WO2023273188A1 - 一种文本处理方法及相关装置 - Google Patents

一种文本处理方法及相关装置 Download PDF

Info

Publication number
WO2023273188A1
WO2023273188A1 PCT/CN2021/137584 CN2021137584W WO2023273188A1 WO 2023273188 A1 WO2023273188 A1 WO 2023273188A1 CN 2021137584 W CN2021137584 W CN 2021137584W WO 2023273188 A1 WO2023273188 A1 WO 2023273188A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
character
correction
characters
position information
Prior art date
Application number
PCT/CN2021/137584
Other languages
English (en)
French (fr)
Inventor
李明
付彬
乔宇
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2023273188A1 publication Critical patent/WO2023273188A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Definitions

  • the present application relates to the technical field of scene text recognition (STR), in particular to a text processing method and related devices.
  • STR scene text recognition
  • Scene text recognition means that by inputting text pictures containing text information in a specific scene into the program, the program converts the input text pictures containing text information into computer-understandable text symbols.
  • Scene text recognition is an important branch in the field of computer vision. It plays an important role and has prospects in application scenarios such as automatic driving and blind assistance. It is particularly important to improve the accuracy and efficiency of scene text recognition.
  • the embodiment of the present application provides a text processing method and a related device.
  • the position information of each character is obtained by detecting the text information contained in it, and the coordinate frame of each character is reconstructed by using the position information , and then perform finer control point sampling on the characters according to the coordinate frame, making the corrected text more horizontal, improving the accuracy and efficiency of text correction, and further improving the accuracy and efficiency of text recognition.
  • the embodiment of the present application provides a text processing method, the method comprising:
  • the first text picture is a picture including the first text
  • the text correction network uses the position information of each character in the first text for correction Network, the text content of the second text is the same as the text content of the first text, and the second text in the second text picture is horizontal text.
  • the acquired first text image is input to the text correction network for text correction.
  • the position and geometric information of each character is obtained by detecting the first text information included in the first text picture, and the coordinate frame of each character is reconstructed by using this information, and then the character is refined according to the coordinate frame
  • the control point is sampled, and the text is corrected by using the control point of the character to obtain a second text picture including the second text.
  • the text content of the second text in the second text picture is the same as the text content of the first text in the first text picture, but the second text in the second text picture is more horizontal, which is more conducive to its text recognition .
  • the current commonly used text correction method is to use the text-level control point sampling method for text correction, ignoring the information of the character itself. Therefore, the text may be distorted during the sampling process, making the corrected text image difficult to be recognized. .
  • the method in the embodiment of the present application uses the character-level control point sampling method to perform text correction, so that the text included in the corrected text image is more horizontal, and the accuracy of text correction is improved. and efficiency, thereby improving the accuracy and efficiency of text recognition.
  • the inputting the first text picture to the text correction network for text correction to obtain the second text picture including the second text includes:
  • a possible specific implementation manner of performing text correction on a text image is provided.
  • the masks of different levels of the characters in the first text are obtained.
  • the masks of the different levels are of different sizes and represent the position information of the characters. Among them, a smaller mask can avoid the problem of character sticking in the text , a larger mask avoids the problem of missing characters in the text.
  • the coordinate frame of the character is constructed, and the coordinate frame is used to determine the control points of the characters, and finally the characters are corrected according to the control points.
  • multiple masks of different sizes for each character position are returned when obtaining the position information of characters, which can solve the problem of sticking and omission between characters, improve the accuracy of determining character position information, and further improve The accuracy and efficiency of text correction.
  • the method before constructing the coordinate frame of the character according to the position information of the character, the method further includes:
  • the target connected domains are used to optimize the position information of the characters.
  • a possible specific implementation manner of optimizing character position information is provided. Specifically, after obtaining the masks of the characters at different levels in the first text, and before constructing the coordinate frame of the characters according to the position information of the characters, the embodiment of the present application will also search for the connectivity corresponding to the masks of the different levels. domain, to obtain its target connected domain, which can optimize the masks of different levels, eliminate inaccurate masks, and then realize the optimization of the position information of characters. Through the embodiments of the present application, the accuracy of determining character position information can be further improved, thereby improving the accuracy and efficiency of text correction.
  • the searching for the connected domains corresponding to the masks of different levels to obtain the target connected domains includes:
  • the first connected domain and the second connected domain are used as the target connected domain.
  • a possible specific implementation manner of determining the target connected domain is provided. Specifically, starting from the smallest mask to find the connected domains of the current layer, and put the obtained connected domains into the queue. After the current layer is searched, start to search for higher-level masks. For all connected domains in higher-level masks, judge whether they overlap with connected domains that exist in the previous layer. If they overlap, discard them. If they do not overlap, discard them. into the queue. Repeat the above process until the connected domain corresponding to the largest mask is also searched, at this time, the connected domain in the queue is used as the target connected domain.
  • the masks of different levels can be optimized, and inaccurate masks can be eliminated, so as to realize the optimization of the position information of the characters, further improve the accuracy of determining the position information of the characters, and further Improve the accuracy and efficiency of text correction.
  • the obtaining masks of characters in the first text at different levels includes:
  • the masks of the characters in the first text at different levels are obtained according to a loss function, where the loss function is used to characterize the mask of the characters and the accuracy of the position information of the characters.
  • the position information of the character itself can be obtained by more accurate regression, and then the position information can be used to obtain more accurate control points, improving the accuracy and accuracy of text correction. efficiency.
  • the method also includes:
  • the second text image is input to a text recognition network for recognition to obtain the second text.
  • a possible specific implementation manner of text recognition is provided, that is, the above corrected second text image is input into a text recognition network for recognition, and the second text can be recognized.
  • the accuracy and efficiency of text recognition can be improved.
  • an embodiment of the present application provides a text processing device, which includes:
  • An acquisition unit configured to acquire a first text picture; the first text picture is a picture including the first text;
  • a correction unit configured to input the first text picture to a text correction network for text correction to obtain a second text picture including the second text; the text correction network is to use each character in the first text A network for correcting position information, the text content of the second text is the same as that of the first text, and the second text in the second text picture is horizontal text.
  • the acquired first text image is input to the text correction network for text correction.
  • the position and geometric information of each character is obtained by detecting the first text information included in the first text picture, and the coordinate frame of each character is reconstructed by using this information, and then the character is refined according to the coordinate frame
  • the control point is sampled, and the text is corrected by using the control point of the character to obtain a second text picture including the second text.
  • the text content of the second text in the second text picture is the same as the text content of the first text in the first text picture, but the second text in the second text picture is more horizontal, which is more conducive to its text recognition .
  • the current commonly used text correction method is to use the text-level control point sampling method for text correction, ignoring the information of the character itself. Therefore, the text may be distorted during the sampling process, making the corrected text image difficult to be recognized. .
  • the method in the embodiment of the present application uses the character-level control point sampling method to perform text correction, so that the text included in the corrected text image is more horizontal, and the accuracy of text correction is improved. and efficiency, thereby improving the accuracy and efficiency of text recognition.
  • the device also includes:
  • the acquiring unit is further configured to acquire masks of different levels of characters in the first text; the masks of different levels have different sizes, and the masks of different levels are used to characterize the characters location information;
  • a construction unit configured to construct a coordinate frame of the character according to the position information of the character; the coordinate frame is used to determine the control point of the character;
  • a sampling unit configured to sample control points of the character according to the coordinate frame
  • the correcting unit is specifically configured to correct the characters according to the control points to obtain the second text picture including the second text.
  • a possible specific implementation manner of performing text correction on a text image is provided.
  • the masks of different levels of the characters in the first text are obtained.
  • the masks of the different levels are of different sizes and represent the position information of the characters. Among them, a smaller mask can avoid the problem of character sticking in the text , a larger mask avoids the problem of missing characters in the text.
  • the coordinate frame of the character is constructed, and the coordinate frame is used to determine the control points of the characters, and finally the characters are corrected according to the control points.
  • multiple masks of different sizes for each character position are returned when obtaining the position information of characters, which can solve the problem of sticking and omission between characters, improve the accuracy of determining character position information, and further improve The accuracy and efficiency of text correction.
  • the acquiring unit is further configured to search for connected domains corresponding to masks of different levels to obtain target connected domains; the target connected domains are used to optimize the position information of the characters .
  • a possible specific implementation manner of optimizing character position information is provided. Specifically, after obtaining the masks of the characters at different levels in the first text, and before constructing the coordinate frame of the characters according to the position information of the characters, the embodiment of the present application will also search for the connectivity corresponding to the masks of the different levels. domain, to obtain its target connected domain, which can optimize the masks of different levels, eliminate inaccurate masks, and then realize the optimization of the position information of characters. Through the embodiments of the present application, the accuracy of determining character position information can be further improved, thereby improving the accuracy and efficiency of text correction.
  • the acquiring unit is specifically configured to search for the connected domain corresponding to the first mask to obtain the first connected domain; and to search for the connected domain corresponding to the second mask to obtain the second connected domain ;
  • the second connected domain does not overlap with the first connected domain, and the second mask is larger than the first mask;
  • the obtaining unit is further configured to use the first connected domain and the second connected domain as the target connected domain.
  • a possible specific implementation manner of determining the target connected domain is provided. Specifically, starting from the smallest mask to find the connected domains of the current layer, and put the obtained connected domains into the queue. After the current layer is searched, start to search for higher-level masks. For all connected domains in higher-level masks, judge whether they overlap with connected domains that exist in the previous layer. If they overlap, discard them. If they do not overlap, discard them. into the queue. Repeat the above process until the connected domain corresponding to the largest mask is also searched, at this time, the connected domain in the queue is used as the target connected domain.
  • the masks of different levels can be optimized, and inaccurate masks can be eliminated, so as to realize the optimization of the position information of the characters, further improve the accuracy of determining the position information of the characters, and further Improve the accuracy and efficiency of text correction.
  • the acquiring unit is further configured to acquire masks of characters at different levels in the first text according to a loss function, and the loss function is used to characterize the mask of the character The accuracy of the code and the location information of the characters.
  • a possible specific implementation manner of obtaining masks of different levels is provided. That is, according to the loss function, the masks of different levels and the position information of the characters determined by these masks can be obtained. Through the embodiment of the present application, under the guidance of the pre-designed loss function, the character itself can be returned more accurately. Position information, and then use the position information to obtain more precise control points and improve the accuracy and efficiency of text correction.
  • the device also includes:
  • the recognition unit is configured to input the second text picture to a text recognition network for recognition to obtain the second text.
  • a possible specific implementation manner of text recognition is provided, that is, the above corrected second text image is input into a text recognition network for recognition, and the second text can be recognized.
  • the accuracy and efficiency of text recognition can be improved.
  • an embodiment of the present application provides a text processing device, the text processing device includes a processor and a memory; the memory is used to store computer-executable instructions; the processor is used to execute the computer-executable instructions stored in the memory instructions, so that the text processing device executes the method according to the above first aspect and any possible implementation manner.
  • the text processing apparatus further includes a transceiver, configured to receive signals or send signals.
  • an embodiment of the present application provides a computer-readable storage medium for storing instructions or computer programs; when the instructions or the computer programs are executed, the first aspect and the The method described in any one of the possible embodiments is implemented.
  • the embodiment of the present application provides a computer program product, the computer program product includes instructions or computer programs; when the instructions or the computer programs are executed, the first aspect and any possible implementation The method described in the manner is implemented.
  • an embodiment of the present application provides a chip, the chip includes a processor, the processor is configured to execute instructions, and when the processor executes the instructions, the chip executes the first aspect and any possible The method described in the embodiment.
  • the chip further includes a communication interface, and the communication interface is used for receiving signals or sending signals.
  • an embodiment of the present application provides a system, the system including at least one text processing device according to the second aspect or the third aspect, or the chip according to the sixth aspect.
  • the process of sending information and/or receiving information in the above method can be understood as the process of outputting information by the processor, And/or, the process by which the processor receives input information.
  • the processor may output information to a transceiver (or a communication interface, or a sending module) for transmission by the transceiver. After the information is output by the processor, additional processing may be required before reaching the transceiver.
  • the transceiver or communication interface, or sending module
  • the information may require other processing before being input to the processor.
  • the sending information mentioned in the foregoing method can be understood as the processor outputting information.
  • receiving information may be understood as the processor receiving input information.
  • the above-mentioned processor may be a processor specially used to execute these methods, or may be executed by a computer in the memory instructions to perform these methods, such as a general-purpose processor.
  • the above-mentioned memory can be a non-transitory (non-transitory) memory, such as a read-only memory (Read Only Memory, ROM), which can be integrated with the processor on the same chip, or can be respectively arranged on different chips.
  • ROM read-only memory
  • the embodiment does not limit the type of the memory and the arrangement of the memory and the processor.
  • the above at least one memory is located outside the device.
  • the at least one memory is located within the device.
  • part of the memory of the at least one memory is located inside the device, and another part of the memory is located outside the device.
  • processor and the memory may also be integrated into one device, that is, the processor and the memory may also be integrated together.
  • the position and geometric information of each character is obtained by detecting the text information contained in it, and the text is corrected by using this information, so that the corrected text is more horizontal, and the accuracy of text correction is improved. Accuracy and efficiency, and then improve the accuracy and efficiency of text recognition.
  • FIG. 1 is a schematic diagram of a text processing architecture provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a text processing scenario provided by an embodiment of the present application.
  • FIG. 3 is a schematic flow chart of a text processing method provided in an embodiment of the present application.
  • FIG. 4 is a schematic diagram of the effect of a character mask provided by the embodiment of the present application.
  • FIG. 5 is a schematic diagram of an effect of character-level information optimization provided by an embodiment of the present application.
  • Fig. 6a is a schematic diagram of a text correction effect provided by the embodiment of the present application.
  • Fig. 6b is a schematic diagram of another text correction effect provided by the embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a text processing device provided in an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • an embodiment means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application.
  • the occurrences of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is understood explicitly and implicitly by those skilled in the art that the embodiments described herein can be combined with other embodiments.
  • At least one (item) means one or more
  • “multiple” means two or more
  • at least two (items) means two or three And three or more
  • "and/or” is used to describe the association relationship of associated objects, indicating that there can be three types of relationships, for example, "A and/or B” can mean: only A exists, only B exists, and A exists at the same time and B, where A and B can be singular or plural.
  • the character “/” generally indicates that the contextual objects are an "or” relationship.
  • “At least one of the following” or similar expressions refer to any combination of these items, including any combination of single or plural items.
  • At least one item (piece) of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c ", where a, b, c can be single or multiple.
  • Text image refers to an image that contains text information.
  • Scene text recognition refers to inputting text pictures containing text information in a specific scene into the program, and the program converts the input text pictures containing text information into computer-understandable text symbols.
  • Scene text recognition is an important branch in the field of computer vision, and has an important role and prospects in application scenarios such as automatic driving and blind assistance.
  • the embodiment of the present application provides a text processing framework, and based on the text processing
  • the architecture proposes a new text processing method.
  • the text correction can be performed by using the control point sampling method at the character level, so that the text included in the corrected text image is more accurate. level, improve the accuracy and efficiency of text correction, and then improve the accuracy and efficiency of text recognition.
  • FIG. 1 is a schematic diagram of a text processing architecture provided by an embodiment of the present application.
  • the text processing architecture mainly includes a Character-Aware Sampling and Rectification Module (CASR) and a text recognition module.
  • the character-level sampling and correction module can correct the text contained in the text image, correct the original inclined or even curved text into horizontal text, and then convert the corrected text image input to the text recognition module for text recognition to obtain the final text sequence result.
  • CSSR Character-Aware Sampling and Rectification Module
  • the character-level sampling and correction module can be divided into four parts, which are character-level information extraction part, character-level information optimization part, character reconstruction and sampling part, and image correction part.
  • the character-level information extraction part is used to extract the position and geometric information of each character in the input text image, which includes the area where the character is located, and the character-level such as the width, height, and sine and cosine values of each character information. It should be noted that in order to characterize the position of each character, the character-level information extraction network returns a mask Mask of different sizes for each character to prevent the problem of character connectivity and loss.
  • CCSA Connected Components Selecting Algorithm
  • the character reconstruction and sampling part is used to use the character-level information obtained above to reconstruct the coordinate frame of the character, and then perform control point sampling according to the constructed coordinate frame, which is different from the common rectangular coordinate frame. This part is in the reconstruction
  • the coordinate frame uses a parallelogram-style coordinate frame, which further improves the applicability of the character-level sampling and correction module.
  • the picture correction part is used to correct the text contained in the text picture to the horizontal text by using the thin plate spline interpolation (Thin Plate Spline, TPS) transformation according to the control points obtained by sampling.
  • Thi Plate Spline, TPS Thi Plate Spline
  • the text correction can be performed using the character-level control point sampling method, so that the text included in the corrected text image is more horizontal, the accuracy and efficiency of text correction are improved, and text recognition is further improved. accuracy and efficiency.
  • FIG. 1 based on the text processing architecture in FIG. 1 , it can also be described in conjunction with the application scenario of performing character-level sampling and correction on an input text image.
  • FIG. 2 is a schematic diagram of a text processing scenario provided by an embodiment of the present application.
  • a text image including the non-horizontal text "CHELSEA” is input into the text processing architecture, and the information extraction part extracts the position information of each character in the text image, and regression obtains the Mask (M 1 , M 2 ,..., M k ), and the corresponding character-level information feature map (F h , F w , F sin , F cos ), where F h represents the height of the character, F w represents the width of the character, and F sin represents the sine value of the angle of the character, and F cos represents the cosine value of the angle of the character.
  • the character-level information optimization part uses the CCSA algorithm to optimize the position information of the extracted characters; the character reconstruction and sampling part reconstructs the coordinate frame of the character according to the optimized character position information, and then according to the constructed The coordinate frame samples the control points; finally, the image correction part corrects the non-horizontal text "CHELSEA" contained in the text image to the horizontal text by using the TPS transformation based on the sampled control points. Then input the text picture including the horizontal text into the text recognition network for text recognition, and obtain the final text sequence result "CHELSEA".
  • the present application also provides a new text processing method, which will be described below in conjunction with FIG. 3 .
  • FIG. 3 is a schematic flow chart of a text processing method provided in the embodiment of the present application. The method includes but is not limited to the following steps:
  • Step 301 Acquire a first text image.
  • the electronic device acquires a first text picture, where the first text picture is a picture including the first text.
  • the electronic device in the embodiment of the present application is a device equipped with a processor capable of executing computer-executed instructions.
  • the electronic device may be a computer, a server, etc., and is used to perform text correction on the acquired first text picture, and The original non-horizontal text is corrected into horizontal text, thereby improving the accuracy and efficiency of text recognition.
  • Step 302 Input the first text picture to the text correction network for text correction to obtain a second text picture including the second text.
  • the electronic device inputs the first text picture to the text correction network for text correction to obtain a second text picture including the second text.
  • the text correction network is a network that corrects using the position information of each character in the first text, the text content of the second text is the same as that of the first text, and the second text in the second text picture for horizontal text.
  • the text correction network performs text correction on the first text mainly includes character-level information extraction, character-level information optimization, and character reconstruction and sampling.
  • the process of text correction of the first text by the text correction network from these three parts will be further explained below.
  • the extracted character information mainly includes two parts: Masks representing different levels of character position information, and corresponding character-level information. Among them, for each layer of Mask, its essence is a binary classification problem.
  • 1 indicates that the pixel belongs to the character area
  • 0 indicates that the pixel belongs to the non-character area.
  • the corresponding character-level information mainly includes the height of the character, the width of the character, the sine value of the angle of the character, and the cosine value of the angle of the character.
  • the training data set contains the position label of the character frame
  • the position of each character can be determined according to its label, and all the pixel values in the character frame are set to 1, and the pixel values outside the frame are set to 0.
  • the original label can be obtained
  • the result of Mask is denoted as G1.
  • the output of the network is a multi-layer Mask of different sizes, so the results of the Mask under the original label also need different scales. Therefore, first of all, it is necessary to set a minimum ratio, and then set the layer number k of the Mask to be regressed. After determining the value of k, the reduction ratio of each layer of Mask can be calculated, and the original label of each layer of Mask can be further obtained.
  • G2 ⁇ Gk The following results are denoted as G2 ⁇ Gk.
  • the sine and cosine values of the width, height and angle of each character can be calculated, and these character-level information of each character can be filled into the corresponding character position of G1 to obtain the corresponding
  • the character-level information is denoted as G h , G w , G sin , and G cos , respectively.
  • the Masks of different levels and the character-level information of the characters determined by these Masks can be obtained according to the loss function.
  • the loss function of CASR's character-level information extraction network is divided into two parts, Mask and character-level information Attribute, where Mask is essentially a binary classification problem, and Attribute is a regression problem, so the overall loss function contains two parts. , which can be expressed by the following formula:
  • L is the overall loss function of CASR
  • L mask is the loss function of the Mask part of CASR
  • L attri is the loss function of character-level information
  • is the balance coefficient of the two-part loss function.
  • the cross entropy loss function (Cross Entropy Loss) is a conventional loss function for dealing with classification problems.
  • the character area will be small, that is to say, the number of pixels classified as "positive” will be much lower than the number of pixels classified as "negative”.
  • the gap between the two can reach 10 times or even 20 times. Therefore, in the case of such extremely imbalanced categories, the cross-entropy loss function has no unique advantages.
  • the embodiment of the present application adopts the combination of Dice Coefficient and Online Hard Example Mining (OHEM) as the final loss function of L mask .
  • Dice Coefficient is represented by the following formula:
  • M i is the output i-th layer mask
  • G i is the result of the i-th layer mask under the original label.
  • L mask is represented by the following formula:
  • k is the number of layers of the regression mask
  • O is the training mask obtained by the OHEM process.
  • L attri is represented by the following formula:
  • x is the type of character-level information for regression, including four types of h, w, sin, and cos.
  • G x is the result of character-level information of type x under the original annotation
  • G 1 is the result of the largest mask under the original annotation
  • F x is the output of x type character information.
  • the position information of the character itself can be obtained by more accurate regression, and then the position information can be used to obtain more accurate control points, improving the accuracy and accuracy of text correction. efficiency.
  • the output of CASR contains multiple layers of masks of different sizes. It is necessary to directly search for connected domains on the mask of each layer, but for a character, it is likely to appear on multiple layers of masks at the same time, so the embodiment of this application proposes a connected domain selection algorithm to solve this problem.
  • CCSA does not use the method of pixel-by-pixel comparison, but directly uses matrix dot multiplication, and judges whether the overlap is based on whether the obtained result is a zero matrix. Compared with the method of pixel-by-pixel comparison, this method is not only simple, but also has lower time complexity.
  • FIG. 5 is a schematic diagram of an effect of character-level information optimization provided by an embodiment of the present application.
  • each connected domain can be obtained, and the connected domain selected by the CCSA algorithm is recorded as SC. At this time, the height, width, and sine and cosine values of each character can be obtained.
  • the following formula expresses:
  • x is the type of character-level information for regression, including four types of h, w, sin, and cos;
  • x j is the x information of the jth character;
  • SC j is the connected domain of the jth character;
  • F x is x Feature map of type character information.
  • the width, height and angle of the four characters represented by points C 1 , C 2 , C 3 , and C 4 in (a) in Figure 6a are respectively denoted as w 1,2,3,4 , h 1,2,3,4 , a 1,2,3,4 .
  • the line segment A 1 A 2 can be constructed to the right in the same way. Then connect A 0 A 1 and A 3 A 2 to obtain the parallelogram A 0 A 1 A 2 A 3 , which is the reconstructed character frame of the character "U", as shown in (b) in Figure 6a.
  • the parallelogram frame in (c) in Fig. 6a is the reconstructed coordinate frame of each character, and the dots are the control points obtained by sampling.
  • (d) in Fig. 6a is the corrected text image and the corresponding control points.
  • the position information of each character is obtained, and the coordinate frame of each character is reconstructed by using the position information, and then the character is controlled more finely according to the coordinate frame Sampling, correct the text according to the sampling control points, making the corrected text more horizontal.
  • FIG. 6b is a schematic diagram of another text correction effect provided by the embodiment of the present application.
  • Figure 6b it is a comparison diagram of the text correction effect of the input text picture using the existing method and the method of the embodiment of the present application.
  • the first line of pictures is the input text picture, including non-horizontal text " SIARI", “handmade”, “COFFEE”, “CHELSEA”, “MANCHESTER”
  • the second line of pictures is the result obtained after using the existing method to correct the text of the above text picture
  • the third line is the result of using the embodiment of this application
  • the method is the result obtained after text correction is performed on the above text image.
  • the text in the text picture obtained by the correction method in the embodiment of the present application is more horizontal, and the accuracy and efficiency of text correction are higher.
  • the accuracy and efficiency of text recognition based on this are higher. It is also more efficient.
  • FIG. 7 is a schematic structural diagram of a text processing device provided in an embodiment of the present application.
  • the text processing device 70 may include an acquisition unit 701 and a correction unit 702, wherein the description of each unit is as follows:
  • An acquiring unit 701 configured to acquire a first text picture; the first text picture is a picture including the first text;
  • Correction unit 702 configured to input the first text picture to the text correction network for text correction to obtain a second text picture including the second text; the text correction network uses each character in the first text The network for correcting the location information of the second text, the text content of the second text is the same as the text content of the first text, and the second text in the second text picture is horizontal text.
  • the acquired first text image is input to the text correction network for text correction.
  • the position information of each character is obtained by detecting the first text information included in the first text picture, and the coordinate frame of each character is reconstructed by using the position information, and then the characters are controlled more finely according to the coordinate frame point sampling, using the control points of the characters to correct the text to obtain a second text picture including the second text.
  • the text content of the second text in the second text picture is the same as the text content of the first text in the first text picture, but the second text in the second text picture is more horizontal, which is more conducive to its text recognition .
  • the current commonly used text correction method is to use the text-level control point sampling method for text correction, ignoring the information of the character itself. Therefore, the text may be distorted during the sampling process, making the corrected text image difficult to be recognized. .
  • the method in the embodiment of the present application uses the character-level control point sampling method to perform text correction, so that the text included in the corrected text image is more horizontal, and the accuracy of text correction is improved. and efficiency, thereby improving the accuracy and efficiency of text recognition.
  • the device also includes:
  • the acquiring unit 701 is further configured to acquire masks of different levels of characters in the first text; the masks of different levels have different sizes, and the masks of different levels are used to represent the character position information;
  • a construction unit 703 configured to construct a coordinate frame of the character according to the position information of the character; the coordinate frame is used to determine the control point of the character;
  • a sampling unit 704 configured to sample control points of the character according to the coordinate frame
  • the correcting unit 702 is specifically configured to correct the characters according to the control points to obtain the second text picture including the second text.
  • a possible specific implementation manner of performing text correction on a text image is provided.
  • the masks of different levels of the characters in the first text are obtained.
  • the masks of the different levels are of different sizes and represent the position information of the characters. Among them, a smaller mask can avoid the problem of character sticking in the text , a larger mask avoids the problem of missing characters in the text.
  • the coordinate frame of the character is constructed, and the coordinate frame is used to determine the control points of the characters, and finally the characters are corrected according to the control points.
  • multiple masks of different sizes for each character position are returned when obtaining the position information of characters, which can solve the problem of sticking and omission between characters, improve the accuracy of determining character position information, and further improve The accuracy and efficiency of text correction.
  • the acquiring unit 701 is further configured to search for the connected domains corresponding to the masks of different levels to obtain the target connected domains; the target connected domains are used to optimize the positions of the characters information.
  • a possible specific implementation manner of optimizing character position information is provided. Specifically, after obtaining the masks of the characters at different levels in the first text, and before constructing the coordinate frame of the characters according to the position information of the characters, the embodiment of the present application will also search for the connectivity corresponding to the masks of the different levels. domain, to obtain its target connected domain, which can optimize the masks of different levels, eliminate inaccurate masks, and then realize the optimization of the position information of characters. Through the embodiments of the present application, the accuracy of determining character position information can be further improved, thereby improving the accuracy and efficiency of text correction.
  • the acquiring unit 701 is specifically configured to search for the connected domain corresponding to the first mask to obtain the first connected domain; and to search for the connected domain corresponding to the second mask to obtain the second connected domain. domain; the second connected domain does not overlap with the first connected domain, and the second mask is larger than the first mask;
  • the obtaining unit 701 is further configured to use the first connected domain and the second connected domain as the target connected domain.
  • a possible specific implementation manner of determining the target connected domain is provided. Specifically, start from the smallest mask to find the connected domains of the current layer, and put the obtained connected domains into the queue. After the current layer is searched, start to search for higher-level masks. For all connected domains in higher-level masks, judge whether they overlap with connected domains that exist in the previous layer. If they overlap, discard them. If they do not overlap, discard them. into the queue. Repeat the above process until the connected domain corresponding to the largest mask is also searched, at this time, the connected domain in the queue is used as the target connected domain.
  • the masks of different levels can be optimized, and inaccurate masks can be eliminated, so as to realize the optimization of the position information of the characters, further improve the accuracy of determining the position information of the characters, and further Improve the accuracy and efficiency of text correction.
  • the obtaining unit 701 is further configured to obtain masks of different levels of characters in the first text according to a loss function, and the loss function is used to characterize the The accuracy of the mask and the location information of the characters.
  • the position information of the character itself can be obtained by more accurate regression, and then the position information can be used to obtain more accurate control points, improving the accuracy and accuracy of text correction. efficiency.
  • the device also includes:
  • the recognition unit 705 is configured to input the second text picture to a text recognition network for recognition to obtain the second text.
  • a possible specific implementation manner of text recognition is provided, that is, the above corrected second text image is input into a text recognition network for recognition, and the second text can be recognized.
  • the accuracy and efficiency of text recognition can be improved.
  • each unit in the device shown in FIG. 7 can be separately or all combined into one or several other units to form, or one (some) units can be split into more functional units. It is composed of multiple small units, which can achieve the same operation without affecting the realization of the technical effects of the embodiments of the present application.
  • the above-mentioned units are divided based on logical functions. In practical applications, the functions of one unit may also be realized by multiple units, or the functions of multiple units may be realized by one unit. In other embodiments of the present application, the network-based device may also include other units. In practical applications, these functions may also be assisted by other units, and may be implemented cooperatively by multiple units.
  • the position information of each character is obtained by detecting the text information contained in it, and the coordinate frame of each character is reconstructed by using the position information, and then according to The coordinate frame performs finer control point sampling on the characters, which makes the corrected text more horizontal, improves the accuracy and efficiency of text correction, and further improves the accuracy and efficiency of text recognition.
  • FIG. 8 is a schematic structural diagram of an electronic device 80 provided in an embodiment of the present application.
  • the electronic device 80 may include a memory 801 and a processor 802 .
  • a communication interface 803 and a bus 804 may also be included, wherein the memory 801 , the processor 802 and the communication interface 803 are connected to each other through the bus 804 .
  • the communication interface 803 is used for data interaction with the above-mentioned text processing device 70 .
  • the memory 801 is used to provide a storage space, in which data such as operating systems and computer programs can be stored.
  • Memory 801 includes, but is not limited to, random access memory (random access memory, RAM), read-only memory (read-only memory, ROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM), or Portable read-only memory (compact disc read-only memory, CD-ROM).
  • the processor 802 is a module for performing arithmetic operations and logic operations, and may be in a processing module such as a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU) or a microprocessor (microprocessor unit, MPU). one or a combination of more.
  • a processing module such as a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU) or a microprocessor (microprocessor unit, MPU). one or a combination of more.
  • a computer program is stored in the memory 801, and the processor 802 calls the computer program stored in the memory 801 to execute the text processing method shown in FIG. 3 above:
  • the first text picture is a picture including the first text
  • the text correction network uses the position information of each character in the first text for correction Network, the text content of the second text is the same as the text content of the first text, and the second text in the second text picture is horizontal text.
  • the processor 802 calls the computer program stored in the memory 801, which can also be used to execute the method steps performed by the various units in the text processing device 70 shown in FIG. I won't repeat them here.
  • the position information of each character is obtained by detecting the text information contained in it, and the coordinate frame of each character is reconstructed by using the position information, and then according to the The coordinate frame performs finer control point sampling on the characters, which makes the corrected text more horizontal, improves the accuracy and efficiency of text correction, and further improves the accuracy and efficiency of text recognition.
  • the embodiment of the present application also provides a computer-readable storage medium, in which a computer program is stored in the above-mentioned computer-readable storage medium, and when the above-mentioned computer program is run on one or more processors, the above-mentioned method shown in Figure 3 can be implemented .
  • An embodiment of the present application further provides a computer program product, where the computer program product includes a computer program, and when the computer program product runs on a processor, the method shown in FIG. 3 above can be implemented.
  • the embodiment of the present application also provides a chip, the chip includes a processor, and the processor is configured to execute instructions, and when the processor executes the instructions, the above method shown in FIG. 3 can be implemented.
  • the chip also includes a communication interface, which is used for inputting signals or outputting signals.
  • the embodiment of the present application also provides a system, which includes at least one text processing device 70 or electronic device 80 or chip as described above.
  • the position information of each character is obtained by detecting the text information contained in it, and the coordinate frame of each character is reconstructed by using the position information, and then the character is updated according to the coordinate frame.
  • Fine control point sampling makes the corrected text more horizontal, improves the accuracy and efficiency of text correction, and further improves the accuracy and efficiency of text recognition.
  • the processes can be completed by hardware related to computer programs, and the computer programs can be stored in computer-readable storage media.
  • the computer programs When executed, , may include the processes of the foregoing method embodiments.
  • the aforementioned storage medium includes: various media capable of storing computer program codes such as read-only memory ROM or random access memory RAM, magnetic disk or optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Character Input (AREA)

Abstract

本申请公开了一种文本处理方法及相关装置。该方法包括:获取第一文本图片;第一文本图片为包括第一文本的图片;将第一文本图片输入至文本矫正网络进行文本矫正,得到包括第二文本的第二文本图片;文本矫正网络为利用第一文本中的每个字符的位置信息进行矫正的网络,第二文本的文本内容与第一文本的文本内容相同,第二文本图片中的第二文本为水平文本。本方法对于输入的文本图片,通过检测其包含的文本信息得到每个字符的位置信息,并利用该位置信息重构出每个字符的坐标框,再根据该坐标框对字符进行更精细的控制点采样,使得矫正后的文本更加水平,提高了文本矫正的准确率及效率,进而提高文本识别的准确率及效率。

Description

一种文本处理方法及相关装置 技术领域
本申请涉及场景文本识别(scene text recognition,STR)技术领域,尤其涉及一种文本处理方法及相关装置。
背景技术
场景文本识别指的是通过将特定场景中包含文本信息的文本图片输入到程序中,由程序将输入的包含文本信息的文本图片转换成计算机可理解的文本符号。场景文本识别在计算机视觉领域中为一个重要的分支,在自动驾驶、盲人辅助等应用场景中有着重要作用及前景,提高场景文本识别的准确率及效率尤为重要。
目前的场景文本识别往往只对水平的文本信息有较高的识别准确率。对于文本图片中包含的倾斜甚至弯曲的文本,目前的场景文本识别方法很难对其正确识别。所以,在对文本图片包含的文本信息进行识别之前,通常需要将原本倾斜甚至弯曲的文本矫正为水平的文本。
因此,如何高效的对文本进行矫正,使提高场景文本识别的准确率及效率,成为了本领域技术人员重要的研究课题。
发明内容
本申请实施例提供了一种文本处理方法及相关装置,对于输入的文本图片,通过检测其包含的文本信息得到每个字符的位置信息,并利用该位置信息重构出每个字符的坐标框,再根据该坐标框对字符进行更精细的控制点采样,使得矫正后的文本更加水平,提高了文本矫正的准确率及效率,进而提高文本识别的准确率及效率。
第一方面,本申请实施例提供了一种文本处理方法,该方法包括:
获取第一文本图片;所述第一文本图片为包括第一文本的图片;
将所述第一文本图片输入至文本矫正网络进行文本矫正,得到包括第二文本的第二文本图片;所述文本矫正网络为利用所述第一文本中的每个字符的位置信息进行矫正的网络,所述第二文本的文本内容与所述第一文本的文本内容相同,所述第二文本图片中的所述第二文本为水平文本。
本申请实施例中,将获取的第一文本图片输入至文本矫正网络进行文本矫正。具体为,通过检测第一文本图片包括的第一文本信息得到每个字符的位置与几何信息,并利用该信息重构出每个字符的坐标框,再根据该坐标框对字符进行更精细的控制点采样,利用字符的控制点进行文本矫正,得到包括第二文本的第二文本图片。其中,第二文本图片中的第二文本的文本内容与第一文本图片中的第一文本的文本内容相同,但第二文本图片中的第二文本更加水平,更有利于对其进行文本识别。
目前常用的文本矫正方法,是使用文本级别的控制点采样方式进行文本矫正,忽略了字符本身的信息,因此,在采样的过程中可能会将文本扭曲,导致矫正后的文本图片不容易被识别。
与目前常用的文本矫正方法相比,本申请实施例中的方法,利用字符级别的控制点采样方式进行文本矫正,使得矫正后的文本图片中包括的文本更加水平,提高了文本矫正的准确率及效率,进而提高文本识别的准确率及效率。
在一种可能的实施方式中,所述将所述第一文本图片输入至文本矫正网络进行文本矫正, 得到包括第二文本的第二文本图片,包括:
获取所述第一文本中的字符的各个不同层级的掩码;所述各个不同层级的掩码大小不同,所述各个不同层级的掩码用于表征所述字符的位置信息;
根据所述字符的位置信息构建所述字符的坐标框;所述坐标框用于确定所述字符的控制点;
根据所述控制点对所述字符进行矫正,得到包括所述第二文本的所述第二文本图片。
在本申请实施例中,提供了一种对文本图片进行文本矫正的可能的具体实施方式。具体为,获取第一文本中的字符的各个不同层级的掩码,该各个不同层级的掩码大小不同,表征字符的位置信息,其中,较小的掩码可以避免文本中字符黏连的问题,较大的掩码可以避免文本中遗漏字符的问题。再根据各个不同层级的掩码表征的字符的位置信息,构建该字符的坐标框,该坐标框用于确定字符的控制点,最后根据控制点对字符进行矫正。通过本申请实施例,在获取字符的位置信息时回归了每个字符位置的多个不同大小的掩码,可以解决字符间的黏连和遗漏问题,提高确定字符位置信息的准确性,进而提高文本矫正的准确率及效率。
在一种可能的实施方式中,所述根据所述字符的位置信息构建所述字符的坐标框之前,所述方法还包括:
查找所述各个不同层级的掩码对应的连通域,得到目标连通域;所述目标连通域用于优化所述字符的位置信息。
在本申请实施例中,提供了一种优化字符的位置信息的可能的具体实施方式。具体为,在获取第一文本中的字符的各个不同层级的掩码之后,以及在根据字符的位置信息构建字符的坐标框之前,本申请实施例还将查找各个不同层级的掩码对应的连通域,得到其目标连通域,该目标连通域可以对各个不同层级的掩码进行优化,剔除不准确的掩码,进而实现对字符的位置信息的优化。通过本申请实施例,可以进一步提高确定字符位置信息的准确性,进而提高文本矫正的准确率及效率。
在一种可能的实施方式中,所述查找所述各个不同层级的掩码对应的连通域,得到目标连通域,包括:
查找第一掩码对应的连通域,得到第一连通域;以及,查找第二掩码对应的连通域,得到第二连通域;所述第二连通域不与所述第一连通域重合,所述第二掩码大于所述第一掩码;
将所述第一连通域和所述第二连通域作为所述目标连通域。
在本申请实施例中,提供了一种确定目标连通域的可能的具体实施方式。具体为,从最小的掩码开始查找当前层的连通域,将得到的连通域放入队列中。当前层查找完之后开始查找更高层的掩码,对于更高层的掩码中的所有连通域,判断其是否和上一层存在的连通域重合,若重合则舍弃,若没有重合则将其放入队列。重复上述过程直至最大的掩码对应的连通域也被查找完毕,此时,将队列中的连通域作为目标连通域。通过本申请实施例得到的目标连通域,可以对各个不同层级的掩码进行优化,剔除不准确的掩码,进而实现对字符的位置信息的优化,进一步提高确定字符位置信息的准确性,进而提高文本矫正的准确率及效率。
在一种可能的实施方式中,所述获取所述第一文本中的字符的各个不同层级的掩码,包括:
根据损失函数获取所述第一文本中的字符的各个不同层级的掩码,所述损失函数用于表征所述字符的掩码和所述字符的位置信息的准确率。
通过本申请实施例,可以在预先设计好的损失函数的引导下,更加准确的回归得到字符本身的位置信息,进而利用该位置信息可以得到更为精确的控制点,提高文本矫正的准确率 及效率。
在一种可能的实施方式中,所述方法还包括:
将所述第二文本图片输入至文本识别网络进行识别,得到所述第二文本。
在本申请实施例中,提供了一种文本识别的可能的具体实施方式,即将上述矫正得到的第二文本图片输入至文本识别网络中进行识别,可以识别得到该第二文本。通过对矫正后的文本图片进行文本识别,可以提高文本识别的准确率及效率。
第二方面,本申请实施例提供了一种文本处理装置,该装置包括:
获取单元,用于获取第一文本图片;所述第一文本图片为包括第一文本的图片;
矫正单元,用于将所述第一文本图片输入至文本矫正网络进行文本矫正,得到包括第二文本的第二文本图片;所述文本矫正网络为利用所述第一文本中的每个字符的位置信息进行矫正的网络,所述第二文本的文本内容与所述第一文本的文本内容相同,所述第二文本图片中的所述第二文本为水平文本。
本申请实施例中,将获取的第一文本图片输入至文本矫正网络进行文本矫正。具体为,通过检测第一文本图片包括的第一文本信息得到每个字符的位置与几何信息,并利用该信息重构出每个字符的坐标框,再根据该坐标框对字符进行更精细的控制点采样,利用字符的控制点进行文本矫正,得到包括第二文本的第二文本图片。其中,第二文本图片中的第二文本的文本内容与第一文本图片中的第一文本的文本内容相同,但第二文本图片中的第二文本更加水平,更有利于对其进行文本识别。
目前常用的文本矫正方法,是使用文本级别的控制点采样方式进行文本矫正,忽略了字符本身的信息,因此,在采样的过程中可能会将文本扭曲,导致矫正后的文本图片不容易被识别。
与目前常用的文本矫正方法相比,本申请实施例中的方法,利用字符级别的控制点采样方式进行文本矫正,使得矫正后的文本图片中包括的文本更加水平,提高了文本矫正的准确率及效率,进而提高文本识别的准确率及效率。
在一种可能的实施方式中,所述装置还包括:
所述获取单元,还用于获取所述第一文本中的字符的各个不同层级的掩码;所述各个不同层级的掩码大小不同,所述各个不同层级的掩码用于表征所述字符的位置信息;
构建单元,用于根据所述字符的位置信息构建所述字符的坐标框;所述坐标框用于确定所述字符的控制点;
采样单元,用于根据所述坐标框对所述字符的控制点采样;
所述矫正单元,具体用于根据所述控制点对所述字符进行矫正,得到包括所述第二文本的所述第二文本图片。
在本申请实施例中,提供了一种对文本图片进行文本矫正的可能的具体实施方式。具体为,获取第一文本中的字符的各个不同层级的掩码,该各个不同层级的掩码大小不同,表征字符的位置信息,其中,较小的掩码可以避免文本中字符黏连的问题,较大的掩码可以避免文本中遗漏字符的问题。再根据各个不同层级的掩码表征的字符的位置信息,构建该字符的坐标框,该坐标框用于确定字符的控制点,最后根据控制点对字符进行矫正。通过本申请实施例,在获取字符的位置信息时回归了每个字符位置的多个不同大小的掩码,可以解决字符间的黏连和遗漏问题,提高确定字符位置信息的准确性,进而提高文本矫正的准确率及效率。
在一种可能的实施方式中,所述获取单元,还用于查找所述各个不同层级的掩码对应的连通域,得到目标连通域;所述目标连通域用于优化所述字符的位置信息。
在本申请实施例中,提供了一种优化字符的位置信息的可能的具体实施方式。具体为,在获取第一文本中的字符的各个不同层级的掩码之后,以及在根据字符的位置信息构建字符的坐标框之前,本申请实施例还将查找各个不同层级的掩码对应的连通域,得到其目标连通域,该目标连通域可以对各个不同层级的掩码进行优化,剔除不准确的掩码,进而实现对字符的位置信息的优化。通过本申请实施例,可以进一步提高确定字符位置信息的准确性,进而提高文本矫正的准确率及效率。
在一种可能的实施方式中,所述获取单元,具体用于查找第一掩码对应的连通域,得到第一连通域;以及,查找第二掩码对应的连通域,得到第二连通域;所述第二连通域不与所述第一连通域重合,所述第二掩码大于所述第一掩码;
所述获取单元,具体还用于将所述第一连通域和所述第二连通域作为所述目标连通域。
在本申请实施例中,提供了一种确定目标连通域的可能的具体实施方式。具体为,从最小的掩码开始查找当前层的连通域,将得到的连通域放入队列中。当前层查找完之后开始查找更高层的掩码,对于更高层的掩码中的所有连通域,判断其是否和上一层存在的连通域重合,若重合则舍弃,若没有重合则将其放入队列。重复上述过程直至最大的掩码对应的连通域也被查找完毕,此时,将队列中的连通域作为目标连通域。通过本申请实施例得到的目标连通域,可以对各个不同层级的掩码进行优化,剔除不准确的掩码,进而实现对字符的位置信息的优化,进一步提高确定字符位置信息的准确性,进而提高文本矫正的准确率及效率。
在一种可能的实施方式中,所述获取单元,具体还用于根据损失函数获取所述第一文本中的字符的各个不同层级的掩码,所述损失函数用于表征所述字符的掩码和所述字符的位置信息的准确率。
在本申请实施例中,提供了一种获取得到的各个不同层级的掩码的可能的具体实施方式。即根据损失函数获取得到各个不同层级的掩码以及这些掩码所确定的字符的位置信息,通过本申请实施例,可以在预先设计好的损失函数的引导下,更加准确的回归得到字符本身的位置信息,进而利用该位置信息可以得到更为精确的控制点,提高文本矫正的准确率及效率。
在一种可能的实施方式中,所述装置还包括:
识别单元,用于将所述第二文本图片输入至文本识别网络进行识别,得到所述第二文本。
在本申请实施例中,提供了一种文本识别的可能的具体实施方式,即将上述矫正得到的第二文本图片输入至文本识别网络中进行识别,可以识别得到该第二文本。通过对矫正后的文本图片进行文本识别,可以提高文本识别的准确率及效率。
第三方面,本申请实施例提供一种文本处理装置,所述文本处理装置包括处理器和存储器;所述存储器用于存储计算机执行指令;所述处理器用于执行所述存储器所存储的计算机执行指令,以使所述文本处理装置执行如上述第一方面以及任一项可能的实施方式的方法。可选的,所述文本处理装置还包括收发器,所述收发器,用于接收信号或者发送信号。
第四方面,本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质用于存储指令或计算机程序;当所述指令或所述计算机程序被执行时,使得第一方面以及任一项可能的实施方式所述的方法被实现。
第五方面,本申请实施例提供一种计算机程序产品,所述计算机程序产品包括指令或计算机程序;当所述指令或所述计算机程序被执行时,使得第一方面以及任一项可能的实施方式所述的方法被实现。
第六方面,本申请实施例提供一种芯片,该芯片包括处理器,所述处理器用于执行指令,当该处理器执行所述指令时,使得该芯片执行如第一方面以及任一项可能的实施方式所述的 方法。可选的,该芯片还包括通信接口,所述通信接口用于接收信号或发送信号。
第七方面,本申请实施例提供一种系统,所述系统包括至少一个如第二方面或第三方面所述的文本处理装置或第六方面所述的芯片。
此外,在执行上述第一方面以及任一项可能的实施方式所述的方法的过程中,上述方法中有关发送信息和/或接收信息等的过程,可以理解为由处理器输出信息的过程,和/或,处理器接收输入的信息的过程。在输出信息时,处理器可以将信息输出给收发器(或者通信接口、或发送模块),以便由收发器进行发射。信息在由处理器输出之后,还可能需要进行其他的处理,然后才到达收发器。类似的,处理器接收输入的信息时,收发器(或者通信接口、或发送模块)接收信息,并将其输入处理器。更进一步的,在收发器收到该信息之后,该信息可能需要进行其他的处理,然后才输入处理器。
基于上述原理,举例来说,前述方法中提及的发送信息可以理解为处理器输出信息。又例如,接收信息可以理解为处理器接收输入的信息。
可选的,对于处理器所涉及的发射、发送和接收等操作,如果没有特殊说明,或者,如果未与其在相关描述中的实际作用或者内在逻辑相抵触,则均可以更加一般性的理解为处理器输出和接收、输入等操作。
可选的,在执行上述第一方面以及任一项可能的实施方式所述的方法的过程中,上述处理器可以是专门用于执行这些方法的处理器,也可以是通过执行存储器中的计算机指令来执行这些方法的处理器,例如通用处理器。上述存储器可以为非瞬时性(non-transitory)存储器,例如只读存储器(Read Only Memory,ROM),其可以与处理器集成在同一块芯片上,也可以分别设置在不同的芯片上,本申请实施例对存储器的类型以及存储器与处理器的设置方式不做限定。
在一种可能的实施方式中,上述至少一个存储器位于装置之外。
在又一种可能的实施方式中,上述至少一个存储器位于装置之内。
在又一种可能的实施方式之中,上述至少一个存储器的部分存储器位于装置之内,另一部分存储器位于装置之外。
本申请中,处理器和存储器还可能集成于一个器件中,即处理器和存储器还可以被集成在一起。
本申请实施例中,对于输入的文本图片,通过检测其包含的文本信息得到每个字符的位置与几何信息,并利用该信息进行文本矫正,使得矫正后的文本更加水平,提高了文本矫正的准确率及效率,进而提高文本识别的准确率及效率。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例中所需要使用的附图作简单地介绍,显而易见地,下面所描述的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种文本处理的架构示意图;
图2为本申请实施例提供的一种文本处理的场景示意图;
图3为本申请实施例提供的一种文本处理方法的流程示意图;
图4为本申请实施例提供的一种字符掩码的效果示意图;
图5为本申请实施例提供的一种字符级信息优化的效果示意图;
图6a为本申请实施例提供的一种文本矫正的效果示意图;
图6b为本申请实施例提供的另一种文本矫正的效果示意图;
图7为本申请实施例提供的一种文本处理装置的结构示意图;
图8为本申请实施例提供的一种电子设备的结构示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图对本申请实施例进行描述。
本申请的说明书、权利要求书及附图中的术语“第一”和“第二”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备等,没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元等,或可选地还包括对于这些过程、方法、产品或设备等固有的其它步骤或单元。
在本文中提及的“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员可以显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
应当理解,在本申请中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上,“至少两个(项)”是指两个或三个及三个以上,“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c可以是单个,也可以是多个。
本申请提供了一种文本处理方法,为了更清楚地描述本申请的方案,下面先介绍一些与文本处理相关的知识。
文本图片:指的是包含了文本信息的图片。
场景文本识别:指的是通过将特定场景中包含文本信息的文本图片输入到程序中,由程序将输入的包含文本信息的文本图片转换成计算机可理解的文本符号。场景文本识别在计算机视觉领域中为一个重要的分支,在自动驾驶、盲人辅助等应用场景中有着重要作用及前景。
目前的场景文本识别往往只对水平的文本信息有较高的识别准确率。对于文本图片中包含的倾斜甚至弯曲的文本,目前的场景文本识别方法很难对其正确识别。所以,在对文本图片包含的文本信息进行识别之前,通常需要将原本倾斜甚至弯曲的文本矫正为水平的文本。目前常用的文本矫正方法,是使用文本级别的控制点采样方式进行文本矫正,忽略了字符本身的信息,因此,在采样的过程中可能会将文本扭曲,导致矫正后的文本图片不容易被识别。
针对上述文本矫正过程中存在的文本扭曲、不易识别而导致矫正准确率及效率低、文本识别的准确率及效率低的问题,本申请实施例提供了一种文本处理架构,并基于该文本处理架构提出了一种新的文本处理方法,通过实施本申请所提供的文本处理架构和文本处理方法,可以利用字符级别的控制点采样方式进行文本矫正,使得矫正后的文本图片中包括的文本更加水平,提高文本矫正的准确率及效率,进而提高文本识别的准确率及效率。
下面结合本申请实施例中的附图对本申请实施例进行描述。
请参阅图1,图1为本申请实施例提供的一种文本处理的架构示意图。
如图1所示,本文本处理架构主要包括字符级采样与矫正模块(Character-Aware Sampling and Rectification Module,CASR)、文本识别模块。将文本图片输入至字符级采样与矫正模块后,字符级采样与矫正模块可以对文本图片中包含的文本进行矫正,将原本倾斜甚至弯曲的文本矫正为水平的文本,再将矫正后的文本图片输入至文本识别模块中进行文本识别,得到最终的文本序列结果。
其中,字符级采样与矫正模块可以分成四个部分,分别是字符级信息提取部分、字符级信息优化部分、字符重构与采样部分,以及图片矫正部分。
字符级信息提取部分,用于提取输入的文本图片中每个字符的位置与几何信息,该信息包括字符所在的区域,以及每个字符的宽度、高度、角度的正弦值与余弦值等字符级信息。需要注意的是,为表征每个字符的位置,字符级信息提取网络回归了每个字符不同大小的掩码Mask,以防止字符连通以及丢失的问题。
字符级信息优化部分,提出连通域选择算法(Connected Components Selecting Algorithm,CCSA),用于对字符级信息提取部分提取到的字符的位置信息进行优化,该连通域选择算法能够进一步避免字符遗漏问题,同时字符的位置、宽度、高度、角度也都会在这里得到优化。
字符重构与采样部分,用于利用上述得到的字符级信息对字符的坐标框进行重构,然后根据构建出来的坐标框进行控制点采样,不同于常见的矩形坐标框,本部分在重构坐标框时使用平行四边形样式的坐标框,进一步提高了字符级采样与矫正模块的适用性。
图片矫正部分,用于根据采样得到的控制点,使用薄板样条插值(Thin Plate Spline,TPS)变换,将文本图片中包含的文本矫正至水平文本。
通过本申请实施例中的文本处理架构,可以利用字符级别的控制点采样方式进行文本矫正,使得矫正后的文本图片中包括的文本更加水平,提高文本矫正的准确率及效率,进而提高文本识别的准确率及效率。
具体的,基于图1中的文本处理架构,还可以结合对输入的文本图片进行字符级采样与矫正的这一应用场景进行说明。
请参阅图2,图2为本申请实施例提供的一种文本处理的场景示意图。
如图2所示,首先将包括非水平文本“CHELSEA”的文本图片输入至文本处理架构中,信息提取部分提取该文本图片中每个字符的位置信息,回归得到字符不同大小的Mask(M 1、M 2、…、M k),以及对应的字符级信息的特征图(F h、F w、F sin、F cos),其中,F h表示字符的高度,F w表示字符的宽度,F sin表示字符的角度的正弦值,F cos表示字符的角度的余弦值。接着,字符级信息优化部分利用CCSA算法对上述提取到的字符的位置信息进行优化;字符重构与采样部分再根据优化后字符的位置信息对字符的坐标框进行重构,然后根据构建出来的坐标框进行控制点采样;最后,图片矫正部分根据采样得到的控制点,使用TPS变换,将文本图片中包含的非水平文本“CHELSEA”矫正至水平文本。再将包括水平文本的文本图片输入至文本识别网络中进行文本识别,得到最终的文本序列结果“CHELSEA”。
基于上述图1中的文本处理架构,本申请还提供了一种新的文本处理方法,下面将结合图3对其进行说明。
请参阅图3,图3为本申请实施例提供的一种文本处理方法的流程示意图,该方法包括但不限于如下步骤:
步骤301:获取第一文本图片。
电子设备获取第一文本图片,该第一文本图片为包括第一文本的图片。
其中,本申请实施例中的电子设备为搭载了可用于执行计算机执行指令的处理器的设备, 该电子设备可以是计算机、服务器等,用于对获取到的第一文本图片进行文本矫正,将原本非水平的文本矫正为水平文本,进而提高文本识别的准确率及效率。
步骤302:将第一文本图片输入至文本矫正网络进行文本矫正,得到包括第二文本的第二文本图片。
电子设备将第一文本图片输入至文本矫正网络进行文本矫正,得到包括第二文本的第二文本图片。其中,该文本矫正网络为利用第一文本中的每个字符的位置信息进行矫正的网络,该第二文本的文本内容与第一文本的文本内容相同,该第二文本图片中的第二文本为水平文本。
具体的,由上述图1中的文本处理架构可知,文本矫正网络对第一文本进行文本矫正的过程主要包括字符级信息提取、字符级信息优化,以及字符重构及采样。下面将分别从这三部分对文本矫正网络对第一文本进行文本矫正的过程进一步说明。
第一、字符级信息提取。
在对第一文本图片中的第一文本信息提取过程中,提取的字符信息主要包含了两个部分:表征字符位置信息的各个不同层级的Mask,以及与之对应的字符级信息。其中,对于每层Mask,其本质是一个二分类问题,在输出中1代表该像素点属于字符区域的,而0代表该像素点属于无字符区域。与之对应的字符级信息主要包括字符的高度、字符的宽度,字符的角度的正弦值,字符的角度的余弦值。
对于字符级信息的提取,需要对字符级别的文字位置检测,其中一个很大的难点即是字符的黏连问题,具体可参阅图4,图4为本申请实施例提供的一种字符Mask的效果示意图。
如图4所示,在一张文本图片(a)上,字符与字符间的位置是很接近的。因此,若在对字符级别的文字位置检测时,直接回归等同于原字符大小的Mask,则会出现图4中(b)所示的情况,不同的字符的Mask会黏连在一起,使得难以区分文本中的每个字符。而若回归较小的Mask,则会出现图4中(c)的情况,会有字符没有被检测出,造成字符丢失。只有某一恰好的缩放比例才可能同时解决上述两个问题,如图4中(d)所示,然而该缩放比例是不可知的,每个不同的文本图片的最佳缩放比例都是不同的。基于上述存在的问题,本步骤选择回归出多层不同大小的Masks,小的Mask可以避免字符黏连的问题,而较大的Mask可以避免遗漏字符的问题。
此外,由于训练数据集中含有字符框位置标注,因此便可根据其标注确定每个字符的位置,将字符框内所有像素值设为1,像素框外设为0,此时可得到原始标注下的Mask的结果,记为G1。如上所述,网络的输出为多层不同大小的Mask,因此原始标注下的Mask的结果也需要不同比例大小。因此,首先需要设定一个最小比例,接下来设定要回归的Mask的层数k,确定k值后即可计算出每一层Mask缩小的比例,即可进一步得到每一层Mask在原始标注下的结果,记为G2~Gk。同时,根据每个字符的坐标框,可计算出每个字符的宽度、高度以及角度的正弦和余弦值,将每个字符的这些字符级信息填充至G1对应的字符位置,即可得到对应的字符级信息,分别记为G h、G w、G sin、G cos
需要注意的是,可以根据损失函数来获取得到各个不同层级的Mask以及这些Mask所确定的字符的字符级信息。其中,CASR的字符级信息提取网络的损失函数分为两个部分,Mask与字符级信息Attribute,其中Mask本质是一个二分类问题,而Attribute是一个回归问题,因此整体的损失函数包含两个部分,可如下公式表示:
L=L mask+λL attri
其中,L为CASR整体的损失函数,L mask为CASR的Mask部分的损失函数,L attri为字 符级信息的损失函数,λ为两部分损失函数的平衡系数。
L mask有多种可选择形式,交叉熵损失函数(Cross Entropy Loss)是处理分类问题的一种常规损失函数。然而,对于较小的mask,字符区域会很小,也就是说分类为“正”的像素点个数会远远低于分类为“负”的像素个数,在极端情况下两者的差距能够达到10倍甚至20倍。因此在这种类别极度不平衡的情况下,交叉熵损失函数并无独特的优势。本申请实施例采取了骰子系数(Dice Coefficient)与在线困难样本挖掘(Online Hard Example Mining,OHEM)结合作为L mask最终的损失函数。Dice Coefficient如下公式表示:
Figure PCTCN2021137584-appb-000001
其中,M i为输出的第i层的mask,G i为原始标注下第i层mask的结果。
L mask如下公式表示:
Figure PCTCN2021137584-appb-000002
其中,k为回归的mask的层数,O为OHEM过程得到的训练用的mask。
L attri如下公式表示:
Figure PCTCN2021137584-appb-000003
其中,x为回归的字符级信息的类型,共h、w、sin、cos四种,G x为原始标注下x类型字符级信息的结果,G 1为原始标注下最大的mask的结果,F x为x类型字符信息的输出。
通过本申请实施例,可以在预先设计好的损失函数的引导下,更加准确的回归得到字符本身的位置信息,进而利用该位置信息可以得到更为精确的控制点,提高文本矫正的准确率及效率。
第二、字符级信息优化。
出于解决文字黏连和文字丢失问题的目的,CASR的输出包含了多层不同大小的mask。直接在每一层的mask上查找连通域是必要的,但是对一个字符而言,它很可能同时出现在多层mask上,因此本申请实施例提出了解决这一问题的连通域选择算法。
具体为,从最小的Mask开始查找当前层的连通域,将得到的连通域放入队列中。当前层查找完之后开始查找更高层的Mask,对于更高层的Mask中的所有连通域,判断其是否和上一层存在的连通域重合,若重合则舍弃,若没有重合则将其放入队列。重复上述过程直至最大的Mask对应的连通域也被查找完毕,此时,将队列中的连通域作为目标连通域。
值得一提的是,在判断连通域是否重合时,CCSA没有采用逐个像素点对比的方法,而是直接用矩阵点乘,根据得到的结果是否为零矩阵来判断是否重合。与逐像素对比的方法相比,该方法不仅简单,同时也拥有更低的时间复杂度。
对于CCSA的效果,具体可参阅图5,图5为本申请实施例提供的一种字符级信息优化的效果示意图。
如图5所示,图5中的(a)为输入的原图,图5中的(b)、(c)(d)依次为更大的mask层的结果。第一行为原始mask,第二行为字符重构及采样的结果,第三行为矫正后的结果。在图5中的(b)中,由于该层mask较小,导致了字符的丢失,如“U”“Y”都没有被检测到,因此最终的矫正结果效果一般。在图5中的(c)中,“Y”被找回,最终的矫正结果显然更加水平。
经过CCSA算法后,即可获得每个连通域,将CCSA算法选择出来的连通域记为SC,此时便可获得每个字符的高度、宽度以及角度的正弦、余弦值。如下公式表示:
Figure PCTCN2021137584-appb-000004
其中,x为回归的字符级信息的类型,共h、w、sin、cos四种类型;x j为第j个字符的x信息;SC j为第j个字符的连通域;F x为x类型字符信息的特征图。
在获取每个字符的角度时,相比于直接回归字符角度,同时回归角度的正弦余弦值,然后再做归一化处理,会拥有更高的鲁棒性。
第三、字符重构及采样。
由上述步骤可获得每个字符的中点以及其字符级信息,下面将结合图6a对字符重构及采样进行说明。
如图6a所示,图6a中的(a)中的点C 1、C 2、C 3、C 4表示的四个字符的宽度、高度以及角度分别记为w 1,2,3,4、h 1,2,3,4、a 1,2,3,4。以字符“U”为例,首先连接C 3、C 2,然后在线段C 3C 2上按照字符“L”与字符“U”的宽度按比例进行划分得到D 1,使得满足以下条件:
Figure PCTCN2021137584-appb-000005
在得到D 1后,便可根据字符“U”的高度及角度构建一条以D 1为中点的线段A 0A 3,A 0A 3=h 3。同理可按照相同的方式向右构建出线段A 1A 2。然后连接A 0A 1和A 3A 2,即可得到平行四边形A 0A 1A 2A 3,也即字符“U”重构的字符框,如图6a中的(b)。
图6a中的(c)中的平行四边形框为每个字符的重构坐标框,圆点为采样得到的控制点。图6a中的(d)为矫正后的文本图片及对应的控制点。
综上所述,通过检测文本图片包含的文本信息,得到每个字符的位置信息,并利用该位置信息重构出每个字符的坐标框,再根据该坐标框对字符进行更精细的控制点采样,根据采样控制点对文本进行矫正,使得矫正后的文本更加水平。
具体的,其矫正效果可参阅图6b,图6b为本申请实施例提供的另一种文本矫正的效果示意图。如图6b所示,分别为利用现有的方法和本申请实施例的方法对输入的文本图片进行文本矫正的效果对比图,第一行图片为输入的文本图片,分别包括了非水平文本“SIARI”、“handmade”、“COFFEE”、“CHELSEA”、“MANCHESTER”,第二行图片为利用现有的方法对上述文本图片进行文本矫正后得到的结果,第三行为利用本申请实施例的方法对上述文本图片进行文本矫正后得到的结果。可以看出,相比于现有的矫正方法,本申请实施例中的矫正方法得到的文本图片中的文本更加水平,文本矫正的准确率及效率更高,基于此的文本识别的准确率及效率也更高。
上述详细阐述了本申请实施例的方法,下面提供本申请实施例的装置。
请参阅图7,图7为本申请实施例提供的一种文本处理装置的结构示意图,该文本处理装置70可以包括获取单元701以及矫正单元702,其中,各个单元的描述如下:
获取单元701,用于获取第一文本图片;所述第一文本图片为包括第一文本的图片;
矫正单元702,用于将所述第一文本图片输入至文本矫正网络进行文本矫正,得到包括第二文本的第二文本图片;所述文本矫正网络为利用所述第一文本中的每个字符的位置信息进行矫正的网络,所述第二文本的文本内容与所述第一文本的文本内容相同,所述第二文本图片中的所述第二文本为水平文本。
在本申请实施例中,将获取的第一文本图片输入至文本矫正网络进行文本矫正。具体为,通过检测第一文本图片包括的第一文本信息得到每个字符的位置信息,并利用该位置信息重构出每个字符的坐标框,再根据该坐标框对字符进行更精细的控制点采样,利用字符的控制点进行文本矫正,得到包括第二文本的第二文本图片。其中,第二文本图片中的第二文本的文本内容与第一文本图片中的第一文本的文本内容相同,但第二文本图片中的第二文本更加水平,更有利于对其进行文本识别。
目前常用的文本矫正方法,是使用文本级别的控制点采样方式进行文本矫正,忽略了字符本身的信息,因此,在采样的过程中可能会将文本扭曲,导致矫正后的文本图片不容易被识别。
与目前常用的文本矫正方法相比,本申请实施例中的方法,利用字符级别的控制点采样方式进行文本矫正,使得矫正后的文本图片中包括的文本更加水平,提高了文本矫正的准确率及效率,进而提高文本识别的准确率及效率。
在一种可能的实施方式中,所述装置还包括:
所述获取单元701,还用于获取所述第一文本中的字符的各个不同层级的掩码;所述各个不同层级的掩码大小不同,所述各个不同层级的掩码用于表征所述字符的位置信息;
构建单元703,用于根据所述字符的位置信息构建所述字符的坐标框;所述坐标框用于确定所述字符的控制点;
采样单元704,用于根据所述坐标框对所述字符的控制点采样;
所述矫正单元702,具体用于根据所述控制点对所述字符进行矫正,得到包括所述第二文本的所述第二文本图片。
在本申请实施例中,提供了一种对文本图片进行文本矫正的可能的具体实施方式。具体为,获取第一文本中的字符的各个不同层级的掩码,该各个不同层级的掩码大小不同,表征字符的位置信息,其中,较小的掩码可以避免文本中字符黏连的问题,较大的掩码可以避免文本中遗漏字符的问题。再根据各个不同层级的掩码表征的字符的位置信息,构建该字符的坐标框,该坐标框用于确定字符的控制点,最后根据控制点对字符进行矫正。通过本申请实施例,在获取字符的位置信息时回归了每个字符位置的多个不同大小的掩码,可以解决字符间的黏连和遗漏问题,提高确定字符位置信息的准确性,进而提高文本矫正的准确率及效率。
在一种可能的实施方式中,所述获取单元701,还用于查找所述各个不同层级的掩码对应的连通域,得到目标连通域;所述目标连通域用于优化所述字符的位置信息。
在本申请实施例中,提供了一种优化字符的位置信息的可能的具体实施方式。具体为,在获取第一文本中的字符的各个不同层级的掩码之后,以及在根据字符的位置信息构建字符的坐标框之前,本申请实施例还将查找各个不同层级的掩码对应的连通域,得到其目标连通域,该目标连通域可以对各个不同层级的掩码进行优化,剔除不准确的掩码,进而实现对字符的位置信息的优化。通过本申请实施例,可以进一步提高确定字符位置信息的准确性,进而提高文本矫正的准确率及效率。
在一种可能的实施方式中,所述获取单元701,具体用于查找第一掩码对应的连通域,得到第一连通域;以及,查找第二掩码对应的连通域,得到第二连通域;所述第二连通域不与所述第一连通域重合,所述第二掩码大于所述第一掩码;
所述获取单元701,具体还用于将所述第一连通域和所述第二连通域作为所述目标连通域。
在本申请实施例中,提供了一种确定目标连通域的可能的具体实施方式。具体为,从最 小的掩码开始查找当前层的连通域,将得到的连通域放入队列中。当前层查找完之后开始查找更高层的掩码,对于更高层的掩码中的所有连通域,判断其是否和上一层存在的连通域重合,若重合则舍弃,若没有重合则将其放入队列。重复上述过程直至最大的掩码对应的连通域也被查找完毕,此时,将队列中的连通域作为目标连通域。通过本申请实施例得到的目标连通域,可以对各个不同层级的掩码进行优化,剔除不准确的掩码,进而实现对字符的位置信息的优化,进一步提高确定字符位置信息的准确性,进而提高文本矫正的准确率及效率。
在一种可能的实施方式中,所述获取单元701,具体还用于根据损失函数获取所述第一文本中的字符的各个不同层级的掩码,所述损失函数用于表征所述字符的掩码和所述字符的位置信息的准确率。
通过本申请实施例,可以在预先设计好的损失函数的引导下,更加准确的回归得到字符本身的位置信息,进而利用该位置信息可以得到更为精确的控制点,提高文本矫正的准确率及效率。
在一种可能的实施方式中,所述装置还包括:
识别单元705,用于将所述第二文本图片输入至文本识别网络进行识别,得到所述第二文本。
在本申请实施例中,提供了一种文本识别的可能的具体实施方式,即将上述矫正得到的第二文本图片输入至文本识别网络中进行识别,可以识别得到该第二文本。通过对矫正后的文本图片进行文本识别,可以提高文本识别的准确率及效率。
根据本申请实施例,图7所示的装置中的各个单元可以分别或全部合并为一个或若干个另外的单元来构成,或者其中的某个(些)单元还可以再拆分为功能上更小的多个单元来构成,这可以实现同样的操作,而不影响本申请的实施例的技术效果的实现。上述单元是基于逻辑功能划分的,在实际应用中,一个单元的功能也可以由多个单元来实现,或者多个单元的功能由一个单元实现。在本申请的其它实施例中,基于网络设备也可以包括其它单元,在实际应用中,这些功能也可以由其它单元协助实现,并且可以由多个单元协作实现。
需要说明的是,各个单元的实现还可以对应参照上述图3所示的方法实施例的相应描述。
在图7所描述的文本处理装置70中,对于输入的文本图片,通过检测其包含的文本信息得到每个字符的位置信息,并利用该位置信息重构出每个字符的坐标框,再根据该坐标框对字符进行更精细的控制点采样,使得矫正后的文本更加水平,提高了文本矫正的准确率及效率,进而提高文本识别的准确率及效率。
请参阅图8,图8为本申请实施例提供的一种电子设备80的结构示意图。该电子设备80可以包括存储器801、处理器802。进一步可选的,还可以包含通信接口803以及总线804,其中,存储器801、处理器802以及通信接口803通过总线804实现彼此之间的通信连接。通信接口803用于与上述文本处理装置70进行数据交互。
其中,存储器801用于提供存储空间,存储空间中可以存储操作系统和计算机程序等数据。存储器801包括但不限于是随机存储记忆体(random access memory,RAM)、只读存储器(read-only memory,ROM)、可擦除可编程只读存储器(erasable programmable read only memory,EPROM)、或便携式只读存储器(compact disc read-only memory,CD-ROM)。
处理器802是进行算术运算和逻辑运算的模块,可以是中央处理器(central processing unit,CPU)、显卡处理器(graphics processing unit,GPU)或微处理器(microprocessor unit,MPU)等处理模块中的一种或者多种的组合。
存储器801中存储有计算机程序,处理器802调用存储器801中存储的计算机程序,以执行上述图3所示的文本处理方法:
获取第一文本图片;所述第一文本图片为包括第一文本的图片;
将所述第一文本图片输入至文本矫正网络进行文本矫正,得到包括第二文本的第二文本图片;所述文本矫正网络为利用所述第一文本中的每个字符的位置信息进行矫正的网络,所述第二文本的文本内容与所述第一文本的文本内容相同,所述第二文本图片中的所述第二文本为水平文本。
上述处理器802执行方法的具体内容可参阅上述图3,此处不再赘述。
相应的,处理器802调用存储器801中存储的计算机程序,还可以用于执行上述图7所示的文本处理装置70中的各个单元所执行的方法步骤,其具体内容可参阅上述图7,此处不再赘述。
在图8所描述的电子设备80中,对于输入的文本图片,通过检测其包含的文本信息得到每个字符的位置信息,并利用该位置信息重构出每个字符的坐标框,再根据该坐标框对字符进行更精细的控制点采样,使得矫正后的文本更加水平,提高了文本矫正的准确率及效率,进而提高文本识别的准确率及效率。
本申请实施例还提供一种计算机可读存储介质,上述计算机可读存储介质中存储有计算机程序,当上述计算机程序在一个或多个处理器上运行时,可以实现上述图3所示的方法。
本申请实施例还提供一种计算机程序产品,上述计算机程序产品包括计算机程序,当上述计算机程序产品在处理器上运行时,可以实现上述图3所示的方法。
本申请实施例还提供一种芯片,该芯片包括处理器,所述处理器用于执行指令,当该处理器执行所述指令时,可以实现上述图3所示的方法。可选的,该芯片还包括通信接口,该通信接口用于输入信号或输出信号。
本申请实施例还提供了一种系统,该系统包括了至少一个如上述文本处理装置70或电子设备80或芯片。
综上所述,对于输入的文本图片,通过检测其包含的文本信息得到每个字符的位置信息,并利用该位置信息重构出每个字符的坐标框,再根据该坐标框对字符进行更精细的控制点采样,使得矫正后的文本更加水平,提高了文本矫正的准确率及效率,进而提高文本识别的准确率及效率。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,该流程可以由计算机程序相关的硬件完成,该计算机程序可存储于计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法实施例的流程。而前述的存储介质包括:只读存储器ROM或随机存储记忆体RAM、磁碟或者光盘等各种可存储计算机程序代码的介质。

Claims (10)

  1. 一种文本处理方法,其特征在于,包括:
    获取第一文本图片;所述第一文本图片为包括第一文本的图片;
    将所述第一文本图片输入至文本矫正网络进行文本矫正,得到包括第二文本的第二文本图片;所述文本矫正网络为利用所述第一文本中的每个字符的位置信息进行矫正的网络,所述第二文本的文本内容与所述第一文本的文本内容相同,所述第二文本图片中的所述第二文本为水平文本。
  2. 根据权利要求1所述的方法,其特征在于,所述将所述第一文本图片输入至文本矫正网络进行文本矫正,得到包括第二文本的第二文本图片的步骤,包括:
    获取所述第一文本中的字符的各个不同层级的掩码;所述各个不同层级的掩码大小不同,所述各个不同层级的掩码用于表征所述字符的位置信息;
    根据所述字符的位置信息构建所述字符的坐标框;所述坐标框用于确定所述字符的控制点;
    根据所述控制点对所述字符进行矫正,得到包括所述第二文本的所述第二文本图片。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述字符的位置信息构建所述字符的坐标框之前,所述方法还包括:
    查找所述各个不同层级的掩码对应的连通域,得到目标连通域;所述目标连通域用于优化所述字符的位置信息。
  4. 根据权利要求3所述的方法,其特征在于,所述查找所述各个不同层级的掩码对应的连通域,得到目标连通域,包括:
    查找第一掩码对应的连通域,得到第一连通域;以及,查找第二掩码对应的连通域,得到第二连通域;所述第二连通域不与所述第一连通域重合,所述第二掩码大于所述第一掩码;
    将所述第一连通域和所述第二连通域作为所述目标连通域。
  5. 根据权利要求2至4中任一项所述的方法,其特征在于,所述获取所述第一文本中的字符的各个不同层级的掩码,包括:
    根据损失函数获取所述第一文本中的字符的各个不同层级的掩码,所述损失函数用于表征所述字符的掩码和所述字符的位置信息的准确率。
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,所述方法还包括:
    将所述第二文本图片输入至文本识别网络进行识别,得到所述第二文本。
  7. 一种文本处理装置,其特征在于,包括:
    获取单元,用于获取第一文本图片;所述第一文本图片为包括第一文本的图片;
    矫正单元,用于将所述第一文本图片输入至文本矫正网络进行文本矫正,得到包括第二文本的第二文本图片;所述文本矫正网络为利用所述第一文本中的每个字符的位置信息进行矫正的网络,所述第二文本的文本内容与所述第一文本的文本内容相同,所述第二文本图片中的所述第二文本为水平文本。
  8. 一种文本处理装置,其特征在于,包括:处理器和存储器;
    所述存储器用于存储计算机执行指令;
    所述处理器用于执行所述存储器所存储的计算机执行指令,以使所述文本处理装置执行如权利要求1至6中任一项所述的方法。
  9. 一种计算机可读存储介质,其特征在于,包括:
    所述计算机可读存储介质用于存储指令或计算机程序;当所述指令或所述计算机程序被执行时,使如权利要求1至6中任一项所述的方法被实现。
  10. 一种计算机程序产品,其特征在于,包括:指令或计算机程序;
    所述指令或所述计算机程序被执行时,使如权利要求1至6中任一项所述的方法被实现。
PCT/CN2021/137584 2021-06-30 2021-12-13 一种文本处理方法及相关装置 WO2023273188A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110738496.2 2021-06-30
CN202110738496.2A CN113627242B (zh) 2021-06-30 2021-06-30 一种文本处理方法及相关装置

Publications (1)

Publication Number Publication Date
WO2023273188A1 true WO2023273188A1 (zh) 2023-01-05

Family

ID=78378806

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/137584 WO2023273188A1 (zh) 2021-06-30 2021-12-13 一种文本处理方法及相关装置

Country Status (2)

Country Link
CN (1) CN113627242B (zh)
WO (1) WO2023273188A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627242B (zh) * 2021-06-30 2022-09-27 中国科学院深圳先进技术研究院 一种文本处理方法及相关装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120134588A1 (en) * 2010-11-29 2012-05-31 Microsoft Corporation Rectification of characters and text as transform invariant low-rank textures
CN110956171A (zh) * 2019-11-06 2020-04-03 广州供电局有限公司 铭牌自动识别方法、装置、计算机设备和存储介质
CN111695554A (zh) * 2020-06-09 2020-09-22 广东小天才科技有限公司 一种文本矫正的方法、装置、电子设备和存储介质
CN111832371A (zh) * 2019-04-23 2020-10-27 珠海金山办公软件有限公司 文本图片矫正方法、装置、电子设备及机器可读存储介质
CN113627242A (zh) * 2021-06-30 2021-11-09 中国科学院深圳先进技术研究院 一种文本处理方法及相关装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120134588A1 (en) * 2010-11-29 2012-05-31 Microsoft Corporation Rectification of characters and text as transform invariant low-rank textures
CN111832371A (zh) * 2019-04-23 2020-10-27 珠海金山办公软件有限公司 文本图片矫正方法、装置、电子设备及机器可读存储介质
CN110956171A (zh) * 2019-11-06 2020-04-03 广州供电局有限公司 铭牌自动识别方法、装置、计算机设备和存储介质
CN111695554A (zh) * 2020-06-09 2020-09-22 广东小天才科技有限公司 一种文本矫正的方法、装置、电子设备和存储介质
CN113627242A (zh) * 2021-06-30 2021-11-09 中国科学院深圳先进技术研究院 一种文本处理方法及相关装置

Also Published As

Publication number Publication date
CN113627242B (zh) 2022-09-27
CN113627242A (zh) 2021-11-09

Similar Documents

Publication Publication Date Title
US10599709B2 (en) Object recognition device, object recognition method, and program for recognizing an object in an image based on tag information
US10977523B2 (en) Methods and apparatuses for identifying object category, and electronic devices
CN108629046B (zh) 一种字段匹配方法及终端设备
US8811734B2 (en) Color determination device, color determination system, color determination method, information recording medium, and program
US8891860B2 (en) Color name determination device, color name determination method, information recording medium, and program
US7925650B2 (en) Image management methods, image management systems, and articles of manufacture
RU2429540C2 (ru) Устройство для обработки изображений, способ обработки изображений и считываемый компьютером носитель информации
WO2019129032A1 (zh) 遥感图像识别方法、装置、存储介质以及电子设备
US11816710B2 (en) Identifying key-value pairs in documents
US11080553B2 (en) Image search method and apparatus
US11886492B2 (en) Method of matching image and apparatus thereof, device, medium and program product
US20110078176A1 (en) Image search apparatus and method
WO2023015939A1 (zh) 用于文本检测的深度学习模型训练方法及文本检测方法
CN114429633B (zh) 文本识别方法、模型的训练方法、装置、电子设备及介质
WO2023273188A1 (zh) 一种文本处理方法及相关装置
WO2020077869A1 (zh) 图像检索方法、装置、终端及存储介质
KR20200094624A (ko) 이미지 기반의 데이터 처리 방법, 장치, 전자 기기 및 저장 매체
WO2022116492A1 (zh) 图像模板选择方法、装置、设备及存储介质
WO2019100348A1 (zh) 图像检索方法和装置以及图像库的生成方法和装置
JP2017219984A (ja) 画像検索システム、画像辞書生成システム、画像処理システム及びプログラム
WO2023138540A1 (zh) 边缘提取方法、装置、电子设备及存储介质
US20220392243A1 (en) Method for training text classification model, electronic device and storage medium
CN114972910B (zh) 图文识别模型的训练方法、装置、电子设备及存储介质
CN114880509A (zh) 模型训练方法、搜索方法、装置、电子设备及存储介质
RU2703270C1 (ru) Оптическое распознавание символов посредством применения специализированных функций уверенности, реализуемое на базе нейронных сетей

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE