CN113780229A - Text recognition method and device - Google Patents

Text recognition method and device Download PDF

Info

Publication number
CN113780229A
CN113780229A CN202111101994.2A CN202111101994A CN113780229A CN 113780229 A CN113780229 A CN 113780229A CN 202111101994 A CN202111101994 A CN 202111101994A CN 113780229 A CN113780229 A CN 113780229A
Authority
CN
China
Prior art keywords
text
target
character
recognized
characters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111101994.2A
Other languages
Chinese (zh)
Inventor
徐支勇
李长亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Digital Entertainment Co Ltd
Original Assignee
Beijing Kingsoft Digital Entertainment Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Digital Entertainment Co Ltd filed Critical Beijing Kingsoft Digital Entertainment Co Ltd
Priority to CN202111101994.2A priority Critical patent/CN113780229A/en
Publication of CN113780229A publication Critical patent/CN113780229A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The application provides a text recognition method and a text recognition device, wherein the text recognition method comprises the following steps: acquiring a text to be identified; inputting the text to be recognized into a recognition module for processing to obtain target characters in the text to be recognized and text boxes corresponding to the target characters; and establishing a position relation between the target character and the text box, and generating a target text corresponding to the text to be recognized according to the position relation.

Description

Text recognition method and device
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a text recognition method. The application also relates to a text recognition device, a computing device and a computer readable storage medium.
Background
With the development of internet technology, the text recognition function becomes an indispensable technology in most business scenes, such as a shooting and question searching scene, a file entry scene, a paper information electronization scene, a document format conversion scene and the like, and relates to the text recognition technology. The accuracy of text recognition is particularly important in each service scenario. In the prior art, in a document format conversion scene, an OCR recognition technology is mostly adopted for text recognition requirements; however, as the demand for recognition accuracy increases, it is difficult for the OCR recognition technology to meet the recognition accuracy demand of most scenes, and therefore an effective solution is needed to solve the above problem.
Disclosure of Invention
In view of this, embodiments of the present application provide a text recognition method to solve technical defects in the prior art. The embodiment of the application also provides a text recognition device, a computing device and a computer readable storage medium.
According to a first aspect of embodiments of the present application, there is provided a text recognition method, including:
acquiring a text to be identified;
inputting the text to be recognized into a recognition module for processing to obtain target characters in the text to be recognized and text boxes corresponding to the target characters;
and establishing a position relation between the target character and the text box, and generating a target text corresponding to the text to be recognized according to the position relation.
Optionally, the inputting the text to be recognized into a recognition module for processing to obtain a target character in the text to be recognized includes:
inputting the text to be recognized into the recognition module, and processing the text to be recognized through a character recognition unit in the recognition module to obtain an initial character and a character coordinate corresponding to the initial character;
calculating the coordinate similarity between the character coordinates, and screening target character coordinates according to the calculation result;
and screening the target characters from the initial characters based on the target character coordinates, and outputting the target characters through the recognition module.
Optionally, the inputting the text to be recognized into a recognition module for processing to obtain a text box corresponding to the target character includes:
inputting the text to be recognized into the recognition module, and processing the text to be recognized through a text processing unit in the recognition module to obtain a text picture and size information corresponding to the text picture;
detecting text constituent elements contained in the text picture, and creating a text box corresponding to the text constituent elements based on the size information;
and taking the text box corresponding to the text composition element as the text box corresponding to the target character, and outputting the text box through the identification module.
Optionally, the text component element comprises at least one of: header, footer, text line;
correspondingly, the creating a text box corresponding to the text component element based on the size information includes:
determining header coordinates corresponding to the headers, footer coordinates corresponding to the footers, and text line coordinates corresponding to the text lines based on the size information;
creating a header text box according to the header coordinates, creating a footer text box according to the footer coordinates, and creating a text line text box according to the text line coordinates;
and taking the header text box, the footer text box and the text line text box as text boxes corresponding to the text composition elements.
Optionally, before the step of establishing the position relationship between the target character and the text box is executed, the method further includes:
determining character coordinate information corresponding to the target character and text box coordinate information corresponding to the text box;
correspondingly, the establishing of the position relationship between the target character and the text box includes:
and establishing the position relation between the target character and the text box based on the character coordinate information and the text box coordinate information.
Optionally, the generating a target text corresponding to the text to be recognized according to the position relationship includes:
sequencing the target characters in the text box according to the position relation and the character coordinate information to obtain a character text box containing the target characters;
and sequencing the character text boxes according to the coordinate information of the text boxes, and acquiring the target text corresponding to the text to be recognized according to a sequencing result.
Optionally, the generating a target text corresponding to the text to be recognized according to the position relationship includes:
detecting whether residual characters exist in the target characters or not according to the position relation;
and if not, generating the target text corresponding to the text to be recognized according to the position relation.
Optionally, if the detection result of detecting whether there are remaining characters in the target character according to the position relationship is yes, the following steps are executed:
extracting the residual characters from the target characters, and determining position information corresponding to the residual characters;
clustering the residual characters based on height information in the position information, and sequencing the clustered residual characters based on width information in the position information;
obtaining a supplementary text composed of the residual characters according to the sequencing result, and generating an intermediate text according to the position relation;
and integrating the supplementary text and the intermediate text to obtain the target text corresponding to the text to be recognized.
Optionally, after the step of generating the target text corresponding to the text to be recognized according to the position relationship is executed, the method further includes:
and under the condition that the target text is detected to contain overlapped characters, adjusting the word spacing in the target text.
According to a second aspect of embodiments of the present application, there is provided a text recognition apparatus including:
the acquisition module is configured to acquire a text to be recognized;
the processing module is configured to input the text to be recognized into the recognition module for processing, and obtain a target character in the text to be recognized and a text box corresponding to the target character;
and the generating module is configured to establish a position relation between the target character and the text box and generate a target text corresponding to the text to be recognized according to the position relation.
According to a third aspect of embodiments herein, there is provided a computing device comprising:
a memory and a processor;
the memory is for storing computer-executable instructions that when executed by the processor implement the steps of the text recognition method.
According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the text recognition method.
According to a fifth aspect of embodiments of the present application, there is provided a chip storing a computer program which, when executed by the chip, implements the steps of the text recognition method.
After the text to be recognized is obtained, in order to accurately recognize characters contained in the text to be recognized in the current format, the text to be recognized can be input to a recognition module for processing, so that target characters in the text to be recognized output by the module and text boxes corresponding to the target characters can be obtained according to recognition results, and at this time, the position relationship between the target characters and the text boxes can be established, so that which target characters contained in the text boxes are determined according to the position relationship, and the target text corresponding to the text to be recognized can be determined according to the position relationship; after the target characters are recognized, the target characters contained in the text to be recognized can be located through the text box, and the recognized target text is obtained by mapping the text to be recognized, so that the recognition accuracy of the text to be recognized is effectively guaranteed, and the recognition error caused by wrong character sequencing is reduced.
Drawings
Fig. 1 is a schematic diagram of a text recognition method according to an embodiment of the present application;
FIG. 2 is a flow chart of a text recognition method according to an embodiment of the present application;
FIG. 3 is a diagram illustrating another text recognition method according to an embodiment of the present application;
FIG. 4 is a flowchart illustrating a text recognition method applied to a rental scenario of a house contract according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a target text provided by an embodiment of the present application;
FIG. 6 is a schematic diagram of another target text provided by an embodiment of the present application;
fig. 7 is a schematic structural diagram of a text recognition apparatus according to an embodiment of the present application;
fig. 8 is a block diagram of a computing device according to an embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.
The terminology used in the one or more embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the present application. As used in one or more embodiments of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments of the present application to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first aspect may be termed a second aspect, and, similarly, a second aspect may be termed a first aspect, without departing from the scope of one or more embodiments of the present application.
First, the noun terms to which one or more embodiments of the present invention relate are explained.
OCR: (Optical Character Recognition) refers to a process in which an electronic device (e.g., a scanner or a digital camera) checks characters printed on paper and then translates the shapes into computer characters using a Character Recognition method; namely, the process of scanning the text data, then analyzing and processing the image file and obtaining the character and layout information. The main indicators for measuring the performance of an OCR system are: the rejection rate, the false recognition rate, the recognition speed, the user interface friendliness, the product stability, the usability, the feasibility and the like.
PDF: (Portable Document Format) is a file Format developed by Adobe Systems for exchanging files in a manner independent of application programs, operating Systems, and hardware. The PDF file is based on a PostScript language image model, and accurate colors and accurate printing effects can be guaranteed regardless of the printer, i.e., the PDF faithfully reproduces each character, color, and image of the original.
DBNet: a text detection network based on a semantic segmentation algorithm receives an input picture, obtains a feature map F after feature extraction, upsampling fusion and concat operation, then predicts a probability map (probability map) by using the feature map F and a threshold map (threshold map) by using the feature map F and is called P, and finally calculates a binary map B through the probability map P and the threshold map T and outputs the binary map B.
PDF Plumer: an open source PDF parsing packet based on python can extract information such as characters and tables.
NMS: the (Non-Maximum Suppression) is an element for suppressing whether the Maximum value is the Maximum value, and is used for target detection, namely, a target detection frame with high confidence coefficient is extracted, and a false detection frame with low Suppression confidence coefficient is extracted. Generally, when the analytical model is output to the target box, the target box is very large, and the specific number is determined by the number of anchors, where many repeated boxes locate to the same target, and the NMS is used to remove these repeated boxes to obtain the true target box.
YOLO: (You Only Look one, target detection algorithm) is a target detector that detects objects using features learned by deep convolutional neural networks; YOLO v3, a third version of YOLO, is composed primarily of 75 volume bases, which are most effective for analyzing object features. The network may correspond to any size input without using a fully connected layer.
fast-RCNN: a fully differentiable model is input in a tensor (multidimensional array) form expressed as Height multiplied by Width multiplied by Depth, and a convolution feature map (convfeature map) is obtained through the processing of a pre-training CNN model, namely, the CNN is used as a feature extractor and is sent to the next part; then, the RPN (region pro network) processes the extracted convolution feature map; the RPN is used to find a predefined number of regions (bounding boxes) that may contain objects; and finally, completing the detection processing of the target based on the R-CNN module.
In the present application, a text recognition method is provided. The present application relates to a text recognition apparatus, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.
In practical application, the method is applied to many scenes, such as the conversion of PDF to WORD format requires text recognition, or the extraction of text in PDF format file for information entry requires text recognition. In general, most of PDF text extraction methods employ an OCR recognition technology, that is, PDF is converted into a picture, and detection and recognition are performed by using the OCR recognition technology; also, PDF text information is extracted using a tool kit such as PDF Plumer. The OCR recognition technology has the problem of inaccurate character recognition, aiming at high recognition error rate of special characters, header and footer cannot be removed by using a text information extraction tool for recognition, and aiming at a mode of displaying character thickening by overlapping the same characters, a plurality of characters of the same type are easily recognized when the character thickening is recognized. There is therefore a need for an effective solution to the above problems.
Referring to the text recognition schematic diagram shown in fig. 1, after a text to be recognized is obtained, in order to accurately recognize characters contained in the text to be recognized in a current format, the text to be recognized may be input to a recognition module for processing, so as to obtain target characters in the text to be recognized output by the module according to a recognition result, and a text box corresponding to the target characters, and establish a position relationship between the target characters and the text box, so as to determine which target characters contained in the text box are contained according to the position relationship, thereby determining a target text corresponding to the text to be recognized according to the position relationship; after the target characters are recognized, the target characters contained in the text to be recognized can be located through the text box, and the recognized target text is obtained by mapping the text to be recognized, so that the recognition accuracy of the text to be recognized is effectively guaranteed, and the recognition error caused by wrong character sequencing is reduced.
Fig. 2 shows a flowchart of a text recognition method according to an embodiment of the present application, which specifically includes the following steps:
step S202, a text to be recognized is obtained.
Specifically, the text to be recognized specifically refers to a text which needs to recognize word units, picture contents, tables and/or formulas contained in the text and needs to extract contents contained in the text; it should be noted that the obtaining path of the text to be recognized may be from a client that is docked by the server, that is, after the user uploads the text to be recognized through the client that is docked by the server, the text recognition processing operation is started to return the recognition result to the client that the user holds, where the client that the user holds includes but is not limited to a mobile phone, a computer, a tablet computer, and other intelligent devices, and this embodiment is not limited herein; accordingly, the text to be recognized includes, but is not limited to, PDF format, doc format, docx format, etc., and the embodiment is not limited thereto.
In this embodiment, the text to be recognized is in a PDF format (the text to be recognized in the PDF format is a parsable PDF to support subsequent parsing and recognizing text content) for example, and the recognition process of the text to be recognized in other formats may refer to the same or similar description content in this embodiment, which is not described herein in detail.
Step S204, inputting the text to be recognized into a recognition module for processing, and obtaining a target character in the text to be recognized and a text box corresponding to the target character.
Specifically, on the basis of obtaining the text to be recognized, in order to accurately recognize the content included in the text to be recognized and avoid the problem that characters are repeatedly recognized or characters are omitted for recognition, the text to be recognized may be input to a recognition module for processing, that is, characters in the text to be processed are recognized, a text box is constructed for the content to be recognized in the text to be recognized, and after recognition is completed, a recognition result is output through an output layer of the recognition module, so as to obtain a target character included in the text to be recognized and a text box corresponding to the target character according to the recognition result.
The recognition module is specifically an integrated module integrating a text processing unit and a character recognition unit, when a text to be recognized is recognized, the text processing unit and the character recognition unit may simultaneously complete recognition processing operations, or may sequentially complete recognition processing operations, and in order to increase the recognition processing rate, it is preferable that the text processing unit and the character recognition unit simultaneously complete recognition processing operations, that is, perform recognition processing in parallel, and this embodiment is not limited herein. The text processing unit can be realized based on DBNet, YOLO v3 or fast-RCNN, and the character recognition unit can be realized based on PDF Plumer or PDF Miner; in a specific implementation, the selection may be performed according to actual requirements, and the embodiment is not limited herein. Correspondingly, the target character specifically refers to a word unit contained in the text to be recognized, and one word unit represents one Chinese character, one data or one letter. The text box is a rectangular box for framing each component in the text to be recognized, and the target character can be positioned through the text box.
That is, when the position of each target character is located, the text box to which the target character belongs is determined, and then the text boxes are sorted according to the coordinates of each target character; it should be noted that an xy coordinate system may be established with the upper left corner of the text to be recognized as the origin, and the character coordinates of each character may be determined based on the xy coordinate system, that is, the height information is the y-axis coordinates of the character, and the width information is the x-axis coordinates of the character. Based on this, for example, a certain text box corresponds to a line of target characters in a text to be recognized, after the position relationship between the target characters and the text box is determined, the target characters can be sorted according to the x-axis coordinate corresponding to each target character, the position of each target character can be positioned without considering the y-axis coordinate, the character sorting efficiency is effectively improved, the accuracy of the subsequently recognized target text is ensured, and meanwhile, the character sorting is prevented from being disordered.
In this process, in order to ensure that the recognition module can accurately complete the recognition processing operation, the text processing unit and the character recognition unit need to be trained sufficiently in the preparation stage, and the training process of the text processing unit is as follows: after obtaining the sample to be recognized, adding a label to the text box recognition unit by using the line, paragraph or language in the sample to be recognized, forming a sample pair based on the sample after adding the label, training the initial text processing unit, and adding the initial text processing unit to the recognition module until obtaining the text processing unit meeting the training stop condition. When a label is added to a sample to be recognized, only the x-direction coordinate of a single line of characters needs to be considered in a row unit; with the paragraph as a unit, the y-direction coordinate needs to be positioned on the basis of the x-direction coordinate so as to identify the text box with the paragraph as a unit; adding a text box in the unit of language means: and marking the continuous characters belonging to the same language in the text once, so that the trained model can output a text box by taking the language as a unit.
In practical application, different recognition scenarios may use the text processing unit trained in different manners to perform subsequent recognition processing operations, and the specific recognition manner may be selected according to the practical application scenario, which is not limited herein.
Based on this, after obtaining the text to be recognized, the text to be recognized can be input to the recognition module for processing, the character recognition unit in the recognition module is used for recognizing the target character, and the text processing unit is used for recognizing the text box, so that the target text can be conveniently determined by subsequently combining the text box and the target character, and the recognition efficiency can be improved on the basis of ensuring the recognition accuracy.
Further, in the process of identifying the target character, it is considered that each character corresponds to a unique position in the text to be identified, so as to avoid the problem of position overlapping (character spacing is small and characters are overlapped) caused by low identification accuracy, thereby affecting the determination of the target character, and the target character may be determined by combining coordinates in the identification process, in this embodiment, the specific implementation manner is as follows:
step S2042, inputting the text to be recognized into a recognition module, and processing the text through a character recognition unit in the recognition module to obtain an initial character and a character coordinate corresponding to the initial character;
step S2044, calculating coordinate similarity between character coordinates, and screening target character coordinates according to a calculation result;
and step S2046, screening out target characters from the initial characters based on the target character coordinates, and outputting the target characters through a recognition module.
Specifically, the character recognition unit is capable of recognizing word units contained in the text to be recognized and determining coordinate information of each character; correspondingly, the initial characters are recognized character units, the character coordinates are coordinate information corresponding to each initial character, and the coordinate information represents the position of the character in the text to be recognized; correspondingly, the coordinate similarity specifically refers to a standard for representing the coincidence degree between the coordinates of each character, the higher the coordinate similarity is, the larger the character overlapping probability is, the higher the possibility that two characters are identified at the same position in the identification process is further described, and conversely, the smaller the coordinate similarity is, the smaller the character overlapping probability is, the lower the possibility that two characters are identified at the same position in the identification process is further described; the coordinate similarity can be realized by using an IOU (interaction over Union) algorithm, and the formula is as follows: IOU ═ B)/(atob), where a and B respectively represent the coordinates corresponding to two adjacent characters; in the case that the IOU value is higher than the preset threshold, it is considered to be coincident, otherwise, it is not coincident, and the preset threshold may be set to 0.9 or 0.85, etc., and the embodiment is not limited herein. The target character coordinates are specifically determined after repeated character coordinates are removed, the target character coordinates can sufficiently reflect the position of each character, the target characters in the text to be recognized can be determined on the basis, and errors caused by inaccurate recognition are reduced in a coordinate screening mode.
The character coordinates of the initial characters are to use the upper left corner of the text to be recognized as an origin to establish an xy coordinate system, and the character coordinates of each initial character are determined based on the xy coordinate system, that is, the height information is the y-axis coordinates of the characters, and the width information is the x-axis coordinates of the characters.
Meanwhile, when the coordinates of each initial character are determined in the coordinate system, in order to ensure the accuracy of the coordinates of the positioned characters, the method can be realized in a manner that one character coordinate is positioned according to four coordinates, namely, a rectangular frame is constructed based on one initial character, each vertex of the rectangular frame is used as a coordinate point, and one initial character is positioned according to the coordinate point, namely, the coordinates corresponding to the four vertices are the character coordinates corresponding to the initial character. In addition, the method can also be realized by adopting a mode of positioning one character by one coordinate, namely, the coordinate of the central point of the initial character is selected as the character coordinate; in practical applications, the manner of determining the character coordinates of the initial character may be selected according to practical application scenarios, and this embodiment is not limited herein.
Based on this, after obtaining the text to be recognized, the text to be recognized may be input to a recognition module, initial characters included in the text to be recognized and character coordinates corresponding to each initial character may be determined by a character recognition unit in the recognition module, in order to improve recognition accuracy, a target character may be screened based on the character coordinates at this time, specifically, a coordinate similarity between the character coordinates corresponding to each initial character is calculated, so as to select a character coordinate larger than a preset similarity threshold according to the coordinate similarity for deduplication processing, that is, an initial character corresponding to a character coordinate having a character similarity larger than the preset similarity threshold is removed, so as to screen the target character from the initial characters according to the target character coordinates determined after removing the repeated character coordinates, and output the recognition module for subsequently creating the target text corresponding to the text to be recognized, and the identification accuracy is ensured.
In practical application, the process of calculating the coordinate similarity and screening out the target characters can adopt an NMS method to realize duplication elimination, so that the screened out target characters are more accurate, a foundation is laid for the subsequent generation of target texts, and errors caused by the problem of identification accuracy are avoided; in addition, the deduplication processing operation can be realized based on a preset deduplication rule, that is, a central point of each initial character is determined, the distance between adjacent central points is calculated, the initial characters corresponding to the adjacent central points are considered to be overlapped under the condition that the distance is smaller than the preset distance, and then the overlapped initial characters are removed to obtain the target character. Correspondingly, the identification of the characters can be completed by utilizing PDF Plumer or PDF Miner and the like, so that the characters and the coordinate information in the text to be identified in the PDF format can be accurately extracted, and the identification processing operation of the target characters can be completed in an auxiliary manner. In specific implementation, the recognition and deduplication processing of the characters may also select other processing manners according to actual requirements, and this embodiment is not limited in any way.
In conclusion, by screening the target characters on the basis of the character coordinates, errors caused by inaccurate recognition can be effectively reduced, the recognition accuracy of the target characters is improved, the target characters can be closer to word units contained in the text to be recognized, and accordingly the target text generated subsequently is effectively guaranteed to be more accurate.
Furthermore, in the process of identifying the text box corresponding to the target character, in order to accurately locate each target character through the text box and thus ensure the identification accuracy of the target text, the text box can be identified by combining the text processing unit, in this embodiment, the specific implementation manner is as follows:
step S2142, the text to be recognized is input into the recognition module, and the text image and the size information corresponding to the text image are obtained by processing through a text processing unit in the recognition module.
Specifically, the text recognition unit is a unit capable of converting a to-be-recognized text in a PDF format into a picture format and recognizing text picture size information; the text picture specifically refers to a picture obtained after converting a text to be recognized into a picture format, and correspondingly, the size information specifically refers to length and width information corresponding to the text picture; in the process of processing the text to be recognized by the text processing unit, the text to be recognized is converted into a general picture format, such as a jpg or png format, for subsequent use.
Based on this, after the text to be recognized is obtained, the text to be recognized can be input into the recognition module, the text to be recognized is processed through the text processing unit in the recognition module, the text to be recognized can be converted into a picture format, a text picture is obtained, meanwhile, the size information of the text picture is determined, so that the text box corresponding to the target character can be determined by combining the text picture and the size information in the following process, and the target text can be recognized in an auxiliary mode in the following process.
Step S2144, detecting text constituent elements contained in the text picture, and creating a text box corresponding to the text constituent elements based on the size information.
Specifically, after the text image and the size information corresponding to the text image are obtained, in order to accurately determine the text box corresponding to the target character, at this time, text component elements of the text to be recognized may be detected in the text image, and a text box corresponding to each text component element is created based on the size information. The text composition elements specifically refer to basic elements forming the text to be recognized, and include but are not limited to headers, footers, text lines and the like; correspondingly, the text box corresponding to the text component element specifically refers to a rectangular box capable of framing a header, a footer or a text line, and after the text box corresponding to each text component element is determined, the target character can be positioned to assist in completing the recognition processing of the target text.
It should be noted that the text box serves as a basis for subsequently composing the target text, and therefore, when creating the text box based on the size information, the following two ways can be implemented. In the first aspect, the text box is created with the word unit edge as a boundary, that is, the created text box will be attached to the word unit in the box; on the other hand, the text box is created with the word unit edge set size as a boundary, that is, the created text box will be set with the word unit distance from the inside of the box set size, and the size will not affect other text boxes, that is, will not overlap with other text boxes.
In practical application, when detecting text constituent elements from a text picture, the text constituent elements can be realized by adopting algorithms such as DBNet, PSENet or PANNEt, namely the text picture is input into a trained model, and the text constituent elements, headers, footers and text lines of a text to be recognized can be obtained according to the recognition result of the model, so that the target text can be conveniently recognized in the following process.
Based on this, in order to accurately detect text constituent elements, during training, a large number of PDF files are converted into pictures, headers, footers, and text lines in each picture are labeled, a large number of sample pairs are obtained according to the labeling result, and then the DBNet model is trained by using the sample pairs, so as to obtain the DBNet model meeting the use requirement according to the training result, so as to be used for the identification processing of the text constituent elements. That is to say, the text picture can be recognized through the DBNet model, and a header, a footer and a text line corresponding to the text to be recognized are output.
Further, after the text component elements are detected, each text component element corresponds to a part of the target characters, and therefore, only by classifying each target character into a corresponding text box, the target text can be recognized, and therefore, the text box corresponding to each text component element is determined, and the subsequent positioning of the target characters can be assisted, and in this embodiment, the specific implementation manner is as follows:
step S21442, header coordinates corresponding to headers, footer coordinates corresponding to footers, and text line coordinates corresponding to texts are determined based on the size information;
step S21444, creating a header text box according to the header coordinates, a footer text box according to the footer coordinates, and a text line text box according to the text line coordinates;
step S21446, uses the header text box, the footer text box and the text line text box as the text boxes corresponding to the text component elements.
Specifically, the header coordinates refer to coordinate information corresponding to a header region in the text to be recognized; the footer coordinate specifically refers to coordinate information corresponding to a footer area in the text to be identified; the text line coordinates specifically refer to coordinate information corresponding to a text area in the text to be recognized; correspondingly, the header text box is a rectangular box for framing the header area; the footer text box specifically refers to a rectangular box for framing the footer area; the text line text box specifically refers to a rectangular box for framing a text area; it should be noted that the text line text box may be a rectangular box for framing all the contents included in the text area; or a plurality of rectangular frames for framing and selecting all contents included in the text area according to paragraphs; the text area can also be a plurality of rectangular frames for framing all contents contained in the text area according to word units; the text area can be a plurality of rectangular frames for framing all contents contained in the text area according to the line unit; the text area may also include a plurality of rectangular frames that are framed in the entire content according to the language type, and in the specific implementation, the rectangular frames may be selected according to the requirement, and this embodiment is not limited in any way here.
In practical application, usually, the components of a document are all composed of text lines, headers and/or footers, so that only each region needs to be determined, and the repeated target characters are added to each region to generate the target text corresponding to the text to be recognized.
Based on the above, under the condition that the text component elements comprise headers, footers and text lines, after the text component elements in the text to be recognized are detected, header coordinates corresponding to the headers, footer coordinates corresponding to the footers and text line coordinates corresponding to the text lines can be positioned according to the size information of the text picture; in the process, a reference coordinate positioning template matched with the text picture can be selected based on the size information of the text picture, and then four vertex coordinates corresponding to a header are read from the template to serve as header coordinates, four vertex coordinates corresponding to a footer serve as footer coordinates, and four vertex coordinates corresponding to a text line serve as text line coordinates; and connecting the coordinates corresponding to the text composition elements to create a text box, namely, creating a header text box corresponding to a header region according to the header coordinates, creating a footer text box corresponding to a footer region according to the footer coordinates, and creating a text line text box corresponding to a text line region according to the text line coordinates, wherein the header text box, the footer text box and the text line text box are taken as the text boxes corresponding to the text composition elements for positioning each target character subsequently so as to identify the target text.
The reference coordinate positioning template is a preset template matched with the type of the text to be recognized, and comprises coordinates corresponding to all text constituent elements; and when the text component elements in the text to be recognized are detected, determining the coordinates corresponding to the text component elements through the reference coordinate positioning template so as to be used for subsequently creating the text box.
In summary, by creating the text boxes of the text component elements by taking the region as a unit, it can be ensured that the text boxes do not overlap, and meanwhile, the position of the target character can be assisted to be subsequently positioned, and the accuracy of the identified target text is ensured.
Step S2146, the text box corresponding to the text component element is taken as the text box corresponding to the target character and is output through the recognition module.
Specifically, after the text box corresponding to the text component element is obtained, the text box corresponding to the target character is used as the text box corresponding to the target character, so that each target character can be conveniently and subsequently positioned according to the text box, and the accuracy of recognizing the target text is guaranteed. Based on this, each target character corresponds to unique coordinate information, so that each target character can be added into each text box according to the relation that the text box contains coordinates, and then the target characters in the text boxes are sequenced, so that the target text can be obtained.
For example, after a paper in the PDF format is acquired, the paper is input to the recognition module, the paper in the PDF format is converted into a picture by a text processing unit (such as an OCR detection unit) in the recognition module, and page size information of the paper, that is, length and width information of a page, is recorded according to a conversion result; and then detecting the page headers, the page footers and the text lines in the paper by using the trained DBNet model, simultaneously recording the coordinate information corresponding to the page headers, the page footers and the text lines, and determining a page header text box corresponding to the page headers, a page footer text box corresponding to the page footers and a text line text box corresponding to the text lines by combining the coordinate information.
Furthermore, when a text box corresponding to each component in the thesis is determined, extracting characters contained in the thesis by using a character recognition unit in a recognition module, namely extracting each character by using PDF (portable document format) Plumer and determining a character coordinate corresponding to each character; and then, calculating the similarity among the character coordinates by using an NMF (non-uniform matrix factorization) method, selecting the character coordinates with the similarity larger than a preset similarity threshold for de-duplication, taking the rest character coordinates as target character coordinates, and screening out target characters from the recognized characters by combining the target character coordinates at the moment to serve as recognition results of all character units in the thesis so as to be used for subsequently forming a target text corresponding to the thesis.
In addition, in order to ensure that the pictures, tables and/or formulas can all exist in the target text after the recognition processing and avoid content loss when the text to be recognized contains the pictures, tables and/or formulas, the pictures, tables and/or formulas can be processed in a picture recognition mode during the recognition processing, that is, the contents contained in the pictures, tables and/or formulas are recognized by OCR recognition in a picture form for subsequently creating the target text, in this embodiment, the specific implementation manner is as follows:
under the condition that the content to be recognized in the text to be recognized comprises picture content, table content and/or formula content, the content to be recognized can be subjected to frame selection recognition through a text processing unit in the recognition module so as to obtain coordinate information of the content to be recognized, and the positions of the picture, the table and/or the formula can be conveniently determined according to the coordinate information subsequently, so that the purpose of forming a target text is achieved.
It should be noted that, when the text to be recognized includes pictures, words, tables and/or formulas, the above recognition methods may be integrated to respectively recognize each type of content to be recognized, so as to ensure recognition accuracy and achieve a higher matching degree between the target text subsequently created and the text to be recognized.
In addition, when the text to be recognized including the table is subjected to recognition processing, the text processing unit is required to recognize the structure frame of the table, the character recognition unit is used for recognizing the content in the structure frame, and the character recognition unit and the structure frame are combined to be used as the recognition content of the table, so that the target text can be created conveniently in the following process.
That is to say, under the condition that the content to be recognized contains picture content, table content and/or formula content, the picture, the table and/or the formula are/is used as a recognition unit for recognition, so that the influence caused by excessive character content contained in the picture, the table and/or the formula can be avoided, the recognition efficiency can be improved, the text with richer content can be recognized, the coverage range of the recognition type is improved, and the recognition effect is ensured.
According to the above example, in the case that the text in the PDF format contains the picture, the table and the formula, the OCR detection unit in the recognition module may recognize the length and width information of the picture, the table and the formula, and determine the coordinate information of the picture, the table and the formula in combination with the length and width information, so as to facilitate the subsequent composition of the target text, and add the coordinate information to the recognition result to form the target text containing the characters, the picture, the table and the formula.
Step S206, establishing the position relation between the target characters and the text box, and generating a target text corresponding to the text to be recognized according to the position relation.
Specifically, after the target characters and the text boxes corresponding to the target characters are obtained, further, in order to ensure that a small error exists between the recognized target text and the text to be recognized, the target characters may be located based on the text boxes, so that the target characters are reasonably and accurately sorted in each text box, and the target text is generated according to the sorting result. The target text specifically refers to an identification result obtained after identification processing is carried out on the text to be identified; correspondingly, the position relationship specifically refers to a relationship that the text box contains the target character.
Based on the above, after the target character is obtained, the position relationship between the target character and the text box can be established through the back-end server, that is, the target character contained in the text box is determined, when the target character contained in the text box is determined, the area surrounded by the text box in the coordinate system can be preferentially determined, and then the target character with the character coordinate located in the area is selected to establish the position relationship, so that the target character contained in the text box is determined; in the process, the contact ratio between each target character and the text box can be calculated through the IOU, the target characters with the contact ratio larger than the preset threshold value are selected to establish a position relation with the text box, so that the target characters contained in the text box are accurately determined, the target characters contained in the text box are sequenced, the target text corresponding to the text to be recognized is generated according to the sequencing result, and the recognition accuracy is guaranteed.
Further, since the text box corresponds to a text component element in the text to be recognized, the text box has a unique position in the text to be recognized, and the target character also has a unique position in the text to be recognized, when determining the position relationship between the target character and the text box, the method can be implemented from the coordinate information of the target character and the text box, and in this embodiment, the specific implementation manner is as follows:
step S2062, determining character coordinate information corresponding to the target character and text box coordinate information corresponding to the text box;
step S2064, the position relationship between the target character and the text box is established based on the character coordinate information and the text box coordinate information.
Specifically, the character coordinate information specifically refers to a coordinate corresponding to each target character; the text box coordinate information specifically refers to coordinates corresponding to each text box.
Based on this, in order to accurately determine the position relationship between the text box and the target character so as to promote the accuracy of the subsequent generation of the target text, the character coordinate information corresponding to the target character and the text box coordinate information corresponding to the text box may be determined first, and then the position relationship between the target character and the text box may be established based on the character coordinate information and the text box coordinate information. That is, the target characters in the text box are selected to establish a position relationship, so that the text box in which each target character is located is determined, and then the target characters contained in the text box are sequenced according to the character coordinate information, so that the target text can be generated.
In summary, the position relation between the text box and the target characters is established by combining the coordinate information, and each target character can be accurately positioned, so that the identification accuracy is effectively improved, and the efficiency of generating the target text is improved.
Furthermore, after determining the position relationship between the target character and the text box based on the coordinate information, the generation of the target text can be completed according to the character coordinate information, and in this embodiment, the specific implementation manner is as follows:
step S2162, according to the position relation and the character coordinate information, the target characters are sequenced in the text box, and a character text box containing the target characters is obtained;
and S2164, sorting the character text boxes according to the coordinate information of the text boxes, and obtaining a target text corresponding to the text to be recognized according to a sorting result.
Specifically, the character text box refers to a rectangular box obtained after the target character is added to the corresponding text box, and the target character is contained in the rectangular box.
Based on this, after the position relationship between the target character and the text box is obtained, the target characters contained in the text box can be sequenced in the text box based on the position relationship and the character coordinate information, so as to obtain the character text box containing the target character according to the sequencing result, and then the character text boxes containing the target character are sequenced according to the text box coordinate information, so as to obtain the target text corresponding to the text to be recognized according to the sequencing result.
According to the above example, after the header text box, the footer text box and the text line text box are obtained, the character coordinates corresponding to each character and the text box coordinates corresponding to each text box can be determined, and then the relationship of the character coordinates contained in the text box coordinates is selected to establish the position relationship between each text box and each character; if the area corresponding to the header text box is composed of coordinates (0, 0), (5, 5), (0, 5) and (5, 0), and the character coordinates are (1,1) (1,2) (6,3) … …, the character corresponding to the character coordinates (1,1) (1,2) … … can be selected to establish the relationship with the header text box, and so on until all the operations are completed. And determining that the header text box comprises characters Z1-Z10, the footer text box comprises characters Z51-Z55 and the text line text box comprises characters Z11-Z50 according to the establishment result.
Further, at this time, the text boxes in which the position relationship exists may be sorted based on the position relationship and the coordinates of each character, that is: under the condition that the selected area of the identified header text box, footer text box or text line text box corresponds to each line of text content in the text, the arrangement sequence of each character in the corresponding text box can be determined only by positioning in the x-axis direction according to the x-axis coordinate of each character. Or carrying out initial positioning in the x-axis direction according to the x-axis coordinate of each character, then carrying out secondary positioning in the y-axis direction according to the y-axis coordinate of each character, and finally determining the sequencing result of each character in the text box according to the positioning results of the two times.
Furthermore, the characters Z1 to Z10 are sorted in header text boxes according to the coordinates corresponding to the characters Z1 to Z10, the characters Z51 to Z55 are sorted in footer text boxes according to the coordinates corresponding to the characters Z51 to Z55, the characters Z11 to Z50 are sorted in text line text boxes according to the coordinates corresponding to the characters Z11 to Z50, and the target text corresponding to the paper in the PDF format can be obtained according to the sorting result, namely the identification processing of the paper in the PDF format is completed.
In summary, the target characters in the text box are sorted by using the coordinates as the drive, so that the precision of character sorting can be ensured, and the precision of identifying the text to be identified is ensured.
In addition, when characters included in the text box are sorted based on the position relationship, considering that the target character is a character obtained after screening, there may be other characters that are not sorted, and if these characters are discarded, the recognition accuracy may be affected, so when there are other characters that are not sorted, a clustering manner may be selected to process them, and in this embodiment, a specific implementation manner is as follows:
detecting whether residual characters exist in the target characters or not according to the position relation between the target characters and the text box;
and if not, generating a target text corresponding to the text to be recognized according to the position relation.
If yes, extracting residual characters from the target characters, and determining position information corresponding to the residual characters; clustering the residual characters based on the height information in the position information, and sequencing the residual characters after clustering based on the width information in the position information; obtaining a supplementary text consisting of the residual characters according to the sequencing result, and generating an intermediate text according to the position relation between the target character and the text box; and integrating the supplementary text and the intermediate text to obtain a target text corresponding to the text to be recognized.
Specifically, the remaining characters specifically refer to target characters that have not yet been corresponding to the text box in the target characters; correspondingly, the supplementary text specifically refers to a text composed of the remaining characters, and correspondingly, the intermediate text specifically refers to a text created based on a positional relationship between the text box and the target character.
Based on this, after the position relationship between the target character and the text box is established, whether the target character has residual characters can be detected, and the specific detection mode is to detect whether the target character has characters which do not establish the position relationship with the text box; if the characters do not exist, all the characters are corresponding to the corresponding text boxes, and then the target text is generated. If the characters exist, the situation that part of the characters do not correspond to the text box is indicated, in order to ensure the comprehensiveness of recognition, the position information of the rest characters can be determined, then the rest characters are clustered according to the height information in the position information, so that the rest characters with consistent height information are clustered together, meanwhile, the clustered rest characters are sequenced according to the width information in the position information, and when the text box is divided in a row unit during sequencing, the rest characters with the same height information can be selected to be sequenced in the text box according to the x-axis direction; or if the text box is not divided by line units, the remaining characters with the same height information can be selected to be initially sorted according to the x-axis direction, and then secondary sorting is carried out in the y-axis direction according to the height information, so that the supplementary text consisting of the remaining characters can be obtained according to the sorting result.
Further, after the supplementary text is obtained, an intermediate text is generated according to the position relation between the text box and the target character, and finally the supplementary text and the intermediate text are integrated to obtain the target text corresponding to the text to be recognized.
In addition, when detecting whether residual characters exist, the method can also be used for detecting after the target text is created, namely, an initial target text is created according to the position relation between the target characters and the text box, then, whether residual characters which are not added to the initial target text exist is detected, and then, under the condition that the residual characters exist, a supplementary text is generated according to the processing operation and added to the initial target text, so that the target text corresponding to the text to be recognized can be obtained. It should be noted that the process of creating the supplemental text may refer to the corresponding description above, and will not be described in detail herein.
In an embodiment, assuming that a paper is identified to obtain a header text box and a footer text box, the target characters further include residual characters, the residual characters can be classified according to height information of the residual characters, the classified residual characters are sorted based on width information of the residual characters, the residual characters are connected according to x-axis coordinates during sorting according to the width information to obtain a supplementary text according to a sorting result, the lowest part of the supplementary text corresponding to the paper is determined by combining the height information of the characters in the supplementary text, the footer content of the paper corresponding to the supplementary text can be determined, and then the footer content is integrated with the header content and the body content to obtain the target text corresponding to the paper in the format.
In conclusion, the comprehensiveness of recognition can be effectively guaranteed by detecting the residual characters, so that the accuracy of recognizing the text to be recognized is improved, and the content in the text to be recognized can be fully reflected through the target text.
In addition, in order to avoid that the identified target text has overlapping characters to affect downstream service processing, the target text may be subjected to overlapping character detection after the target text is generated, and in this embodiment, a specific implementation manner is as follows:
step S2262, when detecting that the target text includes the overlapped characters, adjusting a word space in the target text.
Specifically, the overlapping characters refer to characters that the target characters sequenced in the text box overlap each other, which affects reading. Based on this, when detecting whether the target text contains overlapped characters, the method can be realized by detecting the word spacing, that is, whether the word spacing in the target text is smaller than the preset threshold value is judged, if yes, the condition that the target characters are mutually overlapped exists in the current target text is indicated, and in order to facilitate the use of downstream services, the word spacing of the characters in the target text can be adjusted, so that a certain distance exists between the characters, the characters in the target text are ensured to be clearly visible, and the downstream services are more convenient to use.
In practical application, when the downstream service uses the target text, information can be extracted from the editable target text according to requirements, and the information is used for generating a review text, an edit text or a typeset text which are fed back to a user, so that the downstream service is convenient for the user to use.
Referring to the schematic diagram shown in fig. 3, after a user submits a to-be-identified text in PDF format through a front end (i.e., an interface visible to the user, such as a website front end interface or an APP user interface), the front end may parse the to-be-identified text to obtain a PDF file stream and send the PDF file stream to a back end server; after receiving a text to be recognized, a back-end server can input a PDF file stream corresponding to the text to be recognized into a recognition module in order to accurately recognize characters contained in the text to be recognized in a current format, and the PDF file stream is processed by a text box recognition unit and a character recognition unit in the recognition module so as to obtain target characters in the output text to be recognized and text boxes corresponding to the target characters according to recognition results, and at the moment, the position relationship between the target characters and the text boxes can be established so as to determine which target characters contained in the text boxes are contained according to the position relationship, so that a target file corresponding to the text to be recognized can be determined according to the position relationship; and finally, responding to the processing operation of the downstream service, extracting information of the file, creating a text meeting the use requirement of the user, and feeding back the text to the user. After the target characters are recognized, the target characters contained in the text to be recognized can be located through the text box, and the recognized target text is obtained by mapping the text to be recognized, so that the recognition accuracy of the text to be recognized is effectively guaranteed, and the recognition error caused by wrong character sequencing is reduced.
In the following, with reference to fig. 4, the text recognition method provided by the present application is further described by taking an application of the text recognition method in a house contract rental scenario as an example. Fig. 4 shows a processing flow chart of a text recognition method applied to a house contract rental scenario provided in an embodiment of the present application, which specifically includes the following steps:
step S402, obtaining the text to be identified in PDF format.
In practical application, the method is applied to a plurality of scenes, such as conversion of PDF to WORD format requiring identification of text, or extraction of text in PDF format file for information entry requiring identification of text; in general, most of PDF text extraction methods adopt an OCR recognition technology, that is, PDF is converted into a picture, and then detection and recognition are carried out by utilizing the OCR recognition technology; extracting PDF text information by utilizing tool packages such as PDF plomer and the like; the OCR recognition technology has the problem of inaccurate character recognition, and the error rate of special characters is high. The text information extraction tool cannot remove headers and footers for recognition, and a plurality of similar characters are easily and repeatedly recognized when the text is recognized in a mode of thickening the overlapped characters. There is therefore a need for an effective solution to the above problems.
The present embodiment uses the text to be recognized as a housing contract, and the housing contract includes { header-contract number: 123456789; the text is as follows: house rental contract, party a: a, prescription B: b, signing time: 5/15/2021); and the housing contract is in PDF format.
Step S404, inputting the text to be recognized in the PDF format to a recognition module for processing, and obtaining the target characters in the text to be recognized and the text boxes corresponding to the target characters.
After the house contract in the PDF format is obtained, the house contract can be input to the recognition module, the house contract in the PDF format is firstly converted into a picture by a text recognition unit in the recognition module, for example, an OCR detection unit, and page size information of the house contract, that is, length and width information of a page is recorded according to a conversion result; then, a header, a footer and a text line in the housing contract are detected by using a trained DBNet model (namely a text processing unit), corresponding coordinate information is recorded according to a detection result, at the moment, a header text box corresponding to the header, a footer text box corresponding to the footer and a text line text box corresponding to the text line are output by the DBNet model, and each text box is respectively corresponding to the coordinate information, namely the position and the length and width information of each text box, so that subsequent text identification is facilitated. Namely, determining the header-contract number: 123456789 corresponds to the header text box, and the position information is S1; and (3) determining the text: house rental contract, party a: a, prescription B: b, signing time: year 2021, 5, month 15, and corresponding text line text box, the position information is S2.
Further, when a text box corresponding to the text to be recognized is determined, each character can be extracted by using a tool extraction unit in the recognition module, and optionally, text information and coordinate information of each character are extracted by using the PDFplumer, wherein the coordinate information is the position of each character; and then, the characters are subjected to duplicate removal by an NMS method, so that all character units in the text to be recognized are obtained for subsequently creating the target text.
In step S406, a text box matching the target character is determined.
Specifically, after determining the header text box, the position information of S1, the text line text box, the position information of S2, and the target character, the target character may be matched with the header text box and the text line text box to determine the corresponding character unit of each text box, i.e., the character units "close, same, code, number, 1,2, 3, 4, 5, 6, 7, 8, 9" correspond to the header text box, and the character units "live, room, rent, lay, close, same, square, a, B, sign, about, time, 2, 0, 2, 1, year, 5, month, 1, 5, day" correspond to the third text box.
Step S408, sorting the target characters according to the text boxes.
After the text box matched with each character is determined, in order to generate a text in a target format meeting the requirement in the following process, the text boxes can be sorted according to the coordinate information corresponding to each character, namely, the coordinate of the text box limits the character sorting result, so that the converted text can be matched with the text in the PDF format after the text boxes are sorted according to the coordinate information of the characters; the character units 'close, identity, code, number, 1,2, 3, 4, 5, 6, 7, 8 and 9' are sequenced in a header text box according to the coordinate information corresponding to each character respectively, and a first text is obtained according to the sequencing result; meanwhile, the character units of 'live, house, rent, lay, contract, party A, party B, sign, contract, time, 2, 0, 2, 1, year, 5, month, 1, 5 and day' are sorted in a text box according to the coordinate information corresponding to each character, and a second text is obtained according to the sorting result.
And step S410, generating a target text according to the sequencing result.
Specifically, after sorting each character into each text box, the target text shown in fig. 5 can be obtained according to the sorting result, and optionally, the target text is in an editable WORD format.
In addition, when matching text boxes and characters, a situation that individual characters cannot be matched with a text box may occur, for example, a text to be recognized includes a footer, and the footer is not matched with a corresponding text box, at this time, in order to accurately recognize the footer, the characters in the unidentified footer may be clustered according to height information, that is, the unidentified remaining target characters are clustered according to the height information, the clustered characters are sorted according to width information, that is, the characters in a single line of text are sorted in a front-back order according to the width information, each line of text is sorted according to a height direction coordinate (height information), and the corresponding line of text is supplemented to a corresponding height position, in this embodiment, a target text including a header, a footer, and a body text as shown in fig. 6 is obtained.
According to the text recognition method provided by the application, after the text to be recognized is obtained, in order to accurately recognize characters contained in the text to be recognized in the current format, the text to be recognized can be input to a recognition module for processing, so that target characters in the text to be recognized output by the module and text boxes corresponding to the target characters can be obtained according to recognition results, at this time, the position relation between the target characters and the text boxes can be established, so that which target characters contained in the text boxes can be determined according to the position relation, and therefore the target text corresponding to the text to be recognized can be determined according to the position relation; after the target characters are recognized, the target characters contained in the text to be recognized can be located through the text box, and the recognized target text is obtained by mapping the text to be recognized, so that the recognition accuracy of the text to be recognized is effectively guaranteed, and the recognition error caused by wrong character sequencing is reduced.
Corresponding to the above method embodiment, the present application further provides a text recognition apparatus embodiment, and fig. 7 shows a schematic structural diagram of a text recognition apparatus provided in an embodiment of the present application. As shown in fig. 7, the apparatus includes:
an obtaining module 702 configured to obtain a text to be recognized;
the processing module 704 is configured to input the text to be recognized into a recognition module for processing, so as to obtain a target character in the text to be recognized and a text box corresponding to the target character;
the generating module 706 is configured to establish a position relationship between the target character and the text box, and generate a target text corresponding to the text to be recognized according to the position relationship.
In an optional embodiment, the processing module 704 is further configured to:
inputting the text to be recognized into the recognition module, and processing the text to be recognized through a character recognition unit in the recognition module to obtain an initial character and a character coordinate corresponding to the initial character; calculating the coordinate similarity between the character coordinates, and screening target character coordinates according to the calculation result; and screening the target characters from the initial characters based on the target character coordinates, and outputting the target characters through the recognition module.
In an optional embodiment, the processing module 704 is further configured to:
inputting the text to be recognized into the recognition module, and processing the text to be recognized through a text processing unit in the recognition module to obtain a text picture and size information corresponding to the text picture; detecting text constituent elements contained in the text picture, and creating a text box corresponding to the text constituent elements based on the size information; and taking the text box corresponding to the text composition element as the text box corresponding to the target character, and outputting the text box through the identification module.
In an alternative embodiment, the text component element comprises at least one of: header, footer, text line; accordingly, the processing module 704 is further configured to:
determining header coordinates corresponding to the headers, footer coordinates corresponding to the footers, and text line coordinates corresponding to the text lines based on the size information; creating a header text box according to the header coordinates, creating a footer text box according to the footer coordinates, and creating a text line text box according to the text line coordinates; and taking the header text box, the footer text box and the text line text box as text boxes corresponding to the text composition elements.
In an optional embodiment, the text recognition apparatus further includes:
a determining module configured to determine character coordinate information corresponding to the target character and text box coordinate information corresponding to the text box;
accordingly, the generating module 706 is further configured to:
and establishing the position relation between the target character and the text box based on the character coordinate information and the text box coordinate information.
In an optional embodiment, the generating module 706 is further configured to:
sequencing the target characters in the text box according to the position relation and the character coordinate information to obtain a character text box containing the target characters; and sequencing the character text boxes according to the coordinate information of the text boxes, and acquiring the target text corresponding to the text to be recognized according to a sequencing result.
In an optional embodiment, the generating module 706 is further configured to:
detecting whether residual characters exist in the target characters or not according to the position relation; and if not, generating the target text corresponding to the text to be recognized according to the position relation.
In an optional embodiment, the generating module 706 is further configured to:
extracting the residual characters from the target characters, and determining position information corresponding to the residual characters; clustering the residual characters based on height information in the position information, and sequencing the clustered residual characters based on width information in the position information; obtaining a supplementary text composed of the residual characters according to the sequencing result, and generating an intermediate text according to the position relation; and integrating the supplementary text and the intermediate text to obtain the target text corresponding to the text to be recognized.
In an optional embodiment, the text recognition apparatus further includes:
the adjusting module is configured to adjust the word space in the target text under the condition that the target text is detected to contain the overlapped characters.
After the text to be recognized is obtained, in order to accurately recognize characters contained in the text to be recognized in the current format, the text to be recognized may be input to the recognition module for processing, so as to obtain target characters in the text to be recognized output by the module and text boxes corresponding to the target characters according to the recognition result, at this time, a position relationship between the target characters and the text boxes may be established, so as to determine which target characters contained in the text boxes are contained according to the position relationship, thereby determining the target text corresponding to the text to be recognized according to the position relationship; after the target characters are recognized, the target characters contained in the text to be recognized can be located through the text box, and the recognized target text is obtained by mapping the text to be recognized, so that the recognition accuracy of the text to be recognized is effectively guaranteed, and the recognition error caused by wrong character sequencing is reduced.
The above is a schematic scheme of a text recognition apparatus of the present embodiment. It should be noted that the technical solution of the text recognition apparatus and the technical solution of the text recognition method belong to the same concept, and details that are not described in detail in the technical solution of the text recognition apparatus can be referred to the description of the technical solution of the text recognition method. Further, the components in the device embodiment should be understood as functional blocks that must be created to implement the steps of the program flow or the steps of the method, and each functional block is not actually divided or separately defined. The device claims defined by such a set of functional modules are to be understood as a functional module framework for implementing the solution mainly by means of a computer program as described in the specification, and not as a physical device for implementing the solution mainly by means of hardware.
Fig. 8 illustrates a block diagram of a computing device 800 provided according to an embodiment of the present application. The components of the computing device 800 include, but are not limited to, memory 810 and a processor 820. The processor 820 is coupled to the memory 810 via a bus 830, and the database 850 is used to store data.
Computing device 800 also includes access device 840, access device 840 enabling computing device 800 to communicate via one or more networks 860. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 840 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the application, the above-described components of the computing device 800 and other components not shown in fig. 8 may also be connected to each other, for example, by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 8 is for purposes of example only and is not limiting as to the scope of the present application. Those skilled in the art may add or replace other components as desired.
Computing device 800 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 800 may also be a mobile or stationary server.
Wherein, the processor 820 is configured to execute the following computer-executable instructions:
acquiring a text to be identified;
inputting the text to be recognized into a recognition module for processing to obtain target characters in the text to be recognized and text boxes corresponding to the target characters;
and establishing a position relation between the target character and the text box, and generating a target text corresponding to the text to be recognized according to the position relation.
The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the text recognition method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the text recognition method.
An embodiment of the present application further provides a computer-readable storage medium storing computer instructions that, when executed by a processor, are configured to:
acquiring a text to be identified;
inputting the text to be recognized into a recognition module for processing to obtain target characters in the text to be recognized and text boxes corresponding to the target characters;
and establishing a position relation between the target character and the text box, and generating a target text corresponding to the text to be recognized according to the position relation.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the text recognition method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the text recognition method.
The present embodiment discloses a chip, which stores a computer program that, when executed by the chip, implements the steps of the text recognition method.
The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and its practical applications, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims (12)

1. A text recognition method, comprising:
acquiring a text to be identified;
inputting the text to be recognized into a recognition module for processing to obtain target characters in the text to be recognized and text boxes corresponding to the target characters;
and establishing a position relation between the target character and the text box, and generating a target text corresponding to the text to be recognized according to the position relation.
2. The text recognition method of claim 1, wherein the inputting the text to be recognized into a recognition module for processing to obtain a target character in the text to be recognized comprises:
inputting the text to be recognized into the recognition module, and processing the text to be recognized through a character recognition unit in the recognition module to obtain an initial character and a character coordinate corresponding to the initial character;
calculating the coordinate similarity between the character coordinates, and screening target character coordinates according to the calculation result;
and screening the target characters from the initial characters based on the target character coordinates, and outputting the target characters through the recognition module.
3. The text recognition method of claim 1, wherein the inputting the text to be recognized into a recognition module for processing to obtain the text box corresponding to the target character comprises:
inputting the text to be recognized into the recognition module, and processing the text to be recognized through a text processing unit in the recognition module to obtain a text picture and size information corresponding to the text picture;
detecting text constituent elements contained in the text picture, and creating a text box corresponding to the text constituent elements based on the size information;
and taking the text box corresponding to the text composition element as the text box corresponding to the target character, and outputting the text box through the identification module.
4. The text recognition method of claim 3, wherein the text component element comprises at least one of: header, footer, text line;
correspondingly, the creating a text box corresponding to the text component element based on the size information includes:
determining header coordinates corresponding to the headers, footer coordinates corresponding to the footers, and text line coordinates corresponding to the text lines based on the size information;
creating a header text box according to the header coordinates, creating a footer text box according to the footer coordinates, and creating a text line text box according to the text line coordinates;
and taking the header text box, the footer text box and the text line text box as text boxes corresponding to the text composition elements.
5. The text recognition method according to any one of claims 1 to 4, wherein before the step of establishing the positional relationship between the target character and the text box is executed, the method further comprises:
determining character coordinate information corresponding to the target character and text box coordinate information corresponding to the text box;
correspondingly, the establishing of the position relationship between the target character and the text box includes:
and establishing the position relation between the target character and the text box based on the character coordinate information and the text box coordinate information.
6. The text recognition method according to claim 5, wherein the generating of the target text corresponding to the text to be recognized according to the position relationship comprises:
sequencing the target characters in the text box according to the position relation and the character coordinate information to obtain a character text box containing the target characters;
and sequencing the character text boxes according to the coordinate information of the text boxes, and acquiring the target text corresponding to the text to be recognized according to a sequencing result.
7. The text recognition method according to any one of claims 1 to 4, wherein the generating of the target text corresponding to the text to be recognized according to the position relationship includes:
detecting whether residual characters exist in the target characters or not according to the position relation;
and if not, generating the target text corresponding to the text to be recognized according to the position relation.
8. The text recognition method according to claim 7, wherein if the detection result of detecting whether there are any remaining characters in the target characters according to the positional relationship is yes, the following steps are performed:
extracting the residual characters from the target characters, and determining position information corresponding to the residual characters;
clustering the residual characters based on height information in the position information, and sequencing the clustered residual characters based on width information in the position information;
obtaining a supplementary text composed of the residual characters according to the sequencing result, and generating an intermediate text according to the position relation;
and integrating the supplementary text and the intermediate text to obtain the target text corresponding to the text to be recognized.
9. The text recognition method according to claim 1, wherein after the step of generating the target text corresponding to the text to be recognized according to the position relationship is executed, the method further comprises:
and under the condition that the target text is detected to contain overlapped characters, adjusting the word spacing in the target text.
10. A text recognition apparatus, comprising:
the acquisition module is configured to acquire a text to be recognized;
the processing module is configured to input the text to be recognized into the recognition module for processing, and obtain a target character in the text to be recognized and a text box corresponding to the target character;
and the generating module is configured to establish a position relation between the target character and the text box and generate a target text corresponding to the text to be recognized according to the position relation.
11. A computing device, comprising:
a memory and a processor;
the memory is configured to store computer-executable instructions, and the processor is configured to execute the computer-executable instructions to implement the steps of the method of any one of claims 1 to 9.
12. A computer-readable storage medium storing computer instructions, which when executed by a processor, perform the steps of the method of any one of claims 1 to 9.
CN202111101994.2A 2021-09-18 2021-09-18 Text recognition method and device Pending CN113780229A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111101994.2A CN113780229A (en) 2021-09-18 2021-09-18 Text recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111101994.2A CN113780229A (en) 2021-09-18 2021-09-18 Text recognition method and device

Publications (1)

Publication Number Publication Date
CN113780229A true CN113780229A (en) 2021-12-10

Family

ID=78852418

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111101994.2A Pending CN113780229A (en) 2021-09-18 2021-09-18 Text recognition method and device

Country Status (1)

Country Link
CN (1) CN113780229A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114637845A (en) * 2022-03-11 2022-06-17 上海弘玑信息技术有限公司 Model testing method, device, equipment and storage medium
CN115410191A (en) * 2022-11-03 2022-11-29 平安银行股份有限公司 Text image recognition method, device, equipment and storage medium
CN115497115A (en) * 2022-11-03 2022-12-20 杭州实在智能科技有限公司 Header and footer detection method and system based on deep learning
CN115640401A (en) * 2022-12-07 2023-01-24 恒生电子股份有限公司 Text content extraction method and device
CN116916047A (en) * 2023-09-12 2023-10-20 北京点聚信息技术有限公司 Intelligent storage method for layout file identification data
CN117217185A (en) * 2023-11-07 2023-12-12 江西五十铃汽车有限公司 Document generation method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111340023A (en) * 2020-02-24 2020-06-26 创新奇智(上海)科技有限公司 Text recognition method and device, electronic equipment and storage medium
WO2020232872A1 (en) * 2019-05-22 2020-11-26 平安科技(深圳)有限公司 Table recognition method and apparatus, computer device, and storage medium
CN112597773A (en) * 2020-12-08 2021-04-02 上海深杳智能科技有限公司 Document structuring method, system, terminal and medium
CN113111871A (en) * 2021-04-21 2021-07-13 北京金山数字娱乐科技有限公司 Training method and device of text recognition model and text recognition method and device
CN113378710A (en) * 2021-06-10 2021-09-10 平安科技(深圳)有限公司 Layout analysis method and device for image file, computer equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020232872A1 (en) * 2019-05-22 2020-11-26 平安科技(深圳)有限公司 Table recognition method and apparatus, computer device, and storage medium
CN111340023A (en) * 2020-02-24 2020-06-26 创新奇智(上海)科技有限公司 Text recognition method and device, electronic equipment and storage medium
CN112597773A (en) * 2020-12-08 2021-04-02 上海深杳智能科技有限公司 Document structuring method, system, terminal and medium
CN113111871A (en) * 2021-04-21 2021-07-13 北京金山数字娱乐科技有限公司 Training method and device of text recognition model and text recognition method and device
CN113378710A (en) * 2021-06-10 2021-09-10 平安科技(深圳)有限公司 Layout analysis method and device for image file, computer equipment and storage medium

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114637845A (en) * 2022-03-11 2022-06-17 上海弘玑信息技术有限公司 Model testing method, device, equipment and storage medium
CN114637845B (en) * 2022-03-11 2023-04-14 上海弘玑信息技术有限公司 Model testing method, device, equipment and storage medium
CN115410191A (en) * 2022-11-03 2022-11-29 平安银行股份有限公司 Text image recognition method, device, equipment and storage medium
CN115497115A (en) * 2022-11-03 2022-12-20 杭州实在智能科技有限公司 Header and footer detection method and system based on deep learning
CN115497115B (en) * 2022-11-03 2024-03-15 杭州实在智能科技有限公司 Deep learning-based header and footer detection method and system
CN115640401A (en) * 2022-12-07 2023-01-24 恒生电子股份有限公司 Text content extraction method and device
CN115640401B (en) * 2022-12-07 2023-04-07 恒生电子股份有限公司 Text content extraction method and device
CN116916047A (en) * 2023-09-12 2023-10-20 北京点聚信息技术有限公司 Intelligent storage method for layout file identification data
CN116916047B (en) * 2023-09-12 2023-11-10 北京点聚信息技术有限公司 Intelligent storage method for layout file identification data
CN117217185A (en) * 2023-11-07 2023-12-12 江西五十铃汽车有限公司 Document generation method and system
CN117217185B (en) * 2023-11-07 2024-03-01 江西五十铃汽车有限公司 Document generation method and system

Similar Documents

Publication Publication Date Title
CN113780229A (en) Text recognition method and device
CN111476067B (en) Character recognition method and device for image, electronic equipment and readable storage medium
US20090285482A1 (en) Detecting text using stroke width based text detection
KR101377601B1 (en) System and method for providing recognition and translation of multiple language in natural scene image using mobile camera
CN113221711A (en) Information extraction method and device
CN109344914A (en) A kind of method and system of the Text region of random length end to end
CN113961685A (en) Information extraction method and device
CN110516259B (en) Method and device for identifying technical keywords, computer equipment and storage medium
CN115424282A (en) Unstructured text table identification method and system
CN112434690A (en) Method, system and storage medium for automatically capturing and understanding elements of dynamically analyzing text image characteristic phenomena
CN112418812A (en) Distributed full-link automatic intelligent clearance system, method and storage medium
Tsai et al. Using cell phone pictures of sheet music to retrieve MIDI passages
CN113033269B (en) Data processing method and device
CN110209759B (en) Method and device for automatically identifying page
CN112464907A (en) Document processing system and method
CN115713775B (en) Method, system and computer equipment for extracting form from document
KR102043693B1 (en) Machine learning based document management system
CN113486171B (en) Image processing method and device and electronic equipment
CN114359912B (en) Software page key information extraction method and system based on graph neural network
CN112560849B (en) Neural network algorithm-based grammar segmentation method and system
CN112836632B (en) Method and system for realizing user-defined template character recognition
Pattnaik et al. A Framework to Detect Digital Text Using Android Based Smartphone
CN111950542A (en) Learning scanning pen based on OCR recognition algorithm
CN114399782B (en) Text image processing method, apparatus, device, storage medium, and program product
CN113641746B (en) Document structuring method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination