CN115311666A

CN115311666A - Image-text recognition method and device, computer equipment and storage medium

Info

Publication number: CN115311666A
Application number: CN202210918645.8A
Authority: CN
Inventors: 高鹏; 康维鹏; 袁兰; 吴飞; 周伟华; 高峰; 潘晶
Original assignee: Hangzhou Mjoys Big Data Technology Co ltd
Current assignee: Hangzhou Mjoys Big Data Technology Co ltd
Priority date: 2022-01-26
Filing date: 2022-08-01
Publication date: 2022-11-08
Also published as: CN114639106A

Abstract

The embodiment of the invention discloses a method and a device for identifying pictures and texts, computer equipment and a storage medium. The method comprises the following steps: acquiring a picture to be identified in the bank field; preprocessing the picture to be recognized to obtain a potential text area; inputting the potential text area into a character recognition model for image-text recognition to obtain a first recognition result; correcting the first recognition result to obtain a second recognition result; and outputting the second recognition result. By implementing the method of the embodiment of the invention, the characters in the pictures in the bank field can be accurately identified, the identification errors caused by various interference factors such as brightness, color difference, wrinkles, surface masking, similar shapes and the like are avoided, and the identification accuracy is improved.

Description

Image-text recognition method and device, computer equipment and storage medium

Technical Field

The invention relates to a graph-text identification method, in particular to a graph-text identification method, a graph-text identification device, computer equipment and a storage medium.

Background

Image-text recognition is one of key application technologies in the field of AI (Artificial Intelligence) at present, has a wide application scene, and is widely researched at present. However, most of the existing image-text recognition methods are based on image visual information presented by characters, and the images in the bank field have the conditions of brightness, chromatic aberration, folds, surface masking and the like, and even for characters with similar shapes in the images, the character recognition of the parts is inaccurate.

Therefore, it is necessary to design a new method to accurately identify the characters in the pictures in the banking field, avoid the identification errors caused by various interference factors such as brightness, color difference, wrinkles, surface masking, and similar shapes, and improve the identification accuracy.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a method and a device for identifying pictures and texts, a computer device and a storage medium.

In order to achieve the purpose, the invention adopts the following technical scheme: the image-text identification method comprises the following steps:

acquiring a picture to be identified in the bank field;

preprocessing the picture to be identified to obtain a potential text area;

inputting the potential text area into a character recognition model for image-text recognition to obtain a first recognition result;

correcting the first recognition result to obtain a second recognition result;

and outputting the second recognition result.

The further technical scheme is as follows: the preprocessing the picture to be recognized to obtain a potential text region comprises the following steps:

carrying out picture graying processing on the picture to be processed to obtain a first processing result;

scaling the size of the first processing result to obtain a second processing result;

performing tilt rotation correction on the second processing result to obtain a third processing result;

and carrying out character area positioning on the third processing result to obtain a potential text area.

The further technical scheme is as follows: the positioning of the text region on the third processing result to obtain a potential text region includes:

adopting an image sliding window to extract various CNN convolution characteristics on the third processing result;

carrying out secondary classification on the characteristics, and screening out an image sliding window containing character characters;

and performing contour extraction on the third processing result by adopting a maximum region connected graph contour recognition algorithm and combining an image sliding window containing character characters to obtain a potential text region.

The further technical scheme is as follows: the character recognition model is formed by training an optimized CRNN through a picture with a character label as a sample set, wherein the optimized CRNN is formed by replacing a BLSTM of the CRNN with a trained Bert language model; the trained Bert language model is obtained by training the Bert language model by taking a corpus data set in the bank field as a first sample set.

The further technical scheme is as follows: the trained Bert language model is obtained by training the Bert language model by taking the corpus data set in the bank field as a first sample set, and comprises the following steps:

obtaining a corpus data set in the bank field, and segmenting the corpus data set into natural clauses according to special punctuations to obtain a first sample set;

vectorizing and characterizing the first sample set to obtain a characterized sample set;

constructing a Bert language model;

and training the Bert language model by adopting the characteristic sample set to obtain the trained Bert language model.

The further technical scheme is as follows: the correcting the first recognition result to obtain a second recognition result includes:

carrying out image-text table identification and combination on the first identification result to obtain a processing result;

and merging the image-text paragraphs of the same paragraph of the processing result to obtain a second identification result.

The further technical scheme is as follows: the performing image-text table identification and merging on the first identification result to obtain a processing result includes:

extracting feature information of the first identification result to obtain an extraction result;

performing CNN feature extraction on the extraction result according to the type, position and row-column number information to obtain an extraction result;

performing two-classification discrimination on the extraction result according to a full-connection mode to obtain a discrimination result;

and carrying out table row and column combination according to the judgment result to obtain a processing result.

The invention also provides a picture and text recognition device, comprising:

the image acquisition unit is used for acquiring an image to be identified in the bank field;

the preprocessing unit is used for preprocessing the picture to be recognized to obtain a potential text area;

the identification unit is used for inputting the potential text area into a character identification model to carry out image-text identification so as to obtain a first identification result;

the processing unit is used for correcting the first recognition result to obtain a second recognition result;

and the output unit is used for outputting the second recognition result.

The invention also provides computer equipment which comprises a memory and a processor, wherein the memory is stored with a computer program, and the processor realizes the method when executing the computer program.

The invention also provides a storage medium storing a computer program which, when executed by a processor, implements the method described above.

Compared with the prior art, the invention has the beneficial effects that: the method determines the potential text area by carrying out graying, size scaling, oblique rotation correction and character area positioning processing on the picture to be recognized in the bank field, then determines the text content by adopting the character recognition model, and then carries out subsequent processing operations such as form recognition, field combination and the like on the picture, determines the final text content, realizes accurate recognition of the characters in the picture in the bank field, avoids recognition errors caused by various interference factors such as brightness, chromatic aberration, wrinkles, surface shadowing, similar shapes and the like, and improves the recognition accuracy.

The invention is further described below with reference to the figures and the specific embodiments.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario of a method for recognizing a text according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a method for identifying graphics and text according to an embodiment of the present invention;

fig. 3 is a sub-flow diagram of an image-text recognition method according to an embodiment of the present invention;

fig. 4 is a sub-flow diagram of an image-text recognition method according to an embodiment of the present invention;

fig. 5 is a schematic sub-flow diagram of an image-text recognition method according to an embodiment of the present invention;

fig. 6 is a schematic sub-flow chart of the image-text recognition method according to the embodiment of the present invention;

fig. 7 is a sub-flow diagram of an image-text recognition method according to an embodiment of the present invention;

FIG. 8 is a schematic structural diagram of a character recognition model according to an embodiment of the present invention;

fig. 9 is a schematic block diagram of a teletext identification arrangement according to an embodiment of the invention;

fig. 10 is a schematic block diagram of a preprocessing unit of the image-text recognition apparatus according to an embodiment of the present invention;

fig. 11 is a schematic block diagram of a positioning subunit of the image-text recognition apparatus provided in the embodiment of the present invention;

fig. 12 is a schematic block diagram of a processing unit of the teletext recognition arrangement according to an embodiment of the invention;

FIG. 13 is a block diagram of a table merging subunit of the image recognition apparatus according to an embodiment of the present invention;

FIG. 14 is a schematic block diagram of a computer device provided by an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Referring to fig. 1 and fig. 2, fig. 1 is a schematic view of an application scenario of a text recognition method according to an embodiment of the present invention. Fig. 2 is a schematic flowchart of a method for identifying graphics and text according to an embodiment of the present invention. The image-text identification method is applied to a server. The server performs data interaction with the terminal, inputs a picture to be identified in the bank field through the terminal, preprocesses the picture, determines a potential text area, performs image-text identification and correction processing on the area to obtain the text content in the picture, and outputs the result to the terminal.

Fig. 2 is a schematic flowchart of an image-text recognition method according to an embodiment of the present invention. As shown in fig. 2, the method includes the following steps S110 to S150.

And S110, obtaining the picture to be identified in the bank field.

In this embodiment, the picture to be recognized refers to a picture in the bank field that needs to be subjected to image-text recognition.

And S120, preprocessing the picture to be recognized to obtain a potential text region.

In the present embodiment, the potential text region refers to a region where text may exist.

Specifically, the preprocessing of the picture mainly includes processing the picture in graying, scaling and normalizing the picture size, correcting the picture tilt rotation, positioning the character area, and the like.

In an embodiment, referring to fig. 3, the step S120 may include steps S121 to S124.

And S121, carrying out picture graying processing on the picture to be processed to obtain a first processing result.

In this embodiment, the first processing result refers to a picture obtained by performing picture graying processing on the picture to be processed.

In this embodiment, the graying processing of the picture can be performed by using the existing picture graying processing technology, which is not described herein again.

And S122, carrying out size scaling on the first processing result to obtain a second processing result.

In this embodiment, the second processing result is obtained by performing image graying processing on the image to be processed and then performing size scaling on the image.

Specifically, since the text recognition model has normalized requirements for input, the maximum recognizable picture size is set to 2056 × 2056, and the length and width are all required to be integer multiples of 32. If the length or width of a picture is larger than 2056, the maximum value of the length or width needs to be reduced according to the proportion to be converted into 2056; meanwhile, if the length or width is not an integral multiple of 32, for example, a 28 × 30 picture, the picture length and width need to be enlarged by 32/28 and 32/30 times, respectively, so as to obtain a picture with a standard input size, that is, the second processing result.

And S123, performing tilt rotation correction on the second processing result to obtain a third processing result.

In this embodiment, the third processing result is obtained by performing tilt rotation correction on the zoomed picture.

Specifically, since some pictures are obtained by scanning with a printer, taking pictures with a mobile phone, or adopting an italic typesetting style, the text in the pictures may rotate, and the like, detection needs to be performed according to the text image information of the pictures, so that the whole pictures rotate. The image rotation angle can be detected and calculated by adopting a Hough transformation straight line detection algorithm, and then the image is rotated to the correct position according to the angle, so that the accuracy of subsequent character area positioning is ensured.

And S124, positioning the character area of the third processing result to obtain a potential text area.

In an embodiment, referring to fig. 4, the step S124 may include steps S1241 to S1243.

S1241, extracting various CNN convolution characteristics on the third processing result by adopting an image sliding window;

s1242, carrying out secondary classification on the features, and screening out an image sliding window containing the character;

s1243, extracting the outline of the third processing result by adopting a maximum area connected graph outline recognition algorithm and combining an image sliding window containing the character characters so as to obtain a potential text area.

Specifically, since characters may not exist in all positions of one picture, it is necessary to determine picture position block information where text data exists before actual character recognition is performed. First, the potential text area is framed with a parallelogram-like shaped area block. When the image frame is represented, extracting multiple CNN (Convolutional Neural Networks) convolution characteristics according to an image sliding window of 2x2 on the basis of the converted image, namely the third processing result, performing secondary classification on the extracted characteristics, and using 0/1 to represent, namely classifying and determining whether the image sliding window contains the characters. Thus, one image is converted into a 0/1 matrix image. Then, a maximum area connected graph outline recognition algorithm is adopted, a related algorithm function of OpenCV2 is generally adopted, openCV is an image open source processing tool set, and the outline in the original image is analyzed, so that a character area connected maximum boundary, namely a parallelogram matrix area composed of an upper left corner position, an upper right corner position, a lower left corner position and a lower right corner position, is found. Thus, a series of text character regions, i.e., potential text regions composed of a plurality of parallelograms, are finally obtained.

And S130, inputting the potential text area into a character recognition model for image-text recognition to obtain a first recognition result.

In this embodiment, the first recognition result refers to text content obtained by performing text recognition on the potential text region through a text recognition model.

In this embodiment, the character recognition model is formed by training an optimized CRNN network by using an image with a character tag as a sample set, wherein the optimized CRNN network is formed by replacing a BLSTM of the CRNN network with a trained Bert language model; the trained Bert language model is obtained by training the Bert language model by taking the corpus data set in the bank field as a first sample set.

Potential text region boxes in the picture are obtained, and then character recognition needs to be carried out on each text region box in turn. At present, image-text recognition is generally carried out by adopting a deep learning method, and an image-text recognition network recognizes segmented image blocks of a character area into character contents. The identification process of the deep network generally comprises the following steps: firstly, a text area block which is represented by a parallelogram and is generally a rectangular matrix block is zoomed or filled, the text area block is converted into a rectangular matrix block with standard height, and if the text area block is the parallelogram, blank information filling is carried out at a corresponding position; then, extracting various CNN convolution characteristics of the standard text matrix box obtained after conversion according to a sliding window, thereby obtaining character and graphic abstract characteristic information in the text matrix box; and finally, performing serialized character recognition on the abstracted characteristic information to finish character recognition, wherein for Chinese character recognition, the total classification category of characters at each position during serialized classification is about 4000 Chinese characters.

In the present embodiment, as shown in fig. 8, a final character recognition model is performed in a form in which a CRNN (Recurrent Neural Network) Network is combined with a trained Bert language model, and the final character recognition model can be optimized in a feature extraction section and a final sequence conversion recognition section. CRNN is a popular image-text recognition network at present, can recognize a relatively long and variable text sequence, and a feature extraction layer of the CRNN comprises CNN and BLSTM and can perform end-to-end joint training. The context relationship between character images is learned by means of the BLSTM and CTC network, so that the accuracy of image-text recognition is effectively improved, and the trained Bert language model is adopted to replace the BLSTM in the native CRNN model to capture context semantic graphical information. Because the Bert language model fuses word text semantics, character forms and pinyin information, the problem that context information in an original CRNN model is not captured sufficiently can be solved, the recognition problem caused by picture wrinkles, unclear words and the like is solved, and the final word recognition accuracy is improved.

Specifically, for a picture text region frame to be recognized, firstly, a CNN model is adopted to perform convolution and pooling feature extraction to obtain basic character morphological visual features of the picture text visual region, and then a Bert model fused with character text semantics and character morphological feature semantics is adopted to obtain font sequence semantic features; finally, the output of each position of the Bert model is connected to a BilSTM cyclic recursive network in an abutting mode, so that clearer character sequence semantic codes are obtained, and finally the character sequence semantic codes are connected to a translation model to obtain character recognition results of the text region of the picture.

When the character recognition is performed, in order to facilitate subsequent character post-processing correction and the like, original image-text position information corresponding to the text box needs to be transmitted and recorded.

Specifically, X/Y coordinate positions of four directions, namely, upper left, upper right, lower left and lower right, of the potential text region box are recorded in detail, and when the picture is enlarged or reduced, the X/Y coordinate positions need to be enlarged or reduced in the same ratio, so that the final recognized text can also accurately correspond to the coordinate position of the original picture box, which is located in the original picture box, and therefore, the structural correction processing of the final picture text is facilitated, for example, the correction recognition of the table information in the picture is facilitated.

In an embodiment, referring to fig. 5, the trained Bert language model is obtained by training the Bert language model using the corpus data set in the bank field as the first sample set, and may include steps S131 to S134.

S131, a corpus data set in the bank field is obtained, and the corpus data set is divided into natural clauses according to special punctuations so as to obtain a first sample set.

In this embodiment, the first sample set refers to a set of natural clauses into which corpus data sets in the bank domain are divided according to special punctuations.

The character recognition is to convert the character in the picture into character characters, so the visual information and semantic forward and backward matching information of the characters need to be pre-trained. Model training is carried out according to corpus data sets oriented to the bank field by adopting BERT model training provided by Google, wherein the corpus data sets mainly come from dialogue corpuses of a bank intelligent question-answering system, bank business corpus data, data information collected by industry website crawlers and the like. And cutting the collected corpus data set into natural clauses according to special punctuations such as periods, question marks and the like, and finally combining the natural clauses into the Bert training corpus, namely the first sample set.

S132, vectorization characterization processing is carried out on the first sample set to obtain a characterization sample set.

In this embodiment, the characterizing sample set refers to a data set formed by vectorizing and characterizing the first sample set in multiple characteristic ways.

Because the model is finally used for character recognition in the picture, the basic component unit of a question sample is a basic character, and the original characteristic input information of each character adopts a plurality of characteristic modes to carry out vectorization characterization, which mainly comprises the following steps: firstly, the N-bit random vectorization value of the character generally adopts 256 and 512 equal-digit numbers; secondly, the graphical characteristics of the common fonts corresponding to the characters are specifically represented by the common font characteristics of a regular font, a Song font and the like, and each font characteristic is represented by a 28-by-28-size gray scale map; thirdly, the pinyin information of the character characters, and the basic pinyin such as initial consonant, vowel and tone is characterized according to the code of the basic character.

S133, constructing a Bert language model;

and S134, training the Bert language model by adopting the characteristic sample set to obtain the trained Bert language model.

Specifically, the characteristic sample sets are filled into a Bert language model, so that front and back character collocation semantic feature information of words in the bank field is obtained by training of the Bert language model, and front and back association collocation information such as character imaging, pronunciation tone and the like on a text sequence string can be obtained by capturing the Bert language model. Therefore, the trained Bert language model can better meet the image-text recognition task.

And S140, correcting the first recognition result to obtain a second recognition result.

In this embodiment, the second recognition result refers to text data formed by performing the teletext table recognition and merging on the first recognition result and merging the teletext paragraphs of the same paragraph.

In an embodiment, referring to fig. 6, the step S140 may include steps S141 to S142.

And S141, carrying out image-text table identification and combination on the first identification result to obtain a processing result.

In this embodiment, the processing result refers to tabular text data formed by performing the teletext table identification and merging on the first identification result.

In one embodiment, referring to fig. 7, the step S141 may include steps S1411 to S1414.

S1411, extracting characteristic information of the first identification result to obtain an extraction result.

Because the first recognition result has the picture position information of the characters in the similar recording positions, the table position recognition is carried out by utilizing the information, the table generally has obvious row and column information, the text character string information of each row is basically the same, no matter each column of the table carries out left alignment, right alignment and center alignment, the position information of the characters is within a certain range from a central point on a horizontal axis, and the character content of each column has the same condition, for example, the characters are generally unified into numerical values, names, dates and the like. Therefore, the first recognition result is mainly distinguished and combined according to the characteristic information.

In this embodiment, the above feature information extraction is performed on the first recognition result after recognition in a certain context window, where the context window refers to the top and bottom N lines of text information within a certain deviation range from the first recognition result X-axis position information, which is the text, and the top and bottom N lines of text information within a certain deviation range from the text Y-axis position information.

And S1412, performing CNN feature extraction on the extraction result according to the type, position and row-column number information to obtain an extraction result.

In this embodiment, the extraction result refers to the CNN feature obtained by extracting the extraction result according to the type, location, and row and column number information.

S1413, performing two-classification judgment on the extraction result according to the full connection mode to obtain a judgment result.

In this embodiment, the determination result refers to whether the extracted results are classified into two categories according to the full-connection mode, and the two categories are currently in the same table.

And S1414, performing table row and column combination according to the judgment result to obtain a processing result.

In this embodiment, if the determination result is the same table, the row and column of the table are merged according to a regular manner, so as to form the tabular text data.

In this embodiment, the rule manner is a table identification rule, and is mainly performed according to the position information, text features and the like which are recorded in the image-text identification, if the texts in the images are in the same table, the number of columns divided by each line of texts in the table is approximately equal, and the blank interval and the starting and stopping small marks of each line of texts in the line are generally the same; the starting subscript position and the unlocking subscript position of each row of the table are approximately equal, and the position information and the character types of the same column in the table are generally similar, so that the same table can be quickly positioned, judged and identified by adopting rules.

And S142, merging the image-text paragraphs of the same paragraph of the processing result to obtain a second identification result.

In the long text recognition, the long text passage is generally divided into a plurality of lines in the embodiment, so that the preceding and following periods of the first recognition result are separated. Therefore, the text paragraphs which originally belong to the same paragraph need to be merged, the text paragraphs generally have the characteristics of leading blank in the front of the paragraph, ending punctuation at the end of the paragraph, generally clearly marking the end of the paragraph at the end of the line, generally aligning the position of the head in the paragraph, and the like, and in addition, the text paragraphs are connected in an end-to-end manner to form the characteristics of complete words or sentences and the like on the semantic level, therefore, the method of the step S142 is adopted to merge the image-text paragraphs by using the characteristic information, thereby completing the final image-text recognition post-processing.

And S150, outputting the second recognition result.

The method of the embodiment can effectively identify and improve the identification accuracy of the text, and provide information such as structured tables, paragraphs and the like.

According to the image-text recognition method, graying, size scaling, inclined rotation correction and character region positioning processing are carried out on the picture to be recognized in the bank field, a potential text region is determined, a character recognition model is adopted to determine the text content, and subsequent processing operations such as table recognition, field combination and the like are carried out on the image-text, so that the final text content is determined, the characters in the picture in the bank field are accurately recognized, recognition errors caused by various interference factors such as brightness, chromatic aberration, wrinkles, surface masking, similar shapes and the like are avoided, and the recognition accuracy is improved.

Fig. 9 is a schematic block diagram of an image-text recognition apparatus 300 according to an embodiment of the present invention. As shown in fig. 9, the present invention also provides a text recognition device 300 corresponding to the text recognition method. The teletext recognition arrangement 300 comprises means for performing the teletext recognition method described above, and the arrangement may be arranged in a server. Specifically, referring to fig. 9, the image-text recognition apparatus 300 includes an image obtaining unit 301, a preprocessing unit 302, a recognition unit 303, a processing unit 304, and an output unit 305.

The image acquisition unit 301 is used for acquiring an image to be identified in the bank field; a preprocessing unit 302, configured to preprocess the picture to be recognized to obtain a potential text region; the identification unit 303 is configured to input the potential text region into a text identification model to perform image-text identification, so as to obtain a first identification result; the processing unit 304 is configured to perform correction processing on the first recognition result to obtain a second recognition result; an output unit 305, configured to output the second recognition result.

In one embodiment, as shown in fig. 10, the preprocessing unit 302 includes a graying subunit 3021, a scaling subunit 3022, a rotation subunit 3023, and a positioning subunit 3024.

A graying subunit 3021, configured to perform picture graying processing on the to-be-processed picture to obtain a first processing result; a scaling subunit 3022, configured to perform size scaling on the first processing result to obtain a second processing result; a rotation subunit 3023, configured to perform tilt rotation correction on the second processing result to obtain a third processing result; a positioning subunit 3024, configured to perform text region positioning on the third processing result to obtain a potential text region.

In one embodiment, as shown in fig. 11, the positioning subunit 3024 includes a feature extraction module 30241, a two-classification module 30242, and a contour extraction module 30243.

A feature extraction module 30241, configured to extract multiple CNN convolution features from the third processing result using an image sliding window; a binary classification module 30242, configured to perform secondary classification on the features, and screen out an image sliding window containing text characters; and the contour extraction module 30243 is configured to perform contour extraction on the third processing result by using a maximum region connected graph contour recognition algorithm in combination with an image sliding window containing text characters, so as to obtain a potential text region.

The character recognition model is formed by training an optimized CRNN through a picture with a character label as a sample set, wherein the optimized CRNN is formed by replacing a trained Bert language model with a BLSTM of the CRNN; the trained Bert language model is obtained by training the Bert language model by taking the corpus data set in the bank field as a first sample set. The trained Bert language model is obtained by training the Bert language model by taking the corpus data set in the bank field as a first sample set, and comprises the following steps: obtaining a corpus data set in the bank field, and segmenting the corpus data set into natural clauses according to special punctuations to obtain a first sample set; vectorizing and characterizing the first sample set to obtain a characterized sample set; constructing a Bert language model; and training the Bert language model by adopting the characteristic sample set to obtain the trained Bert language model.

In one embodiment, as shown in fig. 12, the processing unit 304 includes a table merging subunit 3041 and a paragraph merging subunit 3042.

A table merging subunit 3041, configured to perform table-text table identification and merging on the first identification result, so as to obtain a processing result; a paragraph merging subunit 3042, configured to merge the image-text paragraphs of the same paragraph with the processing result to obtain a second recognition result.

In an embodiment, as shown in fig. 13, the table merging subunit 3041 includes an information extracting module 304111, a feature extracting module 30412, a determining module 30413, and a row and column merging module 30414.

An information extraction module 304111, configured to perform feature information extraction on the first recognition result to obtain an extraction result; a feature extraction module 30412, configured to perform CNN feature extraction on the extraction result according to the type, position, and row-column number information, so as to obtain an extraction result; a distinguishing module 30413, configured to perform two-class distinguishing on the extraction result according to the full-connection manner, so as to obtain a distinguishing result; a row-column combination module 30414, configured to perform table row-column combination according to the determination result, so as to obtain a processing result.

It should be noted that, as can be clearly understood by those skilled in the art, the detailed implementation process of the image-text recognition apparatus 300 and each unit may refer to the corresponding description in the foregoing method embodiment, and for convenience and conciseness of description, no further description is provided herein.

The above-described teletext recognition arrangement 300 may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 14.

Referring to fig. 14, fig. 14 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 may be a server, wherein the server may be an independent server or a server cluster composed of a plurality of servers.

Referring to fig. 14, the computer device 500 includes a processor 502, memory, and a network interface 505 connected by a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.

The non-volatile storage medium 503 may store an operating system 5031 and a computer program 5032. The computer programs 5032 include program instructions that, when executed, cause the processor 502 to perform a teletext recognition method.

The processor 502 is used to provide computing and control capabilities to support the operation of the overall computer device 500.

The internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503, and when the computer program 5032 is executed by the processor 502, the processor 502 can be caused to perform a teletext identification method.

The network interface 505 is used for network communication with other devices. It will be appreciated by those skilled in the art that the configuration shown in fig. 14 is a block diagram of only a portion of the configuration associated with the present application, and is not intended to limit the scope of the present application as such may be used with a computer device 500, and that a particular computer device 500 may include more or less components than those shown, or some of the components may be combined, or have a different arrangement of components.

Wherein the processor 502 is configured to run the computer program 5032 stored in the memory to implement the following steps:

acquiring a picture to be identified in the bank field; preprocessing the picture to be identified to obtain a potential text area; inputting the potential text area into a character recognition model for image-text recognition to obtain a first recognition result; correcting the first recognition result to obtain a second recognition result; and outputting the second recognition result.

The character recognition model is formed by training an optimized CRNN through a picture with a character label as a sample set, wherein the optimized CRNN is formed by replacing a trained Bert language model with a BLSTM of the CRNN; the trained Bert language model is obtained by training the Bert language model by taking the corpus data set in the bank field as a first sample set.

In an embodiment, when the processor 502 implements the step of preprocessing the picture to be recognized to obtain the potential text region, the following steps are specifically implemented:

performing picture graying processing on the picture to be processed to obtain a first processing result; scaling the size of the first processing result to obtain a second processing result; performing tilt rotation correction on the second processing result to obtain a third processing result; and carrying out character area positioning on the third processing result to obtain a potential text area.

In an embodiment, when the processor 502 implements the step of performing text region positioning on the third processing result to obtain a potential text region, the following steps are specifically implemented:

adopting an image sliding window to extract various CNN convolution characteristics on the third processing result; carrying out secondary classification on the features, and screening out an image sliding window containing the literal characters; and performing contour extraction on the third processing result by adopting a maximum region connected graph contour recognition algorithm and combining an image sliding window containing character characters to obtain a potential text region.

In an embodiment, when the processor 502 implements the step of training the trained Bert language model by using the corpus data set in the bank field as the first sample set, the following steps are specifically implemented:

obtaining a corpus data set in the bank field, and segmenting the corpus data set into natural clauses according to special punctuations to obtain a first sample set; vectorizing and characterizing the first sample set to obtain a characterized sample set; constructing a Bert language model; and training the Bert language model by adopting the characteristic sample set to obtain the trained Bert language model.

In an embodiment, when implementing the step of performing the correction processing on the first recognition result to obtain the second recognition result, the processor 502 specifically implements the following steps:

carrying out image-text table identification and combination on the first identification result to obtain a processing result; and merging the image-text paragraphs of the same paragraph of the processing result to obtain a second identification result.

In an embodiment, when implementing the step of performing the table-text identification and merging on the first identification result to obtain the processing result, the processor 502 specifically implements the following steps:

extracting characteristic information of the first identification result to obtain an extraction result; performing CNN feature extraction on the extraction result according to the type, position and row-column number information to obtain an extraction result; performing two-classification discrimination on the extraction result according to a full-connection mode to obtain a discrimination result;

It should be understood that, in the embodiment of the present Application, the Processor 502 may be a Central Processing Unit 304 (CPU), and the Processor 502 may also be other general-purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field-Programmable Gate arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It will be understood by those skilled in the art that all or part of the flow of the method implementing the above embodiments may be implemented by a computer program instructing associated hardware. The computer program includes program instructions, and the computer program may be stored in a storage medium, which is a computer-readable storage medium. The program instructions are executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.

Accordingly, the present invention also provides a storage medium. The storage medium may be a computer-readable storage medium. The storage medium stores a computer program, wherein the computer program, when executed by a processor, causes the processor to perform the steps of:

acquiring a picture to be identified in the bank field; preprocessing the picture to be recognized to obtain a potential text area; inputting the potential text area into a character recognition model for image-text recognition to obtain a first recognition result; correcting the first recognition result to obtain a second recognition result; and outputting the second recognition result.

The character recognition model is formed by training an optimized CRNN through a picture with a character label as a sample set, wherein the optimized CRNN is formed by replacing a BLSTM of the CRNN with a trained Bert language model; the trained Bert language model is obtained by training the Bert language model by taking the corpus data set in the bank field as a first sample set.

In an embodiment, when the processor executes the computer program to implement the step of preprocessing the picture to be recognized to obtain the potential text region, the following steps are specifically implemented:

carrying out picture graying processing on the picture to be processed to obtain a first processing result; scaling the size of the first processing result to obtain a second processing result; performing tilt rotation correction on the second processing result to obtain a third processing result; and carrying out character area positioning on the third processing result to obtain a potential text area.

In an embodiment, when the processor executes the computer program to perform the step of performing text region location on the third processing result to obtain a potential text region, the following steps are specifically implemented:

In an embodiment, when the processor executes the computer program to implement the step of training the Bert language model by using the corpus data set in the bank field as the first sample set, the following steps are specifically implemented:

In an embodiment, when the processor executes the computer program to implement the step of performing the correction processing on the first recognition result to obtain the second recognition result, the following steps are specifically implemented:

In an embodiment, when the processor executes the computer program to implement the step of performing the table-text recognition and merging on the first recognition result to obtain the processing result, the following steps are specifically implemented:

extracting characteristic information of the first identification result to obtain an extraction result; performing CNN feature extraction on the extraction result according to the type, position and row-column number information to obtain an extraction result; performing two-classification discrimination on the extraction result according to a full-connection mode to obtain a discrimination result; and carrying out table row-column combination according to the judgment result to obtain a processing result.

The storage medium may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk, which can store various computer readable storage media of program codes.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, various elements or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented.

The steps in the method of the embodiment of the invention can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the invention can be merged, divided and deleted according to actual needs. In addition, functional units in the embodiments of the present invention may be integrated into one processing unit 304, or each unit may exist alone physically, or two or more units are integrated into one unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a terminal, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. The image-text identification method is characterized by comprising the following steps:

acquiring a picture to be identified in the bank field;

preprocessing the picture to be identified to obtain a potential text area;

correcting the first recognition result to obtain a second recognition result;

and outputting the second recognition result.

2. The method for identifying texts according to claim 1, wherein the preprocessing the picture to be identified to obtain potential text regions comprises:

performing picture graying processing on the picture to be processed to obtain a first processing result;

3. The method for identifying texts according to claim 1, wherein the performing text region localization on the third processing result to obtain a potential text region comprises:

carrying out secondary classification on the features, and screening out an image sliding window containing the literal characters;

4. The image-text recognition method according to claim 1, wherein the text recognition model is formed by training an optimized CRNN network by using a picture with text labels as a sample set, wherein the optimized CRNN network is formed by replacing BLSTM of the CRNN network with a trained Bert language model; the trained Bert language model is obtained by training the Bert language model by taking a corpus data set in the bank field as a first sample set.

5. The image-text recognition method according to claim 4, wherein the trained Bert language model is obtained by training a Bert language model by using corpus data sets in the bank field as a first sample set, and comprises:

constructing a Bert language model;

6. The method for image-text recognition according to claim 1, wherein the correcting the first recognition result to obtain a second recognition result comprises:

performing image-text table identification and combination on the first identification result to obtain a processing result;

7. The method of claim 6, wherein the performing of the teletext table identification and merging on the first identification result to obtain a processing result comprises:

extracting characteristic information of the first identification result to obtain an extraction result;

performing CNN feature extraction on the extraction result according to the type, position and row number information to obtain an extraction result;

8. Picture and text recognition device, its characterized in that includes:

and the output unit is used for outputting the second recognition result.

9. A computer device, characterized in that it comprises a memory, on which a computer program is stored, and a processor, which when executing the computer program, implements the method according to any one of claims 1 to 7.

10. A storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 7.