CN110569830B

CN110569830B - Multilingual text recognition method, device, computer equipment and storage medium

Info

Publication number: CN110569830B
Application number: CN201910706802.7A
Authority: CN
Inventors: 王健宗; 回艳菲; 韩茂琨; 于凤英
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-08-01
Filing date: 2019-08-01
Publication date: 2023-08-22
Anticipated expiration: 2039-08-01
Also published as: CN110569830A; WO2021017260A1

Abstract

The invention discloses a multilingual text recognition method, a multilingual text recognition device, computer equipment and a storage medium. The method comprises the following steps: acquiring an image to be identified, wherein the image to be identified comprises original characters corresponding to at least two languages; performing layout analysis and identification on the image to be identified, acquiring at least one text line image, and determining the text line position of each text line image in the image to be identified; performing text type recognition on each text line image to obtain a target text type corresponding to each text line image; inquiring a recognition model database based on the target text, and acquiring a target OCR recognition model corresponding to the target text; performing transcription recognition on the text line image by using a target OCR recognition model to obtain target characters corresponding to the text line image; and acquiring a target recognition text corresponding to the image to be recognized based on the target text corresponding to the text line image and the text line position. The method can adopt the target OCR recognition model to recognize each text line image, thereby being beneficial to improving the recognition accuracy of the multilingual text.

Description

Multilingual text recognition method, device, computer equipment and storage medium

Technical Field

The present invention relates to the field of text recognition technologies, and in particular, to a method and apparatus for recognizing multilingual text, a computer device, and a storage medium.

Background

Multilingual text recognition is particularly applied to a scene in which a text image containing multiple languages is recognized, for example, a scene in which a text image in which a chinese character, a japanese character, and an english character coexist. The current sequence-to-sequence (simply called as sequence 2 sequence) based method is adopted to train the obtained sequence 2 sequence identification model to identify the text images coexisting with multiple languages, the model structure is complex, the training process is very difficult, the efficiency of the model is lower during operation, and the identification accuracy is lower. The term "Seq 2 Seq" refers to a technique of converting a sequence in one field (e.g., english) into a model of a sequence in another field (e.g., french). When a traditional Seq2Seq recognition model is adopted to recognize text images with multiple language characters, the finally recognized text content is usually wrong due to the fact that the judgment of the text type is wrong, so that the recognition accuracy is low, and popularization and application of the recognition model are not facilitated.

Disclosure of Invention

The embodiment of the invention provides a multilingual text recognition method, a device, computer equipment and a storage medium, which are used for solving the problem of low recognition accuracy in the current multilingual text recognition process by adopting a recognition model.

A method of multi-lingual text recognition comprising:

acquiring an image to be identified, wherein the image to be identified comprises original characters corresponding to at least two languages;

performing layout analysis and identification on the image to be identified, acquiring at least one text line image, and determining the text line position of each text line image in the image to be identified;

performing text type recognition on each text line image to obtain a target text type corresponding to each text line image;

inquiring a recognition model database based on the target text, and acquiring a target OCR recognition model corresponding to the target text;

performing transcription recognition on the text line image by adopting the target OCR recognition model to obtain target characters corresponding to the text line image;

and acquiring a target recognition text corresponding to the image to be recognized based on the target text corresponding to the text line image and the text line position.

A multi-lingual text recognition device comprising:

the image to be identified is used for acquiring an image to be identified, and the image to be identified comprises original characters corresponding to at least two languages;

the text line image acquisition module is used for carrying out layout analysis and identification on the image to be identified, acquiring at least one text line image and determining the text line position of each text line image in the image to be identified;

The target text type acquisition module is used for identifying the text types of each text line image and acquiring the target text types corresponding to each text line image;

the recognition model acquisition module is used for inquiring a recognition model database based on the target text, and acquiring a target OCR recognition model corresponding to the target text;

the target character acquisition module is used for performing transcription recognition on the text line image by adopting the target OCR recognition model to acquire target characters corresponding to the text line image;

and the identification text acquisition module is used for acquiring the target identification text corresponding to the image to be identified based on the target text corresponding to the text line image and the text line position.

A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the above-described multilingual text recognition method when executing the computer program.

A computer readable storage medium storing a computer program which when executed by a processor implements the multilingual text recognition method described above.

According to the multilingual text recognition method, the device, the computer equipment and the storage medium, the layout analysis recognition is firstly carried out on the images to be recognized, which are corresponding to the original characters of at least two languages, so as to determine at least one text line image and the corresponding text line position, so that the multilingual images to be recognized are converted into the text line images of single language for recognition, and the recognition accuracy is improved in the subsequent recognition process. And after the text line images are subjected to text type recognition to determine the corresponding target text types, the text line images are recognized by adopting a target OCR recognition model corresponding to the target text types, so that the accuracy of target characters recognized by each text line image is ensured. And then, based on the text line position of the text line image, the target text corresponding to the text line image is rearranged to obtain the target recognition text corresponding to the image to be recognized, so that the comparison and verification are carried out based on the target recognition text, and the recognition accuracy is guaranteed.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an application environment of a multi-language text recognition method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for multi-lingual text recognition in an embodiment of the invention;

FIG. 3 is another flow chart of a method of multilingual text recognition in accordance with one embodiment of the present invention;

FIG. 4 is another flow chart of a method of multilingual text recognition in accordance with one embodiment of the present invention;

FIG. 5 is another flow chart of a method of multilingual text recognition in accordance with one embodiment of the present invention;

FIG. 6 is another flow chart of a method of multilingual text recognition in accordance with one embodiment of the present invention;

FIG. 7 is a schematic diagram of a model structure of an Encoder-Summarizer mechanism in accordance with one embodiment of the present invention, where 7 (a) is the model structure of an Encoder component, 7 (b) is the model structure of a Summarizer component, and 7 (c) is the model structure of an acceptance in 7 (a);

FIG. 8 is a schematic diagram of a multi-language text recognition device according to an embodiment of the invention;

FIG. 9 is a schematic diagram of a computer device in accordance with an embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The multi-language text recognition method provided by the embodiment of the invention can be applied to an application environment shown in fig. 1. Specifically, the multi-language text recognition method is applied to a multi-language text recognition system, the multi-language text recognition system comprises a client and a server as shown in fig. 1, the client and the server are communicated through a network, and the multi-language text recognition system is used for determining a corresponding target OCR recognition model according to each language when recognizing images to be recognized with coexisting multi-language characters, and recognizing the images by utilizing a plurality of target OCR recognition models so as to ensure the recognition accuracy of finally recognized target recognition texts. The client is also called a user end, and refers to a program corresponding to the server for providing local service for the user. The client may be installed on, but is not limited to, various personal computers, notebook computers, smartphones, tablet computers, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers.

In one embodiment, as shown in fig. 2, a multi-language text recognition method is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:

S201: and acquiring an image to be identified, wherein the image to be identified comprises original characters corresponding to at least two languages.

The image to be recognized is an image which needs to be recognized by characters. The original text refers to the text recorded in the image to be recognized. In this embodiment, the original text in the image to be recognized corresponds to at least two languages, that is, the image to be recognized is an image in which at least two languages coexist. For example, in an image to be recognized, original characters corresponding to Chinese, english and Japanese are included.

S202: and carrying out layout analysis and identification on the image to be identified, obtaining at least one text line image, and determining the text line position of each text line image in the image to be identified.

The layout analysis and identification of the image to be identified refers to dividing the distribution structure of the image to be identified, and analyzing the image characteristics of each divided partial image to determine the attribute category corresponding to the partial image. The attribute categories include, but are not limited to, text blocks, image blocks, table blocks, and the like. A text line image is an image formed to contain at least one original text. Generally, if the typesetting type corresponding to the text line images is horizontal typesetting, each text line image comprises a line of original characters; if the typesetting type corresponding to the text line images is longitudinal typesetting, each text line image comprises a column of original characters.

Specifically, the server may perform image preprocessing operations such as gray level conversion, binarization processing, smoothing processing, edge detection and the like on the image to be recognized in advance, and perform layout analysis and recognition on the preprocessed image to be recognized to obtain at least two block images; and then carrying out attribute analysis on each block image to obtain an attribute category corresponding to each block image, and intercepting the block image with the attribute category being a text block to determine the block image as a text line image. It is understood that the layout analysis of the image to be identified may be performed by, but is not limited to, a bottom-up layout analysis algorithm under multi-level credibility guidance, a layout analysis algorithm based on neighborhood characteristics, or a layout analysis algorithm based on projection algorithm combined with a bottom-up layout analysis algorithm.

Specifically, after performing layout analysis and recognition on an image to be recognized and acquiring at least one text line image, the server needs to determine the text line position of the text line image in the image to be recognized so as to perform subsequent text positioning or context combination recognition based on the text line position, thereby ensuring the accuracy of a target recognition text recognized by the image to be recognized.

In a specific embodiment, taking the typesetting type of the image to be identified as a transverse typesetting as an example, after performing layout analysis and identification on the image to be identified and obtaining at least one text line image, the server sorts according to the vertical coordinate of the upper left corner or the center of the area where the at least one text line image is located, and obtains a line sequence number corresponding to each text line image, thereby obtaining a text line position corresponding to the text line image. For example, the text line image with the line number of 001 illustrates that the text line image corresponds to the 1 st line of original text, so that subsequent positioning or context semantic analysis can be performed based on the line number, and the text recognition accuracy can be improved.

S203: and carrying out text type recognition on each text line image to obtain a target text type corresponding to each text line image.

The target language type refers to a language type corresponding to the original text in the text line image, such as Chinese, english or Japanese. Specifically, the server identifies each text line image by identifying the text line image to determine a target text type corresponding to each text line image, so as to determine a corresponding target OCR identification model based on the target text type, identify the text line image by using the target OCR identification model, convert the multilingual image to be identified into a plurality of single-language text line images, and identify the corresponding text line image by using a plurality of target OCR identification models, thereby improving the identification accuracy of the whole image to be identified.

S204: and inquiring the recognition model database based on the target text, and acquiring a target OCR recognition model corresponding to the target text.

Wherein the recognition model database is a database for storing character OCR recognition models for recognizing different languages. And storing the character OCR recognition models corresponding to different languages in a recognition model database. Each of the character OCR recognition models corresponds to a character (i.e., a language category), i.e., the character OCR recognition model is an OCR recognition model for recognizing a character corresponding to the character. Specifically, after obtaining the target text types corresponding to each text line image, the server queries the recognition model database according to the target text types recognized by each text line image, and determines the text type OCR recognition model corresponding to the target text types as a target OCR recognition model so as to recognize the text line image by using the target OCR recognition model, thereby improving the accuracy of text recognition on the text line image.

S205: and performing transcription recognition on the text line image by using a target OCR recognition model to obtain target characters corresponding to the text line image.

Specifically, the server can determine the corresponding recognition sequence according to the text line position corresponding to each text line image, and transcribe and recognize the corresponding text line image by using a target OCR recognition model so as to obtain the target text corresponding to the text line image, so as to ensure the recognition accuracy of the target text corresponding to each text line image, and avoid the problem of lower recognition accuracy in recognizing the text line images of different languages by using the same OCR recognition model. The target characters are characters obtained by recognizing the text line images by using a target OCR recognition model.

S206: and acquiring a target recognition text corresponding to the image to be recognized based on the target text corresponding to the text line image and the text line position.

Specifically, after identifying the target text corresponding to each text line image, the server needs to re-typeset the target text corresponding to each text line image based on the text line position corresponding to each text line image to obtain the target identification text identical to the layout of the image to be identified, so as to perform comparison and verification on the original text at the corresponding position in the image to be identified based on the target identification text. Taking the typesetting type of the image to be identified as a transverse typesetting as an example, determining the text line position according to the line serial number corresponding to each text line image, typesetting the target text according to the line serial number after identifying the target text corresponding to the text line image, so as to obtain the target identification text corresponding to the image to be identified, so that subsequent comparison and verification can be performed, and the identification accuracy can be ensured.

In the multilingual text recognition method provided by the embodiment, layout analysis recognition is performed on the images to be recognized, which are corresponding to the original characters of at least two languages, first to determine at least one text line image and the corresponding text line position, so that the multilingual images to be recognized are converted into text line images of single language for recognition, and the recognition accuracy is improved in the subsequent recognition process. And after the text line images are subjected to text type recognition to determine the corresponding target text types, the text line images are recognized by adopting a target OCR recognition model corresponding to the target text types, so that the accuracy of target characters recognized by each text line image is ensured. And then, based on the text line position of the text line image, the target text corresponding to the text line image is rearranged to obtain the target recognition text corresponding to the image to be recognized, so that the comparison and verification are carried out based on the target recognition text, and the recognition accuracy is guaranteed.

In one embodiment, as shown in fig. 3, before step S201, that is, before the image to be recognized is acquired, the multilingual text recognition method further includes:

s301: and acquiring an original image, detecting the original image by adopting a ambiguity detection algorithm, and acquiring the ambiguity of the original image.

The original image refers to an unprocessed image acquired by the server. The blur degree detection algorithm is an algorithm for detecting the blur degree of an image. The ambiguity detection algorithm may be a detection algorithm commonly used in the art. The blurring degree of the original image is a numerical value for reflecting the blurring degree of the original image, and the larger the blurring degree is, the more blurring of the original image is described; accordingly, the smaller the blur, the clearer the original image is explained.

The ambiguity detection algorithm in this embodiment may employ, but is not limited to, laplace operator (Laplacian operator) for ambiguity detection. Wherein the laplace operator (Laplacian operator) is a second order differential operator adapted to improve image blur due to diffuse reflection of light. The principle is that in the process of shooting and recording an image, a light spot diffusely reflects light to the surrounding area, and the degree of blurring of the image caused by the diffuse reflection of the light is often a constant multiple of the Laplace operator compared with the image shot under normal conditions. Thus, step S301 specifically includes the steps of:

s3011: and sharpening the original image by using the Laplacian operator to obtain a sharpened image and pixel gray values of the sharpened image. The method comprises the steps that a server firstly processes an original image by using a Laplace operator to obtain a Laplace image describing grey level mutation, and then the Laplace image is overlapped with the original image to obtain a sharpened image. After the sharpened image is acquired, the RGB value of each pixel point in the sharpened image is acquired, and the RGB value is processed to acquire the pixel gray value corresponding to the sharpened image.

S3012: and carrying out variance calculation on the pixel gray values of the sharpened image, obtaining a target variance value corresponding to the sharpened image, and determining the target variance value as the ambiguity corresponding to the original image. The server calculates the square sum of the pixel gray value of each pixel point in the sharpened image minus the average gray value of the sharpened image, and divides the square sum by the number of the pixel points to obtain the target variance value capable of reflecting the blurring degree of the sharpened image.

It can be appreciated that the original image is sharpened by using the laplace operator, so as to obtain a sharpened image with clearer details than the original image, thereby improving the definition of the image. Then, the target variance value of the sharpened image is calculated to represent the difference between the pixel gray values of the pixels of the sharpened image. The target variance value of the sharpened image is used as the ambiguity of the original image, so that the original image is subjected to fuzzy filtering according to the comparison result of the ambiguity and a preset threshold value, and the purpose of obtaining a clearer original image is achieved.

S302: and if the ambiguity is greater than the first ambiguity threshold, acquiring ambiguity prompt information.

The first blurring threshold value is a threshold value which is preset by the system and used for evaluating whether the highest blurring degree of the image to be identified can be achieved. The blurring cue information is information for cue that the image is too blurred.

Specifically, after comparing the ambiguity of the original image with the first ambiguity threshold, if the ambiguity of the original image is greater than the first ambiguity threshold, it is indicated that the original image is too ambiguous, and if the original image is directly text-recognized, the accuracy of text recognition in the original image may be recognized, so that the ambiguity prompt information may be obtained and sent to the corresponding terminal, so that the user may upload the original image to the server again based on the ambiguity prompt information.

S303: if the ambiguity is not greater than the first ambiguity threshold and is greater than the second ambiguity threshold, sharpening and correcting the original image to obtain the image to be identified.

The second blurring threshold is a threshold preset by the system and used for evaluating whether the minimum blurring degree of the clearer image to be recognized can be evaluated. It will be appreciated that the first blur threshold is greater than the second blur threshold.

Specifically, when the ambiguity of the original image is greater than the second ambiguity threshold and not greater than the first ambiguity threshold, that is, when the ambiguity of the original image is between the second ambiguity threshold and the first ambiguity threshold, it is indicated that the original image is not excessively ambiguous but does not reach the sharpness standard, and at this time, the sharpening process is required to be performed on the original image to improve the sharpness of the original image; and then, carrying out correction processing on the sharpened original image to obtain a clearer and non-inclined image to be identified, thereby ensuring the accuracy of text identification on the image to be identified subsequently. In general, when optical scanning is performed, the original image scanned is not properly positioned for objective reasons, which affects the accuracy of the later image recognition processing, and therefore, the image correction work is required for the image. The key to image tilt correction is to automatically detect the tilt direction and tilt angle of an image based on the image characteristics. The current common inclination angle method comprises the following steps: projection-based methods, hough transform-based methods, linear fitting-based methods, and Fourier transform-based methods for detection into the frequency domain.

S304: and if the ambiguity is not greater than the second ambiguity threshold, correcting the original image to obtain the image to be identified.

Specifically, when the ambiguity of the original image is not greater than the second ambiguity threshold, the server indicates that the original image is clearer, sharpening processing is not needed to enhance the definition of the original image, and therefore image processing efficiency is improved; when the original image is generated, the position of the original image may be incorrect due to various objective reasons, so that the server needs to perform correction processing on the original image to obtain a non-inclined image to be identified after the correction processing.

In the multilingual text recognition method provided by the embodiment, corresponding processing is performed according to the comparison result of the ambiguity of the original image and the first and second ambiguity thresholds, so as to ensure that the clear and non-inclined image to be recognized is finally obtained, thereby ensuring the accuracy of text recognition based on the image to be recognized and avoiding interference to the recognition result caused by the blurring or inclination of the image.

In an embodiment, as shown in fig. 4, step S202, that is, performing layout analysis and recognition on an image to be recognized, obtaining at least one text line image, and determining a text line position of each text line image in the image to be recognized, specifically includes the following steps:

S401: and carrying out text positioning on the image to be identified by adopting a text positioning algorithm to obtain at least one text line area.

The text positioning algorithm is an algorithm for positioning characters in an image. In this embodiment, the text positioning algorithm includes, but is not limited to, a proximity search algorithm and a CTPN-RNN algorithm. The text line area refers to an area containing original characters, which is identified based on a row of original characters or a column of original characters, from an image to be identified by adopting a text positioning algorithm.

Taking horizontal proximity search as an example, the proximity search algorithm refers to starting from a connected region, and finding the connected regionAnd (3) an algorithm of horizontally circumscribing the rectangle and expanding the connected region to the whole rectangle. When the distance between the connected region and the nearest neighbor region is smaller than a certain range, the expansion of the rectangle is considered, the expansion direction is the direction of the nearest neighbor region, and the expansion operation is performed to determine at least one text line region from the image if and only if the direction is horizontal. The method can effectively integrate the original characters in the same row in the image into a text row area so as to realize the aim of text positioning. Taking horizontal expansion as an example, the process of performing text positioning on an image to be identified by adopting a proximity search algorithm to obtain at least one text line area comprises the following steps: and calculating the central vector difference of any two rectangular areas (namely, the vector difference formed by the central points of the two rectangular areas) of the rectangular areas formed by any one or more original characters in the image to be identified. Subtracting the distance from the center points of the two rectangular areas to the boundary from the center vector difference to obtain the boundary vector difference, namely Wherein, (x' _c ,y' _c ) Refers to the boundary vector difference, (x) _c ,y _c ) Refers to the difference in central vector, a ₁ And b ₁ Respectively the length and width of the first matrix area, a ₂ And b ₂ Respectively the length and width of the second matrix area. Then the distance calculation formula is adoptedCalculating the distance d between two matrix areas, wherein max () is a function of the returned maximum value; if the distance d is smaller than a certain range, performing expansion operation on the text line to obtain at least one text line area, and rapidly obtaining the at least one text line area by adopting a proximity search method.

Wherein CTPN (Connectionist Text Proposal Network, connected to a text proposal network, hereinafter referred to as CTPN) is a model for accurately locating text lines in an image, and CTPN can identify coordinate positions of four corners of each text line. The main purpose of RNN (Recurrent Neural Networks cyclic neural network, hereinafter referred to as RNN) is to process and predict sequence data, nodes between hidden layers of RNN are connected, and input of the hidden layers includes not only output of the input layer but also output of the hidden layer at the previous time. The CTPN-RNN algorithm is adopted to locate at least one text line area from the image to be identified, the CTPN is seamlessly combined into the RNN convolution network, so that text lines in the image to be identified can be accurately located, the text line area is determined according to the position of each text line in the image to be identified, namely, the CTPN-RNN algorithm is adopted to automatically identify the at least one text line area, and the CTPN and RNN are seamlessly combined to effectively improve the detection precision.

S402: and adopting a screenshot tool to screenshot at least one text line area, obtaining at least one text line image, and determining the text line position of each text line image in the image to be identified according to the screenshot sequence of the screenshot tool.

Specifically, the server captures at least one text line area by using OpenCV to obtain at least one corresponding text line image, and determining the text line position of each text line image in the image to be identified according to the capturing sequence of the capturing tool. OpenCV (Open Source Computer Vision Library ) is a cross-platform computer vision library based on BSD license (open source) issues, which can run on Linux, windows, android and Mac OS operating systems. The system is lightweight and efficient, is composed of a series of C functions and a small number of C++ classes, provides interfaces of Python, ruby, MATLAB and other languages, and realizes a plurality of general algorithms in the aspects of image processing and computer vision. In this embodiment, the capturing operation is performed on the coordinates of the 4 corners of each text line region through OpenCV to obtain the corresponding text line image, and the capturing operation is performed through OpenCV, so that the computing is simple, the computing efficiency is higher, and the performance is more stable. Each text line image corresponds to a text line location that may be the coordinates of the four vertices (e.g., the coordinates of the upper left corner) or the center point of the text line image to determine the location of the text line image in the image to be identified based on the text line location.

In the multilingual text recognition method provided by the embodiment, a text positioning algorithm is adopted to perform text positioning on an image to be recognized so as to quickly contain a text line area of one line of original text or one column of original text, so that the acquisition efficiency of the text line area is higher and the accuracy is higher; then, screenshot is carried out on each text line area by adopting a screenshot tool so as to acquire at least one text line image, so that the image to be recognized is divided into at least one text line image, the text line images can be recognized one by one during the subsequent text recognition, and the problem of inaccurate recognition results caused by the fact that the text line images corresponding to different languages are recognized by adopting the same recognition model is avoided; and finally, acquiring a text line position corresponding to each text line image according to the screenshot sequence of the screenshot tool so as to carry out subsequent positioning or context semantic analysis based on the text line position, thereby improving the text recognition accuracy.

In one embodiment, the multi-language text recognition system adopts an Encoder-summerizer mechanism as shown in fig. 7 to perform text recognition, namely, an Encoder component is utilized to convert text line images into feature sequences, and then a summerizer is adopted to aggregate the feature sequences, so as to execute classification tasks, so as to perform classification, and thus determine the corresponding target text. As shown in fig. 5, step S203, that is, performing text recognition on each text line image, obtains a target text corresponding to each text line image, specifically includes the following steps:

S501: and converting the format of the text line image to obtain a feature map conforming to a preset format.

The preset format is a preset format for inputting a feature map encoded by the Encoder component. The feature map is a map extracted from the text line image and encoded by an inputtable Encoder component. The preset format comprises a preset height, a preset width and a preset channel number. The preset channel number is set to be 1, and the characteristic map is a gray image, so that the calculated amount of the subsequent processing process is reduced. Through the setting of the preset height and the preset width, the interference of the width and the height in the feature diagram on the coding processing when the Encoder component utilizes the feature diagram to code can be effectively reduced, and the accuracy of the obtained feature vector is ensured.

Specifically, the format conversion process of the text line image by the server includes: firstly, carrying out graying treatment on a text line image, and converting the text line image into a corresponding gray image so that the number of preset channels is 1; scaling the gray level image to a first image matched with a preset height h; judging whether the width of the first image reaches a preset width or not; if the width reaches the preset width, the first image is directly used as a feature map; if the preset width is not reached, black or white areas are added to the left and right edges of the first image, so that the first image is converted into a characteristic diagram matched with the preset width. It can be understood that the server performs format conversion on the text line image to obtain a feature map conforming to a preset format, so that the influence of related factors (width, height and channel number) is eliminated in the subsequent image processing process, and the accuracy of final recognition is higher.

For example, the width, height, and number of channels corresponding to the text line image are set to w, h, and d, respectively, and it is understood that in order to reduce the workload of the subsequent format conversion process, the text line image may be converted into a grayscale image in advance such that d=1. After performing format conversion on the text line image, obtaining a feature map corresponding to a preset width, a preset height and a preset channel number, wherein the preset width, the preset height and the preset channel number are respectively set as w ', h ' and d ', in this embodiment, h ' =1 and d ' =1, that is, the height and the channel number are kept the same in the format conversion process, so as to ensure the accuracy of the subsequent encoding and identification process.

S502: and adopting an Encoder component to encode the feature map to obtain corresponding feature vectors.

As shown in fig. 7, the Encoder component adopts an Encoder constructed by combining a convolution layer (Conv), a maximum pooling layer (MaxPool) and an acceptance layer, and is used for obtaining a feature vector corresponding to the feature map. As shown in fig. 7 (a), the Encoder component sequentially includes a convolution layer (Conv), a maximum pooling layer (MaxPool), an acceptance layer, and four convolution layers (Conv), where the convolution kernel size and the corresponding activation function are as shown in the figure and can be set autonomously according to actual requirements. The acceptance layer is formed by combining a plurality of convolution layers (Conv) and an average pooling layer (AvgPool) according to the structure shown in fig. 7 (c). As shown in fig. 7 (a) and 7 (c), the model structure of the Encoder component is simpler, which is helpful to increase the calculation speed and save the calculation cost. In the encoding process of the Encoder component, the information of each pixel point in the feature map conforming to the preset format is extracted to obtain the feature vector capable of uniquely identifying the feature map. The objective of the acceptance is to design a network with a good local topology, i.e. to perform multiple convolution or pooling operations on the input image in parallel and to stitch all output results into a very deep feature map. Because different convolution operations such as 1*1, 3*3 or 5*5 and pooling operations can obtain different information of the input feature map, processing the information in parallel by adopting the acceptance and combining all results can obtain better image characterization, namely feature vectors, so as to ensure the accuracy of the obtained feature vectors.

S503: and integrating and classifying the feature vectors by adopting a Summarizer component to obtain at least one text probability, and outputting the identification text with the maximum text probability as a target text corresponding to the text line image.

Specifically, the summerizer component is a classifier preset in the multilingual text recognition system for performing classification tasks to determine corresponding target text types. As shown in fig. 7 (b), the summerizer component includes three convolutional layers (Conv), the first two of which employ a Relu activation function, and the last of which employs a Sigmoid activation function. As shown in fig. 7 (b), the model structure of the summerizer component is simpler, which is helpful to increase the calculation speed and save the calculation cost. The server integrates and classifies the feature vectors output by the Encoder component by adopting the Summarizer component to obtain at least one text probability, wherein each text probability corresponds to a recognition text; and then, carrying out probability distribution processing on at least one text probability by adopting a Softmax function so as to output the text with the maximum text probability as a target text corresponding to the text line image, thereby guaranteeing the accuracy of the identified target text. In this embodiment, the probability of at least one text species identified by the summerizer component may exist in the form of an array, for example, represented by |s|, where each number in the array corresponds to a probability of identifying a text species, and the |s| is an array formed by four values of 0.7, 0, 0.2, and 0.1, and according to the order of the four values, the probabilities refer to chinese, english, japanese, and korean, respectively, and the target text species may be determined to be chinese based on the |s|, where |s| refers to a code point sequence.

In this embodiment, let x be the feature vector corresponding to the text line image, and y be the code point sequence encoded by the Encoder component, where the code point sequence may be the sequence corresponding to the text probability formed after the summerizer component integrates the feature vector, and the model is modeled by using a probability method, where the conditional probability is P (y|x). P (y|x) refers to the probability that the coded code point sequence is y under the condition that x is known, and the Summarizer component outputs the code point sequence with the highest probability. Assuming S e S, S represents a text species, and taking S as an implicit variable, the text information can be combined into the conditional probability P (y|x) to obtain the following formula:

where P (ys, x) represents an OCR model capable of recognizing a certain word, x represents a feature vector of an image of a fixed height (h' =1), s represents a word, i.e., P (y|s, x) refers to the probability that the word in the image belongs to the word s given the feature vector x of the image. argmax is a function that parameterizes the function (set).

Specifically, the calculation formula of P (y|s, x) is as follows:

wherein C (y) represents a function which functions to convert y into a glyph cluster (C ₁ ,c ₂ ,…,c _|C(y)| ) A function of the corresponding sequence; c _i An ith glyph representing y; p (c|s, x), P (c|s) and P (y|s) each represent an OCR recognition model for the word s, a glyph clusterc and a language model. Wherein P (c|s, x) represents the probability that a given input word s belongs to the glyph cluster c under the condition that x is known; p (c|s) represents the probability that the Chinese character in the figure belongs to the glyph cluster c given the entered text s; p (y|s) represents the probability that the coded code point sequence in the figure is y given the entered text s.

The character cluster is a character string which has a certain uniform characteristic in a character stream and is not separable in typesetting, and is a common character (Chinese character or letter) in a normal case, but is a character string which is arranged according to a certain special rule in some cases. If a glyph cluster is composed of a plurality of characters, it has the following properties: 1) The behavior of the font cluster in the character line is equivalent to one character in the traditional typesetting system; 2) The relation and layout of the font clusters among the characters are only related to the character attributes and are not related to typesetting rules; the characters in the font cluster should have the same character and font properties and can be output at one time.

It will be appreciated that after the system trains the Encoder component and the surcharge component, the system pre-counts the conventional Seq2Seq recognition model and the confusion rate of the text recognition based on the Encoder-surcharge mechanism in this embodiment, so as to determine the accuracy of the text recognition based on the confusion rate. The confusion rate refers to the probability that one character is mistaken for another character. The testing process comprises the following steps: (1) Acquiring image test samples with multiple languages, wherein each image test sample comprises a corresponding text label, and the text label is a label of a text corresponding to characters in the image test sample. (2) And (3) testing the image test sample by adopting a traditional Seq2Seq identification model and a method corresponding to the steps S501-S503 respectively to obtain a result label corresponding to the image test sample, wherein the result label is a label of the text type identified by the image test sample. (3) Based on the text labels and the result labels corresponding to all the image test samples, the confusion rate corresponding to any two text types is counted. The confusion rate here can be considered as the quotient of the number of errors that the two types of text are error-recognized and the total number of occurrences of the two types of text. According to the test result, the confusion rate of recognizing the cyrillic letter as Latin based on the traditional Seq2Seq recognition model is 4.2%, and the confusion rate of recognizing the cyrillic letter as Latin is 1.8% when the Encoder-Summarizer is adopted for recognition. As shown in the table I, when testing is performed under the same condition, for most of the text, the text is identified by using the Encoder-Summarizer, and the corresponding confusion rate is far lower than that of the traditional algorithm, so that in the embodiment, the text is identified by using the Encoder-Summarizer, the accuracy of the identified target text can be improved, and the accuracy of the subsequent text identification can be improved. As shown in FIG. 7, the model structures of the Encoder component and the Summarizer component are simpler, which is beneficial to accelerating the calculation speed and saving the calculation cost.

TABLE one confusion rate test results

/>

In the multilingual text recognition method provided by the embodiment, format conversion is performed on the text line images to obtain the feature images conforming to the preset format, so that the calculation amount of subsequent coding processing is reduced, and the coding processing efficiency is improved; the feature image is encoded by the Encoder component, so that corresponding feature vectors can be quickly obtained, and the accuracy of the feature vectors is ensured; and then, carrying out integrated classification on the feature vectors by adopting a Summarizer component to obtain at least one text probability, outputting the identified text with the maximum text probability as a target text corresponding to the text line image, so as to ensure the accuracy of the identified target text and ensure the identification efficiency.

In one embodiment, as shown in fig. 6, step S205, that is, performing transcription recognition on a text line image by using a target OCR recognition model, obtains a target text corresponding to the text line image, specifically includes the following steps:

s601: and integrating adjacent text line images corresponding to the same target text line image based on the target text type and the text line position corresponding to the text line image to form a text block image.

Since the text between the text contexts may have a specific relation, that is, the text between the contexts may have a specific semantic meaning when being connected, when the server identifies the text line images, the server needs to integrate adjacent text line images corresponding to the same target text line image into a text block image based on the target text type and the text line position corresponding to the text line image, so as to take the text block image as an integrally identified object, thereby ensuring the accuracy of the target text corresponding to each text line image identified based on the text block image.

Specifically, after identifying the target text corresponding to each text line image, the server sorts the target text corresponding to all text line images according to the text line positions corresponding to the text line images, so as to integrate all the text line images corresponding to the adjacent text line positions and belonging to the same target text, and form a text block image, so that context semantics in the text block images can be fully considered when the text block images are subsequently identified, and the accuracy of text identification is improved.

S602: cutting the text block image by adopting a text cutting algorithm to obtain at least two single-character images, and obtaining the identification sequence and the row and column labels corresponding to each single-character image according to the typesetting type and the cutting sequence of the text block image.

The text cutting algorithm is an algorithm for cutting a text block image into single-character images, and the text cutting algorithm can be a projection-based text cutting algorithm. For example, when a text block image is switched by adopting a text cutting algorithm based on projection, each text line image can be projected in the vertical direction in sequence to obtain vertical projection pixels, and if continuous pixels meet a preset condition, an original text exists in the area corresponding to the continuous pixels, and cutting is performed to form a single-character image.

Wherein, typesetting types of the text block images comprise transverse typesetting and longitudinal typesetting. The recognition sequence corresponding to the single font image means that the sequence of a certain single font image in the whole text block image is determined according to the typesetting position of the single font image in the text block image. The row and column labels corresponding to a single character image refer to row labels and column labels of a certain single character image in a text block image. For example, a text block image is formed by transversely typesetting 3 text line images, the numbers of words are 18, 20 and 17 respectively, 55 single-word body images are formed by cutting, the identification sequence of the single-word body images corresponding to the 3 rd word of the 2 nd line is 21, the line label is 2, and the column label is 3.

S603: inputting the single character images into a target OCR recognition model for transcription recognition according to the recognition sequence corresponding to the single character images, and obtaining at least one recognition result corresponding to the single character images and the recognition probability corresponding to each recognition result.

Specifically, the server inputs the single-character images into an OCR recognition model for transcription recognition according to the recognition sequence corresponding to each single-character image, and obtains at least one recognition result corresponding to each single-character image and recognition probability corresponding to the recognition result. In this embodiment, only the first 3 recognition results with a larger recognition probability and the corresponding recognition probabilities thereof may be selected, so as to reduce the workload of the subsequent recognition processing. The recognition result refers to the result recognized by each single character image, and can be a recognition symbol or a recognition text. The recognition probability corresponding to each recognition result refers to the probability of being recognized as the recognition result from the single character image. For example, for a single character image containing the original character "it" the recognized characters are "it", "even" and "kanji", respectively, the corresponding recognition probabilities are 99.99%, 84.23% and 47.88%, respectively.

S604: and dividing all single character images in the text block image into at least two units to be identified based on the identification result and the typesetting type.

Specifically, the server divides all the single-character images in the text block image into at least two units to be identified based on at least one identification result corresponding to each single-character image in the text block image and the typesetting type corresponding to the text block image. The unit to be identified is the minimum unit which needs to be identified semantically, and each unit to be identified contains characters which can form a complete sentence so as to be identified semantically, thereby being beneficial to improving the accuracy of character identification. The recognition result is the result recognized by each single character image, and can be a recognition symbol or a recognition text, so that semantic separation can be performed according to whether the recognition result is the recognition symbol or not; when the recognition result is not the recognition symbol, semantic separation can be performed according to the typesetting type of the text block image, so that the recognition accuracy of the unit to be recognized is improved.

In one embodiment, in step S604, all the single character images in the text block image are divided into at least two units to be identified based on the identification result and the typesetting type, which specifically includes the following steps:

S6041: if the identification result contains identification symbols, a unit to be identified is formed based on all single character images between any two adjacent identification symbols.

Specifically, if the recognition result of the single character body image cut by any text block image includes a recognition symbol, where the recognition symbol is a punctuation symbol recognized by using a target OCR recognition model, it is indicated that the characters corresponding to the text block image need to be separated by using the punctuation symbol, and the semantics between the contexts are related to the positions of the punctuation symbol, at this time, the server needs to form all the single character body images between any two adjacent recognition symbols into a unit to be recognized, and the unit to be recognized is used as the minimum unit for performing semantic recognition, so that the characters corresponding to each unit to be recognized are a sentence, thereby being beneficial to improving the recognition accuracy of the characters corresponding to the unit to be recognized and reducing the recognition complexity.

Specifically, when the server adopts the target OCR recognition model to recognize a single character image, at least one recognition result and the recognition probability corresponding to each recognition result are obtained, and if the recognition result is that the recognition probability of the recognition symbol is the maximum or is greater than a preset probability threshold, the recognition result is considered to comprise the recognition symbol. The preset probability threshold is used for evaluating whether the recognition probability reaches a threshold for evaluating that the recognition probability is a certain character/symbol, and the preset probability threshold can be set to be a higher value so as to ensure the recognition accuracy.

Further, when the server recognizes that the recognition result of the single character image cut by any text block image comprises a recognition symbol, the server can judge whether the recognition symbol comprises a preset symbol or not, and if the recognition symbol comprises the preset symbol, a unit to be recognized is formed based on all the single character images between any two adjacent preset symbols, so that the recognition accuracy of the unit to be recognized is improved. Wherein the preset symbols are punctuation marks preset by the system for identifying the end of sentences, including but not limited to periods, question marks and exclamation marks.

S6042: if the identification result does not contain the identification symbol, forming a unit to be identified based on the same row of labels or all the single character body images corresponding to the same column of labels according to the typesetting type of the text block image and the row and column labels corresponding to the single character body images.

Specifically, if the recognition result of the single character image cut by any text block image does not include a recognition symbol, it is indicated that the text corresponding to the text block is not separated by using punctuation marks, and in this case, generally, the text in the same row or the same column forms a sentence, at this time, the server may form a unit to be recognized based on the typesetting type of the text block image and the row-column label corresponding to the single character image, and based on all the single character images corresponding to the label in the same row or the label in the same column, thereby helping to improve the recognition accuracy of the unit to be recognized.

As can be appreciated, the server forms the unit to be identified based on all the single character images between two adjacent recognition symbols when the recognition result contains the recognition symbols; when the recognition result does not contain the recognition symbol, forming a unit to be recognized based on the typesetting type of the text block image and the row-column labels corresponding to the single-character images and based on the same row labels or all the single-character images corresponding to the same column labels, and ensuring that the finally formed unit to be recognized contains a complete sentence to a certain extent so as to facilitate subsequent semantic analysis and improve recognition accuracy.

S605: and acquiring the single character corresponding to each single character image based on at least one identification result corresponding to each single character image and the identification probability corresponding to each identification result in any unit to be identified.

Specifically, the server uses the unit to be identified as the minimum unit for semantic identification, and according to at least one identification result corresponding to each single character image in the unit to be identified and the identification probability corresponding to each identification result, the server utilizes the possible context semantic relation among all the single character images to more accurately determine the single character corresponding to each single character image, thereby improving the accuracy of character identification.

In an embodiment, step S606, namely, based on at least one recognition result corresponding to each single-character image and a recognition probability corresponding to each recognition result in any unit to be recognized, obtains a single-character text corresponding to each single-character image, specifically includes the following steps:

s6051: if the identification characters with the identification probability larger than the preset probability threshold value exist in each single character image in any unit to be identified, the identification characters with the identification probability larger than the preset probability threshold value are determined to be the single character characters corresponding to the single character image.

Specifically, the server compares each recognition probability corresponding to each single character image in any unit to be recognized with a preset probability threshold one by one to judge whether each single character image has recognition characters with the recognition probability larger than the preset probability threshold, and if each single character image has recognition characters with the recognition probability larger than the preset probability threshold, the recognition characters are directly determined to be the single character characters corresponding to the single character image.

Assuming that any unit to be identified comprises N single-character images, M identification characters corresponding to each single-character image and one identification probability corresponding to each identification character are included, the server sequentially compares the M identification probabilities of the N single-character images with a preset probability threshold value to determine whether the identification characters with the identification probability larger than the preset probability threshold value exist in each single-character image or not; if the N single-character images have the identification characters with the identification probability larger than the preset probability threshold value, the accuracy of the identification characters corresponding to each single-character image is higher, and at the moment, the identification characters can be directly identified as the font characters among the single-character images. For example, the preset probability threshold is set to 95%, in six single-word images corresponding to the unit to be identified, namely "today's weather is good", identification words with identification probability larger than the preset probability threshold exist in each single-word image, for example, the identification probability of "present" is 97%, the identification probability of "day" is 98%, the identification probability of "gas" is 99%, the identification probability of "true" is 96%, and the identification probability of "good" is 99%, and then the identification words with identification probability larger than the preset probability threshold (for example, 95%) are determined as single-word words, so that the accuracy of the Chinese identification in the unit to be identified is ensured.

S6052: if at least one single character image does not have the identification characters with the identification probability larger than the preset probability threshold value in any unit to be identified, identifying word sequences formed by at least one identification character corresponding to all the single character images in the unit to be identified by adopting a target language model according to the identification sequence corresponding to the single character images, acquiring the sequence probability corresponding to each word sequence, and acquiring the single character corresponding to each single character image based on the word sequence with the maximum sequence probability.

Specifically, the server compares each recognition probability corresponding to each single character image in any unit to be recognized with a preset probability threshold one by one so as to judge whether each single character image has recognition characters with the recognition probability larger than the preset probability threshold; at the moment, the accuracy of the identification characters identified by the at least one single character image does not reach the standard, and at the moment, semantic analysis is needed to be carried out by combining the context so as to improve the identification accuracy of the characters corresponding to the unit to be identified. Therefore, the server needs to identify the word sequence formed by at least one identification word corresponding to all the single-word images in the unit to be identified by adopting the target language model according to the identification sequence corresponding to the single-word images, obtain the sequence probability corresponding to each word sequence, and obtain the single-word words corresponding to each single-word image based on the word sequence with the maximum sequence probability.

The target language model is a model used for semantic analysis of continuous characters, and can adopt, but is not limited to, an N-gram model, namely a Chinese language model. The Chinese language model utilizes collocation information between adjacent words in the context, when continuous non-space pinyin, strokes or numbers representing letters or strokes are required to be converted into Chinese character strings (i.e. sentences), sentences with the highest probability can be calculated, so that automatic conversion of Chinese characters is realized, manual selection by a user is not required, and the problem of repeated codes of a plurality of Chinese characters corresponding to the same pinyin (or stroke strings or number strings) is avoided.

For example, the preset probability threshold is set to 95%, and the recognition probability of the first single character image recognizing "present" is 97% in six single character images corresponding to the unit to be recognized, namely "weather is good today"; the recognition probability of the second single character image for recognizing 'Fu' is 85%, and the recognition probability of the second single character image for recognizing 'Tian' is 83%; the third single character image has 90% of the identification probability of "day" and 75% of the identification probability of "Fu"; the identification probability of the fourth single character image for identifying 'gas' is 99%; the recognition probability of the fifth single character image recognizing true is 96%; the sixth single-word body image identifies that the identification probability of 'good' is 99%, and because the identification probability of the second single-word body image and the fourth single-word body image is not larger than the identification characters of a preset probability threshold (such as 95%), at this time, corresponding word sequences are formed based on the identification characters corresponding to all the single-word body images, such as 'today weather is good', 'present weather is good', and 'present weather is good', the word sequences are identified by adopting a target language model so as to obtain the sequence probability corresponding to each word sequence, and the single-word body characters corresponding to each single-word body image in the unit to be identified are determined based on the word sequence with the largest sequence probability so as to ensure the accuracy of single-word body character identification.

It can be understood that when each single character image in any unit to be identified has identification characters with identification probability larger than a preset probability threshold, the server can directly determine the identification characters as the single character characters corresponding to the single character images, so that the identification accuracy and the identification efficiency of the single character characters are ensured; when at least one single-word image does not have the identification characters with the identification probability larger than the preset probability threshold value, identifying word sequences formed by all the single-word images by adopting a target language model, determining the single-word characters corresponding to each single-word image based on the word sequence with the maximum sequence probability, and carrying out semantic analysis by adopting the target language model so as to ensure the accuracy of single-word character identification.

S606: and performing page typesetting on the single character corresponding to the single character images according to the identification sequence and the row-column labels corresponding to each single character image, and obtaining the target characters corresponding to the text line images.

Specifically, after identifying the single-character text corresponding to each single-character image, the server performs page typesetting on the single-character text corresponding to all the single-character images according to the identification sequence and the row and column labels corresponding to the single-character images, namely, places the single-character text corresponding to each single-character image at the position corresponding to the row and column labels so as to acquire the target text corresponding to the text line image, so that subsequent comparison and verification are performed based on the target text, and the identification accuracy of the target text is ensured. The target text is the text that the text line image ultimately recognizes.

In the multilingual text recognition method provided by the embodiment, adjacent text line images corresponding to the same target text are integrated to form text block images so as to carry out subsequent recognition by taking the text block images as units, and context semantics among the text line images are fully considered, so that the accuracy of text recognition is improved; the text block image is subjected to word-by-word cutting, and a single font image and a corresponding identification sequence and a row label are obtained so as to be positioned based on the identification sequence and the row label; then, the target OCR recognition model is adopted to recognize the single character image so as to determine at least one recognition result and corresponding recognition probability, and the recognition result recognized by the target OCR recognition model is more accurate because the target OCR recognition model is a recognition model specially aiming at the target character; then, dividing all the single-character images into at least two units to be identified which can perform semantic analysis according to all the identification results and typesetting types corresponding to the text block images so as to ensure the accuracy of the single-character characters corresponding to each identified single-character image based on the analysis of the units to be identified; and page typesetting is carried out on all the single character based on the identification sequence and the row-column labels corresponding to the single character images so as to obtain the target characters corresponding to the text line images, so that the comparison and verification can be carried out later, and the identification accuracy of the target characters is ensured.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.

In an embodiment, a multi-language text recognition device is provided, where the multi-language text recognition device corresponds to the multi-language text recognition method in the above embodiment one by one. As shown in fig. 8, the multilingual text recognition apparatus includes an image to be recognized acquisition module 801, a text line image acquisition module 802, a target text type acquisition module 803, a recognition model acquisition module 804, a target text acquisition module 805, and a recognition text acquisition module 806. The functional modules are described in detail as follows:

the image to be identified obtaining module 801 is configured to obtain an image to be identified, where the image to be identified includes original characters corresponding to at least two languages.

The text line image obtaining module 802 is configured to perform layout analysis and recognition on an image to be recognized, obtain at least one text line image, and determine a text line position of each text line image in the image to be recognized.

The target text type obtaining module 803 is configured to identify a text type for each text line image, and obtain a target text type corresponding to each text line image.

The recognition model obtaining module 804 is configured to query a recognition model database based on the target text, and obtain a target OCR recognition model corresponding to the target text.

And the target text acquisition module 805 is configured to perform transcription recognition on the text line image by using the target OCR recognition model, so as to acquire a target text corresponding to the text line image.

The recognition text obtaining module 806 is configured to obtain a target recognition text corresponding to the image to be recognized based on the target text corresponding to the text line image and the text line position.

Preferably, before the image acquisition module 801 to be recognized, the multilingual text recognition apparatus further includes:

the ambiguity detection unit is used for acquiring an original image, detecting the original image by adopting an ambiguity detection algorithm, and acquiring the ambiguity of the original image.

And the first fuzzy processing unit is used for acquiring fuzzy prompt information if the fuzzy degree is larger than a first fuzzy threshold value.

And the second blurring processing unit is used for sharpening and correcting the original image if the blurring degree is not greater than the first blurring threshold value and is greater than the second blurring threshold value, and acquiring the image to be identified.

And the third fuzzy processing unit is used for correcting the original image to acquire the image to be identified if the fuzzy degree is not greater than the second fuzzy threshold value.

Wherein the first blur threshold is greater than the second blur threshold.

Preferably, the text line image acquisition module 802 includes:

the text line region acquisition unit is used for carrying out text positioning on the image to be identified by adopting a text positioning algorithm to acquire at least one text line region.

The text line image determining unit is used for capturing at least one text line area by adopting a capturing tool, obtaining at least one text line image, and determining the text line position of each text line image in the image to be identified according to the capturing sequence of the capturing tool.

Preferably, the target seed acquiring module 803 includes:

the feature map obtaining unit is used for carrying out format conversion on the text line images to obtain feature maps conforming to a preset format.

And the feature vector acquisition unit is used for adopting the Encoder component to encode the feature map so as to acquire the corresponding feature vector.

And the target text output unit is used for integrating and classifying the feature vectors by adopting the Summarizer component, acquiring at least one text probability, and outputting the identified text with the maximum text probability as the target text corresponding to the text line image.

Preferably, the identifying text acquisition module 806 includes:

And the text block image acquisition unit is used for integrating adjacent text line images corresponding to the same target text type based on the target text type and the text line position corresponding to the text line image to form a text block image.

The single font image acquisition unit is used for cutting the text block image by adopting a text cutting algorithm to acquire at least two single font images, and acquiring the identification sequence and the row and column labels corresponding to each single font image according to the typesetting type and the cutting sequence of the text block image.

The recognition result acquisition unit is used for inputting the single character images into the target OCR recognition model for transcription recognition according to the recognition sequence corresponding to the single character images, and acquiring at least one recognition result corresponding to the single character images and the recognition probability corresponding to each recognition result.

The unit to be identified dividing unit is used for dividing all the single character images in the text block image into at least two units to be identified based on the identification result and the typesetting type.

The single character acquiring unit is used for acquiring the single character corresponding to each single character image based on at least one identification result corresponding to each single character image and the identification probability corresponding to each identification result in any unit to be identified.

And the target character acquisition unit is used for typesetting the page of the single character corresponding to the single character image according to the identification sequence and the row-column label corresponding to each single character image, and acquiring the target character corresponding to the text line image.

Preferably, the unit to be identified divides the unit, including:

the first unit obtains the subunit, is used for if the recognition result includes the identification symbol, form a unit to be recognized based on all single character body images between arbitrary adjacent two identification symbols.

The first unit obtains a subunit, configured to form a unit to be identified based on the same row of labels or all the single-character images corresponding to the same column of labels according to the typesetting type of the text block image and the row-column labels corresponding to the single-character images if the identification result does not include the identification symbol.

Preferably, the single character acquiring unit includes:

and the first single character acquisition subunit is used for determining the identification characters with the identification probability larger than the preset probability threshold as the single character characters corresponding to the single character images if the identification characters with the identification probability larger than the preset probability threshold exist in any unit to be identified.

And the second single word acquisition subunit is used for identifying word sequences formed by at least one identification word corresponding to all single word images in the unit to be identified by adopting the target language model according to the identification sequence corresponding to the single word images if at least one single word image does not have the identification word with the identification probability larger than the preset probability threshold value in any unit to be identified, acquiring the sequence probability corresponding to each word sequence, and acquiring the single word corresponding to each single word image based on the word sequence with the maximum sequence probability.

For specific limitations of the multilingual text recognition device, reference is made to the above limitation of the multilingual text recognition method, and no further description is given here. The various modules in the multi-lingual text recognition device described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store data employed or generated during execution of the multilingual text recognition method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a multilingual text recognition method.

In one embodiment, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor executes the computer program to implement the multilingual text recognition method in the above embodiment, for example, steps S201-S206 shown in fig. 2, or steps S201-S206 shown in fig. 3-6, which are not repeated herein. Alternatively, the processor may implement the functions of each module/unit in this embodiment of the multilingual text recognition device when executing the computer program, for example, the functions of the image acquisition module 801 to be recognized, the text line image acquisition module 802, the target text type acquisition module 803, the recognition model acquisition module 804, the target text acquisition module 805, and the recognition text acquisition module 806 shown in fig. 8, which are not repeated here.

In an embodiment, a computer readable storage medium is provided, and a computer program is stored on the computer readable storage medium, and when the computer program is executed by a processor, the method for identifying multilingual text in the above embodiment is implemented, for example, steps S201-S206 shown in fig. 2, or steps S201-S206 shown in fig. 3-6, which are not repeated herein. Alternatively, the functions of the modules/units in this embodiment of the multilingual text recognition apparatus described above, such as the functions of the image to be recognized acquisition module 801, the text line image acquisition module 802, the target text seed acquisition module 803, the recognition model acquisition module 804, the target text acquisition module 805, and the recognition text acquisition module 806 shown in fig. 8, are implemented when the computer program is executed by a processor, and are not repeated here.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims

1. A method for multi-lingual text recognition comprising:

performing format conversion on the text line image to obtain a feature map conforming to a preset format; adopting an Encoder component to encode the feature map to obtain a corresponding feature vector; the Summarizer component is adopted to integrate and classify the feature vectors, at least one text probability is obtained, and the identified text with the largest text probability is output as a target text corresponding to the text line image;

integrating adjacent text line images corresponding to the same target text line image based on the target text type and the text line position corresponding to the text line image to form a text block image; cutting the text block image by adopting a text cutting algorithm to obtain at least two single character images, and obtaining an identification sequence and a row and column label corresponding to each single character image according to the typesetting type and the cutting sequence of the text block image; inputting the single character images into the target OCR recognition model for transcription recognition according to the recognition sequence corresponding to the single character images, and obtaining at least one recognition result corresponding to the single character images and recognition probability corresponding to each recognition result; dividing all single character images in the text block image into at least two units to be identified based on the identification result and the typesetting type; acquiring a single character corresponding to each single character image based on at least one identification result corresponding to each single character image and identification probability corresponding to each identification result in any unit to be identified; performing page typesetting on the single-character characters corresponding to the single-character images according to the identification sequence and the row-column labels corresponding to each single-character image, and obtaining target characters corresponding to the text line images;

2. The multi-language text recognition method of claim 1, wherein prior to the acquiring the image to be recognized, the multi-language text recognition method further comprises:

acquiring an original image, detecting the original image by adopting a ambiguity detection algorithm, and acquiring the ambiguity of the original image;

if the ambiguity is greater than a first ambiguity threshold, acquiring ambiguity prompt information;

if the ambiguity is not greater than the first ambiguity threshold and is greater than the second ambiguity threshold, sharpening and correcting the original image to obtain an image to be identified;

if the ambiguity is not greater than a second ambiguity threshold, correcting the original image to obtain an image to be identified;

wherein the first blur threshold is greater than the second blur threshold.

3. The method for multi-language text recognition according to claim 1, wherein said performing layout analysis recognition on said image to be recognized, obtaining at least one text line image, and determining a text line position of each of said text line images in said image to be recognized, comprises:

Performing text positioning on the image to be identified by adopting a text positioning algorithm to obtain at least one text line area;

and adopting a screenshot tool to screenshot at least one text line area, obtaining at least one text line image, and determining the text line position of each text line image in the image to be identified according to the screenshot sequence of the screenshot tool.

4. The multilingual text recognition method of claim 1, wherein the dividing all single-font images in the text block image into at least two units to be recognized based on the recognition result and the typesetting type comprises:

if the identification result contains identification symbols, forming a unit to be identified based on all the single character images between any two adjacent identification symbols;

and if the identification result does not contain identification symbols, forming a unit to be identified based on the same row of labels or all the single character body images corresponding to the same column of labels according to the typesetting type of the text block images and the row and column labels corresponding to the single character body images.

5. The method for recognizing multilingual text according to claim 1, wherein the obtaining the single-word text corresponding to each single-word image based on at least one recognition result corresponding to each single-word image and the recognition probability corresponding to each recognition result in any one of the units to be recognized comprises:

If the identification characters with the identification probability larger than a preset probability threshold value exist in each single character image in any unit to be identified, determining the identification characters with the identification probability larger than the preset probability threshold value as the single character characters corresponding to the single character images;

if at least one single character image in any unit to be identified does not have identification characters with identification probability larger than a preset probability threshold, identifying word sequences formed by at least one identification character corresponding to all the single character images in the unit to be identified by adopting a target language model according to the identification sequence corresponding to the single character images, acquiring sequence probability corresponding to each word sequence, and acquiring the single character corresponding to each single character image based on the word sequence with the maximum sequence probability.

6. A multi-lingual text recognition device comprising:

The target text type acquisition module is used for carrying out format conversion on the text line images to acquire feature images conforming to a preset format; adopting an Encoder component to encode the feature map to obtain a corresponding feature vector; the Summarizer component is adopted to integrate and classify the feature vectors, at least one text probability is obtained, and the identified text with the largest text probability is output as a target text corresponding to the text line image;

the target text acquisition module is used for integrating adjacent text line images corresponding to the same target text line image to form a text block image based on the target text type and the text line position corresponding to the text line image; cutting the text block image by adopting a text cutting algorithm to obtain at least two single character images, and obtaining an identification sequence and a row and column label corresponding to each single character image according to the typesetting type and the cutting sequence of the text block image; inputting the single character images into the target OCR recognition model for transcription recognition according to the recognition sequence corresponding to the single character images, and obtaining at least one recognition result corresponding to the single character images and recognition probability corresponding to each recognition result; dividing all single character images in the text block image into at least two units to be identified based on the identification result and the typesetting type; acquiring a single character corresponding to each single character image based on at least one identification result corresponding to each single character image and identification probability corresponding to each identification result in any unit to be identified; performing page typesetting on the single-character characters corresponding to the single-character images according to the identification sequence and the row-column labels corresponding to each single-character image, and obtaining target characters corresponding to the text line images;

7. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the multilingual text recognition method according to any one of claims 1 to 5 when the computer program is executed.

8. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the multilingual text recognition method according to any one of claims 1 to 5.