CN110569830A

CN110569830A - Multi-language text recognition method and device, computer equipment and storage medium

Info

Publication number: CN110569830A
Application number: CN201910706802.7A
Authority: CN
Inventors: 王健宗; 回艳菲; 韩茂琨; 于凤英
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-08-01
Filing date: 2019-08-01
Publication date: 2019-12-13
Anticipated expiration: 2039-08-01
Also published as: WO2021017260A1; CN110569830B

Abstract

The invention discloses a multilingual text recognition method, a multilingual text recognition device, a computer device and a storage medium. The method comprises the following steps: acquiring an image to be recognized, wherein the image to be recognized comprises original characters corresponding to at least two languages; performing layout analysis and identification on an image to be identified, acquiring at least one text line image, and determining the text line position of each text line image in the image to be identified; performing culture identification on each text line image to obtain a target culture corresponding to each text line image; inquiring a recognition model database based on the target language to obtain a target OCR recognition model corresponding to the target language; adopting a target OCR recognition model to perform transcription recognition on the text line image to obtain target characters corresponding to the text line image; and acquiring a target recognition text corresponding to the image to be recognized based on the target characters corresponding to the text line image and the text line position. The method can adopt the target OCR recognition model to recognize each text line image, and is beneficial to improving the recognition accuracy of the multilingual text.

Description

Multi-language text recognition method and device, computer equipment and storage medium

Technical Field

The present invention relates to the field of text recognition technologies, and in particular, to a method and an apparatus for recognizing a multi-language text, a computer device, and a storage medium.

Background

multilingual text recognition is particularly applied to a scene in which a text image containing a plurality of languages is recognized, for example, a scene in which a text image in which a chinese character, a japanese character, and an english character coexist is recognized. The method is characterized in that a sequence-to-sequence (Seq 2Seq) based method is adopted to train a sequence-to-sequence (Seq 2Seq) recognition model to recognize the text images coexisting in multiple languages, the model structure is complex, the training process is very difficult, the running efficiency of the model is low, and the recognition accuracy is low. Here, Seq2Seq refers to a technique of converting a sequence of one domain (for example, english) into a model of a sequence of another domain (for example, french). When the traditional Seq2Seq identification model is adopted to identify a text image with multiple language characters coexisting, the finally identified character content is usually wrong due to the mistake of the judgment of the character type, so that the identification accuracy is low, and the popularization and the application of the identification model are not facilitated.

Disclosure of Invention

the embodiment of the invention provides a method and a device for recognizing a multi-language text, computer equipment and a storage medium, which aim to solve the problem of low recognition accuracy in the process of recognizing the multi-language text by adopting a recognition model at present.

A multilingual text-recognition method comprising:

acquiring an image to be recognized, wherein the image to be recognized comprises original characters corresponding to at least two languages;

performing layout analysis and identification on the image to be identified, acquiring at least one text line image, and determining the text line position of each text line image in the image to be identified;

Performing culture identification on each text line image to obtain a target culture corresponding to each text line image;

inquiring a recognition model database based on the target language, and acquiring a target OCR recognition model corresponding to the target language;

Performing transcription recognition on the text line image by adopting the target OCR recognition model to obtain target characters corresponding to the text line image;

and acquiring a target recognition text corresponding to the image to be recognized based on the target characters corresponding to the image of the text line and the position of the text line.

A multilingual text-recognition apparatus comprising:

the device comprises an image to be recognized acquisition module, a recognition module and a recognition module, wherein the image to be recognized acquisition module is used for acquiring an image to be recognized, and the image to be recognized comprises original characters corresponding to at least two languages;

The text line image acquisition module is used for performing layout analysis and identification on the image to be identified, acquiring at least one text line image and determining the text line position of each text line image in the image to be identified;

The target language type acquisition module is used for identifying the language type of each text line image and acquiring the target language type corresponding to each text line image;

The recognition model acquisition module is used for inquiring a recognition model database based on the target language and acquiring a target OCR recognition model corresponding to the target language;

The target character acquisition module is used for performing transcription recognition on the text line image by adopting the target OCR recognition model to acquire target characters corresponding to the text line image;

and the identification text acquisition module is used for acquiring a target identification text corresponding to the image to be identified based on the target characters corresponding to the text line image and the text line position.

A computer device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, said processor implementing the above multilingual text-recognition method when said computer program is executed by said processor.

a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the above-mentioned multilingual text-recognition method.

According to the multilingual text recognition method, the device, the computer equipment and the storage medium, the layout analysis and recognition are firstly carried out on the images to be recognized, which have coexistent original characters corresponding to at least two languages, so that at least one text line image and the corresponding text line position are determined, the images to be recognized in the multilingual languages are converted into the text line images in the single language for recognition, and the recognition accuracy is improved in the subsequent recognition process. And after the text line image is subjected to language identification to determine the corresponding target language, the text line image is identified by adopting a target OCR identification model corresponding to the target language to ensure the accuracy of the target characters identified by each text line image. And then, rearranging the target characters corresponding to the text line images based on the text line positions of the text line images to obtain target identification texts corresponding to the images to be identified so as to perform comparison and verification based on the target identification texts, thereby being beneficial to ensuring the identification accuracy.

Drawings

in order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a diagram illustrating an application environment of a multilingual text-recognition method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a multilingual text-recognition method in accordance with an embodiment of the present invention;

FIG. 3 is another flow chart of a method for multi-lingual text recognition in an embodiment of the present invention;

FIG. 4 is another flow chart of a method for multi-lingual text recognition in an embodiment of the present invention;

FIG. 5 is another flow chart of a method for multi-lingual text recognition in an embodiment of the present invention;

FIG. 6 is another flow chart of a method for multi-lingual text recognition in an embodiment of the present invention;

FIG. 7 is a schematic diagram of a model structure of the Encoder-Summarizer mechanism in an embodiment of the present invention, in which 7(a) is the model structure of the Encoder component, 7(b) is the model structure of the Summarizer component, and 7(c) is the model structure of the Incepration in 7 (a);

FIG. 8 is a diagram of a multilingual text-recognition apparatus in accordance with an embodiment of the present invention;

FIG. 9 is a schematic diagram of a computer device according to an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The multilingual text recognition method provided by the embodiment of the invention can be applied to the application environment shown in fig. 1. Specifically, the multilingual text recognition method is applied to a multilingual text recognition system, which comprises a client and a server shown in fig. 1, wherein the client and the server are in communication through a network and are used for determining a corresponding target OCR recognition model according to each language when recognizing an image to be recognized in which multilingual characters coexist, and recognizing by using a plurality of target OCR recognition models so as to ensure the recognition accuracy of a finally recognized target recognition text. The client is also called a user side, and refers to a program corresponding to the server and providing local services for the user. The client may be installed on, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers.

in an embodiment, as shown in fig. 2, a multilingual text recognition method is provided, which is described by taking the server in fig. 1 as an example, and includes the following steps:

s201: and acquiring an image to be recognized, wherein the image to be recognized comprises original characters corresponding to at least two languages.

The image to be recognized refers to an image which needs character recognition. The original text is the text written in the image to be recognized. In this embodiment, the original text in the image to be recognized corresponds to at least two languages, that is, the image to be recognized is an image in which at least two languages coexist. For example, in an image to be recognized, original characters corresponding to Chinese, English and Japanese are included.

S202: and performing layout analysis and identification on the image to be identified, acquiring at least one text line image, and determining the text line position of each text line image in the image to be identified.

the step of performing layout analysis and identification on the image to be identified refers to dividing the distribution structure of the image to be identified, and analyzing the image characteristics of each divided partial image to determine the attribute category corresponding to the partial image. The attribute categories include, but are not limited to, text blocks, image blocks, table blocks, and the like. A text line image is an image formed containing at least one original word. Generally speaking, if the type of layout corresponding to the text line images is horizontal layout, each text line image includes a line of original characters; and if the type of the layout corresponding to the text line images is the longitudinal layout, each text line image comprises a column of original characters.

Specifically, the server may perform image preprocessing operations such as gray scale transformation, binarization processing, smoothing processing, edge detection and the like on an image to be recognized in advance, and then perform layout analysis recognition on the preprocessed image to be recognized to obtain at least two block images; and then, performing attribute analysis on each block image to obtain an attribute type corresponding to each block image, and intercepting the block image with the attribute type being a text block to determine the block image as a text line image. It is understood that the layout analysis of the image to be recognized may be, but is not limited to, a bottom-up layout analysis algorithm under the guidance of multi-level confidence, a neighborhood feature-based layout analysis algorithm, or a projection algorithm in combination with a bottom-up layout analysis algorithm.

Specifically, after the server performs layout analysis and identification on the image to be identified and acquires at least one text line image, the server needs to determine the text line position of the text line image in the image to be identified so as to perform subsequent text positioning or identification in combination with the context based on the text line position, thereby ensuring the accuracy of the target identification text identified by the image to be identified.

in a specific embodiment, the description is given by taking the type of the image to be recognized as the horizontal type, and after the server performs layout analysis and recognition on the image to be recognized and obtains at least one text line image, the server performs sorting according to the vertical coordinate of the upper left corner or the center of the area where the at least one text line image is located, and obtains the line number corresponding to each text line image, thereby obtaining the text line position corresponding to the text line image. For example, the text line image with the line number of 001 is a text line image corresponding to the original text of the 1 st line, so that subsequent positioning or context semantic analysis is performed based on the line number, and the text recognition accuracy is improved.

s203: and performing culture identification on each text line image to obtain a target culture corresponding to each text line image.

The target language is a language type corresponding to the original text in the text line image, such as Chinese, English, or Japanese. Specifically, the server identifies the language of each text line image to determine a target language corresponding to each text line image, so as to determine a corresponding target OCR identification model based on the target language, identifies the text line image by using the target OCR identification model to convert the multilingual image to be identified into a plurality of single-language text line images, and identifies the corresponding text line image by using the plurality of target OCR identification models to improve the identification accuracy of the whole image to be identified.

s204: and inquiring the recognition model database based on the target language to obtain a target OCR recognition model corresponding to the target language.

Wherein the recognition model database is a database for storing OCR recognition models for recognizing different languages. And storing the OCR recognition models of the languages corresponding to different languages in a recognition model database. Each kind of OCR recognition model corresponds to a kind of language (i.e. a kind of language), i.e. the kind of OCR recognition model is an OCR recognition model for recognizing characters corresponding to the kind of language. Specifically, after acquiring a target language corresponding to each text line image, the server queries the recognition model database according to the target language recognized by each text line image, and determines the language OCR recognition model corresponding to the target language as the target OCR recognition model, so that the target OCR recognition model is used for recognizing the text line image, and the accuracy of text recognition on the text line image is improved.

s205: and performing transcription recognition on the text line image by adopting a target OCR recognition model to obtain target characters corresponding to the text line image.

Specifically, the server may determine a corresponding recognition sequence according to a text line position corresponding to each text line image, and perform transcription recognition on the corresponding text line image by using the target OCR recognition model to obtain target characters corresponding to the text line image, so as to ensure recognition accuracy of the target characters corresponding to each text line image, and avoid a problem of low recognition accuracy when a plurality of text line images of different languages are recognized by using the same OCR recognition model. The target characters are characters obtained by recognizing the text line images by adopting a target OCR recognition model.

S206: and acquiring a target recognition text corresponding to the image to be recognized based on the target characters corresponding to the text line image and the text line position.

specifically, after the server identifies the target characters corresponding to each text line image, the server needs to rearrange the target characters corresponding to each text line image based on the text line position corresponding to the text line image to obtain the target identification text having the same layout as the image to be identified, so as to compare and verify the original characters at the corresponding position in the image to be identified based on the target identification text. Taking the type of the image to be recognized as the horizontal type, determining the position of the text line according to the line sequence number corresponding to each image of the text line, after the target characters corresponding to the image of the text line are recognized, typesetting the target characters according to the line sequence number to obtain the target recognition text corresponding to the image to be recognized, so as to perform subsequent comparison and verification and ensure the recognition accuracy.

in the multilingual text recognition method provided in this embodiment, a layout analysis and recognition are performed on an image to be recognized in which original characters corresponding to at least two languages coexist, so as to determine at least one text line image and a corresponding text line position, so that the image to be recognized in the multilingual language is converted into a text line image in a single language for recognition, which is beneficial to improving recognition accuracy in a subsequent recognition process. And after the text line image is subjected to language identification to determine the corresponding target language, the text line image is identified by adopting a target OCR identification model corresponding to the target language to ensure the accuracy of the target characters identified by each text line image. And then, rearranging the target characters corresponding to the text line images based on the text line positions of the text line images to obtain target identification texts corresponding to the images to be identified so as to perform comparison and verification based on the target identification texts, thereby being beneficial to ensuring the identification accuracy.

In an embodiment, as shown in fig. 3, before step S201, that is, before the image to be recognized is obtained, the multilingual text recognition method further includes:

s301: and acquiring an original image, and detecting the original image by adopting a fuzzy degree detection algorithm to acquire the fuzzy degree of the original image.

The original image refers to an unprocessed image acquired by a server. The blur degree detection algorithm is an algorithm for detecting a blur degree of an image. The ambiguity detection algorithm may employ detection algorithms commonly used in the art. The fuzziness of the original image is a numerical value used for reflecting the fuzziness of the original image, and the higher the fuzziness is, the more fuzzy the original image is; accordingly, the smaller the blur degree, the sharper the original image is.

The ambiguity detection algorithm in this embodiment may use, but is not limited to, Laplacian operator (Laplacian operator) to perform ambiguity detection. Among them, Laplacian operator (Laplacian operator) is a second order differential operator, and is suitable for improving the image blur caused by the diffuse reflection of light. The principle is that in the process of shooting and recording an image, a light spot diffusely reflects light to a surrounding area, the light diffuse reflection causes a certain degree of blurring of the image, and the blurring degree is often a constant multiple of a Laplace operator compared with an image shot under a normal condition. Therefore, step S301 specifically includes the following steps:

S3011: and adopting a Laplace operator to sharpen the original image to obtain a sharpened image and a pixel gray value of the sharpened image. Namely, the server firstly adopts a Laplacian operator to process an original image to obtain a Laplacian image for describing gray level mutation, and then the Laplacian image and the original image are superposed to obtain a sharpened image. After the sharpened image is obtained, the RGB value of each pixel point in the sharpened image is obtained, and the RGB value is processed to obtain the pixel gray value corresponding to the sharpened image.

S3012: and carrying out variance calculation on the pixel gray value of the sharpened image, acquiring a target variance value corresponding to the sharpened image, and determining the target variance value as the corresponding fuzziness of the original image. The server calculates the sum of squares of subtracting the average gray value of the sharpened image from the gray value of each pixel point in the sharpened image, and then divides the sum of squares by the number of the pixel points, so as to obtain a target variance value capable of reflecting the blurriness of the sharpened image.

as can be understood, the laplacian operator is first used to perform sharpening on the original image to obtain a sharpened image with clearer details than the original image, so as to improve the image definition. And then, calculating a target variance value of the sharpened image to reflect the difference between the pixel gray values of all the pixels of the sharpened image. And taking the target variance value of the sharpened image as the fuzziness of the original image so as to perform fuzzy filtering on the original image according to the comparison result of the fuzziness and a preset threshold value, thereby achieving the purpose of obtaining a clearer original image.

s302: and if the ambiguity is greater than the first ambiguity threshold, acquiring the ambiguity prompt information.

Wherein the first fuzzy threshold is a threshold which is preset by the system and is used for evaluating whether the image can be used as the highest fuzzy degree of the image to be identified. The blur prompt information is information for prompting that an image is excessively blurred.

Specifically, after the ambiguity of the original image is compared with the first ambiguity threshold, if the ambiguity of the original image is greater than the first ambiguity threshold, it indicates that the original image is too ambiguous, and if the original image is directly subjected to character recognition, the accuracy of character recognition in the original image may be recognized.

S303: and if the fuzziness is not greater than the first fuzziness threshold and is greater than the second fuzziness threshold, sharpening and correcting the original image to obtain the image to be identified.

Wherein the second fuzzy threshold is a threshold which is preset by the system and is used for evaluating the lowest fuzzy degree of the image to be recognized which can be rejected as clearer. It will be appreciated that the first blur threshold is greater than the second blur threshold.

Specifically, when the degree of blur of the original image is greater than the second blurring threshold and not greater than the first blurring threshold, that is, when the degree of blur of the original image is between the second blurring threshold and the first blurring threshold, it is indicated that the original image does not excessively blur but does not reach the standard of sharpness, and at this time, the original image needs to be sharpened first to improve the degree of sharpness of the original image; and then, correcting the sharpened original image to obtain a clearer and non-inclined image to be recognized, so that the accuracy of text recognition of the subsequent image to be recognized is guaranteed. Generally, when optical scanning is performed, the original image to be scanned is not positioned correctly due to objective reasons, which affects the accuracy of the image recognition processing in the later period, and therefore, image correction work is required for the image. The key point of the image tilt correction is to automatically detect the image tilt direction and tilt angle according to the image characteristics. The commonly used tilt angle methods at present include: projection-based methods, Hough transform-based methods, linear fitting-based and fourier transform-to-frequency domain-based methods.

S304: and if the fuzziness is not greater than the second fuzziness threshold, correcting the original image to obtain the image to be identified.

Specifically, when the blur degree of the original image is not greater than the second blur threshold value, the server indicates that the original image is clearer, and does not need to perform sharpening processing to enhance the sharpness of the original image, so as to improve the image processing efficiency; however, when the original image is generated, the position of the original image may be incorrect due to various objective reasons, and therefore, the server needs to perform correction processing on the original image to obtain an untilted image to be recognized after the correction processing.

in the multilingual text recognition method provided by this embodiment, corresponding processing is performed according to the comparison result between the degree of blur of the original image and the first blur threshold and the second blur threshold, so as to ensure that a clear and non-oblique image to be recognized is finally obtained, thereby ensuring the accuracy of text recognition based on the image to be recognized, and avoiding interference on the recognition result due to the image blur or the image inclination.

In an embodiment, as shown in fig. 4, step S202 is to perform layout analysis and recognition on an image to be recognized, obtain at least one text line image, and determine a text line position of each text line image in the image to be recognized, and specifically includes the following steps:

S401: and performing text positioning on the image to be recognized by adopting a text positioning algorithm to obtain at least one text line area.

Wherein the text positioning algorithm is an algorithm for positioning characters in the image. In this embodiment, the text location algorithm includes, but is not limited to, a proximity search algorithm and a CTPN-RNN algorithm. The text line region refers to a region which contains original characters and is identified from the image to be identified by adopting a text positioning algorithm, and the text line region is a region determined based on a line of original characters or a column of original characters.

Taking horizontal proximity search as an example, the proximity search algorithm refers to an algorithm that starts from a connected region, finds a horizontal circumscribed rectangle of the connected region, and expands the connected region to the whole rectangle. When the distance between the connected region and the nearest neighboring region is less than a certain range, the expansion of the rectangle is considered, the direction of the expansion is the direction of the nearest neighboring region, and if and only if the direction is horizontal, the expansion operation is performed to determine at least one text line region from the image. The method can effectively integrate the original characters positioned on the same line in the image into a text line area so as to realize the purpose of text positioning. Taking horizontal direction expansion as an example, the process of performing text positioning on the image to be recognized by adopting a proximity search algorithm to obtain at least one text line region comprises the following steps: and calculating the central vector difference of any two rectangular areas (namely the vector difference formed by the central points of the two rectangular areas) of the rectangular areas formed by any one or more original characters in the image to be recognized. Then subtracting the distance from the central point to the boundary of the two rectangular areas from the central vector difference to obtain a boundary vector difference, namelyWherein, (x'_c,y'_c) Is the boundary vector difference, (x)_c,y_c) Is referred to as the central vector difference, a₁And b₁Respectively, the length and width of the first matrix area, a₂and b₂Respectively, the length and width of the second matrix area. Then adopting a distance calculation formulacalculating the distance d between the two matrix areas, wherein max () is a function returning the maximum value;And if the distance d is smaller than a certain range, performing expansion operation on the line of text to obtain at least one text line region, and quickly obtaining the at least one text line region by adopting a proximity search method.

CTPN (connected Text forward Network, hereinafter abbreviated as CTPN) is a model for accurately positioning Text lines in an image, and can identify coordinate positions of four corners of each Text line. The main purpose of the RNN (current Neural Networks, hereinafter referred to as RNN) is to process and predict sequence data, nodes between hidden layers of the RNN are connected, and the input of the hidden layer includes not only the output of the input layer but also the output of the hidden layer at the previous time. The CTPN-RNN algorithm is adopted to position at least one text line region from the image to be recognized, the CTPN is seamlessly combined into the RNN convolutional network, so that the text lines in the image to be recognized can be accurately positioned, the text line region is determined according to the position of each text line in the image to be recognized, namely, the CTPN-RNN algorithm is adopted to realize automatic identification of at least one text line region, and the detection precision can be effectively improved by adopting a CTPN and RNN seamless combination mode.

S402: and adopting a screenshot tool to screenshot at least one text line area, acquiring at least one text line image, and determining the text line position of each text line image in the image to be recognized according to the screenshot sequence of the screenshot tool.

specifically, the server captures a screen of at least one text line area by using OpenCV, obtains at least one corresponding text line image, and determines the text line position of each text line image in the image to be recognized according to the screen capturing sequence of the screen capturing tool. OpenCV (Open Source Computer Vision Library) is a cross-platform Computer Vision Library issued based on BSD license (Open Source), and can run on Linux, Windows, Android, and MacOS operating systems. The method is light and efficient, is composed of a series of C functions and a small number of C + + classes, provides interfaces of languages such as Python, Ruby, MATLAB and the like, and realizes a plurality of general algorithms in the aspects of image processing and computer vision. In this embodiment, the coordinates of 4 corners of each text line region are intercepted by OpenCV to obtain a corresponding text line image, and the interception operation is performed by OpenCV, which is simple in calculation, high in operation efficiency, and stable in performance. The position of a text line corresponding to each text line image can be coordinates of four vertexes (such as coordinates of the upper left corner) or a center point of the text line image, so that the position of the text line image in the image to be recognized is determined according to the text line position.

In the multilingual text recognition method provided by the embodiment, a text positioning algorithm is firstly adopted to perform text positioning on an image to be recognized so as to quickly contain a text line region of a line of original characters or a column of original characters, so that the acquisition efficiency and the accuracy of the text line region are high; adopting a screenshot tool to screenshot each text line area to obtain at least one text line image, and dividing the image to be recognized into at least one text line image, so that the subsequent character recognition can be performed one by one based on the text line images, and the problem of inaccurate recognition result caused by the fact that the same recognition model is adopted to recognize the text line images corresponding to different languages is avoided; and finally, acquiring the text line position corresponding to each text line image according to the screenshot sequence of the screenshot tool so as to perform subsequent positioning or context semantic analysis based on the text line position to improve the text recognition accuracy.

in an embodiment, the multilingual text recognition system uses an Encoder-Summarizer mechanism as shown in fig. 7 to perform text type recognition, that is, an Encoder component is used to convert text line images into feature sequences, and then Summarizer aggregation feature sequences are used to perform a classification task, so as to perform classification, thereby determining a corresponding target text type. As shown in fig. 5, step S203, namely, performing text type recognition on each text line image, and acquiring a target text type corresponding to each text line image, specifically includes the following steps:

S501: and carrying out format conversion on the text line image to obtain a characteristic diagram conforming to a preset format.

The preset format is a preset format for inputting a feature diagram coded by the Encoder component. The feature map is a map of an inputtable Encoder component extracted from a text line image for encoding processing. The preset format comprises a preset height, a preset width and a preset channel number. The preset number of channels is set to 1, which represents that the characteristic diagram is a gray image, so as to reduce the calculation amount in the subsequent processing process. Through the setting of presetting the height and presetting the width, when the Encoder component utilizes the characteristic diagram to encode, the interference of the width and the height in the characteristic diagram to the encoding processing can be effectively reduced, and therefore the accuracy of the obtained characteristic vector is guaranteed.

Specifically, the process of converting the format of the text line image by the server includes: firstly, carrying out graying processing on a text line image, and converting the text line image into a corresponding grayscale image so as to enable the number of preset channels to be 1; then scaling the gray level image to a first image matched with a preset height h; judging whether the width of the first image reaches a preset width or not; if the preset width is reached, directly taking the first image as a feature map; if the width does not reach the preset width, black or white areas are added to the left edge and the right edge of the first image, and the first image is converted into a feature map matched with the preset width. It can be understood that the server performs format conversion on the text line image to obtain the feature map conforming to the preset format, so that the subsequent image processing process excludes the influence of relevant factors (width, height and channel number), and the accuracy of final identification is higher.

For example, the width, height, and number of channels corresponding to the text line image are set to w, h, and d, respectively, and it is understood that, in order to reduce the workload of the subsequent format conversion process, the text line image may be converted into a grayscale image in advance so that d becomes 1. After format conversion is performed on the text line image, a feature map corresponding to a preset width, a preset height and a preset channel number is obtained, wherein the preset width, the preset height and the preset channel number are respectively set to be w ', h ' and d ', in this embodiment, h ' is 1, and d ' is 1, that is, the height and the channel number are kept the same in the format conversion process, so as to ensure the accuracy of the subsequent encoding and identifying process.

S502: and (5) encoding the characteristic graph by adopting an Encoder component to obtain a corresponding characteristic vector.

As shown in fig. 7, the Encoder component employs an Encoder constructed by combining a convolutional layer (Conv), a max pooling layer (MaxPool) and an inclusion layer, and is used for acquiring a feature vector corresponding to the feature map. As shown in fig. 7(a), the Encoder component sequentially includes a convolution layer (Conv), a max pooling layer (MaxPool), an inclusion layer, and four convolution layers (Conv), and the sizes of convolution kernels and corresponding activation functions thereof are shown in the figure and can be set autonomously according to actual requirements. The inclusion layer is formed by a combination of a plurality of convolution layers (Conv) and an average pooling layer (AvgPool) according to the structure shown in fig. 7 (c). As shown in fig. 7(a) and 7(c), the model structure of the Encoder component is simple, which helps to speed up the computation and save the computation cost. In the process of encoding the Encoder component, extracting the information of each pixel point in the characteristic diagram conforming to the preset format to obtain the characteristic vector capable of uniquely identifying the characteristic diagram. The purpose of inclusion, among others, is to design a network with a good local topology, i.e. to perform multiple convolution or pooling operations in parallel on the input image and to stitch all output results into a very deep signature. Different convolution operations and pooling operations such as 1 × 1, 3 × 3 or 5 × 5 can obtain different information of the input feature map, and the information is processed in parallel by adopting the increment and combined with all results to obtain better image representation, namely feature vectors, so as to ensure the accuracy of the obtained feature vectors.

S503: and integrating and classifying the feature vectors by adopting a Summarizer component to obtain at least one culture probability, and outputting the identified culture with the maximum culture probability as a target culture corresponding to the text line image.

Specifically, the Summarizer component is a pre-set classifier in the multilingual text recognition system, and is used for executing a classification task to determine a corresponding target language. As shown in fig. 7(b), the Summarizer component includes three convolutional layers (Conv), the first two convolutional layers (Conv) using the Relu activation function, and the last convolutional layer (Conv) using the Sigmoid activation function. As shown in fig. 7(b), the model structure of the Summarizer component is simple, which helps to increase the computation speed and save the computation cost. The server adopts a Summarizer component to integrate and classify the feature vectors output by the Encoder component so as to obtain at least one literal probability, wherein each literal probability corresponds to an identification literal; and then, performing probability distribution processing on at least one language probability by adopting a Softmax function so as to output the identified language with the maximum language probability as a target language corresponding to the text line image, thereby ensuring the accuracy of the identified target language. In this embodiment, the probability of at least one of the languages recognized by the Summarizer component may exist in an array form, for example, as | s |, where each number in the array corresponds to a recognition probability of a recognized language, | s | is an array formed by four values of 0.7, 0, 0.2, and 0.1, and according to the sequence of the four values, referring to chinese, english, japanese, and korean respectively, the target language may be determined to be chinese based on | s |, where | s | refers to a code point sequence.

in this embodiment, let x be a feature vector corresponding to a text line image, and y be a code point sequence encoded by an Encoder component, where the code point sequence may be a sequence corresponding to a literary probability formed after a Summarizer component integrates the feature vector, and the model is modeled by applying a probability method, and the conditional probability is P (y | x). P (y | x) is the probability that the encoded code point sequence is y under the condition that x is known, and the Summarizer component outputs the code point sequence with the highest probability. Assuming S belongs to S, S represents a kind of language, and text information can be merged into the conditional probability P (y | x) by regarding S as an implicit variable, so as to obtain the following formula:

Where P (ys, x) represents an OCR model capable of recognizing a certain character, x represents a feature vector of an image with a fixed height (h' ═ 1), and s represents a character, i.e., P (y | s, x) refers to a probability that a character in an image belongs to a character s given the feature vector x of the image. argmax is a function that is a function of (a set of) parameters to the function.

specifically, the calculation formula of P (y | s, x) is as follows:

Wherein C (y) represents a function operative to convert y into a glyph cluster (C)₁,c₂,…,c_|C(y)|) A function of the corresponding sequence; c. C_ithe ith glyph representing y; p (c | s, x), P (c | s) and P (y | s) represent an OCR recognition model for the word s, a prior probability for a glyph cluster c and a language model, respectively. Wherein P (c | s, x) represents the probability that a given input letter s belongs to the glyph cluster c under the condition that x is known; p (c | s) represents the probability that a character in the diagram belongs to the glyph cluster c given the input character s; p (y | s) represents the probability that the sequence of code points in the graph after the encoding of a character is y, given the input character s.

The character pattern cluster is a character string which has a certain uniform characteristic in a character stream and cannot be split in typesetting, and is a general character (Chinese character or letter) under a normal condition, but is a character string arranged according to a certain special rule in some cases. If the character is a font cluster consisting of a plurality of characters, the character cluster has the following properties: 1) the behavior of the font cluster in the character line is equivalent to a character in the traditional typesetting system; 2) the relation and the layout of the character patterns in the character clusters are only related to the character attributes and unrelated to the typesetting rules; characters in the font cluster should have the same characters and font attributes, and whether the characters can be output at one time or not can be judged.

it is understood that after the system trains the Encoder component and the Summarizer component, the system makes statistics on the confusion rate of the traditional Seq2Seq recognition model and the culture recognition based on the Encoder-Summarizer mechanism in the embodiment in advance, so as to determine the accuracy of the culture recognition based on the confusion rate. The confusion rate is the probability that one character is mistaken for another character. The test process comprises the following steps: (1) and acquiring image test samples with multiple languages, wherein each image test sample comprises a corresponding language label, and the language label is a label of a language corresponding to characters in the image test sample. (2) Respectively adopting a traditional Seq2Seq identification model and the method corresponding to the steps S501-S503 to test the image test sample, and obtaining a result label corresponding to the image test sample, wherein the result label is a label of the culture identified by the image test sample. (3) And counting the confusion rate corresponding to any two languages based on the language label and the result label corresponding to all the image test samples. The confusion rate here can be considered as the quotient of the number of errors in which the two kinds of characters are mistaken and the total number of occurrences of the two kinds of characters. According to the test results, the confusion rate for recognizing the cyrillic letters as latin is 4.2% based on the conventional Seq2Seq recognition model, and the confusion rate for recognizing the cyrillic letters as latin is 1.8% when the Encoder-Summarizer is used for recognition. As shown in table one, when the test is performed under the same condition, for most of the languages, the language identification is performed by using the Encoder-Summarizer, and the corresponding confusion rate is much lower than that of the conventional algorithm. As shown in fig. 7, the model structures of the Encoder component and the Summarizer component are simple, which is helpful to accelerate the calculation speed and save the calculation cost.

TABLE-confusion Rate test results

In the multilingual text recognition method provided by the embodiment, format conversion is performed on the text line image to obtain the feature map conforming to the preset format, so that the calculation amount of subsequent coding processing is reduced, and the coding processing efficiency is improved; an Encoder component is adopted to encode the feature map, so that corresponding feature vectors can be rapidly obtained, and the accuracy of the feature vectors is ensured; and then, integrating and classifying the feature vectors by adopting a Summarizer component to obtain at least one document probability, and outputting the identified document with the largest document probability as a target document corresponding to the text line image so as to ensure the accuracy of the identified target document and ensure the identification efficiency.

In an embodiment, as shown in fig. 6, step S205, that is, performing transcription recognition on the text line image by using a target OCR recognition model to obtain target characters corresponding to the text line image, specifically includes the following steps:

S601: and integrating adjacent text line images corresponding to the same target language to form a text block image based on the target language and the text line position corresponding to the text line image.

because the characters between the text contexts may have specific relations, that is, the characters between the contexts are connected to have specific semantics, when the server identifies the text line image, the server integrates the adjacent text line images corresponding to the same target language to form a text block image based on the target language and the text line position corresponding to the text line image, so that the text block image is taken as an object of overall identification, thereby ensuring the accuracy of the target characters corresponding to each text line image identified based on the text block image.

Specifically, after the server identifies the target language corresponding to each text line image, the server sorts the target languages corresponding to all the text line images according to the text line positions corresponding to the text line images, so as to integrate all the text line images corresponding to adjacent text line positions and belonging to the same target language to form a text block image, so that when the subsequent identification is performed based on the text block image, the context semantics in the text block image can be fully considered, and the accuracy of character identification is improved.

s602: and cutting the text block images by adopting a character cutting algorithm to obtain at least two single character images, and obtaining a recognition sequence and a row-column label corresponding to each single character image according to the type of the text block images and the cutting sequence.

the character segmentation algorithm is an algorithm for segmenting a text block image into a single-character image, and may specifically be a projection-based character segmentation algorithm. For example, when switching text block images by using a projection-based character segmentation algorithm, vertical projection may be performed on each text line image in sequence to obtain vertical projection pixels, and if there are consecutive pixels that satisfy a preset condition, it is determined that an original character exists in an area corresponding to the consecutive pixels, and the original character is segmented to form a single character image.

The layout types of the text block images comprise horizontal layout and vertical layout. The identification sequence corresponding to the single character image is the sequence of a certain single character image in the whole text block image determined according to the typesetting position of the certain single character image in the text block image. The row and column labels corresponding to the single-character images refer to the row labels and the column labels of a certain single-character image in the text block image. For example, if a text block image is formed by transversely laying out 3 text line images, the number of words is 18, 20, and 17, respectively, 55 single-character images are cut, the identification order of the single-character image corresponding to the 3 rd character on the 2 nd line is 21, the line label is 2, and the column label is 3.

S603: and inputting the single-font image into a target OCR recognition model for transcription recognition according to the recognition sequence corresponding to the single-font image, and acquiring at least one recognition result corresponding to the single-font image and the recognition probability corresponding to each recognition result.

Specifically, the server inputs the single-font images into a target OCR recognition model for transcription recognition according to the recognition sequence corresponding to each single-font image, and obtains at least one recognition result corresponding to each single-font image and the recognition probability corresponding to the recognition result. In this embodiment, only the first 3 recognition results with higher recognition probability and the corresponding recognition probability may be selected to reduce the workload of subsequent recognition processing. The recognition result is a result recognized by each single-character image, and may be a recognition symbol or a recognition character. The recognition probability corresponding to each recognition result refers to the possibility of recognition as a recognition result from a single-character image. For example, for a single-character image containing the original character "it", the recognized characters are "it", "even" and "kan", respectively, and the corresponding recognition probabilities are 99.99%, 84.23% and 47.88%, respectively.

s604: and dividing all the single character images in the text block images into at least two units to be recognized based on the recognition result and the type of the typesetting.

specifically, the server divides all the single-character images in the text block image into at least two units to be recognized based on at least one recognition result corresponding to each single-character image in the text block image and the layout type corresponding to the text block image. The unit to be recognized is the minimum unit which needs semantic recognition, and each unit to be recognized contains characters which can form a complete sentence so as to carry out semantic recognition, thereby being beneficial to improving the accuracy of character recognition. Because the recognition result is the result recognized by each single character image, and can be a recognition symbol or a recognition character, semantic separation can be carried out according to whether the recognition result is the recognition symbol or not; when the recognition result is not the recognition symbol, semantic separation can be carried out according to the type of the layout of the text block images, so that the recognition accuracy of the unit to be recognized is improved.

In an embodiment, in step S604, that is, based on the recognition result and the layout type, dividing all the single-body images in the text block image into at least two units to be recognized specifically includes the following steps:

s6041: and if the recognition result contains the recognition symbols, forming a unit to be recognized based on all the single-character images between any two adjacent recognition symbols.

Specifically, if the recognition result of the single-character image cut out from any text block image includes the recognition symbol, the recognition symbol is the punctuation mark recognized by the target OCR recognition model, it indicates that the characters corresponding to the text block image need to be separated by the punctuation mark, and the semantics between the contexts are related to the position of the punctuation mark, at this time, the server needs to form a unit to be recognized from all the single-character images between any two adjacent recognition marks, and the unit to be recognized is the minimum unit for performing the semantic recognition, so that the characters corresponding to each unit to be recognized are one sentence, which is helpful to improve the recognition accuracy of the characters corresponding to the unit to be recognized and reduce the recognition complexity.

Specifically, when the server adopts the target OCR recognition model to recognize a single-character image, at least one recognition result and the recognition probability corresponding to each recognition result are obtained, and if the recognition result is that the recognition probability of the recognition symbol is the maximum or is greater than a preset probability threshold, the recognition result is determined to comprise the recognition symbol. The preset probability threshold is used for evaluating whether the recognition probability reaches a threshold for evaluating that the recognition probability is a certain character/symbol, and the preset probability threshold can be set to be a higher numerical value so as to ensure the recognition accuracy.

Further, when the server recognizes that the recognition result of the single-character image cut by any text block image contains the recognition symbol, whether the recognition symbol comprises the preset symbol or not can be judged firstly, and if the recognition symbol comprises the preset symbol, a unit to be recognized is formed based on all the single-character images between any two adjacent preset symbols, so that the recognition accuracy of the unit to be recognized is improved. The preset symbol is a punctuation mark preset by the system for recognizing the end of a sentence, including but not limited to a period, a question mark and an exclamation mark.

s6042: and if the recognition result does not contain the recognition symbol, forming a unit to be recognized based on all the single character images corresponding to the same line label or the same line label according to the type of the text block image and the line label and the column label corresponding to the single character image.

Specifically, if the recognition result of the single character image cut from any text block image does not contain a recognition symbol, it indicates that the characters corresponding to the text block are not separated by punctuation marks, in this case, the characters in the same row or the same column generally form a sentence, and at this time, the server may form a unit to be recognized based on all the single character images corresponding to the same row of labels or the same column of labels according to the type of layout of the text block image and the line and column labels corresponding to the single character images, thereby contributing to improving the recognition accuracy of the unit to be recognized.

It can be understood that, when the recognition result contains the recognition symbol, the server forms the unit to be recognized based on all the single-character images between two adjacent recognition symbols; when the recognition result does not contain the recognition symbol, forming a unit to be recognized based on all single character images corresponding to the same line of labels or the same row of labels according to the type of the text block image and the line and column labels corresponding to the single character images, and ensuring that the finally formed unit to be recognized contains a complete sentence as much as possible to facilitate the subsequent semantic analysis so as to improve the recognition accuracy.

S605: and acquiring the single-character characters corresponding to each single-character image based on at least one identification result corresponding to each single-character image and the identification probability corresponding to each identification result in any unit to be identified.

specifically, the server uses the unit to be recognized as the minimum unit for semantic recognition, and determines the single character text corresponding to each single character image more accurately by using the possible context semantic relationship among all the single character images according to at least one recognition result corresponding to each single character image in the unit to be recognized and the recognition probability corresponding to each recognition result, thereby improving the accuracy of text recognition.

In an embodiment, step S606, namely obtaining the single-character corresponding to each single-character image based on at least one recognition result corresponding to each single-character image and the recognition probability corresponding to each recognition result in any unit to be recognized specifically includes the following steps:

S6051: if the identification characters with the identification probability larger than the preset probability threshold exist in each single character image in any unit to be identified, the identification characters with the identification probability larger than the preset probability threshold are determined as the single character characters corresponding to the single character image.

Specifically, the server compares each recognition probability corresponding to each single character image in any unit to be recognized with a preset probability threshold one by one to judge whether each single character image has a recognition character with a recognition probability greater than the preset probability threshold, and if each single character image has a recognition character with a recognition probability greater than the preset probability threshold, directly determines the recognition character as the single character corresponding to the single character image.

If any unit to be identified comprises N single character images, wherein each single character image corresponds to M identification characters and each identification character corresponds to an identification probability, the server compares the M identification probabilities of the N single character images with a preset probability threshold value in sequence to determine whether the identification characters with the identification probabilities larger than the preset probability threshold value exist in each single character image; if the N single-character images all have recognition characters with recognition probabilities greater than the preset probability threshold, it is indicated that the accuracy of the recognition characters corresponding to each single-character image is high, and at this time, the recognition characters can be directly recognized as character characters between the single-character images. For example, the preset probability threshold is set to 95%, the current weather is true, and in six single-character images corresponding to the unit to be recognized, the recognition probability of each single-character image is greater than the preset probability threshold, if the recognition probability of the current image is 97%, the recognition probability of the day image is 98%, the recognition probability of the gas image is 99%, the recognition probability of the true image is 96%, and the recognition probability of the good image is 99%, the recognition characters with the recognition probability greater than the preset probability threshold (e.g., 95%) are determined as the single-character characters, so as to ensure the accuracy of character recognition in the unit to be recognized.

S6052: if at least one single character image in any unit to be recognized does not have recognition characters with recognition probability larger than a preset probability threshold, recognizing word sequences formed by at least one recognition character corresponding to all the single character images in the unit to be recognized by adopting a target language model according to the recognition sequence corresponding to the single character image to obtain the sequence probability corresponding to each word sequence, and obtaining the single character corresponding to each single character image based on the word sequence with the maximum sequence probability.

Specifically, the server compares each recognition probability corresponding to each single character image in any unit to be recognized with a preset probability threshold one by one to judge whether each single character image has recognition characters with the recognition probability larger than the preset probability threshold; when the recognition probability of at least one single character image is not greater than the recognition character with the preset probability threshold, the accuracy of the recognition character recognized by the at least one single character image is not up to the standard, and at the moment, semantic analysis needs to be carried out by combining context so as to improve the recognition accuracy of the character corresponding to the unit to be recognized. Therefore, the server needs to identify a word sequence formed by at least one identification character corresponding to all the single character images in the unit to be identified by adopting the target language model according to the identification sequence corresponding to the single character images, obtain a sequence probability corresponding to each word sequence, and obtain the single character corresponding to each single character image based on the word sequence with the maximum sequence probability.

the target language model is a model used for semantic analysis of continuous characters, and the target language model can be but is not limited to an N-gram model, namely a chinese language model. The Chinese language model can calculate the sentence with the maximum probability by using the collocation information between adjacent words in the context when the continuous blank-free pinyin, strokes or numbers representing letters or strokes need to be converted into Chinese character strings (namely sentences), thereby realizing the automatic conversion of Chinese characters without manual selection of a user and avoiding the problem of repeated codes of a plurality of Chinese characters corresponding to the same pinyin (or stroke strings or number strings).

For example, the preset probability threshold is set to 95%, and the recognition probability of the first single-character image recognizing "today" in the six single-character images corresponding to the unit to be recognized, namely "today is good" is 97%; the recognition probability of the second single-character image recognizing 'Fu' is 85%, and the recognition probability of recognizing 'day' is 83%; the recognition probability of the third single-character image for recognizing day is 90%, and the recognition probability for recognizing Fu is 75%; the recognition probability of recognizing gas in the fourth single-character image is 99 percent; the recognition probability of recognizing true in the fifth single-character image is 96%; the recognition probability of good in the sixth single-character image is 99%, and since there is no recognition character with recognition probability greater than a preset probability threshold (such as 95%) in the second and fourth single-character images, corresponding word sequences are formed based on the recognition characters corresponding to all the single-character images, such as true good weather today, true good husband today, true good weather now, and true good husband now, the word sequences are recognized by adopting a target language model to obtain the sequence probability corresponding to each word sequence, and the single-character corresponding to each single-character image in the unit to be recognized is determined based on the word sequence with the largest sequence probability to ensure the accuracy of single-character recognition.

the server can determine the identification characters as the single character characters corresponding to the single character images when the identification characters with the identification probability larger than the preset probability threshold exist in each single character image in any unit to be identified, so that the identification accuracy and the identification efficiency of the single character characters are ensured; when at least one single character image has no recognition character with the recognition probability larger than a preset probability threshold, a target language model is adopted to recognize the word sequence formed by all the single character images, the single character corresponding to each single character image is determined based on the word sequence with the maximum sequence probability, and semantic analysis is carried out through the target language model to ensure the accuracy of single character recognition.

S606: and performing page typesetting on the single-character words corresponding to the single-character images according to the identification sequence and the row-column labels corresponding to each single-character image to obtain the target words corresponding to the text line images.

Specifically, after the server identifies the single-character characters corresponding to each single-character image, page layout is performed on the single-character characters corresponding to all the single-character images according to the identification sequence and the row-column labels corresponding to the single-character images, that is, the single-character characters corresponding to each single-character image are placed at the positions corresponding to the row-column labels, so as to obtain the target characters corresponding to the text row images, and perform subsequent comparison and verification based on the target characters, thereby ensuring the identification accuracy of the target characters. The target word is the word that the text line image ultimately recognizes.

In the multilingual text recognition method provided by the embodiment, adjacent text line images corresponding to the same target language are integrated to form a text block image, so that subsequent recognition is performed by taking the text block image as a unit, the context semantics among the text line images are fully considered, and the accuracy of character recognition is improved; performing word-by-word cutting on the text block image to obtain a single character image and a corresponding identification sequence and a line label thereof so as to be positioned based on the identification sequence and the line label; then, a target OCR recognition model is adopted to recognize the single character image so as to determine at least one recognition result and corresponding recognition probability, and the recognized recognition result is more accurate because the target OCR recognition model is a recognition model specially aiming at the target language; then, dividing all the single character images into at least two units to be recognized which can carry out semantic analysis according to all the recognition results and the type of layout corresponding to the text block images, so as to carry out analysis based on the units to be recognized and ensure the accuracy of the single character corresponding to each recognized single character image; and then page typesetting is carried out on all the single-character characters based on the identification sequence and the row-column labels corresponding to the single-character images so as to obtain the target characters corresponding to the text row images, so that comparison and verification can be carried out subsequently, and the identification accuracy of the target characters is ensured.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

in one embodiment, a multi-language text recognition apparatus is provided, wherein the multi-language text recognition apparatus corresponds to the multi-language text recognition method in the above embodiments one to one. As shown in fig. 8, the multilingual text recognition apparatus includes an image to be recognized acquisition module 801, a text line image acquisition module 802, a target language acquisition module 803, a recognition model acquisition module 804, a target character acquisition module 805, and a recognition text acquisition module 806. The functional modules are explained in detail as follows:

the image to be recognized acquiring module 801 is configured to acquire an image to be recognized, where the image to be recognized includes original characters corresponding to at least two languages.

The text line image obtaining module 802 is configured to perform layout analysis and recognition on an image to be recognized, obtain at least one text line image, and determine a text line position of each text line image in the image to be recognized.

And the target language type obtaining module 803 is configured to perform language type identification on each text line image, and obtain a target language type corresponding to each text line image.

And the recognition model obtaining module 804 is configured to query the recognition model database based on the target language and obtain a target OCR recognition model corresponding to the target language.

And the target character acquisition module 805 is configured to perform transcription recognition on the text line image by using a target OCR recognition model, and acquire target characters corresponding to the text line image.

And an identification text obtaining module 806, configured to obtain a target identification text corresponding to the image to be identified based on the target characters and the text line positions corresponding to the text line image.

Preferably, before the image acquiring module 801 to be recognized, the multilingual text-recognition apparatus further includes:

And the ambiguity detection unit is used for acquiring the original image, detecting the original image by adopting an ambiguity detection algorithm and acquiring the ambiguity of the original image.

and the first fuzzy processing unit is used for acquiring fuzzy prompt information if the fuzzy degree is greater than a first fuzzy threshold value.

And the second fuzzy processing unit is used for carrying out sharpening and correction processing on the original image to acquire the image to be identified if the fuzziness is not greater than the first fuzzy threshold and is greater than the second fuzzy threshold.

And the third fuzzy processing unit is used for correcting the original image and acquiring the image to be identified if the fuzziness is not greater than the second fuzzy threshold.

wherein the first fuzzy threshold is greater than the second fuzzy threshold.

Preferably, the text line image acquisition module 802 includes:

And the text line area acquisition unit is used for performing text positioning on the image to be recognized by adopting a text positioning algorithm to acquire at least one text line area.

and the text line image determining unit is used for capturing the image of at least one text line area by adopting a capturing tool, acquiring at least one text line image, and determining the text line position of each text line image in the image to be recognized according to the capturing sequence of the capturing tool.

Preferably, the target document acquiring module 803 includes:

And the characteristic diagram acquisition unit is used for carrying out format conversion on the text line image to acquire the characteristic diagram conforming to the preset format.

And the characteristic vector acquisition unit is used for encoding the characteristic diagram by adopting the Encoder component to acquire the corresponding characteristic vector.

And the target language output unit is used for integrating and classifying the feature vectors by adopting a Summarizer component, acquiring at least one language probability, and outputting the identified language with the largest language probability as a target language corresponding to the text line image.

Preferably, the recognition text obtaining module 806 includes:

And the text block image acquisition unit is used for integrating adjacent text line images corresponding to the same target language to form a text block image based on the target language and the text line position corresponding to the text line image.

And the single-character image acquisition unit is used for cutting the text block images by adopting a character cutting algorithm to acquire at least two single-character images and acquiring the corresponding identification sequence and the line and row labels of each single-character image according to the type of layout and the cutting sequence of the text block images.

And the recognition result acquisition unit is used for inputting the single-font image into the target OCR recognition model for transcription recognition according to the recognition sequence corresponding to the single-font image, and acquiring at least one recognition result corresponding to the single-font image and the recognition probability corresponding to each recognition result.

And the unit to be recognized dividing unit is used for dividing all the single character images in the text block images into at least two units to be recognized based on the recognition result and the typesetting type.

And the single-character obtaining unit is used for obtaining the single-character characters corresponding to each single-character image based on at least one recognition result corresponding to each single-character image and the recognition probability corresponding to each recognition result in any unit to be recognized.

And the target character acquisition unit is used for performing page typesetting on the single character corresponding to the single character image according to the identification sequence and the row-column label corresponding to each single character image to acquire the target character corresponding to the text row image.

Preferably, the unit to be recognized is divided into units, including:

The first unit acquiring subunit is used for forming a unit to be recognized based on all the single-character images between any two adjacent recognition symbols if the recognition result contains the recognition symbols.

and the first unit acquiring subunit is used for forming a unit to be recognized based on the same line label or all the single character images corresponding to the same line label according to the type of the text block image and the line label and the column label corresponding to the single character images if the recognition result does not contain the recognition symbol.

Preferably, the one-letter word acquiring unit includes:

The first single character obtaining subunit is configured to, if there is a recognition character with a recognition probability greater than a preset probability threshold in each single character image in any unit to be recognized, determine the recognition character with the recognition probability greater than the preset probability threshold as a single character corresponding to the single character image.

And the second single character obtaining subunit is used for, if at least one single character image in any unit to be recognized does not have the recognition character with the recognition probability larger than the preset probability threshold, adopting the target language model to recognize the word sequence formed by at least one recognition character corresponding to all the single character images in the unit to be recognized according to the recognition sequence corresponding to the single character image, obtaining the sequence probability corresponding to each word sequence, and obtaining the single character corresponding to each single character image based on the word sequence with the maximum sequence probability.

For the specific limitations of the multilingual text-recognition apparatus, reference may be made to the above limitations of the multilingual text-recognition method, which will not be described herein again. The modules in the multilingual text-recognition apparatus may be implemented in whole or in part by software, hardware, or a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data used or generated during the execution of the multilingual text recognition method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a multilingual text-recognition method.

In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the multilingual text recognition method in the foregoing embodiments is implemented, for example, in steps S201 to S206 shown in fig. 2 or in fig. 3 to 6, which are not repeated herein to avoid repetition. Alternatively, the processor executes the computer program to implement the functions of each module/unit in the embodiment of the multilingual text recognition apparatus, such as the functions of the to-be-recognized image obtaining module 801, the text line image obtaining module 802, the target language type obtaining module 803, the recognition model obtaining module 804, the target character obtaining module 805, and the recognized text obtaining module 806 shown in fig. 8, which are not repeated herein to avoid repetition.

In an embodiment, a computer-readable storage medium is provided, and a computer program is stored on the computer-readable storage medium, and when being executed by a processor, the computer program implements the method for recognizing a multi-language text in the foregoing embodiments, for example, steps S201 to S206 shown in fig. 2 or steps S3 to S6 shown in fig. 2 are omitted for avoiding repetition. Alternatively, the computer program, when being executed by the processor, implements the functions of the modules/units in the above-mentioned multilingual text recognition apparatus, such as the functions of the to-be-recognized image obtaining module 801, the text line image obtaining module 802, the target language type obtaining module 803, the recognition model obtaining module 804, the target character obtaining module 805, and the recognized text obtaining module 806 shown in fig. 8, which are not repeated herein for avoiding repetition.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

it will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A method for multi-lingual text recognition, comprising:

2. The multilingual text-recognition method of claim 1, wherein, prior to said obtaining the image to be recognized, said multilingual text-recognition method further comprises:

Acquiring an original image, and detecting the original image by adopting a fuzzy degree detection algorithm to acquire the fuzzy degree of the original image;

if the ambiguity is greater than a first ambiguity threshold, acquiring ambiguity prompt information;

if the ambiguity is not greater than the first ambiguity threshold and greater than a second ambiguity threshold, sharpening and correcting the original image to obtain an image to be identified;

if the fuzziness is not greater than a second fuzziness threshold, correcting the original image to obtain an image to be identified;

Wherein the first blur threshold is greater than the second blur threshold.

3. the multilingual text-recognition method of claim 1, wherein said step of performing layout analysis recognition on said image to be recognized, obtaining at least one text-line image, and determining the text-line position of each of said text-line images in said image to be recognized comprises:

Performing text positioning on the image to be recognized by adopting a text positioning algorithm to obtain at least one text line area;

and adopting a screenshot tool to screenshot at least one text line area to obtain at least one text line image, and determining the text line position of each text line image in the image to be recognized according to the screenshot sequence of the screenshot tool.

4. the multilingual text-recognition method of claim 1, wherein said performing a text-type recognition on each of said text-line images to obtain a target text type corresponding to each of said text-line images comprises:

carrying out format conversion on the text line image to obtain a feature map conforming to a preset format;

Encoding the characteristic diagram by adopting an Encoder component to obtain a corresponding characteristic vector;

And adopting a Summarizer component to integrate and classify the feature vectors to obtain at least one culture probability, and outputting the identified culture with the maximum culture probability as a target culture corresponding to the text line image.

5. The method for multilingual text-recognition according to claim 1, wherein said recognizing the transcription of the text-line image using the target OCR model to obtain the target words corresponding to the text-line image comprises:

Integrating adjacent text line images corresponding to the same target language to form a text block image based on the target language corresponding to the text line image and the text line position;

cutting the text block images by adopting a character cutting algorithm to obtain at least two single character images, and obtaining a recognition sequence and a line label corresponding to each single character image according to the type of layout and the cutting sequence of the text block images;

Inputting the single character image into the target OCR recognition model for transcription recognition according to the recognition sequence corresponding to the single character image, and acquiring at least one recognition result corresponding to the single character image and recognition probability corresponding to each recognition result;

Dividing all single character images in the text block images into at least two units to be recognized based on the recognition result and the typesetting type;

acquiring single-character characters corresponding to each single-character image based on at least one identification result corresponding to each single-character image and identification probability corresponding to each identification result in any unit to be identified;

and according to the identification sequence and the row and column labels corresponding to each single character image, page typesetting is carried out on the single character words corresponding to the single character image, and target words corresponding to the text row image are obtained.

6. The multilingual text-recognition method of claim 5, wherein the dividing all of the single-body images in the text block images into at least two units to be recognized based on the recognition result and the layout type, comprises:

if the recognition result contains recognition symbols, forming a unit to be recognized based on all the single character images between any two adjacent recognition symbols;

And if the recognition result does not contain the recognition symbol, forming a unit to be recognized based on all the single character images corresponding to the same line label or the same line label according to the type of the text block image and the line label and the column label corresponding to the single character images.

7. The multilingual text-recognition method of claim 5, wherein said obtaining the individual characters corresponding to each individual character image based on at least one recognition result corresponding to each individual character image and the recognition probability corresponding to each recognition result in any one of the units to be recognized comprises:

If the identification characters with the identification probability larger than a preset probability threshold exist in each single character image in any unit to be identified, determining the identification characters with the identification probability larger than the preset probability threshold as the single character characters corresponding to the single character image;

if at least one single character image in any unit to be recognized does not have recognition characters with recognition probability larger than a preset probability threshold, recognizing word sequences formed by at least one recognition character corresponding to all the single character images in the unit to be recognized by adopting a target language model according to the recognition sequence corresponding to the single character image, acquiring the sequence probability corresponding to each word sequence, and acquiring the single character corresponding to each single character image based on the word sequence with the maximum sequence probability.

8. A multilingual text-recognition apparatus, comprising:

9. a computer device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, characterized in that said processor implements the multilingual text-recognition method of any one of claims 1 to 7 when executing said computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a method of multilingual text-recognition according to any one of claims 1 to 7.