WO2021017260A1 - 多语言文本识别方法、装置、计算机设备及存储介质 - Google Patents

多语言文本识别方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2021017260A1
WO2021017260A1 PCT/CN2019/116488 CN2019116488W WO2021017260A1 WO 2021017260 A1 WO2021017260 A1 WO 2021017260A1 CN 2019116488 W CN2019116488 W CN 2019116488W WO 2021017260 A1 WO2021017260 A1 WO 2021017260A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
text
recognition
recognized
font
Prior art date
Application number
PCT/CN2019/116488
Other languages
English (en)
French (fr)
Inventor
王健宗
回艳菲
韩茂琨
于凤英
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021017260A1 publication Critical patent/WO2021017260A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/146Aligning or centring of the image pick-up or image-field
    • G06V30/1475Inclination or skew detection or correction of characters or of image to be recognised
    • G06V30/1478Inclination or skew detection or correction of characters or of image to be recognised of characters or characters lines
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the field of text recognition technology, and in particular to a multilingual text recognition method, device, computer equipment and storage medium.
  • Seq2Seq refers to a technique that converts a sequence in one domain (such as English) into a model of a sequence in another domain (such as French).
  • Seq2Seq refers to a technique that converts a sequence in one domain (such as English) into a model of a sequence in another domain (such as French).
  • the embodiments of the present application provide a multilingual text recognition method, device, computer equipment, and storage medium to solve the problem of low recognition accuracy in the current process of multilingual text recognition using a recognition model.
  • a method for multilingual text recognition including:
  • the target recognition text corresponding to the image to be recognized is acquired.
  • a multilingual text recognition device includes:
  • a to-be-recognized image acquisition module for acquiring a to-be-recognized image, where the to-be-recognized image includes original text corresponding to at least two languages;
  • a text line image acquisition module configured to perform layout analysis and recognition on the image to be recognized, acquire at least one text line image, and determine the text line position of each text line image in the image to be recognized;
  • the target language acquisition module is configured to perform language recognition on each of the text line images, and obtain the target language corresponding to each of the text line images;
  • a recognition model obtaining module configured to query a recognition model database based on the target language, and obtain a target OCR recognition model corresponding to the target language;
  • the target text obtaining module is configured to use the target OCR recognition model to perform transcription recognition on the text line image, and obtain the target text corresponding to the text line image;
  • the recognition text obtaining module is configured to obtain the target recognition text corresponding to the image to be recognized based on the target text corresponding to the text line image and the text line position.
  • a computer device includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor.
  • the processor executes the computer program, the following steps are implemented:
  • the target recognition text corresponding to the image to be recognized is acquired.
  • One or more readable storage media storing computer readable instructions
  • the computer readable storage medium storing computer readable instructions
  • the one Or multiple processors perform the following steps:
  • the target recognition text corresponding to the image to be recognized is acquired.
  • FIG. 1 is a schematic diagram of an application environment of a multilingual text recognition method in an embodiment of the present application
  • FIG. 2 is a flowchart of a multilingual text recognition method in an embodiment of the present application
  • FIG. 3 is another flowchart of a multilingual text recognition method in an embodiment of the present application.
  • FIG. 4 is another flowchart of the multilingual text recognition method in an embodiment of the present application.
  • FIG. 5 is another flowchart of a multilingual text recognition method in an embodiment of the present application.
  • FIG. 6 is another flowchart of a multilingual text recognition method in an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a model structure of the Encoder-Summarizer mechanism in an embodiment of the present application, where 7(a) is the model structure of the Encoder component, 7(b) is the model structure of the Summarizer component, and 7(c) is 7( a) The model structure of Inception;
  • FIG. 8 is a schematic diagram of a multilingual text recognition device in an embodiment of the present application.
  • Fig. 9 is a schematic diagram of a computer device in an embodiment of the present application.
  • the multilingual text recognition method provided by the embodiment of the present application can be applied to the application environment shown in FIG. 1.
  • the multi-language text recognition method is applied in a multi-language text recognition system.
  • the multi-language text recognition system includes a client and a server as shown in FIG. 1.
  • the client and the server communicate through a network for realizing
  • the corresponding target OCR recognition model can be determined according to each language, and multiple target OCR recognition models are used for recognition to ensure the recognition accuracy of the final recognized target recognition text.
  • the client is also called the client, which refers to the program that corresponds to the server and provides users with local services.
  • the client can be installed on, but not limited to, various personal computers, laptops, smart phones, tablet computers, and portable wearable devices.
  • the server can be implemented as an independent server or a server cluster composed of multiple servers.
  • a method for multilingual text recognition is provided. Taking the method applied to the server in FIG. 1 as an example, the method includes the following steps:
  • S201 Acquire an image to be recognized, where the image to be recognized includes original characters corresponding to at least two languages.
  • the image to be recognized refers to an image that requires character recognition.
  • the original text refers to the text recorded in the image to be recognized.
  • the original text in the image to be recognized corresponds to at least two languages, that is, the image to be recognized is an image in which at least two languages coexist.
  • the original text corresponding to Chinese, English, and Japanese is included.
  • S202 Perform layout analysis and recognition on the image to be recognized, acquire at least one text line image, and determine the text line position of each text line image in the image to be recognized.
  • performing layout analysis and recognition on the image to be recognized refers to segmenting the distribution structure of the image to be recognized, and analyzing the image characteristics of each segmented partial image to determine the attribute category corresponding to the partial image.
  • This attribute category includes but is not limited to text blocks, image blocks, and table blocks.
  • the text line image is an image formed by including at least one original text. Generally speaking, if the typesetting type corresponding to the text line image is horizontal typesetting, each text line image includes a line of original text; if the typesetting type corresponding to the text line image is vertical typesetting, each text line image includes a column of original text.
  • the server may perform image preprocessing operations such as gray scale transformation, binarization, smoothing, and edge detection on the image to be recognized in advance, and then perform layout analysis and recognition on the preprocessed image to be recognized to obtain at least two regions.
  • Block image then perform attribute analysis on each block image to obtain the attribute category corresponding to each block image, intercept the block image whose attribute category is a text block and determine it as a text line image.
  • the layout analysis of the image to be identified can use, but is not limited to, bottom-up layout analysis algorithms guided by multi-level credibility, layout analysis algorithms based on neighborhood features, or projection algorithms combined with bottom-up layout analysis. algorithm.
  • the server After the server performs layout analysis and recognition on the image to be recognized and obtains at least one text line image, it needs to determine the text line position of the text line image in the image to be recognized, so as to perform subsequent text positioning or combination based on the text line position
  • the context is recognized, so as to ensure the accuracy of the target recognition text recognized by the image to be recognized.
  • the typesetting type of the image to be recognized is horizontal typesetting as an example.
  • the server performs layout analysis and recognition on the image to be recognized, and obtains at least one text line image, it is based on the upper left of the area where the at least one text line image is located.
  • the ordinates of the corners or the center are sorted, and the line number corresponding to each text line image is obtained, thereby obtaining the text line position corresponding to the text line image.
  • a text line image with line number 001 indicates that it is a text line image corresponding to the first line of original text, so that subsequent positioning or contextual semantic analysis can be performed based on the line number to improve text recognition accuracy.
  • S203 Perform language type recognition on each text line image, and obtain the target language type corresponding to each text line image.
  • the target language refers to the language type corresponding to the original text in the text line image, such as Chinese, English, or Japanese.
  • the server determines the target language corresponding to each text line image by recognizing the language of each text line image, so as to determine the corresponding target OCR recognition model based on the target language, and use the target OCR recognition model to
  • the text line image is recognized to convert the multi-language image to be recognized into multiple single language text line images, and then multiple target OCR recognition models are used to recognize the corresponding text line image to improve the recognition of the entire image to be recognized Accuracy.
  • S204 Query the recognition model database based on the target language, and obtain the target OCR recognition model corresponding to the target language.
  • the recognition model database is a database used to store OCR recognition models used to recognize different languages.
  • the language OCR recognition model corresponding to different languages is stored in the recognition model database.
  • Each language type OCR recognition model corresponds to a language type (ie, language type), that is, the language type OCR recognition model is an OCR recognition model used to identify the characters corresponding to the language type.
  • the server queries the recognition model database according to the target language identified by each text line image, and determines the language OCR recognition model corresponding to the target language Is the target OCR recognition model, so that the target OCR recognition model is used to recognize the text line image, so as to improve the accuracy of text recognition of the text line image.
  • S205 Use the target OCR recognition model to perform transcription recognition on the text line image, and obtain the target text corresponding to the text line image.
  • the server may determine the corresponding recognition sequence according to the text line position corresponding to each text line image, and use the target OCR recognition model to transcribe and recognize the corresponding text line image to obtain the target text corresponding to the text line image. Ensure the recognition accuracy of the target text corresponding to each text line image, and avoid the problem of low recognition accuracy that occurs when the same OCR recognition model is used to recognize multiple text line images in different languages.
  • the target text refers to the text obtained by recognizing the text line image using the target OCR recognition model.
  • S206 Obtain target recognition text corresponding to the image to be recognized based on the target text corresponding to the text line image and the text line position.
  • the server After the server recognizes the target text corresponding to each text line image, it needs to retypeset the target text corresponding to each text line image based on the text line position corresponding to the text line image to obtain the layout of the image to be recognized.
  • the same target recognition text in order to compare and verify the original text at the corresponding position in the image to be recognized based on the target recognition text.
  • the position of the text line is determined according to the line number corresponding to each text line image.
  • the target text After the target text corresponding to the text line image is recognized, the target text must be aligned according to the line number
  • the text is typeset to obtain the target recognition text corresponding to the image to be recognized for subsequent comparison and verification to ensure the recognition accuracy.
  • the layout analysis and recognition are first performed to determine at least one text line image and the corresponding text line position, so that the multiple The language to be recognized image is converted into a single-language text line image for recognition, which helps to improve the recognition accuracy in the subsequent recognition process.
  • the target OCR recognition model corresponding to the target language is used to recognize the text line image to ensure that the target text recognized by each text line image Accuracy.
  • the multilingual text recognition method before step S201, that is, before acquiring the image to be recognized, the multilingual text recognition method further includes:
  • S301 Obtain an original image, detect the original image using a blur degree detection algorithm, and obtain the blur degree of the original image.
  • the original image refers to the unprocessed image obtained by the server.
  • the blur degree detection algorithm is an algorithm for detecting the blur degree of an image.
  • the ambiguity detection algorithm can use a detection algorithm commonly used in the prior art.
  • the blur degree of the original image is a numerical value used to reflect the blur degree of the original image. The larger the blur degree, the more blurry the original image; correspondingly, the smaller the blur degree, the clearer the original image.
  • the ambiguity detection algorithm in this embodiment can use, but is not limited to, a Laplacian operator for ambiguity detection.
  • the Laplacian operator is a second-order differential operator, which is suitable for improving image blur caused by diffuse reflection of light.
  • the principle is that in the process of taking and recording an image, the light spot diffusely reflects the light to its surrounding area. This kind of diffuse reflection causes the image to be blurred to a certain extent, and the degree of blur is relatively compared with the image taken under normal circumstances. , Is often a constant multiple of the Laplace operator. Therefore, step S301 specifically includes the following steps:
  • S3011 Use the Laplacian to perform sharpening processing on the original image to obtain the sharpened image and the pixel gray value of the sharpened image. That is, the server first uses the Laplacian to process the original image to obtain the Laplacian image describing the gray level mutation, and then superimposes the Laplacian image with the original image to obtain a sharpened image. After obtaining the sharpened image, the RGB value of each pixel in the sharpened image is obtained, and the RGB value is processed to obtain the pixel gray value corresponding to the sharpened image.
  • S3012 Perform variance calculation on the pixel gray value of the sharpened image, obtain a target variance value corresponding to the sharpened image, and determine the target variance value as the blur degree corresponding to the original image. That is, the server first calculates the pixel gray value of each pixel in the sharpened image minus the square sum of the average gray value of the sharpened image, and then divides the square sum by the number of pixels to obtain the sharpened image The target variance value of the ambiguity.
  • the Laplacian operator is first used to sharpen the original image to obtain a sharpened image with clearer details than the original image, thereby improving the clarity of the image. Then, the target variance value of the sharpened image is calculated to reflect the difference between the pixel gray values of each pixel of the sharpened image.
  • the target variance value of the sharpened image is taken as the blur degree of the original image, so as to achieve the fuzzy filtering of the original image according to the comparison result of the blur degree and the preset threshold value, so as to achieve the purpose of obtaining a clearer original image.
  • the first blur threshold is a threshold preset by the system for evaluating whether it can be the highest blur of the image to be recognized.
  • the blur prompt information is information for prompting that the image is too blurry.
  • the fuzzy prompt information can be obtained and sent to the corresponding terminal, so that the user re-uploads the original image to the server based on the fuzzy prompt information.
  • the second blur threshold is a threshold set in advance by the system for evaluating whether it can be recognized as the lowest blur of the clearer image to be recognized. Understandably, the first blur threshold is greater than the second blur threshold.
  • the blur degree of the original image is greater than the second blur threshold and not greater than the first blur threshold, that is, when the blur degree of the original image is between the second blur threshold and the first blur threshold, it means that the original image is not too blurry. However, it does not reach the standard of clarity. At this time, the original image needs to be sharpened to improve the clarity of the original image; then the sharpened original image is corrected to obtain a clearer without tilt To ensure the accuracy of subsequent text recognition of the image to be recognized. Generally speaking, when performing optical scanning, due to objective reasons, the position of the original scanned image will be incorrect, which will affect the accuracy of the image recognition processing in the later stage. Therefore, the image needs to be corrected.
  • the key to image tilt correction is to automatically detect the tilt direction and tilt angle of the image based on image features. At present, the commonly used tilt angle methods include: projection-based method, Hough transform, linear fitting and Fourier transform to frequency domain for detection.
  • the server when the server has an original image whose blurriness is not greater than the second blur threshold, it indicates that the original image is clear, and there is no need to perform sharpening processing to enhance its sharpness to improve image processing efficiency; and when the original image is generated, it may Its position may be incorrect due to various objective reasons. Therefore, the server needs to perform correction processing on the original image to obtain the corrected non-tilted image to be recognized.
  • corresponding processing is performed respectively according to the comparison result of the original image blur degree with the first blur threshold and the second blur threshold, so as to ensure that a clearer and non-tilted pending image is finally obtained. Recognize the image, so as to ensure the accuracy of text recognition based on the image to be recognized, and prevent the image from being blurred or oblique to interfere with the recognition result.
  • step S202 is to perform layout analysis and recognition on the image to be recognized, obtain at least one text line image, and determine the text line position of each text line image in the image to be recognized, which specifically includes the following step:
  • S401 Use a text positioning algorithm to perform text positioning on the image to be recognized, and obtain at least one text line area.
  • the text positioning algorithm is an algorithm used to locate text in an image.
  • the text location algorithm includes but is not limited to the proximity search algorithm and the CTPN-RNN algorithm.
  • the text line area refers to the area containing the original text that is recognized from the image to be recognized by using a text positioning algorithm.
  • the text line area is an area determined based on a line of original text or a column of original text.
  • the proximity search algorithm refers to an algorithm that starts from a connected area, can find the horizontal circumscribed rectangle of the connected area, and expands the connected area to the entire rectangle.
  • the direction of the expansion is the direction of the nearest neighboring area. If and only if the direction is horizontal, the expansion operation is performed to obtain the image Identify at least one text line area in. This method can effectively integrate the original text on the same line in the image into a text line area to achieve the purpose of text positioning.
  • the process of using the proximity search algorithm to locate the text in the image to be recognized to obtain at least one text line area includes: a rectangular area formed by any one or more original texts in the image to be recognized and calculating any two rectangles
  • the center vector difference of the region that is, the vector difference formed by the center points of two rectangular regions).
  • CTPN Connectionist Text Proposal Network
  • CTPN can identify the coordinate positions of the four corners of each text line.
  • the main purpose of RNN Recurrent Neural Networks, hereinafter referred to as RNN) is to process and predict sequence data.
  • the nodes between the hidden layers of RNN are connected.
  • the input of the hidden layer not only includes the output of the input layer, It also includes the output of the hidden layer at the previous moment.
  • the CTPN-RNN algorithm is used to locate at least one text line area from the image to be recognized.
  • CTPN By combining CTPN in the RNN convolutional network, it can accurately locate the text line in the image to be recognized, and according to each text line Determine the position of the text line area in the image to be recognized, that is, the CTPN-RNN algorithm can be used to automatically identify at least one text line area, and the seamless combination of CTPN and RNN can effectively improve the detection accuracy.
  • S402 Take a screenshot of at least one text line area with a screenshot tool, obtain at least one text line image, and determine the text line position of each text line image in the image to be recognized according to the screenshot sequence of the screenshot tool.
  • OpenCV Open Source Computer Vision Library
  • BSD open source
  • OpenCV Open Source Computer Vision Library
  • OpenCV Open Source Computer Vision Library
  • BSD open source
  • It is lightweight and efficient—consisting of a series of C functions and a small number of C++ classes, it also provides interfaces to languages such as Python, Ruby, and MATLAB, and implements many common algorithms in image processing and computer vision.
  • OpenCV is used to intercept the coordinates of the four corners of each text line area to obtain the corresponding text line image
  • OpenCV is used to perform the interception operation, which has simple calculations, high calculation efficiency and relatively stable performance.
  • Each text line image corresponding to a text line position can be the four vertices of the text line image (such as the coordinates of the upper left corner) or the coordinates of the center point, so as to determine the position of the text line image in the image to be recognized according to the text line position.
  • the text positioning algorithm is first used to locate the text of the image to be recognized, so as to quickly include the text line area of a line of original text or a column of original text, so that the acquisition efficiency of the text line area is high and The accuracy rate is high;
  • the screenshot tool is used to take a screenshot of each text line area to obtain at least one text line image to divide the image to be recognized into at least one text line image, so that subsequent text recognition can be based on the text line image one by one Perform recognition to avoid the problem of inaccurate recognition results caused by using the same recognition model to recognize text line images in different languages;
  • the screenshot sequence of the screenshot tool obtain the text line position corresponding to each text line image in order to Based on the position of the text line, subsequent positioning or contextual semantic analysis is performed to improve the accuracy of text recognition.
  • the Encoder-Summarizer mechanism shown in FIG. 7 is used in the multi-language text recognition system for text recognition, that is, the Encoder component is used to convert the text line image into a feature sequence, and then the Summarizer is used to aggregate the feature sequence to perform classification Task in order to classify and determine its corresponding target language.
  • step S203 which is to perform language type recognition on each text line image, and obtain the target language type corresponding to each text line image, specifically includes the following steps:
  • S501 Perform format conversion on the text line image to obtain a feature map conforming to a preset format.
  • the preset format is a preset format used for inputting the feature map of the Encoder component for encoding.
  • the feature map is a map extracted from the text line image that can be input to the Encoder component for encoding processing.
  • the preset format includes a preset height, a preset width, and a preset number of channels.
  • the preset number of channels is set to 1, which means that the feature map is a gray-scale image to reduce the amount of calculation in the subsequent processing.
  • the format conversion process of the text line image by the server includes: first graying the text line image and converting it into a corresponding gray image so that the preset number of channels is 1; and then scaling the gray image to The first image that matches the preset height h; then determine whether the width of the first image reaches the preset width; if it reaches the preset width, the first image is directly used as the feature map; if the preset width is not reached, the first image A black or white area is added to the left and right edges of an image to convert it into a feature map matching the preset width.
  • the server converts the format of the text line image to obtain a feature map that conforms to the preset format, so that the subsequent image processing process eliminates the influence of related factors (width, height, and number of channels), so that the accuracy of the final recognition is higher. high.
  • S502 Use the Encoder component to perform encoding processing on the feature map to obtain a corresponding feature vector.
  • the Encoder component adopts an encoder constructed by a convolutional layer (Conv), a maximum pooling layer (MaxPool) and an Inception layer to obtain the feature vector corresponding to the feature map.
  • the Encoder component in turn includes a convolutional layer (Conv), a maximum pooling layer (MaxPool), an Inception layer, a maximum pooling layer (MaxPool), an Inception layer, an Inception layer and four convolutions
  • the layer (Conv), the size of the convolution kernel and the corresponding activation function as shown in the figure, can be set independently according to actual needs.
  • the Inception layer is formed by combining multiple convolutional layers (Conv) and average pooling layer (AvgPool) according to the structure shown in Figure 7(c).
  • Conv convolutional layers
  • AvgPool average pooling layer
  • the model structure of the Encoder component is relatively simple, which helps to speed up the calculation and save the calculation cost.
  • the encoding process of the Encoder component by extracting the information of each pixel in the feature map that conforms to the preset format, a feature vector that can uniquely identify the feature map is obtained.
  • the purpose of Inception is to design a network with a good local topology, that is, to perform multiple convolution operations or pooling operations on the input image in parallel, and stitch all the output results into a very deep feature map.
  • S503 Use the Summarizer component to integrate and classify the feature vector, obtain at least one language type probability, and output the recognized language type with the highest language type probability as the target language type corresponding to the text line image.
  • the Summarizer component is a pre-set classifier in the multilingual text recognition system, used to perform classification tasks to determine the corresponding target language.
  • the Summarizer component includes three convolutional layers (Conv), the first two convolutional layers (Conv) use the Relu activation function, and the last convolutional layer (Conv) uses the Sigmoid activation function.
  • the model structure of the Summarizer component is relatively simple, which helps to speed up the calculation and save the calculation cost.
  • the server uses the Summarizer component to integrate and classify the feature vectors output by the Encoder component to obtain at least one language type probability, and each language type probability corresponds to a recognized language type; then, the Softmax function is used to process the probability distribution of at least one language type probability to The recognized language with the highest language probability is output as the target language corresponding to the text line image, so as to ensure the accuracy of the identified target language.
  • the probability of at least one language type identified by the Summarizer component may exist in the form of an array, such as represented by
  • Each number in the array corresponds to the identification probability of a recognized language
  • is An array formed by the four values of 0.7, 0, 0.2, and 0.1 refers to Chinese, English, Japanese, and Korean respectively according to the order of these four values. Based on
  • refers to the sequence of code points.
  • x is the feature vector corresponding to the text line image
  • y is the code point sequence encoded by the Encoder component.
  • the code point sequence can be the language type probability correspondence formed after the Summarizer component integrates the feature vectors.
  • the model is modeled using the probability method, and the conditional probability is P(y
  • x) refers to the probability that the code point sequence after encoding is y under the condition that x is known, and the Summarizer component outputs a code point sequence with the largest probability.
  • s ⁇ S s represents a kind of language
  • the text information can be merged into the above conditional probability P(y
  • s,x) represents an OCR model that can recognize a certain type of text
  • s represents a type of text
  • s, x) refers to the probability that the text in the image belongs to the text s under the condition of the feature vector x of the given image.
  • argmax is a kind of function, which is a function for obtaining parameters (sets) of the function.
  • C(y) represents a function, and its role is to convert y into a function of the sequence corresponding to a glyph cluster (c 1 ,c 2 , ⁇ ,c
  • c i represents the i- th of y A glyph;
  • s) respectively represent an OCR recognition model for the character s, the prior probability of a glyph cluster c and a language model.
  • s,x) represents the probability that the given input character s belongs to the glyph cluster c under the condition that x is known
  • s) represents the given input character s, the graph The probability that a Chinese character belongs to the glyph cluster c
  • s) represents the probability that the code point sequence in the figure is y under the condition of the input character s.
  • glyph clusters are strings that have certain uniform characteristics in the text stream and cannot be split in typesetting.
  • glyph clusters are ordinary characters (Chinese characters or letters), but sometimes they are A string arranged in a special order. If it is a glyph cluster composed of multiple characters, it has the following properties: 1) The behavior of the glyph cluster in a text line is equivalent to a character in the traditional typesetting system; 2) The relationship and layout of the glyph cluster between characters are only It is related to text attributes and has nothing to do with typesetting rules; the characters in the glyph cluster should have the same text and font attributes, and can it be output at one time.
  • the system pre-calculates the confusion rate of the traditional Seq2Seq recognition model and in this embodiment based on the Encoder-Summarizer mechanism for language recognition, so as to determine the language recognition based on the confusion rate The accuracy rate.
  • the confusion rate refers to the probability that one type of text is mistaken for another type of text.
  • the testing process includes the following steps: (1) Obtain image test samples coexisting in multiple languages, each image test sample contains a corresponding language type label, and the language type label is the language type label corresponding to the text in the image test sample.
  • this embodiment adopts Encoder-Summarizer performs text recognition, which can improve the accuracy of the recognized target text, so as to improve the accuracy of subsequent text recognition.
  • the model structure of the Encoder component and the Summarizer component is relatively simple, which helps to accelerate the calculation speed and save the calculation cost.
  • the text line image is first formatted to obtain a feature map that meets the preset format, which helps reduce the amount of calculation for subsequent encoding processing and improve the efficiency of encoding processing;
  • the Encoder component is used to encode the feature map, which can quickly obtain the corresponding feature vector and ensure the accuracy of the feature vector; then, use the Summarizer component to integrate and classify the feature vector to obtain at least one language type probability, and maximize the language type probability
  • the output is the target language corresponding to the text line image to ensure the accuracy of the recognized target language and the recognition efficiency.
  • step S205 which uses the target OCR recognition model to perform transcription recognition on the text line image, and obtains the target text corresponding to the text line image, specifically includes the following steps:
  • the text line position integrates adjacent text line images corresponding to the same target language to form a text block image, so that the text block image is regarded as a whole recognition object, thereby ensuring that each text recognized based on the text block image The accuracy of the target text corresponding to the line image.
  • the server sorts the target language corresponding to all text line images according to the text line position corresponding to the text line image, so as to sort the corresponding text line positions All text line images belonging to the same target language are integrated to form a text block image, so that when the subsequent recognition is based on the text block image, the context semantics in the text block image can be fully considered to improve the accuracy of text recognition.
  • S602 Use a text cutting algorithm to cut the text block image to obtain at least two single font images, and obtain the recognition order and row and column labels corresponding to each single font image according to the typesetting type and cutting order of the text block image.
  • the text cutting algorithm refers to an algorithm for cutting a text block image into a single font image
  • the text cutting algorithm may specifically be a text cutting algorithm based on projection. For example, when using a projection-based text cutting algorithm to switch text block images, you can first project each text line image vertically to obtain vertical projection pixels. If there are continuous pixels that meet the preset conditions, these continuous pixels There is an original text in the area corresponding to the pixels of, which is cut to form a single font image.
  • the typesetting of the text block image includes horizontal typesetting and vertical typesetting.
  • the recognition sequence corresponding to a single font image refers to the sequence of a single font image in the overall text block image according to its typesetting position in the text block image.
  • the row and column labels corresponding to a single font image refer to the row labels and column labels of a certain single font image in the text block image.
  • a text block image is formed by horizontal layout of 3 text line images, and the number of words is 18, 20, and 17, respectively. 55 single font images are cut to form.
  • the recognition order of the single font image corresponding to the third text in the second line is 21.
  • the row label is 2 and the column label is 3.
  • S603 According to the recognition sequence corresponding to the single font image, input the single font image to the target OCR recognition model for transcription recognition, and obtain at least one recognition result corresponding to the single font image and the recognition probability corresponding to each recognition result.
  • the server inputs the single-font image into the target OCR recognition model for transcription recognition according to the recognition order corresponding to each single-font image, and obtains at least one recognition result corresponding to each single-font image and the recognition probability corresponding to the recognition result .
  • it can be set to select only the first three recognition results with a larger recognition probability and their corresponding recognition probabilities, so as to reduce the workload of subsequent recognition processing.
  • the recognition result refers to the recognition result of each single font image, which can be a recognition symbol or a recognized text.
  • the recognition probability corresponding to each recognition result refers to the possibility of being recognized as a recognition result from a single font image. For example, for a single-font image that contains the original text " ⁇ ”, the recognized characters are " ⁇ ", " ⁇ ", and “ ⁇ ” respectively, and the corresponding recognition probabilities are 99.99%, 84.23%, and 47.88%, respectively. .
  • S604 Based on the recognition result and the typesetting type, divide all single-font images in the text block image into at least two units to be recognized.
  • the server divides all the single font images in the text block image into at least two units to be recognized based on at least one recognition result corresponding to each single font image in the text block image and the typesetting type corresponding to the text block image.
  • the unit to be recognized is the smallest unit that requires semantic recognition, and each unit to be recognized contains text that can form a complete sentence for semantic recognition, thereby helping to improve the accuracy of text recognition. Since the recognition result refers to the recognition result of each single font image, it can be a recognition symbol or a recognized text.
  • the semantic separation can be performed according to whether the recognition result is a recognition symbol; when the recognition result is not a recognition symbol, Semantic separation can be performed according to the typesetting type of the text block image, thereby helping to improve the recognition accuracy of the unit to be recognized.
  • step S604 that is, based on the recognition result and the typesetting type, dividing all single-font images in the text block image into at least two units to be recognized, specifically includes the following steps:
  • the recognition symbol is a punctuation mark recognized by the target OCR recognition model, indicating that the text corresponding to the text block image needs to use punctuation.
  • the semantics between contexts are related to the position of punctuation marks.
  • the server needs to form all single-font images between any two adjacent recognition symbols into a unit to be recognized, and the unit to be recognized is used for semantic recognition.
  • the minimum unit makes the text corresponding to each unit to be recognized as one sentence, thereby helping to improve the recognition accuracy of the text corresponding to the unit to be recognized and reducing the complexity of recognition.
  • the server uses the target OCR recognition model to recognize a single font image, obtain at least one recognition result and the recognition probability corresponding to each recognition result, if the recognition result is that the recognition probability of the recognition symbol is the largest or greater than the preset probability threshold , The recognition result is deemed to include the recognition symbol.
  • the preset probability threshold is used to evaluate whether the recognition probability reaches the threshold for evaluating it as a certain character/symbol, and the preset probability threshold can be set to a higher value to ensure recognition accuracy.
  • the server when the server recognizes that the recognition result of the single-font image cut from any text block image includes the recognition symbol, it may first determine whether the recognition symbol includes a preset symbol, and if the recognition symbol includes a preset symbol, it is based on any adjacent two All single-font images between the preset symbols form a unit to be recognized, so as to improve the recognition accuracy of the unit to be recognized.
  • the preset symbol is a punctuation mark preset by the system for determining the end of a sentence, including but not limited to a period, a question mark, and an exclamation mark.
  • S6042 If the recognition result does not include the recognition symbol, form a unit to be recognized based on all single font images corresponding to the same row of labels or the same column of labels according to the typesetting type of the text block image and the row and column labels corresponding to the single font image.
  • the server can form a unit to be recognized based on the typesetting type of the text block image and the row and column labels corresponding to the single font image, based on the same row label or all single font images corresponding to the same column label, thereby helping to improve The recognition accuracy rate of the unit to be recognized.
  • the server forms the unit to be recognized based on all the single-font images between two adjacent recognition symbols; when the recognition result does not include the recognition symbol, it is based on the typesetting type and single The row and column labels corresponding to the font image are formed based on the same row label or all single font images corresponding to the same column label to form the unit to be recognized. To a certain extent, it is ensured that the final unit to be recognized contains a complete sentence as much as possible for subsequent semantic analysis. To improve the recognition accuracy.
  • S605 Based on at least one recognition result corresponding to each single font image and the recognition probability corresponding to each recognition result in any unit to be recognized, obtain a single font text corresponding to each single font image.
  • the server uses the unit to be recognized as the smallest unit for semantic recognition, and uses one of all single font images according to at least one recognition result corresponding to each single font image in the unit to be recognized and the recognition probability corresponding to each recognition result.
  • the possible contextual semantic relationship between the two can more accurately determine the single font text corresponding to each single font image, thereby improving the accuracy of text recognition.
  • step S606 is to obtain the single-font text corresponding to each single-font image based on at least one recognition result corresponding to each single-font image and the recognition probability corresponding to each recognition result in any unit to be recognized , Specifically including the following steps:
  • each single font image has recognized text with a recognition probability greater than the preset probability threshold, then the recognized text with the recognition probability greater than the preset probability threshold is determined as the single font text corresponding to the single font image .
  • the server compares each recognition probability corresponding to each single font image in any unit to be recognized with a preset probability threshold one by one to determine whether each single font image has a recognition probability greater than the preset probability threshold. Recognizing characters, if each single-font image has a recognized character whose recognition probability is greater than a preset probability threshold, then this recognized character is directly determined as the single-font character corresponding to the single-font image.
  • any unit to be recognized includes N single-font images, each single-font image corresponds to M recognized characters, and each recognized text corresponds to a recognition probability
  • the server sequentially calculates the M recognition probabilities of the N single-font images Compare with the preset probability threshold to determine whether each single-font image has recognized characters with a recognition probability greater than the preset probability threshold; if N single-font images have recognized texts with a recognition probability greater than the preset probability threshold, it means that each The accuracy of the recognized text corresponding to the single font image is relatively high. At this time, the recognized text can be directly recognized as the font text corresponding to the single font image.
  • the preset probability threshold is set to 95%, and among the six single-font images corresponding to the unit to be recognized, "It’s really nice today", each single-font image contains recognized characters with a recognition probability greater than the preset probability threshold.
  • the recognition probability of "Jin” is 97%
  • the recognition probability of "Tian” is 98%
  • the recognition probability of "Tian” is 98%
  • the recognition probability of "Qi” is 99%
  • the recognition probability of "True” is 96%.
  • the recognition probability of "good” is 99%
  • the recognized characters with the recognition probability greater than the preset probability threshold are determined as single-font characters to ensure the accuracy of the character recognition in the unit to be recognized.
  • the server compares each recognition probability corresponding to each single font image in any unit to be recognized with a preset probability threshold one by one to determine whether each single font image has a recognition probability greater than the preset probability threshold. Recognized text; in at least one single-font image, there is no recognized text with a recognition probability greater than the preset probability threshold. At this time, it means that the accuracy of the recognized text recognized by at least one single-font image is not up to the standard. In this case, it needs to be combined with the context Semantic analysis in order to improve the recognition accuracy of the text corresponding to the unit to be recognized.
  • the server needs to recognize the word sequence formed by at least one recognized text corresponding to all single-font images in the recognition unit according to the recognition sequence corresponding to the single-font image, and obtain the sequence probability corresponding to each word sequence, based on the sequence For the word sequence with the highest probability, the single font text corresponding to each single font image is obtained.
  • the target language model is a model used for semantic analysis of continuous characters.
  • the target language model can be, but not limited to, the N-gram model, that is, the Chinese language model.
  • the Chinese language model utilizes the collocation information between adjacent words in the context. When it is necessary to convert consecutive pinyin, strokes, or numbers representing letters or strokes into Chinese character strings (ie sentences), the one with the greatest probability can be calculated Sentences, so as to realize automatic conversion to Chinese characters, without the user's manual selection, avoiding the problem of multiple Chinese characters corresponding to the same pinyin (or stroke string or number string).
  • the preset probability threshold is set to 95%.
  • the recognition probability of the first single-font image to recognize "now” is 97%;
  • the recognition probability of two single-font images to recognize "husband” is 85%, and the recognition probability of " ⁇ ” is 83%;
  • the third single-font image recognizes " ⁇ ” with a recognition probability of 90% and recognizes "
  • the recognition probability of "Fu” is 75%;
  • the recognition probability of the fourth single font image recognizing "qi” is 99%;
  • the recognition probability of the fifth single font image recognizing "true” is 96%;
  • the recognition probability of "good” image recognition is 99%, because the second and fourth single-font images do not have recognized characters with recognition probability greater than the preset probability threshold (such as 95%).
  • the server can directly determine these recognized characters as the single-font characters corresponding to the single-font image to ensure that the single-font image Recognition accuracy and efficiency of text; when at least one single font image does not have a recognized text with a recognition probability greater than the preset probability threshold, the target language model is used to identify the word sequence formed by all single font images, based on the highest sequence probability The word sequence determines the single-font text corresponding to each single-font image, and performs semantic analysis through the target language model to ensure the accuracy of single-font text recognition.
  • S606 Perform page layout on the single-font text corresponding to the single-font image according to the recognition sequence and the row and column labels corresponding to each single-font image, and obtain the target text corresponding to the text row image.
  • the server After identifying the single-font text corresponding to each single-font image, the server performs page layout for all single-font texts corresponding to all single-font images according to the recognition order and row and column labels corresponding to the single-font image, that is, each single font
  • the single font text corresponding to the image is placed at the position corresponding to the row and column labels to obtain the target text corresponding to the text line image, so that subsequent comparison and verification can be performed based on the target text to ensure the recognition accuracy of the target text.
  • the target text is the text finally recognized by the text line image.
  • the adjacent text line images corresponding to the same target language are first integrated to form a text block image, so that subsequent recognition is performed in units of the text block image, fully considering the difference between the text line images
  • the contextual semantics between the two can improve the accuracy of text recognition; the text block image is cut word by word, and the single font image and its corresponding recognition order and row and column labels are obtained to locate based on the recognition order and row and column labels; then the target OCR is used
  • the recognition model recognizes the single-font image to determine at least one recognition result and the corresponding recognition probability.
  • the target OCR recognition model is a recognition model specifically for the target language, the recognition result recognized by it is more accurate; then, according to All recognition results and the typesetting type corresponding to the text block image to divide all single font images into at least two units to be recognized for semantic analysis, so as to analyze based on the units to be recognized to ensure that each single font recognized The accuracy of the single-font text corresponding to the image; then page layout of all single-font texts based on the recognition sequence and row and column labels corresponding to the single-font image to obtain the target text corresponding to the text line image for subsequent comparison and verification to ensure the target Accuracy of text recognition.
  • a multi-language text recognition device corresponds to the multi-language text recognition method in the above-mentioned embodiment one-to-one.
  • the multilingual text recognition device includes a to-be-recognized image acquisition module 801, a text line image acquisition module 802, a target language acquisition module 803, a recognition model acquisition module 804, a target text acquisition module 805, and a recognition text acquisition module 806.
  • the detailed description of each functional module is as follows:
  • the to-be-recognized image acquisition module 801 is configured to acquire the to-be-recognized image, and the to-be-recognized image includes original characters corresponding to at least two languages.
  • the text line image acquisition module 802 is configured to perform layout analysis and recognition on the image to be recognized, acquire at least one text line image, and determine the text line position of each text line image in the image to be recognized.
  • the target language acquisition module 803 is configured to recognize the language of each text line image, and obtain the target language corresponding to each text line image.
  • the recognition model obtaining module 804 is configured to query the recognition model database based on the target language, and obtain the target OCR recognition model corresponding to the target language.
  • the target text acquisition module 805 is configured to use the target OCR recognition model to perform transcription recognition on the text line image, and obtain the target text corresponding to the text line image.
  • the recognition text obtaining module 806 is configured to obtain the target recognition text corresponding to the image to be recognized based on the target text corresponding to the text line image and the text line position.
  • the multilingual text recognition device further includes:
  • the blur degree detection unit is used to obtain the original image, and uses the blur degree detection algorithm to detect the original image to obtain the blur degree of the original image.
  • the first blur processing unit is configured to obtain blur prompt information if the blur degree is greater than the first blur threshold.
  • the second blur processing unit is configured to perform sharpening and correction processing on the original image if the blur degree is not greater than the first blur threshold and greater than the second blur threshold to obtain the image to be recognized.
  • the third blur processing unit is configured to perform correction processing on the original image if the blur degree is not greater than the second blur threshold to obtain the image to be recognized.
  • the first blur threshold is greater than the second blur threshold.
  • the text line image acquisition module 802 includes:
  • the text line area acquiring unit is configured to use a text positioning algorithm to perform text positioning on the image to be recognized to obtain at least one text line area.
  • the text line image determining unit is used to take a screenshot of at least one text line area with a screenshot tool, obtain at least one text line image, and determine the text line position of each text line image in the image to be recognized according to the screenshot sequence of the screenshot tool.
  • the target language acquisition module 803 includes:
  • the feature map acquiring unit is used to perform format conversion on the text line image to acquire a feature map conforming to a preset format.
  • the feature vector obtaining unit is used to encode the feature map by using the Encoder component to obtain the corresponding feature vector.
  • the target language output unit is used to use the Summarizer component to integrate and classify the feature vectors, obtain at least one language type probability, and output the recognized language type with the highest language type probability as the target language type corresponding to the text line image.
  • the recognition text acquisition module 806 includes:
  • the text block image acquisition unit is used to integrate adjacent text line images corresponding to the same target language to form a text block image based on the target language and text line position corresponding to the text line image.
  • the single-font image acquisition unit is used to cut the text block image using a text cutting algorithm to obtain at least two single-font images, and obtain the recognition order and ranks of each single-font image according to the typesetting type and cutting order of the text block image label.
  • the recognition result obtaining unit is configured to input the single font image into the target OCR recognition model for transcription recognition according to the recognition sequence corresponding to the single font image, and obtain at least one recognition result corresponding to the single font image and the recognition probability corresponding to each recognition result.
  • the to-be-recognized unit dividing unit is used to divide all single-font images in the text block image into at least two to-be-recognized units based on the recognition result and the typesetting type.
  • the single-font text obtaining unit is used to obtain the single-font text corresponding to each single-font image based on at least one recognition result corresponding to each single-font image and the recognition probability corresponding to each recognition result in any unit to be recognized.
  • the target text obtaining unit is used to perform page layout on the single font text corresponding to the single font image according to the recognition sequence and the row and column labels corresponding to each single font image, and obtain the target text corresponding to the text line image.
  • the division unit of the unit to be identified includes:
  • the first unit acquires a subunit, and is used for forming a unit to be recognized based on all single-font images between any two adjacent recognition symbols if the recognition result includes the recognition symbol.
  • the first unit acquires subunits, which are used to form a single font image based on all single font images corresponding to the same row label or the same column label according to the typesetting type of the text block image and the row and column labels corresponding to the single font image if the recognition result does not contain the recognition symbol Unit to be identified.
  • the single-font text acquisition unit includes:
  • the first single character acquisition subunit is used to determine the recognized character whose recognition probability is greater than the preset probability threshold as a single character if each single font image in any unit to be recognized has a recognition probability greater than the preset probability threshold.
  • the single font text corresponding to the font image.
  • the second single-character acquisition subunit is used to treat at least one single-font image in any to-be-recognized unit with no recognized text with a recognition probability greater than a preset probability threshold, and use the target language model to treat according to the recognition sequence corresponding to the single-font image
  • the word sequence formed by at least one recognized text corresponding to all single-font images in the recognition unit is recognized, the sequence probability corresponding to each word sequence is obtained, and the single-font text corresponding to each single-font image is obtained based on the word sequence with the largest sequence probability.
  • Each module in the above multi-language text recognition device can be implemented in whole or in part by software, hardware and a combination thereof.
  • the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 9.
  • the computer equipment includes a processor, a memory, a network interface and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, a computer program, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the database of the computer equipment is used to store data used or generated in the process of executing the multilingual text recognition method.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program is executed by the processor to realize a multilingual text recognition method.
  • a computer device including a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the processor executes the computer program to implement the multilingual text recognition method in the above embodiment For example, steps S201-S206 shown in FIG. 2 or shown in FIG. 3 to FIG. 6, in order to avoid repetition, they will not be repeated here.
  • the processor executes the computer program, the function of each module/unit in this embodiment of the multilingual text recognition device is realized, such as the image acquisition module 801 to be recognized, the text line image acquisition module 802, and the target language shown in FIG.
  • the functions of the acquisition module 803, the recognition model acquisition module 804, the target text acquisition module 805, and the recognition text acquisition module 806 are not repeated here to avoid repetition.
  • one or more readable storage media storing computer readable instructions are provided.
  • the computer readable storage medium stores computer readable instructions, and the computer readable instructions are executed by one or more processors.
  • the one or more processors are executed to implement the multilingual text recognition method in the foregoing embodiment, for example, steps S201-S206 shown in FIG. 2 or shown in FIGS. 3 to 6, in order to avoid repetition, I won't repeat it here.
  • the functions of the modules/units in this embodiment of the multilingual text recognition device are realized, for example, the image acquisition module 801 to be recognized, the text line image acquisition module 802, The functions of the target language acquisition module 803, the recognition model acquisition module 804, the target text acquisition module 805, and the recognition text acquisition module 806 are not repeated here to avoid repetition.
  • the readable storage medium in this embodiment includes a nonvolatile readable storage medium and a volatile readable storage medium.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Character Discrimination (AREA)

Abstract

一种多语言文本识别方法、装置、计算机设备及存储介质。该方法包括:获取待识别图像,待识别图像包括至少两种语言对应的原始文字(S201);对待识别图像进行版面分析识别,获取至少一个文本行图像,确定每一文本行图像在待识别图像中的文本行位置(S202);对每一文本行图像进行文种识别,获取每一文本行图像对应的目标文种(S203);基于目标文种查询识别模型数据库,获取目标文种对应的目标OCR识别模型(S204);采用目标OCR识别模型对文本行图像进行转录识别,获取文本行图像对应的目标文字(S205);基于文本行图像对应的目标文字和文本行位置,获取待识别图像对应的目标识别文本(S206)。该方法可采用目标OCR识别模型对每一文本行图像进行识别,有助于提高多语言文本的识别准确率。

Description

多语言文本识别方法、装置、计算机设备及存储介质
本申请以2019年8月1日提交的申请号为201910706802.7,名称为“多语言文本识别方法、装置、计算机设备及存储介质”的中国发明申请为基础,并要求其优先权。
技术领域
本申请涉及文本识别技术领域,尤其涉及一种多语言文本识别方法、装置、计算机设备及存储介质。
背景技术
多语言文本识别具体应用在对包含多种语言的文本图像进行识别的场景,例如,在对一中文文字、日文文字和英文文字共存的文本图像进行识别的场景。当前采用基于序列到序列(sequence-to-sequence,简称为Seq2Seq)方法训练所得的Seq2Seq识别模型对多种语言共存的文本图像进行识别,其模型结构复杂,训练过程十分困难且模型运行时效率较低,且识别准确率较低。其中,Seq2Seq是指将一个领域(如英文)的序列转换为另一种领域(如法语)的序列的模型的技术。在采用传统Seq2Seq识别模型对多种语言文字并存的文本图像进行识别时,通常会因为文种类型判断出错而使最终识别出的文字内容出错,导致识别准确率较低,不利于识别模型的推广应用。
发明内容
本申请实施例提供一种多语言文本识别方法、装置、计算机设备及存储介质,以解决当前采用识别模型进行多语言文本识别过程中存在的识别准确率较低的问题。
一种多语言文本识别方法,包括:
获取待识别图像,所述待识别图像包括至少两种语言对应的原始文字;
对所述待识别图像进行版面分析识别,获取至少一个文本行图像,确定每一所述文本行图像在所述待识别图像中的文本行位置;
对每一所述文本行图像进行文种识别,获取每一所述文本行图像对应的目标文种;
基于所述目标文种查询识别模型数据库,获取所述目标文种对应的目标OCR识别模型;
采用所述目标OCR识别模型对所述文本行图像进行转录识别,获取所述文本行图像对应的目标文字;
基于所述文本行图像对应的所述目标文字和所述文本行位置,获取所述待识别图像对应的目标识别文本。
一种多语言文本识别装置,包括:
待识别图像获取模块,用于获取待识别图像,所述待识别图像包括至少两种语言对应的原始文字;
文本行图像获取模块,用于对所述待识别图像进行版面分析识别,获取至少一个文本行图像,确定每一所述文本行图像在所述待识别图像中的文本行位置;
目标文种获取模块,用于对每一所述文本行图像进行文种识别,获取每一所述文本行图像对应的目标文种;
识别模型获取模块,用于基于所述目标文种查询识别模型数据库,获取所述目标文种对应的目标OCR识别模型;
目标文字获取模块,用于采用所述目标OCR识别模型对所述文本行图像进行转录识别,获取所述文本行图像对应的目标文字;
识别文本获取模块,用于基于所述文本行图像对应的所述目标文字和所述文本行位置,获取所述待识别图像对应的目标识别文本。
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算 机程序,所述处理器执行所述计算机程序时实现如下步骤:
获取待识别图像,所述待识别图像包括至少两种语言对应的原始文字;
对所述待识别图像进行版面分析识别,获取至少一个文本行图像,确定每一所述文本行图像在所述待识别图像中的文本行位置;
对每一所述文本行图像进行文种识别,获取每一所述文本行图像对应的目标文种;
基于所述目标文种查询识别模型数据库,获取所述目标文种对应的目标OCR识别模型;
采用所述目标OCR识别模型对所述文本行图像进行转录识别,获取所述文本行图像对应的目标文字;
基于所述文本行图像对应的所述目标文字和所述文本行位置,获取所述待识别图像对应的目标识别文本。
一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
获取待识别图像,所述待识别图像包括至少两种语言对应的原始文字;
对所述待识别图像进行版面分析识别,获取至少一个文本行图像,确定每一所述文本行图像在所述待识别图像中的文本行位置;
对每一所述文本行图像进行文种识别,获取每一所述文本行图像对应的目标文种;
基于所述目标文种查询识别模型数据库,获取所述目标文种对应的目标OCR识别模型;
采用所述目标OCR识别模型对所述文本行图像进行转录识别,获取所述文本行图像对应的目标文字;
基于所述文本行图像对应的所述目标文字和所述文本行位置,获取所述待识别图像对应的目标识别文本。
本申请的一个或多个实施例的细节在下面的附图及描述中提出。本申请的其他特征和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例中多语言文本识别方法的一应用环境示意图;
图2是本申请一实施例中多语言文本识别方法的一流程图;
图3是本申请一实施例中多语言文本识别方法的另一流程图;
图4是本申请一实施例中多语言文本识别方法的另一流程图;
图5是本申请一实施例中多语言文本识别方法的另一流程图;
图6是本申请一实施例中多语言文本识别方法的另一流程图;
图7是本申请一实施例中Encoder-Summarizer机制的一模型结构示意图,其中,7(a)为Encoder组件的模型结构,7(b)为Summarizer组件的模型结构,7(c)为7(a)中的Inception的模型结构;
图8是本申请一实施例中多语言文本识别装置的一示意图;
图9是本申请一实施例中计算机设备的一示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例提供的多语言文本识别方法,该多语言文本识别方法可应用如图1所示的应用环境中。具体地,该多语言文本识别方法应用在多语言文本识别系统中,该多语言文本识别系统包括如图1 所示的客户端和服务器,客户端与服务器通过网络进行通信,用于实现在对多语言文字共存的待识别图像进行识别时,可根据每一种语言确定相应的目标OCR识别模型,利用多个目标OCR识别模型进行识别,以保证最终识别出来的目标识别文本的识别准确率。其中,客户端又称为用户端,是指与服务器相对应,为用户提供本地服务的程序。客户端可安装在但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备上。服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在一实施例中,如图2所示,提供一种多语言文本识别方法,以该方法应用在图1中的服务器为例进行说明,包括如下步骤:
S201:获取待识别图像,待识别图像包括至少两种语言对应的原始文字。
其中,待识别图像是指需要进行文字识别的图像。原始文字是指记载在该待识别图像中的文字。本实施例中,该待识别图像中的原始文字对应至少两种语言,即该待识别图像为至少两种语言共存的图像。例如,在一张待识别图像中,包括采用中文、英文和日文对应的原始文字。
S202:对待识别图像进行版面分析识别,获取至少一个文本行图像,确定每一文本行图像在待识别图像中的文本行位置。
其中,对待识别图像进行版面分析识别是指对待识别图像的分布结构进行分割,并分析各个分割后的部分图像的图像特征,以确定该部分图像对应的属性类别。该属性类别包括但不限于文字块、图像块和表格块等。文本行图像是包含至少一个原始文字所形成的图像。一般来说,若文本行图像对应的排版类型为横向排版,则每一文本行图像包括一行原始文字;若文本行图像对应的排版类型为纵向排版,则每一文本行图像包括一列原始文字。
具体地,服务器可预先对待识别图像进行灰度变换、二值化处理、平滑处理和边缘检测等图像预处理操作,再对预处理后的待识别图像进行版面分析识别,以获取至少两个区块图像;再对每一区块图像进行属性分析,获取每一区块图像对应的属性类别,截取属性类别为文字块的区块图像确定为文本行图像。可以理解地,对待识别图像进行版面分析可以采用但不限于多层次可信度指导下的自底向上的版面分析算法、基于邻域特征的版面分析算法或者基于投影算法结合自底向上的版面分析算法。
具体地,服务器在对待识别图像进行版面分析识别,获取至少一个文本行图像之后,需确定该文本行图像在待识别图像中的文本行位置,以便基于该文本行位置进行后续的文本定位或者结合上下文进行识别,从而保障待识别图像识别出的目标识别文本的准确性。
在一具体实施方式中,以待识别图像的排版类型为横向排版为例进行说明,服务器在对待识别图像进行版面分析识别,获取至少一个文本行图像之后,依据至少一个文本行图像所在区域的左上角或者中心的纵坐标进行排序,获取每一文本行图像对应的行序号,从而获取该文本行图像对应的文本行位置。例如,行序号为001的文本行图像,说明其为第1行原始文字对应的文本行图像,以便基于该行序号进行后续定位或者上下文语义分析,以提高文本识别准确率。
S203:对每一文本行图像进行文种识别,获取每一文本行图像对应的目标文种。
其中,目标文种是指文本行图像中的原始文字对应的语言种类,如中文、英文或日文。具体地,服务器通过对每一文本行图像进行文种识别,以确定每一文本行图像对应的目标文种,以便基于该目标文种确定相应的目标OCR识别模型,利用该目标OCR识别模型对文本行图像进行识别,以将多语言的待识别图像转换成多个单语言的文本行图像,再采用多个目标OCR识别模型对相应的文本行图像进行识别,以提高整个待识别图像的识别准确率。
S204:基于目标文种查询识别模型数据库,获取目标文种对应的目标OCR识别模型。
其中,识别模型数据库是用于存储用于识别不同语言的文种OCR识别模型的数据库。在识别模型数据库中存储不同语言对应的文种OCR识别模型。每一文种OCR识别模型对应一种文种(即语言种类),即文种OCR识别模型是用于识别该文种对应的文字的OCR识别模型。具体地,服务器在获取每一文本行图像对应的目标文种之后,根据每一文本行图像所识别出的目标文种查询识别模型数据库,将与该目标文种对应的文种OCR识别模型确定为目标OCR识别模型,以便利用该目标OCR识别模型对文本行图像进行识别,以提高对该文本行图像进行文本识别的准确率。
S205:采用目标OCR识别模型对文本行图像进行转录识别,获取文本行图像对应的目标文字。
具体地,服务器可以依据每一文本行图像对应的文本行位置确定其对应的识别顺序,采用目标OCR识别模型对相应的文本行图像进行转录识别,以获取该文本行图像对应的目标文字,以确保每一文本行图像对应的目标文字的识别准确率,避免采用相同的OCR识别模型对多个不同语言的文本行图像进行识别所存在的识别准确率较低的问题。其中,目标文字是指采用目标OCR识别模型对文本行图像进行识别所获取的文字。
S206:基于文本行图像对应的目标文字和文本行位置,获取待识别图像对应的目标识别文本。
具体地,服务器在识别出每一文本行图像对应的目标文字之后,需基于文本行图像对应的文本行位置,对每一文本行图像对应的目标文字进行重新排版,以获取与待识别图像版面相同的目标识别文本,以便基于目标识别文本对待识别图像中相应位置的原始文字进行对比校验。以待识别图像的排版类型为横向排版为例,依据每一文本行图像对应的行序号确定其文本行位置,则在识别出该文本行图像对应的目标文字之后,需依据该行序号对目标文字进行排版,以获取待识别图像对应的目标识别文本,以便进行后续的比对校验,保证识别准确率。
本实施例所提供的多语言文本识别方法中,对至少两种语言对应的原始文字共存的待识别图像,先进行版面分析识别以确定至少一个文本行图像和对应的文本行位置,以便将多语言的待识别图像转换成单语言的文本行图像进行识别,有助于后续识别过程中提高识别准确率。对文本行图像进行文种识别,以确定其对应的目标文种后,采用该目标文种对应的目标OCR识别模型对文本行图像进行识别,以保证每一文本行图像识别出的目标文字的准确率。接着,基于文本行图像的文本行位置,对该文本行图像对应的目标文字进行重新排版,获取待识别图像对应的目标识别文本,以便基于该目标识别文本进行对比校验,有助于保障识别准确率。
在一实施例中,如图3所示,在步骤S201之前,即在获取待识别图像之前,多语言文本识别方法还包括:
S301:获取原始图像,采用模糊度检测算法对原始图像进行检测,获取原始图像的模糊度。
其中,原始图像是指服务器获取的未经处理的图像。模糊度检测算法是用于检测图像的模糊度的算法。该模糊度检测算法可以采用现有技术中常用的检测算法。原始图像的模糊度是用于反映原始图像的模糊程度的数值,该模糊度越大,说明原始图像越模糊;相应地,该模糊度越小,说明原始图像越清晰。
本实施例中的模糊度检测算法可以采用但不限于拉普拉斯算子(Laplacian operator)进行模糊度检测。其中,拉普拉斯算子(Laplacian operator)是一种二阶微分算子,适用于改善因为光线的漫反射造成的图像模糊。其原理是,在摄像记录图像的过程中,光点将光漫反射到其周围区域,这种由于光漫反射造成了图像一定程度的模糊,其模糊程度相对与正常情形下拍摄的图像来说,往往是拉普拉斯算子的常数倍。因此,步骤S301具体包括如下步骤:
S3011:采用拉普拉斯算子对原始图像进行锐化处理,获取锐化图像和锐化图像的像素灰度值。即服务器先采用拉普拉斯算子对原始图像进行处理,获取描述灰度突变的拉普拉斯图像,再将拉普拉斯图像与原始图像叠加从而获取锐化图像。在获取锐化图像之后,获取锐化图像中每一像素点的RGB值,并对RGB值进行处理,以获取锐化图像对应的像素灰度值。
S3012:对锐化图像的像素灰度值进行方差计算,获取锐化图像对应的目标方差值,将目标方差值确定为原始图像对应的模糊度。即服务器先计算锐化图像中每个像素点的像素灰度值减去锐化图像的平均灰度值的平方和,再将该平方和除以像素点数量,即可获取可反映锐化图像模糊度的目标方差值。
可以理解地,先采用拉普拉斯算子对原始图像进行锐化处理,以获取细节比原始图像更清晰的锐化图像,从而提高图像的清晰度。然后,通过计算锐化图像的目标方差值,以体现该锐化图像的各像素点的像素灰度值之间的差异性。将锐化图像的目标方差值作为原始图像的模糊度,以便依据该模糊度与预设阈值的比较结果,从而达到对原始图像进行模糊过滤,以达到获取较清晰的原始图像的目的。
S302:若模糊度大于第一模糊阈值,则获取模糊提示信息。
其中,第一模糊阈值是系统预先设置的用于评估能否作为待识别图像的最高模糊度的阈值。模糊提示信息是用于提示图像过于模糊的信息。
具体地,在将原始图像的模糊度与第一模糊阈值进行比较之后,若该原始图像的模糊度大于第一模糊阈值,则说明原始图像过于模糊,若直接对原始图像进行文字识别,可能识别原始图像中的文字识别的准确率,因此,可获取模糊提示信息,并将模糊提示信息发送给相应的终端,以使用户基于该模糊提示信息重新向服务器上传原始图像。
S303:若模糊度不大于第一模糊阈值且大于第二模糊阈值,则对原始图像进行锐化和矫正处理,获取待识别图像。
其中,第二模糊阈值是系统预先设置的用于评估能否认定为较清晰的待识别图像的最低模糊度的阈值。可以理解地,第一模糊阈值大于第二模糊阈值。
具体地,在原始图像的模糊度大于第二模糊阈值且不大于第一模糊阈值时,即原始图像的模糊度在第二模糊阈值与第一模糊阈值之间时,说明原始图像虽然未过于模糊但也达不到清晰的标准,此时,需要先对原始图像进行锐化处理,以提高原始图像的清晰度;再对锐化处理后的原始图像进行矫正处理,以获取较清晰且不倾斜的待识别图像,从而保障后续对待识别图像进行文本识别的准确性。一般来说,在进行光学扫描时,会因为客观原因,导致扫描的原始图像位置不正,影响后期的图像识别处理准确性,因此需对图像进行图像矫正工作。图像倾斜矫正关键在于根据图像特征自动检测出图像倾斜方向和倾斜角度。目前常用的倾斜角度方法有:基于投影的方法、基于Hough变换、基于线性拟合和傅里叶变换到频域来进行检测的方法。
S304:若模糊度不大于第二模糊阈值,则对原始图像进行矫正处理,获取待识别图像。
具体地,服务器在原始图像的模糊度不大于第二模糊阈值时,说明该原始图像较清晰,无需进行锐化处理以增强其清晰度,以提高图像处理效率;而在原始图像生成时,可能会因为各种客观原因导致其位置不正,因此,服务器需对原始图像进行矫正处理,以获取矫正处理后的不倾斜的待识别图像。
本实施例所提供的多语言文本识别方法中,根据原始图像的模糊度与第一模糊阈值和第二模糊阈值的比较结果,分别进行相应的处理,以保障最终获取较清晰且不倾斜的待识别图像,从而保障基于待识别图像进行文本识别的准确率,避免图像较模糊或者图像倾斜而对识别结果造成干扰。
在一实施例中,如图4所示,步骤S202,即对待识别图像进行版面分析识别,获取至少一个文本行图像,确定每一文本行图像在待识别图像中的文本行位置,具体包括如下步骤:
S401:采用文本定位算法对待识别图像进行文本定位,获取至少一个文本行区域。
其中,文本定位算法是用于定位出图像中文字的算法。本实施例中,文本定位算法包括但不限于邻近搜索算法和CTPN-RNN算法。文本行区域是指采用文本定位算法从待识别图像中识别出包含原始文字的区域,该文本行区域是基于一行原始文字或者一列原始文字确定的区域。
以水平邻近搜索为例,该邻近搜索算法是指从一个连通区域出发,可以找到该连通区域的水平外切矩形,将连通区域扩展到整个矩形的算法。当该连通区域与最邻近区域的距离小于一定范围时,考虑这个矩形的膨胀,膨胀的方向是最邻近区域的所在方向,当且仅当所在方向是水平的,才执行膨胀操作,以从图像中确定至少一个文本行区域。这种方式可有效将图像中位于同一行的原始文字整合在一个文本行区域内,以实现文本定位目的。以水平方向膨胀为例,采用邻近搜索算法对待识别图像进行文本定位,以获取至少一个文本行区域的过程包括:对待识别图像中任一个或多个原始文字形成的矩形区域,计算任意两个矩形区域的中心向量差(即两个矩形区域的中心点所形成的向量差)。再将该中心向量差减去两个矩形区域的中心点到边界的距离,获取边界向量差,即
Figure PCTCN2019116488-appb-000001
其中,(x' c,y' c)是指边界向量差,(x c,y c)是指中心向量差,a 1和b 1分别是指第一个矩阵区域的长度和宽度,a 2和b 2分别是指第二个矩阵区域的长度和宽度。再采用距离计算公式
Figure PCTCN2019116488-appb-000002
计算两个矩阵区域的距离d,其中,max()为返回最大数值的函数; 若距离d小于一定范围,则对该行文本进行膨胀操作,以获取至少一个文本行区域,采用邻近搜索法可快速获取到至少一个文本行区域。
其中,CTPN(Connectionist Text Proposal Network,连接文本提议网络,以下简称CTPN)是用于准确定位图像中文本行的模型,CTPN可识别出每一行文本行的四个角的坐标位置。RNN(Recurrent Neural Networks循环神经网络,以下简称RNN)的主要用途是用于处理和预测序列数据,RNN的隐藏层之间的结点是有连接的,隐藏层的输入不仅包括输入层的输出,还包括上一时刻隐藏层的输出。采用CTPN-RNN算法从待识别图像中定位到至少一个文本行区域,通过将CTPN无逢结合到RNN卷积网络中,使其可准确定位待识别图像中的文本行,并根据每一文本行在待识别图像中的位置,确定文本行区域,即采用CTPN-RNN算法可实现对至少一个文本行区域进行自动标识,采用CTPN和RNN无缝结合的方式可有效提高检测精度。
S402:采用截图工具对至少一个文本行区域进行截图,获取至少一个文本行图像,依据截图工具的截图顺序,确定每一文本行图像在待识别图像中的文本行位置。
具体地,服务器采用OpenCV对至少一个文本行区域进行截图,获取对应的至少一个文本行图像,在依据该截图工具的截图顺序,确定每一文本行图像在待识别图像中的文本行位置。OpenCV(Open Source Computer Vision Library,开源计算机视觉库)是一个基于BSD许可(开源)发行的跨平台计算机视觉库,可以运行在Linux、Windows、Android和Mac OS操作系统上。它轻量级而且高效——由一系列C函数和少量C++类构成,同时提供了Python、Ruby、MATLAB等语言的接口,实现了图像处理和计算机视觉方面的很多通用算法。本实施例中,通过OpenCV对每一文本行区域的4个角的坐标进行截取操作,以获取相应的文本行图像,通过OpenCV进行截取操作,其计算简单、运算效率较高且性能较稳定。每一文本行图像对应一文本行位置可以是该文本行图像四个顶点(如左上角的坐标)或者中心点的坐标,以便根据该文本行位置确定文本行图像在待识别图像中的位置。
本实施例所提供的多语言文本识别方法中,先采用文本定位算法对待识别图像进行文本定位,以便快速包含一行原始文字或者一列原始文字的文本行区域,使得文本行区域的获取效率较高且准确率较高;再采用截图工具对每一文本行区域进行截图,以获取至少一个文本行图像,以将待识别图像划分成至少一个文本行图像,使得后续文字识别时可逐一基于文本行图像进行识别,避免采用同一识别模型对不同语言对应的文本行图像进行识别而导致的识别结果不准确的问题;最后,依据截图工具的截图顺序,获取每一文本行图像对应的文本行位置,以便基于该文本行位置后续进行定位或者上下文语义分析,以提高文本识别准确率。
在一实施例中,多语言文本识别系统中采用如图7所示的Encoder-Summarizer机制进行文种识别,即利用Encoder组件将文本行图像转换成特征序列,再采用Summarizer聚合特征序列,执行分类任务,以便进行分类,从而确定其对应的目标文种。如图5所示,步骤S203,即对每一文本行图像进行文种识别,获取每一文本行图像对应的目标文种,具体包括如下步骤:
S501:对文本行图像进行格式转换,获取符合预设格式的特征图。
其中,预设格式是预先设置的用于输入Encoder组件进行编码的特征图的格式。特征图是从文本行图像中提取出的可输入Encoder组件进行编码处理的图。其中,预设格式包括预设高度、预设宽度和预设通道数。其中,预设通道数设置为1,代表该特征图为灰度图像,以减少后续处理过程的计算量。通过预设高度和预设宽度的设置,可以有效减少Encoder组件利用特征图进行编码时,特征图中的宽度和高度对编码处理的干扰,从而保障获取的特征向量的准确性。
具体地,服务器对文本行图像进行格式转换过程包括:先将文本行图像进行灰度化处理,转换成相应的灰度图像,以使其预设通道数为1;再将灰度图像缩放至与预设高度h匹配的第一图像;再判断第一图像的宽度是否达到预设宽度;若达到预设宽度,则直接将第一图像作为特征图;若未达到预设宽度,则在第一图像的左右边缘添加黑色或白色区域,使其转换成与预设宽度匹配的特征图。可以理解地,服务器将文本行图像进行格式转换,以获取符合预设格式的特征图,以使后续图像处理过程排除相关因素(宽度、高度和通道数)的影响,使得最终识别的准确率更高。
例如,文本行图像对应的宽度、高度和通道数分别设置为w,h和d,可以理解地,为了减少后续格式转换处理的工作量,可预先将文本行图像转换成灰度图像,以使d=1。在对该文本行图像进行格式转换之后,获取预设宽度、预设高度和预设通道数对应的特征图,其中,预设宽度、预设高度和预设通道数分别设置为w’、h’和d’,本实施例中可使h’=1,且d’=1,即在格式转换过程中保持高度和通道数相同,以保证后续编码识别过程的准确性。
S502:采用Encoder组件对特征图进行编码处理,获取对应的特征向量。
如图7所示,Encoder组件采用卷积层(Conv)、最大池化层(MaxPool)结合Inception层构建的编码器,用于获取该特征图对应的特征向量。如图7(a)所示,该Encoder组件依次包括卷积层(Conv)、最大池化层(MaxPool)、Inception层、最大池化层(MaxPool)、Inception层、Inception层和四个卷积层(Conv),其卷积核大小和对应的激活函数如图所示,可根据实际需求自主设置的。Inception层是由多个卷积层(Conv)和平均池化层(AvgPool)依据图7(c)所示结构组合形成。如图7(a)和7(c)所示,Encoder组件的模型结构较简单,有助于加快计算速度,节省计算成本。在Encoder组件进行编码处理过程中,通过对符合预设格式的特征图中各像素点的信息进行提取,获取可唯一识别该特征图的特征向量。其中,Inception的目的是设计一种具有优良局部拓扑结构的网络,即对输入图像并行地执行多个卷积运算或池化操作,并将所有输出结果拼接为一个非常深的特征图。由于1*1、3*3或5*5等不同的卷积运算与池化操作可以获得输入特征图的不同信息,采用Inception并行处理这些信息并结合所有结果将获得更好的图像表征,即特征向量,以保证获取的特征向量的准确率。
S503:采用Summarizer组件对特征向量进行整合分类,获取至少一个文种概率,将文种概率最大的识别文种,输出为文本行图像对应的目标文种。
具体地,Summarizer组件是多语言文本识别系统中预先设置的分类器,用于执行分类任务以确定相应的目标文种。如图7(b)所示,Summarizer组件包括三个卷积层(Conv),前面两个卷积层(Conv)采用Relu激活函数,最后一个卷积层(Conv)采用Sigmoid激活函数。如图7(b)所示,Summarizer组件的模型结构较简单,有助于加快计算速度,节省计算成本。服务器采用Summarizer组件对Encoder组件输出的特征向量进行整合分类,以获取至少一个文种概率,每一文种概率对应一识别文种;然后,采用Softmax函数对至少一个文种概率进行概率分布处理,以将文种概率最大的识别文种,输出为文本行图像对应的目标文种,从而保障其所识别出的目标文种的准确性。本实施例中,Summarizer组件所识别出的至少一个文种概率可以以数组形式存在,如用|s|表示,该数组中的每个数字对应一种识别文种的识别概率,|s|为0.7、0、0.2、0.1这四个数值形成的数组,依据这四个数值的顺序,分别指代中文、英文、日文、韩文,则基于|s|可确定其目标文种为中文,这里的|s|是指代码点序列。
在本实施例中,设x为文本行图像对应的的特征向量,y为Encoder组件编码后的代码点序列,该代码点序列可以为Summarizer组件对特征向量进行整合处理之后形成的文种概率对应的序列,应用概率的方法对模型进行建模,条件概率为P(y|x)。P(y|x)是指在x已知的条件下,编码后的代码点序列为y的概率,Summarizer组件输出概率最大的一个代码点序列。假设s∈S,s代表一种文种,将s视为隐含变量就可将文字信息合并到上述条件概率P(y|x)中,获取如下公式:
Figure PCTCN2019116488-appb-000003
其中,P(y|s,x)代表能够识别某一种文字的一个OCR模型,x表示一个固定高度(h’=1)的图像 的特征向量,s代表一种文字,即P(y|s,x)是指给定的图像的特征向量x的条件下,图像中的文字属于文字s的概率。argmax是一种函数,是对函数求参数(集合)的函数。
具体地,P(y|s,x)的计算公式如下:
Figure PCTCN2019116488-appb-000004
其中,C(y)代表一个函数,作用是将y转换为一个字形簇(c 1,c 2,Λ,c |C(y)|)相对应的序列的函数;c i代表y的第i个字形;P(c|s,x)、P(c|s)和P(y|s)分别代表针对文字s的一个OCR识别模型,一个字形簇c的先验概率和一个语言模型。其中,P(c|s,x)表示在x已知条件下,给定输入的文字s属于字形簇c的概率;P(c|s)表示在给定输入的文字s的条件下,图中文字属于字形簇c的概率;P(y|s)表示在给定输入的文字s的条件下,图中文字编码后代码点序列为y的概率。
其中,字形簇是在文字流中具有某种统一特性,在排版中不可拆分的字符串,在通常情况下,字形簇就是一般的字符(汉字或字母),但在有的时候则是按某种特殊规则排列的字符串。如果是由多个字符组成的字形簇,则其具有如下性质:1)字形簇在文字行中的行为相当于传统排版系统中的一个字符;2)字形簇在字符之间的关系和布局只与文字属性有关,与排版规则无关;字形簇内的字符应具有相同的文字和字体属性,能否一次性输出。
可以理解地,在系统训练Encoder组件和Summarizer组件之后,系统预先对传统Seq2Seq识别模型和本实施例中基于Encoder-Summarizer机制进行文种识别的混淆率进行统计,以便基于该混淆率确定文种识别的准确率。其中,混淆率是指一种文字被误认为另种文字对应的概率。其测试过程包括如下步骤:(1)获取多语言并存的图像测试样本,每一图像测试样本包含对应的文种标签,该文种标签为图像测试样本中的文字对应的文种的标签。(2)分别采用传统Seq2Seq识别模型和步骤S501-S503对应的方法对图像测试样本进行测试,获取该图像测试样本对应的结果标签,该结果标签为该图像测试样本被识别出的文种的标签。(3)基于所有图像测试样本对应的文种标签和结果标签,统计任意两种文种对应的混淆率。此处的混淆率可以认定为这两种文种被认错的认错数量与这两种文种出现的总数量之商。依据测试结果可知,基于传统Seq2Seq识别模型将西里尔字母识别为拉丁文的混淆率为4.2%,而采用Encoder-Summarizer进行识别时,将西里尔字母识别为拉丁文的混淆率为1.8%。如表一所示,在采用同等条件下进行测试时,对大多数文种而言,采用Encoder-Summarizer进行文种识别,其对应的混淆率远低于传统算法,因此,本实施例中采用Encoder-Summarizer进行文种识别,可提高识别出的目标文种的准确率,以便提高后续文字识别的准确率。如图7所示,Encoder组件和Summarizer组件的模型结构较简单,有助于加快计算速度,节省计算成本。
表一 混淆率测试结果
Figure PCTCN2019116488-appb-000005
Figure PCTCN2019116488-appb-000006
本实施例所提供的多语言文本识别方法中,先对文本行图像进行格式转换,以获取符合预设格式的特征图,有助于减少后续编码处理的计算量,提高编码处理的效率;再采用Encoder组件对特征图进行编码处理,可快速获取相应的特征向量,并保证特征向量的准确率;然后,采用Summarizer组件对特征向量进行整合分类,获取至少一个文种概率,将文种概率最大的识别文种,输出为文本行图像对应的目标文种,以保证所识别的目标文种的准确率,确保识别效率。
在一实施例中,如图6所示,步骤S205,即采用目标OCR识别模型对文本行图像进行转录识别,获取文本行图像对应的目标文字,具体包括如下步骤:
S601:基于文本行图像对应的目标文种和文本行位置,将同一目标文种对应的相邻的文本行图像整合形成文本块图像。
由于文本上下文之间的文字可能具有特定的联系,即上下文之间的文字连接起来会有特定的语义,因此,服务器在对文本行图像进行识别时,需基于文本行图像对应的目标文种和文本行位置,将同一目标文种对应的相邻的文本行图像整合形成一文本块图像,以将该文本块图像作为一整体识别的对象,从而保证基于该文本块图像识别出的每一文本行图像对应的目标文字的准确性。
具体地,服务器在识别出每一文本行图像对应的目标文种之后,依据文本行图像对应的文本行位置对所有文本行图像对应的目标文种进行排序,以将相邻文本行位置对应的属于同一目标文种的所有的文本行图像整合形成一文本块图像,以便后续基于文本块图像进行识别时,可以充分考虑文本块图像中的上下文语义,提高文字识别的准确率。
S602:采用文字切割算法对文本块图像进行切割,获取至少两个单字体图像,依据文本块图像的排版类型和切割顺序,获取每一单字体图像对应的识别顺序和行列标签。
其中,文字切割算法是指用于将文本块图像切割成单字体图像的算法,该文字切割算法具体可以为基于投影的文字切割算法。例如,采用基于投影的文字切割算法对文本块图像进行切换时,可先依次对每一文本行图像进行垂直方向投影,获取垂直投影像素,若有连续的像素满足预设条件,则认定这些连续的像素对应的区域存在一个原始文字,进行切割,以形成单字体图像。
其中,文本块图像的排版类型包括横向排版和纵向排版。单字体图像对应的识别顺序是指某一单字体图像依据其在文本块图像的排版位置,确定在整体文本块图像中的顺序。单字体图像对应的行列标签是指某一单字体图像在文本块图像中的行标签和列标签。例如,一文本块图像由3个文本行图像横向排版形成,字数分别为18、20和17,则切割形成55个单字体图像,第2行第3个文字对应的单字体图像的识别顺序为21,行标签为2,列标签为3。
S603:依据单字体图像对应的识别顺序,将单字体图像输入到目标OCR识别模型进行转录识别,获取单字体图像对应的至少一个识别结果和每一识别结果对应的识别概率。
具体地,服务器依据每一单字体图像对应的识别顺序,将单字体图像输入与目标OCR识别模型进 行转录识别,获取每一单字体图像对应的至少一个识别结果和与识别结果相对应的识别概率。本实施例中可设置只选取识别概率较大的前3个识别结果和其对应的识别概率,以减少后续识别处理的工作量。该识别结果是指每一单字体图像所识别出的结果,可以为识别符号,也可以为识别文字。每一识别结果对应的识别概率是指从单字体图像中识别为识别结果的可能性。例如,对于包含“其”这一原始文字的单字体图像,其识别出来的识别文字分别为“其”、“甚”和“堪”,对应的识别概率分别为99.99%、84.23%和47.88%。
S604:基于识别结果和排版类型,将文本块图像中所有单字体图像划分成至少两个待识别单元。
具体地,服务器基于文本块图像中每一单字体图像对应的至少一个识别结果和该文本块图像对应的排版类型,以将文本块图像中所有单字体图像划分成至少两个待识别单元。该待识别单元是需要进行语义识别的最小单元,每一待识别单元中包含可形成一完整句子的文字,以便进行语义识别,从而有助于提高文字识别的准确率。由于识别结果是指每一单字体图像所识别出的结果,可以为识别符号,也可以为识别文字,因此,可以根据识别结果是否为识别符号进行语义分隔;在识别结果不为识别符号时,可以根据文本块图像的排版类型进行语义分隔,从而有助于提高待识别单元的识别准确率。
在一实施例中,步骤S604中,即基于识别结果和排版类型,将文本块图像中所有单字体图像划分成至少两个待识别单元,具体包括如下步骤:
S6041:若识别结果包含识别符号,基于任意相邻两个识别符号之间的所有单字体图像形成一待识别单元。
具体地,若任一文本块图像切割出的单字体图像的识别结果包括识别符号,该识别符号是采用目标OCR识别模型识别出的标点符号,说明该文本块图像对应的文字需要采用标点符号进行分隔,上下文之间的语义与标点符号的位置相关,此时,服务器需将任意相邻两个识别符号之间的所有单字体图像形成一待识别单元,以该待识别单元为进行语义识别的最小单位,使得每一待识别单元对应的文字为一句话,从而有助于提高待识别单元对应的文字的识别准确率,降低识别的复杂度。
具体地,在服务器采用目标OCR识别模型对一单字体图像进行识别时,获取至少一个识别结果和每一识别结果对应的识别概率,若识别结果为识别符号的识别概率最大或者大于预设概率阈值,则认定该识别结果包括识别符号。其中,预设概率阈值是用于评估识别概率是否达到评估其为某一文字/符号的阈值,该预设概率阈值可设置为较高的数值,以保证识别准确性。
进一步地,服务器在识别出任一文本块图像切割的单字体图像的识别结果包含识别符号时,可先判断该识别符号是否包括预设符号,若识别符号包括预设符号,则基于任意相邻两个预设符号之间的所有单字体图像形成一待识别单元,以提高该待识别单元的识别准确率。其中,预设符号是系统预先设置的用于认定句子结束的标点符号,包括但不限于句号、问号和感叹号。
S6042:若识别结果不包含识别符号,则依据文本块图像的排版类型和单字体图像对应的行列标签,基于同一行标签或同一列标签对应的所有单字体图像形成一待识别单元。
具体地,若任一文本块图像切割出的单字体图像的识别结果不包含识别符号,说明该文本块对应的文字没有采用标点符号进行分隔,这种情况下,一般同一行或者同一列的文字形成一句话,此时,服务器可依据文本块图像的排版类型和单字体图像对应的行列标签,基于同一行标签或者同一列标签对应的所有单字体图像形成一待识别单元,从而有助于提高待识别单元的识别准确率。
可以理解地,服务器在识别结果包含识别符号时,基于相邻两个识别符号之间的所有单字体图像形成待识别单元;在识别结果不包含识别符号时,依据文本块图像的排版类型和单字体图像对应的行列标签,基于同一行标签或者同一列标签对应的所有单字体图像形成待识别单元,在一定程度上保证最终形成的待识别单元尽可能包含一完整的句子,以便后续进行语义分析,以提高识别准确率。
S605:基于任一待识别单元中,每一单字体图像对应的至少一个识别结果和每一识别结果对应的识别概率,获取每一单字体图像对应的单字体文字。
具体地,服务器以待识别单元为进行语义识别的最小单位,根据该待识别单元中每一一单字体图像对应的至少一个识别结果以及每一识别结果对应的识别概率,利用所有单字体图像之间可能存在的上下文语义关系,更准确地确定每一单字体图像对应的单字体文字,从而提高文字识别的准确性。
在一实施例中,步骤S606,即基于任一待识别单元中,每一单字体图像对应的至少一个识别结果和每一识别结果对应的识别概率,获取每一单字体图像对应的单字体文字,具体包括如下步骤:
S6051:若任一待识别单元中,每一单字体图像均存在识别概率大于预设概率阈值的识别文字,则将识别概率大于预设概率阈值的识别文字确定为单字体图像对应的单字体文字。
具体地,服务器将任一待识别单元中,每一单字体图像对应的每一识别概率逐一与预设概率阈值进行比较,以判断每一单字体图像是否均存在识别概率大于预设概率阈值的识别文字,如果每一单字体图像均存在识别概率大于预设概率阈值的识别文字,则直接将这一识别文字确定为该单字体图像对应的单字体文字。
假设任一待识别单元中,包括N个单字体图像,每个单字体图像对应的M个识别文字且每一识别文字对应一识别概率,则服务器依次将N个单字体图像的M个识别概率与预设概率阈值进行比较,以确定每一单字体图像是否存在识别概率大于预设概率阈值的识别文字;若N个单字体图像均存在识别概率大于预设概率阈值的识别文字,说明每一单字体图像对应的识别文字的准确率较高,此时,可以将识别文字直接认定为单字体图像对应的间的字体文字。例如,预设概率阈值设置为95%,“今天天气真好”这一待识别单元对应的六个单字体图像中,每一单字体图像中均存在识别概率大于预设概率阈值的识别文字,如“今”的识别概率为97%,“天”的识别概率为98%,“天”的识别概率为98%,“气”的识别概率为99%,“真”的识别概率为96%,“好”的识别概率为99%,则将识别概率大于预设概率阈值(如95%)的识别文字确定为单字体文字,以保证该待识别单元中文字识别的准确性。
S6052:若任一待识别单元中,至少一个单字体图像不存在识别概率大于预设概率阈值的识别文字,则依据单字体图像对应的识别顺序,采用目标语言模型对待识别单元中所有单字体图像对应的至少一个识别文字形成的词序列进行识别,获取每一词序列对应的序列概率,基于序列概率最大的词序列,获取每一单字体图像对应的单字体文字。
具体地,服务器将任一待识别单元中,每一单字体图像对应的每一识别概率逐一与预设概率阈值进行比较,以判断每一单字体图像是否均存在识别概率大于预设概率阈值的识别文字;在至少一个单字体图像不存在识别概率大于预设概率阈值的识别文字,此时,说明至少一个单字体图像所识别出的识别文字的准确率未达标,此时,需结合上下文进行语义分析,以便提高待识别单元对应的文字的识别准确率。因此,服务器需依据单字体图像对应的识别顺序,采用目标语言模型对待识别单元中所有单字体图像对应的至少一个识别文字形成的词序列进行识别,获取每一词序列对应的序列概率,基于序列概率最大的词序列,获取每一单字体图像对应的单字体文字。
其中,目标语言模型是用于对连续文字进行语义分析所采用的模型,该目标语言模型可以采用但不限于N-gram模型,即汉语语言模型。汉语语言模型利用上下文中相邻词间的搭配信息,在需要把连续无空格的拼音、笔划,或代表字母或笔划的数字,转换成汉字串(即句子)时,可以计算出具有最大概率的句子,从而实现到汉字的自动转换,无需用户手动选择,避开了许多汉字对应一个相同的拼音(或笔划串,或数字串)的重码问题。
例如,预设概率阈值设置为95%,“今天天气真好”这一待识别单元对应的六个单字体图像中,第一个单字体图像识别出“今”的识别概率为97%;第二个单字体图像识别出“夫”的识别概率为85%,识别出“天”的识别概率为83%;第三个单字体图像识别出“天”的识别概率为90%,识别出“夫”的识别概率为75%;第四个单字体图像识别出“气”的识别概率为99%;第五个单字体图像识别出“真”的识别概率为96%;第六个单字体图像识别出“好”的识别概率为99%,由于第二个和第四个单字体图像不存在识别概率大于预设概率阈值(如95%)的识别文字,此时,基于所有单字体图像对应的识别文字形成相应的词序列,如“今天天气真好”、“今天夫气真好”、“今夫天气真好”和“今夫夫气真好”,采用目标语言模型对这些词序列进行识别,以获取每一词序列对应的序列概率,基于序列概率最大的词序列,确定待识别单元中每一单字体图像对应的单字体文字,以保证单字体文字识别的准确性。
可以理解地,服务器在任一待识别单元中每一单字体图像均存在识别概率大于预设概率阈值的识别文字时,可直接将这些识别文字确定为单字体图像对应的单字体文字,保证单字体文字的识别准确率和识别效率;在至少一个单字体图像不存在识别概率大于预设概率阈值的识别文字时,采用目标语言模 型对所有单字体图像形成的词序列进行识别,基于序列概率最大的词序列,确定每一单字体图像对应的单字体文字,通过目标语言模型进行语义分析,以保证单字体文字识别的准确性。
S606:依据每一单字体图像对应的识别顺序和行列标签,对单字体图像对应的单字体文字进行页面排版,获取文本行图像对应的目标文字。
具体地,服务器在识别出每一单字体图像对应的单字体文字之后,依据单字体图像对应的识别顺序和行列标签,对所有单字体图像对应的单字体文字进行页面排版,即将每一单字体图像对应的单字体文字放置在行列标签对应的位置,以获取文本行图像对应的目标文字,以便基于该目标文字进行后续的对比校验,确保目标文字的识别准确性。该目标文字是文本行图像最终识别出的文字。
本实施例所提供的多语言文本识别方法中,先将同一目标文种对应的相邻的文本行图像整合形成文本块图像,以便以文本块图像为单位进行后续识别,充分考虑文本行图像之间的上下文语义,提高文字识别的准确率;在对文本块图像进行逐字切割,获取单字体图像及其对应的识别顺序和行列标签,以便基于识别顺序和行列标签进行定位;再采用目标OCR识别模型对单字体图像进行识别,以确定至少一个识别结果和对应的识别概率,由于目标OCR识别模型为专门针对目标文种的识别模型,使得其所识别出的识别结果更准确;然后,根据所有识别结果以及文本块图像对应的排版类型,以将所有单字体图像划分成可进行语义分析的至少两个待识别单元,以便基于该待识别单元进行分析,保障所识别出的每一单字体图像对应的单字体文字的准确性;再基于单字体图像对应的识别顺序和行列标签对所有单字体文字进行页面排版,以获取文本行图像对应的目标文字,以便后续进行对比校验,确保目标文字的识别准确性。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
在一实施例中,提供一种多语言文本识别装置,该多语言文本识别装置与上述实施例中多语言文本识别方法一一对应。如图8所示,该多语言文本识别装置包括待识别图像获取模块801、文本行图像获取模块802、目标文种获取模块803、识别模型获取模块804、目标文字获取模块805和识别文本获取模块806。各功能模块详细说明如下:
待识别图像获取模块801,用于获取待识别图像,待识别图像包括至少两种语言对应的原始文字。
文本行图像获取模块802,用于对待识别图像进行版面分析识别,获取至少一个文本行图像,确定每一文本行图像在待识别图像中的文本行位置。
目标文种获取模块803,用于对每一文本行图像进行文种识别,获取每一文本行图像对应的目标文种。
识别模型获取模块804,用于基于目标文种查询识别模型数据库,获取目标文种对应的目标OCR识别模型。
目标文字获取模块805,用于采用目标OCR识别模型对文本行图像进行转录识别,获取文本行图像对应的目标文字。
识别文本获取模块806,用于基于文本行图像对应的目标文字和文本行位置,获取待识别图像对应的目标识别文本。
优选地,在待识别图像获取模块801之前,多语言文本识别装置还包括:
模糊度检测单元,用于获取原始图像,采用模糊度检测算法对原始图像进行检测,获取原始图像的模糊度。
第一模糊处理单元,用于若模糊度大于第一模糊阈值,则获取模糊提示信息。
第二模糊处理单元,用于若模糊度不大于第一模糊阈值且大于第二模糊阈值,则对原始图像进行锐化和矫正处理,获取待识别图像。
第三模糊处理单元,用于若模糊度不大于第二模糊阈值,则对原始图像进行矫正处理,获取待识别图像。
其中,第一模糊阈值大于第二模糊阈值。
优选地,文本行图像获取模块802,包括:
文本行区域获取单元,用于采用文本定位算法对待识别图像进行文本定位,获取至少一个文本行区域。
文本行图像确定单元,用于采用截图工具对至少一个文本行区域进行截图,获取至少一个文本行图像,依据截图工具的截图顺序,确定每一文本行图像在待识别图像中的文本行位置。
优选地,目标文种获取模块803,包括:
特征图获取单元,用于对文本行图像进行格式转换,获取符合预设格式的特征图。
特征向量获取单元,用于采用Encoder组件对特征图进行编码处理,获取对应的特征向量。
目标文种输出单元,用于采用Summarizer组件对特征向量进行整合分类,获取至少一个文种概率,将文种概率最大的识别文种,输出为文本行图像对应的目标文种。
优选地,识别文本获取模块806,包括:
文本块图像获取单元,用于基于文本行图像对应的目标文种和文本行位置,将同一目标文种对应的相邻的文本行图像整合形成文本块图像。
单字体图像获取单元,用于采用文字切割算法对文本块图像进行切割,获取至少两个单字体图像,依据文本块图像的排版类型和切割顺序,获取每一单字体图像对应的识别顺序和行列标签。
识别结果获取单元,用于依据单字体图像对应的识别顺序,将单字体图像输入到目标OCR识别模型进行转录识别,获取单字体图像对应的至少一个识别结果和每一识别结果对应的识别概率。
待识别单元划分单元,用于基于识别结果和排版类型,将文本块图像中所有单字体图像划分成至少两个待识别单元。
单字体文字获取单元,用于基于任一待识别单元中,每一单字体图像对应的至少一个识别结果和每一识别结果对应的识别概率,获取每一单字体图像对应的单字体文字。
目标文字获取单元,用于依据每一单字体图像对应的识别顺序和行列标签,对单字体图像对应的单字体文字进行页面排版,获取文本行图像对应的目标文字。
优选地,待识别单元划分单元,包括:
第一单元获取子单元,用于若识别结果包含识别符号,基于任意相邻两个识别符号之间的所有单字体图像形成一待识别单元。
第一单元获取子单元,用于若识别结果不包含识别符号,则依据文本块图像的排版类型和单字体图像对应的行列标签,基于同一行标签或同一列标签对应的所有单字体图像形成一待识别单元。
优选地,单字体文字获取单元,包括:
第一单字获取子单元,用于若任一待识别单元中,每一单字体图像均存在识别概率大于预设概率阈值的识别文字,则将识别概率大于预设概率阈值的识别文字确定为单字体图像对应的单字体文字。
第二单字获取子单元,用于若任一待识别单元中,至少一个单字体图像不存在识别概率大于预设概率阈值的识别文字,则依据单字体图像对应的识别顺序,采用目标语言模型对待识别单元中所有单字体图像对应的至少一个识别文字形成的词序列进行识别,获取每一词序列对应的序列概率,基于序列概率最大的词序列,获取每一单字体图像对应的单字体文字。
关于多语言文本识别装置的具体限定可以参见上文中对于多语言文本识别方法的限定,在此不再赘述。上述多语言文本识别装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图9所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储执行多语言文本识别方法过程中采用或生成的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种多语言文本识别方法。
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行计算机程序时实现上述实施例中多语言文本识别方法,例如图2所示的步骤S201-S206,或者图3至图6中所示,为避免重复,这里不再赘述。或者,处理器执行计算机程序时实现多语言文本识别装置这一实施例中的各模块/单元的功能,例如图8所示的待识别图像获取模块801、文本行图像获取模块802、目标文种获取模块803、识别模型获取模块804、目标文字获取模块805和识别文本获取模块806的功能,为避免重复,这里不再赘述。
在一实施例中,提供一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时实现上述实施例中多语言文本识别方法,例如图2所示的步骤S201-S206,或者图3至图6中所示,为避免重复,这里不再赘述。或者,该计算机程序被处理器执行时实现上述多语言文本识别装置这一实施例中的各模块/单元的功能,例如图8所示的待识别图像获取模块801、文本行图像获取模块802、目标文种获取模块803、识别模型获取模块804、目标文字获取模块805和识别文本获取模块806的功能,为避免重复,这里不再赘述。本实施例中的可读存储介质包括非易失性可读存储介质和易失性可读存储介质。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,该计算机可读指令可存储于一非易失性可读存储介质也可以存储在一易失性可读存储介质,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种多语言文本识别方法,其特征在于,包括:
    获取待识别图像,所述待识别图像包括至少两种语言对应的原始文字;
    对所述待识别图像进行版面分析识别,获取至少一个文本行图像,确定每一所述文本行图像在所述待识别图像中的文本行位置;
    对每一所述文本行图像进行文种识别,获取每一所述文本行图像对应的目标文种;
    基于所述目标文种查询识别模型数据库,获取所述目标文种对应的目标OCR识别模型;
    采用所述目标OCR识别模型对所述文本行图像进行转录识别,获取所述文本行图像对应的目标文字;
    基于所述文本行图像对应的所述目标文字和所述文本行位置,获取所述待识别图像对应的目标识别文本。
  2. 如权利要求1所述的多语言文本识别方法,其特征在于,在所述获取待识别图像之前,所述多语言文本识别方法还包括:
    获取原始图像,采用模糊度检测算法对所述原始图像进行检测,获取所述原始图像的模糊度;
    若所述模糊度大于第一模糊阈值,则获取模糊提示信息;
    若所述模糊度不大于所述第一模糊阈值且大于第二模糊阈值,则对所述原始图像进行锐化和矫正处理,获取待识别图像;
    若所述模糊度不大于第二模糊阈值,则对所述原始图像进行矫正处理,获取待识别图像;
    其中,所述第一模糊阈值大于所述第二模糊阈值。
  3. 如权利要求1所述的多语言文本识别方法,其特征在于,所述对所述待识别图像进行版面分析识别,获取至少一个文本行图像,确定每一所述文本行图像在所述待识别图像中的文本行位置,包括:
    采用文本定位算法对所述待识别图像进行文本定位,获取至少一个文本行区域;
    采用截图工具对至少一个所述文本行区域进行截图,获取至少一个文本行图像,依据所述截图工具的截图顺序,确定每一所述文本行图像在所述待识别图像中的文本行位置。
  4. 如权利要求1所述的多语言文本识别方法,其特征在于,所述对每一所述文本行图像进行文种识别,获取每一所述文本行图像对应的目标文种,包括:
    对所述文本行图像进行格式转换,获取符合预设格式的特征图;
    采用Encoder组件对所述特征图进行编码处理,获取对应的特征向量;
    采用Summarizer组件对所述特征向量进行整合分类,获取至少一个文种概率,将所述文种概率最大的识别文种,输出为所述文本行图像对应的目标文种。
  5. 如权利要求1所述的多语言文本识别方法,其特征在于,所述采用所述目标OCR识别模型对所述文本行图像进行转录识别,获取所述文本行图像对应的目标文字,包括:
    基于所述文本行图像对应的所述目标文种和所述文本行位置,将同一所述目标文种对应的相邻的文本行图像整合形成文本块图像;
    采用文字切割算法对所述文本块图像进行切割,获取至少两个单字体图像,依据所述文本块图像的排版类型和切割顺序,获取每一所述单字体图像对应的识别顺序和行列标签;
    依据所述单字体图像对应的识别顺序,将所述单字体图像输入到所述目标OCR识别模型进行转录识别,获取所述单字体图像对应的至少一个识别结果和每一所述识别结果对应的识别概率;
    基于所述识别结果和所述排版类型,将所述文本块图像中所有单字体图像划分成至少两个待识别单元;
    基于任一所述待识别单元中,每一所述单字体图像对应的至少一个识别结果和每一所述识别结果对应的识别概率,获取每一所述单字体图像对应的单字体文字;
    依据每一所述单字体图像对应的所述识别顺序和所述行列标签,对所述单字体图像对应的所述单字体文字进行页面排版,获取所述文本行图像对应的目标文字。
  6. 如权利要求5所述的多语言文本识别方法,其特征在于,所述基于所述识别结果和所述排版类型,将所述文本块图像中所有单字体图像划分成至少两个待识别单元,包括:
    若所述识别结果包含识别符号,基于任意相邻两个所述识别符号之间的所有所述单字体图像形成一待识别单元;
    若所述识别结果不包含识别符号,则依据所述文本块图像的排版类型和所述单字体图像对应的行列标签,基于同一行标签或同一列标签对应的所有所述单字体图像形成一待识别单元。
  7. 如权利要求5所述的多语言文本识别方法,其特征在于,所述基于任一所述待识别单元中,每一所述单字体图像对应的至少一个识别结果和每一所述识别结果对应的识别概率,获取每一所述单字体图像对应的单字体文字,包括:
    若任一所述待识别单元中,每一所述单字体图像均存在所述识别概率大于预设概率阈值的识别文字,则将所述识别概率大于所述预设概率阈值的识别文字确定为所述单字体图像对应的单字体文字;
    若任一所述待识别单元中,至少一个所述单字体图像不存在识别概率大于预设概率阈值的识别文字,则依据所述单字体图像对应的识别顺序,采用目标语言模型对所述待识别单元中所有所述单字体图像对应的至少一个识别文字形成的词序列进行识别,获取每一词序列对应的序列概率,基于所述序列概率最大的词序列,获取每一所述单字体图像对应的单字体文字。
  8. 一种多语言文本识别装置,其特征在于,包括:
    待识别图像获取模块,用于获取待识别图像,所述待识别图像包括至少两种语言对应的原始文字;
    文本行图像获取模块,用于对所述待识别图像进行版面分析识别,获取至少一个文本行图像,确定每一所述文本行图像在所述待识别图像中的文本行位置;
    目标文种获取模块,用于对每一所述文本行图像进行文种识别,获取每一所述文本行图像对应的目标文种;
    识别模型获取模块,用于基于所述目标文种查询识别模型数据库,获取所述目标文种对应的目标OCR识别模型;
    目标文字获取模块,用于采用所述目标OCR识别模型对所述文本行图像进行转录识别,获取所述文本行图像对应的目标文字;
    识别文本获取模块,用于基于所述文本行图像对应的所述目标文字和所述文本行位置,获取所述待识别图像对应的目标识别文本。
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如下步骤:
    获取待识别图像,所述待识别图像包括至少两种语言对应的原始文字;
    对所述待识别图像进行版面分析识别,获取至少一个文本行图像,确定每一所述文本行图像在所述待识别图像中的文本行位置;
    对每一所述文本行图像进行文种识别,获取每一所述文本行图像对应的目标文种;
    基于所述目标文种查询识别模型数据库,获取所述目标文种对应的目标OCR识别模型;
    采用所述目标OCR识别模型对所述文本行图像进行转录识别,获取所述文本行图像对应的目标文字;
    基于所述文本行图像对应的所述目标文字和所述文本行位置,获取所述待识别图像对应的目标识别文本。
  10. 如权利要求9所述的计算机设备,其特征在于,所述对所述待识别图像进行版面分析识别,获取至少一个文本行图像,确定每一所述文本行图像在所述待识别图像中的文本行位置,包括:
    采用文本定位算法对所述待识别图像进行文本定位,获取至少一个文本行区域;
    采用截图工具对至少一个所述文本行区域进行截图,获取至少一个文本行图像,依据所述截图工具的截图顺序,确定每一所述文本行图像在所述待识别图像中的文本行位置。
  11. 如权利要求9所述的计算机设备,其特征在于,所述对每一所述文本行图像进行文种识别,获取每一所述文本行图像对应的目标文种,包括:
    对所述文本行图像进行格式转换,获取符合预设格式的特征图;
    采用Encoder组件对所述特征图进行编码处理,获取对应的特征向量;
    采用Summarizer组件对所述特征向量进行整合分类,获取至少一个文种概率,将所述文种概率最大的识别文种,输出为所述文本行图像对应的目标文种。
  12. 如权利要求9所述的计算机设备,其特征在于,所述采用所述目标OCR识别模型对所述文本行图像进行转录识别,获取所述文本行图像对应的目标文字,包括:
    基于所述文本行图像对应的所述目标文种和所述文本行位置,将同一所述目标文种对应的相邻的文本行图像整合形成文本块图像;
    采用文字切割算法对所述文本块图像进行切割,获取至少两个单字体图像,依据所述文本块图像的排版类型和切割顺序,获取每一所述单字体图像对应的识别顺序和行列标签;
    依据所述单字体图像对应的识别顺序,将所述单字体图像输入到所述目标OCR识别模型进行转录识别,获取所述单字体图像对应的至少一个识别结果和每一所述识别结果对应的识别概率;
    基于所述识别结果和所述排版类型,将所述文本块图像中所有单字体图像划分成至少两个待识别单元;
    基于任一所述待识别单元中,每一所述单字体图像对应的至少一个识别结果和每一所述识别结果对应的识别概率,获取每一所述单字体图像对应的单字体文字;
    依据每一所述单字体图像对应的所述识别顺序和所述行列标签,对所述单字体图像对应的所述单字体文字进行页面排版,获取所述文本行图像对应的目标文字。
  13. 如权利要求12所述的计算机设备,其特征在于,所述基于所述识别结果和所述排版类型,将所述文本块图像中所有单字体图像划分成至少两个待识别单元,包括:
    若所述识别结果包含识别符号,基于任意相邻两个所述识别符号之间的所有所述单字体图像形成一待识别单元;
    若所述识别结果不包含识别符号,则依据所述文本块图像的排版类型和所述单字体图像对应的行列标签,基于同一行标签或同一列标签对应的所有所述单字体图像形成一待识别单元。
  14. 如权利要求12所述的计算机设备,其特征在于,所述基于任一所述待识别单元中,每一所述单字体图像对应的至少一个识别结果和每一所述识别结果对应的识别概率,获取每一所述单字体图像对应的单字体文字,包括:
    若任一所述待识别单元中,每一所述单字体图像均存在所述识别概率大于预设概率阈值的识别文字,则将所述识别概率大于所述预设概率阈值的识别文字确定为所述单字体图像对应的单字体文字;
    若任一所述待识别单元中,至少一个所述单字体图像不存在识别概率大于预设概率阈值的识别文字,则依据所述单字体图像对应的识别顺序,采用目标语言模型对所述待识别单元中所有所述单字体图像对应的至少一个识别文字形成的词序列进行识别,获取每一词序列对应的序列概率,基于所述序列概率最大的词序列,获取每一所述单字体图像对应的单字体文字。
  15. 一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读存储介质存储有计算机可读指令,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
    获取待识别图像,所述待识别图像包括至少两种语言对应的原始文字;
    对所述待识别图像进行版面分析识别,获取至少一个文本行图像,确定每一所述文本行图像在所述待识别图像中的文本行位置;
    对每一所述文本行图像进行文种识别,获取每一所述文本行图像对应的目标文种;
    基于所述目标文种查询识别模型数据库,获取所述目标文种对应的目标OCR识别模型;
    采用所述目标OCR识别模型对所述文本行图像进行转录识别,获取所述文本行图像对应的目标文字;
    基于所述文本行图像对应的所述目标文字和所述文本行位置,获取所述待识别图像对应的目标识别文本。
  16. 如权利要求15所述的可读存储介质,其特征在于,所述对所述待识别图像进行版面分析识别,获取至少一个文本行图像,确定每一所述文本行图像在所述待识别图像中的文本行位置,包括:
    采用文本定位算法对所述待识别图像进行文本定位,获取至少一个文本行区域;
    采用截图工具对至少一个所述文本行区域进行截图,获取至少一个文本行图像,依据所述截图工具的截图顺序,确定每一所述文本行图像在所述待识别图像中的文本行位置。
  17. 如权利要求15所述的可读存储介质,其特征在于,所述对每一所述文本行图像进行文种识别,获取每一所述文本行图像对应的目标文种,包括:
    对所述文本行图像进行格式转换,获取符合预设格式的特征图;
    采用Encoder组件对所述特征图进行编码处理,获取对应的特征向量;
    采用Summarizer组件对所述特征向量进行整合分类,获取至少一个文种概率,将所述文种概率最大的识别文种,输出为所述文本行图像对应的目标文种。
  18. 如权利要求15所述的可读存储介质,其特征在于,所述采用所述目标OCR识别模型对所述文本行图像进行转录识别,获取所述文本行图像对应的目标文字,包括:
    基于所述文本行图像对应的所述目标文种和所述文本行位置,将同一所述目标文种对应的相邻的文本行图像整合形成文本块图像;
    采用文字切割算法对所述文本块图像进行切割,获取至少两个单字体图像,依据所述文本块图像的排版类型和切割顺序,获取每一所述单字体图像对应的识别顺序和行列标签;
    依据所述单字体图像对应的识别顺序,将所述单字体图像输入到所述目标OCR识别模型进行转录识别,获取所述单字体图像对应的至少一个识别结果和每一所述识别结果对应的识别概率;
    基于所述识别结果和所述排版类型,将所述文本块图像中所有单字体图像划分成至少两个待识别单元;
    基于任一所述待识别单元中,每一所述单字体图像对应的至少一个识别结果和每一所述识别结果对应的识别概率,获取每一所述单字体图像对应的单字体文字;
    依据每一所述单字体图像对应的所述识别顺序和所述行列标签,对所述单字体图像对应的所述单字体文字进行页面排版,获取所述文本行图像对应的目标文字。
  19. 如权利要求18所述的可读存储介质,其特征在于,所述基于所述识别结果和所述排版类型,将所述文本块图像中所有单字体图像划分成至少两个待识别单元,包括:
    若所述识别结果包含识别符号,基于任意相邻两个所述识别符号之间的所有所述单字体图像形成一待识别单元;
    若所述识别结果不包含识别符号,则依据所述文本块图像的排版类型和所述单字体图像对应的行列标签,基于同一行标签或同一列标签对应的所有所述单字体图像形成一待识别单元。
  20. 如权利要求18所述的可读存储介质,其特征在于,所述基于任一所述待识别单元中,每一所述单字体图像对应的至少一个识别结果和每一所述识别结果对应的识别概率,获取每一所述单字体图像对应的单字体文字,包括:
    若任一所述待识别单元中,每一所述单字体图像均存在所述识别概率大于预设概率阈值的识别文字,则将所述识别概率大于所述预设概率阈值的识别文字确定为所述单字体图像对应的单字体文字;
    若任一所述待识别单元中,至少一个所述单字体图像不存在识别概率大于预设概率阈值的识别文字,则依据所述单字体图像对应的识别顺序,采用目标语言模型对所述待识别单元中所有所述单字体图像对应的至少一个识别文字形成的词序列进行识别,获取每一词序列对应的序列概率,基于所述序列概率最大的词序列,获取每一所述单字体图像对应的单字体文字。
PCT/CN2019/116488 2019-08-01 2019-11-08 多语言文本识别方法、装置、计算机设备及存储介质 WO2021017260A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910706802.7 2019-08-01
CN201910706802.7A CN110569830B (zh) 2019-08-01 2019-08-01 多语言文本识别方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2021017260A1 true WO2021017260A1 (zh) 2021-02-04

Family

ID=68773976

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/116488 WO2021017260A1 (zh) 2019-08-01 2019-11-08 多语言文本识别方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN110569830B (zh)
WO (1) WO2021017260A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420756A (zh) * 2021-07-28 2021-09-21 浙江大华技术股份有限公司 证件图像的识别方法和装置、存储介质及电子装置
CN113553524A (zh) * 2021-06-30 2021-10-26 上海硬通网络科技有限公司 一种网页的文字排版方法、装置、设备和存储介质
CN113780131A (zh) * 2021-08-31 2021-12-10 众安在线财产保险股份有限公司 文本图像朝向识别方法和文本内容识别方法、装置、设备
CN114047981A (zh) * 2021-12-24 2022-02-15 珠海金山数字网络科技有限公司 项目配置方法及装置
CN115033318A (zh) * 2021-11-22 2022-09-09 荣耀终端有限公司 图像的文字识别方法、电子设备及存储介质
CN116502625A (zh) * 2023-06-28 2023-07-28 浙江同花顺智能科技有限公司 一种简历解析方法和系统

Families Citing this family (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111325104A (zh) * 2020-01-22 2020-06-23 平安科技(深圳)有限公司 文本识别方法、装置及存储介质
CN111461097A (zh) * 2020-03-18 2020-07-28 北京大米未来科技有限公司 识别图像信息的方法、装置、电子设备及介质
CN111444905B (zh) * 2020-03-24 2023-09-22 腾讯科技(深圳)有限公司 基于人工智能的图像识别方法和相关装置
CN111444906B (zh) * 2020-03-24 2023-09-29 腾讯科技(深圳)有限公司 基于人工智能的图像识别方法和相关装置
CN111488826B (zh) * 2020-04-10 2023-10-17 腾讯科技(深圳)有限公司 一种文本识别方法、装置、电子设备和存储介质
CN111563495B (zh) * 2020-05-09 2023-10-27 北京奇艺世纪科技有限公司 一种图像中字符的识别方法、装置及电子设备
CN111709249B (zh) * 2020-05-29 2023-02-24 北京百度网讯科技有限公司 多语种模型的训练方法、装置、电子设备和存储介质
CN111626383B (zh) * 2020-05-29 2023-11-07 Oppo广东移动通信有限公司 字体识别方法及装置、电子设备、存储介质
CN113822280A (zh) * 2020-06-18 2021-12-21 阿里巴巴集团控股有限公司 文本识别方法、装置、系统和非易失性存储介质
CN113920286A (zh) * 2020-06-22 2022-01-11 北京字节跳动网络技术有限公司 字符定位方法和装置
CN111797922B (zh) * 2020-07-03 2023-11-28 泰康保险集团股份有限公司 文本图像分类方法及装置
CN111783786A (zh) * 2020-07-06 2020-10-16 上海摩勤智能技术有限公司 图片的识别方法、系统、电子设备及存储介质
CN111832550B (zh) * 2020-07-13 2022-06-07 北京易真学思教育科技有限公司 数据集制作方法、装置、电子设备及存储介质
CN111832657A (zh) * 2020-07-20 2020-10-27 上海眼控科技股份有限公司 文本识别方法、装置、计算机设备和存储介质
CN111914825B (zh) * 2020-08-03 2023-10-27 腾讯科技(深圳)有限公司 文字识别方法、装置及电子设备
CN112818979B (zh) * 2020-08-26 2024-02-02 腾讯科技(深圳)有限公司 文本识别方法、装置、设备及存储介质
CN112100063B (zh) * 2020-08-31 2022-03-01 腾讯科技(深圳)有限公司 界面语言的显示测试方法、装置、计算机设备和存储介质
CN112101367A (zh) * 2020-09-15 2020-12-18 杭州睿琪软件有限公司 文本识别方法、图像识别分类方法、文档识别处理方法
CN112149680B (zh) * 2020-09-28 2024-01-16 武汉悦学帮网络技术有限公司 错字检测识别方法、装置、电子设备及存储介质
CN112200188B (zh) * 2020-10-16 2023-09-12 北京市商汤科技开发有限公司 文字识别方法及装置、存储介质
CN112185348B (zh) * 2020-10-19 2024-05-03 平安科技(深圳)有限公司 多语种语音识别方法、装置及电子设备
CN111967545B (zh) * 2020-10-26 2021-02-26 北京易真学思教育科技有限公司 文本检测方法、装置、电子设备及计算机存储介质
CN112288018B (zh) * 2020-10-30 2023-06-30 北京市商汤科技开发有限公司 文字识别网络的训练方法、文字识别方法和装置
CN112464724B (zh) * 2020-10-30 2023-10-24 中科院成都信息技术股份有限公司 选票识别方法及系统
CN112364667B (zh) * 2020-11-10 2023-03-24 成都安易迅科技有限公司 字符校验方法、装置、计算机设备及计算机可读存储介质
CN112766052A (zh) * 2020-12-29 2021-05-07 有米科技股份有限公司 基于ctc的图像文字识别方法及装置
CN112800972A (zh) * 2021-01-29 2021-05-14 北京市商汤科技开发有限公司 文字识别方法及装置、存储介质
CN113569608A (zh) * 2021-02-08 2021-10-29 腾讯科技(深圳)有限公司 基于深度学习的文本识别方法、装置、设备及存储介质
CN112883966B (zh) * 2021-02-24 2023-02-24 北京有竹居网络技术有限公司 图像字符识别方法、装置、介质及电子设备
CN112883968B (zh) * 2021-02-24 2023-02-28 北京有竹居网络技术有限公司 图像字符识别方法、装置、介质及电子设备
CN112883967B (zh) * 2021-02-24 2023-02-28 北京有竹居网络技术有限公司 图像字符识别方法、装置、介质及电子设备
CN113111871B (zh) * 2021-04-21 2024-04-19 北京金山数字娱乐科技有限公司 文本识别模型的训练方法及装置、文本识别方法及装置
CN113780276B (zh) * 2021-09-06 2023-12-05 成都人人互娱科技有限公司 一种结合文本分类的文本识别方法及系统
CN114140928B (zh) * 2021-11-19 2023-08-22 苏州益多多信息科技有限公司 一种高精准度的数字彩统一化查票方法、系统及介质
CN113903035A (zh) * 2021-12-06 2022-01-07 北京惠朗时代科技有限公司 一种基于超分辨率多尺度重建的文字识别方法及系统
CN114170594A (zh) * 2021-12-07 2022-03-11 奇安信科技集团股份有限公司 光学字符识别方法、装置、电子设备及存储介质
CN115147852A (zh) * 2022-03-16 2022-10-04 北京有竹居网络技术有限公司 一种古籍识别方法、装置、存储介质及设备
CN114596566B (zh) * 2022-04-18 2022-08-02 腾讯科技(深圳)有限公司 文本识别方法及相关装置
CN114663878B (zh) * 2022-05-25 2022-09-16 成都飞机工业(集团)有限责任公司 一种成品软件版本检查方法、装置、设备及介质
CN115147846A (zh) * 2022-07-15 2022-10-04 平安科技(深圳)有限公司 多语言票据识别方法、装置、设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080273796A1 (en) * 2007-05-01 2008-11-06 Microsoft Corporation Image Text Replacement
CN107656922A (zh) * 2017-09-25 2018-02-02 广东小天才科技有限公司 一种翻译方法、装置、终端及存储介质
CN108197109A (zh) * 2017-12-29 2018-06-22 北京百分点信息科技有限公司 一种基于自然语言处理的多语言分析方法和装置
CN109840465A (zh) * 2017-11-29 2019-06-04 三星电子株式会社 识别图像中的文本的电子装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609408B (zh) * 2012-01-11 2014-11-26 清华大学 基于多文种文档图像识别的跨文种理解方法
US9436682B2 (en) * 2014-06-24 2016-09-06 Google Inc. Techniques for machine language translation of text from an image based on non-textual context information from the image
CN107515849A (zh) * 2016-06-15 2017-12-26 阿里巴巴集团控股有限公司 一种成词判定模型生成方法、新词发现方法及装置
CN109492143A (zh) * 2018-09-21 2019-03-19 平安科技(深圳)有限公司 图像数据处理方法、装置、计算机设备及存储介质
CN109492643B (zh) * 2018-10-11 2023-12-19 平安科技(深圳)有限公司 基于ocr的证件识别方法、装置、计算机设备及存储介质
CN109543690B (zh) * 2018-11-27 2020-04-07 北京百度网讯科技有限公司 用于提取信息的方法和装置
CN109598272B (zh) * 2019-01-11 2021-08-06 北京字节跳动网络技术有限公司 字符行图像的识别方法、装置、设备及介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080273796A1 (en) * 2007-05-01 2008-11-06 Microsoft Corporation Image Text Replacement
CN107656922A (zh) * 2017-09-25 2018-02-02 广东小天才科技有限公司 一种翻译方法、装置、终端及存储介质
CN109840465A (zh) * 2017-11-29 2019-06-04 三星电子株式会社 识别图像中的文本的电子装置
CN108197109A (zh) * 2017-12-29 2018-06-22 北京百分点信息科技有限公司 一种基于自然语言处理的多语言分析方法和装置

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113553524A (zh) * 2021-06-30 2021-10-26 上海硬通网络科技有限公司 一种网页的文字排版方法、装置、设备和存储介质
CN113553524B (zh) * 2021-06-30 2022-10-18 上海硬通网络科技有限公司 一种网页的文字排版方法、装置、设备和存储介质
CN113420756A (zh) * 2021-07-28 2021-09-21 浙江大华技术股份有限公司 证件图像的识别方法和装置、存储介质及电子装置
CN113780131A (zh) * 2021-08-31 2021-12-10 众安在线财产保险股份有限公司 文本图像朝向识别方法和文本内容识别方法、装置、设备
CN113780131B (zh) * 2021-08-31 2024-04-12 众安在线财产保险股份有限公司 文本图像朝向识别方法和文本内容识别方法、装置、设备
CN115033318A (zh) * 2021-11-22 2022-09-09 荣耀终端有限公司 图像的文字识别方法、电子设备及存储介质
CN114047981A (zh) * 2021-12-24 2022-02-15 珠海金山数字网络科技有限公司 项目配置方法及装置
CN116502625A (zh) * 2023-06-28 2023-07-28 浙江同花顺智能科技有限公司 一种简历解析方法和系统
CN116502625B (zh) * 2023-06-28 2023-09-15 浙江同花顺智能科技有限公司 一种简历解析方法和系统

Also Published As

Publication number Publication date
CN110569830A (zh) 2019-12-13
CN110569830B (zh) 2023-08-22

Similar Documents

Publication Publication Date Title
WO2021017260A1 (zh) 多语言文本识别方法、装置、计算机设备及存储介质
US10853638B2 (en) System and method for extracting structured information from image documents
US10936862B2 (en) System and method of character recognition using fully convolutional neural networks
CN108710866B (zh) 汉字模型训练方法、汉字识别方法、装置、设备及介质
WO2019232853A1 (zh) 中文模型训练、中文图像识别方法、装置、设备及介质
WO2017020723A1 (zh) 一种字符分割方法、装置及电子设备
TWI435276B (zh) 用以辨識手寫符號之方法及設備
US10643094B2 (en) Method for line and word segmentation for handwritten text images
US11790675B2 (en) Recognition of handwritten text via neural networks
US9286527B2 (en) Segmentation of an input by cut point classification
CN113158808A (zh) 中文古籍字符识别、组段与版面重建方法、介质和设备
CN110942004A (zh) 基于神经网络模型的手写识别方法、装置及电子设备
RU2648638C2 (ru) Способы и системы эффективного автоматического распознавания символов, использующие множество кластеров эталонов символов
WO2019232850A1 (zh) 手写汉字图像识别方法、装置、计算机设备及存储介质
JP2021166070A (ja) 文書比較方法、装置、電子機器、コンピュータ読取可能な記憶媒体及びコンピュータプログラム
JP2019102061A (ja) テキスト線の区分化方法
WO2019232870A1 (zh) 手写字训练样本获取方法、装置、计算机设备及存储介质
Sharma et al. Segmentation of handwritten words using structured support vector machine
CN115461792A (zh) 手写文本识别方法、装置和系统,手写文本搜索方法和系统,以及计算机可读存储介质
RU2597163C2 (ru) Сравнение документов с использованием достоверного источника
CN113887375A (zh) 一种文本识别方法、装置、设备及存储介质
US9418281B2 (en) Segmentation of overwritten online handwriting input
US11087122B1 (en) Method and system for processing candidate strings detected in an image to identify a match of a model string in the image
Al Sayed et al. Survey on handwritten recognition
US20150169949A1 (en) Segmentation of Devanagari-Script Handwriting for Recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19939681

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19939681

Country of ref document: EP

Kind code of ref document: A1