CN116958992A

CN116958992A - Text recognition method and related device

Info

Publication number: CN116958992A
Application number: CN202310814591.5A
Authority: CN
Inventors: 郑岩
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-07-04
Filing date: 2023-07-04
Publication date: 2023-10-27

Abstract

The embodiment of the application discloses a text recognition method and a text recognition device, wherein the method comprises the following steps: before text recognition processing is carried out on an image to be detected by using a text recognition model, image stitching processing is carried out on the basis of text images corresponding to a plurality of text areas in the image to be detected, so that a reference image is obtained, further, the reference image is processed by using the text recognition model, a text recognition result corresponding to the text image in the reference image is obtained, and finally, a target text recognition result corresponding to the image to be detected is determined by using the text results corresponding to each text image in the image to be detected. The text recognition model is used for recognizing the text, and the text recognition model is used for recognizing the text, wherein the text recognition model is used for recognizing the text, and the text recognition model is used for recognizing the text.

Description

Text recognition method and related device

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a text recognition method and a related device.

Background

Optical character recognition (Optical Character Recognition, OCR) refers to the process in which an electronic device examines characters in a document, determines their shape by detecting bright and dark patterns, and then translates the shape into computer text using a character recognition algorithm.

In the related art, OCR technology is often adopted to recognize text in an image including the text; specifically, the text image to be recognized may be input into a pre-trained text recognition model, and the text recognition model outputs text content included in the text image by performing corresponding processing on the text image. In general, the size of an input image of a text recognition model is fixed, and in order to be compatible with a text image with more characters, a larger fixed size is generally set for the input image; for text images that include fewer words, it is necessary to fill the background image to the fixed size. When the text recognition model is used for processing the text image with fewer characters, more computing resources are wasted to process nonsensical background images, which can cause performance loss and influence text recognition efficiency.

Disclosure of Invention

The embodiment of the application provides a text recognition method and a related device, which are used for reducing performance loss and improving text recognition efficiency.

The first aspect of the application provides a text recognition method, which comprises the following steps:

based on n text areas in the image to be detected, intercepting text images corresponding to the n text areas from the image to be detected; n is an integer greater than 1;

performing image stitching processing based on the n text images to obtain at least one reference image; the size of the reference image is smaller than or equal to the size of the input image of the text recognition model, and if the reference image is spliced by a plurality of text images, a separator image is inserted between two adjacent text images in the reference image;

performing text recognition processing on each reference image through a text recognition model to obtain a recognition result corresponding to each reference image; the recognition results corresponding to the reference images comprise text recognition results corresponding to the text images in the reference images;

and determining a target text recognition result corresponding to the image to be detected according to the text recognition result and the text region corresponding to each of the n text images.

A second aspect of the present application provides a text recognition apparatus, the apparatus comprising:

the image intercepting module is used for intercepting text images corresponding to the n text areas from the image to be detected based on the n text areas in the image to be detected; n is an integer greater than 1;

The image stitching module is used for performing image stitching processing based on the n text images to obtain at least one reference image; the size of the reference image is smaller than or equal to the size of the input image of the text recognition model, and if the reference image is spliced by a plurality of text images, a separator image is inserted between two adjacent text images in the reference image;

the text recognition module is used for carrying out text recognition processing on each reference image through the text recognition model to obtain a recognition result corresponding to each reference image; the recognition results corresponding to the reference images comprise text recognition results corresponding to the text images in the reference images;

the result determining module is used for determining a target text recognition result corresponding to the image to be detected according to the text recognition result and the text region corresponding to each of the n text images.

A third aspect of the application provides a computer apparatus comprising a processor and a memory:

the memory is used for storing a computer program;

the processor is configured to perform the steps of the text recognition method as described in the first aspect above according to the computer program.

A fourth aspect of the present application provides a computer readable storage medium storing a computer program for executing the steps of the text recognition method according to the first aspect.

A fifth aspect of the application provides a computer program product or computer program comprising computer instructions stored on a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device performs the steps of the text recognition method according to the first aspect.

From the above technical solutions, the embodiment of the present application has the following advantages:

according to the text recognition method provided by the embodiment of the application, before the text recognition processing is carried out on the image to be detected by using the text recognition model, the image stitching processing is carried out on the basis of the text images corresponding to the text areas in the image to be detected, so as to obtain the reference image, and further, the text recognition result corresponding to the text images in the reference image is obtained by processing the reference image by using the text recognition model, and finally, the target text recognition result corresponding to the image to be detected is determined by using the text results corresponding to the text images in the image to be detected. The text recognition model is used for recognizing the text, and the text recognition model is used for recognizing the text, wherein the text recognition model is used for recognizing the text, and the text recognition model is used for recognizing the text.

Drawings

FIG. 1a is a schematic diagram of a text recognition method according to an embodiment of the present application;

FIG. 1b is a schematic diagram of text image stitching according to an embodiment of the present application;

fig. 2 is a schematic diagram of an application scenario of a text recognition method according to an embodiment of the present application;

FIG. 3 is a flowchart of a text recognition method according to an embodiment of the present application;

fig. 4 is a schematic diagram of an image to be detected according to an embodiment of the present application;

FIG. 5a is a schematic diagram of a text recognition method according to an embodiment of the present application;

fig. 5b is a schematic diagram of an image to be detected and a target text recognition result according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a text recognition model training process based on different splicing modes according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a text recognition device according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a terminal device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a server according to an embodiment of the application.

Detailed Description

In order to make the present application better understood by those skilled in the art, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the related art, generally, an OCR technology is adopted to recognize a text image corresponding to a text region in an image including text, and then the text image to be recognized is input into a pre-trained text recognition model, which outputs text contents included in the text image by performing corresponding processing on the text image.

Specifically, referring to fig. 1a, the diagram is a schematic diagram of a text recognition method according to an embodiment of the present application.

Referring to fig. 1a, in the related art, a plurality of text images including text, such as the text image P1 and the text image P2 in fig. 1a, are generally identified for the image a to be detected including text, and then the text image P1 and the text image P2 are respectively input into a text recognition model for recognition.

In general, the size of the input image of the text recognition model is fixed, in order to be compatible with the text image with more characters, a larger fixed size is generally set for the input image, and assuming that the size of the input image of the text recognition model is 32×512, the size of the text image P1 is 32×412, the size of the text image P2 is 32×32, and for the text image P1, only the background image with 32×100 needs to be filled, so that the size of the text image P1 can be changed to 32×512; for the text image P2, the background image of 32×480 needs to be filled to change the size of the text image P2 to 32×512.

For text images that include fewer text, it is necessary to fill more background images to the fixed size. Then, when the text image P2 is processed by using the text recognition model, more computing resources are wasted to process nonsensical background images, which causes performance loss and affects text recognition efficiency.

In order to solve the above technical problems, an embodiment of the present application provides a text recognition method and a related device, before performing text recognition processing on an image to be detected by using a text recognition model, performing image stitching processing on text images corresponding to a plurality of text regions in the image to be detected, so as to obtain a reference image, further, processing the reference image by using the text recognition model, so as to obtain text recognition results corresponding to the text images in the reference image, and finally, determining a target text recognition result corresponding to the image to be detected by using the text results corresponding to each text image in the image to be detected.

Referring to fig. 1b, which is a schematic diagram of text image stitching provided in the embodiment of the present application, as an example, in the embodiment of the present application, text images P1 and P2 are stitched, and separator images with a size of 32×8 are set, and text images P12 obtained by stitching are 32×452, so that the text images P12 only need to be filled with background images with a size of 32×60 to meet the input requirement, compared with text images P1 and P2 independent of each other, the text images P12 after stitching reduce the filling of many background images, further reduce performance loss, and improve recognition efficiency.

Therefore, by splicing a plurality of text images to be used as processing objects of the text recognition model, effective information required to be processed by the text recognition model in each working process can be increased, so that the text recognition model can recognize as many texts as possible in each working process, performance loss is reduced, and recognition efficiency is improved.

Referring to fig. 2, the application scenario of the text recognition method according to the embodiment of the present application includes a text recognition device 200.

The text recognition device 200 intercepts text images corresponding to the n text regions from the image to be detected based on the n text regions in the image to be detected; n is an integer greater than 1. As an example, the text recognition apparatus 200 recognizes 3 text areas a, B, and C, respectively, in the image to be detected, and intercepts text images a, B, and C, respectively, corresponding to the text areas a, B, and C, respectively.

The text recognition device 200 performs image stitching processing based on the n text images to obtain at least one reference image; the size of the reference image is smaller than or equal to the size of the input image of the text recognition model, and if the reference image is spliced by a plurality of text images, a separator image is inserted between two adjacent text images in the reference image. As an example, the size of the text image a is 32×92, the size of the text image b is 32×400, the size of the text image c is 32×500, the size of the separator image is 32×8, and the size of the input image is 32×500; then the text image a and the text image b can be spliced, and a separator image is inserted in the middle to obtain a reference image 1, wherein the size is 32 x 500; since the size of the text image c is equal to the size of the input image, the text image c is taken as the reference image 2, and the size is 32×500; in this case, since the reference image 1 is formed by stitching the text image a and the text image b, a separator image needs to be added between the two text images to separate different text images in the reference image.

The text recognition device 200 performs text recognition processing on each reference image through a text recognition model to obtain a recognition result corresponding to each reference image; the recognition result corresponding to the reference image includes a text recognition result corresponding to the text image in the reference image. As an example, the text recognition model recognizes the reference image 1 and the reference image 2 respectively, so as to obtain a recognition result of the reference image 1, that is, a recognition result of the text image a and a recognition result of the text image b; and obtaining the recognition result of the reference image 2, namely obtaining the recognition result of the text image c.

The text recognition device 200 determines a target text recognition result corresponding to the image to be detected according to the text recognition result and the text region corresponding to each of the n text images. As an example, the recognition results corresponding to the text images a, B and C are respectively corresponding to the text areas a, B and C, and each recognition result is restored to the corresponding text area to complete the recognition of the image to be detected, so as to obtain the target text recognition result corresponding to the image to be detected.

Therefore, by splicing a plurality of text images to be used as processing objects of the text recognition model, effective information required to be processed by the text recognition model in each working process can be increased, so that the text recognition model can recognize as many texts as possible in each working process, performance loss is reduced, and recognition efficiency is improved. That is, by splicing the text image a and the text image b as the processing objects of the text recognition model, the effective information of the text recognition model required to be processed during working can be increased, so that the text recognition model can recognize two texts of the text image a and the text image b at one time, thereby reducing performance loss and improving recognition efficiency.

The text recognition method provided by the application can be applied to the text recognition device 200 with data processing capability, such as a server and a terminal device. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, a cloud server for providing cloud computing service, or the like, but is not limited thereto; terminal devices include, but are not limited to, cell phones, tablets, computers, smart cameras, smart voice interaction devices, smart appliances, vehicle terminals, aircraft, and the like. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.

The text recognition method provided by the embodiment of the application relates to artificial intelligence and computer vision technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Basic technologies for artificial intelligence generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, electromechanical integration, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

Computer Vision (CV) is a science of how to "look" at a machine, and more specifically, to replace a camera and a Computer to perform machine Vision such as identifying and measuring a target by human eyes, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

The related data collection and processing in the application should be strictly according to the requirements of relevant national laws and regulations when the example is applied, obtain the informed consent or independent consent of the main body of personal information, and develop the subsequent data use and processing behaviors within the authorized range of laws and regulations and the main body of personal information.

Next, the text recognition method provided by the embodiment of the present application will be specifically described below in terms of the above text recognition device.

Referring to fig. 3, a flowchart of a text recognition method according to an embodiment of the present application is shown. The text recognition method provided by the embodiment of the application specifically comprises the following steps:

s301: based on n text areas in the image to be detected, intercepting text images corresponding to the n text areas from the image to be detected; n is an integer greater than 1.

The image to be detected means an image containing text and to be subjected to text recognition.

The text region means a region containing text in the image to be detected, and the size of the region is generally smaller than the size of the image to be detected and larger than the size of text in the text region. For example, the size of the image to be detected is 400×400, and the size of the text is 20×30, and then the size of the text region may be 25×35, and the size of the text region is not particularly limited.

As an implementation manner, the text area may be determined according to the size of the text, that is, the sizes of different text areas may be the same or different, for example, assuming that the size of the text 1 is 30×60 and the size of the text 2 is 70×35, the length and width of the text area may be preset to be greater than the length and width of the text by 5, then the size of the text area 1 corresponding to the text 1 may be 35×65 and the size of the text area 2 corresponding to the text 2 may be 75×40. The difference between the size of the text region and the size of the text may be determined according to practical situations, and is not particularly limited herein.

It should be understood that after the image to be detected is acquired, the image to be detected may be detected, the text line position in the image to be detected is detected, and then the text region is determined based on the detected text line position and according to a preset size rule of the text region.

S302: and performing image stitching processing based on the n text images to obtain at least one reference image.

The size of the reference image is smaller than or equal to the size of the input image of the text recognition model, and if the reference image is spliced by a plurality of text images, a separator image is inserted between two adjacent text images in the reference image.

The reference image means an image obtained by stitching a plurality of text images, or one text image, and it is understood that when the size of a certain text image is relatively close to the size of an input image, the single text image can be directly used as the reference image, and stitching is not required.

The input image size means the size of an image of the input text recognition model, and the input image size of the general text recognition model is fixed, such as 32×512, 40×400, etc., and is not particularly limited.

The separator image means an intermediate image for separating different text images, and generally the size of the separator image is preset and fixed. In order to avoid that the content in the separator image affects the recognition of the normal text, the content in the separator image belongs to characters or characters in the non-text recognition dictionary, such as greek letters, roman letters and the like, so that the picture features of the separator image corresponding to the text are obviously different from those of other text images, and the separator image has obvious distinction degree so as to ensure the precision of character recognition. The pixels of the separator image can be set to be a fixed sequence without obvious meaning, and can be an ordering value of 0-255, so that the separator image and the text image can be distinguished when the separator image is identified later, and the identification precision of the separator image is ensured.

It should be understood that when the size of the text image is smaller, the size of the stitched text image may be made closer to the size of the input image by stitching a plurality of text images, but since the plurality of text images are independent from each other, in order to distinguish different text images, and in order to better restore the recognition result of each text image to the respective corresponding text region, a separator image needs to be inserted between two adjacent text images to ensure the accuracy of recognition.

To further illustrate the process of stitching the n text images to obtain at least one reference image, in one possible implementation, step S302 may specifically include:

a1: the n text images are ordered in order of image width from small to large or from large to small based on the respective image widths of the n text images.

A2: and according to the sequencing result of the n text images, performing image stitching processing on the basis of the n text images to obtain at least one reference image.

It should be understood that the heights of the text images are generally smaller than or equal to the size of the input image, and thus the heights of the text images are not particularly limited, and the reference images are obtained by sorting the image widths of the text images from large to small or from small to large and then stitching the text images having similar widths.

The text images with similar widths are spliced, and when the reference image is identified by a subsequent text identification model, the context relation of each text is learned, so that consistency reasoning is performed, and the identification accuracy is improved. As an example, refer to fig. 4, which is a schematic diagram of an image to be detected according to an embodiment of the present application.

Referring to fig. 4, the image to be detected includes text areas a, B, C, D, E, F, corresponding text images are a, B, C, D, E, F, a is 32×30, B is 32×31, C is 32×30, D is 32×100, E is 32×101, F is 32×120, and the 6 text images are ranked from small to large as a, C, B, D, E, F. And then the A, C and B can be spliced to obtain a reference image 1, and the texts corresponding to the three text images A, C and B represent the quantity, so that when the text recognition model recognizes, the text recognition accuracy is further improved by considering the context relation among the three text images A, C and B, the number '1' is recognized instead of the English letter 'l', namely the recognition accuracy of the near-shape characters is improved.

S303: and carrying out text recognition processing on each reference image through the text recognition model to obtain a recognition result corresponding to each reference image.

The recognition results corresponding to the reference images comprise text recognition results corresponding to the text images in the reference images.

The text recognition model means a model for recognizing text in an image containing text, typically a pre-trained text model.

After the reference image is obtained, the reference image needs to be input into the text recognition model for processing, but, since the size of the reference image is smaller than or equal to the size of the input image, the text recognition model can be directly input for the reference image equal to the size of the input image, but, for the reference image smaller than the size of the input image, filling is also needed to make the size of the reference image meet the input requirement.

As a possible implementation manner, step S303 may specifically include:

b1: for each reference image, if the size of the reference image is smaller than the input image size of the text recognition model, filling the reference image to obtain a target reference image with the size equal to the input image size; and if the size of the reference image is equal to the size of the input image of the text recognition model, taking the reference image as a target reference image.

B2: and carrying out text recognition processing on each target reference image through the text recognition model to obtain a recognition result corresponding to each target reference image.

It should be appreciated that by performing the filling process on the reference image that is smaller than the input image size, it can be made to meet the input requirements, and by stitching, the text recognition model can process more effective information, reduce performance loss, and improve recognition efficiency compared to a single text image that is filled and input into the text recognition model.

S304: and determining a target text recognition result corresponding to the image to be detected according to the text recognition result and the text region corresponding to each of the n text images.

The target text recognition result means a text recognition result corresponding to the image to be detected, which includes the text recognition result of each text region.

In order to further explain the process of text recognition of an image to be detected, refer to fig. 5a, which is a schematic diagram of a text recognition method according to an embodiment of the present application.

Referring to fig. 5a, the text recognition method provided by the embodiment of the application is that an image to be detected is input into a detection model, a plurality of text images are obtained through the detection model, the text images are subjected to width sequencing and splicing to obtain a reference image, then the reference image is input into a text recognition model to obtain a text recognition result corresponding to the reference image, and finally the text recognition result is split and restored to obtain a target text recognition result corresponding to the image to be detected.

The detection model means a model for detecting text in an image to be detected, the detection model is a model trained in advance, and a specific training process can be referred to in the related art, and is not described in detail herein.

For further explanation of the target recognition result, refer to fig. 5b, which is a schematic diagram of the recognition result of the image to be detected and the target text according to the embodiment of the present application. The image to be detected includes three texts, namely "Tianqing qi Shuang", "Yi Liu" and "Yi Liu Zhu Bao", and the target text recognition result is obtained by processing the image to be detected through the steps S301 to S304, where the target text recognition result is a result of reducing three text recognition results of "Tianqing qi Shuang", "Yi Liu" and "Yi Liu Zhu Bao" to each text region in the image to be detected, specifically, see fig. 5 b.

It should be understood that after the reference image is identified by the text recognition model to obtain the recognition result, the recognition result is further required to be segmented according to the separator, and each text recognition result is restored to each position in the image to be detected according to the corresponding relationship between the text image and the text region, so as to obtain the final target text recognition result.

Based on the text recognition method provided in the foregoing embodiment, in order to further improve the accuracy of text recognition, in a possible implementation manner, the text recognition model used in the embodiment of the present application may be a text recognition model based on a self-attention mechanism, and when the input reference image is a target reference image equal to the size of the input image, step B2 may specifically include:

C1: performing downsampling processing on the target reference image aiming at each target reference image to obtain image features of the target reference image, determining separation embedded features and position embedded features corresponding to the image features, wherein the separation embedded features are used for representing whether feature bits in the image features correspond to separator images or not; and determining the input characteristics of the target reference image according to the image characteristics, the corresponding separation embedding characteristics and the position embedding characteristics.

As an example, assuming that the input image size is 32×512, the target reference image with the height of 32 is typically downsampled by 2 times 5 times, so that the high dimension becomes 1, and the wide dimension is typically downsampled by 8 times to ensure the accuracy of text recognition, then the downsampled image size is 1×64 for the target reference image with the size of 32×512, and the image feature may be denoted as F-img.

Since the target reference image may be obtained by splicing a plurality of text images, a separator image exists between two adjacent text images correspondingly, so as to ensure recognition accuracy, the length of a separator embedded feature corresponding to the separator image may be 64, a non-separator may be assigned 0, and a separator may be assigned 1 according to the position definition feature code of the separator image in the target reference image. Assuming that the size of the separator image is set to 32×8, the size of the separator image obtained by the downsampling process is 1*1, that is, 1 separator occupies 1-bit feature code, so that a 64-bit separation embedded feature code can be obtained, and the separation embedded feature code passes through an ebedding layer to obtain a separation embedded feature, which is denoted as F-split.

Further, since the text recognition model based on the self-attention mechanism does not have time sequence, but text recognition is sequential, there is a need to add a position embedding feature, specifically, a unique code can be set for the feature length of the target reference image, the unique code can be a sequential code of 0-63, and then the position embedding feature can be obtained through an embedding layer and can be marked as F-pos.

The sum of the image feature, the separation embedding feature and the position embedding feature is taken as the input of a text recognition model based on a self-attention mechanism, namely F=F-img+F-split+F-pos, and F is taken as the input of the text recognition model based on the self-attention mechanism.

C2: and determining a recognition result corresponding to the target reference image according to the input characteristics of the target reference image through a text recognition model based on a self-attention mechanism.

A Self-Attention mechanism (Self-Attention) is an Attention mechanism of input to output, aiming at noticing the correlation between different parts in the input. Because the input received by the neural network is a vector with different sizes and a certain relation exists between different vectors, but the relation between different inputs cannot be deeply used in actual training, so that the model training effect is poor.

A transducer is a neural network that can learn context and thus meaning by considering relationships in sequence data. Correspondingly, the transducer model is a deep learning model based on a self-attention mechanism, and is a model for improving the training speed of the model by using the attention mechanism.

As a possible implementation manner, in order to further improve the recognition accuracy of the separator image, the self-attention weight of the separator embedded feature may be increased, so as to improve the recognition accuracy of the separator image by the text recognition model based on the self-attention mechanism.

Specifically, as an example, the target reference image is processed, the sum of the image feature a, the separation embedding feature B and the position embedding feature C is taken as input of a text recognition model based on a self-attention mechanism, the input may be denoted as F, after the input into the text recognition model, the F needs to be calculated to obtain a calculated value F1, an output Z of the text recognition model is calculated according to the calculated value F1, when the output Z is calculated, the self-attention weight of each of the image feature a, the separation embedding feature B and the position embedding feature C is considered, assuming that the self-attention weight of the image feature a is 0.4, the self-attention weight of the separation embedding feature B is 0.1, the self-attention weight of the position embedding feature C is 0.5, in order to improve the recognition accuracy of the separator image, the self-attention weight of the separation embedding feature B may be increased, the self-attention weight of the image feature a and/or the self-attention weight of the position embedding feature B may be reduced by 0.4, the self-attention weight of the image feature a is reduced by 0.3, and the self-attention accuracy of the separation embedding feature C is improved based on the text recognition model.

The embodiment of the application improves the accuracy of text recognition by taking the sum of the image features, the separation embedded features and the position embedded features as the input of a text recognition model based on a self-attention mechanism and considering the context relation between text images based on the self-attention mechanism.

Based on the text recognition method provided in the foregoing embodiment, in order to avoid that the text recognition model does not recognize or recognize multiple delimiter images, in one possible implementation manner, when the recognition result corresponding to the reference image further includes the delimiter recognition result corresponding to the delimiter image in the reference image, before step S304, the method may further include:

d1: determining the total number of separator recognition based on the separator recognition results in the recognition results corresponding to the reference images; the number of separator images inserted when image stitching processing is performed based on n text images is determined as the total number of inserted separators.

D2: and under the condition that the total number of the separator recognition is different from the total number of the separator insertion, respectively performing text recognition processing on the n text images through a text recognition model to obtain text recognition results corresponding to the n text images.

The separator recognition total number means a recognition result of the separator image recognized by the text recognition model.

The total number of inserted delimiters means the total number of inserted delimiter images when the n text images are stitched.

It should be understood that when the total number of separator recognition is inconsistent with the total number of practical inserted separators, the problem of missing recognition or multiple recognition exists in the text recognition model is proved, and in order to ensure the precision of text recognition, a plurality of text images are not required to be spliced and are respectively input into the text recognition model for recognition, so that the precision of text recognition is ensured.

Based on the text recognition method provided by the above embodiment, the text recognition model provided by the embodiment of the present application is pre-trained, and as a possible implementation manner, the text recognition model is trained by the following manner:

e1: acquiring a training sample set; the training sample set comprises a plurality of training samples, and each training sample comprises a training text image and a corresponding labeling result.

E2: performing image stitching processing based on training text images respectively included by a plurality of training samples in the training sample set to obtain training reference images; the size of the training reference image is smaller than or equal to the input image size of the text recognition model, and a separator image is inserted between two adjacent training text images in the training reference image.

In order to improve the recognition accuracy of the text recognition model, different stitching manners may be specifically adopted to obtain the training reference image, and in a specific possible implementation manner, step E2 may include:

sorting the plurality of training text images in order of the image width from small to large or from large to small based on the image width of the training text image included in each of the plurality of training samples;

and performing image stitching processing based on the training text images according to the sequencing results of the training text images to obtain training reference images.

It should be understood that the training reference images obtained by the width sorting and splicing mode can splice text images with similar widths together, text images may have more similar text characteristics, the training reference images are input into the text recognition model, and the context relation among the texts can be learned, so that more consistent reasoning is performed, and the recognition accuracy between the shape and the near word is improved.

In another possible implementation manner, step E2 may specifically include:

randomly adjusting the arrangement sequence of a plurality of training samples;

and performing image stitching processing on the training text images respectively included in the training samples after the arrangement sequence adjustment to obtain training reference images.

It is understood that the training reference images obtained in a random splicing mode can have more combination forms and randomness, the generalization capability of the text recognition model can be improved, the text recognition model can accurately recognize the reference images obtained in different splicing modes, the text recognition precision is improved, the separator images are better recognized, and the recognition precision of the separator images is improved.

In another possible implementation manner, the process of acquiring the training reference image may be:

f1: acquiring a training detection image set; the training detection image set comprises a plurality of training detection images; each training test image comprises a plurality of training text images;

f2: determining a scene tag of each training detection image;

f3: determining a scene splicing mode in the corresponding training detection image according to the scene label;

f4: and performing image stitching on a plurality of training text images in the corresponding training detection images according to a scene stitching mode to obtain training reference images. .

The scene label means an application scene or a use scene corresponding to the training detection image, and marks the scene, for example, if the text image is a signature, the scene label can be a signature label, and the signature label is related to a handwritten signature; the text image is an invoice, then its scene tag may be an invoice tag, associated with a number.

It is understood that different scene splicing modes are selected according to scene labels of training detection images to splice a plurality of text images, so that a text recognition model can learn the context relation among texts better based on the scene, consistency of text reasoning is improved, precision of text recognition is further improved, and generalization capability of the text recognition model is improved.

Referring to fig. 6, a schematic diagram of a text recognition model training process based on different stitching modes according to an embodiment of the present application is shown.

In combination with the method shown in fig. 6, the text images with similar widths are spliced, the training reference images are input into a text recognition model, and the context relation among the texts can be learned, so that more consistent reasoning is performed, and the recognition precision between the shape and the near word is improved; the random splicing mode is to splice the text images randomly, so that the method has more combination forms and randomness, can improve the distinction between different text images during splicing, improves the text recognition precision, better recognizes the separator images and improves the recognition precision of the separator images; the scene label splicing mode is to splice text images with the same scene label, so that the text recognition model can better learn the context relation among texts, the consistency of text reasoning is improved, the precision of text recognition is further improved, and the generalization capability of the text recognition model is improved.

E3: performing text recognition processing on the training reference image through a text recognition model to be trained to obtain a recognition result corresponding to the training reference image; the recognition results corresponding to the training reference images comprise text recognition results corresponding to training text images in the training reference images.

It should be appreciated that whichever stitching approach is used, the size of the training reference image needs to be padded to the input image size before it is entered into the text recognition model to ensure consistency of the input.

It should be noted that, in the process of training the text recognition model, any one of the width sequence stitching and the random stitching may be selected randomly in each iteration, for example, assuming that the text recognition model in the third iteration adopts the width sequence stitching, the stitching mode adopted in the fourth iteration may be the random stitching or the width sequence stitching. In an implementation manner, any one of width sequence stitching, random stitching and scene tag can be selected for each iteration, and the text image is stitched as a stitching manner, which is not limited in detail herein.

E4: and training a text recognition model according to the text recognition result corresponding to the training text image in the training reference image and the labeling result corresponding to the training text image in the training reference image.

It should be appreciated that the loss value may be determined based on a difference between the text recognition result and the labeling result, and then a loss function may be constructed according to the loss value, and model parameters of the text recognition model may be adjusted based on the loss function, so that the above operations are iteratively performed based on different training samples until the trained text recognition model meets the training end condition, for example, until the training number of times for the text recognition model reaches a preset number of times, or the model performance of the text recognition model reaches a preset performance requirement. In one possible implementation manner, in the training process of the text recognition model, the size of the input image can be set to be 32×512, and 64 training reference images can be used for one training iteration.

Based on the text recognition method provided by the embodiment, the embodiment of the application also provides a text recognition device, and referring to fig. 7, the diagram is a schematic structural diagram of the text recognition device provided by the embodiment of the application. Referring to fig. 7, a text recognition device 700 provided in an embodiment of the present application may specifically include:

An image capturing module 701, configured to capture, from an image to be detected, text images corresponding to n text regions respectively, based on the n text regions in the image to be detected; n is an integer greater than 1;

the image stitching module 702 is configured to perform image stitching processing based on the n text images, so as to obtain at least one reference image; the size of the reference image is smaller than or equal to the size of the input image of the text recognition model, and if the reference image is spliced by a plurality of text images, a separator image is inserted between two adjacent text images in the reference image;

a text recognition module 703, configured to perform text recognition processing on each reference image through the text recognition model, so as to obtain a recognition result corresponding to each reference image; the recognition results corresponding to the reference images comprise text recognition results corresponding to the text images in the reference images;

the result determining module 704 is configured to determine a target text recognition result corresponding to the image to be detected according to the text recognition results and the text regions corresponding to the n text images.

As one example, the image stitching module 702 includes:

the sorting unit is used for sorting the n text images according to the order of the image width from small to large or from large to small based on the image width of each of the n text images;

And the image stitching unit is used for performing image stitching processing on the basis of the n text images according to the sequencing result of the n text images to obtain at least one reference image.

As an example, the text recognition module 703 includes:

a target reference image determining unit, configured to, for each reference image, perform a filling process on the reference image if the size of the reference image is smaller than the size of the input image of the text recognition model, to obtain a target reference image having a size equal to the size of the input image; if the size of the reference image is equal to the size of the input image of the text recognition model, taking the reference image as a target reference image;

and the text recognition unit is used for carrying out text recognition processing on each target reference image through the text recognition model to obtain a recognition result corresponding to each target reference image.

As an example, a text recognition unit includes:

the feature determination subunit is used for performing downsampling processing on the target reference image aiming at each target reference image to obtain image features of the target reference image, determining separation embedded features and position embedded features corresponding to the image features, wherein the separation embedded features are used for representing whether feature bits in the image features correspond to separator images or not; determining input features of the target reference image according to the image features, the corresponding separation embedding features and the position embedding features;

And the text recognition subunit is used for determining a recognition result corresponding to the target reference image according to the input characteristics of the target reference image through a text recognition model based on a self-attention mechanism.

As an example, the recognition results corresponding to the reference image further include separator recognition results corresponding to the separator image in the reference image, and the apparatus 700 further includes, before the result determining module 704:

the total number determining module is used for determining the total number of the separator recognition based on the separator recognition results in the recognition results corresponding to the reference images; determining the number of the inserted separator images when the image stitching processing is performed based on the n text images as the total number of the inserted separators;

the abnormal processing module is used for respectively carrying out text recognition processing on the n text images through the text recognition model under the condition that the total number of the separator recognition is different from the total number of the separator insertion, so as to obtain text recognition results corresponding to the n text images.

As one example, a text recognition model is trained by:

the acquisition module is used for acquiring a training sample set; the training sample set comprises a plurality of training samples, and each training sample comprises a training text image and a corresponding labeling result;

The training reference image determining module is used for performing image stitching processing based on training text images respectively included by a plurality of training samples in the training sample set to obtain training reference images; the size of the training reference image is smaller than or equal to the input image size of the text recognition model, and a separator image is inserted between two adjacent training text images in the training reference image;

the text recognition result generation module is used for carrying out text recognition processing on the training reference image through the text recognition model to be trained to obtain a recognition result corresponding to the training reference image; the recognition results corresponding to the training reference images comprise text recognition results corresponding to training text images in the training reference images;

the training module is used for training the text recognition model according to the text recognition result corresponding to the training text image in the training reference image and the labeling result corresponding to the training text image in the training reference image.

As an example, the training reference image determination module is specifically configured to:

According to the sequencing result of the training text images, performing image stitching processing based on the training text images to obtain training reference images;

or alternatively, the process may be performed,

randomly adjusting the arrangement sequence of a plurality of training samples;

The text recognition device provided by the embodiment of the application has the same beneficial effects as the text recognition method provided by the embodiment, so that the description is omitted.

The embodiment of the application also provides a computer device, which can be a terminal device or a server, and the terminal device and the server provided by the embodiment of the application are introduced from the aspect of hardware materialization.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a terminal device according to an embodiment of the present application. As shown in fig. 8, for convenience of explanation, only the portions related to the embodiments of the present application are shown, and specific technical details are not disclosed, please refer to the method portions of the embodiments of the present application. The terminal may be any terminal device including a mobile phone, a tablet computer, a personal digital assistant (pda), a Point of Sales (POS), a vehicle-mounted computer, and the like, taking the terminal as an example of a computer:

Fig. 8 is a block diagram showing a part of the structure of a computer related to a terminal provided by an embodiment of the present application. Referring to fig. 8, a computer includes: radio Frequency (RF) circuitry 1210, memory 1220, input unit 1230 (including touch panel 1231 and other input devices 1232), display unit 1240 (including display panel 1241), sensors 1250, audio circuitry 1260 (which may connect speaker 1261 and microphone 1262), wireless fidelity (wireless fidelity, wiFi) module 1270, processor 1280, and power supply 1290. Those skilled in the art will appreciate that the computer architecture shown in fig. 8 is not limiting and that more or fewer components than shown may be included, or that certain components may be combined, or that different arrangements of components may be utilized.

Memory 1220 may be used to store software programs and modules, and processor 1280 may execute the various functional applications and data processing of the computer by executing the software programs and modules stored in memory 1220. The memory 1220 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and a storage data area; the storage data area may store data created according to the use of the computer (such as audio data, phonebooks, etc.), and the like. In addition, memory 1220 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

Processor 1280 is a control center of the computer and connects various parts of the entire computer using various interfaces and lines, performing various functions of the computer and processing data by running or executing software programs and/or modules stored in memory 1220, and invoking data stored in memory 1220. In the alternative, processor 1280 may include one or more processing units; preferably, the processor 1280 may integrate an application processor and a modem processor, wherein the application processor primarily handles operating systems, user interfaces, application programs, etc., and the modem processor primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1280.

In an embodiment of the present application, the processor 1280 included in the terminal further has the following functions:

based on n text areas in an image to be detected, intercepting text images corresponding to the n text areas from the image to be detected; n is an integer greater than 1;

performing image stitching processing based on n text images to obtain at least one reference image; the size of the reference image is smaller than or equal to the size of an input image of a text recognition model, and if the reference image is spliced by a plurality of text images, a separator image is inserted between two adjacent text images in the reference image;

Performing text recognition processing on each reference image through the text recognition model to obtain a recognition result corresponding to each reference image; the recognition result corresponding to the reference image comprises a text recognition result corresponding to the text image in the reference image;

and determining a target text recognition result corresponding to the image to be detected according to the text recognition results and the text regions corresponding to the n text images.

Optionally, the processor 1280 is further configured to perform steps of any implementation of the text recognition method provided by the embodiment of the present application.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a server 1300 according to an embodiment of the present application. The server 1300 may vary considerably in configuration or performance and may include one or more central processing units (central processing units, CPU) 1322 (e.g., one or more processors) and memory 1332, one or more storage media 1330 (e.g., one or more mass storage devices) storing applications 1342 or data 1344. Wherein the memory 1332 and storage medium 1330 may be transitory or persistent. The program stored on the storage medium 1330 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Further, the central processor 1322 may be configured to communicate with the storage medium 1330, and execute a series of instruction operations in the storage medium 1330 on the server 1300.

The Server 1300 may also include one or more power supplies 1326, one or more wired or wireless network interfaces 1350, one or more input/output interfaces 1358, and/or one or more operating systems, such as Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ,Linux ^TM ，FreeBSD ^TM Etc.

The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 9.

Wherein CPU 1322 is configured to perform the following steps:

Optionally, CPU 1322 may also be configured to perform the steps of any one implementation of the text recognition method provided by embodiments of the present application.

The embodiments of the present application also provide a computer-readable storage medium storing a computer program for executing any one of the text recognition methods described in the foregoing embodiments.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform any one of the text recognition methods described in the foregoing respective embodiments.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: u disk, mobile hard disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk, etc. various media for storing computer program.

It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of text recognition, the method comprising:

2. The method according to claim 1, wherein the performing image stitching based on the n text images to obtain at least one reference image includes:

based on the respective image widths of the n text images, sorting the n text images in the order of the image widths from small to large or from large to small;

and according to the sequencing results of the n text images, performing image stitching processing on the basis of the n text images to obtain at least one reference image.

3. The method according to claim 1 or 2, wherein the performing text recognition processing on each reference image by using the text recognition model to obtain a recognition result corresponding to each reference image includes:

For each reference image, if the size of the reference image is smaller than the input image size of the text recognition model, filling the reference image to obtain a target reference image with the size equal to the input image size; if the size of the reference image is equal to the size of the input image of the text recognition model, taking the reference image as a target reference image;

and carrying out text recognition processing on each target reference image through the text recognition model to obtain a recognition result corresponding to each target reference image.

4. A method according to claim 3, wherein said performing text recognition processing on each of said target reference images by said text recognition model to obtain a recognition result corresponding to each of said target reference images comprises:

performing downsampling processing on each target reference image to obtain image features of the target reference images, and determining separation embedding features and position embedding features corresponding to the image features, wherein the separation embedding features are used for representing whether feature bits in the image features correspond to the separator images or not; determining input features of the target reference image according to the image features, the corresponding separation embedding features and the position embedding features;

And determining a recognition result corresponding to the target reference image according to the input characteristics of the target reference image through a text recognition model based on a self-attention mechanism.

5. The method of claim 1, wherein the recognition results corresponding to the reference image further include separator recognition results corresponding to the separator images in the reference image, and wherein prior to determining the target text recognition result corresponding to the image to be detected based on the text recognition results and text regions corresponding to each of the n text images, the method further comprises:

determining the total number of separator recognition based on the separator recognition results in the recognition results corresponding to the reference images; determining the number of the separator images inserted when the image stitching processing is performed based on n text images, and taking the number as the total number of inserted separators;

and under the condition that the total number of the separator recognition is different from the total number of the separator insertion, respectively performing text recognition processing on the n text images through the text recognition model to obtain text recognition results corresponding to the n text images.

6. The method of claim 1, wherein the text recognition model is trained by:

Acquiring a training sample set; the training sample set comprises a plurality of training samples, and each training sample comprises a training text image and a corresponding labeling result;

performing image stitching processing based on training text images respectively included by a plurality of training samples in the training sample set to obtain training reference images; the size of the training reference image is smaller than or equal to the input image size of the text recognition model, and a separator image is inserted between two adjacent training text images in the training reference image;

performing text recognition processing on the training reference image through the text recognition model to be trained to obtain a recognition result corresponding to the training reference image; the recognition result corresponding to the training reference image comprises a text recognition result corresponding to the training text image in the training reference image;

and training the text recognition model according to the text recognition result corresponding to the training text image in the training reference image and the labeling result corresponding to the training text image in the training reference image.

7. The method of claim 6, wherein the performing image stitching based on the training text images included in each of the plurality of training samples in the training sample set to obtain the training reference image includes:

Sorting the plurality of training text images according to the order from small to large or from large to small of the image width based on the image width of the training text images included in each of the plurality of training samples;

performing image stitching processing based on a plurality of training text images according to the sequencing results of the training text images to obtain the training reference image;

or alternatively, the process may be performed,

randomly adjusting the arrangement sequence of a plurality of training samples;

and performing image stitching processing on training text images respectively included in the training samples based on the arrangement sequence adjusted to obtain the training reference images.

8. A text recognition device, the device comprising:

the image intercepting module is used for intercepting text images corresponding to n text areas from the image to be detected based on the n text areas in the image to be detected; n is an integer greater than 1;

the image stitching module is used for performing image stitching processing based on the n text images to obtain at least one reference image; the size of the reference image is smaller than or equal to the size of an input image of a text recognition model, and if the reference image is spliced by a plurality of text images, a separator image is inserted between two adjacent text images in the reference image;

The text recognition module is used for carrying out text recognition processing on each reference image through the text recognition model to obtain a recognition result corresponding to each reference image; the recognition result corresponding to the reference image comprises a text recognition result corresponding to the text image in the reference image;

and the result determining module is used for determining a target text recognition result corresponding to the image to be detected according to the text recognition results and the text areas corresponding to the n text images.

9. A computer device, the computer device comprising a processor and a memory;

the memory is used for storing a computer program;

the processor is configured to perform the text recognition method of any one of claims 1 to 7 according to the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium is for storing a computer program for executing the text recognition method of any one of claims 1 to 7.