CN111242114B

CN111242114B - Character recognition method and device

Info

Publication number: CN111242114B
Application number: CN202010019533.XA
Authority: CN
Inventors: 薛文元; 黄珊; 李清勇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-01-08
Filing date: 2020-01-08
Publication date: 2023-04-07
Anticipated expiration: 2040-01-08
Also published as: CN111242114A

Abstract

The application provides a character recognition method and a device, wherein the method comprises the following steps: acquiring an image to be identified; performing style conversion on at least one of characters and a background in the image to be recognized based on the image to be recognized to obtain a converted image, wherein if the characters and the background in the image to be recognized are subjected to style conversion, the style corresponding to the converted characters is different from the style corresponding to the background; and performing character recognition on characters in the converted image to obtain a character recognition result. According to the scheme, before character recognition is carried out on characters in the image to be recognized, at least one of the characters and the background in the image to be recognized is subjected to style conversion, and because the style of the characters in the converted image is different from the style of the background, and compared with the style of the image to be recognized, the converted image is relatively simple, regular and uniform, the characters in the image can be distinguished more accurately, and therefore the character recognition result is improved.

Description

Character recognition method and device

Technical Field

The application relates to the technical field of computers, in particular to a character recognition method and device.

Background

The text information is important information in images and videos, and if the text information can be obtained, various important applications can be realized, such as mobile phone photographing and translation, intelligent navigation, blind guiding, content-based retrieval and the like, so that great convenience can be provided for the work and life of people. Therefore, the scene-based character detection and recognition technology is a research hotspot in the technical fields of computer vision and artificial intelligence at present.

In the prior art, generally, for an image to be recognized with a relatively complex image style, a character recognition result in the image to be recognized is poor. For example, the colors of the characters and the background in the image to be recognized are relatively close, the image to be recognized is an image after transformation such as deformation and distortion during shooting, and the character recognition result of the image to be recognized is poor.

Disclosure of Invention

The embodiment of the application mainly aims to provide a character recognition method, a character recognition device, an electronic device and a computer readable storage medium.

In a first aspect, an embodiment of the present application provides a text recognition method, where the method includes:

acquiring an image to be identified;

performing style conversion on at least one of characters and a background in the image to be recognized based on the image to be recognized to obtain a converted image, wherein if the characters and the background in the image to be recognized are subjected to style conversion, the style corresponding to the converted characters is different from the style corresponding to the background;

and performing character recognition on the converted image to obtain a character recognition result.

In an optional embodiment of the first aspect, performing style conversion on at least one of a text and a background in the image to be recognized to obtain a converted image includes:

carrying out first image style conversion on characters of an image to be recognized, and carrying out second image style conversion on a background of the image to be recognized to obtain a converted image;

the first image style is a black font, and the second image style is a white background.

In an optional embodiment of the first aspect, performing text recognition on the text in the converted image to obtain a text recognition result, includes:

extracting image features of the converted image;

and based on the image characteristics, a character recognition result is obtained by adopting a recurrent neural network.

In an optional embodiment of the first aspect, the style conversion of the image to be recognized and the character recognition of the characters in the converted image are performed by a character recognition model;

the character recognition model is obtained by training based on the following modes:

acquiring training sample pairs, wherein each training sample pair comprises a first sample image and a second sample image, the second sample image is an image after style conversion corresponding to the first sample image, the first sample image carries a character label, and the character label represents a character labeling result in the first sample image; the first sample image corresponds to a third image style, and the second sample image corresponds to a fourth image style;

training the initial neural network model based on the first sample image until the loss function of the initial neural network model converges, and taking the initial neural network model at the end of training as a character recognition model;

the initial neural network model comprises a first style conversion network and a character recognition network which are connected in series, wherein the first style conversion network is used for converting an input image into an image of a fourth image style; the input of the first style conversion network comprises a first sample image, the output comprises a first image, the input of the character recognition network comprises a first image, and the output comprises a character recognition result of the first image;

the loss function comprises an image loss function and a text recognition loss function, the image loss function comprises a loss function representing the difference between the second sample image and the corresponding first image, and the text recognition loss function comprises a loss function representing the difference between the character labeling result in the first sample image and the character recognition result in the corresponding first image;

the character recognition model comprises a first style conversion network and a character recognition network which are cascaded when training is finished.

In an optional embodiment of the first aspect, the input of the text recognition network further comprises at least one of a second sample image or a first sample image;

if the input of the character recognition network comprises a second sample image, the text recognition loss function also comprises a loss function representing the difference between the character labeling result in the first sample image and the character recognition result of the corresponding second sample image;

if the input to the word recognition network comprises a first sample image, the text recognition loss function further comprises a loss function characterizing a difference between the word annotation result in the first sample image and the word recognition result of the first sample image.

In an optional embodiment of the first aspect, the initial neural network model further comprises a second style conversion network for converting the input image into an image of a third image style, the input of the second style conversion network comprising a second sample image, the output comprising a second image;

the image loss functions further include a loss function characterizing a difference between the first sample image and the second image.

In an optional embodiment of the first aspect, the input of the first style conversion network further comprises a second sample image, the output further comprises a third image, the loss function further comprises an invariance loss function, the invariance loss function comprising a loss function characterizing a difference between the second sample image and the third image;

and/or the presence of a gas in the gas,

the input of the second style conversion network further comprises a first sample image, and the output further comprises a fourth image; the invariance loss function includes a loss function characterizing a difference between the first sample image and the fourth image.

In an optional embodiment of the first aspect, the input of the text recognition network further comprises a second image, and the output further comprises a text recognition result of the second image;

the text recognition loss function further includes a loss function characterizing a difference between the text annotation result of the first image and the corresponding text recognition result of the second image.

In an optional embodiment of the first aspect, the character recognition model further includes at least one of a first discrimination network and a second discrimination network, an input of the first discrimination network is an image of a third image style, an output of the first discrimination network is information for characterizing that the input image is an image generated by the first sample image or the first style conversion network, an input of the second discrimination network is an image of a fourth image style, and an output of the second discrimination network is an image for characterizing that the input image is an image generated by the second sample image or the second style conversion network;

the input of the first judgment network comprises a first sample image and a second image, and the input of the second judgment network is a second sample image and a first image;

the loss functions also include discriminant loss functions that characterize style discriminant losses for the discriminant network.

In an optional embodiment of the first aspect, the input of the second style conversion network further comprises a first image, the output further comprises a fifth image, the loss function further comprises a recurring consistent loss function comprising a loss function characterizing a difference between the first and fifth images;

and/or the presence of a gas in the gas,

the input of the first image style conversion network further comprises a second image and the output further comprises a sixth image, and the recurring consistent loss function comprises a loss function characterizing a difference between the second sample image and the sixth image.

In an optional embodiment of the first aspect, the input of the text recognition network further includes a fifth image, the output further includes a text recognition result of the fifth image, and the text recognition loss function further includes a loss function representing a difference between a text annotation result of the first sample image and a text recognition result of the corresponding fifth image;

and/or the presence of a gas in the gas,

the input of the character recognition network further comprises a sixth image, the output further comprises a character recognition result of the sixth image, and the character recognition loss function further comprises a loss function representing the difference between the character labeling result of the first sample image and the character recognition result of the corresponding sixth image.

In an alternative embodiment of the first aspect, if the text in the training sample pair is a vowel-annex type text, each text consisting of at least one character, the text label in the first sample image is determined by:

acquiring a first sample image and character labels of the first sample image, wherein one character label represents one character corresponding to characters to be recognized in the first sample image;

and generating a character label based on the character label according to the writing rule of the characters in the first sample image.

In an alternative embodiment of the first aspect, the vowel annex type text comprises at least one of Tibetan or Thai.

In an optional embodiment of the first aspect, the method further comprises:

and performing style conversion on at least one of the characters or the background in the first sample image to obtain a second sample image in a fourth image style.

In a second aspect, the present application provides a text recognition apparatus, comprising:

the image acquisition module is used for acquiring an image to be identified;

the style conversion module is used for carrying out style conversion on at least one item of characters or backgrounds in the image to be recognized based on the image to be recognized to obtain a converted image;

if the style of the characters and the background in the image to be recognized is converted, the style corresponding to the converted characters is different from the style corresponding to the background;

and the character recognition module is used for recognizing characters in the converted image to obtain a character recognition result.

In an optional embodiment of the second aspect, the style conversion module is specifically configured to, when performing style conversion on at least one of a text and a background in the image to be recognized to obtain a converted image:

In an optional embodiment of the second aspect, when the character recognition module performs character recognition on characters in the converted image to obtain a character recognition result, the character recognition module is specifically configured to:

extracting image features of the converted image;

In an optional embodiment of the second aspect, the style conversion of the image to be recognized and the character recognition of the characters in the converted image are performed through a character recognition model;

the character recognition model comprises a first style conversion network and a character recognition network which are cascaded after training is finished.

In an optional embodiment of the second aspect, the input to the text recognition network further comprises at least one of a second sample image or a first sample image;

In an optional embodiment of the second aspect, the initial neural network model further comprises a second style conversion network for converting the input image into an image of a third image style, the input of the second style conversion network comprising a second sample image and the output comprising a second image;

In an alternative embodiment of the second aspect, the input of the first style conversion network further comprises a second sample image, the output further comprises a third image, and the loss function further comprises an invariance loss function comprising a loss function characterizing a difference between the second sample image and the third image;

and/or the presence of a gas in the gas,

In an optional embodiment of the second aspect, the input of the text recognition network further comprises a second image, and the output further comprises a text recognition result of the second image;

In an optional embodiment of the second aspect, the character recognition model further includes at least one of a first discrimination network and a second discrimination network, an input of the first discrimination network is an image of a third image style, an output of the first discrimination network is information for characterizing that the input image is a first sample image or an image generated by the first style conversion network, an input of the second discrimination network is an image of a fourth image style, and an output of the second discrimination network is information for characterizing that the input image is a second sample image or an image generated by the second style conversion network;

the loss functions further include discriminant loss functions that characterize a style discriminant loss for the discriminated network.

In an optional embodiment of the second aspect, the input of the second style conversion network further comprises a first image, the output further comprises a fifth image, the loss function further comprises a recurrent uniform loss function comprising a loss function characterizing differences between the first and fifth images;

and/or the presence of a gas in the gas,

In an optional embodiment of the second aspect, the input of the text recognition network further includes a fifth image, the output further includes a text recognition result of the fifth image, and the text recognition loss function further includes a loss function representing a difference between a text annotation result of the first sample image and a text recognition result of the corresponding fifth image;

and/or the presence of a gas in the gas,

In an alternative embodiment of the second aspect, if the text in the training sample pair is a vowel-superscript type text, each text consisting of at least one character, the text label in the first sample image is determined by:

In an alternative embodiment of the second aspect, the vowel annex type text includes at least one of Tibetan or Thai.

In an optional embodiment of the second aspect, the apparatus further comprises:

and the second sample image determining module is used for performing style conversion on at least one of characters and backgrounds in the first sample image to obtain a second sample image in a fourth image style.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor and a memory; the memory has stored therein readable instructions which, when loaded and executed by the processor, implement the method as shown in the first aspect or any one of the alternative embodiments of the first aspect described above.

In a fourth aspect, the present application provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is loaded by a processor and executed, the method as shown in the first aspect or any optional embodiment of the first aspect is implemented.

The beneficial effect that technical scheme that this application provided brought is: before character recognition is carried out on characters in an image to be recognized, style conversion is carried out on at least one of the characters or the background in the image to be recognized, when the characters and the background in the image to be recognized are subjected to style conversion simultaneously, the style corresponding to the converted characters is different from the style corresponding to the converted background, so that the characters and the background in the converted image can be distinguished obviously, when the characters in the converted image are recognized, the characters in the image can be distinguished more easily and accurately due to the fact that the style of the characters in the converted image is different from the style of the background, and the style of the converted image is relatively simple, regular and uniform compared with that of the image to be recognized, so that the characters in the image can be recognized more accurately, and a more accurate character recognition result is obtained.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic flowchart of a text recognition method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an image before and after a style conversion according to an embodiment of the present application;

FIG. 3 is a diagram of a syllable of Tibetan according to one embodiment of the present application;

fig. 4 is a schematic diagram of a character tag and a text tag corresponding to a Tibetan language according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a network structure of an initial neural network according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a network structure of another initial neural network provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of a character recognition apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to yet another embodiment of the present application.

Detailed Description

Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" include plural referents unless the context clearly dictates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject, and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis and algorithm complexity theory. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and counterlearning.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

For better understanding and description of the embodiments of the present application, some technical terms used in the embodiments of the present application will be briefly described below.

Neural Networks (NN): the method is an arithmetic mathematical model simulating animal neural network behavior characteristics and performing distributed parallel information processing. The network achieves the aim of processing information by adjusting the mutual connection relationship among a large number of nodes in the network depending on the complexity of the system.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 1 shows a flowchart of a text recognition method provided in the present application, an execution subject of the present application may be at least one of a terminal or a server, as shown in fig. 1, the method may include steps S110 to S130, where:

and step S110, acquiring an image to be identified.

The image to be recognized refers to an image which needs character recognition, the image to be recognized includes characters and a background, and the background refers to a part of the image except the characters. In the scheme of the application, the image to be recognized is a color image or a grayscale image, a specific representation form of the image to be recognized is not limited in the application, the image to be recognized may be an image containing characters, which is shot by a user, an image containing characters, which is downloaded from the internet, or an image received from a device, or an image in a video, and an image source of the image to be recognized is not limited in the application.

And step S120, performing style conversion on at least one of characters and backgrounds in the image to be recognized based on the image to be recognized to obtain a converted image.

For example, the text in the image to be recognized may be converted into text of a first image style, or the background in the image to be recognized may be converted into a background of a second image style. In the example of the present application, the first image style is a black font, and the second image style is a white background, i.e. the image to be recognized can be converted into an image with black characters on white.

The first image style may be a set font, a set font color, or the like. The second image style may be a set background color, a set background of the scene, etc.

If the style of the characters and the background in the image to be recognized is converted, the style corresponding to the converted characters is different from the style corresponding to the background.

If the style of the characters and the background in the image to be recognized is converted, the style corresponding to the characters after the style conversion is different from the style corresponding to the background in order to obviously distinguish the characters and the background. For example, if the text in the image to be recognized is color-converted, the converted text is blue, and if the background in the image to be recognized is also color-converted, the color of the converted background may not be blue, and should be a color different from blue.

As an example, fig. 2 shows schematic diagrams of images before and after several styles are converted, the original scene image shown in fig. 2 is an image without style conversion, the characters shown in the image are Tibetan, the original scene image includes backgrounds of various styles, for example, from left to right, the background in fig. 1 includes multiple colors, the color of the character in fig. 1 is yellow, the background in fig. 2 is light green, the color of the character is dark green, the background in fig. 3 is white and black, the color of the character is black, the background in fig. 4 is green, and the color of the character is yellow.

The synthesized image shown in fig. 2 is a style-converted image, in this example, the background in the synthesized image in fig. 2 is white, the fonts are black, and the fonts are all the same font, for example, the Noto Sans tibean bold font, and the height of the synthesized image is the same height.

After the original scene image is subjected to style conversion, the styles of the background and characters in the converted image are relatively simple, regular and uniform compared with the styles of the background and the characters before the original scene image is subjected to style conversion, and the difference between the background and the character part in the image can be effectively increased after the original scene image is subjected to style conversion processing, so that the characters in the image can be more easily recognized when the characters in the image are recognized based on the converted image, and the character recognition accuracy is improved.

And step S130, performing character recognition on characters in the converted image to obtain a character recognition result.

The specific meaning of the characters can be recognized through OCR (Optical Character Recognition) by performing Character Recognition on the characters in the image to be recognized, and the Character Recognition can also be performed in a neural network model.

According to the scheme of the embodiment of the application, before character recognition is carried out on characters in the image to be recognized, style conversion is carried out on at least one of the characters and the background in the image to be recognized, when the characters and the background in the image to be recognized are subjected to style conversion at the same time, the style corresponding to the converted characters is different from the style corresponding to the converted background, so that the characters and the background in the converted image can be distinguished obviously, when the characters in the converted image are recognized, the characters in the image can be distinguished more easily and accurately due to the fact that the style of the characters in the converted image is different from the style of the background, and the style of the converted image is relatively simple, regular and uniform compared with that of the image to be recognized, so that the characters in the image can be recognized more accurately, and a more accurate character recognition result is obtained.

In an alternative of the application, in step S130, performing character recognition on the converted image to obtain a character recognition result, which may include:

extracting image features of the converted image;

In an optional example of the present application, a convolutional neural network may be used to extract image features, and based on the image features, a cyclic neural network is used to obtain a text recognition result. The method of combining the convolutional neural network and the cyclic neural network can be used for recognizing the whole character image to be recognized, so that the problem of accumulated errors caused by the fact that characters need to be segmented word by word and then recognized word by word in the related technology is solved, and the character recognition rate can be improved; and the convolutional neural network has deep learning capability, so that the overall performance of the system can be effectively improved.

In an alternative scheme of the application, the style conversion is carried out on the image to be recognized, and the character recognition is carried out on characters in the converted image, wherein the characters are obtained through a character recognition model;

training the initial neural network model based on the first sample image until the loss function of the initial neural network model converges, and taking the initial neural network model after training as a character recognition model;

the loss functions comprise image loss functions and text recognition loss functions, the image loss functions comprise loss functions which represent the difference between the second sample images and the corresponding first images, and the text recognition loss functions comprise loss functions which represent the difference between the character labeling results in the first sample images and the character recognition results of the corresponding first images;

The character recognition model is trained in advance, can perform style conversion on an image to be recognized, and performs character recognition on characters in the image after the style conversion to obtain a character recognition result. Each training sample pair comprises a first sample image and a second sample image, wherein the second sample image is an image after style conversion corresponding to the first sample image, the first sample image carries a character label, and the character label represents a character labeling result in the first sample image; the first sample image corresponds to a third image style and the second sample image corresponds to a fourth image style.

The third image style can be the first image style, the second image style, the first image style and the second image style, and the fourth image style can be the first image style, the second image style, the first image style and the second image style. The third image style is a different image style than the fourth image style.

The character label can be marked manually, namely, the recognition result of the characters in the image is marked manually. The text label of the sample image in the training sample pair can be obtained in other forms, and the specific implementation mode of the text label is not limited in the application. The text label can be a character string, characters, numbers and the like, and the specific expression form of the text label is not limited in the application.

Wherein the first image is an image of a fourth image style.

In the scheme of the application, only the first sample image may be provided with a text label, or both the first sample image and the second sample image may be provided with text labels, and the text labeling results corresponding to the first sample image and the second sample image in each training sample pair are the same.

In an alternative aspect of the present application, the input to the text recognition network further comprises at least one of the second sample image or the first sample image;

When the initial neural network model is trained, the second sample image or the first sample image can be used, so that the text recognition network obtained through training can accurately recognize characters in the image.

In an alternative of the present application, the initial neural network model further includes a second style conversion network, the second style conversion network is configured to convert the input image into an image of a third image style, an input of the second style conversion network includes a second sample image, and an output includes a second image;

The second style conversion network can convert the image in the fourth image style into the image in the third image style, the second style conversion network can convert the second sample image into the second image in the third image style, and the smaller the difference between the first sample image and the second image is, the closer the image style of the second image is to the image style of the first sample image is, which shows that the first style conversion network can accurately convert the image in the fourth image style into the image in the third image style.

In an alternative of the present application, the input of the first style conversion network further includes a second sample image, the output further includes a third image, the loss function further includes an invariance loss function, and the invariance loss function includes a loss function representing a difference between the second sample image and the third image;

and/or the presence of a gas in the gas,

The third image is an image of a fourth image style, the invariance loss function represents the difference between the second sample image and the third image, and the smaller the difference is, the closer the image style between the second sample image and the third image is, and on the other hand, the first style conversion network can accurately convert the image of the fourth image style into the image of the third image style.

Similarly, the fourth image is an image of a third image style, the invariance loss function includes a representation of a difference between the first sample image and the fourth image, and a smaller difference indicates that the image style between the first sample image and the fourth image is closer, thereby on the other hand, explaining that the second style conversion network can accurately convert the image of the third image style into the image of the fourth image style.

In an alternative scheme of the application, the input of the character recognition network further comprises a second image, and the output further comprises a character recognition result of the second image;

the text recognition penalty functions further include a penalty function characterizing a difference between the word annotation result for the first image and the corresponding word recognition result for the second image.

The character recognition network can also comprise a second image, and the trained character recognition network can accurately recognize characters in the image based on the second image and the character recognition result of the second image, so that the precision of the character recognition network is improved.

In an alternative of the present application, the character recognition model further includes at least one of a first discrimination network and a second discrimination network, where an input of the first discrimination network is an image of a third image style, an output of the first discrimination network is information for representing that the input image is an image generated by the first sample image or the first style conversion network, an input of the second discrimination network is an image of a fourth image style, and an output of the second discrimination network is an image for representing that the input image is an image generated by the second sample image or the second style conversion network;

Wherein the first discrimination network is used for discriminating whether the image input to the network is the first sample image or the image generated by the first style conversion network. The second determination network is used for determining whether the image input to the network is the second sample image or the image generated by the second style conversion network.

The smaller the style discrimination loss is, the more the corresponding discrimination network can accurately discriminate which images are generated by the style conversion network and which are sample images.

The style discrimination loss reflects the accuracy of discriminating the network to discriminate the image style, and the smaller the loss, the higher the accuracy is correspondingly. The input images are different (one or two), and the discriminant loss function is also different in composition. During the training process, the style discrimination loss may be selected as the negative logarithm of the probability.

In an alternative of the present application, the discrimination network may specifically be the first discrimination network and/or the second discrimination network, and may specifically be based on an actual network configuration.

The judgment network is only used during initial neural network model training and is not used as a character recognition model.

In an alternative of the present application, the input of the second style conversion network further includes a first image, the output further includes a fifth image, the loss function further includes a recurring consistent loss function, the recurring consistent loss function includes a loss function characterizing a difference between the first and fifth images;

and/or the presence of a gas in the gas,

The fifth image is an image of a third image style, the cycle consistent loss function represents the difference between the first image and the fifth image, the smaller the loss function is, the smaller the difference between the fifth image and the first image is, the closer the styles are, the first image style conversion network can accurately convert the image of the third image style into the image of a fourth image style, and the second image style conversion network can accurately convert the image of the fourth image style into the image of the third image style.

Similarly, the sixth image is an image of a fourth image style, the cycle consistent loss function also represents the difference between the second sample image and the sixth image, and the smaller the loss function is, the smaller the difference between the sixth image and the second sample image is, the closer the style is, on the other hand, it is demonstrated that the first image style conversion network can accurately convert the image of the third image style into the image of the fourth image style, and the second image style conversion network can accurately convert the image of the fourth image style into the image of the third image style.

In an alternative scheme of the application, the input of the character recognition network further comprises a fifth image, the output further comprises a character recognition result of the fifth image, and the character recognition loss function further comprises a loss function representing a difference between a character labeling result of the first sample image and a character recognition result of the corresponding fifth image;

and/or the presence of a gas in the gas,

When the character recognition network is trained, the input of the character recognition network can further comprise at least one of a fifth image and a sixth image, and correspondingly, the text recognition loss function further comprises a loss function representing the difference between the character labeling result of the first sample image and the character recognition result of the corresponding fifth image; or a loss function of a difference between the text annotation result of the first sample image and the text recognition result of the corresponding sixth image. So that the accuracy of the trained text recognition network is higher.

In an alternative aspect of the present application, if the text in the training sample pair is a vowel tagging type text, each text is composed of at least one character, and the text label in the first sample image is determined by:

The vowel-attaching type characters are phonograms which are marked by consonant letters as main bodies and vowels in the form of additional symbols. In the scheme of the present application, taking the text label of the first sample image as an example, the text label of the first sample image may be determined in the following manner: acquiring a first sample image and character labels of the first sample image, wherein one character label represents one character corresponding to characters to be recognized in the first sample image; and generating a character result based on the character label according to the writing rule of the characters in the first sample image, and generating a character label based on the character result.

The character label refers to a label obtained by labeling characters according to characters, one character corresponds to one character label, and if one character consists of two characters, the character corresponds to two character labels. After the character label of the first sample image is determined, the writing rule of the text, that is, the composition order of the characters in the text, such as the top-bottom structure, the left-right structure, etc., may be followed. The characters corresponding to each character can be combined into a character according to the writing rule of each character, that is, a corresponding character label can be generated based on the character label of each character, the character label represents the character labeling result of each character to be recognized in the sample image, and the writing rule of the corresponding character can be reflected through the character label.

The character label may be manually labeled or determined based on other manners, and the determination manner of the character label is not limited in the application.

In an alternative aspect of the present application, the vowel superscript type text includes at least one of Tibetan or Thai.

As an example, as shown in fig. 3, a schematic diagram of a syllable of the tibetan language is shown, the syllable is a basic ideographic unit of the tibetan language, and the tibetan language is expanded around the basic character and respectively consists of "top-added character", "bottom-added character" and "top-added character", "top vowel", "bottom-added character" and "bottom vowel" in the front-back direction of the basic character, wherein the "top-added character" is a consonant letter.

As can be seen from the above, unlike the chinese-english writing system, in which the basic character units are arranged in the horizontal direction, there is a distinct local vertical arrangement in the writing of tibetan. Referring to fig. 4, a schematic diagram of character labels and text labels corresponding to the Tibetan language, in a computer system, the Tibetan language arranges letters in a writing order into a sequence for storage, which is referred to as a character label (also referred to as a character-level label). Based on the character level label, the Chinese character, the upper vowel, the lower character and the lower vowel are combined into a whole according to the original writing rule, and the whole is called as a character label (also called as a stack label). For "top-case", "top-case" and "re-top-case", the labels remain unchanged because they are located in positions that contain only the consonant letters of themselves in the distribution of the longitudinal dimension. We refer to such a recombined sequence of tags as a heap-level tag (stack-level label).

In an alternative aspect of the present application, the method further includes:

The first sample image is an image of a third image style, the second sample image is an image of a fourth image style, and the second sample image may be an image obtained by style converting at least one of a text and a background in the first sample image. The specific conversion method may be the same as the style conversion of at least one of the text and the background in the image to be recognized as described above, and is not described herein again.

In the solution of the present application, the height of the sample image is normalized, such as 32 pixels, and the width of the sample image is scaled equally. For generating network (G) _A ,G _B ) In other words, it contains two convolutional layers of step size 2, 9 residual blocks and two anti-convolutional layers. Discriminating network (D) _A ,D _B ) The image is extracted as a feature vector of one dimension (1, 2, w) by 5 sets of convolutional layers. Based on the feature vector, image information of the sample image may be derived, which may characterize whether the sample image is an image generated by the style conversion network or the sample image. Where 1 denotes the number of channels, 2 denotes the height of the image, and w denotes the width of the image.

The text recognition network firstly uses a convolution neural network to extract image features, and then extracts a feature sequence through the convolution neural network. The model is finely adjusted by using an Adam optimizer during training, the attenuation rates of the first moment and the second moment of the optimizer are respectively set to be 0.5 and 0.999, the initial learning rate is 0.0002, and after 1 epoch (iteration number), the learning rate begins to decrease. The main meaning of the mini-batch (small batch random gradient descent) size being set to 8, is to speed up training, i.e. each mini-batch includes the number of 8 training samples. And dividing the images with similar widths into a batch by adopting a bucket dividing strategy. Wherein 1 epoch is equal to one training using all samples in the training set; one epoch = one forward and one reverse pass of all training samples.

In the solution of the present application, the performance evaluation of the text recognition model adopts Accuracy (Accuracy) and Character error rate (Character error rate): wherein accuracy = identifying the correct number of samples/total number of test samples; the misword rate = sum of edit distances of all sample recognition results and labels/total number of characters in all test samples. And determining whether the trained text recognition model reaches the training precision or not based on the accuracy and the character error rate, namely whether the characters can be accurately recognized or not.

The following describes in detail a training process of the text recognition model of the present application with reference to specific examples, and the specific scheme is as follows:

first, training sample pairs are obtained, each training sample pair including a first sample image Real _A And a second sample image Real _B The first sample image corresponds to a third image style (scene a) the second sample image corresponds to a fourth image style (scene B). In this example, the text in the images in the training sample pair is taken as Tibetan.

Determining a first sample image Real _A The text label specifically comprises: acquiring a first sample image and a character label char-level label of the first sample image, wherein one character label represents one character corresponding to a character to be recognized in the first sample image; and generating a character result based on the character label according to the writing rule of the characters in the first sample image, and generating a character label stack-level label based on the character result.

Then, real based on the first sample image _A And training the initial neural network model until the loss function of the initial neural network model converges, and taking the initial neural network model after training as a character recognition model.

As shown in FIG. 5, the initial neural network model includes a first style conversion network G of a first branch in cascade _A A second style conversion network G _B And a second style conversion net G of a second branch of the character recognition network Text Recognizer in cascade connection _B And a first style conversion network G _A And a first discrimination network D _A And a second discrimination network D _B 。

Wherein, the first style conversion network G _A An image for converting the input image into a fourth image style (scene B); second style conversion network G _B For converting the input image into an image of a third image style (scene a). First of allDiscriminating network D _A For converting an image of scene A into an image of scene B, a second discrimination network D _B For converting an image of scene B to an image of scene a. D _A The method is used for judging whether an input image belongs to a scene A or not, namely whether the input image is a sample image or not; d _B For determining whether the input image belongs to scene B, i.e. whether the input image is an image generated by the style conversion network.

First style conversion network G _A May include a first sample image Real _A The output includes a first image Fake _B ，Real _A First pass through G _A Generating an image with scene B style, and recording the image as a first image Fake _B ，Fake _B Then through G _B An image originally having the style of scene A is generated and recorded as a fifth image Rec _A . Similarly, the second sample image Real _B Also successively through G _B And G _A Respectively generating a second image Fake _A And a sixth image Rec _B 。Real _A 、Fake _A 、Real _B 、Fake _B Are respectively input into D _A And D _B And judging whether the input image is a sample image or an image generated by the style conversion network so as to form the confrontation training.

FIG. 6 is a schematic diagram of a network structure of an initial neural network, which is based on a Text recognition network Text recognizers in the initial neural network, and whose input is Real _A 、Fake _A 、Rec _A 、Real _B 、Fake _B 、Rec _B The text recognition network is used to predict a text recognition result (text prediction result) in the corresponding image and perform supervised training using the corresponding text labeling result.

In this example, the first sample image Real _A After passing through the first image style conversion network G _A Obtaining a first image Fake after conversion _B ，Fake _B Through a second discrimination network D _B ，D _B For determining whether the input image belongs to scene B, i.e. determining Fake _B Whether the image is the image of the scene B is judged at Fake _B Is belonging to a sceneB, the image passes through the first image style conversion network G _A An image Real similar or identical to scene B can be obtained _B At this time, the trained neural network has better performance, and reaches the preset training precision.

Real _B Through a second image style conversion network G _B Obtaining a second image Fake after conversion _A Second image like _A After the text recognition network recognition, the initial neural network can be supervised and trained based on the character recognition result of the second image, so that the accurate character recognition result can be obtained on the image after the style conversion.

It is understood that the initial Neural network may not include a discriminant network, and the initial Neural network may be a Convolutional Neural network CNN (Convolutional Neural Networks) or the like.

The loss function includes an image loss function and a text recognition loss function, wherein the image loss function may include: loss function characterizing a difference between a second sample image and a corresponding first image

A penalty function ≧ which characterizes a difference between the first image and the second image can also be included>

The image loss function may be referred to as a generation loss, and the generation loss is as follows:

image loss function

The difference between the first image and the second sample image after being judged through the second judging network is represented; image loss function pick>

The method is used for representing the difference between the second image and the first sample image after being distinguished by the first distinguishing network. In the image loss function->

If the difference between the first image and the second sample image meets a first set condition, for example, is smaller than a first threshold, it indicates that the first image obtained by style conversion of the first sample image through the first image style conversion network is similar to the second sample image Real _B Based on the first image style transformation network, the image of scene B may be obtained. Analogously, in the image loss function >>

If the difference between the first sample image and the second image satisfies a second setting condition, for example, is smaller than a second threshold, the second image obtained by style conversion of the second sample image through the second image style conversion network is similar to the style of the first sample image (scene a), and the image of the scene a can be obtained based on the second image style conversion network. The first threshold and the second threshold may be configured based on actual demand, and may be the same as or different from each other.

Wherein, GANLOS (feature, label) is a generated countermeasure loss function, GAN (generic adaptive Network, generated discriminant Network), which is input as a feature vector (feature) and a corresponding label (label) representing the style of the corresponding sample image, for example, the first image (Fake) _B There is a corresponding label that labels the style of the first image as scene B. The output of the penalty function is generated as the probability of judging a feature as its corresponding label. The goal of generating the loss is to make the discriminator unable to normally distinguish the synthesized data (the image generated by the style conversion network), and the goal of discriminating the loss is to normally distinguish as much as possible whether the sample image is from the real scene (sample image) or the style conversion network. During training, can takeThe negative logarithm of the probability to minimize the loss function.

Wherein the discriminant loss function is as follows:

the discriminant loss function may include a first discriminant loss function

And a second decision loss function>

False indicates that the input to the discrimination network is an image generated by the style conversion network, and true indicates that the input to the discrimination network is a sample image. In the first discriminant loss function>

In, if the first discrimination network D _A The corresponding style discrimination loss satisfies a third set condition, for example, less than a third threshold, indicating that the input is to the first discrimination network D _A After the image is verified by the first discrimination network, the network can accurately distinguish the image generated by the style conversion network from the sample image, so that the first discrimination network can be used for accurately judging whether the image input to the network is the image generated by the style conversion network or the sample image, that is, the discrimination loss is small enough. The style discrimination loss reflects the accuracy of discriminating the network to discriminate the image style, and the smaller the loss, the higher the accuracy is correspondingly. The input images are different (one or two), and the discriminant loss function is also different in composition. For example, in this example, the input image may be a second image Fake generated by the network switching network _A Read with the first sample image _A 。

Wherein the image input to the first discrimination network may include a second image Fake _A Read with the first sample image _A At least one of the second image Fake _A Passing through the first discrimination networkAfter verification, the second image Fake _A Is an image generated by a style conversion network, and a first sample image Real _A After the first discrimination network verifies that the first sample image is the sample image, it can be shown that the first discrimination network can be used to accurately judge whether the image input to the network is the image generated by the style conversion network or the sample image.

The image input into the first discrimination network includes a second image Fake _A Read with the first sample image _A Then, the second image Fake can be passed _A Corresponding first discriminant loss function and first sample image Real _A Corresponding first discriminant loss function is used for characterizing the first discriminant loss function corresponding to the first discriminant network

In practical applications, the second image Fake may be determined by weights _A Corresponding first discriminant loss function and first sample image Real _A The corresponding first discriminant loss function->

The more important the weight is, the more important the corresponding loss function is represented.

Second discrimination loss function

For characterizing a style discrimination loss of a second discrimination network, in which a loss function->

If the style discrimination loss of the second discrimination network satisfies a fourth set condition, for example, less than a fourth threshold, it indicates that the style discrimination loss is inputted to the second discrimination network D _B After the image is verified by the second discrimination network, the network can accurately discriminate the image generated by the style conversion network from the sample image, which indicates that the second discrimination network can be used to accurately discriminate that the image inputted to the network is style conversionThe network generated image is also a sample image, i.e. the discrimination loss is sufficiently small. The input images are different (one or two), and the discriminant loss function is also different in composition. For example, in this example, the input image may be the first image Fake generated by the network switching network _B And a second sample image Real _B 。

Wherein the image input to the second discrimination network may include the first image Fake _B And a second sample image Real _B At least one of the first image Fake _B After the verification of the second discrimination network, the first image Fake _B Is an image generated by a style conversion network, a second sample image Real _B The second determination network may be configured to accurately determine whether the image input to the network is the image generated by the style conversion network or the sample image.

Similarly, the image inputted into the second discrimination network includes the first image Fake _B And a second sample image Real _B Then, the first image Fake can be passed _B Corresponding second discrimination loss function and second sample image Real _B The corresponding second judgment loss function is used for representing the second judgment loss function corresponding to the second judgment network

In practical applications, the first image Fake may be determined by weights _B Corresponding second discrimination loss function and second sample image Real _B The corresponding second decision loss function->

The greater the weight, the more important the corresponding loss function is represented.

In this example, the penalty function may also include a round robin consistency penalty, wherein the round robin consistency penalty is as follows:

wherein λ is _A ，λ _B And λ _idt To adjust the parameter, L ₁ (. Cndot.) is the mean absolute error loss function.

Loss of cyclic consistency L _cycA The first and fifth images Rec are characterized _A The smaller the loss function is, the smaller the difference between the fifth image and the first sample image is, and the closer the styles are, the first image style conversion network can accurately convert the third image style image into the fourth image style image, and the second image style conversion network can accurately convert the fourth image style image into the third image style image. The consistency of the image style (scene A) before and after the style conversion is ensured.

Loss of cyclic consistency L _cycB Characterizing the second sample image Real _B And a sixth image Rec _B The smaller the difference between the loss functions, the sixth image Rec is represented _B On the other hand, the smaller the difference between the style of the image and the style of the second sample image is, the closer the style is, which shows that the first image style conversion network can accurately convert the third image style image into the fourth image style image, and the second image style conversion network can accurately convert the fourth image style image into the third image style image, thereby ensuring the consistency of the image styles (scene B) before and after style conversion.

The loss functions further include an invariance loss function, wherein the invariance loss function is as follows:

invariance loss function L _idtA Characterizing a difference between the second sample image and the third image; the third image is obtained by converting the second sample image through the first style conversion network, and the smaller the difference is, the closer the image style between the second sample image and the third image is, on the other hand, the first style conversion network can accurately convert the image in the fourth image styleThe image in the third image style cannot be converted into the image in other scenes, and the style invariance of the image converted through the first style conversion network is reflected.

Invariance loss function L _idtB The method has the advantages that the difference between the first image and the fourth image is represented, the fourth image is obtained by converting the first image through the second style conversion network, the smaller the difference is, the closer the image styles between the first image and the fourth image are, and on the other hand, the second style conversion network can accurately convert the image in the third image style into the image in the fourth image style, the image in other scenes cannot be converted, and the style invariance of the image converted through the second style conversion network is embodied.

The text recognition loss function includes a loss function CTCLOSs (Fake) characterizing a difference between the text annotation result in the first sample image and the corresponding text recognition result in the first image _B ) Loss function CTCLOss of the difference between the text annotation result in the first sample image and the text recognition result of the corresponding second sample image (Real) _B ) Loss function CTCLOss (Real) of the difference between the text annotation result in the first sample image and the text recognition result of the first sample image _A ) Loss function CTCLOss (Fake) of the difference between the text annotation result of the first image and the text recognition result of the corresponding second image _A ) A loss function CTCLOss (Rec) of the difference between the text annotation result of the first image and the text recognition result of the corresponding fifth image _A ) A loss function CTCLOss (Rec) of the difference between the text annotation result of the first image and the text recognition result of the corresponding sixth image _B )。

Wherein the text recognition loss function L _ctc Can be expressed as follows:

wherein, ctclos (·) represents a CTC (Connectionist temporal classification based on neural network) loss function in text recognition.

Text recognition penalty function L _ctc The method is used for representing the difference degree between the text recognition result corresponding to each image in the input images of the text recognition network and the corresponding character labeling result. In text recognition loss function L _ctc In the method, if the difference degree between the text recognition result corresponding to each image in the input image of the text recognition network and the corresponding character labeling result meets a fifth set condition, for example, is smaller than a fifth threshold, it indicates that the precision of text recognition on the text in the image through the text recognition network is better, and the actual requirement is met. Wherein the input image of the text recognition network may be included as Real _A 、Fake _A 、Rec _A 、Real _B 、Fake _B 、Rec _B At least one item of (1). It can be understood that, if at least two images are included in the input images of the text recognition network, a corresponding weight may be configured for the text recognition loss function corresponding to each image, and the importance degree of the text recognition loss function corresponding to each image is determined by the weight, and the greater the weight, the more important the corresponding loss function is.

Based on the several text recognition loss functions, the text recognition loss function of the initial neural network may include at least one of the above-mentioned functions, and if a plurality of the functions are included, a corresponding weight may be configured for each weight based on the above-mentioned manner.

Based on the same principle as the method shown in fig. 1, the embodiment of the present application further provides a text recognition apparatus 20, as shown in fig. 7, the text recognition apparatus 20 may include an image acquisition module 210, a style conversion module 220, and a text recognition module 230, where:

an image obtaining module 210, configured to obtain an image to be identified;

the style conversion module 220 is configured to perform style conversion on at least one of a text and a background in the image to be recognized based on the image to be recognized to obtain a converted image;

and the character recognition module 230 is configured to perform character recognition on characters in the converted image to obtain a character recognition result.

According to the scheme of the embodiment of the application, before character recognition is carried out on characters in the image to be recognized, style conversion is carried out on at least one of the characters and the background in the image to be recognized, when the characters and the background in the image to be recognized are simultaneously subjected to style conversion, the style corresponding to the converted characters is different from the style corresponding to the converted background, so that the characters and the background in the converted image can be obviously distinguished, when the characters in the converted image are recognized, the characters in the image can be distinguished more easily and accurately due to the fact that the style of the characters in the converted image is different from the style of the background, and compared with the style of the image to be recognized, the converted image is relatively simple, regular and uniform, so that the characters in the image can be recognized more accurately, and a more accurate character recognition result is obtained.

Optionally, the style conversion module 220 is specifically configured to, when performing style conversion on at least one of the text and the background in the image to be recognized to obtain a converted image:

Optionally, the character recognition module 230 is specifically configured to, when performing character recognition on characters in the converted image to obtain a character recognition result:

extracting image features of the converted image;

Optionally, performing style conversion on the image to be recognized, and performing character recognition on characters in the converted image through a character recognition model;

Optionally, the input of the text recognition network further comprises at least one of the second sample image or the first sample image;

Optionally, the initial neural network model further includes a second style conversion network, the second style conversion network is configured to convert the input image into an image of a third image style, an input of the second style conversion network includes a second sample image, and an output includes a second image;

Optionally, the input of the first style conversion network further includes a second sample image, the output further includes a third image, and the loss function further includes an invariance loss function, the invariance loss function including a loss function characterizing a difference between the second sample image and the third image;

and/or the presence of a gas in the gas,

Optionally, the input of the character recognition network further includes a second image, and the output further includes a character recognition result of the second image;

Optionally, the character recognition model further includes at least one of a first discrimination network and a second discrimination network, where an input of the first discrimination network is an image of a third image style, an output of the first discrimination network is information for characterizing that the input image is an image of a first sample or an image generated by the first style conversion network, an input of the second discrimination network is an image of a fourth image style, and an output of the second discrimination network is an image for characterizing that the input image is an image of a second sample or an image generated by the second style conversion network;

Optionally, the input of the second style conversion network further includes a first image, the output further includes a fifth image, and the loss function further includes a recurring consistent loss function, the recurring consistent loss function including a loss function characterizing a difference between the first sample image and the fifth image;

and/or the presence of a gas in the gas,

Optionally, the input of the text recognition network further includes a fifth image, the output further includes a text recognition result of the fifth image, and the text recognition loss function further includes a loss function representing a difference between the text annotation result of the first sample image and the text recognition result of the corresponding fifth image;

and/or the presence of a gas in the gas,

Optionally, if the characters in the training sample pair are vowel-attaching-type characters, each character is composed of at least one character, and the character label in the first sample image is determined by the following method:

Optionally, the vowel superscript type text includes at least one of Tibetan or Thai.

Optionally, the apparatus further comprises:

Since the text recognition apparatus provided in the embodiment of the present application is an apparatus capable of executing the text recognition method in the embodiment of the present application, based on the text recognition method provided in the embodiment of the present application, a person skilled in the art can understand a specific implementation manner of the text recognition apparatus in the embodiment of the present application and various variations thereof, so that a detailed description of how the text recognition apparatus implements the text recognition method in the embodiment of the present application is not repeated here. The text recognition device used by those skilled in the art to implement the text recognition method in the embodiments of the present application is within the scope of the present application.

Based on the same principle as the character recognition method and the character recognition apparatus provided in the embodiments of the present application, an embodiment of the present application also provides an electronic device, which may include a processor and a memory. The memory stores therein readable instructions, which when loaded and executed by the processor, may implement the method shown in any of the embodiments of the present application.

As an example, fig. 8 shows a schematic structural diagram of an electronic device 4000 to which the solution of the embodiment of the present application is applied, and as shown in fig. 8, the electronic device 4000 includes: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 8, but this is not intended to represent only one bus or type of bus.

The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 4003 is used for storing application codes for executing the scheme of the present application, and the execution is controlled by the processor 4001. Processor 4001 is configured to execute application code stored in memory 4003 to implement any of the method embodiments described above.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. A method for recognizing a character, comprising:

acquiring an image to be identified;

performing style conversion on at least one of characters and a background in the image to be recognized based on the image to be recognized to obtain a converted image, wherein if the styles of the characters and the background in the image to be recognized are both converted, the style corresponding to the converted characters is different from the style corresponding to the background;

performing character recognition on the converted image to obtain a character recognition result;

carrying out style conversion on the image to be recognized, and carrying out character recognition on characters in the converted image, wherein the characters are obtained through a character recognition model;

acquiring training sample pairs, wherein each training sample pair comprises a first sample image and a second sample image, the second sample image is an image after style conversion corresponding to the first sample image, the first sample image carries a text label, and the text label represents a text labeling result in the first sample image; the first sample image corresponds to a third image style, and the second sample image corresponds to a fourth image style;

training an initial neural network model based on the first sample image until a loss function of the initial neural network model converges, and taking the initial neural network model at the end of training as the character recognition model;

the initial neural network model comprises a first style conversion network and a character recognition network which are connected in series, wherein the first style conversion network is used for converting an input image into an image of the fourth image style; the input of the first style conversion network comprises the first sample image, the output comprises a first image, the input of the character recognition network comprises the first image, and the output comprises a character recognition result of the first image;

the loss functions include an image loss function and a text recognition loss function, the image loss function includes a loss function representing a difference between the second sample image and the corresponding first image, and the text recognition loss function includes a loss function representing a difference between a text annotation result in the first sample image and a corresponding text recognition result in the first image;

the character recognition model comprises the first style conversion network and the character recognition network which are cascaded at the end of training;

if the characters in the training sample pair are vowel-attaching-type characters, each character is composed of at least one character, and the character label in the first sample image is determined in the following way:

acquiring the first sample image and character labels of the first sample image, wherein one character label represents one character corresponding to a character to be recognized in the first sample image;

generating a character label based on the character label according to the writing rule of the characters in the first sample image;

the writing rule is the composition sequence of each character in the character;

and generating a corresponding character label based on the character label of each character, wherein the character label represents the character labeling result of each character to be recognized in the sample image, and the writing rule of the corresponding character is reflected through the character label.

2. The method according to claim 1, wherein the style conversion of at least one of a text or a background in the image to be recognized to obtain a converted image comprises:

carrying out first image style conversion on characters of the image to be recognized, and carrying out second image style conversion on a background of the image to be recognized to obtain the converted image;

3. The method of claim 1, wherein the performing character recognition on the characters in the converted image to obtain a character recognition result comprises:

extracting image features of the converted image;

and obtaining the character recognition result by adopting a recurrent neural network based on the image characteristics.

4. The method of claim 1, wherein the input to the text recognition network further comprises at least one of the second sample image or the first sample image;

if the input of the character recognition network comprises the second sample image, the text recognition loss function also comprises a loss function representing the difference between the character marking result in the first sample image and the corresponding character recognition result of the second sample image;

if the input to the word recognition network comprises the first sample image, the text recognition loss function further comprises a loss function characterizing a difference between a word annotation result in the first sample image and a word recognition result of the first sample image.

5. The method of claim 1 or 4, wherein the initial neural network model further comprises a second style conversion network for converting an input image into an image of the third image style, wherein the input of the second style conversion network comprises the second sample image and the output comprises the second image;

the image loss function further includes a loss function characterizing a difference between the first sample image and the second image.

6. The method of claim 5, wherein the input to the first style conversion network further comprises the second sample image, the output further comprises a third image, and the loss function further comprises an invariance loss function comprising a loss function characterizing a difference between the second sample image and the third image;

and/or the presence of a gas in the gas,

the input of the second style conversion network further comprises the first style image, and the output further comprises a fourth image; the invariance loss function comprises a loss function characterizing a difference between the first sample image and the fourth image.

7. The method of claim 5, wherein the input to the word recognition network further comprises the second image, and wherein the output further comprises a word recognition result of the second image;

the text recognition loss function further comprises a loss function characterizing a difference between the text annotation result of the first sample image and the corresponding text recognition result of the second image.

8. The method of claim 5, wherein the character recognition model further comprises at least one of a first discrimination network having an input of an image of a third image style and an output of information characterizing the input image as the first sample image or an image generated by the first style conversion network, or a second discrimination network having an input of an image of a fourth image style and an output of information characterizing the input image as the second sample image or an image generated by the second style conversion network;

the input of the first discrimination network comprises the first sample image and the second image, and the input of the second discrimination network is the second sample image and the first image;

the loss functions further include discriminant loss functions that characterize style discriminant losses of the discriminant network.

9. The method of claim 5,

the input of the second style conversion network further comprises the first image, the output further comprises a fifth image, the loss function further comprises a recurring consistent loss function comprising a loss function characterizing a difference between the first sample image and the fifth image;

and/or the presence of a gas in the gas,

the input of the first image style conversion network further comprises the second image and the output further comprises a sixth image, and the recurring consistent loss function comprises a loss function characterizing a difference between the second sample image and the sixth image.

10. The method of claim 9, wherein the input to the word recognition network further comprises the fifth image, the output further comprises word recognition results for the fifth image, and the text recognition loss function further comprises a loss function characterizing a difference between word annotation results for the first sample image and corresponding word recognition results for the fifth image;

and/or the presence of a gas in the gas,

the input of the character recognition network further comprises the sixth image, the output further comprises a character recognition result of the sixth image, and the text recognition loss function further comprises a loss function representing a difference between a character labeling result of the first sample image and a corresponding character recognition result of the sixth image.

11. The method of claim 1, wherein the vowel epitypes text includes at least one of Tibetan or Thai.

12. The method of claim 1, further comprising:

and performing style conversion on at least one of characters and backgrounds in the first sample image to obtain the second sample image in the fourth image style.

13. A character recognition apparatus, comprising:

the image acquisition module is used for acquiring an image to be identified;

the character recognition module is used for recognizing characters in the converted image to obtain a character recognition result;

obtaining training sample pairs, wherein each training sample pair comprises a first sample image and a second sample image, the second sample image is an image after style conversion corresponding to the first sample image, the first sample image carries a character label, and the character label represents a character labeling result in the first sample image; the first sample image corresponds to a third image style, and the second sample image corresponds to a fourth image style;