CN113657391A

CN113657391A - Training method of character recognition model, and method and device for recognizing characters

Info

Publication number: CN113657391A
Application number: CN202110934328.0A
Authority: CN
Inventors: 王晓燕; 吕鹏原; 范森; 章成全; 姚锟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-08-13
Filing date: 2021-08-13
Publication date: 2021-11-16
Also published as: WO2023016163A1

Abstract

The present disclosure provides a training method for a character recognition model, a method, an apparatus, a device, a storage medium and a program product for recognizing characters, which relate to the technical field of artificial intelligence, in particular to the technical field of computer vision and deep learning, and can be applied to scenes such as OCR optical character recognition. The specific implementation scheme is as follows: determining a plurality of first sample pictures and content labels and language labels of the plurality of first sample pictures according to a plurality of monolingual corpora; determining a plurality of second sample pictures and content labels and language labels of the plurality of second sample pictures according to a plurality of mixed language corpora; and training a character recognition model according to the plurality of first sample pictures, the content labels and the language labels of the plurality of first sample pictures, the plurality of second sample pictures and the content labels and the language labels of the plurality of second sample pictures.

Description

Training method of character recognition model, and method and device for recognizing characters

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical field of computer vision and deep learning, and can be applied to scenes such as OCR optical character recognition.

Background

In life, many documents, pictures, videos and other files contain languages of multiple languages. For example, documents, pictures, videos, and other documents include multiple languages such as english, spanish, portuguese, russian, and polish. The content of the multi-language characters in the file is identified, and the corresponding language category is output, which is the premise of extracting the information and translation of each language character. The identification process has important significance for information auditing, cultural transmission, business communication and the like.

Disclosure of Invention

The present disclosure provides a training method of a character recognition model, a method, an apparatus, a device, a storage medium, and a program product for recognizing characters.

According to an aspect of the present disclosure, there is provided a method for training a character recognition model, including: determining a plurality of first sample pictures and content labels and language labels of the plurality of first sample pictures according to a plurality of monolingual corpora; determining a plurality of second sample pictures and content labels and language labels of the plurality of second sample pictures according to a plurality of mixed language corpora; and training a character recognition model according to the plurality of first sample pictures, the content labels and the language labels of the plurality of first sample pictures, the plurality of second sample pictures and the content labels and the language labels of the plurality of second sample pictures.

According to another aspect of the present disclosure, there is provided a method of recognizing a character, including: acquiring a picture to be identified containing text information; inputting the picture to be recognized into a character recognition model to obtain a content recognition result and a language recognition result of the picture to be recognized, wherein the content recognition result is used for representing character information contained in the picture to be recognized, and the language recognition result is used for representing a language corresponding to the character information, and the character recognition model is trained according to the method disclosed by the embodiment of the disclosure.

According to another aspect of the present disclosure, there is provided a training apparatus for a character recognition model, including: the first determining module is used for determining a plurality of first sample pictures and content labels and language labels of the plurality of first sample pictures according to a plurality of monolingual corpora; the second determining module is used for determining a plurality of second sample pictures and content labels and language labels of the plurality of second sample pictures according to a plurality of mixed language materials; and the training module is used for training the character recognition model according to the plurality of first sample pictures, the content labels and the language labels of the plurality of first sample pictures, the plurality of second sample pictures and the content labels and the language labels of the plurality of second sample pictures.

According to another aspect of the present disclosure, there is provided an apparatus for recognizing a character, including: the acquisition module is used for acquiring the picture to be identified containing the text information; the input module is configured to input the picture to be recognized into a character recognition model, so as to obtain a content recognition result and a language recognition result of the picture to be recognized, where the content recognition result is used to represent character information included in the picture to be recognized, and the language recognition result is used to represent a language corresponding to the character information, and the character recognition model is trained according to the device in the embodiment of the disclosure.

Another aspect of the present disclosure provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the embodiments of the present disclosure.

According to another aspect of the disclosed embodiments, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method shown in the disclosed embodiments.

According to another aspect of the embodiments of the present disclosure, there is provided a computer program product, a computer program, which when executed by a processor implements the method shown in the embodiments of the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow diagram of a method of training a text recognition model according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a text recognition model according to an embodiment of the present disclosure;

FIG. 3 is a flow diagram of a method of training a text recognition model according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a schematic diagram of a method of training a text recognition model, in accordance with an embodiment of the present disclosure;

FIG. 5 schematically illustrates a flow chart of a method of recognizing text in accordance with an embodiment of the present disclosure;

FIG. 6 schematically illustrates a schematic diagram of a method of recognizing text, in accordance with an embodiment of the present disclosure;

FIG. 7 schematically illustrates a block diagram of a training apparatus for a text recognition model according to an embodiment of the present disclosure;

FIG. 8 schematically illustrates a block diagram of an apparatus for recognizing text, in accordance with an embodiment of the present disclosure; and

FIG. 9 schematically shows a block diagram of an example electronic device that may be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

FIG. 1 is a flow chart of a method of training a text recognition model according to an embodiment of the present disclosure.

As shown in fig. 1, the method 100 includes determining content tags and language tags of a plurality of first sample pictures and a plurality of first sample pictures according to a plurality of monolingual corpus in operation S110.

Then, in operation S120, content tags and language tags of a plurality of second sample pictures and a plurality of second sample pictures are determined according to the plurality of mixed language corpora.

In operation S130, a character recognition model is trained according to the plurality of first sample pictures, the content tags and the language tags of the plurality of first sample pictures, the plurality of second sample pictures, and the content tags and the language tags of the plurality of second sample pictures.

According to an embodiment of the present disclosure, the character recognition model may be used to determine a content recognition result and a language recognition result of an input picture. The content recognition result may be used to represent the text information included in the input picture, and the language recognition result may be used to represent the language corresponding to the text information.

According to the embodiment of the disclosure, the trained character recognition model can automatically output the languages corresponding to the characters while recognizing the characters contained in the picture.

In the related technology, pictures of different languages can be collected from a real scene, and are labeled to be used as sample pictures for model training. However, the number of pictures of different languages that can be collected from a real scene is limited, and the labeling difficulty is high.

According to the embodiment of the disclosure, besides the pictures of different languages collected in the real scene, the text corpora of each language can be collected, and a large number of pictures with characters can be synthesized according to the corpora for model training.

According to the embodiment of the present disclosure, for example, for a text containing mixed languages, unnecessary languages in the text may be filtered according to a character set (also referred to as a dictionary) of a predetermined language. Each line of the filtered text is then used as a corpus.

Based on this, according to the embodiments of the present disclosure, a picture including a monolingual corpus may be generated as a first sample picture for each monolingual corpus of a plurality of monolingual corpora. And then determining the content label of the first sample picture according to the text content of the single language corpus. And determining the language label of the first sample picture according to the language of the monolingual corpus.

According to the embodiment of the disclosure, the original corpora of a plurality of languages can be mixed and spliced, and a plurality of mixed corpora are obtained by splicing the corpora of the plurality of languages into one text. And then generating a picture containing the mixed language corpus as a second sample picture aiming at each mixed language corpus in the plurality of mixed language corpora. And determining a content label of the second sample picture according to the text content of the mixed language material. And determining the language label of the second sample picture according to the language of the mixed language material. For example, the language tags of the mixed language corpus may be the language with the largest number of words in the mixed language corpus. When the number of words in a plurality of languages is the largest in the mixed language corpus, any one of the languages may be determined as the language tag of the mixed language corpus.

According to other embodiments of the present disclosure, the pictures input into the text recognition model may be of different sizes, thereby affecting the recognition accuracy. For this purpose, the size of the picture may be adjusted to be within a preset range before the picture is input into the text model. Illustratively, in this embodiment, the height of the picture in the vertical direction can be adjusted to be between 32 pixels and 48 pixels, and accordingly, the width of the picture in the horizontal direction is scaled proportionally according to the original scale of the picture. In addition, the width of the picture can be limited to no more than 512 pixels.

The text recognition model shown above is further described with reference to fig. 2 in conjunction with specific embodiments. Those skilled in the art will appreciate that the following example embodiments are only for the understanding of the present disclosure, and the present disclosure is not limited thereto.

FIG. 2 is a schematic diagram of a word recognition model according to an embodiment of the present disclosure.

As shown in fig. 2, the word recognition model may include a first Convolutional Neural Network (CNN)210, a Recurrent Neural Network (RNN) 220, a Connection Timing Classification (CTC) 230, and a second convolutional Neural Network 240.

According to an embodiment of the present disclosure, the first convolutional neural network 210 may be configured to perform feature extraction on a picture 21 of an input text recognition model, so as to obtain a feature vector 22 of the picture. The features in the feature vector 22 are ordered in time steps. The recurrent neural network 220 may be used to further extract sequence features from the feature vectors 22 extracted by the first convolutional neural network 210. The join timing classification network 230 may be used to determine the content recognition result 23 for the picture according to the sequence features extracted by the recurrent neural network. In addition, a multivariate feature vector (N-gram)24 can be determined from the feature vector 22, and a second convolutional neural network 240 can be used to determine the language identification result 25 from the multivariate feature vector 24.

The number of the models in the character recognition model is small, so that the computing resources are reduced, and the system flow is simplified.

The method for training the character recognition model described above is further described with reference to fig. 3 in conjunction with the specific embodiments. Those skilled in the art will appreciate that the following example embodiments are only for the understanding of the present disclosure, and the present disclosure is not limited thereto.

FIG. 3 is a flow diagram of a method of training a text recognition model according to an embodiment of the present disclosure.

As shown in fig. 3, the method 330 includes acquiring one sample picture of a plurality of first sample pictures and a plurality of second sample pictures in operation S331.

In operation S332, a content recognition result and a language recognition result of the sample picture are determined using the character recognition model.

In operation S333, a first loss is determined according to the content recognition result and the content tag of the sample picture, and a second loss is determined according to the language recognition result and the language tag of the sample picture.

According to an embodiment of the present disclosure, a loss (loss) between the content recognition result and the content tag of the sample picture, i.e., a first loss, may be determined according to a first loss function, for example. The second loss may be determined as a loss between the language identification result and the language label of the sample picture according to a second loss function. The first loss function and the second loss function may be the same or different.

In operation S334, a total loss is determined based on the first loss and the second loss.

According to implementations of the present disclosure, the first loss and the second loss may be added in a weighted manner to obtain a total loss. Wherein the weight of the first loss and the second loss can be determined according to actual needs. For example, in the present embodiment, the weight of the second loss may be lower than the weight of the first loss.

In operation S335, parameters of the character recognition model are adjusted according to the total loss.

In operation S336, another sample picture among the plurality of first sample pictures and the plurality of second sample pictures is acquired, and execution of operation S332 is skipped to determine a content recognition result and a language recognition result of the another sample picture using the character recognition model.

The method for training the character recognition model described above is further described with reference to fig. 4 in conjunction with specific embodiments. Those skilled in the art will appreciate that the following example embodiments are only for the understanding of the present disclosure, and the present disclosure is not limited thereto.

FIG. 4 schematically shows a schematic diagram of a method of training a text recognition model according to an embodiment of the present disclosure.

In fig. 4, it is shown that in the course of training the character recognition model, a first convolutional neural network 410 may be used to determine the feature vectors 42 of the sample picture 41. Then, based on the feature vector 42, character recognition and language classification are performed by two branches, respectively. In the branch corresponding to the word recognition, a recurrent neural network 420 may be used to determine the sequence features from the feature vectors 42, and a join timing classification network 430 may be used to determine the content recognition result 43 from the sequence features. On the other hand, in the branch corresponding to the language classification, the N-gram feature vector 44 may be determined from the feature vector 42, and the language identification result 45 may be determined from the N-gram feature vector using the second convolutional neural network 440.

Next, the first loss 46 may be determined from the content recognition result 43 and the content tag of the sample picture 41, and the second loss 47 may be determined from the language recognition result 45 and the language tag of the sample picture 41. The total loss 48 is then determined based on the first loss 46 and the second loss 47. And adjusting parameters of the character recognition model according to the total loss 48, namely realizing error return.

According to the embodiment of the disclosure, forward calculation and error back transmission are simultaneously carried out by enabling two branches of multilingual character recognition and language classification to share the characteristic vector of the bottom layer. The two are complementary to each other, so that the generalization effect can be improved.

In addition, the language category is helpful to distinguish near characters and improve the recognition precision of language characters, such as English character n and Russian character

Conversely, the unique characters in a language word also help to classify the language class, e.g., by

It appears in russian, ukrainian, etc. According to the character recognition model disclosed by the embodiment of the disclosure, the n-gram feature vector of the image convolution feature vector is extracted, and the semantic correlation between adjacent characters is utilized, so that the language classification precision can be further improved.

Fig. 5 schematically shows a flow chart of a method of recognizing text according to an embodiment of the present disclosure.

As shown in fig. 5, the method includes acquiring a picture to be recognized including text information in operation S510.

Then, in operation S520, the picture to be recognized is input into the character recognition model, and a content recognition result and a language recognition result of the picture to be recognized are obtained.

According to the embodiment of the present disclosure, the character recognition model may be obtained by training according to the training method of the character recognition model shown above, for example. The output of the character recognition model may include a content recognition result and a language recognition result. The content identification result may be used to indicate text information included in the picture to be identified, and the language identification result may be used to indicate a language corresponding to the text information.

The method for recognizing text shown above is further described with reference to fig. 6 in conjunction with the specific embodiments. Those skilled in the art will appreciate that the following example embodiments are only for the understanding of the present disclosure, and the present disclosure is not limited thereto.

Fig. 6 schematically shows a schematic diagram of a method of recognizing text according to an embodiment of the present disclosure.

As shown in fig. 6, the text recognition model may include a first convolutional neural network CNN, a recurrent neural network, a join temporal classification network, and a second convolutional neural network, according to an embodiment of the present disclosure. Based on this, the first convolutional neural network 610 may be used to determine the feature vector 62 of the picture to be recognized 61. The recurrent neural network 620 can then be used to determine the sequence features from the feature vectors 62 and the join temporal classification network 630 can be used to determine the content recognition results 63 for the picture to be recognized 61 from the sequence features. On the other hand, it is possible to determine the N-gram feature vector 64 from the feature vector 62, and determine the language identification result 65 for the picture to be identified 61 from the N-gram feature vector 64 using the second convolutional neural network 640.

FIG. 7 schematically illustrates a block diagram of a training apparatus for a character recognition model according to an embodiment of the present disclosure.

As shown in fig. 7, the training apparatus 700 for the character recognition model may include a first determining module 710, a second determining module 720 and a training module 730.

The first determining module 710 may be configured to determine content tags and language tags of the plurality of first sample pictures according to the plurality of monolingual corpus.

The second determining module 720 may be configured to determine content tags and language tags of the plurality of second sample pictures and the plurality of second sample pictures according to the plurality of mixed language corpora.

The training module 730 may be configured to train the character recognition model according to the content tags and the language tags of the plurality of first sample pictures, the content tags and the language tags of the plurality of second sample pictures, and the language tags of the plurality of second sample pictures.

According to an embodiment of the present disclosure, the first determining module may include a first generating sub-module, a first content tag determining sub-module, and a first language tag determining sub-module. The first generating sub-module may be configured to generate, for each monolingual corpus of the plurality of monolingual corpora, a picture including the monolingual corpus as the first sample picture. The first content tag determining sub-module may be configured to determine the content tag of the first sample picture according to the text content of the monolingual corpus. And the first language label determining submodule can be used for determining the language label of the first sample picture according to the language of the monolingual corpus.

According to an embodiment of the present disclosure, the apparatus may further include a splicing module, which may be configured to perform mixed splicing processing on the original corpora of the multiple languages to obtain the multiple mixed corpora.

According to an embodiment of the present disclosure, the second determining module may include a second generating sub-module, a second content tag determining sub-module, and a second language tag determining sub-module. The second generating sub-module may be configured to generate, for each mixed language corpus of the multiple mixed language corpora, a picture including the mixed language corpus as the second sample picture. The second content tag determination submodule may be configured to determine the content tag of the second sample picture according to the text content of the mixed corpus. And the second language label determining submodule may be configured to determine the language label of the second sample picture according to the language of the mixed language corpus.

According to an embodiment of the present disclosure, the training module may include an identification sub-module, a first loss determination sub-module, a second loss determination sub-module, and an adjustment sub-module. The recognition sub-module may be configured to determine a content recognition result and a language recognition result of one sample picture of the first sample pictures and the second sample pictures by using the character recognition model. The first loss determining sub-module may be configured to determine a first loss according to the content identification result and the content tag of the sample picture, and determine a second loss according to the language identification result and the language tag of the sample picture. A second loss determination submodule operable to determine a total loss based on the first loss and the second loss. And the adjusting sub-module may be configured to adjust parameters of the character recognition model according to the total loss, and return, to another sample picture of the first sample pictures and the second sample pictures, a step of determining a content recognition result and a language recognition result using the character recognition model.

According to an embodiment of the present disclosure, the word recognition model may include a first convolutional neural network, a recurrent neural network, a join timing classification network, and a second convolutional neural network.

According to the embodiment of the disclosure, the identification submodule comprises a feature vector determination unit, a content identification unit and a language identification unit. The feature vector determination unit may be configured to determine a feature vector of the sample picture using the first convolutional neural network. A content recognition unit operable to determine a sequence feature from the feature vector using the recurrent neural network and determine the content recognition result from the sequence feature using the join timing classification network. And the language identification unit can be used for determining a multivariate feature vector according to the feature vector and determining the language identification result according to the multivariate feature vector by using a second convolutional neural network.

Fig. 8 schematically shows a block diagram of an apparatus for recognizing text according to an embodiment of the present disclosure.

As shown in fig. 8, the apparatus 800 for recognizing text may include an obtaining module 810 and an inputting module 820.

The obtaining module 810 may be configured to obtain a to-be-identified picture including text information;

the input module 820 may be configured to input the picture to be recognized into the text recognition model, and obtain a content recognition result and a language recognition result of the picture to be recognized, where the content recognition result is used to represent text information included in the picture to be recognized, and the language recognition result is used to represent a language corresponding to the text information.

According to an embodiment of the present disclosure, the character recognition model is trained by the training apparatus of the character recognition model above.

It should be noted that, in the technical solution of the present disclosure, the acquisition, storage, application, and the like of the personal information of the related user all conform to the regulations of the relevant laws and regulations, and do not violate the common customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 9 schematically shows a block diagram of an example electronic device 900 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above, such as a training method of a character recognition model and a method of recognizing a character. For example, in some embodiments, the method of training the text recognition model and the method of recognizing text may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When loaded into RAM 903 and executed by computing unit 901, may perform one or more steps of the above described method of training a character recognition model and method of recognizing characters. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the training method of the character recognition model and the method of recognizing the character by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A training method of a character recognition model comprises the following steps:

determining a plurality of first sample pictures and content labels and language labels of the plurality of first sample pictures according to a plurality of monolingual corpora;

determining a plurality of second sample pictures and content labels and language labels of the plurality of second sample pictures according to a plurality of mixed language corpora; and

and training a character recognition model according to the plurality of first sample pictures, the content labels and the language labels of the plurality of first sample pictures, the plurality of second sample pictures and the content labels and the language labels of the plurality of second sample pictures.

2. The method of claim 1, wherein determining the plurality of first sample pictures and the content tags and the language tags of the plurality of first sample pictures according to the plurality of single-language corpora comprises:

for each monolingual corpus in the plurality of monolingual corpora,

generating a picture containing the single-language corpus as the first sample picture;

determining a content label of the first sample picture according to the text content of the single-language corpus; and

and determining the language label of the first sample picture according to the language of the single-language corpus.

3. The method of claim 1, further comprising:

and carrying out mixed splicing processing on the original corpora of the languages to obtain the mixed corpora.

4. The method of claim 3, wherein the determining a plurality of second sample pictures and content tags and language tags of the plurality of second sample pictures according to a plurality of mixed language corpora comprises:

for each of the plurality of mixed linguistic data,

generating a picture containing the mixed language material as the second sample picture;

determining a content label of the second sample picture according to the text content of the mixed language material; and

and determining the language label of the second sample picture according to the language of the mixed language material.

5. The method of claim 1, wherein the training of the text recognition model according to the content tags and language tags of the first sample pictures, the content tags and language tags of the second sample pictures and the second sample pictures comprises:

determining a content recognition result and a language recognition result of one sample picture of the plurality of first sample pictures and the plurality of second sample pictures by using the character recognition model;

determining a first loss according to the content identification result and the content label of the sample picture, and determining a second loss according to the language identification result and the language label of the sample picture;

determining a total loss according to the first loss and the second loss; and

and adjusting parameters of the character recognition model according to the total loss, and returning to the step of determining a content recognition result and a language recognition result by using the character recognition model aiming at the other sample picture in the plurality of first sample pictures and the plurality of second sample pictures.

6. The method of claim 5, wherein the word recognition model comprises a first convolutional neural network, a recurrent neural network, a join temporal classification network, and a second convolutional neural network.

7. The method of claim 6, wherein said determining content recognition results and language recognition results of the sample picture using the text recognition model comprises:

determining feature vectors of the sample picture using the first convolutional neural network;

determining a sequence feature from the feature vector using the recurrent neural network and determining the content recognition result from the sequence feature using the join timing classification network; and

and determining a multivariate feature vector according to the feature vector, and determining the language identification result according to the multivariate feature vector by using a second convolutional neural network.

8. A method of recognizing text, comprising:

acquiring a picture to be identified containing text information;

inputting the picture to be recognized into a character recognition model to obtain a content recognition result and a language recognition result of the picture to be recognized, wherein the content recognition result is used for representing character information contained in the picture to be recognized, the language recognition result is used for representing a language corresponding to the character information,

wherein the text recognition model is trained according to the method of any one of claims 1-7.

9. A device for training a character recognition model, comprising:

the first determining module is used for determining a plurality of first sample pictures and content labels and language labels of the plurality of first sample pictures according to a plurality of monolingual corpora;

the second determining module is used for determining a plurality of second sample pictures and content labels and language labels of the plurality of second sample pictures according to a plurality of mixed language materials; and

and the training module is used for training the character recognition model according to the plurality of first sample pictures, the content labels and the language labels of the plurality of first sample pictures, the plurality of second sample pictures and the content labels and the language labels of the plurality of second sample pictures.

10. The apparatus of claim 9, wherein the first determining means comprises:

a first generating sub-module, configured to generate, for each monolingual corpus of the plurality of monolingual corpora, a picture including the monolingual corpus as the first sample picture;

a first content tag determination submodule, configured to determine a content tag of the first sample picture according to the text content of the monolingual corpus; and

and the first language label determining submodule is used for determining the language label of the first sample picture according to the language of the monolingual corpus.

11. The apparatus of claim 9, further comprising:

and the splicing module is used for carrying out mixed splicing processing on the original linguistic data of a plurality of languages to obtain the plurality of mixed linguistic data.

12. The apparatus of claim 11, wherein the second determining means comprises:

a second generating sub-module, configured to generate, for each mixed language corpus of the multiple mixed language corpora, a picture including the mixed language corpus as the second sample picture;

a second content tag determination submodule, configured to determine a content tag of the second sample picture according to the text content of the mixed language corpus; and

and the second language label determining submodule is used for determining the language label of the second sample picture according to the language of the mixed language material.

13. The apparatus of claim 9, wherein the training module comprises:

an identification submodule, configured to determine a content identification result and a language identification result of one sample picture of the plurality of first sample pictures and the plurality of second sample pictures by using the character recognition model;

the first loss determining submodule is used for determining a first loss according to the content identification result and the content label of the sample picture, and determining a second loss according to the language identification result and the language label of the sample picture;

a second loss determination submodule for determining a total loss based on the first loss and the second loss; and

and the adjusting submodule is used for adjusting the parameters of the character recognition model according to the total loss, and returning the step of determining a content recognition result and a language recognition result by using the character recognition model aiming at the other sample picture in the first sample pictures and the second sample pictures.

14. The apparatus of claim 13, wherein the text recognition model comprises a first convolutional neural network, a recurrent neural network, a join temporal classification network, and a second convolutional neural network.

15. The apparatus of claim 14, wherein the identification submodule comprises:

a feature vector determination unit for determining a feature vector of the sample picture using the first convolutional neural network;

a content recognition unit for determining a sequence feature from the feature vector using the recurrent neural network and determining the content recognition result from the sequence feature using the join timing classification network; and

and the language identification unit is used for determining a multivariate feature vector according to the feature vector and determining the language identification result according to the multivariate feature vector by using a second convolutional neural network.

16. An apparatus for recognizing a character, comprising:

the acquisition module is used for acquiring the picture to be identified containing the text information;

an input module, configured to input the picture to be recognized into a character recognition model, so as to obtain a content recognition result and a language recognition result of the picture to be recognized, where the content recognition result is used to represent character information included in the picture to be recognized, and the language recognition result is used to represent a language corresponding to the character information,

wherein the text recognition model is trained according to the apparatus of any one of claims 9-15.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.

19. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-8.