CN115631502A

CN115631502A - Character recognition method, character recognition device, model training method, electronic device and medium

Info

Publication number: CN115631502A
Application number: CN202211297956.3A
Authority: CN
Inventors: 乔美娜; 吕鹏原; 刘珊珊; 章成全; 姚锟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-10-21
Filing date: 2022-10-21
Publication date: 2023-01-20

Abstract

The disclosure relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, image processing and computer vision, and can be applied to scenes such as OCR (optical character recognition). The specific implementation scheme is as follows: training a first neural network based on the image sample to obtain a visual feature extraction model; training a second neural network based on the character samples to obtain a semantic feature extraction model; training a visual feature extraction model based on the image sample; extracting visual features output in the training process of the model based on the visual features, and acquiring texts corresponding to the image samples; training the semantic feature extraction model based on the text until the visual feature extraction model and the semantic feature extraction model converge. Before the character recognition model is trained, the sub-model for extracting the visual features and the sub-model for extracting the semantic features are separately pre-trained, so that the robustness of the character recognition model is improved, and the character recognition precision is improved.

Description

Character recognition method, character recognition device, model training method, electronic device and medium

Technical Field

The present disclosure relates to the technical field of artificial intelligence, and in particular to the technical field of deep learning, image processing, and computer vision, which can be applied to scenes such as Optical Character Recognition (OCR).

Background

With the development of updating iteration and deep learning of computing resources, the OCR recognition technology has become mature day by day and plays an important role in scenes such as traffic, cards, traffic and the like, but images in natural scenes inevitably have interference such as light, noise and the like. In addition, due to manual shooting, shooting may be incomplete or blocked, and the recognition effect of OCR may be affected.

The existing character recognition method comprises the following steps: inputting a picture with characters, and selecting character candidate areas in a frame; and then, based on the candidate frame, extracting a corresponding character area from the original image, inputting the character area into the character recognition model for character recognition, and obtaining a final recognition result. On one hand, when the external interference such as illumination, noise and the like is faced, the robustness of the model to certain interference can be improved by adding corresponding training data, but the processes of collecting/generating data, labeling data, model training and the like all require time and manpower, the cost is high, the learning capability of the model is limited by the structure of the model, and all scenes cannot be learned. On the other hand, when characters in a picture are incomplete, false recognition or missing recognition of a model can be caused, the current effective method is to correct the recognition result by correcting the recognition result, and the two-stage recognition method has poor performance, the correction effect completely depends on an error correction module, and the difficulty of landing application is high.

Disclosure of Invention

The disclosure provides a character recognition method, a character recognition device, a training method of a character recognition model, a training device, an electronic device and a storage medium.

According to a first aspect of the present disclosure, there is provided a model training method, comprising:

training a first neural network based on the image sample to obtain a visual feature extraction model;

training a second neural network based on the character samples to obtain a semantic feature extraction model;

training the visual feature extraction model based on the image sample;

acquiring a text corresponding to the image sample based on the visual features output in the training process of the visual feature extraction model;

and training the semantic feature extraction model based on the text until the visual feature extraction model and the semantic feature extraction model are converged.

According to a second aspect of the present disclosure, there is provided a character recognition method including:

acquiring a character image to be recognized;

extracting visual features of the character image to be recognized;

semantic features are obtained based on the visual feature extraction;

performing feature fusion on the visual features and the semantic features;

and performing text prediction based on the feature after feature fusion to obtain a character recognition result.

According to a third aspect of the present disclosure, there is provided a model training apparatus comprising:

the first training module is configured to train a first neural network based on the image sample to obtain a visual feature extraction model;

the second training module is configured to train the second neural network based on the character samples to obtain a semantic feature extraction model;

a third training module configured to train the visual feature extraction model based on the image sample;

the third training module acquires a text corresponding to the image sample based on the visual features output in the training process of the visual feature extraction model;

and the third training module trains the semantic feature extraction model based on the text until the visual feature extraction model and the semantic feature extraction model are converged.

According to a fourth aspect of the present disclosure, there is provided a character recognition apparatus including:

the acquisition module is configured to acquire a character image to be recognized;

the first feature extraction module is configured to extract visual features of the character image to be recognized;

a second feature extraction module configured to extract semantic features based on the visual features;

a feature fusion module configured to feature fuse the visual features with the semantic features;

and the character recognition module is configured to perform text prediction to obtain a character recognition result based on the feature after feature fusion.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the above aspects.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to any one of the above claims.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of the above-mentioned aspects.

The present disclosure provides a character recognition method, a character recognition apparatus, a model training method, a model training apparatus, an electronic device, and a storage medium, in which a sub-model for extracting visual features and a sub-model for extracting semantic features are separately pre-trained before training a character recognition model, thereby improving robustness of the character recognition model and improving precision of character recognition.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram illustrating steps of a text recognition model training method in an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of the training steps of a visual feature extraction model in an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of the training steps of the semantic feature extraction model in an embodiment of the disclosure;

FIG. 4 is a flow chart illustrating a text recognition method in an embodiment of the present disclosure;

FIG. 5 is a schematic diagram illustrating steps of a text recognition method according to an embodiment of the present disclosure;

FIG. 6 is a functional block diagram of a model training apparatus in an embodiment of the present disclosure;

FIG. 7 is a functional block diagram of a text recognition device in an embodiment of the present disclosure;

fig. 8 is a schematic block diagram of an example electronic device in an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The present disclosure provides a model training method, as shown in fig. 1, including:

step S101, training a first neural network based on an image sample to obtain a visual feature extraction model. The image sample may be image data containing text and as shown in fig. 2, the input to the first neural network may be a picture containing line text. The basic structure of the first neural network may include an Encoder module 201 and a Decoder module 202 based on the principle of self-supervision, the Encoder module 201 is used for extracting image features, the Decoder module 202 is used for reconstructing input, and the loss of model training is the reconstruction loss of output and input (generally, L2 loss). And after the model training is converged, the capability of extracting the visual features is achieved.

And S102, training a second neural network based on the character samples to obtain a semantic feature extraction model. The text sample may be pure text data, as shown in fig. 3, the input of the second neural network may be a pure text "Hello World", and the second neural network may adopt an MLM (Masked Language Model) to perform targeted training on the Language Model, which may further improve the robustness of the Language Model compared with the case of adopting image data as a training sample.

And step S103, training the visual characteristic extraction model based on the image sample. After pre-training in step S101 and step S102, a visual feature extraction model and a semantic feature extraction model are obtained, and then training is performed simultaneously based on the two sub-models to obtain a text recognition model.

And step S104, extracting visual features output in the model training process based on the visual features, and acquiring texts corresponding to the image samples. And inputting the image sample into a visual feature extraction model, outputting corresponding visual features, and classifying to obtain corresponding texts based on the visual features.

And S105, training the semantic feature extraction model based on the text until the visual feature extraction model and the semantic feature extraction model are converged. As shown in fig. 4, a visual feature extraction model and a semantic feature extraction model are firstly introduced, then a line of text pictures are input, the whole link is trained together until the two submodels converge, a text recognition model shown in fig. 4 is obtained, and text recognition is performed based on the models.

Before training, the character recognition model in the disclosure performs pre-training on submodels, namely a visual feature extraction model and a semantic feature extraction model, wherein the two submodels are used for extracting visual features and semantic features respectively, and character recognition is performed by performing feature fusion on the visual features and the semantic features. The robustness of the model is improved by pre-training the visual feature extraction model and the semantic feature extraction model, the trained model can be fused with the visual features and the semantic features, the influence of noise or shielding on character recognition can be greatly reduced, the accuracy of character recognition is improved, the universality and the transportability of the model are strong, and the model can be widely applied to OCR scenes to expand the application field of OCR.

As an optional implementation manner, the step S102 of training the second neural network based on the text samples to obtain the semantic feature extraction model includes: randomly shielding at least one character in the character sample, and converting all characters in the character sample subjected to random shielding into corresponding first character identification codes; and inputting all the first character recognition codes into a second neural network (MLM) for training to obtain a semantic feature extraction model.

As shown in fig. 3, the semantic feature extraction model may be used to extract semantic information, and the pre-training process of the semantic feature extraction model includes: step S301, inputting a plain text 'Hello World' into the MLM model; step S302, random Mask is carried out on the e character in the Hello World to obtain H [ M ] llo World; step S303, converting all characters of the H [ M ] llo World' into corresponding character ids (i.e. character identifiers) and sending the corresponding character ids into a Language Model (MLM); step S304, the character is predicted by the classification task through the Head module. The loss of the model training is cross entropy, after the model training is converged, the text model has the capability of extracting semantic information, and characters subjected to Mask can be correctly predicted according to the semantic information under the condition that the characters are subjected to random Mask. According to the method, the pure text data is adopted for training in the pre-training process, random masks with a certain proportion can be carried out on characters in the text data, the robustness of the language model can be enhanced, and the language model can be connected with context for predicting characters which are incompletely shot or shielded when semantic features are extracted, so that the accuracy of character recognition is improved.

As an alternative embodiment, the training of the semantic feature extraction model based on text in step S103 includes: converting characters in the text into corresponding second character recognition codes; and inputting the second character recognition code into a semantic feature extraction model for training.

As shown in fig. 4, when training the character recognition model, a coding module 401 (Encoder) of a visual feature extraction model and a semantic feature extraction model 402 (MLM model) are introduced first, and then the whole link shown in fig. 4 is trained together until convergence. The specific training process comprises the following steps: firstly, extracting visual features of an image input model containing line characters through an Encoder module 401, dividing the image input model into two parallel links after the visual features are extracted, wherein one link classifies the visual features extracted by the Encoder module 401 through a first classification module 403, then a character id corresponding to a character is obtained through prediction of a calculation module 404 (comprising Softmax and Argmax functions), and then the character id is input into an MLM (multilevel matrix) model 402 to extract semantic features; on the other hand, the visual features extracted by the Encoder module 401 are mapped through the mapping module 405, the visual features and the semantic features are kept in the same feature space, the visual features and the semantic information are fused through the feature fusion module 406, and text prediction is performed through the second classification module 407. In the training process shown in fig. 4, the weights of the coding module 401 and the semantic feature extraction model 402 in the first two rounds are fixed and not updated, only the weights of the other modules are updated, and after the two rounds of training, all the weights of the modules are updated together.

It should be noted that after the Encoder module 401 performs feature extraction, if the visual features are directly used for text prediction, since the characters in the image may be incomplete, the model is interfered by the incomplete characters at this time, and a situation of misrecognition may occur. Error correction is not needed after the recognition result is obtained, the accuracy of character recognition can be effectively improved, and false recognition or missing recognition is reduced.

The present disclosure provides a character recognition method, as shown in fig. 5, including: step S501, character images to be recognized are obtained. Such as the image shown in fig. 4 including the line text "a to y bear". Step S502, extracting visual characteristics of the character image to be recognized; step S503, semantic features are extracted based on the visual features; step S504, performing feature fusion on the visual features and the semantic features; and step S505, text prediction is carried out based on the feature after feature fusion to obtain a character recognition result. Steps S502 to S505 are executed by the text recognition model obtained by training in the model training method in the above embodiment, the visual feature of the character image to be recognized is extracted by the visual feature extraction model in the text recognition model, the semantic feature is further extracted by the semantic feature extraction model in the text recognition model based on the visual feature, and finally, the text recognition model performs text prediction after fusing the visual feature and the semantic feature.

According to the method, the visual features and the semantic features are combined, when partial characters of the image are shielded or incompletely shot, the extracted semantic features can be combined for prediction, and in the semantic feature extraction process, the language model can predict the lacked characters through context contact, so that the accuracy of character recognition can be improved by combining the semantic features.

As an optional implementation, the obtaining of the semantic features based on the visual feature extraction includes: classifying the visual characteristics to obtain characters corresponding to the text images; calculating based on characters to obtain corresponding character identification codes; and extracting semantic features based on the character recognition codes.

As shown in fig. 4, firstly, visual features are extracted from an image input model including line characters, after the visual features are extracted, firstly, the visual features extracted by the Encoder module 401 are classified by the first classification module 403 to obtain characters in the image, then, a character id corresponding to the characters is obtained by calculation by the SA (Softmax + Argmax) module 404, and then, the character id is input into the MLM model 402 to extract semantic features. In addition, before the visual features and the semantic features are fused, the mapping module maps the visual features to the same feature space as the semantic features. Even if characters in the image are shielded, the language model can predict the shielded characters by combining the context, so that the semantic features are fused in the visual features in the method, and the recognition accuracy of the shielded characters can be effectively improved.

The present disclosure provides a model training apparatus 600, as shown in fig. 6, comprising:

the first training module 601 is configured to train the first neural network based on the image sample to obtain a visual feature extraction model. The image sample may be image data containing text and as shown in fig. 2, the input to the first neural network may be a picture containing line text. The basic structure of the first neural network may include an Encoder module 201 and a Decoder module 202 based on the principle of self-supervision, the Encoder module 201 is used for extracting image features, the Decoder module 202 is used for reconstructing input, and the loss of model training is the reconstruction loss of output and input (generally, L2 loss). And after the model training is converged, the capability of extracting the visual features is achieved.

And the second training module 602 is configured to train the second neural network based on the text samples to obtain a semantic feature extraction model. The text sample may be pure text data, as shown in fig. 3, the input of the second neural network may be a pure text "Hello World", and the second neural network may adopt an MLM (Masked Language Model) to perform targeted training on the Language Model, which may further improve the robustness of the Language Model compared with the case of adopting image data as a training sample.

A third training module 603 configured to train the visual feature extraction model based on the image samples. After being pre-trained by the first training module 601 and the second training module 602, a visual feature extraction model and a semantic feature extraction model are respectively obtained, and then the third training module 603 performs simultaneous training based on two submodels of the visual feature extraction model and the semantic feature extraction model to obtain a text recognition model.

The third training module 603 obtains a text corresponding to the image sample based on the visual features output in the training process of the visual feature extraction model. And inputting the image sample into a visual feature extraction model, outputting corresponding visual features, and classifying to obtain corresponding texts based on the visual features.

The third training module 603 trains the semantic feature extraction model based on the text until the visual feature extraction model and the semantic feature extraction model converge. As shown in fig. 4, a visual feature extraction model and a semantic feature extraction model are firstly introduced, then a line character picture is input, the whole link is trained together until the two sub-models converge, a text recognition model shown in fig. 4 is obtained, and character recognition is performed based on the model.

Before training, the character recognition model in the disclosure performs pre-training on submodels, namely a visual feature extraction model and a semantic feature extraction model, the two submodels obtained by training are respectively used for extracting visual features and semantic features, and character recognition is performed by performing feature fusion on the visual features and the semantic features. The robustness of the model is improved by pre-training the visual characteristic extraction model and the semantic characteristic extraction model, the trained model can be fused with the visual characteristic and the semantic characteristic, the influence of noise or shielding on character recognition can be greatly reduced, the accuracy of character recognition is improved, the model is high in universality and portability, and the method can be widely applied to OCR scenes to expand the application field of OCR.

As an optional implementation manner, the training the second neural network by the second training module 602 based on the text samples to obtain the semantic feature extraction model includes: randomly shielding at least one character in the character sample, and converting all characters in the character sample subjected to random shielding into corresponding first character identification codes; and inputting all the first character recognition codes into a second neural network for training to obtain a semantic feature extraction model. As shown in fig. 3, the semantic feature extraction model may be used to extract semantic information, and the pre-training process of the semantic feature extraction model includes: step S301, inputting a plain text 'Hello World' by the MLM model; step S302, random Mask is carried out on the e character in the Hello World to obtain H [ M ] llo World; step S303, converting all characters of 'H [ M ] llo World' into corresponding character ids and sending the corresponding character ids into an MLM model; and step S304, predicting characters by a classification task through a Head module, and outputting a text prediction result of Hello World. The loss of the model training is cross entropy, after the model training is converged, the text model has the capability of extracting semantic information, and characters subjected to Mask can be correctly predicted according to the semantic information under the condition that the characters are subjected to random Mask. This is disclosed through in the training in advance, adopts the plain text data to train, can carry out the random Mask of certain proportion to the character in the text data, can strengthen the robustness of language model, and language model also can contact the context and predict out to shooing incomplete or having the characters that shelter from when extracting semantic feature to promote the degree of accuracy of character recognition.

As an alternative embodiment, the training of the semantic feature extraction model by the third training module 603 based on text includes: converting characters in the text into corresponding second character recognition codes; and inputting the second character recognition code into a semantic feature extraction model for training. As shown in fig. 4, when training the character recognition model, the coding module 401 and the MLM model 402 of the first pre-training module, i.e. the semantic feature extraction model, are introduced first, and then the whole link shown in fig. 4 is trained together until convergence. The specific training process comprises the following steps: firstly, extracting visual features of an image input model containing the rowed characters through an Encoder module 401, dividing the image input model into two parallel links after the visual features are extracted, wherein one link classifies the visual features extracted by the Encoder module 401 through a first classification module 403, a character id corresponding to a character is obtained through prediction of an SA (Softmax + Argmax) module 404, and then the character id is input into an MLM (maximum likelihood model) 402 to extract semantic features; on the other hand, the visual features extracted by the Encoder module 401 are mapped through the mapping module 405, the visual features and the semantic features are kept in the same feature space, the visual features and the semantic information are fused through the feature fusion module 406, and text prediction is performed through the second classification module 407. In the training process shown in fig. 4, the weights of the coding module 401 and the semantic feature extraction model 402 in the first two rounds are fixed, and are not updated, only the weights of the other modules are updated, and after the two rounds of training, all the weights of the modules are updated together.

It should be noted that after the Encoder module 401 performs feature extraction, if the visual features are directly input into the classification module (Head) for text prediction (as shown by the dotted line in fig. 4), since the characters in the image may be incomplete, the model is interfered by the incomplete characters at this time, and a situation of misrecognition may occur, but after the semantic features are fused in the visual features, the recognition result may be adjusted in time, and correct characters are output. Error correction is not needed after the recognition result is obtained, the accuracy of character recognition can be effectively improved, and false recognition or missing recognition is reduced.

The present disclosure provides a character recognition apparatus 700, as shown in fig. 7, including: an obtaining module 701 configured to obtain a text image to be recognized; a first feature extraction module 702 configured to extract visual features of a character image to be recognized; a second feature extraction module 703 configured to obtain semantic features based on the visual feature extraction; a feature fusion module 704 configured to feature fuse the visual features with the semantic features; and the character recognition module 705 is configured to perform text prediction based on the feature after feature fusion to obtain a character recognition result. According to the method, the visual features and the semantic features are combined, when partial characters of the image are shielded or incompletely shot, the extracted semantic features can be combined for prediction, and in the semantic feature extraction process, the language model can predict the missing characters through context contact, so that the accuracy of character recognition can be improved by combining the semantic features.

As an optional implementation manner, the obtaining of the semantic features by the second feature extraction module 703 based on the visual feature extraction includes: classifying the visual characteristics to obtain characters corresponding to the text images; calculating based on characters to obtain corresponding character identification codes; and extracting semantic features based on the character recognition codes. As shown in fig. 4, firstly, visual features are extracted from an image input model including line characters, after the visual features are extracted, the visual features extracted by the Encoder module 401 are classified by the first classification module 403 to obtain characters in the image, a character id corresponding to the characters is obtained through calculation by the SA (Softmax + Argmax) module 404, and then the character id is input into the MLM model 402 to extract semantic features. In addition, the character recognition apparatus further includes: and the mapping module maps the visual features to the same feature space with the semantic features before the visual features and the semantic features are fused. Even if characters in the image are shielded, the language model can predict the shielded characters by combining the context, so that the semantic features are fused in the visual features in the method, and the recognition accuracy of the shielded characters can be effectively improved.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 8 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The computing unit 801 performs the various methods and processes described above, such as a text recognition method or a model training method. For example, in some embodiments, the text recognition method or the model training method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When loaded into RAM 803 and executed by computing unit 801, a computer program may perform one or more steps of the above described text recognition method or model training method. Alternatively, in other embodiments, the computing unit 801 may be configured to perform a text recognition method or a model training method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, causes the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A model training method, comprising:

training the visual feature extraction model based on the image sample;

2. The method of claim 1, wherein training a second neural network based on the text samples to obtain a semantic feature extraction model comprises:

randomly masking at least one character in the text sample;

converting all characters in the character sample subjected to the random shielding into corresponding first character identification codes;

and inputting the first character recognition code into the second neural network for training to obtain the semantic feature extraction model.

3. The method of claim 1, wherein the training the semantic feature extraction model based on the text comprises:

converting characters in the text into corresponding second character recognition codes;

and inputting the second character recognition code into the semantic feature extraction model for training.

4. A method of word recognition, comprising:

acquiring a character image to be recognized;

extracting visual features of the character image to be recognized;

semantic features are obtained based on the visual feature extraction;

performing feature fusion on the visual features and the semantic features;

5. The method of claim 4, wherein the deriving semantic features based on the visual feature extraction comprises:

classifying the visual features to obtain characters corresponding to the text and image to be recognized;

calculating to obtain a corresponding character identification code based on the characters;

and extracting the semantic features based on the character recognition codes.

6. The method of claim 4, further comprising: before the visual feature and the semantic feature are fused, the visual feature is mapped to the same feature space as the semantic feature.

7. A model training apparatus comprising:

8. The apparatus of claim 7, wherein the second training module training a second neural network based on the text samples to obtain a semantic feature extraction model comprises:

randomly masking at least one character in the text sample;

9. The apparatus of claim 7, wherein the third training module to train the semantic feature extraction model based on the text comprises:

10. A character recognition apparatus comprising:

a second feature extraction module configured to obtain semantic features based on the visual feature extraction;

11. The apparatus of claim 10, wherein the second feature extraction module deriving semantic features based on the visual feature extraction comprises:

and extracting the semantic features based on the character recognition codes.

12. The apparatus of claim 10, further comprising: a mapping module configured to map the visual features to a same feature space as the semantic features before the feature fusion module fuses the visual features and the semantic features.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.

15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-6.