CN112016543A

CN112016543A - Text recognition network, neural network training method and related equipment

Info

Publication number: CN112016543A
Application number: CN202010723541.2A
Authority: CN
Inventors: 刘志广; 王靓伟
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2020-12-01
Also published as: WO2022017245A1

Abstract

The application relates to a text recognition technology in the field of artificial intelligence, and discloses a text recognition network, a neural network training method and related equipment, wherein the text recognition network is a neural network used for recognizing characters in an image, and comprises an image feature extraction module used for acquiring an image to be recognized and extracting features of the image to be recognized so as to generate a first feature corresponding to a first character in the image to be recognized; the text feature acquisition module is used for acquiring a preset character corresponding to a first character in the image to be recognized and performing text prediction according to the preset character to generate a semantic feature of the first predicted character; the recognition module is used for executing recognition operation according to the first feature and the semantic feature of the first predicted character so as to generate a recognition result corresponding to the image to be recognized, and executing recognition operation according to more-dimensional features; and the accuracy of character prediction cannot be influenced by the image quality problem, and the accuracy of the text recognition result can be improved.

Description

Text recognition network, neural network training method and related equipment

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method for text recognition network, neural network training, and related devices.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. At present, the recognition of characters in images by a deep learning (deep learning) based neural network is a common application mode of artificial intelligence.

However, in practical situations, when the quality of the image to be recognized is low, for example, the image to be recognized is blurred or some characters in the image to be recognized are blocked, the neural network may output a wrong recognition result, thereby reducing the accuracy of the text recognition result. Therefore, a scheme for improving the accuracy of the text recognition result is urgently needed to be proposed.

Disclosure of Invention

The embodiment of the application provides a text recognition network, a neural network training method and related equipment, wherein a recognition result is generated according to semantic features of predicted characters and image features of an image to be recognized, and recognition operation is executed according to features with more dimensions; and because the image is fuzzy or a part of characters in the image to be recognized are shielded, and the like, the accuracy of character prediction is not influenced, and the accuracy of the text recognition result is favorably improved.

In order to solve the above technical problem, an embodiment of the present application provides the following technical solutions:

in a first aspect, an embodiment of the present application provides a text recognition network, which may be used in the field of text recognition in the field of artificial intelligence. The text recognition network is a neural network used for recognizing characters in the image and comprises an image feature extraction module, a text feature acquisition module and a recognition module. The image feature extraction module is used for acquiring an image to be recognized and performing feature extraction on the image to be recognized so as to generate a first feature corresponding to a first character in the image to be recognized. The first character is a character to be recognized in the image to be recognized, the image feature extraction module in the text recognition network can be embodied as a convolutional neural network, a directional gradient histogram or a local binary pattern, the image to be recognized can be a whole image, or a segmented image including a row of characters or a column of characters after the image segmentation operation is performed. The text feature acquisition module is used for acquiring a preset character corresponding to the first character in the image to be recognized and performing text prediction according to the preset character to generate a semantic feature of the first predicted character. The default character may be a start flag character, and may be represented in the computer program as a < BOS > character for instructing the text feature obtaining module to start text prediction. And the recognition module is used for combining the first characteristic and the semantic characteristic of the first predicted character and executing recognition operation according to the combined characteristic so as to generate a recognition result corresponding to the first character in the image to be recognized. The identification module may be a classification network, the classification network may be embodied as a classifier, and the classifier may select a multi-layer perceptron, or may be composed of a linear transformation matrix and a classification function.

In the implementation mode, the image characteristics of the image to be recognized are obtained, the semantic characteristics of the predicted characters are generated according to the second characters corresponding to the recognized characters in the first characters, the recognition operation is executed according to the characteristics with more dimensions, and the accuracy of the text recognition result is improved; and because the accuracy of the fuzzy character or the character sheltered in the first characteristic can be greatly reduced when the image to be recognized is fuzzy or part of the character in the image to be recognized is sheltered, and the like, the semantic characteristic of the predicted character is generated based on the semantic information of the recognized character, and because the image problems such as the fuzzy image or part of the character sheltered in the image to be recognized occur, the accuracy of the predicted character is not influenced, the recognition result is generated according to the semantic characteristic of the predicted character and the image characteristic, and the accuracy of the text recognition result is favorably improved.

In a possible implementation manner of the first aspect, the text feature obtaining module is specifically configured to, under a condition that a recognition operation is performed on an image to be recognized for the first time, obtain a preset character corresponding to a first character in the image to be recognized, and perform text prediction according to the preset character to generate a semantic feature of the first predicted character. If the execution device performs image segmentation on the whole image to be recognized, performing a recognition operation on a first character in the image to be recognized for the first time refers to performing a recognition operation on the segmented image to be recognized (i.e., a text region of the image to be recognized) for the first time. If the execution device does not perform image segmentation on the whole image to be recognized, performing the recognition operation on the first character in the image to be recognized for the first time refers to performing the recognition operation on the whole image to be recognized for the first time. The text feature obtaining module is specifically configured to, in a case where a recognition operation has been performed on at least one character in the first characters, determine at least one recognition result and a preset character corresponding to the at least one recognized character in the first characters as a second character, and perform text prediction according to the second character to generate a semantic feature of the second predicted character corresponding to the second character.

In the implementation manner, under the condition that the recognition operation is performed on the first character in the image to be recognized for the first time, the execution device generates the semantic features of the first predicted character according to the preset character, and under the condition that the recognition operation has been performed on at least one character in the first character, the execution device determines at least one recognition result corresponding to the recognized character in the first character and the preset character as at least one second character corresponding to the recognized character in the first character, so that the completeness of the scheme is ensured, manual intervention is not required in the whole recognition process, and the user stickiness of the scheme is improved.

In a possible implementation manner of the first aspect, the recognition module is further configured to perform a recognition operation according to the first feature and a semantic feature of the second predicted character to generate a recognition result corresponding to the first character in the image to be recognized.

In this implementation manner, since the text recognition network in this scheme may only be able to obtain the recognition result of a part of the first characters each time the recognition operation is performed, in the case that the recognition operation has been performed on at least one of the first characters, the execution device performs text prediction according to at least one recognition result corresponding to at least one recognized character to generate the semantic feature of the second predicted character, and then performs the recognition operation according to the semantic features of the first feature and the second predicted character, thereby further improving the integrity of this scheme.

In one possible implementation manner of the first aspect, the text feature obtaining module includes: the first generation submodule is used for carrying out vectorization processing on each preset character in at least one preset character to generate a character code of each preset character, and generating a position code of each preset character according to the position of the first character of each preset character in the image to be recognized. And the combination sub-module is used for combining the character code of the preset character and the position code of the preset character to obtain the initial characteristic of the preset character, and executing self-attention coding operation and self-attention decoding operation according to the initial characteristic of the preset character to generate the semantic characteristic of the first predicted character. The combination mode between the character code of the preset character and the position code of the preset character is any one of the following modes: splicing, adding, fusing, and multiplying.

In the implementation mode, the text prediction is carried out in a mode of carrying out self-attention coding operation and self-attention decoding operation on the initial features of the preset characters so as to generate the semantic features of the first predicted characters, and the method is high in calculation speed and low in complexity.

In one possible implementation manner of the first aspect, the identification module includes: a calculation submodule for calculating a similarity between the first feature and a semantic feature of the first predicted character. The similarity can be obtained by calculating cosine similarity, Euclidean distance, Mahalanobis distance and the like between the first feature and the semantic feature of the first predicted character, or the similarity can be obtained by performing dot product operation on the first feature and the semantic feature of the first predicted character. The similarity may include one similarity value, or may be two transposed similarity values. And the second generation submodule is used for generating a second feature and a third feature according to the first feature, the semantic feature and the similarity of the first predicted character, wherein the second feature is the semantic feature combined with the first predicted character on the basis of the first feature, and the third feature is the semantic feature combined with the first feature on the basis of the semantic feature of the first predicted character. And the second generation submodule is also used for combining the second characteristic and the third characteristic and executing the identification operation according to the combined characteristic so as to generate an identification result.

In the implementation mode, the similarity between the first feature and the semantic feature of the first predicted character is calculated, and then the second feature and the third feature are generated according to the similarity between the first feature and the semantic feature of the first predicted character, wherein the second feature is the semantic feature of the first predicted character combined on the basis of the first feature, and the third feature is the first feature combined on the basis of the semantic feature of the first predicted character, namely, the image feature of the character to be recognized is enhanced according to the semantic feature of the predicted character, and the image feature of the character to be recognized is blended into the semantic feature of the predicted character, so that the full fusion of the image feature and the feature of the predicted character is facilitated, and the accuracy of a text recognition result is facilitated to be improved.

In one possible implementation manner of the first aspect, the text recognition network further includes a feature update module, and the feature update module is configured to: combining the characteristics of the preset characters with the first characteristics to generate updated first characteristics; the feature of the preset character may be an initial feature of the preset character or an updated feature of the preset character. The first feature includes image features of a plurality of first characters, at least one of the plurality of first characters is a character on which a recognition operation has been performed, and in a case where the preset character includes a recognition result corresponding to the plurality of recognized characters, the feature of the preset character includes a feature of the recognition result corresponding to the recognized character. The updated first feature may enhance the features of the recognized character relative to the first feature. And the recognition module is specifically used for executing recognition operation according to the updated first feature and the semantic feature of the first predicted character so as to generate a recognition result corresponding to the first character in the image to be recognized.

In the implementation mode, the semantic features of the recognized characters are blended into the image features, so that the features of the recognized characters in the image features are more obvious, and the recognition module can more intensively recognize the unrecognized characters, so that the difficulty of the recognition module in a single recognition process is reduced, and the accuracy of text recognition is improved.

In a possible implementation manner of the first aspect, the feature updating module is specifically configured to: and executing the self-attention coding operation according to the initial features of the preset characters to obtain updated features of the preset characters, and executing the self-attention coding operation according to the first features and the updated features of the preset characters to generate updated first features. In the implementation mode, a self-attention coding mode is adopted, the characteristics of the preset characters are combined with the first characteristics, the full combination of the characteristics of the preset characters and the first characteristics is favorably realized, the complexity is low, and the realization is easy.

In a possible implementation manner of the first aspect, in a case that the granularity of the recognition operation performed by the text recognition network is characters, at least one character is included in one first character, and one character is included in one recognition result output by the text recognition network after performing one recognition operation. In the case that the granularity of the recognition operation performed by the text recognition network is a word, at least one word is included in one first character, and one recognition result output by the text recognition network after performing the recognition operation once is a word including one or more characters.

In the implementation mode, the granularity of the text recognition network for executing the recognition operation can be characters or words, so that the application scene of the scheme is expanded, and the implementation flexibility of the scheme is improved.

In a second aspect, an embodiment of the present application provides a training method for a text recognition network, which can be used in the field of text recognition in the field of artificial intelligence. The text recognition network is a neural network used for recognizing characters in the image and comprises an image feature extraction module, a text feature acquisition module and a recognition module. The method comprises the following steps: the training equipment inputs an image to be recognized into an image feature extraction module, and performs feature extraction on the image to be recognized to generate a first feature corresponding to a first character in the image to be recognized, wherein the first character is a character to be recognized in the image to be recognized; and inputting a preset character corresponding to the first character in the image to be recognized into a text characteristic acquisition module, and performing text prediction according to the preset character to generate a semantic characteristic of the first predicted character. The training device executes recognition operation through the recognition module according to the first characteristic and the semantic characteristic of the first predicted character to generate a recognition result corresponding to the first character in the image to be recognized. The training equipment trains the text recognition network according to a correct result corresponding to a first character in the image to be recognized, a recognition result and a loss function, wherein the loss function indicates the similarity between the correct result corresponding to the first character in the image to be recognized and the recognition result corresponding to the first character in the image to be recognized, the training target of the loss function is to draw up the similarity between the correct result corresponding to the first character in the image to be recognized and the recognition result corresponding to the first character in the image to be recognized, and the loss function can be specifically a cross entropy loss function, a focus loss function or a center loss function.

The second aspect of the embodiment of the present application may further perform steps in each possible implementation manner of the first aspect, and for specific implementation steps of the second aspect and each possible implementation manner of the second aspect of the embodiment of the present application, and beneficial effects brought by each possible implementation manner, reference may be made to descriptions in each possible implementation manner of the first aspect, and details are not repeated here.

In a third aspect, an embodiment of the present application provides a text recognition method, which may be used in the field of text recognition in the field of artificial intelligence. The method comprises the following steps: the method comprises the steps that an execution device inputs an image to be recognized into an image feature extraction module, feature extraction is conducted on the image to be recognized, and first features corresponding to first characters in the image to be recognized are generated, wherein the first characters are characters needing to be recognized in the image to be recognized; and inputting a preset character corresponding to the first character in the image to be recognized into a text characteristic acquisition module, and performing text prediction according to the second character to generate a semantic characteristic of the first predicted character. The execution device executes recognition operation through the recognition module according to the first characteristic and the semantic characteristic of the first predicted character to generate a recognition result corresponding to the first character in the image to be recognized. The image feature extraction module, the text feature acquisition module and the identification module belong to the same text identification network.

The third aspect of the embodiment of the present application may further perform steps in various possible implementation manners of the first aspect, and for specific implementation steps of the third aspect and various possible implementation manners of the third aspect of the embodiment of the present application and beneficial effects brought by each possible implementation manner, reference may be made to descriptions in various possible implementation manners of the first aspect, and details are not repeated here.

In a fourth aspect, an embodiment of the present application provides a training apparatus for a text recognition network, where the text recognition network is a neural network used for recognizing characters in an image, the text recognition network includes an image feature extraction module, a text feature acquisition module, and a recognition module, and the training apparatus for the text recognition network includes: the input unit is used for inputting the image to be recognized to the image feature extraction module, and performing feature extraction on the image to be recognized to generate a first feature corresponding to a first character in the image to be recognized, wherein the first character is a character which needs to be recognized in the image to be recognized; the input unit is also used for inputting a preset character corresponding to the first character in the image to be recognized to the text characteristic acquisition module and performing text prediction according to the preset character to generate a semantic characteristic of the first predicted character; the recognition unit is used for performing recognition operation through the recognition module according to the first characteristic and the semantic characteristic of the first predicted character so as to generate a recognition result corresponding to the first character in the image to be recognized; the training unit is used for training the text recognition network according to a correct result corresponding to the first character in the image to be recognized, a recognition result and a loss function, wherein the loss function indicates the similarity between the correct result corresponding to the first character in the image to be recognized and the recognition result corresponding to the first character in the image to be recognized.

The fourth aspect of the embodiment of the present application may further perform steps in each possible implementation manner of the second aspect, and for specific implementation steps of the fourth aspect and each possible implementation manner of the fourth aspect of the embodiment of the present application and beneficial effects brought by each possible implementation manner, reference may be made to descriptions in each possible implementation manner of the second aspect, and details are not repeated here.

In a fifth aspect, embodiments of the present application provide an execution device, which may include a processor, a processor coupled with a memory, the memory storing program instructions, and the program instructions stored in the memory, when executed by the processor, implement the steps performed by the text recognition network according to the first aspect.

In a sixth aspect, an embodiment of the present application provides a training apparatus, which may include a processor, a processor coupled to a memory, the memory storing program instructions, and the program instructions stored in the memory when executed by the processor implement the training method for the text recognition network according to the second aspect.

In a seventh aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute the steps performed by the text recognition network according to the first aspect, or the computer is caused to execute the training method for the text recognition network according to the second aspect.

In an eighth aspect, an embodiment of the present application provides a circuit system, where the circuit system includes a processing circuit, and the processing circuit is configured to perform the steps performed by the text recognition network according to the first aspect, or perform the training method for the text recognition network according to the second aspect.

In a ninth aspect, embodiments of the present application provide a computer program, which when run on a computer, causes the computer to perform the steps performed by the text recognition network according to the first aspect, or perform the training method for the text recognition network according to the second aspect.

In a tenth aspect, embodiments of the present application provide a chip system, which includes a processor, configured to implement the functions recited in the above aspects, for example, to transmit or process data and/or information recited in the above methods. In one possible design, the system-on-chip further includes a memory for storing program instructions and data necessary for the server or the communication device. The chip system may be formed by a chip, or may include a chip and other discrete devices.

Drawings

FIG. 1 is a schematic structural diagram of an artificial intelligence body framework provided by an embodiment of the present application;

FIG. 2 is a system architecture diagram of a text recognition system according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a workflow of a text recognition network according to an embodiment of the present application;

fig. 4 is a schematic flowchart of generating a fourth feature in a workflow of a text recognition network according to an embodiment of the present application;

fig. 5 is a schematic flowchart of generating a fifth feature and a sixth feature in a workflow of a text recognition network according to an embodiment of the present application;

fig. 6 is a schematic diagram of a network architecture of a text recognition network according to an embodiment of the present application;

fig. 7 is a schematic flowchart of a training method for a text recognition network according to an embodiment of the present application;

fig. 8 is a schematic diagram illustrating an advantageous effect of a text recognition network according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a text recognition network according to an embodiment of the present application;

FIG. 10 is a schematic structural diagram of a text recognition network according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a training apparatus for a text recognition network according to an embodiment of the present application;

FIG. 12 is a schematic structural diagram of an apparatus for training a text recognition network according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of an execution device according to an embodiment of the present application;

FIG. 14 is a schematic structural diagram of a training apparatus provided in an embodiment of the present application;

fig. 15 is a schematic structural diagram of a chip according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings. As can be known to those skilled in the art, with the development of technology and the emergence of new scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely descriptive of the various embodiments of the application and how objects of the same nature can be distinguished. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The general workflow of the artificial intelligence system will be described first, please refer to fig. 1, which shows a schematic structural diagram of an artificial intelligence body framework, and the artificial intelligence body framework is explained below from two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Where "intelligent information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process. The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the human intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.

(1) Infrastructure

The infrastructure provides computing power support for the artificial intelligent system, realizes communication with the outside world, and realizes support through a foundation platform. Communicating with the outside through a sensor; the computing power is provided by an intelligent chip, which includes but is not limited to hardware acceleration chips such as a Central Processing Unit (CPU), an embedded neural Network Processor (NPU), a Graphics Processing Unit (GPU), an Application Specific Integrated Circuit (ASIC), and a Field Programmable Gate Array (FPGA); the basic platform comprises distributed computing framework, network and other related platform guarantees and supports, and can comprise cloud storage and computing, interconnection and intercommunication networks and the like. For example, sensors and external communications acquire data that is provided to intelligent chips in a distributed computing system provided by the base platform for computation.

(2) Data of

Data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphs, images, voice and texts, and also relates to the data of the Internet of things of traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.

The decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capabilities

After the above-mentioned data processing, further based on the result of the data processing, some general capabilities may be formed, such as algorithms or a general system, e.g. translation, analysis of text, computer vision processing, speech recognition, etc.

(5) Intelligent product and industrial application

The intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent terminal, intelligent manufacturing, intelligent transportation, intelligent house, intelligent medical treatment, intelligent security protection, autopilot, safe city etc..

The embodiment of the application can be applied to various fields of artificial intelligence, and particularly can be applied to various scenes in which characters in images need to be recognized, wherein the images are acquired by a camera, a printer, a scanner and other equipment. For example, in an application scenario, in the fields of finance, accounting, tax and the like, an enterprise needs to scan a file such as a receipt or an invoice to obtain an image file, and identify characters in the image file to extract text information, so as to implement functions such as file digital archiving, file fast indexing or file analysis. In another application scenario, a user needs to input information on a document such as an identity card, a driving license, or a passport, and the like, the user may acquire an image of the document by using a camera, and recognize characters in the image to extract key information and the like. It should be understood that the examples are only for convenience of understanding the application scenarios of the embodiments of the present application, and are not exhaustive. In all the foregoing scenarios, there may be a possibility that the image quality is low, and therefore, it is necessary to recognize the image through the text recognition network provided in the embodiment of the present application, so as to improve the accuracy of the recognition result.

To facilitate understanding of the present solution, in the embodiment of the present application, first, a text recognition system provided in the embodiment of the present application is described with reference to fig. 2, please refer to fig. 2, and fig. 2 is a system architecture diagram of the text recognition system provided in the embodiment of the present application. In fig. 2, the text recognition system 200 includes an execution device 210, a training device 220, a database 230, and a data storage system 240, and the execution device 210 includes a calculation module 211 therein.

In the training phase, the database 230 stores a training data set, which may include a plurality of images to be recognized and a correct result corresponding to the first character in each image to be recognized. The training device 220 generates a target model/rule 201 for processing the sequence data, and iteratively trains the target model/rule 201 using a set of training data in the database to obtain a mature target model/rule 201.

During the inference phase, the execution device 210 may invoke data, code, etc. from the data storage system 240 and may store data, instructions, etc. in the data storage system 240. The data storage system 240 may be configured in the execution device 210, or the data storage system 240 may be an external memory with respect to the execution device 210. The calculation module 211 may perform a recognition operation on the image to be recognized input by the execution device 210 through the mature target model/rule 201, so as to obtain a recognition result of the first character in the image to be recognized.

In some embodiments of the present application, for example, in fig. 2, a "user" may interact directly with the execution device 210, that is, the execution device 210 and the client device are integrated in the same device. However, fig. 2 is only a schematic architecture diagram of two image processing systems provided by the embodiment of the present invention, and the positional relationship between the devices, modules, etc. shown in the diagram does not constitute any limitation. In other embodiments of the present application, the execution device 210 and the client device may be separate devices, the execution device 210 is configured with an input/output interface to perform data interaction with the client device, the "user" may input the captured image to the input/output interface through the client device, and the execution device 210 returns the processing result to the client device through the input/output interface.

Based on the above description, the embodiment of the application provides a text recognition network, which includes an image feature extraction module, a text feature acquisition module and a recognition module, the image feature extraction module is used for extracting the image features of a first character in the image to be recognized, the text feature acquisition module is used for carrying out text prediction on the semantic information of a preset character corresponding to the first character in the image to be recognized to obtain the semantic features of the predicted character, the recognition module executes recognition operation according to the image characteristic of the first character in the image to be recognized and the semantic characteristic of the predicted character to generate a recognition result, because the image has problems of image blurring or shielding of partial characters in the image to be recognized and the like, the accuracy of character prediction is not influenced, and generating a recognition result according to the semantic features and the image features of the predicted characters, so that the accuracy of the text recognition result is improved. As can be seen from the description in fig. 2, the embodiment of the present application includes an inference phase and a training phase, and the flow of the inference phase and the training phase are different, and the inference phase and the training phase are described below separately.

First, reasoning phase

In the embodiment of the present application, the inference phase describes how the execution device 210 performs character recognition on the image to be recognized by using a mature text recognition network. Referring to fig. 3, fig. 3 is a schematic flow chart of a workflow of a text recognition network according to an embodiment of the present application, where the method includes:

301. the execution equipment inputs the image to be recognized to the image feature extraction module, and performs feature extraction on the image to be recognized so as to generate a first feature corresponding to a first character in the image to be recognized.

In the embodiment of the application, after the execution device obtains the image to be recognized, the image to be recognized is input into an image feature extraction module of a text recognition network, so that feature extraction is performed on the image to be recognized through the image feature extraction module, a first feature corresponding to a first character in the image to be recognized is generated, and the first character is a character which needs to be recognized in the image to be recognized.

The image feature extraction module in the text recognition network may be specifically represented as a convolutional neural network, a Histogram of Oriented Gradients (HOG), a Local Binary Pattern (LBP), or another neural network for extracting features of an image.

One image to be recognized may include one or more rows of first characters, or one image to be recognized may include one or more columns of first characters. If the granularity of the recognition operation executed by the text recognition network is a character, that is, the execution device can obtain the recognition result of one character in the image to be recognized every time the execution device executes the recognition operation through the text recognition network, one first character comprises one or more characters. As an example, for example, one first character included in the image to be recognized is "cat", the text recognition network performs a recognition operation once, and the like to obtain a recognition result of one character "c". As another example, for example, one first character included in the image to be recognized is "today weather true bar", the text recognition network performs a recognition operation once to generate a feature of one character "present", and outputs a recognition result corresponding to the character "present".

If the granularity of the recognition operation executed by the text recognition network is a word, that is, the execution device can obtain the recognition result of one word in the image to be recognized every time the recognition operation is executed by the text recognition network, one first character comprises one or more words. As an example, for example, one first character included in the image to be recognized is "how you", the text recognition network performs a recognition operation once, and the like to obtain a recognition result of one word "how". As another example, for example, a first character included in the image to be recognized is "today weather true bar", the text recognition network obtains a recognition result of the word "today" every time the recognition operation is performed, and the like, and it should be understood that the above examples are merely for convenience of understanding the present solution, and do not limit the present solution.

Specifically, in one implementation, after the image to be recognized is acquired, the execution device performs image segmentation on the image to be recognized to generate at least one segmented image to be recognized (i.e., segments the image to be recognized into at least one text region). If one image to be recognized comprises one or more lines of first characters, each segmented image to be recognized (namely each text area) comprises one line of first characters; and if one image to be recognized comprises one or more columns of first characters, each segmented image to be recognized comprises one column of first characters.

More specifically, in one case, an image segmentation module is further configured in the text recognition network, and the execution device performs image segmentation on the image to be recognized through the image segmentation module of the text recognition network to obtain at least one segmented image to be recognized. In another case, besides the text recognition network, a first neural network for image segmentation may be configured on the execution device, and the execution device performs image segmentation on the image to be recognized through the first neural network to obtain at least one segmented image to be recognized. Further, the image segmentation module in the Text recognition Network or the first neural Network for performing image segmentation may be embodied as a shape Robust Text Detection Network (PSENet) based on a Progressive Scale extension Network, rCTPN, enter, or other neural networks for image segmentation, which is not limited herein.

Correspondingly, step 301 may include: the execution equipment inputs the segmented image to be recognized into an image feature extraction module, and performs feature extraction on the segmented image to be recognized to generate a first feature corresponding to a first character in the segmented image to be recognized, wherein the segmented image is a text region in the image to be recognized. A first feature refers to a feature of a segmented image to be recognized, which includes an image feature of a row of first characters (i.e., a text region in the image to be recognized), or an image feature of a column of first characters.

In another implementation manner, when one image to be recognized includes a plurality of lines of first characters or a plurality of columns of first characters, after the image to be recognized is acquired, the execution device inputs the whole image to be recognized into an image feature extraction module of a text recognition network, and performs feature extraction on the whole image to be recognized to generate first features corresponding to the first characters in the image to be recognized. The first feature refers to the feature of the whole image to be recognized, and if one image to be recognized comprises a row of first characters or a column of first characters, one first feature is the image feature of the row or column of first characters in the image to be recognized; if one image to be recognized comprises a plurality of rows or a plurality of columns of first characters, one first characteristic is the image characteristic of the plurality of rows or the plurality of columns of the first characters in the image to be recognized.

302. The execution equipment inputs a preset character corresponding to the first character in the image to be recognized into the text characteristic acquisition module, and performs text prediction according to the preset character to generate a semantic characteristic of the first predicted character.

In the embodiment of the application, the execution device, under the condition that the first character in the image to be recognized is recognized for the first time, obtains a preset character corresponding to the first character in the image to be recognized, inputs the preset character corresponding to the first character in the image to be recognized into a text feature obtaining module of a text recognition network, and performs text prediction through the text feature obtaining module according to the preset character to generate the semantic feature of the first predicted character.

If the execution device performs image segmentation on the whole image to be recognized, performing a recognition operation on a first character in the image to be recognized for the first time refers to performing a recognition operation on the segmented image to be recognized (i.e., a text region of the image to be recognized) for the first time. If the execution device does not perform image segmentation on the whole image to be recognized, performing the recognition operation on the first character in the image to be recognized for the first time refers to performing the recognition operation on the whole image to be recognized for the first time.

The default character may be a start flag character, and may be represented in the computer program as a < BOS > character for instructing the text feature obtaining module to start text prediction. The representation form of the preset character is predefined, and specifically can be represented as a vector including N elements, and each element in the N elements is a determined numerical value. Further, N is an integer greater than or equal to 1. As an example, the preset character may specifically be a vector including 32 1 s, or the preset character may specifically be a vector including 64 2 s, and so on, which are not exhaustive here.

The text feature acquisition module of the text recognition network may include an encoding module for extracting text features of the input characters and a decoding module for generating text features of the predicted characters according to the text features of the input characters. Further, the encoding module may be an encoder in a Recurrent Neural Networks (RNNs), and the decoding module is a decoder in the recurrent neural networks; as an example, the encoding module and the decoding module may be encoding modules and decoding modules in a long short term memory network (LSTM), for example. The encoding module can also be a self-attention (self-attention) encoding module, and the decoding module is a self-attention decoding module; as an example, for example, the encoding module and the decoding module may be a self-attention encoding module and a self-attention decoding module of a neural network based on bidirectional encoding characterization (BERT) of a converter, and the encoding module and the decoding module may also be represented as an encoding module and a decoding module in other neural networks for text prediction, and the like, which are not exhaustive here.

Specifically, in an implementation manner, the encoding module and the decoding module in the text feature acquisition module are a self-attention encoding module and a self-attention decoding module, respectively. Step 302 may include: the execution equipment converts the preset characters from a character form to a tensor form through a text feature acquisition module so as to generate character codes of the preset characters, and generates position codes of the preset characters according to the position of a first character of the preset characters in the image to be recognized; and combining the character code of the preset character and the position code of the preset character to obtain the initial characteristic of the preset character. And the execution equipment executes self-attention coding operation and self-attention decoding operation according to the initial characteristics of the preset characters through a text characteristic acquisition module so as to generate semantic characteristics of the first predicted characters.

In the embodiment of the application, text prediction is performed by performing self-attention encoding operation and self-attention decoding operation on the initial features of the preset characters to generate the semantic features of the first predicted character, and the method is high in calculation speed and low in complexity.

More specifically, the generation process of character codes. The execution device may perform vectorization (encoding) processing on the preset character through the text feature acquisition module to generate a character code of the preset character. The training device may further obtain a one-hot (one-hot) code of the preset character, and determine the one-hot code of the preset character as a character code of the preset character, and the like, where a process of generating the character code of the preset character is not limited herein. The character code of the preset character may be a vector including M elements, and a value of M is related to what neural network is adopted by a text feature acquisition module of the text recognition network, which is not limited herein.

A generation process for a position code. The position of a first character of the preset character in the image to be recognized is a first position, and the position code of the preset character indicates that the position of the preset character is the first position. Alternatively, the position code of the preset character may be a vector including M elements. As an example, for example, if the value of M is 512, the position code of the preset character may be a vector including 1 and 511 0, where 1 in the position code of the preset character is located at the head, indicating that the position of the first character of the preset character in the image to be recognized is the head, optionally, the execution device may further perform secondary conversion on the aforementioned 512 elements through a cosine function, and the like. The combination of character encoding and position encoding includes but is not limited to concatenation (contact), addition (add), fusion (fusion), multiplication, etc.

To a process of generating semantic features of a first predicted character. After the execution device obtains the initial features of the preset characters, text prediction needs to be performed through a text feature acquisition module of a text recognition network, that is, a self-attention coding operation is performed on the initial features of the preset characters to generate updated features of the preset characters, and a self-attention decoding operation is performed on the updated features of the preset characters to generate semantic features of the first predicted characters.

In another implementation, the encoding module and the decoding module in the text feature acquisition module are selected from a recurrent neural network. Step 302 may include: the execution equipment converts the preset characters from a character form to a tensor form through the text feature acquisition module so as to generate character codes of the preset characters, and determines the character codes of the preset characters as initial features of the preset characters. And then the execution equipment executes encoding operation and decoding operation according to the initial characteristics of the preset characters through a text characteristic acquisition module so as to generate semantic characteristics of the first predicted characters.

It should be noted that, when the encoding module and the decoding module in the text feature obtaining module are selected from other types of neural networks for text prediction, step 302 may be modified accordingly, which is not exhaustive here.

In addition, the embodiment of the present application does not limit the execution sequence of step 301 and step 302, and step 301 and step 302 may be executed first, or step 302 and step 301 may be executed first, or step 301 and step 302 may be executed at the same time.

303. And the execution equipment combines the characteristics of the preset characters with the first characteristics through the characteristic updating module to generate fourth characteristics.

In some embodiments of the application, after the execution device generates, by an image feature extraction module of the text recognition network, a first feature corresponding to a first character in an image to be recognized, the execution device may further combine a feature of a preset character with the first feature to generate a fourth feature, where the fourth feature is the updated first feature. The feature of the preset character may be an updated feature of the preset character, or may be an initial feature of the preset character.

Specifically, in an implementation manner, the execution device executes, through a feature update module of the text recognition network, a self-attention coding operation according to an initial feature of a preset character to obtain an updated feature of the preset character, and executes the self-attention coding operation according to the first feature and the updated feature of the preset character to generate the fourth feature.

Aiming at the generation process of the updated characteristics of the preset characters, in order to understand the process of self-attention coding more intuitively, a formula for performing self-attention coding operation on the preset characters is disclosed as follows:

Q′_char＝Norm(softmax(Q_charK_char)V_char+Q_char)； (1)

wherein Q is_charIs obtained by multiplying the initial characteristics of the preset character by a first conversion matrix, K_charObtained by multiplying the initial features of the predetermined character by a second transformation matrix, Q_charK_charRepresents Q_charAnd K_charDot product, softmax (Q)_charK_char)V_charRepresents softmax (Q)_charK_char) And V_charDot-ride, V_charIs obtained by multiplying the initial characteristics of the preset characters by a third conversion matrix, and Q'_charRepresenting the updated characteristics of the preset characters, the first conversion matrix, the second conversion matrix and the third conversion matrix may be the same or different, and it should be understood that the example in equation (1) is only for facilitating understanding of the scheme, and is not used to limit the scheme.

For the process of generating the fourth feature, in order to more intuitively understand the process of performing self-attention coding according to the first feature and the updated feature of the preset character, a formula for performing self-attention coding operation on the preset character is disclosed as follows:

Q′_img＝Norm(softmax(Q′_charK_img)V_img+Q′_char)； (2)

wherein, Q'_imgRepresents fourth feature, Q'_charRepresenting updated characteristics of a predetermined character, K_imgTo multiply the first feature by the fourth transformation matrix, V_imgIn order to multiply the first feature by the fifth transformation matrix, the fourth transformation matrix and the fifth transformation matrix may be the same or different, and it should be understood that the example in the formula (2) is only for facilitating understanding of the scheme and is not used to limit the scheme.

For a more intuitive understanding of the present solution, please refer to fig. 4, where fig. 4 is a schematic flow chart illustrating a fourth feature generated in a work flow of a text recognition network according to an embodiment of the present application, and fig. 4 exemplifies that the text recognition network performs a recognition operation once to obtain a recognition result of one character, that is, one second character includes one character. As shown in fig. 4, the execution device inputs the image to be recognized into an image feature extraction module of the text recognition network, and obtains an image feature of a first character in the image to be recognized (that is, the first feature of the first character in the image to be recognized), where in fig. 4, taking the example that the image feature extraction module includes a plurality of convolution layers and a plurality of pooling layers, max poll refers to maximum pooling. As shown in fig. 4, the execution device generates a character code and a position code of the preset character to obtain an initial feature of the preset character, and generates an updated feature Q 'of the preset character according to the above equation (1)'_char. After obtaining the image feature of the first character in the image to be recognized (i.e. the first feature of the first character in the image to be recognized) and the updated feature of the preset character, the executing device executes the self-attention coding operation by using the above formula (2) to generate the fourth feature, it should be noted that, in practical cases, the executing device may be further provided with more neural networks, for example, the feature updating module in the text recognition network may be further provided with more neural networks, in addition to executing the self-attention coding operation according to the first feature and the updated feature of the preset character, and the executing device may be further provided with more neural networks, for example, the text recognition moduleA feedforward neural network, a regularization module, etc. may also be arranged in the feature update module in the network, and fig. 4 is only an example for facilitating understanding of the present solution and is not used to limit the present solution.

In another implementation, the execution device performs a self-attention encoding operation according to the first feature and an initial feature of a preset character through a feature update module of the text recognition network to generate a fourth feature.

In another implementation manner, the execution device executes, through a feature update module of the text recognition network, a coding operation according to an initial feature of the preset character to obtain an updated feature of the preset character, and executes the coding operation according to the first feature and the updated feature of the preset character to generate the fourth feature. Further, the feature update module of the text recognition network performs an encoding operation by an encoder, which is an encoder in the recurrent neural network.

In another implementation, the execution device performs, by a feature update module of the text recognition network, an encoding operation according to the first feature and an initial feature of a preset character to generate a fourth feature.

304. The execution device performs a recognition operation by the recognition module based on the first feature and the semantic feature of the first predicted character to generate a first recognition result.

In the embodiment of the application, the execution device combines the first feature and the semantic feature of the first predicted character through the recognition module, and executes recognition operation according to the combined feature to generate a first recognition result. If the granularity of the recognition operation executed by the text recognition network is characters, a first recognition result is a character recognized by the text recognition network. If the granularity of the recognition operation executed by the text recognition network is a word, a first recognition result is a word recognized by the text recognition network.

Specifically, step 303 is an optional step, and if step 303 is executed, step 304 includes: the execution device combines the fourth feature (i.e. the updated first feature) and the fifth feature through the identification module, and executes the identification operation according to the combined feature to generate a first identification result.

A process of combining the fourth feature and the semantic feature of the first predicted character. In one implementation, the execution device directly combines the fourth feature (i.e., the updated first feature) and the fifth feature by means of the identification module, such as splicing, matrix multiplication, and combination.

In another implementation, the execution device performs, by the recognition module, a combination operation of the fourth feature and the semantic feature of the first predicted character according to a similarity between the fourth feature and the semantic feature of the first predicted character. The execution device calculates a first similarity between the fourth feature and the semantic feature of the first predicted character through the recognition module; generating a fifth feature according to the fourth feature, the semantic feature of the first predicted character and the first similarity; and generating a sixth feature according to the fourth feature, the semantic feature and the similarity of the first predicted character. And combining the fifth feature and the sixth feature by the identification module.

The first similarity may be obtained by calculating cosine similarity, euclidean distance, mahalanobis distance, and the like between the fourth feature and the semantic feature of the first predicted character, or the first similarity may be obtained by performing a dot product operation on the fourth feature and the semantic feature of the first predicted character. Further, the first similarity may include one similarity value, or may be two transposed similarity values. The fifth feature is that the semantic features of the first predicted character are combined on the basis of the fourth feature, and the sixth feature is that the fourth feature is combined on the basis of the semantic features of the first predicted character. The combination of the fifth feature and the sixth feature includes, but is not limited to, splicing, adding, multiplying, or other combinations, which are not exhaustive here.

More specifically, for a more intuitive understanding of the process of generating the fifth feature and the sixth feature, please refer to fig. 5, fig. 5 is a schematic flowchart of generating the fifth feature and the sixth feature in the work flow of the text recognition network provided in the embodiment of the present application, and fig. 5 exemplifies the generation of the first similarity by means of dot multiplication. Wherein, K_visRepresents the firstFour features (i.e., the first feature after update), Q_linRepresenting a semantic feature of the first predicted character,

and

respectively a first weight and a second weight, P_linBy Q_linMultiplied by a first weight, P_visBy K_visAnd the first weight and the second weight are determined in the text recognition network training stage. S_visRepresenting the similarity of the fourth feature to the first predicted character, S_visBy the formula

Is calculated to obtain S_linRepresenting the similarity of the first predicted character to the fourth feature, S_linBy the formula

It is calculated that d represents the number of dimensions of the feature, that is, d represents the number of elements included in the fourth feature or the fifth feature. The fifth feature is a semantic feature in which the first predicted character is combined with the fourth feature, and is shown in fig. 5

Splicing

So as to obtain the compound with the characteristics of,by the formula S_lin、K_linAnd

and dot product is obtained. The sixth feature is a combination of the fourth feature with the semantic feature of the first predicted character, and is shown in fig. 5 as

Splicing

So as to obtain the compound with the characteristics of,

by the formula S_vis、K_visAnd

and dot product is obtained. It should be understood that fig. 5 is only an example for facilitating understanding of the present solution, and is not intended to limit the present solution.

A process for performing a recognition operation based on the combined features. After the executing device combines the fifth feature and the sixth feature through the recognition module, the combined feature is input to the classification network in the recognition module, so that the recognition operation is executed through the classification network in the recognition module, and a first recognition result output by the whole recognition module is obtained.

The classification network may be specifically represented as a classifier, the classifier may be a multi-layer perceptron (MLP), the classifier may also be composed of a linear transformation matrix and a softmax classification function, and the specific form of the classification network is not limited herein.

If step 303 is not performed, step 304 includes: the executing device combines the first feature obtained in step 301 and the semantic feature of the first predicted character through the recognition module, and executes a recognition operation according to the combined feature to generate a first recognition result. The specific implementation process may refer to the description in the case of executing step 303, and is not described in detail here.

305. And the execution equipment inputs a second character corresponding to the recognized character in the first character into the text characteristic acquisition module, and performs text prediction according to the second character to generate the semantic characteristic of the first predicted character.

In this embodiment, a specific implementation manner of step 305 is similar to that of step 302, and in a case that the executing device has performed a recognition operation on at least one character in the first character, the executing device will obtain at least one second character corresponding to all recognized characters in the first character. Specifically, the execution device determines at least one recognition result corresponding to all recognized characters in the first character as at least one second character corresponding to a recognized character in the first character. In the embodiment of the application, under the condition that the recognition operation is performed on at least one character in the first characters, the execution equipment determines at least one recognition result corresponding to the recognized character in the first characters and the preset character as at least one second character corresponding to the recognized character in the first characters, and under the condition that the recognition operation is performed on the first character in the image to be recognized for the first time, the execution equipment generates the semantic features of the first predicted character according to the preset character, so that the completeness of the scheme is ensured, manual intervention is not required in the whole recognition process, and the user stickiness of the scheme is improved.

More specifically, if the executing device proceeds to step 305 through step 304, step 305 includes: the execution device determines the first recognition result and the preset character as one second character corresponding to the recognized character in the first characters. If the executing device proceeds to step 305 through step 307, step 305 includes: the execution device determines the preset character, the first recognition result and the at least one second recognition result as a plurality of second characters corresponding to recognized characters in the first characters.

In the case that the granularity of the recognition operation performed by the text recognition network is a character, the first character is a word including at least one character, one recognition result includes one character, and each second character includes one character. In the case where the granularity of the recognition operation performed by the text recognition network is a word, at least one word is included in the first characters, one recognition result is a word including one or more characters, and each of the second characters is a word including one or more characters. In the embodiment of the application, the granularity of the text recognition network for executing the recognition operation can be characters or words, so that the application scene of the scheme is expanded, and the realization flexibility of the scheme is improved.

And the execution equipment inputs all second characters corresponding to all recognized characters in the first characters into a text characteristic acquisition module of a text recognition network so as to carry out text prediction through an encoding module and a decoding module in the text characteristic acquisition module according to all the second characters and generate semantic characteristics of the first predicted characters.

Specifically, in an implementation manner, the encoding module and the decoding module in the text feature acquisition module are a self-attention encoding module and a self-attention decoding module, respectively. Step 302 may include: the execution equipment converts any one of the at least one second character from a character form to a tensor form through the text feature acquisition module to generate a character code of the second character, and generates a position code of the second character according to the position of the first character of any one of the at least one second character in the image to be recognized. The execution equipment combines the character code and the position code of any second character in at least one second character through a text feature acquisition module of a text recognition network to obtain the initial feature of the second character. The execution device executes the above operation on each second character in the at least one second character through a text feature acquisition module of the text recognition network, so as to generate an initial feature of each second character in the at least one second character. And the execution equipment executes self-attention coding operation and self-attention decoding operation according to the initial characteristics of the second character through a text characteristic acquisition module so as to generate semantic characteristics of the first predicted character.

In another implementation, the encoding module and the decoding module in the text feature acquisition module are selected from a recurrent neural network. Step 302 may include: the execution device converts each second character in the at least one second character from a character form to a tensor form through the text feature acquisition module to generate a character code of each second character, and determines the character code of each second character as the initial feature of each second character. And the execution equipment executes encoding operation and decoding operation according to the initial features of all the second characters in the at least one second character through the text feature acquisition module so as to generate the semantic features of the first predicted character.

The specific implementation manners of the two implementation manners may refer to the description in step 302, which is not described herein.

306. The execution device combines the feature of the second character with the first feature through the feature update module to generate a seventh feature.

In this embodiment of the application, a specific implementation manner of step 306 is similar to that of step 303, and after the execution device generates, by using an image feature extraction module of the text recognition network, a first feature corresponding to a first character in the image to be recognized, the feature of the second character may be combined with the first feature to generate a seventh feature, where the seventh feature is the updated first feature. The feature of the second character may be an updated feature of the preset character, or may be an initial feature of the preset character. The first feature includes image features of a plurality of first characters, at least one of the plurality of first characters being a character on which a recognition operation has been performed, and in a case where the second character includes a recognition result corresponding to the plurality of recognized characters, the feature of the second character includes a feature of the recognition result corresponding to the recognized character. The seventh feature is enhanced with respect to the first feature in the feature of the recognized character.

In the embodiment of the application, the semantic features of the recognized characters are blended into the image features, so that the features of the recognized characters in the image features are more obvious, and the recognition module can more intensively recognize the unrecognized characters, so that the difficulty of the recognition module in a single recognition process is reduced, and the accuracy of text recognition is improved.

Specifically, in one implementation, the execution device executes, through a feature update module of the text recognition network, a self-attention coding operation according to an initial feature of the second character to obtain an updated feature of the second character, and executes the self-attention coding operation according to the first feature and the updated feature of the second character to generate a seventh feature (i.e., the updated first feature). In the embodiment of the application, the feature of the second character is combined with the first feature by adopting a self-attention coding mode, so that the full combination of the feature of the second character and the first feature is favorably realized, the complexity is low, and the realization is easy.

In another implementation, the performing device performs a self-attention encoding operation according to the first feature and the initial feature of the second character to generate a seventh feature by a feature update module of the text recognition network.

In another implementation, the execution device executes, by a feature update module of the text recognition network, a coding operation according to the initial feature of the second character to obtain an updated feature of the second character, and executes the coding operation according to the first feature and the updated feature of the second character to generate the seventh feature. Further, the feature update module of the text recognition network performs an encoding operation by an encoder, which is an encoder in the recurrent neural network.

In another implementation, the execution device performs, by a feature update module of the text recognition network, an encoding operation based on the first feature and the initial feature of the second character to generate a seventh feature.

For the specific implementation of various forms in step 306, reference may be made to the description in step 303, and details are not repeated here.

307. The execution device executes the recognition operation through the recognition module according to the first characteristic and the semantic characteristic of the first predicted character to generate a second recognition result.

In this embodiment, the specific implementation manner of step 307 is similar to the specific implementation manner of step 304, and the executing device combines the first feature and the semantic feature of the first predicted character through the recognition module, and executes the recognition operation according to the combined feature, so as to generate a second recognition result.

Specifically, step 306 is an optional step, and if step 306 is executed, step 307 includes: the execution device combines the seventh feature (i.e., the updated first feature) and the semantic feature of the first predicted character through the recognition module, and performs a recognition operation according to the combined feature to generate a second recognition result.

A process of combining the seventh feature and the semantic feature of the first predicted character. In one implementation, the execution device directly combines the seventh feature (i.e., the updated first feature) and the first feature by means of the identification module, such as splicing, matrix multiplication, and combination.

In another implementation, the execution device performs, by the recognition module, a combination operation of the seventh feature and the semantic feature of the first predicted character according to a similarity between the seventh feature and the semantic feature of the first predicted character. The execution device calculates the similarity between the seventh feature (namely the updated first feature) and the semantic feature of the first predicted character through the recognition module; and generating a second feature and a third feature according to the seventh feature, the semantic feature and the similarity of the first predicted character. The second characteristic is that the semantic characteristic of the first predicted character is combined on the basis of the seventh characteristic, and the third characteristic is that the seventh characteristic is combined on the basis of the semantic characteristic of the first predicted character; and performing a recognition operation according to the second characteristic and the third characteristic to generate a second recognition result.

In the embodiment of the application, the similarity between the first feature and the semantic feature of the first predicted character is calculated, and then the second feature and the third feature are generated according to the similarity between the first feature and the semantic feature of the first predicted character, wherein the second feature is the semantic feature of the first predicted character combined on the basis of the first feature, and the third feature is the first feature combined on the basis of the semantic feature of the first predicted character, namely, the image feature of the character to be recognized is enhanced according to the semantic feature of the predicted character, and the image feature of the character to be recognized is blended into the semantic feature of the predicted character, so that the full fusion of the image feature and the feature of the predicted character is facilitated, and the accuracy of a text recognition result is facilitated to be improved.

A process for performing a recognition operation based on the combined features. The execution device combines the second feature and the third feature through the identification module, and then inputs the combined feature into the classification network in the identification module, so as to execute identification operation through the classification network in the identification module, and obtain a first identification result output by the whole identification module.

If step 306 is not executed, step 307 includes: the executing device combines the first feature obtained in step 301 and the semantic feature of the first predicted character through the recognition module, and executes the recognition operation according to the combined feature to generate a second recognition result. The specific implementation manner of step 307 may refer to the description in step 304, which is not described herein again.

It should be noted that, in the embodiment of the present application, the number of execution times of the steps 301 to 304 and the steps 305 to 307 is not limited, and the steps 305 to 307 may be repeatedly executed after the steps 301 to 304 are executed once, so as to obtain a plurality of second recognition results.

Specifically, if the granularity of one recognition operation performed by the text recognition network is a character, the execution device can obtain the recognition result of one character in one first character every time the execution device performs steps 305 to 307, and the execution device repeatedly performs steps 305 to 307 multiple times to obtain the recognition results of all characters in one first character. If the granularity of one recognition operation performed by the text recognition network is a word, the execution device can obtain the recognition result of one word in one first character every time the execution device performs steps 305 to 307, and the execution device repeatedly performs steps 305 to 307 for multiple times to obtain the recognition results of all words in one first character. And then the output result of the whole first character can be output.

Further, if only one character to be recognized is included in one first character, or only one word to be recognized is included in one first character, the execution device may directly output the recognition result of the entire first character after performing steps 301 to 304.

For a more intuitive understanding of the present solution, please refer to fig. 6, and fig. 6 is a schematic diagram of a network architecture of a text recognition network according to an embodiment of the present application. The text recognition structure comprises an image feature extraction module, A1, A2 and a recognition module, wherein A1 represents a text feature acquisition module, and A2 represents a feature update module. As shown in fig. 6, the execution device inputs the image to be recognized into the image feature extraction module to obtain the image features (i.e., first features) of the first characters in the image to be recognized, inputs the characters corresponding to the first characters in the image to be recognized into a1 (i.e., text feature acquisition module), where the characters corresponding to the first characters in the image to be recognized may be preset characters, or preset characters and second characters, generates the initial features of the characters through the text feature acquisition module, and performs a self-attention encoding operation and a self-attention decoding operation on the initial features of the characters to obtain the semantic features of the predicted characters. After obtaining the first feature, the execution device also performs self-attention coding on the initial feature of the character to obtain an updated feature of the character, and then performs self-attention coding operation according to the first feature and the updated feature of the character to generate an updated first feature. The execution device inputs the updated first characteristic and the semantic characteristic of the predicted character into the recognition module, so that the recognition module executes recognition operation and inputs a recognition result. For a specific implementation manner of each step in fig. 6, reference may be made to the above description, and details are not repeated here, it should be understood that in an actual situation, more or fewer neural network layers may be set in the text recognition network, and fig. 6 is only one example for facilitating understanding of the present solution, and is not used to limit the present solution.

Second, training phase

In the embodiment of the present application, the training phase describes a process of how the training device 220 trains the text recognition network. Referring to fig. 7, fig. 7 is a schematic flowchart of a training method for a text recognition network according to an embodiment of the present application, where the method includes:

701. the training device obtains an image to be recognized from a training data set.

In the embodiment of the application, a training data set is pre-configured on a training device, the training data set comprises a plurality of images to be recognized and a correct result corresponding to a first character in each image to be recognized, and the training device randomly acquires one image to be recognized from the training data set.

702. The training equipment inputs the image to be recognized to an image feature extraction module, and performs feature extraction on the image to be recognized to generate a first feature corresponding to a first character in the image to be recognized.

703. The training equipment inputs a preset character corresponding to the first character in the image to be recognized into a text characteristic acquisition module, and performs text prediction according to the preset character to generate a semantic characteristic of the first predicted character.

704. The training equipment combines the characteristics of the preset characters with the first characteristics through the characteristic updating module to generate fourth characteristics.

705. The training device performs a recognition operation by the recognition module based on the first feature and the semantic features of the first predicted character to generate a first recognition result.

706. The training equipment inputs a second character corresponding to the recognized character in the first character to a text feature acquisition module, and performs text prediction according to the second character to generate a semantic feature of the first predicted character.

707. The training device combines the features of the second character with the first features through a feature update module to generate seventh features.

708. The training device performs a recognition operation by the recognition module based on the first feature and the semantic features of the first predicted character to generate a second recognition result.

In this embodiment of the application, a specific implementation manner of the training device to execute the steps 702 to 708 is similar to a specific implementation manner of the steps 301 to 307 in the embodiment corresponding to fig. 3, and reference may be made to the description of the steps 301 to 307 in the embodiment corresponding to fig. 3, which is not described herein again.

709. The training equipment trains the text recognition network according to the correct result, the recognition result and the loss function corresponding to the first character in the image to be recognized.

In the embodiment of the application, after the training device obtains the recognition result of a first character in the image to be recognized, the training device calculates the function value of the loss function according to the correct result corresponding to the first character in the image to be recognized and the recognition result of the first character in the image to be recognized, and performs gradient derivation on the function value of the loss function to reversely update the weight parameter of the text recognition network, so as to complete one training on the text recognition network. The training equipment repeatedly executes the steps to realize iterative training of the text recognition network.

Specifically, if one first character includes only one character to be recognized, or one first character includes only one word to be recognized, the execution device may directly output the recognition result of the entire first character after steps 701 to 705 are performed, and the training device calculates the function value of the loss function according to the correct result corresponding to the first character in the image to be recognized and the first recognition result output in step 705.

If a first character includes a plurality of characters to be recognized, or a first character includes a plurality of words to be recognized, the execution device may directly output the recognition result of the entire first character after performing steps 701 to 705 once and performing steps 706 to 709 at least once, and the training device calculates the function value of the loss function according to the correct result corresponding to the first character in the image to be recognized, the first recognition result output in step 705, and at least one second recognition result obtained in step 708.

The loss function indicates the similarity between a correct result corresponding to the first character in the image to be recognized and a recognition result of the first character in the image to be recognized, and the training aims to improve the similarity between the correct result corresponding to the first character in the image to be recognized and the recognition result of the first character in the image to be recognized. The loss function may be embodied as a cross-entropy loss function, a focal point (focal) loss function, a center (center) loss function, or other types of loss functions, and is not limited herein.

The preset condition may be that the loss function satisfies the convergence condition, or that the number of iterations reaches a preset number.

In the embodiment of the application, a training method of the text recognition network is provided, so that the completeness of the scheme is improved; due to the fact that when factors such as the image to be recognized is fuzzy or part of characters in the image to be recognized are blocked occur, the accuracy of the characteristics of the fuzzy characters or the blocked characters included in the first characteristics can be greatly reduced. In the training stage, semantic features of predicted characters are generated based on semantic information of recognized characters, recognition results are generated according to the semantic features of the predicted characters and image features, and due to the fact that images are fuzzy or part of characters in the images to be recognized are shielded and the like, accuracy of the predicted characters cannot be affected, and the method is beneficial to improving accuracy of the text recognition results output by the trained text recognition network.

In order to more intuitively understand the beneficial effects brought by the embodiments of the present application, the following table 1 shows the beneficial effects brought by the embodiments of the present application through experimental data.

	svt	SVTP	CT80
				OCR	88.2％	77.67％	84.98％
The embodiments of the present application	92.4％	84.2％	89.9％

TABLE 1

Referring to table 1 above, svt, SVTP and CT80 are three public data sets respectively, and the first line of data in table 1 indicates the accuracy of the recognition results obtained by performing text recognition on the images in data set svt, data set SVTP and data set CT80 respectively by using Optical Character Recognition (OCR) technology. The second line of data in table 1 indicates the accuracy of the recognition results obtained by performing text recognition on the images in the data set svt, the data set SVTP and the data set CT80 respectively using the text recognition network provided in the embodiment of the present application. Obviously, the accuracy of the recognition result obtained by the text recognition network provided by the embodiment of the application is higher.

In addition, please continue to refer to fig. 8, where fig. 8 is a schematic diagram illustrating an advantageous effect of the text recognition network according to the embodiment of the present application. For the first line of data in fig. 8, when the characters in the image to be recognized are recognized only according to the image features of the image to be recognized, the obtained recognition result is "sheet", and the characters in the image to be recognized are recognized by using the text recognition network provided by the embodiment of the present application, and the obtained recognition result is "sheet". By analogy with the data in the second row and the data in the third row in fig. 8, it is obvious that the recognition result obtained by using the text recognition network provided by the embodiment of the present application has higher accuracy.

On the basis of the embodiments corresponding to fig. 1 to 8, in order to better implement the above-mentioned scheme of the embodiments of the present application, the following also provides related equipment for implementing the above-mentioned scheme. Specifically referring to fig. 9, fig. 9 is a schematic structural diagram of a text recognition network according to an embodiment of the present application. The text recognition network 900 may include an image feature extraction module 901, a text feature acquisition module 902, and a recognition module 903. The image feature extraction module 901 is configured to acquire an image to be recognized and perform feature extraction on the image to be recognized to generate a first feature corresponding to a first character in the image to be recognized, where the first character is a character to be recognized in the image to be recognized; a text feature obtaining module 902, configured to obtain a preset character corresponding to a first character in an image to be recognized, and perform text prediction according to the preset character to generate a semantic feature of the first predicted character; and the recognition module 903 is used for executing a recognition operation according to the first characteristic and the semantic characteristic of the first predicted character so as to generate a recognition result corresponding to the first character in the image to be recognized.

In one possible design, the text feature obtaining module 902 is specifically configured to, under the condition that a recognition operation is performed on an image to be recognized for the first time, obtain a preset character corresponding to a first character in the image to be recognized, and perform text prediction according to the preset character to generate a semantic feature of a second predicted character; the text feature obtaining module 902 is further configured to, in a case that a recognition operation has been performed on at least one of the first characters, determine a recognition result corresponding to a recognized character in the first characters as a second character, and generate a semantic feature of a second predicted character corresponding to the second character.

In one possible design, the recognition module 903 is further configured to perform a recognition operation according to the first feature and the semantic feature of the second predicted character to generate a recognition result corresponding to the first character in the image to be recognized.

In a possible design, please refer to fig. 10, where fig. 10 is a schematic structural diagram of a text recognition network according to an embodiment of the present application. The text feature obtaining module 902 includes: the first generation sub-module 9021 is configured to perform vectorization processing on the preset character to generate a character code of the preset character, and generate a position code of the preset character according to a position of a first character of the preset character in the image to be recognized; the combining sub-module 9022 is configured to combine the character code of the preset character and the position code of the preset character to obtain an initial feature of the preset character, and perform a self-attention coding operation and a self-attention decoding operation according to the initial feature of the preset character to generate a semantic feature of the first predicted character.

In one possible design, referring to fig. 10, the identification module 903 comprises: a calculation submodule 9031, configured to calculate a similarity between the first feature and a semantic feature of the first predicted character; the second generation submodule 9032 is configured to generate a second feature and a third feature according to the first feature, the semantic feature of the first predicted character, and the similarity, where the second feature is a combination of the semantic feature of the first predicted character on the basis of the first feature, and the third feature is a combination of the semantic feature of the first predicted character and the first feature; and the second generation submodule 9032 is further configured to execute a recognition operation according to the second feature and the third feature, so as to generate a recognition result.

In one possible design, referring to fig. 10, the text recognition network further includes a feature update module 904, the feature update module 904 being configured to: combining the characteristics of the preset characters with the first characteristics to generate updated first characteristics; the recognition module 903 is specifically configured to perform a recognition operation according to the updated first feature and the semantic feature of the first predicted character, so as to generate a recognition result corresponding to the first character in the image to be recognized.

In one possible design, the feature update module 904 is specifically configured to: and executing the self-attention coding operation according to the initial features of the preset characters to obtain updated features of the preset characters, and executing the self-attention coding operation according to the first features and the updated features of the preset characters to generate updated first features.

In a possible design, in the case that the granularity of the recognition operation performed by the text recognition network is characters, at least one character is included in one first character, and one recognition result output by the text recognition network after performing one recognition operation includes one character; in the case that the granularity of the recognition operation performed by the text recognition network is a word, at least one word is included in one first character, and one recognition result output by the text recognition network after performing the recognition operation once is a word including one or more characters.

It should be noted that, the contents of information interaction, execution process, and the like between the modules/units in the text recognition network 900 are based on the same concept as that of the method embodiments corresponding to fig. 3 to fig. 6 in the present application, and specific contents may refer to the description in the foregoing method embodiments in the present application, and are not described herein again.

The embodiment of the present application further provides a training device for a text recognition network, and specifically, referring to fig. 11, fig. 11 is a schematic structural diagram of the training device for a text recognition network provided in the embodiment of the present application. The text recognition network is a neural network used for recognizing characters in the image and comprises an image feature extraction module, a text feature acquisition module and a recognition module. The training apparatus 1100 of the text recognition network includes: input unit 1101, recognition unit 1102, and training unit 1103. The input unit 1101 is configured to input an image to be recognized to the image feature extraction module, perform feature extraction on the image to be recognized, so as to generate a first feature corresponding to a first character in the image to be recognized, where the first character is a character that needs to be recognized in the image to be recognized; the input unit 1101 is further configured to input a preset character corresponding to the first character in the image to be recognized to the text feature acquisition module, and perform text prediction according to the preset character to generate a semantic feature of the first predicted character; a recognition unit 1102, configured to perform a recognition operation by a recognition module according to the first feature and a semantic feature of the first predicted character to generate a recognition result corresponding to the first character in the image to be recognized; the training unit 1103 is configured to train the text recognition network according to a correct result corresponding to the first character in the image to be recognized, a recognition result, and a loss function, where the loss function indicates a similarity between the correct result corresponding to the first character in the image to be recognized and the recognition result corresponding to the first character in the image to be recognized.

In a possible design, please refer to fig. 12, where fig. 12 is a schematic structural diagram of a training apparatus of a text recognition network according to an embodiment of the present disclosure. The input unit 1101 is specifically configured to input a preset character corresponding to a first character in an image to be recognized to the text feature acquisition module in a case where a recognition operation is performed on the image to be recognized for the first time; the training apparatus 1100 of the text recognition network further includes a generating unit 1104 configured to determine, by the text feature acquisition module, a recognition result corresponding to a recognized character in the first characters as a second character and generate a semantic feature of a second predicted character corresponding to the second character in a case where a recognition operation has been performed on at least one character in the first characters.

In one possible design, the recognition unit 1102 is further configured to perform a recognition operation by the recognition module according to the first feature and the semantic feature of the second predicted character to generate a recognition result corresponding to the first character in the image to be recognized.

In one possible design, the input unit 1101 is specifically configured to perform vectorization processing on a preset character through the text feature acquisition module to generate a character code of the preset character, and generate a position code of the preset character according to a position of a first character of the preset character in an image to be recognized; the method comprises the steps of combining a character code of a preset character and a position code of the preset character through a text feature acquisition module to obtain an initial feature of the preset character, and executing self-attention coding operation and self-attention decoding operation according to the initial feature of the preset character to generate a semantic feature of a first predicted character.

In one possible design, the identifying unit 1102 is specifically configured to: calculating, by the recognition module, a similarity between the first feature and a semantic feature of the first predicted character; generating a second feature and a third feature through a recognition module according to the first feature and the semantic features and the similarity of the first predicted character, wherein the second feature is the semantic features combined with the first predicted character on the basis of the first feature, and the third feature is the semantic features combined with the first feature on the basis of the semantic features of the first predicted character; and performing recognition operation according to the second characteristic and the third characteristic through the recognition module to generate a recognition result.

In one possible design, referring to FIG. 12, the text recognition network further includes a feature update module. The training apparatus 1100 for text recognition network further includes a combining unit 1105, configured to combine the feature of the preset character with the first feature through the feature updating module to generate an updated first feature; the identifying unit 1102 is specifically configured to perform, by the identifying module, an identifying operation according to the updated first feature and the semantic feature of the first predicted character, so as to generate an identifying result corresponding to the first character in the image to be identified.

In one possible design, the combining unit 1105 is specifically configured to, through the feature updating module, execute a self-attention encoding operation according to the initial features of the preset characters to obtain updated features of the preset characters; and executing self-attention coding operation according to the first characteristic and the updated characteristic of the preset character through a characteristic updating module to generate an updated first characteristic.

It should be noted that, the information interaction, the execution process, and other contents between the modules/units in the training apparatus 1100 of the text recognition network are based on the same concept as that of the method embodiments corresponding to fig. 7 in the present application, and specific contents may refer to the description in the foregoing method embodiments in the present application, and are not described herein again.

Referring to fig. 13, fig. 13 is a schematic structural diagram of an execution device provided in the embodiment of the present application, where the execution device 1300 may be disposed with a text recognition network 900 described in the corresponding embodiment of fig. 9 or fig. 10, so as to implement the functions of the execution device in the corresponding embodiment of fig. 3 to fig. 6. Specifically, the execution apparatus 1300 includes: the apparatus includes a receiver 1301, a transmitter 1302, a processor 1303, and a memory 1304 (where the number of processors 1303 in the execution apparatus 1300 may be one or more, and one processor is taken as an example in fig. 13), where the processor 1303 may include an application processor 13031 and a communication processor 13032. In some embodiments of the present application, the receiver 1301, the transmitter 1302, the processor 1303, and the memory 1304 may be connected by a bus or other means.

The memory 1304 may include a read-only memory and a random access memory, and provides instructions and data to the processor 1303. A portion of memory 1304 may also include non-volatile random access memory (NVRAM). The memory 1304 stores processors and operating instructions, executable modules or data structures, or subsets thereof, or expanded sets thereof, wherein the operating instructions may include various operating instructions for performing various operations.

The processor 1303 controls the operation of the execution apparatus. In a particular application, the various components of the execution device are coupled together by a bus system that may include a power bus, a control bus, a status signal bus, etc., in addition to a data bus. For clarity of illustration, the various buses are referred to in the figures as a bus system.

The method disclosed in the embodiment of the present application may be applied to the processor 1303, or implemented by the processor 1303. The processor 1303 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the method may be implemented by hardware integrated logic circuits in the processor 1303 or instructions in the form of software. The processor 1303 may be a general-purpose processor, a Digital Signal Processor (DSP), a microprocessor or a microcontroller, and may further include an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The processor 1303 may implement or execute the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 1304, and the processor 1303 reads information in the memory 1304 and completes the steps of the method in combination with hardware thereof.

The receiver 1301 may be used to receive input numeric or character information and generate signal inputs related to performing device related settings and function control. The transmitter 1302 may be used to output numeric or character information through a first interface; the transmitter 1302 may also be used to send instructions to the disk groups through the first interface to modify data in the disk groups; the transmitter 1302 may also include a display device such as a display screen.

In this embodiment, the application processor 13031 is configured to execute the functions of the execution device in the corresponding embodiments in fig. 3 to fig. 6. It should be noted that, for specific implementation manners and advantageous effects brought by the application processor 13031 for executing the functions of the execution device in the embodiments corresponding to fig. 3 to fig. 6, reference may be made to descriptions in each method embodiment corresponding to fig. 3 to fig. 6, and details are not described here any more.

Referring to fig. 14, fig. 14 is a schematic structural diagram of the training device provided in the embodiment of the present application, and a training apparatus 1100 of the text recognition network described in the embodiment corresponding to fig. 11 or 12 may be disposed on the training device 1400, so as to implement the function of the training device corresponding to fig. 7. In particular, training device 1400 is implemented as one or more servers, where training device 1400 may vary significantly depending on configuration or performance, and may include one or more Central Processing Units (CPUs) 1422 (e.g., one or more processors) and memory 1432, one or more storage media 1430 (e.g., one or more mass storage devices) that store applications 1442 or data 1444. Memory 1432 and storage media 1430, among other things, may be transient or persistent storage. The program stored on storage medium 1430 may include one or more modules (not shown), each of which may include a sequence of instructions for operating on the exercise device. Still further, central processor 1422 may be configured to communicate with storage medium 1430 to perform a series of instructional operations on training device 1400 from storage medium 1430.

Training apparatus 1400 may also include one or more power supplies 1426, one or more wired or wireless network interfaces 1450, one or more input-output interfaces 1458, and/or one or more operating systems 1441, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

In this embodiment of the application, the central processing unit 1422 is configured to implement the function of the training device in the embodiment corresponding to fig. 7. It should be noted that, for the specific implementation manner and the beneficial effects brought by the central processing unit 1422 executing the functions of the training device in the embodiment corresponding to fig. 7, reference may be made to the descriptions in each method embodiment corresponding to fig. 7, and details are not repeated here.

An embodiment of the present application further provides a computer-readable storage medium, in which a program is stored, and when the program runs on a computer, the computer executes the steps executed by the apparatus in the embodiment corresponding to fig. 3 to 6 or the steps executed by the training apparatus in the embodiment corresponding to fig. 7.

Embodiments of the present application also provide a computer program product, which when run on a computer causes the computer to perform the steps performed by the apparatus as described in the embodiments corresponding to fig. 3 to 6, or the steps performed by the training apparatus as described in the embodiments corresponding to fig. 7.

Further provided in an embodiment of the present application is a circuit system, including a processing circuit, configured to perform the steps performed by the apparatus as described in the embodiment corresponding to fig. 3 to 6, or perform the steps performed by the training apparatus as described in the embodiment corresponding to fig. 7.

The execution device or the training device provided by the embodiment of the present application may specifically be a chip, where the chip includes: a processing unit, which may be for example a processor, and a communication unit, which may be for example an input/output interface, a pin or a circuit, etc. The processing unit may execute the computer executable instructions stored in the storage unit to enable the chip to perform the steps performed by the training apparatus in the embodiment corresponding to fig. 3 to 6, or perform the steps performed by the training apparatus in the embodiment corresponding to fig. 7. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, and the like, and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as a read-only memory (ROM) or another type of static storage device that can store static information and instructions, a Random Access Memory (RAM), and the like.

Specifically, please refer to fig. 15, where fig. 15 is a schematic structural diagram of a chip provided in the embodiment of the present application, the chip may be represented as a neural network processor NPU 150, and the NPU 150 is mounted on a main CPU (Host CPU) as a coprocessor, and the Host CPU allocates tasks. The core portion of the NPU is an arithmetic circuit 1503, and the controller 1504 controls the arithmetic circuit 1503 to extract matrix data in the memory and perform multiplication.

In some implementations, the arithmetic circuit 1503 includes a plurality of processing units (PEs) therein. In some implementations, the operational circuit 1503 is a two-dimensional systolic array. The arithmetic circuit 1503 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 1503 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The operation circuit 1503 fetches the data corresponding to the matrix B from the weight memory 1502 and buffers each PE in the operation circuit. The arithmetic circuit 1503 fetches the matrix a data from the input memory 1501, performs matrix arithmetic on the matrix a data and the matrix B data, and stores a partial result or a final result of the matrix in an accumulator (accumulator) 1508.

The unified memory 1506 is used to store input data and output data. The weight data directly passes through a Memory cell Access Controller (DMAC) 1505, and the DMAC is transferred to the weight Memory 1502. The input data is also carried into the unified memory 1506 by the DMAC.

The BIU is a Bus Interface Unit, Bus Interface Unit 1510, for interaction of the AXI Bus with the DMAC and the Instruction Fetch memory (IFB) 1509.

A Bus Interface Unit 1510(Bus Interface Unit, BIU for short) is used for the instruction fetch memory 1509 to fetch instructions from the external memory, and for the storage Unit access controller 1505 to fetch the original data of the input matrix a or the weight matrix B from the external memory.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1506 or to transfer weight data into the weight memory 1502 or to transfer input data into the input memory 1501.

The vector calculation unit 1507 includes a plurality of operation processing units, and performs further processing on the output of the operation circuit 1503 if necessary, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as Batch Normalization, pixel-level summation, up-sampling of a feature plane and the like.

In some implementations, the vector calculation unit 1507 can store the processed output vector to the unified memory 1506. For example, the vector calculation unit 1507 may apply a linear function and/or a nonlinear function to the output of the arithmetic circuit 1503, such as linear interpolation of the feature planes extracted by the convolutional layers, and further such as a vector of accumulated values to generate activation values. In some implementations, the vector calculation unit 1507 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuit 1503, e.g., for use in subsequent layers in a neural network.

An instruction fetch buffer (instruction fetch buffer)1509 connected to the controller 1504 for storing instructions used by the controller 1504;

the unified memory 1506, the input memory 1501, the weight memory 1502, and the instruction fetch memory 1509 are all On-Chip memories. The external memory is private to the NPU hardware architecture.

Here, the operation of each layer in the recurrent neural network may be performed by the operation circuit 1503 or the vector calculation unit 1507.

Wherein any of the aforementioned processors may be a general purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits configured to control the execution of the programs of the method of the first aspect.

It should be noted that the above-described embodiments of the apparatus are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiments of the apparatus provided in the present application, the connection relationship between the modules indicates that there is a communication connection therebetween, and may be implemented as one or more communication buses or signal lines.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus necessary general hardware, and certainly can also be implemented by special hardware including application specific integrated circuits, special CLUs, special memories, special components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, for the present application, the implementation of a software program is more preferable. Based on such understanding, the technical solutions of the present application may be substantially embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods described in the embodiments of the present application.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Claims

1. A text recognition network is characterized in that the text recognition network is a neural network used for recognizing characters in an image, and the text recognition network comprises an image feature extraction module, a text feature acquisition module and a recognition module;

the image feature extraction module is used for acquiring an image to be recognized and extracting features of the image to be recognized to generate a first feature corresponding to a first character in the image to be recognized, wherein the first character is a character which needs to be recognized in the image to be recognized;

the text feature acquisition module is used for acquiring a preset character corresponding to a first character in the image to be recognized and performing text prediction according to the preset character to generate a semantic feature of the first predicted character;

the recognition module is used for executing recognition operation according to the first feature and the semantic feature of the first predicted character so as to generate a recognition result corresponding to the first character in the image to be recognized.

2. The network of claim 1,

the text feature acquisition module is specifically configured to, under the condition that the recognition operation is performed on the image to be recognized for the first time, acquire a preset character corresponding to a first character in the image to be recognized, and perform text prediction according to the preset character to generate a semantic feature of the first predicted character;

the text feature obtaining module is further configured to, when a recognition operation has been performed on at least one of the first characters, determine a recognition result corresponding to a recognized character in the first characters as a second character, and perform text prediction according to the second character to generate a semantic feature of the second predicted character corresponding to the second character.

3. The network of claim 2,

the recognition module is further configured to perform a recognition operation according to the first feature and the semantic feature of the second predicted character to generate a recognition result corresponding to the first character in the image to be recognized.

4. The network according to any one of claims 1 to 3, wherein the text feature acquisition module comprises:

the first generation submodule is used for carrying out vectorization processing on the preset character so as to generate a character code of the preset character, and generating a position code of the preset character according to the position of the first character of the preset character in the image to be recognized;

and the combination sub-module is used for combining the character code of the preset character and the position code of the preset character to obtain the initial characteristic of the preset character, and executing self-attention coding operation and self-attention decoding operation according to the initial characteristic of the preset character to generate the semantic characteristic of the first predicted character.

5. The network of any one of claims 1 to 3, wherein the identification module comprises:

a calculation submodule for calculating a similarity between the first feature and a semantic feature of the first predicted character;

a second generation submodule configured to generate a second feature and a third feature according to the first feature, the semantic feature of the first predicted character, and the similarity, where the second feature is a semantic feature in which the first predicted character is combined on the basis of the first feature, and the third feature is a semantic feature in which the first feature is combined on the basis of the semantic feature of the first predicted character;

the second generation submodule is further configured to execute a recognition operation according to the second feature and the third feature to generate a recognition result.

6. The network of any one of claims 1 to 3, wherein the text recognition network further comprises a feature update module configured to:

combining the characteristics of the preset characters with the first characteristics to generate updated first characteristics;

the recognition module is specifically configured to perform a recognition operation according to the updated first feature and the semantic feature of the first predicted character, so as to generate a recognition result corresponding to the first character in the image to be recognized.

7. The network of claim 6,

the feature updating module is specifically configured to perform a self-attention coding operation according to the initial feature of the preset character to obtain an updated feature of the preset character, and perform the self-attention coding operation according to the first feature and the updated feature of the preset character to generate the updated first feature.

8. The network according to any of claims 1 to 3,

under the condition that the granularity of the recognition operation executed by the text recognition network is a character, a first character comprises at least one character, and a recognition result output by the text recognition network after executing the recognition operation comprises one character;

in the case that the granularity of the recognition operation performed by the text recognition network is a word, at least one word is included in one first character, and one recognition result output by the text recognition network after performing one recognition operation is a word including one or more characters.

9. A training method of a text recognition network is characterized in that the text recognition network is a neural network used for recognizing characters in an image, the text recognition network comprises an image feature extraction module, a text feature acquisition module and a recognition module, and the method comprises the following steps:

inputting an image to be recognized into the image feature extraction module, and performing feature extraction on the image to be recognized to generate a first feature corresponding to a first character in the image to be recognized, wherein the first character is a character which needs to be recognized in the image to be recognized;

inputting a preset character corresponding to a first character in the image to be recognized into the text feature acquisition module, and performing text prediction according to the preset character to generate a semantic feature of the first predicted character;

performing recognition operation through the recognition module according to the first feature and the semantic feature of the first predicted character to generate a recognition result corresponding to the first character in the image to be recognized;

and training the text recognition network according to a correct result corresponding to the first character in the image to be recognized, a recognition result and a loss function, wherein the loss function indicates the similarity between the correct result corresponding to the first character in the image to be recognized and the recognition result corresponding to the first character in the image to be recognized.

10. The method of claim 9,

11. The method of claim 10,

12. A method of text recognition, the method comprising:

inputting an image to be recognized into an image feature extraction module, and performing feature extraction on the image to be recognized to generate a first feature corresponding to a first character in the image to be recognized, wherein the first character is a character which needs to be recognized in the image to be recognized;

inputting a preset character corresponding to a first character in the image to be recognized into a text feature acquisition module, and performing text prediction according to the second character to generate a semantic feature of the first predicted character;

according to the first feature and the semantic feature of the first predicted character, performing recognition operation through a recognition module to generate a recognition result corresponding to the first character in the image to be recognized;

the image feature extraction module, the text feature acquisition module and the identification module belong to the same text identification network.

13. The method according to claim 12, wherein the inputting a preset character corresponding to the first character in the image to be recognized to a text feature obtaining module comprises:

under the condition that the recognition operation is performed on the image to be recognized for the first time, inputting a preset character corresponding to a first character in the image to be recognized to a text feature acquisition module;

the method further comprises the following steps:

and under the condition that the recognition operation is performed on at least one character in the first characters, determining a recognition result corresponding to the recognized character in the first characters as a second character through the text feature acquisition module, and performing text prediction according to the second character to generate semantic features of the second predicted character corresponding to the second character.

14. The method of claim 13, further comprising:

and according to the first characteristic and the semantic characteristic of the second predicted character, performing recognition operation through the recognition module to generate a recognition result corresponding to the first character in the image to be recognized.

15. The method according to any one of claims 12 to 14, wherein the inputting a preset character corresponding to a first character in the image to be recognized to a text feature obtaining module, and performing text prediction according to the second character to generate a semantic feature of the first predicted character comprises:

vectorizing the preset character through the text feature acquisition module to generate a character code of the preset character, and generating a position code of the preset character according to the position of a first character of the preset character in the image to be recognized;

and combining the character code of the preset character and the position code of the preset character through the text feature acquisition module to obtain the initial feature of the preset character, and executing self-attention coding operation and self-attention decoding operation according to the initial feature of the preset character to generate the semantic feature of the first predicted character.

16. The method according to any one of claims 12 to 14, wherein the performing, by the recognition module, a recognition operation according to the first feature and the semantic feature of the first predicted character to generate a recognition result corresponding to the first character in the image to be recognized comprises:

calculating, by the recognition module, a similarity between the first feature and a semantic feature of the first predicted character;

generating, by the recognition module, a second feature and a third feature according to the first feature, the semantic feature of the first predicted character and the similarity, wherein the second feature is a semantic feature in which the first predicted character is combined on the basis of the first feature, and the third feature is a semantic feature in which the first feature is combined on the basis of the semantic feature of the first predicted character;

and executing identification operation according to the second characteristic and the third characteristic through the identification module to generate an identification result.

17. The method of any of claims 12 to 14, wherein the text recognition network further comprises a feature update module, the method further comprising:

combining the characteristics of the preset characters with the first characteristics through the characteristic updating module to generate updated first characteristics;

the identifying module is used for executing identification operation according to the first feature and the semantic feature of the first predicted character to generate an identification result corresponding to the first character in the image to be identified, and the identifying operation comprises the following steps:

and executing recognition operation according to the updated first feature and the semantic feature of the first predicted character through the recognition module so as to generate a recognition result corresponding to the first character in the image to be recognized.

18. The method according to claim 17, wherein the combining, by the feature update module, the feature of the preset character with the first feature to generate an updated first feature comprises:

executing self-attention coding operation according to the initial characteristics of the preset characters through the characteristic updating module to obtain updated characteristics of the preset characters;

and executing self-attention coding operation according to the first characteristic and the updated characteristic of the preset character through the characteristic updating module to generate the updated first characteristic.

19. The method according to any one of claims 12 to 14,

20. An apparatus for training a text recognition network, wherein the text recognition network is a neural network for recognizing characters in an image, the text recognition network comprises an image feature extraction module, a text feature acquisition module and a recognition module, and the apparatus comprises:

the image recognition device comprises an input unit, a recognition unit and a recognition unit, wherein the input unit is used for inputting an image to be recognized to the image feature extraction module, and performing feature extraction on the image to be recognized so as to generate a first feature corresponding to a first character in the image to be recognized, and the first character is a character which needs to be recognized in the image to be recognized;

the input unit is further configured to input a preset character corresponding to the first character in the image to be recognized to the text feature acquisition module, and perform text prediction according to the preset character to generate a semantic feature of the first predicted character;

the recognition unit is used for executing recognition operation through the recognition module according to the first feature and the semantic feature of the first predicted character so as to generate a recognition result corresponding to the first character in the image to be recognized;

and the training unit is used for training the text recognition network according to a correct result corresponding to the first character in the image to be recognized, a recognition result and a loss function, wherein the loss function indicates the similarity between the correct result corresponding to the first character in the image to be recognized and the recognition result corresponding to the first character in the image to be recognized.

21. The apparatus of claim 20,

the input unit is specifically configured to input a preset character corresponding to a first character in the image to be recognized to a text feature acquisition module under the condition that the recognition operation is performed on the image to be recognized for the first time;

the input unit is further configured to, when a recognition operation has been performed on at least one of the first characters, determine, by the text feature acquisition module, a recognition result corresponding to a recognized character in the first characters as a second character, and perform text prediction according to the second character to generate a semantic feature of the second predicted character corresponding to the second character.

22. The apparatus of claim 21,

the recognition unit is further configured to perform, by the recognition module, a recognition operation according to the first feature and the semantic feature of the second predicted character to generate a recognition result corresponding to the first character in the image to be recognized.

23. An execution device comprising a processor coupled to a memory, the memory storing program instructions that, when executed by the processor, implement the steps performed by the text recognition network of any of claims 1 to 8.

24. Training device, comprising a processor coupled to a memory, the memory storing program instructions which, when executed by the processor, implement the method of any of claims 9 to 11.

25. A computer-readable storage medium, characterized by comprising a program which, when run on a computer, causes the computer to perform the steps performed by the text recognition network according to any one of claims 1 to 8, or causes the computer to perform the method according to any one of claims 9 to 11.

26. Circuitry, characterized in that the circuitry comprises processing circuitry configured to perform the steps performed by the text recognition network according to any of claims 1 to 8, or configured to perform the method according to any of claims 9 to 11.