CN116030471A

CN116030471A - Text recognition method, training method, device and equipment for text recognition model

Info

Publication number: CN116030471A
Application number: CN202211712854.3A
Authority: CN
Inventors: 吕鹏原; 范森; 章成全; 姚锟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-04-28

Abstract

The disclosure provides a text recognition method, a training device and training equipment for a text recognition model, relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, image processing and computer vision, and can be applied to scenes such as OCR. The specific implementation scheme is as follows: respectively determining text image characteristics and character position coding characteristics according to the text image; sampling the text image features according to the character position coding features to obtain sampling features of at least one character in the text image; and carrying out text recognition on the text image according to the sampling characteristics. By the technical scheme, the identification accuracy of the text in the text image can be improved.

Description

Text recognition method, training method, device and equipment for text recognition model

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of deep learning, image processing and computer vision, and can be applied to scenes such as OCR.

Background

The character detection recognition technology (OCR) under natural scenes can be widely applied to various industries of society, such as education, medical treatment, finance, etc. The technology of identifying common card tickets derived by the text detection and identification technology, automatically inputting documents, photographing, searching questions and the like greatly improves the intelligent degree and the production efficiency of the traditional industry, and facilitates the daily study and life of people. In recent years, although the technology of end-to-end text detection and recognition in natural scenes has been rapidly developed, many problems still remain, which are not well solved, such as text detection and recognition of arbitrary shapes, and rapid text detection and recognition. Thus, improvements are needed.

Disclosure of Invention

The disclosure provides a text recognition method, a training device and training equipment for a text recognition model.

According to an aspect of the present disclosure, there is provided a text recognition method, the method including:

respectively determining text image characteristics and character position coding characteristics according to the text image;

sampling the text image features according to the character position coding features to obtain sampling features of at least one character in the text image;

and carrying out text recognition on the text image according to the sampling characteristics.

According to another aspect of the present disclosure, there is provided a training method of a text recognition model, the method including:

according to the sampling characteristics, carrying out text recognition on the text image;

and training the text recognition model according to the recognition result and the label data of the text image.

According to another aspect of the present disclosure, there is provided a text recognition apparatus including:

the image feature determining module is used for respectively determining text image features and character 5 position coding features according to the text image;

the sampling feature determining module is used for sampling the text image features according to the character position coding features to obtain sampling features of at least one character in the text image;

and the text recognition module is used for recognizing the text of the text image according to the sampling characteristics.

0 according to another aspect of the present disclosure, there is provided a training device for a text recognition model, the device

The device comprises:

the image feature determining module is used for determining text image features and character position coding features according to the text images respectively;

the sampling feature determining module is used for sampling the features of the text image 5 according to the character position coding features to obtain sampling features of at least one character in the text image;

the text recognition module is used for recognizing the text of the text image according to the sampling characteristics;

and the text recognition model training module is used for training the text recognition model according to the recognition result and the label data of the text image.

0 according to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions executable by the at least one processor, the instructions being

The at least one processor is configured to execute the at least one processor to perform the text recognition method of any of embodiments 5 of the present disclosure, or the training of the text recognition model of any of embodiments of the present disclosure

The method is as follows.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute any one of the implementations of the present disclosure

The text recognition method of the embodiment, or the training 0 training method of the text recognition model of any embodiment of the disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a text recognition method according to any embodiment of the present disclosure, or a training method of a text recognition model according to any embodiment of the present disclosure.

According to the technology disclosed by the invention, the recognition accuracy of the text in the text image can be improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a text recognition method provided in accordance with an embodiment of the present disclosure;

FIG. 2 is a flow chart of yet another text recognition method provided in accordance with an embodiment of the present disclosure;

FIG. 3A is a flow chart of a training method for a text recognition model provided in accordance with an embodiment of the present disclosure;

FIG. 3B is a schematic diagram of a training process for a text recognition model provided in accordance with an embodiment of the present disclosure;

fig. 4 is a schematic structural view of a text recognition device provided according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a training device for a text recognition model according to an embodiment of the present disclosure;

fig. 6 is a block diagram of an electronic device used to implement a text recognition method or training method for a text recognition model in accordance with an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the technical scheme of the invention, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the text images and the like all conform to the regulations of related laws and regulations and do not violate the popular regulations of the public order.

Fig. 1 is a flowchart of a text recognition method provided according to an embodiment of the present disclosure. The method is suitable for the situation of how to perform end-to-end recognition on the text image. The method may be performed by a text recognition device, which may be implemented in software and/or hardware, and may be integrated in an electronic device, such as a server, carrying text recognition functionality. As shown in fig. 1, the text recognition method of the present embodiment may include:

s101, respectively determining text image characteristics and character position coding characteristics according to the text image.

In this embodiment, the text image refers to an image that needs to be identified by text, which may be a text image captured by an image capturing device, or may be a text image captured by a screenshot tool, etc. Text image features refer to features used to characterize a text image and may be represented in matrix or vector form. Character position coding features are used to reflect the interdependence between characters and can be represented in matrix or vector form.

Alternatively, a feature extraction network may be used to perform feature extraction on the text image to obtain features of the text image. The feature extraction network may be a convolutional neural network with any structure, such as a VGG network, a Resnet network, a densenet network, a mobilet network, and the like; the feature extraction network may also include operators for improving network effects, such as deformation convolution defoforming conv, compression and excitation networks (SE), expansion convolution dilationconv, inception modules, and the like. Further, in order to obtain local features and semantic features of a deeper level at the same time, the feature extraction network may also be a feature map pyramid network (Feature Pyramid Networks, FPN), a pyramid attention network (Pyramid Attention Network, PAN), or the like.

Alternatively, the text image may be embedded to obtain character position encoding features. Or, the text image may be processed based on a preset position coding function to obtain character position coding features.

S102, sampling the text image features according to the character position coding features to obtain sampling features of at least one character in the text image.

In this embodiment, the sampled feature refers to a feature of a character sampled from the text image feature, and may be represented in a matrix or vector form.

Specifically, the text image feature can be sampled according to the character position coding feature based on the sampling feature extraction network, so as to obtain the sampling feature of at least one character in the text image. The sampled feature extraction network may be a deep convolutional neural network, among others.

S103, carrying out text recognition on the text image according to the sampling characteristics.

Specifically, a text recognition network may be used to perform text recognition on the text image according to the sampled characteristics of at least one character in the text image. Wherein the text recognition network may comprise at least one linear layer.

According to the technical scheme provided by the embodiment of the disclosure, the text image characteristics and the character position coding characteristics are respectively determined according to the text image, then the text image characteristics are sampled according to the character position coding characteristics, the sampling characteristics of at least one character in the text image are obtained, and further text recognition is carried out on the text image according to the sampling characteristics. Compared with the existing two-stage text recognition scheme with detection before recognition, the technical scheme has the advantages that the text recognition is directly carried out on the sampling features obtained by sampling the text image features, so that the influence of the detection result on the text recognition can be avoided; meanwhile, character position coding features are introduced, so that the interdependence relationship between characters in the text image can be fully obtained, and the text recognition is more accurate.

On the basis of the above embodiment, as an alternative manner of the present disclosure, determining a character position coding feature from a text image includes: encoding pixels in the text image to obtain character position encoding characteristics corresponding to the text image; or, coding the text image features to obtain character position coding features corresponding to the text image.

Specifically, the pixels in the text image can be encoded by adopting sine functions and cosine functions with different frequencies, so that character position encoding characteristics corresponding to the text image are obtained. Alternatively, the text image features may be encoded using sine and cosine functions of different frequencies to obtain character position encoding features corresponding to the text image.

It can be understood that providing different ways of obtaining character position coding features can ensure that the interdependence relationship between characters in the text image is learned later, thereby laying a foundation for subsequent text recognition.

Fig. 2 is a flow chart of yet another text recognition method provided in accordance with an embodiment of the present disclosure. The present embodiment provides an alternative implementation based on the above embodiment, further optimizing the "sample feature of at least one character in a text image obtained by sampling the text image feature according to the character position coding feature". As shown in fig. 2, the text recognition method of the present embodiment may include:

s201, respectively determining text image characteristics and character position coding characteristics according to the text image.

S202, inquiring the text image characteristics according to the character position coding characteristics to obtain character characteristics of at least one character in the text image.

In this embodiment, the character features refer to features of characters extracted from text image features, and may be represented in matrix or vector form. The dimensions of the character feature and the dimensions of the character position coding feature are the same.

Alternatively, a character feature extraction network may be used to extract character features from the character position coding features and the text image features to obtain character features of at least one character in the text image. The character feature extraction network is used for extracting features of each character in the text image, and may be a Transformer decoder network.

In yet another alternative, a query may be performed from the text image features based on the character position encoding features to obtain query features, and then the query features and the character position encoding features are subjected to attention learning to obtain character features of at least one character in the text image.

S203, a sampling point detection network is adopted to respectively predict character characteristics of at least one character, and sampling point information corresponding to the character is obtained.

In this embodiment, the sampling point detection network is used to determine sampling point information corresponding to characters in the text image, which may be a feedforward neural network; wherein the feed-forward neural network may comprise a single layer or multiple linear layers. The sampling point information refers to the related information of the sampling points corresponding to the characters and can comprise the number of the sampling points corresponding to the characters and the coordinate information of each sampling point; alternatively, the sampling point information may be represented in a matrix or vector form.

Specifically, character features of at least one character are respectively input into a sampling point detection network to conduct sampling point prediction, and sampling point information corresponding to each character is obtained.

S204, sampling the characteristics of the text image according to the sampling point information to obtain the sampling characteristics of at least one character in the text image.

Specifically, feature sampling can be performed from text image features according to coordinate information of sampling points in the sampling point information corresponding to each character, so as to obtain sampling features of each character.

S205, carrying out text recognition on the text image according to the sampling characteristics.

According to the technical scheme provided by the embodiment of the disclosure, the text image characteristics and the character position coding characteristics are respectively determined according to the text image, then the text image characteristics are queried according to the character position coding characteristics to obtain character characteristics of at least one character in the text image, a sampling point detection network is adopted to respectively predict the character characteristics of the at least one character to obtain sampling point information corresponding to the character, the text image characteristics are sampled according to the sampling point information to obtain sampling characteristics of the at least one character in the text image, and further text recognition is carried out on the text image according to the sampling characteristics. According to the technical scheme, the character characteristics are predicted to obtain the sampling point information corresponding to the characters, and the sampling point information of the characters can be detected rapidly and automatically, so that the text image characteristics can be sampled rapidly and accurately, and a foundation is laid for text recognition.

On the basis of the above embodiment, as an alternative manner of the present disclosure, further includes: and predicting character characteristics of at least one character by adopting a sampling point detection network to obtain character probability and detection frame position characteristics corresponding to the character.

The character probability refers to the probability that the result obtained after the character feature is predicted is a character, and can be represented by a one-dimensional vector. The detection frame position feature is used for representing the position of the detection frame corresponding to the character text. The feature dimension of the detection frame position feature is determined by the shape of the character, and the more complex the shape of the character is, the larger the feature dimension of the detection frame position feature is, that is, the greater the number of points constituting the detection frame.

Specifically, when sampling point information corresponding to at least one character is obtained based on the sampling point detection network, the sampling point detection network can be adopted to respectively predict character features of the at least one character, so that character probability and detection frame position features corresponding to each character are obtained.

It can be understood that the text recognition is performed, and meanwhile, the position of a detection frame of the text can be predicted, so that decoupling of text detection and recognition is realized; meanwhile, text recognition can be carried out on text images with different shapes by automatically adjusting the feature dimension of the position features of the detection frame.

FIG. 3A is a flow chart of a training method for a text recognition model provided in accordance with an embodiment of the present disclosure; fig. 3B is a schematic diagram of a training process for a text recognition model provided in accordance with an embodiment of the present disclosure. The method is suitable for the situation of how to perform end-to-end recognition on the text image. The method may be performed by a training device of the text recognition model, which may be implemented in software and/or hardware, and may be integrated in an electronic device, such as a server, carrying training functions of the text recognition model. As shown in fig. 3A and 3B, the training method of the text recognition model of the present embodiment may include:

s301, respectively determining text image characteristics and character position coding characteristics according to the text image.

Alternatively, a feature extraction network of a text recognition model may be used to perform feature extraction on a text image to obtain features of the text image. The feature extraction network may be a convolutional neural network with any structure, such as a VGG network, a Resnet network, a densenet network, a mobilet network, and the like; the feature extraction network may also include operators for improving network effects, such as deformation convolution defoforming conv, compression and excitation networks (SE), expansion convolution dilationconv, inception modules, and the like. Further, in order to obtain local features and semantic features of a deeper level at the same time, the feature extraction network may also be a feature map pyramid network (Feature Pyramid Networks, FPN), a pyramid attention network (Pyramid Attention Network, PAN), or the like.

S302, sampling the text image features according to the character position coding features to obtain sampling features of at least one character in the text image.

Specifically, a sampling feature extraction network of a text recognition model may be used, and according to character position coding features, the text image features are sampled to obtain sampling features of at least one character in the text image. The sampled feature extraction network may be a deep convolutional neural network, among others.

Alternatively, the sampled feature extraction network may include a character feature extraction network and a sampled point detection network. Specifically, a character feature extraction network may be used to extract character features from the character position coding features and the text image features, so as to obtain character features of at least one character in the text image. The character feature extraction network is used for extracting features of each character in the text image, and may be a Transformer decoder network. And then, respectively inputting character features of at least one character into a sampling point detection network to perform sampling point prediction, so as to obtain sampling point information corresponding to each character.

The sampling point detection network is used for determining sampling point information corresponding to characters in the text image, and can be a feedforward neural network; wherein the feed-forward neural network may comprise a single layer or multiple linear layers. The sampling point information refers to the related information of the sampling points corresponding to the characters and can comprise the number of the sampling points corresponding to the characters and the coordinate information of each sampling point; alternatively, the sampling point information may be represented in a matrix or vector form.

S303, carrying out text recognition on the text image according to the sampling characteristics.

Specifically, a text recognition network of a text recognition model may be used to perform text recognition on the text image according to the sampled characteristics of at least one character in the text image. Wherein the text recognition network may comprise at least one linear layer.

S304, training the text recognition model according to the recognition result and the label data of the text image.

In this embodiment, the tag data of the text image refers to the real content of the characters at each position in the pre-labeled text image.

Specifically, a preset loss function may be adopted, training loss is determined according to the recognition result of the text image and the label data of the text image, and then the text recognition model is trained according to the training loss until the training stopping condition is met, so that training of the text recognition model is stopped. The training stopping condition may be that the iteration number meets a frequency threshold, or that the loss error is within a set error range; wherein the frequency threshold and the setting error range can be set by a person skilled in the art according to the actual situation.

As a specific example, referring to fig. 3B, feature extraction may be performed on a text Image through a feature extraction network encoder of a text recognition model to obtain text Image features of the text Image, i.e., an Image feature in dimension c×h×w, where C represents the number of channels, H represents the height of the text Image features, and W represents the width of the text Image features; and then, character feature extraction is carried out on the character image features and the character position coding features (N queries) in the N x d dimension by adopting a character feature extraction network transformer decoder of the text recognition model, so that character features of at least one character in the text image are obtained. Further, a sampling point detection network FFN of the text recognition model is adopted to respectively conduct sampling point prediction on character features of at least one character, and sampling point information corresponding to each character is obtained and recorded as Rec; meanwhile, a sampling point detection network FFN of a text recognition model is adopted to respectively predict character features of at least one character to obtain character probability (0/1 classification) and detection frame position features (localization) corresponding to each character, namely N1-dimensional character probability and N2*k-dimensional detection frame position features corresponding to a text image, wherein k represents the number of points forming a detection frame. Then, sampling the text image characteristics by adopting Sampling point information Rec corresponding to each character, and marking as Sampling to obtain Sampling characteristics corresponding to each character; a text recognition network may then be employed that uses a text recognition model based on at least one character in the text image

And (3) carrying out text recognition on the text image to obtain a recognition result of the text image. And finally, training the text recognition model according to the recognition result and the label data of the text image.

According to the technical scheme provided by the embodiment of the disclosure, the text image characteristics and the character position coding characteristics are respectively determined according to the text image, then the text image characteristics are sampled according to the character position coding characteristics to obtain the sampling characteristics of at least one character in the text image, and further according to the sampling characteristics,

and carrying out text recognition on the text image, and finally training the 0 text recognition model according to the recognition result and the label data of the text image. Compared with the existing two-stage text recognition scheme with detection before recognition, the technical scheme has the advantages that the text recognition is directly carried out on the sampling features obtained by sampling the text image features, so that the influence of the detection result on the text recognition can be avoided; meanwhile, character position coding features are introduced, so that the interdependence relationship between characters in the text image can be fully obtained, and the text recognition is more accurate.

Fig. 4 is a schematic structural view of a text recognition device according to an embodiment of the present disclosure. The embodiment is applicable to the case of how to perform end-to-end recognition of a text image. The apparatus may be implemented in software and/or hardware and may be integrated into an electronic device, such as a server, that carries text recognition functionality. As shown in fig. 4, the text recognition apparatus 400 of the present embodiment may include: 0 image feature determination module 401 for determining text image features and text images respectively based on the text images

Character position coding features;

the sampling feature determining module 402 is configured to sample the text image feature according to the character position coding feature, so as to obtain a sampling feature of at least one character in the text image;

the text recognition module 403 is configured to perform text recognition on the text image according to the sampling feature. 5 technical solution provided by the embodiments of the present disclosure, the text image is determined according to the text images

The method comprises the steps of obtaining a character position coding feature, sampling the text image feature according to the character position coding feature to obtain a sampling feature of at least one character in the text image, and further carrying out text recognition on the text image according to the sampling feature. Compared with the prior art, the technical proposal has the advantages that the detection is carried out before the identification

According to the stage text recognition scheme, text recognition 0 is directly carried out on the sampling features obtained by sampling the text image features, so that the influence of a detection result on the text recognition can be avoided; meanwhile, character position coding features are introduced, so that the interdependence relationship between characters in the text image can be fully obtained, and the text recognition is more accurate.

Further, the sampling feature determining module 402 is specifically configured to:

inquiring the text image characteristics according to the character position coding characteristics to obtain character characteristics of at least one character in the text image;

a sampling point detection network is adopted to respectively predict character characteristics of at least one character to obtain sampling point information corresponding to the character;

and sampling the characteristics of the text image according to the sampling point information to obtain the sampling characteristics of at least one character in the text image.

Further, the image feature determining module 401 is specifically configured to:

encoding pixels in the text image to obtain character position encoding characteristics corresponding to the text image; or alternatively, the process may be performed,

and coding the text image characteristics to obtain character position coding characteristics corresponding to the text image.

Further, the apparatus further comprises:

and the detection frame position determining module is used for respectively predicting character characteristics of at least one character by adopting a sampling point detection network to obtain character probability corresponding to the character and detection frame position characteristics.

Further, the feature dimension of the detection frame position feature is determined by the shape of the character.

Fig. 5 is a schematic structural diagram of a training device for a text recognition model according to an embodiment of the present disclosure. The embodiment is applicable to the case of how to perform end-to-end recognition of a text image. The apparatus may be implemented in software and/or hardware and may be integrated into an electronic device, such as a server, that carries training functions of the text recognition model. As shown in fig. 5, the training device 500 of the text recognition model of the present embodiment may include:

an image feature determination module 501 for determining text image features and character position coding features from the text image, respectively;

the sampling feature determining module 502 is configured to sample the text image feature according to the character position coding feature to obtain a sampling feature of at least one character in the text image;

a text recognition module 503, configured to perform text recognition on the text image according to the sampling feature;

the text recognition model training module 504 is configured to train the text recognition model according to the recognition result and the tag data of the text image.

According to the technical scheme provided by the embodiment of the disclosure, the text image characteristics and the character position coding characteristics are respectively determined according to the text image, then the text image characteristics are sampled according to the character position coding characteristics, the sampling characteristics of at least one character in the text image are obtained, further, text recognition is carried out on the text image according to the sampling characteristics, and finally, the text recognition model is trained according to the recognition result and the label data of the text image. Compared with the existing two-stage text recognition scheme with detection before recognition, the technical scheme has the advantages that the text recognition is directly carried out on the sampling features obtained by sampling the text image features, so that the influence of the detection result on the text recognition can be avoided; meanwhile, character position coding features are introduced, so that the interdependence relationship between characters in the text image can be fully obtained, and the text recognition is more accurate.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 6 is a block diagram of an electronic device used to implement a text recognition method or training method for a text recognition model in accordance with an embodiment of the present disclosure. Fig. 6 illustrates a schematic block diagram of an example electronic device 600 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the electronic device 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the electronic device 600 can also be stored. The computing unit 601, ROM 602, and RAM 603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the electronic device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, mouse, etc.; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the electronic device 600 to exchange information/data with other devices through a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the respective methods and processes described above, such as a text recognition method or a training method of a text recognition model. For example, in some embodiments, the text recognition method or the training method of the text recognition model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into the RAM 603 and executed by the computing unit 601, one or more steps of the text recognition method or training method of the text recognition model described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the text recognition method or the training method of the text recognition model in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

Artificial intelligence is the discipline of studying the process of making a computer mimic certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person, both hardware-level and software-level techniques. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligent software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.

Cloud computing (cloud computing) refers to a technical system that a shared physical or virtual resource pool which is elastically extensible is accessed through a network, resources can comprise servers, operating systems, networks, software, applications, storage devices and the like, and resources can be deployed and managed in an on-demand and self-service mode. Through cloud computing technology, high-efficiency and powerful data processing capability can be provided for technical application such as artificial intelligence and blockchain, and model training.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A text recognition method, comprising:

2. The method of claim 1, wherein the sampling the text image feature according to the character position encoding feature to obtain a sampled feature of at least one character in the text image comprises:

and sampling the text image characteristics according to the sampling point information to obtain sampling characteristics of at least one character in the text image.

3. The method of claim 1, determining character position-coding features from a text image, comprising:

4. The method of claim 2, the method further comprising:

and predicting character characteristics of at least one character by adopting a sampling point detection network to obtain character probability and detection frame position characteristics corresponding to the character.

5. The method of claim 4, wherein a feature dimension of the detection frame position feature is determined by a shape of the character.

6. A training method of a text recognition model, comprising:

7. A text recognition device, comprising:

8. The apparatus of claim 7, wherein the sampling feature determination module is specifically configured to:

9. The apparatus of claim 7, the image feature determination module is specifically configured to:

10. The apparatus of claim 8, the apparatus further comprising:

and the detection frame position determining module is used for respectively predicting character characteristics of at least one character by adopting a sampling point detection network to obtain character probability and detection frame position characteristics corresponding to the character.

11. The apparatus of claim 10, wherein a feature dimension of the detection frame position feature is determined by a shape of the character.

12. A training device for a text recognition model, comprising:

13. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the text recognition method of any one of claims 1-5 or the training method of the text recognition model of claim 6.

14. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the text recognition method of any one of claims 1-5, or the training method of the text recognition model of claim 6.

15. A computer program product comprising a computer program which, when executed by a processor, implements the text recognition method according to any one of claims 1-5, or the training method of a text recognition model according to claim 6.