WO2023221422A1

WO2023221422A1 - Neural network used for text recognition, training method thereof and text recognition method

Info

Publication number: WO2023221422A1
Application number: PCT/CN2022/131189
Authority: WO
Inventors: 殷晓婷; 杜宇宁; 李晨霞; 杨烨华; 赖宝华; 毕然; 马艳军; 胡晓光; 于佃海
Original assignee: 北京百度网讯科技有限公司
Priority date: 2022-05-18
Filing date: 2022-11-10
Publication date: 2023-11-23
Also published as: CN114743196B; CN114743196A

Abstract

Provided in the present disclosure are a neural network used for text recognition, a training method thereof and a text recognition method, relating to the field of artificial intelligence, and particularly relating to computer vision and deep learning technology. The neural network comprises: a first convolutional sub-network configured to output a first feature map on the basis of an image to be recognized; a local fusion sub-network configured to determine, on the basis of a feature vector of each pixel in the first feature map and a feature vector of a plurality of target pixels in the first feature map, local feature vectors of the pixels by means of a self-attention mechanism, so as to obtain a second feature map; a second convolutional sub-network configured to output a third feature map on the basis of the second feature map; a global fusion sub-network configured to determine, on the basis of a feature vector corresponding to each pixel in the third feature map and a feature vector of each pixel itself in the third feature map, global feature vectors of the pixels by using the self-attention mechanism, so as to obtain a fourth feature map; and an output sub-network configured to output a text recognition result on the basis of the fourth feature map.

Description

Neural network for text recognition and its training method, text recognition method

Cross-references to related applications

This application claims priority from Chinese patent application 202210548237.8 filed on May 18, 2022, the entire content of which is incorporated into this application by reference in its entirety.

Technical field

The present disclosure relates to the field of artificial intelligence, specifically to machine learning technology, computer vision technology, image processing technology and deep learning technology, and in particular to a neural network for text recognition, a method of text recognition using a neural network, and the training of a neural network Methods, electronic devices, computer-readable storage media, and computer program products.

Background technique

Artificial intelligence is the study of using computers to simulate certain human thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.). It has both hardware-level technology and software-level technology. Artificial intelligence hardware technology generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, etc.; artificial intelligence software technology mainly includes computer vision technology, speech recognition technology, natural language processing technology, and machine learning/depth Learning, big data processing technology, knowledge graph technology and other major directions.

OCR (Optical Character Recognition) is a technology that can convert image information into text information that is easier to edit and store. It is currently widely used in various scenarios, such as bill recognition, bank card information recognition, formula recognition, etc. In addition, OCR also helps many downstream tasks, such as subtitle translation, security monitoring, etc.; it also helps other visual tasks, such as video search wait.

The approaches described in this section are not necessarily those that have been previously envisioned or employed. Unless otherwise indicated, it should not be assumed that any method described in this section is prior art merely by virtue of its inclusion in this section. Similarly, unless otherwise indicated, the issues mentioned in this section should not be considered to be recognized in any prior art.

Contents of the invention

The present disclosure provides a neural network for text recognition, a method for text recognition using a neural network, a neural network training method, an electronic device, a computer-readable storage medium and a computer program product.

According to an aspect of the present disclosure, a neural network for text recognition is provided, including: a first convolution subnetwork configured to perform convolution processing on an image to be recognized to output a first feature map; a local fusion subnetwork , configured to use the self-attention mechanism for each pixel in the first feature map to determine the local feature vector of the pixel based on the feature vector corresponding to the pixel and the respective feature vectors of multiple target pixels in the first feature map. , to obtain the second feature map, wherein the plurality of target pixels include multiple pixels located in the neighborhood of the pixel in the first feature map; the second convolution subnetwork is configured to convolve the second feature map Process to output the third feature map; the global fusion sub-network is configured to target each pixel in the third feature map, using the self-attention mechanism based on the feature vector corresponding to the pixel and each pixel in the third feature map Respective feature vectors determine the global feature vector of the pixel to obtain the fourth feature map; and the output sub-network is configured to output the text recognition result based on the fourth feature map.

According to another aspect of the present disclosure, a method for text recognition using a neural network is provided. The neural network includes a first convolution subnetwork, a local fusion subnetwork, a second convolution subnetwork, a global fusion subnetwork, and an output sub-network, the method includes: inputting the image to be recognized into a first convolution sub-network, the first convolution sub-network being configured to perform convolution processing on the image to be recognized to output a first feature map; inputting the first feature map into local fusion sub-network, the local fusion sub-network is configured to use the self-attention mechanism for each pixel in the first feature map to determine based on the feature vector corresponding to the pixel and the respective feature vectors of multiple target pixels in the first feature map. local feature vector of the pixel to obtain a second feature map, where the plurality of target pixels include multiple pixels located in the neighborhood of the pixel in the first feature map; input the second feature map into the second convolution subnetwork , the second convolution sub-network is configured to perform convolution processing on the second feature map to output the third feature map; the third feature map is input to the global fusion sub-network, and the global fusion sub-network is configured to target the third feature map For each pixel in, use the self-attention mechanism to determine the global feature vector of the pixel based on the feature vector corresponding to the pixel and the respective feature vector of each pixel in the third feature map, to obtain the fourth feature map; and The fourth feature map inputs and outputs the sub-network, and the output sub-network is configured to output the text recognition result based on the fourth feature map.

According to another aspect of the present disclosure, a training method for a neural network is provided. The neural network includes a first convolution subnetwork, a local fusion subnetwork, a second convolution subnetwork, a global fusion subnetwork, and an output subnetwork, The method includes: determining the sample image and the corresponding real result; inputting the sample image into a first convolution subnetwork, and the first convolution subnetwork is configured to perform convolution processing on the sample image to output the first feature map; The feature map inputs the local fusion sub-network, and the local fusion sub-network is configured to target each pixel in the first feature map, using the self-attention mechanism based on the feature vector corresponding to the pixel and each of the multiple target pixels in the first feature map. feature vector, determine the local feature vector of the pixel to obtain the second feature map, in which the multiple target pixels include multiple pixels located in the neighborhood of the pixel in the first feature map; input the second feature map into the Two convolution sub-networks, the second convolution sub-network is configured to perform convolution processing on the second feature map to output the third feature map; the third feature map is input to the global fusion sub-network, and the global fusion sub-network is configured as For each pixel in the third feature map, the self-attention mechanism is used to determine the global feature vector of the pixel based on the feature vector corresponding to the pixel and the respective feature vectors of each pixel in the third feature map to obtain the fourth feature map; input the fourth feature map into the output sub-network, and the output sub-network is configured to output the prediction result of text recognition on the sample image based on the fourth feature map; calculate the loss value based on the real result and the prediction result; and based on the loss Adjust the parameters of the neural network to obtain the trained neural network.

According to another aspect of the present disclosure, an electronic device is provided, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions that can be executed by the at least one processor, and the instructions are At least one processor executes, so that at least one processor can execute the above method.

According to another aspect of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions is provided, wherein the computer instructions are used to cause a computer to perform the above method.

According to another aspect of the present disclosure, a computer program product is provided, including a computer program, wherein the computer program implements the above method when executed by a processor.

According to one or more embodiments of the present disclosure, by using a network module that utilizes a self-attention mechanism, image features can be processed in parallel, thereby improving training speed and prediction speed, and by using local fusion sub-networks and global fusion sub-networks, This enables the local correlation and global correlation between text characters to be considered, thereby improving prediction accuracy. In addition, the use of convolutional subnetworks enables the use of existing deep learning acceleration libraries for acceleration, thereby further improving the training speed and prediction speed of the inference stage.

It should be understood that what is described in this section is not intended to identify key or important features of the embodiments of the disclosure, nor is it intended to limit the scope of the disclosure. Other features of the present disclosure will become readily understood from the following description.

Description of the drawings

The drawings illustrate exemplary embodiments and constitute a part of the specification, and together with the written description, serve to explain exemplary implementations of the embodiments. The embodiments shown are for illustrative purposes only and do not limit the scope of the claims. Throughout the drawings, the same reference numbers refer to similar, but not necessarily identical, elements.

1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented in accordance with embodiments of the present disclosure;

Figure 2 shows a structural block diagram of a neural network for text recognition according to an exemplary embodiment of the present disclosure;

Figure 3 shows a flowchart of a method of text recognition according to an exemplary embodiment of the present disclosure;

Figure 4 shows a flow chart of a training method of a neural network according to an exemplary embodiment of the present disclosure; and

5 illustrates a structural block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed ways

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the present disclosure are included to facilitate understanding and should be considered to be exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope of the disclosure. Also, descriptions of well-known functions and constructions are omitted from the following description for clarity and conciseness.

In this disclosure, unless otherwise stated, the use of the terms “first”, “second”, etc. to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of these elements. Such terms are only used for Distinguish one element from another. In some examples, the first element and the second element can refer to the same instance of the element, and in some cases, based on contextual description, they can refer to different instances.

The terminology used in the description of the various described examples in this disclosure is for the purpose of describing the particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the element may be one or more. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.

In related technologies, existing OCR methods usually use Recursive Neural Network (RNN) for sequence modeling, but RNN has the problem of being unable to be trained in parallel and having low training prediction efficiency.

In order to solve the above problems, the present disclosure enables parallel processing of image features by using a network module that utilizes a self-attention mechanism, thereby improving training speed and prediction speed, and by using local fusion sub-networks and global fusion sub-networks, allowing text to be considered Local correlation and global correlation between characters, thereby improving prediction accuracy. In addition, the use of convolutional subnetworks enables the use of existing deep learning acceleration libraries for acceleration, thereby further improving the training speed and prediction speed of the inference stage.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Figure 1 shows a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented in accordance with embodiments of the present disclosure. Referring to FIG. 1 , the system 100 includes one or

more client devices

101 , 102 , 103 , 104 , 105 , and 106 , a server 120 , and one or more communication networks coupling the one or more client devices to the server 120 110.

Client devices

101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In embodiments of the present disclosure, the server 120 may run one or more services or software applications that enable performing methods of text recognition and/or training methods of neural networks.

In some embodiments, server 120 may also provide other services or software applications that may include non-virtual environments and virtual environments. In some embodiments, these services may be provided as web-based services or cloud services, such as under a Software as a Service (SaaS) network to users of

client devices

101, 102, 103, 104, 105, and/or 106 .

In the configuration shown in FIG. 1 , server 120 may include one or more components that implement the functions performed by server 120 . These components may include software components, hardware components, or combinations thereof that are executable by one or more processors. Users

operating client devices

101, 102, 103, 104, 105, and/or 106 may, in turn, utilize one or more client applications to interact with server 120 to utilize services provided by these components. It should be understood that a variety of different system configurations are possible, which may differ from system 100 . Accordingly, Figure 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

Users can use

client devices

101, 102, 103, 104, 105 and/or 106 to perform collection operations of images to be recognized. The client device may provide an interface that enables the user of the client device to interact with the client device. For example, the user may use a camera of the client device to collect an image to be recognized, or use the client device to upload to the server an image stored in the client device. image. The client device can also output information to the user via the interface. For example, the client can output to the user text obtained by recognizing the image to be recognized uploaded by the user using a text recognition method running on the server. Although FIG. 1 depicts only six client devices, those skilled in the art will understand that the present disclosure can support any number of client devices.

Client devices

101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptop computers), workstation computers, wearable devices, Smart screen equipment, self-service terminal equipment, service robots, game systems, thin clients, various messaging equipment, sensors or other sensing equipment, etc. These computer devices can run various types and versions of software applications and operating systems, such as MICROSOFT Windows, APPLE iOS, UNIX-like operating systems, Linux or Linux-like operating systems (such as GOOGLE Chrome OS); or include various mobile operating systems , such as MICROSOFT Windows Mobile OS, iOS, Windows Phone, Android. Portable handheld devices may include cellular phones, smart phones, tablet computers, personal digital assistants (PDAs), and the like. Wearable devices may include head-mounted displays (such as smart glasses) and other devices. Gaming systems may include various handheld gaming devices, Internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (such as email applications), Short Message Service (SMS) applications, and can use various communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a variety of available protocols (including, but not limited to, TCP/IP, SNA, IPX, etc.). By way of example only, one or more networks 110 may be a local area network (LAN), an Ethernet-based network, a token ring, a wide area network (WAN), the Internet, a virtual network, a virtual private network (VPN), an intranet, an extranet, Public Switched Telephone Network (PSTN), infrared network, wireless network (e.g. Bluetooth, WIFI) and/or any combination of these and/or other networks.

Server 120 may include one or more general purpose computers, special purpose server computers (eg, PC (Personal Computer) servers, UNIX servers, midrange servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination . Server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture involving virtualization (eg, one or more flexible pools of logical storage devices that may be virtualized to maintain the server's virtual storage devices). In various embodiments, server 120 may run one or more services or software applications that provide the functionality described below.

Computing units in server 120 may run one or more operating systems, including any of the operating systems described above, as well as any commercially available server operating system. Server 120 may also run any of a variety of additional server applications and/or middle-tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, and the like.

In some implementations, server 120 may include one or more applications to analyze and incorporate data feeds and/or event updates received from users of

client devices

101, 102, 103, 104, 105, and 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of

client devices

101 , 102 , 103 , 104 , 105 , and 106 .

In some implementations, the server 120 may be a server of a distributed system, or a server combined with a blockchain. The server 120 may also be a cloud server, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology. Cloud server is a host product in the cloud computing service system to solve the shortcomings of difficult management and weak business scalability in traditional physical host and virtual private server (VPS) services.

System 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of databases 130 may be used to store information such as audio files and video files. Database 130 may reside in various locations. For example, a data repository used by server 120 may be local to server 120, or may be remote from server 120 and may communicate with server 120 via a network-based or dedicated connection. Database 130 may be of different types. In some embodiments, the database used by server 120 may be a database, such as a relational database. One or more of these databases may store, update, and retrieve data to and from the database in response to commands.

In some embodiments, one or more of databases 130 may also be used by applications to store application data. The database used by the application can be different types of databases such as key-value repositories, object repositories or regular repositories backed by a file system.

The system 100 of Figure 1 may be configured and operated in various ways to enable the application of the various methods and apparatus described in accordance with the present disclosure.

According to one aspect of the present disclosure, a neural network is provided. As shown in Figure 2, the neural network 200 includes: a first convolution sub-network 204 configured to perform convolution processing on the image to be recognized 202 to output a first feature map; a local fusion sub-network 206 configured to perform convolution processing on the first feature map. For each pixel in the feature map, the self-attention mechanism is used to determine the local feature vector of the pixel based on the feature vector corresponding to the pixel and the feature vectors of multiple target pixels in the first feature map to obtain the second feature map. , wherein the plurality of target pixels include multiple pixels located in the neighborhood of the pixel in the first feature map; the second convolution subnetwork 208 is configured to perform convolution processing on the second feature map to output a third Feature map; the global fusion sub-network 210 is configured to use a self-attention mechanism for each pixel in the third feature map based on the feature vector corresponding to the pixel and the respective feature vector of each pixel in the third feature map, Determine the global feature vector of the pixel to obtain a fourth feature map; and the output sub-network 212 is configured to output the text recognition result 214 based on the fourth feature map.

Therefore, by using the network module that utilizes the self-attention mechanism, it is possible to process image features in parallel, thereby improving the training speed and prediction speed, and by using the local fusion sub-network and the global fusion sub-network, it is possible to consider the differences between text characters. Local correlation and global correlation, thereby improving prediction accuracy. In addition, the use of convolutional subnetworks enables the use of existing deep learning acceleration libraries such as the Math Kernel Library for Deep Neural Networks (MKL-DNN) for acceleration, thereby further improving training speed and Prediction speed during the inference phase.

According to some embodiments, the neural network and its training method and text recognition method of the present disclosure can be applied to any text recognition scenario, including Chinese, English, multi-language, and so on.

According to some embodiments, the image to be recognized may be any image containing text. As mentioned above, the image to be recognized may be an image captured by the camera of the client device, an image already stored on the client device, or an image obtained in other ways, which is not limited here.

In some embodiments, since text is usually in the shape of a long strip, the size of the image to be recognized that is input to the neural network can be limited. In an exemplary embodiment, the size is 32×320. It is understandable that different input sizes can be set according to actual needs. In some embodiments, a preprocessing subnetwork can be set before the first convolutional subnetwork to preprocess the received original image, so that an image to be recognized that meets the input size and/or meets other input requirements can be obtained.

According to some embodiments, at least one of the first convolutional subnetwork and the second convolutional subnetwork may include depthwise separable convolutional layers. The operation of the depthwise separable convolution layer on the received feature map can be divided into two steps: the first step is to use the corresponding N×N×1 convolution kernel to process each channel of the original feature map to obtain the sum The intermediate feature map has the same size as the original feature map; in the second step, k 1×1 convolution kernels are used to process the intermediate feature map to obtain a feature map with the same width and height as the original feature map, but with a depth of k. Using depthwise separable convolutions can significantly reduce the number of multiplication operations, thereby significantly reducing computational costs and reducing the amount of parameters that need to be stored.

According to some embodiments, the first convolutional sub-network may also include a conventional convolutional layer to better extract image feature information from the image to be recognized. At least one of the first convolutional subnetwork and the second convolutional subnetwork may include a first depthwise separable convolutional layer, and the second convolutional subnetwork may include a second depthwise separable convolutional layer. The size of the convolution kernel of the first depth-separable convolution layer may be smaller than the size of the convolution kernel of the second depth-separable convolution layer. In an exemplary embodiment, the size of the convolution kernel of the first depth-separable convolution layer may be 3×3, and the size of the convolution kernel of the second depth-separable convolution layer may be 5×5. As a result, by gradually increasing the size of the receptive field, the deep semantic features of the image to be recognized can be fully learned.

In some embodiments, after some layers, the obtained feature map can also be processed using a compression and excitation network (Squeeze-and-Excitation Net, SENet) to further enhance the features.

In some embodiments, each of the first convolutional subnetwork and the second convolutional subnetwork may be a PaddlePaddle based Lightweight CPU Convolutional Neural Network (PaddlePaddle based Lightweight CPU Convolutional) suitable for a Central Processing Unit (CPU). Neural Network, part of PP-LCNet). PP-LCNet is a lightweight network that uses fewer parameters and requires less calculation in the training and inference stages. MKL-DNN can be used for optimization at the CPU operation level, so it can be used for applications with higher performance requirements. Mission scenario. OCR tasks usually require rapid and accurate text recognition results, so using PP-LCNet can give full play to its above advantages.

PP-LCNet includes 5 stages (Stage), including:

Stage 1 consists of a regular convolutional layer with a convolution kernel of 3×3 and a stride of 2;

Stage 2 includes two depthwise separable convolutional layers with a convolution kernel of 3×3 and strides of 1 and 2 respectively;

Stage 3 includes two depthwise separable convolutional layers with a convolution kernel of 3×3 and strides of 1 and 2 respectively;

Stage 4 includes two depthwise separable convolutional layers with a convolution kernel of 3×3 and strides of 1 and 2 respectively;

Stage 5 consists of seven depthwise separable convolutional layers with a convolution kernel of 5 × 5. The first five convolutional layers and the seventh convolutional layer have a stride of 1, and the sixth convolutional layer has a stride of 2. , SENet (also called SE module) is used after the sixth and seventh convolutional layers.

In some embodiments, local fusion agents may be added at one or more of the four positions between stages 1 and 2, between stages 2 and 3, between stages 3 and 4, and between stages 4 and 5. network.

The larger the number of locally fused subnetworks, the slower the model's inference speed. After experiments, adding a local fusion sub-network can significantly improve the accuracy. In addition, the local fusion sub-network is close to the input end of the neural network, which will significantly increase the amount of calculation (the number of pixels in the feature map output by two adjacent convolutional layers/stages is four times or even an exponential multiple of four, and the closer it is to the input The greater the end difference), and being close to the output end of the neural network will reduce the accuracy to a certain extent. After testing, adding a local fusion sub-network between stages 3 and 4 can achieve the best balance between the two, thereby significantly improving the inference accuracy of the neural network at a reduced time cost.

According to some embodiments, for each pixel in the first feature map, a self-attention mechanism is used to determine the local feature vector of the pixel based on the feature vector corresponding to the pixel and the respective feature vectors of multiple target pixels in the first feature map. , to obtain the second feature map may include: determining the attention score of the feature vector corresponding to each target pixel in the plurality of target pixels with respect to the feature vector corresponding to the pixel; and based on the corresponding feature vector of each target pixel in the plurality of target pixels. The feature vector of the feature vector corresponds to the attention score of the feature vector corresponding to the pixel, and the feature vectors corresponding to the multiple target pixels are fused to obtain the local feature vector of the pixel. As a result, the pixels in the local neighborhood of each pixel in the first feature map are fused using the self-attention mechanism, and the local features between strokes are obtained, thus strengthening the feature vector of the pixel.

The above-mentioned processing method of the feature vector of the pixel and the feature vectors corresponding to the multiple target pixels can refer to the operation of the Transformer block on different input features in the prior art. By using methods that utilize the self-attention mechanism, on the one hand, the inference accuracy can be improved, and on the other hand, features can be processed in parallel to speed up the training process and improve the inference speed.

In some embodiments, the range of the local neighborhood can be set according to requirements, such as a rectangular area with a preset width and a preset height centered on the target pixel. The specific values of the preset width and preset height can also be determined according to needs. It can be understood that local neighborhoods of other shapes or other ranges can also be set, which are not limited here.

In some embodiments, the local fusion sub-network does not change the size of the feature map. That is, the first feature map and the second feature map have the same size.

In some embodiments, after obtaining the third feature map, you can directly use the global fusion sub-network to process it, or you can first use a convolution layer to transform its size, and then use the global fusion sub-network to transform the size of the third feature map. The third feature map is processed. In an exemplary embodiment, the size of the image to be recognized is H×W, the size of the third feature map output by the second convolution subnetwork is H/32×W/4, and the size of the third feature map output by the second convolution subnetwork is H/32×W/4, and the size after further processing using the convolution layer is The size of the third feature map after size transformation is H/32×W/8.

According to some embodiments, the height of the third feature map may be 1/32 of the height of the image to be recognized. In an embodiment where the height of the image to be recognized is 32, the height of the third feature map may be 1. It can be understood that the third feature map here may be the third feature map output by the second convolution subnetwork, or it may be the third feature map after size transformation. This setting is based on the fact that the prediction speed of the global fusion subnetwork is highly sensitive to the shape/size of the features it receives. Therefore, by limiting the input feature shape, its prediction speed can be improved, thereby improving the overall text recognition speed. In fact, the third feature map with a height of 1 is essentially equivalent to a feature vector sequence, and each feature vector in the sequence corresponds to an image area composed of several consecutive columns of pixels in the image to be recognized.

After obtaining the third feature map, the global fusion sub-network also based on the self-attention mechanism can be used to process the third feature map. It can be understood that the way the global fusion sub-network processes the third feature map is similar to the way the local fusion sub-network processes the first feature map. The difference is that the global fusion sub-network calculates the sum for each target pixel in the third feature map. Each pixel in the third feature map corresponds to the attention score, and the feature vectors of all pixels are fused according to the attention score of each pixel to strengthen the feature vector of the target pixel. The global fusion sub-network can achieve the merging of global features.

In some embodiments, the global fusion sub-network also does not change the size of the feature map. In other words, the third feature map and the fourth feature map have the same size. In an exemplary embodiment, the size of the third feature map and the fourth feature map are both 1×40.

According to some embodiments, the neural network may further include at least one of the following: a first fusion layer configured to fuse the first feature map and the second feature map to update the second feature map; and a second fusion layer configured To fuse the third feature map and the fourth feature map to update the fourth feature map. As a result, through the above-mentioned fusion layer (i.e., skip connection), the representation of the feature map is further enriched so that it includes both deep and shallow semantic information, thereby improving the accuracy of the inference results.

According to some embodiments, the output subnetwork can be any network structure capable of outputting text recognition results based on feature maps. In an exemplary embodiment, the output sub-network may be a fully connected layer or a multi-layer perceptron. It can be understood that other network structures can also be used as output subnetworks, which are not limited here.

According to another aspect of the present disclosure, a method for text recognition using a neural network is provided. The neural network includes a first convolution subnetwork, a local fusion subnetwork, a second convolution subnetwork, a global fusion subnetwork, and an output subnetwork. As shown in Figure 3, the method includes: step S301, input the image to be recognized into the first convolution subnetwork, and the first convolution subnetwork is configured to perform convolution processing on the image to be recognized to output the first feature map; Step S302. Input the first feature map into the local fusion sub-network. The local fusion sub-network is configured to use the self-attention mechanism for each pixel in the first feature map based on the feature vector corresponding to the pixel and the first feature map. The respective feature vectors of multiple target pixels are used to determine the local feature vector of the pixel to obtain the second feature map, where the multiple target pixels include multiple pixels located in the neighborhood of the pixel in the first feature map; step S303 , input the second feature map into the second convolution sub-network, and the second convolution sub-network is configured to perform convolution processing on the second feature map to output the third feature map; Step S304, input the third feature map into the global Fusion sub-network, the global fusion sub-network is configured to use the self-attention mechanism for each pixel in the third feature map to determine based on the feature vector corresponding to the pixel and the respective feature vector of each pixel in the third feature map. The global feature vector of the pixel is used to obtain the fourth feature map; and step S305, input the fourth feature map into the output sub-network, and the output sub-network is configured to output the text recognition result based on the fourth feature map.

It can be understood that the operations of steps S301 to S305 in FIG. 3 are similar to the operations of subnetwork 204 to subnetwork 212 in the neural network 200, respectively, and will not be described again.

Therefore, by using the network module that utilizes the self-attention mechanism, image features can be processed in parallel, thereby improving the prediction speed, and by using the local fusion sub-network and the global fusion sub-network, the local correlation between text characters can be considered and global correlation, thereby improving prediction accuracy. In addition, the use of convolutional subnetworks enables the use of existing deep learning acceleration libraries for acceleration, thereby further improving the prediction speed of the inference stage.

According to some embodiments, at least one of the first convolutional subnetwork and the second convolutional subnetwork may include depthwise separable convolutional layers.

According to some embodiments, the first convolutional subnetwork may include a regular convolutional layer, at least one of the first convolutional subnetwork and the second convolutional subnetwork may include a first depthwise separable convolutional layer, and the second convolutional subnetwork may The product subnetwork may include a second depthwise separable convolutional layer. The size of the convolution kernel used by the first depth-separable convolution layer is smaller than the size of the convolution kernel used by the second depth-separable convolution layer.

According to some embodiments, for each pixel in the first feature map, a self-attention mechanism is used to determine the local feature vector of the pixel based on the feature vector corresponding to the pixel and the respective feature vectors of multiple target pixels in the first feature map. , to obtain the second feature map may include: determining the attention score of the feature vector corresponding to each target pixel in the plurality of target pixels with respect to the feature vector corresponding to the pixel; and based on the corresponding feature vector of each target pixel in the plurality of target pixels. The feature vector of the feature vector corresponds to the attention score of the feature vector corresponding to the pixel, and the feature vectors corresponding to the multiple target pixels are fused to obtain the local feature vector of the pixel.

According to some embodiments, the height of the third feature map may be 1/32 of the height of the image to be recognized.

According to some embodiments, the method of text recognition may further include at least one of the following: fusing the first feature map and the second feature map to update the second feature map; and fusing the third feature map and the fourth feature map to update the second feature map. Four feature maps.

According to another aspect of the present disclosure, a training method of a neural network is provided. The neural network includes a first convolution subnetwork, a local fusion subnetwork, a second convolution subnetwork, a global fusion subnetwork, and an output subnetwork. As shown in Figure 4, the training method includes: step S401, determine the sample image and the corresponding real result; step S402, input the sample image into the first convolution subnetwork, and the first convolution subnetwork is configured to convolve the sample image. product processing to output the first feature map; step S403, input the first feature map into the local fusion sub-network, and the local fusion sub-network is configured to use the self-attention mechanism based on the pixel for each pixel in the first feature map. The corresponding feature vector and the respective feature vectors of the multiple related pixels in the first feature map are used to determine the local feature vector of the pixel to obtain the second feature map; step S404, input the second feature map into the second convolution subnetwork , the second convolution sub-network is configured to perform convolution processing on the second feature map to output the third feature map; step S405, input the third feature map into the global fusion sub-network, and the global fusion sub-network is configured to target the third feature map. For each pixel in the three feature maps, the self-attention mechanism is used to determine the global feature vector of the pixel based on the feature vector corresponding to the pixel and the respective feature vectors of each pixel in the third feature map to obtain the fourth feature map. ; Step S406, input the fourth feature map into the output sub-network, and the output sub-network is configured to output the prediction result of text recognition on the sample image based on the fourth feature map; Step S407, calculate the loss value based on the real result and the prediction result ; and step S408, adjust the parameters of the neural network based on the loss value to obtain the trained neural network. It can be understood that the operations of steps S402 to S406 in FIG. 4 are similar to the operations of steps S301 to S305 in FIG. 3 and will not be described again.

Therefore, by using the network module that utilizes the self-attention mechanism, it is possible to process image features in parallel, thereby improving the training speed and prediction speed, and by using the local fusion sub-network and the global fusion sub-network, it is possible to consider the differences between text characters. Local correlation and global correlation, thereby improving prediction accuracy. In addition, the use of convolutional subnetworks enables the use of existing deep learning acceleration libraries for acceleration, thereby further improving the training speed and prediction speed of the inference stage.

According to some embodiments, the loss value may include a connectionist temporal classification (CTC) loss value and a center loss value. CTC loss is a commonly used loss value for predicting label sequences, and center loss can provide a category center for each category, minimizing the distance between each sample in each batch and the corresponding category center, thereby making the intra-class distance closer. Small. Therefore, by using CTC loss and center loss, on the one hand, it ensures the model prediction speed and supports variable-length text input. On the other hand, it further explores the correlation between characters and solves the problem of difficulty in distinguishing similar characters between texts.

In the technical solution of this disclosure, the collection, storage, use, processing, transmission, provision and disclosure of user personal information are in compliance with relevant laws and regulations and do not violate public order and good customs.

According to embodiments of the present disclosure, an electronic device, a readable storage medium, and a computer program product are also provided.

Referring to FIG. 5 , a structural block diagram of an electronic device 900 that may serve as a server or client of the present disclosure will now be described, which is an example of a hardware device that may be applied to aspects of the present disclosure. Electronic devices are intended to refer to various forms of digital electronic computing equipment, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in FIG. 5 , the device 500 includes a computing unit 501 that can execute according to a computer program stored in a read-only memory (ROM) 502 or loaded from a storage unit 508 into a random access memory (RAM) 503 Various appropriate actions and treatments. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. Computing unit 501, ROM 502 and RAM 503 are connected to each other via bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

Multiple components in the device 500 are connected to the I/O interface 505, including: an input unit 506, an output unit 507, a storage unit 508, and a communication unit 509. The input unit 506 may be any type of device capable of inputting information to the device 500. The input unit 506 may receive input numeric or character information and generate key signal input related to user settings and/or function control of the electronic device, and may Including, but not limited to, mouse, keyboard, touch screen, trackpad, trackball, joystick, microphone and/or remote control. Output unit 507 may be any type of device capable of presenting information, and may include, but is not limited to, a display, speakers, video/audio output terminal, vibrator, and/or printer. The storage unit 508 may include, but is not limited to, magnetic disks and optical disks. The communication unit 509 allows the device 500 to exchange information/data with other devices over a computer network such as the Internet and/or various telecommunications networks, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication transceiver, and/or a chipset , such as Bluetooth™ devices, 802.11 devices, WiFi devices, WiMax devices, cellular communication devices and/or the like.

Computing unit 501 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various computing units running machine learning network algorithms, digital signal processing processor (DSP), and any appropriate processor, controller, microcontroller, etc. The computing unit 501 performs various methods and processes described above, such as text recognition methods and/or neural network training methods and machine learning model training methods. For example, in some embodiments, the text recognition method and/or the neural network training method and the machine learning model training method may be implemented as a computer software program, which is tangibly included in a machine-readable medium, such as the storage unit 508 . In some embodiments, part or all of the computer program may be loaded and/or installed onto device 500 via ROM 502 and/or communication unit 509. When the computer program is loaded into the RAM 503 and executed by the computing unit 501, one or more steps of the text recognition method and/or the neural network training method and the machine learning model training method described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the method of text recognition and/or the training method of the neural network and the training of the machine learning model in any other suitable manner (for example, by means of firmware). method.

Various implementations of the systems and techniques described above may be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on a chip implemented in a system (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or a combination thereof. These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor The processor, which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device. An output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that the program codes, when executed by the processor or controller, cause the functions specified in the flowcharts and/or block diagrams/ The operation is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, laptop disks, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

To provide interaction with a user, the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer. Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.

The systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: local area network (LAN), wide area network (WAN), and the Internet.

Computer systems may include clients and servers. Clients and servers are generally remote from each other and typically interact over a communications network. The relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other. The server can be a cloud server, also known as cloud computing server or cloud host. It is a host product in the cloud computing service system to solve the problem of traditional physical host and VPS service ("Virtual Private Server", or "VPS" for short) Among them, there are defects such as difficult management and weak business scalability. The server can also be a distributed system server or a server combined with a blockchain.

It should be understood that various forms of the process shown above may be used, with steps reordered, added or deleted. For example, each step described in the present disclosure can be executed in parallel, sequentially, or in a different order. As long as the desired results of the technical solution disclosed in the present disclosure can be achieved, there is no limitation here.

Although the embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it should be understood that the above-mentioned methods, systems and devices are only exemplary embodiments or examples, and the scope of the present invention is not limited by these embodiments or examples. It is limited only by the granted claims and their equivalent scope. Various elements in the embodiments or examples may be omitted or replaced by equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in this disclosure. Further, various elements in the embodiments or examples may be combined in various ways. Importantly, as technology evolves, many elements described herein may be replaced by equivalent elements appearing after this disclosure.

Claims

A neural network for text recognition, including:

The first convolution subnetwork is configured to perform convolution processing on the image to be recognized to output the first feature map;

The local fusion sub-network is configured to use a self-attention mechanism for each pixel in the first feature map based on the feature vector corresponding to the pixel and the respective feature vectors of multiple target pixels in the first feature map. , determine the local feature vector of the pixel to obtain the second feature map, wherein the plurality of target pixels include multiple pixels located in the neighborhood of the pixel in the first feature map;

a second convolution subnetwork configured to perform convolution processing on the second feature map to output a third feature map;

The global fusion sub-network is configured to use a self-attention mechanism for each pixel in the third feature map based on the feature vector corresponding to the pixel and the respective feature vector of each pixel in the third feature map, Determine the global feature vector of the pixel to obtain a fourth feature map; and

The output sub-network is configured to output text recognition results based on the fourth feature map.
The neural network of claim 1, wherein at least one of the first convolutional subnetwork and the second convolutional subnetwork includes a depthwise separable convolutional layer.
The neural network of claim 2, wherein the first convolutional subnetwork includes a conventional convolutional layer, and at least one of the first convolutional subnetwork and the second convolutional subnetwork includes a A depthwise separable convolution layer, the second convolution sub-network includes a second depthwise separable convolution layer, wherein the size of the convolution kernel used by the first depthwise separable convolution layer is smaller than that of the third depthwise separable convolution layer. The size of the convolution kernel used in the two-depth separable convolutional layer.
The neural network according to any one of claims 1-3, wherein the height of the third feature map is 1/32 of the height of the image to be recognized.
The neural network according to any one of claims 1-3, further comprising at least one of the following:

a first fusion layer configured to fuse the first feature map and the second feature map to update the second feature map; and

The second fusion layer is configured to fuse the third feature map and the fourth feature map to update the fourth feature map.
The neural network according to any one of claims 1 to 3, wherein for each pixel in the first feature map, a self-attention mechanism is used based on the feature vector corresponding to the pixel and the first feature map. The respective feature vectors of multiple related pixels in determine the local feature vector of the pixel to obtain the second feature map including:

Determine the attention score of the feature vector corresponding to each target pixel in the plurality of target pixels with respect to the feature vector corresponding to the pixel; and

Based on the attention score of the feature vector corresponding to each target pixel in the plurality of target pixels with respect to the feature vector corresponding to the pixel, the feature vectors corresponding to the plurality of target pixels are fused to obtain the local feature vector of the pixel. Feature vector.
A method for text recognition using a neural network. The neural network includes a first convolution subnetwork, a local fusion subnetwork, a second convolution subnetwork, a global fusion subnetwork, and an output subnetwork. The method includes:

Input the image to be recognized into the first convolution subnetwork, and the first convolution subnetwork is configured to perform convolution processing on the image to be recognized to output a first feature map;

The first feature map is input into the local fusion sub-network. The local fusion sub-network is configured to use a self-attention mechanism for each pixel in the first feature map based on the sum of the feature vectors corresponding to the pixel. The respective feature vectors of the plurality of target pixels in the first feature map are determined to determine the local feature vector of the pixel to obtain the second feature map, wherein the plurality of target pixels include the locations located in the first feature map. Multiple pixels in the neighborhood of a pixel;

Input the second feature map into the second convolution subnetwork, and the second convolution subnetwork is configured to perform convolution processing on the second feature map to output a third feature map;

The third feature map is input to the global fusion sub-network, and the global fusion sub-network is configured to use a self-attention mechanism for each pixel in the third feature map based on the sum of the feature vectors corresponding to the pixel. The respective feature vector of each pixel in the third feature map is determined to determine the global feature vector of the pixel to obtain the fourth feature map; and

The fourth feature map is input to the output sub-network, and the output sub-network is configured to output a text recognition result based on the fourth feature map.
The method of claim 7, wherein at least one of the first convolutional subnetwork and the second convolutional subnetwork includes a depthwise separable convolutional layer.
The method of claim 8, wherein the first convolutional subnetwork includes a regular convolutional layer, and at least one of the first convolutional subnetwork and the second convolutional subnetwork includes a first a depthwise separable convolution layer, the second convolution sub-network includes a second depthwise separable convolution layer, wherein the size of the convolution kernel used by the first depthwise separable convolution layer is smaller than that of the second depthwise separable convolution layer. The size of the convolution kernel used in depth-separable convolutional layers.
The method according to any one of claims 7-9, wherein the height of the third feature map is 1/32 of the height of the image to be recognized.
The method according to any one of claims 7-9, further comprising at least one of the following:

fusing the first feature map and the second feature map to update the second feature map; and

The third feature map and the fourth feature map are fused to update the fourth feature map.
The method according to any one of claims 7-9, wherein, for each pixel in the first feature map, a self-attention mechanism is used based on the feature vector corresponding to the pixel and the first feature map. The respective feature vectors of the multiple target pixels determine the local feature vector of the pixel to obtain the second feature map including:

Determine the attention score of the feature vector corresponding to each target pixel in the plurality of target pixels with respect to the feature vector corresponding to the pixel; and

Based on the attention score of the feature vector corresponding to each target pixel in the plurality of target pixels with respect to the feature vector corresponding to the pixel, the feature vectors corresponding to the plurality of target pixels are fused to obtain the feature vector of the pixel. local feature vector.
A training method for a neural network. The neural network includes a first convolution subnetwork, a local fusion subnetwork, a second convolution subnetwork, a global fusion subnetwork, and an output subnetwork. The method includes:

Determine sample images and corresponding real results;

Input the sample image into the first convolution subnetwork, and the first convolution subnetwork is configured to perform convolution processing on the sample image to output a first feature map;

The first feature map is input into the local fusion sub-network. The local fusion sub-network is configured to use a self-attention mechanism for each pixel in the first feature map based on the sum of the feature vectors corresponding to the pixel. The respective feature vectors of the plurality of target pixels in the first feature map are determined to determine the local feature vector of the pixel to obtain the second feature map, wherein the plurality of target pixels include the locations located in the first feature map. Multiple pixels in the neighborhood of a pixel;

Input the second feature map into the second convolution subnetwork, and the second convolution subnetwork is configured to perform convolution processing on the second feature map to output a third feature map;

The third feature map is input to the global fusion sub-network, and the global fusion sub-network is configured to use a self-attention mechanism for each pixel in the third feature map based on the sum of the feature vectors corresponding to the pixel. The respective feature vector of each pixel in the third feature map is determined to determine the global feature vector of the pixel to obtain the fourth feature map;

Input the fourth feature map into the output sub-network, and the output sub-network is configured to output a prediction result of text recognition on the sample image based on the fourth feature map;

Calculate a loss value based on the actual results and the predicted results; and

Adjust the parameters of the neural network based on the loss value to obtain a trained neural network.
The method of claim 13, wherein the loss value includes a connection temporal classification (CTC) loss value and a center loss value.
The method of claim 13, wherein at least one of the first convolutional subnetwork and the second convolutional subnetwork includes a depthwise separable convolutional layer.
The method of claim 15, wherein the first convolutional subnetwork includes a regular convolutional layer, and at least one of the first convolutional subnetwork and the second convolutional subnetwork includes a first a depthwise separable convolution layer, the second convolution sub-network includes a second depthwise separable convolution layer, wherein the size of the convolution kernel used by the first depthwise separable convolution layer is smaller than that of the second depthwise separable convolution layer. The size of the convolution kernel used in depth-separable convolutional layers.
The method according to any one of claims 13-16, wherein the height of the third feature map is 1/32 of the height of the image to be recognized.
The method according to any one of claims 13-16, further comprising at least one of the following:

fusing the first feature map and the second feature map to update the second feature map; and

The third feature map and the fourth feature map are fused to update the fourth feature map.
The method according to any one of claims 13-16, wherein, for each pixel in the first feature map, a self-attention mechanism is used based on the feature vector corresponding to the pixel and the first feature map. The respective feature vectors of multiple related pixels determine the local feature vector of the pixel to obtain the second feature map including:

Determine the attention score of the feature vector corresponding to each target pixel in the plurality of target pixels with respect to the feature vector corresponding to the pixel; and

Based on the attention score of the feature vector corresponding to each target pixel in the plurality of target pixels with respect to the feature vector corresponding to the pixel, the feature vectors corresponding to the plurality of target pixels are fused to obtain the local feature vector of the pixel. Feature vector.
An electronic device including:

at least one processor; and

a memory communicatively connected to the at least one processor; wherein

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform any one of claims 1-19 Methods.
A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method according to any one of claims 1-19.
A computer program product comprising a computer program, wherein the computer program implements the method of any one of claims 1-19 when executed by a processor.