CN114821581A

CN114821581A - Image recognition method and method for training image recognition model

Info

Publication number: CN114821581A
Application number: CN202210503528.5A
Authority: CN
Inventors: 陈科桦; 倪子涵; 孙逸鹏; 姚锟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-05-09
Filing date: 2022-05-09
Publication date: 2022-07-29

Abstract

The disclosure provides an image recognition method and a method for training an image recognition model, relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, image processing and computer vision, and can be applied to scenes such as OCR (optical character recognition). The implementation scheme is as follows: obtaining a target image, the target image including a first number of characters arranged along a first direction, each character of the first number of characters being from a preset character set having a corresponding preset label set; obtaining a second number of sequentially arranged tags based on the target image, each tag of the second number of tags being from a preset tag set, the sequentially arranged second number of tags corresponding to a second number of regions arranged along the first direction in the target image; and obtaining a recognition result of the target image, the recognition result including a first number of characters arranged in sequence corresponding to the first number of labels of the second number of labels arranged in sequence.

Description

Image recognition method and method for training image recognition model

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, in particular to the field of deep learning, image processing, and computer vision technologies, and may be applied to scenes such as face recognition, and in particular to an image recognition method, a method and an apparatus for training an image recognition model, an electronic device, a computer-readable storage medium, and a computer program product.

Background

Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like: the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

Image processing techniques based on artificial intelligence have penetrated into various fields. Wherein artificial intelligence based Optical Character Recognition (OCR) technology recognizes a shape on an image by processing the image and translates the recognized shape into a character.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.

Disclosure of Invention

The present disclosure provides an image recognition method, a method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product for training an image recognition model.

According to an aspect of the present disclosure, there is provided an image recognition method including: obtaining a target image, the target image including a first number of characters arranged along a first direction, each character of the first number of characters being from a preset character set having a corresponding preset label set; obtaining a second number of sequentially arranged tags based on the target image, each tag of the second number of tags being from the preset set of tags, the second number of sequentially arranged tags corresponding to the second number of regions arranged along the first direction in the target image; and obtaining a recognition result of the target image based on the sequentially arranged second number of labels, the recognition result including the sequentially arranged first number of characters corresponding to the first number of labels in the sequentially arranged second number of labels.

According to another aspect of the present disclosure, there is provided a method for training an image recognition model, comprising: obtaining a training image comprising a first number of characters arranged along a first direction, each character of the first number of characters being from a preset character set having a corresponding preset label set; obtaining labeling labels of the training image, the labeling labels including the first number of labels arranged in sequence, the first number of labels arranged in sequence corresponding to the first number of characters arranged in the first direction, and each label in the first number being from the preset label set; inputting the training image to an image recognition model to obtain a second number of labels arranged in sequence, each label of the second number of labels being from the preset label set, the second number of labels arranged in sequence corresponding to the second number of regions arranged in the first direction in the target image; obtaining a predicted label comprising the first number of labels of the second ordered number of labels; and adjusting parameters of the image recognition model based on the annotation tag and the prediction tag.

According to another aspect of the present disclosure, there is provided an image recognition apparatus including: an image acquisition unit configured to obtain a target image including a first number of characters arranged in a first direction, each character of the first number of characters being from a preset character set having a corresponding preset tag set; a prediction unit configured to obtain, based on the target image, a second number of labels arranged in sequence, each of the second number of labels being from the preset label set, the second number of labels arranged in sequence corresponding to the second number of regions arranged in the first direction in the target image; and a tag obtaining unit configured to obtain a recognition result of the target image based on a second number of sequentially arranged tags, the recognition result including the first number of sequentially arranged characters corresponding to the first number of tags among the second number of sequentially arranged tags.

According to another aspect of the present disclosure, there is provided an apparatus for training an image recognition model, including: a training image acquisition unit configured to obtain a training image, the training image including a first number of characters arranged in a first direction, each character of the first number of characters being from a preset character set having a corresponding preset label set; a labeling unit configured to obtain labeling labels of the training image, where the labeling labels include the first number of labels arranged in sequence, the first number of labels arranged in sequence corresponds to the first number of characters arranged in the first direction, and each label in the first number is from the preset label set; a training image input unit configured to input the training image to an image recognition model to obtain a second number of labels arranged in sequence, each label of the second number of labels being from the preset label set, the second number of labels arranged in sequence corresponding to the second number of regions arranged in the first direction in the target image; a predicted tag obtaining unit configured to obtain a predicted tag including the first number of tags of the second number of tags arranged in order; and a parameter adjusting unit configured to adjust a parameter of the image recognition model based on the annotation tag and the prediction tag.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to implement a method according to the above.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to implement the method according to the above.

According to another aspect of the present disclosure, a computer program product is provided comprising a computer program, wherein the computer program realizes the method according to the above when executed by a processor.

According to one or more embodiments of the present disclosure, the data processing amount of the target image including the characters can be reduced, and the accuracy of recognition of the characters included in the target image can be improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, according to an embodiment of the present disclosure;

FIG. 2 shows a flow diagram of an image recognition method according to an embodiment of the present disclosure;

fig. 3 shows a flowchart of a process of obtaining a target image in an image recognition method according to an embodiment of the present disclosure;

fig. 4 shows a flowchart of a process of obtaining a target image based on the corresponding region in an image recognition method according to an embodiment of the present disclosure;

fig. 5 shows a flowchart of a process of obtaining a second number of tags arranged in order based on a target image in an image recognition method according to an embodiment of the present disclosure;

FIG. 6 shows a flow chart of a process of obtaining a second number of labels arranged in a sequence based on a second number of convolution features corresponding to the second number of feature maps in an image recognition method according to an embodiment of the present disclosure;

fig. 7 shows a flowchart of a process of obtaining a second number of labels arranged in order based on a prediction matrix in an image recognition method according to an embodiment of the present disclosure;

FIG. 8 shows a flow diagram of a method for training an image recognition model in accordance with an embodiment of the present disclosure;

FIG. 9 shows a flowchart of a process of inputting training images into an image recognition model in a method for training an image recognition model according to an embodiment of the present disclosure;

FIG. 10 shows a flow diagram of a process of inputting a second number of feature maps into a prediction network in a method for training an image recognition model according to an embodiment of the present disclosure;

FIG. 11 shows a flowchart of a process for obtaining a second number of labels arranged in an order based on a second number of convolution features corresponding to the second number of feature maps in a method for training an image recognition model according to an embodiment of the present disclosure;

FIG. 12 shows a flow diagram of a process of obtaining a second number of labels arranged in order based on a prediction matrix in a method for training an image recognition model according to an embodiment of the present disclosure;

fig. 13 shows a block diagram of the structure of an image recognition apparatus according to an embodiment of the present disclosure;

FIG. 14 shows a block diagram of an apparatus for training an image recognition model according to an embodiment of the present disclosure; and

FIG. 15 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, the timing relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.

The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented in accordance with embodiments of the present disclosure. Referring to fig. 1, the system 100 includes one or

more client devices

101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120.

Client devices

101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In embodiments of the present disclosure, the server 120 may run one or more services or software applications that enable the image recognition method to be performed.

In some embodiments, the server 120 may also provide other services or software applications that may include non-virtual environments and virtual environments. In certain embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of

client devices

101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof, which may be executed by one or more processors. A user operating a

client device

101, 102, 103, 104, 105, and/or 106 may, in turn, utilize one or more client applications to interact with the server 120 to take advantage of the services provided by these components. It should be understood that a variety of different system configurations are possible, which may differ from system 100. Accordingly, fig. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

The user may receive the first classification result using

client devices

101, 102, 103, 104, 105, and/or 106. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that any number of client devices may be supported by the present disclosure.

Client devices

101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptops), workstation computers, wearable devices, smart screen devices, self-service terminal devices, service robots, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and so forth. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, APPLE iOS, UNIX-like operating systems, Linux, or Linux-like operating systems (e.g., GOOGLE Chrome OS); or include various Mobile operating systems such as MICROSOFT Windows Mobile OS, iOS, Windows Phone, Android. Portable handheld devices may include cellular telephones, smart phones, tablets, Personal Digital Assistants (PDAs), and the like. Wearable devices may include head-mounted displays (such as smart glasses) and other devices. The gaming system may include a variety of handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), Short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a variety of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. By way of example only, one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture involving virtualization (e.g., one or more flexible pools of logical storage that may be virtualized to maintain virtual storage for the server). In various embodiments, the server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above, as well as any commercially available server operating systems. The server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, and the like.

In some implementations, the server 120 can include one or more applications to analyze and consolidate data feeds and/or event updates received from users of the

client devices

101, 102, 103, 104, 105, and 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of

client devices

101, 102, 103, 104, 105, and 106.

In some embodiments, the server 120 may be a server of a distributed system, or a server incorporating a blockchain. The server 120 may also be a cloud server, or a smart cloud computing server or a smart cloud host with artificial intelligence technology. The cloud Server is a host product in a cloud computing service system, and is used for solving the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of the databases 130 may be used to store information such as audio files and object files. The data store 130 may reside in various locations. For example, the data store used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. The data store 130 may be of different types. In certain embodiments, the data store used by the server 120 may be a database, such as a relational database. One or more of these databases may store, update, and retrieve data to and from the database in response to the command.

In some embodiments, one or more of the databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key-value stores, object stores, or regular stores supported by a file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

Referring to fig. 2, an image recognition method 200 according to some embodiments of the present disclosure includes:

step S210: obtaining a target image, the target image including a first number of characters arranged in a first direction;

step S220: obtaining a second number of labels arranged in sequence based on the target image;

step S230: and obtaining the recognition result of the target image.

Wherein each character of the first number of characters is from a preset character set having a corresponding preset tag set; each tag of the second number of tags is from the preset set of tags, the sequentially arranged second number of tags corresponding to the second number of regions arranged along the first direction in the target image; and the recognition result includes the first number of characters arranged in order corresponding to the first number of tags in the second number of tags arranged in order.

A second number of labels corresponding to a second number of regions of the target image are obtained by processing the target image containing a first number of characters arranged along a first direction, wherein the first number of characters is from a preset character set corresponding to a preset label set, and the second number of labels is from the preset label set respectively. The process of recognizing the characters in the target image is changed into the process of classifying the target image, so that the accuracy of the recognition result of the target image is improved while the data processing amount is reduced.

In the related art, an image including a plurality of characters is positioned by a segmentation-based method, positions of the respective characters in the image are obtained, then the characters at the respective positions are recognized, and finally the recognized characters are combined into a character string to obtain a recognition result of the image. Because the identification of the characters is involved, the involved characters need to be accurately labeled in the training process of the model, so that the labeling cost is high.

According to the embodiment of the disclosure, only limited labeling needs to be performed on the preset character set by obtaining the preset label set corresponding to the preset character set, so that the labeled data volume is greatly reduced. Meanwhile, in the process of image identification, the images are classified corresponding to the preset label set according to the characteristics of the images, and the data processing amount related to the image classification process is greatly reduced.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

In some embodiments, the target image may be any image including a first number of characters arranged in a first direction, wherein the first direction may be a length direction of the target image.

In some embodiments, the target image may be an image containing a target object, wherein the target object includes an image of a first number of characters arranged in one direction (which may be any direction in physical space). For example, the target object may be a paper containing text, and the target image may be an image including the paper containing text.

In some embodiments, the target object comprises any one of: motor vehicle number plate, house number plate, athlete number cloth.

In some embodiments, for a target object, a preset character set and a preset label set corresponding to the target object are obtained.

For example, if the target object is a vehicle number plate, the preset character set includes characters, numbers and letters corresponding to a plurality of acronyms of a plurality of preset regions. For example, the preset character set includes a chinese character set consisting of 72 chinese characters such as jing, jin, ji, jin, meng, jeao, ji, black, shanghai, su … …, etc., a numeric character set consisting of 10 numeric characters such as 0, 1, 2 … … 9, etc., and a pinyin character set consisting of 26 pinyin characters such as A, B, C … … Z, etc. The preset tag set includes tags corresponding to each of the 31 hanzi characters, tags corresponding to each of the 10 numeric characters, and tags corresponding to each of the 26 pinyin characters, respectively.

In some embodiments, the preset character set further includes a blank character, the blank character indicating that no character corresponds. The preset tag set includes a first tag corresponding to a blank character.

In some embodiments, the first number is a preset number and corresponds to the target object.

In some embodiments, the first number is any number less than or equal to the second number, wherein the second number is predetermined.

In some embodiments, as shown in fig. 3, the obtaining of the target image in step S210 includes:

step S310: obtaining a first image, the first image including a target object, the target object including the first number of characters arranged in the first direction;

step S320: obtaining a corresponding region of the target object in the first image; and

step S330: obtaining the target image based on the corresponding region.

The target image is obtained by obtaining a first image including a target object including a first number of characters, and obtaining a region in the first image where the target object is located. The target image only contains the target object, the recognition of the characters in the target object is realized, and the accuracy of the recognition result is improved in the application scene needing to recognize the characters in the target object.

In some embodiments, the corresponding region of the target object in the first image is obtained by performing target detection on the first image.

In some embodiments, for the target object, the corresponding region of the target object in the first image is obtained by matting the first image.

In some embodiments, after obtaining the corresponding region, the corresponding region is scaled to a preset size to obtain the target image.

In some embodiments, as shown in fig. 4, the step S330 of obtaining the target image based on the corresponding region includes:

step S410: acquiring a plurality of points on the corresponding area; and

step S410: and performing perspective transformation based on the plurality of points to obtain the target image, wherein the target image is a front view of the target object.

Through perspective transformation, a projection plane of the target object is changed, so that the plane where the target image is located is an orthographic projection plane of the target object, namely the target image is a front view of the target object, the target object represented by the image features on the target image is more accurate, and therefore the second quantity of labels which are sequentially arranged and obtained based on the target image are accurate.

In some embodiments, the corresponding region of the target object is a rectangular region, and the plurality of points on the corresponding region are vertices of four corners of the rectangular region.

In some embodiments, the second number of labels in the sequential arrangement is obtained by inputting the target image to the trained image recognition model. Wherein the trained image recognition model comprises a feature extraction network and a classification prediction network.

In some embodiments, as shown in fig. 5, the step S220 of obtaining a second number of labels arranged in sequence based on the target image includes:

step S510: performing feature extraction corresponding to the second number of channels on the target image to obtain the second number of feature maps;

step S520: performing convolution operation on the second quantity of feature maps respectively to obtain convolution features corresponding to each feature map in the second quantity of feature maps; and

step S530: and obtaining a second number of labels arranged in sequence based on a second number of convolution features corresponding to the second number of feature maps.

After the feature extraction of the second number of channels is carried out on the target image, the obtained second number of feature maps are subjected to separation convolution, and after the convolution operation is respectively carried out on the plurality of feature maps on the plurality of channels, the second number of labels which are arranged in sequence are obtained on the basis of the second number of convolution features after the convolution operation, so that the data processing amount is greatly reduced.

In some embodiments, feature extraction for a second number of channels is performed on the target image using a classified perspective network (CPNet). The classified perspective network improves the ability of the network to focus on important features, can also inhibit the focus on unnecessary features, and has larger effective receptive field, so that after classification is carried out based on the extracted features, the obtained labels of the second quantity arranged in sequence are accurate.

In some embodiments, Resnet is used as the underlying backbone network. For each residual block, a convolution with 1 x 1 is used when the dimensions between the input and the output are different.

In some embodiments, a separate convolution network (SPPN) with a global average pooled prediction network is employed to perform a convolution operation on the second number of feature maps separately. The SPPN implicitly encodes the character position of each character in the target image in combination with global semantic information, such that the process of obtaining a second number of labels arranged in sequence conforms to the mechanism behind the classification perspective, i.e., the ith classification head predicts the word sequence in the ith character input image, wherein the classification head should know the ith character from the left.

In some embodiments, as shown in fig. 6, the step S530 of obtaining the second number of sequentially arranged labels based on the second number of convolution features corresponding to the second number of feature maps includes:

step S610: fusing the second number of convolution features to obtain fused features; and

step S620: obtaining a prediction matrix based on the fusion features, wherein the prediction matrix comprises the second number of rows, a plurality of elements of each row in the second number of rows respectively correspond to a plurality of labels in the preset label set, and a value of each element in the plurality of elements indicates whether the row corresponds to the label corresponding to the element; and

step S630: obtaining the second number of labels in the sequential order based on the prediction matrix.

And after the second number of convolution characteristics are fused, prediction is carried out based on the fusion characteristics, the processing steps are simplified, and the data processing amount is further reduced. Meanwhile, the fusion features are further combined with global semantics, and the accuracy of the second number of sequentially arranged tags is further improved.

In some embodiments, a second number of the convolution features are fused in the channel direction.

In some embodiments, the prediction matrix is obtained by performing a global average pooling of the fused features.

In an embodiment according to the disclosure, the second number of rows in the prediction matrix corresponds to the second number of regions in the target image. Each row in the prediction matrix may be understood as a classification of the target image corresponding to the preset label set based on a corresponding region in the target image, where each element is a probability that the target image corresponds to each label of the preset label set obtained in the process of classifying the target image.

In some embodiments, the respective elements in the prediction matrix are respective probability values.

In some embodiments, as shown in fig. 7, the step S620 of obtaining the second number of labels arranged in the order based on the prediction matrix includes:

step S710: for each row of the second number of rows, obtaining a largest element of the plurality of elements of the row having a largest value; and

step S720: obtaining the second number of labels arranged in sequence based on the second number of maximum elements corresponding to the second number of rows and the arrangement order of the second number of rows.

And taking the label of the element with the maximum value corresponding to each row in the prediction matrix as the label corresponding to the row, so that the accuracy of the prediction result is improved.

In some embodiments, the label corresponding to the second number of largest elements is determined to be the second number of labels.

In some embodiments, the preset tag set further includes a first tag that does not correspond to any character in the preset character set, and the obtaining of the recognition result of the target image in step S230 includes:

determining the first number of labels of the second ordered number of labels as the first ordered number of labels, wherein any label of the first number of labels of the second ordered number of labels is distinct from the first label.

By setting the preset label set and further comprising the first label which is not corresponding to any character in the preset character set, the prediction process can predict the interval area which is not corresponding to any character in the target image, and the prediction accuracy is improved.

For example, in the process of processing the target image containing the motor vehicle license plate "jing Z123456", the second number of labels sequentially arranged are obtained as follows: [ tag 1, tag 2, tag 3, tag 4, tag 5, tag 6, first tag, … … first tag ], the first number of tags, i.e., tag 1, tag 2, tag 3, tag 4, tag 5 and tag 6, are obtained by obtaining a tag of the second number of tags which is different from the "first tag".

In some embodiments, the second number of sequentially arranged tags is filtered based on the first number to obtain the first number of sequentially arranged tags. For example, two consecutive identical tags of the second number of tags sequentially arranged are determined as one tag, thereby determining the first number of tags sequentially arranged.

In some embodiments, after obtaining the first number of sequentially arranged labels, obtaining a first number of characters corresponding to the first number of labels from a preset character set, and arranging the first number of characters according to an arrangement order of the sequentially arranged first number of labels, thereby obtaining the first number of characters arranged according to the arrangement order. The first number of characters arranged according to the arrangement sequence is the recognition result of the target image.

According to another aspect of the present disclosure, there is also provided a method for training an image recognition model, as shown in fig. 8, the method 800 includes:

step S810: obtaining a training image comprising a first number of characters arranged along a first direction, each character of the first number of characters being from a preset character set having a corresponding preset label set;

step S820: obtaining labeling labels of the training image, the labeling labels including the first number of labels arranged in sequence, the first number of labels arranged in sequence corresponding to the first number of characters arranged in the first direction, and each label in the first number being from the preset label set;

step S830: inputting the training image to an image recognition model to obtain a second number of labels arranged in sequence, each label of the second number of labels being from the preset label set, the second number of labels arranged in sequence corresponding to the second number of regions arranged in the first direction in the target image;

step S840: obtaining a predicted label comprising the first number of labels of the second ordered number of labels; and

step S850: and adjusting parameters of the image recognition model based on the labeling label and the prediction label.

In the training process of the training image recognition model, aiming at the training images of the first number of characters from the preset character set, the preset label set corresponding to the preset character set is adopted for labeling, so that the labeling process is simple, and the labeling cost is low. A second number of sequentially arranged labels corresponding to a second number of regions of the training image arranged in the first direction are obtained as prediction labels by the image recognition model to train the image recognition model. The recognition task of the characters in the training image is modeled as a classification task of the target image based on the image characteristics of the region in the training image, and the data processing amount is further reduced.

In some embodiments, the training image may be arbitrary containing a first number of characters arranged in a first direction, wherein the first direction may be a length direction of the training image.

In some embodiments, the training image may be an image containing a training sample, wherein the training sample includes images of a first number of characters arranged along one direction (which may be any direction in physical space). For example, the training sample may be a text-containing paper and the target image may be an image that includes the text-containing paper.

In some embodiments, the training samples include any one of: motor vehicle number plate, house number plate, athlete number cloth.

In some embodiments, for a training sample, a preset character set and a preset label set corresponding to the training sample are obtained.

For example, if the training sample is a motor vehicle license plate, the preset character set includes characters, numbers and letters respectively corresponding to a plurality of acronyms of a plurality of preset regions. For example, the preset character set includes a chinese character set consisting of 31 chinese characters such as jing, jin, ji, jin, meng, jeao, ji, black, shanghai, su … …, etc., a numeric character set consisting of 10 numeric characters such as 0, 1, 2 … … 9, etc., and a pinyin character set consisting of 26 pinyin characters such as A, B, C … … Z, etc. The preset tag set includes tags corresponding to each of the 31 hanzi characters, tags corresponding to each of the 10 numeric characters, and tags corresponding to each of the 26 pinyin characters, respectively.

In some embodiments, the obtaining of the training image in step S810 includes:

obtaining a first image comprising a training sample comprising the first number of characters arranged along the first direction;

obtaining a corresponding region of the target object in the first image; and

obtaining the training image based on the corresponding region.

By aiming at the training samples, the training images are obtained, the characters in the training samples are recognized, in the application scene where the characters in the training samples of specific types need to be recognized, the training efficiency of the image recognition model is improved, and the accuracy of the recognition result is improved.

In some embodiments, obtaining the training image based on the corresponding region comprises:

acquiring a plurality of points on the corresponding area; and

performing a perspective transformation based on the plurality of points to obtain the training image, wherein the training image is a front view of the training sample.

Through perspective transformation, the projection plane of the training sample is changed, the plane where the training image is located is the orthographic projection plane of the training sample, namely the training image is the front view of the training sample, in the training process, the attention of the model to each area on the training image is more accurately distributed, and the trained model and the predicted result are more accurate.

In some embodiments according to the present disclosure, instead of using other images to train the model, methods such as label smoothing, and model pre-heating, etc., are used to avoid model overfitting. Wherein the other images are distinct from the training images obtained from the first image containing the training samples. For example, the training sample is a motor vehicle number plate, and the other images are images without a motor vehicle number plate.

In some embodiments, the annotation label of the training image is obtained by obtaining a first number of labels corresponding to a first number of characters arranged along a first direction in the training image, respectively, and arranging the first number of labels in an arrangement order of the first number of characters in the first direction,

in some embodiments, the image recognition model includes a feature extraction network and a prediction network, and as shown in fig. 9, the inputting the training image into the image recognition model in step S830 includes:

step S910: inputting the training image to the feature extraction network to obtain a second number of feature maps corresponding to the second number of channels; and

step S920: inputting the second number of profiles into the predictive network to obtain the second number of labels in the sequential order.

By setting the image recognition model to include a feature extraction network and a prediction network, a pipeline (pipeline) structure of the model is simplified.

In the related art, a sequence-to-sequence (seq-to-seq-based) image recognition model is made by encoding an image into a feature sequence and then decoding the feature sequence to recognize characters in the image, so that the pipeline structure of the model is complex and the implementation efficiency is low. According to the embodiment of the disclosure, the used image recognition model models the character recognition task in the image as the image classification task, so that the pipeline structure is simpler, and the training process is efficient.

In some embodiments, the feature extraction network comprises a class perspective network (CPNet). The classified perspective network improves the ability of the network to focus on important features, can also inhibit the focus on unnecessary features, and has larger effective receptive field, so that after classification is carried out based on the extracted features, the obtained labels of the second quantity arranged in sequence are accurate.

In some embodiments, the feature extraction network uses Resnet as the underlying backbone network. For each residual block, a convolution with 1 x 1 is used when the dimensions between the input and the output are different.

In some embodiments, the prediction network employs a split convolutional network (SPPN) with a global average pooled prediction network. The SPPN implicitly encodes the character position of each character in the target image in combination with global semantic information, such that the process of obtaining a second number of labels arranged in sequence conforms to the mechanism behind the classification perspective, i.e., the ith classification head predicts the word sequence in the ith character input image, wherein the classification head should know the ith character from the left. At the same time, SPPN can reduce the back propagation computational burden.

In some embodiments, as shown in fig. 10, the step S920 of inputting the second number of feature maps into the prediction network includes:

step S1010: performing convolution operation on the second quantity of feature maps respectively to obtain convolution features corresponding to each feature map in the second quantity of feature maps; and

step S1020: and obtaining a second number of labels arranged in sequence based on a second number of convolution features corresponding to the second number of feature maps.

By performing separate convolution on the second number of feature maps, after performing convolution operations on the plurality of feature maps on the plurality of channels respectively, the second number of labels arranged in sequence are obtained based on the second number of convolution features after the convolution operations, thereby greatly reducing the data processing amount.

In some embodiments, as shown in fig. 11, the step S1020 of obtaining the second number of sequentially arranged labels based on the second number of convolution features corresponding to the second number of feature maps includes:

step S1110: fusing the second number of convolution features to obtain fused features; and

step S1120: obtaining a prediction matrix based on the fusion features, wherein the prediction matrix comprises the second number of rows, a plurality of elements of each row in the second number of rows respectively correspond to a plurality of labels in the preset label set, and a value of each element in the plurality of elements indicates whether the row corresponds to the label corresponding to the element; and

step S1130: obtaining the second number of labels in the sequential order based on the prediction matrix.

In an embodiment according to the disclosure, the second number of rows in the prediction matrix corresponds to the second number of regions in the training image. Each row in the prediction matrix may be understood as a possibility that the training image corresponds to each label of the preset label set, which is obtained in the process of classifying the training image, when the training image is classified based on the corresponding region in the training image.

In some embodiments, as shown in fig. 12, the obtaining a second number of labels arranged in sequence based on the prediction matrix in step S1130 includes:

step 1210: for each row of the second number of rows, obtaining a largest element of the plurality of elements of the row having a largest value; and

step S1220: obtaining the second number of labels arranged in sequence based on the second number of maximum elements corresponding to the second number of rows and the arrangement order of the second number of rows.

And taking the label of the element with the maximum value corresponding to each row in the prediction matrix as the label corresponding to the row, so that the model training efficiency is improved.

In some embodiments, the preset label set further includes a first label that does not correspond to any character in the preset character set, and the obtaining the predicted label of the training image includes:

By setting the preset label set to further comprise the first label which is not corresponding to any character in the preset character set, the corresponding label is output through the model for the interval region which is not corresponding to any character in the training image, so that the trained model can predict the image of the region which contains the character which is not corresponding to any character or not marked, and the application range and the prediction accuracy of the trained model are improved.

In some embodiments, after obtaining the prediction label, a loss is calculated based on the prediction label and the annotation label, and a parameter of the image recognition model is adjusted based on the loss.

According to another aspect of the present disclosure, there is also provided an image recognition apparatus, as shown in fig. 13, an apparatus 1300 including: an image obtaining unit 1310 configured to obtain a target image, the target image including a first number of characters arranged in a first direction, each character of the first number of characters being from a preset character set having a corresponding preset label set; a prediction unit 1320 configured to obtain, based on the target image, a second number of labels arranged in order, each of the second number of labels being from the preset label set, the second number of labels arranged in order corresponding to the second number of regions arranged in the first direction in the target image; and a tag obtaining unit 1330 configured to obtain a recognition result of the target image based on the second number of sequentially arranged tags, the recognition result including the first number of sequentially arranged characters corresponding to the first number of tags of the second number of sequentially arranged tags.

In some embodiments, the image acquisition unit 1310 includes: a first obtaining subunit configured to obtain a first image, the first image including a target object, the target object including the first number of characters arranged in the first direction; a region acquisition unit configured to obtain a corresponding region of the target object in the first image; and a second acquisition subunit configured to acquire the target image based on the corresponding region.

In some embodiments, the second acquisition subunit comprises: a third acquisition subunit configured to acquire a plurality of points on the corresponding area; and a perspective transformation unit configured to perform perspective transformation based on the plurality of points to obtain the target image, wherein the target image is a front view of the target object.

In some embodiments, the prediction unit comprises: a feature extraction unit configured to perform feature extraction corresponding to the second number of channels on the target image to obtain the second number of feature maps; a convolution unit configured to perform a convolution operation on the second number of feature maps respectively to obtain a convolution feature corresponding to each feature map in the second number of feature maps; and the predicting subunit is configured to obtain the second number of labels arranged in sequence based on a second number of convolution features corresponding to the second number of feature maps.

In some embodiments, the predictor unit comprises: a feature fusion unit configured to fuse the second number of convolution features to obtain fusion features; a prediction subunit configured to obtain, based on the fused feature, a prediction matrix, where the prediction matrix includes the second number of rows, a plurality of elements of each row in the second number of rows respectively correspond to a plurality of labels in the preset label set, and a value of each element in the plurality of elements indicates whether the row corresponds to the label corresponding to the element; and a fourth obtaining subunit configured to obtain the second number of labels arranged in order based on the prediction matrix.

In some embodiments, the fourth acquisition subunit comprises: a first determining unit configured to obtain, for each of the second number of rows, a largest element having a largest value among a plurality of elements of the row; and a tag obtaining subunit configured to obtain the second number of tags arranged in order based on the second number of maximum elements corresponding to the second number of rows and an arrangement order of the second number of rows.

In some embodiments, the preset tag set further includes a first tag that does not correspond to any character in the preset character set, and the tag obtaining subunit includes: a second determination unit configured to determine the first number of tags of the sequentially arranged second number of tags as the first number of tags arranged in the sequence, wherein any tag of the first number of tags of the sequentially arranged second number of tags is different from the first tag.

In some embodiments, the target object comprises any one of: motor vehicle number plates, house plates and player number cloths.

According to another aspect of the present disclosure, there is also provided an apparatus for training an image recognition model, as shown in fig. 14, the apparatus 1400 includes: a training image obtaining unit 1410 configured to obtain a training image including a first number of characters arranged in a first direction, each character of the first number of characters being from a preset character set having a corresponding preset label set; an annotation unit 1420 configured to obtain annotation labels of the training image, the annotation labels including the first number of labels arranged in a sequence, the first number of labels arranged in the sequence corresponding to the first number of characters arranged in the first direction, and each label in the first number being from the preset label set; a training image input unit 1430 configured to input the training image to an image recognition model to obtain a second number of labels arranged in sequence, each label of the second number of labels being from the preset label set, the second number of labels arranged in sequence corresponding to the second number of regions arranged in the first direction in the target image; a predicted tag obtaining unit 1440 configured to obtain predicted tags comprising the first number of tags of the second number of tags arranged in the order; and a parameter adjusting unit 1450 configured to adjust parameters of the image recognition model based on the annotation tag and the prediction tag.

In some embodiments, the image recognition model includes a feature extraction network and a prediction network, and the training image input unit includes: a first input subunit configured to input the training image to the feature extraction network to obtain a second number of feature maps corresponding to the second number of channels; and a second input subunit configured to input the pair of the second number of feature maps to the prediction network to obtain the second number of labels arranged in the sequence.

In some embodiments, the feature extraction network comprises a class perspective network.

In some embodiments, the second input subunit comprises: a convolution unit configured to perform a convolution operation on the second number of feature maps respectively to obtain a convolution feature corresponding to each feature map in the second number of feature maps; and the first prediction subunit is configured to obtain a second number of labels arranged in sequence based on a second number of convolution features corresponding to the second number of feature maps.

In some embodiments, the first prediction subunit comprises: a fusion unit configured to fuse the second number of convolution features to obtain a fusion feature; and a second prediction subunit, configured to obtain, based on the fused feature, a prediction matrix, where the prediction matrix includes the second number of rows, multiple elements of each row in the second number of rows respectively correspond to multiple labels in the preset label set, and a value of each element in the multiple elements indicates whether the row corresponds to the label corresponding to the element; and a third prediction subunit configured to obtain the second number of labels arranged in order based on the prediction matrix.

In some embodiments, the third prediction subunit comprises: a determining subunit configured to, for each of the second number of rows, obtain a largest element of the plurality of elements of the row having a largest value; and a predicted label obtaining subunit configured to obtain the second number of labels arranged in order based on the second number of maximum elements corresponding to the second number of rows and an arrangement order of the second number of rows.

In some embodiments, the training image acquisition unit comprises: a first image obtaining unit configured to obtain a first image, the first image including a training sample including the first number of characters arranged in the first direction; an image area acquisition unit configured to obtain a corresponding area of the target object in the first image; and a training image acquisition subunit configured to acquire the training image based on the corresponding region.

In some embodiments, the training image acquisition subunit comprises: a point determination unit configured to acquire a plurality of points on the corresponding area; and a perspective transformation unit configured to perform perspective transformation based on the plurality of points to obtain the training image, wherein the training image is a front view of the training sample.

In some embodiments, the training samples comprise any one of: motor vehicle number plates, house plates and player number cloths.

According to another aspect of the present disclosure, there is also provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method according to the present disclosure.

According to another aspect of the present disclosure, there is also provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to the present disclosure.

According to another aspect of the present disclosure, there is also provided a computer program product comprising a computer program, wherein the computer program realizes the method according to the present disclosure when executed by a processor.

Referring to fig. 15, a block diagram of a structure of an electronic device 1500, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 15, the electronic device 1500 includes a calculation unit 1501 which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)1502 or a computer program loaded from a storage unit 1508 into a Random Access Memory (RAM) 1503. In the RAM 1503, various programs and data necessary for the operation of the electronic device 1500 can also be stored. The calculation unit 1501, the ROM 1502, and the RAM 1503 are connected to each other by a bus 1504. An input/output (I/O) interface 1505 is also connected to bus 1504.

Various components in the electronic device 1500 connect to the I/O interface 1505, including: an input unit 1506, an output unit 1507, a storage unit 1508, and a communication unit 1509. The input unit 1506 may be any type of device capable of inputting information to the electronic device 1500, and the input unit 1506 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote control. Output unit 1507 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, an object/audio output terminal, a vibrator, and/or a printer. Storage unit 1508 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 1509 allows the electronic device 1500 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, 802.11 devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.

The computing unit 1501 may be various general and/or special purpose processing components having processing and computing capabilities. Some examples of the computation unit 1501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computation chips, various computation units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The calculation unit 1501 executes the respective methods and processes described above, such as the method 200. For example, in some embodiments, method 200 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 1500 via the ROM 1502 and/or the communication unit 1509. When the computer program is loaded into RAM 1503 and executed by computing unit 1501, one or more steps of method 200 described above may be performed. Alternatively, in other embodiments, the computing unit 1501 may be configured to perform the method 200 in any other suitable manner (e.g., by way of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be performed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above-described methods, systems and apparatus are merely exemplary embodiments or examples and that the scope of the present invention is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.

Claims

1. An image recognition method, comprising:

obtaining a target image, the target image including a first number of characters arranged along a first direction, each character of the first number of characters being from a preset character set having a corresponding preset label set;

obtaining a second number of sequentially arranged tags based on the target image, each tag of the second number of tags being from the preset set of tags, the second number of sequentially arranged tags corresponding to the second number of regions arranged along the first direction in the target image; and

obtaining a recognition result of the target image based on the sequentially arranged second number of labels, the recognition result including the sequentially arranged first number of characters corresponding to the first number of labels of the sequentially arranged second number of labels.

2. The method of claim 1, wherein the obtaining a target image comprises:

obtaining a first image, the first image including a target object, the target object including the first number of characters arranged in the first direction;

obtaining a corresponding region of the target object in the first image; and

obtaining the target image based on the corresponding region.

3. The method of claim 2, wherein the obtaining the target image based on the corresponding region comprises:

acquiring a plurality of points on the corresponding area; and

and performing perspective transformation based on the plurality of points to obtain the target image, wherein the target image is a front view of the target object.

4. The method of any of claims 1-3, wherein the obtaining a second number of labels in a sequential arrangement based on the target image comprises:

performing feature extraction corresponding to the second number of channels on the target image to obtain the second number of feature maps;

performing convolution operation on the second quantity of feature maps respectively to obtain convolution features corresponding to each feature map in the second quantity of feature maps; and

and obtaining a second number of labels arranged in sequence based on a second number of convolution features corresponding to the second number of feature maps.

5. The method of claim 4, wherein the obtaining the ordered second number of labels based on the second number of convolution features corresponding to the second number of feature maps comprises:

fusing the second number of convolution features to obtain fused features;

obtaining a prediction matrix based on the fusion features, wherein the prediction matrix comprises the second number of rows, a plurality of elements of each row in the second number of rows respectively correspond to a plurality of labels in the preset label set, and a value of each element in the plurality of elements indicates whether the row corresponds to the label corresponding to the element; and

obtaining the second number of labels in the sequential order based on the prediction matrix.

6. The method of claim 5, wherein said obtaining said second ordered number of labels based on said prediction matrix comprises:

for each row of the second number of rows, obtaining a largest element of the plurality of elements of the row having a largest value; and

obtaining the second number of labels arranged in sequence based on the second number of maximum elements corresponding to the second number of rows and an arrangement order of the second number of rows.

7. The method according to any one of claims 1-6, wherein the preset tag set further includes a first tag that does not correspond to any character in the preset character set, and the obtaining the recognition result of the target image includes:

8. The method according to any one of claims 2-7, wherein the target object comprises any one of: motor vehicle number plates, house plates and player number cloths.

9. A method for training an image recognition model, comprising:

obtaining a training image comprising a first number of characters arranged along a first direction, each character of the first number of characters being from a preset character set having a corresponding preset label set;

obtaining labeling labels of the training image, the labeling labels including the first number of labels arranged in sequence, the first number of labels arranged in sequence corresponding to the first number of characters arranged in the first direction, and each label in the first number being from the preset label set;

inputting the training image to an image recognition model to obtain a second number of labels arranged in sequence, each label of the second number of labels being from the preset label set, the second number of labels arranged in sequence corresponding to the second number of regions arranged in the first direction in the target image;

obtaining a predicted label comprising the first number of labels of the second ordered number of labels; and

and adjusting parameters of the image recognition model based on the labeling label and the prediction label.

10. The method of claim 9, wherein the image recognition model comprises a feature extraction network and a prediction network, and the inputting the training image to the image recognition model comprises:

inputting the training image to the feature extraction network to obtain a second number of feature maps corresponding to the second number of channels; and

inputting the second number of profiles into the predictive network to obtain the second number of labels in the sequential order.

11. The method of claim 10, wherein the feature extraction network comprises a class perspective network.

12. The method of claim 10, wherein the inputting the second number of feature maps to the predictive network comprises:

13. The method of claim 12, wherein the obtaining the second ordered number of labels based on the second number of convolution features corresponding to the second number of feature maps comprises:

fusing the second number of convolution features to obtain fused features;

14. The method of claim 13, wherein the obtaining the second ordered number of labels based on the prediction matrix comprises:

15. The method of any of claims 9-14, wherein the obtaining a training image comprises:

obtaining a corresponding area of the target object in the first image; and

obtaining the training image based on the corresponding region.

16. The method of claim 15, wherein the obtaining the training image based on the corresponding region comprises:

acquiring a plurality of points on the corresponding area; and

17. The method of claim 15 or 16, wherein the training samples comprise any one of: motor vehicle number plates, house plates and player number cloths.

18. An image recognition apparatus comprising:

an image acquisition unit configured to obtain a target image including a first number of characters arranged in a first direction, each character of the first number of characters being from a preset character set having a corresponding preset tag set;

a prediction unit configured to obtain, based on the target image, a second number of labels arranged in sequence, each of the second number of labels being from the preset label set, the second number of labels arranged in sequence corresponding to the second number of regions arranged in the first direction in the target image; and

a tag obtaining unit configured to obtain a recognition result of the target image based on the sequentially arranged second number of tags, the recognition result including the sequentially arranged first number of characters corresponding to the first number of tags among the sequentially arranged second number of tags.

19. The apparatus of claim 18, wherein the image acquisition unit comprises:

a first obtaining subunit configured to obtain a first image, the first image including a target object, the target object including the first number of characters arranged in the first direction;

a region acquisition unit configured to obtain a corresponding region of the target object in the first image; and

a second obtaining subunit configured to obtain the target image based on the corresponding region.

20. The apparatus of claim 19, wherein the second acquisition subunit comprises:

a third acquisition subunit configured to acquire a plurality of points on the corresponding area; and

a perspective transformation unit configured to perform perspective transformation based on the plurality of points to obtain the target image, wherein the target image is a front view of the target object.

21. The apparatus according to any one of claims 18-20, wherein the prediction unit comprises:

a feature extraction unit configured to perform feature extraction corresponding to the second number of channels on the target image to obtain the second number of feature maps;

a convolution unit configured to perform a convolution operation on the second number of feature maps respectively to obtain a convolution feature corresponding to each feature map in the second number of feature maps; and

a predictor configured to obtain the second number of labels arranged in sequence based on a second number of convolution features corresponding to the second number of feature maps.

22. The apparatus of claim 21, wherein the predictor unit comprises:

a feature fusion unit configured to fuse the second number of convolution features to obtain a fusion feature; and

a prediction subunit configured to obtain, based on the fused feature, a prediction matrix, where the prediction matrix includes the second number of rows, a plurality of elements of each row in the second number of rows respectively correspond to a plurality of labels in the preset label set, and a value of each element in the plurality of elements indicates whether the row corresponds to the label corresponding to the element; and

a fourth obtaining subunit configured to obtain the second number of labels arranged in order based on the prediction matrix.

23. The apparatus of claim 22, wherein the fourth acquisition subunit comprises:

a first determining unit configured to obtain, for each of the second number of rows, a largest element having a largest value among a plurality of elements of the row; and

a tag obtaining subunit configured to obtain the second number of tags arranged in order based on the second number of maximum elements corresponding to the second number of rows and an arrangement order of the second number of rows.

24. The apparatus according to any of claims 18-23, wherein the preset set of labels further comprises a first label that does not correspond to any character in the preset set of characters, the label obtaining subunit comprising:

a second determination unit configured to determine the first number of tags of the sequentially arranged second number of tags as the first number of tags arranged in the sequence, wherein any tag of the first number of tags of the sequentially arranged second number of tags is different from the first tag.

25. The apparatus according to any one of claims 19-24, wherein the target object comprises any one of: motor vehicle number plates, house plates and player number cloths.

26. An apparatus for training an image recognition model, comprising:

a training image acquisition unit configured to obtain a training image, the training image including a first number of characters arranged in a first direction, each character of the first number of characters being from a preset character set having a corresponding preset label set;

a labeling unit configured to obtain labeling labels of the training image, where the labeling labels include the first number of labels arranged in sequence, the first number of labels arranged in sequence corresponds to the first number of characters arranged in the first direction, and each label in the first number is from the preset label set;

a training image input unit configured to input the training image to an image recognition model to obtain a second number of labels arranged in sequence, each label of the second number of labels being from the preset label set, the second number of labels arranged in sequence corresponding to the second number of regions arranged in the first direction in the target image;

a predicted tag obtaining unit configured to obtain a predicted tag including the first number of tags of the second number of tags arranged in order; and

a parameter adjusting unit configured to adjust a parameter of the image recognition model based on the annotation tag and the prediction tag.

27. The apparatus of claim 26, wherein the image recognition model includes a feature extraction network and a prediction network, the training image input unit includes:

a first input subunit configured to input the training image to the feature extraction network to obtain a second number of feature maps corresponding to the second number of channels; and

a second input subunit configured to input the pair of the second number of feature maps to the prediction network to obtain the second number of labels arranged in the order.

28. The apparatus of claim 27, wherein the feature extraction network comprises a class perspective network.

29. The apparatus of claim 27, wherein the second input subunit comprises:

a first prediction subunit configured to obtain a second number of labels arranged in the order based on a second number of convolution features corresponding to the second number of feature maps.

30. The apparatus of claim 29, wherein the first prediction subunit comprises:

a fusion unit configured to fuse the second number of convolution features to obtain a fusion feature; and

a second prediction subunit, configured to obtain, based on the fused feature, a prediction matrix, where the prediction matrix includes the second number of rows, a plurality of elements of each row in the second number of rows respectively correspond to a plurality of labels in the preset label set, and a value of each element in the plurality of elements indicates whether the row corresponds to the label corresponding to the element; and

a third prediction subunit configured to obtain the second number of labels arranged in order based on the prediction matrix.

31. The apparatus of claim 30, wherein the third prediction subunit comprises:

a determining subunit configured to, for each of the second number of rows, obtain a largest element of the plurality of elements of the row having a largest value; and

a predicted label obtaining subunit configured to obtain the second number of labels arranged in order based on the second number of maximum elements corresponding to the second number of rows and an arrangement order of the second number of rows.

32. The apparatus of any one of claims 26-31, wherein the training image acquisition unit comprises:

a first image obtaining unit configured to obtain a first image, the first image including a training sample including the first number of characters arranged in the first direction;

an image area acquisition unit configured to obtain a corresponding area of the target object in the first image; and

a training image obtaining subunit configured to obtain the training image based on the corresponding region.

33. The apparatus of claim 32, wherein the training image acquisition subunit comprises:

a point determination unit configured to acquire a plurality of points on the corresponding area; and

a perspective transformation unit configured to perform perspective transformation based on the plurality of points to obtain the training image, wherein the training image is a front view of the training sample.

34. The apparatus of any one of claims 26-33, wherein the training sample comprises any one of: motor vehicle number plates, house plates and player number cloths.

35. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein

The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of any one of claims 1-14.

36. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-17.

37. A computer program product comprising a computer program, wherein the computer program realizes the method of any one of claims 1-17 when executed by a processor.