WO2023047159A1

WO2023047159A1 - Object sequence recognition method, network training method, apparatuses, device, and medium

Info

Publication number: WO2023047159A1
Application number: PCT/IB2021/058767
Authority: WO
Inventors: Jinghuan Chen; Jiabin MA
Original assignee: Sensetime International Pte. Ltd.
Priority date: 2021-09-22
Filing date: 2021-09-27
Publication date: 2023-03-30
Also published as: CN116157801A; AU2021240212A1

Abstract

Provided are an object sequence recognition method, a network training method, apparatuses, a device, and a medium. The method includes that: a first image including an object sequence is acquired; the first image is input to an object sequence recognition network, and feature extraction is performed to obtain a feature sequence, supervision information in a training process of the object sequence recognition network at least including class supervision information of each sample object in a sample object sequence and sequence length supervision information of the sample object of each class in the sample object sequence; and a class of the object sequence is predicted based on the feature sequence using a classifier of the object sequence recognition network to obtain class information of the object sequence.

Description

OBJECT SEQUENCE RECOGNITION METHOD, NETWORK TRAINING METHOD, APPARATUSES, DEVICE, AND MEDIUM

CROSS-REFERENCE TO RELATED APPLICATION(S)

[ 0001] The application claims priority to Singapore patent application No. 10202110495U filed with IPOS on 22 September 2021, the content of which is incorporated herein by reference in its entirety.

TECHNICAL FIELD

[ 0002] Embodiments of the application relate to the technical field of image processing, and relate, but not limited, to an object sequence recognition method, a network training method, apparatuses, a device, and a medium.

BACKGROUND

[ 0003] Sequence recognition on an image is an important research subject in computer vision. A sequence recognition algorithm is widely applied to scene text recognition, license plate recognition and other scenes. In the related art, a neural network is used to recognize an image of sequential objects. The neural network may be obtained by training taking classes of objects in sequential objects as supervision information.

[ 0004] In some scenes, object sequences are relatively long, and requirements on the accuracy of recognizing these objects are relatively high, so it is unlikely to achieve satisfactory sequence recognition effects by a sequence recognition method in the related art.

SUMMARY

[ 0005] The embodiments of the application provide technical solutions to the recognition of an object sequence.

[ 0006] The technical solutions of the embodiments of the application are implemented as follows.

[ 0007] An embodiment of the application provides an object sequence recognition method, which may include the following operations.

[ 0008] A first image including an object sequence is acquired.

[ 0009] The first image is input to an object sequence recognition network, and feature extraction is performed to obtain a feature sequence, supervision information in a training process of the object sequence recognition network at least including class supervision information of each sample object in a sample object sequence and sequence length supervision information of the sample object of each class in the sample object sequence.

[ 0010] A class of the object sequence is predicted based on the feature sequence using a classifier of the object sequence recognition network to obtain class information of the object sequence.

[ 0011] In some embodiments, the operation that the first image is input to an object sequence recognition network and feature extraction is performed to obtain a feature sequence may include the following operations. Feature extraction is performed on the first image using a convolutional subnetwork in the object sequence recognition network to obtain a feature map. The feature map is split to obtain the feature sequence. As such, it is easy to subsequently recognize object classes in the feature sequence more accurately.

[ 0012] In some embodiments, the operation that feature extraction is performed on the first image using a convolutional subnetwork in the object sequence recognition network to obtain a feature map may include the following operations. The first image is down- sampled using the convolutional subnetwork in a length dimension of the first image in a first direction to obtain a first-dimensional feature, the first direction being different from an arrangement direction of objects in the object sequence. A feature in a length dimension of the first image in a second direction is extracted based on a length of the first image in the second direction to obtain a second-dimensional feature. The feature map is obtained based on the first-dimensional feature and the second-dimensional feature. As such, feature information of the first image in the dimension in the second direction may be maximally retained.

[ 0013] In some embodiments, the operation that the feature map is split to obtain the feature sequence may include the following operations. The feature map is pooled in the first direction to obtain a pooled feature map. The pooled feature map is split in the second direction to obtain the feature sequence. Accordingly, the feature map is split in the second direction after being pooled in the first direction, so that the feature sequence may include more detail information of the first image in the second direction.

[ 0014] In some embodiments, the operation that a class of the object sequence is predicted based on the feature sequence using a classifier of the object sequence recognition network to obtain class information of the object sequence may include the following operations. A class corresponding to each feature in the feature sequence is predicted using the classifier of the object sequence recognition network. A class of each object in the object sequence is determined based on a prediction result of the class corresponding to each feature in the feature sequence. A sequence length of target features of objects belonging to the same class is determined in the feature sequence. The class information of the object sequence is obtained based on the class of each object in the object sequence and a sequence length of target features corresponding to the objects of each class. Accordingly, a classification result of the feature sequence is processed using a post-processing rule of a Connectionist Temporal Classification (CTC) loss function, so that the predicted class of each object and the length of the object sequence may be more accurate.

[ 0015] An embodiment of the application provides a method for training an object sequence recognition network, which may include the following operations. A sample image is acquired, the sample image including a sample object sequence and class labeling information of the sample object sequence is acquired. The sample image is input to an object sequence recognition network to be trained, and feature extraction is performed to obtain a sample feature sequence. Class prediction is performed on sample objects in the sample object sequence based on the sample feature sequence using a classifier of the object sequence recognition network to be trained to obtain a class prediction result of the sample object sequence, the class prediction result of the sample object sequence including class prediction information of each sample object in the sample object sequence. A first loss and a second loss are determined based on the class prediction result of the sample object sequence, the first loss being configured to supervise the class prediction result of the sample object sequence based on the class labeling information of the sample object sequence, and the second loss being configured to supervise the number of sample objects of each class in the sample object sequence based on the class labeling information of the sample object sequence. A network parameter of the object sequence recognition network to be trained is adjusted according to the first loss and the second loss such that a loss of a classification result output by an adjusted object sequence recognition network satisfies a convergence condition. Accordingly, the first loss for supervising the whole sequence and the second loss for supervising the number of each class in the sequence are introduced, so that an overall class prediction effect of the network may be improved.

[ 0016] In some embodiments, the operation that the sample image is input to an object sequence recognition network to be trained and feature extraction is performed to obtain a sample feature sequence may include the following operations. Feature extraction is performed on the sample image using a convolutional subnetwork in the object sequence recognition network to be trained to obtain a sample feature map of the sample image. The sample feature map is split to obtain the sample feature sequence. As such, the obtained sample feature sequence may retain more features in the second direction, and the training accuracy of the network may be improved.

[ 0017] In some embodiments, the operation that feature extraction is performed on the sample image using a convolutional subnetwork in the object sequence recognition network to be trained to obtain a sample feature map of the sample image may include the following operations. The sample image is down-sampled using the convolutional subnetwork in a length dimension of the sample image in a first direction to obtain a first-dimensional sample feature, the first direction being different from an arrangement direction of the sample object sequence in the sample sequence. A feature in a length dimension of the sample image in a second direction is extracted based on a length of the sample image in the second direction to obtain a second-dimensional sample feature. The sample feature map of the sample image is obtained based on the first-dimensional sample feature and the second-dimensional sample feature. As such, feature information in a dimension of each sample image in the second direction may be maximally retained.

[ 0018] In some embodiments, the operation that the sample feature map is split to obtain the sample feature sequence may include the following operations. The sample feature map is pooled in the first direction to obtain a pooled sample feature map. The pooled sample feature map is split in the second direction to obtain the sample feature sequence. Accordingly, the sample feature map is split in the dimension in the second direction after being pooled in the dimension in the first direction, so that the sample feature sequence may retain more detailed information of the sample image in the dimension in the second direction.

[ 0019] In some embodiments, the operation that a network parameter of the object sequence recognition network to be trained is adjusted according to the first loss and the second loss such that a loss of a classification result output by an adjusted object sequence recognition network satisfies a convergence condition may include the following operations. Weighted fusion is performed on the first loss and the second loss to obtain a total loss. The network parameter of the object sequence recognition network to be trained is adjusted according to the total loss such that the loss of the classification result output by the adjusted object sequence recognition network satisfies the convergence condition. Accordingly, two loss functions are fused as the total loss, and the network is trained using the total loss, so that the object recognition performance of the network may be improved.

[ 0020] In some embodiments, the operation that weighted fusion is performed on the first loss and the second loss to obtain a total loss may include the following operations. [ 0021] A first dynamic weight is assigned to the first loss to obtain a first dynamic loss, the first dynamic weight gradually decreasing with the increase of a training count and/or training time of the object sequence recognition network to be trained when the training count reaches a first threshold or the training time reaches first time.

[ 0022] A second dynamic weight is assigned to the second loss to obtain a second dynamic loss, the second dynamic weight gradually increasing with the increase of the training count and/or training time of the object sequence recognition network to be trained when the training count reaches a second threshold or the training time reaches second time. The first dynamic loss and the second dynamic loss are fused to obtain the total loss. Accordingly, the weights of the two loss functions are dynamically adjusted, so that a prediction effect of the whole network may be improved, and furthermore, an object sequence recognition network with relatively high performance may be obtained.

[ 0023] An embodiment of the application provides an object sequence recognition apparatus, which may include a first acquisition module, a first extraction module, and a first prediction module.

[ 0024] The first acquisition module may be configured to acquire a first image including an object sequence.

[ 0025] The first extraction module may be configured to input the first image to an object sequence recognition network and perform feature extraction to obtain a feature sequence, supervision information in a training process of the object sequence recognition network at least including class supervision information of each sample object in a sample object sequence and sequence length supervision information of the sample object of each class in the sample object sequence.

[ 0026] The first prediction module may be configured to predict a class of the object sequence based on the feature sequence using a classifier of the object sequence recognition network to obtain class information of the object sequence.

[ 0027] In some embodiments, the first extraction module may include a first extraction submodule and a first splitting submodule.

[ 0028] The first extraction submodule may be configured to perform feature extraction on the first image using a convolutional subnetwork in the object sequence recognition network to obtain a feature map.

[ 0029] The first splitting submodule may be configured to split the feature map to obtain the feature sequence.

[ 0030] In some embodiments, the first extraction submodule may include a first down- sampling unit, a first extraction unit, and a first determination unit.

[ 0031] The first down-sampling unit may be configured to down-sample the first image using the convolutional subnetwork in a length dimension of the first image in a first direction to obtain a first-dimensional feature, the first direction being different from an arrangement direction of objects in the object sequence.

[ 0032] The first extraction unit may be configured to extract a feature in a length dimension of the first image in a second direction based on a length of the first image in the second direction to obtain a second-dimensional feature.

[ 0033] The first determination unit may be configured to obtain the feature map based on the first-dimensional feature and the second-dimensional feature.

[ 0034] In some embodiments, the first splitting submodule may include a first pooling unit and a first splitting unit.

[ 0035] The first pooling unit may be configured to pool the feature map in the first direction to obtain a pooled feature map. [ 0036] The first splitting unit may be configured to split the pooled feature map in the second direction to obtain the feature sequence.

[ 0037] In some embodiments, the first prediction module may include a first prediction submodule, a first determination submodule, a second determination submodule, and a third determination submodule.

[ 0038] The first prediction submodule may be configured to predict a class corresponding to each feature in the feature sequence using the classifier of the object sequence recognition network.

[ 0039] The first determination submodule may be configured to determine a class of each object in the object sequence based on a prediction result of the class corresponding to each feature in the feature sequence.

[ 0040] The second determination submodule may be configured to determine a sequence length of target features of objects belonging to the same class in the feature sequence.

[ 0041] The third determination submodule may be configured to obtain the class information of the object sequence based on the class of each object in the object sequence and a sequence length of target features corresponding to the objects of each class.

[ 0042] An embodiment of the application provides an apparatus for training an object sequence recognition network, which may include a second acquisition module, a second extraction module, a second prediction module, a first determination module, and a first adjustment module.

[ 0043] The second acquisition module may be configured to acquire a sample image, the sample image including a sample object sequence and class labeling information of the sample object sequence.

[ 0044] The second extraction module may be configured to input the sample image to an object sequence recognition network to be trained and perform feature extraction to obtain a sample feature sequence.

[ 0045] The second prediction module may be configured to perform class prediction on sample objects in the sample object sequence based on the sample feature sequence using a classifier of the object sequence recognition network to be trained to obtain a class prediction result of the sample object sequence, the class prediction result of the sample object sequence including class prediction information of each sample object in the sample object sequence.

[ 0046] The first determination module may be configured to determine a first loss and a second loss based on the class prediction result of the sample object sequence, the first loss being configured to supervise the class prediction result of the sample object sequence based on the class labeling information of the sample object sequence, and the second loss being configured to supervise the number of sample objects of each class in the sample object sequence based on the class labeling information of the sample object sequence.

[ 0047] The first adjustment module may be configured to adjust a network parameter of the object sequence recognition network to be trained according to the first loss and the second loss such that a loss of a classification result output by an adjusted object sequence recognition network satisfies a convergence condition.

[ 0048] In some embodiments, the second extraction module may include a second extraction submodule and a second splitting submodule.

[ 0049] The second extraction submodule may be configured to perform feature extraction on the sample image using a convolutional subnetwork in the object sequence recognition network to be trained to obtain a sample feature map of the sample image.

[ 0050] The second splitting submodule may be configured to split the sample feature map to obtain the sample feature sequence.

[ 0051] In some embodiments, the second extraction submodule may include a second down- sampling unit, a second extraction unit, and a second determination unit.

[ 0052] The second down-sampling unit may be configured to down-sample the sample image using the convolutional subnetwork in a length dimension of the sample image in a first direction to obtain a first-dimensional sample feature, the first direction being different from an arrangement direction of the sample object sequence in the sample sequence.

[ 0053] The second extraction unit may be configured to extract a feature in a length dimension of the sample image in a second direction based on a length of the sample image in the second direction to obtain a second-dimensional sample feature.

[ 0054] The second determination unit may be configured to obtain the sample feature map of the sample image based on the first-dimensional sample feature and the seconddimensional sample feature.

[ 0055] In some embodiments, the second splitting submodule may include a second pooling unit and a second splitting unit.

[ 0056] The second pooling unit may be configured to pool the sample feature map in the first direction to obtain a pooled sample feature map.

[ 0057] The second splitting unit may be configured to split the pooled sample feature map in the second direction to obtain the sample feature sequence.

[ 0058] In some embodiments, the first adjustment module may include a first fusion submodule and a first adjustment submodule.

[ 0059] The first fusion submodule may be configured to perform weighted fusion on the first loss and the second loss to obtain a total loss.

[ 0060] The first adjustment submodule may be configured to adjust the network parameter of the object sequence recognition network to be trained according to the total loss such that the loss of the classification result output by the adjusted object sequence recognition network satisfies the convergence condition.

[ 0061] In some embodiments, the first fusion submodule may include a first assignment unit, a second assignment unit, and a first fusion unit.

[ 0062] The first assignment unit may be configured to assign a first dynamic weight to the first loss to obtain a first dynamic loss, the first dynamic weight gradually decreasing with the increase of a training count and/or training time of the object sequence recognition network to be trained when the training count reaches a first threshold or the training time reaches first time.

[ 0063] The second assignment unit may be configured to assign a second dynamic weight to the second loss to obtain a second dynamic loss, the second dynamic weight gradually increasing with the increase of the training count and/or training time of the object sequence recognition network to be trained when the training count reaches a second threshold or the training time reaches second time.

[ 0064] The first fusion unit may be configured to fuse the first dynamic loss and the second dynamic loss to obtain the total loss.

[ 0065] Correspondingly, an embodiment of the application provides a computer storage medium, in which a computer-executable instruction may be stored. The computer-executable instruction may be executed to implement the abovementioned object sequence recognition method. Alternatively, the computer-executable instruction may be executed to implement the abovementioned method for training an object sequence recognition network.

[ 0066] An embodiment of the application provides a computer device, which may include a memory and a processor. A computer-executable instruction may be stored in the memory. The processor may run the computer-executable instruction in the memory to implement the abovementioned object sequence recognition method. Alternatively, the processor may run the computer-executable instruction in the memory to implement the abovementioned method for training an object sequence recognition network.

[ 0067] According to the object sequence recognition method, apparatus, device and storage medium provided in the embodiments of the application, feature extraction is performed on the first image at first to obtain the feature sequence. Then, class prediction is performed on the object sequence in the feature sequence to obtain a relatively accurate classification result of the object sequence in the feature sequence. Finally, the classification result of the object sequence in the feature sequence is further processed to determine class information of multiple object sequences. As such, the accuracy of recognizing the object sequence in the feature sequence may still be improved even though the feature sequence of the object is relatively long.

BRIEF DESCRIPTION OF THE DRAWINGS

[ 0068] FIG. 1 is an implementation flowchart of an object sequence recognition method according to an embodiment of the application.

[ 0069] FIG. 2A is another implementation flowchart of an object sequence recognition method according to an embodiment of the application.

[ 0070] FIG. 2B is an implementation flowchart of a method for training an object sequence recognition network according to an embodiment of the application.

[ 0071] FIG. 3 is a structure diagram of an object sequence recognition network according to an embodiment of the application.

[ 0072] FIG. 4 is a schematic diagram of an application scene of an object sequence recognition network according to an embodiment of the application.

[ 0073] FIG. 5A is a structure composition diagram of an object sequence recognition apparatus according to an embodiment of the application.

[ 0074] FIG. 5B is another structure composition diagram of an object sequence recognition apparatus according to an embodiment of the application.

[ 0075] FIG. 6 is a composition structure diagram of a computer device according to an embodiment of the application.

DETAILED DESCRIPTION

[ 0076] In order to make the purposes, technical solutions, and advantages of the embodiments of the application clearer, specific technical solutions of the disclosure will further be described below in combination with the drawings in the embodiments of the application in detail. The following embodiments are adopted to describe the application rather than limit the scope of the application.

[ 0077] "Some embodiments" involved in the following descriptions describes a subset of all possible embodiments. However, it can be understood that "some embodiments" may be the same subset or different subsets of all the possible embodiments, and may be combined without conflicts.

[ 0078] Term "first/second/third" involved in the following descriptions is only for distinguishing similar objects, and does not represent a specific sequence of the objects. It can be understood that "first/second/third" may be interchanged to specific sequences or orders if allowed to implement the embodiments of the application described herein in sequences except the illustrated or described ones.

[ 0079] Unless otherwise defined, all technical and scientific terms used herein have the same meanings as commonly understood by those skilled in the art of the application. Terms used in the application are only adopted to describe the embodiments of the application and not intended to limit the application.

[ 0080] Nouns and terms involved in the embodiments of the application will be described before the embodiments of the application are further described in detail. The nouns and terms involved in the embodiments of the application are suitable to be explained as follows.

[ 0081] 1) Aggregation Cross-Entropy (ACE): firstly, ACE does not minimize a loss function by maximizing a predicted probability at each position but only cares about an accumulated probability of each class without considering an order between sequences to simplify the problem and only requires a network to predict a character count of each class accurately to minimize the loss function. Secondly, ACE may solve two- dimensional prediction problems.

[ 0082] 2) Connectionist Temporal Classification (CTC) calculates a loss value, has the main advantage that unaligned data may be aligned automatically, and is mainly used for the training of sequential data that is not aligned in advance, e.g., voice recognition and Optical Character Recognition (OCR). In the embodiments of the application, a CTC loss may be used to supervise an overall prediction condition of a sequence during the early training of a network.

[ 0083] An exemplary application of an object sequence recognition device provided in the embodiments of the application will be described below. The device provided in the embodiments of the application may be implemented as various types of user terminals with an image collection function, such as a notebook computer, a tablet computer, a desktop computer, a camera, and a mobile device (e.g., a personal digital assistant, a dedicated messaging device, and a portable game device), or may be implemented as a server. The exemplary application of the device implemented as the terminal or the server will be described below.

[ 0084] A method may be applied to a computer device. A function realized by the method may be realized by a processor in the computer device by calling a program code. Of course, the program code may be stored in a computer storage medium. It can be seen that the computer device at least includes the processor and the storage medium.

[ 0085] An embodiment of the application provides an object sequence recognition method. As shown in FIG. 1, descriptions will be made in combination with the operations shown in FIG. 1.

[ 0086] In S 101, a first image including an object sequence is acquired.

[ 0087] In some embodiments, the object sequence may be a sequence formed by sequentially arranging any objects. A specific object type is not specially limited. For example, the first image is an image collected in a game place, and the object sequence may be tokens in a game in the game place. Alternatively, the first image is an image collected in a scene that planks of various materials or colors are stacked, and the object sequence may be a pile of stacked planks.

[ 0088] The first image is at least one frame of image. The at least one frame of image is an image of which both size information and a pixel value satisfy certain conditions and which is obtained by size adjustment and pixel value normalization.

[ 0089] In some possible implementation modes, an acquired second image is preprocessed to obtain the first image that may be input to an object sequence recognition network. That is, S101 may be implemented through the following Si l l and SI 12 (not shown in the figure).

[ 0090] In S 111, a second image including at least one object sequence is acquired.

[ 0091] Here, the second image may be an image including appearance information of the object sequence. The second image may be an image collected by any collection device, or may be an image acquired from the Internet or another device or any frame in a video. For example, the second image is a frame of image which is acquired from a network and of which picture content includes the object sequence. Alternatively, the second image is a video segment of which picture content includes the object sequence, etc.

[ 0092] In SI 12, an image parameter of the second image is preprocessed based on a preset image parameter to obtain the first image.

[ 0093] In some possible implementation modes, the preset image parameter includes an image width, a height, an image pixel value, etc. First, size information of the original image is adjusted according to a preset size to obtain an adjusted image. The preset size is a preset width and a preset aspect ratio. For example, widths of multiple frames of original images are adjusted to the preset width according to the preset width in a unified manner. Then, pixel values of the adjusted image are normalized to obtain the first image. For example, for an original image of which a height is less than a preset height, an image region of which a height does not reach the preset height is filled with pixels, e.g., gray pixel values. As such, size information is adjusted to make the aspect ratio in the size of the obtained first image the same, and deformations generated when the image is processed may be reduced.

[ 0094] In S102, the first image is input to an object sequence recognition network, and feature extraction is performed to obtain a feature sequence.

[ 0095] In some embodiments, supervision information in a training process of the object sequence recognition network at least includes class supervision information of each sample object in a sample object sequence and sequence length supervision information of the sample object of each class in the sample object sequence. The first input is input to the object sequence recognition network, and feature extraction is performed on the first image using a convolutional neural network part in the object sequence recognition network to obtain a feature map. The feature map is split according to a certain manner, thereby splitting the feature map extracted by the convolutional neural network into a plurality of feature sequences to facilitate subsequent classification of the object sequence in the first image. The feature sequence is a sequence formed by features in the feature map. In some possible implementation modes, the feature map may be split according to a height of the feature map to obtain the feature sequence. Each feature in the feature sequence may correspond to an object in the object sequence. Alternatively, multiple features in the feature sequence correspond to an object in the object sequence.

[ 0096] In S103, a class of the object sequence is predicted based on the feature sequence using a classifier of the object sequence recognition network to obtain class information of the object sequence.

[ 0097] In some embodiments, class prediction is performed on features in the feature sequence using the classifier to obtain a classification result of each feature. Class information of the at least one object sequence is determined based on the classification result of the feature sequence. A class of the feature in the feature sequence is predicted using the classifier in the object sequence recognition network, thereby obtaining a predicted probability of the class of the object sequence corresponding to the feature sequence.

[ 0098] In some embodiments, the class information includes a class of each object and a sequence length of objects of the same class in the object sequence. The classification result of the feature sequence may represent a probability that the feature in the feature sequence belongs to a class corresponding to each classification label. A class corresponding to a classification label, of which a probability value is greater than a certain threshold, in a group of probabilities corresponding to a feature sequence is determined as a class of an object corresponding to a feature in the feature sequence. Accordingly, class prediction may be performed on the feature in the feature sequence to obtain the class of each feature, the class of each feature being a class of an object corresponding to the feature. Therefore, feature sequences belonging to the same class are feature sequences corresponding to the same object, a class of features belonging to the same class is a class of an object corresponding to the features of this class, and furthermore, the class of each object in the object sequence may be obtained.

[ 0099] In the embodiment of the application, feature extraction is performed on the first image at first to obtain the feature sequence. Then, class prediction is performed on the object sequence in the feature sequence to obtain a relatively accurate classification result of the object. As such, the accuracy of recognizing the object sequence may still be improved even though the feature sequence of the object is relatively long.

[ 00100] In some embodiments, the feature extraction of the first image is implemented by a convolutional network obtained by finely adjusting the structure of a Residual Network (ResNet), thereby obtaining the feature sequence. That is, S102 may be implemented through the operations shown in FIG. 2. FIG. 2 is another implementation flowchart of an object sequence recognition method according to an embodiment of the application. The following descriptions will be made in combination with FIG. 2.

[ 00101] In S201, feature extraction is performed on the first image using a convolutional subnetwork in the object sequence recognition network to obtain a feature map.

[ 00102] In some embodiments, the object sequence recognition network is obtained by training based on a first loss for supervising a whole sample image and a second loss for supervising an object of each class in the sample image. Feature extraction is performed on the first image using a convolutional network part in the object sequence recognition network to obtain the feature map. The convolutional network part in the object sequence recognition network may be obtained by fine adjustment based on a network structure of a ResNet.

[ 00103] In some possible implementation modes, feature extraction is performed on the first image using a convolutional network obtained by stride adjustment in the object sequence recognition network, thereby obtaining a feature map of which a height is kept unchanged and a width is changed. That is, S201 may be implemented through the following S211 to S213 (not shown in the figure).

[ 00104] In S211, the first image is down-sampled using the convolutional subnetwork in a length dimension of the first image in a first direction to obtain a first-dimensional feature.

[ 00105] In some possible implementation modes, a network structure of an adjusted ResNet is taken as a convolutional network for the feature extraction of the first image. The first direction is different from an arrangement direction of objects in the object sequence. For example, if the object sequence is multiple objects arranged or stacked in a height direction, namely the arrangement direction of the objects in the object sequence is the height direction, the first direction may be a width direction of the object sequence. If the object sequence is multiple objects arranged in a horizontal direction, namely the arrangement direction of the objects in the object sequence is the horizontal direction, the first direction may be the height direction of the object sequence. For example, strides in the first direction in last strides of convolutional layers 3 and 4 in the network structure of the ResNet are kept at 2 and unchanged. In this manner, down-sampling in the length dimension of the first image in the first direction is implemented, and a length of the obtained feature map in the first direction is changed to a half of a length of the first image in the first direction. For example, the object sequence is multiple objects stacked in a height direction. In such case, width strides in the last strides of convolutional layers 3 and 4 in the network structure of the ResNet are kept at 2 and unchanged. In this manner, down- sampling in a width dimension of the first image is implemented, and a width of the obtained feature map is changed to a half of a width of the first image.

[ 00106] In S212, a feature in a length dimension of the first image in a second direction is extracted based on a length of the first image in the second direction to obtain a second-dimensional feature.

[ 00107] In some possible implementation modes, the second direction is the same as the arrangement direction of the objects in the object sequence. Strides in the second direction in the last strides of convolutional layers 3 and 4 in the network structure of the ResNet are changed from 2 to 1. In this manner, down-sampling is not performed in the length dimension of the first image in the second direction, namely the length of the first image in the second direction is kept, and feature extraction is performed in the length direction of the first image in the second direction to obtain a second-dimensional feature the same as the length of the first image in the second direction.

[ 00108] In a specific example, the arrangement direction of the object sequence is the height direction. Height strides in the last strides of convolutional layers 3 and 4 in the network structure of the ResNet are changed from 2 to 1. In this manner, down-sampling is not performed in a height dimension of the first image, namely the height of the first image is kept, and feature extraction is performed in the height direction of the first image to obtain a feature the same as the height of the first image.

[ 00109] In S213, the feature map is obtained based on the first-dimensional feature and the second-dimensional feature.

[ 00110] In some possible implementation modes, the first-dimensional feature is combined with the second-dimensional feature to form the feature map of the first image.

[ 00111] In S211 to S213, the first image is not down-sampled in the length dimension of the first image in the second direction to make a dimension of a dimensional feature in the second direction the same as that of the first image in the second direction, and the first image is down- sampled in the dimension in the first direction different from the arrangement direction of the object to change a length of a dimensional feature in the first direction to a half of the length of the first image in the first direction. As such, feature information of the first image in the dimension of the arrangement direction of the object sequence may be maximally retained. When the arrangement direction of the object sequence is the height direction, convolutional layers 3 and 4 of which last strides are (2, 2) in the ResNet are changed to convolutional layers of which strides are (1, 2), so that the first image is not down- sampled in the height dimension to make a dimension of a height-dimensional feature the same as the height of the first image, and the first image is down- sampled in the width dimension to change a width of a width-dimensional feature to a half of the width of the first image. As such, feature information of the first image in the height dimension may be maximally retained.

[ 00112] In S202, the feature map is split to obtain the feature sequence.

[ 00113] In some embodiments, the feature map is split based on dimension information of the feature map to obtain the feature sequence. The dimension information of the feature map includes a dimension in a first direction and a dimension in a second direction (e.g., a width dimension and a height dimension). The feature map is processed differently based on the two dimensions to obtain the feature sequence. For example, the feature map is pooled at first in the dimension of the feature map in the first direction, and then a splitting operation is performed on the feature map in the dimension of the feature map in the second direction, thereby splitting the feature map into the feature sequence. In this manner, feature extraction is performed on the image using the object sequence recognition network obtained by training based on two loss functions, and the feature map is split according to the dimension information, so that the obtained feature sequence may retain more features in the second direction to make it easy to subsequently recognize the class of the object sequence in the feature sequence more accurately.

[ 00114] In some possible implementation modes, the feature map is pooled in the dimension in the first direction, and is split in the dimension in the second direction to obtain the feature sequence. That is, S202 may be implemented through S221 and S222 (not shown in the figure).

[ 00115] In S221, the feature map is pooled in the first direction to obtain a pooled feature map.

[ 00116] In some embodiments, average pooling is performed on the feature map in the dimension of the feature map in the first direction, and the dimension of the feature map in the second direction and a channel dimension are kept unchanged, to obtain the pooled feature map. For example, the arrangement direction of the objects in the object sequence is the height direction, and the feature map is pooled in the width dimension in the dimension information to obtain the pooled feature map. A dimension of a first feature map is 2,048*40*16 (the channel dimension is 2,048, the height dimension is 40, and the width dimension is 16), and a 2,048*40*1 pooled feature map is obtained by average pooling in the width dimension.

[ 00117] In S222, the pooled feature map is split in the second direction to obtain the feature sequence.

[ 00118] In some embodiments, the pooled feature map is split in the dimension of the feature map in the second direction to obtain the feature sequence. The number of vectors obtained by splitting the pooled feature map may be determined based on a length of the feature map in the second direction. For example, if the length of the feature map in the second direction is 60, the pooled feature map is split into 60 vectors. In a specific example, the arrangement direction of the objects in the object sequence is the height direction, and the pooled feature map is split based on the height dimension to obtain the feature sequence. If the pooled feature map is 2,048*40*1, the pooled feature map is split in the height dimension to obtain 40 2,048-dimensional vectors, of which each corresponds to a feature corresponding to 1/40 of an image region in the height direction in the original first image. Accordingly, the feature map is split in the second direction the same as the arrangement direction of the objects after being pooled in the first direction different from the arrangement direction of the objects, so that the feature sequence may include more detail information of the first image in the second direction.

[ 00119] In some embodiments, the classification result of the feature sequence is further processed to predict a class of each object and a length of the object sequence. That is, S104 may be implemented through the following S141 to S143 (not shown in the figure).

[ 00120] In S141, a class corresponding to each feature in the feature sequence is predicted using the classifier of the object sequence recognition network.

[ 00121] In some embodiments, the feature sequence is input to the classifier to predict the class corresponding to each feature in the feature sequence. For example, if the total class number of the object sequence is n, the class of the feature in the feature sequence is predicted using a classifier with n class labels, thereby obtaining a predicted probability that the feature in the feature sequence corresponds to each class label in the n class labels.

[ 00122] In S142, a class of each object in the object sequence is determined based on a prediction result of the class corresponding to each feature in the feature sequence.

[ 00123] In some embodiments, after the feature map is split, the feature sequence includes multiple feature vectors of the image to be recognized in the dimension in the second direction, namely the feature vectors are part of features of the image to be recognized, may include all features of one or more object sequences or include part of features of an object sequence. As such, the classification result of the object corresponding to each feature in the feature sequence may be combined to accurately recognize the class of each object in the object sequence in the first image.

[ 00124] In S143, a sequence length of target features of objects belonging to the same class is determined in the feature sequence.

[ 00125] In some embodiments, a feature set of objects belonging to the same class is determined at first in the feature sequence, and then a sequence length of a sequence formed by these features is determined. In a specific example, the object sequence is tokens stacked in the height direction, and a token sequence length corresponding to features of tokens belonging to the same class is determined in the feature sequence. A class of a token includes a face value of the token, a pattern of the token, a game that the token is suitable for, etc. A sequence length of target features of objects of each class is indeterminate. Therefore, a fixed-length feature sequence is converted into a variable sequence length of target features.

[ 00126] In S144, the class information of the object sequence is obtained based on the class of each object in the object sequence and a sequence length of target features corresponding to the objects of each class.

[ 00127] In some embodiments, the class of each object and the sequence length corresponding to the objects of each class are taken as class information of at least one object. Accordingly, the classification result of the feature sequence is processed using a post-processing rule of a CTC loss function, so that the predicted class of each object and the length of the object sequence may be more accurate.

[ 00128] In some embodiments, the object sequence recognition network is configured to recognize the class of the object in the object sequence. The object sequence recognition network is obtained by training an object sequence recognition network to be trained. A training process of the object sequence recognition network to be trained may be implemented through the operations shown in FIG. 2B. FIG. 2B is an implementation flowchart of a method for training an object sequence recognition network according to an embodiment of the application. The following descriptions will be made in combination with FIG. 2B.

[ 00129] In S21, a sample image is acquired.

[ 00130] In some embodiments, the sample image includes a sample object sequence and class labeling information of the sample object sequence. The sample image may be multiple collected frames of labeled images of which pictures include sample objects, or may be a sample image obtained by preprocessing a collected image.

[ 00131] In S22, the sample image is input to an object sequence recognition network to be trained, and feature extraction is performed to obtain a sample feature sequence.

[ 00132] In some embodiments, a sample image set is preprocessed at first to make sizes of sample images in the sample image set the same. Then, feature extraction is performed on the processed sample images to obtain a sample feature sequence.

[ 00133] In some possible implementation modes, a collected sample original image is preprocessed, data enhancement is performed on a preprocessed image, and the preprocessed image and an enhanced image are combined as a sample image. That is, S21 may be implemented through the following process.

[ 00134] First, a sample original image of the labeled sample image is acquired.

[ 00135] Here, image collection may be performed on a scene with the sample objects using an image collection device to obtain the sample original image. The sample original image is multiple frames of images.

[ 00136] Then, an image parameter of the sample original image is preprocessed according to a preset image parameter to obtain an adjusted image.

[ 00137] Here, size information of the sample original image of which a picture includes the sample objects is adjusted according to a preset size, and a normalization operation is performed on pixel values of the adjusted image. The preset size is a preset width and a preset aspect ratio. Widths of multiple frames of sample original images are adjusted to the preset width according to the preset width in a unified manner. For a sample original image of which a height is less than a preset height, an image region of which a height does not reach the preset height is filled with pixels, e.g., gray pixel values. As such, the size information is adjusted to make the aspect ratios in the sizes of obtained multiple frames of adjusted images the same, and deformations generated when the multiple frames of adjusted images are processed may be reduced.

[ 00138] Next, data enhancement is performed on the adjusted image to obtain an enhanced image.

[ 00139] Here, data enhancement includes random flipping, random clipping, random aspect ratio fine adjustment, random rotation, and other operations. Therefore, random flipping, random clipping, random aspect ratio fine adjustment, random rotation and other operations may be performed on the multiple frames of adjusted images to obtain richer sample images.

[ 00140] Finally, the enhanced image and the adjusted image are taken as the sample image set.

[ 00141] Here, the sample image is any image in the sample image set. The adjusted image and the enhanced image, of which the sizes are unified, are combined as the sample image set. Therefore, sample images may be enriched, and the overall robustness of the network to be trained may be improved.

[ 00142] In S23, class prediction is performed on sample objects in the sample object sequence based on the sample feature sequence using a classifier of the object sequence recognition network to be trained to obtain a class prediction result of the sample object sequence.

[ 00143] In some embodiments, the sample feature sequence is input to the classifier of the object sequence recognition network to be trained, and class prediction is performed to obtain a class prediction result corresponding to each sample feature in the sample feature sequence.

[ 00144] In some possible implementation modes, all classes of the sample objects are analyzed, and classification labels of the classifier are set, so that a class prediction result corresponding to each sample feature sequence is predicted. That is, S23 may be implemented through the following process.

[ 00145] First, total classes of the sample objects in the sample image set are determined.

[ 00146] Here, all classes of the sample objects in a scene of the sample image are analyzed. For example, in a game scene, the sample object is a token, and classes of all tokens, i.e., total classes of the tokens, are determined.

[ 00147] Then, classification labels of the classifier of the object sequence recognition network to be trained are determined based on the total classes.

[ 00148] Here, the classification labels of the classifier are set according to the total classes of the sample objects, and then the classifier may predict probabilities that the sample objects in the sample image belongs to any class.

[ 00149] Finally, class prediction is performed on the sample objects in the sample feature sequence using the classifier with the classification labels to obtain the class prediction result of the sample feature sequence.

[ 00150] Here, a probability that an object in each sample feature sequence belongs to each class may be predicted using a classifier with multiple classification labels to obtain a class prediction result of the sample feature sequence. The class that the object in the sample feature sequence most probably belongs to in the sample feature sequence may be determined based on the class prediction result. Accordingly, the total classes of the objects are analyzed, and the classification labels that the classifier has are set, so that the classes of the objects in the sample feature sequence may be predicted more accurately.

[ 00151] In S24, a first loss and a second loss are determined based on the class prediction result of the sample object sequence.

[ 00152] In some embodiments, the first loss is configured to supervise the class prediction result of the sample object sequence based on the class labeling information of the sample object sequence, and the second loss is configured to supervise the number of sample objects of each class in the sample object sequence based on the class labeling information of the sample object sequence. In some possible implementation modes, a CTC loss is adopted as the first loss, and an ACE loss is adopted as the second loss. The same sample object in the class prediction result and a class of the sample object are predicted taking a classification result of each sample feature sequence output by the classifier and truth value information of the sample objects in each sample feature sequence as an input of the CTC loss, so that the class of each sample object in the present frame of sample image may be predicted. The number of the sample objects belonging to the same class, i.e., a sequence length corresponding to the sample objects of each class, is predicted taking the classification result of each sample feature sequence output by the classifier and the truth value information of the sample objects in each sample feature sequence as an input of the ACE loss.

[ 00153] In S25, a network parameter of the object sequence recognition network to be trained is adjusted according to the first loss and the second loss such that a loss of a classification result output by an adjusted object sequence recognition network satisfies a convergence condition.

[ 00154] In some embodiments, a class representing each sample object in the classification result and the truth value information of each sample object may be compared to determine the first loss. A sequence representing the sample objects belonging to the same class in the classification result and a truth value of the sequence length of the sample objects of each class may be compared to determine the second loss. The first loss and the second loss are combined to adjust a weight value and adjustment amount of the object sequence recognition network to be trained to converge losses of the classes of the sample objects and the sequence length of the sample objects of the same class, which are output by the trained object sequence recognition network.

[ 00155] Through S21 to S25, the first loss for supervising the whole sequence and the second loss for supervising the number of each class in the sequence are introduced to the object sequence recognition network to be trained, so that an overall class prediction effect of the network may be improved.

[ 00156] In some embodiments, the feature extraction of the sample image is implemented using a convolutional subnetwork in the object sequence recognition network to be trained, thereby obtaining the sample feature sequence. That is, S22 may be implemented through the following S231 and S232 (not shown in the figure).

[ 00157] In S231, feature extraction is performed on the sample image using a convolutional subnetwork in the object sequence recognition network to be trained to obtain a sample feature map of the sample image.

[ 00158] In some embodiments, feature extraction is performed on the sample image using a convolutional network obtained by finely adjusting the structure of a ResNet as the convolutional subnetwork of the recognition network to be trained to obtain the sample feature map.

[ 00159] In some possible implementation modes, feature extraction is performed on the sample image using a convolutional subnetwork obtained by stride adjustment in the object sequence recognition network, thereby obtaining a sample feature map of which a height is kept unchanged and a width is changed. That is, S231 may be implemented through the following operations (not shown in the figure).

[ 00160] In a first step, the sample image is down-sampled using the convolutional subnetwork in a length dimension of the sample image in a first direction to obtain a first-dimensional sample feature.

[ 00161] Here, an implementation process of the first step is similar to that of S211. That is, when an arrangement direction of the sample object sequence is a stacking height direction, the sample image is down-sampled in a width dimension of the sample image to obtain a first-dimensional sample feature. It is set that width strides in last strides of convolutional layers 3 and 4 of the convolutional subnetwork are kept at 2 and unchanged, and height strides are changed from 2 to 1.

[ 00162] In a second step, a feature in a length dimension of the sample image in a second direction is extracted based on a length of the sample image in the second direction to obtain a second-dimensional sample feature.

[ 00163] Here, an implementation process of the second step is similar to that of S212. That is, when an arrangement direction of the sample object sequence is a stacking height direction, feature extraction is performed based on a height of the sample image in a height dimension of the sample image to obtain a second-dimensional sample feature. For example, it is set that heights in the last strides of convolutional layers 3 and 4 of the convolutional subnetwork are changed from 2 to 1. In such case, down-sampling is not performed in the height dimension of the sample image, namely the second-dimensional sample feature of the sample image is kept.

[ 00164] In a third step, the sample feature map of the sample image is obtained based on the first-dimensional sample feature and the second-dimensional sample feature.

[ 00165] Here, the first-dimensional sample feature and the second-dimensional sample feature are combined to form the sample feature map of the sample image.

[ 00166] Through the first step to the third step, when the arrangement direction of the sample object sequence is the stacking height direction, convolutional layers 3 and 4 of which last strides are (2, 2) in the ResNet are changed to convolutional layers of which strides are (1, 2) to form the convolutional subnetwork of the object sequence recognition network to be trained. Therefore, feature information of the sample image in the dimension in the arrangement direction may be maximally retained.

[ 00167] In S232, the sample feature map is split to obtain the sample feature sequence.

[ 00168] Here, an implementation process of S232 is similar to that of S202. That is, the sample feature map is processed differently based on the dimension in the first direction and the dimension in the second direction to obtain the sample feature sequence. For example, the sample feature map is pooled in the dimension in the first direction, and is split into multiple feature vectors in the dimension in the second direction to form the sample feature sequence. As such, the obtained sample feature sequence may retain more dimensional features of the sample object in the arrangement direction, and the training accuracy of the network may be improved.

[ 00169] In some possible implementation modes, the sample feature map is pooled in the dimension in the first direction, and is split in the dimension in the second direction to obtain the sample feature sequence. That is, S232 may be implemented through the following operations.

[ 00170] In a first step, the sample feature map is pooled in the first direction to obtain a pooled sample feature map.

[ 00171] Here, an implementation process of the first step is similar to that of S221. That is, average pooling is performed on the sample feature map in the dimension of the sample feature map in the first direction, and the dimension of the sample feature map in the second direction and a channel dimension are kept unchanged, to obtain the pooled sample feature map.

[ 00172] In a second step, the pooled sample feature map is split in the second direction to obtain the sample feature sequence.

[ 00173] Here, an implementation process of the second step is similar to that of S222. That is, the pooled sample feature map is split in the dimension of the sample feature map in the second direction to obtain the sample feature sequence. For example, if the dimension of the sample feature map in the second direction is 40, the pooled sample feature map is split into 40 vectors to form a sample feature sequence. Accordingly, the sample feature map is split in the dimension in the second direction after being pooled in the dimension in the second direction, so that the sample feature sequence may retain more detailed information of the sample image in the dimension in the second direction.

[ 00174] In some embodiments, dynamic weighted fusion is performed on the first loss and the second loss to improve the object sequence recognition performance of the object sequence recognition network to be trained. That is, S25 may be implemented through the following S251 and S252.

[ 00175] In S251, weighted fusion is performed on the first loss and the second loss to obtain a total loss.

[ 00176] In some embodiments, the first loss and the second loss are weighted using different dynamic weights, and a first loss and second loss which are obtained by weighted adjustment are fused to obtain the total loss.

[ 00177] In some possible implementation modes, dynamic adjustment parameters are set for the first loss and the second loss to obtain the total loss. That is, S251 may be implemented through the following process.

[ 00178] First, a first dynamic weight is assigned to the first loss to obtain a first dynamic loss.

[ 00179] In some embodiments, the first dynamic weight gradually decreases with the increase of a training count and/or training time of the object sequence recognition network to be trained when the training count reaches a first threshold or the training time reaches first time. That is, the first dynamic weight gradually decreases with the training process of the object sequence recognition network to be trained. Accordingly, in the training process of the object sequence recognition network to be trained, sequences belonging to the same object in feature sequences are supervised based on classification results of the feature sequences output by the classifier using the CTC loss as the first loss during early training. Therefore, the CTC loss has relatively high performance during the early training of the network.

[ 00180] Then, a second dynamic weight is assigned to the second loss to obtain a second dynamic loss.

[ 00181] In some embodiments, the second dynamic weight gradually increases with the increase of the training count and/or training time of the object sequence recognition network to be trained when the training count reaches a second threshold or the training time reaches second time. That is, the second dynamic weight gradually increases with the training process of the object sequence recognition network to be trained. Accordingly, in the training process of the object sequence recognition network to be trained, the number of the objects of each class in the feature sequence is supervised based on the classification result of the feature sequence output by the classifier using the ACE loss as the second loss during later training. Therefore, the ACE loss has relatively high performance during the later training of the network.

[ 00182] Finally, the first dynamic loss and the second dynamic loss are fused to obtain the total loss.

[ 00183] In some embodiments, after the first dynamic weight is assigned to the first loss, the second dynamic weight is assigned to the second loss, the two loss functions are added to obtain the total loss of the object sequence recognition network to be trained. Accordingly, the first dynamic loss and the second dynamic loss are fused to obtain the total loss, and the object sequence recognition network is trained using the total loss, so that the robustness of the network may be improved.

[ 00184] In S252, the network parameter of the object sequence recognition network to be trained is adjusted according to the total loss such that the loss of the classification result output by the adjusted object sequence recognition network satisfies the convergence condition.

[ 00185] In some embodiments, the object sequence recognition network to be trained is trained using the total loss obtained by fusing the first dynamic loss and the second dynamic loss, so that the prediction effect of the whole network may be improved, and an object sequence recognition network with relatively high performance may be obtained.

[ 00186] An exemplary application of the embodiment of the application to a practical application scene will be described below. Taking a game place as an example of the application scene, descriptions will be made with the recognition of an object (e.g., a token) in the game place as an example.

[ 00187] A sequence recognition algorithm for an image is applied extensively to scene text recognition, license plate recognition and other scenes. In the related art, the algorithm mainly includes extracting an image feature using a convolutional neural network, performing classification prediction on each slice feature, performing duplicate elimination in combination with a CTC loss function and supervising a predicted output, and is applicable to text recognition and license plate recognition tasks.

[ 00188] However, for the recognition of a token sequence in the game place, the token sequence usually has a relatively great sequence length, and a requirement on the accuracy of predicting a face value and type of each token is relatively high.

[ 00189] Based on this, an embodiment of the application provides an object sequence recognition method. A CTC loss is fused with an ACE loss, and corresponding weights are dynamically adjusted. Therefore, the supervision of a sequence length in an object sequence recognition network is strengthened, and an object sequence may be recognized accurately.

[ 00190] FIG. 3 is a structure diagram of an object sequence recognition network according to an embodiment of the application. The following descriptions will be made in combination with FIG. 3. A framework of the object sequence recognition network includes an image input module 301, a feature extraction module 302, and a loss module.

[ 00191] The image input module 301 is configured to preprocess each sample image in a sample image set to obtain a processed sample image set.

[ 00192] In some possible implementation modes, preprocessing a frame of sample image mainly includes adjusting a size of the image with an aspect ratio kept unchanged, normalizing pixel values of the image, and other operations. The operation of adjusting the size of the image with the aspect ratio kept unchanged refers to adjusting widths of multiple frames of sample images to be the same. In order to reduce great deformations of the multiple frames of images generated if aspect ratios of the multiple frames of images are not adjusted to be the same because tokens in the input images are different in number and the aspect ratios of the images are greatly different, in the embodiment of the application, for an image of which an image height is less than a maximum height, other positions of which heights are less than the maximum height are filled with average gray pixel values (127, 127, 127). In order to enrich the sample image set, a data enhancement operation is performed on processed sample images. For example, random flipping, random clipping, random aspect ratio fine adjustment, random rotation and other operations are performed on the processed sample images. As such, the overall robustness of the network to be trained may be improved.

[ 00193] The feature extraction module 302 performs feature extraction on processed sample images to obtain a feature sequence 303.

[ 00194] In some possible implementation modes, high-layer features of the input sample images are extracted at first using a convolutional neural network part in the object sequence recognition network to be trained. The convolutional neural network part is obtained by fine adjustment based on a network structure of a ResNet. For example, last strides are (2, 2) of convolutional layers 3 and 4 in the network structure of the ResNet are changed to strides (1, 2). As such, a feature map is not down-sampled in a height dimension, and is down-sampled in a width dimension to halve an original width. Therefore, feature information in the height dimension may be maximally retained. Then, a splitting operation is performed on the feature map, namely the feature map extracted by the convolutional neural network is split into a plurality of feature sequences to facilitate subsequent calculation of a classifier and a loss function. When the feature map is split, average pooling is performed in a width direction of the feature map, and not changes are made in a height direction and a channel dimension. For example, a size of the feature map is 2,048*40*8 (the channel dimension is 2,048, the height dimension is 40, and the width dimension is 8), a 2,048*40*1 feature map is obtained by average pooling in the width direction, and the feature map is split in the height dimension to obtain 40 2,048-dimensional vectors, of which each corresponds to a feature corresponding to 1/40 of a region in the height direction in the original map.

[ 00195] In a specific example, if the sample image includes multiple tokens, as shown in FIG. 4, the feature sequence is obtained by division according to a height dimension of the image 401. A feature sequence includes a feature of less than or equal to one token.

[ 00196] The classifier adopts an n-classifier, and performs token class prediction on the feature sequence to obtain a predicted probability of each feature sequence.

[ 00197] Here, n is the total number of token classes.

[ 00198] The loss module adopts a dynamic adjustment manner for predicted probabilities of all classes in the feature sequence, combines a CTC loss 304 and an ACE loss 305, and simultaneously supervises a prediction result.

[ 00199] In some possible implementation modes, the CTC loss 304 and the ACE loss 305 are combined into a total loss 306 in the object sequence recognition network to be trained, which may be represented as

$ j_nce t|_lc CTC loss has a relatively good supervision effect on an overall prediction condition of the sequence during early training, and the ACE loss additionally supervises the prediction of the number of each class in the sequence, i.e., a sequence length, during later training, the overall prediction effect is improved. Therefore, weights of the two loss functions are dynamically adjusted. A first dynamic weight ^a gradually decreases with the training process, and a second dynamic weight gradually increases with the training process.

[ 00200] Finally, back propagation is performed according to the classification result of the feature sequence and calculation results of the loss functions to update a network parameter weight. In a test stage, the classification result of the feature sequence is processed according to a post-processing rule of the CTC loss function to obtain a predicted token sequence result, including a length of the token sequence and a class corresponding to each token.

[ 00201] In the embodiment of the application, without introducing any additional parameter or modifying a network structure, the prediction result of the sequence length may be improved, and meanwhile, the class recognition accuracy may be improved to finally improve the overall recognition result, particularly in a scene with a long token sequence.

[ 00202] An embodiment of the application provides an object sequence recognition apparatus. FIG. 5A is a structure composition diagram of an object sequence recognition apparatus according to an embodiment of the application. As shown in FIG. 5A, the object sequence recognition apparatus 500 includes a first acquisition module 501, a first extraction module 502, and a first prediction module 503.

[ 00203] The first acquisition module 501 is configured to acquire a first image including an object sequence.

[ 00204] The first extraction module 502 is configured to input the first image to an object sequence recognition network and perform feature extraction to obtain a feature sequence, supervision information in a training process of the object sequence recognition network at least including class supervision information of each sample object in a sample object sequence and sequence length supervision information of the sample object of each class in the sample object sequence.

[ 00205] The first prediction module 503 is configured to predict a class of the object sequence based on the feature sequence using a classifier of the object sequence recognition network to obtain class information of the object sequence.

[ 00206] In some embodiments, the first extraction module 502 includes a first extraction submodule and a first splitting submodule.

[ 00207] The first extraction submodule is configured to perform feature extraction on the first image using a convolutional subnetwork in the object sequence recognition network to obtain a feature map.

[ 00208] The first splitting submodule is configured to split the feature map to obtain the feature sequence.

[ 00209] In some embodiments, the first extraction submodule includes a first downsampling unit, a first extraction unit, and a first determination unit.

[ 00210] The first down-sampling unit is configured to down-sample the first image using the convolutional subnetwork in a length dimension of the first image in a first direction to obtain a first-dimensional feature, the first direction being different from an arrangement direction of objects in the object sequence.

[ 00211] The first extraction unit is configured to extract a feature in a length dimension of the first image in a second direction based on a length of the first image in the second direction to obtain a second-dimensional feature.

[ 00212] The first determination unit is configured to obtain the feature map based on the first-dimensional feature and the second-dimensional feature.

[ 00213] In some embodiments, the first splitting submodule includes a first pooling unit and a first splitting unit.

[ 00214] The first pooling unit is configured to pool the feature map in the first direction to obtain a pooled feature map.

[ 00215] The first splitting unit is configured to split the pooled feature map in the second direction to obtain the feature sequence.

[ 00216] In some embodiments, the first prediction module 503 includes a first prediction submodule, a first determination submodule, a second determination submodule, and a third determination submodule.

[ 00217] The first prediction submodule is configured to predict a class corresponding to each feature in the feature sequence using the classifier of the object sequence recognition network.

[ 00218] The first determination submodule is configured to determine a class of each object in the object sequence based on a prediction result of the class corresponding to each feature in the feature sequence.

[ 00219] The second determination submodule is configured to determine a sequence length of target features of objects belonging to the same class in the feature sequence.

[ 00220] The third determination submodule is configured to obtain the class information of the object sequence based on the class of each object in the object sequence and a sequence length of target features corresponding to the objects of each class.

[ 00221] An embodiment of the application provides an apparatus for training an object sequence recognition network. FIG. 5B is a structure composition diagram of an apparatus for training an object sequence recognition network according to an embodiment of the application. As shown in FIG. 5B, the apparatus 510 for training an object sequence recognition network includes a second acquisition module 511, a second extraction module 512, a second prediction module 513, a first determination module 514, and a first adjustment module 515.

[ 00222] The second acquisition module 511 is configured to acquire a sample image, the sample image including a sample object sequence and class labeling information of the sample object sequence.

[ 00223] The second extraction module 512 is configured to input the sample image to an object sequence recognition network to be trained and perform feature extraction to obtain a sample feature sequence.

[ 00224] The second prediction module 513 is configured to perform class prediction on sample objects in the sample object sequence based on the sample feature sequence using a classifier of the object sequence recognition network to be trained to obtain a class prediction result of the sample object sequence, the class prediction result of the sample object sequence including class prediction information of each sample object in the sample object sequence.

[ 00225] The first determination module 514 is configured to determine a first loss and a second loss based on the class prediction result of the sample object sequence, the first loss being configured to supervise the class prediction result of the sample object sequence based on the class labeling information of the sample object sequence, and the second loss being configured to supervise the number of sample objects of each class in the sample object sequence based on the class labeling information of the sample object sequence.

[ 00226] The first adjustment module 515 is configured to adjust a network parameter of the object sequence recognition network to be trained according to the first loss and the second loss such that a loss of a classification result output by an adjusted object sequence recognition network satisfies a convergence condition.

[ 00227] In some embodiments, the second extraction module 512 includes a second extraction submodule and a second splitting submodule.

[ 00228] The second extraction submodule is configured to perform feature extraction on the sample image using a convolutional subnetwork in the object sequence recognition network to be trained to obtain a sample feature map of the sample image.

[ 00229] The second splitting submodule is configured to split the sample feature map to obtain the sample feature sequence.

[ 00230] In some embodiments, the second extraction submodule includes a second down- sampling unit, a second extraction unit, and a second determination unit.

[ 00231] The second down-sampling unit is configured to down-sample the sample image using the convolutional subnetwork in a length dimension of the sample image in a first direction to obtain a first-dimensional sample feature, the first direction being different from an arrangement direction of the sample object sequence in the sample sequence.

[ 00232] The second extraction unit is configured to extract a feature in a length dimension of the sample image in a second direction based on a length of the sample image in the second direction to obtain a second-dimensional sample feature.

[ 00233] The second determination unit is configured to obtain the sample feature map of the sample image based on the first-dimensional sample feature and the second- dimensional sample feature.

[ 00234] In some embodiments, the second splitting submodule includes a second pooling unit and a second splitting unit.

[ 00235] The second pooling unit is configured to pool the sample feature map in the first direction to obtain a pooled sample feature map.

[ 00236] The second splitting unit is configured to split the pooled sample feature map in the second direction to obtain the sample feature sequence.

[ 00237] In some embodiments, the first adjustment module 515 includes a first fusion submodule and a first adjustment submodule.

[ 00238] The first fusion submodule is configured to perform weighted fusion on the first loss and the second loss to obtain a total loss.

[ 00239] The first adjustment submodule is configured to adjust the network parameter of the object sequence recognition network to be trained according to the total loss such that the loss of the classification result output by the adjusted object sequence recognition network satisfies the convergence condition.

[ 00240] In some embodiments, the first fusion submodule includes a first assignment unit, a second assignment unit, and a first fusion unit.

[ 00241] The first assignment unit is configured to assign a first dynamic weight to the first loss to obtain a first dynamic loss, the first dynamic weight gradually decreasing with the increase of a training count and/or training time of the object sequence recognition network to be trained when the training count reaches a first threshold or the training time reaches first time.

[ 00242] The second assignment unit is configured to assign a second dynamic weight to the second loss to obtain a second dynamic loss, the second dynamic weight gradually increasing with the increase of the training count and/or training time of the object sequence recognition network to be trained when the training count reaches a second threshold or the training time reaches second time.

[ 00243] The first fusion unit is configured to fuse the first dynamic loss and the second dynamic loss to obtain the total loss.

[ 00244] It is to be noted that the descriptions about the above apparatus embodiment are similar to those about the method embodiment and beneficial effects similar to those of the method embodiment are achieved. Technical details undisclosed in the apparatus embodiment of the application may be understood with reference to the descriptions about the method embodiment of the application.

[ 00245] It is to be noted that, in the embodiments of the application, the object sequence recognition method may also be stored in a computer-readable storage medium when being implemented in form of software function module and sold or used as an independent product. Based on such an understanding, the technical solutions of the embodiments of the application substantially or parts making contributions to the conventional art may be embodied in form of software product, and the computer software product is stored in a storage medium, including a plurality of instructions configured to enable a computer device (which may be a terminal, a server, etc.) to execute all or part of the method in each embodiment of the application. The storage medium includes various media capable of storing program codes such as a U disk, a mobile hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Therefore, the embodiments of the application are not limited to any specific hardware and software combination.

[ 00246] An embodiment of the application also provides a computer program product including a computer-executable instruction which may be executed to implement the object sequence recognition method provided in the embodiments of the application.

[ 00247] An embodiment of the application also provides a computer storage medium having stored therein a computer-executable instruction which is executed by a processor to implement the object sequence recognition method provided in the abovementioned embodiments.

[ 00248] An embodiment of the application provides a computer device. FIG. 6 is a composition structure diagram of a computer device according to an embodiment of the application. As shown in FIG. 6, the computer device 600 includes a processor 601, at least one communication bus, a communication interface 602, at least one external communication interface, and a memory 603. The communication interface 602 is configured to implement connections and communications between these components. The communication interface 602 may include a display screen. The external communication interface may include a standard wired interface and wireless interface. The processor 601 is configured to execute an image processing program in the memory to implement the object sequence recognition method provided in the abovementioned embodiments.

[ 00249] The above descriptions about the embodiments of the object sequence recognition apparatus, the computer device and the storage medium are similar to the descriptions about the method embodiments, and technical descriptions and beneficial effects are similar to those of the corresponding method embodiments. Due to the space limitation, references can be made to the records in the method embodiments, and elaborations are omitted herein. Technical details undisclosed in the embodiments of the object sequence recognition apparatus, computer device and storage medium of the application may be understood with reference to the descriptions about the method embodiments of the application.

[ 00250] It is to be understood that "one embodiment" and "an embodiment" mentioned in the whole specification mean that specific features, structures or characteristics related to the embodiment is included in at least one embodiment of the application. Therefore, "in one embodiment" or "in an embodiment" mentioned throughout the specification does not always refer to the same embodiment. In addition, these specific features, structures or characteristics may be combined in one or more embodiments freely as appropriate. It is to be understood that, in each embodiment of the application, the magnitude of the sequence number of each process does not mean an execution sequence and the execution sequence of each process should be determined by its function and an internal logic and should not form any limit to an implementation process of the embodiments of the application. The sequence numbers of the embodiments of the application are adopted not to represent superiority-inferiority of the embodiments but only for description. It is to be noted that terms "include" and "contain" or any other variant thereof is intended to cover nonexclusive inclusions herein, so that a process, method, object, or device including a series of elements not only includes those elements but also includes other elements which are not clearly listed or further includes elements intrinsic to the process, the method, the object, or the device. Under the condition of no more limitations, an element defined by the statement "including a/an " does not exclude existence of the same other elements in a process, method, object, or device including the element.

[ 00251] In some embodiments provided by the application, it is to be understood that the disclosed device and method may be implemented in another manner. The device embodiment described above is only schematic, and for example, division of the units is only logic function division, and other division manners may be adopted during practical implementation. For example, multiple units or components may be combined or integrated into another system, or some characteristics may be neglected or not executed. In addition, coupling or direct coupling or communication connection between each displayed or discussed component may be indirect coupling or communication connection, implemented through some interfaces, of the device or the units, and may be electrical and mechanical or adopt other forms.

[ 00252] The units described as separate parts may or may not be physically separated, and parts displayed as units may or may not be physical units, and namely may be located in the same place, or may also be distributed to multiple network units. Part of all of the units may be selected according to a practical requirement to achieve the purposes of the solutions of the embodiments.

[ 00253] In addition, each function unit in each embodiment of the application may be integrated into a processing unit, each unit may also serve as an independent unit and two or more than two units may also be integrated into a unit. The integrated unit may be implemented in a hardware form and may also be implemented in form of hardware and software function unit. Those of ordinary skill in the art should know that all or part of the steps of the method embodiment may be implemented by related hardware instructed through a program, the program may be stored in a computer-readable storage medium, and the program is executed to execute the steps of the method embodiment. The storage medium includes various media capable of storing program codes such as a mobile storage device, a ROM, a magnetic disk, or an optical disc.

[ 00254] Or, the integrated unit of the application may also be stored in a computer- readable storage medium when implemented in form of a software function module and sold or used as an independent product. Based on such an understanding, the technical solutions of the embodiments of the application substantially or parts making contributions to the conventional art may be embodied in form of software product, and the computer software product is stored in a storage medium, including a plurality of instructions configured to enable a computer device (which may be a personal computer, a server, a network device or the like) to execute all or part of the method in each embodiment of the application. The storage medium includes various media capable of storing program codes such as a mobile hard disk, a ROM, a magnetic disk, or an optical disc. The above is only the specific implementation mode of the application and not intended to limit the scope of protection of the application. Any variations or replacements apparent to those skilled in the art within the technical scope disclosed by the application shall fall within the scope of protection of the application. Therefore, the scope of protection of the application shall be subject to the scope of protection of the claims.

Claims

1. An object sequence recognition method, comprising: acquiring a first image comprising an object sequence; inputting the first image to an object sequence recognition network and performing feature extraction to obtain a feature sequence, supervision information in a training process of the object sequence recognition network at least comprising class supervision information of each sample object in a sample object sequence and sequence length supervision information of the sample object of each class in a sample object sequence; and predicting a class of the object sequence based on the feature sequence using a classifier of the object sequence recognition network to obtain class information of the object sequence.

2. The method of claim 1, wherein the inputting the first image to an object sequence recognition network and performing feature extraction to obtain a feature sequence comprises: performing feature extraction on the first image using a convolutional subnetwork in the object sequence recognition network to obtain a feature map; and splitting the feature map to obtain the feature sequence.

3. The method of claim 2, wherein the performing feature extraction on the first image using a convolutional subnetwork in the object sequence recognition network to obtain a feature map comprises: down- sampling the first image using the convolutional subnetwork in a length dimension of the first image in a first direction to obtain a first-dimensional feature, the first direction being different from an arrangement direction of objects in the object sequence; extracting a feature in a length dimension of the first image in a second direction based on a length of the first image in the second direction to obtain a second-dimensional feature; and obtaining the feature map based on the first-dimensional feature and the seconddimensional feature.

4. The method of claim 3, wherein the splitting the feature map to obtain the feature sequence comprises: pooling the feature map in the first direction to obtain a pooled feature map; and splitting the pooled feature map in the second direction to obtain the feature sequence.

5. The method of any one of claims 1-4, wherein the predicting a class of the object sequence based on the feature sequence using a classifier of the object sequence recognition network to obtain class information of the object sequence comprises: predicting a class corresponding to each feature in the feature sequence using the classifier of the object sequence recognition network; determining a class of each object in the object sequence based on a prediction result of the class corresponding to each feature in the feature sequence; determining a sequence length of target features of objects belonging to a same class in the feature sequence; and obtaining the class information of the object sequence based on the class of each object in the object sequence and a sequence length of target features corresponding to the objects of each class.

6. A method for training an object sequence recognition network, comprising: acquiring a sample image, the sample image comprising a sample object sequence and

26 class labeling information of the sample object sequence; inputting the sample image to an object sequence recognition network to be trained and performing feature extraction to obtain a sample feature sequence; performing class prediction on sample objects in the sample object sequence based on the sample feature sequence using a classifier of the object sequence recognition network to be trained to obtain a class prediction result of the sample object sequence, the class prediction result of the sample object sequence comprising class prediction information of each sample object in the sample object sequence; determining a first loss and a second loss based on the class prediction result of the sample object sequence, the first loss being configured to supervise the class prediction result of the sample object sequence based on the class labeling information of the sample object sequence, and the second loss being configured to supervise the number of sample objects of each class in the sample object sequence based on the class labeling information of the sample object sequence; and adjusting a network parameter of the object sequence recognition network to be trained according to the first loss and the second loss to make a loss of a classification result output by an adjusted object sequence recognition network satisfy a convergence condition.

7. The method of claim 6, wherein the inputting the sample image to an object sequence recognition network to be trained and performing feature extraction to obtain a sample feature sequence comprises: performing feature extraction on the sample image using a convolutional subnetwork in the object sequence recognition network to be trained to obtain a sample feature map of the sample image; and splitting the sample feature map to obtain the sample feature sequence.

8. The method of claim 7, wherein the performing feature extraction on the sample image using a convolutional subnetwork in the object sequence recognition network to be trained to obtain a sample feature map of the sample image comprises: down-sampling the sample image using the convolutional subnetwork in a length dimension of the sample image in a first direction to obtain a first-dimensional sample feature, the first direction being different from an arrangement direction of the sample object sequence in the sample sequence; extracting a feature in a length dimension of the sample image in a second direction based on a length of the sample image in the second direction to obtain a second-dimensional sample feature; and obtaining the sample feature map of the sample image based on the first-dimensional sample feature and the second-dimensional sample feature.

9. The method of claim 8, wherein the splitting the sample feature map to obtain the sample feature sequence comprises: pooling the sample feature map in the first direction to obtain a pooled sample feature map; and splitting the pooled sample feature map in the second direction to obtain the sample feature sequence.

10. The method of any one of claims 6-9, wherein the adjusting a network parameter of the object sequence recognition network to be trained according to the first loss and the second loss to make a loss of a classification result output by an adjusted object sequence recognition network satisfy a convergence condition comprises: performing weighted fusion on the first loss and the second loss to obtain a total loss; and adjusting the network parameter of the object sequence recognition network to be trained according to the total loss to make the loss of the classification result output by the adjusted object sequence recognition network satisfy the convergence condition.

11. The method of claim 10, wherein the performing weighted fusion on the first loss and the second loss to obtain a total loss comprises: assigning a first dynamic weight to the first loss to obtain a first dynamic loss, the first dynamic weight gradually decreasing with the increase of a training count and/or training time of the object sequence recognition network to be trained when the training count reaches a first threshold or the training time reaches first time; assigning a second dynamic weight to the second loss to obtain a second dynamic loss, the second dynamic weight gradually increasing with the increase of the training count and/or training time of the object sequence recognition network to be trained when the training count reaches a second threshold or the training time reaches second time; and fusing the first dynamic loss and the second dynamic loss to obtain the total loss.

12. A computer storage medium, in which a computer-executable instruction is stored, wherein when executed by a processor, the computer-executable instruction is configured to: acquire a first image comprising an object sequence; input the first image to an object sequence recognition network and perform feature extraction to obtain a feature sequence, supervision information in a training process of the object sequence recognition network at least comprising class supervision information of each sample object in a sample object sequence and sequence length supervision information of the sample object of each class in the sample object sequence; and predict a class of the object sequence based on the feature sequence using a classifier of the object sequence recognition network to obtain class information of the object sequence.

13. A computer storage medium, in which a computer-executable instruction is stored, wherein when executed by a processor, the computer-executable instruction is configured to: acquire a sample image, the sample image comprising a sample object sequence and class labeling information of the sample object sequence; input the sample image to an object sequence recognition network to be trained and perform feature extraction to obtain a sample feature sequence; perform class prediction on sample objects in the sample object sequence based on the sample feature sequence using a classifier of the object sequence recognition network to be trained to obtain a class prediction result of the sample object sequence, the class prediction result of the sample object sequence comprising class prediction information of each sample object in the sample object sequence; determine a first loss and a second loss based on the class prediction result of the sample object sequence, the first loss being configured to supervise the class prediction result of the sample object sequence based on the class labeling information of the sample object sequence, and the second loss being configured to supervise the number of sample objects of each class in the sample object sequence based on the class labeling information of the sample object sequence; and adjust a network parameter of the object sequence recognition network to be trained according to the first loss and the second loss to make a loss of a classification result output by an adjusted object sequence recognition network satisfy a convergence condition.

14. A computer device, comprising a memory and a processor, wherein a computerexecutable instruction is stored in the memory; wherein when executing the computerexecutable instruction in the memory, the processor is configured to: acquire a first image comprising an object sequence; input the first image to an object sequence recognition network and perform feature extraction to obtain a feature sequence, supervision information in a training process of the object sequence recognition network at least comprising class supervision information of each sample object in a sample object sequence and sequence length supervision information of the sample object of each class in the sample object sequence; and predict a class of the object sequence based on the feature sequence using a classifier of the object sequence recognition network to obtain class information of the object sequence.

15. The computer device of claim 14, wherein when inputting the first image to the object sequence recognition network and performing the feature extraction to obtain the feature sequence, the processor is configured to: perform feature extraction on the first image using a convolutional subnetwork in the object sequence recognition network to obtain a feature map; and split the feature map to obtain the feature sequence.

16. The computer device of claim 15, wherein when performing the feature extraction on the first image using the convolutional subnetwork in the object sequence recognition network to obtain the feature map, the processor is configured to: down-sample the first image using the convolutional subnetwork in a length dimension of the first image in a first direction to obtain a first-dimensional feature, the first direction being different from an arrangement direction of objects in the object sequence; extract a feature in a length dimension of the first image in a second direction based on a length of the first image in the second direction to obtain a second-dimensional feature; and obtain the feature map based on the first-dimensional feature and the seconddimensional feature.

17. The computer device of claim 16, wherein when splitting the sample feature map to obtain the sample feature sequence, the processor is configured to: pooling the sample feature map in the first direction to obtain a pooled sample feature map; and splitting the pooled sample feature map in the second direction to obtain the sample feature sequence.

18. A computer device, comprising a memory and a processor, wherein a computerexecutable instruction is stored in the memory; wherein when executing the computerexecutable instruction in the memory, the processor is configured to: acquire a sample image, the sample image comprising a sample object sequence and class labeling information of the sample object sequence; input the sample image to an object sequence recognition network to be trained and perform feature extraction to obtain a sample feature sequence; perform class prediction on sample objects in the sample object sequence based on the sample feature sequence using a classifier of the object sequence recognition network to be trained to obtain a class prediction result of the sample object sequence, the class prediction result of the sample object sequence comprising class prediction information of each sample object in the sample object sequence; determine a first loss and a second loss based on the class prediction result of the sample object sequence, the first loss being configured to supervise the class prediction result of the sample object sequence based on the class labeling information of the sample object sequence, and the second loss being configured to supervise the number of sample objects of each class in the sample object sequence based on the class labeling information of the sample object sequence; and

29 adjust a network parameter of the object sequence recognition network to be trained according to the first loss and the second loss to make a loss of a classification result output by an adjusted object sequence recognition network satisfy a convergence condition.

19. A computer program, comprising computer instructions executable by an electronic device, wherein when executed by a processor in the electronic device, the computer instructions are configured to: acquire a first image comprising an object sequence; input the first image to an object sequence recognition network and perform feature extraction to obtain a feature sequence, supervision information in a training process of the object sequence recognition network at least comprising class supervision information of each sample object in a sample object sequence and sequence length supervision information of the sample object of each class in the sample object sequence; and predict a class of the object sequence based on the feature sequence using a classifier of the object sequence recognition network to obtain class information of the object sequence.

20. A computer program, comprising computer instructions executable by an electronic device, wherein when executed by a processor in the electronic device, the computer instructions are configured to: acquire a sample image, the sample image comprising a sample object sequence and class labeling information of the sample object sequence; input the sample image to an object sequence recognition network to be trained and perform feature extraction to obtain a sample feature sequence; perform class prediction on sample objects in the sample object sequence based on the sample feature sequence using a classifier of the object sequence recognition network to be trained to obtain a class prediction result of the sample object sequence, the class prediction result of the sample object sequence comprising class prediction information of each sample object in the sample object sequence; determine a first loss and a second loss based on the class prediction result of the sample object sequence, the first loss being configured to supervise the class prediction result of the sample object sequence based on the class labeling information of the sample object sequence, and the second loss being configured to supervise the number of sample objects of each class in the sample object sequence based on the class labeling information of the sample object sequence; and adjust a network parameter of the object sequence recognition network to be trained according to the first loss and the second loss to make a loss of a classification result output by an adjusted object sequence recognition network satisfy a convergence condition.

30