CN114998678A

CN114998678A - Model training method, target tracking method and device

Info

Publication number: CN114998678A
Application number: CN202210583640.4A
Authority: CN
Inventors: 陈子亮
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2022-09-02

Abstract

The present disclosure provides a model training method, a target tracking method, and an apparatus, which relate to the technical field of artificial intelligence, in particular to the fields of deep learning, image processing, computer vision technology, etc., and can be applied to scenes such as Optical Character Recognition (OCR). The specific implementation scheme is as follows: and performing first pre-training on the first model according to the image-text data to obtain pre-training parameters loaded by the first model in second pre-training, constructing training data according to the first image sample set and the second image sample set, and performing second pre-training on the first model according to the training data and the pre-training parameters to obtain a second model. By adopting the method and the device, the model precision is improved.

Description

Model training method, target tracking method and device

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and in particular, to the fields of deep learning, image processing, computer vision technology, etc., which can be applied to scenes such as Optical Character Recognition (OCR).

Background

With the development of the technology, the hardware performance can be improved through artificial intelligence, and the applicable application scenarios are various, for example, in the hardware design of the application scenarios related to computer vision, such as single target tracking, OCR recognition, image processing, video processing, and the like, the artificial intelligence technology can be adopted, that is: and deploying the trained model in hardware to improve the processing speed and the processing accuracy of the hardware. The single-target tracking is used as a core task in the field of computer vision, and the accuracy of the single-target tracking is not high due to the complexity of a real environment, the instability of a target object, the resolution of the target object and the like, so that how to improve the accuracy of the single-target tracking in practical application is a problem to be solved.

Disclosure of Invention

The disclosure provides a model training method, a target tracking device, an electronic device and a storage medium.

According to an aspect of the present disclosure, there is provided a model training method, including:

performing first pre-training on the first model according to the image-text data to obtain pre-training parameters loaded by the first model in second pre-training;

constructing training data according to the first image sample set and the second image sample set;

and carrying out second pre-training on the first model according to the training data and the pre-training parameters to obtain a second model.

According to another aspect of the present disclosure, there is provided a target tracking method including:

acquiring a first image frame and an Nth image frame from video stream data, wherein N is a positive integer greater than 2;

inputting the first image frame and the Nth image frame into a second model for target tracking, wherein the second model is obtained by loading pre-training parameters to perform model training;

according to the second model, identifying the types of the objects to be tracked in the first image frame and the Nth image frame to obtain an identification result;

and tracking the target according to the recognition result.

According to another aspect of the present disclosure, there is provided a model training apparatus including:

the first training module is used for carrying out first pre-training on the first model according to the image-text data to obtain pre-training parameters loaded by the first model in second pre-training;

the first construction module is used for constructing training data according to the first image sample set and the second image sample set;

and the second training module is used for carrying out second pre-training on the first model according to the training data and the pre-training parameters to obtain a second model.

According to another aspect of the present disclosure, there is provided a target tracking apparatus including:

the first acquisition module is used for acquiring a first image frame and an Nth image frame from video stream data, wherein N is a positive integer greater than 2;

the first processing module is used for inputting the first image frame and the Nth image frame into a second model for target tracking, and the second model is obtained by loading pre-training parameters to perform model training;

the second processing module is used for identifying the types of the objects to be tracked in the first image frame and the Nth image frame according to the second model to obtain an identification result;

and the target tracking module is used for tracking the target according to the recognition result.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method provided by any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a method provided by any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement the method provided by any one of the embodiments of the present disclosure.

By adopting the method and the device, the first model can be subjected to first pre-training according to the image-text data to obtain the pre-training parameters loaded by the first model in second pre-training, the training data can be constructed according to the first image sample set and the second image sample set, so that the second pre-training can be carried out on the first model according to the training data and the pre-training parameters to obtain the second model, and the model precision is improved through the second model obtained by loading the pre-training parameters.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of a distributed cluster processing scenario according to an embodiment of the present disclosure;

FIG. 2 is a flow diagram of a model training method according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a first pre-training in an example of an application according to an embodiment of the present disclosure;

FIG. 4 is a diagram of a second pre-training in an example of an application according to an embodiment of the present disclosure;

FIG. 5 is a schematic flow diagram of a target tracking method according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of an application scenario of a target tracking method according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a component structure of a model training apparatus according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a component structure of a target tracking device according to an embodiment of the present disclosure;

FIG. 9 is a block diagram of an electronic device for implementing the model training method/target tracking method of embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. The term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C. The terms "first" and "second" as used herein are intended to refer to and distinguish one from another, are not intended to limit the order in which the terms are used, or are intended to limit the order in which the terms are used, and are intended to refer to two or more features, e.g., a first feature and a second feature, where the first feature may be one or more and the second feature may be one or more.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Fig. 1 is a schematic diagram of a distributed cluster processing scenario according to an embodiment of the present disclosure, where the distributed cluster system is an example of a cluster system, and model training can be performed by using the distributed cluster system, which is exemplarily described. As shown in FIG. 1, in the distributed cluster system, a plurality of nodes (e.g., server cluster 101, server 102, server cluster 103, server 104, server 105) are included, the server 105 may also be connected to electronic devices, such as a cell phone 1051 and a desktop 1052, and one or more model training tasks may be performed between the plurality of nodes and the connected electronic devices. Optionally, a plurality of nodes in the distributed cluster system may perform model training by using a data parallel relationship, and then the plurality of nodes may execute a model training task based on the same training mode; if the plurality of nodes in the distributed cluster system adopt a model training mode with parallel models, the plurality of nodes can execute model training tasks based on different training modes. Optionally, after each round of training of the relationship extraction model is completed, data exchange (e.g., data synchronization) may be performed between multiple nodes.

According to an embodiment of the present disclosure, a model training method is provided, and fig. 2 is a schematic flowchart of the model training method according to the embodiment of the present disclosure, and the method may be applied to a model training apparatus, for example, the apparatus may be deployed in a terminal or a server or other processing devices in a single-machine, multi-machine or cluster system, and may implement processes such as model training. The terminal may be a User Equipment (UE), a mobile device, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like. In some possible implementations, the method may also be implemented by a processor calling computer readable instructions stored in a memory. As shown in fig. 2, the method is applied to any node or electronic device (mobile phone or desktop, etc.) in the cluster system shown in fig. 1, and includes:

s201, performing first pre-training on the first model according to the image-text data to obtain pre-training parameters loaded by the first model in second pre-training.

S202, training data are constructed according to the first image sample set and the second image sample set.

S203, performing second pre-training on the first model according to the training data and the pre-training parameters to obtain a second model.

In an example of S201 to S203, the image-text data may be massive image data and text data corresponding to the image data, for example, foreground image data of one image data includes an object to be tracked, such as a vehicle, a lane line, and a road test device, such as a traffic light, and background image data of the image data includes buildings around the vehicle, and the text data may describe the object to be tracked in the image data, that is: the image data and the text data corresponding to the image data can establish a mapping relation, and provide possibility for identifying the type of the object to be tracked in the image data, so that the type of a target object, such as a vehicle, in the object to be tracked can be obtained through the image-text data (the image data and the text data corresponding to the image data). The method comprises the steps of carrying out first pre-training on a first model according to image-text data to obtain pre-training parameters loaded by the first model in second pre-training (the pre-training parameters are used for representing target object types obtained by the image-text data), constructing training data according to a first image sample set and a second image sample set, carrying out second pre-training on the first model according to the training data and the pre-training parameters to obtain a trained second model, wherein the second model can be used for identification processing of the target object types and relevant processing after the identification processing in computer vision-related application scenes (single target tracking, OCR (optical character recognition), image processing, video processing and the like).

In one embodiment, the pre-training of the first model according to the image-text data to obtain the pre-training parameters loaded by the first model in the second pre-training includes: extracting first image data and first text data corresponding to the first image data from image-text data, inputting the first image data and the first text data into the first model, and performing first pre-training according to a mapping relation between the first image data and the first text data to obtain the pre-training parameters.

In some examples, the first image data and the first text data may form a teletext pair, which is used as an input to the first model. Since the first image data and the first text data are corresponding, that is, there is a mapping relationship, the first model can obtain a target object type obtained from the image-text data by learning the mapping relationship, for example, the first image data includes a target object "human body", and the first text data correspondingly describes: the first image data includes a target object, the target object is a human body, that is: and identifying that the type of the target object in the image data is a human body, but not a cat, a dog, a plant and the like through a mapping relation established by the first image data and the corresponding first text data.

In some examples, the pre-training parameters are used to characterize a class of target objects derived from the teletext data.

By adopting the embodiment, in the process of the first model performing the first pre-training, the pre-training parameter can be obtained by learning the mapping relationship between the first image data and the first text data, and since the pre-training parameter is used for representing the target object category obtained by the image-text data, the pre-training parameter is obtained in advance by the first pre-training of the first model: the pre-training parameters required by the target object category can be identified, and then the pre-training parameters can be directly loaded in the second pre-training process of the first model, so that the iterative process of model training is improved, the model training is converged as soon as possible to complete the model training, and the precision of the model training is improved.

In one embodiment, performing a first pre-training according to a mapping relationship between first image data and first text data to obtain a pre-training parameter includes: and obtaining a first pre-training target in the first model according to the mapping relation between the first image data and the first text data. And performing first pre-training according to the first pre-training target, and performing parameter adjustment on a mapping module in the first model to obtain pre-training parameters.

In some examples, in the first model, the first image data may be input into a first processing branch of the first model, and feature extraction may be performed on the first image data to obtain the first image feature. And inputting the first text data into a second processing branch of the first model, and performing feature extraction on the first text data to obtain first text features. And mapping the first image characteristic and the first text characteristic respectively, mapping the first image characteristic and the first text characteristic to the same target characteristic space to obtain a mapping relation between first image data and first text data in the same target characteristic space, and obtaining the first pre-training target according to the mapping relation between the first image data and the first text data in the same target characteristic space.

In some examples, the mapping module set in the first model may be a Vision to Text layer (Vision to Text layer), and the mapping module is subjected to parameter adjustment to obtain the pre-training parameters.

By adopting the embodiment, in the second pre-training process of the first model, the extracted features can be mapped to the same target feature space through the mapping module, the pre-training parameters used by the mapping module are loaded in the second pre-training process of the first model, and the pre-training parameters can be fixed so as to constrain the first model to inherit the prior information learned from the first pre-training process in the second pre-training process. Wherein the prior information comprises: by using the class label indicated by the mapping relation between the first image data and the first text data in the same target feature space, taking a single target tracking scene as an example, the class label can be used for determining the class of the object to be tracked according to the class label when the second model is used for single target tracking, and because the training of the second model is from the image-text data and inherits the prior information obtained by the mapping relation between the image-text data, the unknown class of the object to be tracked can be accurately identified, so that the identification precision of the unknown class is improved.

As shown in fig. 3, in the first pre-training process, the input of the first model (the first model is specifically the first model before pre-training) is the graphic data (such as the first image data and the corresponding first text data). The first model may be a twin network structure comprising, in a first processing branch of the twin network: a convolutional layer 301, a visual projection to text layer 302, a first projection matrix 303; in a second processing branch of the twin network comprising: a pre-training layer 304, a second projection matrix 305. The Convolutional layer 301 may adopt a Convolutional Neural Network (CNN), and the Convolutional layer 301 is used for performing feature extraction on the first image data. The visual projection to the text layer 302, which is an example of the mapping module in the first model mentioned above, is used for visually projecting the image features (or called visual features) in the first image data and the text features in the first text data in the same target feature space (which may be a language feature space), and establishing a mapping relationship between the image features and the corresponding text features, so as to obtain the first pre-training target according to the mapping relationship. The first projection matrix 303 is a projection matrix obtained by mapping image features (or visual features) in the first image data to the language feature space, and is referred to as a projection matrix of visual projection to a text layer. The pre-training layer 304 is configured to pre-train the first text data to improve accuracy of the text data, and may further perform slicing processing on the first text data to obtain each text segment (text segment formed by multiple characters) or a single character in the first text data, where the text segment may be "a vehicle to be positioned is included in the first image data", and in this example, a mapping relationship exists between the text description of the text segment and a target object "vehicle" in the first image data; the pre-training layer 304 may also perform feature extraction on the first text data to obtain the text feature. The second projection matrix 305 is a projection matrix obtained by mapping the text features in the first text data to the language feature space. By comparing the values of the first projection matrix 303 and the second projection matrix 305, a first loss function is obtained, the first loss function is used as the first pre-training target, and the first model is pre-trained according to the first pre-training target.

In an embodiment, performing a second pre-training on the first model according to the training data and the pre-training parameters to obtain a second model, includes: extracting first image sample data and second image sample data from the training data, inputting the first image sample data and the second image sample data into the first model, and performing second pre-training on the first model under the condition that a mapping module in the first model loads the pre-training parameters to obtain a second model.

In some examples, after the pre-training of the first pre-trained model, the second pre-training process may be further included, and the second pre-training process is a formal training process of the model. After the pre-training parameters used by the mapping module in the first model are obtained through the first pre-training, the pre-training parameters can be fixed in the formal training process of the model to constrain the formal training process of the model, and then after the first image sample data and the second image sample data are input into the first model, the second pre-training is performed on the first model under the condition that the mapping module in the first model loads the pre-training parameters, so that the first model can be ensured to inherit the prior information learned from the first pre-training process in the second pre-training process.

By adopting the embodiment, the pre-training parameters are adjusted in the pre-training stage of the first pre-training model, so that the pre-training parameters are directly loaded and used by the mapping module in the formal training stage of the second pre-training stage, so that the iteration of model training is faster, and the model performance (such as model precision) of the finally trained model is better.

In one embodiment, the method further comprises: and loading pre-training parameters through a mapping module in a first model to obtain prior information, and determining the types of objects to be tracked in the first image sample data and the second image sample data according to the prior information in the first model. Wherein the prior information comprises: and the category label is indicated by the mapping relation between the first image data and the first text data in the same target feature space.

By adopting the embodiment, the first model can inherit the prior information, so that after the first model is subjected to second pre-training to obtain a trained second model, the category of the object to be tracked can be accurately identified by using the category label, and the target object in the object to be tracked is locked.

In one embodiment, inputting first image sample data and second image sample data into a first model, and performing second pre-training on the first model under the condition that a mapping module in the first model loads pre-training parameters to obtain a second model, includes: and inputting the first image sample data into a first processing branch of the first model, and performing feature extraction on the first image sample data to obtain first image sample features. And inputting the second image sample data into a second processing branch of the first model, and performing feature extraction on the second image sample data to obtain second image sample features. And mapping the first image sample characteristic and the second image sample characteristic through a mapping module respectively, and mapping the first image sample characteristic and the second image sample characteristic to the same target characteristic space. And in the same target feature space, performing similarity matching on features used for representing the category of the object to be tracked in the first image sample feature and the second image sample feature according to prior information to obtain a matching result. And obtaining a second pre-training target according to the matching result, and performing second pre-training according to the second pre-training target to obtain the second model.

By adopting the embodiment, the first image sample characteristic and the second image sample characteristic can be respectively mapped by the mapping module and mapped to the same target characteristic space. In the same target feature space, the first model inherits the prior information obtained according to the mapping relation through the mapping relation between the first image data and the first text data, so that similarity matching is performed on the features used for representing the category of the object to be tracked in the first image sample feature and the second image sample feature based on the prior information in the same target feature space, after the prior information is inherited, the category of the tracked object can be determined according to the prior information, therefore, the similarity matching can be performed, a second pre-training target (such as a second loss function) is finally obtained, and a second model is obtained after training. Since the category can be identified based on this a priori information, an accurate matching result can be obtained.

As shown in fig. 4, in the second pre-training process, the input of the first model (specifically, the pre-trained first model) is image data, such as first image sample data in the first image sample set and second image sample data in the second image sample set. The first image sample set includes at least one first image sample data (e.g., an original image including a plurality of objects to be tracked, and a target object included in the plurality of objects to be tracked), and the second image sample set includes at least one second image sample data (e.g., a small graph obtained by clipping the original image, and the small graph includes only a target object), so as to combine the small graph with the original image to see whether the original image includes the target object in the small graph.

As shown in fig. 4, the first model may be a twin network structure, including in a first processing branch of the twin network: a first scrolling layer 401, a first visual projection to a text layer 403; in a second processing branch of the twin network a second convolutional layer 405 is included, a second visual projection to a text layer 407. The first convolution layer 401 may adopt a CNN network structure, and the first convolution layer 401 is configured to perform feature extraction on first image sample data to obtain a first image feature 402 located in an original feature space, where 9 blocks in the first image feature 402 respectively represent image features corresponding to different positions of a plurality of objects to be tracked in one image, and the 9 blocks are used to represent positions of the images where the plurality of objects to be tracked are located, and may be mapped by first visual projection to the text layer 403, that is, the first image feature 402 is converted from the original feature space to a target feature space (e.g., a language feature space), so as to obtain a third image feature 404 located in the target feature space. The second convolution layer 405 is configured to perform feature extraction on second image sample data to obtain a second image feature 406 located in the original feature space, where 1 block in the second image feature 406 represents a target object in a plurality of objects to be tracked in an image, and the 1 block is used to represent a position of an image where the target object is located, and may be mapped by a second visual projection onto the text layer 407, that is, the second image feature 406 is converted from the original feature space to a target feature space (e.g., a language feature space), so as to obtain a fourth image feature 408 located in the target feature space. The first visual projection to the text layer 403 and the second visual projection to the text layer 407 may be used as examples of the aforementioned mapping module in the first model, and through the first visual projection to the text layer 403 and the second visual projection to the text layer 407, similarity matching of image features may be performed in the same target feature space, and since the pre-trained first model inherits prior information obtained according to the mapping relationship (i.e., the mapping relationship between the first image data and the first text data), in the same target feature space, similarity matching of image features may be performed on the third image feature 404 and the fourth image feature 408 based on the prior information, and since the category of the target object may be identified based on the prior information, an accurate matching result may be obtained. The similarity matching may be a similarity calculation between 1 feature (e.g., 1 block in the fourth image feature 408) and 9 features (e.g., 9 blocks in the third image feature 404), specifically, a matching result is obtained by calculating a cosine distance.

According to an embodiment of the present disclosure, a target tracking method is provided, and fig. 5 is a flowchart of the target tracking method according to the embodiment of the present disclosure, which may be applied to a target tracking apparatus, for example, the apparatus may be deployed in a terminal or a server or other processing devices in a single machine, multiple machines or a cluster system, and may implement processing such as target tracking. The terminal may be a User Equipment (UE), a mobile device, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like. In some possible implementations, the method may also be implemented by a processor calling computer readable instructions stored in a memory. As shown in fig. 5, the method is applied to any node or electronic device (mobile phone or desktop, etc.) in the cluster system shown in fig. 1, and includes:

s501, acquiring a first image frame and an Nth image frame from video stream data, wherein N is a positive integer larger than 2.

S502, inputting the first image frame and the Nth image frame into a second model for target tracking, wherein the second model is obtained by loading pre-training parameters to perform model training.

S503, according to the second model, the categories of the objects to be tracked in the first image frame and the Nth image frame are identified, and identification results are obtained.

And S504, tracking the target according to the recognition result.

In an example of S501-S504, the video stream data includes a plurality of image frames, for example, three image frames, where a plurality of objects to be tracked exist in the three image frames, and one target object exists in the plurality of objects to be tracked, first, a second model is used to identify a category of an object to be tracked in the first image frame and the third image frame, so as to obtain an identification result, since the second model is obtained by loading a pre-training parameter (the pre-training parameter is used to characterize a category of the target object obtained from the image-text data), the target object may be identified from the first image frame and the third image frame, and then, target tracking is performed according to the identification result, that is: performing position tracking on the target object existing in the first image frame and the third image frame. Wherein the second model is: the second model obtained through the first pre-training and the second pre-training of the above embodiment is shown in fig. 4.

By adopting the method and the device, the target object is identified and tracked through the second model obtained by loading the pre-training parameters, so that the identification precision is improved, and the target tracking precision is further improved.

In one embodiment, the tracking of the target according to the recognition result includes: and determining the same object to be tracked in the first image frame and the Nth image frame according to the identification result, taking the same object to be tracked as a target object, and tracking the target according to the position change of the target object to obtain the current target position corresponding to the target object.

By adopting the embodiment, the same object to be tracked in the first image frame and the Nth image frame can be locked according to the identification result of the second model, so that the same object to be tracked is used as a target object, target tracking is performed according to the position change of the target object, the current target position corresponding to the target object is finally obtained, and the target tracking precision is high.

In some examples, as shown in fig. 6, in an application scenario using the above target tracking method, the video stream data includes a plurality of image frames, such as three image frames, where a plurality of objects to be tracked exist, and one target object exists among the plurality of objects to be tracked. When the target object is tracked, the original drawing 600 of the image frame including a plurality of objects to be tracked (such as a plurality of objects to be tracked 601-603) is extracted, the image is cut out on the basis of the original drawing of the image frame to obtain a small drawing 604 including one object to be tracked 602, the object to be tracked 602 is the target object, and the small drawing 604 is used as a target search area in the original drawing 600 to lock the target object in the original drawing 600. Specifically, the original image 600 and the thumbnail 604 are input into a second model for similarity matching, the second model is used to identify the types of the objects to be tracked in the first image frame and the third image frame to obtain an identification result, and the second model is obtained by loading a pre-training parameter (the pre-training parameter is used to represent the type of the target object obtained from the image-text data), so that the target object can be identified from the first image frame and the third image frame, and then the target tracking is performed on the same target object according to the identification result, that is: position tracking is performed on the target object present in both the first image frame and the third image frame.

The following is an example of the model training method provided in the embodiment of the present disclosure.

The single target tracking is to select an example object frame in a starting frame in a continuous video, and then determine the current target position of a target object in a mode of extracting features to calculate similarity or calculating cross-over ratio according to positions in subsequent continuous video frames.

One way of single target tracking is: and (3) nuclear correlation filtering, specifically, expanding the number of negative samples in a cyclic matrix mode, and enhancing the performance of a model (such as a filter) for target tracking due to the increase of the number of samples. Although the robustness of the filter is improved by adopting a Histogram (HOG) characteristic and by means of a circulant matrix, and the calculated amount is reduced by converting an image domain into a frequency domain through the property that time domain convolution is equal to frequency domain multiplication, the richness of the traditional manually designed characteristic is far inferior to the characteristic extracted by a convolutional neural network, and the traditional manually designed characteristic is difficult to have a strong generalization effect.

Another way of target tracking is: the single target tracking is used as a template matching problem by adopting a model with a twin network structure, the model does not need to understand a target object, and when a target object of a new frame arrives, only the same target object needs to be found, so that the current target position of the target object is determined. Due to the diversity of the real world, the scheme depends on training data, and if there is no related class in the training data, the tracking effect is significantly reduced in single target tracking.

In the application example, the second model obtained by final training is based on the easy acquirability of the massive image-text data and the complexity of the content of the massive image-text data, and the prior information of the model pre-trained by the massive image-text data is utilized to improve the accuracy of single target tracking, and particularly the applicability of single target tracking in unknown classes is improved. The image data in the massive image-text data and the mapping relation existing in the corresponding text data can be used for identifying unknown classes. Specifically, the first pre-training of the first model before pre-training is performed by utilizing the massive image-text, the model pre-training of the twin network structure is performed based on the massive image-text data, and the model pre-training of the twin network structure can inherit the prior information due to the existence of related categories in the massive image-text data, so that the method has good generalization capability on unknown categories, image features (or called visual features) are mapped into a language feature space according to a projection matrix, and pre-training parameters can be obtained after the first pre-training of the first model is finished. The pre-training parameters are loaded in the second pre-training of the pre-trained first model, the pre-training parameters can be fixed, and then the weight of the projection matrix is fixed, so that the trained first model inherits the prior information of massive image-text data, and the consistency from image features (or called visual features) to language features can be ensured, thereby improving the generalization capability of the single-target tracker on unknown classes, namely: a second model is adopted in single target tracking, so that a single target tracking algorithm can obtain an accurate target tracking effect in different categories (especially unknown categories which do not exist in a training data set).

According to an embodiment of the present disclosure, a model training apparatus is provided, fig. 7 is a schematic diagram of a composition structure of the model training apparatus according to the embodiment of the present disclosure, and as shown in fig. 7, the model training apparatus includes: the first training module 701 is used for performing first pre-training on a first model according to the image-text data to obtain pre-training parameters loaded by the first model in second pre-training; a first constructing module 702, configured to construct training data according to the first image sample set and the second image sample set; the second training module 703 is configured to perform second pre-training on the first model according to the training data and the pre-training parameters, so as to obtain a second model.

In one embodiment, the pre-training parameters are used to characterize the class of the target object derived from the teletext data.

In one embodiment, the first training module 701 is configured to extract first image data and first text data corresponding to the first image data from the image-text data; inputting the first image data and the first text data into the first model; and performing the first pre-training according to the mapping relation between the first image data and the first text data to obtain the pre-training parameters.

In an embodiment, the first training module 701 is configured to obtain, in the first model, a first pre-training target according to a mapping relationship between the first image data and the first text data; and performing the first pre-training according to the first pre-training target, and performing parameter adjustment on a mapping module in the first model to obtain the pre-training parameters.

In an embodiment, the first training module 701 is configured to input the first image data into a first processing branch of the first model, and perform feature extraction on the first image data to obtain a first image feature; inputting the first text data into a second processing branch of the first model, and performing feature extraction on the first text data to obtain a first text feature; mapping the first image characteristic and the first text characteristic respectively, and mapping the first image characteristic and the first text characteristic into the same target characteristic space to obtain a mapping relation between the first image data and the first text data in the same target characteristic space; and obtaining the first pre-training target according to the mapping relation between the first image data and the first text data in the same target feature space.

In one embodiment, the second training module 703 is configured to extract first image sample data and second image sample data from the training data; and inputting the first image sample data and the second image sample data into the first model, and performing the second pre-training on the first model under the condition that the mapping module in the first model loads the pre-training parameters to obtain the second model.

In an embodiment, the method further includes a category determining module, configured to load the pre-training parameter through the mapping module in the first model to obtain prior information; in the first model, determining the category of an object to be tracked in the first image sample data and the second image sample data according to the prior information; wherein the prior information comprises: a category label indicated by a mapping relationship between the first image data and the first text data in the same target feature space.

In one embodiment, the second training module 703 is configured to input the first image sample data into a first processing branch of the first model, and perform feature extraction on the first image sample data to obtain a first image sample feature; inputting the second image sample data into a second processing branch of the first model, and performing feature extraction on the second image sample data to obtain second image sample features; mapping the first image sample characteristic and the second image sample characteristic through the mapping module respectively, and mapping the first image sample characteristic and the second image sample characteristic to the same target characteristic space; in the same second target feature space, performing similarity matching on features, used for representing the category of the object to be tracked, in the first image sample feature and the second image sample feature according to the prior information to obtain a matching result; and obtaining a second pre-training target according to the matching result, and performing second pre-training according to the second pre-training target to obtain the second model.

According to an embodiment of the present disclosure, a model training apparatus is provided, fig. 8 is a schematic diagram of a composition structure of a target tracking apparatus according to an embodiment of the present disclosure, and as shown in fig. 8, the target tracking apparatus includes: a first obtaining module 801, configured to obtain a first image frame and an nth image frame from video stream data, where N is a positive integer greater than 2; a first processing module 802, configured to input the first image frame and the nth image frame into a second model for target tracking, where the second model is obtained by loading pre-training parameters for model training; the second processing module 803 is configured to identify the categories of the objects to be tracked in the first image frame and the nth image frame according to the second model, so as to obtain an identification result; and the target tracking module 804 is used for tracking the target according to the identification result.

In an embodiment, the target tracking module 804 is configured to determine, according to the identification result, a same object to be tracked included in the first image frame and the nth image frame; and taking the same object to be tracked as a target object, and carrying out target tracking according to the position change of the target object to obtain the current target position corresponding to the target object.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the electronic apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the electronic device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the electronic device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the electronic device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above, such as the model training method/the target tracking method. For example, in some embodiments, the model training method/target tracking method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into RAM 903 and executed by computing unit 901, one or more steps of the model training method/object tracking method described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the model training method/target tracking method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions of the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A model training method, comprising:

performing first pre-training on a first model according to image-text data to obtain pre-training parameters loaded by the first model in second pre-training;

2. A method according to claim 1, wherein the pre-training parameters are used to characterize a target object class derived from the teletext data.

3. The method according to claim 1 or 2, wherein the pre-training the first model according to the teletext data to obtain pre-training parameters loaded by the first model in the second pre-training comprises:

extracting first image data and first text data corresponding to the first image data from the image-text data;

inputting the first image data and the first text data into the first model;

and performing the first pre-training according to the mapping relation between the first image data and the first text data to obtain the pre-training parameters.

4. The method of claim 3, wherein the performing the first pre-training according to the mapping relationship between the first image data and the first text data to obtain the pre-training parameters comprises:

in the first model, obtaining a first pre-training target according to the mapping relation between the first image data and the first text data;

and performing the first pre-training according to the first pre-training target, and performing parameter adjustment on a mapping module in the first model to obtain the pre-training parameters.

5. The method of claim 4, wherein obtaining a first pre-training target in the first model according to a mapping relationship between the first image data and the first text data comprises:

inputting the first image data into a first processing branch of the first model, and performing feature extraction on the first image data to obtain a first image feature;

inputting the first text data into a second processing branch of the first model, and performing feature extraction on the first text data to obtain a first text feature;

mapping the first image characteristic and the first text characteristic respectively, and mapping the first image characteristic and the first text characteristic into the same target characteristic space to obtain a mapping relation between the first image data and the first text data in the same target characteristic space;

and obtaining the first pre-training target according to the mapping relation between the first image data and the first text data in the same target feature space.

6. The method of claim 5, wherein said second pre-training the first model according to the training data and the pre-training parameters, resulting in a second model, comprises:

extracting first image sample data and second image sample data from the training data;

and inputting the first image sample data and the second image sample data into the first model, and performing the second pre-training on the first model under the condition that the mapping module in the first model loads the pre-training parameters to obtain the second model.

7. The method of claim 6, further comprising:

loading the pre-training parameters through the mapping module in the first model to obtain prior information;

in the first model, determining the category of an object to be tracked in the first image sample data and the second image sample data according to the prior information;

wherein the prior information comprises: a category label indicated by a mapping relationship between the first image data and the first text data in the same target feature space.

8. The method of claim 7, wherein said inputting the first and second image sample data into the first model, said second pre-training the first model with the pre-training parameters loaded by the mapping module in the first model, resulting in the second model, comprises:

inputting the first image sample data into a first processing branch of the first model, and performing feature extraction on the first image sample data to obtain first image sample features;

inputting the second image sample data into a second processing branch of the first model, and performing feature extraction on the second image sample data to obtain second image sample features;

mapping the first image sample characteristic and the second image sample characteristic through the mapping module respectively, and mapping the first image sample characteristic and the second image sample characteristic to the same target characteristic space;

in the same target feature space, performing similarity matching on features used for representing the category of the object to be tracked in the first image sample feature and the second image sample feature according to the prior information to obtain a matching result;

and obtaining a second pre-training target according to the matching result, and performing second pre-training according to the second pre-training target to obtain the second model.

9. A target tracking method, comprising:

inputting the first image frame and the Nth image frame into a second model for target tracking, wherein the second model is obtained by loading pre-training parameters for model training;

according to the second model, identifying the categories of the objects to be tracked in the first image frame and the Nth image frame to obtain an identification result;

and tracking the target according to the identification result.

10. The method of claim 9, wherein the pre-training parameters are used to characterize a target object class derived from the teletext data.

11. The method according to claim 9 or 10, wherein the target tracking according to the recognition result comprises:

determining the same object to be tracked in the first image frame and the Nth image frame according to the identification result;

and taking the same object to be tracked as a target object, and carrying out target tracking according to the position change of the target object to obtain the current target position corresponding to the target object.

12. A model training apparatus comprising:

13. Apparatus according to claim 12, wherein the pre-training parameters are for characterizing a target object class derived from the teletext data.

14. The apparatus of claim 12 or 13, wherein the first training module is to:

inputting the first image data and the first text data into the first model;

15. The apparatus of claim 14, wherein the first training module is to:

16. The apparatus of claim 15, wherein the first training module is to:

mapping the first image characteristic and the first text characteristic respectively, and mapping the first image characteristic and the first text characteristic to the same target characteristic space to obtain a mapping relation between the first image data and the first text data in the same target characteristic space;

17. The apparatus of claim 16, wherein the second training module is to:

18. The apparatus of claim 17, further comprising a category determination module to:

19. The apparatus of claim 18, wherein the second training module is to:

in the same target feature space, performing similarity matching on features, used for representing the category of the object to be tracked, in the first image sample feature and the second image sample feature according to the prior information to obtain a matching result;

20. An object tracking device, comprising:

and the target tracking module is used for tracking the target according to the identification result.

21. Apparatus according to claim 20, wherein the pre-training parameters are for characterising a class of target objects derived from the teletext data.

22. The apparatus of claim 20 or 21, wherein the target tracking module is to:

23. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-11.

24. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-11.

25. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-11.