CN112561053B

CN112561053B - Image processing method, training method and device of pre-training model and electronic equipment

Info

Publication number: CN112561053B
Application number: CN202011249923.2A
Authority: CN
Inventors: 李超
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2023-09-08
Anticipated expiration: 2040-11-10
Also published as: JP7267379B2; CN112561053A; JP2022006189A

Abstract

The application discloses an image processing method, a training method and device of a pre-training model and electronic equipment, and relates to the technical field of deep learning and computer vision. The specific implementation scheme is as follows: the method comprises the steps of obtaining a trained pre-training model, wherein the pre-training model adopts multi-frame training images to train, so that the difference between the first image feature distance and the second image feature distance is minimum in the image feature output by the trained pre-training model, further, according to the general pre-training model and a target image processing task, a corresponding image processing model is generated, the generation efficiency of the image processing model corresponding to the target processing task is improved, the target image processing task is executed on the target image by adopting the generated image processing model, and the image processing model corresponds to the target image processing task, so that the effect and the efficiency of image processing are improved.

Description

Image processing method, training method and device of pre-training model and electronic equipment

Technical Field

The application relates to the technical field of image processing, in particular to the technical field of deep learning and computer vision, and further relates to an image processing method, a training method and device of a pre-training model and electronic equipment.

Background

Image processing technology based on neural network has been developed for many years, according to image processing requirements, a trained image processing model is adopted to perform image processing and identification, however, different image processing tasks have different image processing requirements, a fixed image processing model is adopted to perform image processing, and the image processing requirements in different scenes cannot be met, so how to improve the image processing effect is a technical problem to be solved urgently.

Disclosure of Invention

The application provides an image processing method for improving an image processing effect, a training method and device of a pre-training model and electronic equipment.

According to an aspect of the present application, there is provided an image processing method including:

acquiring a trained pre-training model, wherein the pre-training model is trained by adopting a plurality of frames of training images, so that the difference between the first image feature distance and the second image feature distance is minimum as the image features output by the trained pre-training model meet the requirements; wherein the first image feature distance is a distance between image features of training images extracted from the same video clip, and the second image feature distance is a distance between image features of training images extracted from different video clips;

Generating an image processing model for executing the target image processing task according to the pre-training model;

and executing a target image processing task on the target image by adopting the image processing model.

According to another aspect of the present application, there is provided a training method of a pre-training model, including:

acquiring a plurality of video clips;

extracting multi-frame training images from the plurality of video clips to obtain a training set; at least two frames of training images are extracted from each video clip;

performing a multi-round training process on a pre-training model for image feature extraction using the training set; wherein, each round training process includes: selecting each training image extracted from at least two video clips from the training set; inputting each training image selected in the round into the pre-training model to obtain output image characteristics; according to the image characteristics of each training image selected by the round, determining a first image characteristic distance between training images belonging to the same video segment, and determining a second image characteristic distance between training images belonging to different video segments, and according to the first image characteristic distance and the second image characteristic distance, adjusting model parameters of the pre-training model so as to minimize the difference between the first image characteristic distance and the second image characteristic distance.

According to another aspect of the present application, there is provided an image processing apparatus including:

the acquisition module is used for acquiring a trained pre-training model, and the pre-training model is trained by adopting multi-frame training images, so that the difference between the first image feature distance and the second image feature distance is minimum in the image features output by the trained pre-training model; wherein the first image feature distance is a distance between image features of training images extracted from the same video clip, and the second image feature distance is a distance between image features of training images extracted from different video clips;

the generation module is used for generating an image processing model for executing a target image processing task according to the pre-training model;

and the processing module is used for executing a target image processing task on the target image by adopting the image processing model.

According to another aspect of the present application, there is provided a training apparatus for a pre-training model, including: the acquisition module is used for acquiring a plurality of video clips;

the extraction module is used for extracting multi-frame training images from the plurality of video clips so as to obtain a training set; at least two frames of training images are extracted from each video clip;

The training module is used for executing a multi-round training process on the pre-training model for image feature extraction by adopting the training set; wherein, each round training process includes: selecting each training image extracted from at least two video clips from the training set; inputting each training image selected in the round into the pre-training model to obtain output image characteristics; according to the image characteristics of each training image selected by the round, determining a first image characteristic distance between training images belonging to the same video segment, and determining a second image characteristic distance between training images belonging to different video segments, and according to the first image characteristic distance and the second image characteristic distance, adjusting model parameters of the pre-training model so as to minimize the difference between the first image characteristic distance and the second image characteristic distance.

According to another aspect of the present application, there is provided an electronic apparatus including:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the image processing method of the first aspect or the training method of the pre-training model of the second aspect.

According to another aspect of the present application, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the image processing method of the first aspect, or to perform the training method of the pre-training model of the other aspect.

According to another aspect of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, performs the image processing method of the first aspect, or performs the training method of the pre-training model of the second aspect.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.

Drawings

The drawings are included to provide a better understanding of the present application and are not to be construed as limiting the application. Wherein:

fig. 1 is a schematic flow chart of an image processing method according to an embodiment of the present application;

FIG. 2 is a flowchart of another image processing method according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of an image processing model according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of a training method of a pre-training model according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a training device for pre-training a model according to an embodiment of the present application;

fig. 7 is a block diagram of an electronic device according to an embodiment of the application.

Detailed Description

Exemplary embodiments of the present application will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present application are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The image processing method, the training method and the training device of the pre-training model and the electronic equipment of the embodiment of the application are described below with reference to the accompanying drawings.

Fig. 1 is a flowchart of an image processing method according to an embodiment of the present application.

As shown in fig. 1, the method comprises the steps of:

Step 101, acquiring a trained pre-training model, wherein the pre-training model is trained by adopting a plurality of frames of training images, so that the difference between the first image feature distance and the second image feature distance is minimum as the image features output by the trained pre-training model; wherein the first image feature distance is a distance between image features of training images extracted from the same video clip and the second image feature distance is a distance between image features of training images extracted from different video clips.

When training, the pre-training model in the embodiment can train in a deep learning mode, and compared with other machine learning methods, the deep learning has better performance on a large data set. In the pre-training model in this embodiment, a multi-frame training image extracted from a plurality of video clips is used as a training set, the pre-training model is input, and parameters of the pre-training model are continuously adjusted to perform iterative training on the pre-training model until a result output by the pre-training model meets a preset threshold, so that training is finished, a general pre-training model is generated based on a large amount of image data, and accordingly, the efficiency of generating a corresponding target image processing model based on the general pre-training model can be improved.

The training method of the pre-training model will be described in detail in the following embodiments of the training method of the pre-training model, which will not be described in detail in this embodiment.

Step 102, generating an image processing model for executing the target image processing task according to the pre-training model.

The target image processing task comprises an image classification task, a target detection task or an object identification task.

In the application, after the pre-training model is generated, the pre-training model is a pre-generated general model, so that the image processing model corresponding to the target image processing task is quickly generated according to the image set corresponding to the target image processing task, and the efficiency of generating the image processing model corresponding to the target image processing task is improved.

The image processing model may be a convolutional neural network model CNN (Convolutional Neural Networks, CNN) model or a deep neural network model DNN (Deep Neural Networks, DNN), which is not limited in this embodiment.

And step 103, executing a target image processing task on the target image by adopting the image processing model.

The image processing model in the embodiment is a general pre-training model obtained through pre-training, and the generated image processing model corresponding to the target image processing task improves the generation efficiency of the model.

In the image processing method of the embodiment of the application, a trained pre-training model is obtained, wherein the pre-training model is trained by adopting a plurality of frames of training images, so that the difference between the first image feature distance and the second image feature distance is minimum as the image features output by the trained pre-training model. Furthermore, according to the general pre-training model and the target image processing task, a corresponding image processing model is generated, the generation efficiency of the image processing model corresponding to the target processing task is improved, and the target image processing task is executed on the target image by adopting the generated image processing model.

In the above embodiment, in order to improve the efficiency of image processing, according to the target image processing task and the pre-training model, the image processing model corresponding to the target image processing task is generated. As another possible implementation manner, after the pre-training model is spliced with the network layer corresponding to the target processing task, training is performed to obtain a corresponding image processing model, so as to improve the efficiency of generating the image processing model and improve the effect of image processing.

For this reason, based on the above embodiment, another image processing method is provided in the present embodiment, and fig. 2 is a schematic flow chart of another image processing method provided in the embodiment of the present application, as shown in fig. 2, the step 102 includes the following steps:

step 201, a network layer corresponding to a target image processing task is acquired.

In the application, the acquired network layer and the target image processing task have a corresponding relationship.

In one scenario, the target image processing task is an image classification task, and the corresponding network layer is a classification layer, which is used to classify the target image, for example, determine a corresponding vehicle class, for example, a car, an SUV, and the like, for vehicles included in the image to be classified.

In another scenario, the target image processing task is a target detection task, and the corresponding network layer is a detection network, which is used for identifying a target object included in the target image, for example, for a target image to be processed, whether the image includes an obstacle or not, or whether multiple images include the same target object or not.

In still another scenario, if the target image processing task is an object recognition task, the corresponding network layer is configured to recognize objects in the image, for example, for the target image to be processed, recognize types of objects contained in different areas in the image, or recognize types of objects contained in the image.

Step 202, splicing the pre-training model with a network layer, wherein the input of the network layer is the image characteristic output by the pre-training model, and the output of the network layer is the target image task processing result.

In this embodiment, after a general pre-training model is generated, the pre-training model is spliced with a network layer corresponding to a target image processing task, as shown in fig. 3, and the pre-training model obtained by training is spliced with the network layer to obtain an image processing model to be trained. The image features output by the pre-training model are input into a network layer, and the output of the network layer is a target image task processing result.

And 203, training the spliced pre-training model and the network layer by adopting a training set of target image processing tasks to obtain an image processing model.

In this embodiment, for different target image processing tasks, in order to quickly obtain an image processing model corresponding to the target image processing task, a training set corresponding to the target image processing task is adopted to train the spliced pre-training model and the network layer, so as to obtain the image processing model. That is, the image processing model obtained by training has a corresponding relation with the target image processing task, and after being spliced with the corresponding network layer based on the universal pre-processing model which is completed by pre-training, the training is performed, as a possible implementation manner, parameters of the network layer can be adjusted mainly aiming at the requirements of the processing task of the target image, so that the training efficiency of the corresponding image processing model is improved, the processing requirements of different target image processing tasks are met, and the processing requirements under different scenes are met.

In the image processing method of the embodiment, the general pre-processing model which is completed based on pre-training is spliced with the corresponding network layer, wherein the input of the network layer is the image characteristic output by the pre-training model, the output of the network layer is the target image task processing result, and further training is performed.

In order to implement the above embodiment, the present embodiment provides a training method of a pre-training model.

Fig. 4 is a flow chart of a training method of a pre-training model according to an embodiment of the present application, as shown in fig. 4, the method includes the following steps:

in step 401, a plurality of video clips are acquired.

In one possible implementation manner of the embodiment of the present application, at least one video may be acquired, and each video is randomly segmented into a plurality of video segments.

In one possible implementation manner, in order to obtain more video clips, multiple videos may be obtained, and segmentation processing is performed according to a content difference between adjacent image frames in each video to obtain multiple video clips of each video, that is, when video clip segmentation is performed on each video, the content of each frame in the segmented video clip is continuously changed, so that the continuity of frames in the video clip is improved.

In another possible implementation manner of the embodiment of the present application, a video may be obtained, and segmentation processing is performed according to a content difference between adjacent image frames in the video, so as to obtain a plurality of video segments, that is, when video segment segmentation is performed on the video, the content of each frame in the segmented video segments is continuously changed, so that the continuity of frames in the video segments is improved.

As shown in fig. 3, A, B, the video clips are different from each other.

In one scenario, the different video clips may be split from one video clip. In another scenario, the different video segments may be segmented from multiple video segments. The flexible setting can be specifically performed according to the requirements of the training scene, and the embodiment is not limited.

Step 402, extracting multi-frame training images from a plurality of video clips to obtain a training set; at least two frames of training images are extracted from each video clip.

In this embodiment, the training set is composed of a plurality of training images extracted from a plurality of video clips. As a possible implementation, a certain number of training image frames are randomly extracted from each video clip, and the training set is formed by using the extracted frames of the video clips. At least two frames of training images are extracted from each video clip.

As another possible implementation manner, in order to improve the model training effect, the number of frames of the training image extracted from each video segment is the same, and the uniformity of the distribution of the number of frames of each video segment in the training set is improved, so that the training set trains the pre-training model, and the weight proportion occupied by each video segment is the same when the model parameters are determined, so that the training effect of the subsequent pre-training model is improved.

As shown in fig. 3, A, B and N are respectively different video clips, and in this embodiment, 2 frames are extracted from each video clip as training images for illustration. Where A1 and A2 are two frames in video segment a, B1 and B2 are two frames in video segment B, and N1 and N2 are two frames in video segment N.

For example, a video X is taken to divide the video segment into 3 video segments, namely, video segments A, B and C, respectively, as shown in fig. 3, where N is C, and two frames are extracted from each video segment for illustration.

In the video clip a, the extracted two frames of images are A1 and A2, and A1 and A2 are two continuous frames. In the video clip B, the two extracted frames are B1 and B2, and B1 and B2 are two consecutive frames. In the video clip C, the two extracted frames are C1 and C2, and C1 and C2 are two consecutive frames. Further, the image frames A1, A2, B1, B2, C1, and C2 constitute a training set.

It should be noted that, in practical application, the number of multi-frame training images included in the training set is not limited to the 6-frame images described in the embodiment, and may be flexibly set according to the accuracy requirement of training.

Step 403, performing a multi-round training process on the pre-training model for image feature extraction by using the training set, wherein each round of training process includes: selecting each training image extracted from at least two video clips from a training set, inputting each training image selected by a round into a pre-training model to obtain output image characteristics, determining a first image characteristic distance between training images belonging to the same video clip according to the image characteristics of each training image selected by the round, determining a second image characteristic distance between training images belonging to different video clips, and adjusting model parameters of the pre-training model according to the first image characteristic distance and the second image characteristic distance to minimize the difference between the first image characteristic distance and the second image characteristic distance so that the pre-training model obtained by training can identify the association relationship between different video clips to be used as a universal pre-training model.

In this embodiment, a training set is used to perform a multi-round training process on a pre-training model, in the training process of each round, the training effect is determined according to the recognition result, so as to adjust the parameters of the pre-training model until the model converges, so that the pre-training model can accurately generate the image features of the training image.

It should be understood that, because the training set includes a plurality of video clips belonging to the same video and also includes a plurality of video clips belonging to different videos, in the training process of each round, each training image extracted from at least two video clips is selected from the training set, where the two video clips may belong to the same video or belong to different videos, so that the association relationship between the different video clips can be identified by using the extracted training images, so as to be used as a general pre-training model, and improve the robustness of the general model.

In the training method of the pre-training model, at least two frames of training images are obtained through extracting the acquired multiple video clips respectively, a multi-frame training image is obtained, so that a training set is obtained, a multi-round training process is carried out on the pre-training model for image feature extraction through the training set, in each round of training process, image features are obtained according to the training images, first image feature distances among images are obtained according to the image features of images belonging to the same video clip, second image feature distances among images are obtained according to the image features of images belonging to different video clips, and parameters of the pre-training model are continuously adjusted, so that the difference between the first image feature distances and the second image feature distances is minimum, the training of the universal pre-training model is realized, and the reliability of the image features obtained through recognition of the pre-training model is improved.

Based on the above embodiments, the present embodiment provides another training method of a pre-training model, which illustrates how to determine the first image feature distance between training images belonging to the same video segment in order to improve the fineness of the first image feature distance calculation, and specifically may be implemented by the following steps:

the training images of the pre-training model are input to the training process of the round, the intra-class feature distances among the image features of different training images belonging to the same video segment are determined, and the sum of the intra-class feature distances is determined for at least two video segments selected from the training set to obtain a first image feature distance, so that the association relation among the image features of different training images belonging to the same video segment is indicated through the first image distance.

In one possible implementation manner of the embodiment of the present application, for example, the selected training images i1 and i2 belong to the same video segment i, and the training images i1 and i2 are input into the pre-training module to obtain image features of each training image, which are respectively denoted as hi1 and hi2. Further, the intra-class feature distance d (i 1, i 2) between the image features hi1 and hi2 of the training images i1 and i2 belonging to the same video segment i is calculated, and then, for at least two video segments selected from the training set in the present training process, the sum of the intra-class feature distances is determined, so as to obtain the first image feature distance dist (in), which can be specifically realized by the following formula:

Wherein i is a video clip, i.e. the video clip is a natural number from 1 to n, n is greater than or equal to 2.

In another possible implementation manner of the embodiment of the present application, in order to meet the requirements of different scenes, the image features of different training images belonging to the same video segment may be classified, that is, the image features of different training images are divided into different categories, so as to implement refined feature recognition, for example, the image features belonging to the person category, the image features belonging to the building, or the image features belonging to the nose category are determined, and then, for different training images, feature distances between categories are calculated respectively from the image features of any two training images corresponding to the same category, and then, feature distances between all categories are summed to obtain intra-category feature distances belonging to the same video segment. And then, at least two video clips selected from the training set in the training process are subjected to the determination of the sum of the intra-class feature distances to obtain the first image feature distance, so that the fine degree of the calculation of the first image feature distance is realized, and the accuracy of the calculation of the first image feature distance is improved.

It should be noted that, the image feature distance may be calculated according to a euclidean distance or a cosine distance.

Based on the above embodiments, the present embodiment provides another training method of a pre-training model, which illustrates how to determine the second image feature distance between training images belonging to different video segments in order to improve the fineness of the second image feature distance calculation, and specifically may be implemented by the following steps:

the training images of the pre-training model are input to the training process of the round, the inter-class feature distances among the image features of different training images belonging to different video clips are determined, and the sum of the inter-class feature distances is determined for at least two video clips selected from the training set to obtain a second image feature distance, so that the association relation among the image features of different training images not belonging to the same video clip is indicated through the second image distance.

In one possible implementation manner of the embodiment of the present application, for example, the selected training images i1 and i2 belong to the same video segment i, the training images j1 and j2 belong to the same video segment j, the training images i1 and i2 are input into a pre-training module to obtain image features of each training image, which are respectively denoted as hi1 and hi2, and the training images j1 and j2 are input into the pre-training module to obtain corresponding image features, which are respectively denoted as hj1 and hj2. Furthermore, the inter-class feature distance between the image features of the training images belonging to different video segments i and j is calculated, and then, for at least two video segments selected from the training set in the present training process, the sum of the inter-class feature distances is determined, so as to obtain a second image feature distance dist (interval), which can be specifically realized by the following formula:

Wherein i and j are different video clips, where n is equal to or greater than 2, d (h _i1 ,h _j1 ) For the inter-class feature distance between the image features hi1 and hj1 of the training images in the different video clips i and j, and d (h _i1 ,h _j2 )、d(h _i2 ,h _j1 ) And d (h) _i2 ,h _j2 ) Inter-class feature distance d (h) between image features for training images in different video segments i and j _i2 ,h _j1 )。

In this embodiment, two training images are selected from each video clip to be described as an example, and in practical application, the number of the training images selected from each video clip can be flexibly set according to the training requirement, which is not limited in this embodiment.

In another possible implementation manner of the embodiment of the present application, in order to meet the requirements of different scenes, the image features of different training images belonging to different video clips may be classified, that is, the image features of different training images are divided into different categories, so as to implement refined feature recognition, for example, the image features belonging to a person category, the image features belonging to a building, or the image features belonging to a nose category are determined, and then, for the training images belonging to different video clips, feature distances between categories are calculated respectively from the image features of any two training images corresponding to the same category, and then, feature distances between all categories are summed to obtain feature distances between different training images belonging to different video clips. And then, at least two video clips selected from the training set in the training process are subjected to the determination of the sum of the feature distances between the classes so as to obtain the second image feature distance, so that the fine degree of the second image feature distance calculation is realized, and the accuracy of the second image feature distance calculation is improved.

In order to achieve the above embodiments, the present application also provides an image processing apparatus.

Fig. 5 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application.

As shown in fig. 5, includes: an acquisition module 51, a generation module 52 and a processing module 53.

The obtaining module 51 is configured to obtain a trained pre-training model, where the pre-training model uses multiple frames of training images to perform training, so that an image feature output by the trained pre-training model meets a minimum difference between a first image feature distance and a second image feature distance, where the first image feature distance is a distance between image features of training images extracted from a same video segment, and the second image feature distance is a distance between image features of training images extracted from different video segments.

The generating module 52 is configured to generate an image processing model for performing the target image processing task according to the pre-training model.

A processing module 53 for performing a target image processing task on the target image using the image processing model.

Further, in one possible implementation manner of the embodiment of the present application, the generating module 52 is specifically configured to: and acquiring a network layer corresponding to the target image processing task, splicing the pre-training model with the network layer, wherein the input of the network layer is the image characteristics output by the pre-training model, the output of the network layer is the target image task processing result, and training the spliced pre-training model and the network layer by adopting a training set of the target image processing task to obtain the image processing model.

In one possible implementation of an embodiment of the present application, the target image processing task includes an image classification task, a target detection task, or an object recognition task.

It should be noted that the foregoing explanation of the embodiment of the image processing method is also applicable to the image processing apparatus of this embodiment, and the principle is the same, and will not be repeated here.

In the image processing device provided by the embodiment of the application, the trained pre-training model is obtained, wherein the pre-training model is trained by adopting a plurality of frames of training images, so that the difference between the first image feature distance and the second image feature distance is minimum as the image features output by the trained pre-training model. Furthermore, according to the general pre-training model and the target image processing task, a corresponding image processing model is generated, the generation efficiency of the image processing model corresponding to the target processing task is improved, and the target image processing task is executed on the target image by adopting the generated image processing model.

In order to implement the above embodiment, the present embodiment provides a training device for a pre-training model.

Fig. 6 is a schematic structural diagram of a training device for pre-training a model according to an embodiment of the present application. As shown in fig. 6, the apparatus includes: an acquisition module 61, an extraction module 62 and a training module 63.

The obtaining module 61 is configured to obtain a plurality of video clips.

The extracting module 62 is configured to extract a plurality of training images from a plurality of video segments to obtain a training set, where at least two training images are extracted from each video segment.

The training module 63 is configured to perform a plurality of training processes on the pre-training model for image feature extraction using a training set, where each training process includes: selecting each extracted training image from at least two video clips from the training set; inputting each training image selected in the round into a pre-training model to obtain output image characteristics; according to the image characteristics of each training image selected by the round, determining a first image characteristic distance between training images belonging to the same video segment, and determining a second image characteristic distance between training images belonging to different video segments, and according to the first image characteristic distance and the second image characteristic distance, adjusting model parameters of a pre-training model so as to minimize the difference between the first image characteristic distance and the second image characteristic distance.

In one possible implementation of the embodiment of the present application, the training module 63 is specifically configured to: inputting training images of the pre-training model in the training process of the round, and determining intra-class feature distances among image features of different training images belonging to the same video segment; and determining the sum of the feature distances in the class for at least two video clips selected from the training set in the training process of the round, so as to obtain the first image feature distance.

In one possible implementation of the embodiment of the present application, the training module 63 is specifically configured to:

inputting training images of the pre-training model in the training process of the round, and determining the inter-class feature distance between the image features of different training images belonging to different video clips; and determining the sum of the feature distances between the classes for at least two video clips selected from the training set in the training process of the round, so as to obtain the second image feature distance.

In a possible implementation manner of the embodiment of the present application, the number of frames of the training image extracted from each video clip is the same.

In one possible implementation manner of the embodiment of the present application, the obtaining module 61 is specifically configured to: acquiring a plurality of videos; and performing segmentation processing according to the content difference between adjacent image frames in each video to obtain a plurality of video fragments of each video.

In the training device of the pre-training model, at least two frames of training images are obtained through extracting the acquired multiple video clips respectively, a multi-frame training image is obtained, so that a training set is obtained, a multi-round training process is carried out on the pre-training model for image feature extraction through the training set, in each round of training process, image features are obtained according to the training images, a first image feature distance between the images is obtained according to the image features of the images belonging to the same video clip, a second image feature distance between the images is obtained according to the image features of the images belonging to different video clips, and parameters of the pre-training model are continuously adjusted, so that the difference between the first image feature distance and the second image feature distance is minimum, the training of the universal pre-training model is realized, and the reliability of the image features obtained through recognition of the pre-training model is improved.

In order to achieve the above embodiments, an embodiment of the present application provides an electronic device, including:

at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the image processing method of the foregoing method embodiment or the training method of the foregoing method embodiment.

In order to achieve the above-described embodiments, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the image processing method described in the above-described method embodiments or to execute the training method described in the above-described method embodiments.

In order to achieve the above embodiments, embodiments of the present application provide a computer program product, which includes a computer program that, when executed by a processor, performs the image processing method described in the foregoing method embodiment, or performs the training method of the pre-training model described in the foregoing method embodiment.

According to an embodiment of the present application, the present application also provides an electronic device and a readable storage medium.

As shown in fig. 7, is a block diagram of an electronic device according to an embodiment of the application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.

As shown in fig. 7, the electronic device includes: one or more processors 701, memory 702, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 701 is illustrated in fig. 7.

Memory 702 is a non-transitory computer readable storage medium provided by the present application. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the image processing method provided by the present application. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to execute the image processing method provided by the present application.

The memory 702 is used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules (e.g., the acquisition module 51, the generation module 52, and the processing module 53 shown in fig. 5) corresponding to the image processing method according to the embodiment of the present application. The processor 701 executes various functional applications of the server and data processing, i.e., implements the image processing method in the above-described method embodiment, by running non-transitory software programs, instructions, and modules stored in the memory 702.

Memory 702 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of the electronic device of the image processing method, and the like. In addition, the memory 702 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, the memory 702 may optionally include memory remotely located relative to the processor 701, which may be connected to the image processing method's electronics via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the image processing method may further include: an input device 703 and an output device 704. The processor 701, the memory 702, the input device 703 and the output device 704 may be connected by a bus or otherwise, in fig. 7 by way of example.

The input device 703 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device of the image processing method, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer stick, one or more mouse buttons, a track ball, a joystick, etc. The output device 704 may include a display apparatus, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibration motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

According to the technical scheme provided by the embodiment of the application, the trained pre-training model is obtained, wherein the pre-training model is trained by adopting a plurality of frames of training images, so that the difference between the first image feature distance and the second image feature distance is minimum as the image features output by the trained pre-training model. Furthermore, according to the general pre-training model and the target image processing task, a corresponding image processing model is generated, the generation efficiency of the image processing model corresponding to the target processing task is improved, and the target image processing task is executed on the target image by adopting the generated image processing model.

It should be noted that, the electronic device may also implement the training method of the pre-training model of the present application, and the principle is the same, and will not be described herein again.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed embodiments are achieved, and are not limited herein.

The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application.

Claims

1. An image processing method, comprising:

executing a target image processing task on a target image by adopting the image processing model;

the number of training image frames extracted from each video clip is the same.

2. The image processing method of claim 1, wherein the generating an image processing model that performs a target image processing task from the pre-training model comprises:

acquiring a network layer corresponding to the target image processing task;

splicing the pre-training model with the network layer; the input of the network layer is the image characteristics output by the pre-training model, and the output of the network layer is the target image task processing result;

and training the spliced pre-training model and the network layer by adopting the training set of the target image processing task to obtain the image processing model.

3. The image processing method according to claim 1 or 2, wherein the target image processing task includes an image classification task, a target detection task, or an object recognition task.

4. A training method of a pre-training model, comprising:

acquiring a plurality of video clips;

performing a multi-round training process on a pre-training model for image feature extraction using the training set; wherein, each round training process includes: selecting each training image extracted from at least two video clips from the training set; inputting each training image selected in the round into the pre-training model to obtain output image characteristics; according to the image characteristics of each training image selected by the round, determining a first image characteristic distance between training images belonging to the same video segment, and determining a second image characteristic distance between training images belonging to different video segments, and according to the first image characteristic distance and the second image characteristic distance, adjusting model parameters of the pre-training model so as to minimize the difference between the first image characteristic distance and the second image characteristic distance;

The number of training image frames extracted from each video clip is the same.

5. The training method of claim 4, wherein the determining a first image feature distance between training images belonging to the same video segment comprises:

inputting training images of the pre-training model in the training process of the round, and determining intra-class feature distances among image features of different training images belonging to the same video segment;

and determining the sum of the feature distances in the class for at least two video clips selected from the training set in the training process of the round, so as to obtain the first image feature distance.

6. The training method of claim 4, wherein the determining a second image feature distance between training images belonging to different video segments comprises:

inputting training images of the pre-training model in the training process of the round, and determining the inter-class feature distance between the image features of different training images belonging to different video clips;

and determining the sum of the feature distances between the classes for at least two video clips selected from the training set in the training process of the round, so as to obtain the second image feature distance.

7. The training method of any of claims 4-6, wherein the acquiring a plurality of video clips comprises:

Acquiring a plurality of videos;

and performing segmentation processing according to the content difference between adjacent image frames in each video to obtain a plurality of video fragments of each video.

8. An image processing apparatus comprising:

the processing module is used for executing a target image processing task on a target image by adopting the image processing model;

the number of training image frames extracted from each video clip is the same.

9. The image processing device according to claim 8, wherein the generating module is specifically configured to:

Acquiring a network layer corresponding to the target image processing task;

10. The image processing apparatus according to claim 8 or 9, wherein the target image processing task includes an image classification task, a target detection task, or an object recognition task.

11. A training device for a pre-training model, comprising:

the acquisition module is used for acquiring a plurality of video clips;

the training module is used for executing a multi-round training process on the pre-training model for image feature extraction by adopting the training set; wherein, each round training process includes: selecting each training image extracted from at least two video clips from the training set; inputting each training image selected in the round into the pre-training model to obtain output image characteristics; according to the image characteristics of each training image selected by the round, determining a first image characteristic distance between training images belonging to the same video segment, and determining a second image characteristic distance between training images belonging to different video segments, and according to the first image characteristic distance and the second image characteristic distance, adjusting model parameters of the pre-training model so as to minimize the difference between the first image characteristic distance and the second image characteristic distance;

The number of training image frames extracted from each video clip is the same.

12. Training device according to claim 11, wherein the training module is specifically configured to:

13. Training device according to claim 11, wherein the training module is specifically configured to:

14. Training device according to any of the claims 11-13, wherein the acquisition module is specifically adapted to:

acquiring a plurality of videos; and performing segmentation processing according to the content difference between adjacent image frames in each video to obtain a plurality of video fragments of each video.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the image processing method of any one of claims 1-3 or the training method of the pre-training model of any one of claims 4-7.

16. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the image processing method of any one of claims 1-3, or to perform the training method of the pre-training model of any one of claims 4-7.