JP7267379B2

JP7267379B2 - Image processing method, pre-trained model training method, device and electronic equipment

Info

Publication number: JP7267379B2
Application number: JP2021178829A
Authority: JP
Inventors: リ，チョウ
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-11-10
Filing date: 2021-11-01
Publication date: 2023-05-01
Anticipated expiration: 2041-11-01
Also published as: JP2022006189A; CN112561053B; CN112561053A

Description

本出願は、画像処理技術の分野に関し、具体的に深層学習、コンピュータビジョン技術の分野に関し、さらに、画像処理方法、事前トレーニングモデルのトレーニング方法、装置、及び電子機器に関する。 The present application relates to the field of image processing technology, specifically to the field of deep learning, computer vision technology, and further to an image processing method, a pre-trained model training method, an apparatus and an electronic device.

ニューラルネットワークに基づく画像処理技術は長年にわたって発展しており、画像処理のニーズに応じて、トレーニング済みの画像処理モデルを使用して画像処理と認識を行い、しかしながら、異なる画像処理タスクは、異なる画像処理ニーズを有し、決まった画像処理モデルを使用して画像処理を行うと、異なるシナリオにおける画像処理ニーズを満たすことができず、そのため、どのように画像処理の効果を高めるかは早急に解決すべき技術的課題である。 Image processing technology based on neural networks has been developing for many years, and according to the needs of image processing, it uses trained image processing models to perform image processing and recognition. Having processing needs and using a fixed image processing model for image processing cannot meet the image processing needs in different scenarios, so how to improve the effect of image processing is an urgent solution. This is a technical issue that should be addressed.

本出願は、画像処理の効果を向上させるための画像処理方法、事前トレーニングモデルのトレーニング方法、装置、及び電子機器を提供する。 The present application provides an image processing method, a pre-trained model training method, an apparatus and an electronic device for improving the effect of image processing.

本出願の一態様によれば、画像処理方法を提供し、トレーニングされた事前トレーニングモデルを取得するステップであって、前記事前トレーニングモデルは、トレーニングされた事前トレーニングモデルから出力された画像特徴が、第１の画像特徴距離と第２の画像特徴距離との差が最小であることを満たすように、複数のフレームのトレーニング画像を使用してトレーニングされ、前記第１の画像特徴距離は、同じビデオクリップから抽出されたトレーニング画像の画像特徴間の距離であり、前記第２の画像距離は、異なるビデオクリップから抽出されたトレーニング画像の画像特徴間の距離であるステップと、前記事前トレーニングモデルに基づいて、ターゲット画像処理タスクを実行する画像処理モデルを生成するステップと、前記画像処理モデルを使用して、ターゲット画像に対してターゲット画像処理タスクを実行するステップと、を含む。 According to one aspect of the present application, there is provided an image processing method, the step of obtaining a trained pre-trained model, said pre-trained model having an image feature output from the trained pre-trained model of , is trained using training images of a plurality of frames to satisfy that the difference between the first image feature distance and the second image feature distance is minimal, wherein the first image feature distance is the same distances between image features of training images extracted from video clips, said second image distance being distances between image features of training images extracted from different video clips; generating an image processing model that performs a target image processing task based on the , and using the image processing model to perform the target image processing task on the target image.

本出願の別の態様によれば、事前トレーニングモデルのトレーニング方法を提供し、複数のビデオクリップを取得するステップと、前記複数のビデオクリップから複数フレームのトレーニング画像を抽出して、トレーニングセットを取得するステップであって、各前記ビデオクリップから少なくとも２フレームの前記トレーニング画像を抽出するステップと、前記トレーニングセットを使用して、画像特徴抽出のための事前トレーニングモデルに対してマルチラウンドのトレーニングを実行するステップと、を含み、各ラウンドのトレーニングは、前記トレーニングセットから、少なくとも２つのビデオクリップから抽出された各トレーニング画像を選択することと、このラウンドで選択された各前記トレーニング画像を前記事前トレーニングモデルに入力して、出力された画像特徴を取得することと、このラウンドで選択された各前記トレーニング画像の画像特徴に基づいて、同じビデオクリップに属するトレーニング画像間の第１の画像特徴距離を決定し、異なるビデオクリップに属するトレーニング画像間の第２の画像特徴距離を決定し、前記第１の画像特徴距離と前記第２の画像特徴距離とに基づいて、前記前記第１の画像特徴距離と前記第２の画像特徴距離との差が最小となるように、前記事前トレーニングモデルのモデルパラメータを調整することと、を含む。 According to another aspect of the present application, there is provided a method for training a pre-trained model, obtaining a plurality of video clips; extracting multiple frames of training images from the plurality of video clips to obtain a training set; extracting at least two frames of said training images from each said video clip; and using said training set to perform multi-round training on a pre-trained model for image feature extraction. each round of training comprises selecting from said training set each training image extracted from at least two video clips; and applying each said training image selected in this round to said preliminary Inputting a training model to obtain output image features, and based on the image features of each said training image selected in this round, a first image feature distance between training images belonging to the same video clip. , determining a second image feature distance between training images belonging to different video clips, and based on the first image feature distance and the second image feature distance, the first image feature distance adjusting model parameters of the pre-trained model such that the difference between the distance and the second image feature distance is minimized.

本出願の別の態様によれば、画像処理装置を提供し、トレーニングされた事前トレーニングモデルを取得するための取得モジュールであって、前記事前トレーニングモデルは、トレーニングされた事前トレーニングモデルから出力された画像特徴が、第１の画像特徴距離と第２の画像特徴距離との差が最小であることを満たすように、複数のフレームのトレーニング画像を使用してトレーニングされ、前記第１の画像特徴距離は、同じビデオクリップから抽出されたトレーニング画像の画像特徴間の距離であり、前記第２の画像距離は、異なるビデオクリップから抽出されたトレーニング画像の画像特徴間の距離である取得モジュールと、前記事前トレーニングモデルに基づいて、ターゲット画像処理タスクを実行する画像処理モデルを生成するための生成モジュールと、前記画像処理モデルを使用して、ターゲット画像に対してターゲット画像処理タスクを実行するための処理モジュールと、を含む。 According to another aspect of the present application, an acquisition module for providing an image processing device and acquiring a trained pre-trained model, the pre-trained model output from the trained pre-trained model. is trained using a plurality of frames of training images to satisfy a minimum difference between a first image feature distance and a second image feature distance; an acquisition module, wherein the distances are distances between image features of training images extracted from the same video clip and the second image distances are distances between image features of training images extracted from different video clips; a generation module for generating an image processing model to perform a target image processing task based on said pre-trained model; and using said image processing model to perform a target image processing task on a target image. and a processing module of

本出願の別の態様によれば、事前トレーニングモデルのトレーニング装置を提供し、複数のビデオクリップを取得するための取得モジュールと、前記複数のビデオクリップから複数フレームのトレーニング画像を抽出して、トレーニングセットを取得するための抽出モジュールであって、各前記ビデオクリップから少なくとも２フレームの前記トレーニング画像を抽出する抽出モジュールと、前記トレーニングセットを使用して、画像特徴抽出のための事前トレーニングモデルに対してマルチラウンドのトレーニングを実行するためのトレーニングモジュールと、を含み、各ラウンドのトレーニングは、前記トレーニングセットから、少なくとも２つのビデオクリップから抽出された各トレーニング画像を選択することと、このラウンドで選択された各前記トレーニング画像を前記事前トレーニングモデルに入力して、出力された画像特徴を取得することと、このラウンドで選択された各前記トレーニング画像の画像特徴に基づいて、同じビデオクリップに属するトレーニング画像間の第１の画像特徴距離を決定し、異なるビデオクリップに属するトレーニング画像間の第２の画像特徴距離を決定し、前記第１の画像特徴距離と前記第２の画像特徴距離とに基づいて、前記第１の画像特徴距離と前記第２の画像特徴距離との差が最小となるように、前記事前トレーニングモデルのモデルパラメータを調整することと、を含む。 According to another aspect of the present application, there is provided a pre-trained model training apparatus, an acquisition module for acquiring a plurality of video clips, and extracting a plurality of frames of training images from the plurality of video clips for training an extraction module for obtaining a set, the extraction module extracting at least two frames of said training images from each said video clip; a training module for performing multi-round training on the training set, each round of training comprising selecting each training image extracted from at least two video clips from said training set; inputting each said training image obtained into said pre-training model to obtain output image features; and based on the image features of each said training image selected in this round, belonging to the same video clip. determining a first image feature distance between training images; determining a second image feature distance between training images belonging to different video clips; combining said first image feature distance and said second image feature distance; and adjusting model parameters of the pre-trained model such that the difference between the first image feature distance and the second image feature distance is minimized based on.

本出願の別の態様によれば、電子機器を提供し、少なくとも１つのプロセッサと、前記少なくとも１つのプロセッサに通信可能に接続されるメモリと、を含み、前記メモリには、前記少なくとも１つのプロセッサによって実行可能な命令が記憶され、前記命令は、前記少なくとも１つのプロセッサが一態様に記載の画像処理方法、または別の態様に記載の事前トレーニングモデルのトレーニング方法を実行できるように、前記少なくとも１つのプロセッサによって実行される。 According to another aspect of the present application, an electronic apparatus is provided and includes at least one processor and memory communicatively coupled to the at least one processor, the memory including: are stored with instructions executable by the at least one processor to enable the at least one processor to perform an image processing method according to one aspect, or a pre-trained model training method according to another aspect. executed by one processor.

本出願の別の態様によれば、コンピュータ命令が記憶されている非一時的なコンピュータ読み取り可能な記憶媒体を提供し、前記コンピュータ命令は、コンピュータに一態様に記載の画像処理方法、または別の態様に記載の事前トレーニングモデルのトレーニング方法を実行させる。
本出願の別の態様によれば、コンピュータプログラムを提供し、前記コンピュータプログラムは、コンピュータに一態様に記載の画像処理方法、または別の態様に記載の事前トレーニングモデルのトレーニング方法を実行させる。 According to another aspect of the present application, there is provided a non-transitory computer-readable storage medium having computer instructions stored thereon, the computer instructions being stored in a computer to process an image processing method according to one aspect, or an image processing method according to another aspect. Running the pre-trained model training method according to the aspects.
According to another aspect of the present application there is provided a computer program, said computer program causing a computer to perform an image processing method according to one aspect or a pre-trained model training method according to another aspect.

なお、本部分に記載された内容は、本出願の実施例の肝心または重要な特徴を限定することを意図するものではなく、本出願の範囲を限定することを意図するものでもない。本出願の他の特徴は、以下の説明によって容易に理解されやすくなる。 It should be noted that the content described in this section is not intended to limit the key or critical features of the embodiments of the application, nor is it intended to limit the scope of the application. Other features of the present application will become readily comprehensible with the following description.

図面は、本技術案をよりよく理解するために使用され、本出願を限定するものではない。
本出願の実施例によって提供される画像処理方法の概略フローチャートである。本出願の実施例によって提供される別の画像処理方法の概略フローチャートである。本出願の実施例によって提供される画像処理モデルの概略構成図である。本出願の実施例によって提供される事前トレーニングモデルのトレーニング方法の概略フローチャートである。本出願の実施例によって提供される画像処理す装置の概略構成図である。本出願の実施例によって提供される事前トレーニングモデルのトレーニング装置の概略構成図である。本出願の実施例に係る電子機器のブロック図である。 The drawings are used for better understanding of the present technical solution and are not intended to limit the present application.
1 is a schematic flowchart of an image processing method provided by an embodiment of the present application; 4 is a schematic flowchart of another image processing method provided by an embodiment of the present application; 1 is a schematic block diagram of an image processing model provided by an embodiment of the present application; FIG. 1 is a schematic flow chart of a pre-trained model training method provided by an embodiment of the present application; 1 is a schematic configuration diagram of an image processing device provided by an embodiment of the present application; FIG. 1 is a schematic structural diagram of a pre-trained model training device provided by an embodiment of the present application; FIG. 1 is a block diagram of an electronic device according to an embodiment of the present application; FIG.

以下、図面と組み合わせて本出願の例示的な実施例を説明し、理解を容易にするためにその中には本出願の実施例の様々な詳細事項が含まれており、それらは単なる例示的なものと見なされるべきである。従って、当業者は、本出願の範囲及び精神から逸脱することなく、ここで説明される実施例に対して様々な変更と修正を行うことができる。同様に、わかりやすくかつ簡潔にするために、以下の説明では、周知の機能及び構造の説明を省略する。 Illustrative embodiments of the present application are described below in conjunction with the drawings, and various details of the embodiments of the present application are included therein for ease of understanding and are merely exemplary. should be regarded as Accordingly, those skilled in the art may make various changes and modifications to the embodiments described herein without departing from the scope and spirit of this application. Similarly, for the sake of clarity and brevity, the following description omits descriptions of well-known functions and constructions.

以下、図面を参照して、本出願の実施例に係る画像処理方法、事前トレーニングモデルのトレーニング方法、装置、及び電子機器について説明する。 Hereinafter, an image processing method, a pre-trained model training method, an apparatus, and an electronic device according to embodiments of the present application will be described with reference to the drawings.

図１は、本出願の実施例によって提供される画像処理方法の概略フローチャートである。 FIG. 1 is a schematic flowchart of an image processing method provided by an embodiment of the present application.

図１に示すように、この方法は、以下のステップ１０１～１０３を含む。 As shown in FIG. 1, the method includes the following steps 101-103.

ステップ１０１、トレーニングされた事前トレーニングモデルを取得し、事前トレーニングモデルは、トレーニングされた事前トレーニングモデルから出力された画像特徴が、第１画像特徴距離と第２画像特徴距離との差が最小であることを満たすように、複数のフレームのトレーニング画像を使用してトレーニングされ、ここで、第１の画像特徴距離は、同じビデオクリップから抽出されたトレーニング画像の画像特徴間の距離であり、第２の画像特徴距離は、異なるビデオクリップから抽出されたトレーニング画像の画像特徴間の距離である。 Step 101, obtain a trained pre-training model, the pre-training model is such that the image feature output from the trained pre-training model has a minimum difference between the first image feature distance and the second image feature distance; where the first image feature distance is the distance between the image features of the training images extracted from the same video clip, and the second is the distance between image features of training images extracted from different video clips.

本実施例における事前トレーニングモデルは、トレーニングのときに、深層学習の方法によってトレーニングすることができ、他の機械学習方法と比較して、深層学習がビッグデータセット上のパフォーマンスがより良い。本実施例における事前トレーニングモデルは、複数のビデオクリップから抽出された複数フレームのトレーニング画像をトレーニングセットとして事前トレーニングモデルに入力し、事前トレーニングモデルから出力された結果が予め設定された閾値を満たすまで、事前トレーニングモデルに対して反復トレーニングを行うように事前トレーニングモデルのパラメータを継続的に調整し、その後、トレーニングが終了する手段を用いるため、大量の画像データに基づいて、汎用的な事前トレーニングモデルを生成することを実現し、そして、後続にこの汎用的な事前トレーニングモデルに基づいて、対応するターゲット画像処理モデルの生成効率を向上させることができる。 The pre-trained model in this example can be trained by deep learning methods when training, and compared to other machine learning methods, deep learning performs better on big data sets. The pre-trained model in this example inputs multiple frames of training images extracted from multiple video clips as a training set to the pre-trained model, until the results output from the pre-trained model meet a preset threshold. , by means of which the parameters of the pre-trained model are continuously adjusted to perform iterative training on the pre-trained model, and then the training is terminated, so based on a large amount of image data, a generic pre-trained model and subsequently improve the generation efficiency of the corresponding target image processing model based on this general pre-trained model.

ここで、事前トレーニングモデルのトレーニング方法については、後続の事前トレーニングモデルのトレーニング方法についての実施例では詳細に説明するが、本実施例では説明を省略する。 Here, the training method of the pre-trained model will be described in detail in the subsequent example of the training method of the pre-trained model, but the description will be omitted in this example.

ステップ１０２、事前トレーニングモデルに基づいて、ターゲット画像処理タスクを実行する画像処理モデルを生成する。 Step 102, generate an image processing model that performs a target image processing task based on the pre-trained model.

ターゲット画像処理タスクは、画像分類タスク、ターゲット検出タスク、またはオブジェクト認識タスクを含む。 Target image processing tasks include image classification tasks, target detection tasks, or object recognition tasks.

本出願では、事前トレーニングモデルが生成された後、事前トレーニングモデルが予め生成された汎用モデルであるため、ターゲット画像処理タスクに対応する画像セットに基づいて、対応するターゲット画像処理タスクを実行する画像処理モデルを迅速に生成し、ターゲット画像処理タスクに対応する画像処理モデルの生成の効率を向上させる。 In the present application, after the pre-training model is generated, based on the set of images corresponding to the target image processing task, because the pre-training model is a pre-generated general model, the image to perform the corresponding target image processing task To quickly generate a processing model and improve the efficiency of generating an image processing model corresponding to a target image processing task.

ここで、画像処理モデルは、畳み込みニューラルネットワークモデルＣＮＮ（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｕｒａｌＮｅｔｗｏｒｋｓ，ＣＮＮ）であってもよいし、深層ニューラルネットワークモデルＤＮＮ（ＤｅｅｐＮｅｕｒａｌＮｅｔｗｏｒｋｓ，ＤＮＮ）であってもよく、本実施例では限定されない。 Here, the image processing model may be a convolutional neural network model CNN (Convolutional Neural Neural Networks, CNN) or a deep neural network model DNN (Deep Neural Networks, DNN). Not limited.

ステップ１０３、画像処理モデルを使用して、ターゲット画像に対してターゲット画像処理タスクを実行する。 Step 103, perform a target image processing task on the target image using the image processing model.

本実施例の画像処理モデルは、事前トレーニングによって取得された汎用的な事前トレーニングモデルに基づいて生成された、ターゲット画像処理タスクに対応する画像処理モデルであり、モデルの生成効率を向上させるとともに、当該画像処理モデルを使用して、ターゲット画像に対してターゲット画像処理タスクを実行し、ターゲット画像処理タスクの実行効果と処理効率を向上させる。 The image processing model of this embodiment is an image processing model corresponding to the target image processing task, generated based on a general-purpose pre-trained model obtained by pre-training, which improves model generation efficiency, The image processing model is used to perform a target image processing task on the target image to improve the execution effect and processing efficiency of the target image processing task.

本出願の実施例に係る画像処理方法では、トレーニングされた事前トレーニングモデルを取得し、ここで、当該事前トレーニングモデルは、トレーニングされた事前トレーニングモデルから出力された画像特徴が、第１の画像特徴距離と第２の画像特徴距離との差が最小であることを満たすように、複数のフレームのトレーニング画像を使用してトレーニングされる。さらに、汎用的な事前トレーニングモデルとターゲット画像処理タスクにより、対応する画像処理モデルを生成し、ターゲット処理タスクに対応する画像処理モデルの生成効率を向上させ、生成された画像処理モデルを使用して、ターゲット画像に対してターゲット画像処理タスクを実行し、画像処理モデルがターゲット画像処理タスクに対応するため、画像処理の効果と効率を向上させる。 An image processing method according to an embodiment of the present application obtains a trained pre-trained model, wherein the pre-trained model is such that an image feature output from the trained pre-trained model is a first image feature It is trained using multiple frames of training images to satisfy the minimum difference between the distance and the second image feature distance. In addition, a generic pre-training model and a target image processing task generate a corresponding image processing model, improve the generation efficiency of the image processing model corresponding to the target processing task, and use the generated image processing model to , performs the target image processing task on the target image, and the image processing model corresponds to the target image processing task, thus improving the effect and efficiency of image processing.

上記実施例では、画像処理の効率を向上させるために、ターゲット画像処理タスク及び事前トレーニングモデルに基づいて、ターゲット画像処理タスクに対応する画像処理モデルを生成し、一実施形態として、画像処理タスクに基づいて、事前トレーニングモデルをトレーニングして、画像処理タスクに対応する画像処理モデルを生成することで、画像処理の効率を向上させることができる。別の可能な実施形態として、事前トレーニングモデルとターゲット処理タスクに対応するネットワーク層をスプライシングした後、トレーニングして、対応する画像処理モデルを取得することで、画像処理モデルの生成効率と画像処理の効果を向上させることもできる。 In the above example, in order to improve the efficiency of image processing, an image processing model corresponding to the target image processing task is generated based on the target image processing task and the pre-trained model. Based on this, the pre-trained model can be trained to generate an image processing model corresponding to the image processing task, thereby improving the efficiency of image processing. As another possible embodiment, after splicing a pre-trained model and a network layer corresponding to the target processing task, training is performed to obtain the corresponding image processing model, thereby improving the generation efficiency of the image processing model and the image processing performance. It can also improve effectiveness.

このため、上記実施例に基づいて、本実施例は、別の画像処理方法を提供し、図２は、本出願の実施例によって提供される別の画像処理方法の概略フローチャートであり、図２に示すように、上記ステップ１０２は、以下のステップ２０１～２０３を含む。 Therefore, based on the above embodiments, this embodiment provides another image processing method, and FIG. 2 is a schematic flow chart of another image processing method provided by the embodiments of the present application. , the above step 102 includes the following steps 201-203.

ステップ２０１、ターゲット画像処理タスクに対応するネットワーク層を取得する。 Step 201, obtain a network layer corresponding to a target image processing task.

本出願では、取得されたネットワーク層とターゲット画像処理タスクとは、対応関係がある。 In this application, there is correspondence between the obtained network layer and the target image processing task.

あるシナリオでは、ターゲット画像処理タスクが画像分類タスクである場合、対応するネットワーク層は分類層であり、ターゲット画像を分類するために用いられ、例えば、分類対象の画像に含まれる車両に対して、対応する車両カテゴリ、例えば乗用車、ＳＵＶなどを決定する。 In one scenario, if the target image processing task is an image classification task, the corresponding network layer is the classification layer and is used to classify the target image, e.g. Determine the corresponding vehicle category, eg, passenger car, SUV, and the like.

別のシナリオでは、ターゲット画像処理タスクがターゲット検出タスクである場合、対応するネットワーク層は検出ネットワークであり、ターゲット画像に含まれるターゲット物体を認識するために用いられ、例えば、処理対象のターゲット画像に対して、画像に障害物が含まれているか否かを検出し、または、複数の画像に同じターゲット物体が含まれているか否かを検出する。 In another scenario, if the target image processing task is a target detection task, the corresponding network layer is the detection network, which is used to recognize target objects contained in the target image, e.g. On the other hand, it detects whether or not an image contains an obstacle, or detects whether or not a plurality of images contain the same target object.

さらに別のシナリオでは、ターゲット画像処理タスクがオブジェクト認識タスクである場合、対応するネットワーク層は、画像における物体を認識するために用いられ、例えば、処理対象のターゲット画像に対して、画像における異なる領域に含まれる物体カテゴリを認識し、または、画像に含まれる物体の種類を認識する。 In yet another scenario, if the target image processing task is an object recognition task, the corresponding network layer is used to recognize objects in the image, e.g., for the target image to be processed, different regions in the image Recognize the object category contained in the image, or recognize the type of object contained in the image.

ステップ２０２、事前トレーニングモデルとネットワーク層をスプライシングし、ここで、ネットワーク層の入力は、事前トレーニングモデルから出力された画像特徴であり、ネットワーク層の出力は、ターゲット画像タスクの処理結果である。 Step 202, splicing the pre-trained model and the network layer, where the input of the network layer is the image feature output from the pre-trained model, and the output of the network layer is the processing result of the target image task.

本実施形態では、汎用的な事前トレーニングモデルが生成された後、事前トレーニングモデルとターゲット画像処理タスクに対応するネットワーク層をスプライシングし、図３に示すように、トレーニングによって取得された事前トレーニングモデルをとネットワーク層をスプライシングして、トレーニング対象の画像処理モデルを取得する。ここで、事前トレーニングモデルから出力された画像特徴をネットワーク層に入力し、ネットワーク層の出力は、ターゲット画像タスクの処理結果である。 In this embodiment, after the general pre-trained model is generated, the pre-trained model and the network layer corresponding to the target image processing task are spliced, and the pre-trained model obtained by training is used as shown in FIG. and splice the network layers to get the image processing model to be trained. Here, the image features output from the pre-trained model are input to the network layer, and the output of the network layer is the processing result of the target image task.

ステップ２０３、ターゲット画像処理タスクのトレーニングセットを使用して、スプライシングされた事前トレーニングモデル及びネットワーク層をトレーニングして、画像処理モデルを取得する。 Step 203, using the training set of the target image processing task to train the spliced pre-trained model and network layers to obtain an image processing model.

本実施例では、異なるターゲット画像処理タスクに対して、当該ターゲット画像処理タスクに対応する画像処理モデルを迅速に取得するために、ターゲット画像処理タスクに対応するトレーニングセットを使用して、スプライシングされた事前トレーニングモデル及びネットワーク層をトレーニングして、画像処理モデルを取得する。つまり、トレーニングによって取得された画像処理モデルとターゲット画像処理タスクは対応関係があり、事前にトレーニングに基づいて完了した汎用的な前処理モデルと対応するネットワーク層をスプライシングした後にトレーニングする。可能な一実施形態として、主にターゲット画像の処理タスクの要求に対して、ネットワーク層のパラメータを調整して、対応する画像処理モデルのトレーニング効率を向上させることができ、異なるターゲット画像処理タスクの処理ニーズを同時に満たし、異なるシナリオにおける処理ニーズを満たす。 In this embodiment, for different target image processing tasks, the training set corresponding to the target image processing task is used to quickly acquire the image processing model corresponding to the target image processing task. The pre-trained model and network layers are trained to obtain the image processing model. That is, the image processing model obtained by training and the target image processing task have a corresponding relationship, and the general preprocessing model completed based on the training in advance and the corresponding network layer are spliced and then trained. As a possible embodiment, mainly for the requirements of the target image processing task, the parameters of the network layer can be adjusted to improve the training efficiency of the corresponding image processing model, and for different target image processing tasks. Meet your processing needs at the same time and meet your processing needs in different scenarios.

本実施例の画像処理方法において、事前トレーニングに基づいて完了した汎用的な前処理モデルと対応するネットワーク層をスプライシングし、ここで、ネットワーク層の入力は事前トレーニングモデルから出力された画像特徴であり、ネットワーク層の出力はターゲット画像タスクの処理結果であり、さらにトレーニングを行い、トレーニングは主にターゲット画像処理タスクに対応するネットワークを対象としているため、トレーニングのデータ量が少なく、対応する画像処理モデルのトレーニング効率を向上させる。 In the image processing method of this embodiment, the completed universal preprocessing model and the corresponding network layer are spliced based on the pre-training, where the input of the network layer is the image feature output from the pre-training model. , the output of the network layer is the processing result of the target image task, and we train it further, and the training is mainly for the network corresponding to the target image processing task, so the amount of data for training is small, and the corresponding image processing model Improve training efficiency.

上記実施例を実現するために、本実施例は、事前トレーニングモデルのトレーニング方法を提供する。 In order to implement the above embodiments, this embodiment provides a training method for pre-trained models.

図４は、本出願の実施例によって提供される事前トレーニングモデルのトレーニング方法の概略フローチャートであり、図４に示すように、この方法は、以下のステップ４０１～４０３を含む。 FIG. 4 is a schematic flowchart of a pre-trained model training method provided by an embodiment of the present application, and as shown in FIG. 4, the method includes the following steps 401-403.

ステップ４０１、複数のビデオクリップを取得する。 Step 401, get a plurality of video clips.

本出願の実施例の可能な一実施形態では、少なくとも１つのビデオを取得し、各ビデオをランダムに複数のビデオクリップに分割することができる。 In one possible embodiment of the implementation of the present application, at least one video may be obtained and each video may be randomly split into multiple video clips.

可能な一実施形態では、より多くのビデオクリップを取得するために、複数のビデオを取得し、各ビデオにおける隣接する画像フレーム間のコンテンツの違いに基づいて、分割処理を行って、各ビデオの複数のビデオクリップを取得することができる。つまり、各ビデオに対してビデオクリップ分割を行う場合、分割して取得されたビデオクリップにおける各フレームのコンテンツは連続的に変化しており、ビデオクリップにおけるフレームの連続性を向上させる。 In one possible embodiment, in order to obtain more video clips, multiple videos are obtained, and based on content differences between adjacent image frames in each video, a segmentation process is performed to divide each video. You can get multiple video clips. That is, when video clip division is performed for each video, the content of each frame in the video clip obtained by division changes continuously, improving the continuity of the frames in the video clip.

本出願の実施例の別の可能な実施形態では、１つのビデオを取得し、このビデオにおける隣接する画像フレーム間のコンテンツの違いに基づいて分割処理を行って、複数のビデオクリップを取得することができる。つまり、ビデオに対してビデオクリップ分割を行う場合、分割して取得されたビデオクリップにおける各フレームのコンテンツは連続的に変化しており、ビデオクリップにおけるフレームの連続性を向上させる。 Another possible embodiment of an embodiment of the present application is to take a video and perform a splitting process based on content differences between adjacent image frames in this video to get multiple video clips. can be done. In other words, when a video is divided into video clips, the content of each frame in the divided video clip changes continuously, improving the continuity of the frames in the video clip.

図３に示すように、Ａ、Ｂ、・・・・・Ｎはそれぞれ異なるビデオクリップである。 As shown in FIG. 3, A, B, . . . N are different video clips.

あるシナリオでは、これらの異なるビデオクリップは、１つのビデオクリップから分割して取得されたものであってもよい。別のシナリオでは、これらの異なるビデオクリップは、複数のビデオバンドから分割して取得されたものであってもよい。具体的には、トレーニングシナリオのニーズに応じて柔軟に設定することができ、本実施例では限定されない。 In some scenarios, these different video clips may have been split from a single video clip. In another scenario, these different video clips may have been split and captured from multiple video bands. Specifically, it can be flexibly set according to the needs of the training scenario, and is not limited in this embodiment.

ステップ４０２、複数のビデオクリップから複数フレームのトレーニング画像を抽出して、トレーニングセットを取得し、ここで、各ビデオクリップから少なくとも２フレームのトレーニング画像を抽出する。 Step 402, extracting multiple frames of training images from multiple video clips to obtain a training set, wherein extracting at least two frames of training images from each video clip.

本実施例では、トレーニングセットは、複数のビデオクリップから抽出された複数フレームのトレーニング画像から構成される。可能な一実施形態として、各ビデオクリップからランダムに一定の数のトレーニング画像のフレーム数を抽出し、抽出されたビデオクリップのフレーム数を使用してトレーニングセットを構成する。ここで、各ビデオクリップから少なくとも２フレームのトレーニング画像を抽出する。 In this example, the training set consists of multiple frames of training images extracted from multiple video clips. One possible embodiment is to randomly extract a fixed number of training image frame numbers from each video clip and use the extracted video clip frame numbers to construct a training set. Now extract at least two frames of training images from each video clip.

別の可能な実施形態として、モデルのトレーニング効果を向上させるために、各ビデオクリップから抽出されたトレーニング画像のフレーム数が同じであるため、トレーニングセットにおける各ビデオクリップのフレーム数分布の均一性を向上させ、そして、このトレーニングセットを通じて事前トレーニングモデルをトレーニングして、モデルパラメータを決定する際に各ビデオクリップが占める重みの割合が同じになるようにし、後続の事前トレーニングモデルのトレーニング効果を向上させる。 As another possible embodiment, in order to improve the training effect of the model, the homogeneity of the frame number distribution of each video clip in the training set can be improved, since the training images extracted from each video clip have the same number of frames. and train a pre-trained model through this training set so that each video clip has the same proportion of weight in determining model parameters, improving the training effect of subsequent pre-trained models. .

図３に示すように、ＡとＢとＮはそれぞれ異なるビデオクリップであり、本実施例では、各ビデオクリップから２フレームを抽出してトレーニング画像とすることを例として説明する。ここで、Ａ１及びＡ２はビデオクリップＡにおける２フレームであり、Ｂ１及びＢ２はビデオクリップＢにおける２フレームであり、Ｎ１及びＮ２はビデオクリップＮにおける２フレームである。 As shown in FIG. 3, A, B, and N are different video clips, and in this embodiment, an example of extracting two frames from each video clip and using them as training images will be described. where A1 and A2 are two frames in video clip A, B1 and B2 are two frames in video clip B, and N1 and N2 are two frames in video clip N.

例えば、１つのビデオＸで、このビデオクリップを分割して、ビデオクリップＡ、Ｂ、及びＣである３つのビデオクリップを取得し、図３に示すように、ＮはＣであり、各ビデオクリップから２フレームを抽出することを例として説明する。 For example, with one video X, split this video clip to get three video clips, which are video clips A, B, and C, and as shown in Figure 3, N is C and each video clip An example of extracting two frames from .

ここで、ビデオクリップＡでは、抽出された２フレームの画像はＡ１とＡ２であり、Ａ１とＡ２は連続する２フレームである。ビデオクリップＢでは、抽出された２フレームの画像はＢ１とＢ２であり、Ｂ１とＢ２は連続する２フレームである。ビデオクリップＣでは、抽出された２フレームの画像はＣ１とＣ２であり、Ｃ１とＣ２は連続する２フレームである。さらに、画像フレームＡ１、Ａ２、Ｂ１、Ｂ２、Ｃ１、及びＣ２をトレーニングセットとして構成する。 Here, in video clip A, the extracted two frame images are A1 and A2, and A1 and A2 are two consecutive frames. In video clip B, the extracted two frame images are B1 and B2, and B1 and B2 are two consecutive frames. In video clip C, the extracted two frame images are C1 and C2, and C1 and C2 are two consecutive frames. Further, image frames A1, A2, B1, B2, C1 and C2 are constructed as a training set.

なお、実際の応用では、トレーニングセットに含まれる複数フレームのトレーニング画像の数は、本実施例にで説明される６フレームの画像に限定されず、トレーニングの精度ニーズに応じて柔軟に設定することができる。 In addition, in practical applications, the number of multi-frame training images included in the training set is not limited to the 6-frame images described in this embodiment, and can be flexibly set according to the training accuracy needs. can be done.

ステップ４０３、トレーニングセットを使用して、画像特徴抽出のための事前トレーニングモデルに対してマルチラウンドのトレーニングを実行し、ここで、各ラウンドのトレーニングは、トレーニングセットから、少なくとも２つのビデオクリップから抽出された各トレーニング画像を選択し、このラウンドで選択された各トレーニング画像を事前トレーニングモデルに入力して、出力された画像特徴を取得し、このラウンドで選択された各トレーニング画像の画像特徴に基づいて、同じビデオクリップに属するトレーニング画像間の第１の画像特徴距離及び異なるビデオクリップに属するトレーニング画像間の第２の画像特徴距離を決定し、第１の画像特徴距離及び第２の画像特徴距離に基づいて、第１の画像特徴距離と第２の画像特徴距離との差が最小となるように、事前トレーニングモデルのモデルパラメータを調整し、その結果、トレーニングによって取得された事前トレーニングモデルが、異なるビデオクリップ間の関連関係を認識できる汎用的な事前トレーニングモデルとされることを含む。 Step 403, using the training set to perform multi-round training on the pre-trained model for image feature extraction, where each round of training is extracted from the training set with at least two video clips. Select each training image selected in this round, input each training image selected in this round into the pre-training model to get the output image features, and then use the image features of each training image selected in this round to determine a first image feature distance between training images belonging to the same video clip and a second image feature distance between training images belonging to different video clips; , the model parameters of the pre-trained model are adjusted so that the difference between the first image feature distance and the second image feature distance is minimized, so that the pre-trained model obtained by training is: Including being a generic pre-trained model that can recognize association relationships between different video clips.

本実施例では、トレーニングセットを使用して、事前トレーニングモデルに対してマルチラウンドのトレーニングを実行し、各ラウンドのトレーニングにおいて、モデルが収束するまで、認識結果に基づいてトレーニング効果を決定して、事前トレーニングモデルのパラメータを調整することで、事前トレーニングモデルがトレーニング画像の画像特徴を正確に生成できる。本実施例では、トレーニングセットにおけるトレーニング画像により、事前にトレーニングして汎用的な事前トレーニングモデルを取得し、事前トレーニングモデルから出力された画像特徴は、画像認識の汎用結果として、後続のターゲット画像認識タスクと組み合わせて、ターゲット画像認識タスクに対応する画像処理モデルを迅速に取得することを容易にし、画像処理モデルの生成効率を向上させることができる。 In this example, the training set is used to perform multiple rounds of training on the pre-trained model, and in each round of training, until the model converges, the training effect is determined based on the recognition results, By adjusting the parameters of the pre-trained model, the pre-trained model can accurately generate the image features of the training images. In this example, with the training images in the training set, we pre-train to obtain a generic pre-training model, and the image features output from the pre-training model are used as the generic result of image recognition for subsequent target image recognition. In combination with the task, it can facilitate the rapid acquisition of the image processing model corresponding to the target image recognition task, and improve the generation efficiency of the image processing model.

なお、トレーニングセットには、同じビデオに属する複数のビデオクリップが含まれ、異なるビデオに属する複数のビデオクリップも含まれるため、各ラウンドのトレーニングにおいて、トレーニングセットから、少なくとも２つのビデオクリップから抽出された各トレーニング画像を選択し、ここで、２つのビデオクリップは同じビデオに属してもよいし、異なるビデオに属してもよく、抽出されたトレーニング画像を使用して、異なるビデオクリップ間の関連関係を認識して、汎用的な事前トレーニングモデルとし、汎用的なモデルのロバスト性を向上させる。 Note that the training set contains multiple video clips belonging to the same video and also multiple video clips belonging to different videos, so in each round of training, at least two video clips are extracted from the training set. , where the two video clips may belong to the same video or to different videos, and the extracted training images are used to determine the association relationship between the different video clips to make it a generic pre-trained model and improve the robustness of the generic model.

本出願の実施例に係る事前トレーニングモデルのトレーニング方法では、取得された複数のビデオクリップからそれぞれ少なくとも２フレームのトレーニング画像を抽出して、複数フレームのトレーニング画像を取得して、トレーニングセットを取得し、トレーニングセットを通じて画像特徴抽出のための事前トレーニングモデルに対してマルチラウンドのトレーニングを実行し、各ラウンドのトレーニングにおいて、トレーニング画像に基づいて、画像特徴を取得し、同じビデオクリップに属する画像の画像特徴に基づいて、画像間の第１の画像特徴距離を取得し、異なるビデオクリップに属する画像の画像特徴に基づいて、画像間の第２の画像特徴距離を取得し、第１の画像特徴距離と第２の画像特徴距離の差が最小となるように、事前トレーニングモデルのパレメータを継続的に調整することで、汎用的な事前トレーニングモデルのトレーニングを実現し、事前トレーニングモデルによって認識された画像特徴の信頼性を向上させる。 A training method for a pre-trained model according to an embodiment of the present application extracts at least two frames of training images from each of a plurality of captured video clips to obtain a plurality of frames of training images to obtain a training set. , perform multi-round training on the pre-trained model for image feature extraction through the training set, and in each round of training, based on the training images, we get the image features and extract the images of the images belonging to the same video clip Obtaining a first image feature distance between the images based on the features; obtaining a second image feature distance between the images based on the image features of images belonging to different video clips; obtaining the first image feature distances By continuously adjusting the parameters of the pre-trained model so that the difference between the feature distances of the second image and Improve feature reliability.

上記の実施例に基づいて、本実施例は、別の事前トレーニングモデルのトレーニング方法を提供し、第１の画像特徴距離の計算の精度を向上させるために、同じビデオクリップに属するトレーニング画像間の第１の画像特徴距離をどのように決定するかを説明し、具体的には以下のステップによって実現することができる。 Based on the above example, this example provides another training method of pre-training model, in order to improve the accuracy of calculating the first image feature distance, between training images belonging to the same video clip We describe how to determine the first image feature distance, which can be specifically achieved by the following steps.

このラウンドのトレーニングで事前トレーニングモデルに入力されたトレーニング画像に対して、同じビデオクリップに属する異なるトレーニング画像の画像特徴間のクラス内特徴距離を決定し、このラウンドのトレーニングでトレーニングセットから選択された少なくとも２つのビデオクリップに対して、クラス内特徴距離の合計を決定して、第１の画像特徴距離を取得し、第１の画像距離によって同じビデオクリップに属する異なるトレーニング画像の画像特徴間の関連関係を示すことを実現する。 For the training images that were input to the pre-training model in this round of training, we determined the within-class feature distances between the image features of different training images belonging to the same video clip, and the values selected from the training set in this round of training. For at least two video clips, determine the sum of intra-class feature distances to obtain a first image feature distance, and associate between image features of different training images belonging to the same video clip by the first image distance. Realize showing relationships.

本出願の実施形態の可能な一実施形態では、例えば、選択されたトレーニング画像ｉ１及びｉ２は同じビデオクリップｉに属し、トレーニング画像ｉ１及びｉ２を事前トレーニングモジュールに入力して、各トレーニング画像の画像特徴を取得し、それぞれｈｉ１及びｈｉ２として示す。さらに、同じビデオクリップｉに属するトレーニング画像ｉ１とｉ２の画像特徴ｈｉ１とｈｉ２との間のクラス内特徴距離ｄ（ｉ１，ｉ２）を計算し、さらに、このラウンドのトレーニングでトレーニングセットから選択された少なくとも２つのビデオクリップに対して、クラス内特徴距離の合計を決定して、第１画像特徴距離ｄｉｓｔ（内）を取得し、具体的には、以下の式によって実現される。 In one possible implementation of embodiments of the present application, for example, the selected training images i1 and i2 belong to the same video clip i, and the training images i1 and i2 are input to the pre-training module to obtain the image of each training image The features are obtained and denoted as hi1 and hi2, respectively. In addition, we compute the intra-class feature distance d(i1, i2) between the image features hi1 and hi2 of training images i1 and i2 belonging to the same video clip i, and also For at least two video clips, determine the sum of intra-class feature distances to obtain the first image feature distance dist(in), specifically realized by the following formula:

ここで、ｉはビデオクリップであり、すなわち、ビデオクリップは１～ｎの自然数であり、ｎは２以上である。 Here, i is a video clip, that is, a video clip is a natural number from 1 to n, where n is 2 or more.

本出願の実施例の別の可能な実施形態では、異なるシナリオのニーズを満たすために、同じビデオクリップに属する異なるトレーニング画像の画像特徴に対して、異なるトレーニング画像の画像特徴を分類し、すなわち、異なるトレーニング画像の画像特徴を異なるカテゴリに分割して、細分化された特徴認識を実現する。例えば、人物カテゴリに属する画像特徴、建物に属する画像特徴、または鼻カテゴリに属する画像特徴を決定し、さらに、異なるトレーニング画像に対して、任意の２つのトレーニング画像の画像特徴における同じカテゴリに対応する特徴に対してそれぞれカテゴリ間特徴距離を計算し、さらに、すべてのカテゴリ間特徴距離を合計して、同じビデオクリップに属するクラス内特徴距離を取得する。さらに、このラウンドのトレーニングでトレーニングセットから選択された少なくとも２つのビデオクリップに対して、クラス内特徴距離の合計を決定して、第１の画像特徴距離を取得し、第１の画像特徴距離の計算の精度を実現し、第１の画像特徴距離の計算の正確性を向上させる。 In another possible embodiment of the examples of the present application, to meet the needs of different scenarios, classify the image features of different training images with respect to the image features of different training images belonging to the same video clip, i.e. The image features of different training images are divided into different categories to achieve refined feature recognition. For example, determine which image features belong to the People category, which belong to Buildings, or which belong to the Nose category, and furthermore, for different training images, we determine which image features correspond to the same category in the image features of any two training images. Compute the inter-category feature distance for each feature, and sum all the inter-category feature distances to get the intra-class feature distance belonging to the same video clip. Further, for at least two video clips selected from the training set in this round of training, determine the sum of the within-class feature distances to obtain the first image feature distance, Calculation precision is achieved to improve the accuracy of the calculation of the first image feature distance.

なお、上記画像特徴距離は、ユークリッド距離またはコサイン距離に基づいて計算できる。 Note that the image feature distance can be calculated based on Euclidean distance or cosine distance.

上記実施例に基づいて、本実施例は、別の事前トレーニングモデルのトレーニング方法を提供し、第２の画像特徴距離の計算の精度を向上させるために、異なるビデオクリップに属するトレーニング画像間の第２の画像特徴距離をどのように決定するかを説明し、具体的には、以下のステップによって実現することができる。 Based on the above example, this example provides another training method of pre-training model, in order to improve the accuracy of the calculation of the second image feature distance, the first difference between the training images belonging to different video clips is provided. 2, which can be specifically realized by the following steps.

このラウンドのトレーニングで事前トレーニングモデルに入力されたトレーニング画像に対して、異なるビデオクリップに属する異なるトレーニング画像の画像特徴間のクラス間特徴距離を決定し、このラウンドのトレーニングでトレーニングセットから選択された少なくとも２つのビデオクリップに対して、クラス間特徴距離の合計を決定して、第２の画像特徴距離を取得し、第２の画像距離によって異なるビデオクリップに属する異なるトレーニング画像の画像特徴間の関連関係を示すことを実現する。 For the training images that were input to the pre-trained model in this round of training, we determined the inter-class feature distances between the image features of different training images belonging to different video clips and selected from the training set in this round of training. determining a sum of inter-class feature distances for at least two video clips to obtain a second image feature distance, and association between image features of different training images belonging to different video clips by the second image distance; Realize showing relationships.

本出願の実施例の可能な一実施形態では、例えば、選択されたトレーニング画像ｉ１及びｉ２は同じビデオクリップｉに属し、トレーニング画像ｊ１及びｊ２は同じビデオクリップｊに属し、トレーニング画像ｉ１及びｉ２を事前トレーニングモジュールに入力して、各トレーニング画像の画像特徴を取得し、それぞれｈｉ１及びｈｉ２として示し、トレーニング画像ｊ１及びｊ２を事前トレーニングモジュールに入力して、対応する画像特徴を取得し、それぞれｈｊ１及びｈｊ２として示す。さらに、異なるビデオクリップｉ及びｊに属するトレーニング画像の画像特徴間のクラス間特徴距離を計算し、さらに、このラウンドのトレーニングでトレーニングセットから選択された少なくとも２つのビデオクリップに対して、クラス間特徴距離の合計を決定して、第２の画像特徴距離ｄｉｓｔ（間）を取得する。具体的には、以下の式によって実現されることができる。 In one possible embodiment of an embodiment of the present application, for example, selected training images i1 and i2 belong to the same video clip i, training images j1 and j2 belong to the same video clip j, and training images i1 and i2 are Input a pre-training module to obtain image features for each training image, denoted as hi1 and hi2, respectively, and input training images j1 and j2 to the pre-training module to obtain corresponding image features, hj1 and hj1, respectively. denoted as hj2. Further, compute the inter-class feature distance between the image features of the training images belonging to different video clips i and j; Determine the sum of the distances to obtain a second image feature distance dist(between). Specifically, it can be realized by the following formula.

ここで、ｉ及びｊは異なるビデオクリップであり、ｎは２以上であり、ｄ（ｈｉ１，ｈｊ１）は異なるビデオクリップｉ及びｊにおけるトレーニング画像の画像特徴ｈｉ１とｈｊ１との間のクラス間特徴距離であり、ｄ（ｈｉ１，ｈｊ２）とｄ（ｈｉ２，ｈｊ１）とｄ（ｈｉ２，ｈｊ２）は異なるビデオクリップｉ及びｊにおけるトレーニング画像の画像特徴間のクラス間特徴距離である。 where i and j are different video clips, n is greater than or equal to 2, and d(hi1, hj1) is the interclass feature distance between image features hi1 and hj1 of training images in different video clips i and j. where d(hi1, hj2), d(hi2, hj1) and d(hi2, hj2) are interclass feature distances between image features of training images in different video clips i and j.

なお、本実施例では、各ビデオクリップから２つのトレーニング画像を選択することを例として説明し、実際の応用では、各ビデオクリップにおける選択されたトレーニング画像の数は、トレーニングの需要に応じて柔軟に設定することができ、本実施例では限定されない。 It should be noted that in this embodiment, the selection of two training images from each video clip is taken as an example, and in actual application, the number of selected training images in each video clip is flexible according to the training needs. , and is not limited in this embodiment.

本出願の実施例の他の可能な実施形態では、異なるシナリオのニーズを満たすために、異なるビデオクリップに属する異なるトレーニング画像の画像特徴に対して、異なるトレーニング画像の画像特徴を分類することができ、すなわち、異なるトレーニング画像の画像特徴を異なるカテゴリに分割して、細分化された特徴認識を実現することができる。例えば、人物カテゴリに属する画像特徴、建物に属する画像特徴、または鼻カテゴリに属する画像特徴を決定し、さらに、異なるビデオクリップに属するトレーニング画像に対して、任意の２つのトレーニング画像の画像特徴における同じカテゴリに対応する特徴に対してそれぞれカテゴリ間特徴距離を計算し、さらに、すべてのカテゴリ間特徴距離を合計して、異なるビデオクリップに属する異なるトレーニング画像のクラス間特徴距離を取得する。さらに、このラウンドのトレーニングでトレーニングセットから選択された少なくとも２つのビデオクリップに対して、クラス間特徴距離の合計を決定して、第２の画像特徴距離を取得し、第２の画像特徴距離の計算の精度を実現し、第２の画像特徴距離の計算の正確性を向上させる。 In another possible embodiment of the embodiments of the present application, image features of different training images can be classified for image features of different training images belonging to different video clips to meet the needs of different scenarios. That is, the image features of different training images can be divided into different categories to achieve refined feature recognition. For example, determine which image features belong to the person category, which belong to buildings, or which belong to the nose category, and furthermore, for training images belonging to different video clips, for any two training images, the same Calculate the inter-category feature distances respectively for the features corresponding to the categories, and sum all the inter-category feature distances to obtain the inter-class feature distances of different training images belonging to different video clips. Further, for at least two video clips selected from the training set in this round of training, determine the sum of the inter-class feature distances to obtain a second image feature distance, Calculation precision is achieved to improve the accuracy of the calculation of the second image feature distance.

上記実施例を実現するために、本出願は、画像処理装置をさらに提供する。 In order to implement the above embodiments, the present application further provides an image processing device.

図５は、本出願の実施例によって提供される画像処理装置の概略構成図である。 FIG. 5 is a schematic configuration diagram of an image processing device provided by an embodiment of the present application.

図５に示すように、取得モジュール５１と、生成モジュール５２と、処理モジュール５３とを含む。 As shown in FIG. 5, it includes an acquisition module 51 , a generation module 52 and a processing module 53 .

取得モジュール５１は、トレーニングされた事前トレーニングモデルを取得し、当該事前トレーニングモデルは、トレーニングされた事前トレーニングモデルから出力された画像特徴が、第１の画像特徴距離と第２の画像特徴距離との差が最小であることを満たすように、複数のフレームのトレーニング画像を使用してトレーニングされ、第１の画像特徴距離は、同じビデオクリップから抽出されたトレーニング画像の画像特徴間の距離であり、第２の画像特徴距離は、異なるビデオクリップから抽出されたトレーニング画像の画像特徴間の距離である。 Acquisition module 51 acquires a trained pre-trained model, wherein the image features output from the trained pre-trained model are the first image feature distance and the second image feature distance. trained using multiple frames of training images to satisfy the minimum difference, wherein the first image feature distance is the distance between the image features of the training images extracted from the same video clip; The second image feature distance is the distance between image features of training images extracted from different video clips.

生成モジュール５２は、事前トレーニングモデルに基づいて、ターゲット画像処理タスクを実行する画像処理モデルを生成する。 Generation module 52 generates an image processing model that performs a target image processing task based on the pre-trained model.

処理モジュール５３は、画像処理モデルを使用して、ターゲット画像に対してターゲット画像処理タスクを実行する。 The processing module 53 uses the image processing model to perform target image processing tasks on the target image.

さらに、本出願の実施例の可能な一実施形態では、生成モジュール５２は、具体的には、ターゲット画像処理タスクに対応するネットワーク層を取得し、事前トレーニングモデルとネットワーク層をスプライシングし、ここで、ネットワーク層の入力は、事前トレーニングモデルから出力された画像特徴であり、ネットワーク層の出力は、ターゲット画像タスクの処理結果であり、ターゲット画像処理タスクのトレーニングセットを使用して、スプライシングされた事前トレーニングモデル及びネットワーク層をトレーニングして、画像処理モデルを取得する。 Further, in one possible embodiment of the examples of the present application, the generation module 52 specifically obtains the network layer corresponding to the target image processing task, splices the pre-trained model and the network layer, where , the input of the network layer is the image features output from the pre-trained model, the output of the network layer is the processing result of the target image task, and the training set of the target image processing task is used to generate the spliced pre- The training model and network layers are trained to obtain an image processing model.

本出願の実施例の可能な一実施形態では、ターゲット画像処理タスクは、画像分類タスク、ターゲット検出タスク、またはオブジェクト認識タスクを含む。 In one possible embodiment of the examples of the present application, the target image processing task comprises an image classification task, a target detection task or an object recognition task.

なお、上記画像処理方法の実施例についての説明は、本実施例の画像処理装置にも適用されており、原理は同じであり、ここでは説明を省略する。 The description of the embodiment of the image processing method is also applied to the image processing apparatus of this embodiment, and the principle is the same, so the description is omitted here.

本出願の実施例に係る画像処理装置では、トレーニングされた事前トレーニングモデルを取得し、ここで、当該事前トレーニングモデルは、トレーニングされた事前トレーニングモデルから出力された画像特徴が、第１の画像特徴距離と第２の画像特徴距離との差が最小であることを満たすように、複数のフレームのトレーニング画像を使用してトレーニングされ、さらに、汎用的な事前トレーニングモデル及びターゲット画像処理タスクに基づいて、対応する画像処理モデルを生成し、ターゲット処理タスクに対応する画像処理モデルの生成効率を向上させ、生成された画像処理モデルを使用して、ターゲット画像に対してターゲット画像処理タスクを実行し、画像処理モデルがターゲット画像処理タスクに対応するため、画像処理の効果と効率を向上させる。 An image processing apparatus according to an embodiment of the present application obtains a trained pre-trained model, wherein the pre-trained model is such that the image features output from the trained pre-trained model are first image features trained using multiple frames of training images to satisfy the minimum difference between the distance and the second image feature distance, and based on a generic pre-trained model and a target image processing task , generate a corresponding image processing model, improve the generation efficiency of the image processing model corresponding to the target processing task, use the generated image processing model to perform the target image processing task on the target image, Improving the effectiveness and efficiency of image processing because the image processing model corresponds to the target image processing task.

上記実施例を実現するために、本実施例は、事前トレーニングモデルのトレーニング装置を提供する。 In order to implement the above embodiments, this embodiment provides a pre-trained model training device.

図６は、本出願の実施例によって提供される事前トレーニングモデルのトレーニング装置の概略構成図である。図６に示すように、この装置は、取得モジュール６１と、抽出モジュール６２と、トレーニングモジュール６３とを含む。 FIG. 6 is a schematic structural diagram of a pre-trained model training device provided by an embodiment of the present application. As shown in FIG. 6, the device includes an acquisition module 61 , an extraction module 62 and a training module 63 .

取得モジュール６１は、複数のビデオクリップを取得する。 Acquisition module 61 acquires a plurality of video clips.

抽出モジュール６２は、複数のビデオクリップから複数フレームのトレーニング画像を抽出して、トレーニングセットを取得し、ここで、各ビデオクリップから少なくとも２フレームのトレーニング画像を抽出する。 Extraction module 62 extracts multiple frames of training images from multiple video clips to obtain a training set, where it extracts at least two frames of training images from each video clip.

トレーニングモジュール６３は、トレーニングセットを使用して、画像特徴抽出のための事前トレーニングモデルに対してマルチラウンドのトレーニングを実行し、ここで、各ラウンドのトレーニングは、トレーニングセットから、少なくとも２つのビデオクリップから抽出された各トレーニング画像を選択することと、このラウンドで選択された各トレーニング画像を事前トレーニングモデルに入力して、出力された画像特徴を取得することと、このラウンドで選択された各トレーニング画像の画像特徴に基づいて、同じビデオクリップに属するトレーニング画像間の第１の画像特徴距離、及び異なるビデオクリップに属するトレーニング画像間の第２の画像特徴距離を決定し、第１の画像特徴距離及び第２の画像特徴距離に基づいて、第１の画像特徴距離と第２の画像特徴距離との差が最小となるように、事前トレーニングモデルのモデルパラメータを調整することと、を含む。 Training module 63 uses the training set to perform multiple rounds of training on the pre-trained model for image feature extraction, where each round of training consists of at least two video clips from the training set. and inputting each training image selected in this round into the pre-training model to obtain the output image features, and each training image selected in this round Based on the image features of the images, determine a first image feature distance between training images belonging to the same video clip and a second image feature distance between training images belonging to different video clips; and adjusting model parameters of the pre-trained model based on the second image feature distance such that the difference between the first image feature distance and the second image feature distance is minimized.

本出願の実施例の可能な一実施形態では、トレーニングモジュール６３は、具体的に、このラウンドのトレーニングで前記事前トレーニングモデルに入力されたトレーニング画像に対して、同じビデオクリップに属する異なるトレーニング画像の画像特徴間のクラス内特徴距離を決定し、このラウンドのトレーニングで前記トレーニングセットから選択された少なくとも２つのビデオクリップに対して、前記クラス内特徴距離の合計を決定して、前記第１の画像特徴距離を取得する。 In one possible embodiment of the examples of the present application, the training module 63 specifically stores different training images belonging to the same video clip for training images input to said pre-trained model in this round of training. and for at least two video clips selected from the training set in this round of training, determining the sum of the intra-class feature distances between the image features of the first Get the image feature distance.

本出願の実施例の可能な一実施形態では、トレーニングモジュール６３は、具体的に、このラウンドのトレーニングで前記事前トレーニングモデルに入力されたトレーニング画像に対して、異なるビデオクリップに属する異なるトレーニング画像の画像特徴間のクラス間特徴距離を決定し、このラウンドのトレーニングで前記トレーニングセットから選択された少なくとも２つのビデオクリップに対して、前記クラス間特徴距離の合計を決定して、前記第２の画像特徴距離を取得する。 In one possible embodiment of the examples of the present application, the training module 63 specifically stores different training images belonging to different video clips for the training images input to said pre-trained model in this round of training. and for at least two video clips selected from the training set in this round of training, determining the sum of the interclass feature distances between the image features of the second Get the image feature distance.

本出願の実施例の可能な一実施形態では、各前記ビデオクリップから抽出されたトレーニング画像のフレーム数は同じである。 In one possible embodiment of the implementation of the present application, the number of frames of training images extracted from each said video clip is the same.

本出願の可能な一実施形態では、取得モジュール６１は、具体的に、複数のビデオを取得し、各前記ビデオにおける隣接する画像フレーム間のコンテンツの違いに基づいて分割処理を行って、各前記ビデオの複数のビデオクリップを取得する。 In one possible embodiment of the present application, the acquisition module 61 specifically acquires a plurality of videos and performs a segmentation process based on content differences between adjacent image frames in each said video to obtain each said video. Get multiple video clips of a video.

本出願の実施例に係る事前トレーニングモデルのトレーニング装置では、取得された複数のビデオクリップからそれぞれ少なくとも２フレームのトレーニング画像を抽出して、複数フレームのトレーニング画像を取得して、トレーニングセットを取得し、トレーニングセットを通じて画像特徴抽出のための事前トレーニングモデルに対してマルチラウンドのトレーニングを実行し、各ラウンドのトレーニングにおいて、トレーニング画像に基づいて、画像特徴を取得し、同じビデオクリップに属する画像の画像特徴に基づいて、画像間の第１の画像特徴距離を取得し、異なるビデオクリップに属する画像の画像特徴に基づいて、画像間の第２の画像特徴距離を取得し、第１の画像特徴距離と第２の画像特徴距離の差が最小となるように、事前トレーニングモデルのパレメータを継続的に調整し、汎用的な事前トレーニングモデルのトレーニングを実現し、事前トレーニングモデルによって認識された画像特徴の信頼性を向上させる。 The pre-trained model training device according to an embodiment of the present application extracts at least two frames of training images from each of the captured video clips to obtain a plurality of frames of training images to obtain a training set. , perform multi-round training on the pre-trained model for image feature extraction through the training set, and in each round of training, based on the training images, we get the image features and extract the images of the images belonging to the same video clip Obtaining a first image feature distance between the images based on the features; obtaining a second image feature distance between the images based on the image features of images belonging to different video clips; obtaining the first image feature distances Continuously adjust the parameters of the pre-trained model so that the difference between the distance between the and the second image feature distance is minimized, to realize the training of the general pre-trained model, and the image feature recognized by the pre-trained model Improve reliability.

上記実施例を実現するために、本出願の実施例は、電子機器を提供し、少なくとも１つのプロセッサと、前記少なくとも１つのプロセッサに通信可能に接続されるメモリと、を含み、前記メモリには、前記少なくとも１つのプロセッサによって実行可能な命令が記憶され、前記命令は、前記少なくとも１つのプロセッサが前記方法の実施例に記載の画像処理方法、または前記方法の実施例に記載のトレーニング方法を実行できるように、前記少なくとも１つのプロセッサによって実行される。 To implement the above embodiments, an embodiment of the present application provides an electronic device, including at least one processor and memory communicatively coupled to the at least one processor, the memory comprising , instructions executable by said at least one processor are stored, said instructions directing said at least one processor to perform an image processing method according to said method embodiment or a training method according to said method embodiment. preferably executed by said at least one processor.

上記実施例を実現するために、本出願の実施例は、コンピュータ命令が記憶されている非一時的なコンピュータ読み取り可能な記憶媒体をさらに提供し、前記コンピュータ命令は、コンピュータに前記方法の実施例に記載の画像処理方法、または前記方法の実施例に記載のトレーニング方法を実行させる。 To implement the above embodiments, embodiments of the present application further provide a non-transitory computer-readable storage medium having computer instructions stored thereon, said computer instructions being transmitted to a computer by the method embodiments described above. or the training method described in the embodiment of said method.

本出願の実施例によれば、本出願は、電子機器及び読み取り可能な記憶媒体をさらに提供する。
本出願の実施例によれば、本出願は、コンピュータプログラムを提供し、コンピュータプログラムは、コンピュータに本出願によって提供される画像処理方法、または事前トレーニングモデルのトレーニング方法を実行させる。 According to embodiments of the present application, the present application further provides an electronic device and a readable storage medium.
According to an embodiment of the present application, the present application provides a computer program that causes a computer to perform the image processing method or the pre-trained model training method provided by the present application.

図７に示すように、本出願の実施形態に係る電子機器のブロック図である。電子機器は、ラップトップコンピュータ、デスクトップコンピュータ、ワークステーション、パーソナルデジタルアシスタント、サーバ、ブレードサーバ、メインフレームコンピュータ、及び他の適切なコンピュータなどの様々な形態のデジタルコンピュータを表すことを目的とする。電子機器は、パーソナルデジタルプロセッサ、携帯電話、スマートフォン、ウェアラブルデバイス、他の同様のコンピューティングデバイスなどの様々な形態のモバイルデバイスを表すこともできる。本明細書で示されるコンポーネント、それらの接続と関係、及びそれらの機能は単なる例であり、本明細書の説明及び／又は要求される本出願の実現を制限することを意図したものではない。 As shown in FIG. 7, it is a block diagram of an electronic device according to an embodiment of the present application. Electronic equipment is intended to represent various forms of digital computers such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronics can also represent various forms of mobile devices such as personal digital processors, mobile phones, smart phones, wearable devices, and other similar computing devices. The components, their connections and relationships, and their functionality illustrated herein are merely examples and are not intended to limit the description and/or required implementation of the application herein.

図７に示すように、当該電子機器は、１つ又は複数のプロセッサ７０１と、メモリ７０２と、高速インターフェースと低速インターフェースを含む、各コンポーネントを接続するためのインターフェースと、を含む。各コンポーネントは、異なるバスで相互に接続され、共通のマザーボードに取り付けられるか、又は必要に応じて他の方式で取り付けることができる。プロセッサは、外部入力／出力装置（インターフェースに結合されたディスプレイデバイスなど）にＧＵＩの図形情報をディスプレイするためにメモリ内又はメモリに記憶されている命令を含む、電子機器内で実行される命令を処理することができる。他の実施形態では、必要であれば、複数のプロセッサ及び／又は複数のバスを、複数のメモリとともに使用することができる。同様に、複数の電子機器を接続することができ、各電子機器は、部分的な必要な操作（例えば、サーバアレイ、ブレードサーバ、又はマルチプロセッサシステムとする）を提供する。図７では、１つのプロセッサ７０１を例とする。 As shown in FIG. 7, the electronic device includes one or more processors 701, memory 702, and interfaces for connecting components, including high speed and low speed interfaces. Each component is interconnected by a different bus and can be mounted on a common motherboard or otherwise mounted as desired. The processor executes instructions executed within the electronic device, including instructions in or stored in memory to display graphical information of the GUI on an external input/output device (such as a display device coupled to the interface). can be processed. In other embodiments, multiple processors and/or multiple buses can be used along with multiple memories, if desired. Similarly, multiple electronic devices can be connected, each providing a partial required operation (eg, being a server array, blade server, or multi-processor system). In FIG. 7, one processor 701 is taken as an example.

メモリ７０２は、本出願により提供される非一時的なコンピュータ読み取り可能な記憶媒体である。ここで、前記メモリには、少なくとも１つのプロセッサが本出願により提供される画像処理方法を実行するように、前記少なくとも１つのプロセッサによって実行可能な命令が記憶されている。本出願の非一時的なコンピュータ読み取り可能な記憶媒体には、コンピュータに本出願により提供される画像処理方法を実行させるためのコンピュータ命令が記憶されている。 Memory 702 is a non-transitory computer-readable storage medium provided by the present application. Here, the memory stores instructions executable by the at least one processor so that the at least one processor performs the image processing method provided by the present application. A non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the image processing method provided by the present application.

メモリ７０２は、非一時的なコンピュータ読み取り可能な記憶媒体として、本出願の実施例における画像処理方法に対応するプログラム命令／モジュール（例えば、図５に示す取得モジュール５１、生成モジュール５２、処理モジュール５３、）のような、非一時的なソフトウェアプログラム、非一時的なコンピュータ実行可能なプログラム及びモジュールを記憶する。プロセッサ７０１は、メモリ７０２に記憶されている非一時的なソフトウェアプログラム、命令及びモジュールを実行することによって、サーバの様々な機能アプリケーション及びデータ処理を実行し、すなわち上記方法の実施例における画像処理方法を実現する。 The memory 702, as a non-transitory computer-readable storage medium, stores program instructions/modules (for example, the acquisition module 51, the generation module 52, the processing module 53 shown in FIG. 5) corresponding to the image processing method in the embodiments of the present application. , ), non-transitory computer-executable programs and modules. Processor 701 performs the various functional applications and data processing of the server by executing non-transitory software programs, instructions and modules stored in memory 702, i.e. image processing methods in the above method embodiments. Realize

メモリ７０２は、プログラム記憶領域とデータ記憶領域とを含むことができ、ここで、プログラム記憶領域は、オペレーティングシステム、少なくとも１つの機能に必要なアプリケーションプログラムを記憶することができ、データ記憶領域は、画像処理方法の電子機器の使用によって作成されたデータなどを記憶することができる。また、メモリ７０２は、高速ランダムアクセスメモリを含むことができ、非一時的なメモリをさらに含むことができ、例えば、少なくとも１つのディスクストレージデバイス、フラッシュメモリデバイス、又は他の非一時的なソリッドステートストレージデバイスである。いくつかの実施例では、メモリ７０２は、プロセッサ７０１に対して遠隔に設定されたメモリを選択的に含むことができ、これらの遠隔メモリは、ネットワークを介してこの電子機器に接続されることができる。上記ネットワークの例は、インターネット、イントラネット、ローカルエリアネットワーク、モバイル通信ネットワーク、及びその組み合わせを含むが、これらに限定されない。 The memory 702 can include a program storage area and a data storage area, where the program storage area can store an operating system, application programs required for at least one function, and the data storage area can: Data created by the use of image processing method electronics and the like can be stored. Memory 702 may also include high speed random access memory and may further include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state memory device. A storage device. In some embodiments, memory 702 can optionally include memory configured remotely to processor 701, and these remote memories can be connected to the electronic device via a network. can. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

画像処理方法の電子機器は、入力装置７０３と出力装置７０４とをさらに含むことができる。プロセッサ７０１、メモリ７０２、入力装置７０３、及び出力装置７０４は、バス又は他の方式を介して接続することができ、図８では、バスによる接続を例とする。 The image processing method electronics can further include an input device 703 and an output device 704 . The processor 701, memory 702, input device 703, and output device 704 can be connected via a bus or other manner, and FIG. 8 takes the connection by bus as an example.

入力装置７０３は、入力された数字又は文字情報を受信し、この電子機器のユーザ設定及び機能制御に関するキー信号入力を生成することができ、例えば、タッチスクリーン、キーパッド、マウス、トラックパッド、タッチパッド、ポインティングスティック、１つ又は複数のマウスボタン、トラックボール、ジョイスティックなどの入力装置である。出力装置８０４は、ディスプレイデバイス、補助照明デバイス（例えば、ＬＥＤ）、及び触覚フィードバックデバイス（例えば、振動モータ）などを含むことができる。当該ディスプレイデバイスは、液晶ディスプレイ（ＬＣＤ）、発光ダイオード（ＬＥＤ）ディスプレイ、及びプラズマディスプレイを含むことができるが、これらに限定されない。いくつかの実施形態では、ディスプレイデバイスは、タッチスクリーンであってもよい。 The input device 703 is capable of receiving entered numeric or character information and generating key signal inputs for user settings and functional control of this electronic device, such as touch screen, keypad, mouse, trackpad, touch screen, etc. Input devices such as pads, pointing sticks, one or more mouse buttons, trackballs, joysticks, and the like. Output devices 804 may include display devices, supplemental lighting devices (eg, LEDs), tactile feedback devices (eg, vibration motors), and the like. Such display devices can include, but are not limited to, liquid crystal displays (LCD), light emitting diode (LED) displays, and plasma displays. In some embodiments, the display device may be a touchscreen.

本明細書で説明されるシステムと技術の様々な実施形態は、デジタル電子回路システム、集積回路システム、特定用途向けＡＳＩＣ（特定用途向け集積回路）、コンピュータハードウェア、ファームウェア、ソフトウェア、及び／又はそれらの組み合わせで実現することができる。これらの様々な実施形態は、１つ又は複数のコンピュータプログラムで実施され、当該１つ又は複数のコンピュータプログラムは、少なくとも１つのプログラマブルプロセッサを含むプログラム可能なシステムで実行及び／又は解釈されることができ、当該プログラマブルプロセッサは、専用又は汎用のプログラマブルプロセッサであってもよく、ストレージシステム、少なくとも１つの入力装置、及び少なくとも１つの出力装置からデータ及び命令を受信し、データ及び命令を当該ストレージシステム、当該少なくとも１つの入力装置、及び当該少なくとも１つの出力装置に伝送することができる。 Various embodiments of the systems and techniques described herein may be digital electronic circuit systems, integrated circuit systems, application specific integrated circuits (ASICs), computer hardware, firmware, software, and/or can be realized by a combination of These various embodiments are embodied in one or more computer programs, which can be executed and/or interpreted in a programmable system including at least one programmable processor. The programmable processor may be a dedicated or general-purpose programmable processor that receives data and instructions from the storage system, at least one input device, and at least one output device, and transmits data and instructions to the storage system, It can be transmitted to the at least one input device and the at least one output device.

これらのコンピューティングプログラム（プログラム、ソフトウェア、ソフトウェアアプリケーション、又はコードとも呼ばれる）は、プログラマブルプロセッサの機械命令を含み、高度プロセス及び／又はオブジェクト指向プログラミング言語、及び／又はアセンブリ／機械言語でこれらのコンピューティングプログラムを実施することができる。本明細書に使用されるような、「機械読み取り可能な媒体」及び「コンピュータ読み取り可能な媒体」という用語は、機械命令及び／又はデータをプログラマブルプロセッサに提供するために使用される任意のコンピュータプログラム製品、機器、及び／又は装置（例えば、磁気ディスク、光ディスク、メモリ、プログラマブルロジックデバイス（ＰＬＤ））を指し、機械読み取り可能な信号である機械命令を受信する機械読み取り可能な媒体を含む。「機械読み取り可能な信号」という用語は、機械命令及び／又はデータをプログラマブルプロセッサに提供するための任意の信号を指す。 These computing programs (also called programs, software, software applications, or code) contain machine instructions for programmable processors, and are written in high-level process and/or object-oriented programming languages, and/or assembly/machine language. Able to implement programs. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program that is used to provide machine instructions and/or data to a programmable processor. Refers to a product, apparatus, and/or apparatus (eg, magnetic disk, optical disk, memory, programmable logic device (PLD)) and includes a machine-readable medium for receiving machine instructions, which are machine-readable signals. The term "machine-readable signal" refers to any signal for providing machine instructions and/or data to a programmable processor.

ユーザとのインタラクションを提供するために、ここで説明されているシステム及び技術をコンピュータ上で実施することができ、当該コンピュータは、ユーザに情報を表示するためのディスプレイ装置（例えば、ＣＲＴ（陰極線管）又はＬＣＤ（液晶ディスプレイ）モニタ）と、キーボード及びポインティングデバイス（例えば、マウス又はトラックボール）とを有し、ユーザは、当該キーボード及び当該ポインティングデバイスによって入力をコンピュータに提供することができる。他の種類の装置は、ユーザとのインタラクションを提供することもでき、例えば、ユーザに提供されるフィードバックは、任意の形態のセンシングフィードバック（例えば、視覚フィードバック、聴覚フィードバック、又は触覚フィードバック）であってもよく、任意の形態（音響入力と、音声入力と、触覚入力とを含む）でユーザからの入力を受信することができる。 To provide interaction with a user, the systems and techniques described herein can be implemented on a computer that includes a display device (e.g., cathode ray tube (CRT)) for displaying information to the user. ) or LCD (liquid crystal display) monitor), and a keyboard and pointing device (e.g., mouse or trackball) through which a user can provide input to the computer. Other types of devices can also provide interaction with a user, e.g., the feedback provided to the user can be any form of sensing feedback (e.g., visual, auditory, or tactile feedback). may receive input from the user in any form (including acoustic, speech, and tactile input).

ここで説明されるシステム及び技術は、バックエンドコンポーネントを含むコンピューティングシステム（例えば、データサーバとする）、又はミドルウェアコンポーネントを含むコンピューティングシステム（例えば、アプリケーションサーバ）、又はフロントエンドコンポーネントを含むコンピューティングシステム（例えば、グラフィカルユーザインタフェース又はウェブブラウザを有するユーザコンピュータ、ユーザは、当該グラフィカルユーザインタフェース又は当該ウェブブラウザによってここで説明されるシステム及び技術の実施形態とインタラクションする）、又はこのようなバックエンドコンポーネントと、ミドルウェアコンポーネントと、フロントエンドコンポーネントの任意の組み合わせを含むコンピューティングシステムで実施することができる。任意の形態又は媒体のデジタルデータ通信（例えば、通信ネットワーク）によってシステムのコンポーネントを相互に接続することができる。通信ネットワークの例は、ローカルエリアネットワーク（ＬＡＮ）と、ワイドエリアネットワーク（ＷＡＮ）と、インターネットとを含む。 The systems and techniques described herein may be computing systems that include back-end components (e.g., data servers), or computing systems that include middleware components (e.g., application servers), or computing systems that include front-end components. A system (e.g., a user computer having a graphical user interface or web browser, through which users interact with embodiments of the systems and techniques described herein), or such a backend component , middleware components, and front-end components in any combination. The components of the system can be interconnected by any form or medium of digital data communication (eg, a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.

コンピュータシステムは、クライアントとサーバとを含むことができる。クライアントとサーバは、一般に、互いに離れており、通常に通信ネットワークを介してインタラクションする。対応するコンピュータ上で実行され、互いにクライアント－サーバ関係を有するコンピュータプログラムによってクライアントとサーバとの関係が生成される。 The computer system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship to each other.

本出願の実施例の技術案によれば、トレーニングされた事前トレーニングモデルを取得し、ここで、当該事前トレーニングモデルは、トレーニングされた事前トレーニングモデルから出力された画像特徴が第１の画像特徴距離と第２の画像特徴距離との差が最小であることを満たすように、複数のフレームのトレーニング画像を使用してトレーニングされ、さらに、汎用的な事前トレーニングモデル及びターゲット画像処理タスクに基づいて、対応する画像処理モデルを生成し、ターゲット処理タスクに対応する画像処理モデルの生成効率を向上させ、生成された画像処理モデルを使用して、ターゲット画像に対してターゲット画像処理タスクを実行し、画像処理モデルがターゲット画像処理タスクに対応するため、画像処理の効果と効率を向上させる。 According to the technical solution of the embodiments of the present application, a trained pre-trained model is obtained, wherein the pre-trained model is such that the image feature output from the trained pre-trained model is the first image feature distance is trained using multiple frames of training images to satisfy the minimum difference between Generate a corresponding image processing model, improve the generation efficiency of the image processing model corresponding to the target processing task, use the generated image processing model to perform the target image processing task on the target image, and perform the image Improving the effectiveness and efficiency of image processing because the processing model corresponds to the target image processing task.

なお、この電子機器は、本出願の事前トレーニングモデルのトレーニング方法を実施することもできる、原理は同じであり、ここでは説明を省略する。 It should be noted that this electronic device can also implement the training method of the pre-trained model of the present application, the principle is the same, and the description is omitted here.

なお、上記に示される様々な形態のフローを使用して、ステップを並べ替え、追加、又は削除することができることを理解されたい。例えば、本出願に記載されている各ステップは、並列に実行されてもよいし、順次的に実行されてもよいし、異なる順序で実行されてもよいが、本出願で開示されている技術案の所望の結果を実現することができれば、本明細書では限定されない。 It should be appreciated that steps may be reordered, added, or deleted using the various forms of flow shown above. For example, each step described in this application may be performed in parallel, sequentially, or in a different order, but the technology disclosed in this application There is no limitation herein as long as the desired result of the scheme can be achieved.

上記具体的な実施形態は、本出願に対する保護範囲を限定するものではない。当業者は、設計要件と他の要因に応じて、様々な修正、組み合わせ、サブコンビネーション、及び代替を行うことができる。任意の本願の精神と原則内で行われる修正、同等の置換、及び改善などは、いずれも本出願の保護範囲内に含まれるべきである。 The above specific embodiments do not limit the protection scope of the present application. Those skilled in the art can make various modifications, combinations, subcombinations, and substitutions depending on design requirements and other factors. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall all fall within the protection scope of this application.

Claims

An image processing method comprising:
obtaining a trained pre-trained model, wherein the image feature output from the trained pre-trained model is a first image feature distance and a second image feature distance; trained using a plurality of frames of training images to satisfy a minimum difference, wherein said first image feature distance is the distance between image features of training images extracted from the same video clip; , wherein the second image feature distance is a distance between image features of training images extracted from different video clips;
generating an image processing model that performs a target image processing task based on the pre-trained model;
using the image processing model to perform a target image processing task on the target image ;
the number of frames of training images extracted from each said video clip is the same;
An image processing method characterized by:

generating an image processing model that performs a target image processing task based on the pre-trained model,
obtaining a network layer corresponding to the target image processing task;
splicing the pre-trained model and the network layer, wherein the input of the network layer is the image features output from the pre-trained model, and the output of the network layer is the image feature of the target image task; a step that is a processing result;
training the spliced pre-trained model and the network layers using the training set for the target image processing task to obtain the image processing model;
2. The image processing method according to claim 1, wherein:

the target image processing task comprises an image classification task, a target detection task, or an object recognition task;
3. The image processing method according to claim 1, wherein:

A method of training a pre-trained model, comprising:
obtaining a plurality of video clips;
extracting multiple frames of training images from the plurality of video clips to obtain a training set, extracting at least two frames of the training images from each of the video clips;
performing multi-round training on a pre-trained model for image feature extraction using the training set;
each round of training comprises selecting from said training set each training image extracted from at least two video clips; inputting each said training image selected in this round into said pre-trained model; obtaining the output image features, and based on the image features of each said training image selected in this round, determining a first image feature distance between training images belonging to the same video clip, different video clips; and determining the first image feature distance and the second image feature distance based on the first image feature distance and the second image feature distance adjusting the model parameters of the pre-trained model such that the difference between
the number of frames of training images extracted from each said video clip is the same ;
A training method for a pre-trained model, characterized by:

determining a first image feature distance between training images belonging to the same video clip;
for the training images input to the pre-trained model in this round of training, determining intra-class feature distances between image features of different training images belonging to the same video clip;
determining the sum of the within-class feature distances for at least two video clips selected from the training set in this round of training to obtain the first image feature distance;
5. The training method according to claim 4, characterized in that:

determining a second image feature distance between training images belonging to said different video clips;
for the training images input to the pre-trained model in this round of training, determining inter-class feature distances between image features of different training images belonging to different video clips;
determining the sum of the inter-class feature distances for at least two video clips selected from the training set in this round of training to obtain the second image feature distance;
5. The training method according to claim 4, characterized in that:

Obtaining the plurality of video clips comprises:
obtaining a plurality of videos;
performing a splitting process based on content differences between adjacent image frames in each said video to obtain a plurality of video clips of each said video;
The training method according to any one of claims 4 to 6, characterized in that:

An image processing device,
An acquisition module for acquiring a trained pre-trained model, wherein the image features output from the trained pre-trained model are a first image feature distance and a second image feature trained using a plurality of frames of training images such that the first image feature distance is the distance between the image features of the training images extracted from the same video clip. distance, wherein the second image feature distance is a distance between image features of training images extracted from different video clips;
a generation module for generating an image processing model that performs a target image processing task based on the pre-trained model;
a processing module for performing a target image processing task on the target image using the image processing model ;
the number of frames of training images extracted from each said video clip is the same ;
An image processing apparatus characterized by:

The generation module is
obtaining a network layer corresponding to the target image processing task;
splicing the pre-trained model and a network layer, wherein the input of the network layer is the image feature output from the pre-trained model, the output of the network layer is the processing result of the target image task;
training the spliced pre-trained model and the network layer using the training set for the target image processing task to obtain the image processing model;
9. The image processing apparatus according to claim 8 , characterized by:

the target image processing task comprises an image classification task, a target detection task, or an object recognition task;
10. The image processing apparatus according to claim 8 , wherein:

A training device for a pre-trained model, comprising:
an acquisition module for acquiring multiple video clips;
an extraction module for extracting a plurality of frames of training images from the plurality of video clips to obtain a training set, the extraction module extracting at least two frames of the training images from each of the video clips;
a training module for performing multi-round training on a pre-trained model for image feature extraction using the training set;
each round of training comprises selecting from said training set each training image extracted from at least two video clips; inputting each said training image selected in this round into said pre-trained model; obtaining the output image features, and based on the image features of each said training image selected in this round, determining a first image feature distance between training images belonging to the same video clip, different video clips; determining a second image feature distance between training images belonging to the first image feature distance and the second image feature distance based on the first image feature distance and the second image feature distance adjusting the model parameters of the pre-trained model such that the difference between
the number of frames of training images extracted from each said video clip is the same ;
A pre-trained model training device characterized by:

The training module comprises:
for the training images input to said pre-trained model in this round of training, determining intra-class feature distances between image features of different training images belonging to the same video clip;
determining the sum of the within-class feature distances for at least two video clips selected from the training set in this round of training to obtain the first image feature distance;
The training device according to claim 11 , characterized in that:

The training module comprises:
for training images input to said pre-trained model in this round of training, determining inter-class feature distances between image features of different training images belonging to different video clips;
determining the sum of the inter-class feature distances for at least two video clips selected from the training set in this round of training to obtain the second image feature distance;
The training device according to claim 11 , characterized in that:

the acquisition module,
obtaining multiple videos and performing a splitting process based on content differences between adjacent image frames in each said video to obtain multiple video clips of each said video;
The training device according to any one of claims 11 to 13 , characterized in that:

at least one processor;
a memory communicatively coupled to the at least one processor;
The memory stores instructions executable by the at least one processor, and the instructions are stored in the image processing method according to any one of claims 1 to 3, or claims 4 to 7 . performed by the at least one processor to enable execution of the pre-trained model training method of any of
An electronic device characterized by:

A non-transitory computer-readable storage medium having computer instructions stored thereon,
The computer instructions cause a computer to perform the image processing method according to any one of claims 1-3 or the training method of a pre-trained model according to any one of claims 4-7 ,
A non-transitory computer-readable storage medium characterized by:

A computer program,
The computer program causes a computer to execute the image processing method according to any one of claims 1 to 3 or the pre-trained model training method according to any one of claims 4 to 7 ,
A computer program characterized by: