CN113627354B

CN113627354B - A model training and video processing method, which comprises the following steps, apparatus, device, and storage medium

Info

Publication number: CN113627354B
Application number: CN202110926860.8A
Authority: CN
Inventors: 吴文灏; 黄登
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-08-12
Filing date: 2021-08-12
Publication date: 2023-08-08
Anticipated expiration: 2041-08-12
Also published as: CN113627354A

Abstract

The disclosure provides a model training method, a video processing method, a device, equipment and a storage medium, relates to the field of artificial intelligence, in particular to the technical field of computer vision and deep learning, and can be particularly used in smart cities and intelligent traffic scenes. The specific implementation scheme is as follows: extracting a first video segment, a second video segment and a third video segment from the sample video set, wherein the first video segment and the second video segment are similar in appearance, and the second video segment and the third video segment are identical in playing speed; respectively extracting the characteristics of the first video segment, the second video segment and the third video segment by using the target model to obtain a first characteristic, a second characteristic and a third characteristic; determining a loss function based on a first distance between the first feature and the second feature, and a second distance between the second feature and the third feature; and training the target model according to the loss function. The implementation mode can improve the quality of the extracted features and improve the performance of downstream tasks.

Description

A model training and video processing method, which comprises the following steps, apparatus, device, and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular to the field of computer vision and deep learning technologies, and more particularly to a model training method, a video processing method, a device, equipment and a storage medium, which can be specifically used in smart cities and in smart traffic scenes.

Background

Video characterization learning, a technique that helps a system automatically learn features with discriminative power from an original video. With the advent of smartphones, recording video became unprecedented easy. Video analytics has become one of the most active research hotspots currently. However, to obtain a high quality video tag requires a large amount of manual labeling work, which requires a large amount of manpower, material resources and financial resources. In contrast, millions of unlabeled video are available for free on the internet. Thus, learning meaningful video representations from unlabeled video is critical to video content understanding.

Disclosure of Invention

The present disclosure provides a model training and video processing method, apparatus, device, and storage medium.

According to a first aspect, there is provided a model training method comprising: extracting a first video segment, a second video segment and a third video segment from a sample video set, wherein the similarity of the appearance of the first video segment and the appearance of the second video segment is larger than a first preset threshold value, and the playing speed of the second video segment and the playing speed of the third video segment are the same; respectively extracting the characteristics of the first video segment, the second video segment and the third video segment by utilizing the target model to obtain the first characteristics of the first video segment, the second characteristics of the second video segment and the third characteristics of the third video segment; determining a loss function based on a first distance between the first feature and the second feature, and a second distance between the second feature and the third feature; and training the target model according to the loss function.

According to a second aspect, there is provided a video processing method comprising: acquiring a target video; extracting features of a target video by using a target model trained by the model training method as described in the first aspect, and determining target features of the target video; and processing the target video according to the target characteristics.

According to a third aspect, there is provided a model training apparatus comprising: the video clip extraction unit is configured to extract a first video clip, a second video clip and a third video clip from the sample video set, wherein the similarity of the appearance of the first video clip and the appearance of the second video clip is larger than a first preset threshold, and the playing speed of the second video clip and the playing speed of the third video clip are the same; the video feature extraction unit is configured to extract features of the first video segment, the second video segment and the third video segment respectively by utilizing the target model to obtain a first feature of the first video segment, a second feature of the second video segment and a third feature of the third video segment; a loss function determining unit configured to determine a loss function based on a first distance between the first feature and the second feature, and a second distance between the second feature and the third feature; and a target model training unit configured to train the target model according to the loss function.

According to a fourth aspect, there is provided a video processing apparatus comprising: a video acquisition unit configured to acquire a target video; a feature extraction unit configured to extract features of a target video using a target model trained by the model training method as described in the first aspect, and determine target features of the target video; and the video processing unit is configured to process the target video according to the target characteristics.

According to a fifth aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in the first aspect or the method as described in the second aspect.

According to a sixth aspect, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method as described in the first aspect or the method as described in the second aspect.

According to a seventh aspect, a computer program product comprising a computer program which, when executed by a processor, implements a method as described in the first aspect or a method as described in the second aspect.

According to the technology disclosed by the invention, the model can be trained in the feature space, so that more relevant information of the video can be reserved, the quality of the features learned from the unlabeled video data is improved, and the performance of downstream tasks is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is an exemplary system architecture diagram in which an embodiment of the present disclosure may be applied;

FIG. 2 is a flow chart of one embodiment of a model training method according to the present disclosure;

FIG. 3 is a flow chart of another embodiment of a model training method according to the present disclosure;

FIG. 4 is a schematic diagram of one embodiment of a model training method according to the present disclosure;

FIG. 5 is a flow chart of one embodiment of a video processing method according to the present disclosure;

FIG. 6 is a schematic diagram of one application scenario of a model training method, video processing method, according to the present disclosure;

FIG. 7 is a schematic diagram of the structure of one embodiment of a model training apparatus according to the present disclosure;

FIG. 8 is a schematic diagram of the architecture of one embodiment of a video processing device according to the present disclosure;

fig. 9 is a block diagram of an electronic device used to implement the model training method, video processing method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, without conflict, the embodiments of the present disclosure and features of the embodiments may be combined with each other. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 to which the model training methods, video processing methods, or embodiments for model training devices, video processing devices of the present disclosure may be applied.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a video playback class application, a video processing class application, and the like, may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smartphones, tablets, car-mounted computers, laptop and desktop computers, and the like. When the terminal devices 101, 102, 103 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.

The server 105 may be a server providing various services, such as a background server providing language models on the terminal devices 101, 102, 103. The background server may train the model with training samples to obtain a target model, and feed back the target model to the terminal devices 101, 102, 103.

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster formed by a plurality of servers, or as a single server. When server 105 is software, it may be implemented as a plurality of software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.

It should be noted that, the model training method provided in the embodiment of the present disclosure is generally performed by the server 105, and the video processing method may be performed by the terminal devices 101, 102, 103, or may be performed by the server 105. Accordingly, the model training apparatus is generally provided in the server 105, and the video processing apparatus may be provided in the terminal devices 101, 102, 103, or may be provided in the server 105.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to fig. 2, a flow 200 of one embodiment of a model training method according to the present disclosure is shown. The model training method of the embodiment comprises the following steps:

step 201, extracting a first video clip, a second video clip and a third video clip from a sample video set.

In this embodiment, the execution subject of the model training method (e.g., the server 105 shown in fig. 1) may first acquire a sample video set. A plurality of sample videos may be included in the sample video set. The execution body may extract the first video clip, the second video clip, and the third video clip from the sample video set. Here, the number of video frames included in the first video clip, the second video clip, and the third video clip may be smaller. And the similarity of the appearance of the first video clip and the appearance of the second video clip is larger than a first preset threshold value, and the playing speed of the second video clip and the playing speed of the third video clip are the same. Video clips of similar appearance are understood to include substantially identical elements and similar relative positions between the elements. The play speed may be understood as the speed at which the video clip is played to completion, which is related to the number of display frames per second (Frames per Second, FPS).

Step 202, extracting features of the first video segment, the second video segment and the third video segment by using the target model, so as to obtain a first feature of the first video segment, a second feature of the second video segment and a third feature of the third video segment.

In this embodiment, the execution body may extract the features of the first video clip, the second video clip, and the third video clip, respectively, using the object model. Here, the target model may be a model to be trained, which may be used to extract features of the video clip. The execution body may input the first video clip, the second video clip, and the third video clip into the target model, respectively, and the obtained output is the feature of each video clip. Here, the first video clip is referred to as a first feature, the second video clip is referred to as a second feature, and the third video clip is referred to as a third feature.

Step 203, determining a loss function according to a first distance between the first feature and the second feature, and a second distance between the second feature and the third feature.

The execution body may calculate a first distance between the first feature and the second feature and a second distance between the second feature and the third feature, respectively. And determining a loss function according to the first distance and the second distance. Specifically, the execution body may weight the first distance and the second distance to determine the loss function. The weighting coefficient can be set according to the actual application scene.

Step 204, training a target model according to the loss function.

After determining the loss function, the executing body can iteratively adjust parameters of the target model according to the loss function value until the training termination condition is met. It will be appreciated that the smaller the loss function value, the higher the performance of the target model.

The model training method provided by the embodiment of the invention can train the model in the feature space, so that more relevant information of the video can be reserved, the quality of the features learned from the unlabeled video data is improved, and the performance of downstream tasks is improved.

With continued reference to fig. 3, a flow 300 of another embodiment of a model training method according to the present disclosure is shown. As shown in fig. 3, the method of the present embodiment may include the steps of:

step 301, a first sample video and a second sample video are selected from a set of sample videos.

In this embodiment, the executing body may first select the first sample video and the second sample video from the sample video set. Here, the first sample video and the second sample video are two different videos. And the appearance similarity of the first sample video and the second sample video is greater than a second preset threshold. For example, the first sample video and the second sample video may be videos taken from different angles for the same target. Specifically, the execution subject may select the first sample video and the second sample video from the sample video set according to the title of the sample video, the shooting time, elements in the video, and the like.

In step 302, a first video segment and a second video segment are extracted from a first sample video.

In this embodiment, the execution body may extract the first video clip and the second video clip from the first sample video. Specifically, the executing body may extract a plurality of consecutive video frames from the first sample video as the first video clip and the second video clip, respectively. For example, the execution body may use 1 st to 10 th video frames of the first sample video as the first video clip and 21 st to 30 th video frames as the second video clip.

In some optional implementations of this embodiment, the executing body may obtain the first sample video and the second sample video by:

in step 3021, a plurality of consecutive video frames are selected from the first sample video.

In step 3022, the plurality of video frames are divided into two video segments having the same number, so as to obtain a first video segment and a second video segment.

In this implementation, the executing body may select a plurality of consecutive video frames from the first sample video. For example, 1 st to 16 th video frames are selected. And then dividing the plurality of video frames into two video clips with the same number to obtain a first video clip and a second video clip. For example, the 1 st to 8 th video frames are used as a first video clip, and the 9 th to 16 th video frames are used as a second video clip. Alternatively, the execution body may use an odd-numbered video frame of the 1 st to 16 th video frames as the first video clip and an even-numbered video frame as the second video clip.

In step 303, a third video segment is extracted from the second sample video.

In this embodiment, the executing body may extract the third video clip from the second sample video. Specifically, the execution body may set the number of video frames in the third video clip to be the same as the number of video frames in the first video clip or the second video clip.

In some optional implementations of this embodiment, the executing entity may obtain the third sample video by:

step 3031, a number of frames per second of the second video segment is determined.

In step 3032, the second sample video is sampled at the display frame number per second to obtain a third video segment.

In this implementation, the executing body may first determine the number of display frames per second for the second video clip. Then, the second sample video is sampled at the display frame number per second to obtain a third video segment. It will be appreciated that the play speeds of two video clips sampled at the same number of display frames per second are the same.

Step 304, data enhancement is carried out on the first video segment and the second video segment; and extracting the characteristics of the first video segment after data enhancement, the second video segment after data enhancement and the third video segment by using the target model to obtain a first characteristic, a second characteristic and a third characteristic.

After the execution body obtains the three video clips, the execution body can respectively perform data enhancement on the first video clip and the second video clip. Here, data enhancement may include, but is not limited to: random clipping, random disturbance of color, random blurring and random overturning. And then, the execution body can use the target model to perform feature extraction on the first video segment after data enhancement, the second video segment after data enhancement and the third video segment to obtain a first feature, a second feature and a third feature. In this embodiment, the capability of the model to extract features may be enhanced by data enhancement of the first video clip and the second video clip.

Step 305, determining a loss function based on a first distance between the first feature and the second feature, and a second distance between the second feature and the third feature.

In this embodiment, the first distance represents the distance between two video clips in the appearance feature space, and the second distance represents the distance between two video clips in the play speed feature space. The first distance and the second distance may be represented by L2 distances. The execution body may pull the distances of the first feature and the second feature in the appearance feature space by minimizing the first distance. The distance of the second feature and the third feature in the play speed feature space may be zoomed in by minimizing the second distance.

Step 306, training the target model according to the loss function.

In this embodiment, the executing entity may use random gradient descent (stochastic gradient descent, SGD) to optimize the network, continuously updating the network weights until the loss function convergence training ceases.

Fig. 4 shows a schematic diagram of the processing for a first video segment (a), a second video segment (b) and a third video segment (c). The first video segment (a) and the second video segment (b) have similar appearances, and the playing speeds of the second video segment (b) and the third video segment (c) are the same. The execution subject calculates a distance between the first video clip (a) and the second video clip (b) and a distance between the second video clip (b) and the third video clip (c), respectively. And generating a loss function according to the loss function, and training the target model.

And step 307, fine tuning the trained target model according to the sample data of the downstream task.

In this embodiment, the execution body may further perform fine tuning on the trained target model according to the sample data of the downstream task. Specifically, if the downstream task is a classification task, the executing body may join the classifier after training the target model and optimize the network with a smaller learning rate on the data set of the downstream task. Here, the parameter adjustment of the target model at a high learning rate is prevented from being excessively large, which rather reduces the performance of the target model.

According to the model training method provided by the embodiment of the disclosure, the manually marked video labels are not needed in the training process, so that manpower and material resources are saved, and the model training method can be used for training a large-scale label-free data set; in the training process, more video related information can be reserved in the feature space; the method does not depend on negative samples, saves memory space, reduces training cost, and simultaneously carries out fine adjustment on downstream tasks so as to further improve the performance of the model.

With continued reference to fig. 5, a flow 500 of one embodiment of a video processing method according to the present disclosure is shown. As shown in fig. 5, the method of the present embodiment may include the steps of:

step 501, a target video is acquired.

In this embodiment, the execution subject may first acquire the target video. Here, the target video is a video to be processed.

Step 502, extracting characteristics of a target video by using a target model obtained through training by a model training method, and determining target characteristics of the target video.

In this embodiment, the executing body may input the target video into the target model obtained by the model training method described in the embodiment of fig. 2 or fig. 3, and obtain the feature of the target video, which is recorded as the target feature.

In step 503, the target video is processed according to the target feature.

The execution subject may continue processing the target video after the target feature is obtained. For example, a video search may be performed to determine whether the target video is similar to other videos. Alternatively, the target video may be classified, and the category included in the target video may be determined.

According to the video processing method provided by the embodiment of the disclosure, the high-quality characteristics of the target video can be extracted by using the trained target model, so that the accuracy of the downstream task result can be improved.

With continued reference to fig. 6, a schematic diagram of one application scenario of the model training method, video processing method according to the present disclosure is shown. In the application scenario of fig. 6, the server 601 performs the processing of steps 201 to 204 using a plurality of sample videos, and then obtains a trained object model. The target model is then sent to the terminal 602. The terminal 602 may perform video retrieval using the above-described object model to obtain a plurality of similar videos for viewing by the user.

With further reference to fig. 7, as an implementation of the method shown in the foregoing figures, the present disclosure provides an embodiment of a model training apparatus, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 7, the model training apparatus 700 of the present embodiment includes: a video clip extraction unit 701, a video feature extraction unit 702, a loss function determination unit 703, and a target model training unit 704.

The video clip extraction unit 701 is configured to extract a first video clip, a second video clip, and a third video clip from the sample video set. The similarity of the appearance of the first video clip and the appearance of the second video clip is larger than a first preset threshold value, and the playing speed of the second video clip and the playing speed of the third video clip are the same.

The video feature extraction unit 702 is configured to extract features of the first video segment, the second video segment, and the third video segment by using the object model, so as to obtain a first feature of the first video segment, a second feature of the second video segment, and a third feature of the third video segment.

The loss function determining unit 703 is configured to determine a loss function according to a first distance between the first feature and the second feature, and a second distance between the second feature and the third feature.

The object model training unit 704 is configured to train the object model according to the loss function.

In some optional implementations of the present embodiment, the video clip extraction unit 701 may be further configured to: selecting a first sample video and a second sample video from the sample video set, wherein the appearance similarity of the first sample video and the second sample video is larger than a second preset threshold value; extracting a first video segment and a second video segment from the first sample video; and extracting a third video segment from the second sample video.

In some optional implementations of the present embodiment, the video clip extraction unit 701 may be further configured to: selecting a plurality of continuous video frames from the first sample video; dividing the plurality of video frames into two video clips with the same number to obtain a first video clip and a second video clip.

In some optional implementations of the present embodiment, the video clip extraction unit 701 may be further configured to: determining a number of frames per second of the second video segment; and sampling the second sample video with the display frame number per second to obtain a third video segment.

In some optional implementations of the present embodiment, the video feature extraction unit 702 may be further configured to: performing data enhancement on the first video segment and the second video segment; and extracting the characteristics of the first video segment after data enhancement, the second video segment after data enhancement and the third video segment by using the target model to obtain a first characteristic, a second characteristic and a third characteristic.

In some optional implementations of the present embodiment, the apparatus 700 may further include a fine tuning unit, not shown in fig. 7, configured to: and fine tuning the trained target model according to the sample data of the downstream task.

It should be understood that the units 701 to 704 described in the model training apparatus 700 correspond to the respective steps in the method described with reference to fig. 2. Thus, the operations and features described above with respect to the model training method are equally applicable to the apparatus 700 and the units contained therein, and are not described in detail herein.

With further reference to fig. 8, as an implementation of the method shown in the foregoing figures, the present disclosure provides an embodiment of a video processing apparatus, where the apparatus embodiment corresponds to the method embodiment shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 8, the video processing apparatus 800 of the present embodiment includes: a video acquisition unit 801, a feature extraction unit 802, and a video processing unit 803.

The video acquisition unit 801 is configured to acquire a target video.

The feature extraction unit 802 is configured to extract features of the target video using the target model trained by the model training method described in the embodiment of fig. 2 or fig. 3, and determine target features of the target video.

The video processing unit 803 is configured to process the target video according to the target feature.

It should be understood that the units 801 to 803 described in the video processing apparatus 800 correspond to the respective steps in the method described with reference to fig. 4, respectively. Thus, the operations and features described above with respect to the video processing method are equally applicable to the apparatus 800 and the units contained therein, and are not described in detail herein.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 9 shows a block diagram of an electronic device 900 that performs a model training method, a video processing method, according to an embodiment of the disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the electronic device 900 includes a processor 901 that can perform various suitable actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a memory 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the electronic device 900 can also be stored. The processor 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. An I/O interface (input/output interface) 905 is also connected to the bus 904.

A number of components in the electronic device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; memory 908, such as a magnetic disk, optical disk, etc.; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the electronic device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

Processor 901 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of processor 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 901 performs the various methods and processes described above, such as model training methods, video processing methods. For example, in some embodiments, the model training method, the video processing method may be implemented as a computer software program tangibly embodied on a machine-readable storage medium, such as the memory 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into RAM 903 and executed by processor 901, one or more steps of the model training method, video processing method described above may be performed. Alternatively, in other embodiments, processor 901 may be configured to perform model training methods, video processing methods, in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. The program code described above may be packaged into a computer program product. These program code or computer program product may be provided to a processor or controller of a general purpose computer, special purpose computer or other programmable data processing apparatus such that the program code, when executed by the processor 901, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable storage medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable storage medium may be a machine-readable signal storage medium or a machine-readable storage medium. The machine-readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions, improvements, etc. that are within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A model training method, comprising:

selecting a first sample video and a second sample video from a sample video set, wherein the appearance similarity of the first sample video and the second sample video is larger than a second preset threshold;

selecting a plurality of continuous video frames from the first sample video;

dividing the plurality of video frames into two video clips with the same number to obtain a first video clip and a second video clip;

determining a number of frames per second of display of the second video segment;

sampling the second sample video with the display frame number per second to obtain a third video segment;

respectively extracting the characteristics of the first video segment, the second video segment and the third video segment by using a target model to obtain a first characteristic of the first video segment, a second characteristic of the second video segment and a third characteristic of the third video segment;

determining a loss function according to a first distance between the first feature and the second feature and a second distance between the second feature and the third feature;

and training the target model according to the loss function.

2. The method of claim 1, wherein the extracting the features of the first video segment, the second video segment, and the third video segment using the object model to obtain the first feature of the first video segment, the second feature of the second video segment, and the third feature of the third video segment, respectively, comprises:

performing data enhancement on the first video segment and the second video segment;

and extracting features of the first video segment after data enhancement, the second video segment after data enhancement and the third video segment by using the target model to obtain the first feature, the second feature and the third feature.

3. The method of claim 1, wherein the method further comprises:

and fine tuning the trained target model according to the sample data of the downstream task.

4. A video processing method, comprising:

acquiring a target video;

extracting features of the target video by using a target model trained by the model training method according to any one of claims 1 to 3, and determining target features of the target video;

and processing the target video according to the target characteristics.

5. A model training apparatus comprising:

a sample video selecting unit configured to select a first sample video and a second sample video from a sample video set, wherein an appearance similarity of the first sample video and the second sample video is greater than a second preset threshold;

a video frame selection unit configured to select a plurality of consecutive video frames from the first sample video;

the video frame dividing unit is configured to divide the plurality of video frames into two video clips with the same number to obtain a first video clip and a second video clip;

a display frame number determining unit configured to determine a display frame number per second of the second video clip;

a video frame sampling unit configured to sample the second sample video at the display frame number per second to obtain a third video segment;

a video feature extraction unit configured to extract features of the first video segment, the second video segment, and the third video segment, respectively, using a target model, to obtain a first feature of the first video segment, a second feature of the second video segment, and a third feature of the third video segment;

a loss function determining unit configured to determine a loss function from a first distance between the first feature and the second feature, a second distance between the second feature and the third feature;

and a target model training unit configured to train the target model according to the loss function.

6. The apparatus of claim 5, wherein the video feature extraction unit is further configured to:

7. The apparatus of claim 5, wherein the apparatus further comprises a trimming unit configured to:

8. A video processing apparatus comprising:

a video acquisition unit configured to acquire a target video;

a feature extraction unit configured to extract features of the target video using the target model trained by the model training method of any one of claims 1 to 3, and determine target features of the target video;

and the video processing unit is configured to process the target video according to the target characteristics.

9. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-3 or to perform the method of claim 4.

10. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-3 or the method of claim 4.