CN115147754A

CN115147754A - Video frame processing method, video frame processing device, electronic device, storage medium, and program product

Info

Publication number: CN115147754A
Application number: CN202210602387.2A
Authority: CN
Inventors: 舒秀军; 许良晟; 文伟; 谯睿智
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-05-30
Filing date: 2022-05-30
Publication date: 2022-10-04

Abstract

The embodiment of the invention discloses a video frame processing method, a video frame processing device, electronic equipment, a storage medium and a program product; the method comprises the steps of obtaining a plurality of modal images corresponding to each video frame in a video to be processed, carrying out modal feature extraction on the modal images to obtain image features of each modal image, determining nodes corresponding to the image features in an image feature space of each modality, determining effective neighbor nodes of each node, carrying out node clustering according to each node and the effective neighbor nodes of each node to obtain at least one node clustering set, generating video content sets corresponding to each node clustering set based on the video frames corresponding to the nodes in each node clustering set, determining and fusing the video content sets corresponding to the plurality of modalities of the same object to obtain a processing result of at least one object in the video to be processed. The video content set of the same object can be determined from the video, and the accuracy of processing the video frames is improved.

Description

Video frame processing method, video frame processing device, electronic device, storage medium, and program product

Technical Field

The present invention relates to the field of video processing technologies, and in particular, to a video frame processing method, apparatus, electronic device, storage medium, and program product.

Background

Video usually contains rich content information, and the same video may contain video frames of multiple objects. In recent years, with the popularization of artificial intelligence, processing video frames in videos with different objects has wide application in scenes such as media, search and the like, for example, celebrity search, automatic video editing and the like.

At present, the main method adopted when determining video frames corresponding to different objects in a video is to cluster the video frames in the video based on facial features of the objects. However, this method relies on single-modality facial features, which results in a large number of valid video frames such as heads-down, long-range views, back, etc. being unable to be processed. Therefore, the effect of determining the video frames of different objects in the video in the related art is not ideal, and the processing accuracy is low.

Disclosure of Invention

Embodiments of the present invention provide a video frame processing method, an apparatus, an electronic device, a storage medium, and a program product, which can determine a video content set of a same object from a video by combining features of multiple modes, thereby improving accuracy of determining video frames of each object in the video.

The embodiment of the invention provides a video frame processing method, which comprises the following steps:

acquiring a plurality of modal images corresponding to each video frame in a video to be processed based on the video to be processed, wherein the plurality of modal images of the same video frame respectively represent video contents corresponding to the video frames in different modalities;

performing modal feature extraction on the modal images to obtain image features corresponding to the modal images of all the modalities, and determining nodes corresponding to the image features under all the modalities in an image feature space corresponding to each modality;

determining effective neighbor nodes of each node in each image feature space, wherein the nodes and the effective neighbor nodes are associated nodes, and if one node is an associated node of another node, the distance between the node and the another node needs to meet a preset node association condition;

performing node clustering on each node in each image feature space according to each node and the effective neighbor node corresponding to the node to obtain at least one node clustering set in each mode;

generating a video content set corresponding to each node cluster set based on the video frames corresponding to the nodes in each node cluster set;

determining video content sets corresponding to multiple modalities of the same object, and fusing the video content sets corresponding to the same object to obtain a processing result of at least one object in the video to be processed.

Correspondingly, an embodiment of the present invention further provides a video frame processing apparatus, including:

the image acquisition unit is used for acquiring a plurality of modal images corresponding to each video frame in the video to be processed based on the video to be processed, wherein the plurality of modal images of the same video frame respectively represent video contents corresponding to the video frames in different modalities;

the characteristic extraction unit is used for carrying out modal characteristic extraction on the modal images to obtain image characteristics corresponding to the modal images of all the modalities, and determining nodes corresponding to the image characteristics under all the modalities in an image characteristic space corresponding to each modality;

a node determining unit, configured to determine valid neighbor nodes of each node in each image feature space, where a node and a valid neighbor node thereof are associated nodes with each other, and if a node is an associated node of another node, a distance between the node and the another node needs to satisfy a preset node association condition;

the node clustering unit is used for carrying out node clustering on each node in each image characteristic space according to each node and the effective neighbor node corresponding to the node to obtain at least one node clustering set under each mode;

the set generating unit is used for generating a video content set corresponding to each node cluster set based on the video frames corresponding to the nodes in each node cluster set;

and the video processing unit is used for determining video content sets corresponding to a plurality of modalities of the same object, and fusing the video content sets corresponding to the same object to obtain a processing result of at least one object in the video to be processed.

Optionally, the node clustering unit is configured to generate a subgraph in which each node in each image feature space is a central node and the effective neighboring node corresponding to the node is another node according to each node and the effective neighboring node corresponding to the node;

acquiring a graph clustering network corresponding to each mode;

and carrying out graph clustering processing on each subgraph through the graph clustering networks of corresponding modes to obtain at least one node clustering set under each mode.

Optionally, the node clustering unit is configured to perform node feature update on each node according to each node and the valid neighbor node corresponding to the node, so as to obtain each updated node;

and performing node clustering on each updated node in each image feature space based on the similarity between each updated node and other updated nodes in the image feature space to obtain at least one node clustering set in each mode.

Optionally, the node clustering unit is configured to calculate, according to each node and the valid neighbor node corresponding to the node, a spatial distance between each node and the corresponding valid neighbor node in the corresponding image feature space;

determining an association weight between each of the nodes and the corresponding valid neighbor node based on each of the spatial distances;

and carrying out node feature updating calculation according to the node features corresponding to the nodes, the node features of the effective neighbor nodes corresponding to the nodes and the association weights to obtain updated nodes.

Optionally, the node clustering unit is configured to determine, based on each updated node, a preset number of updated nodes as cluster center nodes in each image feature space;

obtaining the similarity between each updated node and each cluster center node in each image feature space;

dividing each updated node into a cluster where a corresponding target center node is located, wherein the similarity between the updated node and the corresponding target center node is not lower than a preset similarity threshold;

selecting a new cluster center node of each cluster based on the updated nodes in each cluster, and returning to the step of obtaining the similarity between each updated node and each cluster center node in each image feature space until a clustering end condition is met;

and respectively determining the updated nodes in each cluster as a node cluster set.

Optionally, the video frame processing apparatus provided in the embodiment of the present invention further includes a graph network training unit, configured to perform graph clustering processing on sample sub-graphs in each sample image feature space through a graph clustering network to be trained corresponding to each modality, to obtain at least one training node cluster set in each modality, where each sample sub-graph is annotated with a reference cluster set result;

based on the training node clustering set and the reference clustering set result in each mode, respectively calculating the loss of the graph clustering network to be trained corresponding to each mode;

and adjusting the network parameters of each graph clustering network to be trained according to the loss to obtain the trained graph clustering network corresponding to each mode.

Optionally, the feature extraction unit is configured to map the modal images into modal feature vector spaces corresponding to different modalities through feature mapping parameters of a shared modal feature extraction model, and obtain image features corresponding to the modal images of the different modalities based on a mapping result, where the shared modal feature extraction model is obtained by training based on a multi-modal image sample set, where the multi-modal image sample set includes a plurality of sample modal images representing image contents in different modalities.

Optionally, the video frame processing apparatus provided in the embodiment of the present invention further includes a model training unit, configured to obtain a multi-modal image sample set, where the multi-modal image sample set includes a plurality of sample modal images representing image contents in different modalities, and each sample modal image is labeled with a reference modality;

performing modal feature extraction on each sample modal image in the multi-modal image sample set through a shared modal feature extraction model to be trained to obtain sample image features corresponding to each sample modal image;

performing modal classification on the sample image features through a modal classification model to obtain training modalities corresponding to the sample image features;

calculating the loss of the shared modal feature extraction network to be trained based on the training modality and the reference modality of each sample image feature;

and adjusting the model parameters of the shared modal feature extraction model to be trained according to the loss to obtain the trained shared modal feature extraction network.

Optionally, the model training unit is configured to calculate, based on the training modality and the reference modality of each sample image feature, a corresponding loss of the to-be-trained shared modality feature extraction network in each modality respectively;

and calculating the total loss of the shared modal feature extraction network as the loss of the shared modal feature extraction network to be trained on the basis of the corresponding loss in each mode.

Optionally, the video processing unit is configured to determine video time information of each video content set according to video frame ordering information of a video frame corresponding to each video content set in the video to be processed;

and matching according to the video time information of each video content set, and taking the successfully matched video content sets in different modals as video content sets corresponding to a plurality of modals of the same object.

Optionally, the image obtaining unit is configured to obtain, based on a video to be processed, each video frame in the video to be processed and audio information corresponding to each video frame;

extracting a face image and a limb image from each video frame to obtain a face modal image and a limb modal image corresponding to each video frame;

performing voiceprint analysis on the audio information corresponding to each video frame to obtain an audio mode image corresponding to each audio information;

and taking the face modal image, the limb modal image and the audio modal image corresponding to each video frame as a plurality of modal images corresponding to each video frame.

Optionally, the set generating unit is configured to determine, based on a node in each node cluster set in a face modality, a face modality extraction area corresponding to each face modality image corresponding to each node in a corresponding video frame;

determining a corresponding limb modal extraction area of each limb modal image corresponding to each node in a corresponding video frame based on each node cluster centralized node under the limb modal;

calculating the intersection ratio between each facial modality extraction area and each limb modality extraction area;

determining a video frame set of the same object based on the intersection ratio, the video frame corresponding to each face modal image and the video frame corresponding to each limb modal image;

determining an audio information set corresponding to each node cluster set according to a video frame corresponding to a node in each node cluster set in an audio mode;

and taking the video frame set and the audio information set as a video content set.

Correspondingly, the embodiment of the invention also provides electronic equipment, which comprises a memory and a processor; the memory stores an application program, and the processor is used for running the application program in the memory to execute the steps in any video frame processing method provided by the embodiment of the invention.

Accordingly, the embodiment of the present invention further provides a computer-readable storage medium, where a plurality of instructions are stored, and the instructions are suitable for being loaded by a processor to perform the steps in any one of the video frame processing methods provided by the embodiment of the present invention.

Furthermore, the present invention also provides a computer program product, which includes a computer program or instructions, and when the computer program or instructions are executed by a processor, the computer program or instructions implement the steps in any video frame processing method provided by the embodiments of the present invention.

By adopting the scheme of the embodiment of the invention, a plurality of modal images corresponding to each video frame in a video to be processed can be obtained based on the video to be processed, wherein the plurality of modal images of the same video frame respectively represent video contents corresponding to the video frames under different modalities, modal feature extraction is carried out on the modal images to obtain image features corresponding to the modal images of each modality, nodes corresponding to the image features under each modality are determined in an image feature space corresponding to each modality, effective neighbor nodes of each node are determined in each image feature space, the nodes and the effective neighbor nodes are mutually related nodes, if one node is a related node of another node, the distance between the node and the another node needs to meet a preset node related condition, node clustering is carried out on each node in each image feature space according to the effective neighbor nodes corresponding to each node and the node to obtain at least one node clustering set under each modality, video contents corresponding to each node clustering set are generated based on the video frames corresponding to each node clustering set, a plurality of corresponding video objects in the same modality set are determined, and at least one video object corresponding to the video object in the same video object is processed to obtain a video object fusion result; in the embodiment of the invention, the multi-modal characteristics are combined, and the nodes in the same image characteristic space are preliminarily selected based on the node association conditions under different modes, so that unreliable error connection among the nodes in the clustering process is reduced, the video content set of the same object can be determined from the video, and the accuracy of processing the video frames is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a scene schematic diagram of a video frame processing method according to an embodiment of the present invention;

fig. 2 is a flowchart of a video frame processing method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a training process of a shared feature extraction model according to an embodiment of the present invention;

fig. 4 is a schematic diagram of determining valid neighbor nodes according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a technical implementation of video frame processing according to an embodiment of the present invention;

FIG. 6 is a flowchart illustrating a user login using authorization according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a video frame processing apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a video frame processing apparatus according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

The embodiment of the invention provides a video frame processing method and device, electronic equipment and a computer readable storage medium. In particular, embodiments of the present invention provide a video frame processing method suitable for a video frame processing apparatus, which may be integrated in an electronic device.

The electronic device may be a terminal or other devices, including but not limited to a mobile terminal and a fixed terminal, for example, the mobile terminal includes but is not limited to a smart phone, a smart watch, a tablet computer, a notebook computer, a smart car, and the like, wherein the fixed terminal includes but is not limited to a desktop computer, a smart television, and the like.

The electronic device may also be a device such as a server, and the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), and a big data and artificial intelligence platform, but is not limited thereto.

The video frame processing method of the embodiment of the invention can be realized by a server or realized by a terminal and the server together.

The following describes the method by taking an example in which the terminal and the server implement the video frame processing method together.

As shown in fig. 1, the video frame processing system provided by the embodiment of the present invention includes a terminal 10, a server 20, and the like; the terminal 10 and the server 20 are connected through a network, for example, a wired or wireless network connection, wherein the terminal 10 may be a terminal for a user to initiate a video frame processing request, and is configured to send the video frame processing request to the server 20.

Alternatively, the terminal 10 may exist as a terminal that transmits a video to be processed to the server 20.

The server 20 may be configured to obtain a plurality of modal images corresponding to each video frame in a video to be processed based on the video to be processed, where the plurality of modal images of the same video frame respectively represent video contents of the video frame corresponding to different modalities, perform modality feature extraction on the modal images to obtain image features corresponding to the modal images of each modality, determine a node corresponding to the image feature in each modality in an image feature space corresponding to each modality, and determine an effective neighbor node of each node in each image feature space, where the node and the effective neighbor node are associated nodes with each other, and if a node is an associated node of another node, a distance between the node and the another node needs to satisfy a preset node association condition, and perform node clustering on each node in each image feature space according to each node and the effective neighbor node corresponding to the node to obtain at least one node clustering set in each modality.

The server 20 may be configured to generate a video content set corresponding to each node cluster set based on video frames corresponding to nodes in each node cluster set, determine video content sets corresponding to multiple modalities of the same object, and fuse the video content sets corresponding to the same object to obtain a processing result of at least one object in the video to be processed.

It is understood that, in some embodiments, the steps of the video frame processing performed by the server 20 may also be performed by the terminal 10, which is not limited by the embodiment of the present invention.

The following are detailed descriptions. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.

Embodiments of the present invention will be described from the perspective of a video frame processing apparatus, which may be specifically integrated in a server or a terminal.

As shown in fig. 2, a specific flow of the video frame processing method of this embodiment may be as follows:

201. based on a video to be processed, a plurality of modal images corresponding to each video frame in the video to be processed are obtained, wherein the plurality of modal images of the same video frame respectively represent video contents corresponding to the video frames in different modalities.

The video to be processed is a video which needs video content corresponding to at least one object in the video. The video to be processed may include related video frames of at least one object. For example, the video to be processed includes two different objects, namely a person and a cat.

Specifically, the content in the video to be processed may be a person, an animal, a plant, and the like, and the content type of the video to be processed is not limited in the embodiment of the present invention.

In the embodiment of the present invention, a modality may be understood as a dimension capable of representing part or all of features of an object in a video to be processed. For example, for an animal, the modality may be the animal's head, torso, limbs, voice, etc.; for plants, the modality may be a trunk, flower, fruit, leaf, etc.

The modal image is an image representing video content of a video frame of the video to be processed in a certain modality. Generally, only the content in one modality is characterized in the same modality image.

The video content may be content in a video frame of the video to be processed, or may be audio content in the video to be processed.

It should be noted that there may be only one modal image corresponding to a certain video frame or some video frames in the to-be-processed video. For example, a video to be processed is processed by using a person as a processing standard, but some video frames only contain a landscape and do not contain a person, and at this time, a modality image corresponding to the video frame may only be one modality image obtained according to an audio corresponding to the video frame.

In some examples, taking the modalities including a face modality, a limb modality, and a sound modality as an example, the step "obtaining, based on the video to be processed, a plurality of modality images corresponding to each video frame in the video to be processed" may specifically include:

acquiring each video frame in the video to be processed and audio information corresponding to each video frame based on the video to be processed;

The face mode image is an image representing the video content of a video frame of the video to be processed in the face mode; the body modal image is an image representing the video content of a video frame of the video to be processed in the body mode; the audio mode image is an image representing video content corresponding to a video frame of the video to be processed in the audio mode.

For example, as shown in fig. 3, a face modality image for a face modality, a limb modality image for a limb modality, and an audio modality image corresponding to the video frame may be separated from the video frame.

Specifically, the voiceprint analysis of the audio information corresponding to each video frame may be performed by performing voiceprint extraction through some voiceprint analysis programs, and then converting the obtained voiceprint into a logarithmic mel spectrum to obtain an audio mode image in a picture data format. Or, the extracted sound spectrum may be subjected to processing such as framing, windowing, filtering, fourier transform, and the like, and then converted into an image format.

It is understood that several face modality images and several limb modality images may be extracted from the same video frame, and the number of the face modality images and the number of the limb modality images extracted from the same video frame may be the same or different.

For example, if the whole body of multiple persons can be included in the same video frame, face modality images and limb modality images corresponding to the multiple persons can be obtained; or, only one person's limb may be included in the same video frame, and then only one person's corresponding limb modality image is obtained.

In the embodiment of the present invention, in order to implement processing of video frames, an artificial intelligence technique may be applied. Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

202. And performing modal feature extraction on the modal images to obtain image features corresponding to the modal images of the various modalities, and determining nodes corresponding to the image features under the various modalities in an image feature space corresponding to each modality.

The image feature space is generally understood to be a vector space constructed based on the modality of the modality image. In some optional examples, the image feature space may be a multi-modal shared vector space. In other examples, each image feature space may correspond to one modality.

In some embodiments, the step of "performing modal feature extraction on the modal image" may be applying a Scale-invariant feature transform (SIFT) feature extraction method, a Histogram of Oriented Gradients (HOG) feature extraction method, or the like.

In other embodiments, the step of "performing modal feature extraction on the modal image" may be implemented by a modal feature extraction model. For example, different modality feature extraction models may be set for modality images of different modalities to perform feature extraction on the modality images.

For example, the modal feature extraction model may include a facial modal feature extraction model that performs feature extraction on a modal image of a facial modality, a limb modal feature extraction model that performs feature extraction on a modal image of a limb modality, an audio modal feature extraction model that performs feature extraction on a modal image of an audio modality, and the like.

It can be understood that if a corresponding modality feature extraction model is set for each modality, in an actual application process, a large amount of storage space is occupied, and the requirement on computing resources during feature extraction is also high. Therefore, in an embodiment of the present invention, a method for extracting features of images in different modalities through a same shared modality feature extraction model is provided, that is, the step "performing modality feature extraction on a modality image to obtain an image feature corresponding to a modality image in each modality" includes:

the method comprises the steps of mapping modal images to modal feature vector spaces corresponding to different modalities through feature mapping parameters of a shared modal feature extraction model, obtaining image features corresponding to the modal images of the modalities based on mapping results, training the shared modal feature extraction model based on a multi-modal image sample set, and enabling the multi-modal image sample set to comprise a plurality of sample modal images representing image contents under the different modalities.

The shared modal feature extraction model is a model capable of realizing modal feature extraction aiming at different modalities.

The feature extraction scheme in the embodiment focuses on feature extraction of modal images in different modalities (for example, modalities such as human faces, human bodies, voiceprints and the like) by using a unified shared modality feature extraction model, so that feature identification is maintained, overall model parameters are reduced, and deployment is facilitated.

In the practical application process, the shared modal feature extraction model is obtained through pre-training. Through the pre-training process, parameters and the like of the shared modal feature extraction model can be adjusted, so that the shared modal feature extraction model can achieve better feature extraction performance. Therefore, before the step of mapping the modal image into the modal feature vector space corresponding to different modalities by sharing the feature mapping parameters of the modal feature extraction model, the video frame processing method provided in the embodiment of the present invention may further include:

acquiring a multi-modal image sample set, wherein the multi-modal image sample set comprises a plurality of sample modal images representing image contents in different modalities, and each sample modal image is marked with a reference modality;

performing modal classification on the image features of each sample through a modal classification model to obtain a training mode corresponding to the image features of each sample;

calculating the loss of a shared modal feature extraction network to be trained based on the training mode and the reference mode of each sample image feature;

and according to the loss, adjusting the model parameters of the shared modal feature extraction model to be trained to obtain the trained shared modal feature extraction network.

Wherein the multi-modal image sample set comprises sample mode images of different modes.

In particular, the shared modal feature extraction model may include a normalization layer, a multi-head attention layer, a forward mapping layer, and so on. As shown in FIG. 3, in the model training phase, each batch (mini-batch) of multi-modal image sample set may simultaneously containThree mode samples. The shared modal feature extraction model may segment each input sample modal image x into K non-overlapping blocks (patch), and then map the K non-overlapping blocks (patch) into another high-dimensional space to obtain K tokens, which are denoted as { E ″ _k ，k∈[1，k]}。

The shared modal feature extraction model can splice a class label Z with the same dimension for K tokens _cls And adding the position code p to obtain the feature z to be mapped.

Wherein Z = [ Z = _cls ，E ₁ ，E ₂ ，...，E _k ]+p。

Z is input into a model stacked by multiple transform blocks (blocks), each block consisting of Multi-head Attention (Multi-head Attention), layer Normalization (Layer Normalization), forward mapping Layer (Feed Forward), etc., as shown in fig. 3. The mathematical operation process can be expressed as:

y ^I ＝MSA(LN(z ^I ))+z ^I

z ^I+1 ＝FF(LN(y ^I ))+y ^I

where MSA denotes multi-head attention, LN denotes slice normalization, and FF denotes forward mapping.

Optionally, the loss of the shared modal feature extraction model may be obtained by solving through a cross entropy function, a gradient descent method, and the like, which is not limited in the embodiment of the present invention.

In order to achieve the goal that the representation of the data of multiple modes can be extracted through a single model, the embodiment of the invention provides a training strategy for model perception, so that the performance of a model independently trained by each mode is even exceeded after the provided shared mode feature extraction model is jointly trained through the data of multiple modes. That is, the step "calculating the loss of the shared modality feature extraction network to be trained based on the training modality and the reference modality of each sample image feature" may specifically include:

respectively calculating the corresponding loss of the shared modal feature extraction network to be trained under each mode based on the training mode and the reference mode of each sample image feature;

and calculating the total loss of the shared modal feature extraction network as the loss of the shared modal feature extraction network to be trained based on the corresponding loss in each mode.

The training mode is a mode corresponding to the sample mode image obtained after the mode classification model carries out mode classification on the characteristics of each sample image. The reference modality is the modality actually labeled by the sample modality image.

That is, in the embodiment of the present invention, classifier predictions of different modalities may be separated, and each modality employs a corresponding loss function. And finally, calculating the total loss of the shared modal feature extraction model according to the losses of different modes, and adjusting the shared modal feature extraction model according to the total loss.

Specifically, taking three different modalities, i.e. face, limbs and audio, as an example, the total loss of the shared modality feature extraction model can be expressed by the following formula:

L＝λ _f *L _f +λ _b *L _b +λ _v *L _v

wherein L is _f 、L _b And L _v Losses, λ, corresponding to facial, limb and audio modalities, respectively _f 、λ _b And λ _v Weights are calculated for the corresponding losses for the face modality, the limb modality, and the audio modality, respectively.

In the embodiment of the present invention, the determining of the node corresponding to the image feature in each modality in the image feature space corresponding to each modality may be directly taking the image feature as a node, or may be taking a feature obtained by mapping the image feature as an image feature as a node. The skilled person can select the choice according to the actual application.

203. And determining effective neighbor nodes of each node in each image feature space, wherein the nodes and the effective neighbor nodes are associated nodes, and if one node is the associated node of another node, the distance between the node and the another node needs to meet a preset node association condition.

For a node, all nodes in the same image feature space as the node are referred to as neighbor nodes of the node.

At present, a supervised clustering algorithm based on a graph or an unsupervised KNN algorithm, a Kmeans algorithm and the like are often adopted for clustering nodes.

Generally, a graph-based supervised clustering algorithm will use the k-nearest neighbor composition method to construct a relationship graph between nodes. The k-nearest neighbor patterning method would select a fixed number k of neighbor nodes for each node. As shown in fig. 4, k =3 for nodes a, b, c. The circles in the graph represent the possible neighborhoods of each node, and the line segments between two points represent their existence as a connection. However, a fixed number k of neighbor nodes may introduce many false connections. For example, in fig. 4, a is taken as the central node, and b and c are both 3-neighbor nodes of a, but a may connect to node c which is physically far away from it in space, thereby forming an unreliable connection.

In the unsupervised KNN algorithm, the Kmeans algorithm and the like, clustering is performed only by depending on the characteristic vector of each node in the node clustering process, but the information content of the nodes is limited, and information such as the relationship among the nodes does not play a role in the clustering process.

In order to solve the problems that the influence of unreliable connection in the graph clustering process and the similarity relation between nodes does not play a role in the clustering process, the embodiment of the invention explores the problem that the relation graph between the nodes is constructed by variable effective neighbor nodes. That is, if two nodes are both neighbor nodes whose each other satisfies the node association condition, the two nodes are valid neighbor nodes for each other.

That is, the valid neighbor node of a certain node is a neighbor node of the neighbor nodes of the node, which is a node associated with the node. That is, the node is an associated node of its valid neighbor nodes, which are also associated nodes of the node.

And if the distance between one node A and the node B meets the preset node association condition aiming at the node B, the node A is the associated node of the node B.

Specifically, the node association condition may be N nodes closest to the node among neighbor nodes of the certain node, or the node association condition may be N nodes closest to the node, where the distance between the node and the neighbor nodes of the certain node is not more than a certain range, and the like.

For example, when the node association condition is that 3 nodes closest to the node are located in the neighbor nodes of a certain node, it is assumed that the neighbor nodes of the node 1 include nodes 2 to 10, where the 3 nodes closest to the node 1 are node 2, node 4, and node 7; the neighbor nodes of the node 2 comprise a node 1 and nodes 3-10, wherein 3 nodes closest to the node 2 are a node 3, a node 5 and a node 6; the neighbor nodes of the node 4 include nodes 1 to 3 and nodes 5 to 10, wherein 3 nodes closest to the node 4 are the node 1, the node 3 and the node 8.

At this time, the associated nodes of the node 1 include the node 2, the node 4, and the node 7, the associated nodes of the node 2 include the node 3, the node 5, and the node 6, and the associated nodes of the node 4 include the node 1, the node 3, and the node 8. That is, node 2 is a node associated with node 1, but node 1 is not a node associated with node 2, so node 1 and node 2 are not nodes associated with each other, and node 2 is not a valid neighbor node of node 1.

However, since the node 4 is a node associated with the node 1, and the node 1 is also a node associated with the node 4, the node 1 and the node 4 are associated with each other, and at this time, the node 4 is a valid neighbor node of the node 1, and the node 1 is also a valid neighbor node of the node 4.

It is understood that node a is an associated node of node B, and does not represent that node B is also necessarily an associated node of node a. Taking fig. 4 as an example, if the node association condition is that 3 nodes closest to a certain node are located in the neighboring nodes of the node, the node c is the associated node of the node a, but the node a is not the associated node of the node c.

In the practical application process, if a certain node does not have an effective neighbor node which is a mutual association node with the node, at this time, the association node of the node can be directly used as the effective neighbor node of the node. For example, the associated node of the node m is only the node n, but the node m is not the associated node of the node n, at this time, the node m does not have an effective neighbor node which is an associated node with the node m, and then the node n can be regarded as an effective neighbor node of the node m to participate in subsequent node clustering.

Therefore, the embodiment of the invention can process the neighbor nodes of each node, and the effective neighbor nodes can play a role in the graph clustering process.

204. And carrying out node clustering on each node in each image characteristic space according to each node and the effective neighbor node corresponding to the node to obtain at least one node clustering set under each mode.

In some alternative embodiments, node clustering may be implemented by a supervised clustering algorithm of the graph, as shown in fig. 5. At this time, the image feature space may be a graph network space, and step 204 may specifically include:

generating a subgraph which takes each node in each graph network space as a central node and takes the effective neighbor node corresponding to the node as other nodes according to each node and the effective neighbor node corresponding to the node;

acquiring a graph clustering network corresponding to each mode;

and carrying out graph clustering processing on each subgraph through a graph clustering network of a corresponding mode to obtain at least one node clustering set under each mode.

Specifically, a subgraph refers to a graph network formed by implementing only a portion of the connections between nodes in a graph network space. For example, a node in the graph network space may be used as a central node, a part of nodes except for the node used as the central node in the graph network space may be used as neighboring nodes, and the central node may be connected to the neighboring nodes, so as to obtain a subgraph.

For example, with a node a and its corresponding valid neighbor node B, C, a can be used as a central node, B and C can be used as other nodes, and A, B and A, C are connected to obtain a subgraph.

Wherein, the graph clustering network can be realized based on Graph Convolution Network (GCN). The GCN can be used in the scenes of supervised node classification, link prediction and recommendation systems and the like. Given a graph, the GCN can obtain node embedding layer by layer using graph convolution operations: in each layer, to obtain the embedding of a node, the embedding of adjacent nodes is acquired, then one or more layers of linear transformation and nonlinear activation are carried out, and finally, an embedding vector is obtained in the last layer and is used for some final tasks. For example, in the node classification problem, the embedded vector (embedding) of the last layer is transferred to a classifier to predict node labels, so as to realize the classification of the nodes.

The graph clustering network may be obtained through training. Through the training process, parameters and the like of the graph clustering network can be adjusted, so that the graph clustering network can achieve better clustering performance. Before the step of "obtaining a graph clustering network corresponding to each modality", the video frame processing method provided by the embodiment of the present invention may further include:

carrying out graph clustering processing on the sample subgraphs in the sample image feature space through a graph clustering network to be trained corresponding to each mode to obtain at least one training node clustering set in each mode, wherein reference clustering set results are injected to various book sub-icons;

respectively calculating the loss of the graph clustering network to be trained corresponding to each mode based on the training node clustering set and the reference clustering set result in each mode;

Optionally, the loss of the graph clustering network may be obtained by solving through a cross entropy function, a gradient descent method, and the like, which is not limited in the embodiment of the present invention.

The network parameters of the graph clustering network may specifically include the number of layers of convolutional layers for node embedding in the graph clustering network, parameters of a linear or nonlinear transformation function, and the like. For example, if the convolutional layers are included in the graph clustering network, the network parameters of the graph clustering network may include the number of convolutional layers, the size of convolutional cores in the convolutional layers, and/or the number of input channels corresponding to each convolutional layer, and so on.

In other alternative embodiments, node clustering may be implemented by an unsupervised clustering algorithm, as shown in FIG. 5. In the node clustering process, the KNN algorithm, the Kmeans algorithm and the like in the related technology only depend on the characteristic vector of each node for clustering, but the information content of the nodes is limited, and the information such as the relationship among the nodes does not play a role in the clustering process.

In order to enable the characteristics of each node to contain more information and enhance the influence of the relationship between nodes in the clustering process when node clustering is carried out, the nodes can be updated based on each node and effective neighbor nodes thereof, and the updated nodes are adopted for node clustering. That is to say, the step "performing node clustering on each node in each image feature space according to each node and an effective neighbor node corresponding to the node to obtain at least one node clustering set in each modality" may specifically include:

according to each node and the effective neighbor node corresponding to the node, node characteristic updating is carried out on each node to obtain each updated node;

The similarity can be obtained by calculating cosine distances, euclidean distances and the like between each updated node and other updated nodes in the image feature space.

In some examples, the updating of each node may be performed by calculating a distance value between each node and its effective neighboring node, and combining the distance values into a vector and splicing the vector with a feature vector corresponding to the node to obtain an updated node.

In other examples, the weight between each node and its effective neighboring node may be calculated based on the distance between each node and its effective neighboring node, and then the updated node may be obtained by performing weighted calculation on the node and its effective neighboring node based on the weight. That is, the step "performing node feature update on each node according to each node and an effective neighbor node corresponding to the node to obtain each updated node" may specifically include:

respectively calculating the space distance between each node and the corresponding effective neighbor node in the corresponding image feature space according to each node and the effective neighbor node corresponding to the node;

determining an association weight between each node and a corresponding effective neighbor node based on each spatial distance;

and updating and calculating the node characteristics according to the node characteristics corresponding to the nodes, the node characteristics of the effective neighbor nodes corresponding to the nodes and the association weights to obtain the updated nodes.

Specifically, assume that a node in the image feature space is X = [ X ] ₁ ，x ₂ ...x _N ]∈R ^N*D Where N and D refer to the number and dimensions of the nodes, respectively. For each node x _i (i is more than or equal to 1 and less than or equal to N), calculating cosine similarity between the node and other nodes, and screening k effective neighbor nodes. The process of node update can be represented by the following formula:

wherein, w _i，j Refers to node x _i And its k valid neighbor nodes, d _i，j Refers to x _i And x _j The cosine distance between. After all nodes are updated, the updated nodes will be used for unsupervised clustering.

Optionally, the reciprocal of the spatial distance may be directly used as the weighting factor to reach the distance node x _i The effect of lower weight coefficients for more distant valid neighbor nodes, and so on.

In some examples, in order to improve the efficiency of computation and save computation resources, the step "performing node clustering on each updated node in each image feature space based on the similarity between each updated node and other updated nodes in the image feature space where the updated node is located to obtain at least one node cluster set in each modality" may specifically include:

determining a preset number of updated nodes as cluster center nodes in each image feature space based on each updated node;

acquiring the similarity between each updated node and each cluster center node in each image feature space;

dividing each updated node into a cluster where a corresponding target center node is located, wherein the similarity between each updated node and the corresponding target center node is not lower than a preset similarity threshold;

selecting a new cluster center node of each cluster based on the updated nodes in each cluster, and returning to the step of acquiring the similarity between each updated node and each cluster center node in each image feature space until the clustering end condition is met;

In the embodiment of the present invention, a cluster is a set of a group of nodes generated based on a clustering process.

The cluster center nodes are determined when the updated nodes in each image feature space are clustered, the number of the cluster center nodes can be a preset fixed number, and the cluster center nodes selected in the first clustering are generally selected randomly. Each time a new cluster center node in a cluster is selected, another updated node may be selected in each cluster as the new cluster center node.

In another example, the clustering process may be implemented in the form of a clustering model, and the clustering model may cluster updated nodes in each image feature space by adjusting the number of central nodes of the clustering cluster multiple times to determine the most accurate clustering cluster.

Optionally, the clustering end condition may be that the updated node in each cluster does not change any more, or that the cluster center node corresponding to each cluster does not change any more, or may be that the number of times of executing the step of obtaining the similarity between each updated node and each cluster center node in each image feature space is returned in the clustering process is preset, or the like.

Specifically, the similarity between each updated node and each cluster center node can be obtained by calculating the distance between each updated node and each cluster center node. Or, the similarity between each updated node and each cluster center node can be obtained by calculating the association weight between each updated node and each cluster center node, and the like.

The preset similarity threshold may be set by a technician according to actual needs, which is not limited in the embodiment of the present invention.

205. And generating a video content set corresponding to each node cluster set based on the video frames corresponding to the nodes in each node cluster set.

The video content set is a set formed by video contents acquired from the video to be processed according to the video frames corresponding to the nodes. In particular, the set of video content may include at least one of a set of video frames and a set of audio information.

For example, for a node cluster set in an audio modality, because an audio frame in a video corresponds to a video frame, when an audio modality image is obtained, an audio frame set corresponding to a certain group of continuous video frames in a video to be processed may be obtained and processed, so that a natural correspondence exists between a node corresponding to each audio modality image and a video frame in the video to be processed, and through this correspondence, an audio frame corresponding to each node may be obtained from an original video to be processed, thereby generating an audio information set as a video content set.

In an actual application process, taking three modalities including a face modality, a limb modality, and an audio modality as an example, since the face and the limb are generally in one-to-one correspondence, a video content set corresponding to the face modality and the limb modality may be merged based on nodes in the face modality and the limb modality. In some embodiments, step 205 may specifically include:

based on the nodes in the cluster set of the nodes under the face mode, determining face mode extraction areas corresponding to the face mode images corresponding to the nodes in the corresponding video frames;

based on the nodes in the node cluster set under the body mode, determining a body mode extraction area corresponding to each body mode image corresponding to each node in a corresponding video frame;

calculating the intersection ratio between each face modal extraction area and each limb modal extraction area;

determining a video frame set of the same object based on the intersection ratio, the video frames corresponding to the face modal images and the video frames corresponding to the limb modal images;

Wherein, the Intersection over Union (IoU) function is used for calculating the ratio of the Intersection and Union of the two bounding boxes. IoU measures the relative size of the overlap of two bounding boxes.

In the present embodiment, the intersection ratio of the face and the limb (IoU) can be calculated using the coordinates between the face modality extraction area and the limb modality extraction area. Find IoU the largest, get the matched face and limbs. So that the video frames corresponding to the face modality and the limb modality can be merged.

Specifically, the video frames corresponding to the face modality and the limb modality are merged, and the video frames corresponding to the face modality and the limb modality may be directly spliced according to the sequence of the video frames in the video to be processed.

Because some video frames corresponding to the face modality images and some video frames corresponding to the limb modality images may be the same video frame, in order to avoid repeated video frames existing in the obtained video frame set, a union operation may be performed on the video frames corresponding to the face modality images and the video frames corresponding to the limb modality images, and the repeated video frames are removed.

206. Determining video content sets corresponding to multiple modalities of the same object, and fusing the video content sets corresponding to the same object to obtain a processing result of at least one object in the video to be processed.

In some optional embodiments, if the multiple modalities include only a face modality and a limb modality, then the corresponding video content sets of the same object in the face modality and the limb modality may be determined by calculating IoU for the video frames corresponding to the face modality and the limb modality based on the nodes in the face modality and the limb modality. The specific determination steps are similar to the aforementioned step of determining the video frame set of the same object based on IoU, and are not repeated herein in the embodiments of the present invention.

In other embodiments, due to the influence of environmental factors and the like, video content of the same object may be clustered into different video content sets, at this time, video content sets of different modalities may be matched according to video time information corresponding to the video content sets, and if matching is successful, the video content sets of the same object are determined. That is, the step "determining a set of video contents corresponding to multiple modalities of the same object" may specifically include:

determining video time information of each video content set according to video frame sequencing information of a video frame corresponding to each video content set in a video to be processed;

and matching according to the video time information of each video content set, and taking the successfully matched video content sets in different modalities as video content sets corresponding to a plurality of modalities of the same object.

The video time information may be a set of time values corresponding to the video content in the video content set in the video, or a time corresponding to the video content in the video content set in the video.

For example, for a video content set corresponding to a certain face modality, the video time information thereof may be 3 minutes 01 seconds to 3 minutes 15 seconds, for a video content set corresponding to another face modality, the video time information thereof may be 3 minutes 23 seconds to 3 minutes 40 seconds, and for a video content set corresponding to a certain audio modality, the video time information thereof may be 2 minutes 59 seconds to 3 minutes 47 seconds.

At this time, the video time information of the video content sets corresponding to the two face modalities may be considered to match the video time information of the video content set corresponding to the audio modality, and the video content sets corresponding to the two face modalities and the video content set corresponding to the audio modality may be regarded as video content sets corresponding to a plurality of modalities of the same object.

Taking the modalities including the face, body and audio as an example, the large difference of external factors, such as illumination, resolution, etc., may cause the video content of the same object to be clustered into different sets of video content. The characteristics of the audio may be unaffected by these effects and may be complementary. In this case, a set with high confidence may be selected from the sets of audio information, and the audio information of each set of audio information may be concatenated in time dimension to obtain a plurality of time series. At different scenes and different time points, objects appearing in the same time sequence are probably the same object. Based on the principle, the video frame sets with fused faces and limbs can be subjected to re-fusion, so that video content sets with different types of sets but the same identity are combined.

The confidence of the audio information set can be calculated based on the node cluster set corresponding to the audio information set. For example, the compactness of the node cluster set may be evaluated by NMI (Normalized Mutual Information), pair-wise F-score (Pairwise F-score), etc., and the evaluation result is taken as the confidence. Or, the audio information set with high confidence level can be obtained by acquiring audio information corresponding to a batch of nodes closer to the cluster center in the node cluster set.

The embodiment of the invention compares the face clustering effect of each algorithm on the multi-modal character clustering data set VideoX through experiments. The algorithms for comparison are unsupervised Clustering algorithms such as K-means, DBSCAN, HDBSCAN, spectral Clustering, algorithmic Clustering, and supervised Clustering algorithms such as GCN-D, L-GCN, GCN-V, STAR-FC. For the algorithm adopted by the scheme, L-GCN is used as a baseline model, then the traditional k nearest neighbor composition method is gradually replaced by a composition method based on mutual k nearest neighbors, and the original features (from VGGFace 2) are replaced by the features generated by a shared modal feature extraction model. The results are shown in the following table.

Note: r-k-NN refers to a composition method based on mutual k neighbors, and T refers to features generated by a shared modal feature extraction model.

In the table below, the first/second part shows the results of the unsupervised/supervised clustering algorithm. Unsupervised clustering algorithms tend to rely on specific data assumptions, causing them to perform poorly on VideoX. In general, supervised algorithms can learn data characteristics under different data distributions, and therefore perform better than unsupervised algorithms. The L-GCN combined with the composition method based on mutual k neighbors surpasses all other methods in the table, the characteristic generated by replacing UMT on the basis of the method greatly improves the performance of the algorithm, and 77.2F is obtained on three indexes _P ，83.6F _B 82.5 excellent performance of NMI. The experiment strongly proves the great potential of the method in monomodal clustering.

As shown in the table above, for the unsupervised approach, the experiment clusters the three modal features generated by UMT using the K-means algorithm, and for the supervised approach, the experiment trains STAR-FC on the training set of VideoX, and reports the results on its test set. The experimental results show that the performance after the multi-modal clustering results are fused is obviously superior to that of any single-modal result, which shows that the fusion algorithm of the scheme can strongly combine the clustering results of various modes.

As can be seen from the above, in the embodiments of the present invention, a plurality of modal images corresponding to each video frame in a video to be processed may be obtained based on the video to be processed, where the plurality of modal images of the same video frame respectively represent video contents corresponding to the video frames in different modalities, the modal features of the modal images are extracted to obtain image features corresponding to the modal images of the respective modalities, a node corresponding to the image feature in the respective modalities is determined in an image feature space corresponding to each modality, an effective neighbor node of each node is determined in each image feature space, where the node and the effective neighbor node are related to each other, and if a node is a related node of another node, a distance between the node and another node needs to satisfy a preset node related condition, according to the node and the effective neighbor node corresponding to the node, node clustering is performed on each node in each image feature space to obtain at least one node clustering set in each modality, based on the video frame corresponding to each node clustering set, a video content set corresponding to the same modality is determined, and at least one video object corresponding to be processed in the video set.

In the embodiment of the invention, the multi-modal characteristics are combined, and the nodes in the same image characteristic space are preliminarily selected based on the node association conditions under different modes, so that unreliable error connection among the nodes in the clustering process is reduced, the video content set of the same object can be determined from the video, and the accuracy of processing the video frames is improved.

The method according to the previous embodiment is further illustrated in detail by way of example.

In this embodiment, the system of fig. 1 will be described in connection with modalities including a face modality, a limb modality, and an audio modality.

As shown in fig. 6, the specific flow of the video frame processing method of this embodiment may be as follows:

601. the terminal receives a video processing viewing operation of a user, sends a video processing viewing request to the server, and the video processing viewing request indicates a video to be processed.

602. The server determines a video to be processed, and acquires a plurality of modal images corresponding to each video frame in the video to be processed, wherein the plurality of modal images of the same video frame respectively represent video contents corresponding to the video frames in different modalities.

Specifically, step 602 may include: acquiring each video frame in the video to be processed and audio information corresponding to each video frame based on the video to be processed;

603. The server maps the modal images into modal feature vector spaces corresponding to different modalities by sharing feature mapping parameters of the modal feature extraction model, and obtains image features corresponding to the modal images of the modalities based on mapping results.

The shared modal feature extraction model is obtained by training based on a multi-modal image sample set, and the multi-modal image sample set comprises a plurality of sample modal images representing image contents in different modalities.

Specifically, the shared modal feature extraction model can be obtained by training through the following steps:

acquiring a multi-modal image sample set, wherein the multi-modal image sample set comprises a plurality of sample modal images representing image contents under different modalities, and each sample modal image is marked with a reference modality;

calculating the total loss of the shared modal feature extraction network as the loss of the shared modal feature extraction network to be trained based on the corresponding loss in each mode;

604. And the server performs node clustering on each node in each image characteristic space according to each node and an effective neighbor node corresponding to the node to obtain at least one node clustering set under each mode.

In the embodiment of the present invention, the nodes may be clustered by using a supervised clustering algorithm or an unsupervised algorithm of a graph, and the like.

605. The server calculates the intersection ratio between each face modality extraction area and each limb modality extraction area.

Specifically, the server may determine, based on each node cluster centering node in the face modality, a face modality extraction region corresponding to each face modality image corresponding to each node in the corresponding video frame, and determine, based on each node cluster centering node in the limb modality, a limb modality extraction region corresponding to each limb modality image corresponding to each node in the corresponding video frame.

606. And the server determines a video frame set of the same object based on the intersection ratio, the video frame corresponding to each face modal image and the video frame corresponding to each limb modal image.

In the present embodiment, the intersection ratio of the face and the limb (IoU) can be calculated using the coordinates between the face modality extraction area and the limb modality extraction area. Finding out the IoU largest person to obtain the matched face and limbs. So that the sets of video content corresponding to the face modality and the body modality can be merged.

607. And the server determines an audio information set corresponding to each node cluster set according to the video frames corresponding to the nodes in each node cluster set in the audio mode.

Taking the modalities including the face, body and audio as an example, the large difference of external factors, such as lighting, resolution, etc., may cause the video content of the same object to be clustered into different sets of video content. The characteristics of the audio may be unaffected by these effects and may be complementary. In this case, a set with high confidence may be selected from the sets of audio information, and the audio information of each set of audio information may be concatenated in time dimension to obtain a plurality of time series. At different scenes and different time points, objects appearing in the same time sequence are probably the same object. Based on the principle, the video frame sets with fused faces and limbs can be subjected to re-fusion, so that video content sets with different types of sets but the same identity are combined.

608. And the server fuses the video content sets corresponding to the same object to obtain a processing result of at least one object in the video to be processed.

In some optional examples, the processing result of the video to be processed may directly determine the identity of the object corresponding to each processing result. For example, only one object may be included in the video to be processed, and the identity of the object is known when the video to be processed is processed, so that the identity of the object in the processing result can be directly determined.

Or, object feature information of each object in the video to be processed may be obtained, and the processing result may be matched with the object feature of each object, so as to determine the identity of the object corresponding to each processing result. For example, after obtaining the processing result, the server may search the internet for related information of the object, such as an image, etc., obtain object feature information of the object, etc.

In order to verify the effectiveness of the effective neighbor node algorithm, the experiment retrains and tests several baseline models on the face image of VideoX and two other face cluster data sets-CASIA and IJB-B. The results are shown in the following table.

As can be seen from the table: (1) For the unsupervised algorithm, the effective neighbor node algorithm brings a remarkable improvement in most cases, and the improvement of the characterization for clustering is proved to a certain extent. (2) The L-GCN algorithm uses a graph convolution neural network to model context information of a subgraph and confidence of predicted nodes. The effective neighbor node algorithm helps the L-GCN to construct a more sparse and reliable subgraph, so that the calculation overhead is reduced, and the performance of the algorithm is improved. (3) The STAR-FC algorithm aims at clustering large-scale face data by graph pruning and graph updating. The efficient neighbor node algorithm helps the STAR-FC filter noise nodes in the subgraph, thereby achieving better effect than the traditional k-nearest neighbor composition method. The experimental result shows that the effective neighbor node algorithm can bring remarkable performance improvement to both supervised and unsupervised clustering methods.

Therefore, the method and the device can combine the characteristics of multiple modes, carry out preliminary selection on the nodes in the same image characteristic space under different modes based on the node association conditions, and reduce unreliable error connection among the nodes in the clustering process, so that the video content set of the same object can be determined from the video, and the accuracy of processing the video frames is improved.

In order to better implement the method, correspondingly, the embodiment of the invention also provides a video frame processing device.

Referring to fig. 7, the apparatus includes:

the image obtaining unit 701 may be configured to obtain, based on a video to be processed, a plurality of modality images corresponding to each video frame in the video to be processed, where the plurality of modality images of the same video frame respectively represent video contents corresponding to the video frames in different modalities;

a feature extraction unit 702, configured to perform modality feature extraction on a modality image to obtain an image feature corresponding to the modality image of each modality, and determine a node corresponding to the image feature in each modality in an image feature space corresponding to each modality;

the node determining unit 703 may be configured to determine valid neighbor nodes of each node in each image feature space, where a node and a valid neighbor node thereof are associated nodes with each other, and if one node is an associated node of another node, a distance between the node and the another node needs to satisfy a preset node association condition;

a node clustering unit 704, configured to perform node clustering on each node in each image feature space according to each node and an effective neighbor node corresponding to the node, to obtain at least one node clustering set in each modality;

the set generating unit 705 may be configured to generate a video content set corresponding to each node cluster set based on a video frame corresponding to a node in each node cluster set;

the video processing unit 706 may be configured to determine video content sets corresponding to multiple modalities of the same object according to video time overlap ratios of the video content sets, and fuse the video content sets corresponding to the same object to obtain a processing result of at least one object in the video to be processed.

In some optional embodiments, the node clustering unit 704 may be configured to generate a subgraph in which each node in each image feature space is a central node and effective neighboring nodes corresponding to the nodes are other nodes according to each node and the effective neighboring nodes corresponding to the node;

acquiring a graph clustering network corresponding to each mode;

In some optional embodiments, the node clustering unit 704 may be configured to perform node feature update on each node according to each node and an effective neighbor node corresponding to the node, so as to obtain each updated node;

In some optional embodiments, the node clustering unit 704 may be configured to calculate, according to each node and an effective neighboring node corresponding to the node, a spatial distance between each node and the corresponding effective neighboring node in the corresponding image feature space, respectively;

In some optional embodiments, the node clustering unit may be configured to determine, based on each updated node, a preset number of updated nodes as cluster center nodes in each image feature space;

selecting a new cluster center node of each cluster based on the updated nodes in each cluster, and returning to the step of acquiring the similarity between each updated node and each cluster center node in each image feature space until a clustering end condition is met;

In some optional embodiments, as shown in fig. 8, the video frame processing apparatus according to an embodiment of the present invention may further include a graph network training unit 707, configured to perform graph clustering processing on sample subgraphs in each sample image feature space through a graph clustering network to be trained corresponding to each modality, to obtain at least one training node clustering set in each modality, where each sample subgraph is annotated with a reference clustering set result;

In some optional embodiments, the feature extraction unit 702 may be configured to map the modality images into modality feature vector spaces corresponding to different modalities through feature mapping parameters of a shared modality feature extraction model, and obtain image features corresponding to the modality images of the modalities based on a mapping result, where the shared modality feature extraction model is obtained by training based on a multi-modality image sample set, where the multi-modality image sample set may include multiple sample modality images representing image contents in different modalities.

In some optional embodiments, the video frame processing apparatus provided in the embodiments of the present invention may further include a model training unit 708, which is configured to obtain a multi-modal image sample set, where the multi-modal image sample set may include multiple sample modal images representing image contents in different modalities, and each sample modal image is labeled with a reference modality;

calculating the loss of the shared mode feature extraction network to be trained based on the training mode and the reference mode of each sample image feature;

In some optional embodiments, the model training unit may be configured to calculate, based on a training modality and a reference modality of each sample image feature, a corresponding loss of the shared modality feature extraction network to be trained in each modality respectively;

and calculating the total loss of the shared modal characteristic extraction network as the loss of the shared modal characteristic extraction network to be trained on the basis of the corresponding loss in each mode.

In some optional embodiments, the video processing unit 706 may be configured to determine video time information of each video content set according to video frame ordering information of a video frame corresponding to each video content set in a video to be processed;

In some optional embodiments, the image obtaining unit may be configured to obtain, based on the video to be processed, each video frame in the video to be processed and audio information corresponding to each video frame;

In some optional embodiments, the set generating unit 705 may be configured to cluster the nodes in the set based on each node under the face modality, and determine a face modality extraction region corresponding to each face modality image corresponding to each node in a corresponding video frame;

Therefore, the video frame processing device can combine the characteristics of multiple modes, and carry out preliminary selection on the nodes in the same image characteristic space under different modes based on the node association conditions, so that unreliable error connection among the nodes in the clustering process is reduced, the video content set of the same object can be determined from the video, and the accuracy of processing the video frames is improved.

In addition, an embodiment of the present invention further provides an electronic device, where the electronic device may be a terminal or a server, and as shown in fig. 9, a schematic structural diagram of the electronic device according to the embodiment of the present invention is shown, specifically:

the electronic device may include Radio Frequency (RF) circuitry 901, memory 902 including one or more computer-readable storage media, input unit 903, display unit 904, sensor 905, audio circuitry 906, wireless Fidelity (WiFi) module 907, processor 908 including one or more processing cores, and power supply 909. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 9 does not constitute a limitation of the electronic device and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. Wherein:

RF circuit 901 can be used for receiving and transmitting signals during a message transmission or communication session, and in particular, for receiving downlink information from a base station and processing the received downlink information by one or more processors 908; in addition, data relating to uplink is transmitted to the base station. In general, the RF circuit 901 includes, but is not limited to, an antenna, at least one Amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuit 901 can also communicate with a network and other devices through wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communications (GSM), general Packet Radio Service (GPRS), code Division Multiple Access (CDMA), wideband Code Division Multiple Access (WCDMA), long Term Evolution (LTE), email, short Message Service (SMS), and the like.

The memory 902 may be used to store software programs and modules, and the processor 908 executes various functional applications and data processing by operating the software programs and modules stored in the memory 902. The memory 902 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phone book, etc.) created according to the use of the electronic device, and the like. Further, the memory 902 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 902 may also include a memory controller to provide access to the memory 902 by the processor 908 and the input unit 903.

The input unit 903 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. In particular, in one particular embodiment, the input unit 903 may include a touch-sensitive surface as well as other input devices. The touch-sensitive surface, also referred to as a touch display screen or a touch pad, may collect touch operations by a user (e.g., operations by a user on or near the touch-sensitive surface using a finger, a stylus, or any other suitable object or attachment) thereon or nearby, and drive the corresponding connection device according to a predetermined program. Alternatively, the touch sensitive surface may comprise two parts, a touch detection means and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 908, and receives and executes commands from the processor 908. In addition, touch sensitive surfaces may be implemented using various types of resistive, capacitive, infrared, and surface acoustic waves. The input unit 903 may include other input devices in addition to a touch-sensitive surface. In particular, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 904 may be used to display information input by or provided to a user and various graphical user interfaces of the electronic device, which may be made up of graphics, text, icons, video, and any combination thereof. The Display unit 904 may include a Display panel, and optionally, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch-sensitive surface may overlay the display panel, and when a touch operation is detected on or near the touch-sensitive surface, the touch operation is communicated to the processor 908 to determine the type of touch event, and the processor 908 provides a corresponding visual output on the display panel according to the type of touch event. Although in FIG. 9 the touch sensitive surface and the display panel are two separate components to implement input and output functions, in some embodiments the touch sensitive surface may be integrated with the display panel to implement input and output functions.

The electronic device may also include at least one sensor 905, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that adjusts the brightness of the display panel according to the brightness of ambient light, and a proximity sensor that turns off the display panel and/or the backlight when the electronic device is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when the mobile phone is stationary, and can be used for applications of recognizing the posture of the mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which may be further configured to the electronic device, detailed descriptions thereof are omitted.

Audio circuitry 906, a speaker, and a microphone may provide an audio interface between a user and the electronic device. The audio circuit 906 may transmit the electrical signal converted from the received audio data to a speaker, and the electrical signal is converted into a sound signal by the speaker and output; on the other hand, the microphone converts the collected sound signal into an electric signal, which is received by the audio circuit 906 and converted into audio data, which is then processed by the audio data output processor 908, and then transmitted to, for example, another electronic device via the RF circuit 901, or the audio data is output to the memory 902 for further processing. The audio circuitry 906 may also include an earbud jack to provide communication of a peripheral headset with the electronic device.

WiFi belongs to short-distance wireless transmission technology, and the electronic equipment can help a user to receive and send emails, browse webpages, access streaming media and the like through the WiFi module 907, and provides wireless broadband internet access for the user. Although fig. 9 shows the WiFi module 907, it is understood that it does not belong to the essential constitution of the electronic device, and may be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 908 is a control center of the electronic device, connects various parts of the entire cellular phone using various interfaces and lines, and performs various functions of the electronic device and processes data by operating or executing software programs and/or modules stored in the memory 902 and calling data stored in the memory 902. Optionally, processor 908 may include one or more processing cores; preferably, the processor 908 may integrate an application processor, which primarily handles operating system, user interface, applications, etc. with a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 908.

The electronic device also includes a power supply 909 (e.g., a battery) that provides power to the various components, which may preferably be logically coupled to the processor 908 via a power management system, such that the functions of managing charging, discharging, and power consumption are performed via the power management system. The power supply 909 may also include any component, such as one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

Although not shown, the electronic device may further include a camera, a bluetooth module, and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 908 in the electronic device loads an executable file corresponding to a process of one or more application programs into the memory 902 according to the following instructions, and the processor 908 runs the application programs stored in the memory 902, thereby implementing various functions as follows:

acquiring a plurality of modal images corresponding to each video frame in the video to be processed based on the video to be processed, wherein the plurality of modal images of the same video frame respectively represent video contents corresponding to the video frames in different modalities;

performing modal feature extraction on the modal images to obtain image features corresponding to the modal images of the modalities, and determining nodes corresponding to the image features under the modalities in an image feature space corresponding to each modality;

determining effective neighbor nodes of each node in each image feature space, wherein the nodes and the effective neighbor nodes are associated nodes, and if one node is the associated node of another node, the distance between the node and the another node needs to meet a preset node association condition;

carrying out node clustering on each node in each image feature space according to each node and an effective neighbor node corresponding to each node to obtain at least one node clustering set in each mode;

and determining video content sets corresponding to a plurality of modalities of the same object according to the video time coincidence degree of each video content set, and fusing the video content sets corresponding to the same object to obtain a processing result of at least one object in the video to be processed.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, the present invention provides a computer-readable storage medium, in which a plurality of instructions are stored, where the instructions can be loaded by a processor to execute the steps in any one of the video frame processing methods provided by the embodiments of the present invention. For example, the instructions may perform the steps of:

determining effective neighbor nodes of each node in each image feature space, wherein the nodes and the effective neighbor nodes are mutually associated nodes, and if one node is the associated node of the other node, the distance between the node and the other node needs to meet a preset node association condition;

carrying out node clustering on each node in each image characteristic space according to each node and an effective neighbor node corresponding to each node to obtain at least one node clustering set under each mode;

and determining video content sets corresponding to a plurality of modals of the same object according to the video time coincidence degree of each video content set, and fusing the video content sets corresponding to the same object to obtain a processing result of at least one object in the video to be processed.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Wherein the computer-readable storage medium may include: read Only Memory (ROM), random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the computer-readable storage medium can execute the steps in any video frame processing method provided in the embodiments of the present invention, the beneficial effects that can be achieved by any video frame processing method provided in the embodiments of the present invention can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

According to an aspect of the application, there is also provided a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the electronic device executes the method provided in the various alternative implementations in the above embodiments.

The video frame processing method, apparatus, electronic device, storage medium, and program product provided by the embodiments of the present invention are described in detail above, and a specific example is applied in the present disclosure to explain the principle and the implementation of the present invention, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for processing video frames, comprising:

performing modal feature extraction on the modal images to obtain image features corresponding to the modal images of the various modalities, and determining nodes corresponding to the image features under the various modalities in an image feature space corresponding to each modality;

2. The method according to claim 1, wherein the performing node clustering on each node in each image feature space according to each node and the valid neighbor node corresponding to the node to obtain at least one node cluster set in each modality includes:

generating a subgraph which takes each node in each image feature space as a central node and the effective neighbor node corresponding to the node as other nodes according to each node and the effective neighbor node corresponding to the node;

acquiring a graph clustering network corresponding to each mode;

and carrying out graph clustering processing on each sub-graph through the graph clustering network of the corresponding mode to obtain at least one node clustering set in each mode.

3. The method according to claim 1, wherein the performing node clustering on each node in each image feature space according to each node and the valid neighbor node corresponding to the node to obtain at least one node cluster set in each modality includes:

4. The method of claim 3, wherein the node feature updating is performed on each node according to each node and the valid neighboring node corresponding to the node to obtain each updated node, and the method comprises:

5. The method according to claim 3, wherein the performing node clustering on each updated node in each image feature space based on the similarity between each updated node and other updated nodes in the image feature space to obtain at least one node cluster set in each modality includes:

6. The method according to claim 2, wherein before the obtaining the graph clustering network corresponding to each modality, the method further comprises:

carrying out graph clustering processing on sample sub-graphs in each sample image feature space through a graph clustering network to be trained corresponding to each mode to obtain at least one training node clustering set in each mode, wherein each sample sub-graph is injected with a reference clustering set result;

7. The method according to any one of claims 1 to 6, wherein the performing modality feature extraction on the modality image to obtain the image feature corresponding to the modality image of each modality includes:

8. The method of claim 7, wherein before the mapping the modal image into a modal feature vector space corresponding to different modalities by sharing feature mapping parameters of a modal feature extraction model, the method further comprises:

9. The method according to claim 8, wherein the calculating a loss of the shared modality feature extraction network to be trained based on the training modality and the reference modality of each of the sample image features comprises:

10. The method according to claim 1, wherein the determining a set of video contents corresponding to a plurality of modalities of a same object comprises:

determining video time information of each video content set according to video frame sequencing information of a video frame corresponding to each video content set in the video to be processed;

and matching according to the video time information of each video content set, and taking the successfully matched video content sets in different modes as video content sets corresponding to a plurality of modes of the same object.

11. The method according to any one of claims 1 to 10, wherein the obtaining a plurality of modal images corresponding to each video frame in the video to be processed based on the video to be processed comprises:

acquiring each video frame and audio information corresponding to each video frame in the video to be processed based on the video to be processed;

12. The method according to claim 11, wherein the generating a video content set corresponding to each node cluster set based on the video frames corresponding to the nodes in each node cluster set comprises:

determining a face modality extraction area corresponding to each face modality image corresponding to each node in a corresponding video frame based on the node in each node cluster set under the face modality;

13. A video frame processing apparatus, comprising:

the node clustering unit is used for carrying out node clustering on each node in each image feature space according to each node and the effective neighbor node corresponding to the node to obtain at least one node clustering set in each mode;

14. An electronic device comprising a memory and a processor; the memory stores an application program, and the processor is configured to run the application program in the memory to perform the steps of the video frame processing method according to any one of claims 1 to 12.

15. A computer readable storage medium storing instructions adapted to be loaded by a processor to perform the steps of the method of any of claims 1 to 12.

16. A computer program product comprising computer programs or instructions, characterized in that said computer programs or instructions, when executed by a processor, implement the steps of the video frame processing method according to any of claims 1 to 12.