CN113157965B

CN113157965B - Audio visual model training and audio visual method, device and equipment

Info

Publication number: CN113157965B
Application number: CN202110493845.9A
Authority: CN
Inventors: 展丽霞; 肖强; 孔昭阳; 董家骥; 李勇
Original assignee: Hangzhou Netease Cloud Music Technology Co Ltd
Current assignee: Hangzhou Netease Cloud Music Technology Co Ltd
Priority date: 2021-05-07
Filing date: 2021-05-07
Publication date: 2022-05-20
Anticipated expiration: 2041-05-07
Also published as: CN113157965A

Abstract

The invention provides a method, a device and equipment for providing audio visual model training and audio visual, comprising the following steps: acquiring a training sample including a relation label indicating whether user information, a user history playing video, a target audio, a target video and a target audio are related or not; inputting the training samples into an audio visual model, and performing feature extraction on the target audio to obtain a first feature representation of the target audio; carrying out feature extraction on user information and a user history playing video to obtain user features and user interest expression features, carrying out feature extraction on a target video to obtain second feature representation, and carrying out combined processing on the user features, the user interest expression features and the second feature representation to obtain third feature representation; determining a similarity between the first feature representation and the third feature representation; and updating the parameters of the audio visual model according to the similarity and the relation labels in the training samples. The invention can carry out personalized video collocation on the same audio frequency, and meets diversified user requirements.

Description

Audio visual model training and audio visual method, device and equipment

Technical Field

The invention relates to the technical field of audio and video, in particular to an audio visualization model training and audio visualization method, device and equipment.

Background

In the audio playing process, a user completes the aesthetic experience process from perception to rationality of the audio work through sound perception, emotional feeling, image association and rational perception. The audio has the characteristic of image thinking, under the accompanying of emotion, images such as audio images, living scenes, artistic conception and the like are obtained through imagination association, and audio visualization is derived accordingly. The audio visualization mainly realizes that the music emotion is interpreted by video animation, and audio materials and videos are integrated.

An audio playing scenario proposed in the related art is to automatically match a dynamic video with a played audio according to the currently played audio of a user, so that not only the user's mind is motivated from an auditory mode, but also an impact force is brought to the user through a visual mode.

In the related art, dynamic videos are automatically collocated for played audio, and videos in corresponding video type labels are collocated according to the single-song style of the audio and the mapping rules in the audio playing process mainly by establishing the mapping rules of the video type labels and the single-song style. When the mapping rule is determined, complex processing of audio content understanding, emotion detection, graphic image translation, scaling, rotation and shearing is mainly researched, finally, audio expressive force is presented, and great visual stimulation is brought to a user. However, the technology is free from the interest preference of the user, and video collocation is not driven by the personalized preference of the user, so that diversified user requirements are difficult to meet.

Disclosure of Invention

Embodiments of the present invention provide an audio visualization method, apparatus, device, and medium, which can implement personalized video matching on the same audio according to interest preferences of users, and meet diversified user requirements.

In a first aspect, an embodiment of the present invention provides an audio visualization model training method, where the method includes:

acquiring a training sample, wherein the training sample comprises user information, a user history playing video, a target audio, a target video and a relation label representing whether the target audio and the target video are associated or not;

inputting the training sample into an audio visual model, and performing feature extraction on the target audio to obtain a first feature representation of the target audio;

carrying out feature extraction on the relationship between the user information, the user historical playing video and the target video to obtain user features and user interest expression features, carrying out feature extraction on the target video to obtain second feature representation, and carrying out combined processing on the user features, the user interest expression features and the second feature representation to obtain third feature representation;

determining a similarity between the first feature representation and the third feature representation;

and updating the parameters of the audio visual model according to the similarity and the relation label in the training sample.

As an optional embodiment, the training sample further comprises a knowledge graph; performing feature extraction on the target audio/target video, including:

determining a target node corresponding to the target audio/target video in the knowledge graph, and determining a neighbor node establishing an association relation with the target node through an edge;

extracting the characteristics of the attribute information of the neighbor node and the incidence relation corresponding to the edge of the neighbor node connected with the target node to obtain the relation expression characteristics of the target node and the neighbor node in the knowledge graph;

the relation of the target node in the knowledge graph with the neighbor nodes expresses characteristics, including the first characteristic representation of the relation expression of the target audio in the knowledge graph with the neighbor nodes or the second characteristic representation of the relation expression of the target video in the knowledge graph with the neighbor nodes;

the knowledge graph is a graph constructed by defining entities as nodes, connecting the nodes with incidence relations through edges, determining the types of the edges according to the types of the incidence relations, and filling attribute information of the nodes according to the relevant information of the nodes, wherein the entities comprise audio and video.

By constructing a complex knowledge graph fusing rich content information such as audio and video, the relevance of the video and the single music in richer attributes is fully considered, and the relation characteristic expression of the audio and the video is enhanced.

As an alternative embodiment, the knowledge graph is constructed as follows:

defining entity types, entity attribute information, edges corresponding to different types of incidence relations and rules for judging the incidence relations of the types, wherein the entity types comprise video types and audio types;

according to the defined entity type and the entity attribute information, extracting entities with different entity types from a source database as nodes, and extracting the attribute information of the nodes from the related information of the nodes;

and determining whether the incidence relation exists between different nodes according to the rule for judging the incidence relation of each type, and connecting the different nodes by utilizing edges of corresponding types according to the type of the incidence relation when determining that the incidence relation exists.

By the method, the knowledge graph rich in audio and video relevance can be constructed, corresponding entities, edge types and extraction rules can be defined according to specific requirements, and the complex knowledge graph fusing rich content information such as audio and video can be automatically constructed.

As an optional implementation, the method further comprises:

storing the extracted nodes, the attribute information of the extracted nodes, the result of whether the determined incidence relation exists and the connection information of the edges by using different tables respectively;

and taking the extracted node as an index entry, and fusing the different tables to obtain entry contents of the node, wherein the entry contents comprise attribute information of the extracted node, neighbor nodes related to the extracted node, and types of association relations between the neighbor nodes related to the extracted node and the extracted node.

Through the fusion mode, all information related to each node can be integrated together, and when the node is used as an index, various relationships can be acquired.

As an optional implementation manner, performing feature extraction on the attribute information of the neighbor node and the association relationship corresponding to the edge of the neighbor node connected to the target node to obtain the relationship expression feature of the target node and the neighbor node in the knowledge graph, includes:

determining isomorphic neighbor nodes belonging to the same entity type as the target node, and performing feature extraction on the association relation corresponding to the attribute information of the isomorphic neighbor nodes and the edges of the isomorphic neighbor nodes connected with the target node by utilizing a first feature extraction layer to obtain first relation expression features of the target node and the isomorphic neighbor nodes in the knowledge graph;

determining heterogeneous neighbor nodes belonging to different entity types from the target node, and performing feature extraction on the attribute information of the heterogeneous neighbor nodes and the incidence relation corresponding to the edges of the heterogeneous neighbor nodes connected with the target node by using a first feature extraction layer to obtain a second relation expression feature of the target node connected with the heterogeneous neighbor in the knowledge graph;

and converting the first relational expression characteristic and the second relational expression characteristic into the same vector space by using a second characteristic extraction layer to obtain the relational expression characteristic of the target node and the neighbor node in the knowledge graph.

Through the feature extraction process, for each target node, not only can the relational expressions of the same type of nodes and the target node be extracted, but also the relational expressions of different types of nodes and the target node can be extracted, and because the different types of nodes have great difference on the feature level and the network topology structure, after the relational expression features are extracted by using the first feature extraction layer, the relational expression features are further extracted by using the second feature extraction layer, so that the same vector space is converted.

As an alternative implementation, determining the similarity between the first feature representation and the third feature representation includes:

inputting the first feature representation into a single-curve double-tower layer, and performing regularization processing on the first feature representation by using the single-curve double-tower layer;

inputting the third feature representation into a video double-tower layer, and utilizing the video double-tower layer to carry out regularization processing on the third feature representation;

determining a similarity between the regularized first feature representation and the third feature representation.

By the regularization process, the first feature representation and the third feature representation can be guaranteed to be in the same order of magnitude.

utilizing three layers of LeakyReLU in the single-curve double-tower layer to carry out regularization processing on the first feature representation;

utilizing three layers of LeakyReLU in the video double-tower layer to carry out regularization processing on the third feature representation;

and determining the similarity between the first characteristic representation and the third characteristic representation after the regularization treatment by a sigmoid function connecting a single-curve double-tower layer and a video double-tower layer.

The regularization processing is carried out on the first feature representation/the third feature representation by utilizing three layers of LeakyReLU, so that the disappearance of the gradient and the guarantee of the gradient can be prevented, and the requirement of network convergence is met.

As an optional implementation, updating parameters of the audio visualization model according to the similarity and the relationship label in the training sample includes:

determining a first loss function according to the similarity and a relation label in the training sample;

determining a second loss function according to the similarity and the similarity between the first feature representation and the third feature representation which are fitted according to the knowledge graph, wherein the greater the number of shared neighbor nodes of the target node corresponding to the target audio and the target node corresponding to the target video is, the greater the fitted similarity is;

and updating the parameters of the audio visual model according to the first loss function and the second loss function.

The parameters of the audio visual model are adjusted by adopting two loss functions, and two targets of matching degree of the single music and the video and joint distribution acceptance of the user to the single music and the video can be considered simultaneously.

As an optional implementation manner, determining a neighbor node that establishes an association relationship with the target node through an edge includes:

determining neighbor nodes which establish association relation with the target node through edges within a set hop count in the knowledge graph;

the hop count is the number of edges required to connect to the target node starting from a neighbor node.

The hop count can be set as required, so that the relational expression of the node with a strong association relation with the target node can be obtained, the relational expression of the node with a relatively weak association relation with the target node can also be obtained, and the relational expression characteristics of the target node are enriched.

As an optional implementation, obtaining the relationship label in the training sample includes:

acquiring play behavior feedback of a user in the process of matching a target audio with a target video for playing;

and determining whether the relationship label representing the target audio and the target video is associated or not according to the play behavior feedback.

Different playing behavior feedbacks represent the relevance of the audio and video, and the method can determine whether the online recommended audio and video collocation is relevant or not through the actual behavior feedback of the user, thereby obtaining the corresponding relation label in the training sample.

As an optional implementation, before inputting the training sample into the audio visualization model, the method further includes:

semantic understanding is carried out on the text data of the target video by utilizing a video content understanding model to obtain a corresponding video text content vector;

extracting image frames of the target video, and performing content understanding on the target video by using an image content understanding model to obtain a corresponding image content vector;

and extracting audio frames of the target audio, and performing content understanding on the target audio by using an audio frame content prediction model to obtain a corresponding audio content vector.

Through the above-described processes, audio and video may be converted into corresponding vectors, so that the input of the model may understand the contents of the corresponding audio and video according to the corresponding vectors.

In a second aspect, an embodiment of the present invention provides an audio visualization method, where the method includes:

responding to the audio and video collocation request, and acquiring user information, candidate audio and candidate video;

inputting the user information, the candidate audio and the candidate video into the audio visual model obtained by training by the method provided by the first aspect;

performing feature extraction on the candidate audio by using the audio visualization model to obtain a first feature representation of the candidate audio;

performing feature extraction on the user information by using the audio visual model to obtain user features, performing feature extraction on the candidate video to obtain a second feature representation, and combining the user features and the second feature representation to obtain a third feature representation;

determining the similarity between the first feature representation and the third feature representation, and predicting the probability of whether each candidate audio is associated with each candidate video according to the similarity;

and selecting the candidate audio and the candidate video with the probability values larger than the preset value to carry out combined playing according to the determined probability of whether each candidate audio is associated with each candidate video.

In a third aspect, an embodiment of the present invention provides an audio visualization model training apparatus, where the apparatus includes:

the system comprises a sample acquisition module, a comparison module and a comparison module, wherein the sample acquisition module is used for acquiring a training sample, and the training sample comprises user information, a user history playing video, a target audio, a target video and a relation label for representing whether the target audio and the target video are associated or not;

the first feature extraction module is used for inputting the training samples into an audio visualization model, and performing feature extraction on the target audio to obtain a first feature representation of the target audio;

the third feature extraction module is used for extracting features of the relationship among the user information, the user historical playing video and the target video to obtain user features and user interest expression features, extracting features of the target video to obtain second feature representation, and performing combined processing on the user features, the user interest expression features and the second feature representation to obtain third feature representation;

a similarity determination module for determining a similarity between the first feature representation and the third feature representation;

and the parameter updating module is used for updating the parameters of the audio visual model according to the similarity and the relation label in the training sample.

As an optional embodiment, the training sample further comprises a knowledge graph; the first feature extraction module performs feature extraction on the target audio/the third feature extraction module performs feature extraction on the target video, and the feature extraction method comprises the following steps:

the relation expression characteristics of the target node and the adjacent nodes in the knowledge graph comprise the first characteristic representation of the relation expression of the target audio and the adjacent nodes in the knowledge graph or the second characteristic representation of the relation expression of the target video and the adjacent nodes in the knowledge graph;

the knowledge graph is a graph constructed by defining entities as nodes, connecting the nodes with incidence relation through edges, determining the types of the edges according to the types of the incidence relation and filling attribute information of the nodes according to the relevant information of the nodes, wherein the entities comprise audio and video.

As an optional implementation, the apparatus further comprises:

the knowledge graph building module is used for building the knowledge graph in the following way:

extracting entities with different entity types from a source database as nodes according to the defined entity types and the entity attribute information, and extracting the attribute information of the nodes from the related information of the nodes;

As an optional implementation, the apparatus further comprises:

the storage module is used for respectively storing the extracted nodes, the attribute information of the extracted nodes, the result of whether the determined incidence relation exists and the connection information of the edges by using different tables;

and the fusion module is used for fusing the different tables to obtain the table entry content of the node by taking the extracted node as an index entry, wherein the table entry content comprises the attribute information of the extracted node, the neighbor node associated with the extracted node, and the type of the association relationship between the neighbor node associated with the extracted node and the extracted node.

As an optional implementation manner, the feature extraction performed by the first feature extraction module/the third feature extraction module on the attribute information of the neighbor node and the association relationship corresponding to the edge of the neighbor node connected to the target node to obtain the relationship expression feature of the target node and the neighbor node in the knowledge graph includes:

determining isomorphic neighbor nodes belonging to the same entity type as the target node, and performing feature extraction on the association relationship corresponding to the attribute information of the isomorphic neighbor nodes and the edges of the isomorphic neighbor nodes connected with the target node by utilizing a first feature extraction layer to obtain first relationship expression features of the target node and the isomorphic neighbor nodes in the knowledge graph;

As an optional implementation, the similarity determination module determines a similarity between the first feature representation and the third feature representation, including:

As an optional implementation manner, the updating the parameters of the audio visualization model by the parameter updating module according to the similarity and the relationship label in the training sample includes:

As an optional implementation manner, the determining, by the first feature extraction module/the third feature extraction module, a neighbor node that establishes an association relationship with the target node through an edge includes:

determining neighbor nodes establishing an association relationship with the target node through edges within a set hop count in the knowledge graph;

As an optional implementation manner, the obtaining, by the sample obtaining module, a relationship label in a training sample includes:

and determining a relation label representing whether the target audio and the target video are related or not according to the play behavior feedback.

As an optional implementation, the apparatus further comprises:

a vector conversion module to perform, prior to inputting the training samples into an audio visualization model:

In a fourth aspect, an embodiment of the present invention provides an audio visualization apparatus, where the apparatus includes:

the information acquisition module is used for responding to the audio and video collocation request and acquiring user information, candidate audio and candidate video;

a model input module, configured to input the user information, the candidate audio, and the candidate video into an audio visualization model trained by the method provided in the first aspect;

the first feature extraction module is used for performing feature extraction on the candidate audio by using the audio visualization model to obtain a first feature representation of the candidate audio;

the third feature extraction module is used for performing feature extraction on the user information by using the audio visualization model to obtain user features, performing feature extraction on the candidate video to obtain second feature representation, and combining the user features and the second feature representation to obtain third feature representation;

a probability determination module, configured to determine a similarity between the first feature representation and the third feature representation, and predict, according to the similarity, a probability of whether each candidate audio is associated with each candidate video;

and the audio and video collocation module is used for selecting the candidate audio and the candidate video with the probability value larger than the preset value to carry out combined playing according to the determined probability of whether each candidate audio is associated with each candidate video.

In a fifth aspect, an embodiment of the present invention provides an audio visualization model training apparatus, including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the audio visualization model training method provided by the first aspect.

In a sixth aspect, an embodiment of the present invention provides an audio visualization apparatus, including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the audio visualization method provided by the second aspect.

In a seventh aspect, an embodiment of the present invention provides a storage medium, where instructions, when executed by a processor of an inventory supply chain management device, enable the inventory supply chain management device to execute the audio visualization training method provided in the first aspect or execute the audio visualization method provided in the second aspect.

By using the audio visual model training and audio visual method, device and equipment provided by the embodiment of the invention, the user characteristics and the user interest expression characteristics are added into the characteristics input during the audio visual model training, so that the emotional interaction requirements of the user and visual music are met, video collocation is carried out by taking the personalized preference of the user as a drive, diversified user requirements can be met, and the personalized matching of hundreds of millions of users is realized.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention and are not to be construed as limiting the invention.

FIG. 1 is a schematic diagram of an application scenario provided by an embodiment of the present invention;

FIG. 2 is a flow diagram illustrating a method of audio visualization model training in accordance with an exemplary embodiment;

FIG. 3 is a flow diagram illustrating conversion of text data of a target video to a video text content vector according to an exemplary embodiment;

FIG. 4 is a flow diagram illustrating the conversion of image frames of a target video into image content vectors in accordance with an exemplary embodiment;

FIG. 5 is a flowchart illustrating conversion of text data of a target video to an audio content vector according to an example embodiment;

FIG. 6 is a diagram illustrating a software architecture employed to implement a method according to an embodiment of the present invention, in accordance with an illustrative embodiment;

FIG. 7 is a diagram neural network structure diagram employed by the audio visualization model shown in accordance with an exemplary embodiment;

FIG. 8 is a schematic diagram illustrating the structure of a GAT module in the neural network of the figure, according to an exemplary embodiment;

FIG. 9 is a detailed flow diagram illustrating audio visualization model training in accordance with an exemplary embodiment;

FIG. 10 is a flowchart of an audio visualization method shown in accordance with an exemplary embodiment;

FIG. 11 is a schematic diagram illustrating an audio visualization model training apparatus according to an exemplary embodiment;

FIG. 12 is a schematic structural diagram of an audio visualization device shown in accordance with an exemplary embodiment;

FIG. 13 is a schematic diagram of an audio visualization model training apparatus shown in accordance with an exemplary embodiment;

FIG. 14 is a schematic structural diagram of an audio visualization device shown in accordance with an exemplary embodiment;

fig. 15 is a schematic diagram of a program product according to an embodiment of the present invention.

Detailed Description

The principles and spirit of the present invention will be described with reference to several exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the invention, and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a method, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

In this document, it is to be understood that any number of elements in the figures are provided by way of illustration and not limitation, and any nomenclature is used for differentiation only and not in any limiting sense.

Summary of The Invention

The inventor finds that the music visualization technology in the related technology mainly researches audio content understanding, and the complex processing of graphic image translation, zooming, rotation and shearing finally presents music expressive force, so as to bring great visual stimulation to people. However, this technique is free from user interest preferences and is difficult to satisfy diversified user needs.

In view of this, embodiments of the present invention provide an audio visualization model training method, an audio visualization device, and an audio visualization apparatus, where user features and user interest expression features are added to features input in an audio visualization model training process, so that not only is an emotional interaction requirement between a user and a visual audio satisfied, but also an individualized matching of hundreds of millions of users is achieved.

Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.

Application scene overview

Fig. 1 is a schematic view of an application scenario according to an embodiment of the present invention. The user 10 logs in the web server 12 through a client installed in the user device 11, where the client may be a browser of a web page or an application client installed in a mobile user device, such as a mobile phone, a tablet computer, or the like.

The user equipment 11 and the network server 12 are communicatively connected through a network, which may be a local area network, a wide area network, or the like. The user device 11 may be a portable device (e.g., a mobile phone, a tablet, a notebook, etc.) or a Personal Computer (PC), and the network server 12 may be any device capable of providing internet services.

The network server 12 may send the requested audio data to the client for playing according to the request of the client, and send the audio data to the client to match the audio data with the corresponding video data, so as to implement audio visualization. The network server 12 may train the audio visualization model using the training samples, and match the audio data and the video data using the audio visualization model.

In the embodiment of the present invention, the network server 12 obtains a training sample in a model training stage, where the training sample includes user information, a user history playing video, a target audio, a target video, and a relationship label representing whether the target audio and the target video are associated; inputting the training sample into an audio visual model, and performing feature extraction on the target audio to obtain a first feature representation of the target audio; performing feature extraction on the relationship between the user information, the user historical playing video and the target video to obtain user features and user interest expression features, performing feature extraction on the target video to obtain second feature representation, and performing combined processing on the user features, the user interest expression features and the second feature representation to obtain third feature representation; determining a similarity between the first feature representation and the third feature representation; and updating the parameters of the audio visual model according to the similarity and the relation label in the training sample.

In the embodiment of the invention, the network server 12 responds to the audio and video collocation request to acquire user information, candidate audio and candidate video in an audio/video recommendation stage; inputting the user information, the candidate audio and the candidate video into an audio visual model obtained through training; performing feature extraction on the candidate audio by using the audio visualization model to obtain a first feature representation of the candidate audio; performing feature extraction on the user information by using the audio visual model to obtain user features, performing feature extraction on the candidate video to obtain a second feature representation, and combining the user features and the second feature representation to obtain a third feature representation; determining the similarity between the first feature representation and the third feature representation, and predicting the probability of whether each candidate audio is associated with each candidate video according to the similarity; and selecting the candidate audio and the candidate video with the probability values larger than the preset value to perform combined playing according to the determined probability of whether each candidate audio is associated with each candidate video.

Exemplary method

In the following, in connection with the application scenario of fig. 1, an audio visualization model training method according to an exemplary embodiment of the present invention is described with reference to fig. 2. It should be noted that the above application scenarios are only presented to facilitate understanding of the spirit and principles of the present invention, and the embodiments of the present invention are not limited in this respect. Rather, embodiments of the present invention may be applied to any scenario in which it is used.

As shown in fig. 2, an audio visualization model training method provided in an embodiment of the present invention includes:

step 201, obtaining a training sample, where the training sample includes user information, a user history playing video, a target audio, a target video, and a relationship label representing whether the target audio and the target video are associated;

in the embodiment of the invention, the training sample is obtained from a source database, and the data in the source database mainly has three types: audio data type, video data type and service data type related to user service operation.

The service data types related to the user service operation comprise user information and user history playing videos. The user information may include, but is not limited to, user basic information such as user age, gender, etc.; the user history playing video is video information played in the user history for a period of time, and may include information such as an ID of the video, content of the video, and the like.

The audio data and the video data accumulate some basic content characteristics through manual labeling or algorithm rules. For example, the video data itself carries manually labeled semantic tags and category tags.

As an optional implementation manner, before the training sample is input into the audio visualization model, the embodiment of the present invention first converts the data input into the model into a corresponding vector, and may specifically perform the following steps to obtain the corresponding vector:

1) semantic understanding is carried out on the text data of the target video by utilizing a video content understanding model to obtain a corresponding video text content vector;

the content understanding of video data includes both video text content understanding and video image content understanding.

The process of understanding the content of the video text in this embodiment is specifically shown in fig. 3, and mainly includes the following processes:

determining each target video in the training sample;

acquiring basic text data such as video titles, manually labeled video labels, video descriptions and the like of all target videos from a source data video library, and acquiring subtitle text information of the target videos by utilizing an OCR (optical character recognition) technology to obtain text data of the target videos;

merging the text data into comprehensive description information for each target video, and performing text segmentation on the comprehensive description information by using a segmentation tool to generate a video text dictionary;

carrying out one-hot oneHot coding on words in a video text dictionary to realize data preprocessing of the video text dictionary, wherein each word is represented by a plurality of compared bits after being coded;

according to the coding value of the word corresponding to each target video, based on a mechanism of a sliding window, every two words in the window construct positive samples, construct positive samples similar to the content of the target video and negative samples not similar to the content of the target video, and train a three-layer neural network language model.

After model training is completed, word coding values corresponding to text data of each target video are input into the neural network language model to obtain corresponding video text content vectors, and specifically, hidden layer weight matrixes learned by the neural network language model are used as the video text content vectors of the target videos.

2) Extracting image frames of the target video, and performing content understanding on the target video by using an image content understanding model to obtain a corresponding image content vector;

the process of understanding the content of the video image in this embodiment is specifically shown in fig. 4, and mainly includes the following processes:

determining each target video in the training sample;

extracting image frames of each target video, wherein the target video can be divided into a plurality of sections on average according to the total duration of the target video, and key image frames are extracted within each section of duration;

and taking the manually marked video semantic type label as a target, inputting the extracted picture frame information into an image content understanding model, and carrying out micro adjustment on model parameters, wherein the model can be but is not limited to an EfficientNet network structure model. In order to verify the validity and accuracy of the content features of the image frames, in the model parameter adjustment, the content features of two videos with similar semantics are clustered into clusters according to the semantic labels of the videos, and the similarity calculation is performed on vectors corresponding to the image frames of the two videos in the clusters. When the vector similarity of the image frames in the same cluster is high, the content features of the video have high confidence.

And after the model parameters are adjusted, inputting the image frame into the image content understanding model to obtain the image content vector predicted by the model.

3) And extracting audio frames of the target audio, and performing content understanding on the target audio by using an audio frame content prediction model to obtain a corresponding audio content vector.

The process of understanding the content of the video image in this embodiment is specifically shown in fig. 5, and mainly includes the following processes:

determining each target audio in the training sample;

extracting audio frames of each target audio, wherein the target audio can be divided into a plurality of sections on average according to the total duration of the target audio, and key audio frames are extracted within each section of duration;

and (3) with the artificially labeled audio semantic type label as a target, inputting the extracted audio frame information into an audio frame content prediction model, and carrying out micro-adjustment on model parameters, wherein the model can be but is not limited to a YamNet network structure model. In order to verify the validity and accuracy of the content features of the audio frames, in the embodiment of the invention, in the adjustment of model parameters, the content features of two audios with similar semantics are clustered into clusters according to the semantic labels of the audios, and the similarity calculation is carried out on vectors corresponding to the audio frames of the two audios in the clusters. When the vector similarity of the audio frames in the same cluster is higher, the content features of the audio have high confidence.

And after the model parameters are adjusted, inputting the audio frame into the image content understanding model to obtain an audio content vector predicted by the model.

Step 202, inputting the training sample into an audio visual model, and performing feature extraction on the target audio to obtain a first feature representation of the target audio;

in implementation, the audio content vector corresponding to the target audio in the training sample may be input into the audio visualization model, and the feature extraction layer of the audio visualization model is used to perform feature extraction on the audio content vector to obtain the first feature representation of the target audio.

Step 203, extracting the characteristics of the user information and the relationship between the user history playing video and the target video to obtain user characteristics and user interest expression characteristics, extracting the characteristics of the target video to obtain a second characteristic representation, and performing combined processing on the user characteristics, the user interest expression characteristics and the second characteristic representation to obtain a third characteristic representation;

in implementation, the user information, the user history playing video and the target video in the training sample can be converted into corresponding vectors to be input into the audio visualization model, a feature extraction layer of the audio visualization model is used for performing feature extraction on the vectors corresponding to the user information to obtain user features, the relationship between the user history playing video and the target video is subjected to feature extraction to obtain user interest expression features, and the video content vectors are subjected to feature extraction to obtain a second feature expression of the target audio.

And jointly processing the user characteristics, the user interest expression characteristics and the second characteristic expression to obtain a third characteristic expression, and then obtaining the interest degree of the user in the target video and the content expression characteristics of the target video according to the user interest expression of the current user.

Step 204, determining the similarity between the first feature representation and the third feature representation;

and step 205, updating parameters of the audio visual model according to the similarity and the relation label in the training sample.

The audio visual model training method provided by the embodiment of the invention respectively extracts the characteristics of the target audio and the target video during model training, meanwhile, the user information, the relation between the user historical playing video and the target video are subjected to feature extraction to obtain user features and user interest expression features, the user characteristics, the user interest expression characteristics and the second characteristic expression are jointly processed to obtain a third characteristic expression, the jointly processed characteristics reflect the interest degree of the user to the target video, the trained audio visual model has the capability of calculating the audio and video association probability according to the user interest, thereby not only meeting the emotional interaction requirements of the user and the visual music, and moreover, video collocation is carried out by taking the personalized preference of the user as a drive, so that diversified user requirements can be met, and the personalized matching of hundreds of millions of users is realized.

As an optional implementation manner, the training sample in the embodiment of the present invention further includes a knowledge graph; the knowledge graph is a graph constructed by defining entities as nodes, connecting the nodes with incidence relations through edges, determining the types of the edges according to the types of the incidence relations, and filling attribute information of the nodes according to the relevant information of the nodes, wherein the entities comprise audio and video.

The embodiment of the invention carries out feature extraction on the target audio/target video, and comprises the following steps:

determining a target node corresponding to the target audio in the knowledge graph, and determining a neighbor node establishing an association relation with the target node through an edge; determining a target node corresponding to a target video in the knowledge graph, and determining a neighbor node establishing an association relation with the target node through an edge;

aiming at a target audio, carrying out feature extraction on attribute information of a neighbor node of the target audio and an incidence relation corresponding to an edge of the neighbor node connected with the target node to obtain a relation expression feature of the target node and the neighbor node in the knowledge graph, namely the first feature expression of the target audio expressed in relation with the neighbor node in the knowledge graph;

and aiming at a target video, carrying out feature extraction on attribute information of a neighbor node of the target video and an incidence relation corresponding to an edge of the neighbor node connected with the target node to obtain a relation expression feature of the target node and the neighbor node in the knowledge graph, namely the second feature expression of the target video expressed in relation with the neighbor node in the knowledge graph.

As shown in fig. 6, in order to implement the audio visualization model training method, based on the source database, the software architecture may be divided into a content understanding module, a knowledge graph building module, and a model training module.

The source database comprises the audio data, the video data and the service data.

The content understanding module is mainly used for converting the audio data in the source database into corresponding audio content vectors and converting the video data in the source database into corresponding video content vectors.

The knowledge map construction module is used for constructing a knowledge map based on vectors obtained by the source database and the content understanding module, fusing various kinds of knowledge by utilizing a complex network, and constructing the complex heterogeneous graph network by linking the natural advantages of associating the knowledge.

The model training module is mainly used for constructing a training sample, constructing a graph neural network adopted by the audio visualization model and training the constructed graph neural network by utilizing the training sample.

The manner in which the content understanding module obtains the audio content vector and the video content vector is described in the above embodiments, and is described again here, and detailed embodiments of the knowledge graph constructing module and the model training module are given below.

Constructing a knowledge graph:

in the embodiment of the invention, the content understanding module is used for obtaining the audio content vector and the video content vector of each audio, and the knowledge graph is constructed mainly through the following steps:

1) body design

An ontology is a collection of terms used to describe a domain whose organizational structure is hierarchically structured and can serve as the skeleton and foundation of a knowledge base. The ontology mainly acquires, describes and represents knowledge of related fields, provides common understanding of the knowledge of the fields, and determines commonly recognized words in the fields.

The ontology comprises basic elements of entities, relations and attributes, and is applied to the embodiment of the invention to define the basic elements contained in the ontology: the entity type comprises a video type and an audio type, entity attribute information, edges corresponding to different types of incidence relations and rules for judging the incidence relations of the types.

Further, entity types may also include other types associated with video types or audio types, such as entities that primarily contain objectively present objects such as single songs, videos, artists, and so on. The different types of association relations mainly describe the association between the single music and the video, and for example, the association relations may include a plurality of association relations, such as single music-content similar single music, video-content similar video, single music-artist, video-artist, and the like, and the association relations among the plurality of association relations may be the association relations of heterogeneous nodes. The entity attribute information may be various information, for example, may include video occurrence area, video semantic tag, video text representation, and video content vector obtained by the foregoing embodiment.

2) Knowledge extraction

The purpose of knowledge extraction is mainly to perform entity extraction and relationship extraction from data of different sources and structures, and the entity extraction is applied to the embodiment of the invention and mainly completes the extraction of audio and video and the filling of basic attributes to obtain corresponding nodes. The relation extraction mainly analyzes business data, and calculates the relation in the ontology definition by using a distributed computing technology to respectively generate edges of various association types.

In an embodiment, according to the defined entity type and the defined entity attribute information, entities of different entity types are extracted from a source database as nodes, and attribute information of the nodes is extracted from related information of the nodes. And determining whether the incidence relation exists between different nodes according to the rule for judging the incidence relation of each type, and connecting the different nodes by utilizing edges of corresponding types according to the type of the incidence relation when determining that the incidence relation exists.

The data types of the data in the source database in the embodiment of the invention are rich and comprise audio and video data, text data and numerical data. Before the embodiment of the invention extracts the knowledge, the knowledge can be uniformly converted into numerical data by utilizing a deep learning technology.

3) Knowledge fusion

Through the knowledge extraction step, node data with attributes and various types of edge data of the knowledge graph are respectively formed. And carrying out disambiguation fusion on the entities and the relations of the multiple data sources through knowledge fusion, and finally constructing a knowledge graph capable of completely describing the relation between music and videos.

As an optional implementation manner, in the embodiment of the present invention, different tables are used to store the extracted nodes, the attribute information of the extracted nodes, the result of determining whether an association relationship exists, and the connection information of edges, respectively; and taking the extracted node as an index entry, and fusing the different tables to obtain the entry content of the node, wherein the entry content comprises the attribute information of the extracted node, the neighbor node associated with the extracted node, and the type of the association relationship between the neighbor node associated with the extracted node and the extracted node.

A model training module:

the purpose of the model training module is to give a video viewing behavior sequence of a user u based on the above-mentioned knowledge-graph of the deep-representation monarch-video relationship: s (u) { v1, v2, … …, vn }, and a function F is learned to predict the probability that the user u can play the video v completely under the matching of the single music s-video v.

1) Constructing training samples

The method comprises the steps of constructing a training sample based on a user historical audio and video combined playing behavior, taking audio played in the user primary historical playing behavior as target audio, taking the jointly played video as the target video, and obtaining feedback of the playing behavior of the user in the process of playing the target audio together with the target video; and determining whether the relationship label representing the target audio and the target video is associated according to the playing behavior feedback, and acquiring user information and videos (video watching behavior sequences) played by the user within a historical period of time so as to obtain a training sample. In implementation, the relation tag may be 1 when the video playing time length exceeds the set time length, and otherwise, the relation tag is 0.

In an example of this embodiment, a training sample obtained based on an audio/video joint playing behavior of a user may be represented as < u, v, s, Useq, Yuvs >, where u represents the user, v represents a target video, and s represents a target audio. Yuvs represents a real label of the video v played by the user u under the background music s, and when the video playing time length exceeds 60s, Yuvs is 1, otherwise Yuvs is 0. Useeq is a video viewing behavior sequence of user u, for example, sequence id information of a video that can be viewed for 30 days in the user history.

The more the number of training samples is, the higher the confidence of a function F for predicting that a user u can completely play a video v under the matching of a single music s-video v is learned, and the number of training samples can be determined according to requirements.

2) Building a graph neural network

As shown in fig. 7, the graph neural network constructed according to the embodiment of the present invention mainly includes a data layer, a conversion layer, a double-tower layer, and an output layer, and each neural network layer is described in detail below.

2.1) data layer

The data layer is used for receiving vectors input into the graph neural network, and specifically comprises target audio, target video, constructed knowledge graph and video watching behavior sequence, the knowledge graph is a complex network containing audio and video, the network contains rich video and single song content understanding information, and the video and the single song are based on the link relation established by service and content.

The user personalized recommendation mode makes it important to acquire the user intention behavior. If only the matching of audio and video is concerned, it is easy to cause all users playing the same single song to watch the video as well. The characteristics related to the user mainly comprise two aspects, namely user characteristics corresponding to user information and user interest expression characteristics corresponding to the video watching behavior sequence. The addition of the characteristics related to the user guides the model to preferentially recommend videos with higher user preference under the constraint of strong association of audio and video contents, and the music visualization is embodied to a certain extent and is driven by taking the user as the center. The user characteristics mainly include characteristics such as age, region, preference of the user to video style, and preference of the user to video language. The user interest expression characteristics contain rich information that can characterize the user interest.

2.2) conversion layer

The conversion layer is used for extracting the features of the vectors input into the graph neural network and is a feature extraction layer.

In the embodiment of the invention, the conversion layer extracts the characteristics of user information to obtain user characteristics, extracts the characteristics of the relation (video watching behavior sequence) between a user history playing video and a target video to obtain user interest expression characteristics, extracts the characteristics of the attribute information of a neighbor node of an audio target node and the incidence relation corresponding to the edge of the neighbor node connected with the audio target node based on a knowledge graph to obtain a first characteristic representation of the relation expression of the audio target node and the neighbor node in the knowledge graph, and extracts the characteristics of the attribute information of the neighbor node of the video target node and the incidence relation corresponding to the edge of the neighbor node connected with the video target node to obtain a second characteristic representation of the relation expression of the video target node and the neighbor node in the knowledge graph. And carrying out combined processing on the user characteristics, the user interest expression characteristics and the second characteristic representation to obtain a third characteristic representation.

The translation layer may specifically include a transform module and a GAT (Graph Attention Networks) module (including a single-song GAT module and a video GAT module), and each module is described in detail below.

2.2.1) Transformer Module

The Transformer module is specifically used for extracting features of a relation (video watching behavior sequence) between a user historical playing video and a target video to obtain user interest expression features.

The Transformer module adopts an Attention mechanism, traditional CNN and RNN are abandoned, and the whole network structure is completely formed based on the Attention mechanism. As shown in fig. 7, the Transformer module core component is composed of a Multi-head Self-Attention mechanism Multi-head Self-Attention portion (hereinafter referred to as Self-Attention) and a Feed-Forward neural network, and further includes a residual connecting and normalizing portion (Add & Norm), wherein the Self-Attention portion is the most core portion. Self-Attention can capture semantic features between videos in the same sequence, and is a mechanism for expressing Self through Self and Self-associated Attention so as to achieve a better feature representation. The Self-Attention directly calculates the dependency relationship regardless of the distance between videos, and easily captures the long-distance interdependent features in the sequence, thereby learning the internal structure of the sequence. The transform module in the embodiment of the invention learns the sequence information of each video item in the video watching behavior sequence by applying Self-Attention. And capturing the mutual relation between the user historical playing video and the target video to obtain the user behavior interest expression.

An QKV model adopted by Self-Attention, wherein Q is Question, V is Answer, K is new Question, Q and which K are more similar in history memory are calculated, and the Answer of the current Question is synthesized according to the V corresponding to the similar K. The Self-Attention evolution projects Q, K, V through h different linear transformations.

The method is applied to the embodiment of the invention, and the Self-Attention is elaborated by taking the representation of the video behavior sequence information of the user as an example. A video viewing behavior sequence is defined as Fv { v1, v 2.·, vn }, Q, K, V is defined as follows:

Q＝W^QF_υ，K＝W^KF_υ，V＝W^VF_υ

wherein, W^Q、W^k、W^vEmbedding the embedding characteristic conversion matrix for the video id. The Self-Attention is calculated by scaleddot-product, and the formula is as follows:

and finally, outputting a result of the MultiHead (Q, K, V) to be finally expressed by the transform layer to obtain the user interest expression characteristics.

2.2.1) GAT Module

Aiming at a target node corresponding to audio or video, a conversion layer determines isomorphic neighbor nodes belonging to the same entity type as the target node, and performs feature extraction on the association relation corresponding to the attribute information of the isomorphic neighbor nodes and the edges of the isomorphic neighbor nodes connected with the target node by utilizing a first feature extraction layer to obtain first relation expression features of the target node and the isomorphic neighbor nodes in the knowledge graph;

aiming at a target node corresponding to audio or video, a conversion layer determines heterogeneous neighbor nodes belonging to different entity types with the target node, and performs feature extraction on the attribute information of the heterogeneous neighbor nodes and the association relation corresponding to the edges of the heterogeneous neighbor nodes connected with the target node by using a first feature extraction layer to obtain a second relation expression feature of the target node connected with the heterogeneous neighbor in the knowledge graph;

and converting the first relational expression characteristic and the second relational expression characteristic into the same vector space by utilizing a second characteristic extraction layer to obtain the relational expression characteristic of the target node and the neighbor node in the knowledge graph, specifically to obtain a first characteristic representation of the relation expression of the audio target node and the neighbor node in the knowledge graph, namely a second characteristic representation of the relation expression of the video target node and the neighbor node in the knowledge graph.

Because different types of nodes have great difference on feature levels and network topological structures, the GAT module is adopted to obtain the first feature representation and the second feature representation in the embodiment of the invention, the GAT module mainly comprises a single-song GAT module and a video GAT module, the main targets of the single-song GAT module and the video GAT module are based on the knowledge graph to enrich the content expression of the single song and the video, and the relevance of audio content and video content is learned through a Transformer graph convolution function. The single-song GAT module extracts isomorphic neighbors and heterogeneous neighbors from a target audio node based on a knowledge graph, information aggregation is carried out by applying a Transformer graph convolution kernel function, first relation expression characteristics and second relation expression characteristics of the target audio node in relation with the neighbor nodes in the knowledge graph are obtained, isomorphic neighbors and heterogeneous neighbors are extracted from a target video node by the video GAT module, information aggregation is carried out by applying a Transformer graph convolution kernel function, and first relation expression characteristics and second relation expression characteristics of the target video node in relation with the neighbor nodes in the knowledge graph are obtained.

The single-curved GAT module is used as the first feature extraction layer, and in order to convert the first relation expression feature and the second relation expression feature into the same vector space, a second feature extraction layer (not shown in the figure, see fig. 8 in particular) needs to be further used for performing features, so as to obtain the relation expression features of the audio/video target node and the neighbor nodes in the knowledge graph.

As shown in fig. 8, which is a schematic structural diagram of the GAT module in the embodiment of the present invention, the GAT module uses a 2-layer Transformer module to perform information aggregation on the isomorphic neighbor and the heterogeneous neighbor respectively (corresponding attribute information, and an association relationship corresponding to an edge of the isomorphic neighbor node connected to the target node is characterized), and a specific structure of the Transformer layer refers to the above description, which is not repeated here. The first layer of the Transformer is divided into two parts which are respectively used for finishing information aggregation of the target node and the isomorphic first-order neighbor node to obtain a first relational expression characteristic and information aggregation of the target node and the heterogeneous neighbor node to obtain a second relational expression characteristic. The single-song GAT includes two-part transformers, and the video GAT is a Transformer including two parts, which are connected to a second layer Transformer. And the second layer of Transformer mainly carries out feature extraction on the isomorphic aggregation information and the heterogeneous aggregation information to obtain the relation expression features of the final target audio node or the target video node and the neighbor nodes in the knowledge graph.

As an optional implementation, for the single-song GAT module, based on the target audio and the knowledge graph, the target audio node, the isomorphic neighbor of the target audio node, and the heterogeneous neighbor of the target audio node are used as the input of the single-song GAT, and for the video GAT module, based on the target audio and the knowledge graph, the target video node, the isomorphic neighbor of the target video node, and the heterogeneous neighbor of the target video node are used as the input of the video GAT.

As an optional embodiment, the determining a neighbor node that establishes an association relationship with the target node through an edge includes:

and determining neighbor nodes establishing an association relationship with the target node through edges in the set hop count in the knowledge graph, wherein the hop count is the number of the edges required for connecting the neighbor nodes to the target node. For the audio target node, isomorphic neighbor nodes and heterogeneous neighbor nodes within a set hop count are specifically determined, and for the video target node, isomorphic neighbor nodes and heterogeneous neighbor nodes within the set hop count are determined.

2.3) regularization treatment layer

The double-layer Transformer can also be understood as double-layer Self-orientation, after the first feature representation of the audio target node and the second feature representation of the video target node are obtained by using the double-layer Self-orientation of the conversion layer, the second feature representation, the user feature and the user interest expression feature are processed in a combined manner to obtain a third feature representation, and specifically, vector splicing can be performed on the second feature representation, the user feature and the user interest expression feature.

The embodiment of the invention carries out regularization processing on a first feature representation and a third feature representation on a regularization processing layer, and determines the similarity between the first feature representation and the third feature representation, wherein the regularization processing layer adopts a double-tower layer structure and comprises a single-curved double-tower layer and a video double-tower layer, and an activation function (not shown in the figure) for connecting the single-curved double-tower layer and the video double-tower layer, inputs the first feature representation into the single-curved double-tower layer, and carries out regularization processing on the first feature representation by using the single-curved double-tower layer; inputting a third feature representation into a video double-tower layer, and utilizing the video double-tower layer to carry out regularization processing on the third feature representation; determining a similarity between the regularized first feature representation and the third feature representation.

When the sigmoid function is used for determining the similarity between the first feature representation and the third feature representation after the regularization processing, the first feature representation and the third feature representation of a positive sample (a training sample with a relation label of 1) are expected to be closer in distance, and the negative sample is opposite, and the final fitting function of the algorithm is defined as:

where nn is defined as Sigmoid activation function, Luv is the third feature representation, and Ls is the first feature representation.

The three layers of LeakyReLU carry out data conversion in the neural network, thereby preventing gradient disappearance and gradient explosion and ensuring the requirement of network convergence.

The embodiment of the invention applies the convolution of double-layer Self-orientation on a graph network to obtain the expression of the same vector space of the target audio and the target video, calculates the video-single music correlation by applying vector similarity, adds the user behavior sequence characteristics to capture the user behavior intention, and finally realizes the combined recommendation of the user-single music-video.

2.4) output layer

In order to simultaneously consider the two goals of the matching degree of audio and video and the acceptance degree of the user for the joint distribution of the audio and the video, the loss function loss in the embodiment of the invention is defined as the expression similarity of supervised CTRLOs and unsupervised graph nodes.

Determining a first loss function, namely supervised CTRLOs, according to the similarity and a relation label in the training sample;

and determining a second loss function, namely graph node expression similarity according to the similarity and the similarity between the first feature representation and the third feature representation which are fitted according to the knowledge graph, wherein the more the number of the shared neighbor nodes of the target node corresponding to the target audio and the target node corresponding to the target video is, the greater the fitting similarity is. Graph node expression similarity may be understood as expressing more similar nodes closer together, e.g., the more similar the node expression of the fit is, the further away the node expression is positive, the closer the target node and the video node are in the knowledge-graph.

Supervised ctrlos is defined as:

wherein Yuvs is a relationship label in the training sample,

the labels are predicted for the model.

The expression similarity of the graph nodes is defined as that the more adjacent nodes are similar, and the specific Loss is defined as that:

wherein H_iVideo or audio, H, defined as a triplet (u, v, s) of training samples_iOutputting a corresponding relation expression characteristic of a target node for a single music/video GAT module, wherein Hj is a relation expression characteristic of the target node and a first-order isomorphic neighbor, Hk is a relation expression characteristic of the target node and a corresponding negative sampling node in a knowledge graph, and the negative sampling node can be defined as a node which exceeds a set hop count with the target node.

The Loss of the final graph neural network is:

L＝Lctr+αL_{node_simlarity}+βW²

3) model training

And (3) performing model training in a GPU environment based on the graph neural network built in the above way to learn final model parameters. Fig. 9 shows a specific model training process, which comprises the following steps:

acquiring a training sample set and a constructed knowledge graph, wherein the training sample set comprises a plurality of training samples, and the training samples comprise user information, a user history playing video, a target audio, a target video and a relation label for representing whether the target audio is associated with the target video;

initializing model parameters of the graph neural network built based on the mode;

judging whether the training end condition is met, specifically, judging whether the loss values of the first loss function and the second loss function meet the requirement or the number of training samples reaches a set number, and determining whether the training end condition is met;

if the training is not finished, inputting a plurality of training samples which do not participate in the training and the knowledge graph in the training samples into the graph neural network, specifically, inputting the training samples into the graph neural network by sequentially intercepting a set number of samples from a training sample set;

isomorphic neighbor nodes and heterogeneous neighbor nodes of target nodes corresponding to target audio/video in each sample are input into a graph neural network from a knowledge graph, and reach an output layer after the processing is carried out on a conversion layer and a double-tower layer of the graph neural network layer;

calculating the correlation probability of the target video and the target audio in the sample by using a forward propagation algorithm in an output layer, then calculating a first loss function and a second loss function by using a backward propagation algorithm, and then calculating the gradient of the neural network parameters of the graph and updating the parameters according to the loss values of the first loss function and the second loss function.

And obtaining an audio visual model after meeting the training end condition, and ending the training.

In order to guarantee timely capture of user interest preference, the knowledge graph can be updated according to a set time interval, and then the off-line training model can be triggered to be updated.

After the off-line training of the audio visual model is completed, the audio/video recommendation stage can be entered for on-line recommendation. As shown in fig. 6, the main functions of the recommendation module are used for model prediction and recommendation and distribution, so that the calculation and distribution of the user preference video-video matching relationship can be completed in an online environment. An embodiment of the present invention provides an audio visualization method, as shown in fig. 10, the method includes:

1001, responding to an audio and video collocation request, and acquiring user information, candidate audio and candidate video;

for each single-track audio, a certain number, e.g., 60 videos, may be extracted as a candidate set of videos for the user under the single-track audio.

Step 1002, inputting the user information, the candidate audio and the candidate video into a trained audio visual model;

step 1003, performing feature extraction on the candidate audio by using the audio visualization model to obtain a first feature representation of the candidate audio;

step 1004, performing feature extraction on the user information by using the audio visualization model to obtain user features, performing feature extraction on the candidate video to obtain a second feature representation, and combining the user features and the second feature representation to obtain a third feature representation;

step 1005, determining the similarity between the first feature representation and the third feature representation, and predicting the probability whether each candidate audio is associated with each candidate video according to the similarity;

the candidate videos can be scored and sorted for a single-song video according to the probability of whether the single-song video is associated with each candidate video, wherein the higher the associated probability value is, the higher the score is.

Step 1006, selecting the candidate audio and the candidate video with the probability value larger than the preset value for joint playing according to the determined probability of whether each candidate audio is associated with each candidate video.

Specifically, the candidate video with the highest score can be selected as the single-song audio, and the single song ranked at the top can be selected as the final background music. Different users can perform personalized background music on the same video due to different song listening preferences of the users.

Music visualization is a complex cross-modal interaction, and related technologies mainly select background music by establishing a video type label and a single music style mapping rule, and ignore the relevance of videos and single music in richer attributes. Compared with the prior art, the audio visualization method provided by the embodiment of the invention is more complex, constructs a complex knowledge map fusing rich content information of audio, video and the like, fully considers the relevance of the video and the single music in richer attributes, solves the problem of matching single music-video Background music by utilizing a deep learning technology, adds user basic characteristics and user sequence characteristics into the technology, comprehensively considers the listening preference of the user, completes the extraction of music characteristics by combining the technologies of audio characteristic identification, voice processing and the like in the multimedia field, carries out content understanding on the existing BGM (Background music) free video, constructs a user, video and single music multi-media cross-field learning model according to the content, emotion and rhythm expressed by the music and the video, realizes the efficient and personalized distribution of multi-mode resources, and meets the emotional interaction requirements of the user and the visualized music, and realize the personalized matching of hundreds of millions of users.

The automatic collocation of video background music in the related technology is mainly realized by learning the matching mode of other videos with background music and the same type, so that the video with the new type is unfamiliar with the tie and the effective transfer learning cannot be carried out. The embodiment of the invention mainly learns the complex matching functions of the single music and the videos in multi-dimensional attributes such as style, rhythm, theme, mood and the like, can carry out good automatic matching on any new single music and new videos, greatly saves the labor cost, reduces the production threshold of video background music and greatly improves the working efficiency.

Exemplary device

Having described the manner in which the exemplary embodiments of the present invention are described, reference is next made to FIG. 11, which illustrates an inventory supply chain management apparatus in accordance with an exemplary embodiment of the present invention.

As shown in fig. 11, based on the same inventive concept, an embodiment of the present invention further provides an audio visualization model training apparatus, including:

a sample obtaining module 1101, configured to obtain a training sample, where the training sample includes user information, a user history playing video, a target audio, a target video, and a relationship label indicating whether the target audio and the target video are associated;

a first feature extraction module 1102, configured to input the training sample into an audio visualization model, perform feature extraction on the target audio, and obtain a first feature representation of the target audio;

a third feature extraction module 1103, configured to perform feature extraction on the relationship between the user information and the user history playing video and the target video to obtain a user feature and a user interest expression feature, perform feature extraction on the target video to obtain a second feature representation, and perform joint processing on the user feature, the user interest expression feature and the second feature representation to obtain a third feature representation;

a similarity determination module 1104 for determining a similarity between the first feature representation and the third feature representation;

and a parameter updating module 1105, configured to update the parameters of the audio visualization model according to the similarity and the relationship labels in the training samples.

As an optional implementation, the apparatus further comprises:

inputting the first feature representation into a single-curved double-tower layer, and utilizing the single-curved double-tower layer to carry out regularization processing on the first feature representation;

As an optional implementation, the apparatus further comprises:

a vector conversion module for performing, prior to inputting the training samples into an audio visualization model:

As shown in fig. 12, based on the same inventive concept, an embodiment of the present invention further provides an audio visualization apparatus, including:

the information acquisition module 1201 is used for responding to the audio and video collocation request and acquiring user information, candidate audio and candidate video;

a model input module 1202, configured to input the user information, the candidate audio, and the candidate video into the audio visualization model trained by the method described in embodiment 1 above;

a first feature extraction module 1203, configured to perform feature extraction on the candidate audio by using the audio visualization model, so as to obtain a first feature representation of the candidate audio;

a third feature extraction module 1204, configured to perform feature extraction on the user information by using the audio visualization model to obtain a user feature, perform feature extraction on the candidate video to obtain a second feature representation, and combine the user feature and the second feature representation to obtain a third feature representation;

a probability determination module 1205 for determining a similarity between the first feature representation and the third feature representation, and predicting a probability of whether each candidate audio is associated with each candidate video according to the similarity;

and the audio and video collocation module 1206 is used for selecting the candidate audio and the candidate video with the probability value larger than the preset value to carry out combined playing according to the determined probability of whether each candidate audio is associated with each candidate video.

The audio visualization model training apparatus 130 according to this embodiment of the present invention is described below with reference to fig. 13. The audio visualization model training apparatus shown in fig. 13 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.

As shown in FIG. 13, the audio visualization model training device 130 may take the form of a general purpose computing device, which may be a terminal device, for example. The components of the audio visualization model training device 130 may include, but are not limited to: the at least one processor 131, the at least one memory 132, and a bus 133 that connects the various system components (including the memory 132 and the processor 131). The processor is configured to execute the instructions to implement the audio visualization method described in the above exemplary method.

Bus 133 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, or a local bus using any of a variety of bus architectures.

The memory 132 may include readable media in the form of volatile memory, such as Random Access Memory (RAM)1321 and/or cache memory 1322, and may further include Read Only Memory (ROM) 1323. The memory is used for storing processor executable instructions, and the processor is configured to execute the instructions to implement the audio visualization model training method in the above-mentioned embodiments.

Memory 132 may also include programs/utilities 1325 having a set (at least one) of program modules 1324, such program modules 1324 including but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

The audio visualization model training device 130 may also communicate with one or more external devices 134 (e.g., keyboard, pointing device, etc.), with one or more devices that enable a user to interact with the audio visualization model training device 130, and/or with any devices (e.g., router, modem, etc.) that enable the audio visualization model training device 130 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 135. Also, the audio visualization model training device 130 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via a network adapter 136. As shown in FIG. 13, the network adapter 136 communicates with the other modules of the audio visualization model training device 130 via the bus 133. It should be appreciated that, although not shown in the figures, other hardware and/or software modules may be used in conjunction with the audio visualization model training device 130, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

An audio visualization device 150 according to this embodiment of the invention is described below with reference to fig. 14. The audio visualization device shown in fig. 14 is only an example, and should not bring any limitations to the function and the scope of use of the embodiments of the present invention.

As shown in fig. 14, the audio visualization device 140 may be in the form of a general purpose computing device, which may be a terminal device, for example. The components of the audio visualization device 140 may include, but are not limited to: the at least one processor 141, the at least one memory 142, and a bus 143 that couples various system components including the memory 142 and the processor 141. The memory is used for storing processor executable instructions, the processor being configured to execute the instructions to implement the audio visualization method in the above embodiments.

Bus 143 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, or a local bus using any of a variety of bus architectures.

The memory 142 may include readable media in the form of volatile memory, such as Random Access Memory (RAM)1421 and/or cache memory 1422, and may further include Read Only Memory (ROM) 1423.

Memory 142 may also include a program/utility 1425 having a set (at least one) of program modules 1424, such program modules 1424 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

The audio visualization device 140 may also communicate with one or more external devices 144 (e.g., keyboard, pointing device, etc.), may also communicate with one or more devices that enable a user to interact with the audio visualization device 140, and/or any devices (e.g., router, modem, etc.) that enable the audio visualization device 140 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 145. Also, the audio visualization model training device 140 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 146. As shown in fig. 14, the network adapter 146 communicates with the other modules of the audio visualization device 140 over a bus 143. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the audio visualization device 140, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Exemplary program product

In some possible embodiments, the various aspects of the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps of the audio visualization model training method according to the various exemplary embodiments of the invention described in the "exemplary methods" section above of this description, or to perform the steps of the audio visualization method according to the various exemplary embodiments of the invention, when the program product is run on the terminal device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As shown in fig. 15, a program product 150 for audio visualization model training or audio visualization is depicted, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer, in accordance with an embodiment of the present invention. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device over any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., over the internet using an internet service provider).

It should be noted that although several modules or sub-modules of the system are mentioned in the above detailed description, such partitioning is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the modules described above may be embodied in one module according to embodiments of the invention. Conversely, the features and functions of one module described above may be further divided into embodiments by a plurality of modules.

Moreover, although the operations of the modules of the system of the present invention are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain operations may be omitted, operations combined into one operation execution, and/or operations broken down into multiple operation executions.

While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method for training an audio visualization model, the method comprising:

2. The method of claim 1, wherein the training samples further comprise a knowledge-graph; performing feature extraction on the target audio/target video, wherein the feature extraction comprises the following steps:

3. The method of claim 2, wherein the knowledge-graph is constructed as follows:

4. The method of claim 3, further comprising:

5. The method according to claim 2, wherein performing feature extraction on the attribute information of the neighbor node and the association relationship corresponding to the edge of the neighbor node connected to the target node to obtain the relationship expression feature of the target node and the neighbor node in the knowledge graph comprises:

6. The method of claim 1 or 2, wherein determining the similarity between the first feature representation and the third feature representation comprises:

inputting the third feature representation into a video double-tower layer, and performing regular processing on the third feature representation by using the video double-tower layer;

7. The method of claim 6, wherein determining the similarity between the first feature representation and the third feature representation comprises:

8. The method according to claim 1 or 2, wherein updating the parameters of the audio visualization model according to the similarity and the relationship labels in the training samples comprises:

9. The method of claim 2, wherein determining neighbor nodes that establish associations with the target node through edges comprises:

10. The method of claim 1 or 2, wherein obtaining the relationship labels in the training sample comprises:

11. The method of claim 1 or 2, wherein prior to inputting the training samples into an audio visualization model, further comprising:

12. A method for audio visualization, the method comprising:

inputting the user information, candidate audio and candidate video into an audio visualization model trained by the method of any one of claims 1 to 11;

performing feature extraction on the user information by using the audio visual model to obtain user features, performing feature extraction on the candidate video to obtain second feature representation, and combining the user features and the second feature representation to obtain third feature representation;

13. An audio visualization model training apparatus, the apparatus comprising:

14. The apparatus of claim 13, wherein the training samples further comprise a knowledge-graph; the first feature extraction module performs feature extraction on the target audio/the third feature extraction module performs feature extraction on the target video, and the feature extraction method comprises the following steps:

15. The apparatus of claim 14, further comprising:

16. The apparatus of claim 15, further comprising:

17. The apparatus according to claim 14, wherein the first/third feature extraction module performs feature extraction on the attribute information of the neighboring node and an association relationship corresponding to an edge of the neighboring node connecting to the target node to obtain a relationship expression feature of the target node and the neighboring node in the knowledge graph, and includes:

18. The apparatus of claim 13 or 14, wherein the similarity determination module determines a similarity between the first feature representation and the third feature representation, comprising:

19. The apparatus of claim 18, wherein the similarity determination module determines a similarity between the first feature representation and the third feature representation, comprising:

20. The apparatus according to claim 13 or 14, wherein the parameter updating module updates the parameters of the audio visualization model according to the similarity and the relationship label in the training sample, and comprises:

21. The apparatus of claim 14, wherein the first/third feature extraction module determines neighbor nodes associated with the target node by edges, comprising:

22. The apparatus of claim 13 or 14, wherein the sample obtaining module obtains the relationship label in the training sample, comprising:

23. The apparatus of claim 13 or 14, further comprising:

24. An audio visualization device, the device comprising:

a model input module, configured to input the user information, the candidate audio and the candidate video into an audio visualization model trained by the method according to any one of claims 1 to 11;

25. An audio visualization model training apparatus, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the audio visualization model training method of any of claims 1 to 11.

26. An audio visualization device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the audio visualization method of claim 12.

27. A storage medium, wherein instructions in the storage medium, when executed by a processor of an inventory supply chain management device, enable the inventory supply chain management device to perform the audio visualization model training method of any of claims 1 to 11, or to perform the audio visualization method of claim 12.