CN112182281A

CN112182281A - Audio recommendation method and device and storage medium

Info

Publication number: CN112182281A
Application number: CN201910604904.8A
Authority: CN
Inventors: 刘龙坡
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-07-05
Filing date: 2019-07-05
Publication date: 2021-01-05
Anticipated expiration: 2039-07-05
Also published as: CN112182281B

Abstract

The application discloses an audio recommendation method, an audio recommendation device and a storage medium, relates to the technical field of computers and aims to improve the accuracy of audio recommendation. In the method, after image feature information capable of describing semantic content of the image information to be dubbed music is determined, feature similarity calculation is carried out on the image feature information and text feature information capable of describing semantic features of audio, and audio of background music capable of being used as the image information to be dubbed music is recommended according to the calculated feature similarity. In this way, by matching the semantics of the image information to be dubbed music and the audio, the audio with the same or similar semantics is recommended to the user, so that the audio recommendation is more accurate; meanwhile, the operation complexity of the user in selecting the music from a plurality of audios irrelevant to the image information to be mixed can be reduced, and the time of the user in mixing the music is saved.

Description

Audio recommendation method and device and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an audio recommendation method, an audio recommendation device, and a storage medium.

Background

With the rise of various instant messaging applications and information sharing platform applications, users can share image information such as pictures and videos on the platforms. When the user shares the image information, the user usually shares the shared image information.

However, in the prior art, when the user is dubbing the shared image information, the user needs to select the audio from the audio library. However, most of the audio automatically recommended to the user in the audio library is current popular music, which does not well meet the current requirements of the user, and thus the accuracy of the recommended audio is still to be improved.

Disclosure of Invention

The application embodiment provides an audio recommendation method, an audio recommendation device and a storage medium, which are used for improving the accuracy of audio recommendation.

In a first aspect, a method for audio recommendation is provided, including:

acquiring image information to be dubbed music; the information of the image to be dubbed music comprises at least one frame of image;

extracting the characteristics of the image information to be dubbed music to obtain image characteristic information capable of describing the semantic content of the image information to be dubbed music;

determining feature similarity of the image feature information and each text feature information in a plurality of text feature information; the text characteristic information and the audio have a corresponding relation, and the text characteristic information is used for describing semantic characteristics of the audio;

and selecting a specified number of audios from the plurality of audios as recommended audios of the image information to be dubbed according to the feature similarity.

In one embodiment, if the image information to be dubbed music includes a multi-frame image, the performing feature extraction on the image information to be dubbed music to obtain image feature information capable of describing semantic content of the image information to be dubbed music includes:

extracting the features of each frame of image in the image information to be dubbed music based on a residual error network model to obtain the feature vector of each frame of image;

determining the average characteristic vector of the image information to be matched according to the characteristic vector of each frame of image;

performing convolution calculation on the average feature vector and a preset number of first convolution kernels respectively, and adding corresponding offset parameters to obtain a feature vector with a specified dimensionality as the image feature information; wherein the convolution parameters of different first convolution kernels are not identical.

In one embodiment, a text corresponding to each audio is stored in advance, and the method further includes:

acquiring text characteristic information of each audio according to the following method:

performing word segmentation processing on a text corresponding to an audio frequency, and then performing word-out-of-use processing to obtain each word segmentation in the text corresponding to the audio frequency;

splicing the word segmentation according to the sequence in the text corresponding to the audio frequency to obtain a word segmentation sequence;

embedding each participle in the participle sequence through a word to obtain a word vector of each participle about context connection relation as a text vector of the audio;

performing convolution calculation on the text vectors and a preset number of second convolution kernels, and adding corresponding offset parameters to obtain feature vectors of specified dimensionality of the audio frequency as text feature information; and the convolution parameters of different second convolution kernels are not completely the same, and the number of the first convolution kernels is the same as that of the second convolution kernels.

In one embodiment, the determining the feature similarity of the image feature information and each of the plurality of text feature information comprises:

and determining the cosine similarity of the image characteristic information and each text characteristic information.

In one embodiment, each convolution parameter and bias parameter of the first convolution kernel and the second convolution kernel are obtained by training:

obtaining a training sample; each training sample comprises image characteristic information of sample image information and text characteristic information of a sample text, and the label of the training sample is the subjective similarity between the image characteristic information and the text characteristic information in the training sample;

determining the feature similarity of the image feature information and the text feature information in each training sample;

determining the error between the feature similarity corresponding to each training sample and the corresponding subjective similarity;

and propagating the error to the first convolution kernel and the second convolution kernel through a back propagation algorithm so as to update convolution parameters in the first convolution kernel and the second convolution kernel and corresponding offset parameters to obtain the convolution parameters and the offset parameters meeting specified conditions.

In a second aspect, an apparatus for audio recommendation is provided, including:

the first acquisition module is used for acquiring image information of the to-be-dubbed music; the information of the image to be dubbed music comprises at least one frame of image;

the characteristic extraction module is used for extracting the characteristics of the image information to be dubbed music to obtain image characteristic information capable of describing the semantic content of the image information to be dubbed music;

a first similarity determining module, configured to determine a feature similarity between the image feature information and each of a plurality of text feature information; the text characteristic information and the audio have a corresponding relation;

and the recommending module is used for selecting a specified number of audios from the plurality of audios as the recommended audios of the image information to be dubbed according to the feature similarity.

In one embodiment, if the information of the image to be dubbed music includes a plurality of frames of images, the feature extraction module includes:

the characteristic extraction unit is used for extracting the characteristics of each frame of image in the image information to be dubbed music based on a residual error network model to obtain the characteristic vector of each frame of image;

the characteristic averaging unit is used for determining an average characteristic vector of the image information to be matched according to the characteristic vector of each frame of image;

the calculation unit is used for respectively carrying out convolution calculation on the average feature vector and a preset number of first convolution kernels and adding corresponding bias parameters to obtain a feature vector with a specified dimensionality as the image feature information; wherein the convolution parameters of different first convolution kernels are not identical.

In one embodiment, a text corresponding to each audio is stored in advance, and the apparatus further includes:

acquiring text characteristic information of each audio according to the following devices:

the processing module is used for performing word segmentation processing on the text corresponding to the audio frequency and then performing word deactivation processing to obtain each word segmentation in the text corresponding to the audio frequency;

the splicing module is used for splicing the participles according to the sequence in the text corresponding to the audio to obtain a participle sequence;

the word embedding module is used for embedding each participle in the participle sequence through words to obtain a word vector of each participle about context connection relation as a text vector of the audio;

the calculation module is used for performing convolution calculation on the text vectors and a preset number of second convolution kernels and adding corresponding offset parameters to obtain feature vectors of the specified dimensionality of the audio frequency as text feature information; and the convolution parameters of different second convolution kernels are not completely the same, and the number of the first convolution kernels is the same as that of the second convolution kernels.

In one embodiment, the first similarity determining module is specifically configured to determine cosine similarity between the image feature information and each text feature information.

the second acquisition module is used for acquiring a training sample; each training sample comprises image characteristic information of sample image information and text characteristic information of a sample text, and the label of the training sample is the subjective similarity between the image characteristic information and the text characteristic information in the training sample;

the second similarity determining module is used for determining the feature similarity of the image feature information and the text feature information in each training sample;

the error determining module is used for determining the error between the feature similarity corresponding to each training sample and the corresponding subjective similarity;

and the parameter updating module is used for propagating the error to the first convolution kernel and the second convolution kernel through a back propagation algorithm so as to update the convolution parameters in the first convolution kernel and the second convolution kernel and corresponding offset parameters to obtain the convolution parameters and the offset parameters meeting specified conditions.

In a third aspect, a computing device is provided, comprising at least one processing unit, and at least one memory unit, wherein the memory unit stores a computer program that, when executed by the processing unit, causes the processing unit to perform the steps of any of the above-described methods of audio recommendation.

In one embodiment, the computing device may be a server or a terminal device.

In a fourth aspect, a computer-readable medium is provided, which stores a computer program executable by a terminal device, and which, when the program is run on the terminal device, causes the terminal device to perform the steps of any of the above-mentioned methods of audio recommendation.

According to the audio recommendation method, the audio recommendation device and the storage medium, after the image feature information capable of describing the semantic content of the image information to be dubbed music is determined, feature similarity calculation is performed on the image feature information and text feature information capable of describing the semantic feature of the audio, and the audio of background music capable of being used as the image information to be dubbed music is recommended according to the calculated feature similarity. In this way, by matching the semantics of the image information to be dubbed music and the audio, the audio with the same or similar semantics is recommended to the user, so that the audio recommendation is more accurate; meanwhile, the audio which is related to the image information to be dubbed and is to be issued by the user is recommended to the user, and the user does not need to manually filter the unrelated audio, so that the operation cost of the user during dubbing music can be reduced, and the time of the user during dubbing music is saved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic view of an application scenario of an audio recommendation method in an embodiment of the present application;

FIG. 2 is a flow chart of the training of a predictive model in an embodiment of the present application;

FIG. 3 is a schematic flowchart of acquiring image feature information according to an embodiment of the present disclosure;

fig. 4 is a schematic flowchart of a process of acquiring text feature information in the embodiment of the present application;

FIG. 5 is a schematic flow chart of predictive model training in an embodiment of the present application;

FIG. 6 is a schematic flow chart of audio recommendation in an embodiment of the present application;

FIG. 7 is a schematic structural diagram of an apparatus for audio recommendation in an embodiment of the present application;

fig. 8 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In order to solve the problems that in the prior art, when a user is happy for image information, the operation cost of the user is high due to manual audio selection, and the selected audio type is single due to a small range of the selected audio type, the embodiment of the application provides an audio recommendation method, an audio recommendation device and a storage medium. In order to better understand the technical solution provided by the embodiments of the present application, the following brief description is made on the basic principle of the solution:

taking the example of selecting background music for short videos, currently pushed music is popular music and has no relation with the short videos to be released by users, and music in an audio library needs to be auditioned by the users one by one from a music list. Since the ordering of the audio in the music list is irrelevant to the short video to be released by the user, the user may not necessarily find the appropriate background music after listening to a plurality of audio samples. In view of this, embodiments of the present application provide an audio recommendation method, an apparatus, and a storage medium. In the method, after image feature information capable of describing semantic content of the image information to be dubbed music is determined, feature similarity calculation is carried out on the image feature information and text feature information capable of describing semantic features of audio, and audio of background music capable of being used as the image information to be dubbed music is recommended according to the calculated feature similarity. In this way, by matching the semantics of the image information to be dubbed music and the audio, the audio with the same or similar semantics is recommended to the user, so that the audio recommendation is more accurate; meanwhile, the operation complexity of the user during the music matching can be reduced, namely the user does not need to manually filter irrelevant audio from the audio which is not relevant to the image information to be matched, so that the operation time of the user during the music matching can be saved, and the processing resource consumed by processing the user operation instruction of the irrelevant audio is reduced.

For the convenience of understanding, the technical solutions provided in the present application are further described below with reference to the accompanying drawings.

Fig. 1 is a schematic view of an application scenario of an audio recommendation method in an embodiment of the present application. The scenario includes a terminal device 101 of a user 100 and a server 102.

The user 100 wants to match a short video on the terminal device 101, and the short video that needs to be matched can be sent to the server 102 for publishing the short video through an instant messaging tool, an information sharing tool and other applications on the terminal device 101.

The server 102 performs feature extraction on the received short video to obtain image feature information of the short video; carrying out feature similarity calculation on the obtained image feature information and text feature information pre-stored in a database; according to the calculated feature similarity, a specified number of audios are selected, and audio identifiers are sent to applications such as an instant messaging tool and an information sharing tool on the terminal device 101.

And determining the recommended audio by the application such as an instant messaging tool and an information sharing tool on the terminal device 101 according to the audio identifier, and displaying the recommended audio on an operation interface of the terminal device 101.

In one embodiment, the terminal device 101 may be a cell phone, a tablet, a personal computer, a virtual machine, or a simulator simulated terminal device, etc.

In one embodiment, the communication between the terminal device 101 and the server 102 is wireless communication or ethernet communication or the like.

In one embodiment, the above operations may also be performed by an application such as an instant messaging tool, an information sharing tool, etc. on the terminal device 101, that is: processing devices applied by an instant messaging tool, an information sharing tool and the like on the terminal device 101 perform feature extraction on the short video to obtain image feature information of the short video; and sends a request to the server 102 for obtaining text feature information. The server 102 transmits the text feature information to an application such as an instant messenger and an information sharing tool on the terminal apparatus 101 according to the received request.

According to the received text feature information, a processing device of an application such as an instant messaging tool and an information sharing tool on the terminal device 101 calculates feature similarity between the image feature information and the text feature information, selects a specified number of audios as recommended audios according to the calculated feature similarity, and displays the recommended audios on an operation interface of the terminal device 101.

In one embodiment, in order to improve the efficiency of audio recommendation, feature extraction on the short video may also be completed on an instant messaging tool, an information sharing tool, or the like on the terminal device 101, and the extracted image feature information is sent to the server 102, and the server performs feature similarity calculation.

Therefore, in the embodiment of the present application, the scheme may be applied to the terminal device 101, the server 102, or a combination of the terminal device 101 and the server 102, which is not limited in the present application.

In the embodiment of the application, the method for automatically recommending the audio to the user can be divided into two parts, including a training part and a prediction part; the training part is used for training and extracting image characteristic information or text characteristic information, the prediction part is used for predicting which audios are related to the image information to be dubbed, and the training part and the prediction part are respectively explained below.

Firstly, a training part:

in this embodiment of the present application, in order to implement an operation of automatically recommending an audio to a user, a prediction model needs to be established first, and the prediction model is trained through a training sample to obtain an audio that can predict image information, as shown in fig. 2, the method specifically includes the following steps:

step 201: obtaining a training sample; each training sample comprises image characteristic information of sample image information and text characteristic information of a sample text, and the label of the training sample is the subjective similarity between the image characteristic information and the text characteristic information in the training sample.

Step 202: and determining the feature similarity of the image feature information and the text feature information in each training sample.

The image feature information is used to describe semantic content of the image information, such as: if the display content of the image information is a scene of snowing, the image characteristic information extracted from the image information is used for showing snowing.

The audio frequency is the audio frequency with lyrics, and the lyrics of the audio frequency are used as the text corresponding to the audio frequency; and if the audio has no lyrics, using description information for describing semantic content to be conveyed by the audio as a text corresponding to the audio. The description information can be determined through artificial experience and music knowledge. For example: if a certain audio is the audio without lyrics and the content expressed by the audio is the situation of thinking, the description information of the audio is a sentence related to thinking village.

The text characteristic information is used for describing semantic content of the audio, and the semantic content of the audio is obtained according to lyrics of the audio; if the audio frequency has no lyrics, the audio frequency is obtained according to the description information describing the audio frequency. For example: if the lyrics of the audio are related to snow (e.g. the lyrics are snow in my love you, north of the Seebeck), the text characteristic information is used to indicate that snow is falling.

In the embodiment of the present application, the subjective similarity may be determined through manual experience and knowledge. For example: if the semantic content described by the image feature information of the sample image information is a snow scene, and the semantic content described by the text feature information of the sample text is related to snow (such as snowing, snow watching and the like), determining the subjective similarity as 1; if the semantic content described by the image feature information of the sample image information is a scene of snowing and the semantic content described by the text feature information of the sample text is not related to snow (such as raining, thunder strike and the like), the subjective similarity is determined to be 0. That is, the value range of the subjective similarity can be between 0 and 1, and a larger value indicates a higher subjective similarity, that is, the text and the image information are more relevant. Certainly, for simplicity, the subjective similarity may also have two values, i.e., 0 and 1, where 0 represents no correlation and 1 represents correlation.

Of course, in the embodiment of the present application, in addition to dividing the subjective similarity into two categories (1 or 0), the subjective similarity may be further classified in more detail, for example, the subjective similarity is divided into 3 categories, which are respectively irrelevant (the subjective similarity is determined to be 0), relevant (the subjective similarity is determined to be 0.5), and very relevant (the subjective similarity is determined to be 1). Therefore, through more detailed classification, the correlation degree between the image information and the audio can be obtained, so that the audio which is more in line with the image information is recommended to the user, and the audio recommendation accuracy is improved.

In the embodiment of the application, the cosine value of the included angle between the image characteristic information and the text characteristic information can be calculated, so that the cosine similarity of the image characteristic information and the text characteristic information can be determined.

Step 203: and determining the error between the feature similarity corresponding to each training sample and the corresponding subjective similarity.

Step 204: and propagating the error to the first convolution kernel and the second convolution kernel through a back propagation algorithm so as to update convolution parameters in the first convolution kernel and the second convolution kernel and corresponding offset parameters to obtain the convolution parameters and the offset parameters meeting specified conditions.

The first convolution kernel is a convolution kernel used for obtaining image characteristic information, and the second convolution kernel is a convolution kernel used for obtaining text characteristic information.

Among them, the back propagation algorithm (Backpropagation algorithm) is also called as BP algorithm, and its role is to reduce errors; namely, the convolution parameters in the first convolution kernel and the second convolution kernel and the corresponding offset parameters are updated, so that the obtained error is smaller, and the model and the requirements are met.

Therefore, the prediction model is trained, so that the prediction model can better predict the image information, find the audio more suitable for the image information and recommend the audio to the user, and the audio recommendation is more accurate; meanwhile, the operation cost of the user during the music matching can be reduced, and the time of the user during the music matching is saved; in addition, since the audio selected through the prediction model is selected among all the audios in the audio library, the type of the selected audio is more diversified than the automatically recommended audio type, thereby increasing the type of the selected audio.

As described above, in the embodiment of the present application, the training samples include image feature information and text feature information, and how to obtain the image feature information and the text feature information is described in detail below.

(1) Image characteristic information:

in the embodiment of the application, if image characteristic information is to be acquired, firstly, image information needs to be acquired; the image information may be video information, picture information, or image information composed of multiple pictures. After the image information is acquired, feature extraction is performed on the image information to obtain image feature information, which may be specifically implemented as steps a 1-A3:

step A1: and extracting the features of each frame of image in the sample image information based on a residual error network model to obtain the feature vector of each frame of image.

Wherein, the sample image information comprises a plurality of frame images.

In the embodiment of the application, the image is subjected to feature extraction by using a residual error network (ResNet) model. The residual network model is a Convolutional Neural Network (CNN) model, and is easy to optimize, and the accuracy of the extracted image feature information can be improved by adding a considerable amount of depth. The core of the method is to solve the side effect (degradation problem) caused by increasing the depth in the existing convolutional neural network model, so that the network performance can be improved by simply increasing the network depth.

In addition, in the embodiment of the application, the image information may be subjected to feature extraction by using other convolutional Neural network models, and the image information may also be subjected to feature extraction by using network models such as a Recurrent Neural Network (RNN) and a Long Short-Term Memory network (LSTM). The present embodiment is applicable as long as semantic features of image information can be extracted.

Step A2: and determining the average characteristic vector of the sample image information according to the characteristic vector of each frame of image.

If the sample image information only has one picture, the average feature vector of the sample image information can be obtained by directly extracting the features of the picture.

In one embodiment, the feature vectors for each frame of image may be added and averaged to obtain an average feature vector for the sample image information.

In one embodiment, the feature vectors of each frame of image may be weighted and summed, and then averaged to obtain the average feature vector of the sample image information.

Step A3: performing convolution calculation on the average feature vector and a preset number of first convolution kernels respectively, and adding corresponding offset parameters to obtain a feature vector with a specified dimensionality as the image feature information; wherein the convolution parameters of different first convolution kernels are not identical.

The convolution parameters and the bias parameters of the first convolution kernel can be corrected by the error obtained in step 203, so that the obtained image feature information more conforms to the semantic content described by the sample image.

The number of the first convolution kernels can be set according to actual conditions, such as 512 kernels, 1000 kernels, and the like. If the number of the first convolution kernels is 512, a 512-dimensional feature vector is obtained.

Therefore, the image information is subjected to feature extraction through the neural network model, the semantic content of the image information can be more accurately described by the obtained image feature information, and audio which meets requirements can be recommended for the image information.

Fig. 3 is a schematic flow chart illustrating a process of acquiring image feature information. After each frame of image of the image information is obtained (the image frame represents the image of each frame in the image information), feature extraction is performed on each frame of image through the residual error network model 301, and a feature vector of each frame of image is obtained; carrying out average processing on the feature vector of each frame of image to obtain an average feature vector of image information; the average feature vector is input into the fully-connected layer 302 for calculation (wherein the fully-connected layer represents what is stated in step a3, that is, in the fully-connected layer, the average feature vector of the image information and the first convolution kernel of the fully-connected layer are respectively subjected to convolution calculation and corresponding bias parameters are added), and image feature information of the image information is obtained.

(2) Text characteristic information:

in the embodiment of the present application, if text feature information is to be acquired, a text corresponding to an audio needs to be acquired. After the text corresponding to the audio is obtained, performing feature extraction on the text to obtain text feature information of the audio, which may be specifically implemented as steps B1-B4:

step B1: and after word segmentation processing is carried out on the text corresponding to the audio, word-stop-removing processing is carried out to obtain each word segmentation in the text corresponding to the audio.

For example, if the text is "i prefer to do a trivial person", after performing the word segmentation process, the result is "i/prefer/do/one/trivial/person"; wherein, if 'would' and 'one' are stop words, the result of the participle processing is processed by stop words, and the obtained participles are 'I/doing/trivial/people'.

Step B2: and splicing the word segmentation according to the sequence in the text corresponding to the audio to obtain a word segmentation sequence.

The purpose of splicing according to the sequence in the text is to acquire the context connection relationship in the text. As described above, the word segmentation sequence obtained by splicing according to the text sequence is "i do trivial man".

Step B3: and embedding each participle in the participle sequence through words to obtain a word vector of each participle about context connection relation as a text vector of the audio.

Word embedding may be a general term of language models and characterization learning techniques in Natural Language Processing (NLP). Conceptually, it can be a high dimensional space with dimensions of the number of all words embedded into a much lower dimensional continuous vector space, each word or phrase being mapped as a vector on the real number domain; word embedding may be done over word2vec (word vector).

In the embodiment of the application, after word embedding is carried out on each participle, a word vector of each participle is obtained, and then feature extraction is carried out on the word vector of each participle to obtain a text vector of a text.

Step B4: performing convolution calculation on the text vectors and a preset number of second convolution kernels, and adding corresponding offset parameters to obtain feature vectors of specified dimensionality of the audio frequency as text feature information; and the convolution parameters of different second convolution kernels are not completely the same, and the number of the first convolution kernels is the same as that of the second convolution kernels.

The convolution parameters and bias parameters of the second convolution kernel can be corrected by the error obtained in step 203, so that the obtained text feature information more conforms to the semantic content of the text of the audio. The number of the second convolution kernels can also be set according to actual conditions, and the number of the dimensionalities of the obtained feature vectors is the same as the number of the second convolution kernels.

The text feature information and the audio have a corresponding relation, and the text feature information is used for describing semantic features of the audio. And the text feature information and the audio frequency can be in one-to-one correspondence or in one-to-many correspondence. For example: if two audio frequencies without lyrics are expressed in the country of thinking, the corresponding texts may be the same texts, and therefore, the corresponding text feature information is also the same.

Therefore, the text corresponding to the audio is subjected to feature extraction through the word vector model, so that the semantic content of the text described by the obtained text feature information is more accurate, and the audio which meets the requirements can be recommended for the image information.

In one embodiment, the text of the audio with the lyrics covers richer content than the text of the audio without the lyrics, so that the text characteristic information extracted based on the lyrics is more accurate and can more comprehensively describe the information expressed by the audio.

Fig. 4 is a schematic flowchart of a process for acquiring text feature information. After the audio text is obtained, performing word segmentation processing and word deactivation processing on the audio text to obtain each word segmentation in the audio text; outputting each participle of the audio text to a word vector model 401 for calculation (wherein the word vector model represents the content in step B3, that is, performing a word embedding operation on each participle in the audio text), so as to obtain a word vector of each participle; extracting the characteristics of the word vectors of the word segments, and reducing the dimension to obtain the characteristic vectors of the audio text; inputting the feature vector of the audio text into the fully-connected layer 402 (wherein the fully-connected layer represents what is stated in step B4, that is, in the fully-connected layer, the feature vector of the text and the second convolution kernel of the fully-connected layer are respectively subjected to convolution calculation and corresponding bias parameters are added), so as to obtain text feature information of the audio text.

Note that the fully connected layer 302 in fig. 3 and the fully connected layer 402 in fig. 4 are two fully connected layers.

After the image characteristic information and the text characteristic information are obtained, inputting the image characteristic information and the text characteristic information in the training sample into a prediction model for training so as to obtain the prediction model meeting the requirements. Fig. 5 is a schematic flow chart of the prediction model training. Wherein, the image information is processed through the flow shown in fig. 3 to obtain image characteristic information; the text corresponding to the audio frequency is processed through the flow shown in fig. 4 to obtain text characteristic information; similarity calculation is carried out on the obtained image characteristic information and the text characteristic information to obtain characteristic similarity; calculating the difference between the feature similarity and the subjective similarity of the training sample, and determining the error between the feature similarity and the subjective similarity; after the error is determined, judging whether the error is smaller than a preset error or not; if the error is not less than the preset error, inputting the error into the two full-connected layers (namely the full-connected layer 302 and the full-connected layer 402) through a back propagation algorithm to update parameters in the two full-connected layers so as to obtain new image characteristic information and new text characteristic information; and if the error is smaller than the preset error, determining that the training of the prediction model is finished, and stopping the training.

II, a prediction part:

in the embodiment of the present application, after the prediction model is trained, audio recommendation may be performed on image information of a piece of music through the prediction model, as shown in fig. 6, the method specifically includes the following steps:

step 601: acquiring image information to be dubbed music; wherein, the image information to be dubbed music comprises at least one frame of image.

Step 602: and performing feature extraction on the image information to be dubbed music to obtain image feature information capable of describing semantic content of the image information to be dubbed music.

Step 603: determining feature similarity of the image feature information and each text feature information in a plurality of text feature information; the text feature information and the audio have a corresponding relation, and the text feature information is used for describing semantic features of the audio.

The text feature information can be obtained by performing feature extraction on the text of the audio in the audio library when the audio is recommended to the image information of the soundtrack. Or extracting the characteristics of the text corresponding to the audio in the audio library in advance and storing the extracted characteristics in the database; in this way, when using text feature information, it can be directly retrieved from the database.

Step 604: and selecting a specified number of audios from the plurality of audios as recommended audios of the image information to be dubbed according to the feature similarity.

According to the size of the feature similarity, a specified number of audios can be selected as the recommended audios of the image information to be dubbed according to the sequence from high to low, and the specified number of audios can be randomly selected as the recommended audios of the image information to be dubbed in audio meeting the requirement. The audio meeting the requirement is the audio corresponding to the characteristic similarity larger than the preset similarity.

Therefore, through intelligent matching of the image feature information of the image information to be dubbed music and the text feature information of the semantic features of the audio, the operation cost of a user during dubbing music can be reduced, and the selected audio type can be increased.

In the embodiment of the application, when actual operation is performed, the part of the prediction model for acquiring the text feature information is not used, but the text of the audio is subjected to feature extraction in advance to obtain the text feature information and stored. When the prediction model needs to use the text feature information, the text feature information can be directly obtained from the stored text feature information.

Based on the same inventive concept, the embodiment of the application also provides an audio recommendation device. As shown in fig. 7, the apparatus includes:

a first obtaining module 701, configured to obtain image information of a to-be-dubbed music; the information of the image to be dubbed music comprises at least one frame of image;

a feature extraction module 702, configured to perform feature extraction on the image information to be dubbed music to obtain image feature information capable of describing semantic content of the image information to be dubbed music;

a first similarity determining module 703, configured to determine a feature similarity between the image feature information and each text feature information in the multiple text feature information; the text characteristic information and the audio have a corresponding relation;

a recommending module 704, configured to select a specified number of audios from the multiple audios as recommended audios of the image information to be dubbed according to the feature similarity.

Further, if the information of the image to be dubbed music includes multiple frames of images, the feature extraction module 702 includes:

Further, a text corresponding to each audio is stored in advance, and the apparatus further includes:

Further, the first similarity determining module 703 is specifically configured to determine cosine similarity between the image feature information and each text feature information.

Further, the convolution parameters and the bias parameters of the first convolution kernel and the second convolution kernel are obtained by training through the following devices:

Based on the same technical concept, the present application further provides a terminal device 800, referring to fig. 8, where the terminal device 800 is configured to implement the methods described in the above various method embodiments, for example, implement the embodiment shown in fig. 6, and the terminal device 800 may include a memory 801, a processor 802, an input unit 4503, and a display panel 804.

A memory 801 for storing computer programs executed by the processor 802. The memory 801 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the terminal apparatus 800, and the like. The processor 802 may be a Central Processing Unit (CPU), a digital processing unit, or the like. The input unit 803 may be used to acquire a user instruction input by a user. The display panel 804 is configured to display information input by a user or information provided to the user, and in this embodiment of the present application, the display panel 804 is mainly used to display a display interface of each application program in the terminal device and a control entity displayed in each display interface. Alternatively, the display panel 804 may be configured in the form of a Liquid Crystal Display (LCD) or an organic light-emitting diode (OLED), and the like.

The embodiment of the present application does not limit the specific connection medium among the memory 801, the processor 802, the input unit 803, and the display panel 804. In the embodiment of the present application, the memory 801, the processor 802, the input unit 803, and the display panel 804 are connected by the bus 805 in fig. 8, the bus 805 is represented by a thick line in fig. 8, and the connection manner between other components is merely illustrative and not limited. The bus 805 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 8, but this is not intended to represent only one bus or type of bus.

The memory 801 may be a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 801 may also be a non-volatile memory (non-volatile memory) such as, but not limited to, a read-only memory (rom), a flash memory (flash memory), a Hard Disk Drive (HDD) or a solid-state drive (SSD), or the memory 801 may be any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory 801 may be a combination of the above memories.

The processor 802, configured to implement the embodiment shown in fig. 6, includes:

a processor 802 for invoking a computer program stored in the memory 801 to perform the embodiment as shown in fig. 6.

The embodiment of the present application further provides a computer-readable storage medium, which stores computer-executable instructions required to be executed by the processor, and includes a program required to be executed by the processor.

In some possible embodiments, aspects of a method for audio recommendation provided by the present application may also be implemented in the form of a program product, which includes program code for causing a terminal device to perform the steps of a method for audio recommendation according to various exemplary embodiments of the present application described above in this specification, when the program product is run on the terminal device. For example, the terminal device may perform the embodiment as shown in fig. 6.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product for audio recommendation of embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a computing device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including a physical programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device over any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., over the internet using an internet service provider).

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functions of two or more units described above may be embodied in one unit, according to embodiments of the application. Conversely, the features and functions of one unit described above may be further divided into embodiments by a plurality of units.

Further, while the operations of the methods of the present application are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. An audio recommendation method, characterized in that the method comprises:

determining feature similarity of the image feature information and each text feature information in a plurality of text feature information; the text characteristic information and the audio have a corresponding relation;

2. The method according to claim 1, wherein if the image information to be dubbed music includes a plurality of frames of images, the extracting the features of the image information to be dubbed music to obtain image feature information capable of describing semantic content of the image information to be dubbed music comprises:

3. The method of claim 2, wherein a text corresponding to each audio is pre-stored, the method further comprising:

4. The method of claim 1, wherein the determining a feature similarity of the image feature information and each of a plurality of text feature information comprises:

5. The method of claim 3, wherein each convolution parameter of the first convolution kernel and the second convolution kernel and a bias parameter are obtained by training:

6. An audio recommendation apparatus, characterized in that the apparatus comprises:

7. The apparatus of claim 6, wherein if the image information of the to-be-dubbed music comprises a plurality of frames of images, the feature extraction module comprises:

8. The apparatus of claim 7, wherein a text corresponding to each audio is pre-stored, the apparatus further comprising:

9. The apparatus of claim 6, wherein the first similarity module is specifically configured to determine a cosine similarity of the image feature information and each text feature information.

10. The apparatus of claim 8, wherein each convolution parameter of the first convolution kernel and the second convolution kernel and a bias parameter are obtained by training with:

11. A computing device comprising at least one processing unit and at least one memory unit, wherein the memory unit stores a computer program that, when executed by the processing unit, causes the processing unit to perform the steps of the method of any of claims 1 to 5.

12. A computer-readable medium, in which a computer program executable by a terminal device is stored, which program, when run on the terminal device, causes the terminal device to carry out the steps of the method according to any one of claims 1 to 5.