CN115734024A

CN115734024A - Audio data processing method, device, equipment and storage medium

Info

Publication number: CN115734024A
Application number: CN202111017197.6A
Authority: CN
Inventors: 李伟卫; 张逾
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2023-03-03

Abstract

The embodiment of the application discloses an audio data processing method, an audio data processing device, audio data processing equipment and a storage medium, and relates to an artificial intelligence related machine learning technology, wherein the method comprises the following steps: acquiring object characteristic information of a target object, video characteristic information of target video data belonging to the target object and audio characteristic information of at least two candidate audio data associated with the target video data; respectively fusing the audio characteristic information of at least two candidate audio data with the video characteristic information of the target video data and the object characteristic information of the target object to obtain audio fusion characteristic information of at least two candidate audio data; respectively carrying out audio recognition on the audio fusion characteristic information and the audio characteristic information by adopting at least two target audio recognition models to obtain target audio data for carrying out music matching on the target video data; target audio data is recommended to the target object. The method and the device can effectively improve the accuracy of recommending the audio data.

Description

Audio data processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of machine learning technology in artificial intelligence, and in particular, to an audio data processing method, apparatus, device, and storage medium.

Background

With the development of internet technology, people can record and distribute video data (such as short videos) anytime and anywhere, and can watch video data distributed by other people. Generally, when a user distributes video data, the user needs to locally select audio data (such as background music) corresponding to the theme of the video data from a terminal, and then use the audio data to match the video data. The audio data can be used for strengthening the theme of the video data, so that the user can more intuitively understand the theme of the video data, and the interestingness and rhythm of the video data can be enhanced. At present, audio data matched with video data is usually selected manually, but if a user does not have professional knowledge related to the audio data, it is difficult to select proper audio data, so that the accuracy of the selected audio data is low.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present application is to provide an audio data processing method, apparatus, device and storage medium, which can effectively improve the accuracy of recommending audio data.

An aspect of the present embodiment provides an audio data processing method, including:

acquiring object characteristic information of a target object, video characteristic information of target video data belonging to the target object, and audio characteristic information of at least two candidate audio data associated with the target video data;

respectively fusing the audio characteristic information of the at least two candidate audio data with the video characteristic information of the target video data and the object characteristic information of the target object to obtain audio fusion characteristic information of the at least two candidate audio data;

respectively carrying out audio recognition on the audio fusion characteristic information of the at least two candidate audio data and the audio characteristic information of the at least two candidate audio data by adopting at least two target audio recognition models to obtain target audio data for carrying out music matching on the target video data; the target audio data belongs to the at least two candidate audio data;

recommending the target audio data to the target object.

acquiring object characteristic information of a sample object, sample video data belonging to the sample object, sample audio data for dubbing music on the sample video data and a labeling audio matching degree of the sample audio data; the labeled audio matching degree is used for reflecting the matching degree between the sample audio data and the sample object and between the sample audio data and the sample video data;

performing video sign extraction on the sample video data to obtain video characteristic information of the sample video data, and performing audio characteristic extraction on the sample audio data to obtain audio characteristic information of the sample audio data;

fusing the audio characteristic information of the sample audio data with the video characteristic information of the sample video data and the object characteristic information of the sample object to obtain audio fusion characteristic information of the sample audio data;

and respectively adjusting at least two candidate audio recognition models according to the matching degree of the marked audio, the audio characteristic information of the sample audio data and the audio fusion characteristic information of the sample audio data to obtain at least two target audio recognition models.

An aspect of an embodiment of the present application provides an audio data processing apparatus, including:

an obtaining module, configured to obtain object feature information of a target object, video feature information of target video data belonging to the target object, and audio feature information of at least two candidate audio data associated with the target video data;

the fusion module is used for respectively fusing the audio characteristic information of the at least two candidate audio data with the video characteristic information of the target video data and the object characteristic information of the target object to obtain the audio fusion characteristic information of the at least two candidate audio data;

the identification module is used for respectively carrying out audio identification on the audio fusion characteristic information of the at least two candidate audio data and the audio characteristic information of the at least two candidate audio data by adopting at least two target audio identification models to obtain target audio data for carrying out music matching on the target video data; the target audio data belongs to the at least two candidate audio data;

and the recommending module is used for recommending the target audio data to the target object.

An embodiment of the present application provides an audio data processing apparatus, including:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring object characteristic information of a sample object, sample video data belonging to the sample object, sample audio data for dubbing music on the sample video data and the tagging audio matching degree of the sample audio data; the labeled audio matching degree is used for reflecting the matching degree between the sample audio data and the sample object and between the sample audio data and the sample video data;

the extraction module is used for extracting video signs of the sample video data to obtain video characteristic information of the sample video data, and extracting audio characteristics of the sample audio data to obtain audio characteristic information of the sample audio data;

the fusion module is used for fusing the audio characteristic information of the sample audio data with the video characteristic information of the sample video data and the object characteristic information of the sample object to obtain audio fusion characteristic information of the sample audio data;

and the adjusting module is used for respectively adjusting the at least two candidate audio recognition models according to the matching degree of the marked audio, the audio feature information of the sample audio data and the audio fusion feature information of the sample audio data to obtain at least two target audio recognition models.

One aspect of the present application provides a computer device, comprising: a processor and a memory;

wherein, the memory is used for storing computer programs, and the processor is used for calling the computer programs to execute the following steps:

recommending the target audio data to the target object.

acquiring object characteristic information of a sample object, sample video data belonging to the sample object, sample audio data for dubbing the sample video data and a labeling audio matching degree of the sample audio data; the labeled audio matching degree is used for reflecting the matching degree between the sample audio data and the sample object and between the sample audio data and the sample video data;

An aspect of the embodiments of the present application provides a computer-readable storage medium, in which a computer program is stored, where the computer program includes program instructions, and the program instructions, when executed by a processor, perform the steps of the method.

An aspect of the embodiments of the present application provides a computer program product, which includes computer programs/instructions, and is characterized in that, when being executed by a processor, the computer programs/instructions implement the steps of the above method.

In the application, the audio fusion characteristic information of the at least two candidate audio data is obtained by fusing the audio characteristic information of the at least two candidate audio data with the object characteristic information of the target object and the video characteristic information of the target video data, namely, the multi-modal characteristic information is fused, so that more information can be provided for the recommended audio data, and the accuracy of the recommended audio data is improved. Further, identifying audio fusion characteristic information of at least two candidate audio data and audio characteristic information of at least two candidate audio data by adopting at least two target audio identification models respectively to obtain target audio data for matching music of the target video data, and recommending the target audio data to a target object; the audio recognition results of the multi-modal audio recognition model are comprehensively considered, and the audio data are automatically recommended to the target object, so that the efficiency of recommending the audio data can be improved; meanwhile, the advantages of different audio recognition models are fully utilized, the problem that the accuracy of recommended audio data is low due to deviation of a single model can be effectively avoided, and the recommended audio data is more stable, more accurate and more reliable.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a block diagram of an audio data processing system according to the present application;

FIG. 2 is a schematic diagram of a scene of a single modality-based audio data recommendation provided herein;

FIG. 3 is a schematic diagram of a scenario for multi-modality based audio data recommendation provided herein;

FIG. 4 is a flow diagram of an audio data processing method provided herein;

FIG. 5 is a flow diagram of an audio data processing method provided herein;

fig. 6 is a schematic view of a scenario for acquiring object feature information of a target object according to the present application;

FIG. 7 is a schematic diagram illustrating a scenario for obtaining audio feature information of candidate audio data according to the present application;

fig. 8 is a schematic view of a scene for acquiring video feature information of target video data according to the present application;

FIG. 9 is a schematic diagram of a scene of audio fusion feature information of candidate audio data provided in the present application;

FIG. 10 is a flow diagram of an audio data processing method provided herein;

fig. 11 is a schematic structural diagram of an audio data processing apparatus according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of an audio data processing apparatus according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

This application relates generally to machine learning techniques in Artificial Intelligence (AI), which is a theory, method, technique and application system that utilizes a digital computer or machine controlled by a digital computer to simulate, extend and extend human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

The Machine Learning (ML) is a multi-domain cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

In order to facilitate a clearer understanding of the present application, an audio data processing system implementing the audio data processing method of the present application is first introduced, as shown in fig. 1, the audio data processing system includes a server and a terminal, as shown in fig. 1.

The terminal may refer to a user-oriented device, and the terminal may include a multimedia application platform (i.e., a multimedia application program) for playing multimedia data (such as audio and video data); the multimedia application platform may refer to a multimedia website platform (such as forum and post), a social application platform, a shopping application platform, a content interaction platform (such as an audio-video playing application platform), and the like. The server may refer to a device for providing a multimedia background service, and may be specifically configured to identify audio data for dubbing video data and recommend the audio data to a user.

The server may be an independent physical server, a server cluster or a distributed system formed by at least two physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, content Delivery Network (CDN), big data and an artificial intelligence platform. The terminal may be, but is not limited to, an intelligent vehicle-mounted terminal, a smart phone, a tablet computer, a notebook computer, a desktop computer, an intelligent sound box, a sound box with a screen, an intelligent watch, an intelligent television, and the like. Each terminal and each server may be directly or indirectly connected through a wired or wireless communication manner, and the number of the terminals and the number of the servers may be one or at least two, which is not limited herein.

The audio data recommendation method can be realized based on the audio data processing system, and comprises a single-mode-based audio data recommendation method and a multi-mode-based audio data recommendation method. As shown in fig. 2, the audio data recommendation method based on the single modality is to analyze audio feature information of at least two candidate audio data, video feature information of target video data, and object feature information of a target object by using an audio recognition model to obtain target audio data for performing music matching on the target video data. Specifically, as shown in fig. 2, the audio data recommendation method based on single modality includes a training process of a candidate audio recognition model and a process of recognizing target audio data by using a target audio recognition model. As shown in fig. 2, the training process of the candidate audio recognition model includes the following steps 1-2:

1. the server obtains training samples for training the candidate audio recognition models. The candidate audio recognition model is a model to be trained for obtaining the score of the video data, that is, the candidate audio recognition model is a recognition model with relatively low audio data recognition accuracy, and the candidate audio recognition model may be a classifier, which may be one of a machine learning model, a deep learning model, a graph network model, and the like. The Machine Learning model includes SVM (Support Vector Machine), FM (Factorization Machines), XGBoost (eXtreme Gradient Boosting), the Deep Learning model includes DNN (Deep Neural Networks), W & D (Wide & Deep Learning for Recommendar System), and the like, and the Graph Network model includes Deep walk, graph (Neural Networks), GCN (Graph volumetric Networks), and the like. In order to improve the audio data recognition accuracy of the candidate audio recognition model, first, the server may acquire object feature information belonging to a sample object, sample video data belonging to the sample object, and sample audio data for dubbing the sample video data from the terminal. The sample object may refer to a user who has distributed video data in a multimedia application platform, and the object characteristic information of the sample object includes age, gender, hobbies, and the like of the sample object. The sample video data can be video data of a multimedia application platform issued by the sample object in a historical time period, and the sample video data can be obtained by shooting the sample object, or the sample video data can be obtained by clipping video data downloaded in the internet by the sample object; the sample audio data refers to audio data (e.g., background music, speech data about verse recitations) dubbed the sample video data by the sample object at the time of distribution of the sample video data. As shown in fig. 2, a user 1 publishes video data 1 in a multimedia application platform, the background music of the video data 1 is music 1, … …, a user N publishes video data N in the multimedia application platform, the background music of the video data N is music N, the video data 1 to the video data N and the audio data 1 to the audio data N can be filtered, the filtered video data is used as sample video data, and the filtered audio data is determined as sample audio data. The filtering process includes copyright filtering, high-quality filtering and tonal filtering, the copyright filtering refers to filtering audio data and video data without copyright, the high-quality filtering refers to filtering video data and audio data with relatively low quality (such as definition), and the tonal filtering refers to filtering audio data with rhythm not meeting the condition, such as audio data with excessive noise.

It should be noted that the object feature information in this scheme may refer to user portrait data obtained after authorization of a user, the audio data in this scheme may refer to audio data authorized by a creator of the audio data, and the video data in this scheme may refer to original video data or video data authorized by a creator. Further, the server may perform video feature extraction on the sample video data to obtain video feature information (i.e., a video portrait) of the sample video data; the video feature information of the sample video data is used to reflect the subject information, scene, color information, quality information, and the like of the sample video data. Similarly, the server may perform audio feature extraction on the sample audio data to obtain audio feature information (i.e., an audio portrait) of the sample audio data, where the audio feature information of the sample audio data is used to reflect lyric information of the sample audio data, score information, object feature information of a creator of the sample audio data, and the like. Then, the degree of matching of the annotation audio of the sample audio data is obtained, and the degree of matching of the annotation audio can be used for reflecting the degree of matching between the sample audio data and the sample object, sample video data. And determining the video characteristic information of the sample video data, the audio characteristic information of the sample audio data, the object characteristic information of the sample object and the labeled audio matching degree as training samples for training the candidate audio recognition model.

2. And the server trains the candidate audio recognition model by adopting the training samples to obtain a target audio recognition model. The server adopts the candidate audio recognition model to carry out audio prediction on the video characteristic information of the sample video data, the audio characteristic information of the sample audio data and the object characteristic information of the sample object, so as to obtain the predicted audio matching degree of the sample audio data; and adjusting the candidate audio recognition model according to the predicted audio matching degree and the labeled audio matching degree to obtain an adjusted candidate audio recognition model, and determining the adjusted audio recognition model as a target audio recognition model.

As shown in fig. 2, the process of identifying target audio data by using the target audio identification model includes the following steps 3-5:

3. the server acquires object feature information of a target object, target video data belonging to the target object, and at least two candidate audio data associated with the target video data. The target object refers to a user (e.g., user W in fig. 2) who needs to publish video data to the multimedia application platform, the target video data may refer to video data (e.g., video W in fig. 2) to be published to the multimedia application platform, the target video data may be captured by the target object, or the target video data may be clipped by the target object on video data downloaded from the internet. The at least two candidate audio data may refer to audio data that matches attribute information of the subject information, scene, and the like of the target video data, and the at least two candidate audio data refer to audio data for which the target object has usage rights.

4. The server acquires video characteristic information of the target video data and audio characteristic information of at least two candidate audio data. The server can extract the video characteristics of the target video data to obtain the video characteristic information of the target video data; the video feature information of the target video data is used to reflect subject information, scenes, color information, quality information, and the like of the target video data. Similarly, the server may perform audio feature extraction on each candidate audio data to obtain audio feature information of each candidate audio data, where the audio feature information of the candidate audio data is used to reflect lyric information of the candidate audio data, score information, object feature information of a creator of the candidate audio data, and the like.

5. The server may identify the target video data using a target audio recognition model. The server may perform audio recognition on the audio feature information of the at least two candidate audio data, the video feature information of the target video data, and the object feature information of the target object by using a target audio recognition model to obtain target video data for performing score matching on the target video data, and recommend the target video data to the target object.

In practice, it is found that the audio recommendation result of the audio data recommendation method based on the single mode completely depends on the knowledge accumulation of the candidate audio recognition model, and if the process of the knowledge accumulation (i.e. the training process) of the candidate audio recognition model has a deviation, the accuracy of the recommended audio data is low. Based on this, the present application provides a multi-modal-based audio data recommendation method, as shown in fig. 3, the multi-modal-based audio data recommendation method refers to analyzing audio feature information of at least two candidate audio data, video feature information of target video data, and object feature information of a target object by using at least two audio recognition models to obtain target audio data for matching music with the target video data. As shown in fig. 3, compared with the audio data recommendation method based on a single modality, the audio data recommendation method based on multiple modalities makes the following improvements:

a. the improvement made for step 2 in the training process of the audio recognition model comprises the following steps: 1. the server may obtain at least two candidate audio recognition models, which may include at least two of a machine learning model, a deep learning model, a graph network model, and the like. The network attributes of the candidate audio recognition models are different, the network attributes comprise at least one of a network structure, a network parameter, a network algorithm and the like, and the network attributes of the candidate audio recognition models are inconsistent, so that the feature processing capabilities of the candidate audio recognition models are inconsistent. For example, FM-based candidate audio recognition models are good at mining associations between feature information, XGBoost-based candidate audio recognition models are good at mining key split points (e.g., key feature points in video data), and so on. 2. And (3) realizing multi-modal feature fusion: fusing the audio characteristic information of the sample audio data with the video characteristic information of the sample video data and the object characteristic information of the sample object to obtain audio fusion characteristic information of the sample audio data; the fusion feature information of the sample audio data is used for reflecting the preference of the sample object to the sample audio data, the association relationship between the sample audio data and the sample video data, and the like. 3. And (3) realizing multi-modal training: and training at least two candidate audio recognition models respectively by adopting the fusion audio characteristic information of the sample audio data and the audio characteristic information of the sample audio data to obtain at least two target audio recognition models.

b. The improvement of the step 5 in the process of identifying the target audio data by using the target audio identification model comprises the following steps: 1. multi-modal feature fusion: respectively fusing the audio characteristic information of at least two candidate audio data with the video characteristic information of the target video data and the object characteristic information of the target object to obtain fused audio characteristic information of at least two candidate audio data; the fusion characteristic information of the candidate audio data is used for reflecting the preference of the target object to the candidate audio data, the association relation between the candidate audio data and the target video data, and the like. 2. And (4) recommendation decision fusion: identifying fusion characteristic information of at least two candidate audio data and audio characteristic information of at least two candidate audio data by adopting at least two target audio identification models respectively to obtain target audio data for matching music of the target video data; and recommending target video data to the target object by synthesizing the audio recognition results of all the target audio recognition models to realize recommendation decision fusion.

In summary, in the multi-modal-based audio data recommendation method, the at least two candidate audio recognition models are trained by using the fusion audio feature information of the sample audio data and the audio feature information of the sample audio data to obtain at least two target audio recognition models, so that the problem of low accuracy of recommended audio data due to deviation of a single audio recognition model in a knowledge accumulation process can be avoided. The audio characteristic information of at least two candidate audio data is fused with the object characteristic information of the target object and the video characteristic information of the target video data, so that the target audio recognition model can mine the implicit relation among the characteristic information, and further, the accuracy of recommending the audio data is improved. The audio data are recommended by comprehensively considering the recognition results of at least two target audio data models, so that the problem of low accuracy of recommended audio data caused by excessively depending on a single audio recognition model can be avoided, and the accuracy of the recommended audio data is improved.

In the present application, the modalities refer to: each source or form of information may be referred to as a modality. For example, humans have touch, hearing, vision, smell; the information media include voice, video, text, etc. A wide variety of sensors such as radar, infrared, accelerometers, etc. Each of the above may be referred to as a modality. Thus, the multimodal features of the present application may include at least two of video feature information, audio feature information, object feature information; the MultiModal audio recognition model in the present application is also referred to as MultiModal Machine Learning (MMML), and aims to achieve the capability of processing and understanding MultiModal information by a Machine Learning method, and establish a model capable of processing and associating information from multiple modes, which is a vigorous multidisciplinary field with extraordinary potential.

Further, please refer to fig. 4, which is a flowchart illustrating an audio data processing method according to an embodiment of the present application. As shown in fig. 4, the method may be performed by a computer device, which may refer to the terminal in fig. 1, or the computer device may refer to the server in fig. 1, or the computer device includes the terminal and the server in fig. 1, that is, the method may be performed by both the terminal and the server in fig. 1. The audio data processing method may include the following steps S101 to S104:

s101, acquiring object characteristic information of a target object, video characteristic information of target video data belonging to the target object and audio characteristic information of at least two candidate audio data associated with the target video data.

In this application, when a user needs to publish video data on a multimedia application platform, the user may be referred to as a target object, and the video data needing to be published may be referred to as target video data. In order to select appropriate background music for the target video, the computer device may acquire object feature information of the target object, target video data belonging to the target object, and at least two candidate audio data associated with the target video data. Further, video feature extraction is carried out on the target video data to obtain video feature information of the target video data, and audio feature extraction is carried out on at least two candidate audio data respectively to obtain audio feature information of the at least two candidate audio data.

The object characteristic information of the target object can be at least one of basic image characteristic information, multimedia image characteristic information and image related characteristic information of the target object, wherein the basic image characteristic information is used for reflecting basic information such as age, sex and the like of the target object; the multimedia portrait characteristic information is used for reflecting multimedia preferences of the target object, such as favorite movies, poems, music, favorite singers and the like; the portrait related feature information is used to reflect the relationship between the basic portrait feature information and the multimedia portrait feature information, for example, the portrait related feature is used to reflect the relative preference of singers A in the user population aged between [18,25] years. The video feature information of the target video data is used to reflect the subject information, scene, color information, quality information, and the like of the target video data, and the audio feature information of the candidate audio data may be used to reflect the lyric information of the candidate audio data, the object feature information of the creator, score information, and the like. The at least two candidate audio data may refer to audio data that matches the theme information, scene, etc. of the target video data, or the at least two audio data may refer to audio data played within a historical time period (e.g., a last week, a last month) of the target object, or the at least two audio data may refer to audio data composed by the target object, or the at least two audio data may refer to music that is currently hot, such as audio data with a current playing amount greater than a playing amount threshold. It should be noted that the audio data referred to in the present application may refer to music, voice data about verse recitations, voice data about story telling, and the like.

S102, respectively fusing the audio characteristic information of the at least two candidate audio data with the video characteristic information of the target video data and the object characteristic information of the target object to obtain audio fusion characteristic information of the at least two candidate audio data.

In the application, the computer device may fuse audio feature information of a first candidate audio data of the at least two candidate audio data with video feature information of a target video data and object feature information of the target object to obtain audio fusion feature information of the first candidate audio data; similarly, the audio feature information of the second candidate audio data in the at least two candidate audio data is fused with the video feature information of the target video data and the object feature information of the target object, so as to obtain the audio fusion feature information of the second candidate audio data. By analogy, the audio fusion feature information of each candidate audio data in the at least two candidate audio data may be obtained.

It should be noted that the fusion implementation herein includes direct fusion and process fusion, and the direct fusion may refer to: two or more pieces of feature information are directly merged into one piece of fused feature information, for example, assuming that the audio feature information of the candidate audio data is (1,2,3) and the video feature information of the target video data is (4,5,6), the audio feature information of the candidate audio data and the video feature information of the target video data are directly merged to obtain the audio fused feature information of the candidate audio data (1,2,3,4,5,6). Direct fusion may also refer to: combining two or more feature parameters with association relationship in the feature information into a fused feature information, for example, if audio feature parameter 2 in the audio feature information of the candidate audio data has association relationship with video feature parameter 5 in the video feature information of the target video data, combining the audio feature information of the candidate audio data with the feature parameter with association relationship in the video feature information of the target video data to obtain the audio fused feature information of the candidate audio data (2,5). The treatment of fusion refers to: the method includes averaging two or more pieces of feature information or extracting a maximum value to obtain one piece of fused feature information. For example, assuming that the audio feature information of the candidate audio data is (1,2,3) and the video feature information of the target video data is (4,5,6), the audio feature information of the candidate audio data and the video feature information of the target video data are averaged to obtain the audio fusion feature information (2.5,3.5,4.5) of the candidate audio data. Or, maximum value extraction processing is carried out on the audio characteristic information of the candidate audio data and the video characteristic information of the target video data, and audio fusion characteristic information of the candidate audio data is obtained (4,5,6).

S103, respectively carrying out audio recognition on the audio fusion characteristic information of the at least two candidate audio data and the audio characteristic information of the at least two candidate audio data by adopting at least two target audio recognition models to obtain target audio data for carrying out music matching on the target video data; the target audio data belongs to the at least two candidate audio data.

In the application, the computer device may respectively perform audio recognition on the audio fusion feature information of the at least two candidate audio data and the audio feature information of the at least two candidate audio data by using at least two target audio recognition models to obtain at least two audio recognition results, and determine target audio data for performing dubbing music on the target video data from the at least two candidate audio data according to the at least two audio recognition results; the target audio data are determined by fusing the audio recognition results of the plurality of target audio recognition models, the advantages of different audio recognition models are fully utilized, the problem that the accuracy of recommended audio data is low due to deviation of a single model can be effectively avoided, and the recommended audio data can be improved to be more stable, accurate and reliable.

It should be noted that the audio recognition method includes undifferentiated recognition or differential recognition, where the undifferentiated recognition means: for example, if the at least two target audio recognition models include a first target audio recognition model and a second target audio recognition model, the computer device may perform audio recognition on the audio fusion feature information by using the first target audio recognition model to obtain a first audio recognition result, and then perform audio recognition on the audio feature information of the at least two candidate audio data by using the first target audio recognition model to obtain a second audio recognition result. And similarly, performing audio recognition on the audio fusion characteristic information by adopting a second target audio recognition model to obtain a third audio recognition result, and then performing audio recognition on the audio characteristic information of the at least two candidate audio data by adopting the second target audio recognition model to obtain a fourth audio recognition result. Further, determining target audio data for dubbing the target video data according to the first audio recognition result, the second audio recognition result, the third audio recognition result and the fourth audio recognition result; here, the first audio recognition result and the third audio recognition result are used to reflect an audio joint matching degree between each candidate audio data and the target object and the target video data, and the audio joint matching degree is specifically used to reflect a preference degree of the target object for the candidate audio data and a matching degree between the candidate audio data and the target video data. The second audio recognition result and the third audio recognition result are used for reflecting the suitability (i.e. audio self-matching degree) of each candidate audio data for dubbing.

Similarly, the differential recognition means: for example, the computer device may perform audio recognition on the audio fusion feature information of the at least two candidate audio data by using the first target audio recognition model to obtain a fifth audio recognition result, and then perform audio recognition on the audio feature information of the at least two candidate audio data by using the second target audio recognition model to obtain a sixth audio recognition result. Further, determining target audio data for dubbing the target video data according to a fifth audio recognition result and a sixth audio recognition result; here, the fifth audio recognition result is used to reflect the audio joint matching degree between each candidate audio data and the target object and the target video data, and the sixth audio recognition result is used to reflect the suitability degree (i.e. the audio self-matching degree) of each candidate audio data for matching music.

And S104, recommending the target audio data to the target object.

In this application, the number of the target audio data may be one or more, and when the number of the target audio data is one, the computer device may display the target audio data in a distribution interface of the target video data, and in response to a selection request for the target audio data, match a music with the target video data using the target audio data. When the number of the target audio data is multiple, the computer device may sequentially display each target audio data in the distribution interface of the target video data according to a total matching degree of each target audio data (where the total matching degree may be determined according to the above-mentioned audio joint matching degree and audio self-matching degree). For example, the target audio data may be displayed in the distribution interface of the target video data simultaneously in the descending order of the total matching degree of the target audio data, or the target audio data may be displayed in the distribution interface of the target video data in a rolling manner in the descending order of the total matching degree of the target audio data, for example, the target audio data with the total matching degree sorted in the range of 1-10 is displayed in the distribution interface of the target video data at the first time, and the target audio data with the total matching degree sorted in the range of 11-20 is displayed at the second time. Then, in response to a selection operation for any one of the plurality of target audio data, the target video data may be dubbed with the selected target audio data. Through the audio recognition model, the audio data can be automatically recommended to the target object, the accuracy of recommending the audio data is improved, manual participation is not needed, and the efficiency of recommending the audio data is improved.

Optionally, each target audio recognition model may be obtained by training each candidate audio recognition model according to sample video data belonging to a sample object, sample audio data for performing score on the sample video data, object feature information of the sample object, and a labeled audio matching degree, where the labeled audio data may be determined according to object behavior data about the sample video data, and the object behavior data includes at least one of a praise amount, a care amount, a forward amount, a collection amount, a click amount, and the like of the audience user on the sample video data. That is, the object behavior data reflects the preference degree of the audience user for the sample video data and the sample audio data to some extent, and the target audio recognition model trained in this way has the capability of recommending audio data to the creator (creator of video data) based on the preference of the audience user for the video data and the audio data. In summary, the audio recognition results output by the target audio recognition models can reflect not only the preference degree of the target object (creator of the target video) to the candidate audio data, the matching degree between the candidate audio data and the target video data, and the suitability degree of the candidate audio data for the dubbing, but also the preference degree of the candidate audio data of the audience user to a certain extent; therefore, the target audio data are recommended by integrating the audio recognition results of the target audio recognition models, the multimedia (namely, the audio data and the video data) preferences of the audience users can be transmitted to the creators, barriers between the creators and the audience users are effectively opened, the creation thought of the creators is expanded, and meanwhile, more favorite works of the creators and the audience users can be produced under the guidance of recommendation.

Further, please refer to fig. 5, which is a flowchart illustrating an audio data processing method according to an embodiment of the present application. As shown in fig. 5, the method may be performed by a computer device, which may refer to the terminal in fig. 1, or the computer device may refer to the server in fig. 1, or the computer device includes the terminal and the server in fig. 1, that is, the method may be performed by both the terminal and the server in fig. 1. The audio data processing method may include the following steps S201 to S206:

s201, acquiring object characteristic information of a target object, video characteristic information of target video data belonging to the target object and audio characteristic information of at least two candidate audio data associated with the target video data.

Optionally, the acquiring the object feature information of the target object in step S201 may include the following steps S11 to S13:

and s11, acquiring the basic image characteristic information and the multimedia image characteristic information of the target object.

And s12, performing image correlation identification on the basic image characteristic information of the target object and the multimedia image characteristic information to obtain image correlation characteristic information.

And s13, determining the basic image characteristic information, the multimedia image characteristic information and the image related characteristic information of the target object as the object characteristic information of the target object.

In steps s11 to s13, as shown in fig. 6, the computer device may acquire the basic portrait characteristic information and the multimedia portrait characteristic information of the target object; the basic portrait characteristic information includes basic attribute characteristics such as age and gender, and the multimedia portrait characteristic information includes singers, movie actors, favorite movies, songs, and the like, which are liked by the target object. Further, the associated recognition model may perform image associated recognition on the basic image feature information of the target object and the multimedia image feature information to obtain image associated feature information reflecting an implicit association relationship between the basic image feature information of the target object and the multimedia image feature information. As shown in fig. 6, the association recognition model may be a deep neural network, the deep neural network is composed of a plurality of neural network layers, the output result of the previous neural network layer is input to the next neural network layer for processing in a forward propagation manner between different neural network layers, and the implicit relationship between the basic portrait feature information of the target object and the multimedia portrait feature information can be mined through the deep neural network, so as to improve the expression capability of the feature information. Then, the basic image characteristic information, the multimedia image characteristic information and the image related characteristic information of the target object are determined as object characteristic information of the target object; by mining the implicit relationship between the basic portrait characteristic information and the multimedia portrait characteristic information of the target object, abundant information amount is provided for recommending audio data, and the accuracy of recommending the audio data is improved.

Optionally, the obtaining of the audio feature information of the at least two candidate audio data associated with the target video data in step S201 may include the following steps S21 to S25:

and s21, acquiring at least two candidate audio data associated with the target video data.

And s22, determining object characteristic information of creators of the at least two candidate audio data.

And s23, performing lyric feature extraction on the at least two candidate audio data to obtain lyric feature information of the at least two candidate audio data.

And s24, performing score feature extraction on the at least two candidate audio data to obtain score feature information of the at least two candidate audio data.

And s25, fusing the object characteristic information of the creator, the lyric characteristic information of the at least two candidate audio data and the score characteristic information of the at least two candidate audio data to obtain the audio characteristic information of the at least two candidate audio data.

In steps s21 to s25, as shown in fig. 7, when the candidate audio data is music, the computer device may acquire video attributes such as theme information, scene information (e.g., shooting scene) and the like of the target video data; and acquiring at least two candidate audio data associated with the target video data according to the video attribute. Then, acquiring creator information (namely singer information) corresponding to creators of the at least two candidate audio data, wherein the creator information comprises basic portrait characteristic information and multimedia portrait characteristic information of the creators, performing association identification on the basic portrait characteristic information and the multimedia portrait characteristic information of the creators by adopting an association identification model (such as a deep neural network) to obtain portrait association characteristic information of the creators, and determining the portrait association characteristic information, the basic portrait characteristic information and the multimedia portrait characteristic information of the creators as object characteristic information of the creators; the object feature information of the creator may be referred to as a song meta information vector. Then, text conversion may be performed on the at least two candidate audio data to obtain text information of the at least two candidate audio data, word segmentation may be performed on the text information to obtain a plurality of segmented words, and a stem entity word of each candidate audio data may be extracted from the plurality of segmented words by using a word statistical method such as TF-IDF (Term-Inverse text Frequency index based on word Frequency) or WordRank, where the stem entity word may refer to a keyword of the candidate audio data, that is, a word representing a subject of the candidate audio data. And converting the main entity words of the candidate audio data into lyric vectors by adopting word vector conversion models such as WordVec or Bert and the like, wherein the lyric vectors can be called as lyric characteristic information. Then, at least two candidate audio data may be pre-emphasized, frame-divided, and the like to obtain score feature information of the at least two candidate audio data, and the object feature information of the creator, the lyric feature information of the at least two candidate audio data, and the score feature information of the at least two candidate audio data are fused to obtain the audio feature information of the at least two candidate audio data.

Optionally, the step s44 may include steps s31 to s33 as follows:

s31, performing frame division processing on the candidate audio data Yi in the at least two candidate audio data to obtain at least two frames of audio data belonging to the candidate audio data Yi; i is a positive integer less than or equal to N, and N is the number of the candidate audio data in the at least two candidate audio data.

And s32, performing frequency domain transformation on at least two frames of audio data belonging to the candidate audio data Yi to obtain frequency domain information of the candidate audio data Yi.

And s33, performing score feature extraction on the frequency domain information of the candidate audio data Yi to obtain score feature information of the at least two candidate audio data.

In steps s31 to s33, as shown in fig. 7, the computer device may pre-emphasize respective candidate audio data, the processed respective candidate audio data; the pre-emphasis processing has the function of eliminating the effect caused by vocal cords and lips in the process of sound production so as to compensate the high-frequency part of the voice signal suppressed by the sound production system; and can highlight the resonance peak of high frequency. Then, performing framing processing on each processed candidate audio data according to the framing parameters to obtain at least two frames of audio data of each candidate audio data; the framing parameters may include a frame length and a frame shift, for example, the frame length may be 20 to 40ms, and the frame shift may be 10ms. Then, windowing can be carried out on each frame of audio data, so that the attenuation of two ends of each frame of audio data is close to zero; and carrying out frequency domain transformation on each frame of audio data subjected to windowing processing to obtain frequency domain information of each candidate audio data, wherein the frequency domain information is used for reflecting the frequency and the amplitude of the candidate audio data. Then, music score feature extraction may be performed on the frequency domain information of each candidate audio data to obtain music score feature information of each candidate audio data, where the music score feature information is used to reflect parameters such as frequency and energy of the candidate audio data. The music score feature information of the candidate audio data is obtained by processing the candidate audio data in a pre-emphasis mode, a frequency domain conversion mode and the like, so that the complexity of obtaining the music score feature information of the candidate audio data can be reduced, and the significance of the music score feature information of the candidate audio data can be improved.

Optionally, the step s33 may include steps s41 to s43 as follows:

and s41, determining energy information of the candidate audio data Yi according to the frequency domain information of the candidate audio data Yi.

And s42, filtering the energy information of the candidate audio data Yi to obtain filtered energy information.

And s43, determining the energy information after the filtering processing as score characteristic information of the at least two candidate audio data.

In steps s41 to s43, as shown in fig. 7, the computer device may determine the energy information of the candidate audio data Yi according to the frequency domain information of the candidate audio data Yi, where the audio corresponding to the frequency that can be sensed by the human ear is called noise because the frequency of the sound that can be sensed by the human ear is limited, i.e., the frequency that cannot be sensed by the human ear is limited; therefore, a filter may be generated according to the auditory characteristics of the human ear, and the energy information of the candidate audio data Yi may be filtered by the filter to obtain the filtered energy information. Further, determining the filtered energy information as score feature information of the at least two candidate audio data; by filtering the energy information of the candidate audio data, the problem that the accuracy of the acquired music score characteristic information is low due to noise interference can be effectively avoided; the subsequent invalid noise processing can be avoided, and the processing resource can be saved.

Optionally, the acquiring the video feature information of the target video data belonging to the target object in step S201 may include the following steps S51 to S54:

and s51, acquiring target video data belonging to the target object.

And s52, extracting at least two key video frames of the target video data.

And s53, performing video feature extraction on the at least two key video frames to obtain video feature information of the at least two key video frames.

And s54, fusing the video characteristic information of the at least two key video frames to obtain the video characteristic information of the target video data.

In steps s51 to s54, as shown in fig. 8, the computer device may acquire target video data belonging to the target object, and extract at least two key frames (i.e., representative frames) of the target video data, where a key frame may refer to an audio data frame capable of reflecting subject information of the target video data in the target video data. Further, video feature extraction is carried out on the at least two key frames by adopting a video feature extraction network to obtain video feature information of the at least two key video frames; as shown in fig. 8, the video feature extraction network may refer to a Convolutional Neural Network (CNN) composed of a plurality of convolutional layers and a pooling layer, the convolutional layers: the input to each node in the convolutional layer is only a small block of the neural network in the previous layer (typically 3*3 or 5*5 in size). Convolutional layers attempt to analyze each small block in the neural network more deeply to obtain features with higher abstraction; a pooling layer: the depth of the three-dimensional matrix cannot be changed by the pooling layer (Pooling), but the size of the matrix can be reduced, the number of nodes in the final full-connection layer is further reduced, and therefore the purpose of reducing parameters in the whole neural network is achieved; therefore, the convolutional neural network can extract video feature information with deeper video data and low redundancy. Then, the video feature information of the at least two key video frames can be fused to obtain the video feature information of the target video data; by extracting the video characteristics from the key video frames, the video characteristic information implicit in the target video can be mined, and the redundancy of the video characteristic information can be reduced.

S202, respectively fusing the audio characteristic information of the at least two candidate audio data with the video characteristic information of the target video data and the object characteristic information of the target object to obtain audio fusion characteristic information of the at least two candidate audio data.

Optionally, step S202 may include the following steps S61 to S63:

and s61, fusing the audio feature information of the at least two candidate audio data with the video feature information of the target video data to obtain first fusion feature information, and fusing the audio feature information of the at least two candidate audio data with the object feature information of the target object to obtain second fusion feature information.

And s62, merging the audio characteristic information of the at least two candidate audio data, the video characteristic information of the target video data and the object characteristic information of the target object to obtain third merged characteristic information.

And s63, determining the first fusion characteristic information, the second fusion characteristic information and the third fusion characteristic information as the audio fusion characteristic information of the at least two candidate audio data.

In steps s61 to s63, the computer device may fuse the audio feature information of the at least two candidate audio data and the video feature information of the target video data by using a direct fusion method or a processing fusion method to obtain first fusion feature information, and fuse the audio feature information of the at least two candidate audio data and the object feature information of the target object by using a direct fusion method or a processing fusion method to obtain second fusion feature information. Similarly, the audio feature information of the at least two candidate audio data, the video feature information of the target video data, and the object feature information of the target object are fused by using a direct fusion mode or a processing fusion mode to obtain third fusion feature information. The first fusion feature information, the second fusion feature information, and the third fusion feature information may be determined as audio fusion feature information of the at least two candidate audio data; alternatively, the first fusion feature information and the second fusion feature information may be determined as audio fusion feature information of the at least two candidate audio data; alternatively, the first fusion feature information and the third fusion feature information may be determined as audio fusion feature information of the at least two candidate audio data, or the second fusion feature information and the third fusion feature information may be determined as audio fusion feature information of the at least two candidate audio data.

Optionally, when the manner of extracting the associated parameter (i.e. the direct fusion manner) is selected to fuse the feature information, step s61 may include the following steps s71 to s74:

s71, acquiring a first video characteristic parameter and a first audio characteristic parameter which have an association relation; the first video characteristic parameter belongs to video characteristic information of the target video data, and the first audio characteristic parameter belongs to audio characteristic information of the at least two candidate audio data;

s72, generating first fusion feature information according to the first video feature parameter and the first audio feature parameter;

s73, acquiring a first object characteristic parameter and a second audio characteristic parameter which have an incidence relation; the first object feature parameter belongs to object feature information of the target object, and the second audio feature parameter belongs to audio feature information of the at least two candidate audio data;

and s74, generating second fusion characteristic information according to the first object characteristic parameter and the second audio characteristic parameter.

In steps s71 to s74, the computer device may obtain a first video feature parameter and a first audio feature parameter having an association relationship, where the first video feature parameter and the first audio feature information having an association relationship may refer to a video feature parameter and an audio feature parameter having positive and positive effects on recommended audio data, and may generate first fusion feature information according to the first video feature parameter and the first audio feature parameter. Similarly, acquiring a first object characteristic parameter and a second audio characteristic parameter which have an association relation; the first object feature parameter and the second audio feature parameter having an association relationship may refer to an object feature parameter and an audio feature parameter having a positive effect on the recommended audio data; then, second fusion feature information is generated according to the first object feature parameter and the second audio feature parameter. The video characteristic parameters and the audio characteristic parameters with the incidence relation are extracted from the video characteristic information and the audio characteristic information, so that the method is favorable for mining the implicit information and the implicit relation in the video characteristic information and the audio characteristic information, greatly reduces the dependence on manpower, and improves the accuracy of recommending the audio data.

Optionally, when the manner of extracting the associated parameters is selected to merge the feature information, step s62 may include the following steps s75 to s76:

s75, acquiring a second object characteristic parameter, a second video characteristic parameter and a third audio characteristic parameter which have an incidence relation; the second object feature parameter belongs to object feature information of the target object, the second video feature parameter belongs to video feature information of the target video data, and the third audio feature parameter belongs to audio feature information of the at least two candidate audio data.

And s76, generating third fusion characteristic information according to the second object characteristic parameter, the second video characteristic information and the third audio characteristic parameter.

In steps s75 to s76, the computer device may obtain a second object characteristic parameter, a second video characteristic parameter, and a third audio characteristic parameter having an association relationship; the second object characteristic parameter, the second video characteristic parameter and the third audio characteristic parameter with the incidence relation refer to an object characteristic parameter, an audio characteristic parameter and a video characteristic parameter which have positive effects on the recommended audio data; then, third fused feature information may be generated according to the second object feature parameter, the second video feature information, and the third audio feature parameter. The video characteristic parameters, the audio characteristic parameters and the object characteristic parameters with the incidence relation are extracted from the video characteristic information, the audio characteristic information and the object characteristic information, so that the mining of the implicit information and the implicit relation in the video characteristic information, the audio characteristic information and the object characteristic information is facilitated, the dependence on manpower is greatly reduced, and the accuracy of recommending audio data is improved.

S203, respectively determining a first target audio recognition model matching the audio fusion feature information of the at least two candidate audio data and a second target audio recognition model matching the audio feature information of the at least two candidate audio data from the at least two target audio recognition models.

In this application, the computer device may process different characteristic information using different target audio recognition models, and specifically, the computer device may select the target audio recognition model in a random selection manner or in a feature processing capability selection manner. For example, when the computer device employs a random selection manner, a target audio recognition model is randomly selected from the at least two target audio recognition models as a first target audio recognition model matching the audio fusion feature information of the at least two candidate audio data, and a target audio recognition model is randomly selected from the remaining target audio recognition models as a second target audio recognition model matching the audio feature information of the at least two candidate audio data.

Alternatively, when the target audio recognition model is selected according to the feature processing capability selection manner, the step S203 may include the following steps: acquiring feature processing capacity information of the at least two target audio recognition models; according to the feature processing capability information, a first target audio recognition model matched with the audio fusion feature information of the at least two candidate audio data and a second target audio recognition model matched with the audio feature information of the target audio data are respectively determined from the at least two target audio recognition models.

The computer device may acquire feature processing capability information of each of the at least two target audio recognition models, the feature processing capability information being used to reflect feature information that the target audio recognition model excels in processing, and may then determine, from the at least two target audio recognition models, a first target audio recognition model that matches audio fusion feature information of the at least two candidate audio data and a second target audio recognition model that matches audio feature information of the target audio data, respectively, based on the feature processing capability information. By selecting the target audio recognition model for processing the characteristic information according to the characteristic processing capability information, the accuracy of processing the characteristic information is improved. For example, the FM-based target audio recognition model is good at mining the association relationship between feature information, the XGBoost-based target audio recognition model is good at mining key split points, and the like; thus, the FM-based target audio recognition model may be determined to be a target audio recognition model matching the audio fusion feature information of the at least two candidate audio data, so as to be able to mine an implicit relationship between the audio feature information, the video feature information, and the object feature information; the XGboost-based target audio recognition model is determined to be a target audio recognition model matched with the audio feature information of at least two candidate audio data, so that key audio feature information (namely, implicit information in the audio feature information) in the audio feature information can be mined.

S204, carrying out audio joint relation recognition on the audio fusion characteristic information of the at least two candidate audio data by adopting the first target audio recognition model to obtain an audio joint matching degree; and performing audio autocorrelation system identification on the audio characteristic information of the target audio data by adopting the second target audio identification model to obtain audio self-matching degree.

In this application, the computer device may perform audio joint relationship identification on the audio fusion feature information of the at least two candidate audio data by using the first target audio identification model to obtain an audio joint matching degree, where the audio joint matching degree is used to reflect an association relationship between the candidate audio data and the target object and between the candidate audio data and the target video data, and specifically, the audio joint matching degree is used to reflect a preference degree of the target object to the candidate audio data and a matching degree between the candidate audio data and the target video data. Further, the second target audio recognition model may be used to perform audio autocorrelation identification on the audio feature information of the target audio data, so as to obtain an audio self-matching degree, where the audio self-matching degree is used to reflect the suitability of the candidate audio data for matching music. The audio fusion characteristic information is subjected to audio joint relation recognition through the first target audio recognition model, so that the hidden relation among the audio characteristic information, the video characteristic information and the object characteristic information is favorably mined; and audio self-correlation system identification is carried out on the audio characteristic information through the second target audio identification model, so that the mining of implicit information in the audio characteristic information is facilitated, and the mining of deeper information in the characteristic information is facilitated, so that the accuracy of recommending audio data is improved.

And S205, selecting target audio data for dubbing the target video data from the at least two candidate audio data according to the audio joint matching degree and the audio self-matching degree.

In this application, the computer device may determine a total matching degree of each candidate audio data according to the audio joint matching degree and the audio self-matching degree, and select target video data for dubbing the target video data from at least two candidate audio data according to the total matching degree of each candidate audio data. Target audio data are determined by synthesizing audio recognition results of the multi-modal audio recognition models, the problem that the accuracy of recommended audio data is low due to deviation of a single model can be effectively solved, and the recommended audio data is more stable, accurate and reliable.

Optionally, step S205 may include the following steps S81 to S82:

and s81, summing the audio joint matching degree and the audio self-matching degree to obtain a matching degree sum.

And s82, determining the candidate audio data with the matching degree sum larger than the threshold value of the matching degree in the at least two candidate audio data as the target audio data for the target video data.

In steps s81 to s82, the computer device may accumulate the audio joint matching degree and the audio self-matching degree to obtain a matching degree sum; or, the audio joint matching degree and the audio self-matching degree may be subjected to weighted summation processing to obtain a matching degree sum. Further, the candidate audio data with the sum of the matching degrees greater than the threshold value of the matching degree in the at least two candidate audio data may be determined as the target audio data for dubbing the target video data. The target video data is determined by summing the audio recognition results of the multi-modal audio recognition models, so that the problem that the accuracy of recommended audio data is low due to deviation of a single model can be effectively avoided, and the recommended audio data is more stable, more accurate and more reliable.

Optionally, when the computer device performs weighted summation processing on the audio joint matching degree and the audio self-matching degree to obtain a matching degree sum, the step s81 may include: acquiring the identification weight of the first target audio identification model and the identification weight of the second target audio identification model; weighting the audio joint matching degree by adopting the recognition weight of the first target audio recognition model to obtain the weighted audio joint matching degree; weighting the audio self-matching degree by adopting the recognition weight of the second target audio recognition model to obtain the weighted audio self-matching degree; and summing the weighted audio joint matching degree and the weighted audio self-matching degree to obtain a matching degree sum.

The computer device may obtain an identification weight of the first target audio recognition model and an identification weight of the second target audio recognition model; the identification weight of the first target audio identification model and the identification weight of the second target audio identification model can be determined according to the audio identification accuracy of the corresponding target audio identification model; or the identification weight of the first target audio identification model and the identification weight of the second target audio identification model can be set according to an application scene; for example, the target video data is obtained by cutting a certain movie, and the creator of the target video data expects that the target video data can obtain more clicks, so that the recognition weight of the first target audio recognition model can be set to be higher than that of the second target audio recognition model, which is beneficial to highlighting the audio joint matching degree and recommending audio data liked by the public. For example, the computer device may calculate the sum of the matching degrees of the respective candidate audio data using the following formula (1):

wherein, in the formula (1), P _j Representing the total degree of match, i.e. P, of the jth candidate audio data _j Scoring the final inference given by the target audio recognition model (i.e., at least two target audio recognition models) for the entire multimodal, Q _ji In the process of identifying the jth candidate audio dataThe audio recognition result output by the ith target audio recognition model, which may be a classifier, i.e., Q _ji Scoring of candidate audio data for a single classifier, w _i And N is the number of models in at least two target audio recognition models.

And S206, recommending the target audio data to the target object.

For example, as shown in fig. 9, the computer device analyzes at least two candidate audio data, target video data, and a target object to obtain feature information of modes 1 to N, where the feature information of the modes 1 to N is different, for example, mode 1 is video feature information, mode 2 is first audio feature information, the first audio feature information includes lyric feature information, score feature information, and singer information, … …, mode N is jth audio feature information, and the jth audio feature information includes lyric feature information and singer feature information. Further, feature parameters may be extracted from feature information of multiple modalities, for example, in fig. 9, the audio feature parameters in the modality 1 and the modality 2 are fused to obtain audio fusion feature information 1, and lyric feature information and singer feature are extracted from the modality 2 as audio feature information 2, … …. The computer equipment comprises N target audio recognition models which are respectively a model 1 to a model N, wherein the model 1 can be used for carrying out audio recognition on the audio fusion characteristic information 1 to obtain an audio recognition result 1 (namely matching degree), the model 2 is used for carrying out audio recognition on the audio characteristic information 1 to obtain an audio recognition result 2, … …, and the model N is used for carrying out audio recognition on the audio characteristic information in the model N to obtain an audio recognition result N. Then, the audio recognition result 1 to the audio recognition result n may be fused (sum-processed) to obtain the matching degree sum of each candidate audio data, and the candidate audio data with the matching degree sum ranked at the top 10 may be selected from at least two candidate audio data as the target audio data for dubbing the target audio data, which is recommended to the target object.

In the application, the audio recognition result of the multi-modal audio recognition model is comprehensively considered, and the audio data is automatically recommended to the target object, so that the efficiency of recommending the audio data can be improved; meanwhile, the advantages of different audio recognition models are fully utilized, the problem that the accuracy of recommended audio data is low due to deviation of a single model can be effectively avoided, and the recommended audio data is more stable, more accurate and more reliable.

Further, please refer to fig. 10, which is a flowchart illustrating an audio data processing method according to an embodiment of the present application. As shown in fig. 10, the method may be performed by a computer device, which may refer to the terminal in fig. 1, or the computer device may refer to the server in fig. 1, or the computer device includes the terminal and the server in fig. 1, that is, the method may be performed by both the terminal and the server in fig. 1. The audio data processing method may include the following steps S301 to S304:

s301, acquiring object characteristic information of a sample object, sample video data belonging to the sample object, sample audio data for dubbing music on the sample video data and a labeling audio matching degree of the sample audio data; the labeled audio matching degree is used for reflecting the matching degree between the sample audio data and the sample object and between the sample audio data and the sample video data.

In the present application, a sample object refers to a user who has issued video data in a multimedia application platform, the issued video data is referred to as sample video data, and audio data for dubbing music on the sample video data is referred to as sample audio data. The computer device can obtain object feature information of a sample object, sample video data belonging to the sample object, sample audio data for dubbing the sample video data, and an annotation audio matching degree of the sample audio data from a multimedia application platform. The labeled audio matching degree is used for reflecting the matching degree between the sample audio data and the sample object and between the sample audio data and the sample video data, and the labeled audio matching degree can be obtained by labeling the sample audio data by a plurality of professional users; alternatively, the labeled audio matching degree may be obtained according to object behavior data of the sample video data, where the object behavior data includes at least one of an amount of praise, an amount of interest, an amount of forwarding, an amount of collection, an amount of click, and the like for the sample video data.

It should be noted that, in the present application, sample video data may refer to a short video or a non-short video, and a short video may refer to video data whose playing time length is less than a time length threshold; meanwhile, the sample data may be obtained by screening the candidate video data according to the attribute information of the candidate video data, for example, the attribute information may refer to definition, duration, whether the video data is original or not, and the like, that is, the sample data may refer to original video data, video data with higher definition, and the like. Optionally, the matching degree of the labeled audio is obtained according to the object behavior data of the sample video data, and specifically, the computer device may obtain the object behavior data about the sample video data, and determine the matching degree of the labeled audio of the sample audio according to the object behavior data about the sample video data.

The computer device can obtain object behavior data about the sample video data from the multimedia application platform, wherein the object behavior data comprises at least one of the praise amount, the attention amount, the forwarding amount, the collection amount, the click amount and the like of the sample video data; and determining the labeled audio matching degree of the sample audio data according to the object behavior data about the sample video data, wherein the labeled audio matching degree increases with at least one of the amount of praise, the amount of concern, the forwarding amount, the collection amount, the click amount and the like of the sample video data, and particularly, if the sample video data is clicked, attended, forwarded and the like, the sample video data is taken as a positive sample, and otherwise, if the sample video data is not clicked, attended, forwarded and the like, the sample video data is taken as a negative sample.

The label matching degree not only can reflect the matching degree between the sample audio data and the sample object and the sample video data, but also can reflect the preference of the audience user to the sample video data and the sample video data. Therefore, the following beneficial effects are obtained by determining the annotation audio matching degree of the sample audio data according to the object behavior data about the sample video data: 1. the candidate audio recognition model is trained according to the matching degree of the labeled audio, so that the trained target audio recognition model has the capability of recommending audio data to an author (author of the video data) based on the preferences of the audience users on the video data and the audio data, namely, the multimedia preferences of the audience users can be transmitted to the author, barriers between the author and the audience users are effectively opened, the authoring thought of the author is expanded, and meanwhile, more articles favored by the audience users and the author can be produced under the guidance of recommendation. 2. The sample audio data does not need to be marked manually, so that the problems of label missing, label deviation and the like in manual marking can be avoided; the accuracy of the matching degree of the labeled audio and the efficiency of obtaining the matching degree of the labeled audio can be improved.

S302, video sign extraction is carried out on the sample video data to obtain video characteristic information of the sample video data, and audio characteristic extraction is carried out on the sample audio data to obtain audio characteristic information of the sample audio data.

S303, fusing the audio feature information of the sample audio data with the video feature information of the sample video data and the object feature information of the sample object to obtain audio fusion feature information of the sample audio data.

In this application, the computer device may use a direct fusion method or a processing fusion method to fuse the audio feature information of the sample audio data, the video feature information of the sample video data, and the object feature information of the sample object, so as to obtain the audio fusion feature information of the sample audio data.

S304, respectively adjusting at least two candidate audio recognition models according to the labeled audio matching degree, the audio feature information of the sample audio data and the audio fusion feature information of the sample audio data to obtain at least two target audio recognition models.

In the present application, the labeled audio matching degree, the audio feature information of the sample audio data, and the audio fusion feature information of the sample audio data are used as training data, and at least two candidate audio recognition models are respectively subjected to iterative training to obtain the at least two target audio recognition models. By training the candidate audio recognition models, the accuracy of recommending audio data can be improved.

Optionally, the step S304 may include the following steps S91 to S92:

and s91, respectively adopting the at least two candidate audio recognition models to perform audio matching prediction on the audio feature information of the sample audio data and the audio fusion feature information of the sample audio data to obtain a predicted audio matching degree.

And s92, respectively adjusting the at least two candidate audio recognition models according to the predicted audio matching degree and the labeled audio matching degree to obtain the at least two target audio recognition models.

In steps s91 to s92, the training modes of the candidate audio recognition models include a non-difference training mode and a difference training mode, where the non-difference training mode is to train each candidate audio recognition model by using the same feature information, for example, at least two candidate audio recognition models include a first candidate audio recognition model and a second candidate audio recognition model, and the first candidate audio recognition model may be used to perform audio joint relation prediction on the audio fusion feature information of the sample audio data to obtain a first prediction result, and the first candidate audio recognition model may be used to perform audio autocorrelation relation recognition on the audio feature information of the sample audio data to obtain a second prediction result. And determining the predicted audio matching degree of the first candidate audio recognition model according to the first prediction result and the second prediction result, and adjusting the first candidate audio recognition model according to the labeled audio matching degree and the predicted audio matching degree of the first candidate audio recognition model to obtain a first target audio recognition model. Similarly, the second candidate audio recognition model may be used to perform audio joint relation prediction on the audio fusion feature information of the sample audio data to obtain a third prediction result, and the second candidate audio recognition model may be used to perform audio autocorrelation relation recognition on the audio feature information of the sample audio data to obtain a fourth prediction result. Determining the predicted audio matching degree of the second candidate audio recognition model according to the third prediction result and the fourth prediction result; and adjusting the second candidate audio recognition model according to the labeled audio matching degree and the predicted audio matching degree of the second candidate audio recognition model to obtain a second target audio recognition model.

Similarly, the differential training uses different feature information to train each candidate audio recognition model, for example, the candidate audio recognition model may be trained according to the feature processing capability of the candidate audio recognition model. For example, a first candidate audio recognition model is good at processing the fused audio feature information, and a second candidate audio recognition model is good at processing the audio feature information; therefore, the first candidate audio recognition model can be used for carrying out audio joint relation prediction on the audio fusion characteristic information of the sample audio data to obtain the predicted audio matching degree of the first candidate audio recognition model, and the first candidate audio recognition model is adjusted according to the labeled audio matching degree and the predicted audio matching degree of the first candidate audio recognition model to obtain the first target audio recognition model. And performing audio autocorrelation system identification on the audio characteristic information of the sample audio data by adopting the second candidate audio identification model to obtain the predicted audio matching degree of the second candidate audio identification model, and adjusting the second candidate audio identification model according to the labeled audio matching degree and the predicted audio matching degree of the second candidate audio identification model to obtain a second target audio identification model.

It should be noted that, when the target audio recognition model is obtained by a non-differential training mode, the audio recognition mode is a non-differential recognition mode; when the target audio recognition model is obtained through the differential training mode, the audio recognition mode is the differential recognition mode.

Optionally, the step s92 may include: respectively determining the prediction errors of the at least two candidate audio recognition models according to the predicted audio matching degree and the labeled audio matching degree; and if the prediction error is not in a convergence state, respectively adjusting the at least two candidate audio recognition models according to the prediction error to obtain the at least two target audio recognition models.

If the difference between the predicted audio matching degree and the labeled audio matching degree is smaller, the audio recognition accuracy of the candidate audio recognition model is higher (namely the prediction error is lower); if the difference between the predicted audio matching degree and the marked audio matching degree is larger, the audio recognition accuracy of the candidate audio recognition model is lower (namely, the prediction error is higher). Therefore, the computer device can respectively determine the prediction errors of the at least two candidate audio identification models according to the predicted audio matching degree and the labeled audio matching degree; if the prediction error is in a convergence state, it indicates that the audio recognition accuracy of the audio recognition model of the candidate audio recognition model is relatively high, and therefore, the candidate audio recognition model can be used as the target audio recognition model. If the prediction error is not in a convergence state, which indicates that the audio recognition accuracy of the audio recognition model of the candidate audio recognition model is low, the at least two candidate audio recognition models are respectively adjusted according to the prediction error to obtain the at least two target audio recognition models.

In the application, at least two candidate audio recognition models are trained by adopting the fusion audio feature information of the sample audio data and the audio feature information of the sample audio data to obtain at least two target audio recognition models, so that the problem that the accuracy of recommended audio data is lower due to the fact that a single audio recognition model has deviation in the knowledge accumulation process can be solved.

Fig. 11 is a schematic structural diagram of an audio data processing apparatus according to an embodiment of the present application. The audio data processing means may be a computer program (including program code) running on a computer device, for example, the audio data processing means is an application software; the apparatus may be used to perform the corresponding steps in the methods provided by the embodiments of the present application. As shown in fig. 11, the audio data processing apparatus may include: an acquisition module 111, a fusion module 112, an identification module 113, and a recommendation module 114.

Optionally, the fusing module respectively fuses the audio feature information of the at least two candidate audio data, the video feature information of the target video data, and the object feature information of the target object to obtain audio fusion feature information of the at least two candidate audio data, and the method includes:

fusing the audio characteristic information of the at least two candidate audio data with the video characteristic information of the target video data to obtain first fusion characteristic information, and fusing the audio characteristic information of the at least two candidate audio data with the object characteristic information of the target object to obtain second fusion characteristic information;

fusing the audio characteristic information of the at least two candidate audio data, the video characteristic information of the target video data and the object characteristic information of the target object to obtain third fused characteristic information;

and determining the first fusion characteristic information, the second fusion characteristic information and the third fusion characteristic information as audio fusion characteristic information of the at least two candidate audio data.

Optionally, the fusing module fuses the audio feature information of the at least two candidate audio data and the video feature information of the target video data to obtain first fused feature information, and fuses the audio feature information of the at least two candidate audio data and the object feature information of the target object to obtain second fused feature information, where the method includes:

acquiring a first video characteristic parameter and a first audio characteristic parameter which have an incidence relation; the first video characteristic parameter belongs to the video characteristic information of the target video data, and the first audio characteristic parameter belongs to the audio characteristic information of the at least two candidate audio data;

generating first fusion characteristic information according to the first video characteristic parameter and the first audio characteristic parameter;

acquiring a first object characteristic parameter and a second audio characteristic parameter which have an incidence relation; the first object feature parameter belongs to object feature information of the target object, and the second audio feature parameter belongs to audio feature information of the at least two candidate audio data;

and generating second fusion characteristic information according to the first object characteristic parameter and the second audio characteristic parameter.

Optionally, the fusing module fuses the audio feature information of the at least two candidate audio data, the video feature information of the target video data, and the object feature information of the target object to obtain third fused feature information, where the third fused feature information includes:

acquiring a second object characteristic parameter, a second video characteristic parameter and a third audio characteristic parameter which have an incidence relation; the second object feature parameter belongs to object feature information of the target object, the second video feature parameter belongs to video feature information of the target video data, and the third audio feature parameter belongs to audio feature information of the at least two candidate audio data;

and generating third fusion characteristic information according to the second object characteristic parameter, the second video characteristic information and the third audio characteristic parameter.

Optionally, the identifying module performs audio identification on the audio fusion feature information of the at least two candidate audio data and the audio feature information of the target audio data by using the at least two target audio identification models respectively to obtain target audio data for performing dubbing music on the target video data, and the method includes:

determining a first target audio recognition model matched with the audio fusion characteristic information of the at least two candidate audio data and a second target audio recognition model matched with the audio characteristic information of the at least two candidate audio data from the at least two target audio recognition models respectively;

adopting the first target audio recognition model to carry out audio joint relation recognition on the audio fusion characteristic information of the at least two candidate audio data to obtain audio joint matching degree; performing audio frequency self-correlation system identification on the audio frequency characteristic information of the target audio frequency data by adopting the second target audio frequency identification model to obtain audio frequency self-matching degree;

and selecting target audio data for matching the target video data according to the audio joint matching degree and the audio self-matching degree from the at least two candidate audio data.

Optionally, the determining, by the identifying module, a first target audio recognition model matching the audio fusion feature information of the at least two candidate audio data and a second target audio recognition model matching the audio feature information of the at least two candidate audio data from the at least two target audio recognition models respectively includes:

acquiring feature processing capacity information of the at least two target audio recognition models;

according to the feature processing capability information, a first target audio recognition model matched with the audio fusion feature information of the at least two candidate audio data and a second target audio recognition model matched with the audio feature information of the target audio data are respectively determined from the at least two target audio recognition models.

Optionally, the selecting, by the identification module, target audio data for dubbing the target video data from the at least two candidate audio data according to the audio joint matching degree and the audio self-matching degree includes:

summing the audio frequency joint matching degree and the audio frequency self-matching degree to obtain a matching degree sum;

and determining the candidate audio data with the matching degree sum larger than the matching degree threshold value in the at least two candidate audio data as target audio data for carrying out music matching on the target video data.

Optionally, the summing, by the identification module, the audio joint matching degree and the audio self-matching degree to obtain a matching degree sum, where the matching degree sum includes:

acquiring the identification weight of the first target audio identification model and the identification weight of the second target audio identification model;

weighting the audio joint matching degree by adopting the identification weight of the first target audio identification model to obtain the weighted audio joint matching degree;

weighting the audio self-matching degree by adopting the identification weight of the second target audio identification model to obtain the weighted audio self-matching degree;

and summing the weighted audio joint matching degree and the weighted audio self-matching degree to obtain a matching degree sum.

Optionally, the obtaining module obtains object feature information of the target object, including:

acquiring basic portrait characteristic information and multimedia portrait characteristic information of the target object;

performing portrait correlation recognition on the basic portrait feature information of the target object and the multimedia portrait feature information to obtain portrait correlation feature information;

and determining basic portrait characteristic information, multimedia portrait characteristic information and portrait related characteristic information of the target object as object characteristic information of the target object.

Optionally, the obtaining module obtains audio feature information of at least two candidate audio data associated with the target video data, including:

acquiring at least two candidate audio data associated with the target video data;

determining object feature information of creators of the at least two candidate audio data;

performing lyric feature extraction on the at least two candidate audio data to obtain lyric feature information of the at least two candidate audio data;

performing score feature extraction on the at least two candidate audio data to obtain score feature information of the at least two candidate audio data;

and fusing the object characteristic information of the author, the lyric characteristic information of the at least two candidate audio data and the music score characteristic information of the at least two candidate audio data to obtain the audio characteristic information of the at least two candidate audio data.

Optionally, the obtaining module performs score feature extraction on the at least two candidate audio data to obtain score feature information of the at least two candidate audio data, including:

performing frame division processing on candidate audio data Yi in the at least two candidate audio data to obtain at least two frames of audio data belonging to the candidate audio data Yi; i is a positive integer less than or equal to N, N being the number of candidate audio data in the at least two candidate audio data;

performing frequency domain transformation on at least two frames of audio data belonging to the candidate audio data Yi to obtain frequency domain information of the candidate audio data Yi;

and performing score feature extraction on the frequency domain information of the candidate audio data Yi to obtain score feature information of the at least two candidate audio data.

Optionally, the obtaining module performs score feature extraction on the frequency domain information of the candidate audio data Yi to obtain score feature information of the at least two candidate audio data, including:

determining energy information of the candidate audio data Yi according to the frequency domain information of the candidate audio data Yi;

filtering the energy information of the candidate audio data Yi to obtain filtered energy information;

and determining the filtered energy information as the score characteristic information of the at least two candidate audio data.

Optionally, the obtaining module obtains video feature information of target video data belonging to the target object, including:

acquiring target video data belonging to the target object;

extracting at least two key video frames of the target video data;

performing video feature extraction on the at least two key video frames to obtain video feature information of the at least two key video frames;

and fusing the video characteristic information of the at least two key video frames to obtain the video characteristic information of the target video data.

According to an embodiment of the present application, the steps involved in the audio data processing method shown in fig. 4 may be performed by respective modules in the audio data processing apparatus shown in fig. 11. For example, step S101 shown in fig. 4 may be performed by the obtaining module 111 in fig. 11, and step S102 shown in fig. 4 may be performed by the fusing module 112 in fig. 11; step S103 shown in fig. 4 may be performed by the identification block 113 in fig. 11; step S104 shown in fig. 4 may be performed by the recommendation module 114 in fig. 11.

According to an embodiment of the present application, each module in the audio data processing apparatus shown in fig. 11 may be respectively or completely combined into one or several units to form the apparatus, or one (some) of the units may be further split into at least two sub-units with smaller functions, which may implement the same operation without affecting implementation of technical effects of the embodiment of the present application. The modules are divided based on logic functions, and in practical applications, the functions of one module may also be implemented by at least two units, or the functions of at least two modules may also be implemented by one unit. In other embodiments of the present application, the audio data processing device may also comprise other units, and in practical applications, these functions may also be implemented with the assistance of other units, and may be implemented by at least two units in cooperation.

According to an embodiment of the present application, the audio data processing apparatus as shown in fig. 11 may be constructed by running a computer program (including program codes) capable of executing the steps involved in the corresponding method as shown in fig. 4 on a general-purpose computer device such as a computer including a Central Processing Unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and the like as well as a storage element, and the audio data processing method of the embodiment of the present application may be implemented. The computer program may be recorded on a computer-readable recording medium, for example, and loaded into and executed by the computing apparatus via the computer-readable recording medium.

In the application, the audio fusion feature information of the at least two candidate audio data is obtained by fusing the audio feature information of the at least two candidate audio data with the object feature information of the target object and the video feature information of the target video data, that is, by fusing the multi-modal feature information, more information can be provided for the recommended audio data, and the accuracy of the recommended audio data can be improved. Further, identifying audio fusion characteristic information of at least two candidate audio data and audio characteristic information of at least two candidate audio data by adopting at least two target audio identification models respectively to obtain target audio data for matching music of the target video data, and recommending the target audio data to a target object; the audio recognition results of the multi-modal audio recognition model are comprehensively considered, and the audio data are automatically recommended to the target object, so that the efficiency of recommending the audio data can be improved; meanwhile, the advantages of different audio recognition models are fully utilized, the problem that the accuracy of recommended audio data is low due to deviation of a single model can be effectively avoided, and the recommended audio data is more stable, more accurate and more reliable.

Fig. 12 is a schematic structural diagram of an audio data processing apparatus according to an embodiment of the present application. The audio data processing means may be a computer program (including program code) running in a computer device, for example, the audio data processing means is an application software; the apparatus may be used to perform the corresponding steps in the methods provided by the embodiments of the present application. As shown in fig. 12, the audio data processing apparatus may include: an acquisition module 121, an extraction module 122, a fusion module 123, and an adjustment module 124.

and the adjusting module is used for respectively adjusting the at least two candidate audio recognition models according to the labeled audio matching degree, the audio feature information of the sample audio data and the audio fusion feature information of the sample audio data to obtain at least two target audio recognition models.

Optionally, the obtaining module obtains the matching degree of the labeled audio of the sample audio data, and includes:

obtaining object behavior data with respect to the sample video data;

and determining the labeling audio matching degree of the sample audio data according to the object behavior data related to the sample video data.

Optionally, the adjusting module adjusts at least two candidate audio recognition models respectively according to the matching degree of the labeled audio, the audio feature information of the sample audio data, and the audio fusion feature information of the sample audio data, so as to obtain the at least two target audio recognition models, including:

respectively adopting the at least two candidate audio recognition models to carry out audio matching prediction on the audio characteristic information of the sample audio data and the audio fusion characteristic information of the sample audio data to obtain a predicted audio matching degree;

and respectively adjusting at least two candidate audio recognition models according to the predicted audio matching degree and the labeled audio matching degree to obtain at least two target audio recognition models.

According to an embodiment of the present application, the steps involved in the audio data processing method shown in fig. 10 may be performed by respective modules in the audio data processing apparatus shown in fig. 12. For example, step S301 shown in fig. 10 may be performed by the obtaining module 121 in fig. 12, and step S302 shown in fig. 10 may be performed by the extracting module 122 in fig. 12; step S303 shown in fig. 10 may be performed by the fusion module 123 in fig. 12; step S304 shown in fig. 10 may be performed by the adjusting module 124 in fig. 12.

According to an embodiment of the present application, the modules in the audio data processing apparatus shown in fig. 12 may be respectively or entirely combined into one or several units to form the unit, or some unit(s) may be further split into at least two sub-units with smaller functions, which may implement the same operation without affecting implementation of technical effects of the embodiment of the present application. The modules are divided based on logic functions, and in practical applications, the functions of one module may also be implemented by at least two units, or the functions of at least two modules may also be implemented by one unit. In other embodiments of the present application, the audio data processing device may also comprise other units, and in practical applications, these functions may also be implemented with the assistance of other units, and may be implemented by at least two units in cooperation.

According to an embodiment of the present application, the audio data processing apparatus as shown in fig. 12 may be constructed by running a computer program (including program codes) capable of executing the steps involved in the corresponding method as shown in fig. 10 on a general-purpose computer device such as a computer including a Central Processing Unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and the like as well as a storage element, and the audio data processing method of the embodiment of the present application may be implemented. The computer program may be recorded on a computer-readable recording medium, for example, and loaded into and executed by the computing apparatus via the computer-readable recording medium.

Fig. 13 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 13, the computer apparatus 1000 may include: the processor 1001, the network interface 1004, and the memory 1005, and the computer apparatus 1000 may further include: an object interface 1003, and at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The object interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the selectable object interface 1003 may also include a standard wired interface and a standard wireless interface. Network interface1004 optionally may include standard wired, wireless interfaces (e.g., W) _I -F _I An interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 13, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, an object interface module, and a device control application program.

In the computer device 1000 shown in fig. 13, the network interface 1004 may provide a network communication function; the object interface 1003 is an interface for providing input to an object; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:

recommending the target audio data to the target object.

Optionally, the processor 1001 may be configured to invoke a device control application stored in the memory 1005 to implement:

and determining the first fusion characteristic information, the second fusion characteristic information and the third fusion characteristic information as the audio fusion characteristic information of the at least two candidate audio data.

Optionally, the processor 1001 may be configured to call a device control application stored in the memory 1005 to implement:

performing audio joint relation recognition on the audio fusion characteristic information of the at least two candidate audio data by adopting the first target audio recognition model to obtain audio joint matching degree; performing audio frequency self-correlation system identification on the audio frequency characteristic information of the target audio frequency data by adopting the second target audio frequency identification model to obtain audio frequency self-matching degree;

and selecting target audio data for dubbing the target video data from the at least two candidate audio data according to the audio joint matching degree and the audio self-matching degree.

and determining the basic portrait characteristic information, the multimedia portrait characteristic information and the portrait related characteristic information of the target object as the object characteristic information of the target object.

performing frame division processing on candidate audio data Yi in the at least two candidate audio data to obtain at least two frames of audio data belonging to the candidate audio data Yi; i is a positive integer less than or equal to N, wherein N is the number of candidate audio data in the at least two candidate audio data;

acquiring target video data belonging to the target object;

extracting at least two key video frames of the target video data;

obtaining object behavior data with respect to the sample video data;

It should be understood that the computer device 1000 described in this embodiment of the present application may perform the description of the audio data processing method in the embodiment corresponding to fig. 4 and fig. 10, and may also perform the description of the audio data processing apparatus in the embodiment corresponding to fig. 11 and fig. 12, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Further, here, it is to be noted that: an embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program executed by the aforementioned audio data processing apparatus, and the computer program includes program instructions, and when the processor executes the program instructions, the descriptions of the audio data processing method in the embodiments corresponding to fig. 4 and fig. 10 can be executed, so that the descriptions of the audio data processing method will not be repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application.

As an example, the program instructions described above may be executed on one computer device, or on at least two computer devices located at one site, or on at least two computer devices distributed over at least two sites and interconnected by a communication network, and the at least two computer devices distributed over at least two sites and interconnected by the communication network may constitute a blockchain network.

The computer readable storage medium may be the data processing apparatus provided in any of the foregoing embodiments or an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) card, a flash card (flash card), and the like, provided on the computer device. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the computer device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the computer device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.

An embodiment of the present application further provides a computer program product, which includes a computer program/instruction, where the computer program/instruction is executed by a processor in the foregoing description of the audio data processing method in the embodiment corresponding to fig. 4 and fig. 10, and therefore, details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in the embodiments of the computer program product referred to in the present application, reference is made to the description of the method embodiments of the present application.

The terms "first," "second," and the like in the description and in the claims and drawings of the embodiments of the present application are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprises" and any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, apparatus, product, or apparatus that comprises a list of steps or elements is not limited to the listed steps or modules, but may alternatively include other steps or modules not listed or inherent to such process, method, apparatus, product, or apparatus.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The method and the related apparatus provided by the embodiments of the present application are described with reference to the flowchart and/or the structural diagram of the method provided by the embodiments of the present application, and each flow and/or block of the flowchart and/or the structural diagram of the method, and the combination of the flow and/or block in the flowchart and/or the block diagram can be specifically implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block or blocks.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A method of audio data processing, comprising:

recommending the target audio data to the target object.

2. The method of claim 1, wherein the fusing the audio feature information of the at least two candidate audio data with the video feature information of the target video data and the object feature information of the target object to obtain the audio fusion feature information of the at least two candidate audio data respectively comprises:

3. The method as claimed in claim 2, wherein the fusing the audio feature information of the at least two candidate audio data with the video feature information of the target video data to obtain a first fused feature information, and the fusing the audio feature information of the at least two candidate audio data with the object feature information of the target object to obtain a second fused feature information comprises:

4. The method of claim 2, wherein the fusing the audio feature information of the at least two candidate audio data, the video feature information of the target video data, and the object feature information of the target object to obtain third fused feature information comprises:

5. The method of claim 1, wherein the audio recognition of the audio fusion feature information of the at least two candidate audio data and the audio feature information of the target audio data by using the at least two target audio recognition models respectively to obtain target audio data for dubbing music on the target video data comprises:

6. The method of claim 5, wherein the determining, from the at least two target audio recognition models, a first target audio recognition model that matches audio fusion feature information of the at least two candidate audio data and a second target audio recognition model that matches audio feature information of the at least two candidate audio data, respectively, comprises:

7. The method of claim 5, wherein selecting the target audio data from the at least two candidate audio data for dubbing the target video data according to the audio joint match score and the audio self-match score comprises:

8. The method of claim 7, wherein the summing the joint audio match score and the self audio match score to obtain a match score sum comprises:

9. The method of claim 1, wherein the obtaining object feature information of the target object comprises:

10. The method of claim 1, wherein the obtaining audio feature information of at least two candidate audio data associated with the target video data comprises:

11. The method of claim 10, wherein the performing score feature extraction on the at least two candidate audio data to obtain score feature information of the at least two candidate audio data comprises:

12. The method as claimed in claim 11, wherein the performing score feature extraction on the frequency domain information of the candidate audio data Yi to obtain score feature information of the at least two candidate audio data comprises:

13. The method of claim 1, wherein the obtaining video feature information of target video data belonging to the target object comprises:

acquiring target video data belonging to the target object;

extracting at least two key video frames of the target video data;

14. A method of audio data processing, comprising:

adjusting at least two candidate audio recognition models respectively according to the labeled audio matching degree, the audio feature information of the sample audio data and the audio fusion feature information of the sample audio data to obtain the at least two target audio recognition models as claimed in any one of claims 1 to 13.

15. The method of claim 14, wherein the obtaining the annotated audio match score for the sample audio data comprises:

obtaining object behavior data with respect to the sample video data;

16. The method of claim 14, wherein the adjusting at least two candidate audio recognition models according to the labeled audio matching degree, the audio feature information of the sample audio data, and the audio fusion feature information of the sample audio data to obtain the at least two target audio recognition models comprises:

17. An audio data processing apparatus, comprising:

the fusion module is used for respectively fusing the audio characteristic information of the at least two candidate audio data with the video characteristic information of the target video data and the object characteristic information of the target object to obtain audio fusion characteristic information of the at least two candidate audio data;

18. A computer device, comprising: a processor and a memory;

the processor is connected with the memory; the memory is for storing program code, and the processor is for calling the program code to perform the method of any of claims 1 to 16.

19. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method according to any of claims 1-16.

20. A computer program product comprising computer programs/instructions, characterized in that the computer programs/instructions, when executed by a processor, implement the steps of the method of claims 1-16.