CN113573161B

CN113573161B - Multimedia data processing method, device, equipment and storage medium

Info

Publication number: CN113573161B
Application number: CN202111104105.8A
Authority: CN
Inventors: 冯鑫
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-09-22
Filing date: 2021-09-22
Publication date: 2022-02-08
Anticipated expiration: 2041-09-22
Also published as: CN113573161A

Abstract

The embodiment of the application discloses a multimedia data processing method, a device, equipment and a storage medium, which relate to machine learning technologies related to artificial intelligence, wherein the method comprises the following steps: acquiring target audio data matched with the target video data; performing audio feature extraction on the target audio data to obtain audio feature information of the target audio data; according to the audio characteristic information of the target audio data, performing refraining identification on the target audio data to obtain a refraining segment of the target audio data; extracting a key video clip from the target video data, and fusing the key video clip with the refrain clip of the target audio data to obtain multimedia data comprising the key video clip and the refrain clip of the target audio data. Through the method and the device, the obtaining efficiency and the obtaining accuracy of the refrain fragment can be effectively improved, and further the obtaining efficiency of the multimedia data is improved.

Description

Multimedia data processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of machine learning technology in artificial intelligence, and in particular, to a method, an apparatus, a device, and a storage medium for processing multimedia data.

Background

With the development of internet technology, people can record and distribute multimedia data (such as short videos) anytime and anywhere, and can watch multimedia data distributed by other people. Generally, when a user creates multimedia data, the user needs to select a background music corresponding to the theme of the video data, and then fuse the background music with the video data to obtain the multimedia data. The background music can be used for enhancing the theme of the multimedia data, for example, the multimedia data is dance video data, the rhythm sense of dance can be enhanced through the background music, and therefore a viewer can more intuitively understand the theme of the multimedia data uploaded by the user through the background music. Since the refrain section of the audio data has a strong sense of rhythm and generality, more and more users select the refrain section of the audio data as background music. At present, the refrain segment of the audio data is mainly edited in a manual mode, and the refrain segment of the audio data can be obtained only by editing the audio data for multiple times by a user, so that the obtaining efficiency of the refrain segment of the audio data is low, and further the obtaining efficiency of the multimedia data is low. Meanwhile, under the influence of subjective feeling of human ears, different users have certain deviation in understanding of the refrain segment of the audio data, and therefore the accuracy of the acquired refrain segment is low.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present application is to provide a method, an apparatus, a device and a storage medium for processing multimedia data, which can effectively improve the obtaining efficiency and accuracy of a refrain fragment, and further improve the obtaining efficiency of multimedia data.

An embodiment of the present application provides a multimedia data processing method, including:

acquiring target audio data matched with the target video data;

performing audio feature extraction on the target audio data to obtain audio feature information of the target audio data;

according to the audio characteristic information of the target audio data, performing refraining identification on the target audio data to obtain a refraining segment of the target audio data;

extracting a key video clip from the target video data, and fusing the key video clip with the refrain clip of the target audio data to obtain multimedia data comprising the key video clip and the refrain clip of the target audio data.

An embodiment of the present application provides a multimedia data processing apparatus, including:

the acquisition module is used for acquiring target audio data matched with the target video data;

the extraction module is used for extracting audio features of the target audio data to obtain audio feature information of the target audio data;

the identification module is used for carrying out refrain identification on the target audio data according to the audio characteristic information of the target audio data to obtain a refrain segment of the target audio data;

and the fusion module is used for extracting a key video clip from the target video data, and fusing the key video clip and the refrain clip of the target audio data to obtain multimedia data of the refrain clip comprising the key video clip and the target audio data.

One aspect of the present application provides a computer device, comprising: a processor and a memory;

wherein, the memory is used for storing computer programs, and the processor is used for calling the computer programs to execute the following steps:

acquiring target audio data matched with the target video data;

An aspect of the embodiments of the present application provides a computer-readable storage medium, where a computer program is stored, where the computer program includes program instructions, and the program instructions, when executed by a processor, perform the following steps:

acquiring target audio data matched with the target video data;

An aspect of the embodiments of the present application provides a computer program product, which includes a computer program/instruction, and when the computer program/instruction is executed by a processor, the computer program/instruction implements the steps of the above method.

In the application, when the multimedia data of the target video data needs to be generated, the computer equipment can acquire the target audio data matched with the target video data, audio feature extraction is carried out on the target audio data to obtain the audio feature information of the target audio data, then the refrain fragment of the target audio data is automatically identified through the audio feature information of the target audio data, manual parameters are not needed, and the efficiency and the accuracy of acquiring the refrain fragment of the target audio data are improved. Meanwhile, the audio characteristic information of the target audio data is used for reflecting the frequency parameter and the energy parameter of the target audio data, namely the audio characteristic information of the target audio data is used for reflecting the music score information of the target audio data, and any audio data comprises the music score information. After the refrain segment of the target audio data is obtained, the computer equipment can extract the key video segment of the target video data, and the key video segment and the refrain segment of the target audio data are fused to obtain the multimedia data, so that the rhythm sense and the theme of the multimedia data can be enhanced.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a block diagram of a multimedia data processing system according to the present application;

FIG. 2 is a schematic diagram of a scenario of data interaction between devices in a multimedia data processing system according to the present application;

FIG. 3 is a flow chart of a multimedia data processing method provided by the present application;

fig. 4 is a schematic view of a scene for acquiring audio feature information of each frame of audio data in target audio data according to the present application;

fig. 5 is a schematic diagram of another scenario for acquiring audio feature information of each frame of audio data in target audio data according to the present application;

FIG. 6 is a scene schematic diagram of a refrain fragment for acquiring target audio data based on a refrain recognition model according to the present application;

fig. 7 is a schematic diagram of a scenario in which candidate confidence levels of frames of audio data in target audio data are smoothed according to the present application;

FIG. 8 is a flow chart of a multimedia data processing method provided in the present application;

FIG. 9 is a schematic structural diagram of a multimedia data processing apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

When a user creates multimedia data, in order to enhance the theme of the media data and enhance the rhythm of the multimedia data, a refrain segment of audio data is generally selected as background music of video data, and then the background music and the video data are fused to obtain the multimedia data. If the refrain section of the audio data is edited in a manual mode, the refrain section of the audio data can be obtained only by editing the audio data for multiple times by the user, and therefore the obtaining efficiency of the refrain section of the audio data is low. Meanwhile, under the influence of subjective feeling of human ears, different users have certain deviation in understanding of the refrain segment of the audio data, and therefore the accuracy of the acquired refrain segment is low. Based on the method, the audio characteristics of the target audio data are extracted by using machine learning technology related to artificial intelligence, so that the audio characteristic information of the target audio data is obtained, and the refrain fragment of the target audio data is automatically identified from the target audio data according to the audio characteristic information of the target audio data; the method for identifying the refrain fragment of the target audio data is more intelligent and automatic, manual participation is not needed, the obtaining efficiency of the refrain fragment of the audio data is improved, and the obtaining efficiency of the multimedia data is further improved. Meanwhile, the problem that the accuracy of the acquired refrain fragment is low due to the influence of subjective feeling of human ears can be avoided, the accuracy of acquiring the refrain fragment is improved, and further the theme and rhythm of the multimedia data are enhanced.

So-called Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

The Machine Learning (ML) is a multi-domain cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

In order to facilitate a clearer understanding of the present application, a multimedia data processing system implementing the multimedia data processing method of the present application is first introduced, as shown in fig. 1, and the multimedia data processing system includes a server and a terminal, as shown in fig. 1.

The terminal may be a user-oriented device, and the terminal may include a multimedia data application platform (i.e., a multimedia data application program) for playing multimedia data, uploading captured video data and produced audio data; the multimedia data platform may refer to a multimedia website platform (such as forum, post), a social application platform, a shopping application platform, a content interaction platform (such as a video application platform), and the like. The server may be a device for providing a multimedia data background service, and specifically may be configured to process audio data and video data in the terminal to obtain multimedia data, and upload the multimedia data to a multimedia data platform, so that a user may play the multimedia data in the multimedia data platform.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and an artificial intelligence platform. The terminal may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a smart television, a vehicle-mounted terminal, etc., but is not limited thereto. Each terminal and each server may be directly or indirectly connected through wired or wireless communication, and the number of the terminals and the servers may be one or more, which is not limited herein. The service scenario applicable to the multimedia data processing data system may specifically include: video program on demand scenes, video teaching scenes, video live broadcast scenes, self-media video playing scenes and the like, wherein service scenes suitable for the multimedia data processing system are not listed one by one.

In the present application, the multimedia data refers to a channel through which a user acquires information and transmits the information, and specifically may refer to a short video or a non-short video, where the short video refers to video data whose playing time is less than a playing time threshold, and the non-short video refers to video data whose playing time is greater than or equal to the playing time threshold; i.e. multimedia data comprising a chorus segment of audio data and a key video segment of video data. The chorus segment of the audio data can refer to a climax part of the audio data, has strong rhythm sense and generalization, and is generally composed of one or a repeated lyric and a music paragraph in the audio data. Usually, the two main song segments are played from the first main song segment to the chorus segment, and then the second main song segment is connected to the chorus segment and then returned to the chorus segment, and so on. When some refrain segments are repeated, the lyrics of each segment are identical, but some audio data in the repeated parts of the refrain segments can make certain changes to the lyrics. Many audio data end with a chorus segment, or the last lyric of a repeated chorus segment. The song title of the audio data refers to a part before the climax (chorus fragment) in the audio data, and the song title has the function of making the melody of the audio data slowly push up the climax and simultaneously making the story background expressed by the song clearly expressed.

Similarly, a key video clip of video data may refer to a video clip that can reflect the main content of the video data. The main content may be able to reflect the main idea of the video data, or may be able to mobilize video segments of interest of the user to watch the video data, i.e. the main content may also be referred to as highlights of the video data. The highlight collection refers to a collection of highlights (i.e. highlight images) in the video data, and may be one or more frames of images in the video data. For example, in a video program on demand scenario, the key video segments herein may refer to video program segments of interest selected by the user in the video program list, for example, the video program segments herein may be television show segments or movie segments of interest to the user. For another example, in a self-media video playing scene, the key video clips may refer to short videos in which the user is interested, including a food short video, a travel short video, a daily life short video, and the like. For another example, in a live video scene, a key video clip herein may refer to a clip of live video data for interest, including: shopping live video data segments, conference live video data segments, sports event live video data segments, and the like.

For easy understanding, please refer to fig. 2, which is a schematic view of a data interaction scenario provided in an embodiment of the present application. As in fig. 2, the data interaction process includes the following steps 1-3:

(1) the server obtains key video clips of the target video data. The server can obtain target video data from the terminal; the target video data may refer to video data that needs to be edited to generate multimedia data (such as short video), the target video data may refer to video data captured by a terminal, or the target video data may be acquired by the terminal from the internet. In one implementation, the server may perform color feature extraction on the target video data to obtain color feature information of the target video data, and extract a key video clip from the target video data according to the color feature information of the target video data; for example, the key video clip refers to a video clip with rich color characteristic information. In another embodiment, the server may identify session contents of objects in the target video data to obtain session contents of each object, and extract key video segments in the target video data according to the session contents of each object; such as a key video data segment, refers to a video segment that includes a classic sentence.

(2) The server acquires a refrain fragment of the target audio data. The server can acquire the theme information of the target video data, and acquire target audio data matched with the target video data according to the theme information of the target video data; the topic information is used herein to describe the main content expressed by the target video data, for example, the topic information may refer to the title of the target video data, or the topic information may refer to a keyword in the conversation content of the target video data. As shown in fig. 2, after target audio data associated with the target video data is acquired, the server may perform audio feature extraction on the audio data to obtain audio feature information of the audio data; the audio feature information may refer to information for reflecting a spectral feature of the target audio data, the spectral feature including an energy parameter (or a magnitude parameter) related to a loudness of the target audio data and a frequency parameter related to a pitch of the target audio data. Therefore, the audio feature information may be referred to as one of loudness, pitch, loudness and pitch of the target audio data, where pitch refers to a parameter for reflecting the level of sound, and is mainly related to the frequency of the target audio data, and if the frequency of the target audio data is high, the pitch is also high; conversely, if the frequency of the target audio data is low, the pitch is also low. Loudness refers to a parameter that reflects the intensity of sound, and is related to the amplitude of the target audio data. The larger the vibration amplitude of the target audio data is, the larger the energy is and the larger the loudness is, whereas the smaller the vibration amplitude of the target audio data is, the smaller the energy is and the loudness is. After the server acquires the audio characteristic information of the target audio data, the server can perform refraining identification on the target audio data according to the audio characteristic information of the target audio data to obtain a refraining fragment of the target audio data.

(3) The server generates multimedia data according to the refrain fragment of the target audio data and the key video fragment of the target video data. After the server acquires the refrain fragment of the target audio data, the refrain fragment of the target audio data and the key video fragment of the target video data can be fused to obtain the multimedia data. Specifically, the playing time of the key video clip and the playing time of the refrain clip can be obtained, and if the playing time of the key video clip is the same as the playing time of the refrain clip, the key video clip and the refrain clip can be aligned to obtain the multimedia data. If the playing time length of the key video clip is different from the playing time length of the refrain clip, the refrain clip can be zoomed to obtain a processed refrain clip, and the processed refrain clip is aligned with the key video clip to obtain the multimedia data. After the multimedia data are acquired, the server can send the multimedia data to the multimedia data application platform, and each terminal can download and play the multimedia data from the multimedia data application platform.

It should be noted that the above-mentioned process of generating multimedia data may be executed by the server in fig. 1, or may be executed by any terminal in fig. 1, or of course, may be executed by both the server and the terminal, and the present application is not limited thereto. The process of generating multimedia data by the terminal and the server together can refer to the process of generating multimedia data by the server in fig. 2, and repeated details are not repeated. Particularly, when the terminal and the server generate the multimedia data together, the server and the terminal respectively execute different steps of generating the multimedia data, and the multimedia data is generated through a distributed system consisting of the terminal and the server, so that the processing pressure of the multimedia data of each device can be effectively reduced, and the efficiency of generating the multimedia data is improved. For example, the server may extract audio features of the target audio data to obtain audio feature information of the target audio data, perform refraining identification on the target audio data according to the audio feature information of the target audio data to obtain a refraining fragment of the target audio data, and send the refraining fragment of the target audio data to the terminal. The terminal can extract a key video clip from the target video data, and the key video clip and the refrain clip are fused to obtain the multimedia data.

In conclusion, the method and the device identify the refrain segment of the target audio data from the target audio data through the audio characteristic information of the target audio data; and extracting a key video clip of the target video data, and fusing the key video clip and the refrain clip to obtain the multimedia data. The method can automatically identify the refrain fragment of the target audio data without manual participation, improve the obtaining efficiency of the refrain fragment of the audio data, and further improve the obtaining efficiency of the multimedia data. Meanwhile, the problem that the accuracy of the acquired refrain fragment is low due to the influence of subjective feeling of human ears can be avoided, the accuracy of acquiring the refrain fragment is improved, and further the theme and rhythm of the multimedia data are enhanced.

Further, please refer to fig. 3, which is a flowchart illustrating a multimedia data processing method according to an embodiment of the present application. As shown in fig. 3, the method may be performed by a computer device, which may refer to the terminal in fig. 1, or the computer device may refer to the server in fig. 1, or the computer device includes the terminal and the server in fig. 1, that is, the method may be performed by both the terminal and the server in fig. 1. The multimedia processing method can comprise the following steps S101-S104:

and S101, acquiring target audio data matched with the target video data.

In the application, the computer device may obtain the theme information of the target video data and the melody of the candidate audio data, and determine the candidate audio data in which the melody matches the theme information of the target video data in the candidate audio data as the target audio data matching the target video data. For example, the target video data is live video data of a sports event, and the target audio data may be songs with more exciting melody; for another example, if the target video data is a tour short video, the target audio data may be classical music with a relatively light and fast melody.

And S102, performing audio characteristic extraction on the target audio data to obtain audio characteristic information of the target audio data.

Optionally, the computer device may perform frequency domain transformation on the target audio data to obtain audio feature information of the target audio data, and specifically, the step S102 may include the following steps S11 to S14:

s11, performing framing processing on the target audio data to obtain multi-frame audio data;

s12, performing frequency domain transformation on the multi-frame audio data to obtain frequency domain information of each frame of audio data in the multi-frame audio data;

s13, extracting audio characteristics of the frequency domain information of each frame of audio data in the multi-frame of audio data to obtain the audio characteristic information of each frame of audio data;

s14, determining the audio feature information of the audio data of each frame as the audio feature information of the target audio data.

In steps s 11-s 14, as shown in fig. 4, to obtain audio feature information of finer granularity of the target audio data, the computer device may obtain a framing parameter, and may perform framing processing on the target audio data according to the framing parameter to obtain multi-frame audio data; the framing parameters include frame length and frame shift, and the framing parameters may be dynamically determined according to the target audio data, or the framing parameters may be manually set. The audio data of each frame belongs to a time domain signal, the time domain signal is complex, the audio characteristic information of the target audio data is difficult to acquire, the frequency domain signal is simple, and the audio characteristic information of the target audio data is easy to acquire. Therefore, the computer device may perform frequency domain transformation on the multi-frame audio data to obtain frequency domain information of each frame of audio data in the multi-frame audio data, where the frequency domain information of each frame of audio data is used to reflect the amplitude parameter and the frequency parameter of the target audio data. Further, audio feature extraction may be performed on the frequency domain information of each frame of audio data in the multiple frames of audio data to obtain audio feature information of each frame of audio data, and the audio feature information of each frame of audio data is determined as the audio feature information of the target audio data. By performing framing processing on the target audio data, the method is favorable for acquiring audio characteristic information with finer granularity of the target audio data and providing richer information quantity for identifying the refrain fragments; by performing frequency domain transformation on each frame of audio data, the difficulty in acquiring the audio characteristic information of the target audio data can be reduced.

For example, as shown in fig. 4, assuming that the target audio data is a song with a playing time of 4 minutes, the frame length is 100ms, the frame shift is 10ms, the computer device may take an audio segment from the starting position of the target audio data to the 100 th ms as a first audio frame, take an audio segment from the 110 th ms to the 210 th ms of the target audio data as a second audio frame, and so on. After obtaining the multi-frame audio data of the target audio data, the computer device may perform windowing on the multi-frame audio data by using a window function (such as a hamming window) to obtain processed multi-frame audio data; and carrying out Fourier transform on the processed multi-frame audio data to obtain frequency domain information of each frame of audio data. The frequency domain information of each frame of audio data is used for reflecting the phase parameter and the amplitude parameter of each frame of audio data, that is, the frequency domain information of each frame of audio data is used for reflecting the relationship between the amplitude and the phase of each frame of audio data. Further, audio feature extraction may be performed on the frequency domain information of each frame of audio data to obtain audio feature information of each frame of audio data, and the audio feature information of each frame of audio data is determined as the audio feature information of the target audio data.

Optionally, the step s13 may include the following steps s 21-s 22:

s21, determining the energy information of each frame of audio data according to the frequency domain information of each frame of audio data in the multi-frame of audio data.

s22, filtering the energy information of each frame of audio data to obtain the audio feature information of each frame of audio data.

In steps s 21-s 22, the computer device may generate an energy spectrum curve of each frame of audio data according to the frequency domain information of each frame of audio data in the plurality of frames of audio data, where the energy spectrum curve is used to reflect the relationship between the frequency parameter and the energy parameter of each frame of audio data, and may obtain the energy information of each frame of audio data from the energy spectrum curve. Because the target audio data includes noise, in order to avoid noise interference, the computer device may perform filtering processing on the energy information of each frame of audio data to obtain the audio feature information of each frame of audio data. By processing the energy information of each frame of audio data, noise interference can be avoided, the accuracy of identifying the refrain fragment is improved, meanwhile, the follow-up invalid noise processing is avoided, and the processing resource of the computer equipment can be saved.

For example, as shown in fig. 5, the computer device may generate an energy spectrum curve of each frame of audio data in the plurality of frames of audio data according to the frequency domain information of each frame of audio data, where the energy spectrum curve is used to reflect the relationship between the frequency parameter and the energy parameter of each frame of audio data. Because the frequency which can be sensed by the human ear is limited, namely the audio data corresponding to the frequency which can not be sensed by the human ear is generally called as noise, the computer equipment can generate a filter according to the auditory characteristics of the human ear, and the filter is adopted to filter the energy spectrum of each frame of audio data to obtain the audio characteristic information of each frame of audio data; that is, the energy information of the audio frames with frequencies within the filter is retained, and the energy information of the audio frames with frequencies outside the filter is filtered. The filter may be referred to as a mel filter, which corresponds to a triangular band filter bank, as shown in fig. 5, and may be used for frequency filtering the audio data.

S103, according to the audio characteristic information of the target audio data, performing refraining identification on the target audio data to obtain a refraining fragment of the target audio data.

In step S102 and step S103, the audio feature information of the target audio data is used to reflect the frequency parameter and the energy parameter of the target audio data; meanwhile, the difference between the audio characteristic information of the refrain section of the target audio data and the audio characteristic information of the verse section of the target audio data is relatively large. For example, the refrain segment of the target audio data has a relatively high frequency and energy, while the verse segment of the target audio data has a relatively low frequency and energy. Therefore, the computer equipment can extract the audio characteristics of the target audio data to obtain the audio characteristic information of the target audio data, and perform refraining identification on the target audio data according to the audio characteristic information of the target audio data to obtain a refraining fragment of the target audio data, so that the accuracy of obtaining the refraining fragment of the target audio data can be improved, and the method can be flexibly applied to various scenes for obtaining the refraining fragment.

It should be noted that the target audio data is classified according to whether the target audio data includes lyrics, the types of the target audio data include a pure music type and an impure music type, the pure music type means that the target audio data does not include lyrics and only includes audio (i.e., music score), and the impure music type means that the target audio data includes lyrics and audio. Therefore, if the lyrics of the target audio data are analyzed to obtain the refrain fragment of the target audio data, this method is only suitable for extracting the refrain fragment of the target audio data of a non-pure music type, and if the target audio data belongs to a pure music type, this method may result in a low accuracy of the obtained refrain fragment, or the refrain fragment may not be obtained. Based on this, in the application, each type of target audio data has the audio characteristic information, so that the refrain fragment of the target audio data is obtained by analyzing the audio characteristic information of the target audio data, the application scene is wider, and the accuracy of obtaining the refrain fragment of the target audio data can be improved.

It should be noted that, the computer device may obtain the refrain segment of the target audio data by using the method a or the method b, where the method a: analyzing the energy parameter and the frequency parameter of each frame of audio data to obtain a refrain fragment of the target audio data; mode b: and acquiring the refrain fragment of the target audio data through the refrain identification model.

Optionally, the audio feature information of each frame of audio data includes an energy parameter and a frequency parameter of each frame of audio data, and when the computing device obtains the refrain fragment of the target audio data in the mode a, the step S103 may include the following steps S31 to S33:

s31, determining a plurality of target audio frames from the multi-frame audio data, wherein the frequency parameter is greater than the frequency threshold value, and the energy parameter is greater than the energy threshold value.

s32, determining the position relation among the target audio frames in the target audio data.

s33, generating the refrain segment of the target audio data according to the target audio frames with continuous adjacent position relation in the plurality of target audio frames.

In steps s 31-s 33, the frequency of the refrain segment is higher and the energy is higher, and the playing time length of the refrain segment is related to the playing time length of the target audio data. Accordingly, the computer device may determine, from the plurality of frames of audio data, a plurality of target audio frames for which the frequency parameter is greater than the frequency threshold and the energy parameter is greater than the energy threshold. Acquiring the position relation of each target audio frame in the plurality of target audio frames in the target audio data; the position relationship comprises a continuous adjacent position relationship and a discontinuous adjacent position relationship, the continuous adjacent position relationship refers to the position adjacency between the target audio frames with the frame number larger than the frame number threshold, the discontinuous adjacent position relationship refers to the position adjacency between the target audio frames with the frame number smaller than the frame number threshold, or the position nonadjacent between the target audio frames. For example, the frame number threshold is 10 frames, and assuming that the positions of the 10 th frame to the 25 th frame in the target audio data are all adjacent, the target audio frames of the 10 th frame to the 25 th frame are said to have a continuous adjacent position relationship. If the positions of the 10 th frame to the 15 th frame in the target audio data are all adjacent, the target audio frames of the 10 th frame to the 25 th frame are referred to as having a non-continuous adjacent position relationship. Therefore, the refrain segment of the target audio data can be generated according to the target audio frames with continuous adjacent position relation in the plurality of target audio frames; specifically, the target audio frames having continuous adjacent position relationships among the plurality of target audio frames may be sequentially spliced according to the positions of the target audio frames in the target audio data, so as to obtain the refrain segment of the target audio data. The refrain segment of the target audio data is identified according to the frequency parameter, the energy parameter, the position relation between the audio data of each frame and other multidimensional information of each frame, that is, the scheme comprehensively considers the frequency parameter and the energy parameter of the audio data of the previous frame and the next frame (namely, the adjacent audio frames) of each frame of audio data, and can effectively improve the accuracy of identifying the refrain segment.

Optionally, when the computing device obtains the refrain segment of the target audio data in the manner b, the step S103 may include the following steps S41 to S43:

s41, calling the coding layer of the refrain recognition model to code the audio characteristic information of the audio data of each frame, and obtaining the coding value of the audio data of each frame.

s42, calling the confidence coefficient identification layer of the refrain identification model to identify the coding value of the audio data of each frame, and obtaining the confidence coefficient of the audio frame of each frame of audio data belonging to the refrain segment.

s43, calling the refrain identification layer of the refrain identification model to determine the refrain segment of the target audio data from the multi-frame audio data according to the confidence degree of each frame of audio data.

In steps s 41-s 43, as shown in fig. 6, the refrain identification model may include a feature extraction layer, a coding layer, a confidence level identification layer, and a refrain identification layer, where the feature extraction layer is configured to extract audio feature information of the target audio data, the coding layer is configured to code the audio feature information to obtain a coded value, the confidence level identification layer is configured to determine, based on the coded value of each frame of audio data, a confidence level that each frame of audio data belongs to an audio frame in a refrain segment, and the refrain identification layer is configured to identify the refrain segment of the target audio data based on the confidence level of each frame of audio data. Specifically, as shown in fig. 6, the encoding layer, the confidence level identification layer, and the refrain identification layer may refer to a network based on attention mechanism, and may of course refer to other types of networks, which is not limited in this application. The computer equipment can obtain the refrain fragment of the target audio data by calling the refrain identification model; specifically, the computer device may invoke the feature extraction layer of the refrain recognition model to extract audio feature information of the target audio data, and invoke the coding layer to code the audio feature information of each frame of audio data in the target audio data, so as to obtain a coding value of each frame of audio data. Furthermore, a confidence recognition layer can be called to recognize the coded value of each frame of audio data to obtain the confidence of each frame of audio data belonging to the audio frame in the refrain fragment, wherein the confidence is used for reflecting the probability of each frame of audio data belonging to the audio frame in the refrain fragment, namely the higher the confidence is, the higher the probability of the audio frame belonging to the audio frame in the refrain fragment is; conversely, the lower the confidence, the lower the probability that the audio frame belongs to the audio frame in the refrain section. Therefore, the computer equipment can determine the refrain segment of the target audio data from the multi-frame audio data according to the confidence coefficient of each frame of audio data of the refrain identification layer. The method has the advantages that the audio characteristic information of each frame of audio data is analyzed through the refrain identification model, the refrain fragment of the target audio data can be automatically obtained, manual participation is not needed, the efficiency and the accuracy of obtaining the refrain fragment are improved, and resources can be saved.

It should be noted that, in order to improve the accuracy of the refrain recognition model, the computer device may train the initial refrain recognition model to obtain the refrain recognition model. Specifically, the computer device may obtain the sample audio data and the labeled refraining segment of the sample audio data, where the labeled refraining segment of the sample audio data may be obtained by labeling the sample audio data by a plurality of users. And adopting an initial refrain identification model to predict refrain of the sample audio data to obtain a predicted refrain segment of the sample audio data, adjusting the initial refrain identification model according to the predicted refrain segment and the marked refrain segment, and taking the adjusted initial refrain identification model as a refrain identification model. Because the marked refrain segment of the sample audio data can be obtained by marking the sample audio data by a plurality of users, the problem that the accuracy of the obtained refrain segment is low due to the understanding deviation of individual users can be avoided; therefore, the initial refrain identification model is trained through the marked refrain segment of the sample audio data, and the accuracy of the refrain identification model can be improved.

Further, the adjusting the initial refrain recognition model according to the predicted refrain segment and the marked refrain segment includes: the computer equipment can determine the identification error of the initial refrain identification model according to the predicted refrain segment and the marked refrain segment, if the identification error is not in a convergence state, the initial refrain identification model is adjusted according to the identification error to obtain an adjusted initial refrain identification model; and if the identification error is in a convergence state, taking the initial refrain identification model as a refrain identification model.

Optionally, the step s42 may include the following steps s 51-s 53:

s51, calling the coding layer of the refrain recognition model to generate the audio feature vector of each frame of audio data according to the audio feature information of each frame of audio data.

s52, determining the candidate confidence of the audio frame of each frame of audio data belonging to the refrain fragment according to the audio feature vector of each frame of audio data.

s53, smoothing the candidate confidence coefficient to obtain the confidence coefficient of the audio frame of each frame of audio data belonging to the refrain fragment.

In steps s 51-s 53, the computer device may invoke the encoding layer to generate an audio feature vector of each frame of audio data according to the audio feature information of each frame of audio data, that is, the audio feature vector is a two-dimensional vector including a frequency domain parameter and an energy parameter of an audio frame, and further, may determine a candidate confidence that each frame of audio data belongs to an audio frame in a refrain segment according to the audio feature vector. Due to the fact that an abnormal value exists in the candidate confidence coefficient, accurate refrain segments are difficult to obtain; therefore, the computer device can perform smoothing processing on the candidate confidence coefficient to obtain the confidence coefficient that each frame of audio data belongs to the audio frame in the refrain segment. By smoothing the candidate confidence coefficient, the accuracy of the confidence coefficient can be improved, and further, the accuracy of obtaining the refrain fragment is improved.

Optionally, the step s52 may include the following steps s 61-s 62:

s61, determining the inner product of the audio feature vectors as the audio weight of the audio data of each frame.

s62, obtaining the candidate confidence of the audio frame of the audio data of each frame belonging to the refrain fragment according to the dot product between the audio weight and the audio feature vector.

In steps s 61-s 62, the computer device may obtain an inner product between the audio feature vectors, and determine the inner product as an audio weight of each frame of audio data, wherein the audio weight is used for reflecting the similarity of audio frames belonging to the refrain section. And further, obtaining the candidate confidence of each frame of audio data belonging to the refrain fragment according to the dot multiplication between the audio weight and the audio characteristic vector. The relation between the audio frame and the front and back audio frames is reflected by the inner product between the audio feature vectors, so that the candidate confidence of the audio frame is determined according to the inner product between the audio feature vectors, the accuracy of the candidate confidence can be improved, and the accuracy of obtaining the refrain fragment is improved.

Optionally, the computer device may adjust the confidence of the audio frame to smooth the candidate confidence of each frame of audio data; specifically, the step s53 may include the following steps s 71-s 73:

and s71, acquiring the frame length of the audio frames in the multi-frame audio data, and generating a plurality of audio detection time periods according to the frame length.

And s72, counting the total candidate confidence of the audio frame in each audio detection time segment in the plurality of audio detection time segments.

s73, adjusting the candidate confidence of the audio frame in each audio detection time segment according to the total candidate confidence to obtain the confidence of the audio frame of each frame of audio data belonging to the refrain segment.

In steps s 71-s 73, the computer device may obtain the frame lengths of the audio frames in the multiple frames of audio data, where the lengths of the audio frames of the frames of audio data are the same, and may generate multiple audio detection time periods according to the frame lengths, where for example, the frame length is 0.1s, and the audio detection time periods may include [0, 1] s, [1, 2] s, [2, 3] s, and so on. Further, the total candidate confidence of the audio frames in each audio detection time period may be counted, and if the total candidate confidence is greater than the candidate confidence threshold, the candidate confidence of the audio frames in the audio detection time period is adjusted to the first candidate confidence; if the total candidate confidence is less than or equal to the candidate confidence threshold, adjusting the candidate confidence of the audio frame in the audio detection time period to be a second candidate confidence; the first candidate confidence is greater than the second candidate confidence. By adjusting the candidate confidence degrees of the audio frames in each audio detection time period according to the total candidate confidence degree, the problem that the accuracy of the acquired refrain fragment is low due to abnormal candidate confidence degrees can be avoided, and the acquisition accuracy of the refrain fragment is improved.

It should be noted that the computer device may determine the average candidate confidence of the audio frames in each audio detection time period according to the total candidate confidence, and adjust the candidate confidence of the audio frames in each audio detection time period according to the average candidate confidence to obtain the confidence that the audio data of each frame belongs to the audio frames in the refrain segment. For example, as shown in fig. 7, the computer device may calculate a total candidate confidence for the audio frames within each audio detection time period, and determine an average candidate confidence for the audio frames within the audio detection time period based on the total candidate confidence and the number of audio frames within the audio detection time period. If the average candidate confidence is larger than the candidate confidence threshold, adjusting the candidate confidence of the audio frame in the audio detection time period to be the first candidate confidence; if the average candidate confidence is less than or equal to the candidate confidence threshold, the candidate confidence for the audio frame within the audio detection time period is adjusted to a second candidate confidence. Assuming that the threshold of the confidence information degree of the candidate is 0.8, as shown in fig. 7, if the average confidence degree of the candidate of the audio frames in the audio detection time period is greater than 0.8, the confidence degrees of the candidate of the audio frames in the audio detection time period may all be adjusted to 1; if the average candidate confidence of the audio frames within the audio detection time period is less than or equal to 0.8, the candidate confidence of the audio frames within the audio detection time period may be all adjusted to 0.

Optionally, the computer device may determine the refrain segment according to the confidence of the audio frames and the position relationship between the audio frames, specifically, s43 may include the following steps s81 to s 83:

s81, calling the refrain recognition model to determine a plurality of candidate audio frames with the confidence coefficient larger than the confidence coefficient threshold value from the multi-frame audio data.

s82, determining the position relation among the candidate audio frames in the target audio data.

s83, generating the refrain segment of the target audio data according to the candidate audio frames with continuous adjacent position relation in the candidate audio frames.

In steps s 81-s 83, as shown in fig. 7, the computer device may invoke a refrain recognition layer to determine a plurality of candidate audio frames from the plurality of frames of audio data, wherein the confidence level of the candidate audio frames is greater than the confidence level threshold; acquiring the position relation among a plurality of candidate audio frames in the target audio data; the position relationship comprises a continuous adjacent position relationship and a discontinuous adjacent position relationship, the continuous adjacent position relationship refers to the position adjacency between the target audio frames with the frame number larger than the frame number threshold, the discontinuous adjacent position relationship refers to the position adjacency between the target audio frames with the frame number smaller than the frame number threshold, or the position nonadjacent between the target audio frames. Therefore, the computer device can generate the refrain segment of the target audio data according to the candidate audio frames with continuous adjacent position relation in the plurality of candidate audio frames. The method identifies the refrain fragment of the target audio data according to the confidence degree of each frame of audio data, the position relation and other multi-dimensional information, and can effectively improve the accuracy of identifying the refrain fragment.

S104, extracting a key video clip from the target video data, and fusing the key video clip with the refrain clip of the target audio data to obtain multimedia data of the refrain clip comprising the key video clip and the target audio data.

In the application, the computer device can extract the key video clip from the target video data, the key video clip can be a video clip used for reflecting the main content of the target video data, further, the refrain clip of the target audio data can be used as the background music of the key video clip, and the refrain clip of the target audio data and the key video clip are fused to obtain the multimedia data of the refrain clip comprising the key video clip and the target audio data.

Further, please refer to fig. 8, which is a flowchart illustrating a multimedia data processing method according to an embodiment of the present application. As shown in fig. 8, the method may be performed by a computer device, which may refer to the terminal in fig. 1, or the computer device may refer to the server in fig. 1, or the computer device includes the terminal and the server in fig. 1, that is, the method may be performed by both the terminal and the server in fig. 1. The multimedia processing method can comprise the following steps S201-S206:

s201, acquiring target audio data matched with the target video data.

S202, extracting audio features of the target audio data to obtain audio feature information of the target audio data.

S203, according to the audio characteristic information of the target audio data, performing refraining identification on the target audio data to obtain a refraining fragment of the target audio data.

It should be noted that, in the present application, the explanation on step S201 may refer to the explanation on step S101 in fig. 3, the explanation on step S202 may refer to the explanation on step S102 in fig. 3, and the explanation on step S203 may refer to the explanation on step S103 in fig. 3, and repeated parts are not repeated.

And S204, extracting color characteristic information of each frame of video data in the target video data.

And S205, determining a key video clip from the target video data according to the color feature information.

In steps S204-S205, the color feature information of the video data includes brightness, saturation, chroma, and the like; the color characteristic information of the key video frames in the key video clip is rich; therefore, the computer device can extract the color feature information of each frame of video data in the target video data, and can determine the key video segments in the target video data according to the color feature information. The key video clips of the target video data are extracted through the color characteristic information of each video data, so that the accuracy of extracting the key video clips is improved, a user can quickly obtain the main content of the target video data, and the theme of the target video data is highlighted.

S206, the key video clip and the refrain clip of the target audio data are fused to obtain the multimedia data of the refrain clip comprising the key video clip and the target audio data.

Optionally, the computer device may fuse the key video segment and the refrain segment according to the playing duration of the key video segment and the playing duration of the refrain segment to obtain the multimedia data, and specifically, the step S206 may include the following steps S91 to S93:

s91, obtaining a first playing time length of the key video clip and a second playing time length of the refrain clip of the target audio data.

s92, if the first playing time length is different from the second playing time length, performing scaling processing on the refrain segment of the target audio data according to the first playing time length to obtain a processed refrain segment; and aligning the processed refrain segment with the key video segment to obtain the multimedia data of the refrain segment comprising the key video segment and the target audio data.

s93, if the first playing time length is the same as the second playing time length, aligning the refrain segment with the key video segment to obtain the multimedia data of the refrain segment including the key video segment and the target audio data.

In steps s 91-s 93, the computer device may obtain a first playing time length of the key video clip and a second playing time length of the refrain clip of the target audio data, and if the first playing time length is different from the second playing time length, it indicates that the playing time length of the refrain clip is greater than or less than the playing time length of the key video clip, so that the refrain clip may be scaled according to the first playing time length to obtain a processed refrain clip, and the processed refrain clip is aligned with the key video clip to obtain the multimedia data. And if the first playing time length is the same as the second playing time length, aligning the refrain segment with the key video segment to obtain multimedia data. By carrying out scaling processing on the refrain segment according to the first playing time length, the playing time length of the processed refrain segment is the same as the playing time length of the key video segment, which is beneficial to improving the playing effect of the multimedia data.

Optionally, the step s92 may include the following steps s94 and s 95:

s94, if the second playing time length is longer than the first playing time length, cutting the refrain segment of the target audio data according to the first playing time length to obtain a processed refrain segment.

s95, if the second playing time length is shorter than the first playing time length, performing expansion processing on the refrain segment of the target audio data according to the first playing time length to obtain a processed refrain segment.

In steps s94 and s95, if the second playing time length is longer than the first playing time length, an assistant song segment with the playing length of the first playing time length may be cut out from the starting time of the assistant song segment, and the first assistant song segment is used as the processed assistant song segment. If the second playing time length is less than the first playing time length, a second sub-singing segment is cut from the sub-singing segment according to the first playing time length, the second sub-singing segment is added at the middle position of the sub-singing segment or at the ending position of the sub-singing segment, a processed sub-singing segment is obtained, and namely the audio frequency at the starting position of the sub-singing segment is kept unchanged. Because the audio frequency at the starting position of the refrain segment can grasp the interest point of the audience, and simultaneously, the audio frequency frames in the refrain segment have smooth transition characteristics, the audio frequency at the starting position of the refrain segment is kept unchanged, which is beneficial to grasping the interest point of the audience and leads the refrain segment to push the plot of the key video segment to climax smoothly.

In the application, when the multimedia data of the target video data needs to be generated, the computer equipment can acquire the target audio data matched with the target video data, audio feature extraction is carried out on the target audio data to obtain the audio feature information of the target audio data, then the refrain fragment of the target audio data is automatically identified through the audio feature information of the target audio data, manual parameters are not needed, and the efficiency and the accuracy of acquiring the refrain fragment of the target audio data are improved. Meanwhile, the audio characteristic information of the target audio data is used for reflecting the frequency parameter and the energy parameter of the target audio data, namely the audio characteristic information of the target audio data is used for reflecting the music score information of the target audio data, and any audio data comprises the music score information. After the refrain fragment of the target audio data is obtained, the computer equipment can extract color feature information of the target video data, extract a key video fragment of the target video data according to the color feature information, and fuse the key video fragment and the refrain fragment of the target audio data to obtain multimedia data, so that the rhythm sense and the theme of the multimedia data can be enhanced.

Please refer to fig. 9, which is a schematic structural diagram of a multimedia data processing apparatus according to an embodiment of the present application. The multimedia data processing apparatus may be a computer program (including program code) running on a computer device, for example, the multimedia data processing apparatus is an application software; the apparatus may be used to perform the corresponding steps in the methods provided by the embodiments of the present application. As shown in fig. 9, the multimedia data processing apparatus may include: an acquisition module 901, an extraction module 902, a recognition module 903, and a fusion module 904.

Optionally, the extracting module performs audio feature extraction on the target audio data to obtain audio feature information of the target audio data, including:

performing framing processing on the target audio data to obtain multi-frame audio data;

performing frequency domain transformation on the multi-frame audio data to obtain frequency domain information of each frame of audio data in the multi-frame audio data;

performing audio characteristic extraction on frequency domain information of each frame of audio data in the multi-frame of audio data to obtain audio characteristic information of each frame of audio data;

and determining the audio characteristic information of each frame of audio data as the audio characteristic information of the target audio data.

Optionally, the extracting module performs audio feature extraction on the frequency domain information of each frame of audio data in the multiple frames of audio data to obtain the audio feature information of each frame of audio data, and the audio feature information includes:

determining energy information of each frame of audio data according to the frequency domain information of each frame of audio data in the multi-frame of audio data;

and filtering the energy information of each frame of audio data to obtain the audio characteristic information of each frame of audio data.

Optionally, the audio feature information of each frame of audio data includes an energy parameter and a frequency parameter of each frame of audio data, and the identifying module performs refrain identification on the target audio data according to the audio feature information of the target audio data to obtain a refrain segment of the target audio data, including:

determining a plurality of target audio frames of which the frequency parameters are greater than a frequency threshold and the energy parameters are greater than an energy threshold from the multi-frame audio data;

determining a positional relationship in the target audio data between target audio frames of the plurality of target audio frames;

and generating a refrain fragment of the target audio data according to the target audio frames with continuous adjacent position relation in the plurality of target audio frames.

Optionally, the identifying module performs refraining identification on the target audio data according to the audio feature information of the target audio data to obtain a refraining segment of the target audio data, including:

calling a coding layer of a refrain identification model to code the audio characteristic information of each frame of audio data to obtain a coding value of each frame of audio data;

calling a confidence coefficient recognition layer of the refrain recognition model to recognize the coding value of each frame of audio data to obtain the confidence coefficient of each frame of audio data belonging to the audio frame in the refrain fragment;

and calling a refrain identification layer of the refrain identification model to determine a refrain segment of the target audio data from the multi-frame audio data according to the confidence coefficient of each frame of audio data.

Optionally, the step of the recognition module calling a coding layer of a refrain recognition model to code the audio feature information of each frame of audio data to obtain the coding value of each frame of audio data includes:

calling a coding layer of a refrain identification model to generate audio feature vectors of the audio data of each frame according to the audio feature information of the audio data of each frame;

determining the candidate confidence of the audio frame of each frame of audio data belonging to the refrain fragment according to the audio characteristic vector of each frame of audio data;

and smoothing the candidate confidence coefficient to obtain the confidence coefficient of the audio frame of each frame of audio data belonging to the refrain fragment.

Optionally, the determining, by the identification module, the candidate confidence that each frame of audio data belongs to an audio frame in a refrain fragment according to the audio feature vector of each frame of audio data includes:

determining the inner product between the audio feature vectors as the audio weight of each frame of audio data;

and obtaining the candidate confidence of the audio frame of each frame of audio data belonging to the refrain fragment according to the dot multiplication between the audio weight and the audio feature vector.

Optionally, the smoothing processing is performed on the candidate confidence degrees by the identification module to obtain the confidence degree that each frame of audio data belongs to an audio frame in a refrain fragment, where the smoothing processing includes:

acquiring the frame length of an audio frame in the multi-frame audio data, and generating a plurality of audio detection time periods according to the frame length;

counting the total candidate confidence of the audio frames in each audio detection time period in the plurality of audio detection time periods;

and adjusting the candidate confidence of the audio frame in each audio detection time period according to the total candidate confidence to obtain the confidence that the audio data of each frame belongs to the audio frame in the refrain fragment.

Optionally, the determining, by the recognition module, a refrain recognition layer of the refrain recognition model determines a refrain segment of the target audio data from the multiple frames of audio frames according to the confidence degree to which each frame of audio data belongs, including:

calling the refrain identification layer to determine a plurality of candidate audio frames with confidence degrees larger than a confidence degree threshold value from the multi-frame audio data;

determining a positional relationship in the target audio data between candidate audio frames of the plurality of candidate audio frames;

and generating the refrain fragment of the target audio data according to the candidate audio frames with continuous adjacent position relation in the candidate audio frames.

Optionally, the fusing module fuses the key video clip and the refrain clip of the target audio data to obtain multimedia data including the key video clip and the refrain clip of the target audio data, and includes:

acquiring a first playing time length of the key video clip and a second playing time length of the refrain clip of the target audio data;

if the first playing time length is different from the second playing time length, carrying out zooming processing on the refrain segment of the target audio data according to the first playing time length to obtain a processed refrain segment; aligning the processed refrain segment with the key video segment to obtain multimedia data of the refrain segment comprising the key video segment and the target audio data;

and if the first playing time length is the same as the second playing time length, aligning the refrain segment with the key video segment to obtain the multimedia data of the refrain segment comprising the key video segment and the target audio data.

Optionally, the scaling processing is performed on the refrain segment of the target audio data by the fusion module according to the first playing duration to obtain a processed refrain segment, including:

if the second playing time length is longer than the first playing time length, cutting the refrain segment of the target audio data according to the first playing time length to obtain a processed refrain segment;

and if the second playing time length is less than the first playing time length, performing expansion processing on the refrain segment of the target audio data according to the first playing time length to obtain a processed refrain segment.

Optionally, the extracting, by the fusion module, a key video clip from the target video data includes:

extracting color characteristic information of each frame of video data in the target video data;

and determining a key video clip from the target video data according to the color characteristic information.

According to an embodiment of the present application, the steps involved in the multimedia data processing method shown in fig. 3 may be performed by various modules in the multimedia data processing apparatus shown in fig. 9. For example, step S101 shown in fig. 3 may be performed by the obtaining module 901 in fig. 9, and step S102 shown in fig. 3 may be performed by the extracting module 902 in fig. 9; step S103 shown in fig. 3 may be performed by the identification module 903 in fig. 9; step S104 shown in fig. 3 may be performed by the fusion module 904 in fig. 9.

Similarly, the steps involved in the multimedia data processing method shown in fig. 8 may be performed by various modules in the multimedia data processing apparatus shown in fig. 9. For example, step S201 shown in fig. 8 may be performed by the obtaining module 901 in fig. 9, and step S202 shown in fig. 8 may be performed by the extracting module 902 in fig. 9; step S203 shown in fig. 8 may be performed by the identifying module 903 in fig. 9; steps S204-S206 shown in FIG. 8 can be performed by the fusion module 904 of FIG. 9.

According to an embodiment of the present application, the modules of the multimedia data processing apparatus shown in fig. 9 may be respectively or entirely combined into one or several units to form the unit, or some unit(s) may be further split into multiple sub-units with smaller functions, which may implement the same operation without affecting implementation of technical effects of embodiments of the present application. The modules are divided based on logic functions, and in practical application, the functions of one module can be realized by a plurality of units, or the functions of a plurality of modules can be realized by one unit. In other embodiments of the present application, the multimedia data processing apparatus may also include other units, and in practical applications, these functions may also be implemented by assistance of other units, and may be implemented by cooperation of a plurality of units.

According to an embodiment of the present application, the multimedia data processing apparatus as shown in fig. 9 may be constructed by running a computer program (including program codes) capable of executing the steps involved in the respective methods as shown in fig. 3 and fig. 8 on a general-purpose computer device such as a computer including a processing element and a storage element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and the like, and implementing the multimedia data processing method of the embodiment of the present application. The computer program may be recorded on a computer-readable recording medium, for example, and loaded into and executed by the computing apparatus via the computer-readable recording medium.

Fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 10, the computer apparatus 1000 may include: the processor 1001, the network interface 1004, and the memory 1005, and the computer apparatus 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a standard wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 10, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the computer device 1000 shown in fig. 10, the network interface 1004 may provide a network communication function; the user interface 1003 is an interface for providing a user with input; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:

acquiring target audio data matched with the target video data;

Optionally, the processor 1001 may be configured to invoke a device control application program stored in the memory 1005, so as to perform audio feature extraction on the target audio data, and obtain audio feature information of the target audio data, where the method includes:

Optionally, the processor 1001 may be configured to invoke a device control application program stored in the memory 1005, so as to implement audio feature extraction on frequency domain information of each frame of audio data in the multiple frames of audio data, and obtain audio feature information of each frame of audio data, where the audio feature information includes:

Optionally, the audio feature information of each frame of audio data includes an energy parameter and a frequency parameter of each frame of audio data, and the processor 1001 may be configured to invoke a device control application program stored in the memory 1005, so as to implement refrain identification on the target audio data according to the audio feature information of the target audio data, to obtain a refrain segment of the target audio data, including:

Optionally, the processor 1001 may be configured to invoke the device control application stored in the memory 1005, so as to implement refrain identification on the target audio data according to the audio feature information of the target audio data, and obtain a refrain segment of the target audio data, where the method includes:

Optionally, the processor 1001 may be configured to invoke a device control application program stored in the memory 1005, so as to implement that an encoding layer of the refrain identification model is invoked to encode the audio feature information of each frame of audio data, so as to obtain an encoded value of each frame of audio data, where the encoding value includes:

Optionally, the processor 1001 may be configured to invoke a device control application stored in the memory 1005, so as to determine the candidate confidence level that each frame of audio data belongs to an audio frame in a refrain segment according to the audio feature vector of each frame of audio data, where the determining includes:

Optionally, the processor 1001 may be configured to invoke a device control application program stored in the memory 1005, so as to implement smoothing processing on the candidate confidence level, and obtain the confidence level that each frame of audio data belongs to an audio frame in a refrain fragment, where the method includes:

Optionally, the processor 1001 may be configured to invoke a device control application stored in the memory 1005, so as to determine, by invoking a refrain recognition layer of the refrain recognition model, a refrain segment of the target audio data from the multiple frames of audio frames according to the confidence degree to which each frame of audio data belongs, including:

Optionally, the processor 1001 may be configured to invoke a device control application stored in the memory 1005, so as to implement fusion of the key video clip and the refrain clip of the target audio data, and obtain multimedia data including the key video clip and the refrain clip of the target audio data, including:

Optionally, the processor 1001 may be configured to invoke a device control application program stored in the memory 1005, so as to implement scaling processing on the refrain segment of the target audio data according to the first play duration, to obtain a processed refrain segment, where the scaling processing includes:

Optionally, the processor 1001 may be configured to call a device control application stored in the memory 1005, so as to extract a key video clip from the target video data, including:

It should be understood that the computer device 1000 described in this embodiment of the present application may perform the description of the multimedia data processing method in the embodiment corresponding to fig. 3 and fig. 8, and may also perform the description of the multimedia data processing apparatus in the embodiment corresponding to fig. 9, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Further, here, it is to be noted that: an embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program executed by the aforementioned multimedia data processing apparatus, and the computer program includes program instructions, and when the processor executes the program instructions, the description of the multimedia data processing method in the embodiment corresponding to fig. 3 and fig. 8 can be executed, so that details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application.

By way of example, the program instructions described above may be executed on one computer device, or on multiple computer devices located at one site, or distributed across multiple sites and interconnected by a communication network, which may comprise a blockchain network.

An embodiment of the present application further provides a computer program product, which includes a computer program/instruction, and when the computer program/instruction is executed by a processor, the description of the multimedia data processing method in the embodiment corresponding to fig. 3 and fig. 8 is implemented, and therefore, details will not be repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in the embodiments of the computer program product referred to in the present application, reference is made to the description of the method embodiments of the present application. It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A method for processing multimedia data, comprising:

acquiring sample audio data and a marked refrain segment of the sample audio data; the marked refrain segment is obtained by marking the sample audio data by a plurality of objects;

carrying out refrain prediction on the sample audio data by adopting an initial refrain recognition model to obtain a predicted refrain segment of the sample audio data;

adjusting the initial refrain identification model according to the predicted refrain segment and the marked refrain segment, and taking the adjusted initial refrain identification model as a refrain identification model;

acquiring target audio data matched with the target video data; the target audio data are audio data matched with the subject information of the target video data;

calling the refrain identification model, and carrying out refrain identification on the target audio data according to the audio characteristic information of the target audio data to obtain a refrain segment of the target audio data; the refrain segment of the target audio data is composed of a plurality of audio frames which have frequency parameters larger than a frequency threshold value and energy parameters larger than an energy threshold value and have continuous adjacent position relations in the target audio data;

extracting color characteristic information of each frame of video data in the target video data, determining a key video clip from the target video data according to the color characteristic information, and fusing the key video clip and a refrain clip of the target audio data to obtain multimedia data of the refrain clip comprising the key video clip and the target audio data; the key video clips are video clips for reflecting the subject content of the target video data.

2. The method of claim 1, wherein the performing audio feature extraction on the target audio data to obtain audio feature information of the target audio data comprises:

3. The method as claimed in claim 2, wherein said performing audio feature extraction on the frequency domain information of each frame of audio data in the multiple frames of audio data to obtain the audio feature information of each frame of audio data comprises:

4. The method according to claim 2 or 3, wherein the invoking the refraining recognition model to perform refraining recognition on the target audio data according to the audio feature information of the target audio data to obtain a refraining segment of the target audio data comprises:

and calling a refrain identification layer of the refrain identification model to determine a refrain segment of the target audio data from the multi-frame audio data according to the confidence degree of each frame of audio data.

5. The method as claimed in claim 4, wherein said invoking the coding layer of the refrain recognition model to code the audio feature information of the frames of audio data to obtain the coded value of the frames of audio data comprises:

the calling a confidence recognition layer of the refrain recognition model to recognize the coding value of each frame of audio data to obtain the confidence of each frame of audio data belonging to the audio frame in the refrain segment includes:

6. The method of claim 5, wherein the determining the candidate confidence level that the frame of audio data belongs to an audio frame in a refrain section according to the audio feature vector of the frame of audio data comprises:

determining the audio weight of each frame of audio data according to the inner product between the audio characteristic vector of each frame of audio data and the audio characteristic vector of the adjacent frame of audio data;

and obtaining the candidate confidence of the audio frame of each frame of audio data belonging to the refrain fragment according to the multiplication of the audio weight and the audio feature vector.

7. The method of claim 5, wherein the smoothing the candidate confidence levels to obtain the confidence level that each frame of audio data belongs to an audio frame in a refrain section comprises:

8. The method as claimed in claim 4, wherein said determining the refrain segment of the target audio data from the multiple frames of audio data by the refrain recognition layer invoking the refrain recognition model according to the confidence of each frame of audio data comprises:

9. The method of claim 1, wherein the fusing the key video clip with the refraining clip of the target audio data to obtain multimedia data including the key video clip and the refraining clip of the target audio data comprises:

10. The method of claim 9, wherein the scaling the refraining segment of the target audio data according to the first playing duration to obtain a processed refraining segment comprises:

11. A multimedia data processing apparatus, comprising:

the acquisition module is used for acquiring target audio data matched with the target video data; the target audio data are audio data matched with the subject information of the target video data;

the identification module is used for calling a refrain identification model, and performing refrain identification on the target audio data according to the audio characteristic information of the target audio data to obtain a refrain segment of the target audio data; the refrain segment of the target audio data is composed of a plurality of audio frames which have frequency parameters larger than a frequency threshold value and energy parameters larger than an energy threshold value and have continuous adjacent position relations in the target audio data;

the fusion module is used for extracting color characteristic information of each frame of video data in the target video data, determining a key video clip from the target video data according to the color characteristic information, and fusing the key video clip and the refrain clip of the target audio data to obtain multimedia data of the refrain clip comprising the key video clip and the target audio data; the key video clips are video clips used for reflecting the subject content of the target video data;

the refrain identification model is obtained through the following steps:

acquiring sample audio data and a marked refrain segment of the sample audio data; carrying out refrain prediction on the sample audio data by adopting an initial refrain recognition model to obtain a predicted refrain segment of the sample audio data; adjusting the initial refrain identification model according to the predicted refrain segment and the marked refrain segment, and taking the adjusted initial refrain identification model as a refrain identification model; the labeled refrain segment is obtained by labeling the sample audio data by a plurality of objects.

12. A computer device, comprising: a processor and a memory;

the processor is connected with the memory; the memory is for storing program code, and the processor is for calling the program code to perform the method of any of claims 1 to 10.

13. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1-10.