WO2024001646A1 - 音频数据的处理方法、装置、电子设备、程序产品及存储介质 - Google Patents

音频数据的处理方法、装置、电子设备、程序产品及存储介质 Download PDF

Info

Publication number
WO2024001646A1
WO2024001646A1 PCT/CN2023/097205 CN2023097205W WO2024001646A1 WO 2024001646 A1 WO2024001646 A1 WO 2024001646A1 CN 2023097205 W CN2023097205 W CN 2023097205W WO 2024001646 A1 WO2024001646 A1 WO 2024001646A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
audio
segment
recommended
feature
Prior art date
Application number
PCT/CN2023/097205
Other languages
English (en)
French (fr)
Inventor
冯鑫
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2024001646A1 publication Critical patent/WO2024001646A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Definitions

  • the present application relates to computer technology, and in particular to an audio data processing method, device, electronic equipment, program product and storage medium.
  • Online multimedia (such as video or audio) playback platforms need to mark some special data segments in the multimedia data, called recommended segments, such as highlight data segments, popular data segments, etc., to facilitate user viewing.
  • Embodiments of the present application provide an audio data processing method, device, electronic equipment, computer program product, computer-readable storage medium, and computer program product, which can accurately identify recommended segments from audio data.
  • Embodiments of the present application provide an audio data processing method, which is executed by an electronic device, including:
  • Extract audio features from each data segment combine the audio features of each data segment with the audio features to form an audio feature sequence of the audio data, and encode the audio feature sequence to obtain the attention of the audio data. force parameter sequence;
  • Fusion of the attention parameter sequence and the weight value sequence obtains the fusion parameters of each of the data segments, and determines the recommended parameters of each of the data segments based on each of the fusion parameters;
  • Recommended segments in the audio data are determined based on recommendation parameters for each of the data segments.
  • An embodiment of the present application provides an audio data processing device, including:
  • the source separation module is configured to extract audio track data corresponding to at least one source type from the audio data, Wherein, the audio data includes multiple data segments;
  • a weight configuration module configured to determine at least one time segment related to the source type in the playback timeline of each audio track data, and determine the time respectively included in each of the data segments in the audio data. paragraph;
  • a feature extraction module configured to extract audio features from each data segment, combine the audio features of each data segment with the audio features to form an audio feature sequence of the audio data, and encode the audio feature sequence to obtain the attention parameter sequence of the audio data;
  • a parameter prediction module configured to fuse the attention parameter sequence and the weight value sequence to obtain the fusion parameters of each of the data segments, and determine the recommended parameters of each of the data segments based on each of the fusion parameters;
  • the parameter prediction module is further configured to determine recommended segments in the audio data based on recommended parameters of each data segment.
  • An embodiment of the present application provides an electronic device, including:
  • Memory used to store executable instructions
  • the processor is configured to implement the audio data processing method provided by the embodiment of the present application when executing executable instructions stored in the memory.
  • Embodiments of the present application provide a computer-readable storage medium that stores executable instructions for causing the processor to implement the audio data processing method provided by the embodiments of the present application when executed.
  • An embodiment of the present application provides a computer program product, which includes a computer program or instructions.
  • the computer program or instructions are executed by a processor, the audio data processing method provided by the embodiment of the present application is implemented.
  • the attention parameter sequence can be used to highlight the importance of the data fragments related to the source in the audio features from the frequency domain level.
  • Figure 1 is a schematic diagram of the application mode of the audio data processing method provided by the embodiment of the present application.
  • Figure 2 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • Figure 3A is a first flow diagram of the audio data processing method provided by the embodiment of the present application.
  • Figure 3B is a second flow diagram of the audio data processing method provided by the embodiment of the present application.
  • FIG. 3C is a third flowchart of the audio data processing method provided by the embodiment of the present application.
  • Figure 3D is a fourth schematic flowchart of the audio data processing method provided by the embodiment of the present application.
  • FIG. 3E is a fifth flowchart of the audio data processing method provided by the embodiment of the present application.
  • Figure 4A is a schematic diagram of audio data extracted from a video provided by an embodiment of the present application.
  • Figure 4B is a schematic diagram of audio track data provided by an embodiment of the present application.
  • Figure 4C is a schematic diagram of time segments provided by the embodiment of the present application.
  • FIG. 5 is an optional flow diagram of the audio data processing method provided by the embodiment of the present application.
  • Figure 6A is a first schematic diagram of the audio processing model provided by an embodiment of the present application.
  • Figure 6B is a second schematic diagram of the audio processing model provided by the embodiment of the present application.
  • Figure 7 is a schematic diagram of the pyramid scene parsing network provided by the embodiment of the present application.
  • Figure 8 is a schematic diagram of the audio semantic information extraction module provided by the embodiment of the present application.
  • Figure 9 is a schematic diagram of the coding principle in the attention module provided by the embodiment of the present application.
  • Figure 10A is a first schematic diagram of the playback interface provided by an embodiment of the present application.
  • Figure 10B is a second schematic diagram of the playback interface provided by the embodiment of the present application.
  • Figure 11 is a schematic diagram of audio data provided by an embodiment of the present application.
  • first ⁇ second ⁇ third are only used to distinguish similar objects and do not represent a specific ordering of objects. It is understandable that “first ⁇ second ⁇ third” is used in Where appropriate, the specific order or sequence may be interchanged so that the embodiments of the application described herein can be implemented in an order other than that illustrated or described herein.
  • the embodiments of this application involve user information, user feedback data and other related data (for example: multimedia data, voice, audio track data, etc.).
  • user permission or consent is required, and the collection, use and processing of relevant data need to comply with relevant laws, regulations and standards of relevant countries and regions.
  • Pyramid Scene Parsing Network (PSPN, Pyramid Scene Parsing Network).
  • PSPN Pyramid Scene Parsing Network
  • the function of the Pyramid Scene Parsing Network is to predict the label, location and shape of the object of interest.
  • the network includes a Pyramid Pooling Module.
  • the Pyramid Pooling Module can aggregate local context information to form global context information, and more comprehensively implement positioning, classification and other processing.
  • Source separation In audio data (such as audio data extracted from the audio track of video data, or audio data extracted from audio files), one or more audio signals (i.e., digital audio The abbreviation of signal, digital audio signal is obtained by sampling and encoding analog audio signal), the source is the source of the sound signal, the source type is the type of the sound source, each audio signal corresponds to a source type (such as The source type corresponding to speech is human), and source separation is to perform separation processing through signal processing or other algorithms to extract the sequence of audio signals from the specified source, and finally generate sequences of audio signals of different source types. Audio track data, such as: voice track data, background track data.
  • VAD Voice Activity Detection
  • ASR Automatic Speech Recognition
  • Time domain and frequency domain are the basic properties of audio data.
  • the different angles used to analyze audio data are called domains, which are two dimensional concepts for measuring audio characteristics.
  • the sampling points of the audio data are displayed and processed in time, and there is a corresponding relationship between the signal and time.
  • the frequency domain is used to analyze the energy distribution of audio data in various frequency bands, including the characteristic performance of audio data to a certain extent.
  • Mel frequency a nonlinear frequency scale based on the human ear's sensory judgment of equidistant pitch changes, is more able to cater to the human ear's auditory experience when performing signal processing
  • the threshold changes to an artificially set frequency scale.
  • many basic audio features are calculated through mel frequency.
  • Convolutional neural network is a type of feedforward neural network (FNN, Feed forward Neural Networks) that includes convolutional calculations and has a deep structure. It is one of the representative algorithms of deep learning (Deep Learning). one. Convolutional neural networks have representation learning capabilities and can perform shift-invariant classification of input images according to their hierarchical structure. Convolutional neural networks have artificial neurons that respond to surrounding units within a portion of their coverage.
  • a convolutional neural network consists of one or more convolutional layers and a top fully connected layer (corresponding to a classic neural network), as well as associated weights and pooling layers.
  • Attention mechanism a problem-solving method proposed to imitate human attention, can quickly filter out high-value information from a large amount of information.
  • the attention mechanism is mainly used to solve the problem that it is difficult to obtain the final reasonable vector representation when the input sequence of the Long Short-Term Memory Network (LSTM, Long Short-Term Memory) and Recurrent Neural Network (RNN, Recurrent Neural Network) models is long.
  • the method is to retain the intermediate results of the long short-term memory network, and use a new model to learn the correlation between the intermediate results and the output results, so as to determine the more exciting information in the output results, thereby achieving the purpose of information screening.
  • Time segment an interval on the playback timeline of multimedia data. For example, for a 10-minute video, the interval from the 5th minute to the 8th minute on the playback timeline can be called a time segment.
  • Data fragments data corresponding to time segments in multimedia data. For example, for a 10-minute video, the data corresponding to the time segment from the 5th minute to the 8th minute on the playback timeline can be called a data segment, which can be divided into data segments of the audio track and video track. Data fragment. A video can be divided into multiple data segments of equal length.
  • the multimedia data includes data segments that contain key information to be expressed or polar emotions (such as sadness, happiness).
  • the playback time corresponds to a time data segment on the playback timeline.
  • the multimedia data can be videos or songs. , audio novels and radio dramas, etc., the recommended clips can be of the following types: highlight clips that include key plots in movies, sad clips that express sad emotions in songs, etc.
  • Recommendation parameters which quantitatively represent the probability that a data segment belongs to a specific type of recommended segment.
  • recommendation parameters represent the probability that the recommended segment is a highlight segment in multimedia data.
  • Embodiments of the present application provide an audio data processing method, audio data processing device, electronic equipment, computer program products, and computer-readable storage media, which can accurately obtain recommended segments in audio data.
  • Figure 1 is a schematic diagram of the application mode of the audio data processing method provided by the embodiment of the present application; as an example, the servers involved include: an identification server 201 and a media server 202, where the media server 202 can be a server of a video platform , music platform servers, audio novel or radio drama platform servers, etc. Also shown in Figure 1 is a network 300 and a terminal device 401.
  • the identification server 201 and the media server 202 communicate through the network 300 or through other means.
  • the terminal device 401 connects to the media server 202 through the network 300.
  • the network 300 can be a wide area network or a local area network, or a combination of the two.
  • the media server 202 sends the audio data (e.g., audio novels, online music) to the recognition server 201, and the recognition server 201 determines the recommendation parameters of each data segment in the audio data (e.g., the data segment belongs to the highlight segment, sad segment, etc.). fragments, the probability of funny data fragments, the recommended parameters are positively related to the degree of excitement, sadness, funnyness, etc.), and based on the recommended parameters, the recommended parameter curve is generated and the recommended fragments in the audio data are determined.
  • the recommended parameter curve and the recommended segment are sent to the media server 202.
  • the media server 202 sends the recommended parameter curve and the recommended segment position tag to the terminal device 401.
  • the terminal device 401 runs the player 402.
  • the recommended parameter curve and the recommended segment position label are displayed. Users can easily determine the recommended parameters of each data segment in the audio data based on the recommended parameter curve, and jump to the corresponding position for playback based on the recommended segment position tag, making it easy to locate the recommended segment.
  • the audio data is obtained by segmenting the audio data from the audio track of the video data (eg, online video or local video).
  • the timelines of the audio data and the video images are aligned, and the highlights of the audio data correspond to the highlights of the video data one-to-one.
  • Recommended snippets can be respectively wonderful data snippets, sad data snippets, funny data snippets, etc.
  • Users can be viewers who watch videos, or users who use video data as material for secondary creation. Users can quickly determine the highlight data clips in the video through the recommended parameter curves and position tags of the highlight data clips, and then watch the highlight data clips, or cut the highlight data clips from the video data for secondary video creation.
  • the recognition server 201 and the media server 202 can be integrated together to implement a unified server, or they can be set up separately.
  • the embodiments of this application can be implemented through blockchain technology.
  • the recommended parameter curve obtained by the audio data processing method in the embodiment of this application can be used as the detection result.
  • the detection results can be uploaded to the blockchain for storage, and the detection results can be guaranteed through the consensus algorithm. reliability.
  • Blockchain is a new application model of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Block chain is essentially a decentralized database. It is a series of data blocks generated by using cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of its information (anti-counterfeiting) and the generation of the next block.
  • Blockchain can include the underlying platform of the blockchain, the platform product service layer and the application service layer.
  • a database can be regarded as a place where electronic file cabinets store electronic files. Users can add, query, update, delete, etc. the data in the files. operate.
  • the so-called “database” is a collection of data that is stored together in a certain way, can be shared with multiple users, has as little redundancy as possible, and is independent of the application.
  • Database Management System is a computer software system designed for managing databases. It generally has basic functions such as storage, interception, security, and backup. Database management systems can be classified according to the database models they support, such as relational, XML (Extensible Markup Language, extensible markup language); or according to the types of computers they support, such as server clusters and mobile phones; Or they can be classified according to the query language used, such as Structured Query Language (SQL, Structured Query Language), XQuery; or they can be classified according to the focus of performance impact, such as maximum scale, maximum running speed; or other classification methods. Regardless of the classification scheme used, some DBMSs are able to span categories, for example, supporting multiple query languages simultaneously.
  • SQL Structured Query Language
  • the server can be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, and networks. services, cloud communications, middleware services, domain name services, security services, CDN, and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.
  • the terminal device can be a smartphone, tablet, laptop, desktop computer, smart speaker, smart watch, smart TV, vehicle-mounted terminal, etc., but is not limited to this.
  • the terminal device and the server may be connected directly or indirectly through wired or wireless communication methods, which are not limited in the embodiments of this application.
  • Cloud technology is a general term for network technology, information technology, integration technology, management platform technology, application technology, etc. based on the cloud computing business model. It can form a resource pool according to Use it when you need it, flexible and convenient. Cloud computing technology will become an important support.
  • the background services of technical network systems require a large amount of computing and storage resources, such as video websites, picture websites and more portal websites.
  • computing and storage resources such as video websites, picture websites and more portal websites.
  • each item may have its own hash code identification mark, which needs to be transmitted to the backend system for logical processing. Data at different levels will be processed separately. All types of industry data require strong system backing. Support can only be achieved through cloud computing.
  • FIG. 2 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • the electronic device 400 can be the terminal device 401 in Figure 1, or can be a server (identification server 201, media server 202, or both). hybrid).
  • the electronic device 400 includes: at least one processor 410, a memory 450, and at least one network interface 420.
  • the various components in electronic device 400 are coupled together by bus system 440 .
  • the bus system 440 is used to implement connection communication between these components.
  • the bus system 440 also includes a power bus, a control bus, and a status signal bus.
  • the various buses are labeled bus system 440 in FIG. 2 .
  • the processor 410 may be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP, Digital Signal Processor), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware Components, etc., wherein the general processor can be a microprocessor or any conventional processor, etc.
  • DSP Digital Signal Processor
  • Memory 450 may be removable, non-removable, or a combination thereof.
  • Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, etc.
  • Memory 450 optionally includes one or more storage devices physically located remotely from processor 410 .
  • Memory 450 includes volatile memory or non-volatile memory, and may include both volatile and non-volatile memory.
  • Non-volatile memory can be read-only memory (ROM, Read Only Memory), and volatile memory can be random access memory (RAM, Random Access Memory).
  • RAM Random Access Memory
  • the memory 450 described in the embodiments of this application is intended to include any suitable type of memory.
  • the memory 450 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplarily described below.
  • the operating system 451 includes system programs used to process various basic system services and perform hardware-related tasks, such as the framework layer, core library layer, driver layer, etc., which are used to implement various basic services and process hardware-based tasks.
  • hardware-related tasks such as the framework layer, core library layer, driver layer, etc.
  • Network communication module 452 for reaching other electronic devices via one or more (wired or wireless) network interfaces 420.
  • Exemplary network interfaces 420 include: Bluetooth, Wireless Compliance Certification (WiFi), and Universal Serial Bus ( USB, Universal Serial Bus), etc.
  • the audio data processing device provided by the embodiment of the present application can be implemented in a software manner.
  • Figure 2 shows the audio data processing device 455 stored in the memory 450, which can be in the form of a program, a plug-in, etc.
  • the software includes the following software modules: source separation module 4551, weight configuration module 4552, feature extraction module 4553, parameter prediction module 4554. These modules are logical, so they can be combined or further split according to the functions implemented. . The functions of each module are explained below.
  • a terminal device or server can implement the audio data processing method provided by the embodiments of this application by running a computer program.
  • a computer program can be a native program or software module in the operating system; it can be a native (Native) application (APP, Application), that is, a program that needs to be installed in the operating system to run, such as a video APP, audio APP ; It can also be a small program, that is, a program that only needs to be downloaded to the browser environment to run.
  • APP Native
  • the computer program described above can be any form of application, module or plug-in.
  • FIG. 3A is a first schematic flowchart of an audio data processing method provided by an embodiment of the present application. This method can be executed by an electronic device and will be described in conjunction with the steps shown in FIG. 3A.
  • step 301 audio track data corresponding to at least one source type is extracted from the audio data.
  • audio track data files (or audio track data packets) corresponding to different source types are separated from the audio data file (or audio data packet).
  • the audio data includes multiple data segments, each data segment may be continuous, and the playback duration of each data segment may be the same or different. For example: divide the audio data into multiple data segments with the same playback duration, or divide the audio data into multiple data segments with different playback durations.
  • the audio data can be native audio data (such as audio novels, radio dramas, etc.), or it can be extracted from video data.
  • Recommended parameters can include: excitement, sadness, comedy, passion, etc.
  • the corresponding recommended snippets are wonderful data snippets, sad data snippets, funny data snippets, etc.
  • step 301 is implemented in the following manner: performing feature extraction on the audio data to obtain the global features of the audio data; using the global features as masks, performing source separation on the audio data to obtain each source in the audio data.
  • the boundaries of the mask are used to characterize the boundaries between audio data corresponding to different source types.
  • Feature extraction for audio data includes: multiple-level feature extraction for audio data, and fusing the features obtained at each level into global features.
  • Source separation can be achieved through the Pyramid Scene Parsing Network (PSPN, Pyramid Scene Parsing Network). Feature extraction and source separation are explained below.
  • PSPN Pyramid Scene Parsing Network
  • feature extraction processing is performed on the audio data to obtain global features of the audio data, which is achieved in the following manner: performing feature extraction processing on the audio data to obtain the original features of the audio data; performing multiple levels of pooling on the original features process to obtain multiple local features of audio data; merge multiple local features to obtain global features of audio data.
  • the pooling process can be implemented through the Pyramid Pooling Module of the Pyramid Scene Parsing Network (PSPN, Pyramid Scene Parsing Network).
  • PSPN Pyramid Scene Parsing Network
  • Figure 7 is a diagram of the Pyramid Scene Parsing Network provided by the embodiment of the present application. Schematic diagram; as detailed below, the pyramid scene parsing network includes: convolutional neural network 701, pooling layer 703 and pyramid pooling module (the pyramid pooling module in Figure 7 includes convolution layer 1, convolution layer 2, convolution layer 3 and convolution layer 4), upsampling layer 704, and convolution layer 706.
  • the convolutional neural network 701 performs feature extraction on the audio data to obtain the original features 702 of the audio data.
  • the pyramid pooling module set after the pooling (pool) layer 703 can be set to more sizes according to the extraction accuracy in the specific implementation. Assuming that the pyramid has N levels in total, 1 ⁇ 1 convolution (CONV) is used after each level to reduce the number of channels at the corresponding level to the original 1/N.
  • CONV convolution
  • the low-dimensional feature map is then upsampled directly through the upsampling layer 704 through bilinear interpolation to obtain a feature map of the same size as the original feature map.
  • Each layer of the pyramid pooling module of the pyramid scene parsing network outputs local features of different sizes, and the feature maps of different levels are merged and processed (concat) to obtain the final global features.
  • the implementation of source separation of audio data using global features as a mask is as follows: convolving the global features as a mask with the initial level features extracted by the pyramid scene parsing network through the convolution layer 706, Obtain the feature map corresponding to the audio track data corresponding to each source type in the audio data.
  • the features are represented in the form of a feature matrix, and the global features are used as masks.
  • the mask is a feature matrix with the same size as the features of the initial level extracted by the pyramid scene parsing network, and the part corresponding to the global features is the mask.
  • the film value is 1, and the mask value of other parts is 0.
  • the global feature is used as a mask and convolved with the initial level features extracted by the pyramid scene parsing network, which can distinguish the spectral differences between audio data of different source types. boundaries, thereby representing the boundaries between the spectrograms of different sources in the audio spectrogram, and separating the sub-audio data of different source types from the overall audio data to obtain the audio tracks corresponding to each source type.
  • Source types include: background sound and voice.
  • multiple levels of feature extraction processing are performed on the audio data through the pyramid scene analysis network, which improves the accuracy of feature extraction, and then convolves the extracted global features with the initial level feature extraction results to improve
  • step 302 at least one time segment related to the source type in the playback timeline of each audio track data is determined, and the time segments included in each data segment in the audio data are determined.
  • a time segment is a segment on the timeline of audio data.
  • At least one time segment related to the source type refers to the time segment when the source corresponding to the source type emits sound. Whether the source corresponding to the source type emits sound can be determined by analyzing the short-term energy of the audio track data corresponding to the source type. Sure.
  • Each data segment can protect a time segment of at least one type of source. For example, the data segment contains a time segment of speech and a time segment of background sound whose time length is the same as the playback duration of the data segment. Alternatively, the data segment contains a time segment of speech that is half the duration of the playback.
  • Figure 3B is a second flow diagram of the audio data processing method provided by the embodiment of the present application.
  • Step 302 in Figure 3A is implemented through steps 3021 and 3022 in Figure 3B, as follows Be specific.
  • the execution order of step 3021 and step 3022 is not limited.
  • step 3021 when the source type corresponding to the audio track data is speech, the time period in the audio track data in which the short-term energy is greater than the energy threshold and the zero-crossing rate is less than the zero-crossing rate threshold is used as a time period related to speech.
  • voice-related time segments can be obtained through the Voice Activity Detection (VAD, Voice Activity Detection) algorithm.
  • VAD Voice Activity Detection
  • the short-term energy that is, the energy of a frame of speech signal, is the sum of squares of the signal within the frame, and the zero-crossing rate is the number of times that the time domain signal of a frame of speech crosses 0 (0 point on the time axis).
  • the principle of the voice activity detection algorithm is that the short-term energy of speech data segments is relatively large, but the zero-crossing rate is relatively small; conversely, the short-term energy of non-speech data segments is relatively small, but the zero-crossing rate is relatively large.
  • the speech signal and the non-speech signal can be judged, that is, the parts of the audio track data that make sounds and the parts that do not make sounds.
  • the short-term energy of the audio data is less than the short-term energy threshold and the zero-crossing rate is greater than the zero-crossing rate threshold, the audio segment is noise.
  • the short-term energy of the audio data is greater than the short-term energy threshold and the zero-crossing rate is less than the zero-crossing rate threshold, the audio segment is speech.
  • step 3022 when the source type corresponding to the audio track data is background sound, the time segment in the audio track data that satisfies the filtering condition is used as the time segment related to the background sound.
  • the filtering conditions include any of the following:
  • the loudness corresponding to the time segment is greater than the loudness lower limit. For example, if the duration is too short or the sound is too low, it may be noise rather than background music.
  • the lower limit value of loudness can be determined by a preset multiple (greater than 0 and less than 1) of the median value of the loudness corresponding to the audio data. For example, the average value of the sum of the maximum and minimum loudness values is the median loudness value, and the loudness middle value 0.5 times the bit value is the lower limit of loudness, and the time segments in the audio data whose loudness is less than the lower limit are determined as segments that do not meet the filtering conditions.
  • the length of the time paragraph is greater than the lower limit of length.
  • the lower limit value of the length is determined based on the time length of the audio data. For example, the lower limit value of the length is one percent of the audio data.
  • Figure 11 is a schematic diagram of audio data provided by an embodiment of the present application.
  • the time length of the audio data 1101 is 0 to T6, and the audio data 1101 is divided into 6 data segments (data segment 1 to data segment 6).
  • the background sound track 1102 of the audio data includes a time segment from T3 to T6 when the background sound source emits sound
  • the voice track 1103 of the audio data includes a time segment from T1 to T5 when the speech source emits sound.
  • each paragraph in the audio track is distinguished by types such as voice and background sound, so that the voice data fragments in the audio data can be directly located, and the voice data fragments are allocated higher than other types of data fragments.
  • the weight value can strengthen the recognition of the semantic information of the speech data fragments, greatly increasing the proportion of the speech semantic information in the positioning of highlight clips.
  • each data segment in the audio data is assigned a corresponding weight value based on the length of the included time segment, and each of the weight values is combined to form the audio data sequence of weight values.
  • the audio data is divided into multiple data segments in advance according to the number of frames or duration.
  • the length of the data segments is a preset number of frames or a preset duration.
  • the time length of the time segment of the voice source is 0, no weight value corresponding to the voice type is assigned, the time length corresponding to the time segment of the background sound is the same as the playback time of the data segment, and the preconfiguration corresponding to the background sound is assigned to the data segment Weight value, for example: if the time length corresponding to the time segment of the background sound is half of the playback time of the data segment, half of the preconfigured weight value will be used as the weight value of the data segment.
  • Figure 3C is a third flow diagram of the audio data processing method provided by the embodiment of the present application.
  • Step 303 in Figure 3A is implemented through steps 3031 to 3033 in Figure 3B, and the following steps 3031 to 3033 are processed for each data fragment, which will be described in detail below.
  • steps 3031 to 3033 is not limited.
  • step 3031 when the data segment belongs to a speech-related time segment, the weight value corresponding to the data segment is determined based on the parameters of the speech corresponding to the data segment.
  • the weight value is positively related to the parameters, and the parameters include at least one of the following: speaking speed, intonation, and loudness.
  • the audio data of a film and television drama includes voice and background sound.
  • the voice part is usually the part performed by actors.
  • the exciting data segments (recommended segments) in the film and television drama are usually in the presence of voice. time period. Parameters such as speech speed, intonation, and loudness of speech can be used as a basis for determining exciting data segments, and the weight value corresponding to the data segment can be determined based on at least one of the parameters.
  • step 3032 when the data segment belongs to a time segment related to the background sound, the preset value is used as the weight value corresponding to the data segment.
  • the preset value is smaller than the weight value of any voice-related data segment.
  • the audio data of film and television drama videos includes voice and background sound.
  • the voice part is usually the part performed by actors, and the part with only background sound is usually data fragments such as cutscenes in film and television drama videos.
  • the time segments related to the background sound can be Assign a weight value smaller than that of the speech-related data segments.
  • the highlight data segments are in the voice part, and the time segments with only background sound can be assigned less weight values.
  • step 3033 when the data segment does not belong to a time segment related to any source type, zero is used as the weight value corresponding to the data segment.
  • the data segment when the data segment is not in a time period of any source type, the data segment may be a silent or noisy data segment, and the accuracy of the recommended parameters can be improved by setting the weight value of the data segment to zero.
  • the data segment is in a time segment of any source type, the data segment is not in a time segment of any source type, and the data segment is in a time segment of multiple source types at the same time (for example: The time segment where the data segment is located in the playback timeline includes both voice track data and background sound track data).
  • the time segment where the data segment is located in the playback timeline includes both voice track data and background sound track data.
  • FIG 11 is a schematic diagram of audio data provided by an embodiment of the present application, in which the time length of the audio data 1101 is 0 to T6, and the audio data 1101 is divided into 6 data segments (data segment 1 to data segment 6) , the background sound track 1102 of the audio data contains the time segment of the background sound from T3 to T6, the voice track 1103 of the audio data contains the time segment of the speech from the midpoint position of T1 and T2 to T5, and the data segment 1 corresponds to The time interval is 0 to T1, the time interval corresponding to data segment 2 is T1 to T2, the time interval corresponding to data segment 3 is T2 to T3, the time interval corresponding to data segment 4 is T3 to T4, and the time interval corresponding to data segment 5 is is T4 to T5, and the time interval corresponding to data segment 6 is T5 to T6.
  • data fragment 1 For data fragment 1, data fragment 1 is not related to any source, then the weight value of data fragment 1 is 0; For data fragments 2 and 3, data fragments 2 and 3 belong to speech-related time paragraphs. The weight value is obtained through step 3031 above. I will not go into details here.
  • the duration of the speech-related time paragraphs contained in data fragment 2 is the data bias. 2 is half of the duration, so half of the weight value obtained according to step 3031 is used as the weight value q2 of data segment 2. It is assumed that the weight values of data segment 2 and 3 are q2 and q3 respectively.
  • data fragments 4 and 5 belong to both speech-related time periods and background sound-related time periods.
  • the data fragments for different types of data are obtained through steps 3031 and 3032 respectively.
  • weight value sequence of data 1101 is [0, q2, q3, q4, q5, q6].
  • the weight values q3 to q5 determined based on the parameters corresponding to the speech are higher than 0 and q6.
  • different methods are selected to determine the weight value corresponding to the data fragment according to the type corresponding to the data fragment.
  • a preset weight value is assigned.
  • the weight value is set. Zero, saving computing resources for obtaining the weight value of the data fragment.
  • the weight value of the data segment is calculated based on the speech-related parameters, which improves the accuracy of obtaining the weight value of the speech data segment.
  • the weight value corresponding to the speech data segment is higher than that of the non-speech-related data segments.
  • recommended segments are usually data segments with voice. Increasing the weight of voice data segments improves the accuracy of predicting the recommended parameters of each data segment.
  • Figure 3D is a fourth schematic flowchart of the audio data processing method provided by the embodiment of the present application.
  • Step 303 in Figure 3A is implemented through steps 3034 to 3035 in Figure 3D.
  • the following steps 3034 and 3035 are processed for each data fragment, which will be described in detail below.
  • step 3034 and step 3035 is not limited.
  • step 3034 when the time segment contained in the data segment belongs to the time segment related to the background sound, the weight value corresponding to the data segment is determined based on the parameters of the background sound corresponding to the data segment.
  • the weight value is positively related to the parameter, and the parameter includes at least one of the following: loudness, pitch.
  • the audio data is the audio data of a concert, it only includes background sound sources and does not necessarily include speech.
  • Parameters such as intonation and loudness can be used as the basis for determining the exciting data segments, and the data segments can be determined based on at least one of the parameters. the corresponding weight value.
  • step 3035 when the time segment contained in the data segment does not belong to the time segment related to any source type, zero is used as the weight value corresponding to the data segment.
  • step 3035 and step 3033 are the same and will not be described again here.
  • a preset weight value is assigned to the data fragments belonging to the background sound, and the weight value is set to zero for the data fragments belonging to silence or noise, which saves the weight of obtaining the data fragments.
  • Value computing resources when there is no voice in the multimedia data, a preset weight value is assigned to the data fragments belonging to the background sound, and the weight value is set to zero for the data fragments belonging to silence or noise, which saves the weight of obtaining the data fragments.
  • step 304 audio features are extracted from each data segment, each data segment is combined with the audio features to form an audio feature sequence of the audio data, and the audio feature sequence is encoded to obtain the attention of the audio data. Force parameter sequence.
  • extracting audio features from each data segment is achieved by performing feature extraction on the audio data to obtain individual frequency domain features or individual time domain features.
  • the audio features can be obtained in the following manner, and the following processing is performed on each data segment in the audio data: extracting the time domain signal features and frequency domain signal features of the data segment; based on For at least one time segment related to the source type in the playback timeline of each audio track data, determine the one-dimensional time domain weight value corresponding to the time domain signal characteristics, and determine the two-dimensional frequency domain weight value corresponding to the frequency domain signal characteristics; Perform multiple levels of convolution on the product of the one-dimensional time domain weight value and the time domain signal feature to obtain the time domain audio feature; perform multiple levels of convolution on the product of the two-dimensional frequency domain weight value and the frequency domain signal feature, Frequency domain audio features are obtained; time domain audio features are scaled to obtain two-dimensional time domain audio features; two-dimensional time domain audio features and frequency domain audio features are fused to obtain audio features of data segments.
  • time-domain audio features are one-dimensional features. You can scale the time-domain audio features to facilitate the fusion of time-domain features and frequency-domain features. For example, use the reshape function to process the one-dimensional features to obtain unchanged elements. Quantitative two-dimensional features.
  • FIG. 6B is a second schematic diagram of the audio processing model provided by the embodiment of the present application; step 304 can be implemented through the audio semantic information extraction module 605 in Figure 6B.
  • the structure of the audio semantic information extraction module 605 is a dual-stream type, including Time domain branch 606 and frequency domain branch 607.
  • the time domain information and weight value sequence of the audio data are input to the time domain branch 606.
  • the time domain branch 606 includes multiple one-dimensional convolution layers (one-dimensional convolution layer 1, ... One-dimensional convolution layer n), the frequency domain information and weight value sequence of the audio data are input to the frequency domain branch 607.
  • the frequency domain branch 607 includes multiple two-dimensional convolution layers (two-dimensional convolution layer 1, ... Two-dimensional convolutional layer n).
  • the feature fusion layer 608 is used to fuse the frequency domain features or time domain features output by the convolutional layers of each level in the two branches.
  • Figure 8 is a schematic diagram of the audio semantic information extraction module provided by the embodiment of the present application.
  • Figure 8 is a detailed structural diagram of the audio semantic information extraction module 605 in Figure 6B; the input of the audio semantic information extraction module is the original audio data of the video (characterized as a sequence of audio sample points).
  • the audio data is divided into multiple data segments (for example, divided in such a way that each data segment includes at least one frame, or each data segment is equal in length).
  • a basic feature map (logmel) is generated based on the audio data as frequency domain information and input to the frequency domain branch 607 , and the audio sampling point sequence (time domain information) of the audio data is input to the time domain branch 606 .
  • the weight value sequence output by the weight allocation unit 604 is processed by the fully connected layer 801 and the fully connected layer 802 to generate time weight vectors with the same dimensions as the time domain signal features and frequency domain signal features respectively, and then is compared with the time domain signal features and frequency domain signal features respectively. Domain signal features are multiplied by corresponding positions.
  • the time domain branch 606 includes a large number of one-dimensional convolution layers (one-dimensional convolution layer 803, one-dimensional convolution layer 804, one-dimensional convolution layer 806, one-dimensional convolution layer 808) and one-dimensional maximum pooling layer ( One-dimensional maximum pooling layer 805, one-dimensional maximum pooling layer 807, one-dimensional maximum pooling layer 809), using a large number of convolutional layers in the time domain signal characteristics can directly learn the time domain characteristics of the audio data, especially Information like audio loudness and sample point amplitude.
  • the generated one-dimensional sequence is scaled (resized) through the deformation layer 810 into a feature map in the form of a two-dimensional graph (wave graph). This processing makes the time domain branch and the frequency domain branch
  • the features output by all channels have the same size, which facilitates fusion processing.
  • the intermediate result is scaled into a two-dimensional graph (wave graph) through the deformation layer (deformation layer 811, deformation layer 812), and passed through the merging layer (for example: merging layer 813 , merging layer 815), a two-dimensional convolution layer (for example: a two-dimensional convolution layer 814, a two-dimensional convolution layer 816) and the intermediate results of the frequency domain branch 607 are merged at multiple levels, so that the final audio features are obtained It can integrate frequency domain features and time domain features of different sizes and levels.
  • the frequency domain information output by the frequency domain branch 607 can be a log-mel spectrum using the Mel frequency domain.
  • the frequency domain branch 607 includes a large number of two-dimensional convolution layers (two-dimensional convolution layer 821, two-dimensional convolution layer 823 , two-dimensional convolution layer 825) and two-dimensional maximum pooling layer (two-dimensional maximum pooling layer 822, two-dimensional maximum pooling layer 824), using a large number of convolutional layers in frequency domain signal features can directly learn audio Frequency domain characteristics of the data.
  • a two-dimensional feature map is obtained.
  • the dimension of the two-dimensional feature map is the same as the dimension of the feature map output by the time domain branch 606.
  • the intermediate results pass through the merging layer (for example: merging layer 813, merging layer 815), the two-dimensional convolution layer (for example: the two-dimensional convolution layer 814, the two-dimensional convolution layer).
  • Dimensional convolution layer 816) and time domain branch The intermediate results of 606 are merged at multiple levels, so that the final audio features can integrate frequency domain features and frequency domain features of different sizes and levels.
  • the deformation layer can be reshaped through the reshape function (a function that transforms a specified matrix into a matrix of a specific dimension, and the number of elements in the matrix remains unchanged.
  • the function can re-adjust the number of rows, columns, and dimensions of the matrix.)
  • the feature map is transformed.
  • the two-dimensional time domain audio features and the frequency domain audio features are fused to obtain the audio features of the data segments, which is implemented in the following manner: superimposing the two-dimensional time domain audio features and the frequency domain audio features, Perform two-dimensional convolution on the superposition features obtained by the superposition processing to obtain the two-dimensional convolution result, and obtain the maximum superposition feature (Max) and the average superposition feature (Mean) of the two-dimensional convolution result; Linear activation is performed by the summation between them to obtain the audio features of the data fragments.
  • two-dimensional time-domain audio features and frequency-domain audio features can be represented as feature matrices respectively.
  • the feature matrices of two-dimensional time-domain audio features and the feature matrices of frequency-domain audio features are linearly added to obtain superimposed features. , using the form of feature matrix to represent superimposed features.
  • audio features are represented in the form of vectors.
  • Linear activation that is, activating the summation between the maximum superposition feature and the average superposition feature through the Relu function to obtain the audio features of the data fragment.
  • the merging layer 817 in the feature fusion module 608 merges the feature maps output by the two branches.
  • the merging process enables the time domain and frequency domain to maintain complementary information, and also allows the high-level network to perceive the underlying network information.
  • the merging layer 817 outputs the two-dimensional frequency domain feature map of each data segment, and the two-dimensional frequency domain feature map is input into the two-dimensional convolutional neural network layer 818. Obtain the two-dimensional convolution result.
  • the mean and maximum (max) of the last dimension feature of the two-dimensional convolutional neural network layer are passed through the merge layer.
  • 819 performs addition, and the addition uses a linear activation function (relu) through the activation layer 820 to generate the final audio semantic feature vector (audio feature).
  • the audio semantic feature vectors of each data segment are combined according to the time sequence corresponding to the data segments to obtain an audio feature sequence.
  • time domain features are converted into features of the same dimension as frequency domain features, which reduces the complexity of fusing time domain features and frequency domain features of audio data, saves computing resources, and improves the efficiency of feature fusion.
  • Accuracy, through the fusion of frequency domain features and time domain features, the information contained in audio can be obtained from different aspects, making the amount of information represented by audio features more comprehensive and improving the accuracy of obtaining audio features.
  • only frequency domain features or time domain features of audio data can be collected as audio features of the audio. By collecting only features of one domain, the calculation speed can be improved and computing resources can be saved.
  • Figure 3E is a fifth flow diagram of the audio data processing method provided by the embodiment of the present application.
  • the audio feature sequence is encoded to obtain the attention of the audio data.
  • the parameter sequence is implemented through steps 3041 to 3043 in Figure 3E.
  • step 3041 the following processing is performed for each audio feature in the audio feature sequence: based on the attention mechanism, the audio feature is separately fused with each audio feature of other data segments to obtain each weighted correlation corresponding to the audio feature.
  • the other data pieces are data pieces in the audio data other than the data piece for which the weighted correlation is currently acquired.
  • audio feature A is used as an example below.
  • the fusion process is to inner product the embedding vector of audio feature A with the embedding vector of the audio feature of any other data segment, and multiply the inner product result with audio feature A to obtain A weighted correlation of audio feature A. Obtain the weighted correlation for the audio features of each other data segment, and then obtain each weighted correlation corresponding to audio feature A.
  • the embedding vector of each audio feature is determined in the following manner: the audio features of each data segment of the audio data are fully connected through a fully connected layer to obtain the embedding vector of each audio feature. .
  • the audio feature sequence includes multiple audio features from a 1 to an n .
  • the audio features corresponding to each two data segments are processed through the fully connected layer to obtain the one-dimensional embedding vector (two embedding) vectors corresponding to the audio features.
  • the order of the vectors is the same).
  • step 3041 is implemented in the following manner: performing the following processing for each audio feature of the audio feature and other data segments: multiplying the embedding vector of the audio feature and the embedding vector of other data segments to obtain the audio feature and other data segments. The correlation between the audio features of the data fragments; multiply the audio features and the correlation to obtain the weighted correlation corresponding to the audio features.
  • the audio feature is represented in the form of a one-dimensional embedding vector.
  • the inner product of the two one-dimensional embedding vectors is calculated to obtain the correlation degree m.
  • the first audio feature in the audio feature sequence is represented as a 1 and the i-th audio feature are represented by a i , and the correlation obtained by multiplying the audio feature a 1 and the audio feature a i is m 1i .
  • the correlation degree is multiplied by the audio feature a 1 to obtain the weighted correlation degree c 1i .
  • each weighted correlation is added to obtain the attention parameter corresponding to the audio feature.
  • the attention parameter W corresponding to the audio feature can be obtained. For example: attention parameter of audio feature a 1
  • each attention parameter is combined to form an attention parameter sequence of the audio data based on the order of the data segments corresponding to each audio feature.
  • the order of data segments refers to the time sequence of the data segments in the audio data.
  • Each attention parameter in the attention parameter sequence corresponds to each data segment one-to-one.
  • Each attention parameter corresponds to the attention parameter.
  • the time sequence of the data segments is combined into an attention parameter sequence.
  • Each weight value in the weight value sequence also corresponds to each data segment one-to-one.
  • Each weight value is combined according to the time sequence of the data segments corresponding to the weight value. is a sequence of weight values.
  • attention parameters are obtained by fusing audio features based on the attention mechanism.
  • Recommended parameters can be determined more accurately based on the attention parameters, thereby improving the accuracy of determining recommended segments and solving the problem of audio that lacks playback record data. It is difficult to determine recommended clips using data or video data.
  • step 305 the attention parameter sequence and the weight value sequence are fused to obtain the fusion parameters of each data segment, and the recommended parameters of each data segment are determined based on each fusion parameter.
  • the fusion process is to multiply the attention parameter sequence and the weight value sequence, and the number of elements contained in the attention parameter sequence and the weight value sequence is the same.
  • step 305 is implemented in the following manner: performing the following processing for each data segment: obtaining the attention parameter corresponding to the data segment from the attention parameter sequence, and comparing the weight value of the data segment with the attention parameter of the data segment. Multiply to obtain the fusion parameters of the data fragments; normalize the fusion parameters to obtain the recommended parameters of the data fragments.
  • the fusion parameter of the first data segment in the audio data is Q1* Z1, that is, the product of the weight value of the first data segment and the attention parameter.
  • the normalization process is to perform confidence prediction through the softmax function.
  • the recommended parameter is the degree of excitement.
  • the part with voice in the video has a higher probability of being an exciting data segment.
  • the audio track data based on the voice source is assigned a corresponding weight value.
  • the weight value of the voice source is high.
  • the weight value on the background sound part makes the confidence level of the excitement level corresponding to the voice data segment higher than the confidence level of the excitement level corresponding to the background sound data segment.
  • the recommended parameters can more comprehensively and quantitatively represent the sound.
  • Information from frequency data improves the accuracy of determining recommended parameters.
  • step 306 recommended segments in the audio data are determined based on the recommendation parameters of each data segment.
  • recommended segments of audio data are determined in any of the following ways:
  • the preset number is positively related to the total number of data segments of the audio data. For example: the preset number is the data segment. One percent of the total.
  • the recommended parameter threshold can be the median value of the recommended parameters of each data segment, or a preset multiple of the median value (for example: 1.5 times, 1 ⁇ preset multiple ⁇ 2).
  • the maximum recommended parameter is 0.9
  • the minimum recommended parameter is 0,
  • the median value 0.45 is taken as the recommended parameter threshold, and the data fragments with a wonderfulness greater than 0.45 are regarded as wonderful data fragments.
  • the maximum recommended parameter is 0.9
  • the minimum recommended parameter is 0, and 1.1 times the median value is taken as the recommended parameter threshold, then the recommended parameter threshold is 0.495.
  • the correlation degree between each data segment in the audio data and the source is quantified through the recommendation parameters, the probability that the audio data belongs to a specific type of recommended segment is characterized through the recommendation parameters, and the data with the highest recommendation parameters are selected.
  • the selected recommended clips can represent specific types of positions in the audio data. Compared with predictions simply from the frequency domain and time domain levels, it combines different source recognition to be more comprehensive, thus based on each data
  • the recommendation parameters of the fragments can accurately identify valuable recommended fragments and provide users with accurate reference information.
  • a recommended parameter curve of the audio data can also be generated based on the recommended parameters of each data segment; in response to the playback triggering operation, the recommended parameter curve of the audio data is displayed on the playback interface.
  • the abscissa of the recommended parameter curve is the playback time of the audio data
  • the ordinate of the recommended parameter curve is the recommended parameter
  • FIG 10A is a first schematic diagram of a playback interface provided by an embodiment of the present application.
  • the playback interface 101A is the playback interface of the video player.
  • the recommended parameter is the excitement level.
  • the excitement level curve 106A is displayed in the area that does not block the video screen, and the highlight data segment 107A is marked.
  • the position of the slider 103A in the progress bar 105A is the position corresponding to the moment when the video is currently played.
  • the progress bar 105A can represent the playback time.
  • the level of the excitement level curve 106A can represent the level of excitement.
  • the play trigger operation can be for audio or video.
  • the playback interface can be an audio playback interface or a video playback interface.
  • the audio playback interface plays audio data (corresponding to the audio playback scene, audio data), and the video playback interface corresponds to the video playback scene, and the audio data is extracted from the video data.
  • the tags of the recommended clips can also be displayed on the playback interface, where the tags are used to characterize the time segments of the recommended clips; in response to the selection operation for any tag, jump to the selected tag Play starts from the starting point of the corresponding recommended clip.
  • the selection operation may be a click operation, or an operation of dragging the progress bar slider to a label.
  • FIG. 10B is a second schematic diagram of the playback interface provided by an embodiment of the present application.
  • the slider 103A is dragged to the position of the label 104A, and the video screen is switched to the screen of the starting point of the highlight data segment 107A.
  • the audio data processing method provided by the embodiments of the present application is implemented through an audio processing model.
  • Source separation is implemented by calling the pyramid scene parsing network of the audio processing model.
  • Audio feature extraction from each data segment is implemented by calling the audio semantic information extraction module of the audio processing model.
  • Encoding and fusion processing is implemented by calling the attention module of the audio processing model. .
  • FIG. 6A is a first schematic diagram of an audio processing model provided by an embodiment of the present application.
  • audio processing module The model includes a pyramid scene parsing network 601, a weight configuration module 610, an audio semantic information extraction module 605 and an attention module 609.
  • the pyramid scene parsing network 601 is used to perform step 301
  • the weight configuration module 610 is used to perform step 303
  • the audio semantic information extraction module 605 is used to perform step 304
  • the attention module 609 is used to perform step 609.
  • the audio data is input to the pyramid scene analysis network 601.
  • the pyramid scene analysis network 601 performs source separation on the audio data into audio track data corresponding to at least one source type.
  • the weight configuration module 610 is used to implement step 303 above.
  • the weight configuration module 610 Determine the time segment associated with the source in the audio track data, assign the corresponding weight value to the time segment, and output the weight value to the audio semantic information extraction module 605 and the attention module 609.
  • the audio input data is input to the audio semantic information extraction module 605 (for the specific structure of the audio semantic information extraction module, please refer to Figure 6B and Figure 8 above).
  • the audio semantic information extraction module 605 extracts features of the audio data in both the time domain and the frequency domain.
  • the attention module 609 is an algorithm module used to run the attention mechanism.
  • the attention module 609 uses the attention mechanism based on the weight value sequence and
  • the audio feature sequence is used for parameter prediction, recommended parameters are obtained, and recommended parameter curves are produced.
  • the audio processing model is trained in the following way: based on the label value of each actual recommended segment of the audio data (the label value is also the recommendation parameter of the actual recommended segment, the label value of the positive sample is 1), the combination forms the actual recommendation of the audio data Parameter sequence; based on the recommended parameters of each data segment of the audio data, combined to form a predicted recommended parameter sequence of the audio data; obtain the cross-entropy loss of the audio processing model based on the actual recommended parameter sequence and the predicted recommended parameter sequence; divide the cross-entropy loss by The number of data fragments of the audio data is used to obtain the average prediction loss. Based on the average prediction loss, the audio processing model is backpropagated to obtain the updated audio processing model.
  • the training data has manually labeled label values.
  • the label values can represent the probability of which data fragments are actually recommended fragments (wonderful data fragments). Among them, recommended fragments are labeled as 1 (positive samples), and non-recommended fragments are labeled. is 0 (negative sample).
  • all label values corresponding to a video can form an actual recommended parameter sequence (a sequence composed of 0 and 1). For example: the video is divided into N data segments, N is a positive integer, the recommended segments are highlight segments, manually mark the highlight segments in the video, and combine the label values according to the order of each data segment from front to back in the video to be the actual Recommended parameter sequence, the actual recommended parameter sequence is represented as [1, 0, 1...0].
  • the highlight data segments can be determined based on the audio features and combined with the image information. It can be achieved in the following ways: extract image features from the image data of the video, fuse the image features with the corresponding audio features to obtain the fused video features, execute the attention mechanism based on the video features, and obtain the attention parameter sequence. Based on the attention The parameter sequence and the weight value sequence determine the recommended parameter sequence.
  • the recommended fragments identified based on the image features of the video can be optimized based on the recommended fragments identified based on the audio data. This is achieved by performing the following steps on the image data of the video: Image recognition, based on the recognized video images including people, determines the time of the data segments including people in the video. The video data segments whose recommended parameters are greater than the recommended parameter threshold and whose corresponding video data segments include people are used as recommended segments.
  • the attention parameters are obtained based on the image semantic feature sequence, and the attention parameter sequence is obtained.
  • the recommended parameters corresponding to the video picture are obtained.
  • the weighted summation of the recommended parameters of the video picture and the recommended parameters of the audio data is obtained to obtain the weighted recommendation parameters, and the video data segments whose weighted recommendation parameters are greater than the weighted recommendation parameter threshold are used as recommended segments.
  • the embodiments of this application analyze the entire video in multiple domains and multiple layers of information from the perspective of the audio side, and can It can quickly locate the recommended segments in the entire audio (for example: exciting data segments, passionate data segments, sad data segments or funny data segments, etc.), so that the audio-based recommended segments can determine the time segments of the recommended segments in the video. position in the timeline. Therefore, without relying on the playback record data of audio data, the recommended clips can be accurately identified, providing users with accurate reference information, and improving the user experience. It can provide the video recommendation parameter curve for the player so that the audience can jump the playback progress bar from the current playback position to the position of the recommended clip, thereby improving the audience's experience of using the player.
  • the recommended segments in the entire audio for example: exciting data segments, passionate data segments, sad data segments or funny data segments, etc.
  • the popularity information associated with the video's timeline progress bar can be displayed in the player.
  • Popularity information is usually calculated based on the video's playback record data (play volume, clicks, comments or comments, etc.), but for videos of newly released movies or TV series, these videos do not have playback record data. Or, there is not enough playback record data for niche videos to determine the popularity.
  • the audio data processing method provided by the embodiment of the present application can generate a recommended parameter curve to replace the popularity information.
  • the recommended parameter can be the excitement level, showing the user the exciting data segments and the excitement level curve in the video, and the user can use the excitement level curve or the excitement level curve according to the excitement level curve or the excitement level curve.
  • Clip tags can jump directly to exciting data clips for watching or listening, improving the user's viewing experience.
  • the audio data processing method can provide second-generation users with a highlight curve. The user can clearly determine the highlight data segments in the video based on the curve, and locate and intercept the highlight data segments in the entire video with one click. Then Erchuang users can directly perform the subsequent short video generation work based on the intercepted results, greatly improving efficiency and avoiding the waste of time in manually distinguishing exciting data fragments.
  • Figure 5 is an optional flow diagram of the audio data processing method provided by the embodiment of the present application. The following will be described in conjunction with the steps of Figure 5, taking the electronic device as the execution subject.
  • step 501 the video file to be processed is obtained.
  • the video file to be processed may be a video file of a TV series or movie.
  • a video file consists of video frame and audio data. Audio track data corresponding to at least one source type can be extracted from the audio data.
  • Figure 4A is a schematic diagram of the audio data extracted from the video provided by the embodiment of the present application; from top to bottom in Figure 4A are a schematic diagram of the video frame (representing the preview screen of the video) and the audio features of the audio data.
  • step 502 the audio processing model is called based on the audio data of the video file to perform highlight confidence prediction processing, and a highlight confidence curve and highlight data fragments of the audio data are obtained.
  • FIG 6A is a first schematic diagram of an audio processing model provided by an embodiment of the present application.
  • the audio processing model includes a pyramid scene parsing network 601, a weight configuration module 610, an audio semantic information extraction module 605, and an attention module 609.
  • the audio data is input to the pyramid scene analysis network 601.
  • the pyramid scene analysis network 601 performs source separation on the audio data into audio track data corresponding to at least one source type.
  • the weight configuration module 610 determines the time segment associated with the source in the audio track data. , assign corresponding weight values to the time paragraphs, and output the weight values to the audio semantic information extraction module 605 and the attention module 609.
  • the audio input data is input to the audio semantic information extraction module 605.
  • the audio semantic information extraction module 605 performs feature extraction processing on the audio data in both the time domain and the frequency domain, and outputs the audio feature sequence that integrates the time domain and frequency domain information to the attention.
  • the attention module performs parameter prediction processing based on the weight value sequence and the audio feature sequence, obtains recommended parameters, and creates a recommended parameter curve.
  • FIG. 6B is a second schematic diagram of the audio processing model provided by the embodiment of the present application.
  • the speech positioning unit 603 in the pyramid scene analysis network 601 and the weight configuration module 610 performs millisecond-level positioning of the speech paragraphs in the entire audio track.
  • the voice positioning unit 603 adopts a voice activity detection algorithm
  • the pyramid scene parsing network 601 is a pyramid scene parsing network (PSPN, Pyramid Scene Parsing Network).
  • PSPN pyramid Scene Parsing Network
  • the pyramid scene parsing network can more accurately separate different features in the audio spectrogram, especially the small convolution layer in the pyramid convolution layer, which can learn between the spectrograms of different sources in the audio spectrogram.
  • the edge of the boundary uses the edges of the characteristics of different sources as masks to separate the spectrograms, making the separated audio track data of different sources more accurate.
  • the original audio track of the video is input to the pyramid scene parsing network 601 and output as audio track data such as separated background sound track and voice track (audio track data 602 in Figure 6B). Then the voice activity detection open source algorithm is used to locate the speech segments in the speech track, thereby obtaining the time segments of the speech in the entire audio track.
  • the pyramid scene parsing network 601 separates the audio track of the entire video based on the source separation model built by the pyramid scene parsing network, splits the voice information and background sound information in the audio track, and stores them separately as audio track data ( audio track file).
  • the speech positioning unit 603 locates the speech data segments in the speech track data based on the speech activity detection algorithm to obtain the time segments in which speech exists, and the weight allocation unit 604 sets the weight of each speech time segment. Time segments of speech are assigned a higher weight value than time segments of pure background sound.
  • the feature maps of different levels generated by the pyramid pooling module are finally merged by the merging layer (concat), and the merged feature maps are spliced together and then input to the fully connected layer for classification.
  • the pyramid scene parsing network outputs local information at different scales and between different sub-regions through multiple levels of convolutional layers of the pyramid pooling module, and constructs global prior information on the final convolutional layer feature map of the pyramid scene parsing network. .
  • This global prior information aims to eliminate the fixed-size input constraints of convolutional neural networks for image classification.
  • Figure 7 is a schematic diagram of the pyramid scene parsing network provided by the embodiment of the present application; as described in detail below, Figure 7 is a detailed structural schematic diagram of the pyramid scene parsing network 601 in Figure 6A and Figure 6B, Vol.
  • the convolutional neural network 701 performs feature extraction on the audio data to obtain the original features 702 of the audio data.
  • the pyramid module (including convolution layer 1, convolution layer 2, convolution layer 3 and convolution layer 703) is set Layer 4 (more dimensions can be set according to the extraction accuracy in the specific implementation) can fuse features of four different pyramid scales:
  • Convolutional layer 1 highlights the single global pooling output of the coarsest level, multiple levels of the pyramid module
  • Convolutional layers of different sizes divide the original feature map into different sub-regions and form local features for different locations.
  • Different levels of convolutional layers in the pyramid module output local features of different sizes.
  • 1 ⁇ 1 convolution is used after each level to reduce the number of channels at the corresponding level to the original 1/N.
  • the low-dimensional feature map is then directly upsampled through the upsampling layer 704 through bilinear interpolation to obtain a feature map of the same size as the original feature map.
  • the feature maps 705 of different levels output by the pyramid module are merged and processed (concat), and the results of the merged processing are convolved through the convolution layer 706 to obtain the final pyramid global feature.
  • the architecture of the pyramid scene analysis model is in a pyramid shape. After the model inputs the image, it uses the pre-trained atrous convolution layer to extract the feature map.
  • Atrous convolutions are also called dilated convolutions (Dilated Convolutions), and the dilation rate is introduced in the convolution layer.
  • the expansion rate defines the spacing between each data value when the convolution kernel processes data. Since the introduction of a pooling layer will lead to the loss of global information, the role of the dilated convolutional layer is to provide a larger receptive field without using a pooling layer.
  • the final feature map size is 1/8 of the input image, and then the feature is input into the pyramid pooling module.
  • the model uses the pyramid pooling module in the pyramid scene parsing network to collect contextual information.
  • the pyramid pooling module has a 4-layer pyramid structure, and the pooling kernel covers all, half and small parts of the image.
  • the previous global feature map and the original feature map are combined and then convolved (using the global feature as a mask to separate the speech and background sounds in the original features) to generate speech , the final segmentation feature map of the background sound.
  • Figure 4B is a schematic diagram of the audio track data provided by the embodiment of the present application; the upper picture in Figure 4B is the audio track waveform diagram (sampling sequence diagram), and the lower diagram is the audio track feature diagram corresponding to the voice.
  • the blank part is the discarded noise part.
  • the source separation model built through the pyramid scene analysis network can separate the audio track data corresponding to the speech and background sounds in the original audio track.
  • a voice activity detection algorithm (such as the WebRTC voice activity detection algorithm) can be used to locate specific audio impulse signal segments.
  • the voice activity detection algorithm is an algorithm that determines whether the audio is speech based on short-time energy (STE, Short Time Energy) and zero-crossing rate (ZCC, Zero Cross Counter).
  • the short-term energy that is, the energy of a frame of speech signal, is the sum of the squares of the signal within the frame, and the zero-crossing rate is the number of times that the time domain signal of a frame of speech crosses 0 (time axis).
  • the principle of the voice activity detection algorithm is that the short-term energy of speech data segments is relatively large, but the zero-crossing rate is relatively small; conversely, the short-term energy of non-speech data segments is relatively small, but the zero-crossing rate is relatively large.
  • the speech signal and the non-speech signal can be judged by measuring these two parameters of the speech signal and comparing them with the thresholds corresponding to the parameters.
  • the audio segment is noise.
  • the short-term energy of the audio data is greater than the short-term energy threshold and the zero-crossing rate is less than the zero-crossing rate threshold, the audio segment is speech.
  • Figure 4C is a schematic diagram of a time segment provided by an embodiment of the present application; the time segment circled in box 401C is a time segment of speech. Similarly, the waveform corresponding to each box circled in Figure 4C is a speech time segment. Time paragraph.
  • the structure of the audio semantic information extraction module 605 is a dual-stream type, including a time domain branch 606 and a frequency domain branch 607.
  • the time domain information and weight value sequence of the audio data are input to the time domain branch 606.
  • the time domain branch Path 606 includes multiple one-dimensional convolution layers (one-dimensional convolution layer 1,... one-dimensional convolution layer n).
  • the frequency domain information and weight value sequence of the audio data are input to frequency domain branch 607.
  • Frequency domain branch 607 It includes multiple two-dimensional convolution layers (two-dimensional convolution layer 1,... two-dimensional convolution layer n).
  • the feature fusion layer 608 is used to fuse the frequency domain features or time domain features output by the convolutional layers of each level in the two branches.
  • Figure 8 is a schematic diagram of the audio semantic information extraction module provided by the embodiment of the present application;
  • the input of the audio semantic information extraction module is the original audio data of the video (characterized as a sequence of audio sampling points).
  • the audio data is divided into multiple data segments (for example, by: each data segment includes at least one frame, or each data segment is equal in length).
  • a basic feature map (logmel) is generated based on the audio data as frequency domain information and input to the frequency domain branch 607 , and the audio sampling point sequence (time domain information) of the audio data is input to the time domain branch 606 .
  • the weight value sequence output by the weight allocation unit 604 is processed by the fully connected layer 801 and the fully connected layer 802 to generate time weight vectors with the same dimensions as the time domain signal features and frequency domain signal features respectively, and then is compared with the time domain signal features and frequency domain signal features respectively. Domain signal features are multiplied by corresponding positions.
  • the time domain branch 606 includes a large number of one-dimensional convolution layers (one-dimensional convolution layer 803, one-dimensional convolution layer 804, one-dimensional convolution layer 806, one-dimensional convolution layer 808) and one-dimensional maximum pooling layer ( One-dimensional max pooling layer 805, one-dimensional max pooling layer 807, one-dimensional max pooling layer 809), using a large number of convolutional layers in time domain signal features can directly learn the time domain characteristics of audio data, including audio Loudness and sample point amplitude information.
  • the generated one-dimensional sequence is scaled (resized) through the deformation layer 810 into a feature map in the form of a two-dimensional graph (wave graph). This processing makes the time domain branch and the frequency domain branch
  • the features output by all channels have the same size, which facilitates fusion processing.
  • the intermediate result is scaled into a two-dimensional graph (wave graph) through the deformation layer (deformation layer 811, deformation layer 812), and passed through the merging layer (for example: merging layer 813 , merging layer 815), a two-dimensional convolution layer (for example: a two-dimensional convolution layer 814, a two-dimensional convolution layer 816) and the intermediate results of the frequency domain branch 607 are merged at multiple levels, so that the final audio features are obtained It can integrate frequency domain features and time domain features of different sizes and levels.
  • the frequency domain information output by the frequency domain branch 607 can be a logmel spectrum using the Mel frequency domain.
  • the frequency domain branch 607 includes a large number of two-dimensional convolution layers (two-dimensional convolution layer 821, two-dimensional convolution layer 823, Dimensional convolution layer 825) and two-dimensional maximum pooling layer (two-dimensional maximum pooling layer 822, two-dimensional maximum pooling layer 824), using a large number of convolutional layers in frequency domain signal features can directly learn the characteristics of audio data Frequency domain characteristics.
  • a two-dimensional feature map is obtained.
  • the dimension of the two-dimensional feature map is the same as the dimension of the feature map output by the time domain branch 606.
  • the intermediate results pass through the merging layer (for example: merging layer 813, merging layer 815), the two-dimensional convolution layer (for example: the two-dimensional convolution layer 814, the two-dimensional convolution layer).
  • the intermediate results of the dimensional convolution layer 816) and the time domain branch 606 are merged at multiple levels, so that the final audio features can integrate frequency domain features and frequency domain features of different sizes and levels.
  • the deformation layer can be reshaped through the reshape function (a function that transforms a specified matrix into a matrix of a specific dimension, and the number of elements in the matrix remains unchanged.
  • the function can re-adjust the number of rows, columns, and dimensions of the matrix.)
  • the feature map is transformed.
  • the merging layer 817 in the feature fusion module 608 merges the feature maps output by the two branches.
  • the merging process enables the time domain and the frequency domain to maintain complementary information, while also allowing the high-level network to perceive the underlying network information.
  • the merging layer 817 outputs the two-dimensional frequency domain feature map of each data segment, and the two-dimensional frequency domain feature map is input into the two-dimensional convolutional neural network layer 818.
  • Obtain the two-dimensional convolution result determine the average value (Mean) and the maximum value (Max) of the two-dimensional convolution result, add the obtained average value and the maximum value through the merging layer 819, and use the addition through the activation layer 820
  • Linear activation function (Relu) generates the final audio semantic feature vector (audio feature).
  • the audio semantic feature vectors of each data segment are combined to obtain an audio feature sequence.
  • the attention module 609 receives the weight value sequence and the audio feature sequence.
  • the attention module obtains the attention parameter sequence based on the audio feature sequence encoding, and predicts the recommended parameters of each data segment based on the attention parameter sequence and the weight value sequence.
  • Figure 9 is a schematic diagram of the coding principle in the attention module provided by an embodiment of the present application.
  • the audio features corresponding to each two data segments are processed through the fully connected layer to obtain the one-dimensional embedding vector corresponding to the audio feature ( (the order of the two vectors is the same), the inner product of the two one-dimensional embedding vectors is calculated to obtain the correlation degree m.
  • the correlation degree between the audio feature a 1 and the audio feature a i is m 1i .
  • the correlation is multiplied by the vector corresponding to the audio feature to obtain the weighted correlation information c (the weighted correlation above).
  • the weighted correlation information amount c 1i between audio feature a 1 and audio feature a i , m 1i ⁇ a 1 c 1i .
  • the attention parameter W corresponding to the audio feature can be obtained. For example: attention parameter of audio feature a 1
  • Feature sequence Q (the granularity of feature sequence Q can be frame level), the feature nodes of each granularity are normalized through the second classification layer: the labels of the second classification are 1-0, and the posterior probability of category 1 is the feature
  • the confidence level (brilliant degree) of the node that is, represents the probability that the characteristic of the feature node is wonderful; perform normalization processing (for example, through the softmax function) on the entire recommended parameter sequence to obtain the wonderful degree curve.
  • the corresponding brilliance threshold can be set, and the data fragments whose brilliance is greater than the brilliance threshold are regarded as wonderful data fragments, and the data fragments which are less than the brilliance threshold are regarded as non-brilliant data fragments.
  • the training data has manually labeled labels (labels) during the training process.
  • the labels can represent which data segments are actually recommended segments (highlight data segments), where the recommended segments are labeled 1 (positive samples) , non-recommended clips are marked as 0 (negative samples).
  • all labels corresponding to a video can form a 0-1 sequence.
  • the audio processing model can be trained through backpropagation based on the prediction loss.
  • step 503 in response to the playback triggering operation, the recommended parameter curve of the video file is displayed on the playback interface.
  • the recommended parameter curve of the playback interface is bound to the progress bar of the timeline of the playback interface.
  • the brilliance curve is displayed above the progress bar. The higher the brilliance, the higher the value corresponding to the curve. , users can pull the progress bar based on the excitement curve and locate the exciting data clips for viewing.
  • the embodiment of this application uses audio information to automatically identify exciting data segments. Automated positioning can quickly and industrially locate exciting data segments. In some landing applications, especially the heat curve (excitement degree curve) on the playback end , can quickly produce in batches, improve production efficiency and reduce production costs.
  • the embodiment of this application uses full audio information to do feature input for locating exciting data segments, which can make up for the problem that data segments that are not high-intensity pictures but have high-intensity background music cannot be located (such as sitcoms), especially when using pictures. Locating the highlight data clips can only locate the most popular shots in the entire picture, and cannot improve the integrity of the entire highlight data clips. However, using audio can locate the entire data clips. Moreover, the common picture processing model has a large number of parameters and cannot quickly predict exciting data fragments. The audio network parameters are small and faster and more convenient.
  • the embodiment of this application uses a pyramid scene analysis network to build a source separation system, and then uses a voice activity detection algorithm to locate the speech paragraph. This method can detect complete speech, not just speech information, but also allows the entire source separation system to learn more complete positioning information of speech data segments.
  • the embodiment of this application uses the time segment information of the speech to determine the weight information of each node in the entire audio track.
  • the embodiments of the present application can directly locate voice data segments and assign corresponding weight values to the voice data segments, and can enhance the recognition of semantic information of voice data segments, greatly increasing the proportion of voice semantic information in locating exciting data segments. .
  • the embodiment of the present application uses a multi-domain and multi-layer method to extract semantic features, which can supplement information in different network layers through time domain and frequency domain.
  • Frequency domain information is added to time domain features.
  • Time domain information is added to the frequency domain features.
  • the software module may include: a source separation module 4551, configured to perform source separation on audio data to obtain audio track data corresponding to at least one source type; a weight configuration module 4552, configured to perform playback based on each audio track data At least one time segment related to the source type in the timeline, assign a corresponding weight value to each data segment in the audio data based on the length of the included time segment, and combine each weight value to form a weight of the audio data value sequence; feature extraction module 4553, configured to combine to form an audio feature sequence of audio data based on the audio features extracted from each data segment, and encode the audio feature sequence to obtain an attention parameter sequence of audio data; parameter prediction module 4554, configured to fuse the attention parameter sequence and the weight value sequence to obtain the fusion parameters of each data segment, and determine the recommended parameters of each
  • the source separation module 4551 is configured to perform feature extraction processing on the audio data to obtain global features of the audio data; use the global features as masks to perform source separation on the audio data to obtain each type of audio data. Audio track data corresponding to different source types, where the boundaries of the mask are used to represent the boundaries between audio data corresponding to different source types.
  • the source separation module 4551 is configured to perform feature extraction processing on the audio data to obtain the original features of the audio data; perform multiple levels of pooling processing on the original features to obtain multiple local features of the audio data; Multiple local features are combined to obtain global features of the audio data.
  • the weight configuration module 4552 is configured to determine at least one time segment related to the source type in the following manner: when the source type corresponding to the audio track data is speech, the short-term energy in the audio track data is greater than The time segment with the energy threshold and the zero-crossing rate is less than the zero-crossing rate threshold is regarded as the time segment related to speech; when the source type corresponding to the audio track data is background sound, the time segment that meets the filtering conditions in the audio track data is regarded as the time segment related to the speech. Time segments related to background sounds, where the filtering conditions include any of the following: the loudness corresponding to the time segment is greater than the lower limit of loudness; the length of the time segment is greater than the lower limit of length.
  • the weight configuration module 4552 is configured to perform the following processing on each data segment when audio track data of two source types, voice and background sound, are obtained through source separation: when the data segment belongs to voice-related When a time segment occurs, the weight value corresponding to the data segment is determined based on the speech parameters corresponding to the data segment, where the weight value is positively related to the parameter, and the parameters include at least one of the following: speech speed, intonation, loudness; when the data segment is related to background sound When the time segment is, the preset value is used as the weight value corresponding to the data segment, where the preset value is smaller than the weight value of any voice-related data segment; when the data segment does not belong to the time segment related to any source type, the preset value is used as the weight value corresponding to the data segment. Zero is used as the weight value corresponding to the data segment.
  • the weight configuration module 4552 is configured to perform the following processing on each data segment when only one source type of background sound track data is obtained through source separation: when the time segment contained in the data segment belongs to When the background sound is related to the time segment, the weight value corresponding to the data segment is determined based on the parameters of the background sound corresponding to the data segment, where the weight value is positively related to the parameter, and the parameter includes at least one of the following: loudness, pitch; when the data segment contains When a time segment does not belong to a time segment related to any source type, zero is used as the weight value corresponding to the data segment.
  • the feature extraction module 4553 is configured to perform the following processing for each data segment in the audio data: extract the time domain signal features and frequency domain signal features of the data segment; based on the playback timeline of each audio track data At least one time period related to the source type, determine the one-dimensional time domain weight value corresponding to the time domain signal characteristics, and determine the two-dimensional frequency domain weight value corresponding to the frequency domain signal characteristics; compare the one-dimensional time domain weight value with the time domain
  • the product of the signal features is convolved at multiple levels to obtain the time domain audio features; the product of the two-dimensional frequency domain weight value and the frequency domain signal features is convolved at multiple levels to obtain the frequency domain audio features; the time domain audio
  • the features are scaled to obtain two-dimensional time domain audio features; the two-dimensional time domain audio features and frequency domain audio features are fused to obtain the audio features of the data fragments.
  • the feature extraction module 4553 is configured to perform superposition processing on two-dimensional time domain audio features and frequency domain audio features, perform two-dimensional convolution on the superimposed features obtained by the superposition processing, and obtain a two-dimensional convolution result. Obtain The maximum superposition feature and the average superposition feature of the two-dimensional convolution result; linear activation is performed on the sum of the maximum superposition feature and the average superposition feature to obtain the audio features of the data fragment.
  • the parameter prediction module 4554 is configured to perform the following processing for each audio feature in the audio feature sequence: fuse the audio feature with each audio feature of other data segments based on the attention mechanism to obtain the audio feature. Each corresponding weighted correlation degree; add each weighted correlation degree to obtain the attention parameter corresponding to the audio feature, where other data fragments are data fragments in the audio data other than the data fragment; based on each audio The order of the data segments corresponding to the features, and each attention parameter is combined to form the attention parameter sequence of the audio data.
  • the parameter prediction module 4554 is configured to fuse the audio features with each audio feature of other data segments based on the attention mechanism to obtain each weighted correlation corresponding to the audio feature.
  • the audio features of each data segment are fully connected to obtain the embedding vector of each audio feature; the following processing is performed for the audio feature and each audio feature of other data segments: the embedding vector of the audio feature and the embedding vector of other data segments Multiply to get the correlation between the audio features and the audio features of other data segments; multiply the audio features and the correlation to get the weighted correlation corresponding to the audio features.
  • the parameter prediction module 4554 is configured to perform the following processing for each data segment: obtain the attention parameter corresponding to the data segment from the attention parameter sequence, and compare the weight value of the data segment with the attention parameter of the data segment. Multiply to obtain the fusion parameters of the data fragments; normalize the fusion parameters to obtain the recommended parameters of the data fragments.
  • the parameter prediction module 4554 is configured to determine recommended segments of audio data in any of the following ways: sorting each data segment in descending order based on the recommended parameters of each data segment, sorting the header of the descending order At least one data segment is used as a recommended segment of audio data; the data segment with a recommended parameter greater than the recommended parameter threshold is used as a recommended segment.
  • the parameter prediction module 4554 is configured to generate a recommended parameter curve of the audio data based on the recommended parameters of each data segment; in response to the playback triggering operation, display the recommended parameter curve of the audio data on the playback interface, where the recommendation The abscissa of the parameter curve is the playback time of the audio data, and the ordinate of the recommended parameter curve is the recommended parameters.
  • the parameter prediction module 4554 is configured to display the tags of the recommended clips on the playback interface, where the tags are used to characterize the time segments of the recommended clips; in response to the selection operation for any tag, jump to the selected tag Play starts from the starting point of the corresponding recommended clip.
  • source separation is implemented by calling the pyramid scene parsing network of the audio processing model, audio features are extracted from each data segment by calling the audio semantic information extraction module of the audio processing model, and encoding and fusion processing are performed by calling the audio processing model.
  • the attention module of the model is implemented; among them, the audio processing model is trained in the following ways: based on the label value of each actual recommended segment of the audio data, combined to form an actual recommended parameter sequence of the audio data; based on the recommendation of each data segment of the audio data Parameters are combined to form a predicted recommended parameter sequence for audio data; the cross-entropy loss of the audio processing model is obtained based on the actual recommended parameter sequence and the predicted recommended parameter sequence; the cross-entropy loss is divided by the number of data segments of the audio data to obtain the average prediction loss, based on The average prediction loss performs backpropagation processing on the audio processing model to obtain an updated audio processing model.
  • Embodiments of the present application provide a computer program product or computer program.
  • the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the electronic device executes the audio data processing method described in the embodiments of the present application.
  • Embodiments of the present application provide a computer-readable storage medium storing executable instructions.
  • the executable instructions are stored therein.
  • the executable instructions When executed by a processor, they will cause the processor to execute the audio data provided by the embodiments of the present application.
  • the processing method is, for example, the audio data processing method shown in FIG. 3A.
  • the computer-readable storage medium may be a memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; it may also include one or any combination of the above memories.
  • Various equipment may be a memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; it may also include one or any combination of the above memories.
  • Various equipment may be a memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; it may also include one or any combination of the above memories.
  • executable instructions may take the form of a program, software, software module, script, or code, written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and their May be deployed in any form, including deployed as a stand-alone program or deployed as a module, component, subroutine, or other unit suitable for use in a computing environment.
  • executable instructions may be deployed to execute on one electronic device, or on multiple electronic devices located at one location, or on multiple electronic devices distributed across multiple locations and interconnected by a communications network. execute on.
  • the entire video is analyzed from the perspective of the audio side in multiple domains and multiple layers of information, and recommended segments in the entire audio (for example, exciting data segments, passionate data, etc.) can be quickly located. fragments, sad data fragments or funny data fragments, etc.), so that the audio-based recommended fragments can determine the position of the time segment of the recommended fragment in the video in the timeline. Therefore, without relying on the playback record data of audio data, the recommended clips can be accurately identified, providing users with accurate reference information, and improving the user experience. It can provide the video recommendation parameter curve for the player so that the audience can jump the playback progress bar from the current playback position to the position of the recommended clip, thereby improving the audience's experience of using the player.
  • recommended segments in the entire audio for example, exciting data segments, passionate data, etc.

Abstract

一种音频数据的处理方法、装置、电子设备、程序产品及存储介质。该方法包括:从音频数据提取得到至少一种信源类型分别对应的音轨数据(301);确定每个音轨数据的播放时间轴中与信源类型相关的至少一个时间段落,并确定音频数据中每个数据片段中分别包含的时间段落(302);对音频数据中的每个数据片段分配对应的权重值,将每个权重值组合形成音频数据的权重值序列(303);从每个数据片段提取音频特征,将每个数据片段的与音频特征组合形成音频数据的音频特征序列,对音频特征序列进行编码,得到音频数据的注意力参数序列(304);将注意力参数序列与权重值序列融合得到每个数据片段的融合参数,并基于每个融合参数确定每个数据片段的推荐参数(305);基于每个推荐参数,确定音频数据中的推荐数据片段(306)。

Description

音频数据的处理方法、装置、电子设备、程序产品及存储介质
相关申请的交叉引用
本申请基于申请号为202210747175.3、申请日为2022年6月29日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及计算机技术,尤其涉及一种音频数据的处理方法、装置、电子设备、程序产品及存储介质。
背景技术
在线多媒体(例如视频或音频)播放平台需要在多媒体数据中标记出一些特殊的数据片段,称为推荐片段,例如精彩数据片段、热门数据片段等,以方便用户观看。
相关技术通过视频/音频的播放记录数据确定视频/音频的推荐数据片段,但是针对新发布的视频/音频,没有播放记录数据,只能通过人工标注推荐片段的方式。例如:通过人工标注的方式来定位整个剧集视频的精彩数据片段。但是人工标注严重依赖于人工标注的主观感受,标注出的推荐片段根据标注人的差异而存在差异,并且人工标注的耗时过长,标注效率低,无法进行快速的批量化生产。
综上所述,针对没有大量播放记录数据的多媒体数据,暂无较好的方式识别推荐数据片段。
发明内容
本申请实施例提供一种音频数据的处理方法、装置、电子设备、计算机程序产品及计算机可读存储介质、计算机程序产品,能够从音频数据中准确识别出推荐片段。
本申请实施例的技术方案是这样实现的:
本申请实施例提供一种音频数据的处理方法,由电子设备执行,包括:
从音频数据提取得到至少一种信源类型分别对应的音轨数据,其中,所述音频数据包含多个数据片段;
确定每个所述音轨数据的播放时间轴中与所述信源类型相关的至少一个时间段落,并确定所述音频数据中每个所述数据片段中分别包含的时间段落;
对所述音频数据中的每个数据片段基于所包含的所述时间段落长度分配对应的权重值,并将每个所述权重值组合形成所述音频数据的权重值序列;
从所述每个数据片段提取音频特征,将所述每个数据片段的与音频特征组合形成所述音频数据的音频特征序列,并对所述音频特征序列进行编码,得到所述音频数据的注意力参数序列;
将所述注意力参数序列与所述权重值序列融合得到每个所述数据片段的融合参数,并基于每个所述融合参数确定每个所述数据片段的推荐参数;
基于每个所述数据片段的推荐参数,确定所述音频数据中的推荐片段。
本申请实施例提供一种音频数据的处理装置,包括:
信源分离模块,配置为从音频数据提取得到至少一种信源类型分别对应的音轨数据, 其中,所述音频数据包含多个数据片段;
权重配置模块,配置为确定每个所述音轨数据的播放时间轴中与所述信源类型相关的至少一个时间段落,并确定所述音频数据中每个所述数据片段中分别包含的时间段落;
对所述音频数据中的每个数据片段基于所包含的所述时间段落长度分配对应的权重值,并将每个所述权重值组合形成所述音频数据的权重值序列;
特征提取模块,配置为从所述每个数据片段提取音频特征,将所述每个数据片段的与音频特征组合形成所述音频数据的音频特征序列,并对所述音频特征序列进行编码,得到所述音频数据的注意力参数序列;
参数预测模块,配置为将所述注意力参数序列与所述权重值序列融合得到每个所述数据片段的融合参数,并基于每个所述融合参数确定每个所述数据片段的推荐参数;
所述参数预测模块,还配置为基于每个所述数据片段的推荐参数,确定所述音频数据中的推荐片段。
本申请实施例提供一种电子设备,包括:
存储器,用于存储可执行指令;
处理器,用于执行所述存储器中存储的可执行指令时,实现本申请实施例提供的音频数据的处理方法。
本申请实施例提供一种计算机可读存储介质,存储有可执行指令,用于引起处理器执行时,实现本申请实施例提供的音频数据的处理方法。
本申请实施例提供一种计算机程序产品,包括计算机程序或指令,所述计算机程序或指令被处理器执行时实现本申请实施例提供的音频数据的处理方法。
本申请实施例具有以下有益效果:
一方面,通过对音频数据提取至少一种信源对应的音轨数据,基于信源相关的时间段落对数据片段分配对应的权重值,从而,在将各个数据片段的权重值组成的权重值序列与注意力参数序列融合时,能够从时域突出与信源对应的数据片段的重要程度;另一方面,通过注意力参数序列来从频域层面突出音频特征中与信源相关的数据片段的特征,这样,通过对音频数据的音轨的时域、频域两个方面的信息进行量化,预测音频数据中每个数据片段属于某一类型的数据片段的概率(推荐参数),相较于单纯从频域的层面来预测,识别更加全面,从而基于每个数据片段的推荐参数可以准确识别出有价值的推荐片段,为用户提供准确的参考信息。
附图说明
图1是本申请实施例提供的音频数据的处理方法的应用模式示意图;
图2是本申请实施例提供的电子设备的结构示意图;
图3A是本申请实施例提供的音频数据的处理方法的第一流程示意图;
图3B是本申请实施例提供的音频数据的处理方法的第二流程示意图;
图3C是本申请实施例提供的音频数据的处理方法的第三流程示意图;
图3D是本申请实施例提供的音频数据的处理方法的第四流程示意图;
图3E是本申请实施例提供的音频数据的处理方法的第五流程示意图;
图4A是本申请实施例提供的视频中提取的音频数据的示意图;
图4B是本申请实施例提供的音轨数据示意图;
图4C是本申请实施例提供的时间段落示意图;
图5是本申请实施例提供的音频数据的处理方法的一个可选的流程示意图;
图6A是本申请实施例提供的音频处理模型的第一示意图;
图6B是本申请实施例提供的音频处理模型的第二示意图;
图7是本申请实施例提供的金字塔场景解析网络的示意图;
图8是本申请实施例提供的音频语义信息提取模块的示意图;
图9是本申请实施例提供的注意力模块中编码的原理示意图;
图10A是本申请实施例提供的播放界面的第一示意图;
图10B是本申请实施例提供的播放界面的第二示意图;
图11是本申请实施例提供的音频数据的示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述,所描述的实施例不应视为对本申请的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。
在以下的描述中,所涉及的术语“第一\第二\第三”仅仅是区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一\第二\第三”在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。
需要指出,在本申请实施例中,涉及到用户信息、用户反馈数据等相关的数据(例如:多媒体数据、语音、音轨数据等),当本申请实施例运用到具体产品或技术中时,需要获得用户许可或者同意,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
对本申请实施例进行进一步详细说明之前,对本申请实施例中涉及的名词和术语进行说明,本申请实施例中涉及的名词和术语适用于如下的解释。
1)金字塔场景解析网络(PSPN,Pyramid Scene Parsing Network),金字塔场景解析网络的作用是预测所关注对象的标签(label)、位置(location)和形状(shape)。该网络中包括金字塔池化模块(Pyramid Pooling Module),金字塔池化模块可以将局部的上下文信息进行聚合,形成全局的上下文信息,更全面地实现定位、分类等处理。
2)信源分离,在音频数据(例如从视频数据的音频轨道中提取的音频数据,或,从音频文件中提取的音频数据)中,可能会承载一种或多种音频信号(即数字音频信号的简称,数字音频信号是对模拟音频信号进行采样和编码得到的),信源是发出声音信号的来源,信源类型是发声来源的类型,每种音频信号对应一种信源类型(例如语音对应的信源类型为人类),而信源分离就是通过信号处理或者其他算法进行分离处理,提取出指定信源的音频信号的序列,最终生成由不同信源类型的音频信号的序列分别构成的音轨数据,例如:语音音轨数据,背景音轨数据。
3)语音活动检测(VAD,Voice Activity Detection)算法,用于检测音频中语音/非语音(非语音/静音)的算法。广泛应用于语音编码、降噪和自动语音识别等场景(ASR,Automatic Speech Recognition)中。
4)时域和频域,时域和频域是音频数据的基本性质,用来分析音频数据的不同角度称为域,是衡量音频特征的两个维度概念。时域维度下,将音频数据的采样点在时间上进行展示处理,信号与时间之间存在相应的关系。通过傅里叶变换可以把信号从时域 转换到频域。频域用于分析音频数据在各个频带上的能量分布,包含音频数据一定程度上的特征表现。
5)梅尔(Mel)频率,一种基于人耳对等距的音高(pitch)变化的感官判断而定的非线性频率刻度,是在进行信号处理时,更能够迎合人耳的听觉感受阈值变化来人为设定的频率刻度,在音频处理领域,有很多基础音频特征是通过mel频率来进行计算的。
6)卷积神经网络(CNN,Convolutional Neural Networks),是一类包含卷积计算且具有深度结构的前馈神经网络(FNN,Feed forward Neural Networks),是深度学习(Deep Learning)的代表算法之一。卷积神经网络具有表征学习(Representation Learning)能力,能够按其阶层结构对输入图像进行平移不变分类(Shift-invariant Classification)。卷积神经网络的人工神经元可以响应一部分覆盖范围内的周围单元。卷积神经网络由一个或多个卷积层和顶端的全连接层(对应经典的神经网络)组成,同时也包括关联权重和池化层(pooling layer)。
7)注意力(Attention)机制,模仿人类注意力而提出的一种解决问题的办法,能够从大量信息中快速筛选出高价值信息。注意力机制用于主要用于解决长短期记忆网络(LSTM,Long Short-Term Memory)、循环神经网络(RNN,Recurrent Neural Network)模型输入序列较长的时候很难获得最终合理的向量表示问题,做法是保留长短期记忆网络的中间结果,用新的模型对中间结果与输出结果之间的关联性进行学习,从而确定输出结果中精彩程度较高的信息,从而达到信息筛选的目的。
8)时间段落,多媒体数据的播放时间轴上的一个区间,例如时长为10分钟的视频,在播放时间轴上从第5分钟至第8分钟的区间,可以称为1个时间段落。
9)数据片段,多媒体数据中对应时间段落的数据。例如时长为10分钟的视频,在播放时间轴上从第5分钟至第8分钟的时间段落所对应的数据,可以称为1个数据片段,可以区分为音轨轨道的数据片段和视频轨道的数据片段。一个视频可以划分为多个时长相等的数据片段。
10)推荐片段,多媒体数据中包括待表达的关键信息或极性情感(例如悲伤、愉快)的数据片段,在播放时间上对应播放时间轴上的一个时间数据片段,多媒体数据可以是视频、歌曲、有声小说和广播剧等,推荐片段可以是以下类型:电影中包括关键情节的精彩片段,歌曲中抒发悲伤情感的悲伤片段等。
11)推荐参数,量化表征一个数据片段属于某个特定类型的推荐片段的概率,例如:推荐参数表征推荐片段是多媒体数据中的精彩片段的概率。
本申请实施例提供一种音频数据的处理方法、音频数据的处理装置、电子设备、计算机程序产品和计算机可读存储介质,能够准确获取音频数据中的推荐片段。
参考图1,图1是本申请实施例提供的音频数据的处理方法的应用模式示意图;示例的,涉及的服务器包括:识别服务器201与媒体服务器202,其中,媒体服务器202可以是视频平台的服务器、音乐平台的服务器、有声小说或者广播剧平台的服务器等。图1中还示出了网络300及终端设备401。识别服务器201与媒体服务器202之间通过网络300进行通信,或者通过其他方式进行通信,终端设备401通过网络300连接媒体服务器202,网络300可以是广域网或者局域网,又或者是二者的组合。
在一些实施例中,媒体服务器202将音频数据(例如有声小说、在线音乐)发送给识别服务器201,识别服务器201确定音频数据中每个数据片段的推荐参数(例如,数据片段属于精彩片段、悲伤片段、搞笑数据片段的概率,推荐参数与精彩程度、悲伤程度、搞笑程度等正相关),并基于推荐参数生成推荐参数曲线、确定音频数据中的推荐片段。将推荐参数曲线与推荐片段发送至媒体服务器202,媒体服务器202将推荐参数曲线、推荐片段位置标签发送给终端设备401,终端设备401运行播放器402,当播放 器402播放对应的音频数据时,显示推荐参数曲线、推荐片段位置标签。用户基于推荐参数曲线能够方便地确定音频数据中每个数据片段的推荐参数,以及基于推荐片段位置标签能够跳转到对应的位置进行播放,便于定位推荐片段。
在一些实施例中,从视频数据(例如在线视频或本地视频)的音频轨道中分割出音频数据,得到音频数据。音频数据与视频画面的时间轴是对齐的,音频数据的精彩数据片段与视频数据的精彩数据片段一一对应。推荐片段可以分别是精彩数据片段、悲伤数据片段、搞笑数据片段等。下面以推荐参数是数据片段属于精彩片段的概率,推荐片段是精彩数据片段举例说明。用户可以是观看视频的观众,或者将视频数据作为素材进行二次创作的用户。用户通过精彩数据片段的推荐参数曲线、位置标签可以快速确定视频中的精彩数据片段,进而观看精彩数据片段,或者将精彩数据片段从视频数据中剪切出来,进行视频二次创作。
在一些实施例中,识别服务器201与媒体服务器202可以集成在一起实施为一个统一的服务器,也可以分开设置。
本申请实施例可以通过区块链技术实现,可以将本申请实施例的音频数据的处理方法得到的推荐参数曲线为检测结果,将检测结果上传到区块链中存储,通过共识算法保证检测结果的可靠性。区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Block chain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层。
本申请实施例可以通过数据库技术实现,数据库(Database),简而言之可视为电子化的文件柜存储电子文件的处所,用户可以对文件中的数据进行新增、查询、更新、删除等操作。所谓“数据库”是以一定方式储存在一起、能与多个用户共享、具有尽可能小的冗余度、与应用程序彼此独立的数据集合。
数据库管理系统(Database Management System,DBMS)是为管理数据库而设计的电脑软件系统,一般具有存储、截取、安全保障、备份等基础功能。数据库管理系统可以依据它所支持的数据库模型来作分类,例如关系式、XML(Extensible Markup Language,即可扩展标记语言);或依据所支持的计算机类型来作分类,例如服务器群集、移动电话;或依据所用查询语言来作分类,例如结构化查询语言(SQL,Structured Query Language)、XQuery;或依据性能冲量重点来作分类,例如最大规模、最高运行速度;亦或其他的分类方式。不论使用哪种分类方式,一些DBMS能够跨类别,例如,同时支持多种查询语言。
在一些实施例中,服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器。终端设备可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表、智能电视、车载终端等,但并不局限于此。终端设备以及服务器之间可以通过有线或无线通信方式进行直接或间接地连接,本申请实施例中不做限制。
本申请实施例,还可以通过云技术实现,云技术(Cloud technology)基于云计算商业模式应用的网络技术、信息技术、整合技术、管理平台技术、应用技术等的总称,可以组成资源池,按需所用,灵活便利。云计算技术将变成重要支撑。技术网络系统的后台服务需要大量的计算、存储资源,如视频网站、图片类网站和更多的门户网站。伴随着互联网行业的高度发展和应用,以及搜索服务、社会网络、移动商务和开放协作等需 求的推动,将来每个物品都有可能存在自己的哈希编码识别标志,都需要传输到后台系统进行逻辑处理,不同程度级别的数据将会分开处理,各类行业数据皆需要强大的系统后盾支撑,只能通过云计算来实现。
参见图2,图2是本申请实施例提供的电子设备的结构示意图,该电子设备400可以是图1中的终端设备401,也可以是服务器(识别服务器201、媒体服务器202,或者二者的结合体)。该电子设备400包括:至少一个处理器410、存储器450、至少一个网络接口420。电子设备400中的各个组件通过总线系统440耦合在一起。可理解,总线系统440用于实现这些组件之间的连接通信。总线系统440除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图2中将各种总线都标为总线系统440。
处理器410可以是一种集成电路芯片,具有信号的处理能力,例如通用处理器、数字信号处理器(DSP,Digital Signal Processor),或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,其中,通用处理器可以是微处理器或者任何常规的处理器等。
存储器450可以是可移除的,不可移除的或其组合。示例性的硬件设备包括固态存储器,硬盘驱动器,光盘驱动器等。存储器450可选地包括在物理位置上远离处理器410的一个或多个存储设备。
存储器450包括易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。非易失性存储器可以是只读存储器(ROM,Read Only Memory),易失性存储器可以是随机存取存储器(RAM,Random Access Memory)。本申请实施例描述的存储器450旨在包括任意适合类型的存储器。
在一些实施例中,存储器450能够存储数据以支持各种操作,这些数据的示例包括程序、模块和数据结构或者其子集或超集,下面示例性说明。
操作系统451,包括用于处理各种基本系统服务和执行硬件相关任务的系统程序,例如框架层、核心库层、驱动层等,用于实现各种基础业务以及处理基于硬件的任务。
网络通信模块452,用于经由一个或多个(有线或无线)网络接口420到达其他电子设备,示例性的网络接口420包括:蓝牙、无线相容性认证(WiFi)、和通用串行总线(USB,Universal Serial Bus)等。
在一些实施例中,本申请实施例提供的音频数据的处理装置可以采用软件方式实现,图2示出了存储在存储器450中的音频数据的处理装置455,其可以是程序和插件等形式的软件,包括以下软件模块:信源分离模块4551、权重配置模块4552、特征提取模块4553、参数预测模块4554,这些模块是逻辑上的,因此根据所实现的功能可以进行任意的组合或进一步拆分。将在下文中说明各个模块的功能。
在一些实施例中,终端设备或服务器可以通过运行计算机程序来实现本申请实施例提供的音频数据的处理方法。举例来说,计算机程序可以是操作系统中的原生程序或软件模块;可以是本地(Native)应用程序(APP,Application),即需要在操作系统中安装才能运行的程序,如视频APP、音频APP;也可以是小程序,即只需要下载到浏览器环境中就可以运行的程序。总而言之,上述计算机程序可以是任意形式的应用程序、模块或插件。
参见图3A,图3A是本申请实施例提供的音频数据的处理方法的第一流程示意图,该方法可以由电子设备执行,将结合图3A示出的步骤进行说明。
在步骤301中,从音频数据提取得到至少一种信源类型分别对应的音轨数据。
示例的,音频数据文件(或者音频数据包)中分离出不同信源类型分别对应的音轨数据文件(或者,音轨数据包)。
示例的,所述音频数据包含多个数据片段,每个数据片段之间可以是连续的,每个数据片段的播放时长可以是相同或者不同的。例如:将音频数据划分为播放时长相同的多个数据片段,或者,将音频数据划分为播放时长不等的多个数据片段。
示例的,音频数据可以是原生的音频数据(例如:有声小说、广播剧等),也可以是从视频数据中提取的。推荐参数可以包括:精彩程度、悲伤程度、搞笑程度、热血程度等。对应的推荐片段分别是精彩数据片段、悲伤数据片段、搞笑数据片段等。
在一些实施例中,步骤301通过以下方式实现:对音频数据进行特征提取,得到音频数据的全局特征;以全局特征为掩膜,对音频数据进行信源分离,得到音频数据中每种信源类型分别对应的音轨数据。这里,掩膜的边界用于表征不同信源类型对应的音频数据之间的边界。
针对音频数据进行特征提取包括:针对音频数据进行多个层次特征提取,将每个层次得到的特征融合为全局特征。信源分离可以通过金字塔场景解析网络(PSPN,Pyramid Scene Parsing Network)实现,以下分别对特征提取、信源分离进行解释说明。
在一些实施例中,对音频数据进行特征提取处理,得到音频数据的全局特征,通过以下方式实现:对音频数据进行特征提取处理,得到音频数据的原始特征;对原始特征进行多个层次的池化处理,得到音频数据的多个局部特征;对多个局部特征进行合并处理,得到音频数据的全局特征。
示例的,池化处理可以通过金字塔场景解析网络(PSPN,Pyramid Scene Parsing Network)的金字塔池化模块(Pyramid Pooling Module)实现,参考图7,图7是本申请实施例提供的金字塔场景解析网络的示意图;以下具体说明,金字塔场景解析网络包括:卷积神经网络701、池化层703以及金字塔池化模块(图7中的金字塔池化模块包括卷积层1、卷积层2、卷积层3以及卷积层4)、上采样层704、卷积层706。
其中,卷积神经网络701对音频数据进行特征提取,得到音频数据的原始特征702,池化(pool)层703后设置的金字塔池化模块,具体实施中可以根据提取精度设置更多的尺寸。假设金字塔共有N个级别,则在每个级别后使用1×1卷积(CONV),将对应级别的通道数量降为原本的1/N。然后通过双线性插值直接通过上采样层704对低维特征图进行上采样(up sample),得到与原始特征映射相同尺寸的特征图。金字塔场景解析网络的金字塔池化模块的每层输出不同尺寸的局部特征,对不同级别的特征图705合并处理(concat),得到最终的全局特征。
继续参考图7,以全局特征为掩膜对音频数据进行信源分离的实现方式为:将全局特征作为掩膜与金字塔场景解析网络提取到的初始层次的特征通过卷积层706进行卷积,得到音频数据中每种信源类型分别对应的音轨数据对应的特征图。
示例的,假设特征以特征矩阵的形式表征,以全局特征为掩膜,掩膜是与金字塔场景解析网络提取到的初始层次的特征的尺寸相同的特征矩阵,其中全局特征对应的部分为的掩膜值为1,其他部分的掩膜值为0,将全局特征作为掩膜与金字塔场景解析网络提取到的初始层次的特征进行卷积,能够区分不同信源类型的音频数据的频谱之间的边界,从而在音频频谱图中表征不同信源的频谱图之间分界,并将不同信源类型的子音频数据从整体的音频数据中单独分离出,得到每种信源类型分别对应的音轨数据。信源类型包括:背景音、语音。
本申请实施例中,通过金字塔场景解析网络对音频数据进行多个层次的特征提取处理,提升了特征提取的精确度,进而基于提取到的全局特征与初始层次的特征提取结果进行卷积,提升了分离不同信源类型对应的音频数据的准确性,进而能够根据不同音源类型确定音频数据的权重值序列,获取音频数据在音源类型方面的信息,以提升获取推荐参数的精确度。
在步骤302中,确定每个音轨数据的播放时间轴中与信源类型相关的至少一个时间段落,并确定音频数据中每个数据片段中分别包含的时间段落。
示例的,时间段落是音频数据的时间轴上的一段。与信源类型相关的至少一个时间段落是指信源类型对应的信源发出声音的时间段落,信源类型对应的信源是否发出声音,可以通过对信源类型对应音轨数据的短时能量确定。每个数据片段可以保护至少一个类型信源的时间段落,例如:数据片段中包含时间长度与数据片段的播放时长相同的语音的时间段落、背景音的时间段落。或者,数据片段中包含时间长度为播放时长一半的语音的时间段落。
在一些实施例中,参考图3B,图3B是本申请实施例提供的音频数据的处理方法的第二流程示意图,在图3A中的步骤302通过图3B中的步骤3021、步骤3022实现,以下具体说明。示例的,不限定步骤3021与步骤3022的执行顺序。
在步骤3021中,当音轨数据对应的信源类型为语音时,将音轨数据中短时能量大于能量阈值且过零率小于过零率阈值的时间段落,作为与语音相关的时间段落。
示例的,可以通过语音活动检测(VAD,Voice Activity Detection)算法获取语音相关的时间段落。短时能量,即一帧语音信号的能量,是帧内信号的平方和,过零率,即一帧语音时域信号穿过0(时间轴的0点)的次数。语音活动检测算法的原理是,语音数据片段的短时能量相对较大,而过零率相对较小;反之,非语音数据片段的短时能量相对较小,但是过零率相对较大。因为语音信号能量绝大部分包含在低频带内,而噪音信号通常能量较小且含有较高频段的信息。故而可以通过测量语音信号的这两个参数并且与参数分别对应的阈值进行对比,从而判断语音信号与非语音信号,也即判断音轨数据中发出声音的部分和没有发出声音的部分。当音频数据的短时能量小于短时能量阈值且过零率大于过零率阈值,则该段音频为噪音。反之,音频数据的短时能量大于短时能量阈值且过零率小于过零率阈值时,该段音频是语音。
在步骤3022中,当音轨数据对应的信源类型为背景音时,将音轨数据中满足筛选条件的时间段落作为与背景音相关的时间段落。
其中,筛选条件包括以下任意一项:
条件1、时间段落对应的响度大于响度下限值。示例的,持续时间太短或者声音太小则有可能是杂音,而不是背景音乐。响度下限值可以是音频数据对应的响度的中位值的预设倍数(大于0且小于1)确定,例如:响度最大值与最小值的加和的平均值为响度中位值,响度中位值的0.5倍为响度下限值,将音频数据中响度小于下限值的时间段落确定为不满足筛选条件的段落。
条件2、时间段落的长度大于长度下限值。长度下限值基于音频数据的时间长度确定,例如:长度下限值为音频数据的百分之一。
图11,图11是本申请实施例提供的音频数据的示意图,其中,音频数据1101的时间长度为0至T6,音频数据1101被划分为6个数据片段(数据片段1至数据片段6),音频数据的背景音音轨1102中存在T3至T6的背景音信源发出声音的时间段落,音频数据的语音音轨1103中存在T1至T5的语音信源发出声音的时间段落。
本申请实施例中,用语音、背景音等类型区分音轨中每个段落,进而能够直接定位出音频数据中的语音数据片段,并相较于其他类型的数据片段对语音数据片段分配更高的权重值,能够加强识别到语音数据片段的语义信息,极大的提升了语音语义信息在精彩片段定位中的占比。
继续参考图3A,在步骤303中,对所述音频数据中的每个数据片段基于所包含的所述时间段落长度分配对应的权重值,并将每个所述权重值组合形成所述音频数据的权重值序列。
示例的,为便于对音频数据进行处理,预先将音频数据按照帧数或者时长划分为多个数据片段,例如:数据片段的长度为预设帧数,或者预设时长。通过确定数据片段包含的时间段落的长度,为数据片段分配对应的权重值。例如:语音信源的时间段落的时间长度为0,不分配语音类型对应的权重值,背景音的时间段落对应的时间长度与数据片段的播放时长相同,对数据片段分配背景音对应的预配置权重值,再例如:背景音的时间段落对应的时间长度与数据片段的播放时长的一半,则将预配置权重值的一半作为数据片段的权重值。
在一些实施例中,当通过信源分离得到语音和背景音两种信源类型的音轨数据时,参考图3C,图3C是本申请实施例提供的音频数据的处理方法的第三流程示意图,图3A中的步骤303通过图3B中的步骤3031至步骤3033实现,针对每个数据片段进行以下步骤3031至步骤3033处理,以下具体说明。
示例的,不限定步骤3031至步骤3033的执行顺序。
在步骤3031中,当数据片段属于语音相关的时间段落时,基于数据片段对应的语音的参数确定数据片段对应的权重值。
这里,权重值与参数正相关,参数包括以下至少之一:语速、语调、响度。
示例的,以影视剧视频为例进行说明,影视剧视频的音频数据中包括语音与背景音,语音部分通常是由演员表演的部分,影视剧中的精彩数据片段(推荐片段)通常处于存在语音的时间段落。语音的语速、语调、响度等参数可以作为确定精彩数据片段的依据,可以基于参数中至少一项确定数据片段对应的权重值。
在步骤3032中,当数据片段属于背景音相关的时间段落时,将预设数值作为数据片段对应的权重值。
这里,预设数值小于任意一个语音相关的数据片段的权重值。
示例的,影视剧视频的音频数据中包括语音与背景音,语音部分通常是由演员表演的部分,仅存在背景音的部分通常是影视剧视频中过场等数据片段,可以对背景音相关时间段落分配小于语音相关数据片段的权重值。再例如:有声小说的音频数据中,精彩数据片段处于语音部分,仅有背景音的时间段落可以分配更少的权重值。
在步骤3033中,当数据片段不属于任意信源类型相关的时间段落时,将零作为数据片段对应的权重值。
示例的,当数据片段不处于任意信源类型的时间段落时,该数据片段可能是静音或者噪声数据片段,可以通过将数据片段的权重值置零提升获取推荐参数的准确性。
在一些实施例中,存在以下情况:数据片段处于任意一种信源类型的时间段落、数据片段未处于任意信源类型的时间段落、数据片段同时处于多种信源类型的时间段落(例如:数据片段在播放时间轴中所处的时间段落,既存在语音音轨数据也存在背景音音轨数据)。当数据片段同时处于多种信源类型的时间段落时,获取数据片段在不同的信源类型下对应的权重值,并每种权重值加权求和,得到数据片段的权重值。
示例的,为便于理解获取数据片段的权重值的过程,以下结合附图进行说明。参考图11,图11是本申请实施例提供的音频数据的示意图,其中,音频数据1101的时间长度为0至T6,音频数据1101被划分为6个数据片段(数据片段1至数据片段6),音频数据的背景音音轨1102中存在T3至T6的背景音的时间段落,音频数据的语音音轨1103中存在T1与T2的中点位置至T5的语音的时间段落,数据片段1对应的时间区间为0至T1,数据片段2对应的时间区间为T1至T2,数据片段3对应的时间区间为T2至T3,数据片段4对应的时间区间为T3至T4,数据片段5对应的时间区间为T4至T5,数据片段6对应的时间区间为T5至T6。
针对数据片段1,数据片段1与任意信源均不相关,则数据片段1的权重值是0; 针对数据片段2和3,数据片段2和3属于语音相关的时间段落,通过上文中步骤3031获取权重值,此处不再赘述,数据片段2中包含的语音相关的时间段落的时长是数据偏2时长的一半,因此将根据步骤3031获取到的权重值的一半作为数据片段2的权重值q2,假设数据片段2和3的权重值分别是q2,q3。针对数据片段4和5,数据片段4和5既属于语音相关的时间段落,也属于背景音相关的时间段落,以数据片段4为例,分别通过步骤3031和步骤3032的方式获取数据片段针对不同的信源类型的权重值,将每个信源类型的权重值加权求和,得到数据片段4的权重值q4=(aY+bB),其中,Y是数据片段4的语音的权重值,B是数据片段4的背景音的权重值,a和b分别是系数。针对数据片段6,数据片段6仅和背景音相关,则获取背景音对应的预设值作为数据片段6的权重值q6,根据数据片段的先后时间顺序组合每个数据片段的权重值,得到音频数据1101的权重值序列[0,q2,q3,q4,q5,q6]。其中,基于语音对应的参数确定的权重值q3至q5相较于0和q6更高。
本申请实施例中,根据数据片段对应的类型选择不同的方式确定数据片段对应的权重值,当数据片段为背景音分配预设的权重值、当数据片段属于静音或者噪声数据片段则权重值置零,节约了获取数据片段的权重值的计算资源。当数据片段属于语音相关的时间段落,基于语音相关的参数计算数据片段的权重值,提升了获取语音数据片段的权重值的准确性。通过将非语音相关数据片段的权重值设置为预设值或者零,而语音数据片段的权重值根据语音相关参数确定,使得语音数据片段对应的权重值相较于非语音相关数据片段更高,在视频、音频中,推荐片段通常是存在语音的数据片段,提升语音数据片段的权重值提升了预测每个数据片段的推荐参数的准确性。
在一些实施例中,当通过信源分离仅得到背景音一种信源类型的音轨数据时,参考图3D,图3D是本申请实施例提供的音频数据的处理方法的第四流程示意图,图3A中的步骤303通过图3D中的步骤3034至步骤3035实现,针对每个数据片段进行以下步骤3034和步骤3035的处理,以下具体说明。
示例的,不限定步骤3034与步骤3035的执行顺序。
在步骤3034中,当数据片段包含的时间段落属于背景音相关的时间段落时,基于数据片段对应的背景音的参数确定数据片段对应的权重值。
这里,权重值与参数正相关,参数包括以下至少之一:响度、音调。
示例的,假设音频数据是音乐会的音频数据,则仅包括背景音信源而不一定存在语音,语调、响度等参数可以作为确定精彩数据片段的依据,可以基于参数中至少一项确定数据片段对应的权重值。
在步骤3035中,当数据片段包含的时间段落不属于任意信源类型相关的时间段落时,将零作为数据片段对应的权重值。
示例的,步骤3035与步骤3033的内容相同,此处不再赘述。
本申请实施例中,当多媒体数据中不存在语音时,针对属于背景音的数据片段分配预设的权重值、针对属于静音或者噪声的数据片段则权重值置零,节约了获取数据片段的权重值的计算资源。
继续参考图3A,在步骤304中,从每个数据片段提取音频特征,将每个数据片段的与音频特征组合形成音频数据的音频特征序列,并对音频特征序列进行编码,得到音频数据的注意力参数序列。
示例的,从每个数据片段中提取音频特征通过以下方式实现:对音频数据进行特征提取,得到单独的频域特征或者单独的时域特征。
在一些实施例中,在步骤304之前,可以通过以下方式获取音频特征,针对音频数据中每个数据片段进行以下处理:提取数据片段的时域信号特征与频域信号特征;基于 每个音轨数据的播放时间轴中与信源类型相关的至少一个时间段落,确定时域信号特征对应的一维时域权重值,以及确定频域信号特征对应的二维频域权重值;对一维时域权重值与时域信号特征的乘积进行多个层次的卷积,得到时域音频特征;对二维频域权重值与频域信号特征的乘积进行多个层次的卷积,得到频域音频特征;对时域音频特征进行缩放,得到二维时域音频特征;对二维时域音频特征与频域音频特征进行融合处理,得到数据片段的音频特征。
示例的,时域音频特征为一维特征,可以通过对时域音频特征进行缩放,便于将时域特征与频域特征进行融合,例如:通过reshape函数对一维特征进行处理,得到不改变元素数量的二维特征。
参考图6B,图6B是本申请实施例提供的音频处理模型的第二示意图;步骤304可以通过图6B中的音频语义信息提取模块605实现,音频语义信息提取模块605的结构为双流型,包括时域支路606以及频域支路607,音频数据的时域信息、权重值序列输入时域支路606,时域支路606包括多个一维卷积层(一维卷积层1、……一维卷积层n),音频数据的频域信息、权重值序列输入频域支路607,频域支路607包括多个二维卷积层(二维卷积层1、……二维卷积层n)。特征融合层608用于融合两条支路中各个层次的卷积层输出的频域特征或者时域特征。
以下具体说明,参考图8,图8是本申请实施例提供的音频语义信息提取模块的示意图,图8是图6B中音频语义信息提取模块605的细化结构图;音频语义信息提取模块的输入为视频的原始音频数据(表征为音频采样点序列)。音频数据被划分为多个数据片段(例如:按照以下方式划分:每个数据片段包括至少一帧,或者每个数据片段的长度相等)。将基于音频数据生成基础特征图(logmel)作为频域信息,并输入到频域支路607,音频数据的音频采样点序列(时域信息)被输入时域支路606。权重分配单元604输出的权重值序列通过全连接层801、全连接层802的处理,分别生成与时域信号特征和频域信号特征相同维度的时间权重向量,然后分别与时域信号特征和频域信号特征进行对应位置相乘。
时域支路606包括大量的一维卷积层(一维卷积层803、一维卷积层804、一维卷积层806、一维卷积层808)以及一维最大池化层(一维最大池化层805、一维最大池化层807、一维最大池化层809),在时域信号特征中使用大量的卷积层能够直接学习到音频数据的时域特性,尤其是像音频响度和采样点幅度的信息。经过大量的一维卷积层后,把生成的一维序列通过变形层810缩放(resize)成为一个二维图谱(wave graph)形式的特征图,这种处理使得时域支路与频域支路输出的特征的尺寸相同,便于进行融合处理。
示例的,在时域支路的一维卷积的过程中,中间结果通过变形层(变形层811、变形层812)缩放为二维图谱(wave graph),通过合并层(例如:合并层813、合并层815)、二维卷积层(例如:二维卷积层814、二维卷积层816)与频域支路607的中间结果进行多个层次的合并,使得最终得到的音频特征能够融合不同尺寸、层次的频域特征与时域特征。
频域支路607输出的频域信息可以为采用梅尔频域的log-mel频谱,频域支路607包括大量的二维卷积层(二维卷积层821、二维卷积层823、二维卷积层825)以及二维最大池化层(二维最大池化层822、二维最大池化层824),在频域信号特征中使用大量的卷积层能够直接学习到音频数据的频域特性。经过大量的二维卷积层后,得到二维特征图,二维特征图的维度与时域支路606输出的特征图的维度相同。
示例的,在频域支路进行二维卷积的过程中,中间结果通过合并层(例如:合并层813、合并层815)、二维卷积层(例如:二维卷积层814、二维卷积层816)与时域支路 606的中间结果进行多个层次的合并,使得最终得到的音频特征能够融合不同尺寸、层次的频域特征与频域特征。
示例的,变形层可以通过reshape函数(将指定的矩阵变换成特定维数矩阵一种函数,且矩阵中元素个数不变,函数可以重新调整矩阵的行数、列数、维数。)对特征图进行变形。
在一些实施例中,对二维时域音频特征与频域音频特征进行融合处理,得到数据片段的音频特征,通过以下方式实现:对二维时域音频特征与频域音频特征进行叠加处理,对叠加处理得到的叠加特征进行二维卷积,得到二维卷积结果,获取二维卷积结果的最大叠加特征(Max)与平均叠加特征(Mean);对最大叠加特征与平均叠加特征之间的加和进行线性激活,得到数据片段的音频特征。
作为叠加处理的示例,二维时域音频特征与频域音频特征可以分别表征为特征矩阵,对二维时域音频特征的特征矩阵、频域音频特征的特征矩阵进行线性相加,得到叠加特征,采用特征矩阵的形式表征叠加特征。
示例的,本申请实施例中音频特征以向量形式表征。线性激活,也即通过Relu函数对最大叠加特征与平均叠加特征之间的加和进行激活处理,得到数据片段的音频特征。继续参考图8,特征融合模块608中的合并层817合并两个支路输出的特征图,合并处理使得时域和频域保持信息上的互补,同时还能够让高层网络感知到底层网络信息。合并层817输出每个数据片段的二维频域特征图,将二维频域特征图输入到二维卷积神经网络层818中。得到二维卷积结果,基于二维卷积结果获取二维卷积神经网络层最后一维度特征的平均值(mean)和最大值(max),将求得的平均值与最大值通过合并层819进行相加,加和通过激活层820利用线性激活函数(relu),生成最终的音频语义特征向量(音频特征)。将每个数据片段的音频语义特征向量根据数据片段对应的时间顺序组合,得到音频特征序列。
本申请实施中,将时域特征转换为与频域特征相同的维度的特征,降低了对音频数据的时域特征、频域特征进行融合的复杂度,节约了计算资源,提升了特征融合的准确性,通过频域特征与时域特征相互融合,能够从不同方面获取音频中所包含的信息,使得音频特征表征的信息量更全面,提升了获取音频特征的精确度。
在一些实施例中,也可以仅采集音频数据的频域特征或者时域特征作为音频的音频特征,通过仅采集一种域的特征的方式,可以提升计算速度,节约计算资源。
在一些实施中,参考图3E,图3E是本申请实施例提供的音频数据的处理方法的第五流程示意图,图3A中的步骤304中的对音频特征序列进行编码,得到音频数据的注意力参数序列通过图3E中的步骤3041至步骤3043实现。
在步骤3041中,针对音频特征序列中每个音频特征执行以下处理:基于注意力机制,将音频特征与其他数据片段的每个音频特征分别融合得到音频特征对应的每个加权相关度。
这里,其他数据片段是音频数据中除当前获取加权相关度的数据片段之外的数据片段。
示例的,以下以音频特征A进行举例,融合处理是将音频特征A的嵌入向量与任意一个其他数据片段的音频特征的嵌入向量进行内积,并将内积结果与音频特征A相乘,得到音频特征A的一个加权相关度,针对每个其他数据片段的音频特征获取加权相关度,则得到音频特征A对应的每个加权相关度。
在一些实施例中,在步骤3041之前,通过以下方式确定每个音频特征的嵌入向量:通过全连接层对音频数据的每个数据片段的音频特征进行全连接,得到每个音频特征的嵌入向量。
示例的,参考图9,图9是本申请实施例提供的注意力模块中编码的原理示意图。假设:音频特征序列包括a1至an等多个音频特征,将每个两个数据片段对应的音频特征通过全连接层进行处理,得到音频特征对应的一维嵌入(embedding)向量(两个向量的阶数相同)。
在一些实施例中,步骤3041通过以下方式实现:针对音频特征与其他数据片段的每个音频特征执行以下处理:对音频特征的嵌入向量与其他数据片段的嵌入向量相乘,得到音频特征与其他数据片段的音频特征之间的相关度;将音频特征与相关度相乘,得到音频特征对应的加权相关度。
示例的,继续参考图9,音频特征表征为一维嵌入向量形式,对两个一维嵌入向量进行内积计算,得到相关度m,例如:音频特征序列中的第1个音频特征表征为a1与第i个音频特征表征为ai,音频特征a1与音频特征ai相乘得到的相关度为m1i。将相关度与音频特征a1进行相乘,得到加权相关度c1i
在步骤3042中,将每个加权相关度相加得到音频特征对应的注意力参数。
示例的,继续参考图9,音频特征a1与音频特征ai之间的加权相关度c1i,m1i×a1=c1i。将同一个音频特征对应的每个加权相关度相加,可以得到该音频特征对应的注意力参数W。例如:音频特征a1的注意力参数
在步骤3043中,基于每个音频特征对应的数据片段的顺序,将每个注意力参数组合形成音频数据的注意力参数序列。
示例的,数据片段的顺序是指数据片段在音频数据中的时间先后顺序,注意力参数序列中每个注意力参数与每个数据片段一一对应,每个注意力参数是根据注意力参数对应的数据片段的时间先后顺序组合为注意力参数序列的,权重值序列中每个权重值也与每个数据片段一一对应,每个权重值是根据权重值对应的数据片段的时间先后顺序组合为权重值序列的。
本申请实施例,通过基于注意力机制对音频特征进行融合获取注意力参数,基于注意力参数能够更准确地确定推荐参数,进而提升了确定推荐片段的准确性,解决了缺乏播放记录数据的音频数据或者视频数据难以确定推荐片段的问题。
继续参考图3A,在步骤305中,将注意力参数序列与权重值序列融合得到每个数据片段的融合参数,并基于每个融合参数确定每个数据片段的推荐参数。
示例的,融合处理是将注意力参数序列与权重值序列进行相乘,注意力参数序列与权重值序列中所包含的元素的数量是相同的。
在一些实施例中,步骤305通过以下方式实现:针对每个数据片段执行以下处理:从注意力参数序列获取数据片段对应的注意力参数,将数据片段的权重值与数据片段的注意力参数相乘,得到数据片段的融合参数;对融合参数进行归一化处理,得到数据片段的推荐参数。
示例的,假设音频数据的权重值序列是[Q1,Q2……Qn],注意力参数序列是[Z1,Z2……Zn],则音频数据中的第一个数据片段的融合参数是Q1*Z1,也即,第一个数据片段的权重值与注意力参数的乘积。
示例的,归一化处理是通过softmax函数进行置信度预测。以推荐参数是精彩程度进行举例,针对影视剧视频,视频中存在语音的部分为精彩数据片段的概率更高,基于语音信源的音轨数据分配对应的权重值,语音信源的权重值高于背景音部分的权重值,使得语音数据片段对应的精彩程度置信度相较于背景音数据片段对应的精彩程度置信度更高。
本申请实施例中,通过将注意力参数与权重值进行融合,基于融合参数确定推荐参数,将频域时域的信息、信源类型的信息结合,使得推荐参数能够更全面地量化表征音 频数据的信息,提升了确定推荐参数的精确度。
在步骤306中,基于每个数据片段的推荐参数,确定音频数据中的推荐片段。
在一些实施例中,通过以下任意一种方式确定音频数据的推荐片段:
1、基于每个数据片段的推荐参数对每个数据片段进行降序排序,将降序排序的头部的至少一个数据片段作为音频数据的推荐片段。例如:对每个数据片段的精彩程度进行降序排序,将头部的预设数量的数据片段作为精彩数据片段,预设数量与音频数据的数据片段总数正相关,例如:预设数量为数据片段总数的百分之一。
2、将推荐参数大于推荐参数阈值的数据片段作为推荐片段。示例的,推荐参数阈值可以是每个数据片段的推荐参数的中位值,或者中位值的预设倍数(例如:1.5倍,1<预设倍数<2),假设,最大的推荐参数为0.9,最小的推荐参数为0,取中位值0.45为推荐参数阈值,将精彩程度大于0.45的数据片段作为精彩数据片段。再假设,最大的推荐参数为0.9,最小的推荐参数为0,取中位值的1.1倍为推荐参数阈值,则推荐参数阈值为0.495。
本申请实施例中,通过推荐参数量化音频数据中每个数据片段与信源之间的相关程度,通过推荐参数表征音频数据属于某个特定类型的推荐片段的概率,选取推荐参数最高多个数据片段作为推荐片段,选取得到的推荐片段可以表征音频数据中的特定类型的位置,相较于单纯从频域、时域的层面来预测,结合了不同信源识别更加全面,从而基于每个数据片段的推荐参数可以准确识别出有价值的推荐片段,为用户提供准确的参考信息。
在一些实施例中,在步骤305之后,还可以基于每个数据片段的推荐参数,生成音频数据的推荐参数曲线;响应于播放触发操作,在播放界面显示音频数据的推荐参数曲线。
这里,推荐参数曲线的横坐标为音频数据的播放时间,推荐参数曲线的纵坐标为推荐参数。
示例的,推荐参数曲线的横坐标与音频数据的播放时间一一对应,推荐参数曲线的纵坐标越高,则推荐参数越大。参考图10A,图10A是本申请实施例提供的播放界面的第一示意图。播放界面101A为视频播放器的播放界面,推荐参数为精彩程度,精彩程度曲线106A显示在不遮挡视频画面的区域,精彩数据片段107A被标注出。进度条105A中的滑块103A所在位置是视频当前播放的时刻对应的位置。进度条105A可以表征播放时间。精彩程度曲线106A的高低可以表征精彩程度的大小。
示例的,播放触发操作可以是针对音频或者视频的。播放界面可以是音频播放界面或者视频播放界面,则音频播放界面,播放音频数据(对应音频播放场景,音频数据),视频播放界面,对应视频播放场景,音频数据是从视频数据提取的。
在一些实施例中,在步骤306之后,还可以在播放界面显示推荐片段的标签,其中,标签用于表征推荐片段的时间段落;响应于针对任意一个标签的选择操作,跳转到选中的标签对应的推荐片段的起点开始播放。
示例的,选择操作可以是点击操作,或者将进度条滑块拖动到标签的操作,参考图10B,图10B是本申请实施例提供的播放界面的第二示意图。滑块103A被拖动到标签104A的位置,视频画面切换为精彩数据片段107A的起点位置的画面。
在一些实施例中,本申请实施例提供的音频数据的处理方法通过音频处理模型实现。信源分离通过调用音频处理模型的金字塔场景解析网络实现,从每个数据片段提取音频特征通过调用音频处理模型的音频语义信息提取模块实现,编码与融合处理通过调用音频处理模型的注意力模块实现。
参考图6A,图6A是本申请实施例提供的音频处理模型的第一示意图。音频处理模 型包括金字塔场景解析网络601、权重配置模块610、音频语义信息提取模块605以及注意力模块609。金字塔场景解析网络601用于执行步骤301,权重配置模块610用于执行步骤303,音频语义信息提取模块605用于执行步骤304,注意力模块609用于执行步骤609。
音频数据输入金字塔场景解析网络601,金字塔场景解析网络601对音频数据进行信源分离到至少一种信源类型对应的音轨数据,权重配置模块610用于实现上文中的步骤303,权重配置模块610确定音轨数据中与信源关联的时间段落,并对时间段落分配对应的权重值,将权重值输出到音频语义信息提取模块605、注意力模块609。音输数据输入到音频语义信息提取模块605(音频语义信息提取模块的具体结构参考上文中图6B以及图8),音频语义信息提取模块605对音频数据进行时域、频域两方面的特征提取处理,并将融合时域、频域信息的音频特征序列输出到注意力模块609,注意力模块609是用于运行注意力机制的算法模块,注意力模块609通过注意力机制基于权重值序列与音频特征序列进行参数预测,得到推荐参数,制作推荐参数曲线。
其中,通过以下方式训练音频处理模型:基于音频数据的每个实际推荐片段的标签值(标签值也即实际推荐片段的推荐参数,正样本的标签值为1),组合形成音频数据的实际推荐参数序列;基于音频数据的每个数据片段的推荐参数,组合形成音频数据的预测推荐参数序列;基于实际推荐参数序列与预测推荐参数序列获取音频处理模型的交叉熵损失;将交叉熵损失除以音频数据的数据片段数量,得到平均预测损失,基于平均预测损失对音频处理模型进行反向传播,得到更新后的音频处理模型。
示例的,训练数据存在人工标注的标签值,标签值能够表征实际上哪些数据片段为推荐片段(精彩数据片段)的概率,其中,推荐片段被标注为1(正样本),非推荐片段被标注为0(负样本),在进行损失函数计算时,一个视频对应的所有的标签值可以组成一个实际推荐参数序列(由0、1组成的序列)。例如:视频分划分为N个数据片段,N是正整数,推荐片段是精彩片段,人工标注出视频中的精彩片段,根据每个数据片段在视频中的时间从前至后的顺序组合标签值为实际推荐参数序列,实际推荐参数序列表征为[1、0、1……0]。
在一些实施例中,当音频数据为视频中截取的音频数据时,可以在音频特征的基础上结合图像信息确定精彩数据片段。可以通过以下方式实现:对视频的图像数据进行图像特征提取,将图像特征与对应的音频特征进行融合,得到融合的视频特征,基于视频特征执行注意力机制,得到注意力参数序列,基于注意力参数序列与权重值序列确定推荐参数序列。
在一些实施例中,当音频数据为视频中截取的音频数据时,可以基于视频的图像特征识别到的推荐片段优化基于音频数据识别到的推荐片段,通过以下方式实现:对视频的图像数据进行图像识别,基于识别得到的包括人物的视频图像,确定视频中包括人物的数据片段时间。将推荐参数大于推荐参数阈值,且对应的视频数据片段中包括人物的视频数据片段作为推荐片段。
示例的,还可以通过以下方式确定视频数据的精彩数据片段:对视频的图像数据(视频画面)进行特征提取处理,得到视频的图像语义特征序列;对视频的图像数据进行图像识别,得到视频中包括人物的数据片段时间,并基于人物数据片段时间对视频分配对应的权重值序列。基于图像语义特征序列获取注意力参数,得到注意力参数序列,基于图像数据的注意力参数序列与权重序列得到视频画面对应的推荐参数。对视频画面的推荐参数与音频数据的推荐参数加权求和,得到加权推荐参数,将加权推荐参数大于加权推荐参数阈值的视频数据片段作为推荐片段。
本申请实施例从音频侧的角度来对整个视频进行多个域内以及多层信息的分析,能 够快速的定位出整个音频中的推荐片段(例如:精彩数据片段、热血数据片段、悲伤数据片段或者搞笑数据片段等),从而基于音频的推荐片段能够判断出视频中的推荐片段的时间段落在时间轴中的位置。从而在不依赖音频数据的播放记录数据的情况下,就可以准确识别推荐片段,为用户提供准确的参考信息,提升了用户体验。能够为播放器提供视频推荐参数曲线,以供观众将播放进度条由当前播放位置跳转到推荐片段的位置,提升观众对播放器的使用体验。
下面,将说明本申请实施例在一个实际的应用场景中的示例性应用,本申请实施例提供的音频数据的处理方法可以应用在如下应用场景中:
1、在不同平台端(pc\tv\android\ios)播放长视频的过程中,在播放器中能够显示视频的时间轴进度条关联的热度信息。热度信息通常是基于视频的播放记录数据(播放量、点击量、弹幕或者评论数量等)计算得到的,但是针对于新上映的电影或者影视剧的视频,这些视频没有播放记录数据。或者,针对于小众视频没有足够播放记录数据确定热度。本申请实施例提供的音频数据的处理方法可以生成推荐参数曲线来替代热度信息,推荐参数可以是精彩程度,向用户展示视频中的精彩数据片段与精彩程度曲线,用户根据精彩程度曲线或者精彩数据片段标签可直接跳转到精彩数据片段进行观看或收听,提升用户的观看体验。
2、针对于某些短视频平台中的影视剧二创短视频制作,用户往往是先自己观看影视剧之后再从整个剧集中定位精彩数据片段,得到精彩数据片段锦集。基于定位得到的精彩数据片段锦集,进行二创短视频集锦类制作。本申请实施例提供的音频数据的处理方法可以为二创用户提供精彩程度曲线,用户可以根据曲线一目了然地确定视频中的精彩数据片段,一键定位、截取整个视频中的精彩数据片段的画面,然后二创用户可以直接根据截取的结果来进行接下来的短视频生成工作,大幅度提升效率,避免了人工分辨精彩数据片段而浪费时间。
下面,以推荐参数为精彩程度,以音频数据为影视剧的视频的音频数据为例进行说明。参考图5,图5是本申请实施例提供的音频数据的处理方法的一个可选的流程示意图,下面以电子设备为执行主体,将结合图5的步骤进行说明。
步骤501中,获取待处理的视频文件。
示例的,待处理的视频文件可以是影视剧或者电影的视频文件。视频文件由视频画面帧与音频数据组成,音频数据中可以提取到至少一种信源类型对应的音轨数据。参考图4A,图4A是本申请实施例提供的视频中提取的音频数据的示意图;图4A中由上至下,分别是视频画面帧的示意图(表征视频的预览画面)、音频数据的音频特征图、音轨数据的音频采样序列图以及推荐参数曲线的示意图。推荐参数曲线的横坐标表示时间,纵坐标表示推荐参数。
步骤502中,基于视频文件的音频数据调用音频处理模型进行精彩置信度预测处理,得到音频数据的精彩置信度曲线以及精彩数据片段。
参考图6A,图6A是本申请实施例提供的音频处理模型的第一示意图。音频处理模型包括金字塔场景解析网络601、权重配置模块610、音频语义信息提取模块605以及注意力模块609。音频数据输入金字塔场景解析网络601,金字塔场景解析网络601对音频数据进行信源分离到至少一种信源类型对应的音轨数据,权重配置模块610确定音轨数据中与信源关联的时间段落,并对时间段落分配对应的权重值,将权重值输出到音频语义信息提取模块605、注意力模块609。音输数据输入到音频语义信息提取模块605,音频语义信息提取模块605对音频数据进行时域、频域两方面的特征提取处理,并将融合时域、频域信息的音频特征序列输出到注意力模块609,注意力模块基于权重值序列与音频特征序列进行参数预测处理,得到推荐参数,制作推荐参数曲线。
以下对音频处理模型中各模块进行解释说明,参考图6B,图6B是本申请实施例提供的音频处理模型的第二示意图。金字塔场景解析网络601、权重配置模块610中的语音定位单元603进行对整条音轨中的语音段落进行毫秒级别的定位。语音定位单元603采用语音活动检测算法,金字塔场景解析网络601为金字塔场景解析网络(PSPN,Pyramid Scene Parsing Network),通过金字塔形式的卷积层网络,由大到小的感受域能够更好的对分离细节进行识别定位。使用金字塔场景解析网络能够更精准的将音频频谱图中不同的特征进行分离,尤其是在金字塔卷积层中的小卷积层,能够学习到在音频频谱图中不同信源的频谱图之间分界的边缘性,以不同信源的特征的边缘为掩膜对频谱图进行分离处理,使得分离得到的不同信源的音轨数据更准确。视频的原始音轨被输入金字塔场景解析网络601,输出为分离的背景音音轨和语音音轨等音轨数据(图6B中的音轨数据602)。然后采用语音活动检测开源算法进行对语音音轨中的语音段落进行定位,从而得到整个音轨中的语音的时间段落。
示例的,金字塔场景解析网络601基于金字塔场景解析网络搭建的信源分离模型对整个视频的音轨进行分离,将音轨中的语音信息和背景音信息分别进行分裂,单独存储成为音轨数据(音轨文件)。语音定位单元603基于语音活动检测算法对语音的音轨数据中的语音数据片段进行定位,得到存在语音的时间段落,权重分配单元604对每个语音的时间段落的权重进行设置。语音的时间段落被分配的权重值相较于纯背景音的时间段落更高。
本申请实施例中,在金字塔场景解析网络中,金字塔池化模块(Pyramid Pooling Module)生成的不同层次的特征图最终被合并层(concat)合并,并将合并得到的特征图拼接起来,再输入到全连接层以进行分类。金字塔场景解析网络通过金字塔池化模块的多个层次的卷积层输出包含不同尺度、不同子区域间的局部信息,并在金字塔场景解析网络的最终的卷积层特征图上构造全局先验信息。该全局先验信息旨在消除卷积神经网络对图像分类输入大小固定的限制。
参考图7,图7是本申请实施例提供的金字塔场景解析网络的示意图;以下具体说明,图7的是金字塔场景解析网络图6A以及图6B中金字塔场景解析网络601的细化结构示意图,卷积神经网络701对音频数据进行特征提取,得到音频数据的原始特征702,池化(pool)层703后设置的金字塔模块(包括卷积层1、卷积层2、卷积层3以及卷积层4,具体实施中可以根据提取精度设置更多的尺寸)可以融合四种不同金字塔尺度的特征:卷积层1突出显示的是最粗糙级别的单个全局池化输出,金字塔模块的多个层次不同尺寸的卷积层将原始特征映射划分为不同的子区域,并形成针对不同位置的局部特征。金字塔模块中不同层次的卷积层输出不同尺寸的局部特征。为了维护全局特性的权重,假设金字塔共有N个级别,则在每个级别后使用1×1卷积(CONV),将对应级别的通道数量降为原本的1/N。然后通过双线性插值直接通过上采样层704对低维特征图进行上采样(up sample),得到与原始特征映射相同尺寸的特征图。最后,将金字塔模块输出的不同级别的特征图705合并处理(concat),通过卷积层706对合并处理的结果进行卷积,得到最终的金字塔全局特征。继续参考图7,可以看出金字塔场景解析模型的架构呈一个金字塔形状。该模型输入图像后,使用预训练的带空洞卷积层提取特征图,空洞卷积(Atrous Convolutions)又称为扩张卷积(Dilated Convolutions),在卷积层中引入了扩张率(dilation rate),扩张率定义了卷积核处理数据时各数据值的间距。由于引入池化层会导致全局信息的损失,空洞卷积层的作用是在不使用池化层的情况下提供更大的感受野。最终的特征映射大小是输入图像的1/8,然后将该特征输入到金字塔池化模块中,模型使用金字塔场景解析网络中金字塔池化模块来收集上下文信息。金字塔池化模块为4层金字塔结构,池化内核覆盖了图像的全部、一半和小部分。它们被融合 为全局先验信息(全局特征),在最后部分将之前的全局特征映射与原始特征映射合并起来再进行卷积(以全局特征为掩膜,分离原始特征中的语音与背景音),生成语音、背景音的最终分割特征图。
参考图4B,图4B是本申请实施例提供的音轨数据示意图;图4B中上图为音轨波形图(采样序列图),下图为语音对应的音轨特征图,音轨特征图中空白部分为舍弃的噪音部分。示例的,通过金字塔场景解析网络搭建的信源分离模型可以分离出原始音轨中的语音、背景音分别对应的音轨数据。基于音轨数据可以使用语音活动检测算法(例如:WebRTC语音活动检测算法)对具体的音频冲激信号段落进行定位。语音活动检测算法,是基于短时能量(STE,Short Time Energy)和过零率(ZCC,Zero Cross Counter)确定音频是否为语音的算法。短时能量,即一帧语音信号的能量,是帧内信号的平方和,过零率,即一帧语音时域信号穿过0(时间轴)的次数。语音活动检测算法的原理是,语音数据片段的短时能量相对较大,而过零率相对较小;反之,非语音数据片段的短时能量相对较小,但是过零率相对较大。因为语音信号能量绝大部分包含在低频带内,而噪音信号通常能量较小且含有较高频段的信息。故而可以通过测量语音信号的这两个参数并且与参数分别对应的阈值进行对比,从而判断语音信号与非语音信号。当音频数据的短时能量小于短时能量阈值且过零率大于过零率阈值,则该段音频为噪音。反之,音频数据的短时能量大于短时能量阈值且过零率小于过零率阈值时,该段音频是语音。参考图4C,图4C是本申请实施例提供的时间段落示意图;框401C圈出的时间段落为语音的时间段落,同理地,图4C中圈出的每个框对应的波形均为语音的时间段落。
继续参考图6B,音频语义信息提取模块605的结构为双流型,包括时域支路606以及频域支路607,音频数据的时域信息、权重值序列输入时域支路606,时域支路606包括多个一维卷积层(一维卷积层1、……一维卷积层n),音频数据的频域信息、权重值序列输入频域支路607,频域支路607包括多个二维卷积层(二维卷积层1、……二维卷积层n)。特征融合层608用于融合两条支路中各个层次的卷积层输出的频域特征或者时域特征。
以下具体说明,参考图8,图8是本申请实施例提供的音频语义信息提取模块的示意图;音频语义信息提取模块的输入为视频的原始音频数据(表征为音频采样点序列)。音频数据被划分为多个数据片段(例如:通过以下方式划分:每个数据片段包括至少一帧,或者每个数据片段的长度相等)。将基于音频数据生成基础特征图(logmel)作为频域信息,并输入到频域支路607,音频数据的音频采样点序列(时域信息)被输入时域支路606。权重分配单元604输出的权重值序列通过全连接层801、全连接层802的处理,分别生成与时域信号特征和频域信号特征相同维度的时间权重向量,然后分别与时域信号特征和频域信号特征进行对应位置相乘。
时域支路606包括大量的一维卷积层(一维卷积层803、一维卷积层804、一维卷积层806、一维卷积层808)以及一维最大池化层(一维最大池化层805、一维最大池化层807、一维最大池化层809),在时域信号特征中使用大量的卷积层能够直接学习到音频数据的时域特性,包括音频响度和采样点幅度的信息。经过大量的一维卷积层后,把生成的一维序列通过变形层810缩放(resize)成为一个二维图谱(wave graph)形式的特征图,这种处理使得时域支路与频域支路输出的特征的尺寸相同,便于进行融合处理。
示例的,在时域支路进行一维卷积的过程中,中间结果通过变形层(变形层811、变形层812)缩放为二维图谱(wave graph),通过合并层(例如:合并层813、合并层815)、二维卷积层(例如:二维卷积层814、二维卷积层816)与频域支路607的中间结果进行多个层次的合并,使得最终得到的音频特征能够融合不同尺寸、层次的频域特征与时域特征。
频域支路607输出的频域信息可以为采用梅尔频域的logmel频谱,频域支路607包括大量的二维卷积层(二维卷积层821、二维卷积层823、二维卷积层825)以及二维最大池化层(二维最大池化层822、二维最大池化层824),在频域信号特征中使用大量的卷积层能够直接学习到音频数据的频域特性。经过大量的二维卷积层后,得到二维特征图,二维特征图的维度与时域支路606输出的特征图的维度相同。
示例的,在频域支路进行二维卷积的过程中,中间结果通过合并层(例如:合并层813、合并层815)、二维卷积层(例如:二维卷积层814、二维卷积层816)与时域支路606的中间结果进行多个层次的合并,使得最终得到的音频特征能够融合不同尺寸、层次的频域特征与频域特征。
示例的,变形层可以通过reshape函数(将指定的矩阵变换成特定维数矩阵一种函数,且矩阵中元素个数不变,函数可以重新调整矩阵的行数、列数、维数。)对特征图进行变形。
特征融合模块608中的合并层817合并两个支路输出的特征图,合并处理使得时域和频域保持信息上的互补,同时还能够让高层网络感知到底层网络信息。合并层817输出每个数据片段的二维频域特征图,将二维频域特征图输入到二维卷积神经网络层818中。得到二维卷积结果,确定二维卷积结果的平均值(Mean)和最大值(Max),将求得的平均值与最大值通过合并层819进行相加,加和通过激活层820利用线性激活函数(Relu),生成最终的音频语义特征向量(音频特征)。将每个数据片段的音频语义特征向量组合,得到音频特征序列。
继续参考图6B,注意力模块609接收权重值序列与音频特征序列,注意力模块基于音频特征序列编码得到注意力参数序列,基于注意力参数序列与权重值序列预测每个数据片段的推荐参数。参考图9,图9是本申请实施例提供的注意力模块中编码的原理示意图。
示例的,假设音频特征序列包括a1至an等多个音频特征,将每个两个数据片段对应的音频特征通过全连接层进行处理,得到音频特征对应的一维嵌入(embedding)向量(两个向量的阶数相同),对两个一维嵌入向量进行内积的计算,得到相关度m,例如:音频特征a1与音频特征ai之间的相关度为m1i。将相关度与音频特征对应的向量进行相乘,得到加权相关度信息量c(上文的加权相关度)。再例如:音频特征a1与音频特征ai之间的加权相关度信息量c1i,m1i×a1=c1i。将音频特征对应的每个加权相关度信息量相加,可以得到音频特征对应的注意力参数W。例如:音频特征a1的注意力参数
通过上述方式获取所有数据片段的音频特征对应的注意力参数,针对每个数据片段,将该数据片段对应的注意力参数W与该数据片段对应的权重值L进行相乘,得到最终的输出的特征序列Q(特征序列Q的粒度可以为帧级别),通过二分类层对每个粒度的特征节点进行归一化处理:二分类的标签为1-0,1类别的后验概率为该特征节点的置信度(精彩程度),也即,代表该特征节点的特征为精彩的概率;针对整个推荐参数序列执行归一化处理(例如通过softmax函数),即可得到精彩程度曲线。可以设置对应的精彩程度阈值,将精彩程度大于精彩程度阈值的数据片段作为精彩数据片段,小于精彩程度阈值的数据片段作为非精彩数据片段。
在一些实施例中,在训练过程中训练数据存在人工标注的标签(label),标签能够表征实际上哪些数据片段为推荐片段(精彩数据片段),其中,推荐片段被标注为1(正样本),非推荐片段被标注为0(负样本),在进行损失函数计算时,一个视频对应的所有的标签可以组成一个0-1序列。基于推荐参数序列与标签序列计算交叉熵损失函数(置信度序列长度与标签序列长度相同),对整个序列的损失函数求平均值,得到模型的预测损失。基于预测损失可以通过反向传播的方式对音频处理模型进行训练。
步骤503中,响应于播放触发操作,在播放界面显示视频文件的推荐参数曲线。
示例的,播放界面的推荐参数曲线与播放界面的时间轴的进度条绑定,视频在播放界面播放时,在进度条的上方显示精彩程度曲线,精彩程度越高,则曲线对应的数值越高,用户可以根据精彩程度曲线拉取进度条,定位到精彩数据片段进行观看。
本申请实施例提供的音频数据的处理方法的有益效果在于:
1、本申请实施例使用音频的信息来进行自动化的精彩数据片段识别,自动化的定位能够快速工业化的定位出精彩数据片段,在一些落地应用中,尤其是像播放端的热度曲线(精彩程度曲线),可以快速批量化的生产,提升生产效率并降低生产成本。
2、本申请实施例使用全音频信息来做精彩数据片段定位的特征输入,能够弥补画面非高燃但背景音乐高燃的数据片段定位不到的问题(比如情景剧),尤其是使用画面来定位精彩数据片段只能定位出整个画面中最高燃的几个镜头,无法完善整个精彩数据片段的完整性,但是使用音频能够将整个数据片段定位出。并且普遍的画面处理模型参数量较大,不能快速地预测出精彩数据片段,音频网络参数较小,更为快速便捷。
3、本申请实施例使用金字塔场景解析网络搭建信源分离系统,然后再使用语音活动检测算法进行对语音段落定位。该方法能够将完全的语音进行检测,不仅仅是语音信息,能够让整个信源分离系统得知更完整的语音数据片段定位信息。
4、本申请实施例使用语音的时间段落信息来确定整个音轨中每个节点权重信息。本申请实施例能够直接定位出语音数据片段并对语音数据片段分配对应的权重值,能够加强识别到语音数据片段的语义信息,极大的提升了语音语义信息在精彩数据片段定位中的占比。
5、本申请实施例使用多域多层的方法来提取语义特征,能够通过时域和频域在不同的网络层中互相补充信息,在时域特征中添加了频域信息,同样地,在频域特征中添加了时域信息。使得高层网络感知到底层网络特征,提升整个模型的感受域以及不同特征间的互补,从而提升整个音频处理模型的定位性能。
下面继续说明本申请实施例提供的音频数据的处理装置455的实施为软件模块的示例性结构,在一些实施例中,如图2所示,存储在存储器450的音频数据的处理装置455中的软件模块可以包括:信源分离模块4551,配置为对音频数据进行信源分离,得到至少一种信源类型分别对应的音轨数据;权重配置模块4552,配置为基于每个音轨数据的播放时间轴中与信源类型相关的至少一个时间段落,对音频数据中的每个数据片段基于所包含的所述时间段落长度分配对应的权重值,并将每个权重值组合形成音频数据的权重值序列;特征提取模块4553,配置为基于从每个数据片段提取的音频特征,组合形成音频数据的音频特征序列,并对音频特征序列进行编码,得到音频数据的注意力参数序列;参数预测模块4554,配置为对注意力参数序列与权重值序列进行融合处理,得到每个数据片段的融合参数,并基于每个融合参数确定每个数据片段的推荐参数;参数预测模块4554,还配置为基于每个数据片段的推荐参数,确定音频数据中的推荐片段。
在一些实施例中,信源分离模块4551,配置为对音频数据进行特征提取处理,得到音频数据的全局特征;以全局特征为掩膜,对音频数据进行信源分离,得到音频数据中每种信源类型分别对应的音轨数据,其中,掩膜的边界用于表征不同信源类型对应的音频数据之间的边界。
在一些实施例中,信源分离模块4551,配置为对音频数据进行特征提取处理,得到音频数据的原始特征;对原始特征进行多个层次的池化处理,得到音频数据的多个局部特征;对多个局部特征进行合并处理,得到音频数据的全局特征。
在一些实施例中,权重配置模块4552,配置为通过以下方式确定与信源类型相关的至少一个时间段落:当音轨数据对应的信源类型为语音时,将音轨数据中短时能量大于 能量阈值且过零率小于过零率阈值的时间段落,作为与语音相关的时间段落;当音轨数据对应的信源类型为背景音时,将音轨数据中满足筛选条件的时间段落作为与背景音相关的时间段落,其中,筛选条件包括以下任意一项:时间段落对应的响度大于响度下限值;时间段落的长度大于长度下限值。
在一些实施例中,权重配置模块4552,配置为当通过信源分离得到语音和背景音两种信源类型的音轨数据时,针对每个数据片段进行以下处理:当数据片段属于语音相关的时间段落时,基于数据片段对应的语音的参数确定数据片段对应的权重值,其中,权重值与参数正相关,参数包括以下至少之一:语速、语调、响度;当数据片段属于背景音相关的时间段落时,将预设数值作为数据片段对应的权重值,其中,预设数值小于任意一个语音相关的数据片段的权重值;当数据片段不属于任意信源类型相关的时间段落时,将零作为数据片段对应的权重值。
在一些实施例中,权重配置模块4552,配置为当通过信源分离仅得到背景音一种信源类型的音轨数据时,针对每个数据片段进行以下处理:当数据片段包含的时间段落属于背景音相关的时间段落时,基于数据片段对应的背景音的参数确定数据片段对应的权重值,其中,权重值与参数正相关,参数包括以下至少之一:响度、音调;当数据片段包含的时间段落不属于任意信源类型相关的时间段落时,将零作为数据片段对应的权重值。
在一些实施例中,特征提取模块4553,配置为针对音频数据中每个数据片段进行以下处理:提取数据片段的时域信号特征与频域信号特征;基于每个音轨数据的播放时间轴中与信源类型相关的至少一个时间段落,确定时域信号特征对应的一维时域权重值,以及确定频域信号特征对应的二维频域权重值;对一维时域权重值与时域信号特征的乘积进行多个层次的卷积,得到时域音频特征;对二维频域权重值与频域信号特征的乘积进行多个层次的卷积,得到频域音频特征;对时域音频特征进行缩放,得到二维时域音频特征;对二维时域音频特征与频域音频特征进行融合处理,得到数据片段的音频特征。
在一些实施例中,特征提取模块4553,配置为对二维时域音频特征与频域音频特征进行叠加处理,对叠加处理得到的叠加特征进行二维卷积,得到二维卷积结果,获取二维卷积结果的最大叠加特征与平均叠加特征;对最大叠加特征与平均叠加特征之间的加和进行线性激活,得到数据片段的音频特征。
在一些实施例中,参数预测模块4554,配置为针对音频特征序列中每个音频特征执行以下处理:基于注意力机制对音频特征与其他数据片段的每个音频特征分别进行融合处理,得到音频特征对应的每个加权相关度;将每个加权相关度相加,得到音频特征对应的注意力参数,其中,其他数据片段是音频数据中除所述数据片段之外的数据片段;基于每个音频特征对应的数据片段的顺序,将每个注意力参数组合形成音频数据的注意力参数序列。
在一些实施例中,参数预测模块4554,配置为在基于注意力机制对音频特征与其他数据片段的每个音频特征分别进行融合处理,得到音频特征对应的每个加权相关度之前,对音频数据的每个数据片段的音频特征进行全连接,得到每个音频特征的嵌入向量;针对音频特征与其他数据片段的每个音频特征执行以下处理:对音频特征的嵌入向量与其他数据片段的嵌入向量相乘,得到音频特征与其他数据片段的音频特征之间的相关度;将音频特征与相关度相乘,得到音频特征对应的加权相关度。
在一些实施例中,参数预测模块4554,配置为针对每个数据片段执行以下处理:从注意力参数序列获取数据片段对应的注意力参数,将数据片段的权重值与数据片段的注意力参数相乘,得到数据片段的融合参数;对融合参数进行归一化处理,得到数据片段的推荐参数。
在一些实施例中,参数预测模块4554,配置为通过以下任意一种方式确定音频数据的推荐片段:基于每个数据片段的推荐参数对每个数据片段进行降序排序,将降序排序的头部的至少一个数据片段作为音频数据的推荐片段;将推荐参数大于推荐参数阈值的数据片段作为推荐片段。
在一些实施例中,参数预测模块4554,配置为基于每个数据片段的推荐参数,生成音频数据的推荐参数曲线;响应于播放触发操作,在播放界面显示音频数据的推荐参数曲线,其中,推荐参数曲线的横坐标为音频数据的播放时间,推荐参数曲线的纵坐标为推荐参数。
在一些实施例中,参数预测模块4554,配置为在播放界面显示推荐片段的标签,其中,标签用于表征推荐片段的时间段落;响应于针对任意一个标签的选择操作,跳转到选中的标签对应的推荐片段的起点开始播放。
在一些实施例中,信源分离通过调用音频处理模型的金字塔场景解析网络实现,从每个数据片段提取音频特征通过调用音频处理模型的音频语义信息提取模块实现,编码与融合处理通过调用音频处理模型的注意力模块实现;其中,通过以下方式训练音频处理模型:基于音频数据的每个实际推荐片段的标签值,组合形成音频数据的实际推荐参数序列;基于音频数据的每个数据片段的推荐参数,组合形成音频数据的预测推荐参数序列;基于实际推荐参数序列与预测推荐参数序列获取音频处理模型的交叉熵损失;将交叉熵损失除以音频数据的数据片段数量,得到平均预测损失,基于平均预测损失对音频处理模型进行反向传播处理,得到更新后的音频处理模型。
本申请实施例提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。电子设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该电子设备执行本申请实施例上述的音频数据的处理方法。
本申请实施例提供一种存储有可执行指令的计算机可读存储介质,其中存储有可执行指令,当可执行指令被处理器执行时,将引起处理器执行本申请实施例提供的音频数据的处理方法,例如,如图3A示出的音频数据的处理方法。
在一些实施例中,计算机可读存储介质可以是FRAM、ROM、PROM、EPROM、EEPROM、闪存、磁表面存储器、光盘、或CD-ROM等存储器;也可以是包括上述存储器之一或任意组合的各种设备。
在一些实施例中,可执行指令可以采用程序、软件、软件模块、脚本或代码的形式,按任意形式的编程语言(包括编译或解释语言,或者声明性或过程性语言)来编写,并且其可按任意形式部署,包括被部署为独立的程序或者被部署为模块、组件、子例程或者适合在计算环境中使用的其它单元。
作为示例,可执行指令可被部署为在一个电子设备上执行,或者在位于一个地点的多个电子设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个电子设备上执行。
综上所述,通过本申请实施例从音频侧的角度来对整个视频进行多个域内以及多层信息的分析,能够快速的定位出整个音频中的推荐片段(例如:精彩数据片段、热血数据片段、悲伤数据片段或者搞笑数据片段等),从而基于音频的推荐片段能够判断出视频中的推荐片段的时间段落在时间轴中的位置。从而在不依赖音频数据的播放记录数据的情况下,就可以准确识别推荐片段,为用户提供准确的参考信息,提升了用户体验。能够为播放器提供视频推荐参数曲线,以供观众将播放进度条由当前播放位置跳转到推荐片段的位置,提升观众对播放器的使用体验。
以上所述,仅为本申请的实施例而已,并非用于限定本申请的保护范围。凡在本申 请的精神和范围之内所作的任何修改、等同替换和改进等,均包含在本申请的保护范围之内。

Claims (18)

  1. 一种音频数据的处理方法,由电子设备执行,所述方法包括:
    从音频数据提取得到至少一种信源类型分别对应的音轨数据,其中,所述音频数据包含多个数据片段;
    确定每个所述音轨数据的播放时间轴中与所述信源类型相关的至少一个时间段落,并确定所述音频数据中每个所述数据片段中分别包含的时间段落;
    对所述音频数据中的每个数据片段基于所包含的所述时间段落长度分配对应的权重值,并将每个所述权重值组合形成所述音频数据的权重值序列;
    从所述每个数据片段提取音频特征,将所述每个数据片段的音频特征组合形成所述音频数据的音频特征序列,并对所述音频特征序列进行编码,得到所述音频数据的注意力参数序列;
    将所述注意力参数序列与所述权重值序列融合得到每个所述数据片段的融合参数,并基于每个所述融合参数确定每个所述数据片段的推荐参数;
    基于每个所述数据片段的推荐参数,确定所述音频数据中的推荐片段。
  2. 如权利要求1所述的方法,其中,所述从音频数据提取得到至少一种信源类型分别对应的音轨数据,包括:
    从所述音频数据提取得到所述音频数据的全局特征;
    以所述全局特征为掩膜,对所述音频数据进行信源分离,得到所述音频数据中每种所述信源类型分别对应的音轨数据,其中,所述掩膜的边界用于表征不同信源类型对应的音频数据之间的边界。
  3. 如权利要求2所述的方法,其中,所述从所述音频数据提取得到所述音频数据的全局特征,包括:
    对所述音频数据进行特征提取处理,得到所述音频数据的原始特征;
    对所述原始特征进行多个层次的池化处理,得到所述音频数据的多个局部特征;
    将所述多个局部特征合并得到所述音频数据的全局特征。
  4. 如权利要求1至3任一项所述的方法,其中,
    所述确定每个所述音轨数据的播放时间轴中与所述信源类型相关的至少一个时间段落,并确定所述音频数据中每个所述数据片段中分别包含的时间段落,包括:
    通过以下方式确定与所述信源类型相关的至少一个时间段落:
    当所述音轨数据对应的信源类型为语音时,将所述音轨数据中短时能量大于能量阈值且过零率小于过零率阈值的时间段落,作为与所述语音相关的时间段落;
    当所述音轨数据对应的信源类型为背景音时,将所述音轨数据中满足筛选条件的时间段落作为与所述背景音相关的时间段落,其中,所述筛选条件包括以下任意一项:所述时间段落对应的响度大于响度下限值;所述时间段落的长度大于长度下限值。
  5. 如权利要求1至4任一项所述的方法,其中,
    当从所述音频数据提取得到语音和背景音两种信源类型的音轨数据时,所述对所述音频数据中的每个数据片段基于所包含的所述时间段落长度分配对应的权重值,包括:
    针对每个所述数据片段进行以下处理:
    当所述数据片段属于所述语音相关的所述时间段落时,基于所述数据片段对应的语音的参数确定所述数据片段对应的权重值,其中,所述权重值与所述参数正相关,所述参数包括以下至少之一:语速、语调、响度;
    当所述数据片段属于所述背景音相关的所述时间段落时,将预设数值作为所述数据片段对应的权重值,其中,所述预设数值小于任意一个所述语音相关的数据片段的权重值;
    当所述数据片段不属于任意所述信源类型相关的时间段落时,将零作为所述数据片段对应的权重值。
  6. 如权利要求1至4任一项所述的方法,其中,当从所述音频数据提取得到背景音一种信源类型的音轨数据时,所述对所述音频数据中的每个数据片段基于所包含的所述时间段落长度分配对应的权重值,包括:
    针对每个所述数据片段进行以下处理:
    当所述数据片段包含的时间段落属于所述背景音相关的所述时间段落时,基于所述数据片段对应的背景音的参数确定所述数据片段对应的权重值,其中,所述权重值与所述参数正相关,所述参数包括以下至少之一:响度、音调;
    当所述数据片段包含的时间段落不属于任意所述信源类型相关的时间段落时,将零作为所述数据片段对应的权重值。
  7. 如权利要求1至6任一项所述的方法,其中,在所述将从所述每个数据片段提取的音频特征组合形成所述音频数据的音频特征序列之前,所述方法还包括:
    针对所述音频数据中每个所述数据片段进行以下处理:
    提取所述数据片段的时域信号特征与频域信号特征;
    基于每个所述音轨数据的播放时间轴中与所述信源类型相关的至少一个时间段落,确定所述时域信号特征对应的一维时域权重值,以及确定所述频域信号特征对应的二维频域权重值;
    对所述一维时域权重值与所述时域信号特征的乘积进行多个层次的卷积,得到时域音频特征;
    对所述二维频域权重值与所述频域信号特征的乘积进行多个层次的卷积,得到频域音频特征;
    对所述时域音频特征进行缩放,得到二维时域音频特征;
    将所述二维时域音频特征与所述频域音频特征融合得到所述数据片段的音频特征。
  8. 如权利要求7所述的方法,其中,所述将所述二维时域音频特征与所述频域音频特征融合得到所述数据片段的音频特征,包括:
    确定所述二维时域音频特征与所述频域音频特征的叠加特征,对所述的叠加特征进行二维卷积,得到二维卷积结果,获取所述二维卷积结果的最大叠加特征与平均叠加特征;
    对所述最大叠加特征与所述平均叠加特征之间的加和进行线性激活,得到所述数据片段的音频特征。
  9. 如权利要求1至8任一项所述的方法,其中,所述将所述音频特征序列进行编码,得到所述音频数据的注意力参数序列,包括:
    针对所述音频特征序列中每个所述音频特征执行以下处理:基于注意力机制,将所述音频特征与其他数据片段的每个所述音频特征分别融合得到所述音频特征对应的每个加权相关度;将每个所述加权相关度相加得到所述音频特征对应的注意力参数,其中,所述其他数据片段是所述音频数据中除所述数据片段之外的数据片段;
    基于每个所述音频特征对应的数据片段的顺序,将每个所述注意力参数组合形成所述音频数据的注意力参数序列。
  10. 如权利要求9所述的方法,其中,在所述基于注意力机制,将所述音频特征与其他数据片段的每个所述音频特征分别融合得到所述音频特征对应的每个加权相关度之前,所述方法还包括:
    对所述音频数据的每个所述数据片段的音频特征进行全连接,得到每个所述音频特征的嵌入向量;
    所述基于注意力机制,将所述音频特征与其他数据片段的每个所述音频特征分别融合得到所述音频特征对应的每个加权相关度,包括:
    针对所述音频特征与其他数据片段的每个所述音频特征执行以下处理:
    对所述音频特征的嵌入向量与其他数据片段的嵌入向量相乘,得到所述音频特征与其他数据片段的音频特征之间的相关度;
    将所述音频特征与所述相关度相乘,得到所述音频特征对应的加权相关度。
  11. 如权利要求1至10任一项所述的方法,其中,
    所述将对所述注意力参数序列与所述权重值序列融合得到每个所述数据片段的融合参数,包括:
    针对每个所述数据片段执行以下处理:
    从所述注意力参数序列获取所述数据片段对应的注意力参数,将所述数据片段的权重值与所述数据片段的注意力参数相乘,得到所述数据片段的融合参数;
    所述基于每个所述融合参数确定每个所述数据片段的推荐参数,包括:
    对每个所述数据片段的融合参数进行归一化处理,得到每个所述数据片段的推荐参数。
  12. 如权利要求1至11任一项所述的方法,其中,所述基于每个所述数据片段的推荐参数,确定所述音频数据中的推荐片段,包括:
    通过以下任意一种方式确定所述音频数据的推荐片段:
    基于每个所述数据片段的推荐参数对每个所述数据片段进行降序排序,将降序排序结果中从头部开始的至少一个数据片段作为所述音频数据的推荐片段;
    将推荐参数大于推荐参数阈值的数据片段作为推荐片段。
  13. 如权利要求1至12任一项所述的方法,其中,
    在所述将所述注意力参数序列与所述权重值序列融合得到每个所述数据片段的融合参数,并基于每个所述融合参数确定每个所述数据片段的推荐参数之后,所述方法还包括:
    基于所述每个所述数据片段的推荐参数,生成所述音频数据的推荐参数曲线;
    响应于播放触发操作,在播放界面显示所述音频数据的推荐参数曲线,其中,所述推荐参数曲线的横坐标为所述音频数据的播放时间,所述推荐参数曲线的纵坐标为所述推荐参数。
  14. 如权利要求13所述的方法,其中,在所述基于每个所述数据片段的推荐参数,确定所述音频数据中的推荐片段之后,所述方法还包括:
    在所述播放界面显示所述推荐片段的标签,其中,所述标签用于表征所述推荐片段的时间段落;
    响应于针对任意一个所述标签的选择操作,跳转到选中的所述标签对应的推荐片段的起点开始播放。
  15. 如权利要求1至14任一项所述的方法,其中,所述音频数据的处理方法通过调用音频处理模型实现,所述方法还包括:
    通过以下方式训练所述音频处理模型:
    基于所述音频数据的每个实际推荐片段的标签值,组合形成所述音频数据的实际推荐参数序列;
    基于所述音频数据的每个所述数据片段的推荐参数,组合形成所述音频数据的预测推荐参数序列;
    基于所述实际推荐参数序列与所述预测推荐参数序列获取所述音频处理模型的交叉熵损失;
    将所述交叉熵损失除以所述音频数据的数据片段数量,得到平均预测损失,基于所述平均预测损失对所述音频处理模型进行反向传播处理,得到更新后的所述音频处理模型。
  16. 一种音频数据的处理装置,所述装置包括:
    信源分离模块,配置为从音频数据提取得到至少一种信源类型分别对应的音轨数据;
    权重配置模块,配置为基于每个所述音轨数据的播放时间轴中与所述信源类型相关的至少一个时间段落,对所述音频数据中的每个所述数据片段分配对应的权重值,并将每个所述权重值组合形成所述音频数据的权重值序列;
    特征提取模块,配置为从所述每个数据片段提取音频特征,将所述每个数据片段的与音频特征组合形成所述音频数据的音频特征序列,并对所述音频特征序列进行编码,得到所述音频数据的注意力参数序列;
    参数预测模块,配置为将所述注意力参数序列与所述权重值序列融合得到每个所述数据片段的融合参数,并基于每个所述融合参数确定每个所述数据片段的推荐参数;
    所述参数预测模块,还配置为基于每个所述数据片段的推荐参数,确定所述音频数据中的推荐片段。
  17. 一种电子设备,所述电子设备包括:
    存储器,用于存储可执行指令;
    处理器,用于执行所述存储器中存储的可执行指令时,实现权利要求1至15任一项所述的方法。
  18. 一种计算机可读存储介质,存储有可执行指令,所述可执行指令被处理器执行时实现权利要求1至15任一项所述的方法。
PCT/CN2023/097205 2022-06-29 2023-05-30 音频数据的处理方法、装置、电子设备、程序产品及存储介质 WO2024001646A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210747175.3A CN114822512B (zh) 2022-06-29 2022-06-29 音频数据的处理方法、装置、电子设备及存储介质
CN202210747175.3 2022-06-29

Publications (1)

Publication Number Publication Date
WO2024001646A1 true WO2024001646A1 (zh) 2024-01-04

Family

ID=82522855

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/097205 WO2024001646A1 (zh) 2022-06-29 2023-05-30 音频数据的处理方法、装置、电子设备、程序产品及存储介质

Country Status (2)

Country Link
CN (1) CN114822512B (zh)
WO (1) WO2024001646A1 (zh)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114822512B (zh) * 2022-06-29 2022-09-02 腾讯科技(深圳)有限公司 音频数据的处理方法、装置、电子设备及存储介质
CN115033734B (zh) * 2022-08-11 2022-11-11 腾讯科技(深圳)有限公司 一种音频数据处理方法、装置、计算机设备以及存储介质
CN115766883A (zh) * 2022-11-04 2023-03-07 重庆长安汽车股份有限公司 一种多媒体数据调用的方法、装置、设备及介质
CN116230015B (zh) * 2023-03-14 2023-08-08 哈尔滨工程大学 一种基于音频时序信息加权的频域特征表示异音检测方法
CN116450881B (zh) * 2023-06-16 2023-10-27 北京小糖科技有限责任公司 基于用户偏好推荐兴趣分段标签的方法、装置及电子设备
CN116524883B (zh) * 2023-07-03 2024-01-05 腾讯科技(深圳)有限公司 音频合成方法、装置、电子设备和计算机可读存储介质
CN117036834B (zh) * 2023-10-10 2024-02-23 腾讯科技(深圳)有限公司 基于人工智能的数据分类方法、装置及电子设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6973256B1 (en) * 2000-10-30 2005-12-06 Koninklijke Philips Electronics N.V. System and method for detecting highlights in a video program using audio properties
CN101398826A (zh) * 2007-09-29 2009-04-01 三星电子株式会社 自动提取体育节目精彩片断的方法和设备
CN107154264A (zh) * 2017-05-18 2017-09-12 北京大生在线科技有限公司 在线教学精彩片段提取的方法
CN111782863A (zh) * 2020-06-30 2020-10-16 腾讯音乐娱乐科技(深圳)有限公司 音频分段方法、装置、存储介质及电子设备
CN112380377A (zh) * 2021-01-14 2021-02-19 腾讯科技(深圳)有限公司 一种音频推荐方法、装置、电子设备及计算机存储介质
US20210321172A1 (en) * 2020-04-14 2021-10-14 Sony Interactive Entertainment Inc. Ai-assisted sound effect generation for silent video
CN114822512A (zh) * 2022-06-29 2022-07-29 腾讯科技(深圳)有限公司 音频数据的处理方法、装置、电子设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050125223A1 (en) * 2003-12-05 2005-06-09 Ajay Divakaran Audio-visual highlights detection using coupled hidden markov models
GB2533924A (en) * 2014-12-31 2016-07-13 Nokia Technologies Oy An apparatus, a method, a circuitry, a multimedia communication system and a computer program product for selecting field-of-view of interest
US11025985B2 (en) * 2018-06-05 2021-06-01 Stats Llc Audio processing for detecting occurrences of crowd noise in sporting event television programming
CN111901627B (zh) * 2020-05-28 2022-12-30 北京大米科技有限公司 视频处理方法、装置、存储介质及电子设备
CN113920534A (zh) * 2021-10-08 2022-01-11 北京领格卓越科技有限公司 一种视频精彩片段提取方法、系统和存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6973256B1 (en) * 2000-10-30 2005-12-06 Koninklijke Philips Electronics N.V. System and method for detecting highlights in a video program using audio properties
CN101398826A (zh) * 2007-09-29 2009-04-01 三星电子株式会社 自动提取体育节目精彩片断的方法和设备
CN107154264A (zh) * 2017-05-18 2017-09-12 北京大生在线科技有限公司 在线教学精彩片段提取的方法
US20210321172A1 (en) * 2020-04-14 2021-10-14 Sony Interactive Entertainment Inc. Ai-assisted sound effect generation for silent video
CN111782863A (zh) * 2020-06-30 2020-10-16 腾讯音乐娱乐科技(深圳)有限公司 音频分段方法、装置、存储介质及电子设备
CN112380377A (zh) * 2021-01-14 2021-02-19 腾讯科技(深圳)有限公司 一种音频推荐方法、装置、电子设备及计算机存储介质
CN114822512A (zh) * 2022-06-29 2022-07-29 腾讯科技(深圳)有限公司 音频数据的处理方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN114822512A (zh) 2022-07-29
CN114822512B (zh) 2022-09-02

Similar Documents

Publication Publication Date Title
WO2024001646A1 (zh) 音频数据的处理方法、装置、电子设备、程序产品及存储介质
Nagrani et al. Voxceleb: Large-scale speaker verification in the wild
Tzinis et al. Improving universal sound separation using sound classification
Chung et al. Out of time: automated lip sync in the wild
Kotsakis et al. Investigation of broadcast-audio semantic analysis scenarios employing radio-programme-adaptive pattern classification
US10108709B1 (en) Systems and methods for queryable graph representations of videos
US20180109843A1 (en) Methods and systems for aggregation and organization of multimedia data acquired from a plurality of sources
JP6967059B2 (ja) 映像を生成するための方法、装置、サーバ、コンピュータ可読記憶媒体およびコンピュータプログラム
US20140245463A1 (en) System and method for accessing multimedia content
CN112199548A (zh) 一种基于卷积循环神经网络的音乐音频分类方法
WO2020028760A1 (en) System and method for neural network orchestration
CN106919652B (zh) 基于多源多视角直推式学习的短视频自动标注方法与系统
TW201717062A (zh) 基於多模態融合之智能高容錯視頻識別系統及其識別方法
CN112418011A (zh) 视频内容的完整度识别方法、装置、设备及存储介质
CN114443899A (zh) 视频分类方法、装置、设备及介质
CN114329041A (zh) 一种多媒体数据处理方法、装置以及可读存储介质
Pandeya et al. Music video emotion classification using slow–fast audio–video network and unsupervised feature representation
CN111816170A (zh) 一种音频分类模型的训练和垃圾音频识别方法和装置
WO2023173539A1 (zh) 一种视频内容处理方法、系统、终端及存储介质
CN114661951A (zh) 一种视频处理方法、装置、计算机设备以及存储介质
CN114420097A (zh) 语音定位方法、装置、计算机可读介质及电子设备
Nguyen et al. Meta-transfer learning for emotion recognition
US20230070957A1 (en) Methods and systems for detecting content within media streams
Jitaru et al. Lrro: a lip reading data set for the under-resourced romanian language
Schuller et al. New avenues in audio intelligence: towards holistic real-life audio understanding

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23829831

Country of ref document: EP

Kind code of ref document: A1