CN118077004A

CN118077004A - Management of professionally generated and user generated audio content

Info

Publication number: CN118077004A
Application number: CN202280065631.4A
Authority: CN
Inventors: 杨少凡; 李凯
Original assignee: Dolby Laboratories Licensing Corp
Current assignee: Dolby Laboratories Licensing Corp
Priority date: 2021-08-13
Filing date: 2022-08-11
Publication date: 2024-05-24

Abstract

A system for managing User Generated Content (UGC) and Professional Generated Content (PGC) is disclosed. The system is programmed to receive digital audio data from a social media platform having two channels. The system is programmed to extract spatial features from the digital audio data, the spatial features capturing differences in the two channels. The system is further programmed to extract temporal features, spectral features, and background features from the digital audio data. The system is then programmed to use the extracted features to determine whether to treat the digital audio data as UGC or PGC prior to playing.

Description

Management of professionally generated and user generated audio content

Cross Reference to Related Applications

The application claims International patent application No. PCT/CN2021/112543 filed on 8/13 of 2021; U.S. provisional patent application No. 63/243,634, filed on day 13 of 9 of 2021; and priority to U.S. provisional patent application No. 63/288,521 filed on 10, 12, 2021, which is incorporated herein by reference in its entirety.

Technical Field

The application relates to audio processing and playback.

Background

The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.

Technological advances have made it easier to make and share digital media content. Accordingly, the amount and variety of digital media content available for consumption is increasing, ranging from traditional television programming, movies, and music to modern video blogs, podcasts, and audio books. Today, professional Generated Content (PGC) and User Generated Content (UGC) correspond to both categories of production digital media content, and are widely available on social media platforms. PGCs refer to digital media content that is first recorded at a recording studio by professional equipment and post-produced by a professional engineer or artist. UGC refers to digital media content recorded in a non-professional environment (e.g., home or office) often using a user device such as a tablet, smart phone, or notebook. The following discussion will focus on digital audio content.

The manner in which digital audio content is generated directly affects how the digital audio content should be processed for playback. In order to properly convey the sound effects created during the production process, PGCs need to be processed through a well-designed signal processing chain before being transmitted to an output device (e.g., a speaker or earphone). For example, such a signal processing chain may include a virtualizer, a dialog enhancer, a volume adjuster, or an equalizer. On the other hand, UGC often presents quality problems for the following reasons: noise or reverberation that may be present in the recording environment, or limitations of the recording equipment, are not addressed in any post-production. Therefore, UGC typically needs to be enhanced to repair the defect before it can be delivered to an output device for consumption. Sometimes, digital audio content generated using user equipment or in a non-professional environment is also post-produced by using audio editing or mixing tools. Such digital audio content may be considered a PGC for the purpose of determining how to process the digital audio content prior to playback.

Because digital audio content is typically submitted to a social media platform without accompanying information about how the digital audio content was made, it is beneficial to determine whether such digital audio content is PGC or UGC in order to provide the user with an optimal playback experience.

Disclosure of Invention

A computer-implemented method of classifying audio into UGC and PGC is disclosed. The method includes receiving, by a processor, digital audio content having two channels in a time-frequency representation over a plurality of frames and a plurality of frequency bands. The method further includes calculating, by the processor, for each frame of at least a subset of the plurality of frames and each frequency band of the plurality of frequency bands, a respective set of values for a corresponding set of spatial indicators to obtain a set of values for each frequency band, the set of spatial indicators being applied to the two channels and including at least one of inter-aural level differences (ILD), inter-aural phase differences (IPD), or inter-aural coherence (IC). In addition, the method includes calculating a set of statistical features from the set of values for each of the plurality of frequency bands, the set of statistical features including a first statistical feature for only one of the plurality of frequency bands and a second statistical feature over a number of the plurality of frequency bands. The method further includes executing a classification model with the set of statistical features as input data and with an indication of whether the digital audio content is UGC or PGC as output data; and transmits the output data.

The techniques described in this specification may be advantageous over conventional audio processing techniques. For example, the method enables efficient audio playback by identifying the appropriate processing pipeline based on audio production conditions. The method provides classification accuracy by considering different types of audio features that capture the differences between UGC and PGC in various audio domains. In particular, the consideration of spatial characteristics is directly related to the dual channel playback experience.

Drawings

Exemplary embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:

FIG. 1 illustrates an example networked computer system in which various embodiments may be practiced.

FIG. 2 illustrates example components of an audio management computer system according to an embodiment of the present disclosure.

Fig. 3 shows a distribution of values of each spatial index of interest in each predetermined frequency band of the sample UGC.

FIG. 4 shows the distribution of the values of each spatial index of interest in each predetermined frequency band of the sample PGC

Fig. 5 shows a probability curve indicating the probability that the value of each spatial index is equal to the value corresponding to the peak of the distribution in each predetermined frequency band for the samples UGC and PGC.

FIG. 6 illustrates an example process performed by an audio management computer system according to some embodiments described herein.

FIG. 7 is a block diagram that illustrates a computer system upon which an embodiment of the invention may be implemented.

Detailed Description

In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the example embodiments of the invention. It may be evident, however, that the exemplary embodiments may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the example embodiments.

Embodiments will be described in sections according to the following summary:

1. General overview

2. Computing environment example

3. Computer component examples

4. Description of the functionality

4.1. Spatial feature extraction

4.2. Temporal and spectral feature extraction

4.3. Background feature extraction

4.4. Construction and execution of classification models

5. Example flow

6. Hardware implementation

1. General overview

A system for managing user-generated content and professionally-generated content is disclosed. In some embodiments, the system is programmed to receive digital audio data from a social media platform having two channels. The system is programmed to extract spatial features from the digital audio data, the spatial features capturing differences in the two channels. The system is further programmed to extract temporal features, spectral features and background features from the audio data. The system is then programmed to use the extracted features to determine whether to treat the digital audio data as UGC or PGC prior to playback.

In some embodiments, the system first builds a digital model for determining whether a given digital audio data is UGC or PGC. Examples of PGCs include tracks or albums recorded in a studio, while examples of UGCs include sounds recorded using a smart phone or other user device. The system can collect training data that includes UGC fragments and PGC fragments that meet certain requirements. For example, each segment may be required to be no less than a minimum length, and may be required to cover multiple types of sound effects or recording environments.

In some embodiments, the system first extracts features from each segment in the training data. The system extracts spatial features relating to how the sound source resides at different locations, temporal features relating to how the source and properties of the sound change over time, and spectral features relating to how the sound exists at different frequencies. The system also separates the background portion from the digital audio data and extracts temporal and spectral features exclusively from the background portion as background features.

In some embodiments, the system extracts the spatial features of each segment converted to the time-frequency representation by calculating statistics of values of spatial index over a plurality of frames of each of the plurality of frequency bands. Spatial indicators include inter-ear level differences (ILD), inter-ear phase differences (IPD), or inter-ear coherence (IC). The statistics are considered to be spatial features, including the mean, variance, or other aggregate value (aggregte value) for each band, as well as more complex aggregate values across all bands.

In some embodiments, the system also extracts temporal features from each segment or background portion thereof. Example temporal characteristics include energy flux, zero-crossing rate, or maximum amplitude. The system also extracts spectral features from each segment converted to the frequency domain or from the background portion converted to the spectral domain. Example spectral features include spectral centroid, spectral flux, spectral density, spectral roll-off, or mel-frequency cepstral coefficient (MFCC).

In some embodiments, the system combines the extracted features into a feature vector to represent each segment. For example, the extracted features may include multiple spatial features for each frequency band as well as additional spatial features across all frequency bands. The system may create a feature vector that includes an index for each frequency band, a plurality of spatial features for the frequency band, and the additional spatial features. The feature vector may be enhanced by temporal, spectral or background features.

In some embodiments, the system may then build a digital model using at least the set of feature vectors created from the training data, the digital model generating a label (label) indicating whether the digital audio data is UGC or PGC, or a probability that the digital audio data is UGC or PGC. For a supervised training based digital model, each feature vector is used as input data and desired output data, respectively, along with a label indicating whether the underlying segment is UGC or PGC to train the digital model. The system may store the digital model for future use or send the digital model to another device for use.

In some embodiments, the system may receive new audio data from the social media platform with two channels through the stored digital model. The system may extract all features from the new audio data in the same manner as extracting all features from each segment in the training set, generating a new feature vector. The system may then run the digital model using the new feature vector as input data to generate a label or probability as output data. The system may then send the tag to another device, such as a display device or an audio processing system, to enhance the new audio data for playback based on whether the new audio data is determined to be UGC or PGC.

The system has technical benefit. The system solves the technical problem of classifying audio as UGC or PGC, which determines how the audio should be further processed. The system enables efficient audio playback by identifying the appropriate processing pipeline based on audio production conditions. The system provides classification accuracy by considering different types of audio features that capture the differences between UGC and PGC in various audio domains. In particular, the consideration of spatial characteristics is directly related to the dual channel playback experience.

2. Computing environment example

FIG. 1 illustrates an example networked computer system in which various embodiments may be practiced. For purposes of illustrating a clear example, fig. 1 is shown in a simplified schematic format, and other embodiments may include more, fewer, or different elements.

In some embodiments, the networked computer systems include an audio management computer system ("system") 102, a social network platform 104 or additional social network platform, and an audio processing device 110 or additional audio processing device, which are communicatively coupled by a direct physical connection or by one or more networks 118.

In some embodiments, system 102 is programmed or configured with data structures and/or database records arranged to host or perform functions related to analyzing audio data to distinguish UGC from PGC. The system 102 may include a server farm, a cloud computing platform, or parallel computers. The system 102 may also include a cellular telephone, tablet computer, laptop computer, personal digital assistant, or any other computing device having sufficient computing power in terms of data processing, data storage, and network communications for the above-described functions.

In some embodiments, social media platform 104 is configured to receive and host digital media, including digital audio data. Digital media may come from a variety of sources, including systems or devices associated with professional studios or ordinary consumers. The social media platform 104 may also be configured to provide a user interface for accessing digital media data in either raw or processed form. In some embodiments, the system 102 is integrated into the social media platform 104. Social media platform 104 may include a server farm, a cloud computing platform, a parallel computer, or any other computing facility with sufficient computing power in terms of data processing, data storage, and network communications for the above-described functions.

In some embodiments, the audio processing device 110 is configured to process the audio data to prepare for playback, depending on how the audio data was made. Audio processing device 110 may utilize separate processing pipelines for UGC and PGC. In some embodiments, the audio processing device 110 is integrated into the system 102. The audio processing device 110 may include a server farm, a cloud computing platform, a parallel computer, or any other computing facility with sufficient computing power in terms of data processing, data storage, and network communications for the above functions.

The one or more networks 118 may be implemented by any medium or mechanism that provides for the exchange of data between the various elements of fig. 1. Examples of network 118 include, but are not limited to, one or more of a cellular network, a Near Field Communication (NFC) network, a Local Area Network (LAN), a Wide Area Network (WAN), the internet, a terrestrial or satellite link, and the like.

In some embodiments, the system 102 is programmed to receive audio data from the social media platform 104. It is assumed that there are at least two channels of audio data for future playback. The system 102 is programmed to build a digital model for classifying audio data as UGC or PGC based on training data received from the social media platform 104 or one or more other audio sources. For new audio data subsequently received from social media platform 104, system 102 is programmed to provide the categorization using the digital model and send the new audio data that passed the categorization to audio processing device 110 for further processing in preparation for playback. The system 102 may also send the digital model to the audio processing device 110, and the audio processing device 110 may then use the digital model to obtain a classification for any new audio data received from the system 102 or directly from the social media platform 104.

3. Computer component examples

FIG. 2 illustrates example components of an audio management computer system according to an embodiment of the present disclosure. The diagram is for illustration purposes only, and system 102 may include fewer or more functions or storage components. Each functional component may be implemented as a software component, a general-purpose or special-purpose hardware component, a firmware component, or any combination thereof. Each functional component may also be coupled with one or more storage components (not shown). The storage component may be implemented using any of a relational database, an object database, a flat file system, or JSON storage. The storage component may be connected to the functional component locally or over a network using programming calls, remote Procedure Call (RPC) facilities, or a messaging bus. The components may or may not be independent. The components may be functionally or physically centralized or distributed, depending on the particular implementation or other considerations.

In some embodiments, system 102 includes spatial feature extraction instructions 202, additional feature extraction instructions 204, classification model training instructions 206, classification model execution instructions 208, and communication interface instructions 210. The system 102 also includes a database 220.

In some embodiments, the spatial feature extraction instructions 202 enable computation of spatial features from given audio data in the time domain for distinguishing UGC from PGC. Spatial characteristics are related to how the sound source resides in different locations. Spatial features may be extracted by converting given audio data into a time-frequency representation.

In some embodiments, the additional feature extraction instructions 204 enable calculation of additional features from the given audio data or the determined background portion of the given audio data in the time domain for distinguishing UGC from PGC. Additional features may include temporal and spectral features. The temporal characteristics relate to how the source and nature of sound changes over time. The spectral characteristics are related to how sound is present at different frequencies. Spectral features may be extracted by converting given audio data into the frequency domain.

In some embodiments, classification model construction instructions 206 enable construction of a classification model for distinguishing UGC from PGC. The classification model is configured to receive the extracted features as input data and optionally a tag of the UGC or PGC as expected output data. Construction may include supervised learning or unsupervised learning.

In some embodiments, classification model execution instructions 208 enable execution of a classification model for distinguishing UGC from PGC. Given the new audio data, specific features as described above can be extracted and used as input data to perform classification models to generate UGC or tags for PGCs.

In some embodiments, the communication interface instructions 210 enable communication with other systems or devices over a computer network. The communication may include receiving audio data from the social media platform 104 or other sound source. The communication may also include sending the audio data classification or additional data to the audio processing device 110 or the display device.

In some embodiments, database 220 is programmed or configured to manage the storage and access of relevant data, such as received audio data, digital models, features extracted from received audio data, or results of executing digital models.

4. Description of the functionality

The capabilities and arrangement of the recording device, the nature of the recording environment, and the capabilities of the post-production tools, alone or in combination, result in differences between PGCs and UGCs in terms of spatial features, temporal and spectral features, and background features. Thus, in some embodiments, the system 102 extracts these features from digital audio data having at least two channels and uses the extracted features to classify the digital audio data.

4.1. Spatial feature extraction

The capabilities and arrangement of the recording device and the capabilities of the post-production tool may significantly determine the spatial characteristics of the digital audio data. Spatial features may be represented relative to the two channels to reach both ears or more channels corresponding to alternative configurations of audio reception. For example, whether recording is in a stereo setting with two or more microphones or in a binaural setting with two microphones arranged as if they were worn on both ears of the head may affect the spatial characteristics of the recorded audio.

More specifically, for PGCs, each of the various sound sources of the audio mix typically has a defined position and spectral bandwidth in the sound field. The placement of sound sources (possibly including the creation of virtual sound sources) is optimized mainly using panning techniques, which involve creative choices based on human perception of sound localization under given technical constraints. For UGC, sound is typically recorded directly by the user device without post-production applying panning techniques. Thus, PGCs tend to have more dynamic and diversified spatial cues than UGCs, which results in greater differences in the arrival of two channels at the ears.

In some embodiments, the system 102 receives digital audio data having two channels ("dual channel audio data") and a plurality of frames in the time domain that were originally generated by one or more input devices (e.g., one or more microphones located in a recording studio or embedded mobile device) and that may be post-produced. The system 102 then generates dual channel content (a view of the signal as a function of time represented in both time and frequency) as a time-frequency representation (TFR) from the dual channel audio data by a known transformation, such as a discrete Short Time Fourier Transform (STFT) or a Complex Quadrature Mirror Filter (CQMF).

In some embodiments, the dual channel content as TFR includes a first channel content and a second channel content corresponding to the two channels, which may be denoted as S1 (f, l) and S2 (f, l), respectively, where f e [1, f ] represents a frequency index, l represents a frame index, and S1 (f, l) or S2 (f, l) represents a complex frequency response (including amplitude and phase) of frame l in frequency f. For example, the set of frequency bands may correspond to a frequency interval within the range of human hearing. The system 102 considers a plurality of spatial metrics to extract spatial modes or features from the dual channel content. Each spatial index is used to compare the two channels across the frequency band.

Fig. 3 shows a distribution of values for each spatial index of interest in each predetermined frequency band of sample UGC (e.g., 80 hours of handset recorded audio). Fig. 4 shows the distribution of the values of each spatial index of interest in each predetermined frequency band of a sample PGC (e.g., 100 hours of movie audio). In fig. 3 or 4, for each frequency band along the y-axis, each point represents the probability that the spatial index takes a value on the x-axis according to the right-hand legend, and the probabilities are added up to 1. For example, the probability may be estimated by a normalized count in the sample data. Fig. 5 shows a probability curve indicating a probability that the value of each spatial index for the samples UGC and PGC is equal to a value corresponding to the peak of the distribution in each predetermined frequency band. Fig. 3-5 illustrate why spatial index can be effectively used to distinguish UGC from PGC.

In some embodiments, the plurality of spatial indicators includes inter-aural level differences (ILD). Each predetermined frequency band k contains a set of frequency coefficients or values f _{k_i}, where i is a positive integer. For example, 1024 Fast Fourier Transform (FFT) coefficients may be calculated for each frame of waveform when the sampling rate of the content is 48 kHz. 41 bands can be obtained based on the Equivalent Rectangular Bandwidth (ERB), with the lower band containing fewer FFT coefficients, which results in only one coefficient in the lowest band. The ILD indicates the energy ratio of the first channel content and the second channel content, and may be calculated for each frequency band k as follows:

Wherein the ILD in each band k and each frame l is calculated by the ratio of the sum of the energy in that band to the frequency value. The ILD distribution in each band may include an estimated probability from a normalized histogram over all sets of frames of the band in the sample data. Plot 302 in fig. 3 shows the ILD distribution of sample UGC, where the x-axis represents a set of possible energy values and the y-axis represents a set of frequency bands. In plot 302, the distribution is centered around 0 and appears uniform across the frequency band. Thus, the distribution peaks at 0 and the UGC line of plot 502 in fig. 5 has a relatively constant value across the frequency band. Plot 402 in fig. 4 shows the ILD distribution for a sample PGC, with the x-axis being the set of possible energy values and the y-axis being the set of frequency bands. In plot 402, the distribution is also centered around 0, but to a higher degree for the higher frequency bands. Thus, the distribution peaks at 0 and the PGC line of plot 502 in fig. 5 has an upward trend as it moves from the lower frequency band to the higher frequency band. As shown in plot 502, the difference in ILD distribution between the sample UGC and the sample PGC may be due to dynamic translation of the sound source over time when the PGC is manufactured. For most sound sources, energy generally decays as the frequency rises. When the sound source is shifted in a direction other than the straight ahead, resulting in an energy difference between the two channels reaching both ears, the lower the frequency, the greater the likelihood that the ILD deviates from 0. Thus, the probability at 0 (corresponding to no difference between the two channels) increases with increasing frequency.

In some embodiments, the plurality of spatial indicators includes an inter-aural phase difference (IPD). The IPD represents a phase difference of the first channel content and the second channel content, and is calculated as follows for each frequency band k:

IPD(k,l)＝Phase[∑_iS1(f_{k_i},l)·S2^*(f_{k_i},l)] (3)

Wherein ^* denotes the complex value, and Phase [ ] denotes the Phase of the complex value.

Plot 304 in fig. 3 shows the IPD distribution of an example UGC, with the x-axis being the set of possible phase values and the y-axis being the set of frequency bands. In plot 304, the distribution is centered around 0, except for the lowest frequency band, and appears uniform across other frequency bands. Thus, the distribution peaks at 0 and the UGC line of plot 504 in fig. 5 has a relatively constant value across the frequency band, but drops from near 1 to about (0.4,0.6) for the lowest frequency band. Plot 404 in fig. 4 shows the IPD distribution of a sample PGC, with the x-axis being the set of phase values and the y-axis being the set of frequency bands. In plot 404, the distribution is centered around 0 in addition to the lowest frequency band, but is wider for the higher frequency band. Thus, the distribution peaks at 0 and the PGC line of plot 504 in fig. 5 has an upward trend as it moves from the lower frequency band to the higher frequency band, but for very low frequency bands it drops rapidly from near 1 to about 0.2. As described above, the IPD reflects the phase difference between the two channels reaching both ears. The larger the energy difference, the larger the phase difference. Thus, similar to the difference in ILD distribution, the difference in IPD distribution between the sample UGC and the sample PGC can be attributed to dynamic translation of the sound source over time when the PGC is fabricated. However, since the lowest frequency band is designed to contain only one frequency coefficient, as described above, the IPD is constant 0, and thus the IPD distribution peaks at 0 for the lowest frequency band of UGC and PGC. In general, both ILD and IPD cues are indicative of the direction of the sound source, so they are used in pairs. Fig. 5 shows the statistical differences in IPD and ILD between PGC and UGC over all data frames. Although plots 502 and 504 appear to have similar trends, they may have different characteristics within the analysis window over a small number of frames. The classifier discussed below will automatically learn and select the most discriminative feature by comparing IPD, ILD and other features mentioned below in different frequency bands.

In some embodiments, the plurality of spatial indicators includes inter-ear coherence (IC). To characterize the noise field, a widely used metric for ICs is amplitude square coherence (MSC). The IC indicates a value similarity between the first channel content and the second channel content, and calculates for each frequency band f as follows:

Plot 306 in fig. 3 shows the IC distribution of UGC, with the x-axis being the set of possible unitless values and the y-axis being the set of frequency bands. In plot 306, the distribution is more concentrated around 1 for the lower frequency bands. Thus, the distribution generally peaks at 1 and the UGC line of plot 506 in FIG. 5 has a decreasing trend as it moves from the lower frequency band to the higher frequency band. Plot 406 in fig. 4 shows the IC distribution of sample PGCs, with the x-axis being the set of possible unitless values and the y-axis being the set of frequency bands. In plot 406, the concentration has a light center (LIGHT CENTER) around 1 and appears uniform across the entire band. Thus, the distribution peaks at 0 and the PGC line of plot 506 in fig. 5 has a relatively constant value. The IC distribution difference between the sample UGC and the sample PGC can be attributed to the nature of the recording device and the presence of noise when the UGC is made. Two diffuse noise channels (found in non-professional environments) captured directly by two microphones (found in user equipment) result in an MSC, which is a frequency dependent function MSC (f) =sinc (2pi fd/c), where d represents microphone distance, c represents sound velocity (in m/s), and f represents frequency. For frequencies above f ₀ =c/2 d, the MSC becomes very low, so the noise between the two channels can be considered uncorrelated. For frequencies below f ₀ =c/2 d, the MSC gets high, so the noise between the two channels is highly correlated, corresponding to IC 1. Thus, the IC profile of UGC has a high value in the lower frequency band and a low value in the higher frequency band.

In some embodiments, the system 102 calculates statistics related to spatial metrics of the dual channel content. For each band, the system 102 applies a moving window of N frames to the dual channel content and calculates a value for each spatial index for each window. For example, N may be 128. Each window and calculated value is associated with a current frame. The N frames may include a current frame and an immediately preceding frame. The N frames may also include immediately following frames when look-ahead (lookahead) is feasible.

In some embodiments, system 102 then calculates the aggregate value for each spatial index over all frames having an associated window for the entire dual channel content. The aggregate value forms a statistical feature of the dual channel content that can be used to distinguish UGC from PGC. The aggregate value may include a mean, variance, or estimated peak for each frequency band, as discussed further below. For example, the system applies a 128 frame moving window to the dual channel content. For the first frequency band, the system calculates a first value for the ILD on frames 1 through 128, a second value for the ILD on frames 2 through 129, and so on. For the first frequency band, the system calculates the mean of all the ILD values as a statistical feature. The aggregate value may also include a ratio of estimated peaks across the frequency band, as also discussed further below.

As shown in fig. 3-5, the distribution typically peaks at 0dB for ILD, at 0rad for IPD, and at 1 for IC. The mean value is expected to approach the value at which the distribution peak is located. The mean value is expected to be different for IC between UGC and PGC for different frequency bands. The variance is expected to be different for all ILD, IPD and IC between UGC and PGC.

In some embodiments, the estimated peak may be calculated from the dual channel content as follows:

Where V (k, l) may represent the ILD, IPD or IC values (over the associated window) for band k and frame l, th1 and th2 represent the lower and upper limits around the value of the distribution peak where the count N for V (k, l) under consideration is located. For example, for ILD, th1= -0.5 and th2=0.5; for IPD, th1= -0.0314, th2=0.0314; for IC, th1=0.99, th2=1. Thus, the estimated peak is an average value of the spatial index values falling within a range around the value at which the value distribution peak is located. As shown in fig. 3-5, the estimated peak is expected to be different for IC between UGC and PGC for different frequency bands.

In some embodiments, in addition to or instead of calculating the aggregate value for comparing UGC and PGC on a band basis, system 102 calculates a first ratio of estimated peaks over all bands as follows:

Where L1 and L2 are the cut-off band indices of the low frequency band and H1 and H2 are the cut-off band indices of the high frequency band. As shown in fig. 3-5, the first ratio is expected to be different for all of ILD, IPD, and IC between UGC and PGC.

In some embodiments, the system 102 calculates a second ratio of estimated peaks over all frequency bands as follows:

Wherein all frequency bands are considered in the denominator. As shown in fig. 3-5, the second ratio is expected to be different for all of ILD, IPD and IC between UGC and PGC.

In some embodiments, the system 102 creates feature vectors from the statistical features for classification purposes. The statistical features may be weighted before forming the feature vector. For example, these ratios may be considered more discernable and thus given greater weight.

4.2. Temporal and spectral feature extraction

The capabilities and arrangement of the recording device and the capabilities of the post-production tool may determine the temporal or spectral characteristics of the digital audio data. For example, in terms of time characteristics, PGCs may exhibit more variation over time as a result of applying translation techniques during post-fabrication. As another example, in terms of spectral characteristics, UGC may have a value only in a limited frequency band due to the relatively low sensitivity of a microphone in the user device.

In some embodiments, the system 102 calculates the temporal feature from the dual channel audio data. The system 102 first down-mixes the two-channel audio data into single-channel audio data in the time domain by taking the average of the signals in the two channels. The system 102 then calculates known temporal characteristics, such as energy distribution, energy flux, zero crossing rate, or maximum amplitude, from the single channel audio data. The system 102 then similarly creates feature vectors from the temporal features for classification purposes.

In some embodiments, system 102 then also converts the single-channel audio data into spectral audio data in the frequency domain. System 102 then calculates known spectral features, such as spectral centroid, spectral flux, spectral density, spectral roll-off, or mel-frequency cepstral coefficients (MFCCs), from the spectral audio data. The system 102 then similarly creates feature vectors from the spectral features for classification purposes.

4.3. Background feature extraction

The nature of the recording environment and the functionality of the post-production tool may significantly determine the background characteristics of the digital audio data. For example, UGC may have more noise, while PGCs may have more background sound effects.

In some embodiments, the system 102 first extracts background audio data from the dual-channel audio data in the time domain using a background separation method known to those skilled in the art, such as a repeated pattern extraction technique (REPET). The system 102 then uses the same method described in the previous section to calculate temporal and spectral features ("background features") of the background audio data. The system 102 similarly creates feature vectors from the background features for classification purposes.

4.4. Construction and execution of classification models

In some embodiments, system 102 builds a digital model for classifying given dual channel audio data as UGC or PGC. A digital "model" herein refers to a collection of digitally stored executable instructions and data values associated with each other that are capable of receiving and responding to programming or other digital calls, enablement, or requests for solutions based on specified input values to produce one or more stored or calculated output values that may be the basis for computer-implemented advice, output data display, machine control, or the like. The digital model can be trained using a set of UGC segments and a set of PGC segments, each UGC segment having two channels and being associated with a UGC tag, each PGC segment having two channels and being associated with a PGC tag. System 102 extracts spatial, temporal, spectral, or background features from each UGC segment and PGC segment to generate a feature vector set, as discussed in section 4.1 through section 4.3. The system 102 then trains the digital model using the feature vector set as input data and optionally the corresponding tag set as expected output data to generate UGC or PGC tags for the given dual channel audio data. The digital model may be a known classification model, such as a Gaussian Mixture Model (GMM), an adaptive enhancement algorithm, a Support Vector Machine (SVM), or a Deep Neural Network (DNN).

In some embodiments, the system 102 performs a digital model for particular dual channel audio data. The system 102 extracts a set of feature vectors from the particular dual channel audio data as discussed in section 4.1 through section 4.3. System 102 then executes the digital model using the set of feature vectors as input data to generate UGC or tags for PGCs as output data. The system 102 may cause the tag to be displayed. The system 102 may also send the particular dual channel audio data to the appropriate processing system based on the tag. For the UGC tag, the particular dual channel audio data is sent to a processing system configured to enhance the UGC data. For PGC tags, the particular dual channel audio data is sent to a processing system configured to enhance the PGC data. The system 102 may further process the particular dual channel audio data based on the tag.

5. Example flow

FIG. 6 illustrates an example process performed by an audio management computer system according to some embodiments described herein. For purposes of illustration and clarity of example, fig. 6 is shown in simplified schematic format, and other embodiments may include more, fewer, or different elements connected in various ways. Fig. 6 is intended to disclose an algorithm, program, or summary that may be used to implement one or more computer programs or other software elements which, when executed, cause performance of the functional improvements and technical advances described herein. Furthermore, the flowcharts herein are described in terms of levels of detail that are equivalent to those ordinarily skilled in the art for communicating with each other in terms of algorithms, plans, or specifications that form the basis for the software programs that they plan to encode or implement using the skills and knowledge they accumulate.

In step 602, the system 102 is programmed to receive digital audio content having two channels in a time-frequency representation over a plurality of frames and a plurality of frequency bands.

In step 604, the system 102 is programmed to calculate a set of values for a set of spatial indicators for each frame of the subset of the plurality of frames and each frequency band of the plurality of frequency bands to obtain a set of values for each frequency band, the set of spatial indicators being applied to the two channels and comprising an inter-aural level difference (ILD), an inter-aural phase difference (IPD), or an inter-aural coherence (IC).

In some embodiments, the system 102 is specifically programmed to apply a moving window covering the current frame to the digital audio content and calculate a set of values for the set of spatial indicators over all frames covered by the moving window.

In step 606, the system 102 is programmed to calculate a statistical feature list from the set of values for each of the plurality of frequency bands, the statistical feature list including a first statistical feature for only one of the plurality of frequency bands and a second statistical feature over all of the plurality of frequency bands.

In some embodiments, the first statistical feature is a mean or variance of values of the spatial index over a subset of frames for each spatial index in the set of spatial indices. In other embodiments, the second statistical feature is a ratio, wherein the numerator of the ratio relates to the value of the most frequently occurring spatial index for each of the lowest frequency band subset of the plurality of frequency bands and the denominator of the ratio relates to the value of the most frequently occurring spatial index for each of the highest frequency band subset of the plurality of frequency bands or for each of the plurality of frequency bands.

In some embodiments, system 102 is programmed to receive digital audio data in a time domain comprising a plurality of frames and generate digital audio content from the digital audio data. The system 102 is also programmed to calculate a set of temporal features from the digital audio data. In other embodiments, the system 102 is programmed to generate processed audio data in the frequency domain from the digital audio data and calculate a set of spectral features from the processed audio data. In further embodiments, the system is further programmed to extract a background portion in the time domain from the digital audio data, generate a spectral portion in the frequency domain from the background portion, and calculate a particular set of temporal features from the background portion and a particular set of spectral features from the spectral portion.

In some embodiments, system 102 is programmed to receive a segment set comprising a plurality of UGC segments and a plurality of PGC segments, each segment in the segment set having two channels in a time-frequency representation. The system 102 is programmed to further calculate a set of values for the set of spatial indicators for each of the plurality of frequency bands and each of the plurality of frames in each of the segments in the segment group to obtain a set of the set of values for each frequency band. The system 102 is then programmed to calculate a list of statistical features from the set of values for each of the plurality of bands for each segment in the segment group to obtain a set of lists of statistical features. The system 102 is programmed to ultimately build a classification model from the set of statistical feature lists. In some embodiments, the classification model is a gaussian mixture model, an adaptive enhancement algorithm, a support vector machine, or a deep neural network.

In some embodiments, system 102 is programmed to incorporate the first statistical feature, an index of one frequency band associated with the first statistical feature, and the second statistical feature into the feature vector.

In step 608, the system 102 is programmed to execute the classification model with a list of statistical features or feature vectors as input data and an indication of whether the digital audio content is UGC or PGC as output data.

In some embodiments, the system 102 is programmed to execute the classification model with the set of temporal features as the first additional input data. In other embodiments, the system 102 is programmed to execute the classification model with the set of spectral features as the second additional input data. In an additional embodiment, the system 102 is programmed to execute the classification model using the set of specific temporal features and the set of specific spectral features as third additional input data.

In some embodiments, the system 102 is further programmed to process the digital audio content based on the results of the determination and send the processed results to the playback device.

In step 610, the system 102 is programmed to transmit output data.

6. Hardware implementation

According to one embodiment, the techniques described herein are implemented by at least one computing device. The techniques may be implemented, in whole or in part, using a combination of at least one server computer and/or other computing devices coupled using a network, such as a packet data network. The computing device may be hardwired to perform the techniques, or may include a digital electronic device, such as at least one Application Specific Integrated Circuit (ASIC) or Field Programmable Gate Array (FPGA) that is permanently programmed to perform the techniques, or may include at least one general purpose hardware processor programmed to perform the techniques according to program instructions in firmware, memory, other storage, or a combination. Such computing devices may also incorporate custom hardwired logic, ASICs, or FPGAs in combination with custom programming to accomplish the described techniques. The computing device may be a server computer, workstation, personal computer, portable computer system, handheld device, mobile computing device, wearable device, body mounted or implantable device, smart phone, smart appliance, networking device, autonomous or semi-autonomous device (e.g., a robot) or unmanned ground vehicle or aircraft, any other electronic device that incorporates hardwired and/or program logic to implement the described technology, one or more virtual computing machines or instances in a data center, and/or a network of server computers and/or personal computers.

Various aspects of the disclosed embodiments may be understood from the following Enumerated Example Embodiments (EEEs):

EEE 1A computer-implemented method of classifying audio as User Generated Content (UGC) or Professional Generated Content (PGC) includes receiving, by a processor, digital audio content having two channels in a time-frequency representation over a plurality of frames and a plurality of frequency bands; calculating, by the processor, for each frame of at least a subset of the plurality of frames and each frequency band of the plurality of frequency bands, a respective set of values for a corresponding set of spatial indicators to obtain a set of values for each frequency band, the set of spatial indicators being applied to the two channels and comprising at least one of inter-ear level differences (ILD), inter-ear phase differences (IPD), or inter-ear coherence (IC); calculating a set of statistical features from the set of values for each of the plurality of frequency bands, the set of statistical features including a first statistical feature for only one of the plurality of frequency bands and a second statistical feature over a number of frequency bands for the plurality of frequency bands; executing a classification model using the set of statistical features as input data and an indication of whether the digital audio content is UGC or PGC as output data; and transmits the output data.

The computer-implemented method of claim 1, further comprising: processing the digital audio content based on a result of the executing; and transmitting the processing result to the playback device.

The EEE 3. The computer-implemented method of claim 1 or 2, the computing comprising applying a moving window covering a current frame to the digital audio content and computing the set of values of the set of spatial indicators over all frames covered by the moving window.

The computer-implemented method of any of claims 1-3, the first statistical feature being a mean or variance of values of spatial indicators over a subset of frames for each spatial indicator in the set of spatial indicators.

The computer-implemented method of any of claims 1-4, the second statistical characteristic being a ratio, wherein a numerator of the ratio relates to a value of a spatial index that occurs most frequently for each frequency band in a lowest frequency band subset of the plurality of frequency bands, and a denominator of the ratio relates to a value of a spatial index that occurs most frequently for each frequency band in a highest frequency band subset of the plurality of frequency bands or for each frequency band of the plurality of frequency bands.

The computer-implemented method of any of claims 1-5, the performing comprising incorporating the first statistical feature, an index of the one frequency band associated with the first statistical feature, and the second statistical feature into a feature vector.

The computer-implemented method of any of claims 1-6, further comprising: receiving digital audio data in a time domain comprising the plurality of frames; and generating digital audio content from the digital audio data.

The EEE 8. Computer-implemented method of claim 7 further comprising calculating a set of temporal features from the digital audio data, the performing being performed using the set of temporal features as first additional input data.

The computer-implemented method of claim 7, further comprising: generating processed audio data in the frequency domain from the digital audio data; and calculating a set of spectral features from the processed audio data, the performing being performed using the set of spectral features as second additional input data.

Eee10. The computer-implemented method of claim 7, further comprising: extracting a background portion in the time domain from the digital audio data; and generating a spectral portion in the frequency domain from the background portion; calculating a specific set of temporal features from the background portion and a specific set of spectral features from the spectral portion; and the performing is performed using the specific temporal feature set and the specific spectral feature set as third additional input data.

The EEE 11. The computer-implemented method of any of claims 1-10, further comprising: receiving a segment set comprising a plurality of UGC segments and a plurality of PGC segments, each segment in the segment set having two channels in a time-frequency representation; calculating a set of values of the set of spatial indicators for each of the plurality of frequency bands and each of the plurality of frames in each of the segments in the segment group to obtain a group of sets of values of each frequency band; calculating a statistical feature list from the set of values for each of the plurality of bands for each segment in the segment group to obtain a set of statistical feature lists; and building a classification model from the set of statistical feature lists.

The computer-implemented method of any of claims 1-11, the classification model being a gaussian mixture model, an adaptive enhancement algorithm, a support vector machine, or a deep neural network.

EEE 13. A non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform the operations of any of claims 1-12.

EEE 14A computer system for classifying audio as User Generated Content (UGC) or Professional Generated Content (PGC), comprising: a memory; one or more processors coupled to the memory and configured to perform: receiving a segment set comprising a plurality of UGC segments and a plurality of PGC segments, each segment in the segment set having two channels in a time-frequency representation, each segment in the plurality of UGC segments being associated with a UGC tag, each segment in the plurality of PGC segments being associated with a PGC tag; calculating, for each frame of at least a subset of the plurality of frames in each of the segment groups and each frequency band of the plurality of frequency bands, a respective set of values for a corresponding set of spatial indicators to obtain a group of sets of values for each frequency band, the set of spatial indicators being applied to two channels of each segment in the segment groups and comprising at least one of inter-aural level differences (ILD), inter-aural phase differences (IPD), or inter-aural coherence (IC); calculating, for each segment in the segment group, a set of statistical features from the set of values for each of the plurality of frequency bands to obtain a set of statistical feature lists, the set of statistical features including a first statistical feature for only one of the plurality of frequency bands and a second statistical feature over a number of frequency bands for the plurality of frequency bands; establishing a classification model from the set of statistical feature lists; receiving digital audio content having two channels in a time-frequency representation; and assigning a UGC tag or PGC tag to the digital audio content using the classification model.

The computer system of claim 14, the one or more processors further configured to perform the sending of the classification model.

The computer system of claim 14 or 15, the calculating comprising applying a moving window covering a current frame to a segment and calculating the set of values of the set of spatial indicators over all frames covered by the moving window.

The computer system of any of claims 14-16, the one or more processors further configured to perform: receiving a set of audio items in a time domain; and generating a segment group from the set of audio items.

The EEE 18. The computer system of claim 17, the one or more processors further configured to perform computing a set of temporal features from each audio item in the set of audio items to obtain a set of temporal feature sets, the establishing further performed according to the set of temporal feature sets.

The computer system of claim 17, the one or more processors further configured to perform: generating a set of processed audio items in the frequency domain from the set of audio items; and computing a set of spectral features from each processed audio item of the set of processed audio items to generate a set of spectral feature sets, the building being further performed in accordance with the set of spectral feature sets.

The computer system of claim 17, the one or more processors further configured to perform: extracting a background portion in the time domain from each audio item in the set of audio items to obtain a set of background portions; and generating a spectral portion in the frequency domain from each background portion of the set of background portions to obtain a set of spectral portions; and computing a temporal feature set from each of the set of background portions to obtain a set of temporal feature sets, and computing a spectral feature set from each of the set of spectral portions to obtain a set of spectral feature sets, the building further performed in accordance with the set of temporal feature sets and the set of spectral feature sets.

FIG. 7 is a block diagram illustrating an example computer system that may be used to implement embodiments. In the example of fig. 7, computer system 700 and instructions for implementing the disclosed techniques in hardware, software, or a combination of hardware and software are schematically represented as, for example, blocks and circles, with the same level of detail as those commonly used by those of ordinary skill in the art to which this disclosure pertains for communicating computer architecture and computer system implementations.

Computer system 700 includes an input/output (I/O) subsystem 702, which may include a bus and/or other communication mechanism for communicating information and/or instructions among the components of computer system 700 via electronic signal paths. The I/O subsystem 702 may include an I/O controller, a memory controller, and at least one I/O port. The electrical signal paths are schematically represented in the figures as, for example, lines, unidirectional arrows, or bidirectional arrows.

At least one hardware processor 704 is coupled to the I/O subsystem 702 for processing information and instructions. The hardware processor 704 may include, for example, a general purpose microprocessor or microcontroller and/or a special purpose microprocessor, such as an embedded system or a Graphics Processing Unit (GPU) or a digital signal processor or an ARM processor. The processor 704 may include an integrated Arithmetic Logic Unit (ALU) or may be coupled to a separate ALU.

Computer system 700 includes one or more units of memory 706, such as main memory, coupled to I/O subsystem 702 for electronically and digitally storing data and instructions to be executed by processor 704. The memory 706 may include volatile memory, such as various forms of Random Access Memory (RAM) or other dynamic storage devices. Memory 706 may also be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 704. When such instructions are stored in a non-transitory computer-readable storage medium accessible to processor 704, computer system 700 may be caused to appear as a special-purpose machine customized to perform the operations specified in the instructions.

Computer system 700 also includes a non-volatile memory, such as Read Only Memory (ROM) 708 or other static storage device, coupled to I/O subsystem 702 for storing information and instructions for processor 704. ROM 708 may include various forms of Programmable ROM (PROM), such as an Erasable PROM (EPROM) or an Electrically Erasable PROM (EEPROM). The elements of persistent storage device 710 may include various forms of non-volatile RAM (NVRAM), such as flash memory, or solid state memory, magnetic or optical disks, such as CD-ROM or DVD-ROM, and may be coupled to I/O subsystem 702 for storing information and instructions. Storage device 710 is an example of a non-transitory computer-readable medium that may be used to store instructions and data that, when executed by processor 704, cause a computer-implemented method to be performed to perform the techniques herein.

The instructions in memory 706, ROM 708, or storage 710 may include one or more sets of instructions organized as a module, method, object, function, routine, or call. The instructions may be organized as one or more computer programs, operating system services, or applications including mobile applications. The instructions may include an operating system and/or system software; one or more libraries supporting multimedia, programming, or other functions; data protocol instructions or stacks for implementing TCP/IP, HTTP or other communication protocols; file processing instructions for interpreting and rendering files encoded using HTML, XML, JPEG, MPEG or PNG; user interface instructions for rendering or interpreting commands of a Graphical User Interface (GUI), a command line interface, or a text user interface; applications such as office suites, internet access applications, design and manufacturing applications, graphics applications, audio applications, software engineering applications, educational applications, games, or other applications. The instructions may implement a web server, a web application server, or a web client. The instructions may be organized as a presentation layer, an application layer, and a data store layer, such as a relational database system using Structured Query Language (SQL) or NoSQL, an object repository, a graph database, a flat file system, or other data store.

Computer system 700 may be coupled to at least one output device 712 via I/O subsystem 702. In one embodiment, output device 712 is a digital computer display. Examples of displays that may be used in various embodiments include touch screen displays or Light Emitting Diode (LED) displays or Liquid Crystal Displays (LCDs) or electronic paper displays. Instead of, or in addition to, a display device, computer system 700 may include other types of output devices 712. Examples of other output devices 712 include printers, ticket printers, plotters, projectors, sound or video cards, speakers, buzzers or piezoelectric devices, or other sound emitting devices, lights or LED or LCD indicators, haptic devices, actuators, or servos.

At least one input device 714 is coupled to the I/O subsystem 702 for communicating signals, data, command selections, or gestures to the processor 704. Examples of input devices 714 include touch screens, microphones, still and video digital cameras, alphanumeric and other keys, keypads, keyboards, tablets, image scanners, joysticks, clocks, switches, buttons, dials, slides, and/or various types of sensors, such as force sensors, motion sensors, thermal sensors, accelerometers, gyroscopes, and Inertial Measurement Unit (IMU) sensors, and/or various types of transceivers, such as wireless, cellular or Wi-Fi, radio Frequency (RF) or Infrared (IR) transceivers, and Global Positioning System (GPS) transceivers.

Another type of input device is control device 716, which may perform cursor control or other automatic control functions, such as navigation in a graphical interface on a display screen, in lieu of or in addition to input functions. The control 716 may be a touchpad, a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to the processor 704 and for controlling cursor movement on the display 712. The input device may have at least two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), which allows the device to specify positions in a plane. Another type of input device is a wired, wireless or optical control device, such as a joystick, stick, console, steering wheel, pedal, gear change mechanism or other type of control device. The input device 714 may include a combination of a plurality of different input devices such as a camera and a depth sensor.

In another embodiment, computer system 700 may include an internet of things (loT) device, where one or more of output device 712, input device 714, and control device 716 are omitted. Or in such embodiments, the input device 714 may include one or more cameras, motion detectors, thermometers, microphones, seismic detectors, other sensors or detectors, measurement devices or encoders, and the output device 712 may include a dedicated display, such as a single row LED or LCD display, one or more indicators, display panels, meters, valves, solenoids, actuators or servos.

When the computer system 700 is a mobile computing device, the input device 714 may comprise a Global Positioning System (GPS) receiver coupled to a GPS module capable of triangulating a plurality of GPS satellites, determining and generating geographic location or position data, for example, latitude-longitude values as the geophysical location of the computer system 700. Output device 712 may include hardware, software, firmware, and interfaces for generating location reporting packets, notifications, pulse or heartbeat signals, or other cyclical data transmissions that specify the location of computer system 700 alone or in combination with other application-specific data directed to host 724 or server 730.

Computer system 700 may implement the techniques described herein using custom hardwired logic, at least one ASIC or FPGA, firmware, and/or program instructions or logic that, when loaded and used or executed in conjunction with a computer system, cause the computer system or program the computer system to operate as a special purpose machine. According to one embodiment, the techniques herein are performed by computer system 700 in response to processor 704 executing at least one sequence of at least one instruction contained in main memory 706. Such instructions may be read into main memory 706 from another storage medium, such as memory 710. Execution of the sequences of instructions contained in main memory 706 causes processor 704 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.

The term "storage medium" as used herein refers to any non-transitory medium that stores data and/or instructions that cause a machine to operate in a specific manner. Such storage media may include non-volatile media and/or volatile media. Non-volatile media includes, for example, optical or magnetic disks, such as memory 710. Volatile media includes dynamic memory, such as memory 706. Common forms of storage media include, for example, a hard disk, a solid state drive, a flash memory drive, a magnetic data storage medium, any optical or physical data storage medium, a memory chip, and so forth.

Storage media are different from, but may be used in conjunction with, transmission media. Transmission media participate in the transfer of information between storage media. For example, transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise the bus of I/O subsystem 702. Transmission media can also take the form of acoustic or light waves, such as those generated during radio wave and infrared data communications.

Various forms of media may be involved in carrying at least one sequence of at least one instruction to processor 704 for execution. For example, the instructions may initially be carried on a magnetic disk or solid state drive of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a communication link, such as an optical or coaxial cable or a telephone line, using a modem. A modem or router local to computer system 700 can receive the data on the communication link and convert the data for reading by computer system 700. For example, a receiver such as a radio frequency antenna or an infrared detector may receive data carried in the wireless or optical signals and appropriate circuitry may provide the data to the I/O subsystem 702, e.g., placing the data on a bus. The I/O subsystem 702 transfers data to the memory 706, and the processor 704 retrieves and executes instructions from the memory 706. The instructions received by memory 706 may optionally be stored on memory 710 either before or after execution by processor 704.

Computer system 700 also includes a communication interface 718 coupled to bus 702. Communication interface 718 provides a two-way data communication coupling to a network link 720, which network link 720 connects directly or indirectly to at least one communication network, such as network 722 or a public or private cloud on the internet. For example, communication interface 718 may be an Ethernet network interface, an Integrated Services Digital Network (ISDN) card, a cable modem, a satellite modem, or a modem to provide a data communication connection to a corresponding type of communication line (e.g., an Ethernet cable or any type of metal cable or fiber optic line or telephone line). Network 722 broadly represents a Local Area Network (LAN), wide Area Network (WAN), campus network, internetwork, or any combination thereof. Communication interface 718 may include a LAN card to provide a data communication connection to a compatible LAN, or a cellular radiotelephone interface to wire a connection to transmit or receive cellular data according to a cellular radiotelephone wireless network standard, or a satellite wireless interface to wire a connection to transmit or receive digital data according to a satellite wireless network standard. In any such implementation, communication interface 718 sends and receives electrical, electromagnetic or optical signals over signal paths that carry digital data streams representing various types of information.

Network link 720 typically provides electrical, electromagnetic, or optical data communication using, for example, satellite, cellular, wi-Fi, or bluetooth techniques, directly or through at least one network to other data devices. For example, network link 720 may provide a connection through network 722 to a host computer 724.

In addition, network link 720 may provide a connection through network 722 or through internetworking devices and/or computers operated by an Internet Service Provider (ISP) 726 to other computing devices. ISP 726 provides data communication services through the world wide packet data communication network represented by Internet 728. A server computer 730 may be coupled to the internet 728. Server 730 broadly represents any computer, data center, virtual machine or virtual computing instance, with or without a hypervisor, or computer executing a containerized program system such as DOCKER or KUBERNETES. Server 730 may represent an electronic digital service implemented using more than one computer or instance and accessed and used by transmitting network service requests, uniform Resource Locator (URL) strings with parameters in HTTP payloads, API calls, application service calls, or other service calls. Computer system 700 and server 730 may form elements of a distributed computing system including other computers, processing clusters, server farms, or other computer organizations that cooperate to perform tasks or execute applications or services. Server 730 may include one or more sets of instructions organized as a module, method, object, function, routine, or call. The instructions may be organized as one or more computer programs, operating system services, or applications including mobile applications. The instructions may include an operating system and/or system software; one or more libraries supporting multimedia, programming, or other functions; data protocol instructions or stacks for implementing TCP/IP, HTTP or other communication protocols; file format processing instructions for interpreting or rendering files encoded using HTML, XML, JPEG, MPEG or PNG; user interface instructions, command line interfaces, or text user interfaces for presenting or interpreting a Graphical User Interface (GUI); applications such as office suites, internet access applications, design and manufacturing applications, graphics applications, audio applications, software engineering applications, educational applications, games, or other applications. Server 730 may include a web application server hosting a presentation layer, an application layer, and a data store layer, such as a relational database system using Structured Query Language (SQL) or NoSQL, object store, graphic database, flat file system, or other data store.

Computer system 700 can send messages and receive data, including program code, through the network(s), network link 720 and communication interface 718. In the Internet example, a server 730 might transmit a code requested by an application program through Internet 728, ISP 726, local network 722 and communication interface 718. The received code may be executed by processor 704 as it is received, and/or stored in storage device 710, or other non-volatile storage for later execution.

Execution of the instructions described in this section may implement a process in the form of an executing computer program instance, and consists of program code and its current activities. Depending on the Operating System (OS), a process may be made up of multiple threads of execution that execute instructions simultaneously. In this case, the computer program is a passive set of instructions, and the process may be the actual execution of these instructions. Multiple processes may be associated with the same program; for example, opening multiple instances of the same program typically means that multiple processes are executing. Multitasking may be implemented to allow multiple processes to share the processor 704. While each processor 704 or the core of the processor performs a single task at a time, the computer system 700 may be programmed to implement multitasking to allow each processor to switch between executing tasks without having to wait for each task to complete. In embodiments, the switching may be performed when a task performs an input/output operation, when a task indicates that it may be switched, or when hardware interrupts. Time sharing may be implemented to allow for a fast response of the interactive user application by quickly performing a context switch to provide the appearance of: appearance that multiple processes are concurrently executing. In an embodiment, for security and reliability, the operating system may prevent direct communication between independent processes, thereby providing tightly-reconciled and controlled inter-process communication functionality.

7. Extensions and alternatives

In the foregoing specification, embodiments of the disclosure have been described with reference to numerous specific details that may vary from implementation to implementation. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. The sole and exclusive indicator of the scope of the disclosure, and what is intended by the applicants to be the scope of the disclosure, is the literal and equivalent scope of the set of claims, presented herein, and is intended to issue a specific form in which such claims issue, including any subsequent correction.

Claims

1. A computer-implemented method of classifying audio as User Generated Content (UGC) or Professional Generated Content (PGC), comprising:

receiving, by a processor, digital audio content having two channels in a time-frequency representation over a plurality of frames and a plurality of frequency bands;

Calculating, by the processor, for each frame of at least a subset of the plurality of frames and each frequency band of the plurality of frequency bands, a respective set of values for a corresponding set of spatial indicators to obtain a set of values for each frequency band, the set of spatial indicators being applied to the two channels and comprising at least one of inter-ear level differences (ILD), inter-ear phase differences (IPD), or inter-ear coherence (IC);

Calculating a set of statistical features from the set of values for each of the plurality of frequency bands, the set of statistical features including a first statistical feature for only one of the plurality of frequency bands and a second statistical feature over a number of frequency bands for the plurality of frequency bands;

executing a classification model using the set of statistical features as input data and an indication of whether the digital audio content is UGC or PGC as output data; and

And transmitting the output data.

2. The computer-implemented method of claim 1, further comprising:

Processing the digital audio content based on a result of the executing; and

And sending the processing result to the playback device.

3. The computer-implemented method of claim 1 or 2, the computing comprising applying a moving window covering a current frame to the digital audio content and computing the set of values of the set of spatial indicators over all frames covered by the moving window.

4. A computer-implemented method as any one of claims 1-3 recites, the first statistical feature being a mean or variance of values of spatial indicators over a subset of frames for each spatial indicator in the set of spatial indicators.

5. The computer-implemented method of any of claims 1-4, the second statistical characteristic being a ratio, wherein a numerator of the ratio relates to a value of a spatial index that occurs most frequently for each frequency band in a lowest frequency band subset of the plurality of frequency bands, and a denominator of the ratio relates to a value of a spatial index that occurs most frequently for each frequency band in a highest frequency band subset of the plurality of frequency bands or for each frequency band of the plurality of frequency bands.

6. The computer-implemented method of any of claims 1-5, the performing comprising incorporating the first statistical feature, an index of the one frequency band associated with the first statistical feature, and the second statistical feature into a feature vector.

7. The computer-implemented method of any of claims 1-6, further comprising:

receiving digital audio data in a time domain comprising the plurality of frames; and

Digital audio content is generated from the digital audio data.

8. The computer-implemented method of claim 7, further comprising:

a set of temporal features is calculated from the digital audio data,

The performing is performed using the set of temporal features as first additional input data.

9. The computer-implemented method of claim 7, further comprising:

generating processed audio data in the frequency domain from the digital audio data; and

A set of spectral features is calculated from the processed audio data, the performing being performed using the set of spectral features as second additional input data.

10. The computer-implemented method of claim 7, further comprising:

extracting a background portion in the time domain from the digital audio data;

Generating a spectral portion in the frequency domain from the background portion;

calculating a specific set of temporal features from the background portion and a specific set of spectral features from the spectral portion; and

The performing is performed using the specific time feature set and the specific spectral feature set as third additional input data.

11. The computer-implemented method of any of claims 1-10, further comprising:

Receiving a segment set comprising a plurality of UGC segments and a plurality of PGC segments, each segment in the segment set having two channels in a time-frequency representation;

calculating a set of values of the set of spatial indicators for each of the plurality of frequency bands and each of the plurality of frames in each of the segments in the segment group to obtain a group of sets of values of each frequency band;

Calculating a set of statistical features from the set of values for each of the plurality of bands for each segment in the set of segments to obtain a set of statistical feature lists; and

And establishing a classification model according to the group of statistical feature lists.

12. The computer-implemented method of any of claims 1-11, the classification model being a gaussian mixture model, an adaptive enhancement algorithm, a support vector machine, or a deep neural network.

13. A non-transitory computer-readable medium storing instructions which, when executed by one or more processors, cause the one or more processors to perform the method of any of claims 1-12.

14. A computer system for classifying audio as User Generated Content (UGC) or Professional Generated Content (PGC), comprising:

A memory;

one or more processors coupled to the memory and configured to perform:

Receiving a segment group comprising a plurality of UGC segments and a plurality of PGC segments,

Each segment in the segment group has two channels in the time-frequency representation,

Each of the plurality of UGC fragments is associated with a UGC tag,

Each of the plurality of PGC fragments is associated with a PGC tag;

Calculating, for each of a plurality of frames and each of a plurality of frequency bands in each of the segment groups, a respective set of values for a corresponding set of spatial indicators to obtain a group of sets of values for each frequency band, the set of spatial indicators being applied to two channels of each segment in the segment groups and comprising at least one of inter-ear level differences (ILD), inter-ear phase differences (IPD), or inter-ear coherence (IC);

calculating, for each segment in the segment group, a set of statistical features from the set of values for each of the plurality of frequency bands to obtain a set of statistical feature lists, the set of statistical features including a first statistical feature for only one of the plurality of frequency bands and a second statistical feature over a number of frequency bands for the plurality of frequency bands;

Establishing a classification model from the set of statistical feature lists;

receiving digital audio content having two channels in a time-frequency representation; and

The classification model is used to assign UGC tags or PGC tags to digital audio content.

15. The computer system of claim 14, the one or more processors further configured to perform the sending of the classification model.

16. The computer system of claim 14 or 15, the computing comprising applying a moving window covering a current frame to a segment and computing the set of values of the set of spatial indicators over all frames covered by the moving window.

17. The computer system of any of claims 14-16, the one or more processors further configured to perform:

Receiving a set of audio items in a time domain; and

A group of clips is generated from the group of audio items.

18. The computer system of claim 17, the one or more processors further configured to perform:

A set of temporal features is calculated from each audio item in the set of audio items to obtain a set of temporal feature sets,

The establishing is further performed according to the set of time features.

19. The computer system of claim 17, the one or more processors further configured to perform:

generating a set of processed audio items in the frequency domain from the set of audio items; and

A set of spectral features is calculated from each processed audio item of the set of processed audio items to generate a set of spectral feature sets, the building being further performed in accordance with the set of spectral feature sets.

20. The computer system of claim 17, the one or more processors further configured to perform:

extracting a background portion in the time domain from each audio item in the set of audio items to obtain a set of background portions;

generating a spectral portion in the frequency domain from each background portion of the set of background portions to obtain a set of spectral portions;

Computing a set of temporal features from each of the set of background portions to obtain a set of temporal features, and computing a set of spectral features from each of the set of spectral portions to obtain a set of spectral features,

The establishing is further performed based on the set of temporal features and the set of spectral features.