CN116803105A

CN116803105A - Audio content identification

Info

Publication number: CN116803105A
Application number: CN202180062659.8A
Authority: CN
Inventors: 王贵平; 芦烈
Original assignee: Dolby Laboratories Licensing Corp
Current assignee: Dolby Laboratories Licensing Corp
Priority date: 2020-08-18
Filing date: 2021-08-18
Publication date: 2023-09-22

Abstract

A method of audio content identification includes the use of a two-stage classifier. The first stage includes a preexisting classifier and the second stage includes a novel classifier. The output of the first stage and the output of the second stage calculated at different time periods are combined to generate a pilot signal. The final classification result is derived from the pilot signal and a combination of the output of the first stage and the output of the second stage. In this way, new classes of classifiers can be added without destroying existing classifiers.

Description

Audio content identification

Cross Reference to Related Applications

The present application claims priority from the following priority applications: PCT/CN International application PCT/CN2020/109744 filed 8/18 in 2020, U.S. provisional application 63/074,621 filed 9/4 in 2020, and EP application 20200318.2 filed 10/6 in 2020.

Technical Field

The present disclosure relates to audio processing, and in particular to audio content recognition.

Background

Unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.

With the advent of consumer entertainment devices such as smartphones, tablets, PCs, etc., audio playback (audio playback) has become popular. There are also tens of thousands of audio applications such as high fidelity play, streaming media, games, podcasts, short videos, and live user broadcasts. Thus, in order to improve the overall quality of audio and provide a different user experience, there are different audio processing algorithms to enhance the audio signal for various purposes. Some typical examples of audio processing algorithms include dialog enhancement and intelligent equalization.

Dialog enhancement generally enhances speech signals. Conversations are an important component of understanding stories in movies. Dialog enhancement implements a method for enhancing dialog to improve its clarity and intelligibility, especially for elderly people with reduced hearing ability.

Intelligent equalization typically performs dynamic adjustments to the audio tone (audio tone). Intelligent equalization is typically applied to musical content to provide uniformity of spectral balance (i.e., so-called "pitch" or "timbre"). It achieves this consistency by: the spectral balance of the audio is continuously monitored, compared to the desired tone, and the equalization filter is dynamically adjusted to transform the original tone of the audio to the desired tone.

Generally, audio processing algorithms have their own application scenarios/contexts. That is, the audio processing algorithm may only be applicable to a specific set of content and not to all possible audio signals, as different content may need to be processed in different ways. For example, the dialogue enhancement method is generally applied to movie contents. If applied to music without conversation, it may erroneously boost certain frequency subbands and introduce severe timbre variations and perceived inconsistencies. Also, if an intelligent equalization method is applied to movie content, tone artifacts will be audible. However, for an audio processing system, its input may be any possible type of audio signal. It is therefore important to identify or distinguish the content being processed so as to apply the most appropriate algorithm (or the most appropriate parameters of each algorithm) to the corresponding content.

The general content-adaptive audio processing system includes three functions: audio content identification, guidance (and audio processing).

The audio content identification automatically identifies the audio type of the content being played. Audio classification techniques may be applied to identify audio content through signal processing, machine learning, and pattern recognition. A confidence score is estimated that represents a probability of audio content for a set of predefined target audio types.

The guidance generally directs the behavior of the audio processing algorithm. It estimates the most appropriate parameters for the corresponding audio processing algorithm based on the results obtained from the audio content recognition.

Audio processing typically applies audio processing using the estimated parameters to an input audio signal to generate an output audio signal.

Disclosure of Invention

With the increasing number of ever-changing audio content and new applications, especially for user-generated content and corresponding applications (e.g., chat, streaming, live, short video, etc.), it is a necessary outcome to improve the audio identifier (classifier) and guidance algorithms in existing systems to meet the performance requirements for new content or new use cases. Taking music as an example, pop music including jazz, country, rock and latin music has in the past tended to be the mainstay across different applications. Thus, the general music classifier in many existing systems is primarily directed to identifying the above-described music genre and accurately generating confidence scores for subsequent guidance algorithms and audio processing algorithms. As fashion trends change, many people prefer to listen to different music genres, such as a talk/hip-hop, electronic music, or a combination of different music styles. In particular, the rap music is mainly composed of (rhythmic) speech, which is difficult to distinguish from ordinary conversations. In many existing cases, the original music classifier typically does not provide sufficient accuracy for classifying either rap music or no accompaniment music (cappella music). As a result, some segments/frames of the rap music may be erroneously identified as speech and subsequently lifted by the dialog enhancer, triggering audible artifacts.

Furthermore, as demand from customers increases, the audio processing system may need to provide new functionality, which further requires the audio classifier to identify certain audio content types. Both of the above scenarios require a new classifier. While the new audio classifier provides more classification results, it is also desirable that the classification results for the originally supported content types (e.g., dialog or music) still be similar to the classification results from the old classifier so that no significant tuning of other audio processing algorithms (e.g., dialog enhancement and intelligent equalization) is required after the new classifier is used.

In view of the above, there is a need to add new classes of classifiers to existing classification systems while still maintaining the original audio processing behavior close to the original. Whether the original classifier is modified on a particular new content or new functionality is added, it is often not easy to transparently update or replace the old classifier with the new classifier. After the identifier replacement, the entire system may not work optimally in a straightforward manner. In many cases, subsequent guidance algorithms and audio processing algorithms may also require corresponding refinements or tuning after the identifier is updated; furthermore, content that the user desires to remain in the original music identifier to conduct behavioral testing on previous content may no longer be suitable. This may introduce a significant amount of additional retuning effort to fully integrate the new components, which is undesirable.

In this disclosure, we propose a method to improve original content recognition for new content while minimizing additional effort for development or validation. Described herein are techniques related to using a two-stage audio classifier.

According to an embodiment, an audio processing method includes receiving an audio signal and performing feature extraction on the audio signal to extract a plurality of features. The method further includes classifying the plurality of features according to a first audio classification model to generate a first set of confidence scores and classifying the plurality of features according to a second audio classification model to generate a second confidence score. The method further includes calculating a pilot signal by combining a first confidence score of the first set of confidence scores and another confidence score of the first set of confidence scores. The method further includes calculating a final confidence score from the pilot signal, the first set of confidence scores, and the second confidence score. The method further includes outputting a classification of the audio signal according to the final confidence score.

According to another embodiment, an apparatus includes a processor and a memory. The processor is configured to control the apparatus to implement one or more of the methods described herein. The apparatus may additionally include details similar to those of one or more of the methods described herein.

According to another embodiment, a non-transitory computer readable medium stores a computer program that, when executed by a processor, controls an apparatus to perform a process comprising one or more of the methods described herein.

The following detailed description and the accompanying drawings provide further understanding of the nature and advantages of the various embodiments.

Drawings

Fig. 1 is a block diagram of an audio classifier 100.

Fig. 2 is a block diagram showing an arrangement of making a classifier a two-stage classifier 200.

Fig. 3 is a block diagram of an audio processing system 300.

Fig. 4 is a block diagram of a device 400 that may be used to implement the audio classifier 100 (see fig. 1) and the like.

Fig. 5 is a flow chart of an audio processing method 500.

Detailed Description

Techniques related to audio content identification are described herein. In the following description, for purposes of explanation, numerous examples and specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be apparent, however, to one skilled in the art that the present disclosure as defined by the claims may include some or all of the features of the examples, alone or in combination with other features described below, and may further include modifications and equivalents of the features and concepts described herein.

In the following description, various methods, procedures, and programs are described in detail. Although certain steps may be described in a certain order, this order is primarily for convenience and clarity. Certain steps may be repeated more than once, may occur before or after other steps (even though the steps are described in another order in addition), and may occur in parallel with other steps. The second step needs to be performed after the first step only when the first step has to be completed before the second step is started. This will be particularly pointed out when not clear from the context.

In this document, the terms "and", "or" and/or "are used. Such terms should be understood to have an inclusive meaning. For example, "a and B" may mean at least the following meanings: "both A and B", "at least both A and B". As another example, "a or B" may mean at least the following meanings: "at least A", "at least B", "both A and B", "at least both A and B". As another example, "a and/or B" may mean at least the following meanings: "A and B", "A or B". This will be noted specifically (e.g., "either a or B", "at most one of a and B") when exclusive or is intended to be used.

This document describes various processing functions associated with structures, such as blocks, elements, components, circuits, and the like. In general, these structures may be implemented by a processor controlled by one or more computer programs.

Fig. 1 is a block diagram of an audio classifier 100. The audio classifier 100 generally receives the input audio signal 102, performs classification of the input audio signal 102 using various models, and outputs a confidence score 128. The audio classifier 100 includes a feature extractor 110, a first set of classifiers 112 (also referred to as original classifiers), a second set of classifiers 114 (also referred to as new classes of classifiers), a context detector 116, and a confidence determiner 118. The audio classifier 100 may also be collectively referred to as a two-stage audio classifier or a two-stage music classifier. Alternatively, the classifiers 112 and 114, the context detector 116, and the confidence determiner 118 (e.g., excluding the feature extractor 110) may be collectively referred to as a two-stage audio classifier or a two-stage music classifier.

The feature extractor 110 receives the audio signal 102, performs feature extraction on the audio signal 102, and generates extracted features 120. The particular features extracted are typically selected based on the particular features associated with the models implemented by classifiers 112 and 114. As an example, the extracted features 120 may correspond to spectral energy in bands of the audio signal 102.

Classifier 112 generally comprises a stage of audio classifier 100. The classifier 112 receives the extracted features 120, performs classification of the extracted features 120 using one or more models, and generates a set of confidence scores 122 (also referred to as raw confidence scores). The set of confidence scores 122 may include one or more confidence scores (e.g., corresponding to one or more models).

Classifier 112 generally corresponds to a set of existing classifiers. In general, the set of existing classifiers has been developed to classify existing audio genres, but may not be able to accurately classify new audio genres. Classifier 112 may include one or more classifiers, including a speech classifier, a music classifier, an audio classifier, a noise classifier, and the like. The classifiers 112 may each comprise one or more different types of classifiers, e.g., two or more types of music classifiers, each developed to classify a particular genre of music (e.g., jazz classifier, rock classifier, etc.). The speech classifier typically evaluates whether the audio signal 102 corresponds to speech (e.g., dialog) rather than music, sound effects, etc. The sound effect classifier typically evaluates whether the audio signal 102 corresponds to a sound effect (e.g., an audiovisual effect such as a car accident, explosion, etc.), rather than speech (e.g., dialogue) or music (e.g., background music, emotional music, etc.). The noise classifier typically evaluates whether the audio signal 102 contains noise (e.g., constant or repetitive sounds such as buzzes, voice-sounds, squeak, electric drill sounds, alarm sounds, waterfall sounds, rain sounds, etc.).

Classifier 112 may be implemented by a machine learning system that performs various classifications using various models of various audio types. Classifier 112 may implement an adaptive boosting (AdaBoost) or deep neural network machine learning process. The AdaBoost process may be implemented in devices that use small model sizes or have limited ability to perform complex calculations. The deep neural network process may be implemented in a more capable device that allows for larger model sizes and performs complex calculations. Typically, the model of classifier 112 is developed offline (offline) by performing machine learning on a set of training data.

Classifier 114 generally includes a second stage of audio classifier 100. The classifier 114 receives the extracted features 120, performs classification of the extracted features 120 using one or more models, and generates a set of confidence scores 124 (also referred to as new confidence scores). Confidence scores 124 may include one or more confidence scores (e.g., corresponding to one or more models).

Classifier 114 generally corresponds to a set of novel classifiers. In general, new classes of audio have been developed to classify new audio genres. For example, training data for developing a model for the original classifier 112 may not include audio data for the new music genre, such that the original classifier 112 performs poorly in identifying the new genre. As described in more detail below, the novel classifier 114 comprises a rap classifier.

Classifier 114 may be implemented by a machine learning system that performs various classifications using various models of various audio types. Classifier 114 may implement an adaptive boosting (AdaBoost) or deep neural network machine learning process. Typically, the model of classifier 114 is developed offline by performing machine learning on a set of training data.

Classifier 114 may also receive information from classifier 112, such as the set of confidence scores 122. For example, the classifier 114 may receive an indication from the classifier 112 that the audio signal 102 corresponds to speech or music (rather than sound effects or noise).

The context detector 116 receives the set of confidence scores 122 and generates a pilot signal 126. The context detector 116 may receive information from the classifier 112 indicating that the audio signal 102 contains neither speech nor music. In general, the context detector 116 evaluates the components of the set of confidence scores 122 over various time frames and uses a smoothed confidence score to reduce the impact of misclassification over a short period of time. The context detector 116 generates a pilot signal 126 to weight the impact of each set of confidence scores 122 and 124 by subsequent components. Further details of the context detector 116 and the pilot signal 126 are provided below.

The confidence determiner 118 receives the sets of confidence scores 122 and 124 and the pilot signal 126 and generates a final confidence score 128. In general, the confidence detector 118 smoothly transitions the audio classifier 100 from using only the classifier 112 to also using the classifier 114 as appropriate according to the confidence score 124. Further details of the confidence determiner 118 are provided below.

Classification of rap music

The following section discusses specific use cases of the rap music classification by classifier 114. The rap music has similarities with dialogue and music compared to existing music genres. Thus, using existing classifiers there is a risk of classifying the rap as dialog and applying one set of audio processing algorithms or classifying the rap as music and applying another set of audio processing algorithms, both of which may not be applicable to rap. In addition, existing classifiers may quickly switch between conversation classification and music classification, resulting in a quick switch between the two processing algorithms, resulting in an inconsistent listening experience. Adding a rap classifier and integrating the rap classifier with an existing classifier to form a two-stage classifier results in an improved listening experience without disrupting the existing classifier.

Subband-based spectral energy

For the rap music, the new features extracted by the feature extractor 110 are developed based on spectral energy showing the energy fluctuation characteristics of different contents in the frequency domain. First, an input audio signal is transformed into spectral coefficients by a time-frequency conversion tool (e.g., quadrature Mirror Filter (QMF), fast Fourier Transform (FFT), etc.), and then an energy spectrum is calculated by the spectral coefficients, where the present disclosure further divides the entire energy spectrum into four sub-bands.

The first subband energy, representing a low frequency energy distribution below 300Hz, is used to detect the onset (onset) of bass or drumbeats. The second subband energy, representing an energy distribution between 300Hz and 1kHz, is used to measure the fluctuation of the pitch of the sound (vocal pitch). The third subband energy, representing an energy distribution between 1kHz and 3kHz, is used to measure the ripple of the sounding harmonic. The fourth subband energy, representing an energy distribution between 3kHz and 6kHz, is used to detect the unvoiced signal or the fluctuation of the snare drum (snare drum).

All sub-band spectral energy is calculated in a short-term frame (e.g., 20 ms) and then stored in a memory buffer until it meets the expected window length, e.g., 5s. Finally, advanced features can be derived based on the spectral energy of the window length described above.

The number of subbands, frequency range of each subband, frame length, and window length may be adjusted as desired. For example, to classify a different new genre, a model for another new classifier 114 may be generated using subbands appropriate for the new genre.

Characteristics of the rap music

Typical rap music has several significant differences compared to general music, including sounding tempo (tempo), rhythmic lyrics, regularity of music bars (music bars), etc. Based on the above sub-band spectral energy, we introduced a peak/valley tracking method to find cues that reflect the characteristics of the sounding tempo, the rhythmic beat (rhythmic meter), and the regularity of the music bars.

For typical rap music, the tempo is generally about 100 to 150 Beats Per Minute (BPM), and typically with a 4/4 beat number (time signature); lyrics are typically regularly sung over a fixed period of time such that the number of syllables (syllables) in each sentence is almost similar. Thus, some new features are derived accordingly:

the first feature is the statistical nature of the sub-band spectral energy distribution. During a fixed period of time, the spectral energy parameters are divided into several musical sections; in each bin, peak/valley spectral energy can be calculated and the number of peaks/valleys also counted. Features (e.g., mean, standard deviation, etc.) indicative of the statistical properties of the spectral energy described above may be used to distinguish the rap music from the general speech content.

The second feature is the peak/valley position spacing of the sub-band spectral energy. The vocalization or syllable consists of voiced and unvoiced sounds, which are related to peaks and valleys of spectral energy to some extent, so that the peak/valley positions of general talking music are at regular intervals. However, for natural conversations, there is no obvious and regular interval between voiced and unvoiced sounds. Thus, the peak/valley positions represented by the indexes in the spectral energy of the window length are recorded here in a continuous manner, and then each interval of adjacent peak positions is calculated. Finally, the uniform distribution of these intervals is used as a key feature for the rap music.

The third feature is the contrast of peak-to-valley spectral energy. The contrast between the peak and valley of the vocal energy in the rap music is not much different than the typical speech or dialogue in a movie or program, which can also be an important cue to indicate whether the audio content is dialogue content.

The fourth feature is the rhyme (rhyme) feature. Most lyrics of the rap are written with a specific tempo and rhythm (rheme scheme). Unfortunately, without semantic recognition, correctly segmenting lyrics based on syllable units may not be computationally feasible. In addition, in the case of rap music, pressing rhymes are sometimes incomplete, especially when the final foot (meta foot) lacks one or more syllables.

The fifth feature is a rhythmic feature (rhythmic feature). A tempo feature is calculated on the subband energy of the above-mentioned various spectral ranges, which tempo feature represents the frequency and intensity of the music's initial motion and the regularity and contrast of the tempo. Separately, one measurement may be based on the 1/4 th sub-band spectral energy and another measurement may be based on the 2/3 rd sub-band spectral energy.

Selection of data and features for training a two-stage music classifier

Before training the rap classifier, it is necessary to prepare a set of training data and finalize the features and classifier algorithms. The training database is made up of various content types, such as speech, rap music, non-rap music, sound effects, noise, etc., which are collected from various applications over time and manually marked to represent their corresponding audio types. These marks represent ground truth (ground truth) of the audio content. To meet the needs of different application scenarios, feature sets may be selected jointly or separately between old and new features. Similarly, the new model may also be trained independently or in combination with multiple models using different learning algorithms.

Depending on the requirements of the new classifier and the system tolerance, there are different combinations of old features/training data and new features/training data. Unfortunately, it is difficult to find the optimal solution for the above combination, since we cannot enumerate all selection possibilities. In practice, we manually split the training data set into two data blocks (data chuck), one representing the genre of the rap music content and the other representing the non-rap. For the feature set, we choose the original and new features to train the rap music classifier while retaining the old features for the old music classifier. Thus, there are two independent music classifiers: one is an original music classifier as a first level of music classifier (e.g., the set of classifiers 112) for general music content recognition, and the other is a newly trained rap music classifier as a second level of music classifier (e.g., the set of classifiers 114) specifically for recognizing audio content between rap songs and dialog content.

Arrangement for making classifier two-stage

Fig. 2 is a block diagram showing an arrangement of the classifiers 112 and 114 (see fig. 1) into a two-stage classifier 200. Classifier 112 forms a first stage and includes a speech classifier 202, a music classifier 204, an audio classifier 206, and a noise classifier 208. The classifier 112 receives the extracted features 120 and generates a speech confidence score 212, a music confidence score 214, an audio confidence score 216, and a noise confidence score 218, respectively, which together constitute the set of confidence scores 122.

Classifier 114 forms a second stage and includes a rap classifier 230. The second stage further comprises a decision stage 232. The decision stage 232 receives the set of confidence scores 122. When the set of confidence scores 122 indicates that the audio signal 102 does not correspond to speech or music (e.g., the value of the speech confidence score 212 and the music confidence score 214 is low, or the value of the sound effect confidence score 216 or the noise confidence score 218 is high), the two-stage classifier 200 outputs the set of confidence scores 122. When the set of confidence scores 122 indicates that the audio signal 102 does correspond to speech or music (e.g., the value of the speech confidence score 212 or the music confidence score 214 is higher), the decision stage indicates this information to the rap classifier 230.

The rap classifier 230 receives the extracted features 120 and an indication of speech or music from the decision stage 232. In order to effectively reduce computational complexity, the rap classifier 230 does not have to be run all the time for all the content. Instead, the classifier 112 and the classifier 114 are arranged as two-stage cascade classifiers. First, a confidence score is calculated for each audio type at a first level, which determines the corresponding audio type with the greatest confidence score. If the audio type is a speech or music type, then the conditions are met and an indication is provided to the rap classifier 230 to perform further recognition. The two-stage classifier 200 then outputs the confidence score 124 resulting from the operation of the rap classifier 230. If the output type of the first stage classifier is sound effects or noise, the rap classifier 230 may be bypassed.

Context detector 116

The context detector 116 (see fig. 1) typically monitors the change in confidence value over time. Both the original classifier (e.g., classifier 112) and the new classifier (e.g., classifier 114) may be subject to error in the short term. Thus, the context detector 116 evaluates long-term continuous context information. For example, listening to music in a period of a few minutes results in context information tending to have a high confidence score for the music type at the end of the period, which helps correct abrupt false positives caused by misclassification in a short period of time. The context detector 116 considers both long-term and short-term contexts. The long-term context information is a slowly smoothed music confidence score (e.g., music confidence score 214). For example, slow smoothing may be determined within 8 to 12 seconds, such as 10 seconds. Long-term context information And can then be calculated according to the following formula (1):

where p (t) is the confidence score of the music classifier (e.g., music confidence score 214) at the current frame t of the audio signal 102, and α _context Is a long-term smoothing coefficient.

In a similar manner, the short-term context information is a non-musical confidence score (e.g., the larger of the sound effect confidence score 216 and the noise confidence score 218) that is quickly smoothed. For example, the fast smoothing may be determined within 4 to 6 seconds, such as 5 seconds. Short term context informationAnd can then be calculated according to the following formula (2):

where q (t) is the maximum of the sound effect confidence score 216 and the noise confidence score 218 for the current frame t of the audio signal 102, and β _context Is a short-term smoothing coefficient.

Given the above-mentioned context signalAnd->In the above, the pilot signal s (t) can be determined by a nonlinear function h (). For example, according to the following equation (3), the obtained context signal may be mapped to an expected pilot signal (from 0 to 1) using a sigmoid function:

wherein h is ₁ And h ₂ Is an S-shaped function according to the following formula (4):

where x is the output obtained context confidence (e.g.,or->) And a and B are two parameters.

The output of the context detector 116 is a pilot signal 126 that is used as a weighting factor for subsequent processing by the confidence determiner 118. The pilot signal 126 ranges from a soft value of 0.0 to 1.0, where a value of 0 indicates a non-music context and a value of 1.0 indicates a music context. Between 0 and 1, the larger the value, the more likely the audio signal 102 is in a music context.

Confidence determiner 118

The confidence determiner 118 (see fig. 1) generates a final music confidence score 128 by jointly considering the pilot signal 126, the set of confidence scores 122, and the confidence score 124. To achieve a smooth transition between the rap music classification on/off, if w (t) e (0, 1), a hybrid procedure will be taken. That is, the final output will be a hybrid confidence score of the old music classifier (e.g., confidence score 122 only) and the new music classifier (e.g., a combination of both confidence scores 122 and 124). Confidence score x given a new music classifier _new (t), confidence score of old music classifier x _old (t) [ e.g., confidence score 122]And pilot signal s (t) discussed above [ e.g., pilot signal 126 ]]X can be calculated according to the following formula (5) _new (t)：

x _new (t)＝x _old (t)+(1-x _old (t))*new_conf(t)

Where new_conf (t) is the second level (rap) music confidence output (e.g., confidence score 124).

The final output confidence score y (t) [ e.g., final confidence score 128] can then be expressed according to the following formulas (6) and (7):

y(t)＝w(t)x _new (t)+(1-w(t))x _old (t)

the threshold may be determined via a statistical summary of the training data; according to an embodiment, a threshold of 0.9 works well.

Extension of additional novel classifier

In the present disclosure, the description of the rap classifier as an example use case for constructing a two-stage music classifier, the two-stage music classifier not only maintains the original behavior of the existing audio content (such as speech, non-rap music, sound effects and noise), but also improves the overall listening experience of rap music by greatly improving the classification accuracy of rap songs. It is worth noting that the proposed method can be easily extended or directly applied to audio systems for various use cases of music content classification, such as building new classes of classifiers for non-accompaniment music, certain background music in games and reverberant speech in podcasts. More broadly, the proposed method can also be extended to general audio systems for general content classification. The following paragraphs discuss several specific use cases, scenarios and applications in which the old content identifier needs to be extended by a new type of content identifier.

One example use case is reverberation detection. For example, the reverberant speech needs to be specially processed and then encoded into a bitstream, such as podcasts or user-generated audio content. While supporting new data types, the new detector may need to generate similar results for old data types to maintain backward compatibility. In this case, a reverberant speech classifier may be added to classifier 114 (see fig. 1).

Another example use case is gunshot detection. In gaming applications, other types of sound effects (e.g., gunshot) may be utilized to update the sound effect detector. In this case, a gunshot classifier may be added to classifier 114.

Another example use case is noise detection. As customer demand increases, the audio processing system may need to provide more functionality (e.g., noise compensation for mobile devices), which further requires the noise classifier to identify more audio content types (e.g., stationary noise in motion). While the novel noise classifier provides more classification results, it is desirable that the classification results on the originally supported content types (e.g., noise or sound effects) still be similar to the classification results from the old classifier so that no significant tuning of other audio processing algorithms (e.g., noise suppression and volume level) is required after the novel classifier is used. In this case, a new noise classifier may be added to the classifier 114.

In summary, the proposed method can be generalized from the following four considerations when it is desired to construct or improve a new classifier.

The first consideration is the relationship between the new and old use cases. This consideration defines the relationship between the new and old classifiers and thus determines the structure of the model assembly. When the new use case is a subset of the types of the old use case, the new classifier may be combined with the old classifier into a cascaded multi-stage structure. If the new use case is an independent requirement, the new classifier can be parallel to the old classifier. In addition, such considerations help in deciding when to trigger or activate a new classifier and how to combine the results of the new classifier with the confidence score of the old classifier in the original system.

The second consideration is the new characteristics of the new use case. Such considerations aim to find typical features representing the essential characteristics of the new schema, which are used to distinguish the target type from other content types.

The third consideration is the training model of the new use case. This consideration prepares training data and tag data as target audio types according to new requirements, then extracts features and trains models of the new classifier in an offline manner through corresponding machine learning techniques.

The fourth consideration is the integration of the new classifier. This consideration aims to integrate new features and classifiers into the original system and tune the appropriate parameters to minimize the behavioral differences of the old use case.

In order to differentiate audio content and apply the optimal parameters or optimal audio processing algorithms accordingly, different use case profiles (profiles) may be required and pre-designed, and system developers may select profiles for the application context being deployed. The configuration File typically encodes a set of audio processing algorithms and/or their optimal parameters to be applied, such as a "File-based" configuration File and a "Portable" configuration File designed specifically for high performance applications or resource-constrained use cases (e.g., mobile). The main difference between file-based profiles and portable profiles is the computational complexity due to feature selection and model selection, with these extended functionalities enabled in the file-based profile and disabled in the portable profile.

Avoiding the influence on the given use case

When we extend the original system with new requests, the new system should not have a huge impact on existing application cases. This suggests the following three suggestions.

The first proposal relates to feature/model selection of the old use case. The overall goal is to keep the original features and classifiers as unchanged as possible and to add or train separate classifiers for new requests, which is a basic guarantee that significant impact on existing use cases is avoided.

The second proposal relates to a determination regarding the use of a novel classifier. To reduce unnecessary false positives, the determination conditions using the new classifier should be fine-tuned, which means that for old use cases, the confidence score is calculated using the original classifier and the result is output, whereas for new use cases only, the new classifier will be used to identify the audio content type.

The third proposal relates to a confidence determiner between the old and new classifiers. Different smoothing schemes may be used to determine the final output between the old confidence score and the new result. For example, to avoid abrupt changes and make smoother estimates of parameters in the audio processing algorithm, the confidence score may be further smoothed. One common smoothing method is based on weighted averaging, e.g. according to the following formulas (8) and (9):

Conf(t)＝α·old_Conf(t)+(1-α)·new_conf(t)

smoothConf(t)＝β·smoothConf(t-1)+(1-β)·conf(t)

where t is the timestamp, α, β is the weight, conf and smoothConf are the confidence before and after smoothing, respectively.

The smoothing algorithm may also be "asymmetric", i.e. use different smoothing weights for different situations. For example, assuming that we are more concerned about the original output as the old confidence score increases, we can design a smoothing algorithm according to the following equation (10):

The above formula allows the smoothed confidence score to quickly respond to the current state as the old confidence score increases and slowly smooth as the old confidence score decreases. Variations of the smoothing function may be generated in a similar manner.

Fig. 3 is a block diagram of an audio processing system 300. The audio processing system 300 comprises an audio classifier 100 (see fig. 1) and a processing component 310, the processing component 310 comprising a dialog enhancer 312, an intelligent equalizer 314 and a rap music enhancer 316.

The audio classifier 100 receives the audio signal 102 and operates as discussed above to generate a final confidence score 128. The processing component 310 receives the final confidence score and processes the audio signal 102 using appropriate components based on the final confidence score 128. For example, when the final confidence score 128 indicates that the audio signal 102 is a conversation, the conversation enhancer 312 may be used to process the audio signal 102. When the final confidence score 128 indicates that the audio signal 102 has an unbalanced spectral balance, the audio signal 102 may be processed using the intelligent equalizer 314. When the final confidence score 128 indicates that the audio signal 102 is music to be singed, the audio signal 102 may be processed using the rap music enhancer 316. The processing component 310 generates a processed audio signal 320 that corresponds to the audio signal 102 that has been processed by the selected component.

Fig. 4 is a block diagram of an apparatus 400 that may be used to implement the audio classifier 100 (see fig. 1), the two-stage classifier 200 (see fig. 2), the audio processing system 300 (see fig. 3), and so on. The device 400 may be a computer (desktop computer, laptop computer, etc.), game console, portable device (e.g., mobile phone, media player, etc.), or the like. The device 400 includes a processor 402, a memory 404, one or more input components 406, one or more output components 408, and one or more communication components 410 connected by a bus 412.

Processor 402 typically controls the operation of device 400, for example, according to the execution of one or more computer programs. The processor 402 may implement one or more of the functions described herein, such as the functions of the feature extractor 110 (see fig. 1), the classifiers 112 and 114, the context detector 116, the confidence determiner 118, the audio processing component 310 (see fig. 3), the equations (1) through (10), the method 500 (see fig. 5), and so forth. The processor 402 may interact with the memory 404 to store data, computer programs, and the like.

Memory 404 typically stores data operated on by device 400. For example, the memory 404 may store the input signal 102 (see FIG. 1; e.g., as a data frame of a streaming signal, as a stored data file, etc.), the extracted features 120, the models used by the classifiers 112 and 114, the confidence scores 122 and 124, the pilot signal 126, the final confidence score 128, the results of equations (1) through (10), etc. Memory 404 may also store computer programs that are executed by processor 402.

Input component 406 typically enables input to device 400. The details of the input component 406 may vary based on the particular form factor of the device 400. For example, the input component 406 of the mobile phone may include a touch screen, microphone, motion sensor, camera, control buttons, and the like. The input components 406 of the game console may include control buttons, powered motion sensors, microphones, game controllers, and the like.

The output component 408 generally enables output of the device 400. The details of the output component 408 may vary based on the particular form factor of the device 400. For example, the output component 408 of the mobile phone may include a screen, speakers, haptic mechanisms, light emitting diodes, and the like. The output component 408 of the game console may include a screen, speakers, etc.

Communication component 410 typically enables wired or wireless communication between device 400 and other devices. Thus, the communication component 410 includes additional input components and output components similar to the input component 406 and output component 408. The wireless component includes a radio, such as a cellular radio, an IEEE 802.15.1 radio (e.g., bluetooth ^TM Radio), IEEE 802.11 radio (e.g., wi-Fi ^TM Radio), etc. The wired components include a keyboard, a mouse, and a game controller Sensors, etc. The details of the input member 406 and the output member 408 may vary based on the particular form factor of the device 400. For example, a mobile phone may include a cellular radio to receive the input signal 102 as a streaming media signal and an IEEE 802.15.1 radio to transmit the processed audio signal 320 to a pair of wireless earpieces for output as sound.

Fig. 5 is a flow chart of an audio processing method 500. The method 500 may be implemented by a device (e.g., the device 400 of fig. 4), such as controlled by execution of one or more computer programs.

At 502, an audio signal is received. For example, the audio signal 102 (see fig. 1) may be received by the communication component 410 (see fig. 4) of the device 400. As another example, the audio signal 102 may be received from the memory 404 where the audio signal has been previously stored.

At 504, feature extraction is performed on the audio signal to extract a plurality of features. For example, the feature extractor 110 (see fig. 1) may perform feature extraction on the audio signal 102 to generate extracted features 120. The details of the feature extraction performed and the resulting extracted features may vary based on the relevance of these particular features to the model used for classification. For example, the subband energy of the input signal 102 may be related to a rap classification model.

At 506, the plurality of features are classified according to a first audio classification model to generate a first set of confidence scores. For example, the classifier 112 (see fig. 1) may classify the extracted features 120 according to a music classification model, a speech classification model, a noise classification model, an audio classification model, or the like, thereby generating the corresponding confidence scores 122.

At 508, the plurality of features are classified according to a second audio classification model to generate a second confidence score. For example, the classifier 114 (see fig. 1) may classify the extracted features 120 according to a rap classification model to generate the rap confidence score 124.

At 510, a pilot signal is calculated by combining a first component of the first set of confidence scores smoothed over a first time period and a second component of the first set of confidence scores smoothed over a second time period, wherein the second time period is shorter than the first time period. For example, the context detector 116 (see fig. 1) may generate the pilot signal 126 according to equation (3) using the long-term context information according to equation (1) and the short-term context information according to equation (2).

At 512, a final confidence score is calculated from the pilot signal, the first set of confidence scores, and the second confidence score. For example, the confidence determiner 118 (see fig. 1) may generate the final confidence score 128 from the pilot signal 126, the confidence score 122, and the confidence score 124. The final confidence score may correspond to a weighted combination of confidence scores 122 and 124, e.g., calculated according to equation (6).

At 514, a classification of the audio signal is output according to the final confidence score. For example, the confidence determiner 118 (see fig. 1) may output the final confidence score 128 for use by other components of the device 400.

At 516, one of a first process and a second process is selectively performed on the audio signal to generate a processed audio signal based on the classification, wherein the first process is performed when the classification is a first classification and the second process is performed when the classification is a second classification. For example, when the audio signal 102 (see fig. 1) corresponds to speech, the dialog enhancer 312 (see fig. 3) may be used to generate the processed audio signal 320. When the audio signal 102 corresponds to a rap, the rap music enhancer 316 may be used to generate a processed audio signal 320.

At 518, the processed audio signal is output as sound. For example, the speaker of device 400 may output processed audio signal 320 as audible sound.

Details of implementation

Embodiments may be implemented in hardware, executable modules stored on a computer readable medium, or a combination of both (e.g., a programmable logic array). Unless otherwise indicated, the steps performed by an embodiment need not be inherently related to any particular computer or other apparatus, although they may be relevant in certain embodiments. In particular, various general-purpose machines may be used with programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus (e.g., an integrated circuit) to perform the required method steps. Thus, embodiments may be implemented in one or more computer programs executing on one or more programmable computer systems each comprising at least one processor, at least one data storage system (including volatile and non-volatile memory and/or storage elements), at least one input device or port, and at least one output device or port. Program code is applied to input data to perform the functions described herein and generate output information. The output information is applied to one or more output devices in a known manner.

Each such computer program is preferably stored on or downloaded to a storage media or device (e.g., solid state memory or media, or magnetic or optical media) readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer system to perform the procedures described herein. The system of the present invention may also be considered to be implemented as a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer system to operate in a specific and predefined manner to perform the functions described herein. (software itself and intangible or transient signals are excluded in the sense that they are not patentable subject matter.)

The above description illustrates various embodiments of the disclosure and examples of how aspects of the disclosure may be implemented. The above examples and embodiments should not be considered as the only embodiments, but are presented to illustrate the flexibility and advantages of the present disclosure as defined by the appended claims. Other arrangements, examples, implementations, and equivalents will be apparent to those skilled in the art based on the foregoing disclosure and appended claims and may be employed without departing from the spirit and scope of the present disclosure as defined by the claims.

Aspects of the invention may be understood from the example embodiments (EEEs) enumerated below:

eee1. A method of audio processing, the method comprising:

receiving an audio signal;

performing feature extraction on the audio signal to extract a plurality of features;

classifying the plurality of features according to a first audio classification model to generate a first confidence score;

classifying the plurality of features according to a second audio classification model to generate a second confidence score;

calculating a pilot signal by combining a first component of the first confidence score and a second component of the first confidence score;

calculating a final confidence score from the pilot signal, the first confidence score, and the second confidence score; and

and outputting the classification of the audio signal according to the final confidence score.

The method of EEE2. Wherein the plurality of models includes a first set of models and the second audio classification model, wherein the first set of models includes the first audio classification model, wherein classifying the plurality of features according to the first audio classification model to generate the first confidence score includes:

The plurality of features are classified according to the first set of models to generate the first confidence score.

EEE3. the method of EEE 2 wherein the first set of models comprises a speech classification model and a music classification model.

EEE4. the method of any one of EEEs 1-3 wherein the second audio classification model is a vocal classification model.

EEE5. the method of any one of EEEs 1-4 wherein performing feature extraction comprises determining a plurality of subband energies for a plurality of subbands of the audio signal.

EEE6. The method of EEE5 wherein the plurality of subbands comprises a first subband below 300Hz, a second subband between 300Hz and 1000Hz, a third subband between 1kHz and 3kHz, and a fourth subband between 3kHz and 6 kHz.

The method of any of EEEs 1-6, wherein classifying the plurality of features according to the first audio classification model comprises:

the plurality of features are classified according to the first audio classification model using at least one of an adaptive lifting machine learning process and a deep neural network machine learning process.

The method of any one of EEEs 1-7, wherein calculating the guidance signal comprises:

The pilot signal is calculated by combining a first component of the first confidence score smoothed over a first time period and a second component of the first confidence score smoothed over a second time period, wherein the second time period is shorter than the first time period.

EEE9. the method of EEE 8 wherein the first period of time is at least twice the second period of time.

EEE10. The method of EEE 8 wherein the first period of time is between 8 and 12 seconds and wherein the second period of time is between 4 and 6 seconds.

EEE11. The method of any one of EEEs 8-10 wherein the first component of the first confidence score smoothed over the first period of time is calculated based on a first smoothing coefficient, a current music confidence score of a current frame of the audio signal, and a previously smoothed music confidence score of a previous frame of the audio signal; and is also provided with

Wherein a second component of the first confidence score smoothed over the second time period is calculated based on a second smoothing coefficient, a current sound effect and noise confidence score of a current frame of the audio signal, and a previously smoothed sound effect and noise confidence score of a previous frame of the audio signal.

EEE12. The method of any one of EEEs 1-11, wherein calculating the guidance signal comprises:

applying a first sigmoid function to a first component of the first confidence score smoothed over the first time period; and

a second sigmoid function is applied to a second component of the first confidence score smoothed over the second time period.

EEE13. The method of any one of EEEs 1-12 wherein the final confidence score is calculated based on a combination of the new confidence component and the old confidence component,

wherein the new confidence component is calculated based on a combination of applying a first weight to a combination of the first confidence score and the second confidence score.

EEE14. The method of EEE13 wherein the old confidence component is calculated based on applying a second weight to the first confidence score.

EEE15. The method of EEE14 wherein the sum of the first weight and the second weight is one.

EEE16. The method of EEE13 wherein the first weight selectively corresponds to one of the pilot signal and a combination of the pilot signal and the second confidence score, and

Wherein the first weight corresponds to the pilot signal when the second confidence score is less than a threshold.

EEE17 the method of any one of EEEs 1-16, further comprising:

one of a first process and a second process is selectively performed on the audio signal based on the classification to generate a processed audio signal, wherein the first process is performed when the classification is a first classification and the second process is performed when the classification is a second classification.

EEE18. A non-transitory computer readable medium storing a computer program which, when executed by a processor, controls a device to perform a process comprising a method as claimed in any one of EEEs 1 to 17.

Eee19. an apparatus for audio processing, the apparatus comprising:

a processor; and

the memory device is used for storing the data,

wherein the processor is configured to control the apparatus to receive an audio signal,

wherein the processor is configured to control the apparatus to perform feature extraction on the audio signal to extract a plurality of features,

wherein the processor is configured to control the apparatus to classify the plurality of features according to a first audio classification model to generate a first confidence score,

Wherein the processor is configured to control the apparatus to classify the plurality of features according to a second audio classification model to generate a second confidence score,

wherein the processor is configured to control the apparatus to calculate the pilot signal by combining a first component of the first confidence score smoothed over a first time period and a second component of the first confidence score smoothed over a second time period, wherein the second time period is shorter than the first time period,

wherein the processor is configured to control the apparatus to calculate a final confidence score from the pilot signal, the first confidence score, and the second confidence score, and

wherein the processor is configured to control the apparatus to output a classification of the audio signal according to the final confidence score.

EEE20. The apparatus of claim 19 wherein the second audio classification model is a singing classification model,

wherein performing feature extraction includes determining a plurality of subband energies for a plurality of subbands of the audio signal, an

Wherein the plurality of subbands includes a first subband below 300Hz, a second subband between 300Hz and 1000Hz, a third subband between 1kHz and 3kHz, and a fourth subband between 3kHz and 6 kHz.

Reference to the literature

U.S. patent No. 10,129,314.

U.S. application publication No. 2018/0181880.

U.S. patent No. 10,411,669.

U.S. application publication No. 2020/013683.

U.S. application publication No. 2011/0029108.

U.S. patent No. 10,522,186.

U.S. patent No. 8,400,566.

U.S. patent No. 7,263,485.

U.S. patent No. 7,953,693.

U.S. patent No. 10,424,321.

U.S. patent No. 10,556,087.

U.S. application publication No. 2020/007409.

U.S. patent No. 9,020,816.

Chinese application publication CN103186527a.

Chinese application publication CN111177454a.

Ja-Hwung Su, hsin-Ho Yeh, philip S.Yu and Vincent S.Tseng, "Music Recommendation Using Content and Context Information Mining [ music recommendation Using content and contextual information mining ]", IEEE Intelligent systems, volume 25, phase 1, pages 16-26, 1 month to 2 months 2010, doi:10.1109/MIS.2010.23.

U.S. patent No. 9,842,605.

Claims

1. A method of audio processing, the method comprising:

receiving an audio signal;

classifying the plurality of features according to a first audio classification model to generate a first set of confidence scores;

calculating a pilot signal by combining a first confidence score of the first set of confidence scores and another confidence score of the first set of confidence scores;

calculating a final confidence score from the pilot signal, the first set of confidence scores, and the second confidence score; and

2. The method of claim 1, wherein a plurality of models includes a first set of models and the second audio classification model, wherein the first set of models includes the first audio classification model, wherein classifying the plurality of features according to the first audio classification model to generate the first set of confidence scores includes:

the plurality of features are classified according to the first set of models to generate the first set of confidence scores.

3. The method of claim 2, wherein the first set of models includes a speech classification model and a music classification model.

4. A method according to any one of claims 1 to 3, wherein the second audio classification model is a vocal classification model.

5. The method of any of claims 1 to 4, wherein performing feature extraction comprises determining a plurality of subband energies for a plurality of subbands of the audio signal.

6. The method of claim 5, wherein the plurality of subbands comprises a first subband below 300Hz, a second subband between 300Hz and 1000Hz, a third subband between 1kHz and 3kHz, and a fourth subband between 3kHz and 6 kHz.

7. The method of any of claims 1 to 6, wherein classifying the plurality of features according to the first audio classification model comprises:

8. The method of any of claims 1-7, wherein calculating the pilot signal comprises:

the pilot signal is calculated by combining the first confidence score of the first set of confidence scores smoothed over a first time period and the other confidence score of the first set of confidence scores smoothed over a second time period, wherein the second time period is shorter than the first time period.

9. The method of claim 8, wherein the first period of time is at least twice the second period of time.

10. The method of claim 8 or claim 9, wherein the first confidence score in the first set of confidence scores smoothed over the first period of time is calculated based on a first smoothing coefficient, a current music confidence score for a current frame of the audio signal, and a previously smoothed music confidence score for a previous frame of the audio signal; and is also provided with

Wherein the other confidence score in the first set of confidence scores smoothed over the second time period is calculated based on a second smoothing coefficient, a current sound effect and noise confidence score of a current frame of the audio signal, and a previously smoothed sound effect and noise confidence score of a previous frame of the audio signal.

11. The method of any of claims 1 to 10, wherein calculating the pilot signal comprises:

applying a first sigmoid function to the first confidence scores of the first set of confidence scores smoothed over the first time period; and

a second sigmoid function is applied to the other confidence score in the first set of confidence scores smoothed over the second time period.

12. The method of any one of claims 1 to 11, further comprising:

13. A non-transitory computer readable medium storing a computer program which, when executed by a processor, controls an apparatus to perform a process comprising the method of any of claims 1 to 12.

14. An apparatus for audio processing, the apparatus comprising:

a processor; and

the memory device is used for storing the data,

wherein the processor is configured to control the apparatus to classify the plurality of features according to a first audio classification model to generate a first set of confidence scores,

Wherein the processor is configured to control the apparatus to calculate a pilot signal by combining a first confidence score of the first set of confidence scores and another confidence score of the first set of confidence scores,

wherein the processor is configured to control the apparatus to calculate a final confidence score from the pilot signal, the first set of confidence scores, and the second confidence score, and

15. The apparatus of claim 14 wherein the second audio classification model is a singing classification model,

16. The apparatus of claim 14 or 15, wherein calculating the pilot signal comprises: