JP2022513184A

JP2022513184A - Dual-ended media intelligence

Info

Publication number: JP2022513184A
Application number: JP2021532235A
Authority: JP
Inventors: バイ，イエンニーン; ウィリアムジェラード，マーク; ハン，リチャード; ヴォルタース，マルティン
Original assignee: ドルビーラボラトリーズライセンシングコーポレイション; ドルビー・インターナショナル・アーベー
Priority date: 2018-12-13
Filing date: 2019-12-10
Publication date: 2022-02-07
Anticipated expiration: 2039-12-10
Also published as: US20220059102A1; BR112021009667A2; RU2768224C1; CN113168839B; KR20210102899A; JP7455836B2; CN113168839A; EP3895164B1; EP3895164A1; WO2020123424A1

Abstract

オーディオ・コンテンツをエンコードする方法は、オーディオ・コンテンツのコンテンツ解析を実行し、該コンテンツ解析に基づいてオーディオ・コンテンツのコンテンツ型を示す分類情報を生成し、オーディオ・コンテンツおよび分類情報をビットストリームにおいてエンコードし、該ビットストリームを出力することを含む。オーディオ・コンテンツおよび該オーディオ・コンテンツについての分類情報を含むビットストリームからオーディオ・コンテンツをデコードする方法であって、前記分類情報は前記オーディオ・コンテンツのコンテンツ分類を示すものである、方法は、前記ビットストリームを受信し、前記オーディオ・コンテンツおよび前記分類情報をデコードし、前記分類情報に基づいて、前記デコードされたオーディオ・コンテンツの後処理を実行するための後処理モードを選択することを含む。後処理モードを選択することは、分類情報に基づいて、デコードされたオーディオ・コンテンツの後処理のための一つまたは複数の制御重みを計算することに関わることができる。The method of encoding audio content is to perform content analysis of the audio content, generate classification information indicating the content type of the audio content based on the content analysis, and encode the audio content and the classification information in a bitstream. And output the bitstream. A method of decoding audio content from a bit stream containing audio content and classification information about the audio content, wherein the classification information indicates content classification of the audio content, the method is the bit. It comprises receiving a stream, decoding the audio content and the classification information, and selecting a post-processing mode for performing post-processing of the decoded audio content based on the classification information. Choosing a post-processing mode can involve calculating one or more control weights for post-processing of decoded audio content based on the classification information.

Description

本開示は、オーディオ・コンテンツをビットストリームに符号化する方法およびオーディオ・コンテンツをビットストリームからデコードする方法に関する。本開示は、特に、オーディオ・コンテンツのコンテンツ型を示す分類情報がビットストリームにおいて伝送されるような方法に関する。 The present disclosure relates to a method of encoding audio content into a bitstream and a method of decoding audio content from a bitstream. The present disclosure relates specifically to methods such that classification information indicating the content type of audio content is transmitted in a bitstream.

オーディオ信号後処理の知覚される恩恵は、オーディオ信号処理アルゴリズムが処理されているコンテンツを認識している場合に改善できる。たとえば、ダイアログ向上器によるダイアログの正確な検出は、現在のオーディオ・フレームにおける、ダイアログの、測定された高い信頼度がある場合に改善される。また、音楽の音色を維持するために、音楽コンテンツの存在下では仮想化器が無効にされてもよく、あるいは発話の音色を維持するために、映画においてダイアログの存在下では、音楽を音色マッチさせるように設計された動的等化器（たとえばドルビー（登録商標）ボリューム・インテリジェント・イコライザー）が無効にされてもよい。 The perceived benefits of audio signal post-processing can be improved if the audio signal processing algorithm is aware of the content being processed. For example, the accurate detection of a dialog by a dialog improver is improved if there is a measured high reliability of the dialog in the current audio frame. Also, to maintain the timbre of the music, the virtualizer may be disabled in the presence of the music content, or to maintain the timbre of the speech, the music should be timbre-matched in the presence of the dialog in the movie. Dynamic equalizers designed to allow (eg, Dolby® Volume Intelligent Equalizer) may be disabled.

典型的には、ユーザーは、再生装置上で最良の設定を得るために「映画」または「音楽」のようなプロファイルを切り換えることを要求されることがあるが、これは、多くのユーザーが認識していない、または快適でないことがありうる高度な設定またはUIにアクセスすることを必要とすることが多い。 Typically, users may be required to switch profiles such as "movie" or "music" to get the best settings on the playback device, which many users recognize. Often requires access to advanced settings or UI that may not be or may not be comfortable.

この問題に取り組むための一つのアプローチは、コンテンツ解析ツール（たとえば、ドルビーのメディア・インテリジェンス）を使用して、ある種のコンテンツ型がオーディオ・ストリーム中にあることがどのくらい確からしいかを決定するためにオーディオ信号中の特徴を検出することであろう。 One approach to tackle this problem is to use content analysis tools (for example, Dolby's Media Intelligence) to determine how likely it is that some content type is in the audio stream. Will detect features in the audio signal.

映画や音楽を含む多様なコンテンツを再生できる携帯電話のような現在の再生装置は、コンテンツ解析ツール（たとえば、ドルビーのメディア・インテリジェンス）を使用して、オーディオ・ストリーム中のある種のコンテンツ型の存在についての信頼値を決定することができる。コンテンツ解析ツールは、「音楽」、「発話」または「背景効果音」の存在に関する信頼値（信頼スコア）を返すことができる。次いで、それらの信頼値が組み合わせて使用されて、アルゴリズム操縦重みを返すことができ、この重みは、ある種の後処理特徴（たとえば、その強度）を制御するために使用することができる。 Today's playback devices, such as mobile phones, that can play a wide variety of content, including movies and music, use content analysis tools (eg, Dolby's Media Intelligence) to capture certain content types in audio streams. You can determine the confidence value for existence. Content analysis tools can return confidence values (confidence scores) for the presence of "music," "utterances," or "background sound effects." The confidence values can then be used in combination to return an algorithm steer weight, which can be used to control certain post-processing features (eg, their intensity).

上述の方法は、デコーダ内またはPCMオーディオ・データを取り込む別個の後処理ライブラリ内で実行できる「シングルエンド」解決策である。このシングルエンド実装は、後処理アルゴリズムを操縦することに効果的でありうるが、再生装置にかなりの計算量を追加し、よって、コンテンツ解析のリアルタイム性は、再生装置上で手に入れられる機能に制限されうる。 The method described above is a "single-ended" solution that can be performed in a decoder or in a separate post-processing library that captures PCM audio data. This single-ended implementation can be effective in manipulating post-processing algorithms, but it adds a significant amount of computation to the playback device, so the real-time nature of content analysis is a feature available on the playback device. Can be limited to.

よって、オーディオ・コンテンツのコンテンツを意識した処理のための改善された方法および装置が必要とされている。 Therefore, there is a need for improved methods and devices for content-aware processing of audio content.

本開示は、それぞれの独立請求項の特徴を有する、オーディオ・コンテンツをエンコードする方法およびオーディオ・コンテンツをデコードする方法を提供する。 The present disclosure provides a method of encoding audio content and a method of decoding audio content, each of which has the characteristics of an independent claim.

本開示のある側面は、オーディオ・コンテンツをエンコードする方法に関する。この方法は、オーディオ・コンテンツのコンテンツ解析を実行することを含んでいてもよい。コンテンツ解析は、たとえば、ドルビーのメディア・インテリジェンス・ツールを適用することによって実行されてもよい。また、コンテンツ解析は、複数の連続する窓のそれぞれについて実行されてもよく、各窓は、所定数の連続する（オーディオ）フレームを含む。ここで、コンテンツ解析は、オーディオ・コンテンツ内の決定可能な特徴に基づいた確からしさ／信頼性の一つまたは複数の計算に基づいていてもよい。これらの計算は動的であってもよく、特定の確からしさを増幅または逆増幅するように調整できる。より一般的な用語では、コンテンツ解析は適応的であってもよく、および／または所定のオーディオ・コンテンツを使用して事前にトレーニングされていてもよい。コンテンツ解析は、待ち時間を減らすために先読みバッファを使ってもよい。追加的または代替的に、コンテンツ解析のために必要とされる処理時間を受け入れるために、エンコード待ち時間が導入されてもよい。また、コンテンツ解析は、複数のパス（pass）で実行されてもよい。この方法は、さらに、コンテンツ解析（の結果）に基づいてオーディオ・コンテンツのコンテンツ型を示す分類情報を生成することを含んでいてもよい。分類情報の生成は、オーディオ・コンテンツ内のシーン遷移の検出（またはシーン遷移の手動指示）にも基づいてもよい。たとえば、分類情報に含まれる信頼値の変化レートは、シーン遷移が検出／指示される場合には、より高くてもよい（すなわち、定常状態におけるよりも大きい）。この方法は、オーディオ・コンテンツおよび分類情報、たとえば、信頼値をビットストリームにエンコードすることをさらに含んでいてもよい。エンコードされたオーディオ・コンテンツおよびエンコードされた分類情報は、多重化されてもよい。この方法は、ビットストリームを出力することをさらに含んでいてもよい。 One aspect of this disclosure relates to how audio content is encoded. This method may include performing content analysis of audio content. Content analysis may be performed, for example, by applying Dolby's media intelligence tools. Content analysis may also be performed on each of the plurality of contiguous windows, each of which comprises a predetermined number of contiguous (audio) frames. Here, the content analysis may be based on one or more calculations of certainty / reliability based on determinable features within the audio content. These calculations may be dynamic and can be adjusted to amplify or deamplify a particular certainty. In more general terms, content analysis may be adaptive and / or pre-trained with predetermined audio content. Content analysis may use look-ahead buffers to reduce latency. Additional or alternative, encoding latency may be introduced to accept the processing time required for content parsing. Further, the content analysis may be executed in a plurality of passes. The method may further include generating classification information indicating the content type of the audio content based on (results in) the content analysis. The generation of classification information may also be based on the detection of scene transitions in audio content (or manual instructions for scene transitions). For example, the rate of change of confidence values contained in the classification information may be higher (ie, greater than in steady state) if scene transitions are detected / indicated. The method may further include encoding audio content and classification information, such as confidence values, into a bitstream. The encoded audio content and the encoded classification information may be multiplexed. This method may further include outputting a bitstream.

本開示のコンテキストにおいて、オーディオ・コンテンツの「コンテンツ型」とは、再生装置で再生することができ、かつ、そのコンテンツ型の一つまたは複数のオーディオ特性によって人間の耳によって区別することができるコンテンツ型を意味する。たとえば、音楽は、異なるオーディオ周波数帯域幅、種々の周波数にわたるオーディオ信号の異なるパワー分布、異なるトーン持続時間、基本および優勢周波数の異なる型および数などに関わるため、発話またはノイズと区別することができる。 In the context of the present disclosure, the "content type" of audio content is content that can be played on a playback device and can be distinguished by the human ear by the audio characteristics of one or more of the content types. Means type. For example, music can be distinguished from speech or noise because it involves different audio frequency bandwidths, different power distributions of audio signals across different frequencies, different tone durations, different types and numbers of basic and dominant frequencies, and so on. ..

エンコーダ側でコンテンツ解析を実行し、結果として得られた分類情報をビットストリーム中にエンコードすることにより、デコーダに対する計算負荷が大幅に緩和できる。さらに、エンコーダの優れた計算能力が、より複雑でより正確なコンテンツ解析を実行するために使用できる。エンコーダとデコーダの異なる計算能力に対応することとは別に、提案される方法は、デコードされたオーディオのオーディオ後処理における付加的な柔軟性をデコーダ側に提供する。たとえば、後処理は、デコーダを実装する装置の装置型および／またはユーザーの個人的な選好に従ってカスタマイズされてもよい。 By performing content analysis on the encoder side and encoding the resulting classification information in the bitstream, the computational load on the decoder can be significantly reduced. In addition, the encoder's excellent computing power can be used to perform more complex and more accurate content analysis. Apart from addressing the different computational powers of the encoder and decoder, the proposed method provides the decoder side with additional flexibility in the audio post-processing of the decoded audio. For example, post-processing may be customized according to the device type of the device that implements the decoder and / or the user's personal preference.

いくつかの実施形態では、コンテンツ解析は、少なくとも部分的には、オーディオ・コンテンツについてのメタデータに基づいてもよい。それにより、たとえば、コンテンツ作成者による、コンテンツ解析に対する追加的な制御が提供される。同時に、適切なメタデータを提供することにより、コンテンツ解析の精度が改善できる。 In some embodiments, the content analysis may be based, at least in part, on metadata about the audio content. This provides, for example, additional control over content analysis by the content creator. At the same time, by providing appropriate metadata, the accuracy of content analysis can be improved.

本開示の別の側面は、オーディオ・コンテンツをエンコードするさらなる方法に関する。この方法は、オーディオ・コンテンツのコンテンツ型に関連するユーザー入力を受領することを含んでいてもよい。ユーザー入力は、たとえば、手動ラベルまたは手動信頼値を含んでいてもよい。この方法はさらに、ユーザー入力に基づいてオーディオ・コンテンツのコンテンツ型を示す分類情報を生成することを含んでいてもよい。この方法は、オーディオ・コンテンツおよび分類情報をビットストリーム中にエンコードすることをさらに含んでいてもよい。たとえば、ラベルまたは信頼値は、ビットストリームにおいてエンコードされてもよい。この方法は、ビットストリームを出力することをさらに含んでいてもよい。この方法により、たとえば、コンテンツ作成者による、コンテンツ解析に対する追加的な制御が提供される。 Another aspect of the disclosure relates to further methods of encoding audio content. This method may include receiving user input related to the content type of audio content. User input may include, for example, a manual label or a manual trust value. The method may further include generating classification information indicating the content type of the audio content based on user input. This method may further include encoding audio content and classification information into a bitstream. For example, the label or trust value may be encoded in the bitstream. This method may further include outputting a bitstream. This method provides, for example, additional control over content analysis by the content creator.

いくつかの実施形態では、ユーザー入力は、オーディオ・コンテンツが所与のコンテンツ型であることを示すラベルの一つまたは複数と、一つまたは複数の信頼値とを含んでいてもよく、各信頼値は、それぞれのコンテンツ型に関連付けられ、オーディオ・コンテンツがそれぞれのコンテンツ型である確からしさの指標を与える。それにより、エンコーダのユーザーは、デコーダ側で実行される後処理に対する追加的な制御を与えられることができる。これは、たとえば、コンテンツ作成者の芸術的意図が後処理によって保存されることを保証することを可能にする。 In some embodiments, the user input may include one or more labels indicating that the audio content is of a given content type, and one or more trust values, each trust. The value is associated with each content type and gives an indicator of the certainty that the audio content is each content type. This allows the encoder user to be given additional control over the post-processing performed on the decoder side. This makes it possible, for example, to ensure that the content creator's artistic intent is preserved by post-processing.

本開示の別の側面は、オーディオ・コンテンツをエンコードするさらなる方法に関する。オーディオ・コンテンツは、オーディオ・プログラムの一部として、オーディオ・コンテンツのストリームにおいて提供されてもよい。この方法は、オーディオ・コンテンツのサービス型（たとえば、オーディオ・プログラム型）を示すサービス型指示を受領することを含んでいてもよい。サービス型は、たとえば、音楽サービスまたはニュース（ニュースキャスト）サービス／チャネルであってもよい。この方法は、サービス型指示に少なくとも部分的に基づいて、オーディオ・コンテンツのコンテンツ解析を実行することをさらに含んでいてもよい。この方法は、さらに、コンテンツ解析（の結果）に基づいてオーディオ・コンテンツのコンテンツ型を示す分類情報を生成することを含んでいてもよい。分類情報の例としての信頼値は、オーディオ・コンテンツと一緒に、コンテンツ作成者によって直接提供されてもよい。たとえばコンテンツ作成者によって提供される信頼値等を考慮に入れるか否かは、サービス型指示に依存してもよい。この方法は、オーディオ・コンテンツおよび分類情報をビットストリーム中にエンコードすることをさらに含んでいてもよい。この方法は、ビットストリームを出力することをさらに含んでいてもよい。 Another aspect of the disclosure relates to further methods of encoding audio content. Audio content may be provided in a stream of audio content as part of an audio program. The method may include receiving a service type instruction indicating the service type (eg, audio program type) of the audio content. The service type may be, for example, a music service or a news (newscast) service / channel. The method may further include performing content analysis of the audio content, at least in part, based on service-type instructions. The method may further include generating classification information indicating the content type of the audio content based on (results in) the content analysis. The confidence value as an example of the classification information may be provided directly by the content creator along with the audio content. For example, whether or not to take into account the trust value provided by the content creator may depend on the service type instruction. This method may further include encoding audio content and classification information into a bitstream. This method may further include outputting a bitstream.

サービス型指示を考慮することにより、エンコーダはコンテンツ解析を実行することにおいて支援されることができる。さらに、エンコーダ側のユーザーは、デコーダ側のオーディオ後処理に対する追加的な制御が与えられることができ、それは、たとえば、コンテンツ作成者の芸術的意図が後処理によって保存されることを保証することを可能にする。 By considering service-type instructions, encoders can be assisted in performing content analysis. In addition, the encoder-side user can be given additional control over the decoder-side audio post-processing, which ensures that, for example, the content creator's artistic intent is preserved by the post-processing. enable.

いくつかの実施形態では、本方法は、サービス型指示に基づいて、オーディオ・コンテンツのサービス型が音楽サービスであるかどうかを判定することをさらに含んでいてもよい。この方法は、さらに、オーディオ・コンテンツのサービス型が音楽サービスであるとの判定に応答して、オーディオ・コンテンツのコンテンツ型が音楽コンテンツであることを示す分類情報（コンテンツ型「音楽」）を生成することを含んでいてもよい。これは、コンテンツ型「音楽」についての信頼値を可能な最大値（たとえば、1）に設定し、他の任意の信頼値をゼロに設定することに相当する。 In some embodiments, the method may further include determining whether the service type of audio content is a music service, based on service type instructions. This method further generates classification information (content type "music") indicating that the content type of the audio content is music content in response to the determination that the service type of the audio content is a music service. May include doing. This is equivalent to setting the confidence value for the content type "music" to the maximum possible value (eg 1) and setting any other confidence value to zero.

いくつかの実施形態では、この方法は、サービス型指示に基づいて、オーディオ・コンテンツのサービス型がニュースキャスト・サービスであるかどうかを判定することをさらに含んでいてもよい。この方法は、さらに、オーディオ・コンテンツのサービス型がニュースキャスト・サービスであるという判定に応答して、オーディオ・コンテンツが発話コンテンツであることを示す可能性がより高いように、コンテンツ解析を適応させることを含んでいてもよい。これは、コンテンツ解析の結果における発話コンテンツ（コンテンツ型「発話」）についての確からしさ／信頼度を高めるために、コンテンツ解析の一つまたは複数の計算（計算アルゴリズム）を適応させること、および／または、発話コンテンツ以外のコンテンツ型についての確からしさ／信頼度を減少させるために、コンテンツ解析の前記一つまたは複数の計算を適応させることによって、達成されうる。 In some embodiments, the method may further include determining whether the service type of audio content is a newscast service, based on service type instructions. This method further adapts the content analysis so that it is more likely that the audio content is spoken content in response to the determination that the service type of the audio content is a newscast service. It may include that. This is to adapt one or more calculations (calculation algorithms) of the content analysis to increase the certainty / reliability of the spoken content (content type "speech") in the result of the content analysis, and / or It can be achieved by adapting the one or more calculations of content analysis to reduce the certainty / reliability of content types other than spoken content.

いくつかの実施形態では、サービス型指示は、フレーム毎に提供されてもよい。本開示の別の側面は、オーディオ・コンテンツをエンコードするさらなる方法に関する。オーディオ・コンテンツは、ファイルごとに提供されてもよい。この方法は、ファイルごとに実行されてもよい。ファイルは、それぞれのオーディオ・コンテンツについてのメタデータを含んでいてもよい。メタデータは、マーカー、ラベル、タグなどを含んでいてもよい。この方法は、少なくとも部分的にはオーディオ・コンテンツについてのメタデータに基づいてオーディオ・コンテンツのコンテンツ解析を実行することを含んでいてもよい。この方法は、さらに、コンテンツ解析（の結果）に基づいてオーディオ・コンテンツのコンテンツ型を示す分類情報を生成することを含んでいてもよい。この方法は、オーディオ・コンテンツおよび分類情報をビットストリームにエンコードすることをさらに含んでいてもよい。この方法は、ビットストリームを出力することをさらに含んでいてもよい。 In some embodiments, service-type instructions may be provided on a frame-by-frame basis. Another aspect of the disclosure relates to further methods of encoding audio content. Audio content may be provided on a file-by-file basis. This method may be performed on a file-by-file basis. The file may contain metadata for each audio content. The metadata may include markers, labels, tags, and the like. This method may include, at least in part, performing content analysis of the audio content based on the metadata about the audio content. The method may further include generating classification information indicating the content type of the audio content based on (results in) the content analysis. This method may further include encoding audio content and classification information into a bitstream. This method may further include outputting a bitstream.

ファイル・メタデータを考慮することにより、エンコーダはコンテンツ解析を実行することにおいて支援されることができる。さらに、エンコーダ側のユーザーは、デコーダ側のオーディオ後処理に対する追加的な制御を与えられることができ、そのことは、たとえば、コンテンツ作成者の芸術的意図が後処理によって保存されることを保証することを可能にする。 By considering file metadata, the encoder can be assisted in performing content analysis. In addition, the encoder-side user can be given additional control over the decoder-side audio post-processing, which ensures that, for example, the content creator's artistic intent is preserved by the post-processing. Make it possible.

いくつかの実施形態では、メタデータは、ファイルのファイル・コンテンツ型を示すファイル・コンテンツ型指示を含んでいてもよい。ファイル・コンテンツ型は、音楽ファイル（ファイル・コンテンツ型「音楽ファイル」）、ニュースキャスト・ファイル／クリップ（ファイル・コンテンツ型「ニュースキャスト・ファイル」）、または動的（非静的または混合ソース）コンテンツを含むファイル（たとえば、発話のあるシーンと音楽／歌シーンとの間で頻繁に、たとえば数分に1回、遷移する、映画のミュージカル・ジャンル；ファイル・コンテンツ型「動的コンテンツ」）であってもよい。ファイル・コンテンツ型は、ファイル全体について同じ（一様）であってもよいし、またはファイルの部分間で変化してもよい。次いで、コンテンツ解析は、少なくとも部分的にはファイル・コンテンツ型指示に基づいてもよい。 In some embodiments, the metadata may include a file content type indication indicating the file content type of the file. File content types can be music files (file content type "music files"), newscast files / clips (file content type "newscast files"), or dynamic (non-static or mixed source) content. A file containing (for example, a musical genre of a movie that transitions frequently between a spoken scene and a music / song scene, for example once every few minutes; file content type "dynamic content"). You may. The file content type may be the same (uniform) for the entire file, or it may vary from part to part of the file. Content analysis may then be based, at least in part, on file content type instructions.

いくつかの実施形態では、この方法は、ファイル・コンテンツ型指示に基づいて、ファイルのファイル・コンテンツ型が音楽ファイルであるかどうかを判定することをさらに含んでいてもよい。この方法は、さらに、ファイルのファイル・コンテンツ型が音楽ファイルであるという判定に応答して、オーディオ・コンテンツのコンテンツ型が音楽コンテンツであることを示す分類情報を生成することを含んでいてもよい。 In some embodiments, the method may further include determining if the file content type of the file is a music file, based on the file content type indication. The method may further include generating classification information indicating that the content type of the audio content is music content in response to the determination that the file content type of the file is a music file. ..

いくつかの実施形態では、この方法は、ファイル・コンテンツ型指示に基づいて、ファイルのファイル・コンテンツ型がニュースキャスト・ファイルであるかどうかを判定することをさらに含んでいてもよい。この方法は、さらに、ファイルのファイル・コンテンツ型がニュースキャスト・ファイルであるという判定に応答して、オーディオ・コンテンツが発話コンテンツであることを示す可能性がより高いようにコンテンツ解析を適応させることを含んでいてもよい。これは、コンテンツ解析における発話コンテンツについての確からしさ／信頼度を高めるよう、コンテンツ解析の一つまたは複数の計算（計算アルゴリズム）を適応させること、および／または、発話コンテンツ以外のコンテンツ型についての確からしさ／信頼度を減らすよう、前記一つまたは複数の計算を適応させることによって達成されてもよい。 In some embodiments, the method may further include determining if the file content type of the file is a newscast file, based on the file content type indication. This method also adapts the content analysis so that it is more likely that the audio content is spoken content in response to the determination that the file content type of the file is a newscast file. May include. This is to adapt one or more calculations (calculation algorithms) of the content analysis to increase the certainty / reliability of the spoken content in the content analysis, and / or the certainty about the content type other than the spoken content. It may be achieved by adapting the one or more calculations above to reduce the likelihood / reliability.

いくつかの実施形態では、この方法は、ファイル・コンテンツ型指示に基づいて、ファイルのファイル・コンテンツ型が動的コンテンツであるかどうかを判定することをさらに含んでいてもよい。この方法は、さらに、ファイルのファイル・コンテンツ型が動的コンテンツであるという判定に応答して、異なるコンテンツ型間のより高い遷移レートを許容するようにコンテンツ解析を適応させることを含んでいてもよい。たとえば、コンテンツ型は、コンテンツ型間で、たとえば、音楽と非音楽の間で、より頻繁に（すなわち、定常状態の場合よりも頻繁に）遷移することを許容されてもよい。さらに、分類情報の平滑化（時間平滑化）は、動的コンテンツ（すなわち、動的ファイル・コンテンツについては無効にされてもよい。 In some embodiments, the method may further include determining if the file content type of the file is dynamic content, based on the file content type indication. This method further includes adapting the content analysis to allow higher transition rates between different content types in response to the determination that the file content type of the file is dynamic content. good. For example, content types may be allowed to transition more frequently (ie, more frequently than in steady state) between content types, for example, between music and non-music. Further, classification information smoothing (time smoothing) may be disabled for dynamic content (ie, dynamic file content).

いくつかの実施形態では、上記の側面または実施形態のいずれかによる方法では、分類情報は、一つまたは複数の信頼値を含んでいてもよい。各信頼値（confidence value）は、それぞれのコンテンツ型に関連していてもよく、オーディオ・コンテンツがそれぞれのコンテンツ型である確からしさ（likelihood）の指示を与えてもよい。 In some embodiments, in any of the above aspects or embodiments, the classification information may include one or more confidence values. Each confidence value may be associated with each content type and may give an indication of the likelihood that the audio content is of each content type.

いくつかの実施形態では、上記の側面または実施形態のいずれかによる方法において、コンテンツ型は、音楽コンテンツ、発話コンテンツ、または効果（たとえば、背景効果）コンテンツのうちの一つまたは複数を含んでいてもよい。コンテンツ型は、さらに、群衆のノイズ／歓声を含んでいてもよい。 In some embodiments, in any of the above aspects or embodiments, the content type comprises one or more of musical content, spoken content, or effect (eg, background effect) content. May be good. The content type may further include crowd noise / cheers.

いくつかの実施形態では、上記の側面または実施形態のいずれかによる方法は、オーディオ・コンテンツ内のシーン遷移の指示をビットストリーム中にエンコードすることをさらに含んでいてもよい。シーン遷移の指示は一つまたは複数のシーン・リセット・フラグを含んでいてもよく、そのそれぞれがそれぞれのシーン遷移を示す。シーン遷移は、エンコーダで検出されてもよく、あるいはたとえばコンテンツ作成者によって外部から提供されてもよい。前者の場合、この方法は、オーディオ・コンテンツ内のシーン遷移を検出するステップを含み、後者の場合、オーディオ・コンテンツ内のシーン遷移の（手動）指示を受領するステップを含むことになろう。ビットストリームにおいてシーン遷移を示すことによって、シーン遷移をまたぐ不適切な後処理の結果として生じうるデコーダ側での可聴アーチファクトが回避できる。 In some embodiments, the method according to any of the above aspects or embodiments may further comprise encoding the instructions for scene transitions in the audio content into a bitstream. The scene transition instructions may include one or more scene reset flags, each of which indicates a scene transition. The scene transition may be detected by the encoder or may be provided externally by, for example, the content creator. In the former case, this method would include the step of detecting a scene transition in the audio content, and in the latter case, it would include the step of receiving a (manual) instruction for the scene transition in the audio content. By showing the scene transitions in the bitstream, audible artifacts on the decoder side that can result from improper post-processing across the scene transitions can be avoided.

いくつかの実施形態では、上記の側面または実施形態のいずれかによる方法は、エンコードする前に、分類情報の平滑化（時間平滑化）をさらに含んでいてもよい。たとえば、信頼値が経時的に平滑化されてもよい。平滑化は、制御入力／メタデータに従い、動的（非静的）としてフラグ付けされたオーディオ・コンテンツについては、状況に依存して、たとえばシーン遷移時には、無効にされてもよい。分類情報を平滑化することにより、デコーダ側のオーディオ後処理の安定性／連続性が改善できる。 In some embodiments, the method according to any of the above aspects or embodiments may further include smoothing of the classification information (time smoothing) prior to encoding. For example, confidence values may be smoothed over time. Smoothing may be context-sensitively disabled, for example during scene transitions, for audio content flagged as dynamic (non-static) according to control input / metadata. By smoothing the classification information, the stability / continuity of the audio post-processing on the decoder side can be improved.

いくつかの実施形態では、上記の側面または実施形態のいずれかによる方法は、エンコードする前に分類情報を量子化することをさらに含んでいてもよい。たとえば、信頼値が量子化されてもよい。それにより、ビットストリームにおいて分類情報を伝送するために必要とされる帯域幅を減らすことができる。 In some embodiments, the method according to any of the above aspects or embodiments may further comprise quantizing the classification information prior to encoding. For example, the confidence value may be quantized. Thereby, the bandwidth required for transmitting the classification information in the bitstream can be reduced.

いくつかの実施形態では、上記の側面または実施形態のいずれかによる方法は、分類情報を、ビットストリームのパケット内の特定のデータ・フィールドに符号化することをさらに含んでいてもよい。ビットストリームは、たとえば、AC-4（ドルビー（登録商標）AC-4）ビットストリームであってもよい。特定のデータ・フィールドは、メディア・インテリジェンス（Media Intelligence、MI）データ・フィールドであってもよい。MIデータ・フィールドは、以下のフィールド：b_mi_data_present、music_confidence、speech_confidence、effects_confidence、b_prog_switch、b_more_mi_data_present、more_mi_dataのうちの任意のもの、一部のもの、または全部を含んでいてもよい。 In some embodiments, the method according to any of the above aspects or embodiments may further comprise encoding the classification information into a particular data field within a packet of a bitstream. The bitstream may be, for example, an AC-4 (Dolby® AC-4) bitstream. The particular data field may be a Media Intelligence (MI) data field. The MI data field may include any, some, or all of the following fields: b_mi_data_present, music_confidence, speech_confidence, effects_confidence, b_prog_switch, b_more_mi_data_present, more_mi_data.

本開示の別の側面は、オーディオ・コンテンツおよび該オーディオ・コンテンツについての分類情報を含むビットストリームからオーディオ・コンテンツをデコードする方法に関する。分類情報は、オーディオ・コンテンツのコンテンツ分類を示してもよい。コンテンツ分類は、コンテンツ解析と、任意的には、たとえばオーディオ・コンテンツのコンテンツ型に関連するユーザー入力とに基づいてもよい（ここで、コンテンツ解析とユーザーによる入力提供はいずれもエンコーダで実行される）。この方法は、ビットストリームを受領することを含んでいてもよい。この方法は、さらに、オーディオ・コンテンツおよび分類情報をデコードすることを含んでいてもよい。この方法は、さらに、分類情報に基づいて、デコードされたオーディオ・コンテンツの後処理を実行するための後処理モードを選択することを含んでいてもよい。換言すれば、デコード方法は、分類情報に基づいて、デコードされたオーディオ・コンテンツの後処理を選択してもよい。 Another aspect of the disclosure relates to audio content and a method of decoding audio content from a bitstream containing classification information about the audio content. The classification information may indicate the content classification of the audio content. Content classification may be based on content analysis and optionally user input related to, for example, the content type of audio content (where both content analysis and user input provision are performed by the encoder. ). This method may include receiving a bitstream. The method may further include decoding audio content and classification information. The method may further include selecting a post-processing mode for performing post-processing of the decoded audio content based on the classification information. In other words, the decoding method may select post-processing of the decoded audio content based on the classification information.

デコーダに分類情報を提供することで、デコーダはコンテンツ解析をしないですむようになり、デコーダに対する計算負荷を大幅に緩和する。さらに、分類情報に基づいて好適な後処理モードを決定することができる、さらなる柔軟性がデコーダに与えられる。その際、装置型やユーザーの選好などの追加的な情報が考慮されてもよい。 By providing classification information to the decoder, the decoder does not have to analyze the content, which greatly reduces the computational load on the decoder. In addition, the decoder is given additional flexibility to be able to determine a suitable post-processing mode based on the classification information. In doing so, additional information such as device type and user preferences may be considered.

いくつかの実施形態では、デコード方法は、分類情報に基づいて、デコードされたオーディオ・コンテンツの後処理のための一つまたは複数の制御重みを計算することをさらに含んでいてもよい。 In some embodiments, the decoding method may further include calculating one or more control weights for post-processing of the decoded audio content based on the classification information.

いくつかの実施形態では、後処理モードの選択は、ユーザー入力にさらに基づいてもよい。 In some embodiments, the choice of post-processing mode may be further based on user input.

いくつかの実施形態では、オーディオ・コンテンツは、チャネル・ベースである。たとえば、オーディオ・コンテンツは、2チャネル以上のオーディオ・コンテンツであってもよい。デコードされたオーディオ・コンテンツの後処理は、チャネル・ベースのオーディオ・コンテンツをアップミックスして、アップミックスされたチャネル・ベースのオーディオ・コンテンツにすることを含んでいてもよい。たとえば、2チャネル・ベースのオーディオ・コンテンツが、5.1チャネル、7.1チャネルまたは9.1チャネルのオーディオ・コンテンツにアップミックスされてもよい。この方法は、アップミックスされたチャネル・ベースのオーディオ・コンテンツに仮想化器を適用して、所望の数のチャネルのスピーカー・アレイのための仮想化のための仮想化されたアップミックスされたチャネル・ベースのオーディオ・コンテンツを得ることをさらに含んでいてもよい。たとえば、仮想化は、アップミックスされた5.1チャネル、7.1チャネル、または9.1チャネルのオーディオ・コンテンツを、たとえばヘッドフォンのような2チャネルのスピーカー・アレイに提供してもよい。しかしながら、仮想化は、アップミックスされた5.1チャネル・オーディオ・コンテンツを2チャネルまたは5.1チャネルのスピーカー・アレイに、アップミックスされた7.1チャネル・オーディオ・コンテンツを2チャネル、5.1チャネルまたは7.1チャネルのスピーカー・アレイに、アップミックスされた9.1チャネル・オーディオ・コンテンツを2チャネル、5.1チャネル、7.1チャネルまたは9.1チャネルのスピーカー・アレイに提供してもよい。 In some embodiments, the audio content is channel-based. For example, the audio content may be audio content of two or more channels. Post-processing of the decoded audio content may include upmixing the channel-based audio content into the upmixed channel-based audio content. For example, 2-channel based audio content may be upmixed to 5.1 channel, 7.1 channel or 9.1 channel audio content. This method applies a virtualizer to upmixed channel-based audio content and virtualized upmixed channels for virtualization for a speaker array of the desired number of channels. -It may further include obtaining base audio content. For example, virtualization may provide upmixed 5.1-channel, 7.1-channel, or 9.1-channel audio content to a two-channel speaker array, such as headphones. However, virtualization puts upmixed 5.1 channel audio content into a 2-channel or 5.1-channel speaker array and upmixed 7.1-channel audio content into a 2-channel, 5.1-channel or 7.1-channel speaker array. The array may provide upmixed 9.1-channel audio content to 2-channel, 5.1-channel, 7.1-channel, or 9.1-channel speaker arrays.

いくつかの実施形態では、この方法は、分類情報に基づいて、デコードされたオーディオ・コンテンツの後処理のための一つまたは複数の制御重みを計算するステップをさらに含んでいてもよい。 In some embodiments, the method may further include calculating one or more control weights for post-processing of the decoded audio content based on the classification information.

いくつかの実施形態では、分類情報（デコーダによって受領されたビットストリームにおいてエンコードされている）は、一つまたは複数の信頼値を含んでいてもよく、各信頼値は、それぞれのコンテンツ型に関連付けられており、オーディオ・コンテンツがそのそれぞれのコンテンツ型である確からしさの指標を与える。制御重みは信頼値に基づいて計算されてもよい。 In some embodiments, the classification information (encoded in the bitstream received by the decoder) may contain one or more trust values, each trust value associated with its own content type. And gives an indicator of the certainty that the audio content is its respective content type. Control weights may be calculated based on confidence values.

いくつかの実施形態では、この方法は、仮想化器の出力をスピーカー・アレイにルーティングし、分類情報に基づいてアップミキサーおよび仮想化器のためのそれぞれの制御重みを計算することをさらに含んでいてもよい。 In some embodiments, the method further comprises routing the output of the virtualization device to a speaker array and calculating the respective control weights for the upmixer and the virtualization device based on the classification information. You may.

いくつかの実施形態では、この方法は、仮想化器を適用した後、チャネル・ベースのオーディオ・コンテンツおよび仮想化されたアップミックスされたオーディオ・コンテンツにクロスフェーダーを適用し、クロスフェーダーの出力をスピーカー・アレイにルーティングすることをさらに含んでいてもよい。この実施形態では、この方法は、分類情報に基づいてアップミキサーおよびクロスフェーダーのためのそれぞれの制御重みを計算することをさらに含んでいてもよい。 In some embodiments, this method applies a crossfader to channel-based audio content and virtualized upmixed audio content after applying a virtualizer to produce the output of the crossfader. It may further include routing to a speaker array. In this embodiment, the method may further include calculating the respective control weights for the upmixer and crossfader based on the classification information.

いくつかの実施形態では、制御重みは、アップミキサー、クロスフェーダーまたは仮想化器以外のモジュールを制御するためのものであってもよい。同様に、制御重みを計算するいくつかの代替的な方法が可能である。制御重みの数および型、ならびにそれらの計算方法に関する実施形態は、本開示の以下の他の側面に関連して以下に記載される。しかしながら、これらの実施形態は、本開示の以下の側面に限定されるものではなく、本稿に開示されるオーディオ・コンテンツをデコードする任意の方法に適用できる。 In some embodiments, the control weights may be for controlling modules other than upmixers, crossfaders or virtualizers. Similarly, several alternative methods of calculating control weights are possible. The numbers and types of control weights, as well as embodiments relating to their calculation methods, are described below in connection with the following other aspects of the present disclosure. However, these embodiments are not limited to the following aspects of the present disclosure and are applicable to any method of decoding the audio content disclosed herein.

本開示の別の側面は、オーディオ・コンテンツおよび該オーディオ・コンテンツについての分類情報を含むビットストリームからオーディオ・コンテンツをデコードするさらなる方法に関する。分類情報は、オーディオ・コンテンツのコンテンツ分類を示してもよい。この方法は、ビットストリームを受領することを含んでいてもよい。この方法は、さらに、オーディオ・コンテンツおよび分類情報をデコードすることを含んでいてもよい。この方法は、さらに、分類情報に基づいて、デコードされたオーディオ・コンテンツの後処理のための一つまたは複数の制御重みを計算することを含んでいてもよい。制御重みは、後処理アルゴリズム／モジュール用の制御重みであってもよく、アルゴリズム操縦重みと称されてもよい。制御重みは、それぞれの後処理アルゴリズムの強度を制御することができる。 Another aspect of the disclosure relates to further methods of decoding audio content from a bitstream containing audio content and classification information about the audio content. The classification information may indicate the content classification of the audio content. This method may include receiving a bitstream. The method may further include decoding audio content and classification information. The method may further include calculating one or more control weights for post-processing of the decoded audio content based on the classification information. The control weights may be control weights for post-processing algorithms / modules and may be referred to as algorithmic maneuvering weights. Control weights can control the strength of each post-processing algorithm.

いくつかの実施形態では、分類情報は一つまたは複数の信頼値を含んでいてもよく、各信頼値はそれぞれのコンテンツ型に関連付けられ、オーディオ・コンテンツがそのそれぞれのコンテンツ型である確からしさの指標を与える。制御重みは信頼値に基づいて計算されてもよい。 In some embodiments, the classification information may include one or more confidence values, each confidence value is associated with a respective content type, and the certainty that the audio content is that respective content type. Give an index. Control weights may be calculated based on confidence values.

いくつかの実施形態において、制御重みは、デコードされたオーディオ・コンテンツの後処理のためのそれぞれのモジュール（アルゴリズム）についての制御重みであってもよい。後処理のためのモジュール（アルゴリズム）は、たとえば：（インテリジェント／動的）等化器、（適応的）仮想化器、サラウンド処理モジュール、ダイアログ向上器、アップミキサー、およびクロスフェーダーのうちの一つまたは複数を含んでいてもよい。 In some embodiments, the control weight may be a control weight for each module (algorithm) for post-processing of the decoded audio content. Modules (algorithms) for post-processing include: (intelligent / dynamic) equalizers, (adaptive) virtualization devices, surround processing modules, dialog improvers, upmixers, and crossfaders. Alternatively, it may include a plurality.

いくつかの実施形態では、制御重みは、等化器のための制御重み、仮想化器のための制御重み、サラウンドプロセッサのための制御重み、ダイアログ向上器のための制御重み、アップミキサーのための制御重み、およびクロスフェーダーのための制御重みのうちの一つまたは複数を含んでいてもよい。等化器は、たとえば、インテリジェント等化器（intelligent equalizer、IEQ）であってもよい。仮想化器は、たとえば、適応的仮想化器であってもよい。 In some embodiments, the control weights are a control weight for an equalizer, a control weight for a virtualizer, a control weight for a surround processor, a control weight for a dialog improver, and an upmixer. It may contain one or more of the control weights of and for the crossfader. The equalizer may be, for example, an intelligent equalizer (IEQ). The virtualization device may be, for example, an adaptive virtualization device.

いくつかの実施形態では、制御重みの計算は、デコードを実行する装置の装置型に依存してもよい。言い換えれば、計算は、エンドポイントに固有であってもよいし、パーソナル化されていてもよい。たとえば、デコーダ側は、後処理のための一組のエンドポイント固有のプロセス／モジュール／アルゴリズムを実装してもよく、これらのプロセス／モジュール／アルゴリズムのためのパラメータ（制御重み）が、エンドポイント固有の仕方で信頼値に基づいて決定されてもよい。それにより、オーディオ後処理を実行する際に、それぞれの装置の個別の能力を考慮することができる。たとえば、モバイル装置およびサウンドバー装置によって異なる後処理が適用されることができる。 In some embodiments, the calculation of control weights may depend on the device type of the device performing the decoding. In other words, the calculation may be endpoint-specific or personalized. For example, the decoder side may implement a set of endpoint-specific processes / modules / algorithms for post-processing, and the parameters (control weights) for these processes / modules / algorithms are endpoint-specific. It may be determined based on the confidence value in the above manner. Thereby, the individual capabilities of each device can be taken into account when performing audio post-processing. For example, different post-processing can be applied depending on the mobile device and soundbar device.

いくつかの実施形態では、制御重みの計算は、ユーザー入力にさらに基づいてもよい。ユーザー入力は、信頼値に基づく計算をオーバーライドまたは部分的にオーバーライドしてもよい。たとえば、ユーザーが望むならば発話に仮想化が適用されてもよく、あるいはユーザーが望むならば、ステレオ拡張、アップミキシング、および／または仮想化がPCユーザーのために適用されてもよい。 In some embodiments, the calculation of control weights may be further based on user input. User input may override or partially override the confidence-based calculation. For example, virtualization may be applied to the utterance if the user so desires, or stereo expansion, upmixing, and / or virtualization may be applied for the PC user if the user so desires.

いくつかの実施形態では、制御重みの計算は、オーディオ・コンテンツのチャネルの数にさらに基づいてもよい。また、制御重みの計算は、一つまたは複数のビットストリーム・パラメータ（たとえば、ビットストリームによって運ばれ、ビットストリームから抽出可能なパラメータ）にさらに基づいていてもよい。 In some embodiments, the calculation of control weights may be further based on the number of channels of audio content. Also, the calculation of control weights may be further based on one or more bitstream parameters (eg, parameters carried by the bitstream and extractable from the bitstream).

いくつかの実施形態では、この方法は、オーディオ・コンテンツのコンテンツ解析を実行して、一つまたは複数の追加的な信頼値（たとえば、エンコーダ側によって考慮されていないコンテンツ型についての信頼値）を決定することを含んでいてもよい。このコンテンツ解析は、エンコーダ側に関して上述したのと同じ仕方で進行してもよい。次いで、制御重みの計算は、一つまたは複数の追加的な信頼値にさらに基づいてもよい。 In some embodiments, this method performs content analysis of the audio content to obtain one or more additional confidence values (eg, confidence values for content types that are not considered by the encoder side). It may include deciding. This content analysis may proceed in the same manner as described above for the encoder side. The calculation of control weights may then be further based on one or more additional confidence values.

いくつかの実施形態では、制御重みは、仮想化器のための制御重みを含んでいてもよい。仮想化器についての制御重みは、分類情報がオーディオ・コンテンツのコンテンツ型が音楽である、または音楽である可能性が高いことを示す場合には、仮想化器が無効にされるように、計算されてもよい。これは、たとえば、音楽についての信頼値が所与の閾値を超える場合に当てはまりうる。それにより、音楽的な音色が保存できる。 In some embodiments, the control weights may include control weights for the virtualizer. Control weights for the virtualization device are calculated so that the virtualization device is disabled if the classification information indicates that the content type of the audio content is or is likely to be music. May be done. This may be the case, for example, if the confidence value for music exceeds a given threshold. As a result, musical tones can be saved.

いくつかの実施形態では、仮想化器のための制御重みは、仮想化器の係数が素通しと完全な仮想化との間でスケールするように計算されてもよい。たとえば、仮想化器についての制御重みは、1－music_confidence*{1－max[effects_confidence,speech_confidence]^2}として計算されうる。 In some embodiments, the control weights for the virtualization machine may be calculated so that the coefficients of the virtualization machine scale between pass-through and full virtualization. For example, the control weight for a virtualizer can be calculated as 1-music_confidence * {1-max [effects_confidence, speech_confidence] ^ 2}.

いくつかの実施形態では、仮想化器についての制御重みは、オーディオ・コンテンツにおけるチャネルの数（すなわち、チャネル・カウント）または他のビットストリーム・パラメータ（単数または複数）にさらに依存してもよい（たとえば、それに基づいて決定されてもよい）。たとえば、仮想化のための制御重み（重み付け因子）は、ステレオ・コンテンツについて、信頼値に基づいて決定されるだけであってもよく、ステレオ・コンテンツ以外のすべてのマルチチャネル・コンテンツには（すなわち、2を超えるチャネル数については）固定の制御重み（たとえば、1に等しい）が適用されてもよい。 In some embodiments, the control weights for the virtualizer may further depend on the number of channels in the audio content (ie, the channel count) or other bitstream parameters (s) or other bitstream parameters (s). For example, it may be determined based on it). For example, control weights (weighting factors) for virtualization may only be determined based on confidence values for stereo content, and for all multichannel content other than stereo content (ie, ie). , For channels greater than 2, fixed control weights (eg, equal to 1) may be applied.

いくつかの実施形態では、制御重みは、ダイアログ向上器のための制御重みを含んでいてもよい。ダイアログ向上器についての制御重みは、分類情報が、オーディオ・コンテンツのコンテンツ型が発話である、または発話である可能性が高いことを示している場合、ダイアログ向上器によるダイアログ向上が有効にされる／向上されるように、計算されてもよい。これは、たとえば、発話についての信頼値が所与の閾値を超える場合に当てはまりうる。それにより、ダイアログ向上は、実際にそれが有益なオーディオ・コンテンツのセクションに制約されることができ、同時に、計算能力を節約することができる。 In some embodiments, the control weights may include control weights for the dialog improver. The control weights for the dialog improver enable dialog enhancement by the dialog improver if the classification information indicates that the content type of the audio content is or is likely to be an utterance. / May be calculated to be improved. This may be the case, for example, if the confidence value for the utterance exceeds a given threshold. Thereby, the dialog enhancement can actually be constrained to the section of audio content where it is useful, while at the same time saving computing power.

いくつかの実施形態では、制御重みは、動的等化器のための制御重みを含んでいてもよい。動的等化器のための制御重みは、分類情報がオーディオ・コンテンツのコンテンツ型が発話である、または発話である可能性が高いことを示している場合、動的等化器が無効にされるように計算されてもよい。これは、たとえば、発話についての信頼値が所与の閾値を超える場合に当てはまりうる。それにより、発話の音色の望ましくない変更が回避できる。 In some embodiments, the control weights may include control weights for a dynamic equalizer. The control weight for the dynamic equalizer is disabled if the classification information indicates that the content type of the audio content is or is likely to be utterance. It may be calculated as follows. This may be the case, for example, if the confidence value for the utterance exceeds a given threshold. This avoids unwanted changes in the timbre of the utterance.

いくつかの実施形態では、この方法は、制御重みの平滑化（時間平滑化）をさらに含んでいてもよい。平滑化は、制御入力／メタデータなどに従い、動的（非静的）としてフラグ付けされたオーディオ・コンテンツについては、状況に依存して、たとえばシーン遷移においては無効にされてもよい。制御重みを平滑化することにより、オーディオ後処理の安定性／連続性が改善できる。 In some embodiments, the method may further include smoothing of control weights (time smoothing). Smoothing may be context-sensitively disabled, for example in scene transitions, for audio content flagged as dynamic (non-static) according to control inputs / metadata and the like. By smoothing the control weights, the stability / continuity of audio post-processing can be improved.

いくつかの実施形態では、制御重みの平滑化は、平滑化される特定の制御重みに依存してもよい。すなわち、平滑化は、少なくとも2つの制御重みの間で異なっていてもよい。たとえば、ダイアログ向上器制御重みについては平滑化が全くないか、ほんのわずかである、および／または仮想化器制御重みについてのより強力な平滑化があるといったことがありうる。 In some embodiments, the smoothing of control weights may depend on the particular control weight to be smoothed. That is, the smoothing may differ between at least two control weights. For example, there may be no or little smoothing for dialog improver control weights, and / or stronger smoothing for virtualizer control weights.

いくつかの実施形態では、制御重みの平滑化は、デコードを実行する装置の装置型に依存してもよい。たとえば、携帯電話とTVセットの間で仮想化器制御重みの平滑化が異なっていてもよい。 In some embodiments, the smoothing of control weights may depend on the device type of the device performing the decoding. For example, the smoothing of virtualization device control weights may differ between the mobile phone and the TV set.

いくつかの実施形態では、この方法は、制御重みの連続性（たとえば、安定性）を増すために、制御重みに非線形マッピング関数を適用することをさらに含んでいてもよい。これは、制御重みの定義域範囲の境界に近い値を像範囲の境界のより近くにマッピングするマッピング関数（たとえばシグモイド関数など）を制御重みに適用することに関わってもよい。それにより、オーディオ後処理の安定性／連続性がさらに改善できる。 In some embodiments, the method may further include applying a non-linear mapping function to the control weights in order to increase the continuity (eg, stability) of the control weights. This may involve applying a mapping function (such as a sigmoid function) to the control weight that maps a value closer to the domain range boundary of the control weight closer to the image range boundary. Thereby, the stability / continuity of the audio post-processing can be further improved.

本開示の別の側面は、2チャネル・オーディオ・コンテンツおよび該2チャネル・オーディオ・コンテンツについての分類情報を含むビットストリームからオーディオ・コンテンツをデコードする方法に関する。ビットストリームは、たとえばAC-4ビットストリームであってもよい。分類情報は、2チャネル・オーディオ・コンテンツのコンテンツ分類を示してもよい。この方法は、ビットストリームを受領することを含んでいてもよい。この方法は、さらに、2チャネル・オーディオ・コンテンツおよび分類情報をデコードすることを含んでいてもよい。この方法は、2チャネル・オーディオ・コンテンツをアップミックスして、アップミックスされた5.1チャネル・オーディオ・コンテンツにすることをさらに含んでいてもよい。この方法は、2チャネル・スピーカー・アレイのための5.1仮想化のために、アップミックスされた5.1チャネル・オーディオ・コンテンツに仮想化器を適用することをさらに含んでいてもよい。この方法は、さらに、2チャネル・オーディオ・コンテンツと仮想化されたアップミックスされた5.1チャネル・オーディオ・コンテンツとにクロスフェーダーを適用することを含んでいてもよい。この方法は、さらに、クロスフェーダーの出力を2チャネル・スピーカー・アレイにルーティングすることを含んでいてもよい。ここで、この方法は、分類情報に基づいて、仮想化器および／またはクロスフェーダーについてのそれぞれの制御重みを計算することを含んでいてもよい。仮想化器およびクロスフェーダーは、それぞれの制御重みの制御下で動作することができる。 Another aspect of the disclosure relates to a method of decoding audio content from a bitstream containing two-channel audio content and classification information about the two-channel audio content. The bitstream may be, for example, an AC-4 bitstream. The classification information may indicate the content classification of the 2-channel audio content. This method may include receiving a bitstream. The method may further include decoding two-channel audio content and classification information. This method may further include upmixing the two-channel audio content into upmixed 5.1-channel audio content. This method may further include applying the virtualization device to the upmixed 5.1 channel audio content for 5.1 virtualization for a 2-channel speaker array. The method may further include applying crossfaders to 2-channel audio content and virtualized upmixed 5.1-channel audio content. This method may further include routing the output of the crossfader to a two-channel speaker array. Here, the method may include calculating the respective control weights for the virtualization device and / or the crossfader based on the classification information. Virtualization and crossfaders can operate under the control of their respective control weights.

本開示の別の側面は、2チャネル・オーディオ・コンテンツおよび該2チャネル・オーディオ・コンテンツの分類情報を含むビットストリームからオーディオ・コンテンツをデコードするさらなる方法に関する。ビットストリームは、たとえばAC-4ビットストリームであってもよい。分類情報は、2チャネル・オーディオ・コンテンツのコンテンツ分類を示してもよい。この方法は、ビットストリームを受領することを含んでいてもよい。この方法は、さらに、2チャネル・オーディオ・コンテンツおよび分類情報をデコードすることを含んでいてもよい。この方法は、2チャネル・オーディオ・コンテンツをアップミックスして、アップミックスされた5.1チャネル・オーディオ・コンテンツにするために、2チャネル・オーディオ・コンテンツにアップミキサーを適用することをさらに含んでいてもよい。この方法は、5チャネル・スピーカー・アレイの5.1仮想化のために、アップミックスされた5.1チャネル・オーディオ・コンテンツに仮想化器を適用することをさらに含んでいてもよい。この方法は、さらに、仮想化器の出力を5チャネル・スピーカー・アレイにルーティングすることを含んでいてもよい。ここで、この方法は、分類情報に基づいて、アップミキサーおよび／または仮想化器についてのそれぞれの制御重みを計算することを含んでいてもよい。アップミキサーおよび仮想化器は、それぞれの制御重みの制御下で動作してもよい。アップミキサーについての制御重みは、アップミックス重みに関係していてもよい。 Another aspect of the disclosure relates to a further method of decoding audio content from a bitstream containing two-channel audio content and classification information for the two-channel audio content. The bitstream may be, for example, an AC-4 bitstream. The classification information may indicate the content classification of the 2-channel audio content. This method may include receiving a bitstream. The method may further include decoding two-channel audio content and classification information. This method further includes applying an upmixer to the 2-channel audio content to upmix the 2-channel audio content into upmixed 5.1-channel audio content. good. This method may further include applying the virtualization device to the upmixed 5.1 channel audio content for 5.1 virtualization of the 5 channel speaker array. This method may further include routing the output of the virtualization device to a 5-channel speaker array. Here, the method may include calculating the respective control weights for the upmixer and / or the virtualization device based on the classification information. The upmixer and virtualization unit may operate under the control of their respective control weights. The control weights for the upmixer may be related to the upmix weights.

別の側面は、プロセッサのための命令を記憶しているメモリに結合されたプロセッサを含む装置（たとえば、エンコーダまたはデコーダ）に関する。プロセッサは、上記の諸側面およびそれらの実施形態のいずれかに従って方法を実施するように適応されてもよい。さらなる側面は、命令を含むコンピュータ・プログラムであって、プロセッサに、上記の側面のいずれかおよびそれらの実施形態に従って方法を実行するように該命令を実行させるコンピュータ・プログラムと、該コンピュータ・プログラムを記憶しているそれぞれのコンピュータ読み取り可能な記憶媒体とに関する。 Another aspect relates to a device (eg, an encoder or decoder) that includes a processor coupled to memory that stores instructions for the processor. The processor may be adapted to implement the method according to any of the above aspects and embodiments thereof. A further aspect is a computer program containing instructions, the computer program causing the processor to execute the instructions according to any of the above aspects and embodiments thereof, and the computer program. With respect to each computer readable storage medium that is stored.

本開示の例示的な実施形態は、添付の図面を参照して以下に説明される。ここで、同様の参照番号は、同様の要素または類似の要素を示す。
本開示の実施形態によるエンコーダ‐デコーダ・システムの例を概略的に示す。本開示の実施形態が適用できるビットストリームの例を概略的に示す。本開示の実施形態によるオーディオ・コンテンツの分類情報を記憶するためのデータ・フィールドの例を概略的に示す。本開示の実施形態によるオーディオ・コンテンツをエンコードする方法の例をフローチャートの形で概略的に示す。本開示の実施形態によるオーディオ・コンテンツのコンテンツ解析の一例を概略的に示す。本開示の実施形態によるオーディオ・コンテンツをエンコードする方法の別の例をフローチャートの形で概略的に示す。本開示の実施形態によるオーディオ・コンテンツをエンコードする方法の別の例をフローチャートの形で概略的に示す。本開示の実施形態によるオーディオ・コンテンツのコンテンツ解析の別の例を概略的に示す。本開示の実施形態によるオーディオ・コンテンツをエンコードする方法のさらに別の例をフローチャートの形で概略的に示す。本開示の実施形態によるオーディオ・コンテンツのコンテンツ解析のさらに別の例を概略的に示す。本開示の実施形態によるオーディオ・コンテンツをデコードする方法の例をフローチャートの形で概略的に示す。本開示の実施形態によるオーディオ・コンテンツをデコードする方法の別の例をフローチャートの形で概略的に示す。本開示の実施形態による制御重み計算の例を概略的に示す。本開示の実施形態によるオーディオ・コンテンツをデコードする方法の別の例をフローチャートの形で概略的に示す。本開示の実施形態によるデコーダにおける制御重みの使用例を概略的に示す。本開示の実施形態によるオーディオ・コンテンツをデコードする方法のさらに別の例をフローチャートの形で概略的に示す。本開示の実施形態による、デコーダにおける制御重みの使用の別の例を概略的に示す。 Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings. Here, a similar reference number indicates a similar element or a similar element.
An example of an encoder-decoder system according to an embodiment of the present disclosure is schematically shown. An example of a bitstream to which the embodiments of the present disclosure are applicable is shown schematically. An example of a data field for storing audio content classification information according to an embodiment of the present disclosure is schematically shown. An example of a method of encoding audio content according to an embodiment of the present disclosure is schematically shown in the form of a flowchart. An example of content analysis of audio content according to the embodiment of the present disclosure is schematically shown. Another example of how to encode audio content according to an embodiment of the present disclosure is schematically shown in the form of a flowchart. Another example of how to encode audio content according to an embodiment of the present disclosure is schematically shown in the form of a flowchart. Another example of content analysis of audio content according to an embodiment of the present disclosure is schematically shown. Yet another example of how to encode audio content according to an embodiment of the present disclosure is schematically shown in the form of a flow chart. Yet another example of content analysis of audio content according to an embodiment of the present disclosure is schematically shown. An example of a method of decoding audio content according to an embodiment of the present disclosure is schematically shown in the form of a flowchart. Another example of how to decode audio content according to an embodiment of the present disclosure is schematically shown in the form of a flowchart. An example of control weight calculation according to the embodiment of the present disclosure is schematically shown. Another example of how to decode audio content according to an embodiment of the present disclosure is schematically shown in the form of a flowchart. An example of using control weights in a decoder according to an embodiment of the present disclosure is schematically shown. Yet another example of how to decode audio content according to an embodiment of the present disclosure is schematically shown in the form of a flow chart. Another example of the use of control weights in a decoder according to an embodiment of the present disclosure is schematically shown.

上述のように、本開示における同一または同様の参照番号は、同一または同様の要素を示し、その繰り返しの説明は、簡潔さの理由から割愛することがある。 As mentioned above, the same or similar reference numbers in the present disclosure indicate the same or similar elements, and the repeated description thereof may be omitted for the sake of brevity.

大まかに言えば、本開示は、コンテンツ解析をオーディオ・デコーダからオーディオ・エンコーダに移転することを提案し、それにより、オーディオ後処理に対するデュアルエンドのアプローチを作り出す。すなわち、コンテンツ解析モジュールの少なくとも一部は、デコーダからエンコーダに移され、オーディオ・ストリーム（ビットストリーム）は、エンコーダ内のコンテンツ解析モジュール（の一部）によって生成される分類情報（たとえば、信頼値、信頼ラベル、または信頼スコア）を運ぶように更新される。重み計算はデコーダに残され、デコーダは、オーディオ・ストリームとともに受領される分類情報に基づいて動作する。 Broadly speaking, this disclosure proposes to transfer content analysis from an audio decoder to an audio encoder, thereby creating a dual-ended approach to audio post-processing. That is, at least part of the content analysis module is moved from the decoder to the encoder, and the audio stream (bitstream) is the classification information (eg, confidence value,) generated by (part of) the content analysis module in the encoder. Updated to carry a trust label, or confidence score). The weighting calculation is left in the decoder, which operates on the classification information received with the audio stream.

上述のスキームを実装するエンコーダ‐デコーダ・システム100の一例が、図1においてブロック図の形で示されている。エンコーダ‐デコーダ・システム100は、（オーディオ）エンコーダ105および（オーディオ）デコーダ115を有する。以下に説明されるエンコーダ105およびデコーダ115のモジュールは、たとえば、それぞれの計算装置のそれぞれのプロセッサによって実装されうることが理解される。 An example of an encoder-decoder system 100 that implements the above scheme is shown in FIG. 1 in the form of a block diagram. The encoder-decoder system 100 includes an (audio) encoder 105 and an (audio) decoder 115. It is understood that the modules of encoder 105 and decoder 115 described below can be implemented, for example, by their respective processors in their respective computing units.

エンコーダ105は、コンテンツ解析モジュール120およびマルチプレクサ130を有する。よって、前述のように、コンテンツ解析が今やエンコーダ段の一部である。エンコーダ105は、可能性としては関連するメタデータおよび／またはユーザー入力との関連で、エンコードされるべき入力オーディオ・コンテンツ101を受領する。入力オーディオ・コンテンツ101は、コンテンツ解析モジュール120およびマルチプレクサ130に提供される。コンテンツ解析モジュール120は、オーディオ・コンテンツ101のコンテンツ解析を実行し（たとえば、ドルビーのメディア・インテリジェンス・ツールを適用することによって）、オーディオ・コンテンツについての分類情報125を導出する。分類情報125は、コンテンツ解析によって推定されるように、入力オーディオ・コンテンツ101のコンテンツ型を示す。以下でより詳細に説明されるように、分類情報125は、それぞれのコンテンツ型に関連する一つまたは複数の信頼値（たとえば、「音楽」、「発話」、「背景効果」の信頼値）を含むことができる。いくつかの実施形態では、信頼値は、それよりも高い粒度を有していてもよい。たとえば、分類情報125は、コンテンツ型「音楽」についての信頼値の代わりに、またはそれに加えて、一つまたは複数の音楽ジャンルについての信頼値（たとえば、コンテンツ型「クラシック音楽」、「ロック／ポップ音楽」、「アコースティック音楽」、「電子音楽」などについての信頼値）を含むことができる。いくつかの実施形態では、コンテンツ解析は、さらに、オーディオ・コンテンツについてのメタデータおよび／またはユーザー入力（たとえば、コンテンツ作成者からの制御入力）に基づいてもよい。 The encoder 105 has a content analysis module 120 and a multiplexer 130. Therefore, as mentioned above, content analysis is now part of the encoder stage. Encoder 105 receives input audio content 101 to be encoded, possibly in relation to relevant metadata and / or user input. The input audio content 101 is provided to the content analysis module 120 and the multiplexer 130. The content analysis module 120 performs content analysis of audio content 101 (eg, by applying Dolby's media intelligence tools) to derive classification information 125 for audio content. Classification information 125 indicates the content type of the input audio content 101, as estimated by content analysis. As described in more detail below, the classification information 125 has one or more confidence values associated with each content type (eg, "music", "utterance", "background effect" confidence values). Can include. In some embodiments, the confidence value may have a higher particle size. For example, classification information 125 substitutes for or in addition to the confidence value for the content type "music" (for example, the content type "classical music", "rock / pop" for one or more music genres. Confidence value for "music", "acoustic music", "electronic music", etc.) can be included. In some embodiments, the content analysis may further be based on metadata and / or user input for audio content (eg, control input from the content creator).

マルチプレクサ130は、オーディオ・コンテンツおよび分類情報125をビットストリーム110に多重化する。オーディオ・コンテンツは、たとえばAC-4符号化標準に従ってエンコードするなど、既知の音声符号化方法に従ってエンコードされうる。その結果、オーディオ・コンテンツ101および分類情報125はビットストリーム110中にエンコードされると言われてもよく、ビットストリームはオーディオ・コンテンツと、該オーディオ・コンテンツについての関連する分類情報とを含むと言われてもよい。次いで、ビットストリーム110はデコーダ115に提供されてもよい。 The multiplexer 130 multiplexes the audio content and classification information 125 into the bitstream 110. Audio content can be encoded according to known audio coding methods, such as encoding according to the AC-4 coding standard. As a result, the audio content 101 and the classification information 125 may be said to be encoded in the bitstream 110, which is said to include the audio content and the relevant classification information about the audio content. You may be broken. The bitstream 110 may then be provided to the decoder 115.

いくつかの実装では、エンコーダ‐デコーダ・システム100のエンコーダ105におけるコンテンツ解析は、複数の連続する窓のそれぞれについて実行されてもよく、各窓は、所定の数の連続する（オーディオ）フレームを含む。 In some implementations, content analysis in the encoder 105 of the encoder-decoder system 100 may be performed for each of a plurality of contiguous windows, each of which comprises a predetermined number of contiguous (audio) frames. ..

コンテンツ解析は、オーディオ・コンテンツ内の決定可能な特徴に基づいて、それぞれのコンテンツ型の確からしさ／信頼度の一つまたは複数の計算に基づいてもよい。 Content analysis may be based on one or more calculations of certainty / reliability of each content type, based on determinable features within the audio content.

たとえば、コンテンツ解析は、オーディオ・コンテンツの前処理、特徴抽出、および信頼値の計算のステップを含んでいてもよい。前処理は任意的であってもよく、ダウンミックス、再フレーム化、振幅スペクトルの計算などを含んでいてもよい。特徴抽出は、オーディオ・コンテンツから複数の特徴（たとえば、数百の特徴）を抽出／計算してもよい。これらの特徴は、メル周波数ケプストラム係数（Mel-Frequency Cepstral Coefficient、MFCC）、MFCCフラックス、ゼロ交差レート、クロマ、自己相関などを含みうる。最終的に信頼値を与える計算は、たとえば、トレーニングされた機械学習ネットワークによって実行されてもよい。 For example, content analysis may include steps for audio content preprocessing, feature extraction, and confidence value calculation. Preprocessing may be optional and may include downmixing, reframement, calculation of amplitude spectra, and the like. Feature extraction may extract / calculate multiple features (eg, hundreds of features) from audio content. These features may include Mel-Frequency Cepstral Coefficient (MFCC), MFCC flux, zero crossover rate, chroma, autocorrelation, and the like. Calculations that ultimately give confidence values may be performed, for example, by a trained machine learning network.

コンテンツ解析のコンテキストにおいて（たとえば、機械学習ネットワークによって）実行される計算は、可変／適応的であってもよい。計算が可変である場合、それらを調整することにより、ある種のコンテンツ型についての選好に従って分類情報を導出することができる。たとえば、（デフォルトの）コンテンツ解析は、所与のオーディオ・コンテンツについて、コンテンツ型「音楽」について0.7の信頼値、コンテンツ型「発話」について0.15の信頼値、コンテンツ型「効果」について0.15の信頼値を返してもよい（この例における信頼値は合計すると1になることに注意）。コンテンツ解析がコンテンツ型「音楽」についていくらかの選好をもつように適応された場合（すなわち、その計算がこの目的に適応されている場合）、適応されたコンテンツ解析／計算は、たとえば、コンテンツ型「音楽」について0.8の信頼値、コンテンツ型「発話」について0.1の信頼値、およびコンテンツ型「効果」について0.1の信頼値を与えてもよい。計算が適応されるさらなる限定しない例が、のちに記載される。 The calculations performed in the context of content analysis (eg, by machine learning networks) may be variable / adaptive. If the calculations are variable, they can be adjusted to derive classification information according to preferences for certain content types. For example, content analysis (default) has a confidence value of 0.7 for content-type "music", a confidence value of 0.15 for content-type "speech", and a confidence value of 0.15 for content-type "effect" for a given audio content. May be returned (note that the confidence values in this example add up to 1). If the content analysis is adapted to have some preference for the content type "music" (ie, if the calculation is adapted for this purpose), the adapted content analysis / calculation is, for example, the content type "music". A 0.8 confidence value for "music", a 0.1 confidence value for content type "speech", and a 0.1 confidence value for content type "effect" may be given. Further, non-limiting examples to which the calculations are applied will be described later.

さらに、コンテンツ解析（たとえば、機械学習ネットワーク）は、適応的であってもよく、および／またはあらかじめ決定されたオーディオ・コンテンツを使用して事前にトレーニングされていてもよい。たとえば、エンコーダ‐デコーダ・システム100のようなデュアルエンド・システムでは、コンテンツ解析は、特徴ラベル付けの精度を改善するために、経時的にさらに発展させられることができる。進歩は、エンコード・サーバー上の増大した計算能力および／またはコンピュータ・プロセッサ能力の向上を通じてまかなえるようになる増大した複雑さに由来しうる。コンテンツ解析は、特定のコンテンツ型の手動のラベル付けを通じて経時的に改善されてもよい。 In addition, content analysis (eg, machine learning networks) may be adaptive and / or pre-trained with pre-determined audio content. For example, in a dual-ended system such as the encoder-decoder system 100, content analysis can be further developed over time to improve the accuracy of feature labeling. Advances can come from the increased complexity that can be met through increased computing power and / or increased computer processor power on the encode server. Content analysis may be improved over time through manual labeling of specific content types.

エンコーダ側のコンテンツ解析は、コンテンツ型決定にかかる待ち時間を減らすために、先読みバッファまたは類似のものを使用してもよい。これは、強力な決定を行なうためにかなり大きなオーディオ・フレームを必要とする、シングルエンドの実装における既知の制限に対処する。たとえば、ダイアログの存在に関する決定を行なうためには、700msのオーディオ・フレームが要求されることがあり、その時点でダイアログの信頼スコアは、発話の開始から700ms遅れており、話された語句の開始が見逃されることがある。追加的または代替的に、コンテンツ解析のために必要とされる処理時間を受け入れるため、エンコード待ち時間が導入されることがある。 Content analysis on the encoder side may use a look-ahead buffer or something similar to reduce the latency of content type determination. This addresses a known limitation in single-ended implementations that require fairly large audio frames to make powerful decisions. For example, a 700ms audio frame may be required to make a decision about the existence of a dialog, at which point the dialog's confidence score is 700ms behind the start of the utterance and the start of the spoken phrase. May be overlooked. Additional or alternative, encoding latency may be introduced to accept the processing time required for content parsing.

いくつかの実装では、コンテンツ型決定の精度を改善するために、コンテンツ解析は複数のパス（pass）で実行されてもよい。 In some implementations, content analysis may be performed on multiple passes to improve the accuracy of content typing.

一般に、分類情報の生成は、オーディオ・コンテンツ内のシーン遷移の検出（またはシーン遷移の手動指示）にも基づいていてもよい。この目的のために、エンコーダ105は、オーディオ・コンテンツ内のそのようなシーン遷移／リセットを検出するための追加のリセット検出器を有していてもよい。コンテンツ解析の信頼値の変化レートに影響を与えるために、手動のラベル付けまたは追加のリセット・シーン検出が使用されてもよい。たとえば、分類情報に含まれる信頼値の変化レートは、シーン遷移が検出／指示される場合（すなわち、定常状態におけるよりも大きい場合）、より高くてもよい。換言すれば、オーディオ・プログラムが変化するとき、信頼値は、オーディオ・プログラムの定常状態におけるよりも速く適応することが許容されてもよい。後処理効果間での可聴な遷移が最小化されることを保証するためである。シーン検出に従って、シーン遷移の指示（たとえば、それぞれがそれぞれのシーン遷移を示す一つまたは複数のリセット・フラグ（シーン遷移フラグ））が、分類情報125（たとえば、信頼値）とともにビットストリーム110にエンコード／多重化されてもよい。 In general, the generation of classification information may also be based on the detection of scene transitions in audio content (or manual instructions for scene transitions). For this purpose, the encoder 105 may have an additional reset detector for detecting such scene transitions / resets in the audio content. Manual labeling or additional reset scene detection may be used to influence the rate of change of confidence values in content analysis. For example, the rate of change of confidence values contained in the classification information may be higher if the scene transition is detected / indicated (ie, greater than in steady state). In other words, when the audio program changes, the confidence values may be allowed to adapt faster than in the steady state of the audio program. This is to ensure that audible transitions between post-processing effects are minimized. According to the scene detection, a scene transition instruction (for example, one or more reset flags (scene transition flags) each indicating each scene transition) is encoded in the bitstream 110 together with the classification information 125 (for example, a confidence value). / May be multiplexed.

エンコーダ‐デコーダ・システム100内のデコーダ115は、デマルチプレクサ160と、重み計算モジュール170と、後処理モジュール180とを有する。デコーダ115によって受領されたビットストリーム110は、デマルチプレクサ160において多重分離され、たとえばAC-4符号化標準に従ったデコードのような既知のオーディオ・デコード方法に従ったデコード後に、分類情報125およびオーディオ・コンテンツが抽出される。結果として、オーディオ・コンテンツおよび分類情報125は、ビットストリーム110からデコードされると言われてもよい。デコードされたオーディオ・コンテンツは、デコードされたオーディオ・コンテンツの後処理を実行する後処理モジュール180に提供される。この目的のために、デコーダ115は、ビットストリーム110から抽出された分類情報125に基づいて、後処理モジュール180のための後処理モードを選択する。より詳細には、ビットストリーム110から抽出された分類情報125は、重み計算モジュール170に提供され、重み計算モジュール170は、分類情報125に基づいて、デコードされたオーディオ・コンテンツの後処理のための一つまたは複数の制御重み175を計算する。各制御重みは、たとえば0から1までの間の数であってもよく、後処理のためのそれぞれのプロセス／モジュール／アルゴリズムの強度を決定してもよい。前記一つまたは複数の制御重み175は、後処理モジュール180に提供される。後処理モジュール180は、デコードされたオーディオ・コンテンツを後処理するために、制御重み175に従って後処理モードを選択／適用することができる。後処理モードを選択することは、いくつかの実施形態では、さらにユーザー入力に基づいていてもよい。選択された後処理モードを使用する、後処理モジュール180によるデコードされたオーディオ・コンテンツの後処理が、デコーダ115によって出力される出力オーディオ信号102を生じうる。 The decoder 115 in the encoder-decoder system 100 includes a demultiplexer 160, a weight calculation module 170, and a post-processing module 180. The bitstream 110 received by the decoder 115 is demultiplexed in the demultiplexer 160 and after decoding according to known audio decoding methods such as decoding according to the AC-4 coding standard, classification information 125 and audio. -Content is extracted. As a result, the audio content and classification information 125 may be said to be decoded from the bitstream 110. The decoded audio content is provided to a post-processing module 180 that performs post-processing of the decoded audio content. For this purpose, the decoder 115 selects a post-processing mode for the post-processing module 180 based on the classification information 125 extracted from the bitstream 110. More specifically, the classification information 125 extracted from the bitstream 110 is provided to the weight calculation module 170, which is used for post-processing of decoded audio content based on the classification information 125. Calculate one or more control weights 175. Each control weight may be, for example, a number between 0 and 1 and may determine the strength of each process / module / algorithm for post-processing. The one or more control weights 175 are provided to the post-processing module 180. The post-processing module 180 can select / apply a post-processing mode according to the control weight 175 to post-process the decoded audio content. The choice of post-processing mode may be further based on user input in some embodiments. Post-processing of the decoded audio content by the post-processing module 180 using the selected post-processing mode can result in the output audio signal 102 output by the decoder 115.

計算された一つまたは複数の制御重み175は、後処理モジュール180によって実行される後処理アルゴリズムのための制御重みであってもよく、よって、アルゴリズム操縦重みと称されてもよい。よって、前記一つまたは複数の制御重み175は、後処理モジュール180における後処理アルゴリズムのための操縦を提供することができる。この意味で、制御重み175は、デコードされたオーディオ・コンテンツの後処理のためのそれぞれの（サブ）モジュールのための制御重みであってもよい。たとえば、後処理モジュール180は、（インテリジェント／動的）等化器、（適応）仮想化器、サラウンドプロセッサ、ダイアログ向上器、アップミキサー、および／またはクロスフェーダーなどの一つまたは複数のそれぞれの（サブ）モジュールを含んでいてもよい。制御重み175は、これらの（サブ）モジュールのための制御重みであってもよく、これらの（サブ）モジュールはそれぞれの制御重みの制御の下で動作してもよい。よって、制御重み175は、等化器（たとえばインテリジェント等化器（IEQ））のための制御重み、仮想化器（たとえば適応仮想化器）のための制御重み、サラウンドプロセッサのための制御重み、ダイアログ向上器のための制御重み、アップミキサーのための制御重み、および／またはクロスフェーダーのための制御重みのうちの一つまたは複数を含んでいてもよい。ここで、インテリジェント等化器は、ターゲット・スペクトル・プロファイルを用いて複数の周波数帯域を調整するものと理解される。利得曲線は、インテリジェント等化器が適用されるオーディオ・コンテンツに依存して適応される。 The calculated control weight 175 may be a control weight for a post-processing algorithm executed by the post-processing module 180, and may thus be referred to as an algorithm maneuvering weight. Thus, the one or more control weights 175 can provide maneuvering for the post-processing algorithm in the post-processing module 180. In this sense, the control weight 175 may be a control weight for each (sub) module for post-processing of the decoded audio content. For example, the post-processing module 180 may include one or more (intelligent / dynamic) equalizers, (adaptive) virtualization devices, surround processors, dialog improvers, upmixers, and / or crossfaders. Sub) may include modules. The control weight 175 may be a control weight for these (sub) modules, and these (sub) modules may operate under the control of their respective control weights. Thus, the control weight 175 is a control weight for an equalizer (eg, an intelligent equalizer (IEQ)), a control weight for a virtualizer (eg, an adaptive virtualizer), a control weight for a surround processor, and so on. It may include one or more of the control weights for the dialog improver, the control weights for the upmixer, and / or the control weights for the crossfader. Here, it is understood that the intelligent equalizer adjusts a plurality of frequency bands using a target spectrum profile. The gain curve is adapted depending on the audio content to which the intelligent equalizer is applied.

エンコーダ105において分類情報125を決定し、それをビットストリーム110の一部としてデコーダ115に提供することは、デコーダ115における計算負荷を削減することができる。さらに、エンコーダのより高い計算能力を利用して、コンテンツ解析をより強力に（たとえば、より正確に）できる。 Determining the classification information 125 in the encoder 105 and providing it to the decoder 115 as part of the bitstream 110 can reduce the computational load on the decoder 115. In addition, the higher computational power of the encoder can be used to make content analysis more powerful (eg, more accurate).

図2は、ビットストリーム110の例示的実装としてAC-4ビットストリームを概略的に示している。ビットストリーム110は複数のフレーム（AC-4フレーム）205を含む。各フレーム205は、同期ワード、フレームワード、生フレーム210（AC-4フレーム）、およびCRCワードを含む。生フレーム210は、目次（table of contents、TOC）フィールドと、TOCフィールドに示されるような複数のサブストリームとを含む。各サブストリームは、オーディオ・データ・フィールド211およびメタデータ・フィールド212を含む。オーディオ・データ・フィールド211はエンコードされたオーディオ・コンテンツを含んでいてもよく、メタデータ・フィールド212は分類情報125を含んでいてもよい。 FIG. 2 schematically illustrates an AC-4 bitstream as an exemplary implementation of bitstream 110. Bitstream 110 includes multiple frames (AC-4 frames) 205. Each frame 205 contains a sync word, a frame word, a raw frame 210 (AC-4 frame), and a CRC word. Raw frame 210 includes a table of contents (TOC) field and a plurality of substreams as shown in the TOC field. Each substream contains audio data field 211 and metadata field 212. The audio data field 211 may contain encoded audio content and the metadata field 212 may contain classification information 125.

そのようなビットストリーム構造が与えられた場合、分類情報125は、ビットストリームのパケット内の特定のデータ・フィールドにエンコードされうる。図3は、分類情報125を搬送するためのビットストリーム（のフレーム）内のデータ・フィールドの例を概略的に示す。このデータ・フィールドは、MIデータ・フィールドと称されてもよい。データ・フィールドは、複数のサブフィールド310～370を含んでいてもよい。たとえば、データ・フィールドは、分類情報（メディア情報またはメディア・インテリジェンス）がフレームに存在するかどうかを示すb_mi_data_presentフィールド310、コンテンツ型「音楽」についての信頼値を含むmusic_confidenceフィールド320、コンテンツ型「発話」についての信頼値を含むspeech_confidenceフィールド330、コンテンツ型「効果」についての信頼値を含むeffects_confidenceフィールド340、b_prog_switchフィールド350、さらなる分類情報（メディア情報）が存在するかどうかを示すb_more_mi_data_presentフィールド360、およびさらなる分類情報（たとえば、群衆雑音についての信頼値）を含むmore_mi_dataフィールド370のうちの任意のもの、一部または全部を含んでいてもよい。分類情報（たとえば、信頼値）は長期的解析（コンテンツ解析）によって決定されるため、比較的ゆっくりと変化していることがありうる。よって、分類情報は、各パケット／フレームについてエンコードされなくてもよく、たとえば、N≧2としてNフレームのうちの1つにエンコードされるのでもよい。 Given such a bitstream structure, classification information 125 may be encoded in a particular data field within a bitstream packet. FIG. 3 schematically shows an example of a data field in (a frame of) a bitstream for carrying classification information 125. This data field may be referred to as an MI data field. The data field may include multiple subfields 310-370. For example, the data field is a b_mi_data_present field 310 that indicates whether classification information (media information or media intelligence) is present in the frame, a music_confidence field 320 that contains a confidence value for the content type "music", and a content type "speech". Speech_confidence field 330 containing confidence values for, effects_confidence field 340 containing confidence values for content type "effects", b_prog_switch field 350, b_more_mi_data_present field 360 indicating whether further classification information (media information) exists, and further classification. It may include any, part or all of the more_mi_data field 370 containing information (eg, confidence values for crowd noise). Classification information (eg, confidence values) is determined by long-term analysis (content analysis) and can change relatively slowly. Therefore, the classification information does not have to be encoded for each packet / frame, and may be encoded in one of N frames, for example, with N ≧ 2.

あるいはまた、分類情報125（たとえば、信頼値）は、AC-4ビットストリームの呈示サブストリームにエンコードされてもよい。さらに、ファイルベースのオーディオ・コンテンツの場合、分類情報125（たとえば、信頼値）は各フレームについてエンコードされなくてもよく、ファイル内のすべてのフレームについて有効であるよう、ビットストリームの適切なデータ・フィールドにエンコードされてもよい。 Alternatively, the classification information 125 (eg, confidence value) may be encoded into a presentation substream of the AC-4 bitstream. In addition, for file-based audio content, the classification information 125 (eg, confidence value) does not have to be encoded for each frame, and the appropriate data in the bitstream so that it is valid for every frame in the file. It may be encoded in a field.

図4は、オーディオ・コンテンツをエンコードする方法400の一例を示すフローチャートである。方法400は、たとえば、図1のエンコーダ‐デコーダ・システム100内のエンコーダ105によって実行されてもよい。 FIG. 4 is a flowchart showing an example of a method 400 for encoding audio content. Method 400 may be performed, for example, by the encoder 105 in the encoder-decoder system 100 of FIG.

ステップS410では、オーディオ・コンテンツのコンテンツ解析が実行される。
ステップS420では、オーディオ・コンテンツのコンテンツ型を示す分類情報が、コンテンツ解析（の結果）に基づいて生成される。
ステップS430では、オーディオ・コンテンツおよび分類情報がビットストリームにエンコードされる。
最後に、ステップS440では、ビットストリームが出力される。 In step S410 , content analysis of the audio content is performed.
In step S420 , classification information indicating the content type of the audio content is generated based on (result) of the content analysis.
In step S430 , the audio content and classification information is encoded into a bitstream.
Finally, in step S440 , a bitstream is output.

特に、方法400のステップは、エンコーダ‐デコーダ・システム100について上述した仕方で実行されてもよい。 In particular, the steps of method 400 may be performed in the manner described above for the encoder-decoder system 100.

上述のように、分類情報を生成することは、オーディオ・コンテンツにおけるシーン遷移の検出（またはシーン遷移の手動指示）にさらに基づいてもよい。よって、方法400（または後述する方法600、700、または900の任意のもの）は、オーディオ・コンテンツにおけるシーン遷移を検出し（またはオーディオ・コンテンツにおけるシーン遷移の手動指示の入力を受領し）、オーディオ・コンテンツにおけるシーン遷移の指示をビットストリーム中にエンコードすることをさらに含んでいてもよい。 As mentioned above, the generation of classification information may be further based on the detection of scene transitions (or manual instructions for scene transitions) in audio content. Thus, method 400 (or any of methods 600, 700, or 900 described below) detects scene transitions in audio content (or receives input of manual instructions for scene transitions in audio content) and audio. -It may further include encoding the instruction of the scene transition in the content into the bitstream.

次に、図5を参照して、コンテンツ解析（たとえば、エンコーダ105のコンテンツ解析モジュール120によって実行されるコンテンツ解析、または方法400のステップS410で実行されるコンテンツ解析）の詳細を説明する。 Next, with reference to FIG. 5, the details of the content analysis (for example, the content analysis performed by the content analysis module 120 of the encoder 105 or the content analysis performed in step S410 of the method 400) will be described.

上述のように、コンテンツ解析は、オーディオ・コンテンツ101のコンテンツ型を示す分類情報125を生成する。本開示のいくつかの実施形態では、分類情報125は、一つまたは複数の信頼値（特徴信頼値、信頼スコア）を含む。これらの信頼値のそれぞれは、それぞれのコンテンツ型に関連付けられ、オーディオ・コンテンツが該それぞれのコンテンツ型である確からしさの指示を与える。これらのコンテンツ型は、音楽コンテンツ、発話コンテンツ、および効果（たとえば、背景効果）コンテンツのうちの一つまたは複数を含むことができる。いくつかの実装では、コンテンツ型は、群衆ノイズコンテンツ（たとえば歓声）をさらに含むことができる。すなわち、分類情報125は、オーディオ・コンテンツがコンテンツ型「音楽」である信頼度（確からしさ）を示す音楽信頼値、オーディオ・コンテンツ101がコンテンツ型「発話」である信頼度（確からしさ）を示す発話信頼値、およびオーディオ・コンテンツ101がコンテンツ型「効果」である信頼度（確からしさ）を示す効果信頼値、また可能性としては、オーディオ・コンテンツ101がコンテンツ型「群衆ノイズ」であることの信頼度（確からしさ）を示す群衆ノイズ信頼値のうちの一つまたは複数を含むことができる。 As mentioned above, the content analysis produces classification information 125 indicating the content type of the audio content 101. In some embodiments of the present disclosure, the classification information 125 comprises one or more confidence values (feature confidence values, confidence scores). Each of these confidence values is associated with each content type and gives an indication of the certainty that the audio content is that respective content type. These content types can include one or more of music content, spoken content, and effect (eg, background effect) content. In some implementations, the content type can further include crowd noise content (eg cheers). That is, the classification information 125 indicates a music reliability value indicating the reliability (probability) that the audio content is the content type "music", and the audio content 101 indicates the reliability (probability) that the content type "speech" is. The speech confidence value, and the effect confidence value indicating the reliability (certainty) that the audio content 101 is the content type "effect", and possibly the audio content 101 is the content type "crowd noise". It can include one or more of the crowd noise confidence values that indicate confidence.

以下では、信頼値が0から1の範囲にはいるように正規化されているとする。ここで、0は、オーディオ・コンテンツがそのそれぞれのコンテンツ型のものである可能性がゼロ（0%）であることを示し、1は、オーディオ・コンテンツがそのそれぞれの可能性のものであることの確実性（完全な確からしさ、100%）を示す。値「0」は、可能性がゼロであることを示す信頼値の値についての限定しない例であり、値「1」は、完全な確からしさを示す信頼値の値についての限定しない例であることが理解される。 In the following, it is assumed that the confidence value is normalized so that it falls in the range of 0 to 1. Here, 0 indicates that the audio content is zero (0%) likely to be of its respective content type, and 1 indicates that the audio content is of its respective potential. Shows certainty (complete certainty, 100%). A value "0" is an unrestricted example of a confidence value indicating zero probability, and a value "1" is an unrestricted example of a confidence value indicating complete certainty. Is understood.

図5の例では、オーディオ・コンテンツ101のコンテンツ解析は、（生）音楽信頼値125a、（生）発話信頼値125b、および（生）効果信頼値125cを返す。原理的には、これらの生の信頼値125a、125b、125cが直接、分類情報125（の一部）としてビットストリーム110にエンコードされるために使用されることもできる。あるいはまた、分類情報125（すなわち、生の信頼値125a、125b、125c）は、エンコードの前に平滑化（たとえば、時間平滑化）にかけられて、実質的に連続的な信頼値を得ることができる。これは、それぞれ平滑化された信頼値145a、145b、145cを出力するそれぞれの平滑化モジュール140a、140b、140cによって行なうことができる。そこでは、異なる平滑化モジュールは、たとえば、平滑化のために異なるパラメータ／係数を使用して、異なる平滑化を適用することができる。 In the example of FIG. 5, the content analysis of the audio content 101 returns a (raw) music confidence value of 125a, a (raw) speech confidence value of 125b, and a (raw) effect confidence value of 125c. In principle, these raw confidence values 125a, 125b, 125c can also be used to be directly encoded into bitstream 110 as (part of) classification information 125. Alternatively, the classification information 125 (ie, raw confidence values 125a, 125b, 125c) can be smoothed (eg, time smoothed) prior to encoding to obtain a substantially continuous confidence value. can. This can be done by the smoothing modules 140a, 140b, 140c, respectively, which output the smoothed confidence values 145a, 145b, 145c, respectively. There, different smoothing modules can apply different smoothings, for example, using different parameters / coefficients for smoothing.

上記に沿って、方法400（または以下に記載される方法600、700、または900の任意のもの）は、多重化／エンコードの前に、分類情報（たとえば、信頼値）を平滑化することをさらに含んでいてもよい。 In line with the above, method 400 (or any of methods 600, 700, or 900 described below) may smooth out classification information (eg, confidence values) prior to multiplexing / encoding. It may be further included.

分類情報（たとえば、信頼値）の平滑化は、ある種の状況下では、たとえば平滑化がシーン遷移をまたいで実行される場合、可聴な歪みを生じる可能性がある。よって、平滑化は、状況に依存して、たとえばシーン遷移においては、無効にされてもよい。さらに、以下により詳細に説明されるように、平滑化は、動的（非静的）オーディオ・コンテンツについて、または制御入力またはメタデータに従って、無効化されてもよい。 Smoothing of classification information (eg, confidence values) can cause audible distortion under certain circumstances, for example, when smoothing is performed across scene transitions. Therefore, smoothing may be disabled depending on the situation, for example, in a scene transition. In addition, smoothing may be disabled for dynamic (non-static) audio content or according to control input or metadata, as described in more detail below.

平滑化された音楽信頼値145a、平滑化された発話信頼値145b、および平滑化された効果信頼値145cは、いくつかの実装では、さらに、エンコードする前に量子化されることができる。これは、それぞれ量子化された信頼値155a、155b、155cを出力するそれぞれの量子化器150a、150b、150cで行なうことができる。そこでは、異なる量子化器は、たとえば量子化のために異なるパラメータを使用する、異なる量子化を適用してもよい。 The smoothed music confidence value 145a, the smoothed speech confidence value 145b, and the smoothed effect confidence value 145c can be further quantized before encoding in some implementations. This can be done with the quantizers 150a, 150b, 150c, which output the quantized confidence values 155a, 155b, 155c, respectively. There, different quantizers may apply different quantizations, for example using different parameters for quantization.

上記に沿って、方法400（または以下に記載される方法600、700、または900の任意のもの）は、多重化／エンコードの前に、分類情報（たとえば、信頼値）を量子化することをさらに含んでいてもよい。 In line with the above, method 400 (or any of methods 600, 700, or 900 described below) quantizes classification information (eg, confidence values) prior to multiplexing / encoding. It may be further included.

分類情報125の平滑化は、デコーダにおける後処理の改善された連続性および安定性、よって聴取経験をもたらすことができる。分類情報125を量子化することにより、ビットストリーム110の帯域幅効率を改善することができる。 The smoothing of classification information 125 can result in improved continuity and stability of post-processing in the decoder, and thus a listening experience. By quantizing the classification information 125, the bandwidth efficiency of the bitstream 110 can be improved.

上述したように、エンコーダ105において分類情報125を決定し、それをビットストリーム110の一部としてデコーダ115に提供することは、計算能力の観点から有利でありうる。加えて、そうすることは、オーディオ・ストリームにおいて伝送される信頼値をある種の所望の値に設定することによって、デコーダ側のオーディオ後処理に対する、いくらかのエンコーダ側の制御を許容しうる。たとえば、分類情報を（少なくとも部分的に）エンコーダ側でのユーザー入力に依存するようにすることによって、エンコーダ側のユーザー（たとえば、コンテンツ作成者）は、デコーダ側でのオーディオ後処理に対する制御を与えられることができる。次に、デコーダ側のオーディオ後処理に対するエンコーダ側の追加的な制御を許容するいくつかの例示的な実装について説明する。 As mentioned above, it may be advantageous in terms of computational power to determine the classification information 125 in the encoder 105 and provide it to the decoder 115 as part of the bitstream 110. In addition, doing so may allow some encoder-side control over audio post-processing on the decoder-side by setting the confidence value transmitted in the audio stream to some desired value. For example, by making the classification information (at least partially) dependent on user input on the encoder side, the user on the encoder side (eg, the content creator) gives control over audio post-processing on the decoder side. Can be Next, some exemplary implementations that allow additional control on the encoder side for audio post-processing on the decoder side will be described.

図6は、ユーザー入力に基づく、デコーダ側のオーディオ後処理のこのようなエンコーダ側制御を許容する、オーディオ・コンテンツをエンコードする方法600の一例をフローチャートの形で概略的に示している。方法600は、たとえば、図1のエンコーダ‐デコーダ・システム100内のエンコーダ105によって実行されてもよい。 FIG. 6 schematically illustrates in the form of a flow chart an example of a method 600 for encoding audio content that allows such encoder-side control of decoder-side audio post-processing based on user input. Method 600 may be performed, for example, by the encoder 105 in the encoder-decoder system 100 of FIG.

ステップS610では、ユーザー入力が受領される。ユーザーは、たとえば、コンテンツ作成者であってもよい。ユーザー入力は、オーディオ・コンテンツを、ある種のコンテンツ型に関連するものとしてラベル付けするための手動ラベルを含むことができ、あるいは、たとえば手動の信頼値に関連することができる。 At step S610 , user input is received. The user may be, for example, a content creator. User input can include manual labels for labeling audio content as related to certain content types, or can be related to, for example, manual confidence values.

ステップS620では、少なくとも部分的にはユーザー入力に基づいて、オーディオ・コンテンツのコンテンツ型を示す分類情報が生成される。たとえば、手動ラベルおよび／または手動信頼値が直接、分類情報として使用できる。オーディオ・コンテンツがある種のコンテンツ型のものとして手動でラベル付けされている場合、そのある種のコンテンツ型についての信頼値は1に設定でき（信頼値が0から1までの間の値をもつとする）、他の信頼値はゼロに設定できる。この場合、コンテンツ解析はバイパスされるであろう。代替的な実装では、コンテンツ解析の出力は、ユーザー入力と一緒に、分類情報を導出するために使用できる。たとえば、最終的な信頼値は、コンテンツ解析において生成された信頼値および手動の信頼値に基づいて計算できる。これは、これらの信頼値を平均すること、または他の任意の好適な組み合わせによって行なってもよい。 Step S620 generates classification information indicating the content type of the audio content, at least in part, based on user input. For example, manual labels and / or manual confidence values can be used directly as classification information. If the audio content is manually labeled as of a certain content type, the confidence value for that particular content type can be set to 1 (trust value between 0 and 1). ), Other confidence values can be set to zero. In this case, content analysis will be bypassed. In an alternative implementation, the content analysis output, along with user input, can be used to derive classification information. For example, the final confidence value can be calculated based on the confidence value generated in the content analysis and the manual confidence value. This may be done by averaging these confidence values, or by any other suitable combination.

ステップS630では、オーディオ・コンテンツおよび分類情報がビットストリーム中にエンコードされる。 In step S630 , the audio content and classification information is encoded in the bitstream.

最後に、ステップS640で、ビットストリームが出力される。 Finally, in step S640 , the bitstream is output.

エンコーダ側でのコンテンツ分類決定を、少なくとも部分的にはオーディオ・コンテンツに関連するメタデータに依存させることによって、追加的なエンコーダ側制御が達成できる。そのようなエンコーダ側処理の2つの例を以下に説明する。第1の例を、図7および図8を参照して説明する。第1の例では、前記オーディオ・コンテンツは、オーディオ・コンテンツのストリーム（たとえば、線形の連続的なストリーム）において、オーディオ・プログラムの一部として、提供される。オーディオ・コンテンツについてのメタデータは、少なくとも、オーディオ・コンテンツの（すなわち、オーディオ・プログラムの）サービス型の指示を含む。よって、サービス型は、オーディオ・プログラム型と称されることもある。サービス型の例は、音楽サービス（たとえば、音楽ストリーミング・サービスまたは音楽放送など）またはニュース（ニュースキャスト）サービス（たとえば、ニュースチャネルのオーディオ成分など）を含みうる。サービス型指示は、フレームベースで提供されてもよく、あるいはオーディオ・ストリームについて同じ（均一／静的）であってもよい。第2の例を、図9および図10を参照して説明する。第2の例では、オーディオ・コンテンツはファイルベースで提供される。各ファイルは、それぞれのオーディオ・コンテンツについてのメタデータを含んでいてもよい。メタデータは、そのファイル（のオーディオ・コンテンツ）のファイル・コンテンツ型を含んでいてもよい。メタデータは、マーカー、ラベル、タグなどをさらに含んでいてもよい。ファイル・コンテンツ型の例は、ファイルが音楽ファイルであることの指示、ファイルがニュース／ニュースキャスト・ファイル（ニュース・クリップ）であることの指示、ファイルが動的（非静的）コンテンツ（たとえば、発話のあるシーンと音楽／歌シーンの間で頻繁に遷移する映画のミュージカル・ジャンルなど）を含むことの指示を含んでいてもよい。ファイル・コンテンツ型は、ファイル全体について同じ（一様／静的）であってもよいし、あるいはファイルの部分間で変化してもよい。第2の例における処理は、ファイルベースであってもよい。ファイル・コンテンツ型を示すメタデータによるファイルの「タグ付け」は、分類情報を導出することにおいてエンコーダを支援すると言える（デコーダ側でのオーディオ後処理に対する追加的な制御をエンコーダ側に提供することに加えて）。 Additional encoder-side control can be achieved by relying, at least in part, on the metadata associated with the audio content for content classification decisions on the encoder side. Two examples of such encoder-side processing will be described below. A first example will be described with reference to FIGS. 7 and 8. In the first example, the audio content is provided as part of an audio program in a stream of audio content (eg, a linear continuous stream). Metadata for audio content includes at least service-type instructions for audio content (ie, audio programs). Therefore, the service type may be referred to as an audio program type. Examples of service types can include music services (eg, music streaming services or music broadcasts) or news (newscast) services (eg, audio components of news channels). Serviced instructions may be provided on a frame basis or may be the same (uniform / static) for the audio stream. A second example will be described with reference to FIGS. 9 and 10. In the second example, the audio content is provided on a file basis. Each file may contain metadata for its own audio content. The metadata may include the file content type of the file (audio content of). The metadata may further include markers, labels, tags, and the like. Examples of file content types are indications that the file is a music file, indication that the file is a news / newscast file (news clip), and that the file is dynamic (non-static) content (eg, non-static). It may include instructions to include (such as the musical genre of a movie that frequently transitions between spoken scenes and music / song scenes). The file content type may be the same (uniform / static) for the entire file, or it may vary from part to part of the file. The process in the second example may be file-based. File "tagging" with metadata indicating the file content type can be said to assist the encoder in deriving classification information (to provide the encoder with additional control over audio post-processing on the decoder side). father).

ここで、図7を参照する。図7は、オーディオ・プログラムの一部としてオーディオ・コンテンツのストリーム内に提供されるオーディオ・コンテンツをエンコードする方法700をフローチャートの形で示している。この方法700は、分類情報を導出する際にオーディオ・コンテンツのメタデータを考慮に入れる。方法700は、たとえば、図1のエンコーダ‐デコーダ・システム100内のエンコーダ105によって実行されてもよい。 Now refer to FIG. FIG. 7 shows, in the form of a flow chart, how to encode audio content provided within a stream of audio content as part of an audio program. This method 700 takes into account audio content metadata when deriving classification information. Method 700 may be performed, for example, by the encoder 105 in the encoder-decoder system 100 of FIG.

ステップS710では、サービス型指示が受領される。上述のように、サービス型指示は、オーディオ・コンテンツのサービス型を示す。
ステップS720では、少なくとも部分的にはサービス型指示に基づいて、オーディオ・コンテンツのコンテンツ解析が実行される。そのようなコンテンツ解析の限定しない例は、図8を参照して以下に記載される。
ステップS730では、オーディオ・コンテンツのコンテンツ型を示す分類情報が、コンテンツ解析（の結果）に基づいて生成される。
ステップS740では、オーディオ・コンテンツおよび分類情報がビットストリーム中にエンコードされる。
最後に、ステップS750で、ビットストリームが出力される。 At step S710 , the service type instruction is received. As mentioned above, the service type instruction indicates the service type of the audio content.
In step S720 , content analysis of the audio content is performed, at least in part, based on service-type instructions. Non-limiting examples of such content analysis are described below with reference to FIG.
In step S730 , classification information indicating the content type of the audio content is generated based on (result) of the content analysis.
In step S740 , the audio content and classification information is encoded in the bitstream.
Finally, in step S750 , the bitstream is output.

図8は、方法700のステップS720におけるオーディオ・コンテンツのコンテンツ解析の例を概略的に示す。図8の上段810は、音楽サービスの例、すなわち、オーディオ・コンテンツがサービス型「音楽サービス」であることを示すサービス型指示に関するものであり、この場合、「音楽」についての信頼値が1に設定されてもよく、他のコンテンツ型（たとえば、「発話」、「効果」、および、可能性としては「群衆ノイズ」）についての信頼値は0に設定される。すなわち、コンテンツ型「音楽」は分類情報にハードコード化されてもよい。このように、方法700は、サービス型指示に基づいて、オーディオ・コンテンツのサービス型が音楽サービスであるかどうかを判定することを含んでいてもよい。次いで、オーディオ・コンテンツのサービス型が音楽サービスであるとの判定に応答して、分類情報は、オーディオ・コンテンツのコンテンツ型が音楽コンテンツであることを示すように生成されてもよい。 FIG. 8 schematically shows an example of content analysis of audio content in step S720 of method 700. The upper 810 of FIG. 8 relates to an example of a music service, that is, a service-type instruction indicating that the audio content is a service-type “music service”, and in this case, the confidence value for “music” is 1. It may be set and the confidence value for other content types (eg, "speech", "effect", and possibly "crowd noise") is set to 0. That is, the content type "music" may be hard-coded into classification information. As described above, the method 700 may include determining whether or not the service type of the audio content is a music service based on the service type instruction. Then, in response to the determination that the service type of the audio content is a music service, the classification information may be generated to indicate that the content type of the audio content is music content.

図8の下段820は、ニュースサービス、すなわちオーディオ・コンテンツがサービス型「ニュースサービス」（またはニュースキャスト・サービス、ニュースチャネル）であることを示すサービス型指示の例に関する。この場合、コンテンツ解析において用いられる計算は、発話についての明確な選好があり、たとえば音楽についてはより少ない選好があるように適応される（たとえば、コンテンツ解析によって与えられる発話コンテンツ（コンテンツ型「発話」）についての信頼値が増加させられてもよく、音楽コンテンツ（コンテンツ型「音楽」）および可能性としては任意の残りのコンテンツ型についての信頼値が減少させられてもよい）。換言すれば、計算の適応により、コンテンツ型「音楽」という誤った指示の可能性が低減される。このように、方法700は、サービス型指示に基づいて、オーディオ・コンテンツのサービス型がニュースキャスト・サービスであるかどうかを判定することを含んでいてもよい。次いで、オーディオ・コンテンツのサービス型がニュースキャスト・サービスであるという判定に応答して、ステップS720におけるコンテンツ解析は、オーディオ・コンテンツが発話コンテンツであることを示す、より高い可能性を有するように適応されうる。さらに、ステップS720におけるコンテンツ解析は、オーディオ・コンテンツが他の任意のコンテンツ型であることを示す、より低い可能性をもつように適応されてもよい。 The lower 820 of FIG. 8 relates to an example of a news service, that is, a service-type instruction indicating that the audio content is a service-type “news service” (or newscast service, news channel). In this case, the calculations used in the content analysis are adapted to have a clear preference for speech, for example, less preference for music (eg, spoken content given by content analysis (content-type "speech"). ) May be increased, and the confidence value for music content (content type "music") and possibly any remaining content type may be decreased). In other words, the adaptation of the calculation reduces the possibility of false indications of content-type "music". Thus, the method 700 may include determining whether the service type of the audio content is a newscast service based on the service type instruction. Then, in response to the determination that the service type of the audio content is the newscast service, the content analysis in step S720 is adapted to have a higher probability that the audio content is spoken content. Can be done. Further, the content analysis in step S720 may be adapted to have a lower probability of indicating that the audio content is of any other content type.

いくつかの実装では、オーディオ・コンテンツについての一つまたは複数の信頼値は、（たとえば、コンテンツ作成者による）ユーザー入力によって、またはメタデータの一部として、直接提供されることができる。次いで、これらの信頼値が考慮されるかどうかは、サービス型指示に依存してもよい。たとえば、ユーザー入力またはメタデータによって提供される信頼値は、オーディオ・コンテンツのサービス型がある型のものである場合（かつその場合にのみ）、分類情報としてエンコードするために使用できる。いくつかの代替的な実装では、ユーザー入力またはメタデータによって提供される信頼値は、オーディオ・コンテンツのサービス型がある型でない限り、分類情報の一部として使用できる。たとえば、サービス型指示がオーディオ・コンテンツのサービス型が音楽サービスであることを示すのでない限り、ユーザー入力またはメタデータによって提供される信頼値が使用できる。サービス型指示がオーディオ・コンテンツのサービス型が音楽サービスであることを示す場合には、音楽コンテンツについての信頼値は、ユーザー入力またはメタデータによって提供される信頼値にかかわらず、1に設定されてもよい。 In some implementations, one or more confidence values for audio content can be provided directly by user input (eg, by the content creator) or as part of the metadata. Whether or not these confidence values are then taken into account may then depend on the service type instruction. For example, the trust value provided by user input or metadata can be used to encode as classification information if the service type of the audio content is of some type (and only then). In some alternative implementations, the trust values provided by user input or metadata can be used as part of the classification information unless the audio content service type is of some type. For example, trust values provided by user input or metadata can be used unless the service type instruction indicates that the service type of the audio content is a music service. If the service type instruction indicates that the service type of the audio content is a music service, the trust value for the music content is set to 1 regardless of the trust value provided by user input or metadata. May be good.

ここで、図9を参照する。図9は、ファイルベースで提供されるオーディオ・コンテンツをエンコードする方法900をフローチャートの形で示す。よって、方法900は、ファイルベースで実行されてもよい。この方法900は、分類情報を導出する際にオーディオ・コンテンツのファイル・メタデータを考慮に入れる。方法900は、たとえば、図1のエンコーダ‐デコーダ・システム100内のエンコーダ105によって実行されてもよい。 Now refer to FIG. FIG. 9 shows, in the form of a flow chart, a method 900 for encoding audio content provided on a file basis. Thus, Method 900 may be file-based. This method 900 takes into account the file metadata of the audio content when deriving the classification information. Method 900 may be performed, for example, by the encoder 105 in the encoder-decoder system 100 of FIG.

ステップS910では、オーディオ・コンテンツのコンテンツ解析が、少なくとも部分的にはオーディオ・コンテンツについての（ファイル）メタデータに基づいて実行される。たとえば、メタデータは、ファイルのファイル・コンテンツ型を示すファイル・コンテンツ型指示を含むことができる。次いで、コンテンツ解析は、少なくとも部分的にはファイル・コンテンツ型指示に基づいてもよい。ファイルのコンテンツ型に少なくとも部分的に基づいたそのようなコンテンツ解析の限定しない例は、図10を参照して以下に記載される。 In step S910 , content analysis of the audio content is performed, at least in part, based on the (file) metadata about the audio content. For example, the metadata can include a file content type indication indicating the file content type of the file. Content analysis may then be based, at least in part, on file content type instructions. An unrestricted example of such content analysis based at least in part on the content type of the file is described below with reference to FIG.

ステップS920では、オーディオ・コンテンツのコンテンツ型を示す分類情報が、コンテンツ解析（の結果）に基づいて生成される。 In step S920 , classification information indicating the content type of the audio content is generated based on (result) of the content analysis.

ステップS930では、オーディオ・コンテンツおよび分類情報がビットストリーム中にエンコードされる。 In step S930 , the audio content and classification information is encoded in the bitstream.

最後に、ステップS940で、ビットストリームが出力される。 Finally, in step S940 , the bitstream is output.

図10は、方法900のステップS910におけるオーディオ・コンテンツのコンテンツ解析の例を概略的に示す。図10の上段1010は、音楽ファイルの例、すなわち、ファイル・コンテンツがファイル・コンテンツ型「音楽」であることを示すファイル・コンテンツ型指示に関するものであり、この場合、コンテンツ型「音楽」は分類情報にハードコードされてもよい。さらに、分類情報は、ファイル全体について一様（静的）であってもよい。よって、方法900は、ファイル・コンテンツ型指示に基づいて、ファイルのファイル・コンテンツ型が音楽ファイルであるかどうかを判定することをさらに含んでいてもよい。次いで、ファイルのファイル・コンテンツ型が音楽ファイルであるとの判定に応答して、分類情報は、オーディオ・コンテンツのコンテンツ型が音楽コンテンツであることを示すよう、生成されうる。 FIG. 10 schematically shows an example of content analysis of audio content in step S910 of Method 900. The upper 1010 of FIG. 10 relates to an example of a music file, that is, a file content type instruction indicating that the file content is a file content type “music”, in which case the content type “music” is classified. The information may be hard coded. Further, the classification information may be uniform (static) for the entire file. Thus, Method 900 may further include determining if the file content type of the file is a music file, based on the file content type indication. Then, in response to the determination that the file content type of the file is a music file, the classification information may be generated to indicate that the content type of the audio content is music content.

図10の中段1020は、ニュース・ファイルの例、すなわち、ファイル・コンテンツがファイル・コンテンツ型「ニュース」であることを示すファイル・コンテンツ型指示に関する。この場合、方法900は、ファイル・コンテンツ型指示に基づいて、ファイルのファイル・コンテンツ型がニュースキャスト・ファイルであるかどうかを判定することをさらに含んでいてもよい。次いで、ファイルのファイル・コンテンツ型がニュースキャスト・ファイルであるという判定に応答して、コンテンツ解析は、オーディオ・コンテンツが発話コンテンツであることを示す、より高い可能性をもつように適応されてもよい。これは、コンテンツ解析における発話コンテンツについての確からしさ／信頼度を高めるよう、コンテンツ解析の一つまたは複数の計算（計算アルゴリズム）を適応することによって、および／または、発話コンテンツ以外のコンテンツ型についての確からしさ／信頼度を下げるよう、前記一つまたは複数の計算を適応することによって達成されてもよい。ここでもまた、分類情報は、ファイル全体について一様（静的）にされてもよい。 The middle 1020 of FIG. 10 relates to an example of a news file, that is, a file content type instruction indicating that the file content is the file content type “news”. In this case, Method 900 may further include determining if the file content type of the file is a newscast file, based on the file content type indication. Then, in response to the determination that the file content type of the file is a newscast file, content analysis may be adapted to have a higher probability that the audio content is spoken content. good. This is done by applying one or more calculations (calculation algorithms) of the content analysis to increase the certainty / reliability of the spoken content in the content analysis and / or for content types other than the spoken content. It may be achieved by adapting the one or more calculations above to reduce certainty / confidence. Again, the classification information may be uniform (static) for the entire file.

図10の下段1030は、動的（非静的）ファイルの例（たとえば、発話のあるシーンと音楽／歌シーンとの間で頻繁に遷移する映画のミュージカル・ジャンル）、すなわち、ファイル・コンテンツがファイル・コンテンツ型「動的」であることを示すファイル・コンテンツ型指示に関する。この場合、方法900は、ファイルのファイル・コンテンツ型指示に基づいて、ファイルのファイル・コンテンツ型が動的コンテンツ（すなわち、動的なファイル・コンテンツ）であるかどうかを判定することをさらに含んでいてもよい。次いで、ファイルのファイル・コンテンツ型が動的コンテンツ（すなわち、動的なファイル・コンテンツ）であるとの判定に応答して、コンテンツ解析は、異なるコンテンツ型間のより高い遷移レートを許容するように適応されてもよい。たとえば、コンテンツ型は、コンテンツ型間で、たとえば、音楽と非音楽の間で、より頻繁に（すなわち、定常状態についてよりも頻繁に）遷移することが許容されうる。よって、分類情報は、たとえばファイルの音楽セクションと非音楽セクションとの間で切り換わることが許容されうる。図10の最初の2段1010および1020とは異なり、これは、分類情報がファイル全体について均一（静的）に保たれないことを意味する。 The lower 1030 of FIG. 10 shows an example of a dynamic (non-static) file (for example, a musical genre of a movie that frequently transitions between a spoken scene and a music / song scene), that is, file content. File content type Regarding the file content type instruction indicating that it is "dynamic". In this case, Method 900 further includes determining if the file content type of the file is dynamic content (ie, dynamic file content) based on the file content type indication of the file. You may. Content analysis then allows higher transition rates between different content types in response to the determination that the file content type of the file is dynamic content (ie, dynamic file content). May be adapted. For example, a content type may be allowed to transition more frequently (ie, more frequently than for steady state) between content types, for example, between music and non-music. Thus, classification information may be allowed to switch between, for example, the music section and the non-music section of the file. Unlike the first two rows 1010 and 1020 in Figure 10, this means that the classification information is not kept uniform (static) throughout the file.

また、動的コンテンツ（すなわち、動的ファイル・コンテンツ）は、ファイル内の異なるコンテンツ型のセクション間でシャープな遷移を有することができることも理解される。たとえば、音楽セクションと非音楽セクションとの間には、シャープな遷移が存在しうる。そのような場合、分類情報（たとえば、信頼値）に対して時間平滑化を適用することは意味をなさないことがありうる。いくつかの実装では、よって、分類情報の平滑化（時間平滑化）は、動的コンテンツ（すなわち、動的ファイル・コンテンツ）については無効にされてもよい。 It is also understood that dynamic content (ie, dynamic file content) can have sharp transitions between sections of different content types within a file. For example, there can be sharp transitions between the music section and the non-music section. In such cases, it may not make sense to apply time smoothing to the classification information (eg, confidence values). In some implementations, therefore, classification information smoothing (time smoothing) may be disabled for dynamic content (ie, dynamic file content).

次に、オーディオ・コンテンツおよび該オーディオ・コンテンツについての分類情報を含むビットストリームからのオーディオ・コンテンツのデコードに関する実施形態および実装について説明する。分類情報は、オーディオ・コンテンツのコンテンツ分類（コンテンツ型に関する）を示すことが理解される。また、コンテンツ分類は、エンコーダ側で実行されたコンテンツ解析に基づいてもよいことも理解される。 Next, an embodiment and implementation of decoding audio content from a bitstream containing audio content and classification information about the audio content will be described. It is understood that the classification information indicates the content classification (with respect to the content type) of the audio content. It is also understood that the content classification may be based on the content analysis performed on the encoder side.

図11は、ビットストリームからオーディオ・コンテンツをデコードする一般的な方法1100をフローチャートの形で示す。方法1100は、たとえば、図1のエンコーダ‐デコーダ・システム100内のデコーダ115によって実行されてもよい。 FIG. 11 shows, in the form of a flow chart, a common method 1100 for decoding audio content from a bitstream. Method 1100 may be performed, for example, by the decoder 115 in the encoder-decoder system 100 of FIG.

ステップS1110では、ビットストリームは、たとえば無線または有線伝送によって、またはビットストリームを記憶する記憶媒体を介して、受領される。
ステップS1120では、オーディオ・コンテンツおよび分類情報がビットストリームからデコードされる。
ステップS1130では、デコードされたオーディオ・コンテンツの（オーディオ）後処理を実行するための後処理モードが、ステップS1120で得られた分類情報に基づいて選択される。いくつかの実装では、後処理モードの選択は、さらにユーザー入力に基づくことができる。 In step S1110 , the bitstream is received, for example, by wireless or wired transmission, or via a storage medium that stores the bitstream.
In step S1120 , the audio content and classification information is decoded from the bitstream.
In step S1130 , a post-processing mode for performing (audio) post-processing of the decoded audio content is selected based on the classification information obtained in step S1120. In some implementations, the choice of post-processing mode can be further based on user input.

さらに、方法1100は、（たとえば、エンコーダ側によって考慮されていないコンテンツ型について）一つまたは複数の追加の信頼値を決定するために、オーディオ・コンテンツのコンテンツ解析を実行することをさらに含んでいてもよい。このコンテンツ解析は、方法400におけるステップS410を参照して上述したのと同じ仕方で進行してもよい。次いで、後処理モードの選択は、さらに該一つまたは複数の追加の信頼値に基づいてもよい。たとえば、デコーダが（レガシー）エンコーダによって考慮されていなかったコンテンツ型についての検出器を含む場合、デコーダは、このコンテンツ型についての信頼値を計算し、後処理モードを選択するために、この信頼値を、分類情報において伝送される任意の信頼値と一緒に使用することができる。 Further, Method 1100 further comprises performing content analysis of the audio content to determine one or more additional confidence values (eg, for content types not considered by the encoder side). May be good. This content analysis may proceed in the same manner as described above with reference to step S410 in Method 400. The choice of post-processing mode may then be further based on the one or more additional confidence values. For example, if the decoder contains a detector for a content type that was not considered by the (legacy) encoder, the decoder will calculate the confidence value for this content type and select this confidence value to select the post-processing mode. Can be used with any confidence value transmitted in the classification information.

図1のコンテキストにおいて上述したように、後処理は、たとえば、（インテリジェント／動的）等化器、（適応）仮想化器、サラウンドプロセッサ、ダイアログ向上器、アップミキサー、またはクロスフェーダーを実装するそれぞれのアルゴリズムのような後処理アルゴリズムを使用して実行されてもよい。よって、後処理を実行するためのモードを選択することは、後処理のためのそれぞれのプロセス／モジュール／アルゴリズムについての一つまたは複数の制御重み（操縦重み、アルゴリズム操縦重み、アルゴリズム制御重み）を決定すること（たとえば、計算すること）に対応すると言える。 As mentioned above in the context of Figure 1, the post-processing implements, for example, an (intelligent / dynamic) equalizer, an (adaptive) virtualization device, a surround processor, a dialog improver, an upmixer, or a crossfader, respectively. It may be executed using a post-processing algorithm such as the algorithm of. Therefore, selecting the mode for performing post-processing can result in one or more control weights (steering weights, algorithm maneuvering weights, algorithm control weights) for each process / module / algorithm for post-processing. It can be said that it corresponds to making a decision (for example, calculating).

対応する方法1200は、図12のフローチャートによって示される。ここでもまた、この方法1200は、たとえば、図1のエンコーダ‐デコーダ・システム100内のデコーダ115によって実行されてもよい。ステップS1210およびステップS1220は、それぞれ、方法1100のステップS1110およびステップS1120と同一である。 The corresponding method 1200 is shown by the flowchart of FIG. Again, this method 1200 may be performed, for example, by the decoder 115 in the encoder-decoder system 100 of FIG. Step S1210 and step S1220 are identical to step S1110 and step S1120 of method 1100, respectively.

ステップS1230では、ステップS1220で得られた分類情報に基づいて、デコードされたオーディオ・コンテンツの後処理のための一つまたは複数の制御重みが決定される（たとえば、計算される）。 In step S1230 , one or more control weights for post-processing of the decoded audio content are determined (eg, calculated) based on the classification information obtained in step S1220.

制御重み（操縦重み）の代わりに信頼値を送信すること、すなわち、重み計算モジュールをエンコーダに移す代わりにデコーダに残すことは、デコーダにおける計算資源の節約を可能にするだけでなく、重み計算がパーソナル化できる、カスタマイズ可能で柔軟なデコーダを可能にすることもできる。たとえば、重み計算は、装置型および／またはユーザーの個人的な選好に依存することができる。これは、デコードされたオーディオ・コンテンツについてどのオーディオ後処理が実行されるべきかについての具体的な命令をデコーダがエンコーダから受け取る従来のアプローチとは対照的である。 Sending confidence values instead of control weights (steering weights), that is, leaving the weighting module in the decoder instead of moving it to the encoder, not only allows the decoder to save computational resources, but also allows the weighting to be calculated. It can also enable a customizable and flexible decoder that can be personalized. For example, weighting can depend on the device type and / or the user's personal preferences. This is in contrast to the traditional approach in which the decoder receives specific instructions from the encoder as to which audio post-processing should be performed on the decoded audio content.

すなわち、オーディオ後処理の要件は、デコードされたオーディオ・コンテンツが再生される装置の装置型に依存することがある。たとえば、2つだけのスピーカーをもつモバイル装置（たとえば、携帯電話）のスピーカーによるデコードされたオーディオ・コンテンツの再生は、5つ以上のスピーカーをもつサウンドバー装置によるデコードされたオーディオ・コンテンツの再生とは異なるオーディオ後処理を要求することがある。よって、いくつかの実装では、制御重みの計算は、デコードを実行する装置の装置型に依存する。言い換えれば、計算は、エンドポイントに固有であってもよいし、パーソナル化されていてもよい。たとえば、デコーダ側は、後処理のための一組のエンドポイント固有のプロセス／モジュール／アルゴリズムを実装してもよく、これらのプロセス／モジュール／アルゴリズムのためのパラメータ（制御重み）は、エンドポイント固有の仕方で信頼値に基づいて決定されてもよい。 That is, the requirements for audio post-processing may depend on the device type of the device on which the decoded audio content is played. For example, playing decoded audio content on a speaker on a mobile device with only two speakers (for example, a mobile phone) is similar to playing decoded audio content on a soundbar device with five or more speakers. May require different audio post-processing. Thus, in some implementations, the calculation of control weights depends on the device type of the device performing the decoding. In other words, the calculation may be endpoint-specific or personalized. For example, the decoder side may implement a set of endpoint-specific processes / modules / algorithms for post-processing, and the parameters (control weights) for these processes / modules / algorithms are endpoint-specific. It may be determined based on the confidence value in the above manner.

さらに、異なるユーザーは、オーディオ後処理について異なる選好をもつ可能性がある。たとえば、発話は、典型的には仮想化されないが、ユーザーの選好に基づいて、発話の多いオーディオ・コンテンツを仮想化することを決定することができる（すなわち、ユーザーが望むならば、仮想化が発話に適用されてもよい）。別の例として、パーソナルコンピュータでのオーディオ再生については、典型的には、ステレオ拡張、アップミックス、および仮想化はない。しかしながら、ユーザーの選好に依存して、ステレオ拡張、アップミックス、および／または仮想化がこの場合に適用されてもよい（すなわち、ユーザーが望むならば、ステレオ拡張、アップミックス、および／または仮想化がPCユーザーのために適用されてもよい）。よって、いくつかの実装では、制御重みの計算は、さらに、ユーザー選好またはユーザー入力（たとえば、ユーザー選好を示すユーザー入力）に基づく。よって、ユーザー入力は、分類情報ベースの計算をオーバーライドする、または部分的にオーバーライドすることができる。 In addition, different users may have different preferences for audio post-processing. For example, utterances are typically not virtualized, but based on user preferences, it can be decided to virtualize utterance-rich audio content (ie, if the user so desires, virtualization). May be applied to utterances). As another example, for audio playback on a personal computer, there is typically no stereo expansion, upmix, and virtualization. However, depending on the user's preferences, stereo expansion, upmix, and / or virtualization may be applied in this case (ie, stereo expansion, upmix, and / or virtualization if the user so desires). May be applied for PC users). Thus, in some implementations, the calculation of control weights is further based on user preference or user input (eg, user input indicating user preference). Thus, user input can override or partially override classification information-based calculations.

分類情報が、それぞれがそれぞれのコンテンツ型に関連付けられた信頼値（信頼スコア）を含み、上述のように、オーディオ・コンテンツがそれぞれのコンテンツ型である確からしさの指標を与える場合、制御重みは、これらの信頼値に基づいて計算されうる。そのような計算の限定しない例はのちに記載される。 If the classification information contains a confidence value (confidence score), each associated with each content type, and, as described above, gives an indicator of the certainty that the audio content is each content type, the control weight is It can be calculated based on these confidence values. Non-limiting examples of such calculations will be described later.

さらに、方法1200は、（たとえば、エンコーダ側によって考慮されていないコンテンツ型についての）一つまたは複数の追加の信頼値を決定するために、オーディオ・コンテンツのコンテンツ解析を実行することをさらに含んでいてもよい。このコンテンツ解析は、方法400のステップS410を参照して上述したのと同じ仕方で進行してもよい。次いで、制御重みモードの計算は、前記一つまたは複数の追加の信頼値にさらに基づいてもよい。たとえば、デコーダが、（レガシー）エンコーダによって考慮されていなかったコンテンツ型についての検出器を有する場合、デコーダは、このコンテンツ型についての信頼値を計算し、この信頼値を、制御重みを計算するために、分類情報において伝送される任意の信頼値と一緒に、使用することができる。 Further, Method 1200 further comprises performing content analysis of the audio content to determine one or more additional confidence values (eg, for content types not considered by the encoder side). You may. This content analysis may proceed in the same manner as described above with reference to step S410 of method 400. The calculation of the control weight mode may then be further based on the one or more additional confidence values. For example, if the decoder has a detector for a content type that was not considered by the (legacy) encoder, the decoder will calculate the confidence value for this content type and this confidence value to calculate the control weights. Can be used with any confidence value transmitted in the classification information.

上述のように、信頼値は、エンコードされるコンテンツを正確かつ安定的に反映するために、デュアルエンドのエンコーダ‐デコーダ・システムにおいてエンコーダ側で平滑化されてもよい。代替的または追加的に、デコーダ側での重み計算は、制御重み（アルゴリズム操縦重み）を決定する際に、さらなる平滑化を提供してもよい。それにより、各後処理アルゴリズムが、可聴な歪みを回避するよう、適切なレベルの連続性を有することが保証できる。たとえば、仮想化器は、空間的像における望ましくない変動を避けるためにゆっくりとした変化を望ことがあるが、一方、ダイアログ向上器は、ダイアログ・フレームには反応されるが非ダイアログ・フレームは任意の誤ったダイアログ向上を最小化することを保証するために速い変化を望むことがありうる。よって、方法1200は、制御重みを平滑化（時間平滑化）するステップをさらに含んでいてもよい。 As mentioned above, the confidence values may be smoothed on the encoder side in a dual-ended encoder-decoder system to accurately and stably reflect the encoded content. Alternatively or additionally, the weight calculation on the decoder side may provide further smoothing in determining the control weight (algorithm maneuvering weight). Thereby, it can be ensured that each post-processing algorithm has an appropriate level of continuity to avoid audible distortion. For example, a virtualizer may want slow changes to avoid unwanted fluctuations in the spatial image, while a dialog improver is responsive to dialog frames but non-dialog frames. You may want fast changes to ensure that any false dialog improvements are minimized. Therefore, the method 1200 may further include a step of smoothing the control weights (time smoothing).

平滑化は、デコードを実行する装置の装置型に依存してもよい。たとえば、モバイル装置（たとえば、携帯電話）のための仮想化器制御重みと、TVセットまたはサウンドバー装置のための仮想化器制御重みとの間には、異なる平滑化が存在しうる。そこでは、平滑化は、平滑化を決定する一組の平滑係数、たとえば平滑化の時定数に関して異なる場合がある。 Smoothing may depend on the device type of the device performing the decoding. For example, there may be different smoothing between the virtualization device control weights for mobile devices (eg, mobile phones) and the virtualization device control weights for TV sets or soundbar devices. There, smoothing may differ with respect to a set of smoothing coefficients that determine smoothing, such as the time constant of smoothing.

さらに、平滑化は、平滑化される特定の制御重みにも依存しうる。すなわち、平滑化は、少なくとも2つの制御重みの間で異なることがある。たとえば、ダイアログ向上器制御重みについての平滑化は全くまたはほとんどない、および／または仮想化器制御重みについてのより強力な平滑化があることがありうる。 In addition, smoothing may also depend on the specific control weights to be smoothed. That is, smoothing can differ between at least two control weights. For example, there may be no or little smoothing for dialog improver control weights, and / or stronger smoothing for virtualizer control weights.

最後に、状況によっては平滑化が無効化されてもよいことを注意しておく。上述のように、平滑化は、動的（非静的）としてフラグ付けされたオーディオ・コンテンツについて、あるいはシーン遷移においては、逆効果になることがある。また、制御入力および／またはメタデータに従って平滑化が無効にされてもよい。 Finally, note that smoothing may be disabled in some situations. As mentioned above, smoothing can be counterproductive for audio content flagged as dynamic (non-static), or for scene transitions. Also, smoothing may be disabled according to control inputs and / or metadata.

制御重みの（よってオーディオ後処理の）連続性／安定性を改善するための別のアプローチは、制御重みに非線形マッピングΦを適用することである。制御重みの値は、0から1の範囲であってもよい。非線形マッピングΦは、写像Φ:[0,1]→[0,1]であってもよい。好ましくは、非線形マッピングΦは、制御重みの値範囲（すなわち、[0,1]のような定義域範囲）の境界に近い制御値の値を、マッピングされた値の値範囲（すなわち、[0,1]のような像範囲）のそれぞれの境界に、より近くマッピングする。すなわち、Φは、値0＋ε（ε≪1）を0に近づけるようにマッピングしてもよく、すなわちΦ(0＋ε)＜(0＋ε)であり、値1－εを1に近づけるようにマッピングしてもよく、すなわちΦ(1－ε)＞(1－ε)である。そのような非線形マッピングΦの例は、シグモイド関数である。 Another approach to improving the continuity / stability of control weights (and thus audio post-processing) is to apply a non-linear mapping Φ to the control weights. The value of the control weight may be in the range 0 to 1. The nonlinear mapping Φ may be a map Φ: [0,1] → [0,1]. Preferably, the nonlinear mapping Φ places the value of the control value close to the boundary of the value range of the control weight (ie, the domain range such as [0,1]) into the value range of the mapped value (ie, [0]. , 1] Map closer to each boundary of the image range). That is, Φ may be mapped so that the value 0 + ε (ε << 1) approaches 0, that is, Φ (0 + ε) <(0 + ε), and the value 1−ε may be mapped so as to approach 1. Well, that is, Φ (1-ε)> (1-ε). An example of such a nonlinear mapping Φ is a sigmoid function.

図13は、上述の考察に従って動作する重み計算モジュール170の例を概略的に示す。以下に説明される重み計算モジュール170は、たとえば、計算装置のプロセッサによって実装されてもよいことが理解される。 FIG. 13 schematically shows an example of a weight calculation module 170 that operates according to the above considerations. It is understood that the weight calculation module 170 described below may be implemented, for example, by the processor of the computing unit.

限定を意図することなく、この実施例の重み計算モジュール170は、インテリジェント／動的等化器のための制御重みおよび仮想化器のための制御重みを決定する。他の制御重みも重み計算モジュール170によって計算されてもよいことが理解される。 Without intention of limitation, the weight calculation module 170 of this embodiment determines the control weights for the intelligent / dynamic equalizer and the control weights for the virtualizer. It is understood that other control weights may also be calculated by the weight calculation module 170.

重み計算モジュール170は、入力として信頼値（すなわち、分類情報125）を受領する。信頼値に基づいて、インテリジェント／動的等化器のための制御重みがブロック1310で計算される。等化は発話の音色を変化させる可能性があり、よって、典型的には、発話については望ましくないので、いくつかの実装では、分類情報（たとえば、信頼値）が、デコードされたオーディオ・コンテンツのコンテンツ型が発話である、または発話である可能性が高いことを示す場合（たとえば、発話信頼値がある閾値を上回る場合）には、等化が無効にされるよう、インテリジェント／動的等化器についての制御重み（等化器制御重み）が計算されうる。任意的に、ブロック1330において、等化器制御重みは平滑化されてもよい。平滑化は、等化器制御重みの平滑化に固有でありうる等化器制御重み平滑係数1335に依存してもよい。最終的に、（平滑化された）等化器制御重み175aが重み計算モジュール170によって出力される。 The weighting module 170 receives a confidence value (ie, classification information 125) as input. Based on the confidence value, the control weights for the intelligent / dynamic equalizer are calculated in block 1310. In some implementations, the classification information (eg, confidence) is decoded audio content, as equalization can change the timbre of the utterance, and thus is typically undesirable for the utterance. Intelligent / dynamic, etc. so that equalization is disabled if the content type of is utterance or is likely to be utterance (eg, if the utterance confidence value exceeds a certain threshold). The control weight for the timbre (equalizer control weight) can be calculated. Optionally, at block 1330, the equalizer control weights may be smoothed. The smoothing may depend on the equalizer control weight smoothing factor 1335, which may be specific to the smoothing of the equalizer control weights. Finally, the (smoothed) equalizer control weight 175a is output by the weight calculation module 170.

信頼値は、ブロック1320において、仮想化器のための制御重み（仮想化器制御重み）を計算するためにも使用される。仮想化は、音楽的な音色を変化させる可能性があり、よって、典型的には、音楽については望ましくないので、いくつかの実装では、分類情報（たとえば、信頼値）が、デコードされたオーディオ・コンテンツのコンテンツ型が音楽である、または音楽である可能性が高いことを示す場合（たとえば、音楽信頼値がある閾値を超えている場合）は仮想化（スピーカー仮想化）が無効にされるように、仮想化器のための制御重みが計算されてもよい。また、仮想化器のための制御重みは、仮想化器の係数が素通し（処理なし）と完全な仮想化との間でスケールするように計算されてもよい。一例として、仮想化器の制御重みは、音楽信頼値music_confidence、発話信頼値speech_confidence、および効果信頼値effect_confidenceに基づいて、
1－music_confidence*{1－max[effects_confidence,speech_confidence]^2} (式1)
により計算されうる。 The confidence value is also used in block 1320 to calculate the control weight for the virtualizer (virtualizer control weight). Virtualization can change the musical timbre, and is therefore typically undesirable for music, so in some implementations the classification information (eg, confidence values) is decoded audio. · Virtualization (speaker virtualization) is disabled if the content type of the content indicates that it is music or is likely to be music (for example, if the music confidence value exceeds a certain threshold). As such, the control weights for the virtualizer may be calculated. Also, the control weights for the virtualizer may be calculated so that the coefficients of the virtualizer scale between pass-through (no processing) and full virtualization. As an example, the control weights of the virtualizer are based on the music confidence value music_confidence, the speech confidence value speech_confidence, and the effect confidence value effect_confidence.
1－music_confidence * {1-max [effects_confidence, speech_confidence] ^ 2} (Equation 1)
Can be calculated by

任意的に、仮想化器制御重みはブロック1340で平滑化されてもよい。平滑化は、仮想化器制御重みの平滑化に固有であってもよい仮想化器制御重み平滑係数1345に依存してもよい。 Optionally, the virtualization device control weights may be smoothed in block 1340. The smoothing may depend on the virtualization device control weight smoothing factor 1345, which may be specific to the smoothing of the virtualization device control weights.

さらに、任意的に、仮想化器制御重みの安定性／連続性を改善するために、ブロック1350において、（平滑化された）仮想化器制御重みは、たとえばシグモイド関数によって増幅されてもよい。それにより、後処理されたオーディオ・コンテンツのレンダリングされた表現における可聴アーチファクトを低減することができる。増幅は、上述の非線形マッピングに従って進行してもよい。 Further, optionally, in block 1350, the (smoothed) virtualization device control weights may be amplified, for example, by a sigmoid function, in order to improve the stability / continuity of the virtualization device control weights. This can reduce audible artifacts in the rendered representation of post-processed audio content. Amplification may proceed according to the nonlinear mapping described above.

最終的に、（平滑化および／または増幅された）仮想化器制御重み175bが、重み計算モジュール170によって出力される。 Finally, the (smoothed and / or amplified) virtualization control weights 175b are output by the weight calculation module 170.

信頼値は、ダイアログ向上器のための制御重み（ダイアログ向上器制御重み；図示せず）を計算するためにも使用できる。ダイアログ向上器は、周波数領域において、ダイアログを含む時間‐周波数タイルを検出してもよい。そして、これらの時間‐周波数タイルが選択的に向上されることができ、それによりダイアログを強調することができる。ダイアログ向上器の主な目的は、ダイアログを強調することであり、ダイアログ向上をダイアログのないコンテンツに適用することは、よくて、計算資源の無駄であるため、分類情報が、オーディオ・コンテンツのコンテンツ型が発話である、または発話である可能性が高いことを示す場合（かつ、その場合にのみ）、ダイアログ向上器によるダイアログ向上が有効にされるように、ダイアログ向上器制御重みが計算されてもよい。これは、たとえば、発話についての信頼値が所与の閾値を超える場合に当てはまりうる。等化器制御重みおよび仮想化器制御重みについてと同様に、ダイアログ向上器制御重みも平滑化および／または増幅の対象となりうる。 The confidence value can also be used to calculate the control weights for the dialog improver (dialog improver control weights; not shown). The dialog improver may detect the time-frequency tile containing the dialog in the frequency domain. And these time-frequency tiles can be selectively improved, thereby emphasizing the dialog. The main purpose of the dialog improver is to emphasize the dialog, and applying the dialog improvement to the content without the dialog is often a waste of computational resources, so the classification information is the content of the audio content. If the type is or is likely to be spoken, then the dialog improver control weights are calculated so that the dialog improver's dialog enhancement is enabled (and only then). May be good. This may be the case, for example, if the confidence value for the utterance exceeds a given threshold. Similar to equalizer control weights and virtualizer control weights, dialog improver control weights can also be subject to smoothing and / or amplification.

さらに、信頼値は、サラウンドプロセッサ（サラウンドプロセッサ制御重み；図示せず）、アップミキサー、および／またはクロスフェーダーのための制御重みを計算するために使用できる。 In addition, confidence values can be used to calculate control weights for surround processors (surround processor control weights; not shown), upmixers, and / or crossfaders.

図14は、本開示の実施形態による、2つのスピーカーをもつモバイル装置（たとえば、携帯電話）による再生のための2チャネル（たとえばステレオ）オーディオ・コンテンツの特殊な場合における、ビットストリームからオーディオ・コンテンツをデコードする方法1400をフローチャートの形で示す。ビットストリームは、分類情報または2チャネル・オーディオ・コンテンツを含み、分類情報は、2チャネル・オーディオ・コンテンツの（たとえば、コンテンツ型に関する）コンテンツ分類を示すことが理解される。方法1400は、2つのスピーカーをもつモバイル装置のデコーダによって実行されてもよい。このデコーダは、図1のエンコーダ‐デコーダ・システム100におけるデコーダ115と同じ基本構成を有してもよく、たとえば、重み計算および後処理の具体的な実装を備えていてもよい。 FIG. 14 shows bitstream to audio content in a special case of two-channel (eg stereo) audio content for playback by a mobile device with two speakers (eg, a mobile phone) according to an embodiment of the present disclosure. How to decode 1400 is shown in the form of a flowchart. It is understood that the bitstream contains classification information or two-channel audio content, and the classification information indicates the content classification (eg, for content type) of the two-channel audio content. Method 1400 may be performed by a decoder on a mobile device with two speakers. The decoder may have the same basic configuration as the decoder 115 in the encoder-decoder system 100 of FIG. 1, and may include, for example, specific implementations of weighting and post-processing.

ステップS1410では、AC-4ビットストリームが受領される。
ステップS1420では、2チャネル・オーディオ・コンテンツおよび分類情報が、ビットストリームからデコード／多重分離される。
ステップS1430では、ステップS1420でデコードされた2チャネル・オーディオ・コンテンツはアップミックスされて、アップミックスされた5.1チャネル・オーディオ・コンテンツにされる。
ステップS1440では、2チャネルのスピーカー・アレイのための5.1仮想化のために、アップミックスされた5.1チャネル・オーディオ・コンテンツに対して仮想化器が適用される。仮想化器は、それぞれの制御重みの制御の下で動作する。仮想化器のための制御重みは、分類情報（たとえば、信頼値）に基づいて計算される。これは、たとえば、図13を参照して上述した仕方で行なうことができる。
ステップS1450では、クロスフェーダーが、2チャネル・オーディオ・コンテンツおよび仮想化されたアップミックスされた5.1チャネル・オーディオ・コンテンツに適用される。クロスフェーダーは、それぞれの制御重みの制御の下で動作する。クロスフェーダーのための制御重みは、分類情報（たとえば信頼値）に基づいて計算される。
最後に、ステップS1460では、クロスフェーダーの出力は、2チャネル・スピーカー・アレイにルーティングされる。 At step S1410 , the AC-4 bitstream is received.
In step S1420 , the 2-channel audio content and classification information is decoded / multiplexed from the bitstream.
In step S1430 , the 2-channel audio content decoded in step S1420 is upmixed into upmixed 5.1-channel audio content.
In step S1440 , a virtualization device is applied to the upmixed 5.1 channel audio content for 5.1 virtualization for a 2-channel speaker array. The virtualization machine operates under the control of each control weight. Control weights for the virtualizer are calculated based on classification information (eg, confidence values). This can be done, for example, in the manner described above with reference to FIG.
In step S1450 , the crossfader is applied to 2-channel audio content and virtualized upmixed 5.1-channel audio content. The crossfader operates under the control of each control weight. Control weights for crossfaders are calculated based on classification information (eg confidence values).
Finally, in step S1460 , the output of the crossfader is routed to a two-channel speaker array.

図15は、本開示の実施形態による、方法1400を実行しうる2スピーカー・モバイル装置1505のデコーダ1500の例を概略的に示す。以下に説明されるデコーダ1500のモジュールは、たとえば、計算装置のプロセッサによって実装されてもよいことが理解される。 FIG. 15 schematically illustrates an example of a decoder 1500 of a two-speaker mobile device 1505 capable of performing method 1400 according to an embodiment of the present disclosure. It is understood that the modules of the decoder 1500 described below may be implemented, for example, by the processor of the computing unit.

デコーダ1500は、ビットストリーム110（たとえば、AC-4ビットストリーム）を受領し、次いで、該ビットストリームはAC-4（モバイル）デコーダ・モジュール1510によってデコード／多重分離される。AC-4（モバイル）デコーダ・モジュール1510は、デコードされた2チャネル・オーディオ・コンテンツ1515およびデコードされた分類情報125を出力する。デコードされた分類情報125は、分類情報125（たとえば、信頼値）に基づいてクロスフェード制御重み1575を計算する仮想化器クロスフェード重み計算モジュール1570に提供される。クロスフェード制御重み1575は、クロスフェード・モジュール1540によって組み合わされる2つの信号の相対重みを決定するパラメータであってもよい。デコードされた2チャネル・オーディオ・コンテンツ1515は、アップミックス・モジュール1520によって2.0チャネルから5.1チャネルにアップミックスされ、該アップミックス・モジュールは、アップミックスされた5.1チャネル・オーディオ・コンテンツ1525を出力する。次いで、ステレオスピーカーのための5.1仮想化が、仮想化モジュール（仮想化器）1530によって、アップミックスされた5.1チャネル・オーディオ・コンテンツ1525に適用される。仮想化モジュールは、仮想化されたアップミックスされた5.1チャネル・オーディオ・コンテンツ1535を出力し、それは、クロスフェード・モジュール1540によってもとのデコードされた2チャネル・オーディオ・コンテンツと組み合わされる。クロスフェード・モジュール1540は、クロスフェード制御重み1575の制御の下で動作し、最終的に、後処理された2チャネル・オーディオ・コンテンツ102を、モバイル装置1505のスピーカーにルーティングするために出力する。 The decoder 1500 receives a bitstream 110 (eg, an AC-4 bitstream), which is then decoded / multiplexed by the AC-4 (mobile) decoder module 1510. The AC-4 (mobile) decoder module 1510 outputs decoded 2-channel audio content 1515 and decoded classification information 125. The decoded classification information 125 is provided to the virtualizer crossfade weight calculation module 1570, which calculates the crossfade control weight 1575 based on the classification information 125 (eg, confidence value). The crossfade control weight 1575 may be a parameter that determines the relative weights of the two signals combined by the crossfade module 1540. The decoded 2-channel audio content 1515 is upmixed from 2.0 channels to 5.1 channels by the upmix module 1520, which outputs the upmixed 5.1 channel audio content 1525. The 5.1 virtualization for stereo speakers is then applied to the upmixed 5.1 channel audio content 1525 by the virtualization module (virtualizer) 1530. The virtualization module outputs virtualized upmixed 5.1 channel audio content 1535, which is combined with the original decoded 2-channel audio content by the crossfade module 1540. The crossfade module 1540 operates under the control of the crossfade control weight 1575 and finally outputs the post-processed 2-channel audio content 102 for routing to the speakers of the mobile device 1505.

図には示されていないが、デコーダ1500は、分類情報125（たとえば、信頼値）に基づいて、仮想化モジュール1530のための仮想化器制御重みを計算するためのモジュールをも含んでいてもよい。さらに、デコーダ1500は、分類情報125（たとえば、信頼値）に基づいて、アップミックス・モジュール1520のためのアップミックス制御重みを計算するためのモジュールを含んでいてもよい。 Although not shown in the figure, the decoder 1500 may also include a module for calculating the virtualization device control weights for the virtualization module 1530 based on classification information 125 (eg, confidence values). good. In addition, the decoder 1500 may include a module for calculating upmix control weights for the upmix module 1520 based on classification information 125 (eg, confidence values).

図16は、本開示の実施形態による、たとえばサウンドバー装置の5（またはそれ以上）スピーカー・アレイによる再生のための、2チャネル（たとえばステレオ）のオーディオ・コンテンツの特殊な場合における、ビットストリームからオーディオ・コンテンツをデコードする方法1600をフローチャートの形で示す。ここでも、ビットストリームは、分類情報または2チャネル・オーディオ・コンテンツを含み、分類情報は、2チャネル・オーディオ・コンテンツの（たとえばコンテンツ型に関する）コンテンツ分類を示すことが理解される。方法1600は、たとえば、サウンドバー装置のような、5（またはそれ以上）スピーカー・アレイをもつ装置のデコーダによって実行されてもよい。このデコーダは、図1のエンコーダ‐デコーダ・システム100におけるデコーダ115と同じ基本構成を有してもよく、たとえば、重み計算および後処理の特定の実装を備えていてもよい。 FIG. 16 is from a bitstream in a special case of two-channel (eg, stereo) audio content for reproduction by an embodiment of the present disclosure, eg, by a 5 (or higher) speaker array of soundbar devices. A method 1600 for decoding audio content is shown in the form of a flowchart. Again, it is understood that the bitstream contains classification information or two-channel audio content, and the classification information indicates the content classification (eg, for content type) of the two-channel audio content. Method 1600 may be performed by a decoder in a device with 5 (or more) speaker arrays, such as a soundbar device. The decoder may have the same basic configuration as the decoder 115 in the encoder-decoder system 100 of FIG. 1 and may include, for example, specific implementations of weighting and post-processing.

ステップS1610では、AC-4ビットストリームが受領される。
ステップS1620では、2チャネル・オーディオ・コンテンツおよび分類情報は、ビットストリームからデコード／多重分離される。
ステップS1630では、2チャネル・オーディオ・コンテンツをアップミックスして、アップミックスされた5.1チャネル・オーディオ・コンテンツにするために、2チャネル・オーディオ・コンテンツにアップミキサーが適用される。アップミキサーは、それぞれの制御重みの制御の下で動作する。アップミキサーのための制御重みは、分類情報（たとえば、信頼値）に基づいて計算される。アップミキサーのための制御重みは、たとえば、アップミックス重みに関連してもよい。
ステップS1640では、5チャネル・スピーカー・アレイのための5.1仮想化のために、アップミックスされた5.1チャネル・オーディオ・コンテンツに対して仮想化器が適用される。仮想化器は、それぞれの制御重みの制御の下で動作する。仮想化器の制御重みは、分類情報（たとえば、信頼値）に基づいて計算される。これは、たとえば、図13を参照して上述した仕方で行なうことができる。
最後に、ステップS1650で、仮想化器の出力は、5チャネル・スピーカー・アレイにルーティングされる。 At step S1610 , the AC-4 bitstream is received.
In step S1620 , the 2-channel audio content and classification information is decoded / multiplexed from the bitstream.
In step S1630 , an upmixer is applied to the 2-channel audio content to upmix the 2-channel audio content into the upmixed 5.1-channel audio content. The upmixer operates under the control of each control weight. Control weights for the upmixer are calculated based on classification information (eg confidence values). The control weights for the upmixer may be related, for example, to the upmix weights.
In step S1640 , a virtualization device is applied to the upmixed 5.1 channel audio content for 5.1 virtualization for a 5-channel speaker array. The virtualization machine operates under the control of each control weight. The control weights of the virtualizer are calculated based on the classification information (eg, confidence value). This can be done, for example, in the manner described above with reference to FIG.
Finally, in step S1650 , the output of the virtualizer is routed to a 5-channel speaker array.

図17は、本開示の実施形態による、方法1600を実行しうるサウンドバー装置1705のデコーダ1700の例を概略的に示す。以下に説明されるデコーダ1700のモジュールは、たとえば、計算装置のプロセッサによって実装されてもよいことが理解される。デコーダ1700は、ビットストリーム110（たとえば、AC-4ビットストリーム）を受領し、該ビットストリームはAC-4（サウンドバー）デコーダ・モジュール1710によってデコード／多重分離される。AC-4（サウンドバー）デコーダ・モジュール1710は、デコードされた2チャネル・オーディオ・コンテンツ1715およびデコードされた分類情報125を出力する。デコードされた分類情報125は、分類情報125（たとえば、信頼値）に基づいてアップミックス制御重み1775を計算するアップミックス重み計算モジュール1770に提供される。アップミックス制御重み1775は、たとえば、アップミックス重みであってもよい。デコードされた2チャネル・オーディオ・コンテンツ1715は、アップミックス・モジュール1720によって2.0チャネルから5.1チャネルにアップミックスされ、該アップミックス・モジュールはアップミックスされた5.1チャネル・オーディオ・コンテンツを出力する。アップミックス・モジュール1720は、アップミックス制御重み1775の制御の下で動作する。たとえば、音楽および発話について、異なるアップミックス（異なるアップミックス制御重みをもつ）が実行されてもよい。次いで、仮想化モジュール（仮想化器）1730は、5チャネル・スピーカー・アレイのための5.1仮想化を、アップミックスされた5.1チャネル・オーディオ・コンテンツ1725に対して適用し、仮想化されたアップミックスされた5.1チャネル・オーディオ・コンテンツを出力する。仮想化されたアップミックスされた5.1チャネル・オーディオ・コンテンツは、最終的に、後処理された5.1チャネル・オーディオ・コンテンツ102として、サウンドバー装置1705のスピーカーへのルーティングのために出力される。 FIG. 17 schematically illustrates an example of a decoder 1700 of a soundbar device 1705 capable of performing method 1600 according to an embodiment of the present disclosure. It is understood that the modules of the decoder 1700 described below may be implemented, for example, by the processor of the computing unit. The decoder 1700 receives a bitstream 110 (eg, an AC-4 bitstream), which is decoded / multiplexed by the AC-4 (soundbar) decoder module 1710. The AC-4 (soundbar) decoder module 1710 outputs decoded 2-channel audio content 1715 and decoded classification information 125. The decoded classification information 125 is provided to the upmix weight calculation module 1770, which calculates the upmix control weight 1775 based on the classification information 125 (eg, confidence value). The upmix control weight 1775 may be, for example, an upmix weight. The decoded 2-channel audio content 1715 is upmixed from 2.0 channels to 5.1 channels by the upmix module 1720, which outputs the upmixed 5.1 channel audio content. The upmix module 1720 operates under the control of the upmix control weight 1775. For example, different upmixes (with different upmix control weights) may be performed for music and utterances. The virtualization module (virtualizer) 1730 then applies 5.1 virtualization for the 5-channel speaker array to the upmixed 5.1-channel audio content 1725, and the virtualized upmix. Output 5.1-channel audio content. The virtualized upmixed 5.1-channel audio content is finally output as post-processed 5.1-channel audio content 102 for routing to the speakers of soundbar device 1705.

図には示されていないが、デコーダ1700は、たとえば、図13を参照して上述した仕方で、分類情報125（たとえば、信頼値）に基づいて、仮想化モジュール1730のための仮想化器制御重みを計算するためのモジュールをも含んでいてもよい。特に、方法1400および1600、ならびに対応するデコーダ1500および1700は、エンドポイント固有のオーディオ後処理についての例である。 Although not shown in the figure, the decoder 1700 has a virtualization device control for the virtualization module 1730, for example, in the manner described above with reference to FIG. 13, based on classification information 125 (eg, confidence values). It may also include a module for calculating weights. In particular, Methods 1400 and 1600, as well as the corresponding decoders 1500 and 1700, are examples of endpoint-specific audio post-processing.

本発明のさまざまな側面は、以下の箇条書き例示的実施形態（enumerated example embodiment、EEE）から理解されうる。
〔ＥＥＥ１〕
オーディオ・コンテンツをエンコードする方法であって：
オーディオ・コンテンツのコンテンツ解析を実行する段階と；
前記コンテンツ解析に基づいて前記オーディオ・コンテンツのコンテンツ型を示す分類情報を生成する段階と；
前記オーディオ・コンテンツおよび前記分類情報をビットストリーム中にエンコードする段階と；
前記ビットストリームを出力する段階とを含む、
方法。
〔ＥＥＥ２〕
前記コンテンツ解析が、少なくとも部分的には前記オーディオ・コンテンツについてのメタデータに基づく、ＥＥＥ１に記載の方法。
〔ＥＥＥ３〕
オーディオ・コンテンツをエンコードする方法であって：
前記オーディオ・コンテンツのコンテンツ型に関するユーザー入力を受領する段階と；
前記ユーザー入力に基づいて前記オーディオ・コンテンツのコンテンツ型を示す分類情報を生成する段階と；
前記オーディオ・コンテンツおよび前記分類情報をビットストリーム中にエンコードする段階と；
前記ビットストリームを出力する段階とを含む、
方法。
〔ＥＥＥ４〕
前記ユーザー入力が：
前記オーディオ・コンテンツが所与のコンテンツ型であることを示すラベル；および
一つまたは複数の信頼値であって、各信頼値はそれぞれのコンテンツ型に関連付けられ、かつ前記オーディオ・コンテンツが該それぞれのコンテンツ型である確からしさの指示を与える、信頼値
の一つまたは複数を含む、ＥＥＥ３に記載の方法。
〔ＥＥＥ５〕
オーディオ・コンテンツをエンコードする方法であって、前記オーディオ・コンテンツが、オーディオ・プログラムの一部としてオーディオ・コンテンツのストリームにおいて提供され、当該方法が：
前記オーディオ・コンテンツのサービス型を示すサービス型指示を受領する段階と；
少なくとも部分的には前記サービス型指示に基づいて前記オーディオ・コンテンツのコンテンツ解析を実行する段階と；
前記コンテンツ解析に基づいて前記オーディオ・コンテンツのコンテンツ型を示す分類情報を生成する段階と；
前記オーディオ・コンテンツおよび前記分類情報をビットストリーム中にエンコードする段階と；
前記ビットストリームを出力する段階とを含む、
方法。
〔ＥＥＥ６〕
前記オーディオ・コンテンツの前記サービス型が音楽サービスであるかどうかを前記サービス型指示に基づいて判定し；
前記オーディオ・コンテンツの前記サービス型が音楽サービスであるとの判定に応答して、前記オーディオ・コンテンツのコンテンツ型が音楽コンテンツであることを示すように前記分類情報を生成することをさらに含む、
ＥＥＥ５に記載の方法。
〔ＥＥＥ７〕
前記オーディオ・コンテンツの前記サービス型がニュースキャスト・サービスであるかどうかを前記サービス型指示に基づいて判定する段階と；
前記オーディオ・コンテンツの前記サービス型がニュースキャスト・サービスであるとの判定に応答して、前記オーディオ・コンテンツが発話コンテンツであることを示す、より高い可能性を有するように前記コンテンツ解析を適応させる段階とを含む、
ＥＥＥ５または６に記載の方法。
〔ＥＥＥ８〕
前記サービス型指示は、フレームごとに提供される、ＥＥＥ５ないし７のうちいずれか一項に記載の方法。
〔ＥＥＥ９〕
オーディオ・コンテンツをエンコードする方法であって、前記オーディオ・コンテンツはファイルベースで提供され、前記ファイルはそれぞれのオーディオ・コンテンツについてのメタデータを含み、当該方法は：
少なくとも部分的には前記オーディオ・コンテンツについての前記メタデータに基づいて前記オーディオ・コンテンツのコンテンツ解析を実行する段階と；
前記コンテンツ解析に基づいて前記オーディオ・コンテンツのコンテンツ型を示す分類情報を生成する段階と；
前記オーディオ・コンテンツおよび前記分類情報をビットストリーム中にエンコードする段階と；
前記ビットストリームを出力する段階とを含む、
方法。
〔ＥＥＥ１０〕
前記メタデータは、前記ファイルのファイル・コンテンツ型を示すファイル・コンテンツ型指示を含み、
前記コンテンツ解析は、少なくとも部分的には前記ファイル・コンテンツ型指示に基づく、
ＥＥＥ９に記載の方法。
〔ＥＥＥ１１〕
前記ファイルの前記ファイル・コンテンツ型が音楽ファイルであるかどうかを、前記ファイル・コンテンツ型指示に基づいて判定し；
前記ファイルの前記ファイル・コンテンツ型が音楽ファイルであるとの判定に応答して、前記オーディオ・コンテンツの前記コンテンツ型が音楽コンテンツであることを示すように前記分類情報を生成することをさらに含む、
ＥＥＥ１０に記載の方法。
〔ＥＥＥ１２〕
前記ファイルの前記ファイル・コンテンツ型がニュースキャスト・ファイルであるかどうかを、前記ファイル・コンテンツ型指示に基づいて判定し；
前記ファイルの前記ファイル・コンテンツ型がニュースキャスト・ファイルであるとの判定に応答して、前記オーディオ・コンテンツが発話コンテンツであることを示す、よりも高い可能性を有するように前記コンテンツ解析を適応させることをさらに含む、
ＥＥＥ１０または１１に記載の方法。
〔ＥＥＥ１３〕
前記ファイルの前記ファイル・コンテンツ型が動的であるかどうかを、前記ファイル・コンテンツ型指示に基づいて判定し；
前記ファイルの前記ファイル・コンテンツ型が動的コンテンツであるとの判定に応答して、異なるコンテンツ型間の、より高い遷移レートを許容するように前記コンテンツ解析を適応させることをさらに含む、
ＥＥＥ１０ないし１２のうちいずれか一項に記載の方法。
〔ＥＥＥ１４〕
前記分類情報が、一つまたは複数の信頼値を含み、各信頼値はそれぞれのコンテンツ型に関連付けられ、かつ前記オーディオ・コンテンツが該それぞれのコンテンツ型である確からしさの指示を与える、
ＥＥＥ１ないし１３のうちいずれか一項に記載の方法。
〔ＥＥＥ１５〕
前記コンテンツ型は：音楽コンテンツ、オーディオ・コンテンツ、または効果コンテンツの一つまたは複数を含む、ＥＥＥ１ないし１４のうちいずれか一項に記載の方法。
〔ＥＥＥ１６〕
前記オーディオ・コンテンツにおけるシーン遷移の指示を前記ビットストリーム中にエンコードすることをさらに含む、ＥＥＥ１ないし１５のうちいずれか一項に記載の方法。
〔ＥＥＥ１７〕
エンコードする前の前記分類情報の平滑化をさらに含む、
ＥＥＥ１ないし１６のうちいずれか一項に記載の方法。
〔ＥＥＥ１８〕
エンコードする前に前記分類情報を量子化することをさらに含む、
ＥＥＥ１ないし１７のうちいずれか一項に記載の方法。
〔ＥＥＥ１９〕
前記分類情報を、前記ビットストリームのパケット中の特定のデータ・フィールドにエンコードすることをさらに含む、
ＥＥＥ１ないし１８のうちいずれか一項に記載の方法。
〔ＥＥＥ２０〕
オーディオ・コンテンツと該オーディオ・コンテンツについての分類情報とを含むビットストリームからオーディオ・コンテンツをデコードする方法であって、前記分類情報は、前記オーディオ・コンテンツのコンテンツ分類を示し、当該方法は：
前記ビットストリームを受領する段階と；
前記オーディオ・コンテンツおよび前記分類情報をデコードする段階と；
前記分類情報に基づいて、デコードされたオーディオ・コンテンツの後処理を実行するための後処理モードを選択する段階とを含む、
方法。
〔ＥＥＥ２１〕
前記後処理モードの選択は、ユーザー入力にさらに基づく、ＥＥＥ２０に記載の方法。
〔ＥＥＥ２２〕
オーディオ・コンテンツと該オーディオ・コンテンツについての分類情報とを含むビットストリームからオーディオ・コンテンツをデコードする方法であって、前記分類情報は、前記オーディオ・コンテンツのコンテンツ分類を示し、当該方法は：
前記ビットストリームを受領する段階と；
前記オーディオ・コンテンツおよび前記分類情報をデコードする段階と；
前記分類情報に基づいて、デコードされたオーディオ・コンテンツの後処理のための一つまたは複数の制御重みを計算する段階とを含む、
方法。
〔ＥＥＥ２３〕
前記分類情報は、一つまたは複数の信頼値を含み、それぞれの信頼値は、それぞれのコンテンツ型に関連付けられ、前記オーディオ・コンテンツが該それぞれのコンテンツ型である確からしさの指標を与えるものであり；
前記制御重みは、前記信頼値に基づいて計算される、
ＥＥＥ２２に記載の方法。
〔ＥＥＥ２４〕
前記制御重みは、前記デコードされたオーディオ・コンテンツの後処理のためのそれぞれのモジュールのための制御重みである、ＥＥＥ２２または２３に記載の方法。
〔ＥＥＥ２５〕
前記制御重みは、等化器のための制御重み、仮想化器のための制御重み、サラウンドプロセッサのための制御重み、およびダイアログ向上器のための制御重みのうちの一つまたは複数を含む、ＥＥＥ２２ないし２４のうちいずれか一項に記載の方法。
〔ＥＥＥ２６〕
前記制御重みの計算は、前記デコードを実行する装置の装置型に依存する、ＥＥＥ２２ないし２５のうちいずれか一項に記載の方法。
〔ＥＥＥ２７〕
前記制御重みの計算は、ユーザー入力にさらに基づく、ＥＥＥ２２ないし２６のうちいずれか一項に記載の方法。
〔ＥＥＥ２８〕
前記制御重みの計算は、前記オーディオ・コンテンツのチャネル数にさらに基づく、ＥＥＥ２２ないし２７のうちいずれか一項に記載の方法。
〔ＥＥＥ２９〕
前記制御重みは、仮想化器のための制御重みを含み、
前記仮想化器のための制御重みは、前記分類情報が、前記オーディオ・コンテンツの前記コンテンツ型が音楽である、または音楽である可能性が高いことを示す場合に、前記仮想化器が無効にされるように計算される、
ＥＥＥ２２ないし２８のうちいずれか一項に記載の方法。
〔ＥＥＥ３０〕
前記制御重みは、仮想化器のための制御重みを含み、
前記仮想化器のための制御重みは、前記仮想化器の係数が素通しと完全な仮想化との間でスケールするように計算される、
ＥＥＥ２２ないし２９のうちいずれか一項に記載の方法。
〔ＥＥＥ３１〕
前記制御重みは、ダイアログ向上器のための制御重みを含み、
前記ダイアログ向上器のための制御重みは、前記分類情報が、前記オーディオ・コンテンツの前記コンテンツ・タイプが発話である、または発話である可能性が高いことを示す場合に、前記ダイアログ向上器によるダイアログ向上が向上されるように計算される、
ＥＥＥ２２ないし３０のうちいずれか一項に記載の方法。
〔ＥＥＥ３２〕
前記制御重みは、動的等化器のための制御重みを含み、
前記動的等化器のための制御重みは、前記分類情報が、前記オーディオ・コンテンツの前記コンテンツ型が発話である、または発話である可能性が高いことを示す場合に、前記動的等化器が無効にされるように計算される、
ＥＥＥ２２ないし３１のうちいずれか一項に記載の方法。
〔ＥＥＥ３３〕
前記制御重みの平滑化をさらに含む、ＥＥＥ２２ないし３２のうちいずれか一項に記載の方法。
〔ＥＥＥ３４〕
前記制御重みの平滑化は、平滑化される特定の制御重みに依存する、ＥＥＥ３３に記載の方法。
〔ＥＥＥ３５〕
前記制御重みの平滑化は、前記デコードを実行する装置の装置型に依存する、ＥＥＥ３３または３４に記載の方法。
〔ＥＥＥ３６〕
前記制御重みの連続性を増大させるために、前記制御重みに非線形マッピング関数を適用することをさらに含む、ＥＥＥ３３ないし３５のうちいずれか一項に記載の方法。
〔ＥＥＥ３７〕
2チャネル・オーディオ・コンテンツと該2チャネル・オーディオ・コンテンツについての分類情報とを含むビットストリームからオーディオ・コンテンツをデコードする方法であって、前記分類情報は、前記2チャネル・オーディオ・コンテンツのコンテンツ分類を示し、当該方法は：
前記AC-4ビットストリームを受領する段階と；
前記2チャネル・オーディオ・コンテンツおよび前記分類情報をデコードする段階と；
前記2チャネル・オーディオ・コンテンツをアップミックスして、アップミックスされた5.1チャネル・オーディオ・コンテンツにする段階と；
2チャネル・スピーカー・アレイのための5.1仮想化のために、前記アップミックスされた5.1チャネル・オーディオ・コンテンツに仮想化器を適用する段階と；
前記2チャネル・オーディオ・コンテンツおよび前記仮想化されたアップミックスされた5.1チャネル・オーディオ・コンテンツにクロスフェーダーを適用する段階と；
前記クロスフェーダーの出力を前記2チャネル・スピーカー・アレイにルーティングする段階とを含み、
当該方法は、前記分類情報に基づいて前記仮想化器および前記クロスフェーダーのためのそれぞれの制御重みを計算する段階をさらに含む、
方法。
〔ＥＥＥ３８〕
2チャネル・オーディオ・コンテンツと該2チャネル・オーディオ・コンテンツについての分類情報とを含むビットストリームからオーディオ・コンテンツをデコードする方法であって、前記分類情報は、前記2チャネル・オーディオ・コンテンツのコンテンツ分類を示し、当該方法は：
前記ビットストリームを受領する段階と；
前記2チャネル・オーディオ・コンテンツおよび前記分類情報をデコードする段階と；
前記2チャネル・オーディオ・コンテンツをアップミックスして、アップミックスされた5.1チャネル・オーディオ・コンテンツにするよう、前記2チャネル・オーディオ・コンテンツにアップミキサーを適用する段階と；
5チャネル・スピーカー・アレイのための5.1仮想化のために、前記アップミックスされた5.1チャネル・オーディオ・コンテンツに仮想化器を適用する段階と；
前記仮想化器の出力を前記5チャネル・スピーカー・アレイにルーティングする段階とを含み、
当該方法は、前記分類情報に基づいて前記アップミキサーおよび前記仮想化器のためのそれぞれの制御重みを計算する段階をさらに含む、
方法。
〔ＥＥＥ３９〕
オーディオ・コンテンツをエンコードするためのエンコーダであって、当該エンコーダはプロセッサを有し、前記プロセッサは、前記プロセッサのための命令を記憶しているメモリに結合されており、前記プロセッサは、ＥＥＥ１ないし１９のうちいずれか一項に記載の方法を実行するように適応されている、エンコーダ。
〔ＥＥＥ４０〕
オーディオ・コンテンツをデコードするためのデコーダであって、当該デコーダはプロセッサを有し、前記プロセッサは、前記プロセッサのための命令を記憶しているメモリに結合されており、前記プロセッサは、ＥＥＥ２０ないし３８のうちいずれか一項に記載の方法を実行するように適応されている、デコーダ。
〔ＥＥＥ４１〕
命令を含んでいるコンピュータ・プログラムであって、前記命令は、ＥＥＥ１ないし３８のうちいずれか一項に記載の方法を実行するよう前記命令をプロセッサに実行させるものである、コンピュータ・プログラム。
〔ＥＥＥ４２〕
ＥＥＥ４１に記載のコンピュータ・プログラムを記憶しているコンピュータ読み取り可能な記憶媒体。 Various aspects of the invention can be understood from the following bulleted example embodiments (EEEs).
[EEE1]
How to encode audio content:
At the stage of performing content analysis of audio content;
The stage of generating classification information indicating the content type of the audio content based on the content analysis;
The stage of encoding the audio content and the classification information into a bitstream;
Including the step of outputting the bitstream.
Method.
[EEE2]
The method of EEE 1, wherein the content analysis is at least partially based on metadata about the audio content.
[EEE3]
How to encode audio content:
At the stage of receiving user input regarding the content type of the audio content;
The stage of generating classification information indicating the content type of the audio content based on the user input;
The stage of encoding the audio content and the classification information into a bitstream;
Including the step of outputting the bitstream.
Method.
[EEE4]
The user input is:
A label indicating that the audio content is of a given content type; and one or more trust values, each trust value is associated with a respective content type, and the audio content is said to be its respective. The method according to EEE3, which comprises one or more confidence values, which gives an indication of certainty of being content type.
[EEE5]
A method of encoding audio content, wherein the audio content is provided in a stream of audio content as part of an audio program.
At the stage of receiving the service type instruction indicating the service type of the audio content;
At least in part, at the stage of performing content analysis of the audio content based on the service-type instructions;
The stage of generating classification information indicating the content type of the audio content based on the content analysis;
The stage of encoding the audio content and the classification information into a bitstream;
Including the step of outputting the bitstream.
Method.
[EEE6]
Whether or not the service type of the audio content is a music service is determined based on the service type instruction;
Further comprising generating the classification information to indicate that the content type of the audio content is music content in response to the determination that the service type of the audio content is a music service.
The method according to EEE5.
[EEE7]
At the stage of determining whether the service type of the audio content is a newscast service based on the service type instruction;
In response to the determination that the service type of the audio content is a newscast service, the content analysis is adapted to have a higher probability of indicating that the audio content is spoken content. Including stages,
The method according to EEE 5 or 6.
[EEE8]
The method according to any one of EEE 5 to 7, wherein the service type instruction is provided for each frame.
[EEE9]
A method of encoding audio content, wherein the audio content is provided on a file basis, the file contains metadata about each audio content, and the method is:
At least in part with performing a content analysis of the audio content based on the metadata about the audio content;
The stage of generating classification information indicating the content type of the audio content based on the content analysis;
The stage of encoding the audio content and the classification information into a bitstream;
Including the step of outputting the bitstream.
Method.
[EEE10]
The metadata includes a file content type indication indicating the file content type of the file.
The content analysis is at least partially based on the file content type indication.
The method according to EEE9.
[EEE11]
Whether or not the file content type of the file is a music file is determined based on the file content type instruction;
Further comprising generating the classification information to indicate that the content type of the audio content is music content in response to the determination that the file content type of the file is a music file.
The method according to EEE10.
[EEE12]
Whether or not the file content type of the file is a newscast file is determined based on the file content type instruction;
In response to the determination that the file content type of the file is a newscast file, the content analysis is adapted to have a higher probability of indicating that the audio content is spoken content. Including further to let
The method according to EEE 10 or 11.
[EEE13]
Whether or not the file content type of the file is dynamic is determined based on the file content type instruction;
Further comprising adapting the content analysis to allow higher transition rates between different content types in response to the determination that the file content type of the file is dynamic content.
The method according to any one of EEE 10 to 12.
[EEE14]
The classification information comprises one or more confidence values, each confidence value is associated with a respective content type, and gives an indication of the certainty that the audio content is the respective content type.
The method according to any one of EEE 1 to 13.
[EEE15]
The method according to any one of EEE1 to 14, wherein the content type includes one or more of music content, audio content, or effect content.
[EEE16]
The method according to any one of EEE 1 to 15, further comprising encoding a scene transition instruction in the audio content into the bitstream.
[EEE17]
Further including smoothing of the classification information before encoding,
The method according to any one of EEE 1 to 16.
[EEE18]
Further comprising quantizing the classification information prior to encoding,
The method according to any one of EEE 1 to 17.
[EEE19]
Further comprising encoding the classification information into specific data fields in the packet of the bitstream.
The method according to any one of EEE 1 to 18.
[EEE20]
A method of decoding audio content from a bitstream containing audio content and classification information about the audio content, wherein the classification information indicates content classification of the audio content, and the method is:
At the stage of receiving the bitstream;
The stage of decoding the audio content and the classification information;
A step of selecting a post-processing mode for performing post-processing of decoded audio content based on the classification information.
Method.
[EEE21]
The method of EEE20, wherein the post-processing mode selection is further based on user input.
[EEE22]
A method of decoding audio content from a bitstream containing audio content and classification information about the audio content, wherein the classification information indicates content classification of the audio content, and the method is:
At the stage of receiving the bitstream;
The stage of decoding the audio content and the classification information;
A step of calculating one or more control weights for post-processing of decoded audio content based on the classification information.
Method.
[EEE23]
The classification information includes one or more confidence values, each confidence value is associated with each content type and provides an indicator of the certainty that the audio content is the respective content type. ;
The control weight is calculated based on the confidence value.
The method according to EEE22.
[EEE24]
23. The method of EEE 22 or 23, wherein the control weights are control weights for each module for post-processing of the decoded audio content.
[EEE25]
The control weights include one or more of a control weight for an equalizer, a control weight for a virtualizer, a control weight for a surround processor, and a control weight for a dialog improver. The method according to any one of EEE 22 to 24.
[EEE26]
The method according to any one of EEE 22 to 25, wherein the calculation of the control weight depends on the device type of the device that performs the decoding.
[EEE27]
The method according to any one of EEE 22 to 26, wherein the calculation of the control weight is further based on user input.
[EEE28]
The method according to any one of EEE 22 to 27, wherein the calculation of the control weight is further based on the number of channels of the audio content.
[EEE29]
The control weights include control weights for the virtualization device.
The control weight for the virtualization device disables the virtualization device if the classification information indicates that the content type of the audio content is or is likely to be music. Calculated to be,
The method according to any one of EEE 22 to 28.
[EEE30]
The control weights include control weights for the virtualization device.
The control weights for the virtualizer are calculated so that the coefficients of the virtualizer scale between pass-through and full virtualization.
The method according to any one of EEE 22 to 29.
[EEE31]
The control weights include control weights for the dialog improver.
The control weight for the dialog improver is a dialog by the dialog improver if the classification information indicates that the content type of the audio content is or is likely to be an utterance. Calculated to improve the improvement,
The method according to any one of EEE 22 to 30.
[EEE32]
The control weights include control weights for the dynamic equalizer.
The control weight for the dynamic equalizer is the dynamic equalization if the classification information indicates that the content type of the audio content is or is likely to be an utterance. Calculated to invalidate the vessel,
The method according to any one of EEE 22 to 31.
[EEE33]
The method according to any one of EEE 22 to 32, further comprising smoothing the control weights.
[EEE34]
The method of EEE33, wherein the smoothing of the control weights depends on the particular control weights to be smoothed.
[EEE35]
30. The method of EEE 33 or 34, wherein the smoothing of the control weights depends on the device type of the device performing the decoding.
[EEE36]
The method according to any one of EEE 33 to 35, further comprising applying a non-linear mapping function to the control weights to increase the continuity of the control weights.
[EEE37]
A method of decoding audio content from a bitstream containing two-channel audio content and classification information about the two-channel audio content, wherein the classification information is a content classification of the two-channel audio content. The method is:
At the stage of receiving the AC-4 bit stream;
The stage of decoding the two-channel audio content and the classification information;
The stage of upmixing the above-mentioned 2-channel audio content into upmixed 5.1-channel audio content;
With the stage of applying the virtualization device to the upmixed 5.1 channel audio content for 5.1 virtualization for a 2-channel speaker array;
The stage of applying crossfader to the 2-channel audio content and the virtualized upmixed 5.1-channel audio content;
Including the step of routing the output of the crossfader to the two-channel speaker array.
The method further comprises calculating the respective control weights for the virtualizer and the crossfader based on the classification information.
Method.
[EEE38]
A method of decoding audio content from a bitstream containing two-channel audio content and classification information about the two-channel audio content, wherein the classification information is a content classification of the two-channel audio content. The method is:
At the stage of receiving the bitstream;
The stage of decoding the two-channel audio content and the classification information;
At the stage of applying the upmixer to the 2-channel audio content so that the 2-channel audio content is upmixed into the upmixed 5.1-channel audio content;
With the stage of applying the virtualization device to the upmixed 5.1 channel audio content for 5.1 virtualization for a 5-channel speaker array;
Including the step of routing the output of the virtualization device to the 5-channel speaker array.
The method further comprises calculating the respective control weights for the upmixer and the virtualization device based on the classification information.
Method.
[EEE39]
An encoder for encoding audio content, the encoder having a processor, the processor being coupled to a memory storing instructions for the processor, the processor being EEE1-19. An encoder adapted to perform the method described in any one of the sections.
[EEE40]
A decoder for decoding audio content, the decoder having a processor, the processor being coupled to a memory storing instructions for the processor, the processor being EEE 20-38. A decoder adapted to perform the method described in any one of the sections.
[EEE41]
A computer program comprising an instruction, wherein the instruction causes a processor to execute the instruction according to any one of EEE1 to 38.
[EEE42]
A computer-readable storage medium that stores the computer program described in EEE41.

Claims

How to encode audio content:
At the stage of performing content analysis of audio content;
At the stage of generating classification information indicating the content type of the audio content based on the content analysis, the classification information includes one or more trust values, and each trust value is associated with each content type. And giving an indication of the certainty that the audio content is the respective content type;
The stage of encoding the audio content and the classification information into a bitstream;
Including the step of outputting the bitstream.
Method.

The method of claim 1, wherein the content analysis is at least partially based on metadata about the audio content.

Further including the step of receiving user input regarding the content type of the audio content.
The step of generating information is based on the user input.
The method according to claim 1 or 2.

The user input is:
The method of claim 3, comprising a label indicating that the audio content is of a given content type.

The audio content is provided in a stream of audio content as part of an audio program, further:
At the stage of receiving the service type instruction indicating the service type of the audio content;
At least in part, including performing a content analysis of the audio content based on the service-type instruction.
The step of generating classification information indicating the content type of the audio content is based on the content analysis.
The method according to any one of claims 1 to 4.

Whether or not the service type of the audio content is a music service is determined based on the service type instruction;
Further comprising generating the classification information to indicate that the content type of the audio content is music content in response to the determination that the service type of the audio content is a music service.
The method according to claim 5.

At the stage of determining whether the service type of the audio content is a newscast service based on the service type instruction;
The said so as to have a certainty value higher than a predetermined threshold indicating that the audio content is spoken content in response to the determination that the service type of the audio content is a newscast service. Including the stage of adapting content analysis,
The method according to claim 5 or 6.

The method according to any one of claims 5 to 7, wherein the service type instruction is provided for each frame.

The method according to any one of claims 1 to 7, wherein the audio content is provided on a file basis, and the file contains metadata for each audio content, as long as it is dependent on claim 2. ..

9. The method of claim 9, wherein the metadata includes a file content type indication indicating the file content type of the file, and the content analysis is at least partially based on the file content type instruction.

Whether or not the file content type of the file is a music file is determined based on the file content type instruction;
Further comprising generating the classification information to indicate that the content type of the audio content is music content in response to the determination that the file content type of the file is a music file.
The method according to claim 10.

Whether or not the file content type of the file is a newscast file is determined based on the file content type instruction;
The said so as to have a certainty value higher than a predetermined threshold indicating that the audio content is spoken content in response to the determination that the file content type of the file is a newscast file. Further including adapting content analysis,
The method according to claim 10 or 11.

Whether or not the file content type of the file is dynamic is determined based on the file content type instruction;
Further comprising adapting the content analysis to allow higher transition rates between different content types in response to the determination that the file content type of the file is dynamic content.
The method according to any one of claims 10 to 12.

The method of any one of claims 1-13, wherein the content type is: music content, audio content, effect content, and one or more of the content selected from the group of crowd noise.

The method of any one of claims 1-14, further comprising encoding a scene transition instruction in the audio content into the bitstream.

Further including smoothing of the classification information before encoding,
The method according to any one of claims 1 to 15.

Further comprising quantizing the classification information prior to encoding,
The method according to any one of claims 1 to 16.

Further comprising encoding the classification information into specific data fields in the packet of the bitstream.
The method according to any one of claims 1 to 17.

An encoder for encoding audio content, wherein the processor has a processor, the processor is coupled to a memory storing instructions for the processor, and the processor is claimed 1. Or an encoder adapted to perform the method according to any one of 18.

A method of decoding audio content from a bitstream that includes audio content and classification information about the audio content, wherein the classification information indicates the content type of the audio content, and the classification information is one. It comprises one or more confidence values, each confidence value is associated with each content type and provides an indicator of the certainty that the audio content is the respective content type.
At the stage of receiving the bitstream;
The stage of decoding the audio content and the classification information;
A step of selecting a post-processing mode for performing post-processing of decoded audio content based on the classification information;
At the stage of calculating one or more control weights for the post-processing of the decoded audio content based on the classification information, the control weights are calculated based on the confidence value. Including stages,
Method.

The bitstream contains channel-based audio content and the post-processing is:
The stage of upmixing the channel-based audio content into upmixed channel-based audio content;
To obtain the virtualized upmixed channel-based audio content for virtualization for a speaker array of the desired number of channels, to the upmixed channel-based audio content. Including the stage of applying the virtualizer,
The method according to claim 20.

The method of claim 20 or 21, wherein the post-processing mode selection is further based on user input.

At the stage of routing the output of the virtualization device to the speaker array;
Further comprising calculating the respective control weights for the upmixer and the virtualization device based on the classification information.
The method of claim 21 or 22.

After applying the virtualization device, the method further:
The stage of applying a crossfader to the channel-based audio content and the virtualized upmixed audio content;
At the stage of routing the output of the crossfader to the speaker array;
Further comprising calculating the respective control weights for the upmixer and the crossfader based on the classification information.
The method of claim 21 or 22.

The method according to any one of claims 20 to 24, wherein the control weight is a control weight for each module for post-processing of the decoded audio content.

The control weights include one or more of a control weight for an equalizer, a control weight for a virtualizer, a control weight for a surround processor, and a control weight for a dialog improver. The method according to any one of claims 20 to 25.

The method according to any one of claims 20 to 26, wherein the calculation of the control weight depends on the device type of the device that performs the decoding.

The method according to any one of claims 20 to 27, wherein the calculation of the control weight is further based on user input.

The method according to any one of claims 20 to 28, wherein the calculation of the control weight is further based on the number of channels of the audio content.

The control weights include control weights for the virtualization device.
The control weight for the virtualization device disables the virtualization device if the classification information indicates that the content type of the audio content is or is likely to be music. Calculated to be,
The method according to any one of claims 20 to 29.

The control weights include control weights for the virtualization device.
The control weights for the virtualizer are calculated so that the coefficients of the virtualizer scale between pass-through and full virtualization.
The method according to any one of claims 20 to 30.

The control weights include control weights for the dialog improver.
The control weight for the dialog improver is a dialog by the dialog improver if the classification information indicates that the content type of the audio content is or is likely to be an utterance. Calculated to enhance improvement,
The method according to any one of claims 20 to 31.

The control weights include control weights for the dynamic equalizer.
The control weight for the dynamic equalizer is the dynamic equalization if the classification information indicates that the content type of the audio content is or is likely to be an utterance. Calculated to invalidate the vessel,
The method according to any one of claims 20 to 32.

The method according to any one of claims 20 to 33, further comprising smoothing the control weights.

34. The method of claim 34, wherein the smoothing of the control weights depends on the particular control weights to be smoothed.

34. The method of claim 34 or 35, wherein the smoothing of the control weights depends on the device type of the device performing the decoding.

The method of any one of claims 33-36, further comprising applying a non-linear mapping function to the control weights to increase the continuity of the control weights.

The bitstream is an AC-4 bitstream, and the method is:
The stage of decoding 2-channel audio content and the classification information;
The stage of upmixing the above-mentioned 2-channel audio content into upmixed 5.1-channel audio content;
With the stage of applying the virtualization device to the upmixed 5.1 channel audio content for 5.1 virtualization for a 2-channel speaker array;
The stage of applying crossfader to the 2-channel audio content and the virtualized upmixed 5.1-channel audio content;
Including the step of routing the output of the crossfader to the two-channel speaker array.
The method further comprises calculating the respective control weights for the virtualizer and the crossfader based on the classification information.
The method according to any one of claims 21 to 37.

The bitstream contains classification information about the two-channel audio content and the two-channel audio content, the classification information indicating the content classification of the two-channel audio content, the method. :
The stage of decoding the two-channel audio content and the classification information;
At the stage of applying the upmixer to the 2-channel audio content so that the 2-channel audio content is upmixed into the upmixed 5.1-channel audio content;
With the stage of applying the virtualization device to the upmixed 5.1 channel audio content for 5.1 virtualization for a 5-channel speaker array;
Including the step of routing the output of the virtualization device to the 5-channel speaker array.
The method further comprises calculating the respective control weights for the upmixer and the virtualization device based on the classification information.
The method according to any one of claims 21 to 38.

A decoder for decoding audio content, wherein the decoder has a processor, the processor is coupled to a memory storing instructions for the processor, and the processor is claimed 20. A decoder adapted to perform the method according to any one of 39 to 39.

A computer program comprising an instruction, wherein the instruction causes a processor to execute the instruction according to any one of claims 1 to 39.