JP6683231B2

JP6683231B2 - Information processing apparatus and information processing method

Info

Publication number: JP6683231B2
Application number: JP2018188776A
Authority: JP
Inventors: 隆一難波; 金章藤下
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2018-10-04
Filing date: 2018-10-04
Publication date: 2020-04-15
Anticipated expiration: 2034-11-04
Also published as: JP2019020743A

Description

本開示は、情報処理装置および情報処理方法に関する。 The present disclosure relates to an information processing device and an information processing method .

音声情報や映像情報等の所定の時間長さを有する情報に対して、その内容を全て視聴することなく当該内容の概要を把握したいという要望がある。そこで、例えば特許文献１には、音声情報の特徴を示す特徴量から、音声情報の中で注目すべき場面である盛り上がり部分を検出し、音声情報の中の当該盛り上がり部分に対してインデックスを付与する技術が開示されている。当該技術によれば、音声情報の中から当該インデックスが付された部分のみを再生することにより、盛り上がり部分のみが抽出された当該音声情報のダイジェストを生成することができる。 With respect to information having a predetermined time length such as audio information and video information, there is a demand for grasping the outline of the content without watching all the content. Therefore, for example, in Patent Document 1, a climax part that is a noteworthy scene in the audio information is detected from the feature amount indicating the characteristics of the audio information, and an index is given to the climax part in the audio information. Techniques for doing so are disclosed. According to this technique, it is possible to generate a digest of the audio information in which only the climax is extracted by reproducing only the part to which the index is attached from the audio information.

特開２００４−１９１７８０号公報JP 2004-191780 A

ここで、例えば会議の様子を録音した音声情報のダイジェストを生成することを考えると、会議の内容の概要を把握するために盛り上がっている場面、すなわち議論が紛糾している場面をダイジェストに含めたいという要望がある一方で、会議の参加者を把握するためにできるだけ多くの人物の声が含まれるようにダイジェストを生成したいという要望も存在し得る。このように、ユーザがダイジェストに対して求める要望は、その目的に応じて多様である。特許文献１に記載の技術は、盛り上がり部分を検出することに特化したものであるため、特許文献１に記載の技術ではこのようなユーザの多様な要望に応えることは困難であると考えられる。 Here, for example, considering the generation of a digest of audio information that records the state of a meeting, we want to include scenes that are exciting to grasp the outline of the meeting, that is, scenes in which the discussion is confusing. On the other hand, there may be a desire to generate a digest so that the voices of as many people as possible are included in order to grasp the participants of the conference. As described above, there are various requests made by the user for the digest depending on the purpose. Since the technique described in Patent Document 1 is specialized in detecting a swelling portion, it is considered difficult for the technique described in Patent Document 1 to meet such various user requests. .

そこで、本開示では、ユーザの利便性をより向上させることが可能な、新規かつ改良された情報処理装置を提案する。 Therefore, the present disclosure proposes a new and improved information processing device capable of further improving the convenience of the user.

本開示によれば、音声情報に含まれる音声の特徴量を抽出する特徴量抽出部と、前記特徴量に応じて算出される音源種別に基づいて、前記音声情報の中から、前記音声情報のダイジェストを構成するダイジェスト区間を決定するダイジェスト区間決定部と、を備え、前記ダイジェスト区間決定部は、予め設定されるモードに基づいて前記ダイジェストに含める前記音声の音源種別を決定し、前記予め設定されるモードには、少なくとも前記ダイジェストに優先的に含めるように指定された前記音源種別を含むように前記ダイジェストを生成する単一音源モードを有する、情報処理装置が提供される。 According to the present disclosure, based on the feature amount extraction unit that extracts the feature amount of the voice included in the voice information and the sound source type calculated according to the feature amount, the voice information of the voice information is selected from among the voice information. A digest section determining unit for determining a digest section constituting a digest, the digest section determining unit determines a sound source type of the voice to be included in the digest based on a preset mode, the preset The information processing apparatus having a single sound source mode for generating the digest so as to include at least the sound source type specified to be preferentially included in the digest is provided.

また、本開示によれば、音声情報に含まれる音声の特徴量を抽出する特徴量抽出部と、前記特徴量に応じて算出される音源種別に基づいて、前記音声情報の中から、前記音声情報のダイジェストを構成するダイジェスト区間を決定するダイジェスト区間決定部と、を備え、前記ダイジェスト区間決定部は、予め設定されるモードに基づいて前記ダイジェストに含める前記音声の音源種別を決定し、前記予め設定されるモードには、少なくとも複数の音源種別の前記音声を所定の割合で含むように前記ダイジェストを生成する複数音源モードを有する、情報処理装置が提供される。 Further, according to the present disclosure, based on the feature amount extraction unit that extracts the feature amount of the voice included in the voice information and the sound source type calculated according to the feature amount, the voice A digest section determining unit for determining a digest section constituting a digest of information, the digest section determining unit determines a sound source type of the voice included in the digest based on a preset mode, the pre- An information processing apparatus is provided that has a plural sound source mode in which the digest is generated so that the mode to be set includes at least the sound of plural sound source types at a predetermined ratio.

また、本開示によれば、音声情報に含まれる音声の特徴量を抽出する特徴量抽出部と、前記特徴量に応じて算出される音源種別に基づいて、前記音声情報の中から、前記音声情報のダイジェストを構成するダイジェスト区間を決定するダイジェスト区間決定部と、を備え、前記ダイジェスト区間決定部は、予め設定されるモードに基づいて前記ダイジェストに含まれる前記音声の音源種別を決定し、前記予め設定されるモードには、少なくとも同一の前記音源種別に分類される前記音声の中から多様な前記音声が含まれるように前記ダイジェストを生成する多様性反映モードを有する、情報処理装置が提供される。 Further, according to the present disclosure, based on the feature amount extraction unit that extracts the feature amount of the voice included in the voice information and the sound source type calculated according to the feature amount, the voice A digest section determining unit for determining a digest section constituting a digest of information, the digest section determining unit determines a sound source type of the voice included in the digest based on a preset mode, the An information processing apparatus having a diversity reflection mode in which the digest is generated so that various voices are included in the voices classified into at least the same sound source type is provided as the preset mode. It

本開示によれば、音声情報に含まれる音声の特徴量を抽出し、当該特徴量に応じて算出される音源種別に基づいて、当該音声情報の中から、当該音声情報のダイジェストを構成するダイジェスト区間が決定される。従って、音源種別に応じたユーザの多様な要望に応じたダイジェストを生成することが可能になる。よって、ユーザの利便性をより向上させることができる。 According to the present disclosure, a feature amount of a voice included in voice information is extracted, and a digest forming a digest of the voice information is extracted from the voice information based on a sound source type calculated according to the feature amount. The section is determined. Therefore, it is possible to generate a digest that meets various user requests according to the sound source type. Therefore, the convenience of the user can be further improved.

以上説明したように本開示によれば、ユーザの利便性をより向上させることが可能となる。なお、上記の効果は必ずしも限定的なものではなく、上記の効果とともに、又は上記の効果に代えて、本明細書に示されたいずれかの効果、又は本明細書から把握され得る他の効果が奏されてもよい。 According to the present disclosure as described above, it is possible to further improve the convenience of the user. Note that the above effects are not necessarily limited, and in addition to or in place of the above effects, any of the effects shown in this specification, or other effects that can be grasped from this specification. May be played.

本実施形態に係る情報処理装置の機能構成の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of a functional structure of the information processing apparatus which concerns on this embodiment. 音源種別スコア算出部によって算出される音源種別スコアの一例を示す図である。It is a figure which shows an example of the sound source classification score calculated by the sound source classification score calculation part. 音声情報とダイジェストとの関係について説明するための説明図である。It is explanatory drawing for demonstrating the relationship between audio | voice information and a digest. オフライン処理の処理手順の一例を示すフロー図である。It is a flow figure showing an example of the processing procedure of offline processing. オフライン処理における、単一音源モードでのダイジェスト区間決定処理の処理手順の一例を示すフロー図である。It is a flow figure showing an example of a processing procedure of digest section determination processing in single sound source mode in offline processing. オフライン処理における、単一音源モードでのダイジェスト区間決定処理の処理手順の一例を示すフロー図である。It is a flow figure showing an example of a processing procedure of digest section determination processing in single sound source mode in offline processing. オフライン処理での高スコア区間決定処理について説明するための説明図である。It is explanatory drawing for demonstrating the high score area determination process in an offline process. オフライン処理での高スコア区間決定処理の処理手順の一例を示すフロー図である。It is a flow figure showing an example of a processing procedure of high score section determination processing in off-line processing. オフライン処理での高スコア区間決定処理の処理手順の一例を示すフロー図である。It is a flow figure showing an example of a processing procedure of high score section determination processing in off-line processing. オフライン処理における、複数音源モードでのダイジェスト区間決定処理の処理手順の一例を示すフロー図である。It is a flow figure showing an example of a processing procedure of digest section determination processing in a multi-source mode in offline processing. オフライン処理における、複数音源モードでのダイジェスト区間決定処理の処理手順の一例を示すフロー図である。It is a flow figure showing an example of a processing procedure of digest section determination processing in a multi-source mode in offline processing. 多様性反映モードにおける各処理を実行する情報処理装置の機能構成の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of a functional structure of the information processing apparatus which performs each process in diversity reflection mode. オフライン処理における、多様性反映モードでのダイジェスト区間決定処理の処理手順の一例を示すフロー図である。It is a flow figure showing an example of a processing procedure of digest section determination processing in diversity reflection mode in off-line processing. オフライン処理における、多様性反映モードでのダイジェスト区間決定処理の処理手順の一例を示すフロー図である。It is a flow figure showing an example of a processing procedure of digest section determination processing in diversity reflection mode in off-line processing. オフライン処理における、多様性に基づくダイジェスト区間削除処理の処理手順の一例を示すフロー図である。It is a flow figure showing an example of a processing procedure of digest section deletion processing based on diversity in offline processing. オンライン処理の処理手順の一例を示すフロー図である。It is a flow figure showing an example of a processing procedure of online processing. オフライン処理における、単一音源モードでのダイジェスト区間決定処理の処理手順の一例を示すフロー図である。It is a flow figure showing an example of a processing procedure of digest section determination processing in single sound source mode in offline processing. オンライン処理における、単一音源モードでのフレーム削除処理の処理手順の一例を示すフロー図である。It is a flow figure showing an example of a processing procedure of frame deletion processing in single sound source mode in online processing. オンライン処理での高スコア区間決定処理について説明するための説明図である。It is explanatory drawing for demonstrating the high score area determination process in an online process. オンライン処理での高スコア区間決定処理の処理手順の一例を示すフロー図である。It is a flow figure showing an example of a processing procedure of high score section determination processing in online processing. オンライン処理での高スコア区間決定処理の処理手順の一例を示すフロー図である。It is a flow figure showing an example of a processing procedure of high score section determination processing in online processing. オンライン処理での高スコア区間決定処理の処理手順の一例を示すフロー図である。It is a flow figure showing an example of a processing procedure of high score section determination processing in online processing. オンライン処理における、複数音源モードでのダイジェスト区間決定処理の処理手順の一例を示すフロー図である。It is a flow figure showing an example of a processing procedure of digest section determination processing in a multiple sound source mode in online processing. オンライン処理における、複数音源モードでのフレーム削除処理の処理手順の一例を示すフロー図である。It is a flow figure showing an example of a processing procedure of frame deletion processing in a multiple sound source mode in online processing. オンライン処理における、多様性反映モードでのフレーム削除処理の処理手順の一例を示すフロー図である。It is a flow figure showing an example of a processing procedure of frame deletion processing in diversity reflection mode in online processing. オンライン処理における、多様性に基づく削除フレーム選択処理の処理手順の一例を示すフロー図である。It is a flow figure showing an example of the processing procedure of deletion frame selection processing based on diversity in online processing. 音声収音機能が設けられる変形例に係る情報処理装置の機能構成の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of a functional structure of the information processing apparatus which concerns on the modification provided with a voice sound collection function. ダイジェスト生成機能が設けられる変形例に係る情報処理装置の機能構成の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of a functional structure of the information processing apparatus which concerns on the modification provided with the digest generation function. 音声情報データベースが設けられる変形例に係る情報処理装置の機能構成の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of a functional structure of the information processing apparatus which concerns on the modification provided with a voice information database. 本実施形態に係る情報処理装置のハードウェア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware constitutions of the information processing apparatus which concerns on this embodiment.

以下に添付図面を参照しながら、本開示の好適な実施の形態について詳細に説明する。なお、本明細書及び図面において、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重複説明を省略する。 Hereinafter, preferred embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. In the present specification and the drawings, constituent elements having substantially the same functional configuration are designated by the same reference numerals, and duplicate description will be omitted.

なお、説明は以下の順序で行うものとする。
１．既存の技術に対する検討
２．装置構成
３．オフライン処理の詳細
３−１．全体の処理手順
３−２．単一音源モード
３−２−１．ダイジェスト区間決定処理の処理手順
３−２−２．高スコア区間決定処理
３−３．複数音源モード
３−３−１．ダイジェスト区間決定処理の処理手順
３−４．多様性反映モード
３−４−１．機能構成
３−４−２．ダイジェスト区間決定処理の処理手順
３−４−４．多様性に基づくダイジェスト区間削除処理
４．オンライン処理の詳細
４−１．全体の処理手順
４−２．単一音源モード
４−２−１．ダイジェスト区間決定処理
４−２−２．フレーム削除処理
４−２−３．高スコア区間決定処理
４−３．複数音源モード
４−３−１．ダイジェスト区間決定処理の処理手順
４−３−２．フレーム削除処理
４−４．多様性反映モード
４−４−１．フレーム削除処理の処理手順
４−４−２．多様性に基づく削除フレーム選択処理
５．変形例
６．ハードウェア構成
７．まとめ The description will be given in the following order.
1. Examination of existing technologies 2. Device configuration 3. Details of offline processing 3-1. Overall processing procedure 3-2. Single sound source mode 3-2-1. Processing procedure of digest section determination processing 3-2-2. High score section determination process 3-3. Multiple sound source mode 3-3-1. Processing procedure of digest section determination processing 3-4. Diversity reflection mode 3-4-1. Functional configuration 3-4-2. Processing procedure of digest section determination processing 3-4-4. 3. Digest section deletion processing based on diversity Details of online processing 4-1. Overall processing procedure 4-2. Single sound source mode 4-2-1. Digest section determination process 4-2-2. Frame deletion process 4-2-3. High score section determination process 4-3. Multiple sound source mode 4-3-1. Processing procedure of digest section determination processing 4-3-2. Frame deletion process 4-4. Diversity reflection mode 4-4-1. Processing procedure of frame deletion processing 4-4-2. 4. Deleted frame selection processing based on diversity Modified example 6. Hardware configuration 7. Conclusion

（１．既存の技術に対する検討）
本開示の好適な一実施形態について説明するに先立ち、本発明者らが既存の一般的な技術について検討した結果について説明するとともに、本発明者らが本開示に想到した背景について説明する。 (1. Study on existing technology)
Prior to describing a preferred embodiment of the present disclosure, the results of the present inventors 'consideration of existing general techniques will be described, and the background of the present inventors' idea of the present disclosure will be described.

一般的に、音声情報や映像情報等の概要を簡易に把握するために、そのダイジェストを生成するための技術が開発されている。特に、例えば録画したテレビ番組のダイジェストを生成する等、映像情報に関する技術は多数提案されている。しかしながら、映像情報からダイジェストを生成する技術では、映像から算出される特徴量と音声から算出される特徴量の双方を用いた、マルチモーダルな枠組みを前提としているものが多い。情報量の多い映像情報に比べて、音声情報のみに基づいて当該音声情報のダイジェストを適切に生成することはより困難であると考えられる。 Generally, in order to easily grasp the outline of audio information, video information, etc., a technique for generating the digest has been developed. In particular, many techniques relating to video information have been proposed, such as generating a digest of a recorded television program. However, many techniques for generating a digest from video information assume a multimodal framework that uses both the feature amount calculated from video and the feature amount calculated from audio. It can be considered that it is more difficult to appropriately generate the digest of the audio information based on only the audio information, as compared with the video information having a large amount of information.

例えば、音声情報のダイジェストを生成する一般的な方法として、音声情報の先頭部分、中間部分及び末尾部分を単純に抜き出してダイジェストを生成する方法や、音量の大きい区間を抜き出してダイジェストを生成する方法等が考えられる。あるいは、既存のＩＣレコーダーの中には、選択された音声ファイルの冒頭５秒間を再生する機能が搭載されているものが存在する。しかしながら、音声情報の内容にかかわらず所定の区間を抜き出す方法では、有意な情報がダイジェストに含まれない可能性が高い。また、音量に基づく方法では、雑音が大きい区間等、必ずしも有用とは言えない区間がダイジェストに含まれてしまう可能性がある。 For example, as a general method of generating a digest of audio information, a method of simply extracting the beginning portion, an intermediate portion, and an end portion of the audio information to generate a digest, or a method of extracting a high volume section to generate a digest Etc. are possible. Alternatively, some existing IC recorders have a function of reproducing the first 5 seconds of the selected audio file. However, in the method of extracting a predetermined section regardless of the content of voice information, it is highly likely that significant information is not included in the digest. Further, in the method based on the volume, there is a possibility that the digest may include a segment that is not necessarily useful, such as a segment having a large amount of noise.

また、音声情報のダイジェストを生成するための技術としては、例えば上記特許文献１に記載の技術がある。しかしながら、上述したように、当該技術は、盛り上がり部分を抽出してダイジェストを生成することに特化したものである。ユーザがダイジェストで把握したい内容は、必ずしも盛り上がり部分に限定されないため、当該技術では、ダイジェストに求められるユーザの多様な要望に応えることが難しい。 Further, as a technique for generating a digest of voice information, there is, for example, the technique described in Patent Document 1 above. However, as described above, the technique is specialized in extracting a swelling portion and generating a digest. Since the content that the user wants to grasp in the digest is not necessarily limited to the excitement part, it is difficult for the technique to meet various user's demands for the digest.

以上、本発明者らが既存の一般的な技術に対して検討した結果について説明した。以上説明したように、音声情報のダイジェストを生成する技術においては、ユーザの多様な要望に応え得るより利便性の高い技術が望まれていた。本発明者らは、以上の既存の技術に対する検討結果に基づいて、よりユーザの利便性を向上させることが可能な技術について鋭意検討した結果、以下に説明する本開示の一実施形態に想到した。以下では、本発明者らが想到した、本開示の好適な一実施形態について詳細に説明する。 Heretofore, the results of the present inventors' examination of existing general techniques have been described. As described above, in the technique of generating a digest of voice information, a more convenient technique that can meet various needs of users has been desired. The present inventors have earnestly studied a technique capable of further improving the convenience of the user based on the above-described examination result of the existing technique, and have arrived at one embodiment of the present disclosure described below. . Hereinafter, a preferred embodiment of the present disclosure conceived by the present inventors will be described in detail.

（２．装置構成）
図１を参照して、本開示の一実施形態に係る情報処理装置の機能構成について説明する。図１は、本実施形態に係る情報処理装置の機能構成の一例を示す機能ブロック図である。 (2. Device configuration)
A functional configuration of an information processing apparatus according to an embodiment of the present disclosure will be described with reference to FIG. 1. FIG. 1 is a functional block diagram showing an example of the functional configuration of the information processing apparatus according to this embodiment.

図１を参照すると、本実施形態に係る情報処理装置１１０は、その機能として、特徴量抽出部１１１と、音源種別スコア算出部１１３と、ダイジェスト区間決定部１１５と、を有する。情報処理装置１１０は、任意の音声情報を入力として、当該音声情報の中で当該音声情報のダイジェストを構成する区間であるダイジェスト区間を決定し、当該ダイジェスト区間についての情報（ダイジェスト区間情報）を出力する装置である。 Referring to FIG. 1, the information processing apparatus 110 according to the present embodiment has, as its functions, a feature amount extraction unit 111, a sound source type score calculation unit 113, and a digest section determination unit 115. The information processing apparatus 110 receives arbitrary audio information as input, determines a digest section that is a section constituting a digest of the audio information in the audio information, and outputs information about the digest section (digest section information). It is a device that does.

なお、情報処理装置１１０に対する音声情報の入力元は任意であってよい。例えば、情報処理装置１１０に入力される音声情報は、情報処理装置１１０内に設けられる記憶部（図示せず。）に記憶されているものであってもよいし、情報処理装置１１０とは異なる外部の機器から入力されるものであってもよい。あるいは、情報処理装置１１０が外部の音声を収音する収音部を有する場合には、当該収音部を介して音声情報が入力されてもよい（このような構成については、下記（５−１．音声収音機能が設けられる変形例）で詳しく説明する。）。 The input source of the voice information to the information processing device 110 may be arbitrary. For example, the voice information input to the information processing device 110 may be stored in a storage unit (not shown) provided in the information processing device 110, or may be different from the information processing device 110. It may be input from an external device. Alternatively, when the information processing device 110 has a sound pickup unit that picks up an external voice, the voice information may be input via the sound pickup unit (for such a configuration, see (5- This will be described in detail in 1. Modified example in which a voice pickup function is provided).

特徴量抽出部１１１は、音声情報の特徴量を抽出する。当該特徴量としては、音声情報の特性を示す各種の物理量が算出され得る。例えば、当該特徴量としては、パワー、スペクトル包絡形状、ゼロ交差数、ピッチ（基本周波数）、ＭＦＣＣ（Ｍｅｌ−ＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔｓ）等が算出されてよい。また、互いに異なる位置に配置されたマイクロフォンで収音された音声情報であれば、特徴量として、その収音位置間での相関が算出されてもよい。また、当該相関に基づいて音源方位が更に算出されてもよい。特徴量抽出部１１１は、これらの特徴量のうちの少なくともいずれかを算出し得る。 The feature amount extraction unit 111 extracts a feature amount of voice information. As the feature quantity, various physical quantities indicating the characteristics of audio information can be calculated. For example, as the feature amount, power, spectrum envelope shape, number of zero crossings, pitch (fundamental frequency), MFCC (Mel-Frequency Cepstrum Coefficients), or the like may be calculated. Further, in the case of voice information picked up by microphones arranged at mutually different positions, the correlation between the picked-up positions may be calculated as the feature amount. The sound source azimuth may be further calculated based on the correlation. The feature amount extraction unit 111 can calculate at least one of these feature amounts.

なお、特徴量抽出部１１１によって行われる、音声情報から特徴量を抽出する処理としては、音声情報の解析処理において一般的に用いられている各種の手法が用いられてよいため、その具体的な処理についての詳細な説明は省略する。また、特徴量抽出部１１１によって算出される特徴量は上記で列挙したものに限定されず、特徴量抽出部１１１は、音声情報の解析処理において一般的に算出され得る各種の特徴量を算出してよい。 As the process of extracting the feature amount from the voice information performed by the feature amount extraction unit 111, various methods generally used in the voice information analysis process may be used. Detailed description of the processing is omitted. Further, the feature amount calculated by the feature amount extraction unit 111 is not limited to those listed above, and the feature amount extraction unit 111 calculates various feature amounts that can be generally calculated in the analysis process of voice information. You may

特徴量抽出部１１１によって算出された特徴量は、例えば、算出した特徴量の種類数の次元を有する空間（特徴量空間）内でのベクトル（特徴量ベクトル）として表現され得る。特徴量抽出部１１１は、算出した特徴量についての情報（すなわち特徴量ベクトルについての情報）を音源種別スコア算出部１１３に提供する。 The feature amount calculated by the feature amount extraction unit 111 can be expressed as a vector (feature amount vector) in a space (feature amount space) having a dimension of the number of types of the calculated feature amount, for example. The feature amount extraction unit 111 provides the sound source type score calculation unit 113 with information about the calculated feature amount (that is, information about the feature amount vector).

音源種別スコア算出部１１３は、特徴量抽出部１１１によって抽出された音声情報の特徴量に基づいて、当該音声情報に含まれる音声の音源種別の蓋然性を示す音源種別スコアを算出する。ここで、音源種別とは、音声の音源をいくつかの種類に分類したものである。例えば、音源種別スコアには、音楽らしさを示す音楽スコア、人の声らしさを示す声スコア及び／又は雑音らしさを示すノイズスコア等が含まれる。また、声スコアが算出される際には、より詳細に、男性の声らしさを示す男性声スコア、女性の声らしさを示す女性声スコア、子どもの声らしさを示す子ども声スコア、及び／又は前記音声を発している特定の人物らしさを示す特定声スコア等が算出されてもよい。 The sound source type score calculation unit 113 calculates a sound source type score indicating the probability of the sound source type of the voice included in the voice information, based on the feature amount of the voice information extracted by the feature amount extraction unit 111. Here, the sound source classification is a classification of sound sources of voice into several types. For example, the sound source type score includes a music score indicating the likelihood of music, a voice score indicating the likelihood of human voice, and / or a noise score indicating the likelihood of noise. Further, when the voice score is calculated, in more detail, a male voice score indicating a male voice-likeness, a female voice score indicating a female voice-likeness, a child voice score indicating a childlike voice-likeness, and / or the above A specific voice score indicating the specificity of a specific person making a sound may be calculated.

音源種別スコア算出部１１３は、音声情報における所定の区間ごとに、上述した音源種別スコアのうちの少なくともいずれかを算出する。以下では、音源種別スコア算出部１１３が音源種別スコアを算出する時間単位を、スコア算出区間と呼称する。スコア算出区間は、例えばフレームに対応する区間であってよい。 The sound source type score calculation unit 113 calculates at least one of the sound source type scores described above for each predetermined section in the audio information. Hereinafter, the time unit in which the sound source type score calculation unit 113 calculates the sound source type score is referred to as a score calculation section. The score calculation section may be a section corresponding to a frame, for example.

音源種別スコアの算出には、音声情報の解析処理において一般的に用いられている各種の識別器が用いられてよい。当該識別器は、例えば、機械学習により、解析の対象としている音声情報の特徴量ベクトルに応じて、すなわち、特徴量空間内での座標に応じて、各音源種別スコアを算出することができる。事前に識別器において機械学習を行うことが困難である場合には、音源種別スコア算出部１１３は、過去の計算から導かれる平均的な話者性との距離に応じて音源種別スコアを算出することができる。例えば、音源種別スコア算出部１１３は、過去の話者性との距離が大きいほど、音源種別スコアとしてより高い値を出力する。 For the calculation of the sound source type score, various discriminators generally used in the analysis process of the voice information may be used. For example, the classifier can calculate each sound source type score by machine learning according to the feature amount vector of the voice information to be analyzed, that is, according to the coordinates in the feature amount space. When it is difficult to perform machine learning in the classifier in advance, the sound source type score calculation unit 113 calculates the sound source type score according to the distance from the average speaker characteristic derived from past calculations. be able to. For example, the sound source type score calculation unit 113 outputs a higher value as the sound source type score as the distance to the past speaker characteristic is larger.

図２に、音源種別スコア算出部１１３によって算出される音源種別スコアの一例を示す。図２は、音源種別スコア算出部１１３によって算出される音源種別スコアの一例を示す図である。図２では、横軸に音声情報内での時間を取り、縦軸にスコア算出区間ごとに算出された音源種別スコアを取り、両者の関係性をプロットしている。図２に示す例では、音源種別スコア算出部１１３によって、３種類の音源種別スコアが算出されている。 FIG. 2 shows an example of the sound source type score calculated by the sound source type score calculation unit 113. FIG. 2 is a diagram showing an example of a sound source type score calculated by the sound source type score calculation unit 113. In FIG. 2, the horizontal axis represents time in the audio information, the vertical axis represents the sound source type score calculated for each score calculation section, and the relationship between the two is plotted. In the example illustrated in FIG. 2, the sound source type score calculation unit 113 calculates three types of sound source type scores.

音源種別スコア算出部１１３は、スコア算出区間ごとに算出した音源種別スコアについての情報を、ダイジェスト区間決定部１１５に提供する。 The sound source type score calculation unit 113 provides the digest section determination unit 115 with information about the sound source type score calculated for each score calculation section.

ダイジェスト区間決定部１１５は、音源種別スコア算出部１１３によって算出された音源種別スコアに基づいて、音声情報の中から、当該音声情報のダイジェストを構成する時間区間であるダイジェスト区間を決定する。ここで、図３を参照して、音声情報とダイジェストとの関係について説明する。図３は、音声情報とダイジェストとの関係について説明するための説明図である。 Based on the sound source type score calculated by the sound source type score calculation unit 113, the digest section determination unit 115 determines a digest section, which is a time section that constitutes a digest of the voice information, from the voice information. Here, the relationship between the voice information and the digest will be described with reference to FIG. FIG. 3 is an explanatory diagram for explaining the relationship between voice information and a digest.

図３に示すように、ダイジェストは、音声情報内の少なくとも１つの時間区間によって構成されている。図示する例では、音声情報内で４つの時間区間（ダイジェスト区間１〜４）が、ダイジェストを構成する時間区間（ダイジェスト区間）として決定されており、これらのダイジェスト区間がつなぎ合わされることによりダイジェストが構成されている。 As shown in FIG. 3, the digest is composed of at least one time section in the audio information. In the example shown in the figure, four time intervals (digest intervals 1 to 4) are determined as time intervals (digest intervals) that make up the digest in the audio information, and the digest intervals are combined to form a digest. It is configured.

以下の説明では、各ダイジェスト区間の時間長さをダイジェスト区間長と呼称する。また、ダイジェストの時間長さをダイジェスト長と呼称する。ダイジェスト長は、例えば１分間等、得たいダイジェストの長さとして、予めユーザや情報処理装置１１０の設計者等によって設定されている。ダイジェスト区間長の合計がダイジェスト長と略一致するようにダイジェスト区間が決定されることとなる。 In the following description, the time length of each digest section is referred to as the digest section length. Also, the time length of the digest is called the digest length. The digest length is set in advance by the user, the designer of the information processing apparatus 110, or the like as the length of the desired digest, such as one minute. The digest section is determined so that the total digest section length substantially matches the digest length.

ダイジェスト区間決定部１１５は、基本的には、音楽情報の中で音源種別スコアがより高い時間区間を、ダイジェスト区間として決定する。しかしながら、図２に示すように、音声情報に対しては、複数の音源種別スコアがそれぞれ独立に算出され得る。従って、いずれの音源種別スコアを用いてダイジェスト区間を決定するかが事前に設定される必要がある。 The digest section determination unit 115 basically determines a time section having a higher sound source type score in the music information as the digest section. However, as shown in FIG. 2, a plurality of sound source type scores can be calculated independently for audio information. Therefore, it is necessary to set in advance which sound source type score is used to determine the digest section.

ここで、いずれの音源種別スコアを優先的に用いてダイジェスト区間を決定するかは、ユーザの要望に応じて多様であり得る。例えば、音声情報の中から男性の声だけを抽出したいと考えているユーザに対しては、男性声スコアに注目し、当該男性声スコアがより高い時間区間がダイジェスト区間として決定されることが望ましい。あるいは、音声情報に含まれる多様な音声を万遍なく抽出したいと考えているユーザに対しては、音源種別ごとにその音源種別スコアが高い時間区間がバランスよくダイジェスト区間として決定されることが望ましい。 Here, which sound source type score is preferentially used to determine the digest section may be various according to the user's request. For example, for a user who wants to extract only a male voice from the voice information, it is desirable to pay attention to the male voice score and determine a time period having a higher male voice score as the digest period. . Alternatively, for a user who wants to uniformly extract various voices included in voice information, it is desirable that a time section having a high sound source type score is determined as a digest section in good balance for each sound source type. .

そこで、本実施形態では、生成するダイジェストのモードが設定され、ダイジェスト区間決定部１１５は、設定されたモードに従ってダイジェスト区間を決定する処理を行う。モードは予め所定のものが設定されていてもよいし、図示しない情報処理装置１１０の入力部を介したユーザによる操作入力に応じて任意に切り替えられてもよい。設定されたモードを示すモード情報は、ダイジェスト区間決定部１１５に入力される。ダイジェスト区間決定部１１５は、設定されたモードに基づいてダイジェストに含める音声の音源種別を決定し、音声情報の中で、決定した音源種別に係る音源種別スコアがより高い区間を、ダイジェスト区間として決定することができる。 Therefore, in the present embodiment, the mode of the generated digest is set, and the digest section determination unit 115 performs the process of determining the digest section according to the set mode. A predetermined mode may be set in advance, or the mode may be arbitrarily switched according to an operation input by the user via the input unit of the information processing apparatus 110 (not shown). The mode information indicating the set mode is input to the digest section determination unit 115. The digest section determination unit 115 determines the sound source type of the sound to be included in the digest based on the set mode, and determines, as the digest section, a section having a higher sound source type score related to the determined sound source type in the audio information. can do.

例えば、モードとしては、単一の音源種別の音声のみを含むようにダイジェストを生成する単一音源モード、複数の音源種別の音声を所定の割合で含むようにダイジェストを生成する複数音源モード、及び／又は、同一の音源種別に分類される音声の中から多様な音声が含まれるようにダイジェストを生成する多様性反映モードが存在する。 For example, as the mode, a single sound source mode in which a digest is generated so as to include only sound of a single sound source type, a multiple sound source mode in which a digest is generated so as to include sound of a plurality of sound source types at a predetermined ratio, and Or, there is a diversity reflection mode for generating a digest so that various voices are included from among the voices classified into the same sound source type.

モードが単一音源モードである場合には、そのモード情報には、ダイジェストに優先的に含める音源種別を指定する旨の情報が含まれる。モードが単一音源モードである場合には、ダイジェスト区間決定部１１５は、指定された一の音源種別に係る音源種別スコアがより高い区間を、ダイジェスト区間として決定する。 When the mode is the single sound source mode, the mode information includes information indicating that the sound source type to be preferentially included in the digest is designated. When the mode is the single sound source mode, the digest section determination unit 115 determines a section having a higher sound source type score associated with one designated sound source type as a digest section.

また、モードが複数音源モードである場合には、そのモード情報には、ダイジェストに含める音源種別の割合を指定する旨の情報が含まれる。モードが複数音源モードである場合には、ダイジェスト区間決定部１１５は、指定された割合に基づいて、ダイジェストに含める音声の時間長さを音源種別ごとに設定し、音源種別ごとに音源種別スコアがより高い区間であって当該区間の合計長さが設定した音源種別ごとの時間長さ以下となるような区間を、ダイジェスト区間として決定する。 When the mode is the multiple sound source mode, the mode information includes information indicating that the ratio of the sound source types included in the digest is designated. When the mode is the multiple sound source mode, the digest section determination unit 115 sets the time length of the voice included in the digest for each sound source type based on the designated ratio, and the sound source type score is set for each sound source type. A section that is a higher section and whose total length is less than or equal to the set time length for each sound source type is determined as the digest section.

当該割合は、モード情報としてユーザによって適宜指定され得る。これにより、ユーザは、ダイジェストに優先的に含める音源種別を自身の要望に合わせて選択することができる。また、逆に、雑音等、ダイジェストに含めたくない音声種別の割合を低い値に設定することも可能である。 The ratio can be appropriately designated by the user as the mode information. With this, the user can select the sound source type to be preferentially included in the digest, in accordance with his / her request. On the contrary, it is also possible to set a low value for the ratio of voice types that should not be included in the digest, such as noise.

なお、ダイジェストに含める音源種別の割合は、モード情報として外部から入力されるのではなく、情報処理装置１１０によって自動的に設定されてもよい。例えば、音源種別ごとに音源種別スコアが比較的高い区間の時間長さの総和が算出され、当該総和の音源種別間の比率として、上記割合が決定され、種別ダイジェスト長が決定されてもよい。このように決定される割合は、音声情報内での音源種別ごとの音声の出現確率を反映するものであり得る。 The ratio of the sound source types to be included in the digest may be automatically set by the information processing apparatus 110, instead of being externally input as the mode information. For example, the sum of the time lengths of the sections having a relatively high sound source type score may be calculated for each sound source type, and the above ratio may be determined as the ratio between the sound source types of the total, and the type digest length may be determined. The ratio thus determined may reflect the appearance probability of the sound for each sound source type in the sound information.

また、モードが多様性反映モードである場合には、ダイジェスト区間決定部１１５は、同一の音源種別内での特徴量のばらつき及び同一の音源種別内での音声が発せられた時刻のばらつきを算出し、当該特徴量のばらつき及び当該時刻のばらつきがより大きくなるように、ダイジェスト区間を決定する。 In addition, when the mode is the diversity reflection mode, the digest section determination unit 115 calculates the variation of the feature amount within the same sound source type and the variation of the time when the sound is emitted within the same sound source type. Then, the digest section is determined so that the variation in the feature amount and the variation in the time become larger.

例えば、音源種別スコアの観点からは同一の音源種別に分類された場合であっても、実際には異なる人物の音声であることもあり得る。同一の音源種別内での特徴量のばらつきがより大きくなるようにダイジェスト区間が決定されることにより、音源種別スコアの観点からは同一の音源種別に分類されるものの比較的特徴量が異なっている音声がダイジェストに含まれることになり、より多様な音声がダイジェストに含まれることになる。 For example, from the viewpoint of the sound source type score, even if they are classified into the same sound source type, they may actually be voices of different persons. Since the digest section is determined so that the variation of the feature amount within the same sound source type becomes larger, the feature amount is relatively different from the viewpoint of the sound source type score, though the feature amount is relatively classified. Audio will be included in the digest, and more diverse audio will be included in the digest.

また、例えば、音源種別スコアの観点からは同一の音源種別に分類され、同一人物の声である可能性が高い場合であっても、時間的に間隔を空けてなされた発言は、内容的には全く異なるものであることもあり得る。同一の音源種別内での音声が発せられた時刻のばらつきがより大きくなるようにダイジェスト区間が決定されることにより、音源種別スコアの観点からは同一の音源種別に分類されるものの発せられた時刻が隔たっている音声がダイジェストに含まれることになり、より多様な内容の音声がダイジェストに含まれることになる。 In addition, for example, from the viewpoint of sound source type score, even if there is a high possibility that the voices of the same person are classified into the same sound source type, statements made at intervals in time are Can be quite different. From the viewpoint of the sound source type score, the time at which the sound is emitted is determined by determining the digest section so that the variation in the time at which the sound is emitted within the same sound source type becomes larger. The voices separated by are included in the digest, and the voices of more diverse contents are included in the digest.

なお、単一音源モード、複数音源モード及び多様性反映モードのそれぞれのモードにおけるダイジェスト区間決定処理のより具体的な処理内容については、下記（３−２．単一音源モード）、（３−３．複数音源モード）、（３−４．多様性反映モード）、（４−２．単一音源モード）、（４−３．複数音源モード）、（４−４．多様性反映モード）で詳しく説明する。 In addition, for more specific processing contents of the digest section determination processing in each of the single sound source mode, the multiple sound source mode, and the diversity reflection mode, the following (3-2. Single sound source mode), (3-3 (Multiple sound source mode), (3-4. Diversity reflection mode), (4-2. Single sound source mode), (4-3. Multiple sound source mode), (4-4. Diversity reflection mode) explain.

ダイジェスト区間決定部１１５は、ダイジェスト区間を決定すると、決定したダイジェスト区間についての情報（ダイジェスト区間情報）を出力する。ダイジェスト区間情報は、例えば、ダイジェスト区間の開始時刻、終了時刻、ダイジェスト区間長、ダイジェスト区間に付されるインデックス（ダイジェスト区間インデックス）等についての情報を含む。つまり、ダイジェスト区間情報は、音声情報内でのダイジェスト区間の位置を特定するための情報であり、音声情報及びダイジェスト区間情報に基づいてダイジェストが生成され得る。 When the digest section determination unit 115 determines the digest section, it outputs information about the determined digest section (digest section information). The digest section information includes, for example, information about a start time, an end time of the digest section, a digest section length, an index attached to the digest section (digest section index), and the like. That is, the digest section information is information for specifying the position of the digest section in the voice information, and the digest can be generated based on the voice information and the digest section information.

ダイジェスト区間決定部１１５によるダイジェスト区間情報の出力先は任意であってよい。例えば、ダイジェスト区間決定部１１５は、情報処理装置１１０に設けられる記憶部（図示せず）にダイジェスト区間情報を出力してもよいし、情報処理装置１１０とは異なる外部の機器にダイジェスト区間情報を出力してもよい。 The output destination of the digest section information by the digest section determining unit 115 may be arbitrary. For example, the digest section determination unit 115 may output the digest section information to a storage unit (not shown) provided in the information processing apparatus 110, or may output the digest section information to an external device different from the information processing apparatus 110. You may output.

ダイジェスト区間情報が情報処理装置１１０内に保存される場合には、情報処理装置１１０は、当該ダイジェスト区間情報及び音声情報に基づいてダイジェストを生成する機能を更に有してもよい（このような構成については、下記（５−２．ダイジェスト生成機能が設けられる変形例）で詳しく説明する。）。また、ダイジェスト区間情報が外部機器に出力される場合には、当該外部機器が、当該ダイジェスト区間情報及び音声情報に基づいてダイジェストを生成する機能を有してもよい。このように、本実施形態では、情報処理装置１１０は、少なくともダイジェスト区間情報を生成する機能を有するように構成され、その後に実際にダイジェストを生成する機能は、必ずしも情報処理装置１１０に設けられなくてもよい。 When the digest section information is stored in the information processing apparatus 110, the information processing apparatus 110 may further have a function of generating a digest based on the digest section information and the voice information (such a configuration). Will be described in detail in the following (5-2. Modification example in which digest generating function is provided). In addition, when the digest section information is output to the external device, the external apparatus may have a function of generating a digest based on the digest section information and the voice information. As described above, in the present embodiment, the information processing apparatus 110 is configured to have at least the function of generating digest section information, and the function of actually generating the digest thereafter is not necessarily provided in the information processing apparatus 110. May be.

以上、図１を参照して、本実施形態に係る情報処理装置の機能構成について説明した。以上説明したように、本実施形態によれば、音声情報に含まれる音声の音源種別の蓋然性を示す音源種別スコアが算出され、当該音源種別スコアに基づいて、当該音声情報の中から当該音声情報のダイジェストを構成するダイジェスト区間が決定される。従って、例えば、音楽のみをダイジェストに含めたい、人の声のみをダイジェストに含めたい、音楽と人の声とをバランスよくダイジェストに含めたい等、ユーザの多様な要望に応じたダイジェストを生成することが可能になる。なお、特徴量抽出部１１１、音源種別スコア算出部１１３及びダイジェスト区間決定部１１５による一連の処理は、ユーザによる入力部（図示せず）を介した指示に応じて開始されてもよいし、音声情報が情報処理装置１１０に入力されることにより当該音声情報に対する処理が自動的に開始されてもよい。 The functional configuration of the information processing apparatus according to the present embodiment has been described above with reference to FIG. As described above, according to the present embodiment, the sound source type score indicating the probability of the sound source type of the sound included in the sound information is calculated, and the sound information is selected from the sound information based on the sound source type score. The digest section which comprises the digest of is determined. Therefore, for example, to generate only the music in the digest, to include only the human voice in the digest, to include the music and the human voice in the digest in a well-balanced manner, etc., to generate a digest that meets various user needs. Will be possible. Note that a series of processes performed by the feature amount extraction unit 111, the sound source type score calculation unit 113, and the digest section determination unit 115 may be started in response to an instruction from a user via an input unit (not shown). When the information is input to the information processing device 110, the process for the voice information may be automatically started.

ここで、情報処理装置１１０の具体的な装置構成は任意であってよい。例えば、情報処理装置１１０は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）やＤＳＰ（ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）、ＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）等の各種のプロセッサであってよい。あるいは、情報処理装置１１０は、各種のプロセッサが実装されたＰＣやサーバ、スマートフォン、タブレットＰＣ等の装置であってよい。また、あるいは、情報処理装置１１０は、ＩＣレコーダー等の収音、録音機能を有する装置であってもよい。各種のプロセッサが所定のプログラムに従って動作することにより、図１に示す情報処理装置１１０の機能が実行され得る。 Here, the specific device configuration of the information processing device 110 may be arbitrary. For example, the information processing apparatus 110 may be various types of processors such as a CPU (Central Processing Unit), a DSP (Digital Signal Processor), and an ASIC (Application Specific Integrated Circuit). Alternatively, the information processing device 110 may be a device such as a PC, a server, a smartphone, or a tablet PC in which various processors are mounted. Alternatively, the information processing device 110 may be a device such as an IC recorder having sound collecting and recording functions. The functions of the information processing apparatus 110 shown in FIG. 1 can be executed by the various processors operating according to a predetermined program.

また、例えば、情報処理装置１１０の各機能（特徴量抽出部１１１、音源種別スコア算出部１１３及びダイジェスト区間決定部１１５）は、必ずしも１つの装置によって実行されなくてもよい。例えば、特徴量抽出部１１１、音源種別スコア算出部１１３及びダイジェスト区間決定部１１５に対応する各機能が、複数の情報処理装置（例えば複数のプロセッサ）に分散されて実装され、当該複数の装置が互いに通信可能に接続され協働して動作することにより、以上説明した情報処理装置１１０としての機能が実現されてもよい。また、情報処理装置１１０は、ユーザによって直接的に操作されるローカルの情報処理装置であってもよいし、ネットワークを介してユーザの端末と接続されるいわゆるクラウド上の情報処理装置であってもよい。例えば、スマートフォンやＩＣレコーダー等のユーザの端末が録音機能を有している場合には、当該端末で録音された音声情報が、当該端末からクラウド上の情報処理装置１１０に送信され、情報処理装置１１０によって当該音声情報に対して上述した各種の処理が施され、処理結果であるダイジェスト区間情報又はダイジェストに係る音声情報が、情報処理装置１１０から当該端末に送信されてもよい。 Further, for example, each function of the information processing device 110 (the feature amount extraction unit 111, the sound source type score calculation unit 113, and the digest section determination unit 115) does not necessarily have to be executed by one device. For example, the respective functions corresponding to the feature amount extraction unit 111, the sound source type score calculation unit 113, and the digest section determination unit 115 are distributed and implemented in a plurality of information processing devices (for example, a plurality of processors), and the plurality of devices are The functions of the information processing apparatus 110 described above may be realized by being communicably connected to each other and operating in cooperation with each other. The information processing device 110 may be a local information processing device that is directly operated by a user, or may be a so-called cloud information processing device that is connected to a user terminal via a network. Good. For example, when a user's terminal such as a smartphone or an IC recorder has a recording function, the voice information recorded by the terminal is transmitted from the terminal to the information processing apparatus 110 on the cloud, and the information processing apparatus The various types of processing described above may be performed on the voice information by 110, and the digest section information or the voice information related to the digest, which is the processing result, may be transmitted from the information processing apparatus 110 to the terminal.

なお、上述のような本実施形態に係る情報処理装置１１０の各機能を実現するためのコンピュータプログラムを作製し、ＰＣ等に実装することが可能である。また、このようなコンピュータプログラムが格納された、コンピュータで読み取り可能な記録媒体も提供することができる。記録媒体は、例えば、磁気ディスク、光ディスク、光磁気ディスク、フラッシュメモリ等である。また、上記のコンピュータプログラムは、記録媒体を用いずに、例えばネットワークを介して配信されてもよい。 Note that it is possible to create a computer program for realizing each function of the information processing apparatus 110 according to the present embodiment as described above, and mount the computer program on a PC or the like. It is also possible to provide a computer-readable recording medium in which such a computer program is stored. The recording medium is, for example, a magnetic disk, an optical disk, a magneto-optical disk, a flash memory, or the like. Further, the above computer program may be distributed, for example, via a network without using a recording medium.

以下、情報処理装置１１０によって実行される処理についてより詳細に説明する。ここで、本実施形態では、情報処理装置１１０が行う処理を、その処理形態から大きく２つに分けることができる。一方の処理では、情報処理装置１１０は、予めその全てが取得されている音声情報に対して、特徴量抽出処理、音源種別スコア算出処理及びダイジェスト区間決定処理を行う。以下、このような処理のことをオフライン処理と呼ぶ。 Hereinafter, the process executed by the information processing device 110 will be described in more detail. Here, in the present embodiment, the processing performed by the information processing device 110 can be roughly divided into two according to the processing mode. In one process, the information processing apparatus 110 performs a feature amount extraction process, a sound source type score calculation process, and a digest section determination process on the audio information of which all have been acquired in advance. Hereinafter, such processing is referred to as off-line processing.

一方、他方の処理では、情報処理装置１１０は、現在まさに取得され続けている音声情報に対して、特徴量抽出処理、音源種別スコア算出処理及びダイジェスト区間決定処理を随時行う。この場合には、音声情報が取得され続けている間、ダイジェスト区間情報が随時更新されることとなる。以下、このような処理のことをオンライン処理と呼ぶ。 On the other hand, in the other process, the information processing apparatus 110 performs the feature amount extraction process, the sound source type score calculation process, and the digest section determination process on the audio information that is just being acquired at any time. In this case, the digest section information is updated at any time while the voice information is continuously acquired. Hereinafter, such processing will be referred to as online processing.

オフライン処理とオンライン処理とでは、その詳細な処理内容が異なるものとなり得る。そこで、以下では、オフライン処理及びオンライン処理のそれぞれについて、その詳細な処理内容について説明する。また、オフライン処理及びオンライン処理のそれぞれについて、上述したモードに応じて、ダイジェスト区間決定処理の詳細な処理内容が異なるものとなり得る。そこで、以下では、オフライン処理及びオンライン処理のそれぞれについて、モードに応じたダイジェスト区間決定処理の詳細な処理内容について説明する。 The detailed processing contents may differ between the offline processing and the online processing. Therefore, in the following, detailed processing contents of each of the offline processing and the online processing will be described. Further, the details of the digest section determination process may be different depending on the mode described above for each of the offline process and the online process. Therefore, in the following, for each of the offline processing and the online processing, the detailed processing contents of the digest section determination processing according to the mode will be described.

なお、以下の説明では、一例として、スコア算出区間がフレーム区間である場合について説明する。つまり、フレームごとに音源種別スコアが算出される場合について説明する。ただし、本実施形態はかかる例に限定されず、複数のフレームからなる区間がスコア算出区間として設定されてもよい。また、以下の説明では、簡単のため、音源種別スコアのことを単にスコアと呼ぶ場合がある。 In the following description, a case where the score calculation section is a frame section will be described as an example. That is, a case where the sound source type score is calculated for each frame will be described. However, the present embodiment is not limited to this example, and a section including a plurality of frames may be set as the score calculation section. In the following description, the sound source type score may be simply referred to as a score for simplicity.

（３．オフライン処理の詳細）
（３−１．全体の処理手順）
図４を参照して、オフライン処理の処理手順について説明する。図４は、オフライン処理の処理手順の一例を示すフロー図である。図４に示す処理手順は、オフライン処理時における、図１に示す情報処理装置１１０によって実行される情報処理方法全体の処理手順に対応している。オフライン処理では、音声情報の全フレームのスコアが算出された後に、当該スコアに基づいて音声情報の中からダイジェスト区間が決定される。 (3. Details of offline processing)
(3-1. Overall processing procedure)
The processing procedure of the offline processing will be described with reference to FIG. FIG. 4 is a flowchart showing an example of the processing procedure of offline processing. The processing procedure shown in FIG. 4 corresponds to the processing procedure of the entire information processing method executed by the information processing apparatus 110 shown in FIG. 1 during offline processing. In the off-line processing, after the scores of all frames of the audio information are calculated, the digest section is determined from the audio information based on the score.

図４を参照すると、オフライン処理では、まず、音声情報の特徴量が抽出される（ステップＳ１０１）。ステップＳ１０１に示す処理では、音声情報の特徴量として、例えばパワーやスペクトル包絡形状等、音声情報の特性を示す各種の物理量が算出される。ステップＳ１０１に示す処理は、例えば図１に示す特徴量抽出部１１１によって行われる処理に対応している。 Referring to FIG. 4, in the offline processing, first, the feature amount of the voice information is extracted (step S101). In the process shown in step S101, various physical quantities indicating the characteristics of the audio information such as power and spectrum envelope shape are calculated as the characteristic quantities of the audio information. The process shown in step S101 corresponds to the process performed by the feature amount extraction unit 111 shown in FIG. 1, for example.

次に、抽出された特徴量に基づいて、各フレームの音源種別スコアが算出される（ステップＳ１０３）。ステップＳ１０３に示す処理では、例えば、音声情報の特徴量に応じて音声の音源種別を識別する識別器によって、フレームごとに当該音声の音源種別の蓋然性を示す音源種別スコアが算出される。この際、音声スコア、声スコア、ノイズスコア等、複数の種類の音源種別スコアが算出されてよい。ステップＳ１０３に示す処理は、例えば図１に示す音源種別スコア算出部１１３によって行われる処理に対応している。 Next, the sound source type score of each frame is calculated based on the extracted feature amount (step S103). In the process shown in step S103, for example, a sound source type score indicating the probability of the sound source type of the sound is calculated for each frame by a discriminator that identifies the sound source type of the sound according to the feature amount of the sound information. At this time, a plurality of sound source type scores such as a voice score, a voice score, and a noise score may be calculated. The process shown in step S103 corresponds to the process performed by the sound source type score calculation unit 113 shown in FIG. 1, for example.

なお、スコア算出区間がフレーム区間ではなく、複数のフレーム区間からなる場合には、ステップＳ１０３において、各フレームの音源種別スコアを平滑化してスコア算出区間としての音源種別スコアを算出する処理が行われてもよい。 In addition, when the score calculation section is not a frame section but includes a plurality of frame sections, a process of smoothing the sound source type score of each frame to calculate a sound source type score as a score calculation section is performed in step S103. May be.

次に、算出された音源種別スコアに基づいて、音声情報の中からダイジェスト区間が決定される（ステップＳ１０５）。例えば、ステップＳ１０５に示す処理では、音声情報の中で音源種別スコアのより高い時間区間がダイジェスト区間として決定される。ステップＳ１０５の具体的な処理内容はモードに応じて異なるため、その詳細な処理内容については、下記（３−２．単一音源モード）、（３−３．複数音源モード）及び（３−４．多様性反映モード）においてモードごとにより詳細に説明する。決定されたダイジェスト区間についてのダイジェスト区間情報を出力して、一連の処理が終了する。なお、ステップＳ１０５に示す処理は、例えば図１に示すダイジェスト区間決定部１１５によって行われる処理に対応している。 Next, the digest section is determined from the audio information based on the calculated sound source type score (step S105). For example, in the process shown in step S105, a time section having a higher sound source type score in the audio information is determined as the digest section. Since the specific processing content of step S105 differs depending on the mode, the detailed processing content thereof will be described in the following (3-2. Single sound source mode), (3-3. Multiple sound source mode) and (3-4). (Diversity reflection mode) will be described in more detail for each mode. The digest section information about the determined digest section is output, and a series of processing is ended. The process shown in step S105 corresponds to the process performed by the digest section determining unit 115 shown in FIG. 1, for example.

以上、図４を参照して、オフライン処理の処理手順について説明した。 The processing procedure of the offline processing has been described above with reference to FIG.

（３−２．単一音源モード）
（３−２−１．ダイジェスト区間決定処理の処理手順）
単一音源モードでは、ある１つの種類の音源種別が指定され、指定された一の音源種別に係る音源種別スコアがより高い区間が、ダイジェスト区間として決定される。 (3-2. Single sound source mode)
(3-2-1. Processing procedure of digest section determination processing)
In the single sound source mode, a certain one sound source type is designated, and a section having a higher sound source type score related to the one designated sound source type is determined as the digest section.

図５及び図６を参照して、オフライン処理における、単一音源モードでのダイジェスト区間決定処理の処理手順について説明する。図５及び図６は、オフライン処理における、単一音源モードでのダイジェスト区間決定処理の処理手順の一例を示すフロー図である。 The procedure of the digest section determination process in the single sound source mode in the offline process will be described with reference to FIGS. 5 and 6. 5 and 6 are flowcharts showing an example of the processing procedure of the digest section determination processing in the single sound source mode in the offline processing.

図５及び図６を参照すると、オフライン処理における単一音源モードでのダイジェスト区間決定処理では、まず、スコア閾値上限値としてスコア閾値理論上限値が設定される（ステップＳ２０１）。次いで、スコア閾値上限値よりも低い値としてスコア閾値が設定される（ステップＳ２０３）。 Referring to FIGS. 5 and 6, in the digest section determination process in the single sound source mode in the offline process, first, the score threshold theoretical upper limit value is set as the score threshold upper limit value (step S201). Next, the score threshold is set as a value lower than the score threshold upper limit value (step S203).

ここで、詳しくは後述するが、ダイジェスト区間決定処理では、音声情報の中からよりスコアの高い区間（高スコア区間）をダイジェスト区間として決定する処理（ステップＳ２０５に示す高スコア区間決定処理）が行われ、その後、それらのダイジェスト区間の時間長さ（ダイジェスト区間長）の合計がダイジェスト長に適合するように、ダイジェスト区間長の長さやダイジェスト区間の数が調整される。 Here, as will be described later in detail, in the digest segment determination process, a process of determining a segment with a higher score (high score segment) from the audio information as a digest segment (high score segment determination process shown in step S205) is performed. After that, the length of the digest section and the number of digest sections are adjusted so that the sum of the time lengths of those digest sections (digest section length) matches the digest length.

スコア閾値とは、高スコア区間決定処理において、各フレームを高スコア区間に含めるかどうか（すなわちダイジェスト区間に含めるかどうか）を判断するための閾値である。スコア閾値は、後述するステップＳ２１３やステップＳ２１９において行われるように、ダイジェスト区間長の合計をダイジェスト長に応じて調整するために、ダイジェスト区間決定処理の一連の処理中に適宜変更される。スコア閾値がより高い値に変更されれば、ダイジェスト区間に含まれるフレーム数が増加し、ダイジェスト区間長は長くなる。逆に、スコア閾値がより低い値に変更されれば、ダイジェスト区間に含まれるフレーム数が減少し、ダイジェスト区間長は短くなる。 The score threshold is a threshold for determining whether or not each frame is included in the high score section (that is, whether or not it is included in the digest section) in the high score section determination processing. The score threshold value is appropriately changed during a series of digest section determination processing in order to adjust the total digest section length according to the digest length, as will be described later in steps S213 and S219. If the score threshold is changed to a higher value, the number of frames included in the digest section increases and the digest section length becomes longer. Conversely, if the score threshold is changed to a lower value, the number of frames included in the digest section decreases, and the digest section length becomes shorter.

スコア閾値上限値は、変更されるスコア閾値の上限を規定する値である。スコア閾値が高くなり過ぎると、ダイジェスト区間に含まれるフレームの数が少なくなり、ダイジェスト区間長の合計がダイジェスト長に大幅に満たない事態が生じてしまう可能性がある。スコアしきい値上限値はこのような事態が起こることを防止するために設定される（後述するステップＳ２１７に示す処理を参照）。 The score threshold upper limit value is a value that defines the upper limit of the score threshold to be changed. If the score threshold becomes too high, the number of frames included in the digest section may decrease, and the total digest section length may be less than the digest length. The score threshold upper limit value is set to prevent such a situation from occurring (see the processing shown in step S217 described later).

スコアしきい値理論上限値は、例えば、スコアの計算に用いられた識別器の性能等に応じて設定される、スコアが取り得る理論上の上限値である。上記のように、ステップＳ２０１において、スコア閾値上限値の初期値として、スコアしきい値理論上限値が設定される。 The score threshold theoretical upper limit value is a theoretical upper limit value that the score can take, which is set according to, for example, the performance of the discriminator used for the score calculation. As described above, in step S201, the score threshold theoretical upper limit value is set as the initial value of the score threshold upper limit value.

ステップＳ２０１及びステップＳ２０３に示す処理が行われると、次に、音声情報の中でより高いスコアを有する区間（高スコア区間）をダイジェスト区間として決定する処理（高スコア区間決定処理）が行われる（ステップＳ２０５）。高スコア区間とは、音声情報の中で連続してスコアの高い区間のことである。ただし、本実施形態では、スコアが低い区間の時間長さが極短い場合には、当該区間も高スコア区間に含める処理が行われる。スコアが低い区間の時間長さが極短い場合には、当該区間は、例えばある人物の一連の発言の最中の息継ぎ等、情報の内容の観点からは、前後の区間と一連の区間であると考えられるからである。 When the processes shown in steps S201 and S203 are performed, next, a process of determining a section having a higher score (high score section) in the audio information as a digest section (high score section determination process) is performed ( Step S205). The high score section is a section in which the score is continuously high in the audio information. However, in the present embodiment, when the time length of a section with a low score is extremely short, a process of including the section in the high score section is also performed. When the time length of the section with a low score is extremely short, the section is a series of sections before and after, from the viewpoint of information content, such as breathing during a series of statements by a person. Because it is considered.

オフライン処理においては、ダイジェスト区間決定処理では、ステップＳ２０５において決定された高スコア区間をダイジェスト区間とみなし、その後の処理において、ダイジェスト区間長の合計がダイジェスト長に応じた長さになるように、ダイジェスト区間の時間長や数を調整する処理が行われる。高スコア区間決定処理において決定される高スコア区間は、いわば、最終的に決定されるダイジェスト区間の候補であると言える。 In the offline processing, in the digest section determination processing, the high-score section determined in step S205 is regarded as the digest section, and in the subsequent processing, the digest length is adjusted so that the total digest section length becomes the length according to the digest length. Processing for adjusting the time length and number of sections is performed. It can be said that the high score section determined in the high score section determination processing is, so to speak, a candidate of the finally determined digest section.

なお、高スコア区間決定処理のより詳細な処理内容については、図７−９を参照して、後程改めて説明する。 Note that the more detailed processing contents of the high score section determination processing will be described later later with reference to FIGS.

ステップＳ２０５において高スコア区間が決定されると、これらの区間をダイジェスト区間とみなして、各ダイジェスト区間の区間内での平均スコア（区間平均スコア）が算出される（ステップＳ２０７）。区間平均スコアは、高スコア区間決定処理において決定される、高スコア区間（すなわちダイジェスト区間）の開始時刻や終了時刻、インデックスとともに、ダイジェスト区間情報に含まれてよい。 When the high score sections are determined in step S205, these sections are regarded as digest sections, and an average score (section average score) within each digest section is calculated (step S207). The section average score may be included in the digest section information together with the start time and end time of the high score section (that is, the digest section) and the index determined in the high score section determination processing.

次に、ダイジェスト区間長の合計がダイジェスト長よりも大幅に短いかどうかが判断される（ステップＳ２０９）。具体的には、ステップＳ２０９では、ダイジェスト区間長の合計が、ダイジェスト長に対して設定されるダイジェスト長からのずれ量の許容範囲を下回っているかどうかが判断される。ダイジェスト区間長の合計がダイジェスト長と完全に一致するようにダイジェスト区間を決定することは困難であるため、本実施形態では、このような許容範囲が設定され、ダイジェスト区間長の合計が当該許容範囲に含まれるかどうかによって、ダイジェスト区間長の合計が適切かどうかが判断される。当該許容範囲は、ユーザがダイジェストを聴く際に、実際のダイジェスト長がダイジェスト長の設定値よりも長い又は短いことにより違和感を与えないようなずれ量の範囲として、情報処理装置１１０の設計者等によって適宜設定されてよい。 Next, it is determined whether the total digest section length is significantly shorter than the digest length (step S209). Specifically, in step S209, it is determined whether or not the total digest section length is below the allowable range of the deviation amount from the digest length set for the digest length. Since it is difficult to determine the digest section so that the total digest section length completely matches the digest length, in the present embodiment, such an allowable range is set, and the total digest section length is the allowable range. Whether or not the sum of the digest section lengths is appropriate is determined by whether or not it is included in. The permissible range is a range of a shift amount that does not give a feeling of strangeness when the user listens to the digest and the actual digest length is longer or shorter than the set value of the digest length. May be appropriately set by.

ステップＳ２０９でダイジェスト区間長の合計がダイジェスト長よりも大幅に短いと判断された場合には、ステップＳ２１１〜ステップＳ２１３に進む。ステップＳ２１１〜ステップＳ２１３では、ダイジェスト区間長の合計をより長くするための処理が行われる。 When it is determined in step S209 that the total digest section length is significantly shorter than the digest length, the process proceeds to steps S211 to S213. In steps S211 to S213, processing for increasing the total digest section length is performed.

具体的には、ステップＳ２１１では、スコア閾値上限値として現在のスコア閾値が設定される。これは、ダイジェスト区間長の合計がダイジェスト長よりも大幅に短いということは、現在のスコア閾値は適切な値に比べて高過ぎると考えられるため、今後の処理においてスコア閾値が変更される際に、当該スコア閾値が現在のスコア閾値よりも大きくならないようにするためである。 Specifically, in step S211, the current score threshold value is set as the score threshold upper limit value. This is because the sum of the digest section length is significantly shorter than the digest length, it is considered that the current score threshold is too high compared to an appropriate value, so when the score threshold is changed in future processing. This is to prevent the score threshold value from becoming larger than the current score threshold value.

次に、新たなスコア閾値として、現在のスコア閾値よりも低い値が設定される（ステップＳ２１３）。そして、ステップＳ２０７に進み、新たなスコア閾値を用いて高スコア区間決定処理が再度行われる。より低い値に設定された新たなスコア閾値を用いて高スコア区間決定処理が行われることにより、高スコア区間に含まれるフレームの数が増えるため、ダイジェスト区間長の合計が長くなり、ダイジェスト区間長の合計をよりダイジェスト長に近付けることができる。 Next, a value lower than the current score threshold is set as a new score threshold (step S213). Then, the process proceeds to step S207, and the high score section determination process is performed again using the new score threshold. By performing the high score section determination process using the new score threshold set to a lower value, the number of frames included in the high score section increases, so the total digest section length becomes longer, and the digest section length becomes longer. Can be brought closer to the digest length.

ステップＳ２０９でダイジェスト区間長の合計がダイジェスト長よりも大幅に短くはないと判断された場合には、ステップＳ２１５に進む。ステップＳ２１５では、逆に、ダイジェスト区間長の合計がダイジェスト長よりも大幅に長いかどうかが判断される。 If it is determined in step S209 that the total digest section length is not significantly shorter than the digest length, the process proceeds to step S215. In step S215, conversely, it is determined whether the total digest section length is significantly longer than the digest length.

ステップＳ２１５でダイジェスト区間長の合計がダイジェスト長よりも大幅に長くはないと判断された場合には、ダイジェスト区間決定処理の一連の処理を終了する。つまり、高スコア区間決定処理で決定された現在のダイジェスト区間が、最終的なダイジェスト区間として確定される。ステップＳ２０９でダイジェスト区間長の合計がダイジェスト長よりも大幅に短くはないと判断され、かつ、ステップＳ２１５でダイジェスト区間長の合計がダイジェスト長よりも大幅に長くはないと判断された場合には、ダイジェスト区間長の合計は、ダイジェスト長の許容範囲に含まれているからである。 When it is determined in step S215 that the total digest section length is not significantly longer than the digest length, the series of digest section determination processing ends. That is, the current digest section determined by the high score section determination processing is confirmed as the final digest section. When it is determined in step S209 that the total digest section length is not significantly shorter than the digest length, and when it is determined in step S215 that the total digest section length is not significantly longer than the digest length, This is because the total digest section length is included in the digest length allowable range.

一方、ステップＳ２１５でダイジェスト区間長の合計がダイジェスト長よりも大幅に長いと判断された場合には、ステップＳ２１７に進む。ステップＳ２１７以降の処理では、ダイジェスト区間長の合計をより短くするための処理が行われる。 On the other hand, when it is determined in step S215 that the total digest section length is significantly longer than the digest length, the process proceeds to step S217. In the processing after step S217, processing for further shortening the total digest section length is performed.

ステップＳ２１７では、スコア閾値がスコア閾値上限値よりも小さいかどうかが判断される。ステップＳ２１７でスコア閾値がスコア閾値上限値よりも小さいと判断された場合には、ステップＳ２１９に進む。ステップＳ２１９では、新たなスコア閾値として、現在のスコア閾値よりも高い値が設定される。そして、ステップＳ２０７に進み、新たなスコア閾値を用いて高スコア区間決定処理が再度行われる。より高い値に設定された新たなスコア閾値を用いて高スコア区間決定処理が行われることにより、高スコア区間に含まれるフレームの数が減るため、ダイジェスト区間長の合計が短くなり、ダイジェスト区間長の合計をよりダイジェスト長に近付けることができる。 In step S217, it is determined whether the score threshold value is smaller than the score threshold upper limit value. When it is determined in step S217 that the score threshold value is smaller than the score threshold upper limit value, the process proceeds to step S219. In step S219, a value higher than the current score threshold is set as the new score threshold. Then, the process proceeds to step S207, and the high score section determination process is performed again using the new score threshold. By performing the high score section determination process using the new score threshold set to a higher value, the number of frames included in the high score section is reduced, so the total digest section length becomes shorter, and the digest section length becomes shorter. Can be brought closer to the digest length.

ステップＳ２１７でスコア閾値がスコア閾値上限値よりも小さくないと判断された場合には、ステップＳ２２１に進む。この場合には、スコア閾値を現在の値以上に高くすることができないため、スコア閾値を変更することによりダイジェスト区間長の合計を短くすることはできない。従って、ステップＳ２２１以降の処理では、現在のダイジェスト区間の中からフレームを削除する、又は現在のダイジェスト区間の数を減らすことにより、ダイジェスト区間長の合計を短くする処理が行われる。 When it is determined in step S217 that the score threshold value is not smaller than the score threshold upper limit value, the process proceeds to step S221. In this case, since the score threshold cannot be set higher than the current value, the total digest section length cannot be shortened by changing the score threshold. Therefore, in the processing from step S221, processing is performed to shorten the total digest section length by deleting a frame from the current digest section or reducing the number of current digest sections.

具体的には、ステップＳ２２１では、各ダイジェスト区間について、ダイジェスト区間長の短縮が可能かどうかが判断される。ここで、ダイジェスト区間長の短縮が可能かどうかは、ダイジェスト区間長と連続区間最低長とを比較することによって行われる。連続区間最低長は、音声として出力した際に人が当該音声の意味を認識可能な最小区間として設定される。ダイジェスト区間長が連続最低長以下であると、ダイジェストを聴いた際に、当該ダイジェスト区間に対応する部分の意味を把握できないため、ダイジェストとして有意なものではなくなってしまう。従って、ステップＳ２２１に示す判断処理を行うことにより、ダイジェスト区間長が連続最低長よりも大きくなるようにダイジェスト区間が決定されるようにしているのである。 Specifically, in step S221, it is determined whether or not the digest section length can be shortened for each digest section. Here, whether or not the digest section length can be shortened is determined by comparing the digest section length and the minimum continuous section length. The minimum continuous section length is set as a minimum section in which a person can recognize the meaning of the voice when outputting the voice. If the digest section length is equal to or less than the continuous minimum length, the meaning of the portion corresponding to the digest section cannot be grasped when listening to the digest, so that the digest is not significant. Therefore, by performing the determination process shown in step S221, the digest section is determined so that the digest section length becomes larger than the continuous minimum length.

ステップＳ２２１でいずれかのダイジェスト区間においてダイジェスト区間長の短縮が可能と判断された場合には、ステップＳ２２３〜ステップＳ２２７に進む。ステップＳ２２３〜ステップＳ２２７では、現在のダイジェスト区間の中からフレームを削除することによりダイジェスト区間長の合計を短くする処理が行われる。 When it is determined in step S221 that the digest section length can be shortened in any of the digest sections, the process proceeds to steps S223 to S227. In steps S223 to S227, processing is performed to shorten the total digest section length by deleting the frame from the current digest section.

具体的には、ステップＳ２２３では、ダイジェスト区間長の短縮が可能と判断されたダイジェスト区間（すなわちダイジェスト区間長が連続最低長よりも長いダイジェスト区間）の中で、区間平均スコアがより低いダイジェスト区間のダイジェスト区間長が短縮される。ダイジェスト区間長を短縮する際には、例えば、短縮対象であるダイジェスト区間の先頭の所定の数のフレーム及び終端の所定の数のフレームのうち、スコアの平均値が低い方がダイジェスト区間から除外される。 Specifically, in step S223, among digest sections determined to be capable of shortening the digest section length (that is, digest sections whose digest section length is longer than the continuous minimum length), digest sections with a lower section average score are selected. The digest section length is shortened. When shortening the digest section length, for example, of the predetermined number of frames at the beginning and the predetermined number of frames at the end of the digest section to be shortened, the one with the lower average score is excluded from the digest section. It

次に、フレームが削除されダイジェスト区間長が短縮されたダイジェスト区間の区間平均スコアが更新される（ステップＳ２２５）。そして、ダイジェスト区間長の合計がダイジェスト長と略一致するかどうかが判断される（ステップＳ２２７）。ステップＳ２２７では、具体的には、ダイジェスト区間長の合計が、ダイジェスト長に設定されている許容範囲に含まれるかどうかが判断される。 Next, the section average score of the digest section in which the frame is deleted and the digest section length is shortened is updated (step S225). Then, it is determined whether or not the total digest section length substantially matches the digest length (step S227). In step S227, specifically, it is determined whether or not the total digest section length is included in the allowable range set for the digest length.

ステップＳ２２７でダイジェスト区間長の合計がダイジェスト長と略一致していると判断された場合には、ダイジェスト区間決定処理の一連の処理を終了する。つまり、現在のダイジェスト区間が、最終的なダイジェスト区間として確定される。 If it is determined in step S227 that the total digest section length is substantially equal to the digest length, the series of digest section determination processing ends. That is, the current digest section is confirmed as the final digest section.

一方、ステップＳ２２７でダイジェスト区間長の合計がダイジェスト長と略一致していないと判断された場合には、ステップＳ２２１に戻り、再度、各ダイジェスト区間について、ダイジェスト区間長の短縮が可能かどうかが判断される。 On the other hand, if it is determined in step S227 that the total digest section length does not substantially match the digest length, the procedure returns to step S221, and again it is determined whether or not the digest section length can be shortened for each digest section. To be done.

ステップＳ２２１でいずれのダイジェスト区間においてもダイジェスト区間長の短縮が不可能と判断された場合には、ステップＳ２２９〜ステップＳ２３１に進む。ステップＳ２２９〜ステップＳ２３１では、現在のダイジェスト区間の数を減らすことによりダイジェスト区間長の合計を短くする処理が行われる。 When it is determined in step S221 that the digest section length cannot be shortened in any of the digest sections, the process proceeds to steps S229 to S231. In steps S229 to S231, a process of shortening the total digest section length by reducing the number of current digest sections is performed.

具体的には、ステップＳ２２９では、現在のダイジェスト区間の中から、区間平均スコアのより低いダイジェスト区間が削除される。そして、ダイジェスト区間長の合計がダイジェスト長と略一致するかどうかが判断される（ステップＳ２３１）。ステップＳ２３１では、ステップＳ２２７と同様に、ダイジェスト区間長の合計が、ダイジェスト長に設定されている許容範囲に含まれるかどうかが判断される。 Specifically, in step S229, a digest section having a lower section average score is deleted from the current digest sections. Then, it is determined whether or not the total digest section length is substantially equal to the digest length (step S231). In step S231, similarly to step S227, it is determined whether or not the total digest section length is included in the allowable range set in the digest length.

ステップＳ２３１でダイジェスト区間長の合計がダイジェスト長と略一致していると判断された場合には、ダイジェスト区間決定処理の一連の処理を終了する。つまり、現在のダイジェスト区間が、最終的なダイジェスト区間として確定される。 If it is determined in step S231 that the total digest section length is substantially equal to the digest length, the series of digest section determination processing ends. That is, the current digest section is confirmed as the final digest section.

（３−２−２．高スコア区間決定処理）
ここで、図７−図９を参照して、詳細な説明を省略していたステップＳ２０５に示す、オフライン処理での高スコア区間決定処理について詳しく説明する。図７は、オフライン処理での高スコア区間決定処理について説明するための説明図である。図８及び図９は、オフライン処理での高スコア区間決定処理の処理手順の一例を示すフロー図である。 (3-2-2. High score section determination process)
Here, the high score section determination processing in the offline processing, which is shown in step S205, whose detailed description is omitted, will be described in detail with reference to FIGS. 7 to 9. FIG. 7 is an explanatory diagram for explaining the high score section determination processing in the offline processing. FIG. 8 and FIG. 9 are flowcharts showing an example of the processing procedure of the high score section determination processing in the offline processing.

以下の高スコア区間決定処理についての説明では現在フレーム、現ダイジェスト区間、連続区間及び不連続区間という用語を用いる。高スコア区間決定処理の具体的な処理手順について説明する前に、図７を参照して、これらの用語が示す概念について説明する。 In the following description of the high score section determination process, the terms current frame, current digest section, continuous section and discontinuous section are used. Before describing the specific processing procedure of the high score section determination processing, the concept represented by these terms will be described with reference to FIG. 7.

図７では、横軸に音声情報の時間を取り、縦軸にフレームごとに算出されたスコアを取り、両者の関係性をプロットしている。高スコア区間決定処理では、フレームごとに、時系列に従って、当該フレームをダイジェスト区間に含めるかどうかの判断が行われる。図中、現在フレームは、現在判断処理の対象としているフレームを示している。 In FIG. 7, the horizontal axis represents the time of audio information, the vertical axis represents the score calculated for each frame, and the relationship between the two is plotted. In the high score section determination processing, it is determined for each frame in time series whether or not the frame is included in the digest section. In the figure, the current frame indicates the frame currently targeted for the determination process.

現ダイジェスト区間は、現在フレームを含めるかどうかを判断する対象としているダイジェスト区間を意味する。連続区間は、現ダイジェスト区間内でスコアがスコア閾値を連続的に超えている区間を意味している。不連続区間は、現ダイジェスト区間内で直前の連続区間の終了時刻から現在フレームまでの区間を意味している。現ダイジェスト区間、連続区間及び不連続区間の時間長さのことを、それぞれ、現ダイジェスト区間長、連続区間長及び不連続区間長とも呼称する。 The current digest section means a digest section which is a target for determining whether to include the current frame. The continuous section means a section in which the score continuously exceeds the score threshold in the current digest section. The discontinuous section means a section from the end time of the immediately preceding continuous section in the current digest section to the current frame. The time lengths of the current digest section, the continuous section, and the discontinuous section are also referred to as the current digest section length, the continuous section length, and the discontinuous section length, respectively.

図８及び図９を参照して、オフライン処理における高スコア区間決定処理の具体的な処理手順について説明する。図８及び図９を参照すると、オフライン処理における高スコア区間決定処理では、まず、フレームインデックスがゼロに設定される（ステップＳ３０１）。また、ダイジェスト区間インデックスがゼロに設定される（ステップＳ３０３）。フレームインデックスは、音声情報の各フレームに対して時系列順に付されるものであり、フレームインデックスがゼロのフレームは音声情報の先頭のフレームを指している。ステップＳ３０１及びステップＳ３０３に示す処理は、現在フレームをフレーム＃０とし、現ダイジェスト区間をダイジェスト区間＃０にする処理に対応している。 A specific processing procedure of the high score section determination processing in the offline processing will be described with reference to FIGS. 8 and 9. Referring to FIGS. 8 and 9, in the high score section determination processing in the offline processing, first, the frame index is set to zero (step S301). Further, the digest section index is set to zero (step S303). The frame index is attached to each frame of the audio information in chronological order, and the frame having a frame index of zero indicates the first frame of the audio information. The processing shown in steps S301 and S303 corresponds to the processing in which the current frame is set to frame # 0 and the current digest section is set to digest section # 0.

次に、現在フレームのスコアがスコア閾値よりも大きいかどうかが判断される（ステップＳ３０５）。ステップＳ３０５で現在フレームのスコアがスコア閾値以下と判断された場合には、現在フレームをダイジェスト区間には含めずに、ステップＳ３１９に進む。この場合には、現在フレームは不連続区間に追加されることになる。ステップＳ３１９における処理については後述する。 Next, it is determined whether the score of the current frame is larger than the score threshold (step S305). When the score of the current frame is determined to be equal to or lower than the score threshold value in step S305, the current frame is not included in the digest section, and the process proceeds to step S319. In this case, the current frame will be added to the discontinuous section. The process in step S319 will be described later.

一方、ステップＳ３０５で現在フレームのスコアがスコア閾値よりも大きいと判断された場合には、ステップＳ３０７に進む。ステップＳ３０７〜ステップＳ３１７では、現在フレームをダイジェスト区間に含めるための処理が行われる。 On the other hand, if it is determined in step S305 that the score of the current frame is higher than the score threshold value, the process proceeds to step S307. In steps S307 to S317, processing for including the current frame in the digest section is performed.

まず、ステップＳ３０７において、不連続区間長が不連続区間最大長よりも小さいかどうかが判断される。ここで、不連続区間最大長とは、不連続区間が、ダイジェスト区間に含めるべき有意な区間であるかどうかを判断する基準となる時間長さである。上述したように、不連続区間は、直前の連続区間の終了時刻から現在フレームまでの区間であるため、連続区間には含まれない、スコアが連続的に低い区間であると言える。従って、不連続区間は、ダイジェストに含める対象としている音源種別の音声がほぼ発せられていない沈黙の区間であると考えられるが、例えば不連続区間が極短い場合には、当該区間は、例えばある人物の一連の発言の最中の息継ぎ等、情報の内容の観点からは、前後の区間と一連の区間である可能性が高い。不連続区間最大長は、このような観点から、不連続区間に対応する沈黙の区間が、一連の音声中の極短い沈黙なのか、あるいは例えば話者の変更を伴うような長い沈黙なのかを判断するための時間長さとして設定され得る。 First, in step S307, it is determined whether the discontinuous section length is smaller than the maximum discontinuous section length. Here, the maximum length of the discontinuous section is a time length serving as a reference for determining whether or not the discontinuous section is a significant section that should be included in the digest section. As described above, since the discontinuous section is a section from the end time of the immediately preceding continuous section to the current frame, it can be said that the section is not included in the continuous section and has a continuously low score. Therefore, it is considered that the discontinuous section is a silent section in which almost no sound of the sound source type targeted for inclusion in the digest is produced. For example, when the discontinuous section is extremely short, the section is, for example, From the viewpoint of information content, such as breathing during a series of statements by a person, there is a high possibility that the section is a front and back section and a series of sections. From this point of view, the maximum length of the discontinuity section is whether the silence section corresponding to the discontinuity section is a very short silence in a series of voices or a long silence, for example, with a change in speaker. It can be set as the length of time for making a decision.

ステップＳ３０７で不連続区間長が不連続区間最大長よりも小さいと判断された場合には、ステップＳ３０９に進む。この場合、上述したように、不連続区間はその直前の連続区間と一連の区間と考えられるべきである。よって、ステップＳ３０９では、現ダイジェスト区間に不連続区間及び現在フレームを接続する（すなわち、不連続区間及び現在フレームを現ダイジェスト区間の終端に加える）処理が行われる。このように、不連続期間が極短い場合に、当該不連続期間まで含むようにダイジェスト区間が決定されることにより、一連の音声が途切れることなくダイジェストに含まれることとなり、内容把握の観点からより有用なダイジェストを生成することが可能となる。なお、この際、フレームインデックスが１つ小さいフレーム（すなわち時系列的に１つ前のフレーム）に対してもステップＳ３０９に示す処理が行われた場合には、既に不連続区間は現ダイジェスト区間に含まれているため、現在フレームのみが現ダイジェスト区間に接続される。ステップＳ３０９に示す処理を終えると、ステップＳ３１９に進む。 When it is determined in step S307 that the length of the discontinuous section is smaller than the maximum length of the discontinuous section, the process proceeds to step S309. In this case, as described above, the discontinuous section should be considered as a series of sections including the immediately preceding continuous section. Therefore, in step S309, a process of connecting the discontinuous section and the current frame to the current digest section (that is, adding the discontinuous section and the current frame to the end of the current digest section) is performed. In this way, when the discontinuity period is extremely short, by determining the digest section to include the discontinuity period, a series of audio will be included in the digest without interruption, and from the viewpoint of understanding the content, It becomes possible to generate a useful digest. At this time, if the process shown in step S309 is performed on a frame with a frame index that is one smaller (that is, the frame that is one frame before in time series), the discontinuous section has already become the current digest section. Since it is included, only the current frame is connected to the current digest section. When the process shown in step S309 ends, the process proceeds to step S319.

一方、ステップＳ３０７で不連続区間長が不連続区間最大長以上であると判断された場合には、ステップＳ３１１に進む。ステップＳ３１１では、不連続区間前の連続区間長が連続区間最低長以上であるかどうかが判断される。図６のステップＳ２２１に示す処理について説明する際に言及したように、連続区間最低長とは、音声として出力した際に人が当該音声の意味を認識可能な最小区間として設定される時間長さである。つまり、ステップＳ３１１に示す処理は、連続区間が有意な区間であるかどうかを時間長さの観点から判断する処理であると言える。 On the other hand, if it is determined in step S307 that the discontinuous section length is greater than or equal to the discontinuous section maximum length, the process proceeds to step S311. In step S311, it is determined whether or not the continuous section length before the discontinuous section is greater than or equal to the minimum continuous section length. As mentioned in the description of the process shown in step S221 of FIG. 6, the minimum continuous section length is the time length set as the minimum section in which a person can recognize the meaning of the sound when output as a sound. Is. That is, it can be said that the process shown in step S311 is a process of determining whether or not the continuous section is a significant section from the viewpoint of time length.

ステップＳ３１１で不連続区間前の連続区間長が連続区間最低長以上であると判断された場合には、ステップＳ３１３〜ステップＳ３１５に進む。この場合は、不連続区間が不連続区間最大長以上であり、かつ、連続区間が連続区間最低長以上である場合（すなわち、不連続区間が有意な区間でなく、かつ、不連続区間の前の連続区間が有意な区間である場合）であるため、不連続区間を破棄する（ダイジェスト区間に含めない）とともに、不連続区間の前の連続区間を採用する（ダイジェスト区間に含める）処理が行われる。 When it is determined in step S311 that the continuous section length before the discontinuous section is equal to or longer than the continuous section minimum length, the process proceeds to steps S313 to S315. In this case, if the discontinuous section is greater than or equal to the maximum length of the discontinuous section, and the continuous section is greater than or equal to the minimum length of the continuous section (that is, the discontinuous section is not a significant section, and before the discontinuous section Since the continuous section of is a significant section), the discontinuous section is discarded (not included in the digest section) and the continuous section before the discontinuous section is adopted (included in the digest section). Be seen.

具体的には、ステップＳ３１３では、不連続区間前の連続区間が１つのダイジェスト区間として確定される。次いで、ステップＳ３１５では、ダイジェスト区間インデックスが１つ繰り上げられ（すなわち処理対象である現ダイジェスト区間が新たに設定され）、現在フレームがその新たな現ダイジェスト区間の開始時刻に設定される。ステップＳ３１５に示す処理を終えると、ステップＳ３１９に進む。 Specifically, in step S313, the continuous section before the discontinuous section is determined as one digest section. Next, in step S315, the digest interval index is incremented by 1 (that is, the current digest interval to be processed is newly set), and the current frame is set to the start time of the new current digest interval. When the process of step S315 ends, the process proceeds to step S319.

一方、ステップＳ３１１で不連続区間前の連続区間長が連続区間最低長よりも小さいと判断された場合には、ステップＳ３１７に進む。この場合は、不連続区間が不連続区間最大長以上であり、かつ、連続区間が連続区間最低長よりも小さい場合（すなわち、不連続区間が有意な区間でなく、かつ、不連続区間の前の連続区間も有意でない場合）であるため、不連続区間と、不連続区間の前の連続区間を、ともに破棄する（ダイジェスト区間に含めない）処理が行われる。このように、連続期間が人によって認識できないほど短い場合に、当該連続期間を含まないようにダイジェスト区間が決定されることにより、ダイジェストを聴いた際にユーザにとって耳障りとなるような、内容把握の意味の薄い区間をダイジェストから省くことができ、より品質の高いダイジェストを生成することが可能となる。 On the other hand, if it is determined in step S311 that the continuous section length before the discontinuous section is smaller than the minimum continuous section length, the process proceeds to step S317. In this case, if the discontinuous section is greater than or equal to the maximum length of the discontinuous section, and the continuous section is smaller than the minimum length of the continuous section (that is, the discontinuous section is not a significant section and the Since the continuous section of is also insignificant), the discontinuous section and the continuous section before the discontinuous section are both discarded (not included in the digest section). In this way, when the continuous period is so short that it cannot be recognized by a person, the digest section is determined so as not to include the continuous period. The meaningless section can be omitted from the digest, and a higher quality digest can be generated.

具体的には、ステップＳ３１７では、不連続区間前の連続区間が破棄され、現在フレームが現ダイジェスト区間の開始時刻に設定される。ステップＳ３１７に示す処理を終えると、ステップＳ３１９に進む。 Specifically, in step S317, the continuous section before the discontinuous section is discarded, and the current frame is set to the start time of the current digest section. When the process of step S317 ends, the process proceeds to step S319.

ステップＳ３１９では、音声情報が終端かどうかが判断される。ステップＳ３１９で音声情報が終端でないと判断された場合には、フレームインデックスが１つ繰り上げられ（すなわち処理対象であるフレームが１つ先のフレームに設定され）（ステップＳ３２１）、ステップＳ３０５以降の処理が繰り返し実行される。 In step S319, it is determined whether the voice information is the end. When it is determined in step S319 that the voice information is not the end, the frame index is incremented by 1 (that is, the frame to be processed is set to the frame one ahead) (step S321), and the processes in step S305 and subsequent steps are performed. Is repeatedly executed.

一方、ステップＳ３１９で音声情報が終端であると判断された場合には、ステップＳ３２３に進む。ステップＳ３２３では、現ダイジェスト区間長が連続区間最低長よりも大きいかどうかが判断される。つまり、ステップＳ３２３では、最後に処理対象であったダイジェスト区間が、時間長さの観点から有意な区間であるかどうか（すなわち音声の識別が可能な程度の時間長さを有しているかどうか）が判断される。 On the other hand, if it is determined in step S319 that the voice information is at the end, the process proceeds to step S323. In step S323, it is determined whether the current digest section length is larger than the continuous section minimum length. That is, in step S323, whether or not the digest section that is the last processing target is a meaningful section from the viewpoint of time length (that is, whether or not it has a time length such that speech can be identified). Is judged.

ステップＳ３２３で現ダイジェスト区間長が連続区間最低長よりも大きいと判断された場合には、現ダイジェスト区間は時間長さ的に有意な区間であると考えられるため、当該ダイジェスト区間を採用し、一連の処理を終了する。一方、ステップＳ３２３で現ダイジェスト区間長が連続区間最低長以下であると判断された場合には、現ダイジェスト区間は時間長さ的に有意な区間でないと考えられるため、当該ダイジェスト区間を破棄し、一連の処理を終了する。 If it is determined in step S323 that the current digest section length is larger than the continuous section minimum length, the current digest section is considered to be a section that is significant in terms of time length. Ends the process. On the other hand, when it is determined in step S323 that the current digest section length is less than or equal to the continuous section minimum length, the current digest section is considered not to be a section that is significant in terms of time length, so the digest section is discarded, A series of processing ends.

以上、オフライン処理における、単一音源モードでのダイジェスト区間決定処理の処理手順について説明した。 The processing procedure of the digest section determination processing in the single sound source mode in the offline processing has been described above.

（３−３．複数音源モード）
（３−３−１．ダイジェスト区間決定処理の処理手順）
複数音源モードでは、指定された割合に基づいてダイジェストに含める音声の時間長さが音源種別ごとに設定され、音源種別ごとに音源種別スコアがより高い区間であって当該区間の合計長さが設定した音源種別ごとの時間長さ以下となるような区間が、ダイジェスト区間として決定される。 (3-3. Multiple sound source mode)
(3-3-1. Processing procedure of digest section determination processing)
In the multiple sound source mode, the time length of the sound included in the digest is set for each sound source type based on the specified ratio, and the sound source type score is higher for each sound source type and the total length of the relevant section is set. A section that is equal to or less than the time length for each sound source type is determined as the digest section.

図１０及び図１１を参照して、オフライン処理における、複数音源モードでのダイジェスト区間決定処理の処理手順について説明する。図１０及び図１１は、オフライン処理における、複数音源モードでのダイジェスト区間決定処理の処理手順の一例を示すフロー図である。 With reference to FIG. 10 and FIG. 11, the processing procedure of the digest section determination processing in the multiple sound source mode in the offline processing will be described. 10 and 11 are flowcharts showing an example of the processing procedure of the digest section determination processing in the multiple sound source mode in the offline processing.

なお、図１０及び図１１に示す複数音源モードでのダイジェスト区間決定処理は、図５−図９を参照して説明した単一音源モードでのダイジェスト区間決定処理における各処理が音源種別ごとに行われるものであり、各処理の内容自体は、単一音源モードでのダイジェスト区間決定処理と略同様であり得る。ただし、単一音源モードでのダイジェスト区間決定処理では、１つの音源種別しか対象にしていなかったため、上述したステップＳ２０９及びステップＳ２１５において、その音源種別に係るスコアに基づいて決定されたダイジェスト区間長の合計値がダイジェスト長と比較されていたが、複数音源モードでのダイジェスト区間決定処理では、各音源種別に係るスコアに基づいて決定されたダイジェスト区間長の合計値が、ダイジェストに含める各音源種別の音声の時間長さ（以下、種別ダイジェスト長とも呼称する。）と比較される。 In the digest section determination process in the multiple sound source mode shown in FIGS. 10 and 11, each process in the digest section determination process in the single sound source mode described with reference to FIGS. 5 to 9 is performed for each sound source type. The content itself of each processing may be substantially the same as the digest section determination processing in the single sound source mode. However, in the digest section determination process in the single sound source mode, since only one sound source type is targeted, the digest section length determined based on the score related to the sound source type in step S209 and step S215 described above. Although the total value was compared with the digest length, in the digest section determination process in the multiple sound source mode, the total value of the digest section length determined based on the score related to each sound source type, of each sound source type to be included in the digest It is compared with the time length of the voice (hereinafter, also referred to as the type digest length).

以下の複数音源モードでのダイジェスト区間決定処理の処理手順についての説明では、単一音源モードでのダイジェスト区間決定処理の処理手順と相違する事項について主に説明し、重複する事項についてはその詳細な説明を省略する。 In the following description of the processing procedure of the digest section determination processing in the multiple sound source mode, matters that are different from the processing procedure of the digest section determination processing in the single sound source mode will be mainly described, and detailed descriptions of overlapping matters will be given. The description is omitted.

図１０及び図１１を参照すると、オフライン処理における複数音源モードでのダイジェスト区間決定処理では、まず、スコア閾値上限値としてスコア閾値理論上限値が設定される（ステップＳ４０１）。次いで、スコア閾値上限値よりも低い値としてスコア閾値が設定される（ステップＳ４０３）。これらの処理は、図５及び図６に示すステップＳ２０１及びステップＳ２０３における処理と同様である。 Referring to FIGS. 10 and 11, in the digest section determination process in the multiple sound source mode in the offline process, first, the score threshold theoretical upper limit value is set as the score threshold upper limit value (step S401). Next, the score threshold is set as a value lower than the score threshold upper limit value (step S403). These processes are similar to the processes in steps S201 and S203 shown in FIGS. 5 and 6.

次に、種別ダイジェスト長が設定される（ステップＳ４０５）。例えば、種別ダイジェスト長は、モード情報に基づいて設定され得る。例えば、モード情報には、ダイジェストに含める音源種別の割合を指定する旨の情報が含まれている。ステップＳ４０５に示す処理では、ダイジェスト長に当該割合を乗じることにより、音源種別ごとにその種別ダイジェスト長が算出される。 Next, the type digest length is set (step S405). For example, the type digest length may be set based on the mode information. For example, the mode information includes information for designating the ratio of the sound source types included in the digest. In the process shown in step S405, the type digest length is calculated for each sound source type by multiplying the digest length by the ratio.

ただし、ステップＳ４０５に示す処理はかかる例に限定されず、ダイジェストに含める音源種別の割合は、モード情報として外部から入力されるのではなく、情報処理装置１１０によって自動的に設定されてもよい。例えば、何らかの機会に図８及び図９に示す高スコア区間決定処理が各音源種別に対して既に１度実行されており、各種別音源に対して、高スコア区間が決定されている場合であれば、当該高スコア区間についての情報を用いて、上記割合が決定され、種別ダイジェスト長が決定されてもよい。 However, the process shown in step S405 is not limited to this example, and the ratio of the sound source types to be included in the digest may be automatically set by the information processing apparatus 110 instead of being input from the outside as mode information. For example, when the high score section determination process shown in FIGS. 8 and 9 has already been executed once for each sound source type and the high score section has been determined for each type of sound source, for example. For example, the ratio may be determined and the type digest length may be determined using information about the high score section.

具体的には、高スコア区間決定処理の結果から、音源種別ごとに、決定された高スコア区間の時間長さの総和が算出され、その比率が計算される。そして、計算された比率をダイジェスト長に乗じることにより、音源種別ごとにその種別ダイジェスト長が算出され得る。このように高スコア区間の時間長さに基づいて決定される割合は、音声情報内における音源種別ごとの音声の出現確率が反映されたものであり得る。 Specifically, from the result of the high score section determination processing, the sum of the time lengths of the determined high score sections is calculated for each sound source type, and the ratio thereof is calculated. Then, the type digest length can be calculated for each sound source type by multiplying the calculated ratio by the calculated digest length. In this way, the proportion determined based on the time length of the high score section may reflect the appearance probability of the voice for each sound source type in the voice information.

なお、モード情報に基づく場合、及び高スコア区間に基づく場合ともに、算出された種別ダイジェスト長が連続区間最低長を下回る場合には、その長さを調整する処理が適宜行われる。種別ダイジェスト長が連続区間最低長を下回る場合には、当該種別ダイジェスト長が短過ぎ、その音声が、人によって有意に認識されないからである。具体的には、連続区間最低長を下回る種別ダイジェスト長を連続区間最低長まで増加させるとともに、他の連続区間最低長を上回る種別ダイジェスト長からその増加分を減じる処理が行われる。 If the calculated type digest length is less than the minimum length of the continuous section both based on the mode information and based on the high score section, a process for adjusting the length is appropriately performed. This is because when the type digest length is shorter than the minimum length of the continuous section, the type digest length is too short, and the voice is not significantly recognized by a person. Specifically, the type digest length that is less than the minimum continuous section length is increased to the minimum continuous section length, and the increase is subtracted from the type digest lengths that are longer than the other minimum continuous section lengths.

種別ダイジェスト長が決定されると、次に、音声情報の中でより高いスコアを有する区間（高スコア区間）をダイジェスト区間として決定する処理（高スコア区間決定処理）が行われる（ステップＳ４０７）。ステップＳ４０７に示す処理は、図５及び図６に示すステップＳ２０５における処理、すなわち、図８及び図９に示す一連の処理と同様であるため、その詳細な説明を省略する。 When the type digest length is determined, a process of determining a section having a higher score (high score section) in the voice information as a digest section (high score section determination processing) is performed (step S407). The process shown in step S407 is the same as the process in step S205 shown in FIGS. 5 and 6, that is, the series of processes shown in FIGS. 8 and 9, and thus detailed description thereof will be omitted.

以降、ステップＳ４０９〜ステップＳ４３３に示す処理は、音源種別ごとに実行される点を除けば、図５及び図６に示すステップＳ２０７〜ステップＳ２３１における処理と同様の処理であるため、その詳細な説明を省略する。ステップＳ４１１〜ステップＳ４２１に示す処理は、図５及び図６に示すステップＳ２０９〜ステップＳ２１９における処理に対応する。ステップＳ４１１〜ステップＳ４２１に示す処理では、音源種別ごとに、ダイジェスト区間長の合計が種別ダイジェスト長と大幅に異なっていないかが判断され、スコア閾値が調整されることにより、ダイジェスト区間長の合計が種別ダイジェスト長の許容範囲に含まれるように、各ダイジェスト区間長が調整される。 After that, the processing shown in steps S409 to S433 is the same as the processing in steps S207 to S231 shown in FIGS. 5 and 6 except that it is executed for each sound source type, and therefore detailed description thereof will be given. Is omitted. The processing shown in steps S411 to S421 corresponds to the processing in steps S209 to S219 shown in FIGS. 5 and 6. In the processing shown in steps S411 to S421, it is determined for each sound source type whether the total digest section length is significantly different from the type digest length, and the score threshold is adjusted, so that the total digest section length is the type. Each digest section length is adjusted so that it is included in the allowable range of the digest length.

ステップＳ４２３〜ステップＳ４３３に示す処理は、図５及び図６に示すステップＳ２２１〜ステップＳ２３１における処理に対応する。ステップＳ４２３〜ステップＳ４３３に示す処理は、スコア閾値の調整がそれ以上できなくなった場合に行われる処理であり、ステップＳ４２３以降の処理では、現在のダイジェスト区間の中からフレームを削除する、又は現在のダイジェスト区間の数を減らすことにより、ダイジェスト区間長の合計を短くする処理が行われる。ただし、図５及び図６に示すステップＳ２２１〜ステップＳ２３１における処理では、フレーム又は区間数の削除対象となるダイジェスト区間は単一の音源種別に係るものであったが、ステップＳ４２３〜ステップＳ４３３に示す処理では、フレーム又は区間数の削除対象となるダイジェスト区間は、複数の音源種別に係るダイジェスト区間が混合されたものである。 The processing shown in steps S423 to S433 corresponds to the processing in steps S221 to S231 shown in FIGS. 5 and 6. The processing shown in steps S423 to S433 is processing that is performed when the score threshold cannot be adjusted any more. In the processing after step S423, the frame is deleted from the current digest section, or the current digest section is deleted. By reducing the number of digest sections, processing for shortening the total digest section length is performed. However, in the processing in steps S221 to S231 shown in FIG. 5 and FIG. 6, the digest section to be deleted of the frame or the section number relates to a single sound source type, but is shown in steps S423 to S433. In the processing, the digest section whose frame or the number of sections is to be deleted is a mixture of digest sections relating to a plurality of sound source types.

以上、図１０及び図１１を参照して、オフライン処理における、複数音源モードでのダイジェスト区間決定処理の処理手順について説明する。 The processing procedure of the digest section determination processing in the multiple sound source mode in the offline processing will be described above with reference to FIGS. 10 and 11.

（３−４．多様性反映モード）
多様性反映モードでは、同一の音源種別に分類される音声の中から多様な音声が含まれるようにダイジェストが生成される。具体的には、多様性反映モードでは、同一の音源種別内での音声の特徴量のばらつき及び同一の音源種別内での音声の時間的ばらつきがより大きくなるように、ダイジェスト区間が決定される。 (3-4. Diversity reflection mode)
In the diversity reflection mode, a digest is generated so that various voices are included among the voices classified into the same sound source type. Specifically, in the diversity reflection mode, the digest section is determined so that the variation in the feature amount of the voice within the same sound source type and the temporal variation of the voice within the same sound source type become larger. .

（３−４−１．機能構成）
ここで、上述した単一音源モード及び複数音源モードにおける各処理は、図１に示す情報処理装置１１０の機能構成によって実行され得る。ただし、多様性反映モードにおける各処理は、図１に示す情報処理装置１１０とは若干異なる機能構成によって実行され得る。 (3-4-1. Functional configuration)
Here, each processing in the single sound source mode and the multiple sound source mode described above can be executed by the functional configuration of the information processing apparatus 110 illustrated in FIG. 1. However, each processing in the diversity reflection mode may be executed by a functional configuration slightly different from that of the information processing apparatus 110 shown in FIG.

図１２を参照して、多様性反映モードにおける各処理を実行する情報処理装置の機能構成について説明する。図１２は、多様性反映モードにおける各処理を実行する情報処理装置の機能構成の一例を示す機能ブロック図である。 The functional configuration of the information processing apparatus that executes each process in the diversity reflection mode will be described with reference to FIG. FIG. 12 is a functional block diagram illustrating an example of the functional configuration of the information processing device that executes each process in the diversity reflection mode.

図１２を参照すると、多様性反映モードに対応する情報処理装置１２０は、その機能として、特徴量抽出部１１１と、音源種別スコア算出部１１３と、ダイジェスト区間決定部１１５と、を有する。ここで、特徴量抽出部１１１、音源種別スコア算出部１１３及びダイジェスト区間決定部１１５の機能は、図１に示す情報処理装置１１０におけるこれらの機能ブロックの機能と同様であるため、その詳細な説明は省略する。 Referring to FIG. 12, the information processing apparatus 120 corresponding to the diversity reflection mode has a feature amount extraction unit 111, a sound source type score calculation unit 113, and a digest section determination unit 115 as its functions. Here, the functions of the feature amount extraction unit 111, the sound source type score calculation unit 113, and the digest section determination unit 115 are similar to the functions of these functional blocks in the information processing apparatus 110 shown in FIG. Is omitted.

情報処理装置１２０では、情報処理装置１１０と異なり、特徴量抽出部１１１によって算出された音声情報の特徴量についての情報が、ダイジェスト区間決定部１１５にも提供される。ダイジェスト区間決定部１１５は、当該特徴量についての情報を用いて、多様性を考慮してダイジェスト区間を決定することができる（後述する図１４のステップＳ５３１に示す処理を参照）。 Unlike the information processing apparatus 110, the information processing apparatus 120 also provides the digest section determination unit 115 with information about the characteristic amount of the audio information calculated by the characteristic amount extraction unit 111. The digest section determination unit 115 can determine the digest section in consideration of the diversity by using the information about the feature amount (see the processing shown in step S531 of FIG. 14 described later).

（３−４−２．ダイジェスト区間決定処理の処理手順）
図１３及び図１４を参照して、図１２に示す情報処理装置１２０によって実行され得る、オフライン処理における、多様性反映モードでのダイジェスト区間決定処理の処理手順について説明する。図１３及び図１４は、オフライン処理における、多様性反映モードでのダイジェスト区間決定処理の処理手順の一例を示すフロー図である。 (3-4-2. Processing procedure of digest section determination processing)
With reference to FIGS. 13 and 14, a processing procedure of the digest section determination processing in the diversity reflection mode in the offline processing, which can be executed by the information processing apparatus 120 illustrated in FIG. 12, will be described. 13 and 14 are flowcharts showing an example of the processing procedure of the digest section determination processing in the diversity reflection mode in the offline processing.

なお、多様性反映モードは、同一音源種別内での多様性を考慮してダイジェスト区間を決定するものであるため、ダイジェストに含める対象とする音源種別は、単一の音源種別であってもよいし、複数の音源種別であってもよい。図１３及び図１４では、一例として、ダイジェストに複数の音源種別からなる音声を含める場合における処理手順を図示している。 Since the diversity reflection mode determines the digest section in consideration of the diversity within the same sound source type, the sound source type to be included in the digest may be a single sound source type. However, there may be a plurality of sound source types. 13 and 14 show, as an example, a processing procedure in the case where the digest includes sounds of a plurality of sound source types.

ここで、多様性反映モードでのダイジェスト区間決定処理における各処理は、後述するステップＳ５３１に示す処理を除き、図１０及び図１１を参照して説明した複数音源モードでのダイジェスト区間決定処理における各処理と同様である。従って、以下の多様性反映モードでのダイジェスト区間決定処理における各処理についての説明では、複数音源モードでのダイジェスト区間決定処理における各処理と相違する事項について主に説明し、重複する事項についてはその詳細な説明を省略する。なお、ダイジェストに複数の音源種別からなる音声を含める場合における多様性反映モードでのダイジェスト区間決定処理の処理手順は、図５及び図６に示す単一音源モードでのダイジェスト区間決定処理の処理手順において、ステップＳ２２９に示す処理の代わりに後述するステップＳ５３１に示す処理が行われるものに対応する。 Here, each process in the digest segment determination process in the diversity reflection mode is different from each process in the digest segment determination process in the multiple sound source mode described with reference to FIGS. 10 and 11 except for the process shown in step S531 described later. It is similar to the processing. Therefore, in the description of each process in the digest section determination process in the diversity reflection mode below, mainly the matters different from each process in the digest section determination process in the multiple sound source mode will be described, and the overlapping matters will be described. Detailed description is omitted. In addition, the processing procedure of the digest section determination process in the diversity reflection mode in the case where the digest includes voices of a plurality of sound source types is the same as the processing procedure of the digest section determination processing in the single sound source mode shown in FIGS. 5 and 6. In this case, instead of the process shown in step S229, the process shown in step S531 described later is performed.

図１３及び図１４を参照すると、多様性反映モードでのダイジェスト区間決定処理において、ステップＳ５０１〜ステップＳ５２１における処理は、図１０及び図１１に示すステップＳ４０１〜ステップＳ４２１における処理と同様の処理である。またステップＳ５２３以降の処理も、複数音源モードでのダイジェスト区間決定処理と同様に、スコア閾値の調整がそれ以上できなくなった場合に行われる処理である。ステップＳ５２３以降の処理では、現在のダイジェスト区間の中からフレームを削除する、又は現在のダイジェスト区間の数を減らすことにより、ダイジェスト区間長の合計を短くする処理が行われる。 Referring to FIGS. 13 and 14, in the digest section determination process in the diversity reflection mode, the processes in steps S501 to S521 are the same as the processes in steps S401 to S421 shown in FIGS. 10 and 11. . In addition, the process from step S523 is also a process performed when the score threshold cannot be adjusted any more, like the digest section determination process in the multiple sound source mode. In the processing from step S523, processing is performed to shorten the total digest section length by deleting a frame from the current digest section or reducing the number of current digest sections.

ここで、多様性反映モードにおいて、ステップＳ５２３で各ダイジェスト区間についてダイジェスト区間長の短縮が可能であると判断された場合に、より区間平均スコアが低いダイジェスト区間からフレームを削除することによりダイジェスト区間長の合計を短くする一連の処理（ステップＳ５２５〜ステップＳ５２９に示す処理）は、複数音源モードにおけるこれらの処理（ステップＳ４２５〜ステップＳ４２９に示す処理）と同様である。 Here, in the diversity reflection mode, when it is determined in step S523 that the digest section length can be shortened for each digest section, the digest section length is deleted by deleting the frame from the digest section having a lower section average score. A series of processes (processes shown in steps S525 to S529) for shortening the sum of the above are similar to these processes (processes shown in steps S425 to S429) in the multiple sound source mode.

一方、多様性反映モードにおいては、ステップＳ５２３でいずれのダイジェスト区間においてもダイジェスト区間長の短縮が不可能と判断された場合に、ダイジェスト区間の数が減じられる処理の詳細が、複数音源モードとは異なる。具体的には、複数音源モードでは、区間平均スコアの低いダイジェスト区間が削除されていた（図１１のステップＳ４３１に示す処理を参照）。一方、多様性反映モードでは、多様性に基づいてダイジェスト区間を削除する処理（多様性に基づくダイジェスト区間削除処理）が行われる（ステップＳ５３１）。ダイジェスト区間が削除された後に、ダイジェスト区間長の合計がダイジェスト長と略一致するかどうかが判断され（ステップＳ５３３）、ダイジェスト区間長の合計がダイジェスト長と略一致するまで、ステップＳ５３１に示す多様性に基づくダイジェスト区間削除処理が実行される。 On the other hand, in the diversity reflection mode, the details of the process of reducing the number of digest sections when it is determined that the digest section length cannot be shortened in any digest section in step S523 different. Specifically, in the multiple sound source mode, the digest section having a low section average score has been deleted (see the processing shown in step S431 in FIG. 11). On the other hand, in the diversity reflection mode, processing for deleting the digest section based on the diversity (digest section deletion processing based on the diversity) is performed (step S531). After the digest section is deleted, it is determined whether the total digest section length substantially matches the digest length (step S533), and the diversity shown in step S531 until the total digest section length substantially matches the digest length. The digest section deletion process based on is executed.

（３−４−３．多様性に基づくダイジェスト区間削除処理）
図１５を参照して、図１４のステップＳ５３１に示す多様性に基づくダイジェスト区間削除処理について詳しく説明する。図１５は、オフライン処理における、多様性に基づくダイジェスト区間削除処理の処理手順の一例を示すフロー図である。 (3-4-3. Digest section deletion processing based on diversity)
With reference to FIG. 15, the digest section deletion processing based on diversity shown in step S531 of FIG. 14 will be described in detail. FIG. 15 is a flowchart showing an example of a processing procedure of digest section deletion processing based on diversity in the offline processing.

図１５を参照すると、オフライン処理における多様性に基づくダイジェスト区間削除処理では、まず、各ダイジェスト区間の特徴量ベクトルの平均（平均特徴量ベクトル）が算出される（ステップＳ６０１）。 Referring to FIG. 15, in the diversity-based digest section deletion processing in the offline processing, first, the average of the feature quantity vectors of each digest section (average feature quantity vector) is calculated (step S601).

次に、全ダイジェスト区間の場合と、任意の１つのダイジェスト区間を除いた場合の、ｎ通りの特徴量空間における平均特徴量ベクトルの分散が計算される（ステップＳ６０３）。 Next, the variances of the average feature quantity vectors in the n feature quantity spaces in the case of all digest sections and the case of excluding one arbitrary digest section are calculated (step S603).

次に、各ダイジェスト区間の平均時刻が算出される（ステップＳ６０５）。平均時刻は、例えば、各ダイジェスト区間の開始時刻と終了時刻との中間の時刻として計算される。 Next, the average time of each digest section is calculated (step S605). The average time is calculated, for example, as an intermediate time between the start time and the end time of each digest section.

次に、全ダイジェスト区間の場合と、任意の１つのダイジェスト区間を除いた場合の、ｎ通りの各ダイジェスト区間の平均時刻の分散が計算される（ステップＳ６０７）。 Next, the variance of the average time of each of the n digest sections in the case of the entire digest section and the case of excluding one arbitrary digest section is calculated (step S607).

次に、平均特徴量ベクトルの分散及び平均時刻の分散に重み付けを行った上でその総和が計算され、全ダイジェスト区間の場合の値からの低減量が最も少ない場合に除外されたダイジェスト区間が、削除するダイジェスト区間として決定される（ステップＳ６０９）。つまり、ステップＳ６０９に示す処理では、平均特徴量ベクトル及び平均時刻の分散の計算に用いられなかった場合に最も影響の少ない平均特徴量ベクトル及び平均時刻を有するダイジェスト区間が、削除するダイジェスト区間として決定される。これにより、平均特徴量ベクトル及び平均時刻の分散がより大きくなるように、ダイジェストに含めるダイジェスト区間が選択されることとなる。最後に、決定されたダイジェスト区間が削除される（ステップＳ６１１）。 Next, the sum is calculated after weighting the variance of the average feature amount vector and the variance of the average time, and the digest section excluded when the reduction amount from the value in the case of the entire digest section is the smallest, It is determined as the digest section to be deleted (step S609). That is, in the processing shown in step S609, the digest section having the average feature quantity vector and the average time having the least influence when not used in the calculation of the variance of the average feature quantity vector and the average time is determined as the digest section to be deleted. To be done. As a result, the digest section included in the digest is selected so that the variance of the average feature quantity vector and the average time becomes larger. Finally, the determined digest section is deleted (step S611).

以上、図１３及び図１４を参照して、オフライン処理における、多様性反映モードでのダイジェスト区間決定処理の処理手順について説明した。また、図１５を参照して、ステップＳ５３１に示す多様性に基づくダイジェスト区間削除処理について説明した。 The processing procedure of the digest section determination processing in the diversity reflection mode in the offline processing has been described above with reference to FIGS. 13 and 14. Also, with reference to FIG. 15, the digest section deletion processing based on diversity shown in step S531 has been described.

以上説明したように、多様性反映モードでは、同一の音源種別に分類される音声について特徴量ベクトル及び時刻の多様性が確保されるように、ダイジェスト区間が決定される。特徴量ベクトルの多様性が確保されることにより、同一の音源種別に分類されてはいるが実際には別人の声が存在する場合に、これらの声をともにダイジェストに含めることが可能となる。また、時刻の多様性が確保されることにより、同一の音源種別に分類されている音声が時間的に離れた場所で発言をしている場合に、これらの声をともにダイジェストに含めることが可能となる。 As described above, in the diversity reflection mode, the digest section is determined so that the diversity of the feature amount vector and the time is secured for the sounds classified into the same sound source type. By ensuring the diversity of the feature quantity vectors, it is possible to include both voices in the digest when they are actually classified by the same sound source type but different voices actually exist. Also, by ensuring the diversity of time, it is possible to include both voices classified into the same sound source type in the digest when they are speaking at distant locations in time. Becomes

（４．オンライン処理の詳細）
（４−１．全体の処理手順）
図１６を参照して、オンライン処理の処理手順について説明する。図１６は、オンライン処理の処理手順の一例を示すフロー図である。図１６に示す処理手順は、オンライン処理時における、図１に示す情報処理装置１１０によって実行される情報処理方法全体の処理手順に対応している。 (4. Details of online processing)
(4-1. Overall processing procedure)
The procedure of the online process will be described with reference to FIG. FIG. 16 is a flowchart showing an example of the processing procedure of online processing. The processing procedure shown in FIG. 16 corresponds to the processing procedure of the entire information processing method executed by the information processing apparatus 110 shown in FIG. 1 during online processing.

オンライン処理では、音声情報のフレームが新たに入力される度に、その新たに入力されたフレーム（入力フレーム）のスコアが算出され、当該スコアに基づいて音声情報の中からダイジェスト区間が決定される。つまり、オンライン処理では、音声情報が入力されている間、図１６に示す一連の処理が。フレームが新たに入力される度に実行され、ダイジェスト区間情報が更新される。 In the online processing, each time a frame of voice information is newly input, the score of the newly input frame (input frame) is calculated, and the digest section is determined from the voice information based on the score. . That is, in the online processing, the series of processing shown in FIG. 16 is performed while the voice information is being input. It is executed every time a frame is newly input, and the digest section information is updated.

なお、スコア算出区間がフレーム区間ではなく、複数のフレーム区間からなる場合には、図１６に示す一連の処理は、スコア算出区間に対応する複数のフレームが入力される度に実行され得る。 If the score calculation section is not a frame section but a plurality of frame sections, the series of processes shown in FIG. 16 may be executed every time a plurality of frames corresponding to the score calculation section are input.

図１６を参照すると、オンライン処理では、まず、これまでに取得されている音声情報の特徴量が抽出される（ステップＳ７０１）。ステップＳ７０１に示す処理では、音声情報の特徴量として、例えばパワーやスペクトル包絡形状等、音声情報の特性を示す各種の物理量が算出される。ステップＳ７０１に示す処理は、例えば図１に示す特徴量抽出部１１１によって行われる処理に対応している。 Referring to FIG. 16, in the online processing, first, the feature amount of the voice information acquired so far is extracted (step S701). In the process shown in step S701, various physical quantities indicating characteristics of the audio information, such as power and spectrum envelope shape, are calculated as the characteristic quantities of the audio information. The process shown in step S701 corresponds to, for example, the process performed by the feature amount extraction unit 111 shown in FIG.

次に、抽出された特徴量に基づいて、入力フレームの音源種別スコアが算出される（ステップＳ７０３）。ステップＳ７０３に示す処理では、例えば、音声情報の特徴量に応じて音声の音源種別を識別する識別器によって、入力フレームにおける当該音声の音源種別の蓋然性を示す音源種別スコアが算出される。この際、音声スコア、声スコア、ノイズスコア等、複数の種類の音源種別スコアが算出されてよい。ステップＳ７０３に示す処理は、例えば図１に示す音源種別スコア算出部１１３によって行われる処理に対応している。 Next, the sound source type score of the input frame is calculated based on the extracted feature amount (step S703). In the process shown in step S703, the sound source type score indicating the probability of the sound source type of the sound in the input frame is calculated by, for example, the classifier that identifies the sound source type of the sound according to the feature amount of the sound information. At this time, a plurality of sound source type scores such as a voice score, a voice score, and a noise score may be calculated. The process shown in step S703 corresponds to the process performed by the sound source type score calculation unit 113 shown in FIG. 1, for example.

なお、スコア算出区間がフレーム区間ではなく、複数のフレーム区間からなる場合には、ステップＳ７０３において、各フレームの音源種別スコアを平滑化してスコア算出区間としての音源種別スコアを算出する処理が行われてもよい。 If the score calculation section is not a frame section but a plurality of frame sections, a process of smoothing the sound source type score of each frame to calculate a sound source type score as a score calculation section is performed in step S703. May be.

次に、算出された音源種別スコアに基づいて、音声情報の中からダイジェスト区間が決定される（ステップＳ７０５）。ステップＳ７０５に示す処理は、例えば図１に示すダイジェスト区間決定部１１５によって行われる処理に対応している。 Next, the digest section is determined from the audio information based on the calculated sound source type score (step S705). The process shown in step S705 corresponds to, for example, the process performed by the digest section determining unit 115 shown in FIG.

ステップＳ７０５に示す処理では、これまでに取得された音声情報の時間長さがダイジェスト長（ダイジェストの時間長さの設定値）よりも短い場合には、入力フレームが無条件でダイジェストに追加される。一方、これまでに取得された音声情報の時間長さがダイジェスト長以上である場合には、入力フレームがダイジェストに追加されるとともに、その代わりに、ダイジェストの中から例えばよりスコアの低いフレームが削除される。 In the processing shown in step S705, if the time length of the audio information acquired so far is shorter than the digest length (the set value of the digest time length), the input frame is unconditionally added to the digest. . On the other hand, if the time length of the audio information acquired so far is equal to or longer than the digest length, the input frame is added to the digest, and instead, for example, a frame with a lower score is deleted from the digest. To be done.

なお、ステップＳ７０５における具体的な処理内容はモードに応じて異なるため、その詳細な処理内容については、下記（４−２．単一音源モード）、（４−３．複数音源モード）及び（４−４．多様性反映モード）においてモードごとにより詳細に説明する。 Since the specific processing contents in step S705 differ depending on the mode, the detailed processing contents will be described in the following (4-2. Single tone generator mode), (4-3. Multiple tone generator mode) and (4). -4. Diversity reflection mode) will be described in more detail for each mode.

次に、音声情報の入力が終了したかどうかが判断される（ステップＳ７０７）。ステップＳ７０７で音声情報の入力が終了したと判断された場合には、決定されたダイジェスト区間についてのダイジェスト区間情報を出力して、一連の処理が終了する。一方、ステップＳ７０７で音声情報の入力が終了していないと判断された場合には、次のフレームの入力を待機し（ステップＳ７０９）、新たに入力されたフレームに対して、ステップＳ７０１以降の処理が繰り返し実行される。 Next, it is determined whether the input of voice information is completed (step S707). When it is determined in step S707 that the input of the voice information is completed, the digest section information about the determined digest section is output, and the series of processes is completed. On the other hand, if it is determined in step S707 that the input of voice information is not completed, the input of the next frame is awaited (step S709), and the processing of step S701 and subsequent steps is performed on the newly input frame. Is repeatedly executed.

以上、図１６を参照して、オンライン処理の処理手順について説明した。 The processing procedure of the online processing has been described above with reference to FIG.

（４−２．単一音源モード）
（４−２−１．ダイジェスト区間決定処理）
図１７を参照して、オフライン処理における、単一音源モードでのダイジェスト区間決定処理の処理手順について説明する。図１７は、オフライン処理における、単一音源モードでのダイジェスト区間決定処理の処理手順の一例を示すフロー図である。 (4-2. Single sound source mode)
(4-2-1. Digest section determination process)
With reference to FIG. 17, a processing procedure of the digest section determination processing in the single sound source mode in the offline processing will be described. FIG. 17 is a flowchart showing an example of a processing procedure of digest section determination processing in the single sound source mode in the offline processing.

図１７を参照すると、オフライン処理における単一音源モードでのダイジェスト区間決定処理では、まず、現在のダイジェスト長が、ダイジェスト長よりも短いかどうかが判断される（ステップＳ８０１）。ステップＳ８０１で、現在のダイジェスト長がダイジェスト長よりも短いと判断された場合には、入力フレームがダイジェストに追加されるとともに、ダイジェスト全体としての平均スコア（ダイジェスト平均スコア）が更新される（ステップＳ８０３）。そして、ダイジェスト区間決定処理を終了し、次の入力フレームを待つ。 Referring to FIG. 17, in the digest section determination process in the single sound source mode in the offline process, first, it is determined whether the current digest length is shorter than the digest length (step S801). If it is determined in step S801 that the current digest length is shorter than the digest length, the input frame is added to the digest and the average score (digest average score) of the entire digest is updated (step S803). ). Then, the digest section determination process is ended, and the next input frame is waited for.

ステップＳ８０１及びステップＳ８０３に示す処理は、これまでに入力された音声情報の時間長さがダイジェスト長に満たない場合には、入力フレームを無条件でダイジェストに追加する処理に対応している。 The processes shown in steps S801 and S803 correspond to the process of unconditionally adding the input frame to the digest when the time length of the voice information input so far is less than the digest length.

ステップＳ８０１で、現在のダイジェスト長がダイジェスト長以上である判断された場合には、ステップＳ８０５に進む。ステップＳ８０５では、入力フレームのスコアがダイジェスト平均スコア以上であるかどうかが判断される。ステップＳ８０５で入力フレームのスコアがダイジェスト平均スコアよりも小さいと判断された場合には、当該入力フレームをダイジェストに追加することなく、ダイジェスト区間決定処理を終了する。つまり、スコアのより低いフレームはダイジェストに含まれないようにする。 If it is determined in step S801 that the current digest length is greater than or equal to the digest length, the process proceeds to step S805. In step S805, it is determined whether the score of the input frame is equal to or higher than the digest average score. When it is determined in step S805 that the score of the input frame is smaller than the digest average score, the digest section determination process ends without adding the input frame to the digest. That is, frames with lower scores should not be included in the digest.

一方、ステップＳ８０５で入力フレームのスコアがダイジェスト平均スコア以上である判断された場合には、入力フレームがダイジェストに追加され、ダイジェスト平均スコアが更新される（ステップＳ８０７）。ただし、この場合には、入力フレームをダイジェストに追加したことにより、現在のダイジェスト長が、１フレームに対応する時間長さ分、ダイジェスト長を超過してしまっている。従って、ステップＳ８０７に示す処理に次いで、ダイジェストの中からフレームを削除する処理（フレーム削除処理）が行われる（ステップＳ８０９）。フレーム削除処理では、例えばダイジェストの中から、よりスコアの低いフレームが削除される。なお、ステップＳ８０９に示すフレーム削除処理の詳細については、図１８を参照して後述する。 On the other hand, if it is determined in step S805 that the score of the input frame is equal to or higher than the digest average score, the input frame is added to the digest and the digest average score is updated (step S807). However, in this case, since the input frame is added to the digest, the current digest length exceeds the digest length by the time length corresponding to one frame. Therefore, after the processing shown in step S807, processing for deleting a frame from the digest (frame deletion processing) is performed (step S809). In the frame deletion process, for example, a frame with a lower score is deleted from the digest. Note that details of the frame deletion processing shown in step S809 will be described later with reference to FIG.

フレームが削除されると、ダイジェスト平均スコアが更新され（ステップＳ８１１）、ダイジェスト区間決定処理を終了する。 When the frame is deleted, the digest average score is updated (step S811), and the digest section determination process ends.

（４−２−２．フレーム削除処理）
ここで、図１８を参照して、図１７のステップＳ８０９に示すフレーム削除処理の詳細について説明する。図１８は、オンライン処理における、単一音源モードでのフレーム削除処理の処理手順の一例を示すフロー図である。 (4-2-2. Frame deletion processing)
Here, with reference to FIG. 18, details of the frame deletion processing shown in step S809 of FIG. 17 will be described. FIG. 18 is a flowchart showing an example of a processing procedure of frame deletion processing in the single sound source mode in the online processing.

図１８を参照すると、オンライン処理における単一音源モードでのフレーム削除処理では、まず、スコア閾値として、ダイジェスト平均スコアが設定される（ステップＳ９０１）。そして、設定されたスコア閾値を用いて、ダイジェストの中でより高いスコアを有する区間（高スコア区間）をダイジェスト区間として決定する処理（高スコア区間決定処理）が行われる（ステップＳ９０３）。 Referring to FIG. 18, in the frame deletion process in the single sound source mode in the online process, first, the digest average score is set as the score threshold (step S901). Then, using the set score threshold value, a process of determining a section having a higher score (high score section) in the digest as a digest section (high score section determination processing) is performed (step S903).

ステップＳ９０３に示す高スコア区間決定処理では、図５のステップＳ２０５に示すオフライン処理での高スコア区間決定処理と略同様の処理が行われるが、一部の処理はオフライン処理のそれとは相違する。具体的には、オフライン処理では、音声情報全体を対象にして、当該音声情報の中でダイジェスト区間を決定するために高スコア区間決定処理が行われる。一方、図１７を参照して説明したように、オンライン処理では、これまでに取得された音声情報の時間長さがダイジェスト長に至るまでの間は、無条件に入力フレームがダイジェストに追加されるため、高スコア区間決定処理を行う前に、既に、いわば仮のダイジェストが生成されている。オンライン処理では、入力フレームが追加され現在のダイジェスト長が１フレーム分だけダイジェスト長の設定値よりも長くなっている場合に、そのダイジェストの中からよりスコアの低い区間を見付けて削除するフレームを決定するために、高スコア区間決定処理が行われるのである。つまり、オンライン処理では、ダイジェストを対象として高スコア区間決定処理が行われる。 In the high score section determination processing shown in step S903, substantially the same processing as the high score section determination processing in the offline processing shown in step S205 of FIG. 5 is performed, but some processing is different from that of the offline processing. Specifically, in the offline processing, a high score section determination process is performed on the entire voice information to determine a digest section in the voice information. On the other hand, as described with reference to FIG. 17, in the online processing, the input frame is unconditionally added to the digest until the time length of the voice information acquired so far reaches the digest length. Therefore, a so-called temporary digest has already been generated before performing the high score section determination process. In online processing, if an input frame is added and the current digest length is one frame longer than the digest length setting value, a section with a lower score is found from the digests and the frame to be deleted is determined. In order to do so, the high score section determination process is performed. That is, in the online processing, the high score section determination processing is performed for the digest.

また、上記の事情から、オフライン処理では、音声情報の中で高スコア区間として決定されなかった区間は、当然ダイジェスト区間として採用されない。一方、オンライン処理では、ダイジェストの中で高スコア区間として決定されなかった区間が存在した場合であっても、ダイジェストから削除される区間は１フレーム分の区間であるため、その高スコア区間として決定されなかった区間全てをダイジェストから削除することはできない。つまり、オンライン処理では、高スコア区間決定処理の結果高スコア区間として決定されなかった区間が、ダイジェスト内に残存し得る。以下の説明では、このような高スコア区間として決定されなかった区間のことを削除対象区間と呼称する。削除対象区間の中から、例えば最もスコアの低いフレームが、削除されるフレームとして選択されることになる。このように、削除対象区間は、現在はダイジェスト内に存在するが、随時音声情報が入力され、ダイジェストが更新されるにつれていずれ削除されるべき区間であるとも言える。 Further, from the above circumstances, in the offline processing, the section that is not determined as the high score section in the audio information is not naturally adopted as the digest section. On the other hand, in the online processing, even if there is a section that is not determined as a high score section in the digest, the section that is deleted from the digest is a section for one frame, so it is determined as the high score section. It is not possible to delete all the sections that have not been deleted from the digest. That is, in the online processing, a section that is not determined as a high score section as a result of the high score section determination processing may remain in the digest. In the following description, a section that is not determined as such a high score section is referred to as a deletion target section. From the deletion target section, for example, the frame with the lowest score is selected as the frame to be deleted. As described above, the deletion target section is present in the digest at present, but it can be said that the section should be deleted as the voice information is input and the digest is updated.

また、オンライン処理では、上記のように、ダイジェストに入力フレームが追加されるとともに、いずれかのフレームが削除されていくこととなるため、ダイジェスト内の各フレームにおけるスコアを時系列順に並べた際に、スコアが不連続になる点が存在し得る。上述したオフライン処理での高スコア区間決定処理では、音楽情報全体が処理対象であり、このようなスコアの不連続点は考慮する必要がなかったが、オンライン処理での高スコア区間決定処理では、当該不連続点に対処するための追加的な処理が必要となる。 In addition, in the online processing, as described above, the input frame is added to the digest and one of the frames is deleted, so when arranging the scores in each frame in the digest in chronological order , There may be points where the scores become discontinuous. In the high score section determination processing in the above-mentioned offline processing, the entire music information is the processing target, and it was not necessary to consider such a score discontinuity, but in the high score section determination processing in the online processing, Additional processing is required to deal with the discontinuity.

なお、ステップＳ９０３に示すオンライン処理における高スコア区間決定処理のより詳細な処理内容については、図１９−図２２を参照して後程改めて説明する。 Note that the more detailed processing contents of the high score section determination processing in the online processing shown in step S903 will be described later again with reference to FIGS. 19 to 22.

ステップＳ９０３において高スコア区間が決定されると、高スコア区間決定処理の結果、高スコア区間として決定されなかった削除対象期間が存在するかどうかが判断される（ステップＳ９０５）。ステップＳ９０５において削除対象区間が存在すると判断された場合には、その削除対象区間からスコアのより低いフレームが１つ選択される（ステップＳ９０７）。そして、選択されたそのフレームがダイジェストから削除される（ステップＳ９１１）。 When the high score section is determined in step S903, as a result of the high score section determination processing, it is determined whether there is a deletion target period that has not been determined as the high score section (step S905). If it is determined in step S905 that there is a deletion target section, one frame with a lower score is selected from the deletion target section (step S907). Then, the selected frame is deleted from the digest (step S911).

一方、ステップＳ９０５において削除対象区間が存在しないと判断された場合には、ダイジェストからスコアのより低いフレームが１つ選択される（すなわちＳ９０９）。そして、選択されたそのフレームがダイジェストから削除される（ステップＳ９１１）。 On the other hand, when it is determined in step S905 that there is no deletion target section, one frame with a lower score is selected from the digest (that is, S909). Then, the selected frame is deleted from the digest (step S911).

（４−２−３．高スコア区間決定処理）
ここで、図１９−図２２を参照して、詳細な説明を省略していた図１８のステップＳ９０３に示す、オンライン処理での高スコア区間決定処理について詳しく説明する。図１９は、オンライン処理での高スコア区間決定処理について説明するための説明図である。図２０−図２２は、オンライン処理での高スコア区間決定処理の処理手順の一例を示すフロー図である。 (4-2-3. High score section determination processing)
Here, with reference to FIGS. 19 to 22, the high score section determination processing in the online processing, which is shown in step S903 in FIG. 18 and whose detailed description is omitted, will be described in detail. FIG. 19 is an explanatory diagram for explaining the high score section determination processing in the online processing. 20 to 22 are flowcharts showing an example of the processing procedure of the high score section determination processing in the online processing.

図１９では、横軸に音声情報の時間を取り、縦軸にフレームごとに算出されたスコアを取り、両者の関係性をプロットしている。高スコア区間決定処理では、フレームごとに、時系列に従って、当該フレームをダイジェスト区間に含めるかどうかの判断が行われる。現在フレーム、現ダイジェスト区間、連続区間及び不連続区間の意味は、図７に示すオフライン処理での高スコア区間決定処理と同様である。 In FIG. 19, the horizontal axis represents the time of audio information, the vertical axis represents the score calculated for each frame, and the relationship between the two is plotted. In the high score section determination processing, it is determined for each frame in time series whether or not the frame is included in the digest section. The meanings of the current frame, the current digest section, the continuous section, and the discontinuous section are the same as those in the high score section determination processing in the offline processing shown in FIG. 7.

ただし、上述したように、オンライン処理では、オフライン処理とは異なり、その処理対象がダイジェストである。従って、図示するように、ダイジェスト内からフレームが削除されることにより、ダイジェスト内の各フレームにおけるスコアを時系列順に並べた際にスコアが不連続になる点（不連続点）が存在し得る。また、これも上述したように、高スコア区間決定処理が行われた結果、高スコア区間（すなわちダイジェスト区間）としては決定されなかったがダイジェスト内に存在する区間である削除対象区間がダイジェスト内に存在し得る。 However, as described above, in online processing, unlike offline processing, the processing target is a digest. Therefore, as shown in the figure, when frames are deleted from the digest, points may become discontinuous (discontinuous points) when the scores in each frame in the digest are arranged in chronological order. Also, as described above, as a result of the high score section determination processing being performed, the deletion target section, which is a section that is not determined as the high score section (that is, the digest section) but exists in the digest, is included in the digest. Can exist

図２０−図２２を参照して、オンライン処理における高スコア区間決定処理の具体的な処理手順について説明する。なお、図２０−図２２に示すオンライン処理における高スコア区間決定処理の処理手順は、処理対象が音声情報全体ではなくダイジェストであることと、後述するステップＳ１１１９〜ステップＳ１１２３に示す処理が追加されたことを除けば、図８及び図９を参照して説明したオフライン処理における高スコア区間決定処理の処理手順と略同様である。従って、以下のオンライン処理における高スコア区間決定処理の処理手順についての説明では、オフライン処理における高スコア区間決定処理の処理手順と重複する事項についてはその詳細な説明を省略し、相違する事項について主に説明する。 A specific processing procedure of the high score section determination processing in the online processing will be described with reference to FIGS. The processing procedure of the high score section determination processing in the online processing shown in FIGS. 20 to 22 is that the processing target is not the entire voice information but a digest, and the processing shown in steps S1119 to S1123 described later is added. Except for this, the processing procedure of the high score section determination processing in the offline processing described with reference to FIGS. 8 and 9 is substantially the same. Therefore, in the following description of the processing procedure of the high score section determination processing in the online processing, detailed description will be omitted for the matters overlapping with the processing procedure of the high score section determination processing in the offline processing, and the different matters will be mainly described. Explained.

図２０−図２２を参照すると、オンライン処理における高スコア区間決定処理では、まず、フレームインデックスがゼロに設定され（ステップＳ１１０１）、ダイジェスト区間インデックスがゼロに設定される（すなわちＳ１１０３）。これらの処理は、図８及び図９に示すステップＳ３０１及びステップＳ３０３に示す処理と同様である。 20 to 22, in the high score section determination processing in the online processing, first, the frame index is set to zero (step S1101) and the digest section index is set to zero (that is, S1103). These processes are similar to the processes shown in steps S301 and S303 shown in FIGS. 8 and 9.

以降のステップＳ１１０５〜ステップＳ１１１７に示す処理は、図８及び図９に示すステップＳ３０５〜ステップＳ３１７に示す処理と同様である。具体的には、ステップＳ１１０５において、現在フレームのスコアがスコア閾値よりも大きいかどうかが判断される。現在フレームのスコアがスコア閾値以下と判断された場合には、現在フレームをダイジェスト区間には含めずに、ステップＳ１１１９に進む。一方、現在フレームのスコアがスコア閾値以下と判断された場合には、ステップＳ１１０７〜ステップＳ１１１７に進み、現在フレームをダイジェスト区間に含めるための処理が行われる。 The subsequent steps S1105 to S1117 are the same as the steps S305 to S317 shown in FIGS. 8 and 9. Specifically, in step S1105, it is determined whether the score of the current frame is larger than the score threshold. If it is determined that the score of the current frame is less than or equal to the score threshold, the current frame is not included in the digest section, and the process proceeds to step S1119. On the other hand, if it is determined that the score of the current frame is equal to or lower than the score threshold value, the process proceeds to steps S1107 to S1117, and processing for including the current frame in the digest section is performed.

ステップＳ１１０７〜ステップＳ１１１７では、不連続区間長が不連続区間最大長よりも小さい場合には、現ダイジェスト区間に不連続区間及び現在フレームが接続される（ステップＳ１１０９）。また、不連続区間長が不連続区間最大長以上であり、かつ不連続区間前の連続区間が連続区間最低長以上である場合には、不連続区間前の連続区間を１つのダイジェスト区間として確定するとともに、ダイジェスト区間インデックスが１つ繰り上げられ、現在フレームがその新たな現ダイジェスト区間の開始時刻に設定される（ステップＳ１１１３、Ｓ１１１５）。また、不連続区間長が不連続区間最大長以上であり、かつ不連続区間前の連続区間が連続区間最低長よりも小さい場合には、不連続区間前の連続区間が破棄され（すなわち削除対象区間とされ）、現在フレームが現ダイジェスト区間の開始時刻に設定される（ステップＳ１１１７）。ステップＳ１１０９、ステップＳ１１１５及びステップＳ１１１７のいずれかの処理が終了すると、ステップＳ１１１９に進む。 In steps S1107 to S1117, when the discontinuous section length is smaller than the maximum discontinuous section length, the discontinuous section and the current frame are connected to the current digest section (step S1109). When the length of the discontinuous section is greater than or equal to the maximum length of the discontinuous section and the continuous section before the discontinuous section is greater than or equal to the minimum length of the continuous section, the continuous section before the discontinuous section is determined as one digest section. At the same time, the digest section index is incremented by 1, and the current frame is set to the start time of the new current digest section (steps S1113 and S1115). If the length of the discontinuous section is greater than or equal to the maximum length of the discontinuous section and the continuous section before the discontinuous section is smaller than the minimum length of the continuous section, the continuous section before the discontinuous section is discarded (that is, to be deleted). The current frame is set to the start time of the current digest section (step S1117). When any one of steps S1109, S1115, and S1117 ends, the process proceeds to step S1119.

ステップＳ１１１９では、現在フレームが不連続点かどうかが判断される。ステップＳ１１１９で現在フレームが不連続点でないと判断された場合には、特段の処理は行われず、ステップＳ１１２５に進む。 In step S1119, it is determined whether the current frame is a discontinuous point. If it is determined in step S1119 that the current frame is not a discontinuous point, no special process is performed and the process proceeds to step S1125.

一方、ステップＳ１１１９で現在フレームが不連続点であると判断された場合には、ステップＳ１１２３に進む。ステップＳ１１２３では、現ダイジェスト区間長が連続区間最低長よりも大きいかどうかが判断される。つまり、ステップＳ１１２３では、不連続点直前のダイジェスト区間が、時間長さの観点から有意な区間であるかどうか（すなわち音声の識別が可能な程度の時間長さを有しているかどうか）が判断される。 On the other hand, if it is determined in step S1119 that the current frame is a discontinuous point, the process advances to step S1123. In step S1123, it is determined whether the current digest section length is larger than the continuous section minimum length. That is, in step S1123, it is determined whether or not the digest section immediately before the discontinuity point is a meaningful section from the viewpoint of time length (that is, whether or not it has a time length that allows speech to be identified). To be done.

ステップＳ１１２３で現ダイジェスト区間長が連続区間最低長よりも大きいと判断された場合には、現ダイジェスト区間は時間長さ的に有意な区間であると考えられるため、当該ダイジェスト区間を採用し、ステップＳ１１２５に進む。一方、ステップＳ１１２３で現ダイジェスト区間長が連続区間最低長以下であると判断された場合には、現ダイジェスト区間は時間長さ的に有意な区間でないと考えられるため、当該ダイジェスト区間を破棄し（すなわち削除対象区間とし）、ステップＳ１１２５に進む。 If it is determined in step S1123 that the current digest section length is greater than the minimum continuous section length, the current digest section is considered to be a section that is significant in terms of time length. Proceeds to S1125. On the other hand, if it is determined in step S1123 that the current digest section length is less than or equal to the continuous section minimum length, the current digest section is considered not to be a section that is significant in terms of time length, and the digest section is discarded ( That is, the deletion target section), and the process proceeds to step S1125.

以降のステップＳ１１２５〜ステップＳ１１３１に示す処理は、図８及び図９に示すステップＳ３１９〜ステップＳ３２５に示す処理と同様である。具体的には、ステップＳ１１２５では、音声情報が終端かどうかが判断される。ステップＳ１１２５で音声情報が終端でないと判断された場合には、フレームインデックスが１つ繰り上げられ（すなわち処理対象であるフレームが１つ先のフレームに設定され）（ステップＳ１１２７）、ステップＳ１１０５以降の処理が繰り返し実行される。 The subsequent steps S1125 to S1131 are the same as the steps S319 to S325 shown in FIGS. 8 and 9. Specifically, in step S1125, it is determined whether the voice information is the end. When it is determined in step S1125 that the audio information is not the end, the frame index is incremented by 1 (that is, the frame to be processed is set to the frame one ahead) (step S1127), and the processing in step S1105 and subsequent steps is performed. Is repeatedly executed.

一方、ステップＳ１１２５で音声情報が終端であると判断された場合には、ステップＳ１１２１に進み、現ダイジェスト区間長が連続区間最低長よりも大きいかどうか、すなわち最後に処理対象であったダイジェスト区間が、時間長さの観点から有意な区間であるかどうかが判断される。 On the other hand, if it is determined in step S1125 that the voice information is the end, the process proceeds to step S1121, and it is determined whether the current digest section length is larger than the continuous section minimum length, that is, the digest section that is the last processing target is , It is judged from the viewpoint of time length whether it is a significant section.

ステップＳ１１２１で現ダイジェスト区間長が連続区間最低長よりも大きいと判断された場合には、現ダイジェスト区間は時間長さ的に有意な区間であると考えられるため、当該ダイジェスト区間を採用し、一連の処理を終了する。一方、ステップＳ１１２１で現ダイジェスト区間長が連続区間最低長以下であると判断された場合には、現ダイジェスト区間は時間長さ的に有意な区間でないと考えられるため、当該ダイジェスト区間を破棄し（すなわち削除対象区間とし）、一連の処理を終了する。 If it is determined in step S1121 that the current digest section length is greater than the minimum continuous section length, the current digest section is considered to be a section that is significant in terms of time length. Ends the process. On the other hand, if it is determined in step S1121 that the current digest section length is less than or equal to the continuous section minimum length, the current digest section is considered not to be a section that is significant in terms of time length, and the digest section is discarded ( That is, the deletion target section is set), and the series of processes is ended.

以上、オンライン処理における、単一音源モードでのダイジェスト区間決定処理の処理手順について説明した。 The processing procedure of the digest section determination processing in the single sound source mode in the online processing has been described above.

（４−３．複数音源モード）
（４−３−１．ダイジェスト区間決定処理の処理手順）
図２３を参照して、オンライン処理における、複数音源モードでのダイジェスト区間決定処理の処理手順について説明する。図２３は、オンライン処理における、複数音源モードでのダイジェスト区間決定処理の処理手順の一例を示すフロー図である。 (4-3. Multiple sound source mode)
(Processing procedure of 4-3-1. Digest section determination processing)
With reference to FIG. 23, the processing procedure of the digest section determination processing in the multiple sound source mode in the online processing will be described. FIG. 23 is a flowchart showing an example of the processing procedure of the digest section determination processing in the multiple sound source mode in the online processing.

なお、図２３に示す複数音源モードでのダイジェスト区間決定処理は、図１７を参照して説明した単一音源モードでのダイジェスト区間決定処理に対して、一部の処理（具体的には後述するステップＳ１２０５に示す処理）が変更されたものであり、その他の処理は、単一音源モードでのダイジェスト区間決定処理と略同様である。従って、以下の複数音源モードでのダイジェスト区間決定処理の処理手順についての説明では、単一音源モードでのダイジェスト区間決定処理の処理手順と重複する事項についてはその詳細な説明を省略し、相違する事項について主に説明する。 Note that the digest section determination process in the multiple sound source mode shown in FIG. 23 is part of the processing (specifically described later) with respect to the digest section determination process in the single sound source mode described with reference to FIG. The process shown in step S1205) is changed, and the other processes are substantially the same as the digest section determination process in the single sound source mode. Therefore, in the following description of the processing procedure of the digest section determination processing in the multiple sound source mode, the detailed description of the matters overlapping with the processing procedure of the digest section determination processing in the single sound source mode will be omitted and different. Items will be mainly described.

図２３を参照すると、複数音源モードでのダイジェスト区間決定処理では、まず、現在のダイジェスト長が、ダイジェスト長（ダイジェストの時間長さの設定値）よりも短いかどうかが判断され（ステップＳ１２０１）、現在のダイジェスト長がダイジェスト長よりも短いと判断された場合には、入力フレームがダイジェストに追加され、ダイジェスト平均スコアが更新される（ステップＳ１２０３）。ステップＳ１２０１及びステップＳ１２０３に示す処理は、図１７に示すステップＳ８０１及びステップＳ８０３における処理と同様である。 Referring to FIG. 23, in the digest section determination process in the multiple sound source mode, first, it is determined whether or not the current digest length is shorter than the digest length (the set value of the digest time length) (step S1201), When it is determined that the current digest length is shorter than the digest length, the input frame is added to the digest and the digest average score is updated (step S1203). The processing shown in steps S1201 and S1203 is the same as the processing in steps S801 and S803 shown in FIG.

ステップＳ１２０１で、現在のダイジェスト長がダイジェスト長以上である判断された場合には、ステップＳ１２０５に進む。ステップＳ１２０５では、音源種別ごとに入力フレームのスコアとダイジェスト平均スコアとが比較され、いずれかの音源種別において、入力フレームのスコアがダイジェスト平均スコア以上であるかどうかが判断される。ステップＳ１２０５で、いずれの音源種別においても、入力フレームのスコアがダイジェスト平均スコアよりも小さいと判断された場合には、当該入力フレームをダイジェストに追加することなく、ダイジェスト区間決定処理を終了する。 When it is determined in step S1201 that the current digest length is equal to or longer than the digest length, the process proceeds to step S1205. In step S1205, the score of the input frame is compared with the digest average score for each sound source type, and it is determined whether the score of the input frame is equal to or higher than the digest average score for any sound source type. If it is determined in step S1205 that the score of the input frame is smaller than the digest average score in any sound source type, the digest section determination process is terminated without adding the input frame to the digest.

一方、ステップＳ１２０５で、いずれかの音源種別において入力フレームのスコアがダイジェスト平均スコア以上であると判断された場合には、ステップＳ１２０７に進む。以降のステップＳ１２０７〜ステップＳ１２１１に示す処理は、図１７に示すステップＳ８０７〜ステップＳ８１１における処理と同様である。すなわち、入力フレームがダイジェストに追加されダイジェスト平均スコアが更新される（ステップＳ１２０７）。次いで、フレーム削除処理（ステップＳ１２０９）が行われ、フレームが削除されると、ダイジェスト平均スコアが更新され（ステップＳ１２１１）、ダイジェスト区間決定処理を終了する。 On the other hand, if it is determined in step S1205 that the score of the input frame is equal to or higher than the digest average score in any sound source type, the process proceeds to step S1207. The subsequent steps S1207 to S1211 are the same as the steps S807 to S811 shown in FIG. That is, the input frame is added to the digest and the digest average score is updated (step S1207). Next, frame deletion processing (step S1209) is performed, and when the frame is deleted, the digest average score is updated (step S1211), and the digest section determination processing ends.

（４−３−２．フレーム削除処理）
ここで、図２４を参照して、図２３のステップＳ１２０９に示すフレーム削除処理の詳細について説明する。図２４は、オンライン処理における、複数音源モードでのフレーム削除処理の処理手順の一例を示すフロー図である。 (4-3-2. Frame deletion processing)
Here, with reference to FIG. 24, details of the frame deletion processing shown in step S1209 of FIG. 23 will be described. FIG. 24 is a flowchart showing an example of a processing procedure of frame deletion processing in the multiple sound source mode in the online processing.

図２４を参照すると、オンライン処理における複数音源モードでのフレーム削除処理では、まず、音源種別ごとに、スコア閾値として、ダイジェスト平均スコアが設定される（ステップＳ１３０１）。次いで、種別ダイジェスト長が設定される（ステップＳ１３０３）。なお、ステップＳ１３０３に示す処理では、種別ダイジェスト長は、図１０に示す、オフライン処理における複数音源モードでのダイジェスト区間決定処理のステップＳ４０５に示す処理と同様の方法によって設定されてよい。 Referring to FIG. 24, in the frame deletion processing in the multiple sound source mode in the online processing, first, a digest average score is set as a score threshold for each sound source type (step S1301). Next, the type digest length is set (step S1303). In the process shown in step S1303, the type digest length may be set by the same method as the process shown in step S405 of the digest section determination process in the multiple sound source mode in the offline process shown in FIG.

そして、設定されたスコア閾値を用いて、ダイジェストの中でより高いスコアを有する区間（高スコア区間）をダイジェスト区間として決定する処理（高スコア区間決定処理）が行われる（ステップＳ１３０５）。ステップＳ１３０５に示す処理は、図１８に示すステップＳ９０３における処理、すなわち、図２０−図２２に示す一連の処理と同様であるため、その詳細な説明を省略する。ただし、複数音源モードでのフレーム削除処理では、高スコア区間決定処理が、音源種別ごとに行われる。 Then, using the set score threshold value, a process of determining a section having a higher score (high score section) in the digest as a digest section (high score section determination processing) is performed (step S1305). Since the processing shown in step S1305 is similar to the processing in step S903 shown in FIG. 18, that is, the series of processing shown in FIGS. 20 to 22, detailed description thereof will be omitted. However, in the frame deletion process in the multiple sound source mode, the high score section determination process is performed for each sound source type.

ステップＳ１３０５において高スコア区間が決定されると、高スコア区間決定処理の結果、いずれかの音源種別において、削除対象期間が存在するかどうかが判断される（ステップＳ１３０７）。ステップＳ１３０７においていずれかの音源種別において削除対象区間が存在すると判断された場合には、その音源種別の削除対象区間からスコアのより低いフレームが１つ選択される（ステップＳ１３０９）。そして、選択されたそのフレームがダイジェストから削除される（ステップＳ１３１５）。 When the high score section is determined in step S1305, as a result of the high score section determination processing, it is determined whether or not there is a deletion target period in any sound source type (step S1307). When it is determined in step S1307 that there is a deletion target section in any sound source type, one frame with a lower score is selected from the deletion target section of that sound source type (step S1309). Then, the selected frame is deleted from the digest (step S1315).

一方、ステップＳ１３０７において、いずれの音源種別にも削除対象区間が存在しないと判断された場合には、ダイジェスト区間長の合計が種別ダイジェスト長を最も超過している音源種別が選択される（ステップＳ１３１１）。そして、選択された音源種別について、そのスコアのより低いフレームが１つ選択される（ステップＳ１３１３）。そして、選択されたそのフレームがダイジェストから削除される（ステップＳ１３１５）。 On the other hand, if it is determined in step S1307 that there is no deletion target section in any sound source type, a sound source type whose total digest section length most exceeds the type digest length is selected (step S1311). ). Then, for the selected sound source type, one frame having a lower score is selected (step S1313). Then, the selected frame is deleted from the digest (step S1315).

以上、オンライン処理における、複数音源モードでのダイジェスト区間決定処理の処理手順について説明した。 The processing procedure of the digest section determination processing in the multiple sound source mode in the online processing has been described above.

（４−４．多様性反映モード）
オンライン処理における多様性反映モードでのダイジェスト区間決定処理の処理手順は、図２３を参照して説明したオンライン処理における複数音源モードでのダイジェスト区間決定処理の処理手順と同様である。ただし、多様性反映モードでは、図２３のステップＳ１２０９に示すフレーム削除処理の詳細が、複数音源モードとは異なる。従って、以下のオンライン処理における多様性反映モードでのダイジェスト区間決定処理についての説明では、フレーム削除処理の詳細について主に説明する。 (4-4. Diversity reflection mode)
The processing procedure of the digest section determination processing in the diversity reflection mode in the online processing is the same as the processing procedure of the digest section determination processing in the multiple sound source mode in the online processing described with reference to FIG. However, in the diversity reflection mode, the details of the frame deletion processing shown in step S1209 in FIG. 23 are different from those in the multiple sound source mode. Therefore, in the following description of the digest section determination process in the diversity reflection mode in the online process, the details of the frame deletion process will be mainly described.

なお、オンライン処理においても、オフライン処理と同様に、多様性反映モードにおける各処理は、図１２に示す情報処理装置１２０によって実行され得る。 In the online processing as well, similar to the offline processing, each processing in the diversity reflection mode can be executed by the information processing apparatus 120 shown in FIG.

（４−４−１．フレーム削除処理の処理手順）
図２５を参照して、オンライン処理における、多様性反映モードでのフレーム削除処理の処理手順について説明する。図２５は、オンライン処理における、多様性反映モードでのフレーム削除処理の処理手順の一例を示すフロー図である。 (4-4-1. Processing procedure of frame deletion processing)
With reference to FIG. 25, a processing procedure of frame deletion processing in the diversity reflection mode in online processing will be described. FIG. 25 is a flowchart showing an example of a processing procedure of frame deletion processing in the diversity reflection mode in online processing.

ここで、多様性反映モードは、同一音源種別内での多様性を考慮してダイジェスト区間を決定するものであるため、ダイジェストに含める対象とする音源種別は、単一の音源種別であってもよいし、複数の音源種別であってもよい。図２５では、一例として、ダイジェストに複数の音源種別からなる音声を含める場合における処理手順を図示している。 Here, since the diversity reflection mode determines the digest section in consideration of the diversity within the same sound source type, even if the sound source type to be included in the digest is a single sound source type, Alternatively, a plurality of sound source types may be used. FIG. 25 illustrates, as an example, a processing procedure in the case where the digest includes sounds of a plurality of sound source types.

なお、多様性反映モードでのフレーム削除処理における各処理は、後述するステップＳ１４１３に示す処理を除き、図２４を参照して説明した複数音源モードでのフレーム削除処理における各処理と同様である。従って、以下の多様性反映モードでのフレーム削除処理の処理手順についての説明では、複数音源モードでのフレーム削除処理の処理手順と相違する事項について主に説明し、重複する事項についてはその詳細な説明を省略する。 Note that each process in the frame deletion process in the diversity reflection mode is the same as each process in the frame deletion process in the multiple sound source mode described with reference to FIG. 24, except for the process shown in step S1413 described later. Therefore, in the following description of the processing procedure of the frame deletion processing in the diversity reflection mode, matters that are different from the processing procedure of the frame deletion processing in the multiple sound source mode will be mainly described, and the overlapping matters will be described in detail. The description is omitted.

図２５を参照すると、オンライン処理における多様性反映モードでのフレーム削除処理では、まず、音源種別ごとに、スコア閾値としてダイジェスト平均スコアが設定され（ステップＳ１４０１）、次いで、種別ダイジェスト長が設定される（ステップＳ１４０３）。そして、設定されたスコア閾値を用いて、音源種別ごとに、高スコア区間決定処理が行われる（ステップＳ１４０５）。これらの処理は、図２４に示すステップＳ１３０１〜ステップＳ１３０５における処理と同様である。 Referring to FIG. 25, in the frame deletion process in the diversity reflection mode in the online process, first, a digest average score is set as a score threshold for each sound source type (step S1401), and then a type digest length is set. (Step S1403). Then, the high score section determination process is performed for each sound source type using the set score threshold (step S1405). These processes are similar to the processes in steps S1301 to S1305 shown in FIG.

次に、高スコア区間決定処理の結果、いずれかの音源種別において、削除対象期間が存在するかどうかが判断される（ステップＳ１４０７）。いずれかの音源種別において削除対象区間が存在すると判断された場合には、その音源種別の削除対象区間からスコアのより低いフレームが１つ選択され（ステップＳ１４０９）、選択されたそのフレームがダイジェストから削除される（ステップＳ１４１５）。これらの処理は、図２４に示すステップＳ１３０７、ステップＳ１３０９、ステップＳ１３１５における処理と同様である。 Next, as a result of the high score section determination processing, it is determined whether or not there is a deletion target period in any sound source type (step S1407). When it is determined that the deletion target section exists in any sound source type, one frame having a lower score is selected from the deletion target section of the sound source type (step S1409), and the selected frame is selected from the digest. It is deleted (step S1415). These processes are similar to the processes in steps S1307, S1309, and S1315 shown in FIG.

一方、ステップＳ１４０７において、いずれの音源種別にも削除対象区間が存在しないと判断された場合には、ダイジェスト区間長の合計が種別ダイジェスト長を最も超過している音源種別が選択される（ステップＳ１４１１）。そして、選択された音源種別について、当該音源種別内での多様性を考慮して削除するフレームを選択する処理（多様性に基づく削除フレーム選択処理）が行われる（ステップＳ１４１３）。そして、選択されたそのフレームがダイジェストから削除される（ステップＳ１４１５）。 On the other hand, if it is determined in step S1407 that the deletion target section does not exist in any sound source type, the sound source type whose sum of digest section lengths most exceeds the type digest length is selected (step S1411). ). Then, with respect to the selected sound source type, a process of selecting a frame to be deleted in consideration of diversity within the sound source type (deleted frame selection process based on diversity) is performed (step S1413). Then, the selected frame is deleted from the digest (step S1415).

（４−４−２．多様性に基づく削除フレーム選択処理）
図２６を参照して、図２５のステップＳ１４１３に示す多様性に基づく削除フレーム選択処理について詳しく説明する。図２６は、オンライン処理における、多様性に基づく削除フレーム選択処理の処理手順の一例を示すフロー図である。 (4-4-2. Deleted frame selection processing based on diversity)
With reference to FIG. 26, the diversity-based deletion frame selection processing shown in step S1413 of FIG. 25 will be described in detail. FIG. 26 is a flowchart showing an example of the processing procedure of diversity-based deletion frame selection processing in online processing.

図２６を参照すると、オンライン処理における多様性に基づく削除フレーム選択処理では、まず、全フレームの場合と、任意の１つのフレームを除いた場合の、ｎ通りの特徴量空間における特徴量ベクトルの分散が計算される（ステップＳ１５０１）。 Referring to FIG. 26, in the deletion frame selection processing based on diversity in the online processing, first, the variance of the feature amount vectors in the n feature amount spaces in the case of all frames and the case of excluding one arbitrary frame Is calculated (step S1501).

次に、全フレームの場合と、任意の１つのフレームを除いた場合の、ｎ通りのフレームの時刻の分散が計算される（ステップＳ１５０３）。 Next, the variances of the times of the n frames are calculated for all the frames and the case where any one frame is excluded (step S1503).

次に、特徴量ベクトルの分散及び時刻の分散に重み付けを行った上でその総和が計算され、全フレームの場合の値からの低減量が最も少ない場合に除外されたフレームが、削除するフレームとして決定される（ステップＳ１５０５）。つまり、ステップＳ１５０５に示す処理では、特徴量ベクトル及び時刻の分散の計算に用いられなかった場合に最も影響の少ない特徴量ベクトル及び時刻を有するフレームが、削除するフレームとして決定される。これにより、特徴量ベクトル及び時刻の分散がより大きくなるように、ダイジェストに含めるフレームが選択されることとなる。 Next, the sum of the variances of the feature amount vector and the variance of the time is weighted, and the sum is calculated. The frame excluded when the reduction amount from the value of all frames is the smallest is the frame to be deleted. It is determined (step S1505). That is, in the process shown in step S1505, the frame having the feature amount vector and the time having the least influence when not used for the calculation of the variance of the feature amount vector and the time is determined as the frame to be deleted. Thereby, the frames to be included in the digest are selected so that the variance of the feature amount vector and the time becomes larger.

以上、図２５を参照して、オンライン処理における、多様性反映モードでのダイジェスト区間決定処理の処理手順について説明した。また、図２６を参照して、図２５のステップＳ１４１３に示す多様性に基づく削除フレーム選択処理について説明した。 The procedure of the digest section determination process in the diversity reflection mode in the online process has been described above with reference to FIG. In addition, with reference to FIG. 26, the deletion frame selection processing based on diversity shown in step S1413 of FIG. 25 has been described.

（５．変形例）
以上説明した実施形態のいくつかの変形例について説明する。なお、以上説明した実施形態及び以下に説明する各変形例に記載される事項は、可能な範囲で互いに組み合わされてよい。 (5. Modified example)
Some modifications of the embodiment described above will be described. The matters described in the above-described embodiment and each modification described below may be combined with each other within a possible range.

（５−１．音声収音機能が設けられる変形例）
図２７を参照して、情報処理装置に音声収音機能が設けられる変形例について説明する。図２７は、音声収音機能が設けられる変形例に係る情報処理装置の機能構成の一例を示す機能ブロック図である。 (5-1. Modified example in which voice collecting function is provided)
With reference to FIG. 27, a modification in which the voice pickup function is provided in the information processing device will be described. FIG. 27 is a functional block diagram showing an example of the functional configuration of the information processing apparatus according to the modified example provided with the voice pickup function.

図２７を参照すると、本変形例に係る情報処理装置１３０は、その機能として、特徴量抽出部１１１と、音源種別スコア算出部１１３と、ダイジェスト区間決定部１１５と、音声収音部１３１と、を有する。ここで、特徴量抽出部１１１、音源種別スコア算出部１１３及びダイジェスト区間決定部１１５の機能は、図１に示す情報処理装置１１０におけるこれらの機能ブロックの機能と同様であるため、その詳細な説明は省略する。 Referring to FIG. 27, the information processing apparatus 130 according to the present modification has, as its functions, a feature amount extraction unit 111, a sound source type score calculation unit 113, a digest section determination unit 115, a voice sound collection unit 131, Have. Here, the functions of the feature amount extraction unit 111, the sound source type score calculation unit 113, and the digest section determination unit 115 are similar to the functions of these functional blocks in the information processing apparatus 110 shown in FIG. Is omitted.

音声収音部１３１は、例えばマイクロフォン等の収音装置によって構成され、外部の音声を収音し、音声情報として情報処理装置１１０に入力する機能を有する。音声収音部１３１は、収音した外部音声に係る音声情報を、特徴量抽出部１１１に提供する。特徴量抽出部１１１、音源種別スコア算出部１１３及びダイジェスト区間決定部１１５は、音声収音部１３１から提供された音声情報に対して、以上説明した実施形態に係る各種の処理（特徴量抽出処理、音源種別スコア算出処理及びダイジェスト区間決定処理）を行う。 The sound collecting unit 131 is configured by a sound collecting device such as a microphone, and has a function of collecting an external sound and inputting it as sound information to the information processing device 110. The voice pickup unit 131 provides the feature amount extraction unit 111 with voice information related to the collected external voice. The feature amount extraction unit 111, the sound source type score calculation unit 113, and the digest section determination unit 115 perform various processes (feature amount extraction process) according to the above-described embodiments on the voice information provided from the voice sound pickup unit 131. , Sound source type score calculation processing and digest section determination processing).

なお、音声収音部１３１は、１つのマイクロフォンによって構成されてもよいし、互いに異なる位置に配置される複数のマイクロフォンによって構成されてもよい。音声収音部１３１が、互いに異なる位置に配置される複数のマイクロフォンによって構成される場合には、特徴量抽出部１１１は、収音位置間の相関や音源方位等、マイクロフォンが複数存在することによって算出可能となる各種の特徴量を算出することができる。 The voice pickup unit 131 may be configured by one microphone, or may be configured by a plurality of microphones arranged at different positions. When the sound pickup unit 131 is composed of a plurality of microphones arranged at mutually different positions, the feature amount extraction unit 111 causes the plurality of microphones such as the correlation between the sound pickup positions and the sound source direction to exist. It is possible to calculate various types of feature amounts that can be calculated.

以上、図２７を参照して、情報処理装置に音声収音機能が設けられる変形例について説明した。以上説明したように、本変形例によれば、情報処理装置１３０自体が外部の音声を収音する収音機能を有し、収音した外部音声に係る音声情報のダイジェスト区間情報を出力することができる。このような情報処理装置１３０は、例えばＩＣレコーダーや外部音声を録音するアプリケーションソフトが搭載されたスマートフォン等であり得る。 In the above, with reference to FIG. 27, the modification in which the voice pickup function is provided in the information processing apparatus has been described. As described above, according to the present modification, the information processing apparatus 130 itself has a sound collecting function of collecting an external sound, and outputs digest section information of audio information related to the collected external sound. You can Such an information processing apparatus 130 may be, for example, an IC recorder, a smartphone equipped with application software for recording an external voice, or the like.

（５−２．ダイジェスト生成機能が設けられる変形例）
図２８を参照して、情報処理装置にダイジェスト生成機能が設けられる変形例について説明する。図２８は、ダイジェスト生成機能が設けられる変形例に係る情報処理装置の機能構成の一例を示す機能ブロック図である。 (5-2. Modified example in which digest generation function is provided)
With reference to FIG. 28, a modification in which the digest generating function is provided in the information processing device will be described. FIG. 28 is a functional block diagram showing an example of the functional configuration of the information processing apparatus according to the modified example provided with the digest generation function.

図２８を参照すると、本変形例に係る情報処理装置１４０は、その機能として、特徴量抽出部１１１と、音源種別スコア算出部１１３と、ダイジェスト区間決定部１１５と、出力音声生成部１４１と、を有する。ここで、特徴量抽出部１１１、音源種別スコア算出部１１３及びダイジェスト区間決定部１１５の機能は、図１に示す情報処理装置１１０におけるこれらの機能ブロックの機能と同様であるため、その詳細な説明は省略する。 With reference to FIG. 28, the information processing apparatus 140 according to the present modification has, as its functions, a feature amount extraction unit 111, a sound source type score calculation unit 113, a digest section determination unit 115, an output voice generation unit 141, Have. Here, the functions of the feature amount extraction unit 111, the sound source type score calculation unit 113, and the digest section determination unit 115 are similar to the functions of these functional blocks in the information processing apparatus 110 shown in FIG. Is omitted.

出力音声生成部１４１は、各種のプロセッサによって構成され、音声情報と、ダイジェスト区間決定部１１５によって生成されるダイジェスト区間情報と、に基づいて、当該音声情報のダイジェストを、音声出力機器で出力可能なデータ形式で生成する。出力音声生成部１４１は、ダイジェストを生成する際に、ダイジェスト区間同士のつなぎ目に対してクロスフェード処理を施す等、ユーザの聴き心地を考慮して、各種の公知の音声処理を適宜行ってもよい。出力音声生成部１４１は、生成したダイジェストに対応する音声情報（出力音声情報）を、例えばスピーカ等の音声出力機器に出力する。当該音声出力機器によってダイジェストが音声として出力される。 The output voice generation unit 141 includes various processors, and can output a digest of the voice information by a voice output device based on the voice information and the digest section information generated by the digest section determination unit 115. Generate in data format. When generating the digest, the output voice generation unit 141 may appropriately perform various known voice processes in consideration of the listening comfort of the user, such as performing a crossfade process on the joint between the digest sections. . The output voice generation unit 141 outputs the voice information (output voice information) corresponding to the generated digest to a voice output device such as a speaker. The digest is output as voice by the voice output device.

以上、図２８を参照して、情報処理装置にダイジェスト生成機能が設けられる変形例について説明した。以上説明したように、本変形例によれば、情報処理装置１４０自身がダイジェストを生成する機能を有し、生成したダイジェストを、情報処理装置１４０自身に設けられる音声出力機器又は情報処理装置１４０の外部の音声出力機器から出力することができる。 The modification example in which the information processing device is provided with the digest generation function has been described above with reference to FIG. 28. As described above, according to this modification, the information processing apparatus 140 itself has a function of generating a digest, and the generated digest is stored in the audio output device or the information processing apparatus 140 provided in the information processing apparatus 140 itself. It can be output from an external audio output device.

なお、情報処理装置１４０自身が音声出力機器を有し、ダイジェストを再生可能である場合には、情報処理装置１４０は、音声情報を取得したら自動的にダイジェストを生成してもよい。また、その場合、情報処理装置１４０では、例えば、表示画面上の音声情報を表すファイル名にポインタを載せる等のＧＵＩ（ＧｒａｐｈｉｃａｌＵｓｅｒＩｎｔｅｒｆａｃｅ）を用いた操作や、プレビュー操作等の簡易な操作によって、ダイジェストが再生されてもよい。情報処理装置１４０がこのように構成されることにより、ユーザは、ダイジェスト生成のための操作をわざわざ行わなくてもよく、また、簡易な操作でダイジェストを聴くことができるため、あたかも映像情報におけるサムネイルを確認するような感覚で音声情報のダイジェストを確認することができ、ユーザの利便性がより向上する。 When the information processing device 140 itself has a voice output device and can reproduce the digest, the information processing device 140 may automatically generate the digest when the voice information is acquired. Further, in that case, in the information processing apparatus 140, for example, by an operation using a GUI (Graphical User Interface) such as placing a pointer on a file name indicating audio information on the display screen, or a simple operation such as a preview operation, The digest may be played. With the information processing device 140 configured in this manner, the user does not have to bother to perform the operation for generating the digest, and since the user can listen to the digest with a simple operation, it is as if the thumbnail in the video information. It is possible to confirm the digest of the voice information as if confirming, and the convenience for the user is further improved.

（５−３．音声情報データベースが設けられる変形例）
図２９を参照して、情報処理装置に音声情報データベースが設けられる変形例について説明する。図２９は、音声情報データベースが設けられる変形例に係る情報処理装置の機能構成の一例を示す機能ブロック図である。 (5-3. Modified example in which voice information database is provided)
A modification in which the voice information database is provided in the information processing device will be described with reference to FIG. FIG. 29 is a functional block diagram showing an example of the functional configuration of the information processing apparatus according to the modified example in which the voice information database is provided.

図２９を参照すると、本変形例に係る情報処理装置１５０は、その機能として、特徴量抽出部１１１と、音源種別スコア算出部１１３と、ダイジェスト区間決定部１１５と、音声情報データベース１５１と、を有する。ここで、特徴量抽出部１１１、音源種別スコア算出部１１３及びダイジェスト区間決定部１１５の機能は、図１に示す情報処理装置１１０におけるこれらの機能ブロックの機能と同様であるため、その詳細な説明は省略する。 Referring to FIG. 29, the information processing apparatus 150 according to the present modification includes, as its functions, a feature amount extraction unit 111, a sound source type score calculation unit 113, a digest section determination unit 115, and a voice information database 151. Have. Here, the functions of the feature amount extraction unit 111, the sound source type score calculation unit 113, and the digest section determination unit 115 are similar to the functions of these functional blocks in the information processing apparatus 110 shown in FIG. Is omitted.

音声情報データベース１５１は、例えばＨＤＤ等の記憶装置によって構成され、データベース化された音声情報を記憶する。特徴量抽出部１１１は、音声情報データベース１５１にアクセスすることにより、当該音声情報データベース１５１内の任意の音声情報から特徴量を抽出することができる。つまり、本変形例によれば、情報処理装置１５０内に設けられる記憶部内のデータベース化された音声情報に対して、特徴量抽出部１１１、音源種別スコア算出部１１３及びダイジェスト区間決定部１１５が、以上説明した実施形態に係る各種の処理（特徴量抽出処理、音源種別スコア算出処理及びダイジェスト区間決定処理）を行う。 The voice information database 151 is configured by a storage device such as an HDD, and stores the voice information in a database. By accessing the voice information database 151, the feature amount extraction unit 111 can extract the feature amount from arbitrary voice information in the voice information database 151. That is, according to the present modification, the feature amount extraction unit 111, the sound source type score calculation unit 113, and the digest section determination unit 115 for the database-based voice information in the storage unit provided in the information processing device 150, Various processes (feature amount extraction process, sound source type score calculation process, and digest section determination process) according to the embodiment described above are performed.

以上、図２９を参照して、情報処理装置に音声情報データベースが設けられる変形例について説明した。以上説明したように、本変形例によれば、情報処理装置１５０自身が音声情報が格納されたデータベースを有し、当該データベース内の音声情報のダイジェスト区間情報を出力することができる。 The modification example in which the information processing apparatus is provided with the voice information database has been described above with reference to FIG. As described above, according to this modification, the information processing apparatus 150 itself has a database in which voice information is stored, and the digest section information of voice information in the database can be output.

（６．ハードウェア構成）
次に、図３０を参照して、本実施形態に係る情報処理装置のハードウェア構成について説明する。図３０は、本実施形態に係る情報処理装置のハードウェア構成の一例を示すブロック図である。なお、図３０に示す情報処理装置９００は、例えば、図１、図１２、図２７−図２９に示す情報処理装置１１０、１２０、１３０、１４０、１５０の機能構成を実現し得る。 (6. Hardware configuration)
Next, the hardware configuration of the information processing apparatus according to the present embodiment will be described with reference to FIG. FIG. 30 is a block diagram showing an example of the hardware configuration of the information processing apparatus according to this embodiment. The information processing apparatus 900 illustrated in FIG. 30 can realize the functional configurations of the information processing apparatuses 110, 120, 130, 140, and 150 illustrated in FIGS. 1, 12, and 27 to 29, for example.

情報処理装置９００は、ＣＰＵ９０１、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）９０３及びＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）９０５を備える。また、情報処理装置９００は、ホストバス９０７、ブリッジ９０９、外部バス９１１、インターフェース９１３、入力装置９１５、出力装置９１７、ストレージ装置９１９、通信装置９２１、ドライブ９２３及び接続ポート９２５を備えてもよい。情報処理装置９００は、ＣＰＵ９０１に代えて、又はこれとともに、ＤＳＰ若しくはＡＳＩＣと呼ばれるような処理回路を有してもよい。 The information processing apparatus 900 includes a CPU 901, a ROM (Read Only Memory) 903, and a RAM (Random Access Memory) 905. Further, the information processing device 900 may include a host bus 907, a bridge 909, an external bus 911, an interface 913, an input device 915, an output device 917, a storage device 919, a communication device 921, a drive 923, and a connection port 925. The information processing apparatus 900 may have a processing circuit called DSP or ASIC instead of or in addition to the CPU 901.

ＣＰＵ９０１は、演算処理装置及び制御装置として機能し、ＲＯＭ９０３、ＲＡＭ９０５、ストレージ装置９１９又はリムーバブル記録媒体９２９に記録された各種のプログラムに従って、情報処理装置９００内の動作全般又はその一部を制御する。ＲＯＭ９０３は、ＣＰＵ９０１が使用するプログラムや演算パラメータ等を記憶する。ＲＡＭ９０５は、ＣＰＵ９０１の実行において使用するプログラムや、その実行時のパラメータ等を一次記憶する。ＣＰＵ９０１、ＲＯＭ９０３及びＲＡＭ９０５は、ＣＰＵバス等の内部バスにより構成されるホストバス９０７により相互に接続されている。更に、ホストバス９０７は、ブリッジ９０９を介して、ＰＣＩ（ＰｅｒｉｐｈｅｒａｌＣｏｍｐｏｎｅｎｔＩｎｔｅｒｃｏｎｎｅｃｔ／Ｉｎｔｅｒｆａｃｅ）バス等の外部バス９１１に接続されている。ＣＰＵ９０１は、例えば、上述した実施形態における特徴量抽出部１１１、音源種別スコア算出部１１３、ダイジェスト区間決定部１１５及び出力音声生成部１４１を構成し得る。 The CPU 901 functions as an arithmetic processing unit and a control unit, and controls the overall operation of the information processing apparatus 900 or a part thereof according to various programs recorded in the ROM 903, the RAM 905, the storage device 919, or the removable recording medium 929. The ROM 903 stores programs used by the CPU 901, calculation parameters, and the like. The RAM 905 temporarily stores programs used in the execution of the CPU 901, parameters at the time of execution, and the like. The CPU 901, the ROM 903, and the RAM 905 are mutually connected by a host bus 907 configured by an internal bus such as a CPU bus. Further, the host bus 907 is connected to an external bus 911 such as a PCI (Peripheral Component Interconnect / Interface) bus via a bridge 909. The CPU 901 can configure, for example, the feature amount extraction unit 111, the sound source type score calculation unit 113, the digest section determination unit 115, and the output sound generation unit 141 in the above-described embodiment.

ホストバス９０７は、ブリッジ９０９を介して、ＰＣＩ（ＰｅｒｉｐｈｅｒａｌＣｏｍｐｏｎｅｎｔＩｎｔｅｒｃｏｎｎｅｃｔ／Ｉｎｔｅｒｆａｃｅ）バス等の外部バス９１１に接続されている。 The host bus 907 is connected to an external bus 911 such as a PCI (Peripheral Component Interconnect / Interface) bus via a bridge 909.

入力装置９１５は、例えば、マウス、キーボード、タッチパネル、ボタン、スイッチ及びレバー等、ユーザによって操作される装置によって構成される。また、入力装置９１５は、例えば、赤外線やその他の電波を利用したリモートコントロール装置（いわゆる、リモコン）であってもよいし、情報処理装置９００の操作に対応した携帯電話やＰＤＡ等の外部接続機器９３１であってもよい。更に、入力装置９１５は、例えば、上記の操作手段を用いてユーザにより入力された情報に基づいて入力信号を生成し、ＣＰＵ９０１に出力する入力制御回路などから構成されている。情報処理装置９００のユーザは、この入力装置９１５を操作することにより、情報処理装置９００に対して各種のデータを入力したり処理動作を指示したりすることができる。本実施形態では、入力装置９１５を介して、例えばダイジェスト区間決定処理を開始する旨の指示や、モードの切り替え指示等が、情報処理装置１１０、１２０、１３０、１４０、１５０に入力されてよい。 The input device 915 is configured by a device operated by a user, such as a mouse, a keyboard, a touch panel, a button, a switch and a lever. Further, the input device 915 may be, for example, a remote control device (so-called remote control) using infrared rays or other radio waves, or an external connection device such as a mobile phone or a PDA corresponding to the operation of the information processing device 900. 931 may be sufficient. Further, the input device 915 is composed of, for example, an input control circuit that generates an input signal based on the information input by the user using the above-described operation unit and outputs the input signal to the CPU 901. By operating the input device 915, the user of the information processing device 900 can input various data to the information processing device 900 and instruct a processing operation. In the present embodiment, for example, an instruction to start the digest section determination process, a mode switching instruction, or the like may be input to the information processing apparatuses 110, 120, 130, 140, and 150 via the input device 915.

また、入力装置９１５は、周囲の音声を収音し、当該周囲の音声を音声情報として情報処理装置９００に入力するマイクロフォンであってもよい。入力装置９１５がマイクロフォンである場合には、当該入力装置９１５は、上述した実施形態における音声収音部１３１を構成し得る。 In addition, the input device 915 may be a microphone that picks up surrounding sounds and inputs the surrounding sounds into the information processing device 900 as sound information. When the input device 915 is a microphone, the input device 915 can configure the sound pickup unit 131 in the above-described embodiment.

出力装置９１７は、取得した情報をユーザに対して視覚的又は聴覚的に通知することが可能な装置で構成される。このような装置として、ＣＲＴディスプレイ装置、液晶ディスプレイ装置、プラズマディスプレイ装置、ＥＬディスプレイ装置及びランプ等の表示装置や、スピーカ及びヘッドホン等の音声出力装置や、プリンタ装置等がある。出力装置９１７は、例えば、情報処理装置９００が行った各種処理により得られた結果を出力する。具体的には、表示装置は、情報処理装置９００が行った各種処理により得られた結果を、テキスト、イメージ、表、グラフ等、様々な形式で視覚的に表示する。他方、音声出力装置は、再生された音声データや音響データ等からなるオーディオ信号をアナログ信号に変換して聴覚的に出力する。本実施形態では、当該音声出力装置を介して、例えば、情報処理装置１４０によって生成される音声情報のダイジェストが出力されてよい。また、当該表示装置には、入力装置９１５を介して各種の指示を入力するためのＧＵＩに係る表示が表示されてもよい。 The output device 917 is configured by a device capable of visually or auditorily notifying the user of the acquired information. Such devices include CRT display devices, liquid crystal display devices, plasma display devices, EL display devices and display devices such as lamps, audio output devices such as speakers and headphones, and printer devices. The output device 917 outputs results obtained by various processes performed by the information processing device 900, for example. Specifically, the display device visually displays the results obtained by various processes performed by the information processing device 900 in various formats such as text, images, tables, and graphs. On the other hand, the audio output device converts an audio signal composed of reproduced audio data, acoustic data, and the like into an analog signal and outputs it audibly. In the present embodiment, for example, a digest of audio information generated by the information processing device 140 may be output via the audio output device. Further, a display related to a GUI for inputting various instructions via the input device 915 may be displayed on the display device.

ストレージ装置９１９は、情報処理装置９００の記憶部の一例として構成されたデータ格納用の装置である。ストレージ装置９１９は、例えば、ＨＤＤ等の磁気記憶部デバイス、半導体記憶デバイス、光記憶デバイス又は光磁気記憶デバイス等により構成される。このストレージ装置９１９は、ＣＰＵ９０１が実行するプログラムや各種データ及び外部から取得した各種のデータ等を格納する。ストレージ装置９１９は、例えば、上述した実施形態における音声情報データベース１５１を構成し得る。 The storage device 919 is a device for storing data configured as an example of a storage unit of the information processing device 900. The storage device 919 includes, for example, a magnetic storage device such as an HDD, a semiconductor storage device, an optical storage device, a magneto-optical storage device, or the like. The storage device 919 stores programs executed by the CPU 901, various data, various data acquired from the outside, and the like. The storage device 919 can configure, for example, the voice information database 151 in the above-described embodiment.

通信装置９２１は、例えば、通信網（ネットワーク）９２７に接続するための通信デバイス等で構成された通信インターフェースである。通信装置９２１は、例えば、有線若しくは無線ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、Ｂｌｕｅｔｏｏｔｈ（登録商標）又はＷＵＳＢ（ＷｉｒｅｌｅｓｓＵＳＢ）用の通信カード等である。また、通信装置９２１は、光通信用のルータ、ＡＤＳＬ（ＡｓｙｍｍｅｔｒｉｃＤｉｇｉｔａｌＳｕｂｓｃｒｉｂｅｒＬｉｎｅ）用のルータ又は各種通信用のモデム等であってもよい。この通信装置９２１は、例えば、インターネットや他の通信機器との間で、例えばＴＣＰ／ＩＰ等の所定のプロトコルに則して信号等を送受信することができる。また、通信装置９２１に接続されるネットワーク９２７は、有線又は無線によって接続されたネットワーク等により構成され、例えば、インターネット、家庭内ＬＡＮ、赤外線通信、ラジオ波通信又は衛星通信等であってもよい。本実施形態では、例えば、情報処理装置１１０、１２０、１３０、１４０、１５０が、通信装置９２１を介して、音声情報やダイジェスト区間情報、出力音声情報等の、情報処理装置１１０、１２０、１３０、１４０、１５０の入出力である各種の情報を、外部の機器との間でやり取りしてよい。 The communication device 921 is, for example, a communication interface including a communication device or the like for connecting to a communication network (network) 927. The communication device 921 is, for example, a communication card for wired or wireless LAN (Local Area Network), Bluetooth (registered trademark), or WUSB (Wireless USB). The communication device 921 may be a router for optical communication, a router for ADSL (Asymmetrical Digital Subscriber Line), a modem for various kinds of communication, or the like. The communication device 921 can transmit and receive signals and the like to and from the Internet and other communication devices, for example, according to a predetermined protocol such as TCP / IP. The network 927 connected to the communication device 921 is configured by a network connected by wire or wirelessly, and may be, for example, the Internet, a home LAN, infrared communication, radio wave communication, satellite communication, or the like. In the present embodiment, for example, the information processing devices 110, 120, 130, 140, 150, via the communication device 921, information processing devices 110, 120, 130, such as voice information, digest section information, output voice information, etc. Various types of information, which are inputs and outputs of 140 and 150, may be exchanged with an external device.

ドライブ９２３は、記録媒体用リーダライタであり、情報処理装置９００に内蔵、あるいは外付けされる。ドライブ９２３は、装着されている磁気ディスク、光ディスク、光磁気ディスク又は半導体メモリ等のリムーバブル記録媒体９２９に記録されている情報を読み出して、ＲＡＭ９０５に出力する。また、ドライブ９２３は、装着されている磁気ディスク、光ディスク、光磁気ディスク又は半導体メモリ等のリムーバブル記録媒体９２９に情報を書き込むことも可能である。リムーバブル記録媒体９２９は、例えば、ＤＶＤメディア、ＨＤ−ＤＶＤメディア、Ｂｌｕ−ｒａｙ（登録商標）メディア等である。また、リムーバブル記録媒体９２９は、コンパクトフラッシュ（登録商標）（ＣｏｍｐａｃｔＦｌａｓｈ：ＣＦ）、フラッシュメモリ又はＳＤメモリカード（ＳｅｃｕｒｅＤｉｇｉｔａｌｍｅｍｏｒｙｃａｒｄ）等であってもよい。また、リムーバブル記録媒体９２９は、例えば、非接触型ＩＣチップを搭載したＩＣカード（ＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔｃａｒｄ）又は電子機器等であってもよい。本実施形態では、例えば情報処理装置１１０、１２０、１３０、１４０、１５０によって処理される各種の情報が、ドライブ９２３によってリムーバブル記録媒体９２９から読み出されたり、リムーバブル記録媒体９２９に書き込まれたりしてもよい。 The drive 923 is a reader / writer for recording medium, and is built in or externally attached to the information processing apparatus 900. The drive 923 reads the information recorded in the removable recording medium 929 such as a mounted magnetic disk, optical disk, magneto-optical disk, or semiconductor memory, and outputs it to the RAM 905. The drive 923 can also write information on a removable recording medium 929 such as a mounted magnetic disk, optical disk, magneto-optical disk, or semiconductor memory. The removable recording medium 929 is, for example, a DVD medium, an HD-DVD medium, a Blu-ray (registered trademark) medium, or the like. In addition, the removable recording medium 929 may be a compact flash (registered trademark) (CompactFlash: CF), a flash memory, an SD memory card (Secure Digital memory card), or the like. Further, the removable recording medium 929 may be, for example, an IC card (Integrated Circuit card) equipped with a non-contact type IC chip, an electronic device, or the like. In the present embodiment, for example, various types of information processed by the information processing devices 110, 120, 130, 140, 150 are read from the removable recording medium 929 by the drive 923 or written in the removable recording medium 929. Good.

接続ポート９２５は、機器を情報処理装置９００に直接接続するためのポートである。接続ポート９２５の一例として、ＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）ポート、ＩＥＥＥ１３９４ポート及びＳＣＳＩ（ＳｍａｌｌＣｏｍｐｕｔｅｒＳｙｓｔｅｍＩｎｔｅｒｆａｃｅ）ポート等がある。接続ポート９２５の別の例として、ＲＳ−２３２Ｃポート、光オーディオ端子及びＨＤＭＩ（登録商標）（Ｈｉｇｈ−ＤｅｆｉｎｉｔｉｏｎＭｕｌｔｉｍｅｄｉａＩｎｔｅｒｆａｃｅ）ポート等がある。この接続ポート９２５に外部接続機器９３１を接続することで、情報処理装置９００は、外部接続機器９３１から直接各種のデータを取得したり、外部接続機器９３１に各種のデータを提供したりする。本実施形態では、例えば情報処理装置１１０、１２０、１３０、１４０、１５０によって処理される各種の情報が、接続ポート９２５を介して外部接続機器９３１から取得されたり、外部接続機器９３１に出力されたりしてもよい。 The connection port 925 is a port for directly connecting a device to the information processing device 900. Examples of the connection port 925 include a USB (Universal Serial Bus) port, an IEEE 1394 port, and a SCSI (Small Computer System Interface) port. Other examples of the connection port 925 include an RS-232C port, an optical audio terminal, and an HDMI (registered trademark) (High-Definition Multimedia Interface) port. By connecting the external connection device 931 to the connection port 925, the information processing apparatus 900 directly acquires various data from the external connection device 931 and provides the external connection device 931 with various data. In the present embodiment, for example, various types of information processed by the information processing devices 110, 120, 130, 140, 150 are acquired from the external connection device 931 via the connection port 925 or output to the external connection device 931. You may.

以上、本実施形態に係る情報処理装置９００の機能を実現可能なハードウェア構成の一例を示した。上記の各構成要素は、汎用的な部材を用いて構成されていてもよいし、各構成要素の機能に特化したハードウェアにより構成されていてもよい。従って、本実施形態を実施する時々の技術レベルに応じて、適宜、利用するハードウェア構成を変更することが可能である。 The example of the hardware configuration capable of realizing the function of the information processing device 900 according to the present embodiment has been described above. Each component described above may be configured by using a general-purpose member, or may be configured by hardware specialized for the function of each component. Therefore, it is possible to appropriately change the hardware configuration to be used according to the technical level at the time of implementing the present embodiment.

なお、上述のような本実施形態に係る情報処理装置９００の各機能を実現するためのコンピュータプログラムを作製し、ＰＣ等に実装することが可能である。また、このようなコンピュータプログラムが格納された、コンピュータで読み取り可能な記録媒体も提供することができる。記録媒体は、例えば、磁気ディスク、光ディスク、光磁気ディスク、フラッシュメモリ等である。また、上記のコンピュータプログラムは、記録媒体を用いずに、例えばネットワークを介して配信されてもよい。 A computer program for realizing each function of the information processing apparatus 900 according to the present embodiment as described above can be created and mounted on a PC or the like. It is also possible to provide a computer-readable recording medium in which such a computer program is stored. The recording medium is, for example, a magnetic disk, an optical disk, a magneto-optical disk, a flash memory, or the like. Further, the above computer program may be distributed, for example, via a network without using a recording medium.

（７．まとめ）
以上説明したように、本実施形態によれば、音声情報に含まれる音声の音源種別の蓋然性を示す音源種別スコアが算出され、当該音源種別スコアに基づいて、当該音声情報の中から当該音声情報のダイジェストを構成するダイジェスト区間が決定される。従って、例えば、音楽のみをダイジェストに含めたい、人の声のみをダイジェストに含めたい、音楽と人の声とをバランスよくダイジェストに含めたい等、ユーザの多様な要望に応じたダイジェストを生成することが可能になる。よって、ユーザの利便性をより向上させることができる。 (7. Summary)
As described above, according to the present embodiment, the sound source type score indicating the probability of the sound source type of the sound included in the sound information is calculated, and the sound information is selected from the sound information based on the sound source type score. The digest section which comprises the digest of is determined. Therefore, for example, to generate only the music in the digest, to include only the human voice in the digest, to include the music and the human voice in the digest in a well-balanced manner, etc., to generate a digest that meets various user needs. Will be possible. Therefore, the convenience of the user can be further improved.

また、モードが設定され、ダイジェストに含まれる音声の音源種別が適宜調整されることにより、よりユーザの意向に沿ったダイジェストを生成することが可能になる。例えば、複数音源モードにおいてノイズスコアに係る音声がダイジェストに含まれる割合を低い値に設定する等、モードを適宜設定することで、ノイズが低減された、よりユーザにとって聞き取りやすいダイジェストを生成することが可能である。 Also, by setting the mode and appropriately adjusting the sound source type of the voice included in the digest, it becomes possible to generate a digest more in line with the user's intention. For example, in the multiple sound source mode, it is possible to generate a digest in which noise is reduced and which is more audible to the user by appropriately setting the mode, such as setting a low value for the ratio of the sound related to the noise score included in the digest. It is possible.

ここで、一般的に、映像情報については、例えばサムネイルを表示することにより、当該映像情報の概要を視覚的にユーザに対して通知することができる。しかしながら、主に映像情報ではなく音声情報を取得する音声収録機器（例えばＩＣレコーダー、録音アプリケーションソフトが搭載されたスマートフォン、カメラ機能が搭載されていない又はカメラ機能が使用できない状況下でのウェアラブル機器等）で音声を収録した場合、その音声情報のファイル名、収音日時等は視覚的に表示され得るが、ユーザにとって、これらの情報から、その音声情報の概要を視覚的に把握することは困難である。また、音声情報とともに映像情報を有する場合であっても、例えば暗い室内でのイベント中で表示画面のバックライトを点灯することが憚られる場合等、状況によっては、表示画面を見ることができず視覚的な確認ができない場合もある。 Here, in general, for video information, by displaying a thumbnail, for example, it is possible to visually notify the user of the outline of the video information. However, mainly audio recording devices that acquire audio information instead of video information (for example, IC recorders, smartphones equipped with recording application software, wearable devices in situations where the camera function is not installed or the camera function cannot be used, etc. ), The file name of the audio information, the date and time of sound collection, etc. can be displayed visually, but it is difficult for the user to visually grasp the outline of the audio information from these information. Is. In addition, even if the video information is included with the audio information, the display screen cannot be seen in some situations, for example, when the backlight of the display screen is lighted up during an event in a dark room. In some cases, visual confirmation is not possible.

このような場合、音声情報（又は、音声情報及び映像情報）の内容を把握するためには、ユーザは、実際に当該音声情報を試聴する必要がある。しかしながら、音声情報の時間長さが長い場合には、内容確認のために当該音声情報を一通り聞くことは、時間的な負荷が大きく、ユーザにとって大きな負担となる。 In such a case, in order to understand the content of the audio information (or the audio information and the video information), the user needs to actually listen to the audio information. However, when the time length of the voice information is long, listening to the voice information in order to confirm the content causes a large time load and a heavy burden on the user.

一方、本実施形態によれば、上述したように、ユーザの要望に沿った音声情報のダイジェストを作成することが可能になる。従って、例えば数秒のダイジェストを試聴するだけで音声情報の内容を把握することができ、これまでは多大な時間を要していた内容確認に掛かる時間を、大幅に短縮することができる。 On the other hand, according to the present embodiment, as described above, it becomes possible to create a digest of audio information that meets the user's request. Therefore, for example, the content of the voice information can be grasped only by listening to the digest for several seconds, and the time required for the content confirmation, which has required a great amount of time until now, can be greatly shortened.

また、本実施形態によれば、例えば、音声を収録した装置本体、又はストレージに移動された後の音声情報を管理する他の装置等により、取得された音声情報に対して、自動的にダイジェストが生成されてもよい。また、取得された音声情報に対して自動的にダイジェストが生成される場合には、例えば、表示画面上の音声情報を表すファイル名にポインタを載せる等のＧＵＩを用いた操作や、プレビュー操作等の簡易な操作によって、ダイジェストが再生されてもよい。これにより、ユーザは、煩わしい操作を行うことなく、より気楽にダイジェストを確認することができる。 Further, according to the present embodiment, for example, a digest of the acquired voice information is automatically performed by the device body that has recorded the voice or another device that manages the voice information after being moved to the storage. May be generated. When a digest is automatically generated for the acquired audio information, for example, an operation using a GUI such as placing a pointer on a file name representing the audio information on the display screen, a preview operation, or the like. The digest may be reproduced by a simple operation of. This allows the user to more easily check the digest without performing a troublesome operation.

また、本実施形態に係る技術は、いわゆるビッグデータを解析する用途にも好適に適用可能である。例えば、コールセンターや捜査機関等で収集される通話記録に対して本実施形態に係る技術を適用し、通話記録のダイジェストを生成することにより、膨大な量の通話記録の内容をより短時間で確認することが可能となる。従って、通話記録の解析がより容易になる。 Further, the technology according to the present embodiment can be suitably applied to the purpose of analyzing so-called big data. For example, by applying the technology according to the present embodiment to call records collected by a call center or an investigative institution and generating a digest of the call records, it is possible to check the contents of a huge amount of call records in a shorter time. It becomes possible to do. Therefore, the analysis of the call record becomes easier.

また、音声情報とともに映像情報を有する場合であっても、映像情報に基づくサムネイル等を用いた視覚的な方法では、内容の把握が難しい状況が考えられる。例えば、似通った映像に対して音声部分のみが大きく異なる複数のファイルが存在する場合や、装置の処理速度等の実装的な制約から映像情報を利用できない場合、定点カメラ等による映像であるために映像内に音源が映っていない場合（すなわち話者が特定できない場合）等が、このような状況に該当し得る。本実施形態に係る技術は、このような、内容の把握のために映像情報が有効に利用できない場合にも好適に適用され得る。 Further, even when the video information is included together with the audio information, it may be difficult to grasp the content by a visual method using a thumbnail based on the video information. For example, if there are multiple files that differ greatly only in the audio part for similar images, or if the image information cannot be used due to implementation restrictions such as the processing speed of the device, it is an image from a fixed-point camera, etc. Such a situation may be the case where no sound source is shown in the video (that is, the speaker cannot be identified). The technique according to the present embodiment can be suitably applied even when such video information cannot be effectively used for grasping the content.

更に、本実施形態に係る技術は、動画を編集する場合等、音声情報を編集する作業においても、編集前の素材となる音声情報の内容を容易に把握する上で、有効である。例えば、近年、静止画像と音声とを組み合わせた、音声情報付きの写真を生成、提供するサービスが存在する。このような、静止画像と音声とを組み合わせたフォーマットのファイルを生成する際に、音声部分を編集する際にも、本実施形態に係る技術が有効に活用され得る。 Further, the technique according to the present embodiment is effective in easily grasping the content of the audio information which is the material before the editing even in the operation of editing the audio information such as when editing a moving image. For example, in recent years, there is a service for generating and providing a photograph with voice information, which is a combination of still images and voice. The technique according to the present embodiment can be effectively used also when editing a voice portion when generating a file of a format in which a still image and voice are combined.

以上、添付図面を参照しながら本開示の好適な実施形態について詳細に説明したが、本開示の技術的範囲はかかる例に限定されない。本開示の技術分野における通常の知識を有する者であれば、特許請求の範囲に記載された技術的思想の範疇内において、各種の変更例または修正例に想到し得ることは明らかであり、これらについても、当然に本開示の技術的範囲に属するものと了解される。 Although the preferred embodiments of the present disclosure have been described above in detail with reference to the accompanying drawings, the technical scope of the present disclosure is not limited to such examples. It is obvious that a person having ordinary knowledge in the technical field of the present disclosure can conceive various changes or modifications within the scope of the technical idea described in the claims. It is understood that the above also naturally belongs to the technical scope of the present disclosure.

また、本明細書に記載された効果は、あくまで説明的又は例示的なものであって限定的なものではない。つまり、本開示に係る技術は、上記の効果とともに、又は上記の効果に代えて、本明細書の記載から当業者には明らかな他の効果を奏し得る。 Further, the effects described in the present specification are merely illustrative or exemplary, and are not limitative. That is, the technology according to the present disclosure may have other effects that are apparent to those skilled in the art from the description of the present specification, in addition to or instead of the above effects.

ここで、本明細書では、各処理の処理手順での判断処理において、スコアをしきい値と比較する際等に、「以下」や「よりも大きい」等の表現を用いているが、これらの表現はあくまで例示であり、当該判断処理における境界条件を限定するものではない。本実施形態では、スコア等の値がしきい値と等しい場合に、その大小関係をどのように判断するかは任意に設定可能であってよい。本明細書における「以下」との表現は「よりも小さい」との表現と互いに適宜読み替えることが可能であるし、「よりも大きい」との表現は「以上」との表現と互いに適宜読み替えることが可能である。 Here, in this specification, expressions such as “less than or equal to” and “greater than” are used in the judgment processing in each processing procedure when comparing a score with a threshold value, etc. The above expression is merely an example, and does not limit the boundary condition in the determination process. In the present embodiment, when the value of the score or the like is equal to the threshold value, how to determine the magnitude relation may be set arbitrarily. In the present specification, the expression "less than or equal to" can be read as appropriate as the expression "less than", and the expression "greater than" can be read as appropriate as the expression "greater than or equal to". Is possible.

なお、以下のような構成も本開示の技術的範囲に属する。
（１）音声情報に含まれる音声の音源種別の蓋然性を示す音源種別スコアを算出する音源種別スコア算出部と、算出された前記音源種別スコアに基づいて、前記音声情報の中から、前記音声情報のダイジェストを構成するダイジェスト区間を決定するダイジェスト区間決定部と、を備える、情報処理装置。
（２）前記音源種別スコアは、音楽らしさを示す音楽スコア、人の声らしさを示す声スコア及び雑音らしさを示すノイズスコアの少なくともいずれかを含む、前記（１）に記載の情報処理装置。
（３）前記声スコアは、男性の声らしさを示す男性声スコア、女性の声らしさを示す女性声スコア、子どもの声らしさを示す子ども声スコア、及び前記音声を発している特定の人物らしさを示す特定声スコアの少なくともいずれかを更に含む、前記（２）に記載の情報処理装置。
（４）前記音源種別スコア算出部は、前記音声情報の特徴を示す特徴量に基づいて、前記音源種別スコアを算出する、前記（１）〜（３）のいずれか１項に記載の情報処理装置。
（５）前記特徴量は、前記音声情報についての、パワー、スペクトル包絡形状、ゼロ交差数、ピッチ、ＭＦＣＣ、収音位置間での相関、及び音源方位の特性を示す物理量のうちの少なくとも１つを含む、前記（４）に記載の情報処理装置。
（６）前記ダイジェスト区間決定部は、生成する前記ダイジェストのモードに基づいて前記ダイジェストに含める前記音声の音源種別を決定し、前記音声情報の中で、決定した音源種別に係る前記音源種別スコアがより高い区間を、前記ダイジェスト区間として決定する、前記（１）〜（５）のいずれか１項に記載の情報処理装置。
（７）前記モードは、単一の音源種別の前記音声のみを含むように前記ダイジェストを生成する単一音源モード、複数の音源種別の前記音声を所定の割合で含むように前記ダイジェストを生成する複数音源モード、及び、同一の前記音源種別に分類される前記音声の中から多様な前記音声が含まれるように前記ダイジェストを生成する多様性反映モード、の少なくともいずれかから選択される、前記（６）に記載の情報処理装置。
（８）前記モードが前記単一音源モードである場合には、前記ダイジェスト区間決定部は、指定された一の音源種別に係る前記音源種別スコアがより高い区間を、前記ダイジェスト区間として決定する、前記（７）に記載の情報処理装置。
（９）前記モードが前記複数音源モードである場合には、前記ダイジェスト区間決定部は、前記ダイジェストに含める前記音声の時間長さを音源種別ごとに設定し、音源種別ごとに前記音源種別スコアがより高い区間であって当該区間の合計長さが設定した音源種別ごとの前記時間長さと略等しくなるような前記区間を、前記ダイジェスト区間として決定する、前記（７）に記載の情報処理装置。
（１０）前記モードが前記多様性反映モードである場合には、前記ダイジェスト区間決定部は、同一の音源種別内での前記音声情報の特徴を示す特徴量のばらつき及び同一の前記音源種別内での前記音声が発せられた時刻のばらつきを算出し、前記特徴量のばらつき及び前記時刻のばらつきがより大きくなるように、前記ダイジェスト区間を決定する、前記（７）に記載の情報処理装置。
（１１）前記ダイジェスト区間決定部は、前記音源種別スコアが所定のしきい値よりも高い第１の区間と、前記音源種別スコアが所定のしきい値よりも低い第２の区間と、が連続して存在しており、かつ、前記第２の区間の時間長さが所定の時間よりも短い場合には、前記第１及び第２の区間をともに含むように前記ダイジェスト区間を決定する、前記（６）〜（１０）のいずれか１項に記載の情報処理装置。
（１２）前記ダイジェスト区間決定部は、前記音源種別スコアが所定のしきい値よりも高い第１の区間の時間長さが、人にとって音声として認識できない長さである場合には、前記第１の区間を含まないように前記ダイジェスト区間を決定する、前記（６）〜（１１）のいずれか１項に記載の情報処理装置。
（１３）前記音源種別スコア算出部は、予め全てが取得されている前記音声情報について、前記音源種別スコアを算出し、前記ダイジェスト区間決定部は、予め全てが取得されている前記音声情報の前記ダイジェストを生成する、前記（１）〜（１２）のいずれか１項に記載の情報処理装置。
（１４）前記音源種別スコア算出部は、現在まさに取得され続けている前記音声情報について、前記ダイジェスト区間以下の長さの時間からなるスコア算出区間に対応する時間長さの音声情報が新たに取得される度に、前記スコア算出区間ごとに前記音源種別スコアを算出し、前記ダイジェスト区間決定部は、前記音声情報が取得されている間、前記音声情報の前記ダイジェストを随時更新しながら生成する、前記（１）〜（１２）のいずれか１項に記載の情報処理装置。
（１５）前記ダイジェスト区間決定部は、これまでに取得された前記音声情報の時間長さが、前記ダイジェストの時間長さの設定値よりも短い場合には、新たに取得された前記音声情報を前記ダイジェストに追加し、これまでに取得された前記音声情報の時間長さが、前記ダイジェストの時間長さの設定値以上である場合には、新たに取得された前記スコア算出区間分の前記音声情報を前記ダイジェストに追加するとともに、前記ダイジェストの中から前記スコア算出区間分の時間長さの区間であって前記音源種別スコアがより低い区間を削除する、前記（１４）に記載の情報処理装置。
（１６）外部の音声を収音する音声収音部、を更に備え、前記音声情報は、前記音声収音部によって収音された外部音声に係る音声情報である、前記（１）〜（１５）のいずれか１項に記載の情報処理装置。
（１７）データベース化された前記音声情報が保存される記憶部、を更に備え、前記音源種別スコア算出部は、データベース化された前記音声情報に対して音源種別スコアを算出し、前記ダイジェスト区間決定部は、データベース化された前記音声情報に対して前記ダイジェスト区間を決定する、前記（１）〜（１５）のいずれか１項に記載の情報処理装置。
（１８）前記音声情報と、前記ダイジェスト区間決定部によって決定されたダイジェスト区間についての情報と、に基づいて、前記音声情報のダイジェストを、音声出力機器で出力可能なデータ形式で生成する出力音声生成部、を更に備える、前記（１）〜（１７）のいずれか１項に記載の情報処理装置。
（１９）プロセッサが、音声情報に含まれる音声の音源種別の蓋然性を示す音源種別スコアを算出することと、算出された前記音源種別スコアに基づいて、前記音声情報の中から、前記音声情報のダイジェストを構成するダイジェスト区間を決定することと、を含む、情報処理方法。
（２０）コンピュータのプロセッサに、音声情報に含まれる音声の音源種別の蓋然性を示す音源種別スコアを算出する機能と、算出された前記音源種別スコアに基づいて、前記音声情報の中から、前記音声情報のダイジェストを構成するダイジェスト区間を決定する機能と、を実現させる、プログラム。
（２１）音声情報に含まれる音声の特徴量を抽出する特徴量抽出部と、前記特徴量に応じて算出される音源種別に基づいて、前記音声情報の中から、前記音声情報のダイジェストを構成するダイジェスト区間を決定するダイジェスト区間決定部と、を備え、前記ダイジェスト区間決定部は、予め設定されるモードに基づいて前記ダイジェストに含める前記音声の音源種別を決定し、前記予め設定されるモードには、少なくとも同一の前記音源種別に分類される前記音声の中から多様な前記音声が含まれるように前記ダイジェストを生成する多様性反映モードを有し、前記モードが前記多様性反映モードである場合には、前記ダイジェスト区間決定部は、同一の前記音源種別内での前記音声情報の特徴を示す特徴量及び/または同一の前記音源種別内での前記音声が発せられた時刻に基づいて、前記ダイジェスト区間を決定する、情報処理装置として機能させるためのプログラム。 Note that the following configurations also belong to the technical scope of the present disclosure.
(1) Based on the calculated sound source type score, a sound source type score calculation unit that calculates a sound source type score indicating the probability of the sound source type of the sound included in the sound information; An information processing apparatus, comprising: a digest section determining unit that determines a digest section that configures the digest of.
(2) The information processing device according to (1), wherein the sound source type score includes at least one of a music score indicating a music likeness, a voice score indicating a human voice likeness, and a noise score indicating a noise likeness.
(3) The voice score includes a male voice score indicating a male voice likelihood, a female voice score indicating a female voice likelihood, a child voice score indicating a child voice likelihood, and a specific person likelihood making the voice. The information processing apparatus according to (2), further including at least one of the specific voice scores shown.
(4) The information processing according to any one of (1) to (3), wherein the sound source type score calculation unit calculates the sound source type score based on a feature amount indicating a feature of the voice information. apparatus.
(5) The feature amount is at least one of a power, a spectrum envelope shape, a number of zero crossings, a pitch, an MFCC, a correlation between sound pickup positions, and a physical amount indicating a characteristic of a sound source direction for the voice information. The information processing apparatus according to (4), including:
(6) The digest section determination unit determines a sound source type of the sound to be included in the digest based on a mode of the generated digest, and in the sound information, the sound source type score related to the determined sound source type is The information processing apparatus according to any one of (1) to (5), wherein a higher section is determined as the digest section.
(7) The mode is a single sound source mode in which the digest is generated so as to include only the sound of a single sound source type, and the digest is generated so as to include the sound in a plurality of sound source types at a predetermined ratio. A plurality of sound source modes, and a diversity reflection mode in which the digest is generated so that various sounds are included from among the sounds classified into the same sound source type; The information processing device according to 6).
(8) When the mode is the single sound source mode, the digest section determination unit determines, as the digest section, a section in which the sound source type score related to one designated sound source type is higher. The information processing device according to (7).
(9) When the mode is the plural sound source mode, the digest section determination unit sets the time length of the voice included in the digest for each sound source type, and the sound source type score is set for each sound source type. The information processing apparatus according to (7), wherein a higher section, in which the total length of the section is substantially equal to the time length for each set sound source type, is determined as the digest section.
(10) In the case where the mode is the diversity reflection mode, the digest section determination unit determines, within the same sound source type, variation in feature amounts indicating the characteristics of the audio information within the same sound source type. The information processing apparatus according to (7), wherein the variation of the time when the voice is uttered is calculated, and the digest section is determined so that the variation of the feature amount and the variation of the time become larger.
(11) The digest section determination unit continuously has a first section in which the sound source type score is higher than a predetermined threshold value and a second section in which the sound source type score is lower than a predetermined threshold value. And the time length of the second section is shorter than a predetermined time, the digest section is determined so as to include both the first and second sections. The information processing apparatus according to any one of (6) to (10).
(12) If the time length of the first section in which the sound source type score is higher than a predetermined threshold is a length that cannot be recognized as speech by a person, the digest section determination unit determines the first section. The information processing apparatus according to any one of (6) to (11), wherein the digest section is determined so as not to include the section.
(13) The sound source type score calculation unit calculates the sound source type score for the sound information that has been acquired in advance, and the digest section determination unit may calculate the sound information of the sound information that has been acquired in advance. The information processing apparatus according to any one of (1) to (12), which generates a digest.
(14) The sound source type score calculation unit newly obtains voice information of a time length corresponding to a score calculation section including a time of a length equal to or less than the digest section, for the voice information that is currently being just acquired. Each time is calculated the sound source type score for each score calculation section, the digest section determination unit, while the voice information is being acquired, while generating the digest of the voice information while updating at any time, The information processing apparatus according to any one of (1) to (12) above.
(15) When the time length of the voice information acquired so far is shorter than the set value of the time length of the digest, the digest section determining unit determines the newly acquired voice information. When added to the digest, the time length of the voice information acquired so far is equal to or greater than the set value of the time length of the digest, the voice for the newly acquired score calculation section The information processing device according to (14), wherein information is added to the digest and a section having a time length corresponding to the score calculation section and having a lower sound source type score is deleted from the digest. .
(16) A voice pickup unit that picks up an external voice is further provided, and the voice information is voice information relating to an external voice picked up by the voice pickup unit. ) The information processing device according to any one of 1).
(17) A storage unit for storing the database-converted voice information is further provided, and the sound source type score calculation unit calculates a sound source type score for the database-based voice information to determine the digest section. The section is the information processing device according to any one of (1) to (15), wherein the digest section is determined for the voice information stored in a database.
(18) Output voice generation that generates a digest of the voice information in a data format that can be output by a voice output device, based on the voice information and information about the digest section determined by the digest section determination unit. The information processing apparatus according to any one of (1) to (17), further including a unit.
(19) The processor calculates a sound source type score indicating the probability of the sound source type of the sound included in the sound information, and based on the calculated sound source type score, the processor extracts the sound information of the sound information from the sound information. Determining a digest section forming a digest, the information processing method.
(20) The processor of the computer has a function of calculating a sound source type score indicating the probability of the sound source type of the voice included in the voice information, and the voice from the voice information based on the calculated voice source type score. A program that realizes the function of determining a digest section that constitutes a digest of information.
(21) A digest of the voice information is configured from the voice information based on a feature amount extraction unit that extracts a feature amount of the voice included in the voice information and a sound source type calculated according to the feature amount. A digest section determining unit for determining a digest section to be, the digest section determining unit determines a sound source type of the voice included in the digest based on a preset mode, in the preset mode. Has a diversity reflection mode for generating the digest so that various sounds are included from the sounds classified into at least the same sound source type, and the mode is the diversity reflection mode In the above, the digest section determination unit is a feature quantity indicating the feature of the audio information in the same sound source type and / or the same sound source type. A program for functioning as an information processing device, which determines the digest section on the basis of the time when the voice is emitted within another.

１１０、１２０、１３０、１４０、１５０情報処理装置
１１１特徴量抽出部
１１３音源種別スコア算出部
１１５ダイジェスト区間決定部
１３１音声収音部
１４１出力音声生成部
１５１音声情報データベース（ＤＢ）
110, 120, 130, 140, 150 Information processing device 111 Feature amount extraction unit 113 Sound source type score calculation unit 115 Digest section determination unit 131 Voice pickup unit 141 Output voice generation unit 151 Voice information database (DB)

Claims

A feature amount extraction unit that extracts a feature amount of voice included in voice information,
Based on the sound source type calculated according to the feature amount, from the audio information, a digest interval determination unit that determines a digest interval configuring a digest of the audio information,
Equipped with
The digest section determination unit determines a sound source type of the sound to be included in the digest based on a preset mode, and the preset mode includes at least the sound sources classified into the same sound source type. the have a diversity reflecting mode for generating the digest to include diverse the sound from within,
When the mode is the diversity reflection mode, the digest section determination unit,
An information processing device for determining the digest section based on a feature amount indicating a feature of the audio information in the same sound source type and / or a time at which the sound is emitted in the same sound source type .

The information processing apparatus according to claim 1, wherein the digest section determination unit determines the digest section such that the characteristic amount variation and / or the time variation is large.

The digest section determination unit determines a sound source type of the voice to be included in the digest based on a preset mode, and the preset mode is specified to include at least the digest in priority. The information processing apparatus according to claim 1, having a single sound source mode that generates the digest so as to include the sound source type.

When the mode is the single sound source mode, the digest section determination unit determines a section having a higher sound source type score related to the one designated sound source type as the digest section,
The information processing apparatus according to claim 3 .

Before SL digest section determination unit, determines a sound source type of the audio to be included in the digest based on the mode set in advance, wherein the mode set in advance, said predetermined ratio of the audio of at least a plurality of sound source type The information processing apparatus according to claim 1 , further comprising a plurality of sound source modes for generating the digest so as to include the above.

When the mode is the plural sound source mode, the digest section determination unit sets the time length of the voice included in the digest for each sound source type, and a section with a higher sound source type score for each sound source type And, the section in which the total length of the section is substantially equal to the time length for each of the set sound source types is determined as the digest section,
The information processing device according to claim 5 .

The digest section determination unit determines a sound source type of the voice included in the digest based on the mode of the generated digest, in the audio information, a section with a higher sound source type score related to the determined sound source type. , Determined as the digest section,
The information processing device according to claim 1.

The digest section determination unit continuously has a first section in which the sound source type score is higher than a predetermined threshold value and a second section in which the sound source type score is lower than a predetermined threshold value. And the time length of the second section is shorter than a predetermined time, the digest section is determined so as to include both the first and second sections.
The information processing device according to claim 7.

The digest section determining unit determines the first section when the time length of the first section in which the sound source type score is higher than a predetermined threshold is a length that cannot be recognized as voice by a person. Determine the digest section so as not to include,
The information processing device according to claim 7.

Further comprising a sound source type score calculator that calculates a sound source type score indicating the probability of the sound source type of the voice included in the voice information,
The sound source type score calculation unit calculates the sound source type score based on the feature amount indicating the feature of the audio information,
The information processing device according to claim 1.

The sound source type score calculation unit calculates the sound source type score for the audio information that is acquired in advance,
The digest section determination unit generates the digest of the audio information, all of which are acquired in advance,
The information processing device according to claim 10.

The sound source type score calculation unit, with respect to the voice information that is just being acquired at present, when the voice information of the time length corresponding to the score calculation section including the time of the digest section or less is newly acquired. In, calculating the sound source type score for each score calculation section,
The digest section determination unit generates while updating the digest of the voice information at any time while the voice information is being acquired,
The information processing device according to claim 10.

The digest section determination unit, if the time length of the voice information acquired so far is shorter than a set value of the time length of the digest, the newly acquired voice information to the digest. Add
When the time length of the voice information acquired so far is equal to or greater than the set value of the time length of the digest, the newly acquired voice information for the score calculation section is added to the digest. Along with the deletion, a section having a lower sound source type score and having a time length corresponding to the score calculation section from the digest is deleted.
The information processing apparatus according to claim 12.

A storage unit for storing the voice information in a database,
The sound source type score calculation unit calculates the sound source type score for the voice information stored in a database,
The digest section determination unit determines the digest section for the voice information stored in a database,
The information processing apparatus according to claim 10.

The sound source type score includes at least one of a music score indicating music likeness, a voice score indicating human voice likeness, and a noise score indicating noise likeness,
The information processing device according to any one of claims 7 to 13.

The voice score is a male voice score indicating a male voice likelihood, a female voice score indicating a female voice likelihood, a child voice score indicating a child voice likelihood, and a specific voice indicating a specific person likelihood making the voice. Further including at least one of the scores,
The information processing device according to claim 15.

The feature amount includes at least one of a power, a spectrum envelope shape, a number of zero crossings, a pitch, an MFCC, a correlation between sound collection positions, and a physical amount indicating a characteristic of a sound source direction, for the voice information.
The information processing apparatus according to any one of claims 1 to 16.

A voice pickup unit for picking up external voice is further provided,
The voice information is voice information about an external voice collected by the voice collecting unit.
The information processing apparatus according to any one of claims 1 to 17.

An output voice generation unit that generates a digest of the voice information in a data format that can be output by a voice output device, based on the voice information and information about the digest period determined by the digest period determination unit. Further prepare,
The information processing apparatus according to claim 1.

  Extract the voice feature amount included in the voice information,
  Based on the sound source type calculated according to the feature amount, from the audio information, to determine the digest section constituting the digest of the audio information,
  A sound source type of the voice to be included in the digest is determined based on a preset mode, and in the preset mode, various voices are selected from the voices classified into at least the same sound source type. Has a diversity reflection mode that produces the digest to be included,
  In the case where the mode is the diversity reflection mode, at the time when the feature amount indicating the feature of the voice information in the same sound source type and / or the voice in the same sound source type is emitted. An information processing method for determining the digest section based on the above.