JP2016090774A

JP2016090774A - Information processing device, information processing method and program

Info

Publication number: JP2016090774A
Application number: JP2014224159A
Authority: JP
Inventors: 隆一難波; Ryuichi Nanba; 金章藤下; Kanaaki Fujishita
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2014-11-04
Filing date: 2014-11-04
Publication date: 2016-05-23
Anticipated expiration: 2034-11-04
Also published as: JP6413653B2

Abstract

PROBLEM TO BE SOLVED: To improve user convenience.SOLUTION: An information processing device comprises a sound source type score calculation unit that calculates a sound source type score showing a probability of a sound source type of voice included in voice information, and a digest section determination unit that determines a digest section constituting a digest of the voice information, out of the voice information, on the basis of the calculated sound source type score.SELECTED DRAWING: Figure 1

Description

本開示は、情報処理装置、情報処理方法及びプログラムに関する。 The present disclosure relates to an information processing apparatus, an information processing method, and a program.

音声情報や映像情報等の所定の時間長さを有する情報に対して、その内容を全て視聴することなく当該内容の概要を把握したいという要望がある。そこで、例えば特許文献１には、音声情報の特徴を示す特徴量から、音声情報の中で注目すべき場面である盛り上がり部分を検出し、音声情報の中の当該盛り上がり部分に対してインデックスを付与する技術が開示されている。当該技術によれば、音声情報の中から当該インデックスが付された部分のみを再生することにより、盛り上がり部分のみが抽出された当該音声情報のダイジェストを生成することができる。 There is a demand for information having a predetermined length of time, such as audio information and video information, to obtain an overview of the content without viewing the entire content. Therefore, for example, in Patent Document 1, a climax part that is a scene to be noticed in speech information is detected from the feature amount indicating the feature of speech information, and an index is assigned to the swell part in speech information. Techniques to do this are disclosed. According to the technology, it is possible to generate a digest of the audio information in which only the climax part is extracted by reproducing only the part with the index from the audio information.

特開２００４−１９１７８０号公報JP 2004-191780 A

ここで、例えば会議の様子を録音した音声情報のダイジェストを生成することを考えると、会議の内容の概要を把握するために盛り上がっている場面、すなわち議論が紛糾している場面をダイジェストに含めたいという要望がある一方で、会議の参加者を把握するためにできるだけ多くの人物の声が含まれるようにダイジェストを生成したいという要望も存在し得る。このように、ユーザがダイジェストに対して求める要望は、その目的に応じて多様である。特許文献１に記載の技術は、盛り上がり部分を検出することに特化したものであるため、特許文献１に記載の技術ではこのようなユーザの多様な要望に応えることは困難であると考えられる。 Here, for example, considering generating a digest of audio information that records the state of the meeting, we want to include in the digest scenes that are exciting to understand the outline of the contents of the meeting, that is, scenes where discussions are in conflict On the other hand, there may be a desire to generate a digest so that as many voices as possible are included in order to grasp the participants of the conference. As described above, there are various requests that the user requests for the digest depending on the purpose. Since the technique described in Patent Document 1 is specialized in detecting a rising portion, it is considered that it is difficult to meet such various requests of the user with the technique described in Patent Document 1. .

そこで、本開示では、ユーザの利便性をより向上させることが可能な、新規かつ改良された情報処理装置、情報処理方法及びプログラムを提案する。 Therefore, the present disclosure proposes a new and improved information processing apparatus, information processing method, and program capable of further improving user convenience.

本開示によれば、音声情報に含まれる音声の音源種別の蓋然性を示す音源種別スコアを算出する音源種別スコア算出部と、算出された前記音源種別スコアに基づいて、前記音声情報の中から、前記音声情報のダイジェストを構成するダイジェスト区間を決定するダイジェスト区間決定部と、を備える、情報処理装置が提供される。 According to the present disclosure, the sound source type score calculating unit that calculates the sound source type score indicating the probability of the sound source type of the sound included in the sound information, and based on the calculated sound source type score, from the sound information, There is provided an information processing apparatus comprising: a digest section determining unit that determines a digest section constituting the digest of the voice information.

また、本開示によれば、プロセッサが、音声情報に含まれる音声の音源種別の蓋然性を示す音源種別スコアを算出することと、算出された前記音源種別スコアに基づいて、前記音声情報の中から、前記音声情報のダイジェストを構成するダイジェスト区間を決定することと、を含む、情報処理方法が提供される。 Further, according to the present disclosure, the processor calculates a sound source type score indicating the probability of the sound source type of the sound included in the sound information, and based on the calculated sound source type score, from the sound information And determining a digest section that constitutes a digest of the voice information.

また、本開示によれば、コンピュータのプロセッサに、音声情報に含まれる音声の音源種別の蓋然性を示す音源種別スコアを算出する機能と、算出された前記音源種別スコアに基づいて、前記音声情報の中から、前記音声情報のダイジェストを構成するダイジェスト区間を決定する機能と、を実現させる、プログラムが提供される。 Further, according to the present disclosure, a function of calculating a sound source type score indicating the probability of a sound source type of sound included in the sound information, and a sound processor type score of the sound information based on the calculated sound source type score. A program that realizes a function of determining a digest section that constitutes the digest of the voice information is provided.

本開示によれば、音声情報に含まれる音声の音源種別の蓋然性を示す音源種別スコアが算出され、当該音源種別スコアに基づいて、当該音声情報の中から当該音声情報のダイジェストを構成するダイジェスト区間が決定される。従って、音源種別に応じたユーザの多様な要望に応じたダイジェストを生成することが可能になる。よって、ユーザの利便性をより向上させることができる。 According to the present disclosure, the sound source type score indicating the probability of the sound source type of the sound included in the sound information is calculated, and the digest section constituting the digest of the sound information from the sound information based on the sound source type score Is determined. Therefore, it is possible to generate a digest according to various requests of the user according to the sound source type. Therefore, user convenience can be further improved.

以上説明したように本開示によれば、ユーザの利便性をより向上させることが可能となる。なお、上記の効果は必ずしも限定的なものではなく、上記の効果とともに、又は上記の効果に代えて、本明細書に示されたいずれかの効果、又は本明細書から把握され得る他の効果が奏されてもよい。 As described above, according to the present disclosure, user convenience can be further improved. Note that the above effects are not necessarily limited, and any of the effects shown in the present specification, or other effects that can be grasped from the present specification, together with the above effects or instead of the above effects. May be played.

本実施形態に係る情報処理装置の機能構成の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of a function structure of the information processing apparatus which concerns on this embodiment. 音源種別スコア算出部によって算出される音源種別スコアの一例を示す図である。It is a figure which shows an example of the sound source classification score calculated by the sound source classification score calculation part. 音声情報とダイジェストとの関係について説明するための説明図である。It is explanatory drawing for demonstrating the relationship between audio | voice information and a digest. オフライン処理の処理手順の一例を示すフロー図である。It is a flowchart which shows an example of the process sequence of an offline process. オフライン処理における、単一音源モードでのダイジェスト区間決定処理の処理手順の一例を示すフロー図である。It is a flowchart which shows an example of the process sequence of the digest area determination process in single sound source mode in offline processing. オフライン処理における、単一音源モードでのダイジェスト区間決定処理の処理手順の一例を示すフロー図である。It is a flowchart which shows an example of the process sequence of the digest area determination process in single sound source mode in offline processing. オフライン処理での高スコア区間決定処理について説明するための説明図である。It is explanatory drawing for demonstrating the high score area determination process in an offline process. オフライン処理での高スコア区間決定処理の処理手順の一例を示すフロー図である。It is a flowchart which shows an example of the process sequence of the high score area determination process in an offline process. オフライン処理での高スコア区間決定処理の処理手順の一例を示すフロー図である。It is a flowchart which shows an example of the process sequence of the high score area determination process in an offline process. オフライン処理における、複数音源モードでのダイジェスト区間決定処理の処理手順の一例を示すフロー図である。It is a flowchart which shows an example of the process sequence of the digest area determination process in multiple sound source mode in offline processing. オフライン処理における、複数音源モードでのダイジェスト区間決定処理の処理手順の一例を示すフロー図である。It is a flowchart which shows an example of the process sequence of the digest area determination process in multiple sound source mode in offline processing. 多様性反映モードにおける各処理を実行する情報処理装置の機能構成の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of a function structure of the information processing apparatus which performs each process in diversity reflection mode. オフライン処理における、多様性反映モードでのダイジェスト区間決定処理の処理手順の一例を示すフロー図である。It is a flowchart which shows an example of the process sequence of the digest area determination process in the diversity reflection mode in offline processing. オフライン処理における、多様性反映モードでのダイジェスト区間決定処理の処理手順の一例を示すフロー図である。It is a flowchart which shows an example of the process sequence of the digest area determination process in the diversity reflection mode in offline processing. オフライン処理における、多様性に基づくダイジェスト区間削除処理の処理手順の一例を示すフロー図である。It is a flowchart which shows an example of the process sequence of the digest area deletion process based on diversity in an offline process. オンライン処理の処理手順の一例を示すフロー図である。It is a flowchart which shows an example of the process sequence of an online process. オフライン処理における、単一音源モードでのダイジェスト区間決定処理の処理手順の一例を示すフロー図である。It is a flowchart which shows an example of the process sequence of the digest area determination process in single sound source mode in offline processing. オンライン処理における、単一音源モードでのフレーム削除処理の処理手順の一例を示すフロー図である。It is a flowchart which shows an example of the process sequence of the frame deletion process in single sound source mode in online processing. オンライン処理での高スコア区間決定処理について説明するための説明図である。It is explanatory drawing for demonstrating the high score area determination process in an online process. オンライン処理での高スコア区間決定処理の処理手順の一例を示すフロー図である。It is a flowchart which shows an example of the process sequence of the high score area determination process in an online process. オンライン処理での高スコア区間決定処理の処理手順の一例を示すフロー図である。It is a flowchart which shows an example of the process sequence of the high score area determination process in an online process. オンライン処理での高スコア区間決定処理の処理手順の一例を示すフロー図である。It is a flowchart which shows an example of the process sequence of the high score area determination process in an online process. オンライン処理における、複数音源モードでのダイジェスト区間決定処理の処理手順の一例を示すフロー図である。It is a flowchart which shows an example of the process sequence of the digest area determination process in multiple sound source mode in online processing. オンライン処理における、複数音源モードでのフレーム削除処理の処理手順の一例を示すフロー図である。It is a flowchart which shows an example of the process sequence of the frame deletion process in multiple sound source mode in online processing. オンライン処理における、多様性反映モードでのフレーム削除処理の処理手順の一例を示すフロー図である。It is a flowchart which shows an example of the process sequence of the frame deletion process in a diversity reflection mode in online processing. オンライン処理における、多様性に基づく削除フレーム選択処理の処理手順の一例を示すフロー図である。It is a flowchart which shows an example of the process sequence of the deletion frame selection process based on diversity in an online process. 音声収音機能が設けられる変形例に係る情報処理装置の機能構成の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of a function structure of the information processing apparatus which concerns on the modification provided with an audio | voice sound collection function. ダイジェスト生成機能が設けられる変形例に係る情報処理装置の機能構成の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of a function structure of the information processing apparatus which concerns on the modification provided with a digest production | generation function. 音声情報データベースが設けられる変形例に係る情報処理装置の機能構成の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of a function structure of the information processing apparatus which concerns on the modification provided with an audio | voice information database. 本実施形態に係る情報処理装置のハードウェア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware constitutions of the information processing apparatus which concerns on this embodiment.

以下に添付図面を参照しながら、本開示の好適な実施の形態について詳細に説明する。なお、本明細書及び図面において、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重複説明を省略する。 Hereinafter, preferred embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. In addition, in this specification and drawing, about the component which has the substantially same function structure, duplication description is abbreviate | omitted by attaching | subjecting the same code | symbol.

なお、説明は以下の順序で行うものとする。
１．既存の技術に対する検討
２．装置構成
３．オフライン処理の詳細
３−１．全体の処理手順
３−２．単一音源モード
３−２−１．ダイジェスト区間決定処理の処理手順
３−２−２．高スコア区間決定処理
３−３．複数音源モード
３−３−１．ダイジェスト区間決定処理の処理手順
３−４．多様性反映モード
３−４−１．機能構成
３−４−２．ダイジェスト区間決定処理の処理手順
３−４−４．多様性に基づくダイジェスト区間削除処理
４．オンライン処理の詳細
４−１．全体の処理手順
４−２．単一音源モード
４−２−１．ダイジェスト区間決定処理
４−２−２．フレーム削除処理
４−２−３．高スコア区間決定処理
４−３．複数音源モード
４−３−１．ダイジェスト区間決定処理の処理手順
４−３−２．フレーム削除処理
４−４．多様性反映モード
４−４−１．フレーム削除処理の処理手順
４−４−２．多様性に基づく削除フレーム選択処理
５．変形例
６．ハードウェア構成
７．まとめ The description will be made in the following order.
1. Study on existing technology 2. Device configuration 3. Details of offline processing 3-1. Overall processing procedure 3-2. Single sound source mode 3-2-1. Processing procedure of digest section determination processing 3-2-2. High score section determination processing 3-3. Multiple sound source modes 3-3-1. Processing procedure of digest section determination processing 3-4. Diversity reflection mode 3-4-1. Functional configuration 3-4-2. Processing procedure of digest section determination processing 3-4-4. 3. Digest section deletion processing based on diversity Details of online processing 4-1. Overall processing procedure 4-2. Single sound source mode 4-2-1. Digest section determination processing 4-2-2. Frame deletion process 4-2-3. High score section determination processing 4-3. Multiple sound source modes 4-3-1. Processing procedure of digest section determination processing 4-3-2. Frame deletion processing 4-4. Diversity reflection mode 4-4-1. Processing procedure of frame deletion processing 4-4-2. 4. Delete frame selection process based on diversity Modification 6 Hardware configuration Summary

（１．既存の技術に対する検討）
本開示の好適な一実施形態について説明するに先立ち、本発明者らが既存の一般的な技術について検討した結果について説明するとともに、本発明者らが本開示に想到した背景について説明する。 (1. Examination of existing technology)
Prior to describing a preferred embodiment of the present disclosure, the results of the study of existing general techniques by the present inventors will be described, and the background that the present inventors have conceived of the present disclosure will be described.

一般的に、音声情報や映像情報等の概要を簡易に把握するために、そのダイジェストを生成するための技術が開発されている。特に、例えば録画したテレビ番組のダイジェストを生成する等、映像情報に関する技術は多数提案されている。しかしながら、映像情報からダイジェストを生成する技術では、映像から算出される特徴量と音声から算出される特徴量の双方を用いた、マルチモーダルな枠組みを前提としているものが多い。情報量の多い映像情報に比べて、音声情報のみに基づいて当該音声情報のダイジェストを適切に生成することはより困難であると考えられる。 In general, in order to easily grasp the outline of audio information, video information, etc., a technique for generating the digest has been developed. In particular, many techniques relating to video information have been proposed, such as generating a digest of a recorded television program. However, many techniques for generating a digest from video information are premised on a multimodal framework that uses both feature quantities calculated from video and feature quantities calculated from audio. Compared to video information with a large amount of information, it is considered more difficult to appropriately generate a digest of the audio information based only on the audio information.

例えば、音声情報のダイジェストを生成する一般的な方法として、音声情報の先頭部分、中間部分及び末尾部分を単純に抜き出してダイジェストを生成する方法や、音量の大きい区間を抜き出してダイジェストを生成する方法等が考えられる。あるいは、既存のＩＣレコーダーの中には、選択された音声ファイルの冒頭５秒間を再生する機能が搭載されているものが存在する。しかしながら、音声情報の内容にかかわらず所定の区間を抜き出す方法では、有意な情報がダイジェストに含まれない可能性が高い。また、音量に基づく方法では、雑音が大きい区間等、必ずしも有用とは言えない区間がダイジェストに含まれてしまう可能性がある。 For example, as a general method for generating a digest of audio information, a method for generating a digest by simply extracting the beginning, middle and end of audio information, or a method for generating a digest by extracting a section with a high volume Etc. are considered. Alternatively, some existing IC recorders are equipped with a function for reproducing the first 5 seconds of a selected audio file. However, in the method of extracting a predetermined section regardless of the content of audio information, there is a high possibility that significant information is not included in the digest. In addition, in the method based on the volume, there is a possibility that a section that is not necessarily useful, such as a section where the noise is large, is included in the digest.

また、音声情報のダイジェストを生成するための技術としては、例えば上記特許文献１に記載の技術がある。しかしながら、上述したように、当該技術は、盛り上がり部分を抽出してダイジェストを生成することに特化したものである。ユーザがダイジェストで把握したい内容は、必ずしも盛り上がり部分に限定されないため、当該技術では、ダイジェストに求められるユーザの多様な要望に応えることが難しい。 Further, as a technique for generating a digest of voice information, there is a technique described in Patent Document 1, for example. However, as described above, the technique is specialized in generating a digest by extracting a rising portion. The content that the user wants to grasp by the digest is not necessarily limited to the climax part, and it is difficult for the technology to meet the various demands of the user required for the digest.

以上、本発明者らが既存の一般的な技術に対して検討した結果について説明した。以上説明したように、音声情報のダイジェストを生成する技術においては、ユーザの多様な要望に応え得るより利便性の高い技術が望まれていた。本発明者らは、以上の既存の技術に対する検討結果に基づいて、よりユーザの利便性を向上させることが可能な技術について鋭意検討した結果、以下に説明する本開示の一実施形態に想到した。以下では、本発明者らが想到した、本開示の好適な一実施形態について詳細に説明する。 In the above, the result which the present inventors examined with respect to the existing general technique was demonstrated. As described above, in the technology for generating a digest of audio information, a more convenient technology capable of meeting various user needs has been desired. The inventors of the present invention have intensively studied a technique that can further improve user convenience based on the above-described examination results for the existing technique, and have arrived at an embodiment of the present disclosure described below. . Hereinafter, a preferred embodiment of the present disclosure that has been conceived by the present inventors will be described in detail.

（２．装置構成）
図１を参照して、本開示の一実施形態に係る情報処理装置の機能構成について説明する。図１は、本実施形態に係る情報処理装置の機能構成の一例を示す機能ブロック図である。 (2. Device configuration)
With reference to FIG. 1, a functional configuration of an information processing apparatus according to an embodiment of the present disclosure will be described. FIG. 1 is a functional block diagram illustrating an example of a functional configuration of the information processing apparatus according to the present embodiment.

図１を参照すると、本実施形態に係る情報処理装置１１０は、その機能として、特徴量抽出部１１１と、音源種別スコア算出部１１３と、ダイジェスト区間決定部１１５と、を有する。情報処理装置１１０は、任意の音声情報を入力として、当該音声情報の中で当該音声情報のダイジェストを構成する区間であるダイジェスト区間を決定し、当該ダイジェスト区間についての情報（ダイジェスト区間情報）を出力する装置である。 Referring to FIG. 1, the information processing apparatus 110 according to the present embodiment includes a feature amount extraction unit 111, a sound source type score calculation unit 113, and a digest section determination unit 115 as its functions. The information processing apparatus 110 receives arbitrary voice information, determines a digest section that is a section constituting the digest of the voice information in the voice information, and outputs information about the digest section (digest section information). It is a device to do.

なお、情報処理装置１１０に対する音声情報の入力元は任意であってよい。例えば、情報処理装置１１０に入力される音声情報は、情報処理装置１１０内に設けられる記憶部（図示せず。）に記憶されているものであってもよいし、情報処理装置１１０とは異なる外部の機器から入力されるものであってもよい。あるいは、情報処理装置１１０が外部の音声を収音する収音部を有する場合には、当該収音部を介して音声情報が入力されてもよい（このような構成については、下記（５−１．音声収音機能が設けられる変形例）で詳しく説明する。）。 Note that the input source of audio information to the information processing apparatus 110 may be arbitrary. For example, the audio information input to the information processing apparatus 110 may be stored in a storage unit (not shown) provided in the information processing apparatus 110 or different from the information processing apparatus 110. It may be input from an external device. Alternatively, when the information processing apparatus 110 includes a sound collection unit that collects external sound, the sound information may be input via the sound collection unit (for such a configuration, the following (5- This will be described in detail in “1. Modified example in which voice collecting function is provided”.

特徴量抽出部１１１は、音声情報の特徴量を抽出する。当該特徴量としては、音声情報の特性を示す各種の物理量が算出され得る。例えば、当該特徴量としては、パワー、スペクトル包絡形状、ゼロ交差数、ピッチ（基本周波数）、ＭＦＣＣ（Ｍｅｌ−ＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔｓ）等が算出されてよい。また、互いに異なる位置に配置されたマイクロフォンで収音された音声情報であれば、特徴量として、その収音位置間での相関が算出されてもよい。また、当該相関に基づいて音源方位が更に算出されてもよい。特徴量抽出部１１１は、これらの特徴量のうちの少なくともいずれかを算出し得る。 The feature amount extraction unit 111 extracts feature amounts of audio information. As the feature quantity, various physical quantities indicating the characteristics of audio information can be calculated. For example, as the feature amount, power, spectrum envelope shape, number of zero crossings, pitch (fundamental frequency), MFCC (Mel-Frequency Cepstrum Coefficients), and the like may be calculated. Further, in the case of audio information collected by microphones arranged at different positions, a correlation between the sound collection positions may be calculated as a feature amount. Further, the sound source direction may be further calculated based on the correlation. The feature quantity extraction unit 111 can calculate at least one of these feature quantities.

なお、特徴量抽出部１１１によって行われる、音声情報から特徴量を抽出する処理としては、音声情報の解析処理において一般的に用いられている各種の手法が用いられてよいため、その具体的な処理についての詳細な説明は省略する。また、特徴量抽出部１１１によって算出される特徴量は上記で列挙したものに限定されず、特徴量抽出部１１１は、音声情報の解析処理において一般的に算出され得る各種の特徴量を算出してよい。 In addition, as the process performed by the feature amount extraction unit 111 to extract feature amounts from the speech information, various methods generally used in speech information analysis processing may be used. A detailed description of the processing is omitted. In addition, the feature amounts calculated by the feature amount extraction unit 111 are not limited to those listed above, and the feature amount extraction unit 111 calculates various feature amounts that can be generally calculated in the analysis processing of audio information. It's okay.

特徴量抽出部１１１によって算出された特徴量は、例えば、算出した特徴量の種類数の次元を有する空間（特徴量空間）内でのベクトル（特徴量ベクトル）として表現され得る。特徴量抽出部１１１は、算出した特徴量についての情報（すなわち特徴量ベクトルについての情報）を音源種別スコア算出部１１３に提供する。 The feature amount calculated by the feature amount extraction unit 111 can be expressed, for example, as a vector (feature amount vector) in a space (feature amount space) having the dimension of the number of types of calculated feature amounts. The feature quantity extraction unit 111 provides information about the calculated feature quantity (that is, information about the feature quantity vector) to the sound source type score calculation unit 113.

音源種別スコア算出部１１３は、特徴量抽出部１１１によって抽出された音声情報の特徴量に基づいて、当該音声情報に含まれる音声の音源種別の蓋然性を示す音源種別スコアを算出する。ここで、音源種別とは、音声の音源をいくつかの種類に分類したものである。例えば、音源種別スコアには、音楽らしさを示す音楽スコア、人の声らしさを示す声スコア及び／又は雑音らしさを示すノイズスコア等が含まれる。また、声スコアが算出される際には、より詳細に、男性の声らしさを示す男性声スコア、女性の声らしさを示す女性声スコア、子どもの声らしさを示す子ども声スコア、及び／又は前記音声を発している特定の人物らしさを示す特定声スコア等が算出されてもよい。 The sound source type score calculation unit 113 calculates a sound source type score indicating the probability of the sound source type of the sound included in the sound information based on the feature amount of the sound information extracted by the feature amount extraction unit 111. Here, the sound source type is obtained by classifying audio sound sources into several types. For example, the sound source type score includes a music score indicating the likelihood of music, a voice score indicating the likelihood of human voice, and / or a noise score indicating the likelihood of noise. Further, when the voice score is calculated, in more detail, a male voice score indicating male voice quality, a female voice score indicating female voice quality, a child voice score indicating child voice quality, and / or A specific voice score or the like indicating the character of a specific person who is producing a voice may be calculated.

音源種別スコア算出部１１３は、音声情報における所定の区間ごとに、上述した音源種別スコアのうちの少なくともいずれかを算出する。以下では、音源種別スコア算出部１１３が音源種別スコアを算出する時間単位を、スコア算出区間と呼称する。スコア算出区間は、例えばフレームに対応する区間であってよい。 The sound source type score calculation unit 113 calculates at least one of the above-described sound source type scores for each predetermined section in the audio information. Hereinafter, the time unit in which the sound source type score calculation unit 113 calculates the sound source type score is referred to as a score calculation section. The score calculation section may be a section corresponding to a frame, for example.

音源種別スコアの算出には、音声情報の解析処理において一般的に用いられている各種の識別器が用いられてよい。当該識別器は、例えば、機械学習により、解析の対象としている音声情報の特徴量ベクトルに応じて、すなわち、特徴量空間内での座標に応じて、各音源種別スコアを算出することができる。事前に識別器において機械学習を行うことが困難である場合には、音源種別スコア算出部１１３は、過去の計算から導かれる平均的な話者性との距離に応じて音源種別スコアを算出することができる。例えば、音源種別スコア算出部１１３は、過去の話者性との距離が大きいほど、音源種別スコアとしてより高い値を出力する。 For the calculation of the sound source type score, various classifiers generally used in the sound information analysis process may be used. The discriminator can calculate each sound source type score by machine learning, for example, according to a feature vector of speech information to be analyzed, that is, according to coordinates in a feature space. When it is difficult to perform machine learning in the classifier in advance, the sound source type score calculation unit 113 calculates the sound source type score according to the distance from the average speaker characteristics derived from past calculations. be able to. For example, the sound source type score calculation unit 113 outputs a higher value as the sound source type score as the distance from the past speaker characteristics increases.

図２に、音源種別スコア算出部１１３によって算出される音源種別スコアの一例を示す。図２は、音源種別スコア算出部１１３によって算出される音源種別スコアの一例を示す図である。図２では、横軸に音声情報内での時間を取り、縦軸にスコア算出区間ごとに算出された音源種別スコアを取り、両者の関係性をプロットしている。図２に示す例では、音源種別スコア算出部１１３によって、３種類の音源種別スコアが算出されている。 FIG. 2 shows an example of the sound source type score calculated by the sound source type score calculating unit 113. FIG. 2 is a diagram illustrating an example of a sound source type score calculated by the sound source type score calculating unit 113. In FIG. 2, the horizontal axis represents time in the audio information, the vertical axis represents the sound source type score calculated for each score calculation section, and the relationship between the two is plotted. In the example illustrated in FIG. 2, three types of sound source type scores are calculated by the sound source type score calculation unit 113.

音源種別スコア算出部１１３は、スコア算出区間ごとに算出した音源種別スコアについての情報を、ダイジェスト区間決定部１１５に提供する。 The sound source type score calculation unit 113 provides information about the sound source type score calculated for each score calculation section to the digest section determination unit 115.

ダイジェスト区間決定部１１５は、音源種別スコア算出部１１３によって算出された音源種別スコアに基づいて、音声情報の中から、当該音声情報のダイジェストを構成する時間区間であるダイジェスト区間を決定する。ここで、図３を参照して、音声情報とダイジェストとの関係について説明する。図３は、音声情報とダイジェストとの関係について説明するための説明図である。 Based on the sound source type score calculated by the sound source type score calculation unit 113, the digest section determination unit 115 determines a digest section, which is a time section constituting the digest of the audio information, from the audio information. Here, with reference to FIG. 3, the relationship between audio | voice information and a digest is demonstrated. FIG. 3 is an explanatory diagram for explaining the relationship between audio information and a digest.

図３に示すように、ダイジェストは、音声情報内の少なくとも１つの時間区間によって構成されている。図示する例では、音声情報内で４つの時間区間（ダイジェスト区間１〜４）が、ダイジェストを構成する時間区間（ダイジェスト区間）として決定されており、これらのダイジェスト区間がつなぎ合わされることによりダイジェストが構成されている。 As shown in FIG. 3, the digest is composed of at least one time section in the audio information. In the example shown in the figure, four time intervals (digest intervals 1 to 4) in the audio information are determined as time intervals (digest intervals) constituting the digest, and the digest is generated by connecting these digest intervals. It is configured.

以下の説明では、各ダイジェスト区間の時間長さをダイジェスト区間長と呼称する。また、ダイジェストの時間長さをダイジェスト長と呼称する。ダイジェスト長は、例えば１分間等、得たいダイジェストの長さとして、予めユーザや情報処理装置１１０の設計者等によって設定されている。ダイジェスト区間長の合計がダイジェスト長と略一致するようにダイジェスト区間が決定されることとなる。 In the following description, the time length of each digest section is referred to as the digest section length. In addition, the time length of the digest is referred to as the digest length. The digest length is set in advance by the user or the designer of the information processing apparatus 110 as the digest length to be obtained, for example, for one minute. The digest section is determined such that the sum of the digest section lengths substantially matches the digest length.

ダイジェスト区間決定部１１５は、基本的には、音楽情報の中で音源種別スコアがより高い時間区間を、ダイジェスト区間として決定する。しかしながら、図２に示すように、音声情報に対しては、複数の音源種別スコアがそれぞれ独立に算出され得る。従って、いずれの音源種別スコアを用いてダイジェスト区間を決定するかが事前に設定される必要がある。 The digest section determination unit 115 basically determines a time section having a higher sound source type score in the music information as the digest section. However, as shown in FIG. 2, a plurality of sound source type scores can be calculated independently for audio information. Therefore, it is necessary to set in advance which sound source type score is used to determine the digest section.

ここで、いずれの音源種別スコアを優先的に用いてダイジェスト区間を決定するかは、ユーザの要望に応じて多様であり得る。例えば、音声情報の中から男性の声だけを抽出したいと考えているユーザに対しては、男性声スコアに注目し、当該男性声スコアがより高い時間区間がダイジェスト区間として決定されることが望ましい。あるいは、音声情報に含まれる多様な音声を万遍なく抽出したいと考えているユーザに対しては、音源種別ごとにその音源種別スコアが高い時間区間がバランスよくダイジェスト区間として決定されることが望ましい。 Here, which sound source type score is preferentially used to determine the digest section may vary depending on the user's request. For example, for a user who wants to extract only a male voice from voice information, it is desirable to pay attention to a male voice score and to determine a time section with a higher male voice score as a digest section. . Alternatively, for a user who wants to extract various voices included in the voice information evenly, it is desirable that a time section having a high sound source type score is determined as a digest section in a balanced manner for each sound source type. .

そこで、本実施形態では、生成するダイジェストのモードが設定され、ダイジェスト区間決定部１１５は、設定されたモードに従ってダイジェスト区間を決定する処理を行う。モードは予め所定のものが設定されていてもよいし、図示しない情報処理装置１１０の入力部を介したユーザによる操作入力に応じて任意に切り替えられてもよい。設定されたモードを示すモード情報は、ダイジェスト区間決定部１１５に入力される。ダイジェスト区間決定部１１５は、設定されたモードに基づいてダイジェストに含める音声の音源種別を決定し、音声情報の中で、決定した音源種別に係る音源種別スコアがより高い区間を、ダイジェスト区間として決定することができる。 Therefore, in the present embodiment, the mode of the digest to be generated is set, and the digest section determination unit 115 performs processing for determining the digest section according to the set mode. A predetermined mode may be set in advance, or may be arbitrarily switched according to an operation input by a user via an input unit of the information processing apparatus 110 (not shown). Mode information indicating the set mode is input to the digest section determination unit 115. The digest section determination unit 115 determines the sound source type of the voice to be included in the digest based on the set mode, and determines a section having a higher sound source type score related to the determined sound source type as the digest section in the sound information. can do.

例えば、モードとしては、単一の音源種別の音声のみを含むようにダイジェストを生成する単一音源モード、複数の音源種別の音声を所定の割合で含むようにダイジェストを生成する複数音源モード、及び／又は、同一の音源種別に分類される音声の中から多様な音声が含まれるようにダイジェストを生成する多様性反映モードが存在する。 For example, as a mode, a single sound source mode for generating a digest so as to include only sound of a single sound source type, a multiple sound source mode for generating a digest so as to include sound of a plurality of sound source types at a predetermined ratio, and There is a diversity reflection mode for generating a digest so that various voices are included from voices classified into the same sound source type.

モードが単一音源モードである場合には、そのモード情報には、ダイジェストに優先的に含める音源種別を指定する旨の情報が含まれる。モードが単一音源モードである場合には、ダイジェスト区間決定部１１５は、指定された一の音源種別に係る音源種別スコアがより高い区間を、ダイジェスト区間として決定する。 When the mode is the single sound source mode, the mode information includes information for designating the sound source type to be preferentially included in the digest. When the mode is the single sound source mode, the digest section determination unit 115 determines a section having a higher sound source type score related to the designated one sound source type as the digest section.

また、モードが複数音源モードである場合には、そのモード情報には、ダイジェストに含める音源種別の割合を指定する旨の情報が含まれる。モードが複数音源モードである場合には、ダイジェスト区間決定部１１５は、指定された割合に基づいて、ダイジェストに含める音声の時間長さを音源種別ごとに設定し、音源種別ごとに音源種別スコアがより高い区間であって当該区間の合計長さが設定した音源種別ごとの時間長さ以下となるような区間を、ダイジェスト区間として決定する。 When the mode is a multiple sound source mode, the mode information includes information for designating the ratio of the sound source types to be included in the digest. When the mode is the multiple sound source mode, the digest section determining unit 115 sets the time length of the sound included in the digest for each sound source type based on the specified ratio, and the sound source type score is set for each sound source type. A section that is a higher section and whose total length is equal to or less than the set time length for each sound source type is determined as a digest section.

当該割合は、モード情報としてユーザによって適宜指定され得る。これにより、ユーザは、ダイジェストに優先的に含める音源種別を自身の要望に合わせて選択することができる。また、逆に、雑音等、ダイジェストに含めたくない音声種別の割合を低い値に設定することも可能である。 The ratio can be appropriately designated by the user as mode information. Thereby, the user can select the sound source type to be preferentially included in the digest in accordance with his / her own request. Conversely, it is also possible to set a low value for the proportion of voice types that are not desired to be included in the digest, such as noise.

なお、ダイジェストに含める音源種別の割合は、モード情報として外部から入力されるのではなく、情報処理装置１１０によって自動的に設定されてもよい。例えば、音源種別ごとに音源種別スコアが比較的高い区間の時間長さの総和が算出され、当該総和の音源種別間の比率として、上記割合が決定され、種別ダイジェスト長が決定されてもよい。このように決定される割合は、音声情報内での音源種別ごとの音声の出現確率を反映するものであり得る。 Note that the ratio of the sound source types to be included in the digest may be automatically set by the information processing apparatus 110 instead of being input from the outside as the mode information. For example, for each sound source type, the sum of time lengths of sections having a relatively high sound source type score may be calculated, the ratio may be determined as the ratio between the sound source types of the total, and the type digest length may be determined. The ratio determined in this way may reflect the appearance probability of sound for each sound source type in the sound information.

また、モードが多様性反映モードである場合には、ダイジェスト区間決定部１１５は、同一の音源種別内での特徴量のばらつき及び同一の音源種別内での音声が発せられた時刻のばらつきを算出し、当該特徴量のばらつき及び当該時刻のばらつきがより大きくなるように、ダイジェスト区間を決定する。 In addition, when the mode is the diversity reflection mode, the digest section determination unit 115 calculates the variation of the feature amount within the same sound source type and the variation of the time when the sound is emitted within the same sound source type. Then, the digest section is determined so that the variation in the feature amount and the variation in the time become larger.

例えば、音源種別スコアの観点からは同一の音源種別に分類された場合であっても、実際には異なる人物の音声であることもあり得る。同一の音源種別内での特徴量のばらつきがより大きくなるようにダイジェスト区間が決定されることにより、音源種別スコアの観点からは同一の音源種別に分類されるものの比較的特徴量が異なっている音声がダイジェストに含まれることになり、より多様な音声がダイジェストに含まれることになる。 For example, from the viewpoint of the sound source type score, even if the sound sources are classified into the same sound source type, they may actually be voices of different persons. By determining the digest section so that the variation of the feature quantity within the same sound source type becomes larger, from the viewpoint of the sound source type score, although it is classified into the same sound source type, the feature quantity is relatively different. The voice is included in the digest, and more diverse sounds are included in the digest.

また、例えば、音源種別スコアの観点からは同一の音源種別に分類され、同一人物の声である可能性が高い場合であっても、時間的に間隔を空けてなされた発言は、内容的には全く異なるものであることもあり得る。同一の音源種別内での音声が発せられた時刻のばらつきがより大きくなるようにダイジェスト区間が決定されることにより、音源種別スコアの観点からは同一の音源種別に分類されるものの発せられた時刻が隔たっている音声がダイジェストに含まれることになり、より多様な内容の音声がダイジェストに含まれることになる。 In addition, for example, from the viewpoint of the sound source type score, even if there is a high possibility that the voices of the same person are classified as the same sound source type, Can be quite different. The time at which a sound source type is classified from the viewpoint of the sound source type score is determined by determining the digest section so that the variation in time at which the sound is generated within the same sound source type becomes larger. Voices that are separated from each other will be included in the digest, and voices with more diverse contents will be included in the digest.

なお、単一音源モード、複数音源モード及び多様性反映モードのそれぞれのモードにおけるダイジェスト区間決定処理のより具体的な処理内容については、下記（３−２．単一音源モード）、（３−３．複数音源モード）、（３−４．多様性反映モード）、（４−２．単一音源モード）、（４−３．複数音源モード）、（４−４．多様性反映モード）で詳しく説明する。 For more specific processing contents of the digest section determination process in each of the single sound source mode, the multiple sound source mode, and the diversity reflection mode, the following (3-2. Single sound source mode), (3-3) .. Multiple sound source mode), (3-4. Diversity reflection mode), (4-2. Single sound source mode), (4-3. Multiple sound source mode), and (4-4. Diversity reflection mode) explain.

ダイジェスト区間決定部１１５は、ダイジェスト区間を決定すると、決定したダイジェスト区間についての情報（ダイジェスト区間情報）を出力する。ダイジェスト区間情報は、例えば、ダイジェスト区間の開始時刻、終了時刻、ダイジェスト区間長、ダイジェスト区間に付されるインデックス（ダイジェスト区間インデックス）等についての情報を含む。つまり、ダイジェスト区間情報は、音声情報内でのダイジェスト区間の位置を特定するための情報であり、音声情報及びダイジェスト区間情報に基づいてダイジェストが生成され得る。 When the digest section determination unit 115 determines the digest section, it outputs information about the determined digest section (digest section information). The digest section information includes, for example, information on the start time and end time of the digest section, the digest section length, the index (digest section index) attached to the digest section, and the like. That is, the digest section information is information for specifying the position of the digest section in the voice information, and a digest can be generated based on the voice information and the digest section information.

ダイジェスト区間決定部１１５によるダイジェスト区間情報の出力先は任意であってよい。例えば、ダイジェスト区間決定部１１５は、情報処理装置１１０に設けられる記憶部（図示せず）にダイジェスト区間情報を出力してもよいし、情報処理装置１１０とは異なる外部の機器にダイジェスト区間情報を出力してもよい。 The output destination of the digest section information by the digest section determination unit 115 may be arbitrary. For example, the digest section determination unit 115 may output the digest section information to a storage unit (not shown) provided in the information processing apparatus 110, or the digest section information may be output to an external device different from the information processing apparatus 110. It may be output.

ダイジェスト区間情報が情報処理装置１１０内に保存される場合には、情報処理装置１１０は、当該ダイジェスト区間情報及び音声情報に基づいてダイジェストを生成する機能を更に有してもよい（このような構成については、下記（５−２．ダイジェスト生成機能が設けられる変形例）で詳しく説明する。）。また、ダイジェスト区間情報が外部機器に出力される場合には、当該外部機器が、当該ダイジェスト区間情報及び音声情報に基づいてダイジェストを生成する機能を有してもよい。このように、本実施形態では、情報処理装置１１０は、少なくともダイジェスト区間情報を生成する機能を有するように構成され、その後に実際にダイジェストを生成する機能は、必ずしも情報処理装置１１０に設けられなくてもよい。 When the digest section information is stored in the information processing apparatus 110, the information processing apparatus 110 may further have a function of generating a digest based on the digest section information and the voice information (such a configuration). Will be described in detail in the following (5-2. Modification in which digest generation function is provided). In addition, when the digest section information is output to an external device, the external device may have a function of generating a digest based on the digest section information and audio information. As described above, in the present embodiment, the information processing apparatus 110 is configured to have at least a function of generating digest section information, and the function of actually generating a digest after that is not necessarily provided in the information processing apparatus 110. May be.

以上、図１を参照して、本実施形態に係る情報処理装置の機能構成について説明した。以上説明したように、本実施形態によれば、音声情報に含まれる音声の音源種別の蓋然性を示す音源種別スコアが算出され、当該音源種別スコアに基づいて、当該音声情報の中から当該音声情報のダイジェストを構成するダイジェスト区間が決定される。従って、例えば、音楽のみをダイジェストに含めたい、人の声のみをダイジェストに含めたい、音楽と人の声とをバランスよくダイジェストに含めたい等、ユーザの多様な要望に応じたダイジェストを生成することが可能になる。なお、特徴量抽出部１１１、音源種別スコア算出部１１３及びダイジェスト区間決定部１１５による一連の処理は、ユーザによる入力部（図示せず）を介した指示に応じて開始されてもよいし、音声情報が情報処理装置１１０に入力されることにより当該音声情報に対する処理が自動的に開始されてもよい。 The functional configuration of the information processing apparatus according to the present embodiment has been described above with reference to FIG. As described above, according to the present embodiment, the sound source type score indicating the probability of the sound source type of the sound included in the sound information is calculated, and the sound information is extracted from the sound information based on the sound source type score. The digest sections constituting the digests are determined. Therefore, for example, you want to include only the music in the digest, want to include only the voice of the person in the digest, or want to include the music and the voice of the person in a well-balanced digest. Is possible. Note that a series of processing by the feature amount extraction unit 111, the sound source type score calculation unit 113, and the digest section determination unit 115 may be started in response to an instruction by the user via an input unit (not shown), or voice When the information is input to the information processing apparatus 110, the processing for the voice information may be automatically started.

ここで、情報処理装置１１０の具体的な装置構成は任意であってよい。例えば、情報処理装置１１０は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）やＤＳＰ（ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）、ＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）等の各種のプロセッサであってよい。あるいは、情報処理装置１１０は、各種のプロセッサが実装されたＰＣやサーバ、スマートフォン、タブレットＰＣ等の装置であってよい。また、あるいは、情報処理装置１１０は、ＩＣレコーダー等の収音、録音機能を有する装置であってもよい。各種のプロセッサが所定のプログラムに従って動作することにより、図１に示す情報処理装置１１０の機能が実行され得る。 Here, the specific apparatus configuration of the information processing apparatus 110 may be arbitrary. For example, the information processing apparatus 110 may be a variety of processors such as a CPU (Central Processing Unit), a DSP (Digital Signal Processor), and an ASIC (Application Specific Integrated Circuit). Alternatively, the information processing apparatus 110 may be an apparatus such as a PC, a server, a smartphone, or a tablet PC on which various processors are mounted. Alternatively, the information processing apparatus 110 may be an apparatus having a sound collecting and recording function such as an IC recorder. Functions of the information processing apparatus 110 shown in FIG. 1 can be executed by various processors operating according to a predetermined program.

また、例えば、情報処理装置１１０の各機能（特徴量抽出部１１１、音源種別スコア算出部１１３及びダイジェスト区間決定部１１５）は、必ずしも１つの装置によって実行されなくてもよい。例えば、特徴量抽出部１１１、音源種別スコア算出部１１３及びダイジェスト区間決定部１１５に対応する各機能が、複数の情報処理装置（例えば複数のプロセッサ）に分散されて実装され、当該複数の装置が互いに通信可能に接続され協働して動作することにより、以上説明した情報処理装置１１０としての機能が実現されてもよい。また、情報処理装置１１０は、ユーザによって直接的に操作されるローカルの情報処理装置であってもよいし、ネットワークを介してユーザの端末と接続されるいわゆるクラウド上の情報処理装置であってもよい。例えば、スマートフォンやＩＣレコーダー等のユーザの端末が録音機能を有している場合には、当該端末で録音された音声情報が、当該端末からクラウド上の情報処理装置１１０に送信され、情報処理装置１１０によって当該音声情報に対して上述した各種の処理が施され、処理結果であるダイジェスト区間情報又はダイジェストに係る音声情報が、情報処理装置１１０から当該端末に送信されてもよい。 Further, for example, each function of the information processing apparatus 110 (the feature amount extraction unit 111, the sound source type score calculation unit 113, and the digest section determination unit 115) does not necessarily have to be executed by one device. For example, each function corresponding to the feature amount extraction unit 111, the sound source type score calculation unit 113, and the digest section determination unit 115 is distributed and implemented in a plurality of information processing devices (for example, a plurality of processors). The functions as the information processing apparatus 110 described above may be realized by connecting to each other so as to communicate with each other and operating in cooperation. Further, the information processing apparatus 110 may be a local information processing apparatus that is directly operated by a user, or may be an information processing apparatus on a so-called cloud connected to a user terminal via a network. Good. For example, when a user terminal such as a smartphone or an IC recorder has a recording function, audio information recorded by the terminal is transmitted from the terminal to the information processing apparatus 110 on the cloud. The various processes described above may be performed on the voice information by 110, and the digest section information or the voice information related to the digest as a processing result may be transmitted from the information processing apparatus 110 to the terminal.

なお、上述のような本実施形態に係る情報処理装置１１０の各機能を実現するためのコンピュータプログラムを作製し、ＰＣ等に実装することが可能である。また、このようなコンピュータプログラムが格納された、コンピュータで読み取り可能な記録媒体も提供することができる。記録媒体は、例えば、磁気ディスク、光ディスク、光磁気ディスク、フラッシュメモリ等である。また、上記のコンピュータプログラムは、記録媒体を用いずに、例えばネットワークを介して配信されてもよい。 Note that a computer program for realizing each function of the information processing apparatus 110 according to the present embodiment as described above can be produced and mounted on a PC or the like. In addition, a computer-readable recording medium storing such a computer program can be provided. The recording medium is, for example, a magnetic disk, an optical disk, a magneto-optical disk, a flash memory, or the like. Further, the above computer program may be distributed via a network, for example, without using a recording medium.

以下、情報処理装置１１０によって実行される処理についてより詳細に説明する。ここで、本実施形態では、情報処理装置１１０が行う処理を、その処理形態から大きく２つに分けることができる。一方の処理では、情報処理装置１１０は、予めその全てが取得されている音声情報に対して、特徴量抽出処理、音源種別スコア算出処理及びダイジェスト区間決定処理を行う。以下、このような処理のことをオフライン処理と呼ぶ。 Hereinafter, the process executed by the information processing apparatus 110 will be described in more detail. Here, in the present embodiment, the processing performed by the information processing apparatus 110 can be broadly divided into two according to the processing mode. In one process, the information processing apparatus 110 performs a feature amount extraction process, a sound source type score calculation process, and a digest section determination process on audio information that has been acquired in advance. Hereinafter, such processing is referred to as offline processing.

一方、他方の処理では、情報処理装置１１０は、現在まさに取得され続けている音声情報に対して、特徴量抽出処理、音源種別スコア算出処理及びダイジェスト区間決定処理を随時行う。この場合には、音声情報が取得され続けている間、ダイジェスト区間情報が随時更新されることとなる。以下、このような処理のことをオンライン処理と呼ぶ。 On the other hand, in the other process, the information processing apparatus 110 performs a feature amount extraction process, a sound source type score calculation process, and a digest section determination process as needed for the audio information that is just being acquired. In this case, while the voice information is continuously acquired, the digest section information is updated as needed. Hereinafter, such processing is referred to as online processing.

オフライン処理とオンライン処理とでは、その詳細な処理内容が異なるものとなり得る。そこで、以下では、オフライン処理及びオンライン処理のそれぞれについて、その詳細な処理内容について説明する。また、オフライン処理及びオンライン処理のそれぞれについて、上述したモードに応じて、ダイジェスト区間決定処理の詳細な処理内容が異なるものとなり得る。そこで、以下では、オフライン処理及びオンライン処理のそれぞれについて、モードに応じたダイジェスト区間決定処理の詳細な処理内容について説明する。 The detailed processing contents can be different between the offline processing and the online processing. Therefore, in the following, detailed processing contents of each of the offline processing and the online processing will be described. In addition, for each of the offline processing and the online processing, the detailed processing content of the digest section determination processing may be different depending on the mode described above. Therefore, in the following, detailed processing contents of the digest section determination processing according to the mode will be described for each of the offline processing and the online processing.

なお、以下の説明では、一例として、スコア算出区間がフレーム区間である場合について説明する。つまり、フレームごとに音源種別スコアが算出される場合について説明する。ただし、本実施形態はかかる例に限定されず、複数のフレームからなる区間がスコア算出区間として設定されてもよい。また、以下の説明では、簡単のため、音源種別スコアのことを単にスコアと呼ぶ場合がある。 In the following description, a case where the score calculation section is a frame section will be described as an example. That is, a case where the sound source type score is calculated for each frame will be described. However, the present embodiment is not limited to such an example, and a section including a plurality of frames may be set as the score calculation section. In the following description, for the sake of simplicity, the sound source type score may be simply referred to as a score.

（３．オフライン処理の詳細）
（３−１．全体の処理手順）
図４を参照して、オフライン処理の処理手順について説明する。図４は、オフライン処理の処理手順の一例を示すフロー図である。図４に示す処理手順は、オフライン処理時における、図１に示す情報処理装置１１０によって実行される情報処理方法全体の処理手順に対応している。オフライン処理では、音声情報の全フレームのスコアが算出された後に、当該スコアに基づいて音声情報の中からダイジェスト区間が決定される。 (3. Details of offline processing)
(3-1. Overall processing procedure)
With reference to FIG. 4, the processing procedure of offline processing will be described. FIG. 4 is a flowchart illustrating an example of a processing procedure of offline processing. The processing procedure shown in FIG. 4 corresponds to the processing procedure of the entire information processing method executed by the information processing apparatus 110 shown in FIG. 1 during offline processing. In the offline processing, after the scores of all frames of the voice information are calculated, a digest section is determined from the voice information based on the score.

図４を参照すると、オフライン処理では、まず、音声情報の特徴量が抽出される（ステップＳ１０１）。ステップＳ１０１に示す処理では、音声情報の特徴量として、例えばパワーやスペクトル包絡形状等、音声情報の特性を示す各種の物理量が算出される。ステップＳ１０１に示す処理は、例えば図１に示す特徴量抽出部１１１によって行われる処理に対応している。 Referring to FIG. 4, in the off-line processing, first, feature values of audio information are extracted (step S101). In the process shown in step S101, various physical quantities indicating the characteristics of the speech information such as power and spectrum envelope shape are calculated as the feature quantities of the speech information. The process shown in step S101 corresponds to, for example, the process performed by the feature amount extraction unit 111 shown in FIG.

次に、抽出された特徴量に基づいて、各フレームの音源種別スコアが算出される（ステップＳ１０３）。ステップＳ１０３に示す処理では、例えば、音声情報の特徴量に応じて音声の音源種別を識別する識別器によって、フレームごとに当該音声の音源種別の蓋然性を示す音源種別スコアが算出される。この際、音声スコア、声スコア、ノイズスコア等、複数の種類の音源種別スコアが算出されてよい。ステップＳ１０３に示す処理は、例えば図１に示す音源種別スコア算出部１１３によって行われる処理に対応している。 Next, the sound source type score of each frame is calculated based on the extracted feature amount (step S103). In the process shown in step S103, for example, a sound source type score indicating the probability of the sound source type of the sound is calculated for each frame by a discriminator that identifies the sound source type of the sound according to the feature amount of the sound information. At this time, a plurality of types of sound source type scores such as a voice score, a voice score, and a noise score may be calculated. The process shown in step S103 corresponds to the process performed by the sound source type score calculation unit 113 shown in FIG.

なお、スコア算出区間がフレーム区間ではなく、複数のフレーム区間からなる場合には、ステップＳ１０３において、各フレームの音源種別スコアを平滑化してスコア算出区間としての音源種別スコアを算出する処理が行われてもよい。 When the score calculation section is not a frame section but includes a plurality of frame sections, in step S103, the sound source type score of each frame is smoothed to calculate the sound source type score as the score calculation section. May be.

次に、算出された音源種別スコアに基づいて、音声情報の中からダイジェスト区間が決定される（ステップＳ１０５）。例えば、ステップＳ１０５に示す処理では、音声情報の中で音源種別スコアのより高い時間区間がダイジェスト区間として決定される。ステップＳ１０５の具体的な処理内容はモードに応じて異なるため、その詳細な処理内容については、下記（３−２．単一音源モード）、（３−３．複数音源モード）及び（３−４．多様性反映モード）においてモードごとにより詳細に説明する。決定されたダイジェスト区間についてのダイジェスト区間情報を出力して、一連の処理が終了する。なお、ステップＳ１０５に示す処理は、例えば図１に示すダイジェスト区間決定部１１５によって行われる処理に対応している。 Next, a digest section is determined from the audio information based on the calculated sound source type score (step S105). For example, in the process shown in step S105, a time interval with a higher sound source type score in the audio information is determined as the digest interval. Since the specific processing contents of step S105 differ depending on the mode, the detailed processing contents are described in (3-2. Single sound source mode), (3-3. Multiple sound source modes) and (3-4). .. Diversity reflection mode) will be described in more detail for each mode. Digest section information on the determined digest section is output, and a series of processing ends. Note that the process shown in step S105 corresponds to, for example, the process performed by the digest section determination unit 115 shown in FIG.

以上、図４を参照して、オフライン処理の処理手順について説明した。 The processing procedure of the offline processing has been described above with reference to FIG.

（３−２．単一音源モード）
（３−２−１．ダイジェスト区間決定処理の処理手順）
単一音源モードでは、ある１つの種類の音源種別が指定され、指定された一の音源種別に係る音源種別スコアがより高い区間が、ダイジェスト区間として決定される。 (3-2. Single sound source mode)
(3-2-1. Digest Section Determination Process Procedure)
In the single sound source mode, one kind of sound source type is specified, and a section having a higher sound source type score related to the specified one sound source type is determined as a digest section.

図５及び図６を参照して、オフライン処理における、単一音源モードでのダイジェスト区間決定処理の処理手順について説明する。図５及び図６は、オフライン処理における、単一音源モードでのダイジェスト区間決定処理の処理手順の一例を示すフロー図である。 With reference to FIG.5 and FIG.6, the process sequence of the digest area determination process in single sound source mode in an offline process is demonstrated. FIG. 5 and FIG. 6 are flowcharts showing an example of a processing procedure of digest section determination processing in the single sound source mode in offline processing.

図５及び図６を参照すると、オフライン処理における単一音源モードでのダイジェスト区間決定処理では、まず、スコア閾値上限値としてスコア閾値理論上限値が設定される（ステップＳ２０１）。次いで、スコア閾値上限値よりも低い値としてスコア閾値が設定される（ステップＳ２０３）。 Referring to FIGS. 5 and 6, in the digest section determination process in the single sound source mode in the offline process, first, the score threshold theoretical upper limit value is set as the score threshold upper limit value (step S201). Next, the score threshold is set as a value lower than the score threshold upper limit (step S203).

ここで、詳しくは後述するが、ダイジェスト区間決定処理では、音声情報の中からよりスコアの高い区間（高スコア区間）をダイジェスト区間として決定する処理（ステップＳ２０５に示す高スコア区間決定処理）が行われ、その後、それらのダイジェスト区間の時間長さ（ダイジェスト区間長）の合計がダイジェスト長に適合するように、ダイジェスト区間長の長さやダイジェスト区間の数が調整される。 Here, as will be described in detail later, in the digest section determination process, a process (a high score section determination process shown in step S205) of determining a section having a higher score (high score section) from the speech information as a digest section is performed. Thereafter, the length of the digest section and the number of digest sections are adjusted so that the sum of the time lengths of the digest sections (digest section length) matches the digest length.

スコア閾値とは、高スコア区間決定処理において、各フレームを高スコア区間に含めるかどうか（すなわちダイジェスト区間に含めるかどうか）を判断するための閾値である。スコア閾値は、後述するステップＳ２１３やステップＳ２１９において行われるように、ダイジェスト区間長の合計をダイジェスト長に応じて調整するために、ダイジェスト区間決定処理の一連の処理中に適宜変更される。スコア閾値がより高い値に変更されれば、ダイジェスト区間に含まれるフレーム数が増加し、ダイジェスト区間長は長くなる。逆に、スコア閾値がより低い値に変更されれば、ダイジェスト区間に含まれるフレーム数が減少し、ダイジェスト区間長は短くなる。 The score threshold is a threshold for determining whether or not each frame is included in the high score section (that is, whether or not to include in the digest section) in the high score section determination process. The score threshold is appropriately changed during a series of digest section determination processes in order to adjust the sum of the digest section lengths according to the digest length, as performed in step S213 and step S219 described later. If the score threshold is changed to a higher value, the number of frames included in the digest section increases and the digest section length becomes longer. Conversely, if the score threshold is changed to a lower value, the number of frames included in the digest section decreases, and the digest section length becomes shorter.

スコア閾値上限値は、変更されるスコア閾値の上限を規定する値である。スコア閾値が高くなり過ぎると、ダイジェスト区間に含まれるフレームの数が少なくなり、ダイジェスト区間長の合計がダイジェスト長に大幅に満たない事態が生じてしまう可能性がある。スコアしきい値上限値はこのような事態が起こることを防止するために設定される（後述するステップＳ２１７に示す処理を参照）。 The score threshold upper limit value is a value that defines the upper limit of the score threshold to be changed. If the score threshold is too high, the number of frames included in the digest section is reduced, and there is a possibility that the sum of the digest section lengths may not be significantly less than the digest length. The score threshold upper limit value is set to prevent such a situation from occurring (see the process shown in step S217 described later).

スコアしきい値理論上限値は、例えば、スコアの計算に用いられた識別器の性能等に応じて設定される、スコアが取り得る理論上の上限値である。上記のように、ステップＳ２０１において、スコア閾値上限値の初期値として、スコアしきい値理論上限値が設定される。 The score threshold theoretical upper limit value is a theoretical upper limit value that can be taken by the score, which is set according to, for example, the performance of the discriminator used for calculating the score. As described above, in step S201, the score threshold theoretical upper limit is set as the initial value of the score threshold upper limit.

ステップＳ２０１及びステップＳ２０３に示す処理が行われると、次に、音声情報の中でより高いスコアを有する区間（高スコア区間）をダイジェスト区間として決定する処理（高スコア区間決定処理）が行われる（ステップＳ２０５）。高スコア区間とは、音声情報の中で連続してスコアの高い区間のことである。ただし、本実施形態では、スコアが低い区間の時間長さが極短い場合には、当該区間も高スコア区間に含める処理が行われる。スコアが低い区間の時間長さが極短い場合には、当該区間は、例えばある人物の一連の発言の最中の息継ぎ等、情報の内容の観点からは、前後の区間と一連の区間であると考えられるからである。 If the process shown in step S201 and step S203 is performed, next, the process (high score area determination process) which determines the area (high score area) which has a higher score in speech information as a digest area will be performed ( Step S205). The high score section is a section having a high score continuously in the voice information. However, in the present embodiment, when the time length of a section with a low score is extremely short, processing for including the section in the high score section is performed. When the time length of a section with a low score is extremely short, the section is a series of sections and the preceding and following sections from the viewpoint of information content such as breathing during a series of statements of a person, for example. Because it is considered.

オフライン処理においては、ダイジェスト区間決定処理では、ステップＳ２０５において決定された高スコア区間をダイジェスト区間とみなし、その後の処理において、ダイジェスト区間長の合計がダイジェスト長に応じた長さになるように、ダイジェスト区間の時間長や数を調整する処理が行われる。高スコア区間決定処理において決定される高スコア区間は、いわば、最終的に決定されるダイジェスト区間の候補であると言える。 In the offline processing, in the digest section determination process, the high score section determined in step S205 is regarded as the digest section, and in the subsequent processes, the digest section length is summed up to a length corresponding to the digest length. Processing for adjusting the time length and number of sections is performed. It can be said that the high score section determined in the high score section determination process is a digest section candidate finally determined.

なお、高スコア区間決定処理のより詳細な処理内容については、図７−９を参照して、後程改めて説明する。 Details of the high score section determination process will be described later with reference to FIGS. 7-9.

ステップＳ２０５において高スコア区間が決定されると、これらの区間をダイジェスト区間とみなして、各ダイジェスト区間の区間内での平均スコア（区間平均スコア）が算出される（ステップＳ２０７）。区間平均スコアは、高スコア区間決定処理において決定される、高スコア区間（すなわちダイジェスト区間）の開始時刻や終了時刻、インデックスとともに、ダイジェスト区間情報に含まれてよい。 When high score sections are determined in step S205, these sections are regarded as digest sections, and an average score (section average score) within each digest section is calculated (step S207). The section average score may be included in the digest section information together with the start time, end time, and index of the high score section (that is, the digest section) determined in the high score section determination process.

次に、ダイジェスト区間長の合計がダイジェスト長よりも大幅に短いかどうかが判断される（ステップＳ２０９）。具体的には、ステップＳ２０９では、ダイジェスト区間長の合計が、ダイジェスト長に対して設定されるダイジェスト長からのずれ量の許容範囲を下回っているかどうかが判断される。ダイジェスト区間長の合計がダイジェスト長と完全に一致するようにダイジェスト区間を決定することは困難であるため、本実施形態では、このような許容範囲が設定され、ダイジェスト区間長の合計が当該許容範囲に含まれるかどうかによって、ダイジェスト区間長の合計が適切かどうかが判断される。当該許容範囲は、ユーザがダイジェストを聴く際に、実際のダイジェスト長がダイジェスト長の設定値よりも長い又は短いことにより違和感を与えないようなずれ量の範囲として、情報処理装置１１０の設計者等によって適宜設定されてよい。 Next, it is determined whether or not the total digest section length is significantly shorter than the digest length (step S209). Specifically, in step S209, it is determined whether or not the total digest section length is below the allowable range of deviation from the digest length set for the digest length. Since it is difficult to determine the digest section so that the total digest section length completely matches the digest length, in this embodiment, such an allowable range is set, and the total digest section length is the permissible range. It is determined whether or not the sum of the digest section lengths is appropriate depending on whether it is included in. When the user listens to the digest, the allowable range is a range of a deviation amount that does not give a sense of incongruity because the actual digest length is longer or shorter than the set value of the digest length. May be set as appropriate.

ステップＳ２０９でダイジェスト区間長の合計がダイジェスト長よりも大幅に短いと判断された場合には、ステップＳ２１１〜ステップＳ２１３に進む。ステップＳ２１１〜ステップＳ２１３では、ダイジェスト区間長の合計をより長くするための処理が行われる。 If it is determined in step S209 that the total digest section length is significantly shorter than the digest length, the process proceeds to steps S211 to S213. In step S211 to step S213, a process for making the total digest section length longer is performed.

具体的には、ステップＳ２１１では、スコア閾値上限値として現在のスコア閾値が設定される。これは、ダイジェスト区間長の合計がダイジェスト長よりも大幅に短いということは、現在のスコア閾値は適切な値に比べて高過ぎると考えられるため、今後の処理においてスコア閾値が変更される際に、当該スコア閾値が現在のスコア閾値よりも大きくならないようにするためである。 Specifically, in step S211, the current score threshold is set as the score threshold upper limit. This is because the sum of the digest interval lengths is significantly shorter than the digest length, so the current score threshold is considered to be too high compared to the appropriate value. This is to prevent the score threshold from becoming larger than the current score threshold.

次に、新たなスコア閾値として、現在のスコア閾値よりも低い値が設定される（ステップＳ２１３）。そして、ステップＳ２０７に進み、新たなスコア閾値を用いて高スコア区間決定処理が再度行われる。より低い値に設定された新たなスコア閾値を用いて高スコア区間決定処理が行われることにより、高スコア区間に含まれるフレームの数が増えるため、ダイジェスト区間長の合計が長くなり、ダイジェスト区間長の合計をよりダイジェスト長に近付けることができる。 Next, a value lower than the current score threshold is set as a new score threshold (step S213). And it progresses to step S207 and a high score area determination process is performed again using a new score threshold value. Since the number of frames included in the high score section increases by performing the high score section determination process using the new score threshold set to a lower value, the total digest section length becomes longer, and the digest section length Can be made closer to the digest length.

ステップＳ２０９でダイジェスト区間長の合計がダイジェスト長よりも大幅に短くはないと判断された場合には、ステップＳ２１５に進む。ステップＳ２１５では、逆に、ダイジェスト区間長の合計がダイジェスト長よりも大幅に長いかどうかが判断される。 If it is determined in step S209 that the total digest section length is not significantly shorter than the digest length, the process proceeds to step S215. Conversely, in step S215, it is determined whether or not the total digest section length is significantly longer than the digest length.

ステップＳ２１５でダイジェスト区間長の合計がダイジェスト長よりも大幅に長くはないと判断された場合には、ダイジェスト区間決定処理の一連の処理を終了する。つまり、高スコア区間決定処理で決定された現在のダイジェスト区間が、最終的なダイジェスト区間として確定される。ステップＳ２０９でダイジェスト区間長の合計がダイジェスト長よりも大幅に短くはないと判断され、かつ、ステップＳ２１５でダイジェスト区間長の合計がダイジェスト長よりも大幅に長くはないと判断された場合には、ダイジェスト区間長の合計は、ダイジェスト長の許容範囲に含まれているからである。 If it is determined in step S215 that the total digest section length is not significantly longer than the digest length, the series of digest section determination processing ends. That is, the current digest section determined by the high score section determination process is determined as the final digest section. If it is determined in step S209 that the total digest length is not significantly shorter than the digest length and it is determined in step S215 that the total digest length is not significantly longer than the digest length, This is because the total digest section length is included in the allowable range of digest length.

一方、ステップＳ２１５でダイジェスト区間長の合計がダイジェスト長よりも大幅に長いと判断された場合には、ステップＳ２１７に進む。ステップＳ２１７以降の処理では、ダイジェスト区間長の合計をより短くするための処理が行われる。 On the other hand, if it is determined in step S215 that the total digest section length is significantly longer than the digest length, the process proceeds to step S217. In the processing after step S217, processing for shortening the total digest section length is performed.

ステップＳ２１７では、スコア閾値がスコア閾値上限値よりも小さいかどうかが判断される。ステップＳ２１７でスコア閾値がスコア閾値上限値よりも小さいと判断された場合には、ステップＳ２１９に進む。ステップＳ２１９では、新たなスコア閾値として、現在のスコア閾値よりも高い値が設定される。そして、ステップＳ２０７に進み、新たなスコア閾値を用いて高スコア区間決定処理が再度行われる。より高い値に設定された新たなスコア閾値を用いて高スコア区間決定処理が行われることにより、高スコア区間に含まれるフレームの数が減るため、ダイジェスト区間長の合計が短くなり、ダイジェスト区間長の合計をよりダイジェスト長に近付けることができる。 In step S217, it is determined whether the score threshold is smaller than the score threshold upper limit value. If it is determined in step S217 that the score threshold is smaller than the score threshold upper limit value, the process proceeds to step S219. In step S219, a value higher than the current score threshold is set as the new score threshold. And it progresses to step S207 and a high score area determination process is performed again using a new score threshold value. Since the number of frames included in the high score section is reduced by performing the high score section determination process using a new score threshold set to a higher value, the total digest section length is shortened and the digest section length is shortened. Can be made closer to the digest length.

ステップＳ２１７でスコア閾値がスコア閾値上限値よりも小さくないと判断された場合には、ステップＳ２２１に進む。この場合には、スコア閾値を現在の値以上に高くすることができないため、スコア閾値を変更することによりダイジェスト区間長の合計を短くすることはできない。従って、ステップＳ２２１以降の処理では、現在のダイジェスト区間の中からフレームを削除する、又は現在のダイジェスト区間の数を減らすことにより、ダイジェスト区間長の合計を短くする処理が行われる。 If it is determined in step S217 that the score threshold is not smaller than the score threshold upper limit value, the process proceeds to step S221. In this case, since the score threshold cannot be made higher than the current value, the total digest section length cannot be shortened by changing the score threshold. Therefore, in the processing after step S221, processing is performed to shorten the total digest section length by deleting a frame from the current digest section or reducing the number of current digest sections.

具体的には、ステップＳ２２１では、各ダイジェスト区間について、ダイジェスト区間長の短縮が可能かどうかが判断される。ここで、ダイジェスト区間長の短縮が可能かどうかは、ダイジェスト区間長と連続区間最低長とを比較することによって行われる。連続区間最低長は、音声として出力した際に人が当該音声の意味を認識可能な最小区間として設定される。ダイジェスト区間長が連続最低長以下であると、ダイジェストを聴いた際に、当該ダイジェスト区間に対応する部分の意味を把握できないため、ダイジェストとして有意なものではなくなってしまう。従って、ステップＳ２２１に示す判断処理を行うことにより、ダイジェスト区間長が連続最低長よりも大きくなるようにダイジェスト区間が決定されるようにしているのである。 Specifically, in step S221, it is determined whether or not the digest section length can be shortened for each digest section. Here, whether or not the digest section length can be shortened is performed by comparing the digest section length with the minimum continuous section length. The continuous section minimum length is set as the minimum section in which a person can recognize the meaning of the voice when output as voice. If the digest section length is equal to or shorter than the minimum continuous length, the meaning of the part corresponding to the digest section cannot be grasped when the digest is listened to, so the digest section is not significant. Therefore, by performing the determination process shown in step S221, the digest section is determined so that the digest section length becomes larger than the continuous minimum length.

ステップＳ２２１でいずれかのダイジェスト区間においてダイジェスト区間長の短縮が可能と判断された場合には、ステップＳ２２３〜ステップＳ２２７に進む。ステップＳ２２３〜ステップＳ２２７では、現在のダイジェスト区間の中からフレームを削除することによりダイジェスト区間長の合計を短くする処理が行われる。 If it is determined in step S221 that the digest section length can be shortened in any of the digest sections, the process proceeds to steps S223 to S227. In steps S223 to S227, processing is performed to shorten the total digest section length by deleting the frame from the current digest section.

具体的には、ステップＳ２２３では、ダイジェスト区間長の短縮が可能と判断されたダイジェスト区間（すなわちダイジェスト区間長が連続最低長よりも長いダイジェスト区間）の中で、区間平均スコアがより低いダイジェスト区間のダイジェスト区間長が短縮される。ダイジェスト区間長を短縮する際には、例えば、短縮対象であるダイジェスト区間の先頭の所定の数のフレーム及び終端の所定の数のフレームのうち、スコアの平均値が低い方がダイジェスト区間から除外される。 Specifically, in step S223, a digest section having a lower section average score in a digest section determined to be able to be shortened (ie, a digest section having a digest section length longer than the continuous minimum length). Digest section length is shortened. When shortening the digest section length, for example, of the predetermined number of frames at the beginning of the digest section to be shortened and the predetermined number of frames at the end, the lower average score is excluded from the digest section. The

次に、フレームが削除されダイジェスト区間長が短縮されたダイジェスト区間の区間平均スコアが更新される（ステップＳ２２５）。そして、ダイジェスト区間長の合計がダイジェスト長と略一致するかどうかが判断される（ステップＳ２２７）。ステップＳ２２７では、具体的には、ダイジェスト区間長の合計が、ダイジェスト長に設定されている許容範囲に含まれるかどうかが判断される。 Next, the section average score of the digest section in which the frame is deleted and the digest section length is shortened is updated (step S225). Then, it is determined whether or not the sum of the digest section lengths substantially matches the digest length (step S227). In step S227, specifically, it is determined whether the total digest section length is included in the allowable range set in the digest length.

ステップＳ２２７でダイジェスト区間長の合計がダイジェスト長と略一致していると判断された場合には、ダイジェスト区間決定処理の一連の処理を終了する。つまり、現在のダイジェスト区間が、最終的なダイジェスト区間として確定される。 If it is determined in step S227 that the total digest section length is substantially equal to the digest length, the series of digest section determination processing ends. That is, the current digest section is determined as the final digest section.

一方、ステップＳ２２７でダイジェスト区間長の合計がダイジェスト長と略一致していないと判断された場合には、ステップＳ２２１に戻り、再度、各ダイジェスト区間について、ダイジェスト区間長の短縮が可能かどうかが判断される。 On the other hand, if it is determined in step S227 that the total digest section length does not substantially match the digest length, the process returns to step S221 to determine again whether or not the digest section length can be shortened for each digest section. Is done.

ステップＳ２２１でいずれのダイジェスト区間においてもダイジェスト区間長の短縮が不可能と判断された場合には、ステップＳ２２９〜ステップＳ２３１に進む。ステップＳ２２９〜ステップＳ２３１では、現在のダイジェスト区間の数を減らすことによりダイジェスト区間長の合計を短くする処理が行われる。 If it is determined in step S221 that the digest section length cannot be shortened in any digest section, the process proceeds to steps S229 to S231. In steps S229 to S231, processing is performed to shorten the total digest section length by reducing the number of current digest sections.

具体的には、ステップＳ２２９では、現在のダイジェスト区間の中から、区間平均スコアのより低いダイジェスト区間が削除される。そして、ダイジェスト区間長の合計がダイジェスト長と略一致するかどうかが判断される（ステップＳ２３１）。ステップＳ２３１では、ステップＳ２２７と同様に、ダイジェスト区間長の合計が、ダイジェスト長に設定されている許容範囲に含まれるかどうかが判断される。 Specifically, in step S229, a digest section with a lower section average score is deleted from the current digest section. Then, it is determined whether or not the total digest section length substantially matches the digest length (step S231). In step S231, as in step S227, it is determined whether or not the total digest section length is included in the allowable range set in the digest length.

ステップＳ２３１でダイジェスト区間長の合計がダイジェスト長と略一致していると判断された場合には、ダイジェスト区間決定処理の一連の処理を終了する。つまり、現在のダイジェスト区間が、最終的なダイジェスト区間として確定される。 If it is determined in step S231 that the total digest length is substantially equal to the digest length, a series of digest segment determination processing ends. That is, the current digest section is determined as the final digest section.

（３−２−２．高スコア区間決定処理）
ここで、図７−図９を参照して、詳細な説明を省略していたステップＳ２０５に示す、オフライン処理での高スコア区間決定処理について詳しく説明する。図７は、オフライン処理での高スコア区間決定処理について説明するための説明図である。図８及び図９は、オフライン処理での高スコア区間決定処理の処理手順の一例を示すフロー図である。 (3-2-2. High score section determination processing)
Here, with reference to FIG. 7 to FIG. 9, the high score section determination process in the offline process shown in step S <b> 205, which has not been described in detail, will be described in detail. FIG. 7 is an explanatory diagram for describing high score section determination processing in offline processing. 8 and 9 are flowcharts showing an example of the processing procedure of the high score section determination processing in the offline processing.

以下の高スコア区間決定処理についての説明では現在フレーム、現ダイジェスト区間、連続区間及び不連続区間という用語を用いる。高スコア区間決定処理の具体的な処理手順について説明する前に、図７を参照して、これらの用語が示す概念について説明する。 In the following description of the high score section determination process, the terms “current frame”, “current digest section”, “continuous section”, and “discontinuous section” are used. Prior to describing the specific processing procedure of the high score section determination process, the concept represented by these terms will be described with reference to FIG.

図７では、横軸に音声情報の時間を取り、縦軸にフレームごとに算出されたスコアを取り、両者の関係性をプロットしている。高スコア区間決定処理では、フレームごとに、時系列に従って、当該フレームをダイジェスト区間に含めるかどうかの判断が行われる。図中、現在フレームは、現在判断処理の対象としているフレームを示している。 In FIG. 7, the horizontal axis represents the time of the audio information, the vertical axis represents the score calculated for each frame, and the relationship between the two is plotted. In the high score section determination process, for each frame, it is determined whether to include the frame in the digest section in time series. In the figure, the current frame indicates a frame that is currently subject to determination processing.

現ダイジェスト区間は、現在フレームを含めるかどうかを判断する対象としているダイジェスト区間を意味する。連続区間は、現ダイジェスト区間内でスコアがスコア閾値を連続的に超えている区間を意味している。不連続区間は、現ダイジェスト区間内で直前の連続区間の終了時刻から現在フレームまでの区間を意味している。現ダイジェスト区間、連続区間及び不連続区間の時間長さのことを、それぞれ、現ダイジェスト区間長、連続区間長及び不連続区間長とも呼称する。 The current digest section means a digest section for which it is determined whether or not to include the current frame. The continuous section means a section where the score continuously exceeds the score threshold in the current digest section. The discontinuous section means a section from the end time of the immediately preceding continuous section to the current frame in the current digest section. The time lengths of the current digest section, the continuous section, and the discontinuous section are also referred to as the current digest section length, the continuous section length, and the discontinuous section length, respectively.

図８及び図９を参照して、オフライン処理における高スコア区間決定処理の具体的な処理手順について説明する。図８及び図９を参照すると、オフライン処理における高スコア区間決定処理では、まず、フレームインデックスがゼロに設定される（ステップＳ３０１）。また、ダイジェスト区間インデックスがゼロに設定される（ステップＳ３０３）。フレームインデックスは、音声情報の各フレームに対して時系列順に付されるものであり、フレームインデックスがゼロのフレームは音声情報の先頭のフレームを指している。ステップＳ３０１及びステップＳ３０３に示す処理は、現在フレームをフレーム＃０とし、現ダイジェスト区間をダイジェスト区間＃０にする処理に対応している。 With reference to FIG.8 and FIG.9, the specific process sequence of the high score area determination process in an offline process is demonstrated. 8 and 9, in the high score section determination process in the offline process, first, the frame index is set to zero (step S301). In addition, the digest section index is set to zero (step S303). The frame index is assigned to each frame of the audio information in chronological order, and the frame with the frame index of zero indicates the head frame of the audio information. The processing shown in step S301 and step S303 corresponds to processing in which the current frame is frame # 0 and the current digest section is digest section # 0.

次に、現在フレームのスコアがスコア閾値よりも大きいかどうかが判断される（ステップＳ３０５）。ステップＳ３０５で現在フレームのスコアがスコア閾値以下と判断された場合には、現在フレームをダイジェスト区間には含めずに、ステップＳ３１９に進む。この場合には、現在フレームは不連続区間に追加されることになる。ステップＳ３１９における処理については後述する。 Next, it is determined whether the score of the current frame is larger than the score threshold (step S305). If it is determined in step S305 that the score of the current frame is equal to or lower than the score threshold value, the process proceeds to step S319 without including the current frame in the digest section. In this case, the current frame is added to the discontinuous section. The process in step S319 will be described later.

一方、ステップＳ３０５で現在フレームのスコアがスコア閾値よりも大きいと判断された場合には、ステップＳ３０７に進む。ステップＳ３０７〜ステップＳ３１７では、現在フレームをダイジェスト区間に含めるための処理が行われる。 On the other hand, if it is determined in step S305 that the score of the current frame is greater than the score threshold, the process proceeds to step S307. In steps S307 to S317, a process for including the current frame in the digest section is performed.

まず、ステップＳ３０７において、不連続区間長が不連続区間最大長よりも小さいかどうかが判断される。ここで、不連続区間最大長とは、不連続区間が、ダイジェスト区間に含めるべき有意な区間であるかどうかを判断する基準となる時間長さである。上述したように、不連続区間は、直前の連続区間の終了時刻から現在フレームまでの区間であるため、連続区間には含まれない、スコアが連続的に低い区間であると言える。従って、不連続区間は、ダイジェストに含める対象としている音源種別の音声がほぼ発せられていない沈黙の区間であると考えられるが、例えば不連続区間が極短い場合には、当該区間は、例えばある人物の一連の発言の最中の息継ぎ等、情報の内容の観点からは、前後の区間と一連の区間である可能性が高い。不連続区間最大長は、このような観点から、不連続区間に対応する沈黙の区間が、一連の音声中の極短い沈黙なのか、あるいは例えば話者の変更を伴うような長い沈黙なのかを判断するための時間長さとして設定され得る。 First, in step S307, it is determined whether the discontinuous section length is smaller than the discontinuous section maximum length. Here, the discontinuous section maximum length is a time length that serves as a reference for determining whether the discontinuous section is a significant section to be included in the digest section. As described above, since the discontinuous section is a section from the end time of the immediately preceding continuous section to the current frame, it can be said that the discontinuous section is a section having a continuously low score that is not included in the continuous section. Accordingly, the discontinuous section is considered to be a silent section in which sound of the sound source type to be included in the digest is hardly emitted. For example, when the discontinuous section is extremely short, the section is, for example, From the viewpoint of information content, such as breathing in the middle of a series of utterances of a person, there is a high possibility that the section is a series of sections before and after. From this point of view, the maximum length of a discontinuous section indicates whether the silence section corresponding to the discontinuous section is a very short silence in a series of speech or a long silence with a change of the speaker, for example. It can be set as a time length for judgment.

ステップＳ３０７で不連続区間長が不連続区間最大長よりも小さいと判断された場合には、ステップＳ３０９に進む。この場合、上述したように、不連続区間はその直前の連続区間と一連の区間と考えられるべきである。よって、ステップＳ３０９では、現ダイジェスト区間に不連続区間及び現在フレームを接続する（すなわち、不連続区間及び現在フレームを現ダイジェスト区間の終端に加える）処理が行われる。このように、不連続期間が極短い場合に、当該不連続期間まで含むようにダイジェスト区間が決定されることにより、一連の音声が途切れることなくダイジェストに含まれることとなり、内容把握の観点からより有用なダイジェストを生成することが可能となる。なお、この際、フレームインデックスが１つ小さいフレーム（すなわち時系列的に１つ前のフレーム）に対してもステップＳ３０９に示す処理が行われた場合には、既に不連続区間は現ダイジェスト区間に含まれているため、現在フレームのみが現ダイジェスト区間に接続される。ステップＳ３０９に示す処理を終えると、ステップＳ３１９に進む。 If it is determined in step S307 that the discontinuous section length is smaller than the discontinuous section maximum length, the process proceeds to step S309. In this case, as described above, the discontinuous section should be considered as the immediately preceding continuous section and a series of sections. Therefore, in step S309, processing for connecting the discontinuous section and the current frame to the current digest section (that is, adding the discontinuous section and the current frame to the end of the current digest section) is performed. In this way, when the discontinuity period is extremely short, the digest section is determined so as to include the discontinuity period, so that a series of voices are included in the digest without interruption. It is possible to generate a useful digest. At this time, if the process shown in step S309 is also performed for a frame having a frame index that is one smaller (that is, the previous frame in time series), the discontinuous section has already been changed to the current digest section. Since it is included, only the current frame is connected to the current digest section. When the process shown in step S309 ends, the process proceeds to step S319.

一方、ステップＳ３０７で不連続区間長が不連続区間最大長以上であると判断された場合には、ステップＳ３１１に進む。ステップＳ３１１では、不連続区間前の連続区間長が連続区間最低長以上であるかどうかが判断される。図６のステップＳ２２１に示す処理について説明する際に言及したように、連続区間最低長とは、音声として出力した際に人が当該音声の意味を認識可能な最小区間として設定される時間長さである。つまり、ステップＳ３１１に示す処理は、連続区間が有意な区間であるかどうかを時間長さの観点から判断する処理であると言える。 On the other hand, if it is determined in step S307 that the discontinuous section length is greater than or equal to the discontinuous section maximum length, the process proceeds to step S311. In step S311, it is determined whether the continuous section length before the discontinuous section is greater than or equal to the minimum continuous section length. As mentioned when explaining the process shown in step S221 of FIG. 6, the minimum continuous section length is the time length set as the minimum section in which a person can recognize the meaning of the sound when output as speech. It is. That is, it can be said that the process shown in step S311 is a process of determining whether or not a continuous section is a significant section from the viewpoint of time length.

ステップＳ３１１で不連続区間前の連続区間長が連続区間最低長以上であると判断された場合には、ステップＳ３１３〜ステップＳ３１５に進む。この場合は、不連続区間が不連続区間最大長以上であり、かつ、連続区間が連続区間最低長以上である場合（すなわち、不連続区間が有意な区間でなく、かつ、不連続区間の前の連続区間が有意な区間である場合）であるため、不連続区間を破棄する（ダイジェスト区間に含めない）とともに、不連続区間の前の連続区間を採用する（ダイジェスト区間に含める）処理が行われる。 If it is determined in step S311 that the continuous section length before the discontinuous section is greater than or equal to the minimum continuous section length, the process proceeds to steps S313 to S315. In this case, when the discontinuous section is not less than the maximum length of the discontinuous section and the continuous section is not less than the minimum length of the continuous section (that is, the discontinuous section is not a significant section and before the discontinuous section). If the continuous section is a significant section), the discontinuous section is discarded (not included in the digest section) and the continuous section before the discontinuous section is adopted (included in the digest section). Is called.

具体的には、ステップＳ３１３では、不連続区間前の連続区間が１つのダイジェスト区間として確定される。次いで、ステップＳ３１５では、ダイジェスト区間インデックスが１つ繰り上げられ（すなわち処理対象である現ダイジェスト区間が新たに設定され）、現在フレームがその新たな現ダイジェスト区間の開始時刻に設定される。ステップＳ３１５に示す処理を終えると、ステップＳ３１９に進む。 Specifically, in step S313, the continuous section before the discontinuous section is determined as one digest section. Next, in step S315, the digest section index is incremented by 1 (that is, the current digest section to be processed is newly set), and the current frame is set to the start time of the new current digest section. When the process shown in step S315 is completed, the process proceeds to step S319.

一方、ステップＳ３１１で不連続区間前の連続区間長が連続区間最低長よりも小さいと判断された場合には、ステップＳ３１７に進む。この場合は、不連続区間が不連続区間最大長以上であり、かつ、連続区間が連続区間最低長よりも小さい場合（すなわち、不連続区間が有意な区間でなく、かつ、不連続区間の前の連続区間も有意でない場合）であるため、不連続区間と、不連続区間の前の連続区間を、ともに破棄する（ダイジェスト区間に含めない）処理が行われる。このように、連続期間が人によって認識できないほど短い場合に、当該連続期間を含まないようにダイジェスト区間が決定されることにより、ダイジェストを聴いた際にユーザにとって耳障りとなるような、内容把握の意味の薄い区間をダイジェストから省くことができ、より品質の高いダイジェストを生成することが可能となる。 On the other hand, if it is determined in step S311 that the continuous section length before the discontinuous section is smaller than the continuous section minimum length, the process proceeds to step S317. In this case, when the discontinuous section is longer than the maximum length of the discontinuous section and the continuous section is smaller than the minimum length of the continuous section (that is, the discontinuous section is not a significant section and before the discontinuous section). Therefore, the discontinuous section and the continuous section before the discontinuous section are both discarded (not included in the digest section). In this way, when the continuous period is so short that it cannot be recognized by a person, the digest section is determined so as not to include the continuous period, so that it is difficult for the user to understand the content when listening to the digest. A less meaningful section can be omitted from the digest, and a higher quality digest can be generated.

具体的には、ステップＳ３１７では、不連続区間前の連続区間が破棄され、現在フレームが現ダイジェスト区間の開始時刻に設定される。ステップＳ３１７に示す処理を終えると、ステップＳ３１９に進む。 Specifically, in step S317, the continuous section before the discontinuous section is discarded, and the current frame is set as the start time of the current digest section. When the process shown in step S317 is completed, the process proceeds to step S319.

ステップＳ３１９では、音声情報が終端かどうかが判断される。ステップＳ３１９で音声情報が終端でないと判断された場合には、フレームインデックスが１つ繰り上げられ（すなわち処理対象であるフレームが１つ先のフレームに設定され）（ステップＳ３２１）、ステップＳ３０５以降の処理が繰り返し実行される。 In step S319, it is determined whether the audio information is at the end. If it is determined in step S319 that the audio information is not the end, the frame index is incremented by 1 (that is, the frame to be processed is set as the next frame) (step S321), and the processing after step S305 is performed. Is repeatedly executed.

一方、ステップＳ３１９で音声情報が終端であると判断された場合には、ステップＳ３２３に進む。ステップＳ３２３では、現ダイジェスト区間長が連続区間最低長よりも大きいかどうかが判断される。つまり、ステップＳ３２３では、最後に処理対象であったダイジェスト区間が、時間長さの観点から有意な区間であるかどうか（すなわち音声の識別が可能な程度の時間長さを有しているかどうか）が判断される。 On the other hand, if it is determined in step S319 that the audio information is the end, the process proceeds to step S323. In step S323, it is determined whether the current digest section length is larger than the continuous section minimum length. That is, in step S323, whether or not the digest section that was the last processing target is a significant section from the viewpoint of time length (that is, whether or not the digest section has a time length that allows voice identification). Is judged.

ステップＳ３２３で現ダイジェスト区間長が連続区間最低長よりも大きいと判断された場合には、現ダイジェスト区間は時間長さ的に有意な区間であると考えられるため、当該ダイジェスト区間を採用し、一連の処理を終了する。一方、ステップＳ３２３で現ダイジェスト区間長が連続区間最低長以下であると判断された場合には、現ダイジェスト区間は時間長さ的に有意な区間でないと考えられるため、当該ダイジェスト区間を破棄し、一連の処理を終了する。 If it is determined in step S323 that the current digest section length is larger than the minimum continuous section length, the current digest section is considered to be a significant section in terms of time length. Terminate the process. On the other hand, if it is determined in step S323 that the current digest section length is less than or equal to the minimum continuous section length, the current digest section is considered not a significant section in terms of time length, so the digest section is discarded, A series of processing ends.

以上、オフライン処理における、単一音源モードでのダイジェスト区間決定処理の処理手順について説明した。 The processing procedure of the digest section determination process in the single sound source mode in the offline process has been described above.

（３−３．複数音源モード）
（３−３−１．ダイジェスト区間決定処理の処理手順）
複数音源モードでは、指定された割合に基づいてダイジェストに含める音声の時間長さが音源種別ごとに設定され、音源種別ごとに音源種別スコアがより高い区間であって当該区間の合計長さが設定した音源種別ごとの時間長さ以下となるような区間が、ダイジェスト区間として決定される。 (3-3. Multiple sound source modes)
(3-3-1. Procedure of digest section determination process)
In multiple sound source mode, the time length of the audio included in the digest is set for each sound source type based on the specified ratio, and the sound source type score is higher for each sound source type and the total length of that interval is set A section that is equal to or shorter than the time length for each sound source type is determined as a digest section.

図１０及び図１１を参照して、オフライン処理における、複数音源モードでのダイジェスト区間決定処理の処理手順について説明する。図１０及び図１１は、オフライン処理における、複数音源モードでのダイジェスト区間決定処理の処理手順の一例を示すフロー図である。 With reference to FIG.10 and FIG.11, the process sequence of the digest area determination process in multiple sound source mode in an offline process is demonstrated. 10 and 11 are flowcharts showing an example of a processing procedure of digest section determination processing in the multiple sound source mode in offline processing.

なお、図１０及び図１１に示す複数音源モードでのダイジェスト区間決定処理は、図５−図９を参照して説明した単一音源モードでのダイジェスト区間決定処理における各処理が音源種別ごとに行われるものであり、各処理の内容自体は、単一音源モードでのダイジェスト区間決定処理と略同様であり得る。ただし、単一音源モードでのダイジェスト区間決定処理では、１つの音源種別しか対象にしていなかったため、上述したステップＳ２０９及びステップＳ２１５において、その音源種別に係るスコアに基づいて決定されたダイジェスト区間長の合計値がダイジェスト長と比較されていたが、複数音源モードでのダイジェスト区間決定処理では、各音源種別に係るスコアに基づいて決定されたダイジェスト区間長の合計値が、ダイジェストに含める各音源種別の音声の時間長さ（以下、種別ダイジェスト長とも呼称する。）と比較される。 The digest section determination process in the multiple sound source mode shown in FIG. 10 and FIG. 11 is performed for each sound source type in the digest section determination process in the single sound source mode described with reference to FIGS. The content of each process itself may be substantially the same as the digest section determination process in the single sound source mode. However, in the digest section determination process in the single sound source mode, only one sound source type is targeted. Therefore, in the above-described step S209 and step S215, the digest section length determined based on the score related to the sound source type is used. Although the total value was compared with the digest length, in the digest section determination process in the multiple sound source mode, the total value of the digest section length determined based on the score for each sound source type is the value of each sound source type included in the digest. It is compared with the time length of the voice (hereinafter also referred to as the type digest length).

以下の複数音源モードでのダイジェスト区間決定処理の処理手順についての説明では、単一音源モードでのダイジェスト区間決定処理の処理手順と相違する事項について主に説明し、重複する事項についてはその詳細な説明を省略する。 In the following description of the digest section determination process procedure in the multiple sound source mode, the differences from the digest section determination process procedure in the single sound source mode will be mainly described, and detailed explanations will be given for overlapping items. Description is omitted.

図１０及び図１１を参照すると、オフライン処理における複数音源モードでのダイジェスト区間決定処理では、まず、スコア閾値上限値としてスコア閾値理論上限値が設定される（ステップＳ４０１）。次いで、スコア閾値上限値よりも低い値としてスコア閾値が設定される（ステップＳ４０３）。これらの処理は、図５及び図６に示すステップＳ２０１及びステップＳ２０３における処理と同様である。 Referring to FIGS. 10 and 11, in the digest section determination process in the multiple sound source mode in the offline process, first, the score threshold theoretical upper limit is set as the score threshold upper limit (step S401). Next, the score threshold is set as a value lower than the score threshold upper limit (step S403). These processes are the same as the processes in steps S201 and S203 shown in FIGS.

次に、種別ダイジェスト長が設定される（ステップＳ４０５）。例えば、種別ダイジェスト長は、モード情報に基づいて設定され得る。例えば、モード情報には、ダイジェストに含める音源種別の割合を指定する旨の情報が含まれている。ステップＳ４０５に示す処理では、ダイジェスト長に当該割合を乗じることにより、音源種別ごとにその種別ダイジェスト長が算出される。 Next, the type digest length is set (step S405). For example, the type digest length can be set based on the mode information. For example, the mode information includes information indicating that the ratio of sound source types to be included in the digest is specified. In the process shown in step S405, the digest length is calculated for each sound source type by multiplying the digest length by the ratio.

ただし、ステップＳ４０５に示す処理はかかる例に限定されず、ダイジェストに含める音源種別の割合は、モード情報として外部から入力されるのではなく、情報処理装置１１０によって自動的に設定されてもよい。例えば、何らかの機会に図８及び図９に示す高スコア区間決定処理が各音源種別に対して既に１度実行されており、各種別音源に対して、高スコア区間が決定されている場合であれば、当該高スコア区間についての情報を用いて、上記割合が決定され、種別ダイジェスト長が決定されてもよい。 However, the processing shown in step S405 is not limited to such an example, and the ratio of the sound source types included in the digest may be automatically set by the information processing apparatus 110 instead of being input from the outside as mode information. For example, the high score section determination process shown in FIGS. 8 and 9 has already been executed once for each sound source type at some occasion, and the high score section has been determined for each type of sound source. For example, the ratio may be determined using the information about the high score section, and the type digest length may be determined.

具体的には、高スコア区間決定処理の結果から、音源種別ごとに、決定された高スコア区間の時間長さの総和が算出され、その比率が計算される。そして、計算された比率をダイジェスト長に乗じることにより、音源種別ごとにその種別ダイジェスト長が算出され得る。このように高スコア区間の時間長さに基づいて決定される割合は、音声情報内における音源種別ごとの音声の出現確率が反映されたものであり得る。 Specifically, the sum of the time lengths of the determined high score section is calculated for each sound source type from the result of the high score section determination process, and the ratio is calculated. Then, by multiplying the digest length by the calculated ratio, the type digest length can be calculated for each sound source type. Thus, the ratio determined based on the time length of the high score section may reflect the appearance probability of the sound for each sound source type in the sound information.

なお、モード情報に基づく場合、及び高スコア区間に基づく場合ともに、算出された種別ダイジェスト長が連続区間最低長を下回る場合には、その長さを調整する処理が適宜行われる。種別ダイジェスト長が連続区間最低長を下回る場合には、当該種別ダイジェスト長が短過ぎ、その音声が、人によって有意に認識されないからである。具体的には、連続区間最低長を下回る種別ダイジェスト長を連続区間最低長まで増加させるとともに、他の連続区間最低長を上回る種別ダイジェスト長からその増加分を減じる処理が行われる。 In addition, when based on the mode information and based on the high score section, when the calculated type digest length is less than the minimum length of the continuous section, processing for adjusting the length is appropriately performed. This is because when the type digest length is less than the minimum length of the continuous section, the type digest length is too short and the voice is not significantly recognized by a person. Specifically, the type digest length below the minimum continuous section length is increased to the minimum continuous section length, and the increase is subtracted from the type digest length exceeding the other continuous section minimum length.

種別ダイジェスト長が決定されると、次に、音声情報の中でより高いスコアを有する区間（高スコア区間）をダイジェスト区間として決定する処理（高スコア区間決定処理）が行われる（ステップＳ４０７）。ステップＳ４０７に示す処理は、図５及び図６に示すステップＳ２０５における処理、すなわち、図８及び図９に示す一連の処理と同様であるため、その詳細な説明を省略する。 When the type digest length is determined, next, a process (high score section determination process) of determining a section (high score section) having a higher score in the speech information as a digest section is performed (step S407). Since the process shown in step S407 is the same as the process in step S205 shown in FIGS. 5 and 6, that is, the series of processes shown in FIGS. 8 and 9, detailed description thereof will be omitted.

以降、ステップＳ４０９〜ステップＳ４３３に示す処理は、音源種別ごとに実行される点を除けば、図５及び図６に示すステップＳ２０７〜ステップＳ２３１における処理と同様の処理であるため、その詳細な説明を省略する。ステップＳ４１１〜ステップＳ４２１に示す処理は、図５及び図６に示すステップＳ２０９〜ステップＳ２１９における処理に対応する。ステップＳ４１１〜ステップＳ４２１に示す処理では、音源種別ごとに、ダイジェスト区間長の合計が種別ダイジェスト長と大幅に異なっていないかが判断され、スコア閾値が調整されることにより、ダイジェスト区間長の合計が種別ダイジェスト長の許容範囲に含まれるように、各ダイジェスト区間長が調整される。 Hereinafter, the processes shown in steps S409 to S433 are the same as the processes in steps S207 to S231 shown in FIGS. 5 and 6 except that they are executed for each sound source type. Is omitted. The processes shown in steps S411 to S421 correspond to the processes in steps S209 to S219 shown in FIGS. In the processing shown in steps S411 to S421, for each sound source type, it is determined whether the total digest section length is not significantly different from the type digest length, and the score threshold is adjusted so that the total digest section length is the type. Each digest section length is adjusted so as to be included in the allowable range of the digest length.

ステップＳ４２３〜ステップＳ４３３に示す処理は、図５及び図６に示すステップＳ２２１〜ステップＳ２３１における処理に対応する。ステップＳ４２３〜ステップＳ４３３に示す処理は、スコア閾値の調整がそれ以上できなくなった場合に行われる処理であり、ステップＳ４２３以降の処理では、現在のダイジェスト区間の中からフレームを削除する、又は現在のダイジェスト区間の数を減らすことにより、ダイジェスト区間長の合計を短くする処理が行われる。ただし、図５及び図６に示すステップＳ２２１〜ステップＳ２３１における処理では、フレーム又は区間数の削除対象となるダイジェスト区間は単一の音源種別に係るものであったが、ステップＳ４２３〜ステップＳ４３３に示す処理では、フレーム又は区間数の削除対象となるダイジェスト区間は、複数の音源種別に係るダイジェスト区間が混合されたものである。 The processes shown in steps S423 to S433 correspond to the processes in steps S221 to S231 shown in FIGS. The processes shown in steps S423 to S433 are processes performed when the score threshold value can no longer be adjusted. In the processes after step S423, the frame is deleted from the current digest section, or the current By reducing the number of digest sections, a process for shortening the total digest section length is performed. However, in the processing in steps S221 to S231 shown in FIGS. 5 and 6, the digest section to be deleted of the number of frames or sections relates to a single sound source type, but is shown in steps S423 to S433. In the processing, the digest section to be deleted from the number of frames or sections is a mixture of digest sections related to a plurality of sound source types.

以上、図１０及び図１１を参照して、オフライン処理における、複数音源モードでのダイジェスト区間決定処理の処理手順について説明する。 The processing procedure of the digest section determination process in the multiple sound source mode in the offline process will be described above with reference to FIGS. 10 and 11.

（３−４．多様性反映モード）
多様性反映モードでは、同一の音源種別に分類される音声の中から多様な音声が含まれるようにダイジェストが生成される。具体的には、多様性反映モードでは、同一の音源種別内での音声の特徴量のばらつき及び同一の音源種別内での音声の時間的ばらつきがより大きくなるように、ダイジェスト区間が決定される。 (3-4. Diversity reflection mode)
In the diversity reflection mode, a digest is generated so that various sounds are included from the sounds classified into the same sound source type. Specifically, in the diversity reflection mode, the digest section is determined so that the variation in the audio feature amount within the same sound source type and the time variation in the sound within the same sound source type become larger. .

（３−４−１．機能構成）
ここで、上述した単一音源モード及び複数音源モードにおける各処理は、図１に示す情報処理装置１１０の機能構成によって実行され得る。ただし、多様性反映モードにおける各処理は、図１に示す情報処理装置１１０とは若干異なる機能構成によって実行され得る。 (3-4-1. Functional configuration)
Here, each process in the single sound source mode and the plurality of sound source modes described above can be executed by the functional configuration of the information processing apparatus 110 illustrated in FIG. 1. However, each process in the diversity reflection mode can be executed by a slightly different functional configuration from the information processing apparatus 110 shown in FIG.

図１２を参照して、多様性反映モードにおける各処理を実行する情報処理装置の機能構成について説明する。図１２は、多様性反映モードにおける各処理を実行する情報処理装置の機能構成の一例を示す機能ブロック図である。 With reference to FIG. 12, a functional configuration of the information processing apparatus that executes each process in the diversity reflection mode will be described. FIG. 12 is a functional block diagram illustrating an example of a functional configuration of the information processing apparatus that executes each process in the diversity reflection mode.

図１２を参照すると、多様性反映モードに対応する情報処理装置１２０は、その機能として、特徴量抽出部１１１と、音源種別スコア算出部１１３と、ダイジェスト区間決定部１１５と、を有する。ここで、特徴量抽出部１１１、音源種別スコア算出部１１３及びダイジェスト区間決定部１１５の機能は、図１に示す情報処理装置１１０におけるこれらの機能ブロックの機能と同様であるため、その詳細な説明は省略する。 Referring to FIG. 12, the information processing apparatus 120 corresponding to the diversity reflection mode includes a feature amount extraction unit 111, a sound source type score calculation unit 113, and a digest section determination unit 115 as its functions. Here, the functions of the feature quantity extraction unit 111, the sound source type score calculation unit 113, and the digest section determination unit 115 are the same as the functions of these functional blocks in the information processing apparatus 110 shown in FIG. Is omitted.

情報処理装置１２０では、情報処理装置１１０と異なり、特徴量抽出部１１１によって算出された音声情報の特徴量についての情報が、ダイジェスト区間決定部１１５にも提供される。ダイジェスト区間決定部１１５は、当該特徴量についての情報を用いて、多様性を考慮してダイジェスト区間を決定することができる（後述する図１４のステップＳ５３１に示す処理を参照）。 In the information processing device 120, unlike the information processing device 110, information about the feature amount of the speech information calculated by the feature amount extraction unit 111 is also provided to the digest section determination unit 115. The digest section determination unit 115 can determine a digest section in consideration of diversity using information about the feature amount (see the process shown in step S531 of FIG. 14 described later).

（３−４−２．ダイジェスト区間決定処理の処理手順）
図１３及び図１４を参照して、図１２に示す情報処理装置１２０によって実行され得る、オフライン処理における、多様性反映モードでのダイジェスト区間決定処理の処理手順について説明する。図１３及び図１４は、オフライン処理における、多様性反映モードでのダイジェスト区間決定処理の処理手順の一例を示すフロー図である。 (3-4-2. Digest Section Determination Process Procedure)
With reference to FIG. 13 and FIG. 14, a processing procedure of digest section determination processing in the diversity reflection mode in offline processing that can be executed by the information processing apparatus 120 illustrated in FIG. 12 will be described. FIG. 13 and FIG. 14 are flowcharts showing an example of a processing procedure of digest section determination processing in the diversity reflection mode in offline processing.

なお、多様性反映モードは、同一音源種別内での多様性を考慮してダイジェスト区間を決定するものであるため、ダイジェストに含める対象とする音源種別は、単一の音源種別であってもよいし、複数の音源種別であってもよい。図１３及び図１４では、一例として、ダイジェストに複数の音源種別からなる音声を含める場合における処理手順を図示している。 Since the diversity reflection mode is to determine the digest section in consideration of diversity within the same sound source type, the sound source type to be included in the digest may be a single sound source type. However, it may be a plurality of sound source types. In FIGS. 13 and 14, as an example, a processing procedure in the case where the digest includes sound composed of a plurality of sound source types is illustrated.

ここで、多様性反映モードでのダイジェスト区間決定処理における各処理は、後述するステップＳ５３１に示す処理を除き、図１０及び図１１を参照して説明した複数音源モードでのダイジェスト区間決定処理における各処理と同様である。従って、以下の多様性反映モードでのダイジェスト区間決定処理における各処理についての説明では、複数音源モードでのダイジェスト区間決定処理における各処理と相違する事項について主に説明し、重複する事項についてはその詳細な説明を省略する。なお、ダイジェストに複数の音源種別からなる音声を含める場合における多様性反映モードでのダイジェスト区間決定処理の処理手順は、図５及び図６に示す単一音源モードでのダイジェスト区間決定処理の処理手順において、ステップＳ２２９に示す処理の代わりに後述するステップＳ５３１に示す処理が行われるものに対応する。 Here, each process in the digest section determination process in the diversity reflecting mode is the same as that in the digest section determination process in the multiple sound source mode described with reference to FIGS. 10 and 11 except for the process shown in step S531 described later. It is the same as the processing. Therefore, in the description of each process in the digest section determination process in the diversity reflection mode described below, items that are different from each process in the digest section determination process in the multiple sound source mode are mainly described, and overlapping items are Detailed description is omitted. Note that the digest section determining process procedure in the diversity reflecting mode when the digest includes sounds of a plurality of sound source types is the digest section determining process procedure in the single sound source mode shown in FIGS. 5 and 6. 3 corresponds to a process in which a process shown in step S531 described later is performed instead of the process shown in step S229.

図１３及び図１４を参照すると、多様性反映モードでのダイジェスト区間決定処理において、ステップＳ５０１〜ステップＳ５２１における処理は、図１０及び図１１に示すステップＳ４０１〜ステップＳ４２１における処理と同様の処理である。またステップＳ５２３以降の処理も、複数音源モードでのダイジェスト区間決定処理と同様に、スコア閾値の調整がそれ以上できなくなった場合に行われる処理である。ステップＳ５２３以降の処理では、現在のダイジェスト区間の中からフレームを削除する、又は現在のダイジェスト区間の数を減らすことにより、ダイジェスト区間長の合計を短くする処理が行われる。 Referring to FIGS. 13 and 14, in the digest section determination process in the diversity reflection mode, the processes in steps S501 to S521 are the same as the processes in steps S401 to S421 shown in FIGS. . Also, the processing after step S523 is processing performed when the score threshold value can no longer be adjusted, similarly to the digest section determination processing in the multiple sound source mode. In the processing after step S523, processing is performed to shorten the total digest section length by deleting a frame from the current digest section or reducing the number of current digest sections.

ここで、多様性反映モードにおいて、ステップＳ５２３で各ダイジェスト区間についてダイジェスト区間長の短縮が可能であると判断された場合に、より区間平均スコアが低いダイジェスト区間からフレームを削除することによりダイジェスト区間長の合計を短くする一連の処理（ステップＳ５２５〜ステップＳ５２９に示す処理）は、複数音源モードにおけるこれらの処理（ステップＳ４２５〜ステップＳ４２９に示す処理）と同様である。 Here, in the diversity reflecting mode, when it is determined in step S523 that the digest section length can be shortened for each digest section, the digest section length is deleted by deleting the frame from the digest section having a lower section average score. A series of processes for shortening the total (processes shown in steps S525 to S529) are the same as those in the multiple sound source mode (processes shown in steps S425 to S429).

一方、多様性反映モードにおいては、ステップＳ５２３でいずれのダイジェスト区間においてもダイジェスト区間長の短縮が不可能と判断された場合に、ダイジェスト区間の数が減じられる処理の詳細が、複数音源モードとは異なる。具体的には、複数音源モードでは、区間平均スコアの低いダイジェスト区間が削除されていた（図１１のステップＳ４３１に示す処理を参照）。一方、多様性反映モードでは、多様性に基づいてダイジェスト区間を削除する処理（多様性に基づくダイジェスト区間削除処理）が行われる（ステップＳ５３１）。ダイジェスト区間が削除された後に、ダイジェスト区間長の合計がダイジェスト長と略一致するかどうかが判断され（ステップＳ５３３）、ダイジェスト区間長の合計がダイジェスト長と略一致するまで、ステップＳ５３１に示す多様性に基づくダイジェスト区間削除処理が実行される。 On the other hand, in the diversity reflection mode, when it is determined in step S523 that it is impossible to shorten the digest section length in any digest section, the details of the process for reducing the number of digest sections are as follows. Different. Specifically, in the multiple sound source mode, a digest section having a low section average score has been deleted (see the process shown in step S431 in FIG. 11). On the other hand, in the diversity reflection mode, processing for deleting a digest section based on diversity (digest section deletion processing based on diversity) is performed (step S531). After the digest section is deleted, it is determined whether or not the total digest section length substantially matches the digest length (step S533), and the diversity shown in step S531 until the total digest section length substantially matches the digest length. A digest section deletion process based on the is executed.

（３−４−３．多様性に基づくダイジェスト区間削除処理）
図１５を参照して、図１４のステップＳ５３１に示す多様性に基づくダイジェスト区間削除処理について詳しく説明する。図１５は、オフライン処理における、多様性に基づくダイジェスト区間削除処理の処理手順の一例を示すフロー図である。 (3-4-3 Digest section deletion processing based on diversity)
With reference to FIG. 15, the digest section deletion process based on diversity shown in step S531 of FIG. 14 will be described in detail. FIG. 15 is a flowchart illustrating an example of a processing procedure of digest section deletion processing based on diversity in offline processing.

図１５を参照すると、オフライン処理における多様性に基づくダイジェスト区間削除処理では、まず、各ダイジェスト区間の特徴量ベクトルの平均（平均特徴量ベクトル）が算出される（ステップＳ６０１）。 Referring to FIG. 15, in the digest section deletion process based on diversity in the offline process, first, an average of feature quantity vectors (average feature quantity vector) in each digest section is calculated (step S601).

次に、全ダイジェスト区間の場合と、任意の１つのダイジェスト区間を除いた場合の、ｎ通りの特徴量空間における平均特徴量ベクトルの分散が計算される（ステップＳ６０３）。 Next, the variance of the average feature quantity vector in the n feature quantity spaces in the case of all digest sections and the case where any one digest section is excluded is calculated (step S603).

次に、各ダイジェスト区間の平均時刻が算出される（ステップＳ６０５）。平均時刻は、例えば、各ダイジェスト区間の開始時刻と終了時刻との中間の時刻として計算される。 Next, the average time of each digest section is calculated (step S605). The average time is calculated as, for example, an intermediate time between the start time and the end time of each digest section.

次に、全ダイジェスト区間の場合と、任意の１つのダイジェスト区間を除いた場合の、ｎ通りの各ダイジェスト区間の平均時刻の分散が計算される（ステップＳ６０７）。 Next, the variance of the average time of each of the n digest sections in the case of all digest sections and the case where any one digest section is excluded is calculated (step S607).

次に、平均特徴量ベクトルの分散及び平均時刻の分散に重み付けを行った上でその総和が計算され、全ダイジェスト区間の場合の値からの低減量が最も少ない場合に除外されたダイジェスト区間が、削除するダイジェスト区間として決定される（ステップＳ６０９）。つまり、ステップＳ６０９に示す処理では、平均特徴量ベクトル及び平均時刻の分散の計算に用いられなかった場合に最も影響の少ない平均特徴量ベクトル及び平均時刻を有するダイジェスト区間が、削除するダイジェスト区間として決定される。これにより、平均特徴量ベクトル及び平均時刻の分散がより大きくなるように、ダイジェストに含めるダイジェスト区間が選択されることとなる。最後に、決定されたダイジェスト区間が削除される（ステップＳ６１１）。 Next, after weighting the variance of the average feature vector and the variance of the average time, the sum is calculated, and the digest section excluded when the amount of reduction from the value in the case of all digest sections is the smallest, It is determined as a digest section to be deleted (step S609). That is, in the process shown in step S609, the digest section having the average feature vector and the average time that have the least influence when not used for calculating the variance of the average feature vector and the average time is determined as the digest section to be deleted. Is done. Thereby, the digest section included in the digest is selected so that the variance of the average feature vector and the average time becomes larger. Finally, the determined digest section is deleted (step S611).

以上、図１３及び図１４を参照して、オフライン処理における、多様性反映モードでのダイジェスト区間決定処理の処理手順について説明した。また、図１５を参照して、ステップＳ５３１に示す多様性に基づくダイジェスト区間削除処理について説明した。 As above, the processing procedure of the digest section determination process in the diversity reflection mode in the offline process has been described with reference to FIGS. 13 and 14. In addition, the digest section deletion process based on diversity shown in step S531 has been described with reference to FIG.

以上説明したように、多様性反映モードでは、同一の音源種別に分類される音声について特徴量ベクトル及び時刻の多様性が確保されるように、ダイジェスト区間が決定される。特徴量ベクトルの多様性が確保されることにより、同一の音源種別に分類されてはいるが実際には別人の声が存在する場合に、これらの声をともにダイジェストに含めることが可能となる。また、時刻の多様性が確保されることにより、同一の音源種別に分類されている音声が時間的に離れた場所で発言をしている場合に、これらの声をともにダイジェストに含めることが可能となる。 As described above, in the diversity reflection mode, the digest section is determined so that the feature amount vector and the time diversity are ensured for the speech classified into the same sound source type. By ensuring the diversity of the feature vector, it is possible to include both of these voices in the digest when they are classified into the same sound source type but actually have another person's voice. In addition, by ensuring the diversity of time, when voices classified into the same sound source type are speaking in places that are separated in time, both voices can be included in the digest. It becomes.

（４．オンライン処理の詳細）
（４−１．全体の処理手順）
図１６を参照して、オンライン処理の処理手順について説明する。図１６は、オンライン処理の処理手順の一例を示すフロー図である。図１６に示す処理手順は、オンライン処理時における、図１に示す情報処理装置１１０によって実行される情報処理方法全体の処理手順に対応している。 (4. Details of online processing)
(4-1. Overall processing procedure)
With reference to FIG. 16, the procedure of online processing will be described. FIG. 16 is a flowchart illustrating an example of a processing procedure of online processing. The processing procedure shown in FIG. 16 corresponds to the processing procedure of the entire information processing method executed by the information processing apparatus 110 shown in FIG. 1 during online processing.

オンライン処理では、音声情報のフレームが新たに入力される度に、その新たに入力されたフレーム（入力フレーム）のスコアが算出され、当該スコアに基づいて音声情報の中からダイジェスト区間が決定される。つまり、オンライン処理では、音声情報が入力されている間、図１６に示す一連の処理が。フレームが新たに入力される度に実行され、ダイジェスト区間情報が更新される。 In online processing, each time a frame of speech information is newly input, the score of the newly input frame (input frame) is calculated, and a digest section is determined from the speech information based on the score. . That is, in the online processing, a series of processing shown in FIG. 16 is performed while voice information is being input. This is executed each time a frame is newly input, and the digest section information is updated.

なお、スコア算出区間がフレーム区間ではなく、複数のフレーム区間からなる場合には、図１６に示す一連の処理は、スコア算出区間に対応する複数のフレームが入力される度に実行され得る。 When the score calculation section is not a frame section but includes a plurality of frame sections, the series of processes shown in FIG. 16 can be executed each time a plurality of frames corresponding to the score calculation section are input.

図１６を参照すると、オンライン処理では、まず、これまでに取得されている音声情報の特徴量が抽出される（ステップＳ７０１）。ステップＳ７０１に示す処理では、音声情報の特徴量として、例えばパワーやスペクトル包絡形状等、音声情報の特性を示す各種の物理量が算出される。ステップＳ７０１に示す処理は、例えば図１に示す特徴量抽出部１１１によって行われる処理に対応している。 Referring to FIG. 16, in the online processing, first, the feature amount of the voice information acquired so far is extracted (step S701). In the process shown in step S701, various physical quantities indicating the characteristics of the speech information such as power and spectrum envelope shape are calculated as the feature quantities of the speech information. The process shown in step S701 corresponds to, for example, the process performed by the feature amount extraction unit 111 shown in FIG.

次に、抽出された特徴量に基づいて、入力フレームの音源種別スコアが算出される（ステップＳ７０３）。ステップＳ７０３に示す処理では、例えば、音声情報の特徴量に応じて音声の音源種別を識別する識別器によって、入力フレームにおける当該音声の音源種別の蓋然性を示す音源種別スコアが算出される。この際、音声スコア、声スコア、ノイズスコア等、複数の種類の音源種別スコアが算出されてよい。ステップＳ７０３に示す処理は、例えば図１に示す音源種別スコア算出部１１３によって行われる処理に対応している。 Next, the sound source type score of the input frame is calculated based on the extracted feature amount (step S703). In the processing shown in step S703, for example, a sound source type score indicating the probability of the sound source type of the sound in the input frame is calculated by a discriminator that identifies the sound source type of the sound according to the feature amount of the sound information. At this time, a plurality of types of sound source type scores such as a voice score, a voice score, and a noise score may be calculated. The process shown in step S703 corresponds to the process performed by, for example, the sound source type score calculation unit 113 shown in FIG.

なお、スコア算出区間がフレーム区間ではなく、複数のフレーム区間からなる場合には、ステップＳ７０３において、各フレームの音源種別スコアを平滑化してスコア算出区間としての音源種別スコアを算出する処理が行われてもよい。 If the score calculation section is not a frame section but includes a plurality of frame sections, in step S703, the sound source type score of each frame is smoothed to calculate the sound source type score as the score calculation section. May be.

次に、算出された音源種別スコアに基づいて、音声情報の中からダイジェスト区間が決定される（ステップＳ７０５）。ステップＳ７０５に示す処理は、例えば図１に示すダイジェスト区間決定部１１５によって行われる処理に対応している。 Next, a digest section is determined from the audio information based on the calculated sound source type score (step S705). The process shown in step S705 corresponds to the process performed by, for example, the digest section determination unit 115 shown in FIG.

ステップＳ７０５に示す処理では、これまでに取得された音声情報の時間長さがダイジェスト長（ダイジェストの時間長さの設定値）よりも短い場合には、入力フレームが無条件でダイジェストに追加される。一方、これまでに取得された音声情報の時間長さがダイジェスト長以上である場合には、入力フレームがダイジェストに追加されるとともに、その代わりに、ダイジェストの中から例えばよりスコアの低いフレームが削除される。 In the processing shown in step S705, when the time length of the audio information acquired so far is shorter than the digest length (digest time length setting value), the input frame is unconditionally added to the digest. . On the other hand, if the time length of the audio information acquired so far is equal to or longer than the digest length, the input frame is added to the digest, and instead, for example, a frame with a lower score is deleted from the digest. Is done.

なお、ステップＳ７０５における具体的な処理内容はモードに応じて異なるため、その詳細な処理内容については、下記（４−２．単一音源モード）、（４−３．複数音源モード）及び（４−４．多様性反映モード）においてモードごとにより詳細に説明する。 In addition, since the specific processing content in step S705 changes with modes, about the detailed processing content, the following (4-2. Single sound source mode), (4-3. Multiple sound source modes) and (4 -4. Diversity reflection mode) will be described in detail.

次に、音声情報の入力が終了したかどうかが判断される（ステップＳ７０７）。ステップＳ７０７で音声情報の入力が終了したと判断された場合には、決定されたダイジェスト区間についてのダイジェスト区間情報を出力して、一連の処理が終了する。一方、ステップＳ７０７で音声情報の入力が終了していないと判断された場合には、次のフレームの入力を待機し（ステップＳ７０９）、新たに入力されたフレームに対して、ステップＳ７０１以降の処理が繰り返し実行される。 Next, it is determined whether or not the input of voice information has been completed (step S707). If it is determined in step S707 that the input of the voice information has been completed, the digest section information for the determined digest section is output, and the series of processes ends. On the other hand, if it is determined in step S707 that the input of audio information has not been completed, the input of the next frame is awaited (step S709), and the processing after step S701 is performed for the newly input frame. Is repeatedly executed.

以上、図１６を参照して、オンライン処理の処理手順について説明した。 The processing procedure of online processing has been described above with reference to FIG.

（４−２．単一音源モード）
（４−２−１．ダイジェスト区間決定処理）
図１７を参照して、オフライン処理における、単一音源モードでのダイジェスト区間決定処理の処理手順について説明する。図１７は、オフライン処理における、単一音源モードでのダイジェスト区間決定処理の処理手順の一例を示すフロー図である。 (4-2. Single sound source mode)
(4-2-1. Digest Section Determination Process)
With reference to FIG. 17, a processing procedure of digest section determination processing in the single sound source mode in offline processing will be described. FIG. 17 is a flowchart illustrating an example of a processing procedure for digest section determination processing in the single sound source mode in offline processing.

図１７を参照すると、オフライン処理における単一音源モードでのダイジェスト区間決定処理では、まず、現在のダイジェスト長が、ダイジェスト長よりも短いかどうかが判断される（ステップＳ８０１）。ステップＳ８０１で、現在のダイジェスト長がダイジェスト長よりも短いと判断された場合には、入力フレームがダイジェストに追加されるとともに、ダイジェスト全体としての平均スコア（ダイジェスト平均スコア）が更新される（ステップＳ８０３）。そして、ダイジェスト区間決定処理を終了し、次の入力フレームを待つ。 Referring to FIG. 17, in the digest section determination process in the single sound source mode in the offline process, first, it is determined whether or not the current digest length is shorter than the digest length (step S801). If it is determined in step S801 that the current digest length is shorter than the digest length, the input frame is added to the digest, and the average score (digest average score) as a whole digest is updated (step S803). ). Then, the digest section determination process is terminated, and the next input frame is awaited.

ステップＳ８０１及びステップＳ８０３に示す処理は、これまでに入力された音声情報の時間長さがダイジェスト長に満たない場合には、入力フレームを無条件でダイジェストに追加する処理に対応している。 The processes shown in steps S801 and S803 correspond to a process of adding an input frame to the digest unconditionally when the time length of the audio information input so far is less than the digest length.

ステップＳ８０１で、現在のダイジェスト長がダイジェスト長以上である判断された場合には、ステップＳ８０５に進む。ステップＳ８０５では、入力フレームのスコアがダイジェスト平均スコア以上であるかどうかが判断される。ステップＳ８０５で入力フレームのスコアがダイジェスト平均スコアよりも小さいと判断された場合には、当該入力フレームをダイジェストに追加することなく、ダイジェスト区間決定処理を終了する。つまり、スコアのより低いフレームはダイジェストに含まれないようにする。 If it is determined in step S801 that the current digest length is greater than or equal to the digest length, the process proceeds to step S805. In step S805, it is determined whether the score of the input frame is greater than or equal to the digest average score. If it is determined in step S805 that the score of the input frame is smaller than the digest average score, the digest section determination process ends without adding the input frame to the digest. That is, a frame with a lower score is not included in the digest.

一方、ステップＳ８０５で入力フレームのスコアがダイジェスト平均スコア以上である判断された場合には、入力フレームがダイジェストに追加され、ダイジェスト平均スコアが更新される（ステップＳ８０７）。ただし、この場合には、入力フレームをダイジェストに追加したことにより、現在のダイジェスト長が、１フレームに対応する時間長さ分、ダイジェスト長を超過してしまっている。従って、ステップＳ８０７に示す処理に次いで、ダイジェストの中からフレームを削除する処理（フレーム削除処理）が行われる（ステップＳ８０９）。フレーム削除処理では、例えばダイジェストの中から、よりスコアの低いフレームが削除される。なお、ステップＳ８０９に示すフレーム削除処理の詳細については、図１８を参照して後述する。 On the other hand, if it is determined in step S805 that the score of the input frame is greater than or equal to the digest average score, the input frame is added to the digest and the digest average score is updated (step S807). However, in this case, since the input frame is added to the digest, the current digest length exceeds the digest length by the time length corresponding to one frame. Therefore, after the process shown in step S807, a process of deleting a frame from the digest (frame deletion process) is performed (step S809). In the frame deletion process, for example, a frame with a lower score is deleted from the digest. Details of the frame deletion process shown in step S809 will be described later with reference to FIG.

フレームが削除されると、ダイジェスト平均スコアが更新され（ステップＳ８１１）、ダイジェスト区間決定処理を終了する。 When the frame is deleted, the digest average score is updated (step S811), and the digest section determination process is terminated.

（４−２−２．フレーム削除処理）
ここで、図１８を参照して、図１７のステップＳ８０９に示すフレーム削除処理の詳細について説明する。図１８は、オンライン処理における、単一音源モードでのフレーム削除処理の処理手順の一例を示すフロー図である。 (4-2-2. Frame deletion processing)
Here, the details of the frame deletion process shown in step S809 of FIG. 17 will be described with reference to FIG. FIG. 18 is a flowchart showing an example of a processing procedure of frame deletion processing in the single sound source mode in online processing.

図１８を参照すると、オンライン処理における単一音源モードでのフレーム削除処理では、まず、スコア閾値として、ダイジェスト平均スコアが設定される（ステップＳ９０１）。そして、設定されたスコア閾値を用いて、ダイジェストの中でより高いスコアを有する区間（高スコア区間）をダイジェスト区間として決定する処理（高スコア区間決定処理）が行われる（ステップＳ９０３）。 Referring to FIG. 18, in the frame deletion process in the single sound source mode in the online process, first, a digest average score is set as the score threshold (step S901). And the process (high score area determination process) which determines the area (high score area) which has a higher score in a digest as a digest area using the set score threshold value is performed (step S903).

ステップＳ９０３に示す高スコア区間決定処理では、図５のステップＳ２０５に示すオフライン処理での高スコア区間決定処理と略同様の処理が行われるが、一部の処理はオフライン処理のそれとは相違する。具体的には、オフライン処理では、音声情報全体を対象にして、当該音声情報の中でダイジェスト区間を決定するために高スコア区間決定処理が行われる。一方、図１７を参照して説明したように、オンライン処理では、これまでに取得された音声情報の時間長さがダイジェスト長に至るまでの間は、無条件に入力フレームがダイジェストに追加されるため、高スコア区間決定処理を行う前に、既に、いわば仮のダイジェストが生成されている。オンライン処理では、入力フレームが追加され現在のダイジェスト長が１フレーム分だけダイジェスト長の設定値よりも長くなっている場合に、そのダイジェストの中からよりスコアの低い区間を見付けて削除するフレームを決定するために、高スコア区間決定処理が行われるのである。つまり、オンライン処理では、ダイジェストを対象として高スコア区間決定処理が行われる。 In the high score section determination process shown in step S903, substantially the same process as the high score section determination process in the offline process shown in step S205 of FIG. 5 is performed, but some processes are different from those in the offline process. Specifically, in the offline processing, high score section determination processing is performed on the entire voice information in order to determine a digest section in the voice information. On the other hand, as described with reference to FIG. 17, in the online processing, the input frame is unconditionally added to the digest until the time length of the voice information acquired so far reaches the digest length. Therefore, before performing the high score section determination process, a provisional digest has already been generated. In online processing, when an input frame is added and the current digest length is longer than the digest length setting value by one frame, a frame with a lower score is found in the digest and a frame to be deleted is determined. In order to do so, a high score section determination process is performed. That is, in the online processing, high score section determination processing is performed for the digest.

また、上記の事情から、オフライン処理では、音声情報の中で高スコア区間として決定されなかった区間は、当然ダイジェスト区間として採用されない。一方、オンライン処理では、ダイジェストの中で高スコア区間として決定されなかった区間が存在した場合であっても、ダイジェストから削除される区間は１フレーム分の区間であるため、その高スコア区間として決定されなかった区間全てをダイジェストから削除することはできない。つまり、オンライン処理では、高スコア区間決定処理の結果高スコア区間として決定されなかった区間が、ダイジェスト内に残存し得る。以下の説明では、このような高スコア区間として決定されなかった区間のことを削除対象区間と呼称する。削除対象区間の中から、例えば最もスコアの低いフレームが、削除されるフレームとして選択されることになる。このように、削除対象区間は、現在はダイジェスト内に存在するが、随時音声情報が入力され、ダイジェストが更新されるにつれていずれ削除されるべき区間であるとも言える。 In addition, due to the above circumstances, in the offline processing, the section that is not determined as the high score section in the voice information is naturally not adopted as the digest section. On the other hand, in online processing, even if there is a section that was not determined as a high score section in the digest, the section to be deleted from the digest is a section for one frame, so it is determined as the high score section. It is not possible to delete all sections that have not been deleted from the digest. That is, in the online processing, a section that has not been determined as a high score section as a result of the high score section determination process may remain in the digest. In the following description, a section that is not determined as such a high score section is referred to as a deletion target section. For example, the frame with the lowest score is selected as the frame to be deleted from the deletion target section. Thus, although the deletion target section currently exists in the digest, it can be said that it is a section to be deleted as soon as voice information is input and the digest is updated.

また、オンライン処理では、上記のように、ダイジェストに入力フレームが追加されるとともに、いずれかのフレームが削除されていくこととなるため、ダイジェスト内の各フレームにおけるスコアを時系列順に並べた際に、スコアが不連続になる点が存在し得る。上述したオフライン処理での高スコア区間決定処理では、音楽情報全体が処理対象であり、このようなスコアの不連続点は考慮する必要がなかったが、オンライン処理での高スコア区間決定処理では、当該不連続点に対処するための追加的な処理が必要となる。 In online processing, as described above, an input frame is added to the digest and any frame is deleted, so when the scores in each frame in the digest are arranged in chronological order. There may be points where the score becomes discontinuous. In the high score section determination process in the offline process described above, the entire music information is a processing target, and there is no need to consider such discontinuity points in the score, but in the high score section determination process in the online process, Additional processing is required to deal with the discontinuous points.

なお、ステップＳ９０３に示すオンライン処理における高スコア区間決定処理のより詳細な処理内容については、図１９−図２２を参照して後程改めて説明する。 Details of the high score section determination process in the online process shown in step S903 will be described later with reference to FIGS.

ステップＳ９０３において高スコア区間が決定されると、高スコア区間決定処理の結果、高スコア区間として決定されなかった削除対象期間が存在するかどうかが判断される（ステップＳ９０５）。ステップＳ９０５において削除対象区間が存在すると判断された場合には、その削除対象区間からスコアのより低いフレームが１つ選択される（ステップＳ９０７）。そして、選択されたそのフレームがダイジェストから削除される（ステップＳ９１１）。 When a high score section is determined in step S903, it is determined whether there is a deletion target period that has not been determined as a high score section as a result of the high score section determination process (step S905). If it is determined in step S905 that there is a deletion target section, one frame having a lower score is selected from the deletion target section (step S907). Then, the selected frame is deleted from the digest (step S911).

一方、ステップＳ９０５において削除対象区間が存在しないと判断された場合には、ダイジェストからスコアのより低いフレームが１つ選択される（すなわちＳ９０９）。そして、選択されたそのフレームがダイジェストから削除される（ステップＳ９１１）。 On the other hand, if it is determined in step S905 that there is no deletion target section, one frame having a lower score is selected from the digest (ie, S909). Then, the selected frame is deleted from the digest (step S911).

（４−２−３．高スコア区間決定処理）
ここで、図１９−図２２を参照して、詳細な説明を省略していた図１８のステップＳ９０３に示す、オンライン処理での高スコア区間決定処理について詳しく説明する。図１９は、オンライン処理での高スコア区間決定処理について説明するための説明図である。図２０−図２２は、オンライン処理での高スコア区間決定処理の処理手順の一例を示すフロー図である。 (4-2-3. High Score Section Determination Process)
Here, with reference to FIG. 19 to FIG. 22, the high score section determination process in the online process shown in step S903 of FIG. FIG. 19 is an explanatory diagram for describing high score section determination processing in online processing. 20 to 22 are flowcharts showing an example of the processing procedure of the high score section determination processing in the online processing.

図１９では、横軸に音声情報の時間を取り、縦軸にフレームごとに算出されたスコアを取り、両者の関係性をプロットしている。高スコア区間決定処理では、フレームごとに、時系列に従って、当該フレームをダイジェスト区間に含めるかどうかの判断が行われる。現在フレーム、現ダイジェスト区間、連続区間及び不連続区間の意味は、図７に示すオフライン処理での高スコア区間決定処理と同様である。 In FIG. 19, the horizontal axis represents the time of the audio information, the vertical axis represents the score calculated for each frame, and the relationship between the two is plotted. In the high score section determination process, for each frame, it is determined whether to include the frame in the digest section in time series. The meanings of the current frame, the current digest section, the continuous section, and the discontinuous section are the same as those in the high score section determination process in the offline process shown in FIG.

ただし、上述したように、オンライン処理では、オフライン処理とは異なり、その処理対象がダイジェストである。従って、図示するように、ダイジェスト内からフレームが削除されることにより、ダイジェスト内の各フレームにおけるスコアを時系列順に並べた際にスコアが不連続になる点（不連続点）が存在し得る。また、これも上述したように、高スコア区間決定処理が行われた結果、高スコア区間（すなわちダイジェスト区間）としては決定されなかったがダイジェスト内に存在する区間である削除対象区間がダイジェスト内に存在し得る。 However, as described above, in the online processing, unlike the offline processing, the processing target is a digest. Therefore, as shown in the figure, by deleting a frame from the digest, there may be a point (discontinuous point) where the score becomes discontinuous when the scores in each frame in the digest are arranged in time series. Also, as described above, as a result of the high score section determination process, the deletion target section that is a section that is not determined as a high score section (that is, a digest section) but exists in the digest is included in the digest. Can exist.

図２０−図２２を参照して、オンライン処理における高スコア区間決定処理の具体的な処理手順について説明する。なお、図２０−図２２に示すオンライン処理における高スコア区間決定処理の処理手順は、処理対象が音声情報全体ではなくダイジェストであることと、後述するステップＳ１１１９〜ステップＳ１１２３に示す処理が追加されたことを除けば、図８及び図９を参照して説明したオフライン処理における高スコア区間決定処理の処理手順と略同様である。従って、以下のオンライン処理における高スコア区間決定処理の処理手順についての説明では、オフライン処理における高スコア区間決定処理の処理手順と重複する事項についてはその詳細な説明を省略し、相違する事項について主に説明する。 With reference to FIGS. 20-22, the specific process sequence of the high score area determination process in an online process is demonstrated. The processing procedure of the high score section determination processing in the online processing shown in FIGS. 20 to 22 is that the processing target is not the entire voice information but a digest, and the processing shown in steps S1119 to S1123 described later is added. Except for this, the processing procedure of the high score section determination processing in the offline processing described with reference to FIGS. 8 and 9 is substantially the same. Therefore, in the following description of the processing procedure of the high score section determination process in the online processing, the detailed description of items that overlap with the processing procedure of the high score section determination process in the offline processing is omitted, and the differences are mainly described. Explained.

図２０−図２２を参照すると、オンライン処理における高スコア区間決定処理では、まず、フレームインデックスがゼロに設定され（ステップＳ１１０１）、ダイジェスト区間インデックスがゼロに設定される（すなわちＳ１１０３）。これらの処理は、図８及び図９に示すステップＳ３０１及びステップＳ３０３に示す処理と同様である。 20 to 22, in the high score section determination process in the online process, first, the frame index is set to zero (step S1101), and the digest section index is set to zero (ie, S1103). These processes are the same as the processes shown in steps S301 and S303 shown in FIGS.

以降のステップＳ１１０５〜ステップＳ１１１７に示す処理は、図８及び図９に示すステップＳ３０５〜ステップＳ３１７に示す処理と同様である。具体的には、ステップＳ１１０５において、現在フレームのスコアがスコア閾値よりも大きいかどうかが判断される。現在フレームのスコアがスコア閾値以下と判断された場合には、現在フレームをダイジェスト区間には含めずに、ステップＳ１１１９に進む。一方、現在フレームのスコアがスコア閾値以下と判断された場合には、ステップＳ１１０７〜ステップＳ１１１７に進み、現在フレームをダイジェスト区間に含めるための処理が行われる。 The subsequent processes shown in steps S1105 to S1117 are the same as the processes shown in steps S305 to S317 shown in FIGS. Specifically, in step S1105, it is determined whether the score of the current frame is greater than a score threshold. If it is determined that the score of the current frame is equal to or lower than the score threshold value, the process proceeds to step S1119 without including the current frame in the digest section. On the other hand, if it is determined that the score of the current frame is equal to or lower than the score threshold value, the process proceeds to steps S1107 to S1117, and processing for including the current frame in the digest section is performed.

ステップＳ１１０７〜ステップＳ１１１７では、不連続区間長が不連続区間最大長よりも小さい場合には、現ダイジェスト区間に不連続区間及び現在フレームが接続される（ステップＳ１１０９）。また、不連続区間長が不連続区間最大長以上であり、かつ不連続区間前の連続区間が連続区間最低長以上である場合には、不連続区間前の連続区間を１つのダイジェスト区間として確定するとともに、ダイジェスト区間インデックスが１つ繰り上げられ、現在フレームがその新たな現ダイジェスト区間の開始時刻に設定される（ステップＳ１１１３、Ｓ１１１５）。また、不連続区間長が不連続区間最大長以上であり、かつ不連続区間前の連続区間が連続区間最低長よりも小さい場合には、不連続区間前の連続区間が破棄され（すなわち削除対象区間とされ）、現在フレームが現ダイジェスト区間の開始時刻に設定される（ステップＳ１１１７）。ステップＳ１１０９、ステップＳ１１１５及びステップＳ１１１７のいずれかの処理が終了すると、ステップＳ１１１９に進む。 In steps S1107 to S1117, if the discontinuous section length is smaller than the discontinuous section maximum length, the discontinuous section and the current frame are connected to the current digest section (step S1109). Also, if the discontinuous section length is longer than the discontinuous section maximum length and the continuous section before the discontinuous section is longer than the continuous section minimum length, the continuous section before the discontinuous section is determined as one digest section. At the same time, the digest section index is incremented by 1, and the current frame is set to the start time of the new current digest section (steps S1113 and S1115). If the discontinuous section length is greater than or equal to the discontinuous section maximum length and the continuous section before the discontinuous section is smaller than the minimum continuous section length, the continuous section before the discontinuous section is discarded (that is, the deletion target) The current frame is set as the start time of the current digest section (step S1117). When any one of steps S1109, S1115, and S1117 ends, the process proceeds to step S1119.

ステップＳ１１１９では、現在フレームが不連続点かどうかが判断される。ステップＳ１１１９で現在フレームが不連続点でないと判断された場合には、特段の処理は行われず、ステップＳ１１２５に進む。 In step S1119, it is determined whether the current frame is a discontinuity point. If it is determined in step S1119 that the current frame is not a discontinuous point, no special processing is performed, and the process proceeds to step S1125.

一方、ステップＳ１１１９で現在フレームが不連続点であると判断された場合には、ステップＳ１１２３に進む。ステップＳ１１２３では、現ダイジェスト区間長が連続区間最低長よりも大きいかどうかが判断される。つまり、ステップＳ１１２３では、不連続点直前のダイジェスト区間が、時間長さの観点から有意な区間であるかどうか（すなわち音声の識別が可能な程度の時間長さを有しているかどうか）が判断される。 On the other hand, if it is determined in step S1119 that the current frame is a discontinuous point, the process proceeds to step S1123. In step S1123, it is determined whether the current digest section length is larger than the continuous section minimum length. That is, in step S1123, it is determined whether or not the digest section immediately before the discontinuity is a significant section from the viewpoint of time length (that is, whether or not the digest section has a time length that allows voice identification). Is done.

ステップＳ１１２３で現ダイジェスト区間長が連続区間最低長よりも大きいと判断された場合には、現ダイジェスト区間は時間長さ的に有意な区間であると考えられるため、当該ダイジェスト区間を採用し、ステップＳ１１２５に進む。一方、ステップＳ１１２３で現ダイジェスト区間長が連続区間最低長以下であると判断された場合には、現ダイジェスト区間は時間長さ的に有意な区間でないと考えられるため、当該ダイジェスト区間を破棄し（すなわち削除対象区間とし）、ステップＳ１１２５に進む。 If it is determined in step S1123 that the current digest section length is larger than the minimum continuous section length, the current digest section is considered to be a significant section in terms of time length. The process proceeds to S1125. On the other hand, if it is determined in step S1123 that the current digest section length is less than or equal to the minimum continuous section length, it is considered that the current digest section is not a significant section in terms of time length, and therefore the digest section is discarded ( That is, it is set as a deletion target section), and the process proceeds to step S1125.

以降のステップＳ１１２５〜ステップＳ１１３１に示す処理は、図８及び図９に示すステップＳ３１９〜ステップＳ３２５に示す処理と同様である。具体的には、ステップＳ１１２５では、音声情報が終端かどうかが判断される。ステップＳ１１２５で音声情報が終端でないと判断された場合には、フレームインデックスが１つ繰り上げられ（すなわち処理対象であるフレームが１つ先のフレームに設定され）（ステップＳ１１２７）、ステップＳ１１０５以降の処理が繰り返し実行される。 The subsequent processing shown in steps S1125 to S1131 is the same as the processing shown in steps S319 to S325 shown in FIGS. Specifically, in step S1125, it is determined whether the audio information is at the end. If it is determined in step S1125 that the audio information is not the end, the frame index is incremented by 1 (that is, the frame to be processed is set as the next frame) (step S1127), and the processing after step S1105 Is repeatedly executed.

一方、ステップＳ１１２５で音声情報が終端であると判断された場合には、ステップＳ１１２１に進み、現ダイジェスト区間長が連続区間最低長よりも大きいかどうか、すなわち最後に処理対象であったダイジェスト区間が、時間長さの観点から有意な区間であるかどうかが判断される。 On the other hand, if it is determined in step S1125 that the voice information is the end, the process proceeds to step S1121, and whether or not the current digest section length is longer than the minimum continuous section length, that is, the last digest section to be processed is determined. From the viewpoint of time length, it is determined whether the interval is significant.

ステップＳ１１２１で現ダイジェスト区間長が連続区間最低長よりも大きいと判断された場合には、現ダイジェスト区間は時間長さ的に有意な区間であると考えられるため、当該ダイジェスト区間を採用し、一連の処理を終了する。一方、ステップＳ１１２１で現ダイジェスト区間長が連続区間最低長以下であると判断された場合には、現ダイジェスト区間は時間長さ的に有意な区間でないと考えられるため、当該ダイジェスト区間を破棄し（すなわち削除対象区間とし）、一連の処理を終了する。 If it is determined in step S1121 that the current digest section length is larger than the minimum continuous section length, the current digest section is considered to be a significant section in terms of time length. Terminate the process. On the other hand, if it is determined in step S1121 that the current digest section length is less than or equal to the minimum continuous section length, the current digest section is considered to be a section that is not significant in terms of time length, so the digest section is discarded ( In other words, it is set as a deletion target section), and a series of processing ends.

以上、オンライン処理における、単一音源モードでのダイジェスト区間決定処理の処理手順について説明した。 The processing procedure of the digest section determination process in the single sound source mode in the online process has been described above.

（４−３．複数音源モード）
（４−３−１．ダイジェスト区間決定処理の処理手順）
図２３を参照して、オンライン処理における、複数音源モードでのダイジェスト区間決定処理の処理手順について説明する。図２３は、オンライン処理における、複数音源モードでのダイジェスト区間決定処理の処理手順の一例を示すフロー図である。 (4-3. Multiple sound source modes)
(4-3-1. Digest Section Determination Process Procedure)
With reference to FIG. 23, the process sequence of the digest area determination process in a multiple sound source mode in an online process is demonstrated. FIG. 23 is a flowchart showing an example of a processing procedure of digest section determination processing in the multiple sound source mode in online processing.

なお、図２３に示す複数音源モードでのダイジェスト区間決定処理は、図１７を参照して説明した単一音源モードでのダイジェスト区間決定処理に対して、一部の処理（具体的には後述するステップＳ１２０５に示す処理）が変更されたものであり、その他の処理は、単一音源モードでのダイジェスト区間決定処理と略同様である。従って、以下の複数音源モードでのダイジェスト区間決定処理の処理手順についての説明では、単一音源モードでのダイジェスト区間決定処理の処理手順と重複する事項についてはその詳細な説明を省略し、相違する事項について主に説明する。 The digest section determination process in the multiple sound source mode shown in FIG. 23 is a part of the digest section determination process in the single sound source mode described with reference to FIG. The processing shown in step S1205) is changed, and the other processing is substantially the same as the digest section determination processing in the single sound source mode. Therefore, in the description of the digest section determination processing procedure in the multiple sound source mode below, the detailed description of items that overlap with the digest section determination processing procedure in the single sound source mode is omitted and is different. The matter is mainly explained.

図２３を参照すると、複数音源モードでのダイジェスト区間決定処理では、まず、現在のダイジェスト長が、ダイジェスト長（ダイジェストの時間長さの設定値）よりも短いかどうかが判断され（ステップＳ１２０１）、現在のダイジェスト長がダイジェスト長よりも短いと判断された場合には、入力フレームがダイジェストに追加され、ダイジェスト平均スコアが更新される（ステップＳ１２０３）。ステップＳ１２０１及びステップＳ１２０３に示す処理は、図１７に示すステップＳ８０１及びステップＳ８０３における処理と同様である。 Referring to FIG. 23, in the digest section determination process in the multiple sound source mode, first, it is determined whether or not the current digest length is shorter than the digest length (set value of the digest time length) (step S1201). If it is determined that the current digest length is shorter than the digest length, the input frame is added to the digest and the digest average score is updated (step S1203). The processes shown in steps S1201 and S1203 are the same as the processes in steps S801 and S803 shown in FIG.

ステップＳ１２０１で、現在のダイジェスト長がダイジェスト長以上である判断された場合には、ステップＳ１２０５に進む。ステップＳ１２０５では、音源種別ごとに入力フレームのスコアとダイジェスト平均スコアとが比較され、いずれかの音源種別において、入力フレームのスコアがダイジェスト平均スコア以上であるかどうかが判断される。ステップＳ１２０５で、いずれの音源種別においても、入力フレームのスコアがダイジェスト平均スコアよりも小さいと判断された場合には、当該入力フレームをダイジェストに追加することなく、ダイジェスト区間決定処理を終了する。 If it is determined in step S1201 that the current digest length is greater than or equal to the digest length, the process proceeds to step S1205. In step S1205, the score of the input frame is compared with the digest average score for each sound source type, and it is determined whether the score of the input frame is greater than or equal to the digest average score for any one of the sound source types. If it is determined in step S1205 that the score of the input frame is smaller than the digest average score for any sound source type, the digest section determination process is terminated without adding the input frame to the digest.

一方、ステップＳ１２０５で、いずれかの音源種別において入力フレームのスコアがダイジェスト平均スコア以上であると判断された場合には、ステップＳ１２０７に進む。以降のステップＳ１２０７〜ステップＳ１２１１に示す処理は、図１７に示すステップＳ８０７〜ステップＳ８１１における処理と同様である。すなわち、入力フレームがダイジェストに追加されダイジェスト平均スコアが更新される（ステップＳ１２０７）。次いで、フレーム削除処理（ステップＳ１２０９）が行われ、フレームが削除されると、ダイジェスト平均スコアが更新され（ステップＳ１２１１）、ダイジェスト区間決定処理を終了する。 On the other hand, if it is determined in step S1205 that the score of the input frame is greater than or equal to the digest average score for any of the sound source types, the process proceeds to step S1207. The subsequent processing shown in steps S1207 to S1211 is the same as the processing in steps S807 to S811 shown in FIG. That is, the input frame is added to the digest, and the digest average score is updated (step S1207). Next, a frame deletion process (step S1209) is performed. When a frame is deleted, the digest average score is updated (step S1211), and the digest section determination process is terminated.

（４−３−２．フレーム削除処理）
ここで、図２４を参照して、図２３のステップＳ１２０９に示すフレーム削除処理の詳細について説明する。図２４は、オンライン処理における、複数音源モードでのフレーム削除処理の処理手順の一例を示すフロー図である。 (4-3-2. Frame deletion processing)
Here, with reference to FIG. 24, the details of the frame deletion processing shown in step S1209 of FIG. 23 will be described. FIG. 24 is a flowchart showing an example of a processing procedure of frame deletion processing in the multiple sound source mode in online processing.

図２４を参照すると、オンライン処理における複数音源モードでのフレーム削除処理では、まず、音源種別ごとに、スコア閾値として、ダイジェスト平均スコアが設定される（ステップＳ１３０１）。次いで、種別ダイジェスト長が設定される（ステップＳ１３０３）。なお、ステップＳ１３０３に示す処理では、種別ダイジェスト長は、図１０に示す、オフライン処理における複数音源モードでのダイジェスト区間決定処理のステップＳ４０５に示す処理と同様の方法によって設定されてよい。 Referring to FIG. 24, in the frame deletion process in the multiple sound source mode in the online process, first, a digest average score is set as a score threshold for each sound source type (step S1301). Next, the type digest length is set (step S1303). In the process shown in step S1303, the type digest length may be set by the same method as the process shown in step S405 of the digest section determination process in the multiple sound source mode in the offline process shown in FIG.

そして、設定されたスコア閾値を用いて、ダイジェストの中でより高いスコアを有する区間（高スコア区間）をダイジェスト区間として決定する処理（高スコア区間決定処理）が行われる（ステップＳ１３０５）。ステップＳ１３０５に示す処理は、図１８に示すステップＳ９０３における処理、すなわち、図２０−図２２に示す一連の処理と同様であるため、その詳細な説明を省略する。ただし、複数音源モードでのフレーム削除処理では、高スコア区間決定処理が、音源種別ごとに行われる。 And the process (high score area determination process) which determines the area (high score area) which has a higher score in a digest as a digest area using the set score threshold value is performed (step S1305). The process shown in step S1305 is the same as the process in step S903 shown in FIG. 18, that is, the series of processes shown in FIGS. However, in the frame deletion process in the multiple sound source mode, the high score section determination process is performed for each sound source type.

ステップＳ１３０５において高スコア区間が決定されると、高スコア区間決定処理の結果、いずれかの音源種別において、削除対象期間が存在するかどうかが判断される（ステップＳ１３０７）。ステップＳ１３０７においていずれかの音源種別において削除対象区間が存在すると判断された場合には、その音源種別の削除対象区間からスコアのより低いフレームが１つ選択される（ステップＳ１３０９）。そして、選択されたそのフレームがダイジェストから削除される（ステップＳ１３１５）。 When the high score section is determined in step S1305, it is determined whether or not there is a deletion target period in any of the sound source types as a result of the high score section determination processing (step S1307). If it is determined in step S1307 that there is a deletion target section in any of the sound source types, one frame having a lower score is selected from the deletion target section of the sound source type (step S1309). Then, the selected frame is deleted from the digest (step S1315).

一方、ステップＳ１３０７において、いずれの音源種別にも削除対象区間が存在しないと判断された場合には、ダイジェスト区間長の合計が種別ダイジェスト長を最も超過している音源種別が選択される（ステップＳ１３１１）。そして、選択された音源種別について、そのスコアのより低いフレームが１つ選択される（ステップＳ１３１３）。そして、選択されたそのフレームがダイジェストから削除される（ステップＳ１３１５）。 On the other hand, if it is determined in step S1307 that there is no deletion target section for any sound source type, the sound source type whose total digest section length exceeds the type digest length most is selected (step S1311). ). Then, for the selected sound source type, one frame having a lower score is selected (step S1313). Then, the selected frame is deleted from the digest (step S1315).

以上、オンライン処理における、複数音源モードでのダイジェスト区間決定処理の処理手順について説明した。 The processing procedure of the digest section determination process in the multiple sound source mode in the online process has been described above.

（４−４．多様性反映モード）
オンライン処理における多様性反映モードでのダイジェスト区間決定処理の処理手順は、図２３を参照して説明したオンライン処理における複数音源モードでのダイジェスト区間決定処理の処理手順と同様である。ただし、多様性反映モードでは、図２３のステップＳ１２０９に示すフレーム削除処理の詳細が、複数音源モードとは異なる。従って、以下のオンライン処理における多様性反映モードでのダイジェスト区間決定処理についての説明では、フレーム削除処理の詳細について主に説明する。 (4-4. Diversity reflection mode)
The processing procedure of the digest section determination process in the diversity reflection mode in the online processing is the same as the processing procedure of the digest section determination processing in the multiple sound source mode in the online processing described with reference to FIG. However, in the diversity reflection mode, the details of the frame deletion processing shown in step S1209 in FIG. 23 are different from those in the multiple sound source mode. Therefore, in the following description of the digest section determination process in the diversity reflection mode in the online process, details of the frame deletion process will be mainly described.

なお、オンライン処理においても、オフライン処理と同様に、多様性反映モードにおける各処理は、図１２に示す情報処理装置１２０によって実行され得る。 In online processing, each processing in the diversity reflection mode can be executed by the information processing apparatus 120 shown in FIG. 12, as in offline processing.

（４−４−１．フレーム削除処理の処理手順）
図２５を参照して、オンライン処理における、多様性反映モードでのフレーム削除処理の処理手順について説明する。図２５は、オンライン処理における、多様性反映モードでのフレーム削除処理の処理手順の一例を示すフロー図である。 (4-4-1. Procedure of frame deletion processing)
With reference to FIG. 25, the process procedure of the frame deletion process in the diversity reflection mode in the online process will be described. FIG. 25 is a flowchart showing an example of a processing procedure of frame deletion processing in the diversity reflection mode in online processing.

ここで、多様性反映モードは、同一音源種別内での多様性を考慮してダイジェスト区間を決定するものであるため、ダイジェストに含める対象とする音源種別は、単一の音源種別であってもよいし、複数の音源種別であってもよい。図２５では、一例として、ダイジェストに複数の音源種別からなる音声を含める場合における処理手順を図示している。 Here, since the diversity reflection mode is to determine the digest section in consideration of diversity within the same sound source type, even if the sound source type to be included in the digest is a single sound source type It may be a plurality of sound source types. In FIG. 25, as an example, a processing procedure in the case of including audio composed of a plurality of sound source types in the digest is illustrated.

なお、多様性反映モードでのフレーム削除処理における各処理は、後述するステップＳ１４１３に示す処理を除き、図２４を参照して説明した複数音源モードでのフレーム削除処理における各処理と同様である。従って、以下の多様性反映モードでのフレーム削除処理の処理手順についての説明では、複数音源モードでのフレーム削除処理の処理手順と相違する事項について主に説明し、重複する事項についてはその詳細な説明を省略する。 Each process in the frame deletion process in the diversity reflection mode is the same as each process in the frame deletion process in the multiple sound source mode described with reference to FIG. 24 except for the process shown in step S1413 described later. Therefore, in the description of the processing procedure of the frame deletion process in the diversity reflection mode below, items that are different from the processing procedure of the frame deletion process in the multiple sound source mode will be mainly described, and detailed descriptions will be given for the overlapping items. Description is omitted.

図２５を参照すると、オンライン処理における多様性反映モードでのフレーム削除処理では、まず、音源種別ごとに、スコア閾値としてダイジェスト平均スコアが設定され（ステップＳ１４０１）、次いで、種別ダイジェスト長が設定される（ステップＳ１４０３）。そして、設定されたスコア閾値を用いて、音源種別ごとに、高スコア区間決定処理が行われる（ステップＳ１４０５）。これらの処理は、図２４に示すステップＳ１３０１〜ステップＳ１３０５における処理と同様である。 Referring to FIG. 25, in the frame deletion process in the diversity reflection mode in the online process, first, a digest average score is set as a score threshold for each sound source type (step S1401), and then a type digest length is set. (Step S1403). Then, a high score section determination process is performed for each sound source type using the set score threshold (step S1405). These processes are the same as the processes in steps S1301 to S1305 shown in FIG.

次に、高スコア区間決定処理の結果、いずれかの音源種別において、削除対象期間が存在するかどうかが判断される（ステップＳ１４０７）。いずれかの音源種別において削除対象区間が存在すると判断された場合には、その音源種別の削除対象区間からスコアのより低いフレームが１つ選択され（ステップＳ１４０９）、選択されたそのフレームがダイジェストから削除される（ステップＳ１４１５）。これらの処理は、図２４に示すステップＳ１３０７、ステップＳ１３０９、ステップＳ１３１５における処理と同様である。 Next, as a result of the high score section determination process, it is determined whether or not there is a deletion target period in any sound source type (step S1407). If it is determined that there is a deletion target section in any of the sound source types, one frame having a lower score is selected from the deletion target section of the sound source type (step S1409), and the selected frame is selected from the digest. It is deleted (step S1415). These processes are the same as the processes in steps S1307, S1309, and S1315 shown in FIG.

一方、ステップＳ１４０７において、いずれの音源種別にも削除対象区間が存在しないと判断された場合には、ダイジェスト区間長の合計が種別ダイジェスト長を最も超過している音源種別が選択される（ステップＳ１４１１）。そして、選択された音源種別について、当該音源種別内での多様性を考慮して削除するフレームを選択する処理（多様性に基づく削除フレーム選択処理）が行われる（ステップＳ１４１３）。そして、選択されたそのフレームがダイジェストから削除される（ステップＳ１４１５）。 On the other hand, if it is determined in step S1407 that there is no deletion target section for any sound source type, the sound source type whose total digest section length exceeds the type digest length most is selected (step S1411). ). Then, for the selected sound source type, a process of selecting a frame to be deleted in consideration of diversity in the sound source type (deleted frame selection process based on diversity) is performed (step S1413). Then, the selected frame is deleted from the digest (step S1415).

（４−４−２．多様性に基づく削除フレーム選択処理）
図２６を参照して、図２５のステップＳ１４１３に示す多様性に基づく削除フレーム選択処理について詳しく説明する。図２６は、オンライン処理における、多様性に基づく削除フレーム選択処理の処理手順の一例を示すフロー図である。 (4-4-2. Deleted frame selection processing based on diversity)
With reference to FIG. 26, the deletion frame selection processing based on diversity shown in step S1413 of FIG. 25 will be described in detail. FIG. 26 is a flowchart showing an example of the processing procedure of the deletion frame selection processing based on diversity in the online processing.

図２６を参照すると、オンライン処理における多様性に基づく削除フレーム選択処理では、まず、全フレームの場合と、任意の１つのフレームを除いた場合の、ｎ通りの特徴量空間における特徴量ベクトルの分散が計算される（ステップＳ１５０１）。 Referring to FIG. 26, in the deletion frame selection process based on diversity in the online process, first, distribution of feature quantity vectors in n feature quantity spaces in the case of all frames and when one arbitrary frame is excluded. Is calculated (step S1501).

次に、全フレームの場合と、任意の１つのフレームを除いた場合の、ｎ通りのフレームの時刻の分散が計算される（ステップＳ１５０３）。 Next, the variance of the time of the n frames in the case of all frames and the case where any one frame is excluded is calculated (step S1503).

次に、特徴量ベクトルの分散及び時刻の分散に重み付けを行った上でその総和が計算され、全フレームの場合の値からの低減量が最も少ない場合に除外されたフレームが、削除するフレームとして決定される（ステップＳ１５０５）。つまり、ステップＳ１５０５に示す処理では、特徴量ベクトル及び時刻の分散の計算に用いられなかった場合に最も影響の少ない特徴量ベクトル及び時刻を有するフレームが、削除するフレームとして決定される。これにより、特徴量ベクトル及び時刻の分散がより大きくなるように、ダイジェストに含めるフレームが選択されることとなる。 Next, after weighting the variance of the feature vector and the variance of the time, the sum is calculated, and the frame excluded when the amount of reduction from the value in the case of all frames is the smallest is the frame to be deleted It is determined (step S1505). In other words, in the processing shown in step S1505, a frame having a feature vector and time that have the least influence when not used for calculation of the feature vector and time variance is determined as a frame to be deleted. As a result, the frames to be included in the digest are selected so that the variance of the feature vector and the time becomes larger.

以上、図２５を参照して、オンライン処理における、多様性反映モードでのダイジェスト区間決定処理の処理手順について説明した。また、図２６を参照して、図２５のステップＳ１４１３に示す多様性に基づく削除フレーム選択処理について説明した。 The processing procedure of the digest section determination process in the diversity reflection mode in the online process has been described above with reference to FIG. In addition, with reference to FIG. 26, the deletion frame selection process based on diversity illustrated in step S1413 of FIG. 25 has been described.

（５．変形例）
以上説明した実施形態のいくつかの変形例について説明する。なお、以上説明した実施形態及び以下に説明する各変形例に記載される事項は、可能な範囲で互いに組み合わされてよい。 (5. Modifications)
Several modifications of the embodiment described above will be described. In addition, the matters described in the above-described embodiment and each modification described below may be combined with each other as much as possible.

（５−１．音声収音機能が設けられる変形例）
図２７を参照して、情報処理装置に音声収音機能が設けられる変形例について説明する。図２７は、音声収音機能が設けられる変形例に係る情報処理装置の機能構成の一例を示す機能ブロック図である。 (5-1. Modification Example Provided with Sound Collecting Function)
With reference to FIG. 27, a modification in which the information processing apparatus is provided with a sound collecting function will be described. FIG. 27 is a functional block diagram illustrating an example of a functional configuration of an information processing apparatus according to a modified example in which a sound collecting function is provided.

図２７を参照すると、本変形例に係る情報処理装置１３０は、その機能として、特徴量抽出部１１１と、音源種別スコア算出部１１３と、ダイジェスト区間決定部１１５と、音声収音部１３１と、を有する。ここで、特徴量抽出部１１１、音源種別スコア算出部１１３及びダイジェスト区間決定部１１５の機能は、図１に示す情報処理装置１１０におけるこれらの機能ブロックの機能と同様であるため、その詳細な説明は省略する。 Referring to FIG. 27, the information processing apparatus 130 according to the present modification includes, as its functions, a feature amount extraction unit 111, a sound source type score calculation unit 113, a digest section determination unit 115, a voice sound collection unit 131, Have Here, the functions of the feature quantity extraction unit 111, the sound source type score calculation unit 113, and the digest section determination unit 115 are the same as the functions of these functional blocks in the information processing apparatus 110 shown in FIG. Is omitted.

音声収音部１３１は、例えばマイクロフォン等の収音装置によって構成され、外部の音声を収音し、音声情報として情報処理装置１１０に入力する機能を有する。音声収音部１３１は、収音した外部音声に係る音声情報を、特徴量抽出部１１１に提供する。特徴量抽出部１１１、音源種別スコア算出部１１３及びダイジェスト区間決定部１１５は、音声収音部１３１から提供された音声情報に対して、以上説明した実施形態に係る各種の処理（特徴量抽出処理、音源種別スコア算出処理及びダイジェスト区間決定処理）を行う。 The sound collection unit 131 is configured by a sound collection device such as a microphone, for example, and has a function of collecting external sound and inputting the sound as audio information to the information processing device 110. The sound collection unit 131 provides sound information related to the collected external sound to the feature amount extraction unit 111. The feature amount extraction unit 111, the sound source type score calculation unit 113, and the digest section determination unit 115 perform various processes (feature amount extraction processing) according to the above-described embodiment on the audio information provided from the sound collection unit 131. , Sound source type score calculation processing and digest section determination processing).

なお、音声収音部１３１は、１つのマイクロフォンによって構成されてもよいし、互いに異なる位置に配置される複数のマイクロフォンによって構成されてもよい。音声収音部１３１が、互いに異なる位置に配置される複数のマイクロフォンによって構成される場合には、特徴量抽出部１１１は、収音位置間の相関や音源方位等、マイクロフォンが複数存在することによって算出可能となる各種の特徴量を算出することができる。 Note that the sound collection unit 131 may be configured by a single microphone, or may be configured by a plurality of microphones arranged at different positions. When the sound collection unit 131 includes a plurality of microphones arranged at different positions, the feature amount extraction unit 111 has a plurality of microphones such as a correlation between sound collection positions and a sound source direction. Various feature quantities that can be calculated can be calculated.

以上、図２７を参照して、情報処理装置に音声収音機能が設けられる変形例について説明した。以上説明したように、本変形例によれば、情報処理装置１３０自体が外部の音声を収音する収音機能を有し、収音した外部音声に係る音声情報のダイジェスト区間情報を出力することができる。このような情報処理装置１３０は、例えばＩＣレコーダーや外部音声を録音するアプリケーションソフトが搭載されたスマートフォン等であり得る。 As described above, with reference to FIG. 27, the modification example in which the information processing apparatus is provided with the sound collecting function has been described. As described above, according to the present modification, the information processing apparatus 130 itself has a sound collection function for collecting external sound, and outputs digest section information of sound information related to the collected external sound. Can do. Such an information processing apparatus 130 can be, for example, an IC recorder or a smartphone equipped with application software for recording external sound.

（５−２．ダイジェスト生成機能が設けられる変形例）
図２８を参照して、情報処理装置にダイジェスト生成機能が設けられる変形例について説明する。図２８は、ダイジェスト生成機能が設けられる変形例に係る情報処理装置の機能構成の一例を示す機能ブロック図である。 (5-2. Modified example in which a digest generation function is provided)
With reference to FIG. 28, a modification in which a digest generation function is provided in the information processing apparatus will be described. FIG. 28 is a functional block diagram illustrating an example of a functional configuration of an information processing apparatus according to a modification in which a digest generation function is provided.

図２８を参照すると、本変形例に係る情報処理装置１４０は、その機能として、特徴量抽出部１１１と、音源種別スコア算出部１１３と、ダイジェスト区間決定部１１５と、出力音声生成部１４１と、を有する。ここで、特徴量抽出部１１１、音源種別スコア算出部１１３及びダイジェスト区間決定部１１５の機能は、図１に示す情報処理装置１１０におけるこれらの機能ブロックの機能と同様であるため、その詳細な説明は省略する。 Referring to FIG. 28, the information processing apparatus 140 according to the present modification includes, as its functions, a feature amount extraction unit 111, a sound source type score calculation unit 113, a digest section determination unit 115, an output audio generation unit 141, Have Here, the functions of the feature quantity extraction unit 111, the sound source type score calculation unit 113, and the digest section determination unit 115 are the same as the functions of these functional blocks in the information processing apparatus 110 shown in FIG. Is omitted.

出力音声生成部１４１は、各種のプロセッサによって構成され、音声情報と、ダイジェスト区間決定部１１５によって生成されるダイジェスト区間情報と、に基づいて、当該音声情報のダイジェストを、音声出力機器で出力可能なデータ形式で生成する。出力音声生成部１４１は、ダイジェストを生成する際に、ダイジェスト区間同士のつなぎ目に対してクロスフェード処理を施す等、ユーザの聴き心地を考慮して、各種の公知の音声処理を適宜行ってもよい。出力音声生成部１４１は、生成したダイジェストに対応する音声情報（出力音声情報）を、例えばスピーカ等の音声出力機器に出力する。当該音声出力機器によってダイジェストが音声として出力される。 The output sound generation unit 141 includes various processors, and can output a digest of the sound information on the sound output device based on the sound information and the digest section information generated by the digest section determination unit 115. Generate in data format. When generating the digest, the output sound generation unit 141 may appropriately perform various known sound processes in consideration of the user's listening comfort, such as performing a cross-fade process on the joint between the digest sections. . The output audio generation unit 141 outputs audio information (output audio information) corresponding to the generated digest to an audio output device such as a speaker. The digest is output as voice by the voice output device.

以上、図２８を参照して、情報処理装置にダイジェスト生成機能が設けられる変形例について説明した。以上説明したように、本変形例によれば、情報処理装置１４０自身がダイジェストを生成する機能を有し、生成したダイジェストを、情報処理装置１４０自身に設けられる音声出力機器又は情報処理装置１４０の外部の音声出力機器から出力することができる。 The modification example in which the digest generation function is provided in the information processing apparatus has been described above with reference to FIG. As described above, according to this modification, the information processing apparatus 140 itself has a function of generating a digest, and the generated digest is used for the audio output device or the information processing apparatus 140 provided in the information processing apparatus 140 itself. Output from an external audio output device.

なお、情報処理装置１４０自身が音声出力機器を有し、ダイジェストを再生可能である場合には、情報処理装置１４０は、音声情報を取得したら自動的にダイジェストを生成してもよい。また、その場合、情報処理装置１４０では、例えば、表示画面上の音声情報を表すファイル名にポインタを載せる等のＧＵＩ（ＧｒａｐｈｉｃａｌＵｓｅｒＩｎｔｅｒｆａｃｅ）を用いた操作や、プレビュー操作等の簡易な操作によって、ダイジェストが再生されてもよい。情報処理装置１４０がこのように構成されることにより、ユーザは、ダイジェスト生成のための操作をわざわざ行わなくてもよく、また、簡易な操作でダイジェストを聴くことができるため、あたかも映像情報におけるサムネイルを確認するような感覚で音声情報のダイジェストを確認することができ、ユーザの利便性がより向上する。 Note that when the information processing apparatus 140 itself has an audio output device and can reproduce the digest, the information processing apparatus 140 may automatically generate the digest after acquiring the audio information. In this case, the information processing apparatus 140 performs, for example, an operation using a GUI (Graphical User Interface) such as placing a pointer on a file name representing audio information on the display screen, or a simple operation such as a preview operation. The digest may be played. By configuring the information processing apparatus 140 in this way, the user does not have to perform the operation for generating the digest and can listen to the digest with a simple operation. The digest of the voice information can be confirmed with the feeling of confirming the user's convenience, and the convenience for the user is further improved.

（５−３．音声情報データベースが設けられる変形例）
図２９を参照して、情報処理装置に音声情報データベースが設けられる変形例について説明する。図２９は、音声情報データベースが設けられる変形例に係る情報処理装置の機能構成の一例を示す機能ブロック図である。 (5-3. Modification in which a voice information database is provided)
With reference to FIG. 29, a modified example in which an audio information database is provided in the information processing apparatus will be described. FIG. 29 is a functional block diagram illustrating an example of a functional configuration of an information processing apparatus according to a modification in which a voice information database is provided.

図２９を参照すると、本変形例に係る情報処理装置１５０は、その機能として、特徴量抽出部１１１と、音源種別スコア算出部１１３と、ダイジェスト区間決定部１１５と、音声情報データベース１５１と、を有する。ここで、特徴量抽出部１１１、音源種別スコア算出部１１３及びダイジェスト区間決定部１１５の機能は、図１に示す情報処理装置１１０におけるこれらの機能ブロックの機能と同様であるため、その詳細な説明は省略する。 Referring to FIG. 29, the information processing apparatus 150 according to the present modification includes, as its functions, a feature amount extraction unit 111, a sound source type score calculation unit 113, a digest section determination unit 115, and an audio information database 151. Have. Here, the functions of the feature quantity extraction unit 111, the sound source type score calculation unit 113, and the digest section determination unit 115 are the same as the functions of these functional blocks in the information processing apparatus 110 shown in FIG. Is omitted.

音声情報データベース１５１は、例えばＨＤＤ等の記憶装置によって構成され、データベース化された音声情報を記憶する。特徴量抽出部１１１は、音声情報データベース１５１にアクセスすることにより、当該音声情報データベース１５１内の任意の音声情報から特徴量を抽出することができる。つまり、本変形例によれば、情報処理装置１５０内に設けられる記憶部内のデータベース化された音声情報に対して、特徴量抽出部１１１、音源種別スコア算出部１１３及びダイジェスト区間決定部１１５が、以上説明した実施形態に係る各種の処理（特徴量抽出処理、音源種別スコア算出処理及びダイジェスト区間決定処理）を行う。 The voice information database 151 is configured by a storage device such as an HDD, and stores voice information in a database. The feature quantity extraction unit 111 can extract a feature quantity from arbitrary voice information in the voice information database 151 by accessing the voice information database 151. That is, according to the present modification, the feature amount extraction unit 111, the sound source type score calculation unit 113, and the digest section determination unit 115 are performed on the voice information stored in the database in the storage unit provided in the information processing device 150. Various processes according to the embodiment described above (feature amount extraction process, sound source type score calculation process, and digest section determination process) are performed.

以上、図２９を参照して、情報処理装置に音声情報データベースが設けられる変形例について説明した。以上説明したように、本変形例によれば、情報処理装置１５０自身が音声情報が格納されたデータベースを有し、当該データベース内の音声情報のダイジェスト区間情報を出力することができる。 The modification example in which the information processing apparatus is provided with the audio information database has been described above with reference to FIG. As described above, according to this modification, the information processing apparatus 150 itself has a database in which voice information is stored, and digest section information of the voice information in the database can be output.

（６．ハードウェア構成）
次に、図３０を参照して、本実施形態に係る情報処理装置のハードウェア構成について説明する。図３０は、本実施形態に係る情報処理装置のハードウェア構成の一例を示すブロック図である。なお、図３０に示す情報処理装置９００は、例えば、図１、図１２、図２７−図２９に示す情報処理装置１１０、１２０、１３０、１４０、１５０の機能構成を実現し得る。 (6. Hardware configuration)
Next, a hardware configuration of the information processing apparatus according to the present embodiment will be described with reference to FIG. FIG. 30 is a block diagram illustrating an example of a hardware configuration of the information processing apparatus according to the present embodiment. 30 can realize the functional configuration of the information processing apparatuses 110, 120, 130, 140, and 150 illustrated in FIGS. 1, 12, and 27 to 29, for example.

情報処理装置９００は、ＣＰＵ９０１、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）９０３及びＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）９０５を備える。また、情報処理装置９００は、ホストバス９０７、ブリッジ９０９、外部バス９１１、インターフェース９１３、入力装置９１５、出力装置９１７、ストレージ装置９１９、通信装置９２１、ドライブ９２３及び接続ポート９２５を備えてもよい。情報処理装置９００は、ＣＰＵ９０１に代えて、又はこれとともに、ＤＳＰ若しくはＡＳＩＣと呼ばれるような処理回路を有してもよい。 The information processing apparatus 900 includes a CPU 901, a ROM (Read Only Memory) 903, and a RAM (Random Access Memory) 905. The information processing apparatus 900 may include a host bus 907, a bridge 909, an external bus 911, an interface 913, an input device 915, an output device 917, a storage device 919, a communication device 921, a drive 923, and a connection port 925. The information processing apparatus 900 may include a processing circuit called a DSP or an ASIC instead of or together with the CPU 901.

ＣＰＵ９０１は、演算処理装置及び制御装置として機能し、ＲＯＭ９０３、ＲＡＭ９０５、ストレージ装置９１９又はリムーバブル記録媒体９２９に記録された各種のプログラムに従って、情報処理装置９００内の動作全般又はその一部を制御する。ＲＯＭ９０３は、ＣＰＵ９０１が使用するプログラムや演算パラメータ等を記憶する。ＲＡＭ９０５は、ＣＰＵ９０１の実行において使用するプログラムや、その実行時のパラメータ等を一次記憶する。ＣＰＵ９０１、ＲＯＭ９０３及びＲＡＭ９０５は、ＣＰＵバス等の内部バスにより構成されるホストバス９０７により相互に接続されている。更に、ホストバス９０７は、ブリッジ９０９を介して、ＰＣＩ（ＰｅｒｉｐｈｅｒａｌＣｏｍｐｏｎｅｎｔＩｎｔｅｒｃｏｎｎｅｃｔ／Ｉｎｔｅｒｆａｃｅ）バス等の外部バス９１１に接続されている。ＣＰＵ９０１は、例えば、上述した実施形態における特徴量抽出部１１１、音源種別スコア算出部１１３、ダイジェスト区間決定部１１５及び出力音声生成部１４１を構成し得る。 The CPU 901 functions as an arithmetic processing unit and a control unit, and controls all or a part of the operation in the information processing apparatus 900 according to various programs recorded in the ROM 903, the RAM 905, the storage apparatus 919, or the removable recording medium 929. The ROM 903 stores programs used by the CPU 901, calculation parameters, and the like. The RAM 905 temporarily stores programs used in the execution of the CPU 901, parameters at the time of execution, and the like. The CPU 901, the ROM 903, and the RAM 905 are connected to each other by a host bus 907 configured by an internal bus such as a CPU bus. Further, the host bus 907 is connected to an external bus 911 such as a PCI (Peripheral Component Interconnect / Interface) bus via a bridge 909. The CPU 901 can configure, for example, the feature amount extraction unit 111, the sound source type score calculation unit 113, the digest section determination unit 115, and the output audio generation unit 141 in the above-described embodiment.

ホストバス９０７は、ブリッジ９０９を介して、ＰＣＩ（ＰｅｒｉｐｈｅｒａｌＣｏｍｐｏｎｅｎｔＩｎｔｅｒｃｏｎｎｅｃｔ／Ｉｎｔｅｒｆａｃｅ）バス等の外部バス９１１に接続されている。 The host bus 907 is connected to an external bus 911 such as a PCI (Peripheral Component Interconnect / Interface) bus via a bridge 909.

入力装置９１５は、例えば、マウス、キーボード、タッチパネル、ボタン、スイッチ及びレバー等、ユーザによって操作される装置によって構成される。また、入力装置９１５は、例えば、赤外線やその他の電波を利用したリモートコントロール装置（いわゆる、リモコン）であってもよいし、情報処理装置９００の操作に対応した携帯電話やＰＤＡ等の外部接続機器９３１であってもよい。更に、入力装置９１５は、例えば、上記の操作手段を用いてユーザにより入力された情報に基づいて入力信号を生成し、ＣＰＵ９０１に出力する入力制御回路などから構成されている。情報処理装置９００のユーザは、この入力装置９１５を操作することにより、情報処理装置９００に対して各種のデータを入力したり処理動作を指示したりすることができる。本実施形態では、入力装置９１５を介して、例えばダイジェスト区間決定処理を開始する旨の指示や、モードの切り替え指示等が、情報処理装置１１０、１２０、１３０、１４０、１５０に入力されてよい。 The input device 915 is configured by a device operated by a user, such as a mouse, a keyboard, a touch panel, a button, a switch, and a lever. The input device 915 may be, for example, a remote control device (so-called remote controller) that uses infrared rays or other radio waves, or an external connection device such as a mobile phone or a PDA that supports the operation of the information processing device 900. It may be 931. Furthermore, the input device 915 includes an input control circuit that generates an input signal based on information input by the user using the above-described operation means and outputs the input signal to the CPU 901, for example. A user of the information processing apparatus 900 can input various data and instruct a processing operation to the information processing apparatus 900 by operating the input device 915. In the present embodiment, for example, an instruction to start digest section determination processing, a mode switching instruction, or the like may be input to the information processing apparatuses 110, 120, 130, 140, and 150 via the input device 915.

また、入力装置９１５は、周囲の音声を収音し、当該周囲の音声を音声情報として情報処理装置９００に入力するマイクロフォンであってもよい。入力装置９１５がマイクロフォンである場合には、当該入力装置９１５は、上述した実施形態における音声収音部１３１を構成し得る。 Further, the input device 915 may be a microphone that picks up surrounding sounds and inputs the surrounding sounds to the information processing apparatus 900 as sound information. When the input device 915 is a microphone, the input device 915 can constitute the sound collection unit 131 in the above-described embodiment.

出力装置９１７は、取得した情報をユーザに対して視覚的又は聴覚的に通知することが可能な装置で構成される。このような装置として、ＣＲＴディスプレイ装置、液晶ディスプレイ装置、プラズマディスプレイ装置、ＥＬディスプレイ装置及びランプ等の表示装置や、スピーカ及びヘッドホン等の音声出力装置や、プリンタ装置等がある。出力装置９１７は、例えば、情報処理装置９００が行った各種処理により得られた結果を出力する。具体的には、表示装置は、情報処理装置９００が行った各種処理により得られた結果を、テキスト、イメージ、表、グラフ等、様々な形式で視覚的に表示する。他方、音声出力装置は、再生された音声データや音響データ等からなるオーディオ信号をアナログ信号に変換して聴覚的に出力する。本実施形態では、当該音声出力装置を介して、例えば、情報処理装置１４０によって生成される音声情報のダイジェストが出力されてよい。また、当該表示装置には、入力装置９１５を介して各種の指示を入力するためのＧＵＩに係る表示が表示されてもよい。 The output device 917 is a device that can notify the user of the acquired information visually or audibly. Examples of such devices include CRT display devices, liquid crystal display devices, plasma display devices, EL display devices, display devices such as lamps, audio output devices such as speakers and headphones, printer devices, and the like. For example, the output device 917 outputs results obtained by various processes performed by the information processing apparatus 900. Specifically, the display device visually displays results obtained by various processes performed by the information processing device 900 in various formats such as text, images, tables, and graphs. On the other hand, the audio output device converts an audio signal composed of reproduced audio data, acoustic data, and the like into an analog signal and outputs it aurally. In the present embodiment, for example, a digest of voice information generated by the information processing apparatus 140 may be output via the voice output apparatus. Further, a display related to a GUI for inputting various instructions via the input device 915 may be displayed on the display device.

ストレージ装置９１９は、情報処理装置９００の記憶部の一例として構成されたデータ格納用の装置である。ストレージ装置９１９は、例えば、ＨＤＤ等の磁気記憶部デバイス、半導体記憶デバイス、光記憶デバイス又は光磁気記憶デバイス等により構成される。このストレージ装置９１９は、ＣＰＵ９０１が実行するプログラムや各種データ及び外部から取得した各種のデータ等を格納する。ストレージ装置９１９は、例えば、上述した実施形態における音声情報データベース１５１を構成し得る。 The storage device 919 is a data storage device configured as an example of a storage unit of the information processing device 900. The storage device 919 includes, for example, a magnetic storage device such as an HDD, a semiconductor storage device, an optical storage device, or a magneto-optical storage device. The storage device 919 stores programs executed by the CPU 901, various data, various data acquired from the outside, and the like. The storage device 919 can constitute, for example, the audio information database 151 in the above-described embodiment.

通信装置９２１は、例えば、通信網（ネットワーク）９２７に接続するための通信デバイス等で構成された通信インターフェースである。通信装置９２１は、例えば、有線若しくは無線ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、Ｂｌｕｅｔｏｏｔｈ（登録商標）又はＷＵＳＢ（ＷｉｒｅｌｅｓｓＵＳＢ）用の通信カード等である。また、通信装置９２１は、光通信用のルータ、ＡＤＳＬ（ＡｓｙｍｍｅｔｒｉｃＤｉｇｉｔａｌＳｕｂｓｃｒｉｂｅｒＬｉｎｅ）用のルータ又は各種通信用のモデム等であってもよい。この通信装置９２１は、例えば、インターネットや他の通信機器との間で、例えばＴＣＰ／ＩＰ等の所定のプロトコルに則して信号等を送受信することができる。また、通信装置９２１に接続されるネットワーク９２７は、有線又は無線によって接続されたネットワーク等により構成され、例えば、インターネット、家庭内ＬＡＮ、赤外線通信、ラジオ波通信又は衛星通信等であってもよい。本実施形態では、例えば、情報処理装置１１０、１２０、１３０、１４０、１５０が、通信装置９２１を介して、音声情報やダイジェスト区間情報、出力音声情報等の、情報処理装置１１０、１２０、１３０、１４０、１５０の入出力である各種の情報を、外部の機器との間でやり取りしてよい。 The communication device 921 is a communication interface configured with, for example, a communication device for connecting to a communication network (network) 927. The communication device 921 is, for example, a communication card for wired or wireless LAN (Local Area Network), Bluetooth (registered trademark), or WUSB (Wireless USB). Further, the communication device 921 may be a router for optical communication, a router for ADSL (Asymmetric Digital Subscriber Line), a modem for various communication, or the like. The communication device 921 can transmit and receive signals and the like according to a predetermined protocol such as TCP / IP, for example, with the Internet or other communication devices. The network 927 connected to the communication device 921 is configured by a wired or wireless network, and may be, for example, the Internet, a home LAN, infrared communication, radio wave communication, satellite communication, or the like. In the present embodiment, for example, the information processing apparatuses 110, 120, 130, 140, and 150 receive information, such as voice information, digest section information, and output voice information, via the communication apparatus 921. Various information that is input and output of 140 and 150 may be exchanged with an external device.

ドライブ９２３は、記録媒体用リーダライタであり、情報処理装置９００に内蔵、あるいは外付けされる。ドライブ９２３は、装着されている磁気ディスク、光ディスク、光磁気ディスク又は半導体メモリ等のリムーバブル記録媒体９２９に記録されている情報を読み出して、ＲＡＭ９０５に出力する。また、ドライブ９２３は、装着されている磁気ディスク、光ディスク、光磁気ディスク又は半導体メモリ等のリムーバブル記録媒体９２９に情報を書き込むことも可能である。リムーバブル記録媒体９２９は、例えば、ＤＶＤメディア、ＨＤ−ＤＶＤメディア、Ｂｌｕ−ｒａｙ（登録商標）メディア等である。また、リムーバブル記録媒体９２９は、コンパクトフラッシュ（登録商標）（ＣｏｍｐａｃｔＦｌａｓｈ：ＣＦ）、フラッシュメモリ又はＳＤメモリカード（ＳｅｃｕｒｅＤｉｇｉｔａｌｍｅｍｏｒｙｃａｒｄ）等であってもよい。また、リムーバブル記録媒体９２９は、例えば、非接触型ＩＣチップを搭載したＩＣカード（ＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔｃａｒｄ）又は電子機器等であってもよい。本実施形態では、例えば情報処理装置１１０、１２０、１３０、１４０、１５０によって処理される各種の情報が、ドライブ９２３によってリムーバブル記録媒体９２９から読み出されたり、リムーバブル記録媒体９２９に書き込まれたりしてもよい。 The drive 923 is a recording medium reader / writer, and is built in or externally attached to the information processing apparatus 900. The drive 923 reads information recorded on a removable recording medium 929 such as a mounted magnetic disk, optical disk, magneto-optical disk, or semiconductor memory, and outputs the information to the RAM 905. The drive 923 can also write information to a removable recording medium 929 such as a mounted magnetic disk, optical disk, magneto-optical disk, or semiconductor memory. The removable recording medium 929 is, for example, a DVD medium, an HD-DVD medium, a Blu-ray (registered trademark) medium, or the like. Further, the removable recording medium 929 may be a compact flash (registered trademark) (CompactFlash: CF), a flash memory, an SD memory card (Secure Digital memory card), or the like. The removable recording medium 929 may be, for example, an IC card (Integrated Circuit card) on which a non-contact IC chip is mounted, an electronic device, or the like. In the present embodiment, for example, various types of information processed by the information processing apparatuses 110, 120, 130, 140, and 150 are read from the removable recording medium 929 by the drive 923 or written to the removable recording medium 929. Also good.

接続ポート９２５は、機器を情報処理装置９００に直接接続するためのポートである。接続ポート９２５の一例として、ＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）ポート、ＩＥＥＥ１３９４ポート及びＳＣＳＩ（ＳｍａｌｌＣｏｍｐｕｔｅｒＳｙｓｔｅｍＩｎｔｅｒｆａｃｅ）ポート等がある。接続ポート９２５の別の例として、ＲＳ−２３２Ｃポート、光オーディオ端子及びＨＤＭＩ（登録商標）（Ｈｉｇｈ−ＤｅｆｉｎｉｔｉｏｎＭｕｌｔｉｍｅｄｉａＩｎｔｅｒｆａｃｅ）ポート等がある。この接続ポート９２５に外部接続機器９３１を接続することで、情報処理装置９００は、外部接続機器９３１から直接各種のデータを取得したり、外部接続機器９３１に各種のデータを提供したりする。本実施形態では、例えば情報処理装置１１０、１２０、１３０、１４０、１５０によって処理される各種の情報が、接続ポート９２５を介して外部接続機器９３１から取得されたり、外部接続機器９３１に出力されたりしてもよい。 The connection port 925 is a port for directly connecting a device to the information processing apparatus 900. Examples of the connection port 925 include a USB (Universal Serial Bus) port, an IEEE 1394 port, a SCSI (Small Computer System Interface) port, and the like. As another example of the connection port 925, there are an RS-232C port, an optical audio terminal, an HDMI (registered trademark) (High-Definition Multimedia Interface) port, and the like. By connecting the external connection device 931 to the connection port 925, the information processing apparatus 900 acquires various data directly from the external connection device 931 or provides various data to the external connection device 931. In the present embodiment, for example, various types of information processed by the information processing apparatuses 110, 120, 130, 140, and 150 are acquired from the external connection device 931 via the connection port 925 or output to the external connection device 931. May be.

以上、本実施形態に係る情報処理装置９００の機能を実現可能なハードウェア構成の一例を示した。上記の各構成要素は、汎用的な部材を用いて構成されていてもよいし、各構成要素の機能に特化したハードウェアにより構成されていてもよい。従って、本実施形態を実施する時々の技術レベルに応じて、適宜、利用するハードウェア構成を変更することが可能である。 Heretofore, an example of the hardware configuration capable of realizing the functions of the information processing apparatus 900 according to the present embodiment has been shown. Each component described above may be configured using a general-purpose member, or may be configured by hardware specialized for the function of each component. Therefore, it is possible to change the hardware configuration to be used as appropriate according to the technical level at the time of carrying out this embodiment.

なお、上述のような本実施形態に係る情報処理装置９００の各機能を実現するためのコンピュータプログラムを作製し、ＰＣ等に実装することが可能である。また、このようなコンピュータプログラムが格納された、コンピュータで読み取り可能な記録媒体も提供することができる。記録媒体は、例えば、磁気ディスク、光ディスク、光磁気ディスク、フラッシュメモリ等である。また、上記のコンピュータプログラムは、記録媒体を用いずに、例えばネットワークを介して配信されてもよい。 Note that a computer program for realizing each function of the information processing apparatus 900 according to the present embodiment as described above can be produced and mounted on a PC or the like. In addition, a computer-readable recording medium storing such a computer program can be provided. The recording medium is, for example, a magnetic disk, an optical disk, a magneto-optical disk, a flash memory, or the like. Further, the above computer program may be distributed via a network, for example, without using a recording medium.

（７．まとめ）
以上説明したように、本実施形態によれば、音声情報に含まれる音声の音源種別の蓋然性を示す音源種別スコアが算出され、当該音源種別スコアに基づいて、当該音声情報の中から当該音声情報のダイジェストを構成するダイジェスト区間が決定される。従って、例えば、音楽のみをダイジェストに含めたい、人の声のみをダイジェストに含めたい、音楽と人の声とをバランスよくダイジェストに含めたい等、ユーザの多様な要望に応じたダイジェストを生成することが可能になる。よって、ユーザの利便性をより向上させることができる。 (7. Summary)
As described above, according to the present embodiment, the sound source type score indicating the probability of the sound source type of the sound included in the sound information is calculated, and the sound information is extracted from the sound information based on the sound source type score. The digest sections constituting the digests are determined. Therefore, for example, you want to include only the music in the digest, want to include only the voice of the person in the digest, or want to include the music and the voice of the person in a well-balanced digest. Is possible. Therefore, user convenience can be further improved.

また、モードが設定され、ダイジェストに含まれる音声の音源種別が適宜調整されることにより、よりユーザの意向に沿ったダイジェストを生成することが可能になる。例えば、複数音源モードにおいてノイズスコアに係る音声がダイジェストに含まれる割合を低い値に設定する等、モードを適宜設定することで、ノイズが低減された、よりユーザにとって聞き取りやすいダイジェストを生成することが可能である。 In addition, by setting the mode and appropriately adjusting the sound source type of the sound included in the digest, it is possible to generate a digest more in line with the user's intention. For example, by appropriately setting the mode, such as setting the ratio of the voice related to the noise score in the digest to a low value in the multiple sound source mode, it is possible to generate a digest with reduced noise and easier for the user to hear. Is possible.

ここで、一般的に、映像情報については、例えばサムネイルを表示することにより、当該映像情報の概要を視覚的にユーザに対して通知することができる。しかしながら、主に映像情報ではなく音声情報を取得する音声収録機器（例えばＩＣレコーダー、録音アプリケーションソフトが搭載されたスマートフォン、カメラ機能が搭載されていない又はカメラ機能が使用できない状況下でのウェアラブル機器等）で音声を収録した場合、その音声情報のファイル名、収音日時等は視覚的に表示され得るが、ユーザにとって、これらの情報から、その音声情報の概要を視覚的に把握することは困難である。また、音声情報とともに映像情報を有する場合であっても、例えば暗い室内でのイベント中で表示画面のバックライトを点灯することが憚られる場合等、状況によっては、表示画面を見ることができず視覚的な確認ができない場合もある。 Here, generally, with respect to video information, for example, by displaying a thumbnail, an outline of the video information can be visually notified to the user. However, audio recording devices that mainly acquire audio information instead of video information (for example, IC recorders, smartphones equipped with recording application software, wearable devices in situations where camera functions are not installed or camera functions cannot be used, etc. ), The file name of the audio information, the sound collection date and time, etc. can be displayed visually, but it is difficult for the user to visually grasp the outline of the audio information from these information. It is. Even when audio information and video information are included, the display screen cannot be viewed depending on the situation, for example, when the backlight of the display screen is turned on during an event in a dark room. Visual confirmation may not be possible.

このような場合、音声情報（又は、音声情報及び映像情報）の内容を把握するためには、ユーザは、実際に当該音声情報を試聴する必要がある。しかしながら、音声情報の時間長さが長い場合には、内容確認のために当該音声情報を一通り聞くことは、時間的な負荷が大きく、ユーザにとって大きな負担となる。 In such a case, in order to grasp the contents of audio information (or audio information and video information), the user needs to actually listen to the audio information. However, when the time length of the voice information is long, listening to the voice information in order to confirm the contents has a large time load and is a heavy burden on the user.

一方、本実施形態によれば、上述したように、ユーザの要望に沿った音声情報のダイジェストを作成することが可能になる。従って、例えば数秒のダイジェストを試聴するだけで音声情報の内容を把握することができ、これまでは多大な時間を要していた内容確認に掛かる時間を、大幅に短縮することができる。 On the other hand, according to the present embodiment, as described above, it is possible to create a digest of voice information in accordance with a user's request. Therefore, for example, it is possible to grasp the contents of the audio information only by listening to a digest of several seconds, and it is possible to greatly reduce the time required to confirm the contents, which has taken a long time until now.

また、本実施形態によれば、例えば、音声を収録した装置本体、又はストレージに移動された後の音声情報を管理する他の装置等により、取得された音声情報に対して、自動的にダイジェストが生成されてもよい。また、取得された音声情報に対して自動的にダイジェストが生成される場合には、例えば、表示画面上の音声情報を表すファイル名にポインタを載せる等のＧＵＩを用いた操作や、プレビュー操作等の簡易な操作によって、ダイジェストが再生されてもよい。これにより、ユーザは、煩わしい操作を行うことなく、より気楽にダイジェストを確認することができる。 In addition, according to the present embodiment, for example, the digest information is automatically digested with respect to the acquired sound information by the apparatus main body that records the sound, or another apparatus that manages the sound information after moving to the storage. May be generated. Further, when a digest is automatically generated for the acquired audio information, for example, an operation using a GUI such as placing a pointer on a file name representing the audio information on the display screen, a preview operation, etc. The digest may be reproduced by a simple operation. Thereby, the user can confirm the digest more easily without performing troublesome operations.

また、本実施形態に係る技術は、いわゆるビッグデータを解析する用途にも好適に適用可能である。例えば、コールセンターや捜査機関等で収集される通話記録に対して本実施形態に係る技術を適用し、通話記録のダイジェストを生成することにより、膨大な量の通話記録の内容をより短時間で確認することが可能となる。従って、通話記録の解析がより容易になる。 Further, the technology according to the present embodiment can be suitably applied to a use for analyzing so-called big data. For example, by applying the technology according to the present embodiment to call records collected at call centers, investigation agencies, etc., and generating a digest of call records, the contents of a huge amount of call records can be confirmed in a shorter time It becomes possible to do. Therefore, analysis of the call record becomes easier.

また、音声情報とともに映像情報を有する場合であっても、映像情報に基づくサムネイル等を用いた視覚的な方法では、内容の把握が難しい状況が考えられる。例えば、似通った映像に対して音声部分のみが大きく異なる複数のファイルが存在する場合や、装置の処理速度等の実装的な制約から映像情報を利用できない場合、定点カメラ等による映像であるために映像内に音源が映っていない場合（すなわち話者が特定できない場合）等が、このような状況に該当し得る。本実施形態に係る技術は、このような、内容の把握のために映像情報が有効に利用できない場合にも好適に適用され得る。 Further, even when the video information is included together with the audio information, there may be a situation where it is difficult to grasp the contents by a visual method using a thumbnail or the like based on the video information. For example, if there are multiple files that differ greatly only in the audio part for similar images, or if video information is not available due to implementation restrictions such as the processing speed of the device, it is a video from a fixed point camera etc. Such a situation may correspond to a case where a sound source is not shown in the video (that is, a speaker cannot be identified). The technique according to the present embodiment can be suitably applied even when video information cannot be used effectively for grasping the content.

更に、本実施形態に係る技術は、動画を編集する場合等、音声情報を編集する作業においても、編集前の素材となる音声情報の内容を容易に把握する上で、有効である。例えば、近年、静止画像と音声とを組み合わせた、音声情報付きの写真を生成、提供するサービスが存在する。このような、静止画像と音声とを組み合わせたフォーマットのファイルを生成する際に、音声部分を編集する際にも、本実施形態に係る技術が有効に活用され得る。 Furthermore, the technique according to the present embodiment is effective in easily comprehending the content of audio information that is a material before editing, even in the operation of editing audio information, such as when editing a moving image. For example, in recent years, there are services that generate and provide photographs with audio information that combine still images and audio. The technique according to the present embodiment can be effectively utilized when editing a sound part when generating a file having a format in which still images and sound are combined.

以上、添付図面を参照しながら本開示の好適な実施形態について詳細に説明したが、本開示の技術的範囲はかかる例に限定されない。本開示の技術分野における通常の知識を有する者であれば、特許請求の範囲に記載された技術的思想の範疇内において、各種の変更例または修正例に想到し得ることは明らかであり、これらについても、当然に本開示の技術的範囲に属するものと了解される。 The preferred embodiments of the present disclosure have been described in detail above with reference to the accompanying drawings, but the technical scope of the present disclosure is not limited to such examples. It is obvious that a person having ordinary knowledge in the technical field of the present disclosure can come up with various changes or modifications within the scope of the technical idea described in the claims. Of course, it is understood that it belongs to the technical scope of the present disclosure.

また、本明細書に記載された効果は、あくまで説明的又は例示的なものであって限定的なものではない。つまり、本開示に係る技術は、上記の効果とともに、又は上記の効果に代えて、本明細書の記載から当業者には明らかな他の効果を奏し得る。 In addition, the effects described in the present specification are merely illustrative or illustrative, and are not limited. That is, the technology according to the present disclosure can exhibit other effects that are apparent to those skilled in the art from the description of the present specification in addition to or instead of the above effects.

ここで、本明細書では、各処理の処理手順での判断処理において、スコアをしきい値と比較する際等に、「以下」や「よりも大きい」等の表現を用いているが、これらの表現はあくまで例示であり、当該判断処理における境界条件を限定するものではない。本実施形態では、スコア等の値がしきい値と等しい場合に、その大小関係をどのように判断するかは任意に設定可能であってよい。本明細書における「以下」との表現は「よりも小さい」との表現と互いに適宜読み替えることが可能であるし、「よりも大きい」との表現は「以上」との表現と互いに適宜読み替えることが可能である。 Here, in this specification, expressions such as “below” or “greater than” are used when comparing the score with a threshold value in the determination process in the processing procedure of each process. The expression is merely an example, and does not limit the boundary condition in the determination process. In this embodiment, when a value such as a score is equal to a threshold value, how to determine the magnitude relationship may be arbitrarily set. In this specification, the expression “below” can be read as appropriate with the expression “less than”, and the expression “greater than” can be read as appropriate with the expression “above”. Is possible.

なお、以下のような構成も本開示の技術的範囲に属する。
（１）音声情報に含まれる音声の音源種別の蓋然性を示す音源種別スコアを算出する音源種別スコア算出部と、算出された前記音源種別スコアに基づいて、前記音声情報の中から、前記音声情報のダイジェストを構成するダイジェスト区間を決定するダイジェスト区間決定部と、を備える、情報処理装置。
（２）前記音源種別スコアは、音楽らしさを示す音楽スコア、人の声らしさを示す声スコア及び雑音らしさを示すノイズスコアの少なくともいずれかを含む、前記（１）に記載の情報処理装置。
（３）前記声スコアは、男性の声らしさを示す男性声スコア、女性の声らしさを示す女性声スコア、子どもの声らしさを示す子ども声スコア、及び前記音声を発している特定の人物らしさを示す特定声スコアの少なくともいずれかを更に含む、前記（２）に記載の情報処理装置。
（４）前記音源種別スコア算出部は、前記音声情報の特徴を示す特徴量に基づいて、前記音源種別スコアを算出する、前記（１）〜（３）のいずれか１項に記載の情報処理装置。
（５）前記特徴量は、前記音声情報についての、パワー、スペクトル包絡形状、ゼロ交差数、ピッチ、ＭＦＣＣ、収音位置間での相関、及び音源方位の特性を示す物理量のうちの少なくとも１つを含む、前記（４）に記載の情報処理装置。
（６）前記ダイジェスト区間決定部は、生成する前記ダイジェストのモードに基づいて前記ダイジェストに含める前記音声の音源種別を決定し、前記音声情報の中で、決定した音源種別に係る前記音源種別スコアがより高い区間を、前記ダイジェスト区間として決定する、前記（１）〜（５）のいずれか１項に記載の情報処理装置。
（７）前記モードは、単一の音源種別の前記音声のみを含むように前記ダイジェストを生成する単一音源モード、複数の音源種別の前記音声を所定の割合で含むように前記ダイジェストを生成する複数音源モード、及び、同一の前記音源種別に分類される前記音声の中から多様な前記音声が含まれるように前記ダイジェストを生成する多様性反映モード、の少なくともいずれかから選択される、前記（６）に記載の情報処理装置。
（８）前記モードが前記単一音源モードである場合には、前記ダイジェスト区間決定部は、指定された一の音源種別に係る前記音源種別スコアがより高い区間を、前記ダイジェスト区間として決定する、前記（７）に記載の情報処理装置。
（９）前記モードが前記複数音源モードである場合には、前記ダイジェスト区間決定部は、前記ダイジェストに含める前記音声の時間長さを音源種別ごとに設定し、音源種別ごとに前記音源種別スコアがより高い区間であって当該区間の合計長さが設定した音源種別ごとの前記時間長さと略等しくなるような前記区間を、前記ダイジェスト区間として決定する、前記（７）に記載の情報処理装置。
（１０）前記モードが前記多様性反映モードである場合には、前記ダイジェスト区間決定部は、同一の音源種別内での前記音声情報の特徴を示す特徴量のばらつき及び同一の前記音源種別内での前記音声が発せられた時刻のばらつきを算出し、前記特徴量のばらつき及び前記時刻のばらつきがより大きくなるように、前記ダイジェスト区間を決定する、前記（７）に記載の情報処理装置。
（１１）前記ダイジェスト区間決定部は、前記音源種別スコアが所定のしきい値よりも高い第１の区間と、前記音源種別スコアが所定のしきい値よりも低い第２の区間と、が連続して存在しており、かつ、前記第２の区間の時間長さが所定の時間よりも短い場合には、前記第１及び第２の区間をともに含むように前記ダイジェスト区間を決定する、前記（６）〜（１０）のいずれか１項に記載の情報処理装置。
（１２）前記ダイジェスト区間決定部は、前記音源種別スコアが所定のしきい値よりも高い第１の区間の時間長さが、人にとって音声として認識できない長さである場合には、前記第１の区間を含まないように前記ダイジェスト区間を決定する、前記（６）〜（１１）のいずれか１項に記載の情報処理装置。
（１３）前記音源種別スコア算出部は、予め全てが取得されている前記音声情報について、前記音源種別スコアを算出し、前記ダイジェスト区間決定部は、予め全てが取得されている前記音声情報の前記ダイジェストを生成する、前記（１）〜（１２）のいずれか１項に記載の情報処理装置。
（１４）前記音源種別スコア算出部は、現在まさに取得され続けている前記音声情報について、前記ダイジェスト区間以下の長さの時間からなるスコア算出区間に対応する時間長さの音声情報が新たに取得される度に、前記スコア算出区間ごとに前記音源種別スコアを算出し、前記ダイジェスト区間決定部は、前記音声情報が取得されている間、前記音声情報の前記ダイジェストを随時更新しながら生成する、前記（１）〜（１２）のいずれか１項に記載の情報処理装置。
（１５）前記ダイジェスト区間決定部は、これまでに取得された前記音声情報の時間長さが、前記ダイジェストの時間長さの設定値よりも短い場合には、新たに取得された前記音声情報を前記ダイジェストに追加し、これまでに取得された前記音声情報の時間長さが、前記ダイジェストの時間長さの設定値以上である場合には、新たに取得された前記スコア算出区間分の前記音声情報を前記ダイジェストに追加するとともに、前記ダイジェストの中から前記スコア算出区間分の時間長さの区間であって前記音源種別スコアがより低い区間を削除する、前記（１４）に記載の情報処理装置。
（１６）外部の音声を収音する音声収音部、を更に備え、前記音声情報は、前記音声収音部によって収音された外部音声に係る音声情報である、前記（１）〜（１５）のいずれか１項に記載の情報処理装置。
（１７）データベース化された前記音声情報が保存される記憶部、を更に備え、前記音源種別スコア算出部は、データベース化された前記音声情報に対して音源種別スコアを算出し、前記ダイジェスト区間決定部は、データベース化された前記音声情報に対して前記ダイジェスト区間を決定する、前記（１）〜（１５）のいずれか１項に記載の情報処理装置。
（１８）前記音声情報と、前記ダイジェスト区間決定部によって決定されたダイジェスト区間についての情報と、に基づいて、前記音声情報のダイジェストを、音声出力機器で出力可能なデータ形式で生成する出力音声生成部、を更に備える、前記（１）〜（１７）のいずれか１項に記載の情報処理装置。
（１９）プロセッサが、音声情報に含まれる音声の音源種別の蓋然性を示す音源種別スコアを算出することと、算出された前記音源種別スコアに基づいて、前記音声情報の中から、前記音声情報のダイジェストを構成するダイジェスト区間を決定することと、を含む、情報処理方法。
（２０）コンピュータのプロセッサに、音声情報に含まれる音声の音源種別の蓋然性を示す音源種別スコアを算出する機能と、算出された前記音源種別スコアに基づいて、前記音声情報の中から、前記音声情報のダイジェストを構成するダイジェスト区間を決定する機能と、を実現させる、プログラム。 The following configurations also belong to the technical scope of the present disclosure.
(1) A sound source type score calculation unit that calculates a sound source type score indicating the probability of the sound source type of the sound included in the sound information, and the sound information is selected from the sound information based on the calculated sound source type score. And a digest section determination unit that determines a digest section that constitutes the digest of the information processing apparatus.
(2) The information processing apparatus according to (1), wherein the sound source type score includes at least one of a music score indicating music, a voice score indicating human voice, and a noise score indicating noise.
(3) The voice score includes a male voice score indicating a male voice, a female voice score indicating a female voice, a child voice score indicating a voice of a child, and a specific person who is producing the voice. The information processing apparatus according to (2), further including at least one of specific voice scores to be shown.
(4) The information processing according to any one of (1) to (3), wherein the sound source type score calculation unit calculates the sound source type score based on a feature amount indicating a feature of the audio information. apparatus.
(5) The feature quantity is at least one of power, spectrum envelope shape, number of zero crossings, pitch, MFCC, correlation between sound collection positions, and a physical quantity indicating characteristics of sound source direction for the voice information. The information processing apparatus according to (4), including:
(6) The digest section determination unit determines a sound source type of the voice to be included in the digest based on a mode of the digest to be generated, and the sound source type score related to the determined sound source type is included in the sound information. The information processing apparatus according to any one of (1) to (5), wherein a higher section is determined as the digest section.
(7) The mode is a single sound source mode for generating the digest so as to include only the sound of a single sound source type, and the digest is generated so as to include the sound of a plurality of sound source types at a predetermined ratio. The sound source mode is selected from at least one of a plurality of sound source modes and a diversity reflection mode for generating the digest so that various sounds are included from the sounds classified into the same sound source type, The information processing apparatus according to 6).
(8) When the mode is the single sound source mode, the digest section determination unit determines a section having a higher sound source type score related to one designated sound source type as the digest section. The information processing apparatus according to (7).
(9) When the mode is the multiple sound source mode, the digest section determination unit sets the time length of the voice included in the digest for each sound source type, and the sound source type score is set for each sound source type. The information processing apparatus according to (7), wherein the section that is a higher section and is substantially equal to the time length for each sound source type for which the total length of the section is set is determined as the digest section.
(10) When the mode is the diversity reflecting mode, the digest section determination unit determines the variation in the feature amount indicating the feature of the audio information within the same sound source type and the same sound source type. The information processing apparatus according to (7), wherein a variation in time at which the sound is generated is calculated, and the digest section is determined so that the variation in the feature amount and the variation in the time become larger.
(11) The digest section determination unit includes a first section in which the sound source type score is higher than a predetermined threshold and a second section in which the sound source type score is lower than a predetermined threshold. And when the time length of the second section is shorter than a predetermined time, the digest section is determined to include both the first and second sections, The information processing apparatus according to any one of (6) to (10).
(12) If the time length of the first section in which the sound source type score is higher than a predetermined threshold is a length that cannot be recognized as speech by a person, the digest section determination unit The information processing apparatus according to any one of (6) to (11), wherein the digest section is determined so as not to include the section.
(13) The sound source type score calculation unit calculates the sound source type score for the voice information for which all has been acquired in advance, and the digest section determination unit has the sound information for which all has been acquired in advance. The information processing apparatus according to any one of (1) to (12), wherein a digest is generated.
(14) The sound source type score calculation unit newly acquires voice information having a duration corresponding to a score calculation section composed of a length of time equal to or shorter than the digest section for the voice information that has just been acquired. Each time the score is calculated, the sound source type score is calculated for each of the score calculation sections, and the digest section determination unit generates the voice information while updating the digest as needed. The information processing apparatus according to any one of (1) to (12).
(15) If the time length of the audio information acquired so far is shorter than the set value of the time length of the digest, the digest section determination unit determines the newly acquired audio information. When the time length of the audio information acquired so far in addition to the digest is equal to or greater than the set time length of the digest, the newly acquired audio for the score calculation section The information processing apparatus according to (14), wherein information is added to the digest and a section having a time length corresponding to the score calculation section and having a lower sound source type score is deleted from the digest. .
(16) A sound collecting unit that picks up external sound, and the sound information is sound information related to the external sound collected by the sound collecting unit. The information processing apparatus according to any one of the above.
(17) A storage unit for storing the voice information in a database is further provided, and the sound source type score calculation unit calculates a sound source type score for the voice information in the database, and determines the digest section The information processing apparatus according to any one of (1) to (15), wherein the unit determines the digest section for the voice information stored in a database.
(18) Output audio generation for generating a digest of the audio information in a data format that can be output by an audio output device based on the audio information and information on the digest interval determined by the digest interval determination unit The information processing apparatus according to any one of (1) to (17), further including a unit.
(19) The processor calculates a sound source type score indicating the probability of the sound source type of the sound included in the sound information, and based on the calculated sound source type score, the processor calculates the sound information from the sound information. Determining a digest section that constitutes the digest.
(20) A function of calculating a sound source type score indicating a probability of a sound source type of sound included in the sound information in a processor of the computer, and the sound from the sound information based on the calculated sound source type score. And a function for determining a digest section constituting a digest of information.

１１０、１２０、１３０、１４０、１５０情報処理装置
１１１特徴量抽出部
１１３音源種別スコア算出部
１１５ダイジェスト区間決定部
１３１音声収音部
１４１出力音声生成部
１５１音声情報データベース（ＤＢ）
110, 120, 130, 140, 150 Information processing device 111 Feature amount extraction unit 113 Sound source type score calculation unit 115 Digest section determination unit 131 Audio sound collection unit 141 Output audio generation unit 151 Audio information database (DB)

Claims

A sound source type score calculating unit that calculates a sound source type score indicating the probability of the sound source type of the audio included in the audio information;
Based on the calculated sound source type score, a digest section determination unit that determines a digest section that constitutes a digest of the voice information from the voice information;
An information processing apparatus comprising:

The sound source type score includes at least one of a music score indicating the likelihood of music, a voice score indicating the likelihood of human voice, and a noise score indicating the likelihood of noise.
The information processing apparatus according to claim 1.

The voice score is a male voice score indicating the voice like a man, a female voice score indicating the voice like a woman, a child voice score indicating the voice like a child, and a specific voice indicating the particular character who is producing the voice Further comprising at least one of the scores,
The information processing apparatus according to claim 2.

The sound source type score calculating unit calculates the sound source type score based on a feature amount indicating a feature of the audio information;
The information processing apparatus according to claim 1.

The feature amount includes at least one of power, spectrum envelope shape, number of zero crossings, pitch, MFCC, correlation between sound collection positions, and physical quantity indicating sound source azimuth characteristics for the audio information.
The information processing apparatus according to claim 4.

The digest section determination unit determines a sound source type of the voice to be included in the digest based on a mode of the digest to be generated, and a section having a higher sound source type score related to the determined sound source type in the voice information Is determined as the digest interval,
The information processing apparatus according to claim 1.

The mode includes a single sound source mode for generating the digest so as to include only the sound of a single sound source type, and a multiple sound source mode for generating the digest so as to include the sound of a plurality of sound source types at a predetermined ratio. And a diversity reflection mode for generating the digest so that various voices are included from the voices classified into the same sound source type.
The information processing apparatus according to claim 6.

When the mode is the single sound source mode, the digest section determination unit determines a section having a higher sound source type score related to one designated sound source type as the digest section.
The information processing apparatus according to claim 7.

When the mode is the multiple sound source mode, the digest section determination unit sets a time length of the voice included in the digest for each sound source type, and a section in which the sound source type score is higher for each sound source type And determining, as the digest section, the section such that the total length of the section is approximately equal to the time length for each sound source type set.
The information processing apparatus according to claim 7.

When the mode is the diversity reflection mode, the digest section determination unit determines variations in feature quantities indicating features of the audio information within the same sound source type and the audio within the same sound source type. Calculating the variation of the time at which is issued, and determining the digest section so that the variation of the feature amount and the variation of the time become larger.
The information processing apparatus according to claim 7.

The digest section determination unit continuously includes a first section in which the sound source type score is higher than a predetermined threshold and a second section in which the sound source type score is lower than a predetermined threshold. And when the time length of the second section is shorter than a predetermined time, the digest section is determined so as to include both the first and second sections.
The information processing apparatus according to claim 6.

The digest section determination unit determines the first section when the time length of the first section in which the sound source type score is higher than a predetermined threshold is a length that cannot be recognized as speech for a person. Determine the digest interval so that it does not include,
The information processing apparatus according to claim 6.

The sound source type score calculation unit calculates the sound source type score for the audio information that has been acquired in advance,
The digest section determination unit generates the digest of the audio information that has been acquired in advance.
The information processing apparatus according to claim 1.

The sound source type score calculation unit, for the sound information that has just been acquired, newly acquires sound information having a time length corresponding to a score calculation section consisting of a length of time equal to or shorter than the digest section. And calculating the sound source type score for each of the score calculation sections,
The digest section determination unit generates the voice information while updating the digest as needed while the voice information is acquired.
The information processing apparatus according to claim 1.

When the time length of the audio information acquired so far is shorter than the set value of the time length of the digest, the digest section determination unit adds the newly acquired audio information to the digest. Add
When the time length of the voice information acquired so far is equal to or longer than the set time length of the digest, the voice information for the newly obtained score calculation section is added to the digest And deleting a section of the time length corresponding to the score calculation section from the digest and having a lower sound source type score,
The information processing apparatus according to claim 14.

A sound collecting unit for collecting external sound;
The audio information is audio information related to external audio collected by the audio collection unit,
The information processing apparatus according to claim 1.

A storage unit for storing the voice information in a database;
The sound source type score calculation unit calculates a sound source type score for the voice information stored in a database,
The digest section determination unit determines the digest section for the voice information stored in a database.
The information processing apparatus according to claim 1.

Based on the voice information and information on the digest section determined by the digest section determination section, an output voice generation section that generates a digest of the voice information in a data format that can be output by a voice output device, In addition,
The information processing apparatus according to claim 1.

The processor calculates a sound source type score indicating the probability of the sound source type of the audio included in the audio information;
Determining a digest section that constitutes a digest of the voice information from the voice information based on the calculated sound source type score;
Including an information processing method.

Computer processor,
A function of calculating a sound source type score indicating the probability of the sound source type of the audio included in the audio information;
A function for determining a digest section constituting a digest of the voice information from the voice information based on the calculated sound source type score;
A program that realizes