JP2020056918A

JP2020056918A - Sound data learning system, sound data learning method and sound data learning device

Info

Publication number: JP2020056918A
Application number: JP2018187599A
Authority: JP
Inventors: 昭年泉; Akitoshi Izumi
Original assignee: Panasonic Intellectual Property Management Co Ltd
Current assignee: Panasonic Intellectual Property Management Co Ltd
Priority date: 2018-10-02
Filing date: 2018-10-02
Publication date: 2020-04-09
Also published as: WO2020071015A1

Abstract

To assist in compatibly reducing a processing load in execution of learning processing and generating a learnt model efficiently with high precision according to its use.SOLUTION: The sound data learning system extracts J feature quantities of a sound signal for learning and generates N feature quantity reduction blocks having parameters for defining the presence or absence of use. The sound data learning system executes: preparation processing to prepare data sets of k kinds of different patterns using a sound signal for learning; learning processing, for each of the N feature quantity reduction blocks, using the corresponding feature quantity reduction block and some of the data sets with respect to respective different data structures; and feature quantity reduction processing to repeat evaluation processing associated with use of each of J feature quantities targeting the remaining data sets. The sound data learning system selects, based upon a feature quantity reduction processing result of each of the N feature quantity reduction blocks having k kinds of evaluation results, one of the feature quantity reduction blocks as reduction information.SELECTED DRAWING: Figure 4

Description

本開示は、入力された音データの学習に関する処理を実行する音データ学習システム、音データ学習方法および音データ学習装置に関する。 The present disclosure relates to a sound data learning system, a sound data learning method, and a sound data learning device that execute processing related to learning of input sound data.

特許文献１には、予め与えられた機械音の正常動作時の学習データを用いて、機械の異常音を検出する装置が開示されている。具体的には、この装置は、入力された周波数領域の信号を音の性質が互いに異なる２種以上の信号に分離し、２種以上の信号のそれぞれについて音響特徴量を抽出する。また、この装置は、抽出された音響特徴量および事前に学習された２種以上の信号のそれぞれの正常時のモデルを用いて、２種以上の信号のそれぞれの異常度を計算し、これらの異常度を統合した統合異常度により周波数領域の信号が異常であるかを判定する。 Patent Literature 1 discloses an apparatus for detecting an abnormal sound of a machine using learning data given in advance for normal operation of a mechanical sound. Specifically, this device separates an input frequency-domain signal into two or more types of signals having different sound properties from each other, and extracts an acoustic feature amount for each of the two or more types of signals. Further, the apparatus calculates the degree of abnormality of each of the two or more signals by using the extracted acoustic feature amounts and the respective models of the normal state of the two or more signals learned in advance. It is determined whether the signal in the frequency domain is abnormal based on the integrated abnormality degree obtained by integrating the abnormality degrees.

また近年、マーケティングあるいは動画データへのタグ付け等を目的として、機械学習に基づくモデルを用いて人の感情や情緒を分析（識別）するアフェクティブコンピューティング（Affective Computing）技術が注目を集めている。例えば感情分析技術として最も活発に研究されているものは画像に基づいて分析するものであり、顔画像の表情から感情を分析することは一定の精度が得られるところまで研究が進められている。一方、人間の感情は表情のみならず発話音声にも含まれる。このため、コールセンター等の電話応対では相手の顔画像の取得が困難であるが、オペレータは画像を介して相手の感情を認識することは難しい。そこで、相手の発話音声から感情を読み取ることは重要と考えられるのだが、顔画像と違い発話音声から感情を分析する技術は未だ成熟しておらず、活発に研究が進められているところである。 In recent years, for the purpose of marketing, tagging video data, and the like, effective computing (Affective Computing) technology that analyzes (identifies) human emotions and emotions using a model based on machine learning has attracted attention. . For example, the most actively researched emotion analysis technology is analysis based on images, and analysis of emotions from facial expressions has been studied to the point where a certain degree of accuracy can be obtained. On the other hand, human emotions are included not only in facial expressions but also in uttered voices. For this reason, it is difficult to obtain the face image of the other party by telephone reception at a call center or the like, but it is difficult for the operator to recognize the emotion of the other party via the image. Therefore, it is considered important to read emotions from the uttered voice of the other party. However, unlike facial images, technology for analyzing emotions from uttered voices has not yet matured and research is being actively conducted.

特開２０１７−９０６０６号公報JP 2017-90606 A

機械学習を介した人の感情や情緒の分析結果の精度を高めるためには、機械学習によってより好適な学習済みモデルを生成することが肝要である。ところが、上述した特許文献１を含む従来技術においてこのような学習済みモデルを生成するためには、学習用のデータとして、数多くの入力データを必要とし、多大な次元の特徴量を機械学習の際に扱う必要があった。従って、機械学習を行う装置における処理負荷がかかる上に、用途に応じた効率的かつ高精度な学習済みモデルの生成が困難であった。 In order to improve the accuracy of analysis results of human emotions and emotions via machine learning, it is important to generate a more suitable learned model by machine learning. However, in order to generate such a trained model in the related art including Patent Document 1 described above, a large amount of input data is required as learning data, and a large amount of dimensional features are required for machine learning. Had to be dealt with. Therefore, a processing load is imposed on an apparatus that performs machine learning, and it is difficult to efficiently and accurately generate a learned model according to a use.

本開示は、上述した従来の事情に鑑みて案出され、学習処理の実行時にかかる処理負荷の軽減と、用途に応じた効率的かつ高精度な学習済みモデルの生成との両立を支援する音データ学習システム、音データ学習方法および音データ学習装置を提供することを目的とする。 The present disclosure has been devised in view of the above-described conventional circumstances, and has a sound that supports both reduction of the processing load imposed at the time of executing a learning process and efficient and highly accurate generation of a learned model according to the application. It is an object to provide a data learning system, a sound data learning method, and a sound data learning device.

本開示は、学習用サウンド信号を収音するマイクロホンと、前記学習用サウンド信号のＪ（Ｊ：正の整数）個の特徴量を抽出する抽出部と、Ｊ個の前記特徴量のそれぞれの使用の有無を規定するＪ個のパラメータにより構成されるＮ（Ｎ：２以上の整数）個の特徴量削減ブロックをそれぞれ生成する生成部と、前記学習用サウンド信号のデータを用いて、ｋ（ｋ：２以上の規定値）種類の異なるパターンのデータセットを準備する準備処理と、Ｎ個の前記特徴量削減ブロックごとに、ｋ個のそれぞれの前記異なるパターンのデータセットについて、該当する前記特徴量削減ブロックと前記学習用サウンド信号のデータの一部のデータセットとを用いた学習処理と、前記学習処理の学習結果を用いて、前記学習用サウンド信号のデータの残りのデータセットを対象としたＪ個の前記特徴量のそれぞれの使用に関する評価処理と、を繰り返す特徴量削減処理を実行する特徴量削減部と、それぞれｋ種類の評価結果を有する、Ｎ個の前記特徴量削減ブロックごとの特徴量削減処理結果に基づいて、いずれかの前記特徴量削減ブロックを、Ｊ個の前記特徴量の削減情報として選択する学習制御部と、を備える、音データ学習システムを提供する。 The present disclosure relates to a microphone that collects a learning sound signal, an extraction unit that extracts J (J: a positive integer) feature amounts of the learning sound signal, and use of each of the J feature amounts. A generator (N: N: an integer equal to or greater than 2) configured by J parameters defining presence / absence of the feature amount reduction block, and k (k : Two or more specified values) A preparation process for preparing data sets of different types of patterns, and for each of the N feature amount reduction blocks, the feature amounts corresponding to k different data sets of the different patterns A learning process using the reduced block and a partial data set of the data of the learning sound signal, and using a learning result of the learning process, the remaining data of the learning sound signal. A feature amount reduction unit that performs a feature amount reduction process that repeats an evaluation process for use of each of the J feature amounts for the data set, and the N feature amounts each having k types of evaluation results. A learning control unit that selects one of the feature amount reduction blocks as J pieces of feature amount reduction information based on a feature amount reduction processing result for each reduction block; .

また、本開示は、マイクロホンを有する音データ学習システムにおける音データ学習方法であって、前記マイクロホンにより、学習用サウンド信号を収音するステップと、前記学習用サウンド信号のＪ（Ｊ：正の整数）個の特徴量を抽出するステップと、Ｊ個の前記特徴量のそれぞれの使用の有無を規定するＪ個のパラメータにより構成されるＮ（Ｎ：２以上の整数）個の特徴量削減ブロックをそれぞれ生成するステップと、前記学習用サウンド信号のデータを用いて、ｋ（ｋ：２以上の規定値）種類の異なるパターンのデータセットを準備する準備処理と、Ｎ個の前記特徴量削減ブロックごとに、ｋ個のそれぞれの前記異なるパターンのデータセットについて、該当する前記特徴量削減ブロックと前記学習用サウンド信号のデータの一部のデータセットとを用いた学習処理と、前記学習処理の学習結果を用いて、前記学習用サウンド信号のデータの残りのデータセットを対象としたＪ個の前記特徴量のそれぞれの使用に関する評価処理と、を繰り返す特徴量削減処理を実行するステップと、それぞれｋ種類の評価結果を有する、Ｎ個の前記特徴量削減ブロックごとの特徴量削減処理結果に基づいて、いずれかの前記特徴量削減ブロックを、Ｊ個の前記特徴量の削減情報として選択するステップと、を有する、音データ学習方法を提供する。 Further, the present disclosure relates to a sound data learning method in a sound data learning system having a microphone, wherein the microphone collects a learning sound signal, and the learning sound signal J (J: positive integer). ) Number of feature amounts, and N (N: an integer equal to or greater than 2) feature amount reduction blocks composed of J parameters defining whether or not each of the J feature amounts is used. Generating each, using the data of the learning sound signal, preparing a data set of k (k: a specified value of 2 or more) different patterns, and preparing for each of the N feature amount reduction blocks In the k data sets of the different patterns, a part of the data of the corresponding feature amount reduction block and the data of the learning sound signal is used. A learning process using the data set, and an evaluation process for using each of the J feature amounts for the remaining data set of the data of the learning sound signal using a learning result of the learning process. Performing a feature amount reduction process that repeats the above. Based on the feature amount reduction process results for each of the N feature amount reduction blocks, each of which has k types of evaluation results, Selecting J pieces of feature amount reduction information as sound information learning methods.

また、本開示は、学習用サウンド信号を収音するマイクロホンに接続され、前記学習用サウンド信号のＪ（Ｊ：正の整数）個の特徴量を抽出する抽出部と、Ｊ個の前記特徴量のそれぞれの使用の有無を規定するＪ個のパラメータにより構成されるＮ（Ｎ：２以上の整数）個の特徴量削減ブロックをそれぞれ生成する生成部と、前記学習用サウンド信号のデータを用いて、ｋ（ｋ：２以上の規定値）種類の異なるパターンのデータセットを準備する準備処理と、Ｎ個の前記特徴量削減ブロックごとに、ｋ個のそれぞれの前記異なるパターンのデータセットについて、該当する前記特徴量削減ブロックと前記学習用サウンド信号のデータの一部のデータセットとを用いた学習処理と、前記学習処理の学習結果を用いて、前記学習用サウンド信号のデータの残りのデータセットを対象としたＪ個の前記特徴量のそれぞれの使用に関する評価処理と、を繰り返す特徴量削減処理を実行する特徴量削減部と、それぞれｋ種類の評価結果を有する、Ｎ個の前記特徴量削減ブロックごとの特徴量削減処理結果に基づいて、いずれかの前記特徴量削減ブロックを、Ｊ個の前記特徴量の削減情報として選択する学習制御部と、を備える、音データ学習装置を提供する。 In addition, the present disclosure relates to an extraction unit that is connected to a microphone that collects a learning sound signal and extracts J (J: a positive integer) feature amounts of the learning sound signal; A generation unit configured to generate N (N: an integer equal to or greater than 2) number of feature amount reduction blocks, each of which includes J parameters defining whether or not each is used, and using the data of the learning sound signal. , K (k: 2 or more specified values) of different types of pattern data sets, and for each of the N feature amount reduction blocks, k sets of the different pattern data sets are applied. A learning process using the feature amount reduction block and a partial data set of the data of the learning sound signal, and a learning result of the learning sound signal using a learning result of the learning process. And a feature amount reduction unit that performs a feature amount reduction process that repeats an evaluation process for each of the J feature amounts for the remaining data sets of the data, and k types of evaluation results. A learning control unit that selects one of the feature amount reduction blocks as J feature amount reduction information based on the feature amount reduction processing result for each of the N feature amount reduction blocks. A data learning device is provided.

なお、以上の構成要素の任意の組合せ、本開示の表現を方法、装置、システム、記録媒体、コンピュータプログラム、学習済みモデル（学習モデル）などの間で変換したものもまた、本開示の態様として有効である。 It should be noted that any combination of the above-described components, and a representation of the present disclosure converted between a method, an apparatus, a system, a recording medium, a computer program, a learned model (learning model), and the like are also included in the embodiments of the present disclosure. It is valid.

本開示によれば、学習処理の実行時にかかる処理負荷の軽減と、用途に応じた効率的かつ高精度な学習済みモデルの生成との両立を支援できる。 According to the present disclosure, it is possible to support both the reduction of the processing load imposed at the time of executing the learning process and the efficient and high-accuracy generation of the learned model according to the application.

実施の形態１に係る音データ学習システムの構成例を示すブロック図FIG. 2 is a block diagram showing a configuration example of a sound data learning system according to Embodiment 1. 交差検証を用いた学習処理部の識別器による学習処理および評価処理の一例の説明図Explanatory drawing of an example of learning processing and evaluation processing by a classifier of a learning processing unit using cross-validation 遺伝的アルゴリズム（ＧＡ）を用いた特徴量削減処理の説明概念図Explanatory conceptual diagram of feature amount reduction processing using a genetic algorithm (GA) 音データ学習処理の全体動作手順例を説明するフローチャートFlow chart for explaining an example of an overall operation procedure of the sound data learning process. 削減済み特徴量を用いた学習処理の動作手順例を説明するフローチャート4 is a flowchart illustrating an example of an operation procedure of a learning process using the reduced feature amount. 学習済みモデルを用いた音データの分析処理の動作手順例を説明するフローチャートFlowchart for explaining an example of an operation procedure of sound data analysis processing using a learned model 入力された音データとその音データに対する分析処理結果の一例とを対応付けたグラフA graph in which input sound data is associated with an example of an analysis result of the sound data 実施の形態１の変形例に係る音データ学習システムの構成例を示すブロック図FIG. 4 is a block diagram showing a configuration example of a sound data learning system according to a modification of the first embodiment.

以下、適宜図面を参照しながら、本開示に係る音データ学習システム、音データ学習方法および音データ学習装置の構成および作用を具体的に開示した実施の形態を詳細に説明する。但し、必要以上に詳細な説明は省略する場合がある。例えば、既によく知られた事項の詳細説明や実質的に同一の構成に対する重複説明を省略する場合がある。これは、以下の説明が不必要に冗長になるのを避け、当業者の理解を容易にするためである。なお、添付図面および以下の説明は、当業者が本開示を十分に理解するために提供されるのであって、これらにより特許請求の範囲に記載の主題を限定することは意図されていない。 Hereinafter, an embodiment that specifically discloses the configuration and operation of a sound data learning system, a sound data learning method, and a sound data learning device according to the present disclosure will be described in detail with reference to the drawings as appropriate. However, an unnecessary detailed description may be omitted. For example, a detailed description of a well-known item or a redundant description of substantially the same configuration may be omitted. This is to prevent the following description from being unnecessarily redundant and to facilitate understanding by those skilled in the art. The accompanying drawings and the following description are provided to enable those skilled in the art to fully understand the present disclosure, and are not intended to limit the claimed subject matter.

（実施の形態１）
図１は、実施の形態１に係る音データ学習システム２００の構成例を示すブロック図である。実施の形態１に係る音データ学習システム２００は、マイクロホン１１０により対象人物の発する音あるいは音声（以下「音」と総称）を収音し、収音された音データを、オーディオインターフェース１２０を介して情報処理装置１４０に入力する。音データ学習システム２００は、例えば対象人物の発する音に含まれる感情あるいは情緒（以下「感情」と総称）を分析可能な学習済みモデルを生成するために、入力された音データを用いた機械学習あるいはディープラーニング等の学習処理（後述参照）を情報処理装置１４０において実行する。 (Embodiment 1)
FIG. 1 is a block diagram illustrating a configuration example of a sound data learning system 200 according to Embodiment 1. The sound data learning system 200 according to the first embodiment collects sounds or voices (hereinafter, collectively referred to as “sounds”) emitted by a target person using the microphone 110, and outputs the collected sound data via the audio interface 120. The information is input to the information processing device 140. The sound data learning system 200 uses, for example, machine learning using input sound data in order to generate a learned model capable of analyzing emotions or emotions (hereinafter, collectively referred to as “emotions”) included in sounds emitted by a target person. Alternatively, the information processing device 140 executes a learning process such as deep learning (see below).

図１に示すように、音データ学習システム２００は、マイクロホン１１０と、オーディオインターフェース１２０と、情報処理装置１４０とを含む構成である。図１では、マイクロホン１１０は「ＭＩＣ」と略記し、インターフェースは「Ｉ／Ｆ」と略記している。また、図１において、マイクロホン１１０は１つだけ図示されているが、複数のマイクロホン１１０がオーディオインターフェース１２０に接続されてよい。また、１つのマイクロホン１１０と１つのオーディオインターフェース１２０とが対応付けて設けられてもよく、この場合には、マイクロホン１１０の設置数と同数個のオーディオインターフェース１２０が設けられる。 As shown in FIG. 1, the sound data learning system 200 has a configuration including a microphone 110, an audio interface 120, and an information processing device 140. In FIG. 1, the microphone 110 is abbreviated as “MIC”, and the interface is abbreviated as “I / F”. Although only one microphone 110 is shown in FIG. 1, a plurality of microphones 110 may be connected to the audio interface 120. One microphone 110 and one audio interface 120 may be provided in association with each other. In this case, the same number of audio interfaces 120 as the number of microphones 110 are provided.

また、音データ学習システム２００において、オーディオインターフェース１２０の構成が情報処理装置１４０内に取り込まれるように含まれてもよく、この場合には、情報処理装置１４０とマイクロホン１１０とが直接に接続されてよい。 In the sound data learning system 200, the configuration of the audio interface 120 may be included so as to be taken into the information processing device 140. In this case, the information processing device 140 and the microphone 110 are directly connected. Good.

マイクロホン１１０は、対象人物が発する音を構成する音波を入力して電気信号のオーディオ信号を生成して出力する収音デバイスを有して構成される。 The microphone 110 is configured to include a sound collection device that inputs a sound wave constituting a sound emitted by a target person, generates an audio signal of an electric signal, and outputs the audio signal.

オーディオインターフェース１２０は、マイクロホン１１０にて生成されたオーディオ信号を各種の信号処理が可能なデジタルデータに変換する、音データ入力用のインターフェース装置である。オーディオインターフェース１２０は、入力部１２１と、ＡＤコンバータ１２２と、バッファ１２３と、通信部１２４とを有する。図１では、ＡＤコンバータ１２２は「ＡＤＣ」と略記している。 The audio interface 120 is an interface device for inputting sound data that converts an audio signal generated by the microphone 110 into digital data that can be subjected to various kinds of signal processing. The audio interface 120 has an input unit 121, an AD converter 122, a buffer 123, and a communication unit 124. In FIG. 1, the AD converter 122 is abbreviated as “ADC”.

入力部１２１は、オーディオ信号を入力する入力端子を有する。 The input unit 121 has an input terminal for inputting an audio signal.

ＡＤコンバータ１２２は、所定の量子化ビットおよびサンプリング周波数によってアナログのオーディオ信号をデジタルの音データに変換する。ＡＤコンバータ１２２のサンプリング周波数は、例えば４８ｋＨｚである。 The AD converter 122 converts an analog audio signal into digital sound data using a predetermined quantization bit and a sampling frequency. The sampling frequency of the AD converter 122 is, for example, 48 kHz.

バッファ１２３は、音データを一時的に保存可能な記憶容量（メモリ）を有し、所定時間分の音データをバッファリングする。バッファ１２３のバッファ容量は、例えば４０ｍｓｅｃ程度とする。このように比較的小さなバッファ容量とすることによって、音データ学習システム２００における学習処理等の遅延を小さくすることが可能である。 The buffer 123 has a storage capacity (memory) capable of temporarily storing sound data, and buffers sound data for a predetermined time. The buffer capacity of the buffer 123 is, for example, about 40 msec. By using a relatively small buffer capacity as described above, it is possible to reduce the delay of the learning process and the like in the sound data learning system 200.

通信部１２４は、例えばＵＳＢ（Universal Serial Bus）等の通信インターフェースを有する通信回路を用いて構成され、情報処理装置１４０等の外部機器との間でデータの送受信を行う。通信部１２４は、バッファ１２３より入力された音データを情報処理装置１４０に送信する。 The communication unit 124 is configured using a communication circuit having a communication interface such as a USB (Universal Serial Bus), and transmits and receives data to and from an external device such as the information processing device 140. The communication unit 124 transmits the sound data input from the buffer 123 to the information processing device 140.

情報処理装置１４０は、例えばプロセッサおよびメモリを有するＰＣ（Personal Computer）等により構成され、音データの学習処理あるいは音データの検知処理を実行する。以下、音データ学習システム２００における学習処理に供する入力データ（つまり、学習用に入力される音データ）を、「学習用音データ」という。また、音データ学習システム２００における検知処理（つまり人の感情あるいは情緒の分析処理）に供する入力データ（つまり、前述した検知用に入力される音データ）を、「検知用音データ」という。情報処理装置１４０は、ＰＣの代わりに、タブレット端末あるいはスマートフォン等のオーディオインターフェース１２０との間でデータの送受信が可能な機器により構成されてもよい。情報処理装置１４０は、通信部１４１と、処理部１４２と、記憶部１４３と、操作入力部１４４と、表示部１４５とを有する。 The information processing device 140 is configured by, for example, a PC (Personal Computer) having a processor and a memory, and executes a sound data learning process or a sound data detection process. Hereinafter, the input data provided for the learning process in the sound data learning system 200 (that is, the sound data input for learning) is referred to as “learning sound data”. The input data (that is, the sound data input for detection described above) to be used for the detection processing (that is, the analysis processing of human emotion or emotion) in the sound data learning system 200 is referred to as “detection sound data”. The information processing device 140 may be configured by a device capable of transmitting and receiving data to and from the audio interface 120 such as a tablet terminal or a smartphone, instead of the PC. The information processing device 140 includes a communication unit 141, a processing unit 142, a storage unit 143, an operation input unit 144, and a display unit 145.

通信部１４１は、例えばＵＳＢ（Universal Serial Bus）等の通信インターフェースを有する通信回路を用いて構成され、オーディオインターフェース１２０等の外部機器との間でデータの送受信を行う。通信部１４１は、オーディオインターフェース１２０から送られる音データを入力して処理部１４２に送る。 The communication unit 141 is configured using a communication circuit having a communication interface such as a USB (Universal Serial Bus), and transmits and receives data to and from an external device such as the audio interface 120. The communication unit 141 inputs sound data sent from the audio interface 120 and sends the sound data to the processing unit 142.

処理部１４２は、例えばＣＰＵ（Central Processing Unit）、ＤＳＰ（Digital Signal Processor）あるいはＦＰＧＡ（Field Programmable Gate Array）等のプロセッサにより構成される。処理部１４２は、例えばメモリ等の記憶部１４３に記憶される所定のプログラムおよびデータに従って各種の処理を実行し、後述する学習用音データの学習処理あるいは検知用音データの検知処理を実行する。処理部１４２は、動作時に記憶部１４３のＲＡＭおよびＲＯＭ（後述参照）を使用し、処理部１４２が生成または取得したデータもしくは情報を前述したＲＡＭに一時的に保存する。 The processing unit 142 is configured by a processor such as a CPU (Central Processing Unit), a DSP (Digital Signal Processor) or an FPGA (Field Programmable Gate Array). The processing unit 142 executes various processes according to a predetermined program and data stored in the storage unit 143 such as a memory, for example, and performs learning sound data learning processing or detection sound data detection processing described below. The processing unit 142 uses the RAM and ROM (see below) of the storage unit 143 during operation, and temporarily stores data or information generated or obtained by the processing unit 142 in the above-described RAM.

処理部１４２は、記憶部１４３内のＲＡＭおよびＲＯＭと協働することにより、機能的構成として、各種の処理の実行を統括的に制御する制御部１５１と、学習処理を実行する学習処理部１５２と、検知処理を実行する検知処理部１５３と、判定処理を実行する判定処理部１５４とを実現可能である。制御部１５１、学習処理部１５２、検知処理部１５３、判定処理部１５４におけるそれぞれの処理の詳細については後述する。 The processing unit 142 functions as a functional configuration by cooperating with a RAM and a ROM in the storage unit 143, and a control unit 151 that comprehensively controls execution of various processes and a learning processing unit 152 that performs a learning process. , A detection processing unit 153 that performs a detection process, and a determination processing unit 154 that performs a determination process. Details of each processing in the control unit 151, the learning processing unit 152, the detection processing unit 153, and the determination processing unit 154 will be described later.

記憶部１４３は、例えばＲＡＭ（Random Access Memory）およびＲＯＭ（Read Only Memory）等による半導体メモリと、例えばＳＳＤ（Solid State Drive）あるいはＨＤＤ（Hard Disk Drive）等によるストレージデバイスとを含む。記憶部１４３のＲＯＭには、学習処理、検知処理および判定処理の機能を実行するためのプログラムおよびデータ、学習処理、検知処理および判定処理のそれぞれの実行時に参照される各種の設定データが保存される。また、記憶部１４３のＲＡＭ、ＳＳＤあるいはＨＨＤには、分析対象となる音データ、また、その音データの学習処理、検知処理あるいは判定処理の際に生成される各種のデータが記憶される。 The storage unit 143 includes a semiconductor memory such as a random access memory (RAM) and a read only memory (ROM), and a storage device such as a solid state drive (SSD) or a hard disk drive (HDD). The ROM of the storage unit 143 stores a program and data for executing the functions of the learning process, the detection process, and the determination process, and various setting data referred to at the time of executing the learning process, the detection process, and the determination process. You. The RAM, SSD, or HHD of the storage unit 143 stores sound data to be analyzed and various data generated at the time of learning processing, detection processing, or determination processing of the sound data.

操作入力部１４４は、例えばキーボード、マウス、タッチバッド、タッチパネル等の入力デバイスを有する。操作入力部１４４は、音データ学習システム２００の使用時におけるユーザの入力操作を受け付けて処理部１４２に出力する。 The operation input unit 144 has input devices such as a keyboard, a mouse, a touchpad, and a touch panel. The operation input unit 144 receives a user's input operation when using the sound data learning system 200 and outputs the operation to the processing unit 142.

表示部１４５は、例えば液晶ディスプレイ（ＬＣＤ：Liquid Crystal Display）、有機ＥＬ（Electroluminescence）ディスプレイ等の表示デバイスを有する。表示部１４５は、処理部１４２による学習処理、検知処理および判定処理等の処理実行時の表示画面（図示略）を表示する。 The display unit 145 has a display device such as a liquid crystal display (LCD) or an organic EL (Electroluminescence) display. The display unit 145 displays a display screen (not shown) when the processing unit 142 executes processing such as learning processing, detection processing, and determination processing.

制御部１５１は、情報処理装置１４０の全体的な動作を司るコントローラとして機能し、情報処理装置１４０の処理部１４２内の各部の動作を統括するための制御処理、情報処理装置１４０の各部との間のデータの入出力処理、データの演算（計算）処理およびデータの記憶処理を行う。制御部１５１は、操作入力部１４４を用いたユーザ操作に基づき、通信部１４１から送られた学習用音データを取得すると、学習用音データを学習処理部１５２に渡して学習処理の実行を指示する。また、制御部１５１は、操作入力部１４４を用いたユーザ操作に基づき、通信部１４１から送られた分析用音データを取得すると、分析用音データを検知処理部１５３に渡して検知処理（つまり分析処理）の実行を指示する。 The control unit 151 functions as a controller that controls the overall operation of the information processing device 140, performs control processing for controlling the operation of each unit in the processing unit 142 of the information processing device 140, It performs data input / output processing, data calculation (calculation) processing, and data storage processing. When acquiring the learning sound data transmitted from the communication unit 141 based on a user operation using the operation input unit 144, the control unit 151 passes the learning sound data to the learning processing unit 152 and instructs the learning processing unit 152 to execute the learning process. I do. Further, when acquiring the analysis sound data transmitted from the communication unit 141 based on the user operation using the operation input unit 144, the control unit 151 passes the analysis sound data to the detection processing unit 153 and performs the detection processing (that is, Analysis process).

抽出部の一例としての学習処理部１５２は、制御部１５１からの指示に従い、制御部１５１から渡された学習用音データ（学習用サウンド信号の一態様）の既定のＪ次元（Ｊ：正の整数、例えば１０００程度の一例としての９８８。以下同様。）の特徴量のデータを抽出する。学習部の一例としての学習処理部１５２は、抽出された学習用音データ（学習用サウンド信号の一態様）の特徴量のデータを用いて学習処理（後述参照）を実行することで、検知用音データ（検知用サウンド信号の一態様）に含まれる人の感情を分析するための学習済みモデルを生成する。 The learning processing unit 152 as an example of the extraction unit, according to an instruction from the control unit 151, sets a predetermined J dimension (J: positive) of the learning sound data (one mode of the learning sound signal) passed from the control unit 151. An integer, for example, 988 as an example of about 1000. The same applies hereinafter.) The learning processing unit 152, which is an example of a learning unit, performs a learning process (see below) by using feature amount data of the extracted learning sound data (one mode of the learning sound signal), thereby performing a detection process. A learned model for analyzing a human emotion included in the sound data (one mode of the detection sound signal) is generated.

実施の形態１では、情報処理装置１４０は、学習処理の実行時の負荷を軽減することと効率的かつ高精度な学習済みモデルを生成することを両立する。そこで、学習処理部１５２は、従来の機械学習のように高次元（例えば９８８次元）の特徴量の全てを用いた学習処理でなく、その高次元の特徴量から一部の特徴量を削減した上で学習処理を実行する。 In the first embodiment, the information processing device 140 achieves both reduction of the load at the time of executing the learning process and generation of an efficient and highly accurate learned model. Therefore, the learning processing unit 152 does not perform the learning process using all the high-dimensional (eg, 988-dimensional) feature amounts as in the conventional machine learning, but reduces a part of the high-dimensional feature amounts. Execute the learning process above.

特徴量とは、学習用音データあるいは検知用音データに含まれる特徴的なデータであり、例えば「ラウドネス（loudness）」、「インテンシティ（intensity）」、「ＭＦＣＣ（Mel-Frequency Cepstrum Coefficients）」、「ＺＣＲ（Zero Crossing Ratio）」、「Ｆ０（Fundamental Frequency」である。ラウドネスは、人の聴覚が感じる音の強さであり、感覚量の一つである。インテンシティは、音の物理的なエネルギーであり、物理的な茂樹の程度を示す。ＭＦＣＣは、声道特性を示す特徴量である。ＺＣＲは、一定時間内に学習用音データあるいは検知用音データの信号レベルが０（ゼロ）と交わる回数であり、音声区間ではこの値が大きくなる傾向がある。Ｆ０は声の基本周波数である。なお、ここでは５つ（５次元）の特徴量を参考的に例示したが、特徴量の種類はこれらに限定されない。 The feature amount is characteristic data included in the learning sound data or the detection sound data, and is, for example, “loudness”, “intensity”, or “MFCC (Mel-Frequency Cepstrum Coefficients)”. , “ZCR (Zero Crossing Ratio)”, and “F0 (Fundamental Frequency). Loudness is the intensity of sound perceived by human hearing, and is one of the sensations. MFCC is a characteristic quantity indicating vocal tract characteristics, and ZCR indicates that the signal level of the learning sound data or the detection sound data is 0 (zero) within a certain period of time. ), And this value tends to increase in the voice section.F0 is the fundamental frequency of the voice.Five (five-dimensional) feature values are illustrated here for reference. Type of quantity But it is not limited to these.

学習処理部１５２における学習処理の概要は次の通りである。詳細については、図２および図３を参照して詳述する。 The outline of the learning processing in the learning processing unit 152 is as follows. Details will be described in detail with reference to FIGS.

（１）学習処理部１５２は、特徴量の次元数を削減するために、遺伝的アルゴリズム（ＧＡ：Genetic Algorithm）を使用し、学習処理における識別器（Identifier、つまり識別用アルゴリズム）にＤＮＮ（Deep Neural Network）を使用する。
（２）学習処理部１５２は、遺伝的アルゴリズムの実行時に、現世代における遺伝子（後述参照）を有した複数個の個体（後述参照）を生成し、それぞれの個体の評価値をＤＮＮにより算出する。
（３）学習処理部１５２は、ＤＮＮにより算出された現世代における各個体の評価値に基づいて、次世代の個体を生成するための親となる個体を現世代における個体から複数（例えば２個）選択する。学習処理部１５２は、その選択された複数の個体を組み合わせて新しい個体を生成することで、次世代の個体を生成する。 (1) The learning processing unit 152 uses a genetic algorithm (GA) in order to reduce the number of dimensions of the feature amount, and uses a DNN (Deep) in an identifier (Identifier, that is, an identification algorithm) in the learning process. Neural Network).
(2) When the genetic algorithm is executed, the learning processing unit 152 generates a plurality of individuals (see below) having genes (see below) in the current generation, and calculates an evaluation value of each individual by DNN. .
(3) Based on the evaluation value of each individual in the current generation calculated by the DNN, the learning processing unit 152 selects a plurality (for example, two )select. The learning processing unit 152 generates a next-generation individual by combining the plurality of selected individuals to generate a new individual.

学習処理部１５２は、これら（１）〜（３）の一連の処理を一世代における処理として、第Ｍ世代も前述した一連の処理を同様に繰り返すことで、相対的に高い評価値を有する個体を得ることができる。Ｍは例えば２以上の規定の整数値である。 The learning processing unit 152 performs the series of processes (1) to (3) as a process in one generation, and repeats the series of processes described above for the M-th generation in the same manner. Can be obtained. M is, for example, a specified integer value of 2 or more.

ここで、個体が有する遺伝子は、ＤＮＮが使用する学習用音データあるいは検知用音データの特徴量の使用の有無を規定するパラメータであり、具体的には「１」あるいは「０」となる。「１」は、特徴量の使用を指示するパラメータである。「０」は、特徴量の不使用を指示するパラメータである。 Here, the gene possessed by the individual is a parameter that defines whether or not the feature amount of the learning sound data or the detection sound data used by the DNN is used, and is specifically “1” or “0”. “1” is a parameter instructing the use of the feature amount. “0” is a parameter instructing non-use of the feature amount.

ここで、学習処理部１５２において実行される学習処理を実現するためのアルゴリズムもしくは手法は、例えば１つ以上の統計的分類技術を用いて行って良い。統計的分類技術としては、例えば、線形分類器（Linear Classifiers）、サポートベクターマシン（Support Vector Machines）、二次分類器（Quadratic Classifiers）、カーネル密度推定（Kernel Estimation）、決定木（Decision Trees）、人工ニューラルネットワーク（Artificial Neural Networks）、ベイジアン技術および／またはネットワーク（Bayesian Techniques and/or Networks）、隠れマルコフモデル（Hidden Markov Models）、バイナリ分類子（Binary Classifiers）、マルチクラス分類器（Multi-Class Classifiers）クラスタリング（Clustering Technique）、ランダムフォレスト（Random Forest Technique）、ロジスティック回帰（Logistic Regression Technique）、線形回帰（Linear Regression Technique）、勾配ブースティング（Gradient Boosting Technique）等が挙げられる。但し、使用される統計的分類技術はこれらに限定されない。更に、学習処理は、情報処理装置１４０内の処理部１４２で行われても良いし、例えばネットワークを用いて情報処理装置１４０との間で通信可能に接続されるサーバ装置３４０（図８参照）で行われても良い。 Here, the algorithm or method for realizing the learning processing executed in the learning processing unit 152 may be performed using, for example, one or more statistical classification techniques. Examples of the statistical classification technology include linear classifiers (Linear Classifiers), support vector machines (Support Vector Machines), quadratic classifiers (Quadratic Classifiers), kernel density estimation (Kernel Estimation), decision trees (Decision Trees), Artificial Neural Networks, Bayesian Techniques and / or Networks, Hidden Markov Models, Binary Classifiers, Multi-Class Classifiers ) Clustering (Clustering Technique), Random Forest (Random Forest Technique), Logistic Regression (Logistic Regression Technique), Linear Regression (Linear Regression Technique), Gradient Boosting Technique, and the like. However, the statistical classification technique used is not limited to these. Further, the learning process may be performed by the processing unit 142 in the information processing device 140, or for example, a server device 340 communicably connected to the information processing device 140 using a network (see FIG. 8). It may be done in.

検知部の一例としての検知処理部１５３は、制御部１５１からの指示に従い、制御部１５１から渡された検知用音データ（サウンド信号の一態様）を、学習処理部１５２により生成された学習済みモデルに入力することで、検知用音データ（検知用サウンド信号の一態様）に含まれる人の感情を分析する処理（つまり検知処理）を実行する。 The detection processing unit 153, which is an example of the detection unit, converts the sound data for detection (an example of a sound signal) passed from the control unit 151 into a learned state generated by the learning processing unit 152 in accordance with an instruction from the control unit 151. By inputting the data to the model, a process of analyzing the emotion of a person included in the detection sound data (one mode of the detection sound signal) (that is, a detection process) is executed.

判定処理部１５４は、検知処理部１５３の検知処理の検知結果を、検知用音データに含まれる人の感情の判定結果として出力する。判定処理部１５４による判定結果の出力先は、記憶部１４３あるいは表示部１４５でもよいし、情報処理装置１４０と通信可能に接続される外部機器（図示略）であってもよい。また、判定処理部１５４の処理は検知処理部１５３により実行されてもよく、この場合には判定処理部１５４の構成は省略されてよい。 The determination processing unit 154 outputs the detection result of the detection processing of the detection processing unit 153 as the determination result of the emotion of the person included in the detection sound data. The output destination of the determination result by the determination processing unit 154 may be the storage unit 143 or the display unit 145, or may be an external device (not shown) communicably connected to the information processing device 140. Further, the processing of the determination processing unit 154 may be executed by the detection processing unit 153, and in this case, the configuration of the determination processing unit 154 may be omitted.

次に、学習処理部１５２における学習処理の詳細について図２および図３を参照して説明する。図２は、交差検証を用いた学習処理部１５２の識別器による学習処理および評価処理の一例の説明図である。図３は、遺伝的アルゴリズム（ＧＡ）を用いた特徴量削減処理の説明概念図である。 Next, details of the learning processing in the learning processing unit 152 will be described with reference to FIGS. FIG. 2 is an explanatory diagram illustrating an example of a learning process and an evaluation process performed by the discriminator of the learning processing unit 152 using cross validation. FIG. 3 is an explanatory conceptual diagram of a feature amount reduction process using a genetic algorithm (GA).

図２および図３の説明を分かり易くするために、学習用音データＶＣ１は、例えば５人の人物が試験用に発声した声の音データとして、５個のデータブロックＢＬＫ１，ＢＬＫ２，ＢＬＫ３，ＢＬＫ４，ＢＬＫ５により構成される。例えば、データブロックＢＬＫ１は１人目、データブロックＢＬＫ２は２人目、データブロックＢＬＫ３は３人目、データブロックＢＬＫ４は４人目、データブロックＢＬＫ５は５人目のそれぞれの人物の発声した声の音データでもよい。あるいは、データブロックＢＬＫ１，ＢＬＫ２，ＢＬＫ３，ＢＬＫ４，ＢＬＫ５は、学習用音データＶＣ１の全体を単に５等分したデータであってもよい。また、図２に示す学習用音データＶＣ１は、学習用音データＶＣ１の既定のＪ次元（例えば９８８次元）の特徴量として置き換えられてもよい。Ｊは既定の正の整数である。この場合には、データブロックＢＬＫ１〜ＢＬＫ５は、９８８次元の特徴量が５等分に分割された特徴量のデータブロックとなる。 In order to make the description of FIGS. 2 and 3 easy to understand, the learning sound data VC1 is, for example, five data blocks BLK1, BLK2, BLK3, and BLK4 as sound data of voices uttered for testing by five persons. , BLK5. For example, the data block BLK1 may be the first person, the data block BLK2 may be the second person, the data block BLK3 may be the third person, the data block BLK4 may be the fourth person, and the data block BLK5 may be the fifth person's voice. Alternatively, the data blocks BLK1, BLK2, BLK3, BLK4, BLK5 may be data obtained by simply dividing the entirety of the learning sound data VC1 into five equal parts. The learning sound data VC1 illustrated in FIG. 2 may be replaced with a predetermined J-dimensional (eg, 988-dimensional) feature amount of the learning sound data VC1. J is a predetermined positive integer. In this case, the data blocks BLK1 to BLK5 are feature amount data blocks obtained by dividing the 988-dimensional feature amount into five equal parts.

図２に示すように、特徴量削減部の一例としての学習処理部１５２は、制御部１５１から学習用音データＶＣ１を受け取ると、交差検証を用いて、標本データに対応する学習用音データＶＣ１をｋ（ｋ：２以上の規定値）個のデータブロックＢＬＫ１，ＢＬＫ２，ＢＬＫ３，ＢＬＫ４，ＢＬＫ５に分割する。ここでは、例えばｋ＝５としている。 As shown in FIG. 2, upon receiving the learning sound data VC1 from the control unit 151, the learning processing unit 152 as an example of the feature amount reducing unit uses the cross validation to perform the learning sound data VC1 corresponding to the sample data. Is divided into k (k: a specified value of 2 or more) data blocks BLK1, BLK2, BLK3, BLK4, and BLK5. Here, for example, k = 5.

更に、特徴量削減部の一例としての学習処理部１５２は、学習用音データＶＣ１のｋ個のデータブロックＢＬＫ１〜ＢＬＫ５への分割に伴い、その分割数と同数であるｋ種類のパターン（言い換えると役割）の異なるデータセットをそれぞれ準備して生成する（準備処理）。交差検証とは、統計学において、標本データ（例えば学習用音データＶＣ１）を所定個に分割し、その一部を先ず解析（例えば学習処理）して、残る部分でその解析（例えば学習処理）のテスト（例えば学習結果を用いた評価処理）を行い、解析自身の妥当性の検証あるいは確認を行うための手法である。 Further, as the learning processing unit 152 as an example of the feature amount reducing unit divides the learning sound data VC1 into k data blocks BLK1 to BLK5, the learning processing unit 152 includes k types of patterns (in other words, the same number of divisions). Data sets having different roles are prepared and generated (preparation processing). The cross-validation means that, in statistics, sample data (for example, sound data for learning VC1) is divided into a predetermined number, a part of which is first analyzed (for example, learning processing), and the remaining part is analyzed (for example, learning processing). (For example, evaluation processing using learning results) to verify or confirm the validity of the analysis itself.

具体的には、学習処理部１５２は、第１パターンの学習用音データのデータセット、第２パターンの学習用音データのデータセット、第３パターンの学習用音データのデータセット、第４パターンの学習用音データのデータセット、第５パターンの学習用音データのデータセットをそれぞれ生成する。第１パターン〜第５パターンのそれぞれの学習用音データのデータセットは、学習処理部１５２における学習処理および評価処理の対象となるデータブロックが異なる。 Specifically, the learning processing unit 152 includes a data set of learning sound data of a first pattern, a data set of learning sound data of a second pattern, a data set of learning sound data of a third pattern, and a data set of learning sound data of a third pattern. And a data set of the learning sound data of the fifth pattern. The data set of the learning sound data of each of the first to fifth patterns has a different data block as a target of the learning processing and the evaluation processing in the learning processing unit 152.

第１パターンの学習用音データのデータセットでは、データブロックＢＬＫ１に対応するデータセットＥＳＴ１が評価処理に使用され、データブロックＢＬＫ２〜ＢＬＫ５に対応するデータセットＬＲＮ１が学習処理に使用される。 In the data set of the learning sound data of the first pattern, the data set EST1 corresponding to the data block BLK1 is used for the evaluation processing, and the data set LRN1 corresponding to the data blocks BLK2 to BLK5 is used for the learning processing.

第２パターンの学習用音データのデータセットでは、データブロックＢＬＫ２に対応するデータセットＥＳＴ２が評価処理に使用され、データブロックＢＬＫ１，ＢＬＫ３〜ＢＬＫ５に対応するデータセットＬＲＮ２が学習処理に使用される。 In the data set of the learning sound data of the second pattern, the data set EST2 corresponding to the data block BLK2 is used for the evaluation processing, and the data set LRN2 corresponding to the data blocks BLK1, BLK3 to BLK5 is used for the learning processing.

第３パターンの学習用音データのデータセットでは、データブロックＢＬＫ３に対応するデータセットＥＳＴ３が評価処理に使用され、データブロックＢＬＫ１，ＢＬＫ２，ＢＬＫ４，ＢＬＫ５に対応するデータセットＬＲＮ３が学習処理に使用される。 In the data set of the learning sound data of the third pattern, the data set EST3 corresponding to the data block BLK3 is used for the evaluation processing, and the data set LRN3 corresponding to the data blocks BLK1, BLK2, BLK4, and BLK5 is used for the learning processing. You.

第４パターンの学習用音データのデータセットでは、データブロックＢＬＫ４に対応するデータセットＥＳＴ４が評価処理に使用され、データブロックＢＬＫ１〜ＢＬＫ３，ＢＬＫ５に対応するデータセットＬＲＮ４が学習処理に使用される。 In the data set of the learning sound data of the fourth pattern, the data set EST4 corresponding to the data block BLK4 is used for the evaluation processing, and the data set LRN4 corresponding to the data blocks BLK1 to BLK3 and BLK5 is used for the learning processing.

第５パターンの学習用音データのデータセットでは、データブロックＢＬＫ５に対応するデータセットＥＳＴ５が評価処理に使用され、データブロックＢＬＫ１〜ＢＬＫ４に対応するデータセットＬＲＮ５が学習処理に使用される。 In the data set of the learning sound data of the fifth pattern, the data set EST5 corresponding to the data block BLK5 is used for the evaluation processing, and the data set LRN5 corresponding to the data blocks BLK1 to BLK4 is used for the learning processing.

生成部の一例としての学習処理部１５２は、図３に示すように、学習用音データＶＣ１の学習処理に用いる特徴量を削減するために、進化的アルゴリズム（例えば遺伝的アルゴリズム）を採用し、遺伝的アルゴリズムにおける第１世代の複数（例えばＮ：２以上の整数）個の異なる個体１，個体２，…，個体Ｎ（特徴量削減ブロックの一態様）を生成する（例えば図４のステップＳｔ５参照）。 As illustrated in FIG. 3, the learning processing unit 152 as an example of the generation unit employs an evolutionary algorithm (for example, a genetic algorithm) in order to reduce a feature amount used for the learning process of the learning sound data VC1. Generating a first generation of a plurality (for example, N: an integer of 2 or more) different individuals 1, individuals 2,..., Individuals N (one mode of the feature amount reduction block) in the genetic algorithm (for example, step St5 in FIG. 4) reference).

進化的アルゴリズム（ＥＡ：Evolutionary Algorithm）とは、進化的計算の一分野を意味し、人工知能（ＡＩ：Artificial Intelligent）の一部を構成し、個体群ベースのメタヒューリスティックな最適化アルゴリズムの総称である。例えば、進化的アルゴリズムは、そのメカニズムとして、生殖、突然変異、遺伝子組み換え、自然淘汰、適者生存等の進化の仕組みに着想を得た計算アルゴリズムである。進化的アルゴリズムは、前述した遺伝的アルゴリズムに限定されず、遺伝的アルゴリズムの代わりに遺伝的プログラミングあるいは進化的プログラミングが採用されてもよい。遺伝的アルゴリズムは、問題の解を探索するにあたって数値の列（例えば２進数）を使用し、選択と変異に加えて個体を次々に生成する。遺伝的プログラミングは、基本的に遺伝的アルゴリズムと同じであるが、解は木構造の形式で表し、数式やプログラムコードを表現する。進化的プログラミングは、解の適応度関数に、集団中におけるその解の優位性を表した確率的な関数を用いる。 Evolutionary Algorithm (EA) refers to a field of evolutionary computation, which is a part of Artificial Intelligent (AI), and is a general term for a population-based metaheuristic optimization algorithm. is there. For example, an evolutionary algorithm is a computational algorithm inspired by an evolutionary mechanism such as reproduction, mutation, genetic recombination, natural selection, and survival of the fittest. The evolutionary algorithm is not limited to the genetic algorithm described above, and genetic programming or evolutionary programming may be employed instead of the genetic algorithm. Genetic algorithms use a sequence of numbers (eg, binary numbers) to search for a solution to the problem, and generate individuals one after another in addition to selection and mutation. Genetic programming is basically the same as genetic algorithm, but the solution is expressed in the form of a tree structure, and expresses mathematical formulas and program codes. Evolutionary programming uses a stochastic function that represents the superiority of the solution in the population as the fitness function of the solution.

それぞれの個体は、既定の特徴量の数（例えば９８８次元）と同数分のビット列からなるビットサイズを有し、例えば２進数のパラメータ「０」および「１」がランダムに羅列された数列として構成される。前述したように、「１」は、特徴量の使用を指示するパラメータである。「０」は、特徴量の不使用を指示するパラメータである。例えば、学習用音データＶＣ１の特徴量が「ラウドネス，インテンシティ，ＭＦＣＣ，ＺＣＲ，Ｆ０，…」であり、かつ個体が「１，０，１，１，０，…」のビット列で構成される場合、少なくとも「ラウドネス，ＭＦＣＣ，ＺＣＲ」の特徴量は使用され、少なくとも「インテンシティ，Ｆ０」の特徴量は使用されない。なお、特徴量の配列（つまり、前述した「ラウドネス，インテンシティ，ＭＦＣＣ，ＺＣＲ，Ｆ０，…」の並び）は予め決まっているとする。 Each individual has a bit size consisting of the same number of bit strings as a predetermined number of feature quantities (eg, 988 dimensions), and is configured as a number sequence in which, for example, binary parameters “0” and “1” are randomly arranged. Is done. As described above, “1” is a parameter instructing the use of the feature amount. “0” is a parameter instructing non-use of the feature amount. For example, the feature amount of the learning sound data VC1 is “loudness, intensity, MFCC, ZCR, F0,...”, And the individual is composed of a bit string of “1, 0, 1, 1, 0,. In this case, at least the feature amount of “loudness, MFCC, ZCR” is used, and at least the feature amount of “intensity, F0” is not used. It is assumed that the arrangement of the feature amounts (that is, the arrangement of the above-described “loudness, intensity, MFCC, ZCR, F0,...”) Is predetermined.

特徴量削減部の一例としての学習処理部１５２は、先ず第１世代の個体１から、前述した第１パターンの学習用音データのうち学習処理に使用されるデータセットＬＲＮ１（つまりデータブロックＢＬＫ２〜ＢＬＫ５）に対応する序数の部分のビット列を抽出する。特徴量削減部の一例としての学習処理部１５２は、この抽出されたビット列を構成するそれぞれのパラメータ（つまり「０」，「１」）と学習処理に使用されるデータブロックＢＬＫ２〜ＢＬＫ５に対応する特徴量のそれぞれとをＡＮＤ演算することで、学習処理に用いる特徴量を削減する。そして、特徴量削減部の一例としての学習処理部１５２は、削減済みの特徴量を用いて、学習処理部１５２における識別器（つまり識別用アルゴリズム）の学習処理を実行する。特徴量削減部の一例としての学習処理部１５２は、学習処理の後、学習処理後の識別器を用いて、前述した第１パターンの学習用音データのうち評価処理に使用されるデータセットＥＳＴ１（つまりデータブロックＢＬＫ１）の評価処理を実行し、評価結果（特徴量削減処理結果の一態様）としてのスコア（例えば「０．６」、図３参照）を出力する。 The learning processing unit 152, which is an example of the feature amount reducing unit, first obtains a data set LRN1 (that is, the data blocks BLK2 to BLN2) used from the first generation individual 1 for the learning process in the learning sound data of the first pattern. The bit string of the ordinal part corresponding to BLK5) is extracted. The learning processing unit 152 as an example of the feature amount reducing unit corresponds to each parameter (that is, “0”, “1”) constituting the extracted bit string and the data blocks BLK2 to BLK5 used in the learning processing. By performing an AND operation with each of the feature amounts, the feature amounts used in the learning processing are reduced. Then, the learning processing unit 152 as an example of the feature amount reducing unit executes the learning process of the classifier (that is, the algorithm for identification) in the learning processing unit 152 using the reduced feature amount. After the learning processing, the learning processing unit 152, which is an example of the feature amount reducing unit, uses the classifier after the learning processing to use the data set EST1 used for the evaluation processing among the learning sound data of the first pattern described above. The evaluation process (that is, the data block BLK1) is executed, and a score (for example, “0.6”, see FIG. 3) as an evaluation result (one mode of the feature amount reduction process result) is output.

次に、特徴量削減部の一例としての学習処理部１５２は、第１世代の個体１から、前述した第２パターンの学習用音データのうち学習処理に使用されるデータセットＬＲＮ２（つまりデータブロックＢＬＫ１，ＢＬＫ３〜ＢＬＫ５）に対応する序数の部分のビット列を抽出する。特徴量削減部の一例としての学習処理部１５２は、この抽出されたビット列を構成するそれぞれのパラメータ（つまり「０」，「１」）と学習処理に使用されるデータブロックＢＬＫ１，ＢＬＫ３〜ＢＬＫ５に対応する特徴量のそれぞれとをＡＮＤ演算することで、学習処理に用いる特徴量を削減する。そして、特徴量削減部の一例としての学習処理部１５２は、削減済みの特徴量を用いて、学習処理部１５２における識別器（つまり識別用アルゴリズム）の学習処理を実行する。特徴量削減部の一例としての学習処理部１５２は、学習処理の後、学習処理後の識別器を用いて、前述した第２パターンの学習用音データのうち評価処理に使用されるデータセットＥＳＴ２（つまりデータブロックＢＬＫ２）の評価処理を実行し、評価結果としてのスコア（例えば「０．７」、図３参照）を出力する。学習処理部１５２は、第３パターン、第４パターンのそれぞれの学習用音データについても同様な処理を繰り返す。 Next, the learning processing unit 152, which is an example of the feature amount reducing unit, obtains a data set LRN2 (that is, a data block BLK1, BLK3 to BLK5) are extracted. The learning processing unit 152, which is an example of the feature amount reduction unit, assigns the parameters (that is, “0” and “1”) constituting the extracted bit string and the data blocks BLK1, BLK3 to BLK5 used in the learning process. By performing an AND operation with each of the corresponding feature amounts, the feature amounts used in the learning processing are reduced. Then, the learning processing unit 152 as an example of the feature amount reducing unit executes the learning process of the classifier (that is, the algorithm for identification) in the learning processing unit 152 using the reduced feature amount. After the learning process, the learning processing unit 152, which is an example of the feature amount reducing unit, uses the classifier after the learning process to use the data set EST2 used for the evaluation process among the learning sound data of the second pattern described above. (Ie, the data block BLK2) is evaluated, and a score (for example, “0.7”, see FIG. 3) as an evaluation result is output. The learning processing unit 152 repeats the same processing for the learning sound data of the third pattern and the fourth pattern.

同様にして、特徴量削減部の一例としての学習処理部１５２は、先ず第１世代の個体１から、前述した第５パターンの学習用音データのうち学習処理に使用されるデータセットＬＲＮ５（つまりデータブロックＢＬＫ１〜ＢＬＫ４）に対応する序数の部分のビット列を抽出する。特徴量削減部の一例としての学習処理部１５２は、この抽出されたビット列を構成するそれぞれのパラメータ（つまり「０」，「１」）と学習処理に使用されるデータブロックＢＬＫ１〜ＢＬＫ４に対応する特徴量のそれぞれとをＡＮＤ演算することで、学習処理に用いる特徴量を削減する。そして、特徴量削減部の一例としての学習処理部１５２は、削減済みの特徴量を用いて、学習処理部１５２における識別器（つまり識別用アルゴリズム）の学習処理を実行する。特徴量削減部の一例としての学習処理部１５２は、学習処理の後、学習処理後の識別器を用いて、前述した第５パターンの学習用音データのうち評価処理に使用されるデータセットＥＳＴ５（つまりデータブロックＢＬＫ５）の評価処理を実行し、評価結果としてのスコア（例えば「０．６」、図３参照）を出力する。 Similarly, the learning processing unit 152, which is an example of the feature amount reduction unit, first obtains a data set LRN5 (that is, a data set LRN5 used for learning processing from among the learning sound data of the fifth pattern described above from the first generation individual 1). The bit string of the ordinal part corresponding to the data blocks BLK1 to BLK4) is extracted. The learning processing unit 152 as an example of the feature amount reduction unit corresponds to each parameter (that is, “0”, “1”) constituting the extracted bit string and the data blocks BLK1 to BLK4 used in the learning processing. By performing an AND operation with each of the feature amounts, the feature amounts used in the learning processing are reduced. Then, the learning processing unit 152 as an example of the feature amount reducing unit executes the learning process of the classifier (that is, the algorithm for identification) in the learning processing unit 152 using the reduced feature amount. After the learning process, the learning processing unit 152 as an example of the feature amount reducing unit uses the discriminator after the learning process, and uses the data set EST5 used for the evaluation process among the learning sound data of the fifth pattern described above. The evaluation process (that is, data block BLK5) is executed, and a score (for example, “0.6”, see FIG. 3) is output as the evaluation result.

学習処理部１５２は、第１世代の個体１を用いた学習処理および評価処理により、計５個のスコア（例えば、「０．６」，「０．７」，「０．７」，「０．５」，「０．６」）を算出し、平均値（ＡＶＥ）である「０．６２」を、第１世代の個体１による最終評価スコアとして出力する。 The learning processing unit 152 performs a total of five scores (for example, “0.6”, “0.7”, “0.7”, “0”) by learning processing and evaluation processing using the first generation individual 1. .5 "," 0.6 "), and outputs the average value (AVE)" 0.62 "as the final evaluation score of the first generation individual 1.

また、学習処理部１５２は、第１世代の個体２から個体Ｎまでのそれぞれの個体についても、同様に前述した個体１を用いた学習処理および評価処理を同様に実行する。図２および図３の説明において、それぞれの個体を用いた学習処理部１５２による学習処理および評価処理の手順は同様であるため、詳細な手順の説明は割愛する。 In addition, the learning processing unit 152 similarly executes the learning process and the evaluation process using the individual 1 described above for each individual from the first generation individual 2 to the individual N. In the description of FIG. 2 and FIG. 3, the procedure of the learning process and the evaluation process by the learning processing unit 152 using each individual is the same, and thus the detailed description of the procedure is omitted.

学習処理部１５２は、第１世代の個体２を用いた学習処理および評価処理（前述参照）により、計５個のスコア（例えば、「０．４」，「０．４」，「０．５」，「０．８」，「０．４」）を算出し、平均値（ＡＶＥ）である「０．５０」を、第１世代の個体２による最終評価スコアとして出力する。 The learning processing unit 152 performs a total of five scores (for example, “0.4”, “0.4”, “0.5”) by performing learning processing and evaluation processing using the first-generation individual 2 (see above). , “0.8”, “0.4”), and outputs the average value (AVE) “0.50” as the final evaluation score of the first generation individual 2.

同様に、学習処理部１５２は、第１世代の個体Ｎを用いた学習処理および評価処理（前述参照）により、計５個のスコア（例えば、「０．６」，「０．４」，「０．６」，「０．８」，「０．５」）を算出し、平均値（ＡＶＥ）である「０．５８」を、第１世代の個体Ｎによる最終評価スコアとして出力する。 Similarly, the learning processing unit 152 performs a total of five scores (for example, “0.6”, “0.4”, “0.4”) by performing learning processing and evaluation processing using the first-generation individual N (see above). 0.6 ”,“ 0.8 ”,“ 0.5 ”), and outputs the average value (AVE)“ 0.58 ”as the final evaluation score of the first generation individual N.

次に、学習処理部１５２は、第１世代のそれぞれの個体１〜個体Ｎによる最終評価スコアに応じた確率分布に従い、第１世代のＮ個の個体から２個の個体をランダムに選択し、それぞれの個体を構成するパラメータ（要素）をＡＮＤ演算することで次世代（例えば第２世代）の個体を次々に生成して、第１世代と同様にＮ個生成する。例えば、図３に示すように、学習処理部１５２は、第１世代の個体１および個体２を選択して、第２世代の個体１を生成する。学習処理部１５２は、第１世代の個体２および個体Ｎを選択して、第２世代の個体２を生成する。同様にして、学習処理部１５２は、第１世代の個体１および個体Ｎを選択して、第２世代の個体Ｎを生成する。このように、学習処理部１５２は、遺伝的アルゴリズムにおいて第２世代のＮ個の個体を生成する。 Next, the learning processing unit 152 randomly selects two individuals from the first generation N individuals according to a probability distribution according to the final evaluation score of each of the first generation individuals 1 to N, By performing an AND operation on parameters (elements) constituting each individual, individuals of the next generation (for example, the second generation) are generated one after another, and N are generated similarly to the first generation. For example, as shown in FIG. 3, the learning processing unit 152 selects the first generation individual 1 and the individual 2 and generates the second generation individual 1. The learning processing unit 152 generates the second generation individual 2 by selecting the first generation individual 2 and the individual N. Similarly, the learning processing unit 152 selects the first generation individual 1 and the individual N, and generates the second generation individual N. Thus, the learning processing unit 152 generates N individuals of the second generation by the genetic algorithm.

この後、学習処理部１５２は、第２世代のＮ個の個体のそれぞれについても、第１世代のＮ個の個体のそれぞれを用いた学習処理および評価処理と同様な学習処理および評価処理を実行する。 Thereafter, the learning processing unit 152 executes the same learning and evaluation processing as the learning and evaluation processing using each of the first generation N individuals for each of the N individuals of the second generation. I do.

学習処理部１５２は、第２世代の個体１を用いた学習処理および評価処理により、計５個のスコア（例えば、「０．７」，「０．７」，「０．６」，「０．５」，「０．７」）を算出し、平均値（ＡＶＥ）である「０．６４」を、第２世代の個体１による最終評価スコアとして出力する。 The learning processing unit 152 performs a total of five scores (for example, “0.7”, “0.7”, “0.6”, “0”) by performing learning processing and evaluation processing using the second-generation individual 1. .5 "," 0.7 "), and outputs the average value (AVE)" 0.64 "as the final evaluation score of the second generation individual 1.

学習処理部１５２は、第２世代の個体２を用いた学習処理および評価処理（前述参照）により、計５個のスコア（例えば、「０．６」，「０．４」，「０．６」，「０．６」，「０．５」）を算出し、平均値（ＡＶＥ）である「０．５４」を、第２世代の個体２による最終評価スコアとして出力する。 The learning processing unit 152 performs a total of five scores (for example, “0.6”, “0.4”, “0.6”) by performing learning processing and evaluation processing (see above) using the second-generation individual 2. , “0.6”, “0.5”), and outputs the average value (AVE) “0.54” as the final evaluation score of the second generation individual 2.

同様に、学習処理部１５２は、第２世代の個体Ｎを用いた学習処理および評価処理（前述参照）により、計５個のスコア（例えば、「０．７」，「０．５」，「０．７」，「０．７」，「０．６」）を算出し、平均値（ＡＶＥ）である「０．６２」を、第２世代の個体Ｎによる最終評価スコアとして出力する。 Similarly, the learning processing unit 152 performs a total of five scores (eg, “0.7”, “0.5”, “0.5”) by performing learning processing and evaluation processing using the second-generation individual N (see above). 0.7 ”,“ 0.7 ”,“ 0.6 ”), and outputs the average value (AVE)“ 0.62 ”as the final evaluation score of the second generation individual N.

また、学習処理部１５２は、第２世代以降のそれぞれの個体１〜個体Ｎによる最終評価スコアに応じた確率分布に従い、第２世代以降のＮ個の個体から２個の個体をランダムに選択し、それぞれの個体を構成するパラメータ（要素）をＡＮＤ演算することで次世代（例えば第３世代以降）の個体を次々に生成して、第１世代や第２世代と同様にＮ個生成する。このように、学習処理部１５２は、遺伝的アルゴリズムにおいて既定の上限として設定された第Ｍ世代のＮ個の個体を生成する。Ｍは３以上の既定の整数である。同様に、学習処理部１５２は、第Ｍ世代のＮ個の個体のそれぞれについても、それぞれの個体を用いた学習処理および評価処理を実行することで、それぞれの個体による最終評価スコアを求めて出力する。 In addition, the learning processing unit 152 randomly selects two individuals from the second generation and subsequent N individuals according to the probability distribution according to the final evaluation score of each individual 1 to individual N from the second generation. Then, by performing an AND operation on the parameters (elements) constituting each individual, individuals of the next generation (for example, the third generation and thereafter) are generated one after another, and N are generated similarly to the first generation and the second generation. As described above, the learning processing unit 152 generates N individuals of the Mth generation set as the predetermined upper limit in the genetic algorithm. M is a predetermined integer of 3 or more. Similarly, the learning processing unit 152 obtains the final evaluation score of each individual by executing learning processing and evaluation processing using each individual for each of the N individuals of the M-th generation. I do.

学習制御部の一例としての学習処理部１５２は、第１世代〜第Ｍ世代のそれぞれの個体による最終評価スコアの中で最も良好であった個体を選択し、その個体（つまり、特徴量の使用の有無を規定するパラメータの数列により構成されるビット列）を削減情報として出力する。 The learning processing unit 152 as an example of the learning control unit selects an individual having the best evaluation among the final evaluation scores of the individuals of the first to Mth generations, and selects the individual (that is, the use of the feature amount). Is output as reduction information.

次に、実施の形態１に係る音データ学習システム２００による音データ学習処理の全体動作手順について、図４を参照して説明する。図４は、音データ学習処理の全体動作手順例を説明するフローチャートである。図４に示す処理は、主に情報処理装置１４０の処理部１４２により実行される。 Next, an overall operation procedure of the sound data learning process by the sound data learning system 200 according to the first embodiment will be described with reference to FIG. FIG. 4 is a flowchart illustrating an example of an overall operation procedure of the sound data learning process. The processing illustrated in FIG. 4 is mainly executed by the processing unit 142 of the information processing device 140.

図４において、情報処理装置１４０は、マイクロホン１１０により収音された学習用音データを、オーディオインターフェース１２０を介して、処理部１４２において入力して取得する（Ｓｔ１）。情報処理装置１４０は、ステップＳｔ１において取得された学習用音データから、既知の次元数（例えば９８８次元）となる特徴量のデータを学習処理部１５２において抽出する（Ｓｔ２）。 In FIG. 4, the information processing apparatus 140 inputs and acquires the learning sound data collected by the microphone 110 in the processing unit 142 via the audio interface 120 (St1). The information processing apparatus 140 uses the learning processing unit 152 to extract feature amount data having a known number of dimensions (for example, 988 dimensions) from the learning sound data acquired in step St1 (St2).

情報処理装置１４０は、例えば記憶部１４３のＲＡＭに保持されている情報等を参照しながら、現在が遺伝的アルゴリズムのどの世代に到達しているかを学習処理部１５２において判定する（Ｓｔ３）。情報処理装置１４０は、現在が遺伝的アルゴリズムの第Ｍ世代に到達していないと判定した場合には（Ｓｔ３、ＮＯ）、現在の変数ｉが「１」でない場合には変数ｉ＝１を学習処理部１５２において設定し、現在が第１世代であることを示す情報を処理部１４２において記憶部１４３のＲＡＭに一時的に保存する（Ｓｔ４）。 The information processing device 140 determines, in the learning processing unit 152, which generation of the genetic algorithm has been reached, with reference to, for example, information stored in the RAM of the storage unit 143 (St3). If the information processing apparatus 140 determines that the current variable does not reach the Mth generation of the genetic algorithm (St3, NO), the information processing apparatus 140 learns the variable i = 1 if the current variable i is not “1”. The information set in the processing unit 152 and indicating that the current time is the first generation is temporarily stored in the RAM of the storage unit 143 in the processing unit 142 (St4).

情報処理装置１４０は、ステップＳｔ４の後、学習処理部１５２における識別器（つまり識別用アルゴリズム）の学習処理に用いる特徴量を削減するための処理（つまり特徴量削減処理）として、遺伝的アルゴリズムにおける第ｉ世代（例えば第１世代）の複数の個体（例えば個体１〜個体Ｎ）を学習処理部１５２において生成する（Ｓｔ５、図３参照）。 After step St4, the information processing device 140 performs a process for reducing the feature amount used for the learning process of the classifier (that is, the identification algorithm) in the learning processing unit 152 (that is, the feature amount reduction process) in the genetic algorithm. A plurality of individuals (for example, individuals 1 to N) of the i-th generation (for example, the first generation) are generated in the learning processing unit 152 (St5, see FIG. 3).

情報処理装置１４０は、ステップＳｔ５において生成された第ｉ世代（例えば第１世代）の複数の個体のそれぞれを用いた場合の、交差検証を用いた学習処理部１５２における識別器（つまり識別用アルゴリズム）の学習処理および評価処理を学習処理部１５２においてそれぞれ実行する（Ｓｔ６）。ステップＳｔ５，Ｓｔ６の詳細については、図２および図３を参照して説明したので、ここでは詳細な内容の説明は省略する。 The information processing apparatus 140 uses a cross-validation-based classifier (that is, a discrimination algorithm) in the learning processing unit 152 using each of a plurality of individuals of the i-th generation (for example, the first generation) generated in step St5. The learning process and the evaluation process are performed by the learning processing unit 152 (St6). Since the details of steps St5 and St6 have been described with reference to FIGS. 2 and 3, detailed description thereof will be omitted here.

情報処理装置１４０は、第ｉ世代（例えば第１世代）の個体１〜個体Ｎのそれぞれによる最終評価スコアの中で最も良好な最終評価スコアが得られた個体を学習処理部１５２において選択する（Ｓｔ７）。 The information processing device 140 selects, in the learning processing unit 152, the individual who has obtained the best final evaluation score among the final evaluation scores of the individual 1 to individual N of the i-th generation (for example, the first generation) ( St7).

情報処理装置１４０は、ステップＳｔ７において選択された個体による最終評価スコアが所定値以上であるか否かを処理部１４２において判定する（Ｓｔ８）。情報処理装置１４０は、ステップＳｔ７において選択された個体による最終評価スコアが所定値以上であると判定した場合には（Ｓｔ８、ＹＥＳ）、その所定値以上となった最終評価スコアに対応する個体を削減情報として記憶部１４３に保存するとともに出力する（Ｓｔ９）。 The information processing device 140 causes the processing unit 142 to determine whether or not the final evaluation score of the individual selected in step St7 is equal to or greater than a predetermined value (St8). If the information processing device 140 determines that the final evaluation score of the individual selected in step St7 is equal to or more than the predetermined value (St8, YES), the information processing apparatus 140 determines the individual corresponding to the final evaluation score equal to or more than the predetermined value. The information is stored in the storage unit 143 as reduction information and output (St9).

一方、ステップＳｔ７において選択された個体による最終評価スコアが所定値以上ではないと判定された場合には（Ｓｔ８、ＮＯ）、情報処理装置１４０の処理はステップＳｔ３に戻る。 On the other hand, when it is determined in step St7 that the final evaluation score of the selected individual is not greater than or equal to the predetermined value (St8, NO), the processing of the information processing device 140 returns to step St3.

一方、情報処理装置１４０は、現在が遺伝的アルゴリズムの第Ｍ世代に到達したと判定した場合には（Ｓｔ３、ＹＥＳ）、第１世代〜第Ｍ世代の各世代における複数の個体１〜個体Ｎのそれぞれによる最終評価スコアの中で最も良好な最終評価スコアが得られた個体を学習処理部１５２において選択する（Ｓｔ１０）。情報処理装置１４０は、ステップＳｔ１０において選択された個体を削減情報として記憶部１４３に保存するとともに出力する（Ｓｔ９）。 On the other hand, when the information processing device 140 determines that the current time has reached the Mth generation of the genetic algorithm (St3, YES), the information processing device 140 has a plurality of individuals 1 to N in each of the first to Mth generations. In the learning processing unit 152, the individual that has obtained the best final evaluation score among the final evaluation scores obtained by the above is selected (St10). The information processing device 140 saves and outputs the individual selected in step St10 as the reduction information in the storage unit 143 (St9).

次に、実施の形態１に係る音データ学習システム２００による削減済み特徴量を用いた学習処理の動作手順について、図５を参照して説明する。図５は、削減済み特徴量を用いた学習処理の動作手順例を説明するフローチャートである。図５に示す処理は、図４に示す処理が全て実行された後で、主に情報処理装置１４０の学習処理部１５２により実行される。 Next, an operation procedure of a learning process using the reduced feature amount by the sound data learning system 200 according to the first embodiment will be described with reference to FIG. FIG. 5 is a flowchart illustrating an example of an operation procedure of a learning process using the reduced feature amount. The processing illustrated in FIG. 5 is mainly executed by the learning processing unit 152 of the information processing device 140 after all the processing illustrated in FIG.

図５において、情報処理装置１４０は、図４に示すステップＳｔ１の時点とは異なる時点にマイクロホン１１０により収音された学習用音データを、オーディオインターフェース１２０を介して、処理部１４２において入力して取得する（Ｓｔ１１）。情報処理装置１４０は、ステップＳｔ１において取得された学習用音データに対し、所定のデータ拡張処理を実行する（Ｓｔ１２）。ステップＳｔ１２において実行されるデータ拡張処理は、例えば次の６通りのうち少なくとも一つであるが、これらに限定されない。 5, the information processing apparatus 140 inputs the learning sound data collected by the microphone 110 at a time different from the time of step St1 shown in FIG. Acquire (St11). The information processing device 140 executes a predetermined data extension process on the learning sound data acquired in step St1 (St12). The data extension process executed in step St12 is, for example, at least one of the following six types, but is not limited thereto.

学習処理部１５２は、第１のデータ拡張処理として、学習用音データにエコー成分を付加する。これにより、学習処理部１５２は、第１のデータ拡張処理を実行しない場合に比べて、高精度な学習済みモデルの生成に資する学習用音データを生成できる。 The learning processing unit 152 adds an echo component to the learning sound data as a first data extension process. Accordingly, the learning processing unit 152 can generate learning sound data that contributes to generation of a highly accurate learned model, as compared to a case where the first data expansion processing is not performed.

学習処理部１５２は、第２のデータ拡張処理として、第１のデータ拡張処理に加え、所定（例えば屋内の居室にいる環境と同等）のノイズ成分を付加する。これにより、学習処理部１５２は、第１のデータ拡張処理を実行する場合に比べて、高精度な学習済みモデルの生成に資する学習用音データを生成できる。 The learning processing unit 152 adds a predetermined noise component (e.g., equivalent to an indoor environment) in addition to the first data expansion process as the second data expansion process. Thereby, the learning processing unit 152 can generate learning sound data that contributes to generation of a highly accurate learned model, as compared with the case where the first data expansion processing is performed.

学習処理部１５２は、第３のデータ拡張処理として、第２のデータ拡張処理に加え、話速（つまり人物の話声の速度）を９０％に低減する。これにより、学習処理部１５２は、第１あるいは第２のデータ拡張処理を実行する場合に比べて、高精度な学習済みモデルの生成に資する学習用音データを生成できる。 The learning processing unit 152 reduces the speech speed (that is, the speed of the voice of a person) to 90% in addition to the second data extension process as the third data extension process. Thereby, the learning processing unit 152 can generate learning sound data that contributes to generation of a highly accurate learned model, as compared with the case where the first or second data expansion processing is executed.

学習処理部１５２は、第４のデータ拡張処理として、第３のデータ拡張処理に加え、話速（つまり人物の話声の速度）を１１０％に高速化する。これにより、学習処理部１５２は、第１〜第３のデータ拡張処理のうちいずれかを実行する場合に比べて、高精度な学習済みモデルの生成に資する学習用音データを生成できる。 The learning processing unit 152 increases the speech speed (that is, the speed of the voice of a person) to 110% in addition to the third data extension process as the fourth data extension process. Accordingly, the learning processing unit 152 can generate learning sound data that contributes to generation of a highly accurate learned model, as compared with the case where any one of the first to third data expansion processes is executed.

学習処理部１５２は、第５のデータ拡張処理として、第４のデータ拡張処理に加え、ホワイトノイズ（例えば、ＳＮ比が２４ｄＢ）を付加する。これにより、学習処理部１５２は、第１〜第４のデータ拡張処理のうちいずれかを実行する場合に比べて、高精度な学習済みモデルの生成に資する学習用音データを生成できる。 The learning processing unit 152 adds white noise (for example, the SN ratio is 24 dB) in addition to the fourth data expansion process as the fifth data expansion process. Accordingly, the learning processing unit 152 can generate learning sound data that contributes to generation of a highly accurate learned model, as compared with a case where any one of the first to fourth data expansion processes is executed.

学習処理部１５２は、第６のデータ拡張処理として、第５のデータ拡張処理に加え、ホワイトノイズ（例えば、ＳＮ比が３６ｄＢ）を付加する。これにより、学習処理部１５２は、第１〜第５のデータ拡張処理のうちいずれかを実行する場合に比べて、高精度な学習済みモデルの生成に資する学習用音データを生成できる。 The learning processing unit 152 adds white noise (for example, the SN ratio is 36 dB) in addition to the fifth data expansion process as the sixth data expansion process. Accordingly, the learning processing unit 152 can generate learning sound data that contributes to generation of a highly accurate learned model, as compared with a case where any one of the first to fifth data expansion processes is executed.

情報処理装置１４０は、記憶部１４３に保存された削減情報（図４のステップＳｔ９参照）を読み出して学習処理部１５２に入力する（Ｓｔ１３）。 The information processing device 140 reads out the reduction information (see Step St9 in FIG. 4) stored in the storage unit 143 and inputs the information to the learning processing unit 152 (St13).

情報処理装置１４０は、ステップＳｔ１２によりデータ拡張処理された学習用音データとステップＳｔ１３において入力された削減情報とを用いて、既定の特徴量の配列（前述参照）から削減情報（個体）中のパラメータ「１」に対応する特徴量の項目だけを学習処理部１５２において抽出する。情報処理装置１４０は、その抽出された項目に対応する特徴量のデータを学習処理部１５２において取得する（Ｓｔ１４）。 The information processing apparatus 140 uses the learning sound data subjected to the data expansion processing in step St12 and the reduction information input in step St13 to convert the predetermined feature amount array (see above) into the reduction information (individual). The learning processing unit 152 extracts only the item of the feature amount corresponding to the parameter “1”. The information processing device 140 causes the learning processing unit 152 to acquire feature amount data corresponding to the extracted item (St14).

情報処理装置１４０は、ステップＳｔ１４において取得された削減済みの特徴量のデータを用いて、学習処理部１５２における識別器（つまり識別用アルゴリズム）の学習処理を学習処理部１５２において実行する（Ｓｔ１５）。情報処理装置１４０は、ステップＳｔ１５の学習処理の結果として、識別器（つまり識別用アルゴリズム）である学習済みモデルを生成あるいは更新して記憶部１４３に保存する（Ｓｔ１６）。 The information processing device 140 uses the data of the reduced feature amount acquired in Step St14 to execute the learning processing of the discriminator (that is, the algorithm for identification) in the learning processing unit 152 in the learning processing unit 152 (St15). . The information processing apparatus 140 generates or updates a learned model, which is a classifier (that is, an algorithm for identification), as a result of the learning processing in step St15, and stores the model in the storage unit 143 (St16).

次に、実施の形態１に係る音データ学習システム２００による音データの分析処理の動作手順について、図６および図７を参照して説明する。図６は、学習済みモデルを用いた音データの分析処理の動作手順例を説明するフローチャートである。図７は、入力された音データとその音データに対する分析処理結果の一例とを対応付けたグラフである。図６に示す処理は、主に情報処理装置１４０の検知処理部１５３により実行される。 Next, an operation procedure of sound data analysis processing by the sound data learning system 200 according to the first embodiment will be described with reference to FIGS. FIG. 6 is a flowchart illustrating an example of an operation procedure of sound data analysis processing using a learned model. FIG. 7 is a graph in which input sound data is associated with an example of an analysis result of the sound data. The processing illustrated in FIG. 6 is mainly executed by the detection processing unit 153 of the information processing device 140.

図６において、情報処理装置１４０は、記憶部１４３に保存された学習済みモデルを検知処理部１５３において読み出す（Ｓｔ２１）。情報処理装置１４０は、マイクロホン１１０により収音された検知用音データを、オーディオインターフェース１２０を介して、処理部１４２において入力して取得する（Ｓｔ２２）。 6, the information processing apparatus 140 causes the detection processing unit 153 to read the learned model stored in the storage unit 143 (St21). The information processing device 140 inputs and acquires the detection sound data collected by the microphone 110 through the audio interface 120 in the processing unit 142 (St22).

情報処理装置１４０は、ステップＳｔ２１において読み出された学習済みモデルを用いて、ステップＳｔ２２において取得された検知用音データの分析処理（識別処理）を検知処理部１５３において実行する（Ｓｔ２３）。検知処理部１５３は、例えば検知用音データに対応する声を発した人物の感情を、検知用音データの分析処理（識別処理）によって判別する。 The information processing device 140 uses the learned model read out in Step St21 to execute analysis processing (identification processing) of the detection sound data acquired in Step St22 in the detection processing unit 153 (St23). The detection processing unit 153 determines, for example, the emotion of the person who uttered the voice corresponding to the detection sound data by analyzing the detection sound data (identification processing).

情報処理装置１４０は、ステップＳｔ２３の分析処理（識別処理）の識別結果（図７参照）を表示部１４５に出力する（Ｓｔ２４）。図７の横軸は時間である。図７の上段のグラフの縦軸は音データの大きさ（例えば音圧）を示し、図７の下段のグラフの縦軸は識別結果を正規化したものを示す。図７の上段のグラフは、入力された検知用音データの音声波形であり、例えば被験者には「喜び」の感情を抱いた上で発声してもらった。図７の下段のグラフは、検知処理部１５３による分析処理（識別処理）による結果を示しており、例えば「喜び」（ｈａｐ：ｈａｐｐｙの略語）、「普通」（ｎｏｒ：ｎｏｒｍａｌの略語）、「怒り」（ａｎｇ：ａｎｇｅｒの略語）の３種類の感情の度合いを示す折れ線グラフである。 The information processing apparatus 140 outputs the identification result (see FIG. 7) of the analysis processing (identification processing) in Step St23 to the display unit 145 (St24). The horizontal axis in FIG. 7 is time. The vertical axis of the upper graph in FIG. 7 indicates the magnitude (for example, sound pressure) of the sound data, and the vertical axis of the lower graph of FIG. 7 indicates a result obtained by normalizing the identification result. The upper graph in FIG. 7 is a voice waveform of the input detection sound data. For example, the subject had an utterance of “joy” while uttering the voice. The lower graph in FIG. 7 shows the results of the analysis processing (identification processing) performed by the detection processing unit 153. For example, “joy” (abbreviation for happy: happy), “normal” (abbreviation for nor: normal), “ It is a line graph which shows the degree of three kinds of emotions of "anger" (ang: abbreviation of angel).

例えば図７に示すように、後半の発話区切りの箇所で「怒り」の誤検知が瞬間的に発生しているが、発話音声全体についての識別結果は概ね「喜び」が得られているので、高精度な分析結果が得られていることが分かる。なお、誤検知された箇所は息継ぎ等により発話が途切れた箇所であり、はっきりと発話している区間に比べて含まれる特徴が少ないことが誤検知の原因と考えることができる。 For example, as shown in FIG. 7, an erroneous detection of “anger” occurs instantaneously in the latter half of the utterance break, but since the identification result for the entire uttered voice is generally “joy”, It can be seen that highly accurate analysis results have been obtained. Note that the erroneously detected portion is a portion where the utterance is interrupted due to breathing or the like, and it can be considered that the cause of the erroneous detection is that there are fewer features included in the section where the utterance is clearly uttered.

以上により、実施の形態１に係る音データ学習システム２００は、学習用サウンド信号を収音するマイクロホン１１０と、学習用サウンド信号のＪ（Ｊ：正の整数）個の特徴量を抽出する抽出部を有する。音データ学習システム２００は、Ｊ個の特徴量のそれぞれの使用の有無を規定するＪ個のパラメータにより構成されるＮ（Ｎ：２以上の整数）個の特徴量削減ブロック（例えば遺伝的アルゴリズムの各世代における遺伝子（例えば「０」あるいは「１」のパラメータ）を有する個体）をそれぞれ生成する生成部を有する。音データ学習システム２００は、学習用サウンド信号のデータをｋ（ｋ：２以上の規定値）種類の異なるデータセットを準備する準備処理と、Ｎ個の特徴量削減ブロック（例えば個体）ごとに、それぞれの異なるデータセットについて、該当する特徴量削減ブロックと学習用サウンド信号のデータの一部のデータセットとを用いた学習処理と、学習処理の学習結果を用いて、学習用サウンド信号のデータの残りのデータセットを対象としたＪ個の特徴量のそれぞれの使用に関する評価処理と、を繰り返す特徴量削減処理を実行する特徴量削減部を有する。音データ学習システム２００は、それぞれｋ種類の評価結果を有する、Ｎ個の特徴量削減ブロックごとの特徴量削減処理結果に基づいて、いずれかの特徴量削減ブロックを、Ｊ個の特徴量の削減情報として選択する学習制御部を有する。 As described above, the sound data learning system 200 according to the first embodiment includes the microphone 110 that collects the learning sound signal and the extraction unit that extracts J (J: a positive integer) feature amounts of the learning sound signal. Having. The sound data learning system 200 includes N (N: an integer equal to or greater than 2) feature amount reduction blocks (for example, a genetic algorithm) configured by J parameters that specify whether or not each of the J feature amounts is used. Each generation unit has a generation unit that generates a gene (for example, an individual having a parameter of “0” or “1”) in each generation. The sound data learning system 200 prepares k (k: 2 or more specified values) types of different data sets of learning sound signal data, and performs N feature amount reduction blocks (for example, individuals) for each of N pieces of feature amount reduction blocks. For each different data set, a learning process using the corresponding feature amount reduction block and a partial data set of the learning sound signal data, and a learning sound signal data using the learning result of the learning process. A feature amount reduction unit that executes a feature amount reduction process of repeating the evaluation process regarding the use of each of the J feature amounts for the remaining data sets. The sound data learning system 200 reduces one of the feature amount reduction blocks to J feature amounts based on the feature amount reduction processing result for each of the N feature amount reduction blocks, each of which has k types of evaluation results. It has a learning control unit for selecting as information.

これにより、音データ学習システム２００の情報処理装置１４０（音データ学習装置の一態様）は、学習用音データを対象とした識別器の学習処理の実行時に、学習用音データの特徴量を削減できる点において学習処理にかかる処理負荷を軽減できる。また、情報処理装置１４０は、人の感情を分析する等の特定の用途に応じて、従来技術における高次元（例えば９８８次元）の全ての特徴量を使用することなく、削減された特徴量だけを有効かつ効率的に利用できるので、高精度な学習済みモデルを生成できる。従って、音データ学習システム２００あるいは情報処理装置１４０によれば、学習処理の実行時にかかる処理負荷の軽減と、用途に応じた効率的かつ高精度な学習済みモデルの生成との両立の支援が可能となる。 Thereby, the information processing device 140 (an embodiment of the sound data learning device) of the sound data learning system 200 reduces the feature amount of the learning sound data when performing the learning process of the discriminator for the learning sound data. The processing load on the learning process can be reduced in that it is possible. In addition, the information processing apparatus 140 does not use all the high-dimensional (eg, 988-dimensional) feature amounts in the related art, but uses only the reduced feature amounts according to a specific use such as analyzing a human emotion. Can be used effectively and efficiently, and a highly accurate learned model can be generated. Therefore, according to the sound data learning system 200 or the information processing apparatus 140, it is possible to support both the reduction of the processing load imposed at the time of executing the learning process and the efficient and high-precision generation of the learned model according to the application. Becomes

また、情報処理装置１４０の学習処理部１５２は、選択された特徴量削減ブロックを、削減情報として記憶部１４３（メモリの一態様）に保存する。これにより、音データ学習システム２００あるいは情報処理装置１４０は、特徴量削減処理により選択された特徴量削減ブロックを保存できるので、学習済みモデルを生成するための学習処理において学習用音データの必要な特徴量を厳選できるので、効率的な学習済みモデルを生成できる。 Further, the learning processing unit 152 of the information processing device 140 stores the selected feature amount reduction block in the storage unit 143 (an example of a memory) as reduction information. Thereby, the sound data learning system 200 or the information processing device 140 can store the feature amount reduction block selected by the feature amount reduction process, and thus the learning process for generating the learned model requires the learning sound data. Since the feature amount can be carefully selected, an efficient learned model can be generated.

また、情報処理装置１４０の学習処理部１５２は、特徴量の不使用を示すパラメータ（第１パラメータの一態様）の数をｕ（ｕ：正の整数）とした場合に、１≦ｕ＜Ｊを満たすようにＮ個の特徴量削減ブロックを生成する。これにより、情報処理装置１４０は、学習済みモデルを生成するための学習処理において学習用音データの必要な特徴量を、従来技術における高次元（例えば９８８次元）の全てを使用しなくてよいので、その学習処理の実行時の処理負荷を軽減できる。 Further, the learning processing unit 152 of the information processing apparatus 140 determines that 1 ≦ u <J when u (u: a positive integer) is set as the number of parameters (one mode of the first parameter) indicating non-use of the feature amount. Are generated to satisfy the condition. Thus, the information processing apparatus 140 does not need to use all of the high-dimensional (eg, 988) dimensions required in the related art for the necessary feature amounts of the learning sound data in the learning process for generating the learned model. Thus, the processing load at the time of executing the learning process can be reduced.

また、情報処理装置１４０の学習処理部１５２は、特徴量削減処理に用いる遺伝的アルゴリズムの現世代に対応する、Ｎ個の特徴量削減ブロックごとの特徴量削減処理結果に基づく確率分布に従って、遺伝的アルゴリズムの次世代に対応するＮ個の特徴量削減ブロックを生成する。これにより、情報処理装置１４０は、進化的プログラミングを用いて世代ごとに進化する遺伝子（つまり、パラメータ「０」，「１」のいずれか）が羅列された特徴量削減ブロック（つまり個体）を生成できるので、優良な遺伝子を有する個体（つまり、学習処理に用いる特徴量の削減がより一層可能な個体）を選別できる。 In addition, the learning processing unit 152 of the information processing apparatus 140 generates a genetic code according to a probability distribution based on the feature amount reduction processing result for each of the N feature amount reduction blocks corresponding to the current generation of the genetic algorithm used for the feature amount reduction processing. N feature amount reduction blocks corresponding to the next generation of the genetic algorithm are generated. Thereby, the information processing device 140 generates a feature amount reduction block (that is, an individual) in which genes evolving for each generation (that is, one of the parameters “0” and “1”) are listed using evolutionary programming. Therefore, it is possible to select individuals having excellent genes (that is, individuals capable of further reducing the feature amount used in the learning process).

また、情報処理装置１４０は、削減情報に基づいてＪ個の特徴量の削減処理を実行し、学習用サウンド信号のデータの中から、削減処理により削減された後の特徴量を用いて学習処理を実行して学習済みモデルを生成する学習部を更に有する。これにより、情報処理装置１４０は、人の感情の分析処理に好適であって高精度な学習済みモデルを効率的に生成できる。 Also, the information processing apparatus 140 performs a reduction process of J feature amounts based on the reduction information, and performs a learning process using the feature amount reduced by the reduction process from the data of the learning sound signal. To generate a trained model. Accordingly, the information processing device 140 can efficiently generate a highly accurate learned model suitable for the analysis processing of human emotions.

また、情報処理装置１４０は、学習用サウンド信号のデータに所定のデータ拡張処理を施す。これにより、情報処理装置１４０は、データ拡張処理を実行しない場合に比べて、人の感情の分析処理に好適であって高精度な学習済みモデルを相対的かつ効率的に生成できる。 Further, the information processing device 140 performs a predetermined data extension process on the data of the learning sound signal. Thereby, the information processing apparatus 140 can generate a highly accurate learned model that is suitable for the analysis processing of human emotions relatively and efficiently as compared with the case where the data expansion processing is not executed.

また、情報処理装置１４０は、生成された学習済みモデルを用いて、マイクロホン１１０により収音されたサウンド信号のデータから人の感情に関する情報を検知する検知部を更に有する。これにより、情報処理装置１４０は、学習用音データを用いて生成された学習済みモデルを用いて、マイクロホン１１０によりリアルタイムに収音された検知用音データに含まれる人の感情の分析処理を高精度に実行できる。 In addition, the information processing device 140 further includes a detection unit that detects information related to human emotion from data of the sound signal collected by the microphone 110 using the generated learned model. Accordingly, the information processing apparatus 140 uses the learned model generated using the learning sound data to perform a high-level analysis process of the human emotion included in the detection sound data collected in real time by the microphone 110. Can be performed with precision.

また、情報処理装置１４０は、生成された学習済みモデルを用いて、マイクロホン１１０により収音されたサウンド信号のデータから機械音の種別に関する情報を検知する検知部を更に有してよい。これにより、情報処理装置１４０は、学習用音データを用いて生成された学習済みモデルを用いて、マイクロホン１１０によりリアルタイムに収音された検知用音データに含まれる機械音（例えば工場等に設置される機械から発生する音）の種別（種類）の分析処理を高精度に実行できる。 In addition, the information processing device 140 may further include a detection unit that detects information related to the type of the mechanical sound from the data of the sound signal collected by the microphone 110 using the generated learned model. Accordingly, the information processing device 140 uses the learned model generated using the learning sound data to generate a mechanical sound (for example, installed in a factory or the like) included in the detection sound data collected by the microphone 110 in real time. Of the type (sound generated from the machine to be performed) can be analyzed with high accuracy.

（実施の形態１の変形例）
図８は、実施の形態１の変形例に係る音データ学習システム２００Ａの構成例を示すブロック図である。実施の形態１の変形例では、実施の形態１に係る情報処理装置１４０の代わりに、情報処理装置１４０Ａとの間でネットワークまたは通信回線を介して通信可能に接続されるサーバ装置３４０により、学習用音データを用いた学習処理および評価処理、学習済みモデルの生成処理ならびに検知用音データの検知処理が実行される例を説明する。 (Modification of First Embodiment)
FIG. 8 is a block diagram showing a configuration example of a sound data learning system 200A according to a modification of the first embodiment. In a modification of the first embodiment, learning is performed by a server device 340 communicably connected to the information processing device 140A via a network or a communication line instead of the information processing device 140 according to the first embodiment. An example in which learning processing and evaluation processing using sound data for use, generation processing of a learned model, and detection processing of detection sound data will be described.

音データ学習システム２００Ａは、マイクロホン１１０と、オーディオインターフェース１２０と、情報処理装置１４０Ａと、サーバ装置３４０とを含む構成である。情報処理装置１４０Ａは、通信部１４１と、処理部１４２Ａと、記憶部１４３と、操作入力部１４４と、表示部１４５と、通信部１４６とを有し、処理部１４２Ａは制御部１５１の機能を有している。通信部１４６は、有線あるいは無線の通信インターフェースを有し、外部のサーバ装置３４０と通信を行う。情報処理装置１４０Ａは、有線あるいは無線のネットワークまたは通信回線等の通信路３００を介してサーバ装置３４０と接続される。その他は図１に示す音データ学習システム２００の構成と同様であり、ここでは異なる部分のみ説明する。 The sound data learning system 200A is configured to include the microphone 110, the audio interface 120, the information processing device 140A, and the server device 340. The information processing device 140A includes a communication unit 141, a processing unit 142A, a storage unit 143, an operation input unit 144, a display unit 145, and a communication unit 146, and the processing unit 142A has a function of the control unit 151. Have. The communication unit 146 has a wired or wireless communication interface and communicates with an external server device 340. The information processing device 140A is connected to the server device 340 via a communication path 300 such as a wired or wireless network or a communication line. In other respects, the configuration is the same as that of the sound data learning system 200 shown in FIG.

サーバ装置３４０は、プロセッサおよびメモリを有する情報処理装置（コンピュータ）により構成され、学習用音データを用いた学習処理および評価処理、学習済みモデルの生成処理ならびに検知用音データの検知処理に関する各種の処理を実行する。サーバ装置３４０は、通信部３４１と、処理部３４２と、記憶部２４３とを有する。通信部３４１は、情報処理装置１４０Ａとの間で音データ、学習データ等の各種データを送受信する。処理部３４２は、ＣＰＵ（Central Processing Unit）、ＤＳＰ（Digital Signal Processor）等のプロセッサを有する。処理部３４２は、所定のプログラムに従って処理を実行し、前述した学習用音データを用いた学習処理および評価処理、学習済みモデルの生成処理ならびに検知用音データの検知処理等の機能を実現する。処理部３４２は、機能的構成として、各種制御を実行する制御部３５１、学習処理を実行する学習処理部３５２、検知処理を実行する検知処理部３５３、判定処理を実行する判定処理部３５４を有する。ここで、学習処理部３５２、検知処理部３５３、判定処理部３５４は、前述した図１の構成例における情報処理装置１４０の処理部１４２の学習処理部１５２、検知処理部１５３、判定処理部１５４と同様の処理を行う。なお、学習処理部３５２、検知処理部３５３、判定処理部３５４のうちの一部の処理をサーバ装置３４０において実行し、残りを情報処理装置１４０Ａの処理部１４２Ａにおいて実行してもよい。 The server device 340 is configured by an information processing device (computer) having a processor and a memory, and performs various processes related to learning processing and evaluation processing using learning sound data, generation processing of a learned model, and detection processing of detection sound data. Execute the process. The server device 340 includes a communication unit 341, a processing unit 342, and a storage unit 243. The communication unit 341 transmits and receives various data such as sound data and learning data to and from the information processing device 140A. The processing unit 342 has a processor such as a CPU (Central Processing Unit) and a DSP (Digital Signal Processor). The processing unit 342 executes processing according to a predetermined program, and realizes functions such as learning processing and evaluation processing using the above-described learning sound data, generation processing of a learned model, and detection processing of detection sound data. The processing unit 342 includes, as functional components, a control unit 351 for performing various controls, a learning processing unit 352 for performing learning processing, a detection processing unit 353 for performing detection processing, and a determination processing unit 354 for performing determination processing. . Here, the learning processing unit 352, the detection processing unit 353, and the determination processing unit 354 are the learning processing unit 152, the detection processing unit 153, and the determination processing unit 154 of the processing unit 142 of the information processing device 140 in the configuration example of FIG. The same processing is performed. Note that a part of the learning processing unit 352, the detection processing unit 353, and the determination processing unit 354 may be executed by the server device 340, and the rest may be executed by the processing unit 142A of the information processing device 140A.

実施の形態１の変形例では、実施の形態１に係る処理をネットワークまたは通信回線等を介して接続される複数の情報処理装置において分散して実行する構成となっている。特に、学習用音データを用いた学習処理および評価処理、学習済みモデルの生成処理については、高い処理能力を持つサーバ装置３４０等の情報処理装置を用いて実行することにより、複雑なアルゴリズム演算や高速処理などへの対応が容易になる。学習処理部、検知処理部、判定処理部による処理は、オーディオインターフェースと接続されるローカルの情報処理装置、または通信路を介して接続されるリモートの情報処理装置などにおいて、処理毎に適宜割り当てて実行してもよい。例えば、システム構成、使用環境、データ処理のアルゴリズム、データ量、データ特性、出力態様などの各種条件に応じて、実施の形態１に係る各処理を適切な情報処理装置にて実行することが可能である。 In the modified example of the first embodiment, the processing according to the first embodiment is configured to be executed in a distributed manner in a plurality of information processing apparatuses connected via a network or a communication line. In particular, the learning process and the evaluation process using the learning sound data, and the generation process of the learned model are performed using an information processing device such as the server device 340 having a high processing capability, so that a complicated algorithm operation or the like can be performed. It is easy to deal with high-speed processing. The processing by the learning processing unit, the detection processing unit, and the determination processing unit is appropriately assigned to each processing in a local information processing device connected to an audio interface or a remote information processing device connected through a communication path. May be performed. For example, each process according to the first embodiment can be executed by an appropriate information processing device according to various conditions such as a system configuration, a usage environment, a data processing algorithm, a data amount, a data characteristic, and an output mode. It is.

以上、図面を参照しながら各種の実施の形態について説明したが、本開示はかかる例に限定されないことは言うまでもない。当業者であれば、特許請求の範囲に記載された範疇内において、各種の変更例、修正例、置換例、付加例、削除例、均等例に想到し得ることは明らかであり、それらについても当然に本開示の技術的範囲に属するものと了解される。また、発明の趣旨を逸脱しない範囲において、上述した各種の実施の形態における各構成要素を任意に組み合わせてもよい。 Although various embodiments have been described with reference to the drawings, it is needless to say that the present disclosure is not limited to such examples. It is clear that those skilled in the art can conceive various changes, modifications, substitutions, additions, deletions, and equivalents within the scope of the claims. Naturally, it is understood that they belong to the technical scope of the present disclosure. Further, the components in the above-described various embodiments may be arbitrarily combined without departing from the spirit of the invention.

なお、上述した実施の形態１あるいはその変形例においては、学習済みモデルは、学習用音データあるいは検知用音データに含まれる人の感情を分析するための学習モデルであることを例示して説明したが、人の感情を分析することを用途としたものに限定されない。例えば、学習済みモデルは、学習用音データあるいは検知用音データに含まれる、工場等に設置された機械から生じる機械音、あるいは人がハンマー等で叩いた時の打音の種類（種別）を分析することを用途とした学習モデルであってもよい。 In the above-described first embodiment or its modified example, a description will be given by exemplifying that the learned model is a learning model for analyzing the emotion of a person included in the learning sound data or the detection sound data. However, the present invention is not limited to analysis of human emotions. For example, the trained model indicates the type (type) of the machine sound generated from a machine installed in a factory or the like, or the type of the tapping sound when a person hits with a hammer or the like, included in the learning sound data or the detection sound data. A learning model used for analysis may be used.

例えば、機械音あるいは打音を分析するための特徴量としては、パワースペクトルが挙げられる。例えば、情報処理装置１４０は、オーディオインターフェース１２０から入力した学習用音データあるいは検知用音データを４８０００Ｈｚの周波数でサンプリングした場合、０．０２１秒（１０２４／４８０００）あたり、５１２次元の特徴量を得ることができる。機械音の稼働音を分析するために効率的な特徴量を５１２次元から削減するために、実施の形態１あるいはその変形例に係る音データ学習システム２００，２００Ａにおいて進化的アルゴリズム（例えば遺伝的アルゴリズム）を用いることで同様に削減することが可能となる。 For example, a power spectrum is an example of a feature amount for analyzing a mechanical sound or a tapping sound. For example, when the learning sound data or the detection sound data input from the audio interface 120 is sampled at a frequency of 48000 Hz, the information processing apparatus 140 obtains a 512-dimensional feature amount per 0.021 seconds (1024/48000). be able to. In order to reduce the feature amount effective for analyzing the operation sound of the machine sound from 512 dimensions, an evolutionary algorithm (for example, a genetic algorithm) is used in the sound data learning systems 200 and 200A according to the first embodiment or its modification. ) Can be similarly reduced.

本開示は、学習処理の実行時にかかる処理負荷の軽減と、用途に応じた効率的かつ高精度な学習済みモデルの生成との両立を支援する音データ学習システム、音データ学習方法および音データ学習装置として有用である。 The present disclosure relates to a sound data learning system, a sound data learning method, and a sound data learning that support both reduction of the processing load imposed at the time of performing a learning process and efficient and high-precision generation of a learned model according to the application. Useful as a device.

１１０マイクロホン
１２０オーディオインターフェース
１２１入力部
１２２ＡＤコンバータ
１２３バッファ
１２４、１４１通信部
１４０情報処理装置
１４２処理部
１４３記憶部
１４４操作入力部
１４５表示部
１５１制御部
１５２学習処理部
１５３検知処理部
１５４判定処理部
２００、２００Ａ音データ学習システム 110 microphone 120 audio interface 121 input unit 122 AD converter 123 buffer 124, 141 communication unit 140 information processing unit 142 processing unit 143 storage unit 144 operation input unit 145 display unit 151 control unit 152 learning processing unit 153 detection processing unit 154 determination processing unit 200, 200A sound data learning system

Claims

A microphone that picks up the learning sound signal,
An extracting unit for extracting J (J: a positive integer) feature amounts of the learning sound signal;
A generation unit configured to generate N (N: an integer equal to or greater than 2) feature amount reduction blocks each including J parameters that define whether or not each of the J feature amounts is used;
A preparation process of preparing data sets of k (k: 2 or more specified values) different patterns using the data of the learning sound signal;
For each of the N data sets of the different patterns for each of the N feature amount reduction blocks, learning using the corresponding feature amount reduction block and a partial data set of the data of the learning sound signal. Processing, using the learning result of the learning processing, evaluation processing on the use of each of the J feature amounts for the remaining data set of the data of the learning sound signal,
A feature amount reduction unit that performs a feature amount reduction process that repeats
One of the feature amount reduction blocks is selected as J feature amount reduction information based on the feature amount reduction processing results for each of the N feature amount reduction blocks, each of which has k types of evaluation results. A learning control unit;
Sound data learning system.

The learning control unit stores the selected feature amount reduction block in a memory as the reduction information.
The sound data learning system according to claim 1.

When the number of first parameters indicating non-use of the feature amount is u (u: a positive integer), the generation unit generates N feature amount reduction blocks so as to satisfy 1 ≦ u <J. Generate,
The sound data learning system according to claim 1.

The generation unit generates a next generation of the genetic algorithm according to a probability distribution based on a feature amount reduction processing result for each of the N feature amount reduction blocks corresponding to a current generation of the genetic algorithm used for the feature amount reduction processing. Generating N feature amount reduction blocks corresponding to
The sound data learning system according to claim 1.

Performing a reduction process of J feature amounts based on the reduction information, and performing the learning process using the feature amount reduced by the reduction process from the data of the learning sound signal; A learning unit that generates a learned model by using
The sound data learning system according to claim 1.

The learning unit performs a predetermined data extension process on the data of the learning sound signal,
The sound data learning system according to claim 5.

Using the generated learned model, further comprising a detection unit that detects information related to human emotions from data of a sound signal collected by the microphone,
The sound data learning system according to claim 5.

Using the generated learned model, further comprising a detection unit that detects information about the type of mechanical sound from the data of the sound signal collected by the microphone,
The sound data learning system according to claim 5.

A sound data learning method in a sound data learning system having a microphone,
Collecting the learning sound signal by the microphone;
Extracting J (J: positive integer) feature amounts of the learning sound signal;
Generating N (N: an integer equal to or greater than 2) feature amount reduction blocks, each of which includes J parameters defining whether or not each of the J feature amounts is used;
A preparation process of preparing data sets of k (k: 2 or more specified values) different patterns using the data of the learning sound signal;
For each of the N data sets of the different patterns for each of the N feature amount reduction blocks, learning using the corresponding feature amount reduction block and a partial data set of the data of the learning sound signal. Processing, using the learning result of the learning processing, evaluation processing on the use of each of the J feature amounts for the remaining data set of the data of the learning sound signal,
Performing a feature amount reduction process of repeating
One of the feature amount reduction blocks is selected as J feature amount reduction information based on the feature amount reduction processing results for each of the N feature amount reduction blocks, each of which has k types of evaluation results. Having a step and
Sound data learning method.

Connected to a microphone that picks up the learning sound signal,
An extracting unit for extracting J (J: a positive integer) feature amounts of the learning sound signal;
A generation unit configured to generate N (N: an integer equal to or greater than 2) feature amount reduction blocks each including J parameters that define whether or not each of the J feature amounts is used;
A preparation process of preparing data sets of k (k: 2 or more specified values) different patterns using the data of the learning sound signal;
For each of the N data sets of the different patterns for each of the N feature amount reduction blocks, learning using the corresponding feature amount reduction block and a partial data set of the data of the learning sound signal. Processing, using the learning result of the learning processing, evaluation processing on the use of each of the J feature amounts for the remaining data set of the data of the learning sound signal,
A feature amount reduction unit that performs a feature amount reduction process that repeats
One of the feature amount reduction blocks is selected as J feature amount reduction information based on the feature amount reduction processing results for each of the N feature amount reduction blocks, each of which has k types of evaluation results. A learning control unit;
Sound data learning device.