JP2020064253A

JP2020064253A - Learning device, detection device, learning method, learning program, detection method, and detection program

Info

Publication number: JP2020064253A
Application number: JP2018197718A
Authority: JP
Inventors: 祐介木田; Yusuke Kida; 高史前角; Takashi Maesumi
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2018-10-19
Filing date: 2018-10-19
Publication date: 2020-04-23
Anticipated expiration: 2038-10-19
Also published as: JP7212718B2; JP2021121875A; JP6892426B2

Abstract

To improve detection accuracy between speech sections.SOLUTION: A learning device according to this invention comprises: an acquirement unit for acquiring speech information including object speech to be object of detection; and a learning unit having a model learn a final end of the object speech and passage of time from a start end of the object speech. In addition, the detection device according to this invention comprises: an acquirement unit for acquiring the speech information; and a detection unit for detecting a start end of the object speech from the speech information acquired by the acquirement unit by using a model having learned the final end of the object speech and the passage of time from the start end of the object speech.SELECTED DRAWING: Figure 1

Description

本発明は、学習装置、検出装置、学習方法、学習プログラム、検出方法、および検出プログラムに関する。 The present invention relates to a learning device, a detection device, a learning method, a learning program, a detection method, and a detection program.

近年、自動音声認識（Automatic Speech Recognition）を利用した技術が知られている。このような自動音声認識の技術の一例として、利用者の発話をテキストデータに変換し、変換後のテキストデータを用いて各種の情報処理を実行する技術が知られている。また、認識精度を改善するため、入力された音響信号から、利用者の発話が含まれる音声区間を検出する音声区間検出（Voice Activity Detection）の技術が知られている。 In recent years, a technique using automatic speech recognition has been known. As an example of such a technique of automatic speech recognition, a technique is known in which a user's utterance is converted into text data and various information processing is executed using the converted text data. Further, in order to improve recognition accuracy, there is known a technique of voice activity detection that detects a voice interval including a user's utterance from an input acoustic signal.

特開２００８−１３９６５４号公報JP, 2008-139654, A

このような音声区間検出の技術を用いて、所定の音声を含む音声区間を検出する処理が考えられる。例えば、処理対象となるフレームが音声を含む音声区間であるか否かを学習させたＤＮＮ（Deep Neural Network）等のモデルを用いて、音響信号から所定の音声を抽出する技術が考えられる。 A process of detecting a voice section including a predetermined voice using such a voice section detection technique is conceivable. For example, a technique is conceivable in which a predetermined voice is extracted from an acoustic signal by using a model such as DNN (Deep Neural Network) in which it is learned whether or not a frame to be processed is a voice section including a voice.

しかしながら、このような技術では、音声区間の検出精度を改善する余地があった。 However, with such a technique, there is room for improving the detection accuracy of the voice section.

例えば、複数の単語から構成されるキーワードや、途中に無発声の区間が含まれるキーワード等を含む音声区間を抽出しようとした場合、上述した技術では、キーワードの一部のみを含む区間を音声区間として検出してしまう恐れがある。 For example, when an attempt is made to extract a voice segment including a keyword composed of a plurality of words or a keyword including an unvoiced segment in the middle, in the above-described technique, a segment including only a part of the keyword is a voice segment. May be detected as.

本願は、上記に鑑みてなされたものであって、音声区間の検出精度を向上させることを目的とする。 The present application has been made in view of the above, and an object thereof is to improve the detection accuracy of a voice section.

本願に係る学習装置は、検出対象となる対象音声が含まれる音声情報を取得する取得部と、前記対象音声の終端と、当該対象音声の始端から経過した期間とをモデルに学習させる学習部とを有することを特徴とする。 A learning device according to the present application, an acquisition unit that acquires voice information that includes a target voice that is a detection target, a learning unit that causes a model to learn the end of the target voice, and the period elapsed from the start end of the target voice. It is characterized by having.

実施形態の一態様によれば、音声区間の検出精度を向上させることができる。 According to the aspect of the embodiment, it is possible to improve the detection accuracy of the voice section.

図１は、実施形態に係る情報提供装置と端末装置とが実行する処理の一例を示す図である。FIG. 1 is a diagram illustrating an example of processing executed by the information providing apparatus and the terminal device according to the embodiment. 図２は、実施形態に係る情報提供装置の構成例を示す図である。FIG. 2 is a diagram illustrating a configuration example of the information providing apparatus according to the embodiment. 図３は、実施形態に係る学習データデータベースに登録される情報の一例を示す図である。FIG. 3 is a diagram showing an example of information registered in the learning data database according to the embodiment. 図４は、実施形態に係る端末装置の構成例を示す図である。FIG. 4 is a diagram illustrating a configuration example of the terminal device according to the embodiment. 図５は、実施形態に係るモデルが出力する情報の一例を示す図である。FIG. 5 is a diagram showing an example of information output by the model according to the embodiment. 図６は、実施形態に係る情報提供装置が実行する学習処理の流れの一例を示すフローチャートである。FIG. 6 is a flowchart showing an example of the flow of a learning process executed by the information providing device according to the embodiment. 図７は、実施形態に係る端末装置が実行する検出処理の流れの一例を示すフローチャートである。FIG. 7 is a flowchart showing an example of the flow of detection processing executed by the terminal device according to the embodiment. 図８は、ハードウェア構成の一例を示す図である。FIG. 8 is a diagram illustrating an example of the hardware configuration.

以下に、本願に係る学習装置、検出装置、学習方法、学習プログラム、検出方法、および検出プログラムを実施するための形態（以下、「実施形態」と呼ぶ）について図面を参照しつつ詳細に説明する。なお、この実施形態により本願に係る学習装置、検出装置、学習方法、学習プログラム、検出方法、および検出プログラムが限定されるものではない。また、各実施形態は、処理内容を矛盾させない範囲で適宜組み合わせることが可能である。また、以下の各実施形態において同一の部位には同一の符号を付し、重複する説明は省略される。 Hereinafter, a learning device, a detection device, a learning method, a learning program, a detection method, and a mode for carrying out the detection program (hereinafter, referred to as “embodiment”) according to the present application will be described in detail with reference to the drawings. . Note that the learning device, the detection device, the learning method, the learning program, the detection method, and the detection program according to the present application are not limited by this embodiment. Further, the respective embodiments can be appropriately combined within the range in which the processing content is not inconsistent. Also, in each of the following embodiments, the same parts are designated by the same reference numerals, and duplicated description will be omitted.

〔１．情報提供装置と端末装置とについて〕
まず、図１を用いて、学習装置の一例である情報提供装置１０が実行する学習処理の一例と、検出装置の一例である端末装置１００が実行する検出処理の一例とについて説明する。図１は、実施形態に係る情報提供装置と端末装置とが実行する処理の一例を示す図である。図１では、情報提供装置１０によって、利用者の発話を含む音声情報から、検出対象となる所定の対象音声を抽出する際に用いるモデルの学習を行う学習処理の一例について記載した。また、図１では、端末装置１００によって、利用者の発話を含む音声情報から所定のキーワードが含まれるキーワード区間を検出する検出処理の一例について記載した。 [1. Information providing device and terminal device]
First, an example of a learning process executed by the information providing apparatus 10 which is an example of a learning apparatus and an example detection process executed by the terminal device 100 which is an example of a detecting apparatus will be described with reference to FIG. 1. FIG. 1 is a diagram illustrating an example of processing executed by the information providing apparatus and the terminal device according to the embodiment. In FIG. 1, an example of the learning process in which the information providing apparatus 10 learns a model used when extracting a predetermined target voice that is a detection target from voice information including a user's utterance is described. Further, in FIG. 1, an example of the detection process of detecting the keyword section including the predetermined keyword from the voice information including the utterance of the user by the terminal device 100 is described.

図１に示す情報提供装置１０は、学習処理を行う情報処理装置であり、例えば、サーバ装置やクラウドシステム等により実現される。例えば、情報提供装置１０は、データサーバＤＳから提供される学習データを用いて、所定のキーワードが含まれるキーワード区間を音声データから抽出する際に用いるモデルの学習を実行する。 The information providing device 10 illustrated in FIG. 1 is an information processing device that performs a learning process, and is realized by, for example, a server device or a cloud system. For example, the information providing apparatus 10 uses the learning data provided from the data server DS to perform learning of a model used when extracting a keyword section including a predetermined keyword from voice data.

データサーバＤＳは、各種のデータを管理する情報処理装置であり、例えば、サーバ装置やクラウドシステム等により実現される。例えば、データサーバＤＳは、情報提供装置１０が学習処理に用いる学習データの管理を行う。なお、データサーバＤＳが管理する学習データの詳細については、後述する。 The data server DS is an information processing device that manages various data, and is realized by, for example, a server device or a cloud system. For example, the data server DS manages the learning data used by the information providing device 10 in the learning process. The details of the learning data managed by the data server DS will be described later.

端末装置１００は、周囲の音を取得するマイク等の取得装置と、任意の音を出力可能なスピーカー等の出力装置とを有する入出力装置とを有する入出力装置であり、例えば、スマートスピーカーと呼ばれるデバイスである。例えば、端末装置１００は、出力装置を用いて、音楽の出力や音声による情報提供を実現可能な装置である。また、端末装置１００は、音の入力を受付ける受付機能を有し、利用者が発した音声を取得すると、取得した音声の内容に応じた音を出力する出力機能を有する。 The terminal device 100 is an input / output device that includes an acquisition device such as a microphone that acquires ambient sound, and an input / output device that includes an output device such as a speaker that can output an arbitrary sound. Is the device called. For example, the terminal device 100 is a device that can output music and provide information by voice using an output device. In addition, the terminal device 100 has a reception function of receiving a sound input, and has an output function of outputting a sound corresponding to the content of the acquired sound when the sound uttered by the user is acquired.

例えば、端末装置１００は、利用者が所定の楽曲の曲名を示す音声を発した場合には、各種の音声解析技術により、音声が示す曲名を特定し、特定した曲名が示す楽曲のデータを、ネットワークＮ（例えば、図２を参照）を介して、所定の外部サーバＯＳ（例えば、図２）から取得する。そして、音声デバイスは、取得した楽曲を再生する。 For example, when the user utters a voice indicating a song title of a predetermined song, the terminal device 100 identifies the song title indicated by the voice by various voice analysis techniques, and outputs the data of the song indicated by the identified song title, It is acquired from a predetermined external server OS (for example, FIG. 2) via the network N (for example, see FIG. 2). Then, the audio device reproduces the acquired music piece.

なお、端末装置１００は、例えば、利用者Ｕが発した音声の内容を各種の音声解析技術により特定し、特定した内容に応じた応答を出力する機能を有する。例えば、端末装置１００は、「今日の天気は？」といった利用者Ｕの音声を取得した場合は、外部サーバＯＳから天気や気温などといった各種の気象情報を取得し、取得した気象情報を読み上げることで、利用者Ｕに天気の情報を提供する。また、端末装置１００は、上述した処理以外にも、例えば、電子商店街に出品された商品の注文、空調装置や照明装置等といった各種家電機器の制御、メールやスケジュールの読み上げ等といった各種の処理を実現可能なスマートスピーカーである。 The terminal device 100 has, for example, a function of identifying the content of the voice uttered by the user U by various voice analysis techniques and outputting a response according to the identified content. For example, when the terminal device 100 acquires the voice of the user U such as “What is the weather today?”, The terminal device 100 acquires various weather information such as weather and temperature from the external server OS, and reads the acquired weather information. Then, the weather information is provided to the user U. In addition to the above-described processing, the terminal device 100 also performs various kinds of processing such as ordering products sold in the online shopping mall, controlling various home electric appliances such as air conditioners and lighting devices, and reading out emails and schedules. It is a smart speaker that can realize.

なお、端末装置１００は、外部サーバＯＳと連携することで、音声解析を行ってもよい。例えば、端末装置１００は、マイク等を用いて周囲の音声を取得し、取得した音声が所定の条件を満たした場合は、外部サーバＯＳに取得した音声を送信する。このような場合、外部サーバＯＳは、取得した音声の内容を各種の音声解析技術により特定し、特定結果を端末装置１００へと送信する。その後、端末装置１００は、特定結果に対応する各種の処理を実行してもよい。すなわち、端末装置１００は、スタンドアローン型のスマートスピーカーであってもよく、クラウド等の外部サーバと連携するスマートスピーカーであってもよい。 The terminal device 100 may perform voice analysis by cooperating with the external server OS. For example, the terminal device 100 acquires ambient sound using a microphone and the like, and when the acquired sound satisfies a predetermined condition, transmits the acquired sound to the external server OS. In such a case, the external server OS identifies the content of the acquired voice by various voice analysis techniques and transmits the identification result to the terminal device 100. After that, the terminal device 100 may execute various processes corresponding to the specific result. That is, the terminal device 100 may be a stand-alone smart speaker or a smart speaker that cooperates with an external server such as a cloud.

ここで、端末装置１００は、それぞれ異なる位置に取付けられた複数の取得装置（例えば、マイク等）を有し、各取得装置を介して受付けた音声を用いて、上述した各種の処理を実行してもよい。また、端末装置１００は、それぞれ異なる位置に取付けられた複数の取得装置を有する装置であれば、例えば、スマートデバイスや録音装置等、任意の装置であってもよい。また、端末装置１００は、物理的に離間した位置に設置された複数の取得装置と無線ＬＡＮ（Local Area Network）やブルートゥース（登録商標）等の無線通信を介して接続され、各取得装置が取得した音声を収集する装置であってもよい。 Here, the terminal device 100 has a plurality of acquisition devices (for example, microphones) attached to different positions, and executes the above-described various processes using the voice received through each acquisition device. May be. Further, the terminal device 100 may be any device such as a smart device or a recording device, as long as it has a plurality of acquisition devices attached to different positions. The terminal device 100 is connected to a plurality of acquisition devices installed at physically separated positions via wireless communication such as a wireless LAN (Local Area Network) or Bluetooth (registered trademark), and each acquisition device acquires. It may be a device that collects the generated sound.

〔１−１．キーワードの検出について〕
ここで、利用者は、スマートスピーカー等を操作する場合は、所定のキーワードを発話した後で、実行させる処理を示す発話（以下、「処理発話」と記載する。）を発話する。このような場合、端末装置１００は、取得した音声に所定のキーワードが含まれているか否かを判定する。そして、端末装置１００は、所定のキーワードが含まれていると判定される場合は、そのキーワードに続いて利用者が発話した処理発話が含まれる音声データの区間から、音声解析により利用者の発話内容を特定する。 [1-1. About keyword detection]
Here, when operating a smart speaker or the like, the user utters a predetermined keyword and then utters a process indicating a process to be executed (hereinafter, referred to as a “process utterance”). In such a case, the terminal device 100 determines whether or not the acquired voice includes a predetermined keyword. Then, when it is determined that the predetermined keyword is included, the terminal device 100 utters the user by voice analysis from the section of the voice data including the processing utterance uttered by the user following the keyword. Identify the content.

また、このようなキーワードは、単に処理の起動音声として用いられるだけではなく、後続する処理発話の明瞭化処理に用いられる場合がある。例えば、音声データからキーワードが含まれるキーワード区間を抽出し、抽出されたキーワード区間内に含まれる音声から特徴を抽出し、抽出した特徴に基づいて、後続する音声のうち利用者の発話を強調することで、音楽やテレビジョンの音声等といった雑音の影響を軽減するといった態様が考えられる。また、複数のマイクを用いて取得された複数の音声データからキーワード区間をそれぞれ抽出し、抽出した各キーワード区間が測定された時間差に基づいて、利用者が所在する方向を推定し、推定した方向からの音声を強調することで、雑音の影響を軽減するビームフォーミングの技術が考えられる。このため、キーワード区間を適切に検出することができた場合、起動音声の有無を適切に判定することができるだけではなく、処理発話の認識精度を向上させることができる。 Further, such a keyword may be used not only as a start voice of a process but also in a process of clarifying a subsequent process utterance. For example, a keyword section including a keyword is extracted from voice data, a feature is extracted from a voice included in the extracted keyword section, and a user's utterance is emphasized in subsequent voices based on the extracted feature. Therefore, it is possible to reduce the influence of noise such as music and television sound. In addition, the keyword section is extracted from each of the plurality of voice data acquired using the plurality of microphones, the direction in which the user is located is estimated based on the time difference between the extracted keyword sections, and the estimated direction is determined. A beamforming technology that reduces the influence of noise by enhancing the sound from the can be considered. For this reason, when the keyword segment can be detected appropriately, it is possible not only to appropriately determine the presence or absence of the activation voice, but also to improve the recognition accuracy of the processed utterance.

ここで、キーワードが有する特徴をＳＶＭ（Support Vector Machine）やＤＮＮ（Deep Neural Network）等といった各種分類器として動作するモデルに学習させ、学習済モデルを用いて、収集した音声からキーワードの検出を行うといった態様が考えられる。しかしながら、単にキーワードの音声が有する特徴をモデルに学習させた場合は、キーワードのうちどの時点からモデルがキーワードであると判断するかが明確ではないため、音声データのうちどこからどこまでがキーワードを含むキーワード区間であるかを推定するのが困難となる。 Here, a model that operates as various classifiers such as SVM (Support Vector Machine) and DNN (Deep Neural Network) is made to learn the characteristics of the keyword, and the keyword is detected from the collected voice using the learned model. Such a mode is conceivable. However, if the model is simply trained to have the characteristics of the voice of the keyword, it is not clear from which point in the keyword the model is determined to be the keyword. It is difficult to estimate whether it is an interval.

〔１−２．学習処理について〕
そこで、情報提供装置１０は、以下の学習処理を実行する。まず、情報提供装置１０は、検出対象となる対象音声が含まれる音声情報を取得する。例えば、情報提供装置１０は、キーワード等、所定の端末装置１００に所定の動作を実行させるための音声を対象音声として含む音声情報を取得する。そして、情報提供装置１０は、すくなくとも、対象音声の終端と、対象音声の始端から経過した期間とをモデルに学習させる。より具体的な例を挙げると、情報提供装置１０は、対象音声の始端から終端までの間の特徴、すなわち、対象音声の特徴をモデルに学習させるとともに、音声の始端から対象音声の各区間までの間の期間とをモデルに学習させる。例えば、情報提供装置１０は、音声情報を複数の区間に分割し、各区間ごとに、対象音声の終端が含まれているか否かと、対象音声の始端から処理対象となる区間までの期間とをモデルに学習させる。 [1-2. About learning processing]
Therefore, the information providing device 10 executes the following learning process. First, the information providing apparatus 10 acquires voice information including a target voice to be detected. For example, the information providing apparatus 10 acquires voice information including a voice such as a keyword for causing a predetermined terminal device 100 to perform a predetermined operation as a target voice. Then, the information providing apparatus 10 causes the model to learn at least the end of the target voice and the period elapsed from the start of the target voice. To give a more specific example, the information providing apparatus 10 causes the model to learn the features from the beginning to the end of the target voice, that is, the features of the target voice, and from the beginning of the voice to each section of the target voice. Train the model between and. For example, the information providing apparatus 10 divides the audio information into a plurality of sections, and determines, for each section, whether or not the end of the target sound is included and a period from the start end of the target sound to the section to be processed. Train the model.

換言すると、情報提供装置１０は、キーワードの終端付近でキーワードを検出した旨を出力するモデルの学習を行う。例えば、情報提供装置１０は、音声データの音声を複数のフレームに分割し、各フレームに含まれる音声の情報を時系列順にモデルに入力する。そして、情報提供装置１０は、キーワードの終端を含むフレーム若しくは終端付近のフレームに含まれる音声の情報をモデルに入力した際に、キーワードの終端を検知した旨の情報を出力するように、モデルの学習を行う。ここで、情報提供装置１０は、過去に入力した音声の特徴量を考慮して、新たに入力された音声がキーワードの終端である事後確率を算出させるため、ＲＮＮ（Recurrent Neural Network）若しくは、ＬＳＴＭ（Long short-term memory）といった再帰型ニューラルネットワークの構成を有するモデルの学習を行う。 In other words, the information providing apparatus 10 learns a model that outputs the fact that a keyword is detected near the end of the keyword. For example, the information providing apparatus 10 divides the voice of the voice data into a plurality of frames, and inputs the voice information included in each frame to the model in chronological order. Then, the information providing apparatus 10 outputs the information indicating that the end of the keyword is detected when the information of the voice included in the frame including the end of the keyword or the frame near the end is input to the model. Learn. Here, in order to calculate the posterior probability that the newly input voice is the end of the keyword in consideration of the feature amount of the voice input in the past, the information providing apparatus 10 uses an RNN (Recurrent Neural Network) or LSTM. (Long short-term memory) learning model with recurrent neural network configuration.

このような学習に加えて、情報提供装置１０は、入力されたフレームを、キーワードの始端からの長さに応じたクラスに分類させるタスクを追加する。すなわち、情報提供装置１０は、各フレームが、キーワードの始端からどれくらい経過したフレームなのか、すなわち、各フレームに含まれる音声が、キーワードの始端からどれくらい経過した際に観測される音声なのかをモデルに学習させる。 In addition to such learning, the information providing apparatus 10 adds a task of classifying the input frame into a class according to the length from the start end of the keyword. That is, the information providing apparatus 10 models how long each frame is from the beginning of the keyword, that is, how much the voice included in each frame is observed when the beginning of the keyword is reached. Let them learn.

例えば、情報提供装置１０は、１フレームが２０ミリ秒であり、キーワードが約１００フレーム程度で発話される場合は、キーワードの始端から１０フレームごとに異なるクラスを割り当てる。そして、情報提供装置１０は、各フレームにキーワードの終端が含まれているか否かを学習させるとともに、入力されたフレームがどのクラスに割り当てられているかをモデルに学習させる。すなわち、情報提供装置１０は、モデルにマルチタスク学習を実行させる。 For example, the information providing apparatus 10 allocates a different class for every 10 frames from the start end of the keyword when the keyword is uttered in about 100 frames when one frame is 20 milliseconds. Then, the information providing apparatus 10 learns whether or not each frame includes the end of the keyword, and makes the model learn to which class the input frame is assigned. That is, the information providing apparatus 10 causes the model to perform multitask learning.

上述した学習処理により、情報提供装置１０は、キーワードの終端を適切に検出するとともに、検出したキーワードの始端から検出した終端までの期間を推定可能なモデルの学習を実現できる。例えば、情報提供装置１０は、上述した学習処理により、キーワード全体（例えば、キーワードの始端から終端）までの特徴に基づいて、キーワードの終端を検出するモデルを実現する。すなわち、情報提供装置１０は、キーワードの各区間が有する特徴の出現順序に基づいて、キーワードの終端を含む区間を検出するようにモデルの学習を行う。この結果、情報提供装置１０は、キーワードの終端の検出精度を向上させることができる。 By the learning process described above, the information providing apparatus 10 can realize the learning of the model that can appropriately detect the end of the keyword and can estimate the period from the detected end of the keyword to the detected end. For example, the information providing apparatus 10 realizes a model for detecting the end of the keyword based on the characteristics of the entire keyword (for example, the start end to the end of the keyword) by the above-described learning process. That is, the information providing apparatus 10 learns the model so as to detect the section including the end of the keyword based on the appearance order of the features of each section of the keyword. As a result, the information providing apparatus 10 can improve the detection accuracy of the keyword end.

例えば、情報提供装置１０は、キーワードの終端付近のフレームに含まれる音声のみを学習データとして用いるのではなく、キーワード全体の各フレームの音声を時系列順にモデルに入力することで、キーワードの各フレームにおける音声の特徴と、各音声の出現順序の特徴とをモデルに学習させる。このような学習が行われた場合、モデルは、先頭から終端までの各フレームにおける特徴と、特徴の時系列的な出現順序とがキーワードと類似する音声が入力された場合に、キーワードを検出したと判定することとなる。この結果、情報提供装置１０は、複数の単語や無音区間が含まれるキーワードを適切に検出することができる。 For example, the information providing apparatus 10 does not use only the voice included in the frame near the end of the keyword as the learning data, but inputs the voice of each frame of the entire keyword into the model in chronological order, so that each frame of the keyword is input. The model is made to learn the features of the voice in and the feature of the appearance order of each voice. When such learning is performed, the model detects a keyword when a feature in each frame from the beginning to the end and the time-series appearance sequence of the feature are similar to the keyword are input. Will be determined. As a result, the information providing apparatus 10 can appropriately detect a keyword including a plurality of words or silent sections.

例えば、キーワードが「ねえ」という単語と「ヤフー」という単語とを含む「ねえ＿ヤフー」であった場合、情報提供装置１０は、「ねえ＿ヤフー」という複数の単語を含む一連の音声の特徴をキーワードの音声としてモデルに学習させる。より具体的には、情報提供装置１０は、「ねえ＿ヤフー」という音声の各フレームを出現順にモデルに入力し、最後のフレーム、すなわち、終端のフレームが入力された際に、キーワードを検出した旨を出力するようモデルの学習を行う。例えば、情報提供装置１０は、「ねえ＿ヤフー」という音声の各フレームを出現順にモデルに入力し、終端のフレーム以外の各フレームの音声が入力される度に「０」を出力し、終端のフレームが入力された場合に「１」を出力するように、モデルの学習を行う。 For example, if the keyword is “Hey_Yahoo” including the word “Hey” and the word “Yahoo”, the information providing apparatus 10 causes the information providing device 10 to have a series of audio features including a plurality of words “Hey_Yahoo”. The model is trained as a keyword voice. More specifically, the information providing apparatus 10 inputs each frame of the voice "Hey_Yahoo" into the model in the order of appearance, and detects the keyword when the last frame, that is, the end frame is input. The model is trained to output a message. For example, the information providing apparatus 10 inputs each frame of the sound "Hey_Yahoo" into the model in the order of appearance, outputs "0" each time the sound of each frame other than the end frame is input, and outputs the end The model is trained so that "1" is output when a frame is input.

このような学習が行われた場合、モデルは、「ヤフー」という単語が入力されただけでは、キーワードを検出した旨（すなわち、「１」）を出力せず、「ねえ＿ヤフー」という音声の各フレームが出現順に入力された場合に、キーワードを検出した旨を出力することとなる。また、このようなモデルは、「おい＿ヤフー」や「ねえ＿やすこ」といったキーワードの一部と類似する音声が入力された場合や、「ヤフー＿ねえ」といったキーワードと音の出現順序が異なる音声が入力されただけでは、キーワードを検出した旨を出力せず、キーワード全体と類似する音声の各フレームが、キーワードと同じ順序で入力された場合にのみ、キーワードの終端を検出することとなる。 When such learning is performed, the model does not output the fact that the keyword is detected (that is, “1”) only by inputting the word “Yahoo”, and the voice of “Hey_Yahoo” is output. When each frame is input in the order of appearance, the fact that the keyword has been detected is output. In addition, in such a model, when a voice similar to a part of a keyword such as “Oi_Yahoo” or “Hey_Yasuko” is input, or a voice such as “Yahoo_Hey” and the sound appearance order are different. Does not output that the keyword has been detected, and detects the end of the keyword only when the frames of the voice similar to the entire keyword are input in the same order as the keyword.

一方、キーワードの終端のフレームに含まれる音声の特徴のみをモデルに学習させた場合、単に「ヤフー」や「フー」という音声が入力されただけで、キーワードを検出したとモデルが誤判定する恐れがある。そこで、情報提供装置１０は、キーワード全体の特徴からキーワードの終端を検出するようにモデルに学習を行うことで、複数の単語や無音の区間を含むキーワードの終端を適切に検出可能なモデルを学習することができる。 On the other hand, if the model is made to learn only the features of the speech contained in the end frame of the keyword, the model may erroneously determine that the keyword is detected simply by inputting the speech such as "Yahoo" or "Fu". There is. Therefore, the information providing apparatus 10 learns a model that can appropriately detect the end of the keyword including a plurality of words or silent sections by learning the model so as to detect the end of the keyword from the characteristics of the entire keyword. can do.

また、情報提供装置１０は、キーワードの終端を検出するモデルに対し、キーワードの始端から検出した終端までの期間の特徴を学習させる。このような学習が行われたモデル（以下、「学習モデル」と記載する。）に対し、実際に測定された音声データの各フレームを時系列順に入力した場合、学習モデルは、入力されたフレームにキーワードの終端が含まれているか否か（若しくは、キーワードの終端の近傍であるか否か）を出力するとともに、入力されたフレームのクラスを示す情報、すなわち、キーワードの始端から入力されたフレームまでどれくらいの期間が経過したかを示す期間情報を出力する。 The information providing apparatus 10 also causes the model that detects the end of the keyword to learn the characteristics of the period from the start of the keyword to the detected end. When each frame of the actually measured speech data is input in chronological order to the model on which such learning is performed (hereinafter, referred to as “learning model”), the learning model is the input frame. Whether or not the end of the keyword is included in (or whether or not it is near the end of the keyword), and the information indicating the class of the input frame, that is, the frame input from the start of the keyword Outputs period information that indicates how much time has passed.

ここで、学習モデルが終端であると判定したフレームから、そのフレームが属するクラスに応じた期間だけ遡ったフレーム若しくはそのフレームの近傍には、キーワードの始端が含まれていると推定される。この結果、情報提供装置１０は、キーワード区間を精度よく抽出可能な学習モデルの学習を実現することができる。 Here, it is estimated that the start point of the keyword is included in the frame that is traced back from the frame in which the learning model is determined to be the end for the period corresponding to the class to which the frame belongs, or in the vicinity of the frame. As a result, the information providing apparatus 10 can realize learning of a learning model that can accurately extract a keyword section.

また、上述した学習処理により学習が行われた学習モデルは、時系列順に入力されたキーワードの各フレームの特徴に基づいて、キーワードの終端を推定する。このため、学習モデルは、キーワードに複数の単語が含まれる場合や無音の区間が含まれる場合であっても、キーワードの終端を適切に推定することができる。 The learning model learned by the above-described learning process estimates the end of the keyword based on the characteristics of each frame of the keyword input in chronological order. Therefore, the learning model can appropriately estimate the end of the keyword even when the keyword includes a plurality of words or a silent section.

なお、上述した説明では、情報提供装置１０は、キーワード全体の特徴と、キーワードの始端から経過した期間とをモデルに学習させたが、実施形態は、これに限定されるものではない。例えば、情報提供装置１０は、すくなくとも、キーワードの終端付近の特徴と、キーワードの始端から経過した期間とをモデルに学習させればよい。このような学習が行われた場合、モデルは、キーワードの終端と類似する音声が入力された場合に、キーワードの終端を検出した旨を出力するとともに、キーワードの始端から検出した終端までの期間を示す情報を出力することとなる。このような出力からも、終端と検出されたフレームから、モデルが検出した期間だけ遡ることで、キーワード若しくは一部がキーワードと類似する音声が含まれる区間を検出することができる。このような区間の検出を行い、実際にキーワードが含まれるか否かについては、他のモデル等を用いて判定を行ってもよい。 In the above description, the information providing apparatus 10 causes the model to learn the characteristics of the entire keyword and the period elapsed from the start of the keyword, but the embodiment is not limited to this. For example, the information providing apparatus 10 may at least let the model learn the characteristics near the end of the keyword and the period elapsed from the start of the keyword. When such learning is performed, the model outputs the fact that the end of the keyword has been detected when a voice similar to the end of the keyword is input, and also the period from the start of the keyword to the end of detection. The information shown will be output. Also from such an output, it is possible to detect a section including a keyword or a voice part of which is similar to the keyword by tracing back from the frame detected as the end for the period detected by the model. Such a section may be detected and whether or not the keyword is actually included may be determined using another model or the like.

〔１−３．検出処理について〕
一方、端末装置１００は、情報提供装置１０により学習が行われた学習モデルを用いて、利用者の発話からキーワード区間を検出する。例えば、端末装置１００は、マイク等を用いて、利用者の発話を含む音声情報を取得する。そして、端末装置１００は、検出対象となる対象音声の終端と、対象音声の始端から経過した期間とを学習させたモデル、すなわち、情報提供装置１０により学習が行われた学習モデルを用いて、取得された音声情報から、対象音声の始端を検出する。 [1-3. About detection processing]
On the other hand, the terminal device 100 uses the learning model learned by the information providing device 10 to detect the keyword section from the utterance of the user. For example, the terminal device 100 uses a microphone or the like to acquire voice information including the utterance of the user. Then, the terminal device 100 uses the model in which the end of the target voice to be detected and the period elapsed from the start of the target voice are learned, that is, the learning model learned by the information providing device 10, The starting end of the target voice is detected from the acquired voice information.

例えば、端末装置１００は、マイク等を用いて取得した音声情報を複数のフレームに分割し、時系列順に各フレームを学習モデルに入力する。上述した学習処理により学習が行われた学習モデルにフレームを入力した場合、学習モデルは、入力されたフレームに終端が含まれているか否かを示す情報（例えば、終端が含まれているか否かを示す確度や、終端が含まれているか否かを示す２値の情報）を出力するとともに、入力されたフレームに含まれる音声がキーワードの始端からどれくらい経過した際の音声であるかを示す情報、すなわち、始端からの経過時間に応じたクラスを示す情報を出力する。例えば、学習モデルは、入力されたフレームが各クラスに属する事後確率（すなわち、各クラスに属する確度）を出力することとなる。すなわち、学習モデルは、フレームに終端が含まれているか否かのクラス分類（以下、「終端クラス分類」と記載する場合がある。）を行うとともに、始端からの経過時間に応じたクラス分類（以下、「経過クラス分類」と記載する場合がある。）とを同時に行うこととなる。 For example, the terminal device 100 divides the voice information acquired using a microphone or the like into a plurality of frames, and inputs each frame into the learning model in chronological order. When a frame is input to the learning model that has been learned by the above-described learning processing, the learning model uses information indicating whether or not the input frame includes a terminal (for example, whether or not the terminal includes a terminal). Information indicating whether or not the voice included in the input frame is the voice from the start end of the keyword That is, the information indicating the class according to the elapsed time from the start end is output. For example, the learning model outputs the posterior probability that the input frame belongs to each class (that is, the probability of belonging to each class). That is, the learning model classifies whether or not the frame includes an end (hereinafter, may be referred to as “end class classification”), and classifies according to the elapsed time from the start ( Hereinafter, it may be referred to as "transition class classification") at the same time.

このような学習モデルを用いて、端末装置１００は、入力されたフレームに終端が含まれているか否かを特定するとともに、入力されたフレームに含まれる音声が始端からどれくらい経過した音声であるかを特定する。例えば、端末装置１００は、学習モデルによる終端クラス分類の結果に基づいて、あるフレームにキーワードの終端が含まれている旨を特定した場合は、そのフレームの経過クラス分類の結果を特定する。そして、端末装置１００は、特定したクラスに応じた期間だけ遡ったフレームにキーワードの始端が含まれていると推定し、キーワードの始端が含まれているフレームから、キーワードの終端が含まれているフレームまでをキーワード区間として抽出する。このような処理の結果、端末装置１００は、キーワード区間を精度良く検出することができる。 Using such a learning model, the terminal device 100 identifies whether or not the input frame includes the end, and how long the voice included in the input frame is the voice after the start end. Specify. For example, when the terminal device 100 specifies that a certain frame includes the end of the keyword based on the result of the end class classification by the learning model, the terminal device 100 specifies the result of the transitive class classification of the frame. Then, the terminal device 100 estimates that the frame starting from the period corresponding to the specified class includes the start point of the keyword, and includes the end point of the keyword from the frame including the start point of the keyword. Extract up to the frame as a keyword section. As a result of such processing, the terminal device 100 can accurately detect the keyword section.

〔１−４．処理の一例〕
続いて、図１を用いて、情報提供装置１０が実行する学習処理の一例、および、端末装置１００が実行する検出処理の一例について説明する。例えば、情報提供装置１０は、データサーバＤＳからモデルの学習に用いる学習データを取得する（ステップＳ１）。そして、情報提供装置１０は、キーワードの終端と始端から各区間までの経過時間とをモデルに学習させる（ステップＳ２）。 [1-4. Example of processing]
Subsequently, an example of a learning process executed by the information providing device 10 and an example of a detection process executed by the terminal device 100 will be described with reference to FIG. For example, the information providing apparatus 10 acquires learning data used for model learning from the data server DS (step S1). Then, the information providing apparatus 10 causes the model to learn the end time of the keyword and the elapsed time from the start end to each section (step S2).

例えば、情報提供装置１０は、学習データとして、キーワードの発話音声を含む音声データと、音声データの各区間にキーワードの終端が含まれるか否かを示す終端ラベルと、各区間が属するクラスを含むクラスラベルとを含む学習データＬＤ１を取得する。なお、キーワードに複数の単語が含まれる場合や、無音の区間が含まれる場合は、複数の単語を発声した音声、又は、無音の区間を含む音声を対象音声として含む音声データを学習データとして取得することとなる。 For example, the information providing apparatus 10 includes, as learning data, voice data including a speech voice of a keyword, an end label indicating whether or not each section of the voice data includes the end of the keyword, and a class to which each section belongs. Learning data LD1 including a class label is acquired. In addition, when the keyword includes a plurality of words or when a silent section is included, the speech data of a plurality of words or the speech data including the speech including the silent section as the target speech is acquired as the learning data. Will be done.

例えば、図１に示す例では、学習データＬＤ１は、始端Ｓ１と終端Ｅ１とを有するキーワードを含む音声データを有する。また、学習データＬＤ１において、音声データは、区間「１」〜「２３」に分割されている。また、学習データＬＤ１は、各区間ごとに、キーワードの終端Ｅ１が含まれているか否かを示す終端ラベルが付与されている。例えば、学習データＬＤ１の各区間「１」〜「２３」には、終端Ｅ１が含まれていない旨を示す値「０」、若しくは、終端Ｅ１が含まれている旨を示す値「１」が付与されている。 For example, in the example shown in FIG. 1, the learning data LD1 has voice data including a keyword having a start end S1 and an end end E1. Further, in the learning data LD1, the voice data is divided into sections “1” to “23”. Further, the learning data LD1 is provided with an end label indicating whether or not the end E1 of the keyword is included in each section. For example, the value "0" indicating that the end E1 is not included in each section "1" to "23" of the learning data LD1 or the value "1" indicating that the end E1 is included is Has been granted.

また、学習データＬＤ１は、各区間ごとに、始端Ｓ１から経過した期間に応じたクラスを示すクラスラベルが付与されている。例えば、図１に示す例では、始端Ｓ１が区間「３」に含まれている。このような場合、学習データＬＤ１の区間「１」、「２」には、クラスラベル「０」が付与されており、区間「３」〜「２１」には、順にクラスラベル「１」〜「１９」が付与されている。 Further, the learning data LD1 is provided with a class label indicating a class according to the period elapsed from the starting end S1 for each section. For example, in the example shown in FIG. 1, the start end S1 is included in the section “3”. In such a case, the class label “0” is given to the sections “1” and “2” of the learning data LD1, and the class labels “1” to ““ in order to the sections “3” to “21”. 19 ”is given.

ここで、学習データＬＤ１において、終端が含まれる区間よりも後の区間には、クラスラベル「０」が付与されている。例えば、学習データの区間「２１」には、キーワードの終端が含まれているため、終端ラベル「１」が付与されており、区間「２１」よりも後の区間「２２」、「２３」には、クラスラベル「０」が付与されている。 Here, in the learning data LD1, a class label “0” is given to a section after the section including the end. For example, since the section “21” of the learning data includes the end of the keyword, the end label “1” is added to the sections “22” and “23” after the section “21”. Is assigned the class label “0”.

なお、キーワードが平均して２０区間程度で発話される場合、クラスレベルの最大値を２０としてもよい。また、図１に示す例では、キーワードの終端が含まれる区間よりも後の区間に対し、クラスラベル「０」を付与したが、実施形態は、これに限定されるものではない。例えば、終端が含まれる区間よりも後の区間に対しても、連続する一連のクラスラベルが付与されてもよく、クラスラベルの最大値を超えた区間については、前の区間と同一のクラスラベルが付与されてもよい。例えば、区間「２２」、「２３」には、クラスラベル「２０」、「２１」が付与されてもよく、同一のクラスラベル「２０」が付与されてもよい。 When the keywords are spoken in about 20 sections on average, the maximum class level value may be set to 20. Further, in the example shown in FIG. 1, the class label “0” is given to the section after the section including the end of the keyword, but the embodiment is not limited to this. For example, a continuous series of class labels may be given to a section after the section including the end, and a section exceeding the maximum class label value has the same class label as the previous section. May be given. For example, the class labels “20” and “21” may be given to the sections “22” and “23”, or the same class label “20” may be given.

なお、図１に示す学習データＬＤ１は、２３個の区間に分割されているが、実施形態は、これに限定されるものではない。図１に示す学習データＬＤ１は、発明の理解を容易にするために模式的に示したものであり、実際には、より多くの区間に分割されることとなる。具体的な例を挙げると、音声データを処理する際のフレームが２０ミリ秒であり、学習データＬＤ１に含まれる音声データが３秒のデータである場合、音声データは、１５０個のフレームに分割されることとなる。 The learning data LD1 shown in FIG. 1 is divided into 23 sections, but the embodiment is not limited to this. The learning data LD1 shown in FIG. 1 is schematically shown to facilitate understanding of the invention, and is actually divided into more sections. As a specific example, when the frame for processing the audio data is 20 milliseconds and the audio data included in the learning data LD1 is 3 seconds, the audio data is divided into 150 frames. Will be done.

図１に示す学習データＬＤ１の各区間は、１つのフレームに対応するものであってもよく、複数のフレームに対応してもよい。また、終端ラベルやクラスラベルは、任意の単位で各区間に付与されていてよい。例えば、終端ラベルは、各フレームごとに付与され、クラスラベルは、複数のフレームごとに付与されるものであってもよい。また、クラスラベルは、キーワード区間と対応する各フレームに対し、フレームごとに異なる値が付与されていてもよい。 Each section of the learning data LD1 shown in FIG. 1 may correspond to one frame or may correspond to a plurality of frames. Further, the end label and the class label may be given to each section in arbitrary units. For example, the end label may be given to each frame and the class label may be given to each of a plurality of frames. Further, as the class label, a different value may be given to each frame corresponding to the keyword section.

まず、情報提供装置１０は、ＬＳＴＭの構造を有するモデルＭを準備する。そして、情報提供装置１０は、学習データＬＤ１に含まれる音声データの各フレームを時系列順にモデルに入力した際に、入力されたフレームに付与された終端ラベルとクラスラベルとを出力するように、モデルＭの学習を行う。なお、このような学習は、例えば、バックプロパゲーションや確率的勾配降下法等、ＬＳＴＭの学習を実現する任意の学習手法が採用可能である。 First, the information providing apparatus 10 prepares a model M having an LSTM structure. Then, when the information providing apparatus 10 inputs each frame of the audio data included in the learning data LD1 to the model in chronological order, the information providing apparatus 10 outputs the terminal label and the class label given to the input frame, The model M is learned. For such learning, for example, an arbitrary learning method that realizes learning of LSTM such as back propagation or stochastic gradient descent can be adopted.

例えば、情報提供装置１０は、区間「３」に含まれるフレームをモデルＭに入力した場合は、モデルＭが終端ラベル「０」とクラスラベル「１」とを出力するように、モデルＭの学習を行う。同様に、情報提供装置１０は、各フレームを時系列順にモデルＭに入力し、各フレームと対応する終端ラベルとクラスラベルとを出力するように、モデルＭの学習を行う。なお、情報提供装置１０は、適切な学習を行うため、例えば、終端ラベルが「０」となるフレーム等、一部の学習データをランダムな順序で入力してもよい。 For example, the information providing apparatus 10 learns the model M so that the model M outputs the end label “0” and the class label “1” when the frame included in the section “3” is input to the model M. I do. Similarly, the information providing apparatus 10 inputs the frames into the model M in time series order, and learns the model M so as to output the end label and the class label corresponding to each frame. Note that the information providing apparatus 10 may input some learning data in a random order, such as a frame with a terminal label of “0”, for performing appropriate learning.

このように、情報提供装置１０は、所定の区間に含まれる音声を前記モデルに入力した際に、その所定の区間に対象音声の終端が含まれているか否かを示す終端情報と、対象音声の始端からその所定の区間までの期間を示す期間情報とを出力するように、モデルＭの学習を行う。また、情報提供装置１０は、音声情報を複数の区間に分割し、所定の区間に含まれる音声を入力した際に、対象音声の始端から所定の区間までの期間に応じた分類結果を出力するよう、モデルの学習を行う。 As described above, the information providing apparatus 10 receives, when the voice included in the predetermined section is input to the model, the end information indicating whether the end of the target voice is included in the predetermined section, and the target voice. The model M is learned so as to output period information indicating a period from the start end to the predetermined section. Further, the information providing apparatus 10 divides the voice information into a plurality of sections, and when a voice included in the predetermined section is input, outputs the classification result according to the period from the start end of the target voice to the predetermined section. So, learn the model.

なお、情報提供装置１０は、学習データＬＤ１のみならず、複数の学習データを用いて、モデルＭの学習を行う。ここで、情報提供装置１０は、モデルＭによる処理精度を向上させるため、様々な利用者により発話されたキーワードを含む学習データを用いてよい。また、情報提供装置１０は、テレビジョンから発せられた音声や他の利用者の発話、ホワイトノイズ等の各種雑音を付加した音声データを含む学習データを用いて、モデルＭの学習を行ってよい。 The information providing device 10 learns the model M using not only the learning data LD1 but also a plurality of learning data. Here, the information providing apparatus 10 may use learning data including keywords spoken by various users in order to improve the processing accuracy of the model M. Further, the information providing apparatus 10 may perform learning of the model M by using learning data including voice generated from a television, speech of another user, and voice data added with various noises such as white noise. .

そして、情報提供装置１０は、学習が行われた学習モデルＭを端末装置１００に提供する（ステップＳ３）。このような場合、端末装置１００は、利用者の発話を受付ける（ステップＳ４）。例えば、端末装置１００は、利用者が順に発話したキーワードおよび処理発話の音声をマイクを用いて取得する。そして、端末装置１００は、学習モデルＭを用いて、取得した音声からキーワードの終端を推定し、学習モデルＭにより推定されたキーワードの終端までの経過期間に基づいて、キーワード区間の始端を推定する（ステップＳ５）。 Then, the information providing device 10 provides the learned model M, which has been learned, to the terminal device 100 (step S3). In such a case, the terminal device 100 receives the utterance of the user (step S4). For example, the terminal device 100 acquires a keyword uttered by the user in order and a voice of a process utterance using a microphone. Then, the terminal device 100 estimates the end of the keyword from the acquired voice using the learning model M, and estimates the start of the keyword section based on the elapsed time to the end of the keyword estimated by the learning model M. (Step S5).

例えば、端末装置１００は、利用者から取得した音声（以下、「発話音声」と記載する）を複数の区間に分割し、各区間の音声を時系列順に学習モデルＭに入力する。そして、端末装置１００は、各区間ごとに、学習モデルＭが出力した終端ラベルとクラスラベルとを取得する。そして、端末装置１００は、区間「１９」の音声を入力した際に、学習モデルＭ１が終端ラベル「１」を出力した場合は、キーワード区間の終端が区間「１９」であると推定する。また、端末装置１００は、区間「１９」の音声を入力した際に、学習モデルＭ１がクラスラベル「１５」を出力した場合は、区間「１９」から「１５」クラス分前の区間、すなわち、区間「４」にキーワードの始端が含まれていると推定する。そして、端末装置１００は、区間「４」から区間「１９」までの間がキーワード区間であると推定する。 For example, the terminal device 100 divides the voice (hereinafter, referred to as “speech voice”) acquired from the user into a plurality of sections, and inputs the voices of each section to the learning model M in chronological order. Then, the terminal device 100 acquires, for each section, the terminal label and the class label output by the learning model M. Then, when the learning model M1 outputs the end label "1" when the voice of the section "19" is input, the terminal device 100 estimates that the end of the keyword section is the section "19". Further, when the learning model M1 outputs the class label “15” when the voice of the section “19” is input, the terminal device 100 outputs a section 15 classes before the section “19”, that is, It is estimated that the start point of the keyword is included in the section “4”. Then, the terminal device 100 estimates that the period from section “4” to section “19” is the keyword section.

続いて、端末装置１００は、推定したキーワード区間に含まれる音声を用いて、所定の処理を実行する（ステップＳ６）。例えば、端末装置１００は、キーワード区間に含まれる音声の解析を行い、キーワードが発話されたか否かを判定してもよく、ビームフォーミング等を実行し、後続する処理発話の強調等を行ってもよい。また、端末装置１００は、単に、キーワード区間に後続する処理発話の解析を行い、解析結果と対応する処理を実行してもよい。そして、端末装置１００は、処理結果を利用者に対して提供する（ステップＳ７）。 Then, the terminal device 100 performs a predetermined process using the voice included in the estimated keyword section (step S6). For example, the terminal device 100 may analyze the voice included in the keyword section to determine whether or not the keyword is uttered, perform beamforming, etc., and emphasize subsequent processing utterances. Good. In addition, the terminal device 100 may simply analyze the process utterance that follows the keyword section and execute the process corresponding to the analysis result. Then, the terminal device 100 provides the processing result to the user (step S7).

このように、端末装置１００は、学習対象となった音声情報である学習情報に含まれる各区間ごとに、対象音声の終端が含まれているか否かと、対象音声の始端から処理対象の区間までの期間とを学習させたモデルを用いて、発話音声から対象音声の始端を含む区間を検出する。例えば、端末装置１００は、再帰型ニューラルネットワークの構成を有する学習モデルＭに対し、発話音声の各区間に含まれる音声を先頭から順に入力し、学習モデルＭが出力した終端情報と期間情報とに基づいて、対象音声の始端を含む区間を検出する。 As described above, the terminal device 100 determines whether or not the end of the target voice is included in each section included in the learning information that is the voice information that is the learning target, and from the start end of the target voice to the target section. A period including the starting end of the target voice is detected from the uttered voice using a model in which the period and are learned. For example, the terminal device 100 sequentially inputs the voices included in each section of the uttered voice to the learning model M having the configuration of the recursive neural network from the beginning, and outputs the end information and the period information output by the learning model M. Based on this, a section including the start end of the target voice is detected.

すなわち、端末装置１００は、所定の区間に含まれる音声が入力された場合にその所定の区間に対象音声の終端が含まれているか否かを示す終端情報と、対象音声の始端からその所定の区間までの期間を示す期間情報とを出力するように学習が行われた学習モデルＭを用いて、発話音声から対象音声の始端を含む区間を検出する。また、端末装置１００は、発話音声を複数の区間に分割し、分割した区間のうち、音声を入力した際に対象音声の終端が含まれている旨を示す終端情報を学習モデルＭが出力した区間を特定し、特定した区間について学習モデルＭが出力した期間情報に基づいて、対象音声の始端が含まれる区間を検出する。 That is, the terminal device 100, when the voice included in the predetermined section is input, the terminal information indicating whether or not the end of the target voice is included in the predetermined section, and the predetermined information from the start end of the target voice. A section including the start end of the target speech is detected from the uttered speech by using the learning model M in which the learning is performed so as to output the period information indicating the period up to the section. Further, the terminal device 100 divides the uttered voice into a plurality of sections, and the learning model M outputs end information indicating that the end of the target voice is included in the divided sections when the voice is input. The section is specified, and the section including the start end of the target voice is detected based on the period information output from the learning model M for the specified section.

このような処理の結果、端末装置１００は、１つの学習モデルＭにより、キーワードの検出に加えて、キーワード区間を適切に推定することができる。また、端末装置１００は、ＬＳＴＭにより構成される学習モデルＭを用いて、キーワードの終端を推定し、推定したキーワードの終端から遡ってキーワードの始端を推定する。ここで、ＬＳＴＭ等の再帰型ニューラルネットワークにおいては、それまでに入力されたデータの特徴を考慮して、新たに入力されたデータが所定の条件を満たすか否かを判定することができる。このため、端末装置１００は、キーワード全体の発話を待って、キーワード区間の検出を行うことができるので、キーワード区間を精度よく検出することができる。 As a result of such processing, the terminal device 100 can appropriately estimate the keyword section in addition to the keyword detection by using one learning model M. In addition, the terminal device 100 estimates the end of the keyword by using the learning model M configured by the LSTM, and estimates the start of the keyword by tracing back from the estimated end of the keyword. Here, in a recursive neural network such as LSTM, it is possible to determine whether or not the newly input data satisfies a predetermined condition, by taking into consideration the characteristics of the data input so far. Therefore, the terminal device 100 can detect the keyword section after waiting for the utterance of the entire keyword, and thus can detect the keyword section with high accuracy.

また、キーワードの終端を検出するタスクとともに、キーワードの始端から各区間までの経過期間とを推定するタスクとのマルチタスク学習を行わせた場合、音声が有する特徴のうち各タスクを実現するための特徴をモデルが多角的に学習することとなる。このような処理の結果、学習モデルＭにおいては、キーワードの終端を検出するタスクのみを学習させたモデルよりも、キーワードの終端をより精度よく検出することができる。 In addition, when performing multitask learning with the task of detecting the end of the keyword and the task of estimating the elapsed time from the start of the keyword to each section, in order to realize each task among the features of the voice. The model learns the features from multiple angles. As a result of such processing, in the learning model M, the keyword end can be detected more accurately than in the model in which only the task of detecting the keyword end is learned.

〔１−５．モデルについて〕
上述した説明では、情報提供装置１０は、ＬＳＴＭの構造を有するモデルを学習モデルＭとした。しかしながら、実施形態は、これに限定されるものではない。例えば、情報提供装置１０は、ＢｉｄｉｒｅｃｔｉｏｎａｌＬＳＴＭ等、ＬＳＴＭから派生した各種のニューラルネットワークであってもよく、各種ＲＮＮであってもよい。また、情報提供装置１０は、入力された音声の区間にキーワードの終端が含まれているか否かと、キーワードの始端から入力された音声の区間までの期間とを同時に学習させるのであれば、ＳＶＭ（Support Vector Machine）やＤＮＮ（Deep Neural Network）、ＣＮＮ（Convolutional Neural Network）等といった任意の構成を有するモデルを採用してよい。 [1-5. About the model]
In the above description, the information providing apparatus 10 uses the model having the LSTM structure as the learning model M. However, the embodiment is not limited to this. For example, the information providing apparatus 10 may be various neural networks derived from LSTM such as BidirectionalLSTM, or may be various RNNs. Further, if the information providing apparatus 10 simultaneously learns whether or not the end of the keyword is included in the input voice section and the period from the start end of the keyword to the input voice section, the SVM ( A model having an arbitrary configuration such as Support Vector Machine), DNN (Deep Neural Network), CNN (Convolutional Neural Network), or the like may be adopted.

また、情報提供装置１０は、複数のモデルを用いて、学習を行ってもよい。例えば、情報提供装置１０は、キーワードの終端を検出するように第１モデルの学習を行うとともに、キーワードの始端から各区間までの経過期間を第２モデルに学習させる。そして、端末装置１００は、このような第１モデルと第２モデルとに対して、個別に発話音声の各区間を入力し、第１モデルが終端であると判定した区間から、第２モデルが出力した経過期間分だけ遡った区間を、キーワードの始端を含む区間としてもよい。 Moreover, the information providing apparatus 10 may perform learning using a plurality of models. For example, the information providing apparatus 10 learns the first model so as to detect the end of the keyword, and causes the second model to learn the elapsed period from the start end of the keyword to each section. Then, the terminal device 100 inputs each section of the uttered speech individually to the first model and the second model, and from the section determined to be the end of the first model, the second model is The section traced back by the output elapsed period may be set as the section including the start point of the keyword.

〔１−６．区間について〕
上述した例では、情報提供装置１０は、学習データを複数の区間に分割し、区間ごとに終端ラベルの値とクラスラベルの値とをモデルに学習させた。しかしながら、実施形態は、これに限定されるものではない。例えば、情報提供装置１０は、学習データを所定長のフレームに分割し、フレームごとに終端ラベルの値を学習させるとともに、複数のフレームを含む区間ごとにクラスラベルの値を学習させてもよい。すなわち、情報提供装置１０は、キーワードの終端についてはフレームごとの学習を行い、経過期間については、複数のフレームごとの学習を行ってもよい。また、入力されたフレームをいくつのクラスに分類するかについては、任意の態様が採用可能である。 [1-6. About section]
In the example described above, the information providing apparatus 10 divides the learning data into a plurality of sections, and causes the model to learn the value of the end label and the value of the class label for each section. However, the embodiment is not limited to this. For example, the information providing apparatus 10 may divide the learning data into frames of a predetermined length, learn the value of the end label for each frame, and learn the value of the class label for each section including a plurality of frames. That is, the information providing apparatus 10 may perform learning on a frame-by-frame basis regarding the end of the keyword and may perform learning on a plurality of frames regarding the elapsed period. In addition, as to how many classes the input frame is classified into, any mode can be adopted.

〔１−７．学習処理について〕
上述した例では、キーワードの終端について「１」若しくは「０」といった２値の値を出力するようにモデルの学習を行い、経過期間（すなわち、クラス）について「１」〜「２０」といった整数値を出力するようにモデルの学習を行う例について記載した。 [1-7. About learning processing]
In the above-mentioned example, the model is trained so as to output a binary value such as "1" or "0" at the end of the keyword, and an integer value such as "1" to "20" for the elapsed period (that is, class). An example in which the model is trained to output is described.

ここで、実際には、情報提供装置１０は、入力されたフレームにキーワードの終端が含まれている確度を出力するようにモデルの学習を行う。このような場合、端末装置１００は、あるフレームを学習モデルＭに入力した際に、学習モデルＭが出力した確度が所定の閾値を超える場合は、そのフレームにキーワードの終端が含まれていると推定してもよい。 Here, in reality, the information providing apparatus 10 learns the model so as to output the probability that the keyword end is included in the input frame. In such a case, when a certain frame is input to the learning model M and the accuracy output from the learning model M exceeds a predetermined threshold, the terminal device 100 determines that the frame includes the end of the keyword. It may be estimated.

また、情報提供装置１０は、入力されたフレームが各クラスに属する確度をそれぞれ出力するようにモデルの学習を行う。このような場合、端末装置１００は、あるフレームを学習モデルＭに入力した際に、学習モデルＭが出力した確度が所定の閾値を超えるクラスを、入力したフレームが属するクラスと判定してもよい。換言すると、端末装置１００は、各経過期間のうち、学習モデルＭ１が出力した確度が所定の閾値を超える経過期間を特定し、入力されたフレームが、キーワードの始端から特定した経過期間だけ後のフレームであると推定してもよい。 The information providing apparatus 10 also learns the model so that the input frame outputs the degrees of certainty that each frame belongs to each class. In such a case, the terminal device 100 may determine, when a certain frame is input to the learning model M, a class whose accuracy output by the learning model M exceeds a predetermined threshold as a class to which the input frame belongs. . In other words, the terminal device 100 identifies the elapsed period in which the accuracy output by the learning model M1 exceeds a predetermined threshold value from among the elapsed periods, and the input frame is after the identified elapsed period from the beginning of the keyword. It may be estimated to be a frame.

なお、入力されたフレームが、終端クラス分類や経過クラス分類の各クラスごとに確度を出力するように学習モデルの学習を行う場合、所定の閾値を超えるクラスが複数存在する事象が生じうる。そこで、情報提供装置１０は、各クラスの確度の最大値を特定し、確度が最も高いクラスを採用することとしてもよい。すなわち、情報提供装置１０は、各クラスの確度に関してａｒｇｍａｘを取ることによってクラスの決定を行ってもよい。また、情報提供装置１０は、このようなａｒｇｍａｘの処理を行う出力層を備えたモデルの学習を行ってもよい。また、情報提供装置１０は、確度が所定の閾値を超えたクラスのうち、確度が最大となるクラスにフレームの分類を行うように、学習モデルの学習を行ってもよい。 When the learning model is learned so that the input frame outputs the accuracy for each class of the terminal class classification and the elapsed class classification, there may occur a plurality of classes exceeding the predetermined threshold. Therefore, the information providing device 10 may specify the maximum value of the accuracy of each class and adopt the class with the highest accuracy. That is, the information providing apparatus 10 may determine the class by taking argmax for the accuracy of each class. The information providing apparatus 10 may also learn a model including an output layer that performs such argmax processing. Further, the information providing apparatus 10 may perform learning model learning so as to classify the frame into a class having the highest accuracy among the classes whose accuracy exceeds a predetermined threshold.

なお、情報提供装置１０は、経過期間に関しては、回帰問題で解いてもよい。例えば、情報提供装置１０は、経過時間のクラス分類ではなく、始端から経過したと推定される期間を示す数値そのものを出力するように、モデルの学習を行ってもよい。例えば、情報提供装置１０は、クラスラベルに代えて、キーワードの始端から各フレームまでの経過時間を含む学習データの特徴をモデルに学習させてもよい。 The information providing apparatus 10 may solve the elapsed period by a regression problem. For example, the information providing apparatus 10 may perform the model learning so as to output the numerical value itself indicating the period estimated to have passed from the start end, instead of classifying the elapsed time. For example, the information providing apparatus 10 may cause the model to learn the characteristics of the learning data including the elapsed time from the beginning of the keyword to each frame, instead of the class label.

〔１−７．適用対象について〕
上述した例では、情報提供装置１０は、起動音声となるキーワードの検出を行うモデルの学習を行った。しかしながら、実施形態は、これに限定されるものではない。情報提供装置１０は、検出目的となる音声であれば、任意の音声の検出を行うモデルの学習を行ってよい。すなわち、情報提供装置１０は、各種の音声データの中から、所定の機械音、環境音、ノイズ等、検出目的となる音を含む区間を検出するため、検出目的となる音の終端と、検出目的となる音の始端から経過した期間とをモデルに学習させるのであれば、任意の音を検出目的として良い。 [1-7. About application target]
In the example described above, the information providing apparatus 10 has learned the model for detecting the keyword that is the activation voice. However, the embodiment is not limited to this. The information providing apparatus 10 may perform learning of a model for detecting any voice as long as the voice is a detection target. That is, the information providing apparatus 10 detects a section including a sound to be detected, such as a predetermined mechanical sound, environmental sound, or noise, from various audio data, and therefore, the end of the sound to be detected and the detection are performed. Any sound may be used as the detection target as long as the model learns the period elapsed from the start of the target sound.

〔１−８．実行主体について〕
上述した例では、情報提供装置１０により学習処理が行われ、端末装置１００により検出処理が実行された。しかしながら、実施形態は、これに限定されるものではない。例えば、学習処理および検出処理は、情報提供装置１０により実行されてもよい。このような場合、情報提供装置１０は、端末装置１００が取得した発話音声を受付け、学習モデルＭを用いて、受付けた発話音声からキーワード区間を検出することとなる。また、上述した学習処理および検出処理は、端末装置１００によって実現されてもよい。 [1-8. Execution subject]
In the example described above, the learning process is performed by the information providing device 10, and the detection process is performed by the terminal device 100. However, the embodiment is not limited to this. For example, the learning process and the detection process may be executed by the information providing device 10. In such a case, the information providing apparatus 10 receives the uttered voice acquired by the terminal device 100, and uses the learning model M to detect the keyword section from the received uttered voice. The learning process and the detection process described above may be realized by the terminal device 100.

〔２．機能構成の一例〕
以下、上記した学習処理を実現する情報提供装置１０が有する機能構成の一例、および、上述した検出処理を実現する端末装置１００が有する機能構成の一例について説明する。 [2. Example of functional configuration]
Hereinafter, an example of a functional configuration of the information providing apparatus 10 that implements the learning process described above and an example of a functional configuration of the terminal device 100 that implements the detection process described above will be described.

〔２−１．情報提供装置の機能構成の一例について〕
まず、図２を用いて、情報提供装置１０が有する機能構成の一例を説明する。図２は、実施形態に係る情報提供装置の構成例を示す図である。図２に示すように、情報提供装置１０は、通信部２０、記憶部３０、および制御部４０を有する。 [2-1. Regarding an example of the functional configuration of the information providing device]
First, an example of the functional configuration of the information providing apparatus 10 will be described with reference to FIG. FIG. 2 is a diagram illustrating a configuration example of the information providing apparatus according to the embodiment. As shown in FIG. 2, the information providing device 10 includes a communication unit 20, a storage unit 30, and a control unit 40.

通信部２０は、例えば、ＮＩＣ（Network Interface Card）等によって実現される。そして、通信部２０は、ネットワークＮと有線または無線で接続され、例えば、端末装置１００、データサーバＤＳおよび外部サーバＯＳとの間で情報の送受信を行う。 The communication unit 20 is realized by, for example, a NIC (Network Interface Card) or the like. Then, the communication unit 20 is connected to the network N by wire or wirelessly, and transmits / receives information to / from the terminal device 100, the data server DS, and the external server OS, for example.

記憶部３０は、例えば、ＲＡＭ（Random Access Memory)、フラッシュメモリ（Flash Memory）等の半導体メモリ素子、または、ハードディスク、光ディスク等の記憶装置によって実現される。また、記憶部３０は、学習データデータベース３１およびモデルデータベース３２を記憶する。 The storage unit 30 is realized by, for example, a semiconductor memory device such as a RAM (Random Access Memory) or a flash memory (Flash Memory), or a storage device such as a hard disk or an optical disk. The storage unit 30 also stores a learning data database 31 and a model database 32.

学習データデータベース３１は、学習データが登録される。例えば、図３は、実施形態に係る学習データデータベースに登録される情報の一例を示す図である。図３に示すように、学習データデータベース３１には、「学習データＩＤ（Identifier）」、「区間」、「音声データ」、「終端タグ」、および「クラスラベル」といった項目を有する情報が登録される。なお、図３に示す例では、「区間」ごとに音声データ、終端タグ、およびクラスラベルが格納される例について記載したが、実際には、フレームごとに音声データ、終端タグ、およびクラスラベルが格納されていてもよい。 Learning data is registered in the learning data database 31. For example, FIG. 3 is a diagram showing an example of information registered in the learning data database according to the embodiment. As shown in FIG. 3, information having items such as “learning data ID (Identifier)”, “section”, “voice data”, “termination tag”, and “class label” is registered in the learning data database 31. It In the example shown in FIG. 3, the audio data, the end tag, and the class label are stored for each “section”, but in reality, the audio data, the end tag, and the class label are stored for each frame. It may be stored.

ここで、「学習データＩＤ」とは、学習データの識別子である。また、「区間」とは、学習データとなる音声データを分割した各区間を識別するための情報であり、例えば、区間に付与された一連の番号である。また、「音声データ」とは、対応付けられた「区間」が示す区間に含まれる音声データ、すなわち音響信号である。また、「終端タグ」とは、対応付けられた「区間」にキーワードの終端が含まれているか否かを示す情報である。また、「クラスラベル」は、対応付けられた「区間」に含まれる音声が、キーワードの始端からどれくらい経過した際の音声であるかを示す区間情報であり、対応付けられた「区間」が属するクラスを示す情報である。 Here, "learning data ID" is an identifier of learning data. Further, the “section” is information for identifying each section obtained by dividing the audio data that is the learning data, and is, for example, a series of numbers given to the section. Further, the “voice data” is voice data included in the section indicated by the associated “section”, that is, an acoustic signal. The “end tag” is information indicating whether the associated “section” includes the end of the keyword. Further, the “class label” is section information indicating how long the voice included in the associated “section” is the voice after the beginning of the keyword, and the associated “section” belongs to the “class label”. This is information indicating a class.

例えば、図３に示す例では、学習データデータベース３１には学習データＩＤ「ＬＤ１」、区間「１」、音声データ「ＳＤ１」、終端タグ「０」、およびクラスラベル「０」が対応付けて登録されている。このような情報は、学習データＩＤ「ＬＤ１」が示す学習データのうち、区間「１」に含まれるの音声データとして音声データ「ＳＤ１」が登録されており、区間「１」における終端タグの値が「０」であり、クラスラベルの値が「０」である旨を示す。 For example, in the example shown in FIG. 3, the learning data ID “LD1”, the section “1”, the voice data “SD1”, the end tag “0”, and the class label “0” are associated and registered in the learning data database 31. Has been done. For such information, voice data “SD1” is registered as voice data included in the section “1” of the learning data indicated by the learning data ID “LD1”, and the value of the end tag in the section “1”. Is "0", indicating that the value of the class label is "0".

なお、図３に示す例では、「ＳＤ１」といった概念的な値を記載したが、実際には、学習データデータベース３１には、音声データとして各フレームの音声の音量や周波数分布等を示す情報が登録されることとなる。また、学習データデータベース３１には、「区間」に代えて、フレーム番号等が登録されていてもよい。また、図３に示す情報以外にも、学習データデータベース３１には、任意の情報が登録されていてよい。 Note that in the example shown in FIG. 3, a conceptual value such as “SD1” is described, but in reality, the learning data database 31 includes information indicating the sound volume and frequency distribution of the sound of each frame as sound data. It will be registered. Further, in the learning data database 31, instead of the “section”, a frame number or the like may be registered. In addition to the information shown in FIG. 3, arbitrary information may be registered in the learning data database 31.

図２に戻り、説明を続ける。モデルデータベース３２には、学習モデルが登録される。すなわち、モデルデータベース３２には、検出対象となる対象音声の終端と、対象音声の始端から経過した期間とを学習させた学習モデルＭのデータが登録される。例えば、モデルデータベース３２には、学習モデルＭ１のデータとして、それぞれが１つ又は複数のノードを含む多段の層を構成するノードの情報と、各ノード間の接続関係を示す情報と、ノード間で情報を伝達する際の重みである接続係数とが登録される。 Returning to FIG. 2, the description will be continued. Learning models are registered in the model database 32. That is, in the model database 32, the data of the learning model M in which the end of the target voice to be detected and the period elapsed from the start of the target voice are learned are registered. For example, in the model database 32, as data of the learning model M1, information on nodes forming a multi-tiered layer each including one or a plurality of nodes, information indicating a connection relationship between each node, A connection coefficient, which is a weight for transmitting information, is registered.

ここで、学習モデルＭ１は、学習データである音響信号が入力される入力層を有する。また、学習モデルＭ１は、入力された音響信号に対象音声の終端が含まれているか否かを示す終端情報と、入力された音響信号が対象音声の始端からどれくらい経過した際の音響信号であるのかを示す期間情報、すなわち、入力された音響信号の分類先となるクラスを示す情報とを出力する出力層を有する。 Here, the learning model M1 has an input layer to which an acoustic signal that is learning data is input. Further, the learning model M1 is termination information indicating whether or not the input sound signal includes the end of the target sound, and the sound signal when the input sound signal has elapsed from the start end of the target sound. The output layer outputs period information indicating whether or not, that is, information indicating a class to which the input acoustic signal is classified.

また、学習モデルＭ１は、入力層から出力層までのいずれかの層であって出力層以外の層に属する第１要素と、第１要素と第１要素の重みとに基づいて値が算出される第２要素と、を含み、入力層に入力された情報に対し、出力層以外の各層に属する各要素を第１要素として、第１要素と第１要素の重みとに基づく演算を行うことにより、入力層に入力された情報と対応する情報を出力層から出力するようコンピュータを機能させる。 Further, the learning model M1 has a value calculated based on the first element belonging to any layer from the input layer to the output layer and other than the output layer, and the first element and the weight of the first element. And a second element which is included in the input layer, and performs an operation based on the first element and the weight of the first element with respect to the information input to the input layer, with each element belonging to each layer other than the output layer as the first element. This causes the computer to function so that the information corresponding to the information input to the input layer is output from the output layer.

このような学習モデルＭ１は、例えば、学習時および測定時において、入力層に音声データが入力された場合に、出力層から、終端情報と期間情報とを出力するようコンピュータを機能させる。そして、情報提供装置１０は、学習時においては、学習モデルＭ１が出力する終端情報と期間情報とが、入力された音声データと対応する終端情報と期間情報とを示すように、学習モデルＭ１の接続係数を修正する。 Such a learning model M1 causes a computer to output termination information and period information from the output layer when voice data is input to the input layer during learning and measurement, for example. Then, at the time of learning, the information providing device 10 sets the learning model M1 so that the termination information and the period information output by the learning model M1 indicate the termination information and the period information corresponding to the input voice data. Correct the connection coefficient.

ここで、学習モデルＭ１がＳＶＭや回帰モデルで実現される場合、学習モデルＭ１は、入力層と出力層とを有する単純パーセプトロンと見做すことができる。学習モデルＭ１を単純パーセプトロンと見做した場合、第１要素は、入力層が有するいずれかのノードに対応し、第２要素は、出力層が有するノードと見做すことができる。また、学習モデルＭ１をＤＮＮ等、１つまたは複数の中間層を有するニューラルネットワークで実現される場合、各モデルが含む第１要素とは、入力層または中間層が有するいずれかのノードと見做すことができ、第２要素とは、第１要素と対応するノードから値が伝達されるノード、すなわち、次段のノードと対応し、第１要素の重みとは、第１要素と対応するノードから第２要素と対応するノードに伝達される値に対して考慮される重み、すなわち、接続係数である。 Here, when the learning model M1 is realized by the SVM or the regression model, the learning model M1 can be regarded as a simple perceptron having an input layer and an output layer. When the learning model M1 is regarded as a simple perceptron, the first element can be regarded as a node included in the input layer, and the second element can be regarded as a node included in the output layer. When the learning model M1 is realized by a neural network having one or a plurality of intermediate layers such as DNN, the first element included in each model is regarded as any node included in the input layer or the intermediate layer. The second element corresponds to the node whose value is transmitted from the node corresponding to the first element, that is, the node in the next stage, and the weight of the first element corresponds to the first element. It is a weight, that is, a connection coefficient, which is considered for the value transmitted from the node to the node corresponding to the second element.

ここで、情報提供装置１０は、学習データデータベース３１に登録される学習データを用いて、上述した検出処理を実行するための学習モデルＭ１を生成する。すなわち、学習データデータベース３１に登録される学習データは、音響信号が入力される入力層と、出力層と、入力層から出力層までのいずれかの層であって出力層以外の層に属する第１要素と、第１要素と第１要素の重みとに基づいて値が算出される第２要素と、を含み、入力層に入力された音響信号に対し、出力層以外の各層に属する各要素を第１要素として、第１要素と、第１要素の重みであって、対象音声の特徴と、対象音声の始端から対象音声の各区間までの期間との特徴を反映させた重みに基づく演算を行うことにより、終端情報と期間情報とを出力層から出力するよう、コンピュータを機能させるためのデータである。 Here, the information providing apparatus 10 uses the learning data registered in the learning data database 31 to generate the learning model M1 for executing the above-described detection process. That is, the learning data registered in the learning data database 31 is the input layer to which the acoustic signal is input, the output layer, and any layer from the input layer to the output layer that belongs to a layer other than the output layer. Each element that includes one element and a second element whose value is calculated based on the first element and the weight of the first element and that belongs to each layer other than the output layer with respect to the acoustic signal input to the input layer As a first element, the first element and the weighting of the first element, and the calculation based on the weight reflecting the characteristics of the target voice and the characteristics of the period from the start end of the target voice to each section of the target voice Is data for operating the computer so that the terminal information and the period information are output from the output layer by performing.

制御部４０は、コントローラ（controller）であり、例えば、ＣＰＵ（Central Processing Unit）、ＭＰＵ（Micro Processing Unit）等のプロセッサによって、情報提供装置１０内部の記憶装置に記憶されている各種プログラムがＲＡＭ等を作業領域として実行されることにより実現される。また、制御部４０は、コントローラ（controller）であり、例えば、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）等の集積回路により実現されてもよい。 The control unit 40 is a controller, for example, a processor such as a CPU (Central Processing Unit) or an MPU (Micro Processing Unit) allows various programs stored in a storage device inside the information providing apparatus 10 to be a RAM or the like. Is implemented as a work area. The control unit 40 is a controller, and may be realized by an integrated circuit such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array).

図２に示すように、制御部４０は、データ取得部４１、学習部４２、および提供部４３を有する。データ取得部４１は、検出対象となる対象音声が含まれる音声情報を取得する。例えば、データ取得部４１は、データサーバＤＳから学習データとして、複数の区間に分割された音声データと、各区間に含まれる音声に対象音声の終端が含まれているか否かを示す終端タグと、対象音声の始端から各区間に含まれる音声までの期間を示す期間情報、すなわちクラスデータとを対応付けた情報を取得する。そして、データ取得部４１は、取得した学習データを学習データデータベース３１に登録する。 As shown in FIG. 2, the control unit 40 includes a data acquisition unit 41, a learning unit 42, and a providing unit 43. The data acquisition unit 41 acquires voice information including a target voice to be detected. For example, the data acquisition unit 41 uses, as learning data from the data server DS, voice data divided into a plurality of sections, and a termination tag indicating whether or not the speech included in each section includes the termination of the target speech. , Period information indicating a period from the start end of the target voice to the voice included in each section, that is, information associated with the class data is acquired. Then, the data acquisition unit 41 registers the acquired learning data in the learning data database 31.

なお、データ取得部４１は、端末装置１００に所定の動作を実行させるための音声、すなわち、起動音声であるキーワードを対象音声として含む音声情報を取得してもよい。また、データ取得部４１は、複数の単語を発声した音声、又は、無音の区間を含む音声を対象音声として含む音声情報を取得してもよい。このように、どのような音声を対象音声とするかについては、任意の設定が可能であるが、設定された対象音声を適切に検出するため、データ取得部４１は、検出対象となる音声と特徴が類似する音声を学習データとして取得するのが望ましい。 The data acquisition unit 41 may acquire voice for causing the terminal device 100 to perform a predetermined operation, that is, voice information including a keyword that is a start voice as a target voice. In addition, the data acquisition unit 41 may acquire voice information that includes, as a target voice, a voice that utters a plurality of words or a voice that includes a silent section. As described above, it is possible to arbitrarily set what kind of sound is the target sound, but the data acquisition unit 41 determines that the set target sound is appropriately detected, in order to properly detect the set target sound. It is desirable to acquire voices having similar characteristics as learning data.

学習部４２は、対象音声の終端と、対象音声の始端から経過した期間とをモデルに学習させる。例えば、学習部４２は、対象音声全体の特徴に基づいて、キーワードの終端を検出するモデルを学習する。より具体的な例を挙げると、学習部４２は、キーワードの各区間が有する特徴の出現順序に基づいて、キーワードの終端を含む区間を検出するようにモデルの学習を行う。 The learning unit 42 causes the model to learn the end of the target voice and the period elapsed from the start of the target voice. For example, the learning unit 42 learns a model for detecting the end of a keyword based on the characteristics of the entire target voice. As a more specific example, the learning unit 42 performs model learning so as to detect a section including the end of the keyword based on the appearance order of the features of each section of the keyword.

例えば、学習部４２は、音声情報を複数の区間に分割し、各区間ごとに、対象音声の終端が含まれているか否かと、対象音声の始端から当該区間までの期間とをモデルに学習させる。より具体的な例を挙げると、学習部４２は、所定の区間に含まれる音声をモデルに入力した際に、所定の区間に対象音声の終端が含まれているか否かを示す終端情報と、対象音声の始端から所定の区間までの期間を示す期間情報とを出力するように、モデルの学習を行う。すなわち、学習部４２は、音声情報を複数の区間に分割し、所定の区間に含まれる音声を入力した際に、対象音声の始端から所定の区間までの期間に応じた分類結果を出力するよう、モデルの学習を行えばよい。 For example, the learning unit 42 divides the voice information into a plurality of sections, and causes the model to learn whether or not the end of the target voice is included in each section and the period from the start end of the target voice to the section. . To give a more specific example, the learning unit 42, when the voice included in the predetermined section is input to the model, end information indicating whether or not the end of the target voice is included in the predetermined section, The model is learned so as to output period information indicating a period from the start end of the target voice to a predetermined section. That is, the learning unit 42 divides the voice information into a plurality of sections, and when a voice included in the predetermined section is input, outputs the classification result according to the period from the start end of the target voice to the predetermined section. , Model learning should be performed.

なお、通常のＤＮＮ等を用いた場合、対象音声が有する特徴のうち、終端の周辺のみの特徴に基づいて対象音声の終端を検出するといった現象が考えられる。このような検出を行った場合は、対象音声の終端と特徴が類似する音声を対象音声の終端として検出してしまう恐れがある。そこで、学習部４２は、対象音声全体の特徴に基づいて対象音声の終端を検出させるため、再帰型ニューラルネットワークの構成を有するモデルに対し、対象音声の終端と、対象音声の始端から経過した期間とを学習させればよい。 In addition, when a normal DNN or the like is used, a phenomenon in which the end of the target voice is detected based on the features only around the end of the features of the target voice can be considered. When such detection is performed, there is a possibility that a voice having characteristics similar to the end of the target voice may be detected as the end of the target voice. Therefore, in order to detect the end of the target voice based on the characteristics of the entire target voice, the learning unit 42 applies the end of the target voice and the period elapsed from the start of the target voice to the model having the recursive neural network configuration. You can learn and.

例えば、学習部４２は、ＬＳＴＭの構成を有するモデルを生成すると共に、学習データデータベース３１から処理対象となる学習データを１つ読み出す。続いて、学習部４２は、読み出した学習データの各区分について、時系列順（すなわち、区間の番号が若い順に）以下の処理を実行する。まず、学習部４２は、処理対象となる区間の音声データをモデルに入力する。例えば、学習部４２は、音声データが示す音声の周波数や振幅等を入力してもよく、音声が有する特徴をモデルに入力してもよい。そして、学習部４２は、音声が入力されたモデルの出力が、処理対象となる区間の終端タグとクラスラベルとを示すように、モデルが有する接続係数の修正を行う。 For example, the learning unit 42 generates a model having an LSTM structure and reads one piece of learning data to be processed from the learning data database 31. Subsequently, the learning unit 42 executes the following processing in time-series order (that is, in ascending order of section number) for each section of the read learning data. First, the learning unit 42 inputs the voice data of the processing target section into the model. For example, the learning unit 42 may input the frequency and amplitude of the voice indicated by the voice data, or may input the features of the voice into the model. Then, the learning unit 42 corrects the connection coefficient of the model so that the output of the model to which the voice is input indicates the end tag and the class label of the section to be processed.

以下、音声データ「ＳＤ１０」に終端タグ「１」とクラスラベル「１８」とが対応付けて登録されている例について説明する。例えば、学習部４２は、音声データ「ＳＤ１０」をモデルに入力する。このような場合、学習部４２は、モデルが有する出力層の各ノードのうち、終端情報を出力するためのノードから、所定の閾値以上の確度を示す値（すなわち、終端タグ「１」に対応する値）が出力され、かつ、モデルが有する出力層の各ノードのうち、クラスラベル「１８」と対応するノードから、所定の閾値以上の確度を示す値が出力されるように、モデルの接続係数を修正する。また、学習部４２は、他の学習データについても同様の処理を行う。そして、学習部４２は、モデルを学習モデルＭとしてモデルデータベース３２に登録する。 Hereinafter, an example in which the end tag “1” and the class label “18” are registered in association with the voice data “SD10” will be described. For example, the learning unit 42 inputs the voice data “SD10” into the model. In such a case, the learning unit 42, from among the nodes of the output layer included in the model, outputs a value indicating the accuracy of a predetermined threshold value or more from the node for outputting the termination information (that is, corresponds to the termination tag “1”). Connected to the model so that the value corresponding to the class label “18” among the nodes in the output layer of the model is output as a value indicating the accuracy of a predetermined threshold value or more. Correct the coefficient. The learning unit 42 also performs the same processing on other learning data. Then, the learning unit 42 registers the model in the model database 32 as the learning model M.

提供部４３は、学習モデルを端末装置１００に提供する。例えば、提供部４３は、端末装置１００からの要求に基づき、モデルデータベース３２から学習モデルＭを読出し、読み出した学習モデルＭを端末装置１００に送信する。 The providing unit 43 provides the learning model to the terminal device 100. For example, the providing unit 43 reads the learning model M from the model database 32 based on a request from the terminal device 100, and transmits the read learning model M to the terminal device 100.

〔２−２．端末装置の機能構成の一例について〕
続いて、図４を用いて、端末装置１００が有する機能構成の一例を説明する。図４は、実施形態に係る端末装置の構成例を示す図である。図４に示すように、端末装置１００は、通信部１２０、記憶部１３０、制御部１４０、マイクＭＣおよびスピーカーＳＰを有する。 [2-2. Example of functional configuration of terminal device]
Next, an example of the functional configuration of the terminal device 100 will be described with reference to FIG. FIG. 4 is a diagram illustrating a configuration example of the terminal device according to the embodiment. As shown in FIG. 4, the terminal device 100 includes a communication unit 120, a storage unit 130, a control unit 140, a microphone MC, and a speaker SP.

通信部１２０は、例えば、ＮＩＣ等によって実現される。そして、通信部１２０は、ネットワークＮと有線または無線で接続され、例えば、情報提供装置１０、データサーバＤＳおよび外部サーバＯＳとの間で情報の送受信を行う。 The communication unit 120 is realized by, for example, a NIC or the like. Then, the communication unit 120 is connected to the network N by wire or wirelessly, and transmits / receives information to / from the information providing device 10, the data server DS, and the external server OS, for example.

記憶部１３０は、例えば、ＲＡＭ、フラッシュメモリ等の半導体メモリ素子、または、ハードディスク、光ディスク等の記憶装置によって実現される。また、記憶部１３０は、情報提供装置１０から配信される学習モデルＭを記憶する。 The storage unit 130 is realized by, for example, a semiconductor memory device such as a RAM or a flash memory, or a storage device such as a hard disk or an optical disk. The storage unit 130 also stores the learning model M distributed from the information providing apparatus 10.

マイクＭＣは、利用者が発話した音声、すなわち発話音声等、端末装置１００の周囲から発せられた音声を受付けるマイク、すなわち、入力装置である。また、スピーカーＳＰは、各種の音声を出力するためのスピーカー、すなわち、出力装置である。なお、端末装置１００は、複数のマイクＭＣを有していてもよく、複数のスピーカーＳＰを有していてもよい。 The microphone MC is a microphone that receives a voice uttered by the user, that is, a voice that is uttered from the surroundings of the terminal device 100, that is, an input device. The speaker SP is a speaker for outputting various sounds, that is, an output device. The terminal device 100 may include a plurality of microphones MC and a plurality of speakers SP.

制御部１４０は、コントローラであり、例えば、ＣＰＵ、ＭＰＵ等のプロセッサによって、端末装置１００内部の記憶装置に記憶されている各種プログラムがＲＡＭ等を作業領域として実行されることにより実現される。また、制御部１４０は、コントローラであり、例えば、ＡＳＩＣやＦＰＧＡ等の集積回路により実現されてもよい。 The control unit 140 is a controller, and is realized by a processor such as a CPU or MPU executing various programs stored in a storage device inside the terminal device 100 using a RAM or the like as a work area. The control unit 140 is a controller and may be realized by an integrated circuit such as ASIC or FPGA.

また、制御部１４０は、音声取得部１４１、検出部１４２、および処理部１４３を有する。音声取得部１４１は、音声情報を取得する。例えば、音声取得部１４１は、マイクＭＣを介して、利用者の発話音声等を音声情報として取得する。 The control unit 140 also includes a voice acquisition unit 141, a detection unit 142, and a processing unit 143. The voice acquisition unit 141 acquires voice information. For example, the voice acquisition unit 141 acquires the voice uttered by the user as voice information via the microphone MC.

検出部１４２は、検出対象となる対象音声の終端と、対象音声の始端から経過した期間とを学習させたモデルを用いて、音声取得部１４１により取得された音声情報から、対象音声の始端を検出する。例えば、検出部１４２は、記憶部１３０に登録された学習モデルＭを読み出す。そして、検出部４２は、音声取得部１４１により取得された音声情報を、区分ごとに、取得された時系列に沿って順次学習モデルＭに入力する。そして、検出部４２は、学習モデルＭの出力に基づいて、対象音声の終端と始端とを検出し、検出した始端から終端までの範囲をキーワード区間として特定する。 The detection unit 142 uses the model in which the end of the target voice to be detected and the period elapsed from the start end of the target voice are learned to determine the start end of the target voice from the voice information acquired by the voice acquisition unit 141. To detect. For example, the detection unit 142 reads the learning model M registered in the storage unit 130. Then, the detection unit 42 sequentially inputs the audio information acquired by the audio acquisition unit 141 to the learning model M for each segment along the acquired time series. Then, the detection unit 42 detects the end and the start end of the target voice based on the output of the learning model M, and specifies the range from the detected start end to the end as the keyword section.

例えば、検出部１４２は、音声取得部１４１により取得された音声情報を複数の区間に分割し、分割した区間のうち、区間に含まれる音声を入力した際に、対象音声の終端が含まれている旨を示す終端情報を学習モデルＭが出力した区間を特定する。続いて、検出部１４２は、特定した区間について学習モデルＭ１が出力した期間情報に基づいて、対象音声の始端が含まれる区間を検出する。そして、検出部１４２は、検出した始端から終端までをキーワード区間として処理部１４３に通知する。 For example, the detection unit 142 divides the voice information acquired by the voice acquisition unit 141 into a plurality of sections, and when the voice included in the section is input among the divided sections, the end of the target voice is included. The section in which the learning model M outputs the end information indicating that the learning model M is output is specified. Subsequently, the detection unit 142 detects a section including the start end of the target voice, based on the period information output by the learning model M1 for the specified section. Then, the detection unit 142 notifies the processing unit 143 of the detected start end to end as a keyword section.

すなわち、検出部１４２は、学習対象となった音声情報である学習情報に含まれる各区間ごとに、対象音声の終端が含まれているか否かと、対象音声の始端から区間までの期間とを学習させたモデルを用いて、対象音声の始端を含む区間を検出する。また、検出部１４２は、所定の区間に含まれる音声が入力された場合に所定の区間に前記対象音声の終端が含まれているか否かを示す終端情報と、対象音声の始端から所定の区間までの期間を示す期間情報とを出力するように学習が行われたモデルを用いて、対象音声の始端を含む区間を検出する。また、検出部１４２は、再帰型ニューラルネットワークの構成を有するモデルに対し、音声情報の各区間に含まれる音声を先頭から順に入力し、モデルが出力した終端情報と期間情報とに基づいて、対象音声の始端を含む区間を検出する。 That is, the detection unit 142 learns whether or not the end of the target voice is included for each section included in the learning information that is the voice information that is the learning target, and the period from the start end of the target voice to the section. Using the model thus generated, a section including the start end of the target voice is detected. In addition, the detection unit 142, when the voice included in the predetermined section is input, end information indicating whether or not the end of the target voice is included in the predetermined section, and a predetermined section from the start end of the target voice. The period including the start end of the target voice is detected by using the model learned to output the period information indicating the period up to. Further, the detection unit 142 sequentially inputs the voices included in each section of the voice information to the model having the configuration of the recursive neural network from the beginning, and based on the end information and the period information output from the model, the target Detects a section including the beginning of voice.

例えば、図５は、実施形態に係るモデルが出力する情報の一例を示す図である。図５に示す例では、キーワードの発話を含む音声を７２個の区間に分割し、各区間の音声を時系列順に入力した際に学習モデルＭが出力する情報の一例について示した。また、図５に示す例では、学習モデルＭは、出力層に「０」から「１６」までの番号が付与された１７個のノードを有し、各ノード毎に、０以上１以下の値、すなわち確度を出力するように構成されているものとする。また、図５に示す例では、各ノードが出力した値が所定の第１閾値未満となる区間を白色で示し、第１閾値以上第２閾値未満となる区間を右肩上がりのハッチングで示し、第２閾値以上となる区間を右肩下がりのハッチングで示した。なお、図５に示す例では、確度が第２閾値以上となった場合、ノードが「１」を出力したと判定するものとする。 For example, FIG. 5 is a diagram showing an example of information output by the model according to the embodiment. In the example shown in FIG. 5, the voice including the utterance of the keyword is divided into 72 sections, and an example of information output by the learning model M when the voices in each section are input in chronological order is shown. Further, in the example shown in FIG. 5, the learning model M has 17 nodes to which numbers “0” to “16” are given in the output layer, and a value of 0 or more and 1 or less for each node. That is, it is assumed to be configured to output the accuracy. Further, in the example shown in FIG. 5, a section in which the value output by each node is less than the predetermined first threshold value is shown in white, and a section in which the value output from each node is not less than the first threshold value and less than the second threshold value is shown by upward hatching, The section that is equal to or greater than the second threshold value is indicated by hatching with a downward slope. In the example illustrated in FIG. 5, when the accuracy is equal to or higher than the second threshold value, it is determined that the node outputs “1”.

例えば、図５に示す例では、ノード０は、入力された音声がキーワードの終端でない場合は「１」に近い値を出力し、入力された音声がキーワードの終端である場合は、「０」に近い値を出力するよう学習が行われたノードである。また、ノード１は、入力された音声がキーワードの終端でない場合は「０」に近い値を出力し、入力された音声がキーワードの終端である場合は、「１」に近い値を出力するよう学習が行われたノードである。また、ノード２は、入力された音声がキーワード区間である場合は「０」に近い値を出力し、入力された音声がキーワード区間でない場合は、「１」に近い値を出力するよう学習が行われたノードである。 For example, in the example shown in FIG. 5, the node 0 outputs a value close to “1” when the input voice is not the end of the keyword, and “0” when the input voice is the end of the keyword. A node that has been trained to output a value close to. Also, the node 1 outputs a value close to “0” when the input voice is not the end of the keyword, and outputs a value close to “1” when the input voice is the end of the keyword. This is the node on which learning was performed. Also, the node 2 learns to output a value close to “0” when the input speech is in the keyword section and to output a value close to “1” when the input speech is not in the keyword section. The node that was made.

また、図５に示す例では、ノード４〜ノード１６は、それぞれ異なるクラスに対応するノードであり、期間情報を出力するよう学習が行われたノードである。例えば、ノード４は、入力された音声がキーワードの始端から３区間以内の音声である場合は「１」を出力し、それ以外の場合は「０」を出力するように学習が行われたノードである。また、ノード５は、入力された音声がキーワードの始端から３区間以上が経過し、かつ、６区間以内の範囲に含まれる音声である場合は「１」を出力し、それ以外の場合は「０」を出力するように学習が行われたノードである。また、ノード６は、入力された音声がキーワードの始端から６区間以上が経過し、かつ、９区間以内の範囲に含まれる音声である場合は「１」を出力し、それ以外の場合は「０」を出力するように学習が行われたノードである。また、他のノードも同様に、音声がキーワードの始端からそれぞれ異なる区間の音声である場合に「１」を出力するように学習が行われたノードである。なお、学習モデルＭは、図５に示すノード以外にも、さらに多くのクラスに対応するノードを有していてもよい。 Further, in the example shown in FIG. 5, the nodes 4 to 16 are nodes corresponding to different classes, and have been learned so as to output the period information. For example, the node 4 outputs “1” if the input voice is a voice within 3 sections from the beginning of the keyword, and outputs “0” in other cases. Is. In addition, the node 5 outputs "1" when the input voice is a voice which is included in the range within 6 sections after three or more sections have passed from the beginning of the keyword, and "1" in other cases. This is a node that has been learned so as to output "0". In addition, the node 6 outputs "1" when the input voice is a voice which is included in the range within 9 sections after 6 sections or more have passed from the beginning of the keyword, and "1" otherwise. This is a node that has been learned so as to output "0". Similarly, the other nodes are also nodes that have been learned so as to output "1" when the voices are voices in different sections from the start end of the keyword. The learning model M may have nodes corresponding to more classes than the nodes shown in FIG.

このような学習モデルＭに対して取得した音声を時系列順に入力した場合、各ノード０〜１６は、図５に示すような値を出力する。例えば、図５に示す例ではノード１が、区間「４９」において、キーワード区間の終端を検知した結果「１」に近い値を出力している。そこで、検出部１４２は、ノード１が第２閾値を超える値を出力した区間「４９」において、キーワードの終端が検出されたと推定する。 When the acquired voices are input to the learning model M in chronological order, the nodes 0 to 16 output the values shown in FIG. For example, in the example shown in FIG. 5, the node 1 outputs a value close to “1” as a result of detecting the end of the keyword section in the section “49”. Therefore, the detection unit 142 estimates that the end of the keyword is detected in the section “49” in which the node 1 outputs the value exceeding the second threshold.

続いて、検出部１４２は、ノード４〜ノード１６の出力を参照し、区間「４９」に含まれる音声が属するクラスを特定する。図５に示す例では、区間「４９」において、ノード１３が値を出力し始めている。そこで、検出部１４２は、区間「４９」に含まれる音声をノード１３と対応するクラスに分類する。ここで、ノード４に対応するクラスからノード１３に対応するクラスまでは、１０個のクラスが存在し、各クラスに３つの区間が対応付けられている。このため、検出部１４２は、キーワードの始端から区間「４９」までの期間は、３０区間が存在していると推定し、区間「４９」から３０を減算した区間「１９」に、キーワードの始端が含まれていると推定する。この結果、検出部１４２は、区間「１９」から区間「４９」までがキーワード区間である旨を検出することができる。 Subsequently, the detection unit 142 refers to the outputs of the nodes 4 to 16 and identifies the class to which the voice included in the section “49” belongs. In the example illustrated in FIG. 5, the node 13 starts outputting a value in the section “49”. Therefore, the detection unit 142 classifies the voice included in the section “49” into a class corresponding to the node 13. Here, there are 10 classes from the class corresponding to the node 4 to the class corresponding to the node 13, and three classes are associated with each class. Therefore, the detection unit 142 estimates that 30 sections exist from the start point of the keyword to the section “49”, and the start point of the keyword is added to the section “19” obtained by subtracting 30 from the section “49”. Is assumed to be included. As a result, the detection unit 142 can detect that the section “19” to the section “49” are keyword sections.

図４に戻り、説明を続ける。処理部１４３は、検出部１４２により検出された区間に含まれる音声に応じた各種の処理を実行する。例えば、処理部１４３は、検出された区間内の音声解析を行い、解析結果に応じた各種の処理を実行する。そして、処理部１４３は、処理の実行結果を示す音声をスピーカーＳＰから出力する。 Returning to FIG. 4, the description will be continued. The processing unit 143 executes various processes according to the sound included in the section detected by the detection unit 142. For example, the processing unit 143 performs voice analysis in the detected section and executes various types of processing according to the analysis result. Then, the processing unit 143 outputs the sound indicating the execution result of the processing from the speaker SP.

〔３．情報提供装置および端末装置が実行する処理の流れについて〕
次に、図６、図７を用いて、情報提供装置１０および端末装置１００が実行する処理の流れの一例について説明する。図６は、実施形態に係る情報提供装置が実行する学習処理の流れの一例を示すフローチャートである。また、図７は、実施形態に係る端末装置が実行する検出処理の流れの一例を示すフローチャートである。 [3. Regarding the flow of processing executed by the information providing device and the terminal device]
Next, an example of the flow of processing executed by the information providing device 10 and the terminal device 100 will be described with reference to FIGS. 6 and 7. FIG. 6 is a flowchart showing an example of the flow of a learning process executed by the information providing device according to the embodiment. Further, FIG. 7 is a flowchart showing an example of the flow of detection processing executed by the terminal device according to the embodiment.

まず、図６を用いて、学習処理の流れの一例を説明する。まず、情報提供装置１０は、対象音声を含む音声情報を学習データとして取得し（ステップＳ１０１）、音声情報の各区間について、対象音声の終端が含まれるか否かと、対象音声の始端からの経過時間とをモデルに学習させる（ステップＳ１０２）。そして、情報提供装置１０は、学習モデルを端末装置１００に提供し（ステップＳ１０３）、処理を終了する。 First, an example of the flow of learning processing will be described with reference to FIG. First, the information providing apparatus 10 acquires voice information including the target voice as learning data (step S101), determines whether or not the end of the target voice is included in each section of the voice information, and the progress from the start end of the target voice. The model is made to learn time and (step S102). Then, the information providing device 10 provides the learning model to the terminal device 100 (step S103) and ends the process.

続いて、図７を用いて、検出処理の流れの一例を説明する。まず、端末装置１００は、発話音声を受付けたか否かを判定し（ステップＳ２０１）、受付けていない場合は（ステップＳ２０１：Ｎｏ）、ステップＳ２０１を実行する。また、端末装置１００は、発話音声を受付けた場合は（ステップＳ２０１：Ｙｅｓ）、発話音声を学習モデルに入力し、対象音声の終端を推定する（ステップＳ２０２）。続いて、端末装置１００は、学習モデルにより推定された経過期間に基づいて、始端を推定する（ステップＳ２０３）。そして、端末装置１００は、キーワード区間を抽出し、抽出したキーワード区間に含まれる音声に応じた処理を実行し（ステップＳ２０４）、処理を終了する。 Next, an example of the flow of detection processing will be described using FIG. 7. First, the terminal device 100 determines whether or not the uttered voice is received (step S201), and when not received (step S201: No), executes step S201. Further, when the terminal device 100 receives the uttered voice (step S201: Yes), the uttered voice is input to the learning model and the end of the target voice is estimated (step S202). Then, the terminal device 100 estimates the starting end based on the elapsed period estimated by the learning model (step S203). Then, the terminal device 100 extracts the keyword section, executes a process according to the voice included in the extracted keyword section (step S204), and ends the process.

〔４．変形例〕
上記では、情報提供装置１０による学習処理や検出処理の一例について説明した。しかしながら、実施形態は、これに限定されるものではない。以下、情報提供装置１０や端末装置１００が実行する学習処理や検出処理のバリエーションについて説明する。 [4. Modification example)
In the above, an example of the learning process and the detection process by the information providing device 10 has been described. However, the embodiment is not limited to this. Hereinafter, variations of the learning process and the detection process executed by the information providing device 10 and the terminal device 100 will be described.

〔４−１．クラスに対応する区間について〕
上述した図５を用いた説明では、１つのクラスに３つの区間を対応付けた。しかしながら、実施形態は、これに限定されるものではない。例えば、情報提供装置１０は、１つのクラスに１つの区間を対応付けるような学習を行ってもよく、１つのクラスに１０の区間を対応付けるような学習を行ってもよい。ここで、１つの区間は、１つのフレームと対応してもよく、複数のフレームと対応していてもよい。また、情報提供装置１０は、クラスの数に上限値を設けてもよい。 [4-1. About the section corresponding to the class]
In the above description using FIG. 5, one class is associated with three sections. However, the embodiment is not limited to this. For example, the information providing apparatus 10 may perform learning such that one class corresponds to one section, or may perform learning such that one class corresponds to ten sections. Here, one section may correspond to one frame or may correspond to a plurality of frames. Further, the information providing device 10 may set an upper limit value for the number of classes.

〔４−２．装置構成〕
記憶部３０に登録された各データベース３１、３２は、外部のストレージサーバに保持されていてもよい。また、情報提供装置１０と端末装置１００とは、上述した学習処理および検出処理を連携して実現してもよく、いずれか一方の装置が単独で実行してもよい。 [4-2. Device configuration〕
The databases 31 and 32 registered in the storage unit 30 may be held in an external storage server. Further, the information providing device 10 and the terminal device 100 may realize the above-described learning process and detection process in cooperation with each other, or one of them may execute them independently.

〔４−３．その他〕
また、上記実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部または一部を手動的に行うこともでき、逆に、手動的に行われるものとして説明した処理の全部または一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。例えば、各図に示した各種情報は、図示した情報に限られない。 [4-3. Other]
Further, of the processes described in the above embodiment, all or part of the processes described as being automatically performed may be manually performed, and conversely, the processes described as being manually performed. All or part of the above can be automatically performed by a known method. In addition, the processing procedures, specific names, information including various data and parameters shown in the above-mentioned documents and drawings can be arbitrarily changed unless otherwise specified. For example, the various kinds of information shown in each drawing are not limited to the illustrated information.

また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。 Further, each constituent element of each illustrated device is a functional conceptual one, and does not necessarily have to be physically configured as illustrated. That is, the specific form of distribution / integration of each device is not limited to that shown in the figure, and all or part of the device may be functionally or physically distributed / arranged in arbitrary units according to various loads and usage conditions. It can be integrated and configured.

また、上記してきた各実施形態は、処理内容を矛盾させない範囲で適宜組み合わせることが可能である。 Further, the respective embodiments described above can be appropriately combined within a range in which the processing content is not inconsistent.

〔４−４．プログラム〕
また、上述した実施形態に係る情報提供装置１０は、例えば図８に示すような構成のコンピュータ１０００によって実現される。図８は、ハードウェア構成の一例を示す図である。コンピュータ１０００は、出力装置１０１０、入力装置１０２０と接続され、演算装置１０３０、一次記憶装置１０４０、二次記憶装置１０５０、出力ＩＦ（Interface）１０６０、入力ＩＦ１０７０、ネットワークＩＦ１０８０がバス１０９０により接続された形態を有する。 [4-4. program〕
Further, the information providing device 10 according to the above-described embodiment is realized by, for example, a computer 1000 having a configuration shown in FIG. FIG. 8 is a diagram illustrating an example of the hardware configuration. The computer 1000 is connected to an output device 1010 and an input device 1020, and an arithmetic device 1030, a primary storage device 1040, a secondary storage device 1050, an output IF (Interface) 1060, an input IF 1070, and a network IF 1080 are connected by a bus 1090. Have.

演算装置１０３０は、一次記憶装置１０４０や二次記憶装置１０５０に格納されたプログラムや入力装置１０２０から読み出したプログラム等に基づいて動作し、各種の処理を実行する。一次記憶装置１０４０は、ＲＡＭ等、演算装置１０３０が各種の演算に用いるデータを一次的に記憶するメモリ装置である。また、二次記憶装置１０５０は、演算装置１０３０が各種の演算に用いるデータや、各種のデータベースが登録される記憶装置であり、ＲＯＭ(Read Only Memory)、ＨＤＤ（Hard Disk Drive）、フラッシュメモリ等により実現される。 The arithmetic unit 1030 operates based on a program stored in the primary storage device 1040 or the secondary storage device 1050, a program read from the input device 1020, or the like, and executes various processes. The primary storage device 1040 is a memory device such as a RAM that temporarily stores data used by the arithmetic device 1030 for various calculations. The secondary storage device 1050 is a storage device in which data used by the arithmetic device 1030 for various calculations and various databases are registered, such as a ROM (Read Only Memory), a HDD (Hard Disk Drive), and a flash memory. It is realized by.

出力ＩＦ１０６０は、モニタやプリンタといった各種の情報を出力する出力装置１０１０に対し、出力対象となる情報を送信するためのインタフェースであり、例えば、ＵＳＢ（Universal Serial Bus）やＤＶＩ（Digital Visual Interface）、ＨＤＭＩ（登録商標）（High Definition Multimedia Interface）といった規格のコネクタにより実現される。また、入力ＩＦ１０７０は、マウス、キーボード、およびスキャナ等といった各種の入力装置１０２０から情報を受信するためのインタフェースであり、例えば、ＵＳＢ等により実現される。 The output IF 1060 is an interface for transmitting information to be output to an output device 1010 that outputs various types of information such as a monitor and a printer. For example, a USB (Universal Serial Bus) or a DVI (Digital Visual Interface), It is realized by a connector of a standard such as HDMI (registered trademark) (High Definition Multimedia Interface). The input IF 1070 is an interface for receiving information from various input devices 1020 such as a mouse, a keyboard, and a scanner, and is realized by, for example, USB.

なお、入力装置１０２０は、例えば、ＣＤ（Compact Disc）、ＤＶＤ（Digital Versatile Disc）、ＰＤ（Phase change rewritable Disk）等の光学記録媒体、ＭＯ（Magneto-Optical disk）等の光磁気記録媒体、テープ媒体、磁気記録媒体、または半導体メモリ等から情報を読み出す装置であってもよい。また、入力装置１０２０は、ＵＳＢメモリ等の外付け記憶媒体であってもよい。 The input device 1020 is, for example, an optical recording medium such as a CD (Compact Disc), a DVD (Digital Versatile Disc), a PD (Phase change rewritable Disk), a magneto-optical recording medium such as an MO (Magneto-Optical disk), or a tape. It may be a device that reads information from a medium, a magnetic recording medium, a semiconductor memory, or the like. The input device 1020 may also be an external storage medium such as a USB memory.

ネットワークＩＦ１０８０は、ネットワークＮを介して他の機器からデータを受信して演算装置１０３０へ送り、また、ネットワークＮを介して演算装置１０３０が生成したデータを他の機器へ送信する。 The network IF 1080 receives data from another device via the network N and sends the data to the arithmetic device 1030, and also transmits the data generated by the arithmetic device 1030 via the network N to another device.

演算装置１０３０は、出力ＩＦ１０６０や入力ＩＦ１０７０を介して、出力装置１０１０や入力装置１０２０の制御を行う。例えば、演算装置１０３０は、入力装置１０２０や二次記憶装置１０５０からプログラムを一次記憶装置１０４０上にロードし、ロードしたプログラムを実行する。 The arithmetic device 1030 controls the output device 1010 and the input device 1020 via the output IF 1060 and the input IF 1070. For example, the arithmetic device 1030 loads a program from the input device 1020 or the secondary storage device 1050 onto the primary storage device 1040, and executes the loaded program.

例えば、コンピュータ１０００が情報提供装置１０として機能する場合、コンピュータ１０００の演算装置１０３０は、一次記憶装置１０４０上にロードされたプログラムまたはデータ（例えば、学習モデルＭ１）を実行することにより、制御部４０の機能を実現する。コンピュータ１０００の演算装置１０３０は、これらのプログラムまたはデータ（例えば、学習モデルＭ１）を一次記憶装置１０４０から読み取って実行するが、他の例として、他の装置からネットワークＮを介してこれらのプログラムを取得してもよい。 For example, when the computer 1000 functions as the information providing device 10, the arithmetic device 1030 of the computer 1000 executes the program or data (for example, the learning model M1) loaded on the primary storage device 1040, so that the control unit 40. Realize the function of. The computing device 1030 of the computer 1000 reads these programs or data (for example, the learning model M1) from the primary storage device 1040 and executes them. As another example, these programs are loaded from other devices via the network N. You may get it.

〔５．効果〕
上述したように、情報提供装置１０は、検出対象となる対象音声が含まれる音声情報を取得し、対象音声の終端と、その対象音声の始端から経過した期間とをモデルに学習させる。このため、情報提供装置１０は、入力された音声情報から対象音声が含まれる区間を適切に検出可能なモデルの学習を実現する結果、対象音声が含まれる区間の検出精度を向上させることができる。 [5. effect〕
As described above, the information providing apparatus 10 acquires the voice information including the target voice to be detected, and makes the model learn the end of the target voice and the period elapsed from the start of the target voice. Therefore, the information providing apparatus 10 can improve the detection accuracy of the section including the target voice as a result of implementing learning of the model that can appropriately detect the section including the target voice from the input voice information. .

また、情報提供装置１０は、音声情報を複数の区間に分割し、各区間ごとに、対象音声の終端が含まれているか否かと、対象音声の始端からその区間までの期間とをモデルに学習させる。また、情報提供装置１０は、所定の区間に含まれる音声をモデルに入力した際に、その所定の区間に対象音声の終端が含まれているか否かを示す終端情報と、対象音声の始端からその所定の区間までの期間を示す期間情報とを出力するように、モデルの学習を行う。また、情報提供装置１０は、音声情報を複数の区間に分割し、所定の区間に含まれる音声を入力した際に、対象音声の始端からその所定の区間までの期間に応じた分類結果を出力するよう、モデルの学習を行う。 Further, the information providing apparatus 10 divides the voice information into a plurality of sections, and learns for each section whether or not the end of the target voice is included and the period from the start end of the target voice to the section as a model. Let In addition, the information providing apparatus 10, when the voice included in the predetermined section is input to the model, determines the end information indicating whether the end of the target voice is included in the predetermined section and the start end of the target voice. The model is trained so as to output period information indicating the period up to the predetermined section. Further, the information providing device 10 divides the voice information into a plurality of sections, and when a voice included in the predetermined section is input, outputs a classification result according to a period from the start end of the target voice to the predetermined section. To learn the model.

このように、情報提供装置１０は、対象音声の終端を検出するとともに、検出した終端から期間情報を遡って対象音声の始端の検出を可能とするモデルを学習する。この結果、情報提供装置１０は、対象音声全体の特徴を用いて、対象音声が含まれる区間の検出を実現する結果、対象音声が含まれる区間の検出精度を向上させることができる。 In this way, the information providing apparatus 10 detects the end of the target voice and learns a model that enables the detection of the start end of the target voice by tracing back the period information from the detected end. As a result, the information providing apparatus 10 can detect the section including the target voice by using the characteristics of the entire target voice, and as a result, can improve the detection accuracy of the section including the target voice.

また、情報提供装置１０は、再帰型ニューラルネットワークの構成を有するモデルに対し、対象音声の終端と、その対象音声の始端から経過した期間とを学習させる。また、情報提供装置１０は、所定の端末装置に所定の動作を実行させるための音声を対象音声として含む音声情報を取得する。また、情報提供装置１０は、複数の単語を発声した音声、又は、無音の区間を含む音声を対象音声として含む音声情報を取得する。 The information providing apparatus 10 also causes the model having the configuration of the recurrent neural network to learn the end of the target voice and the period elapsed from the start of the target voice. The information providing apparatus 10 also acquires voice information including a voice for causing a predetermined terminal device to perform a predetermined operation as a target voice. In addition, the information providing apparatus 10 acquires voice information that includes a voice in which a plurality of words are uttered or a voice including a silent section as a target voice.

また、情報提供装置１０は、対象音声全体の特徴に基づいて、対象音声の終端を検出するようにモデルの学習を行う。例えば、情報提供装置１０は、対象音声の各区間が有する特徴の出現順序に基づいて、対象音声の終端を含む区間を検出するようにモデルの学習を行う。上述した処理の結果、情報提供装置１０は、対象音声が含まれる区間の検出精度を向上させることができる。 The information providing apparatus 10 also learns a model so as to detect the end of the target voice based on the characteristics of the entire target voice. For example, the information providing apparatus 10 performs model learning so as to detect a section including the end of the target voice based on the appearance order of the features of each section of the target voice. As a result of the above-described processing, the information providing device 10 can improve the detection accuracy of the section including the target voice.

また、端末装置１００は、音声情報を取得する。そして、端末装置１００は、検出対象となる対象音声の終端と、その対象音声の始端から経過した期間とを学習させたモデルを用いて、取得部により取得された音声情報から、対象音声の始端を検出する。このため、端末装置１００は、対象音声が含まれる区間の検出精度を向上させることができる。 In addition, the terminal device 100 acquires voice information. Then, the terminal device 100 uses the model in which the end of the target voice to be detected and the period elapsed from the start of the target voice are learned, from the voice information acquired by the acquisition unit, the start end of the target voice. To detect. Therefore, the terminal device 100 can improve the detection accuracy of the section including the target voice.

また、端末装置１００は、学習対象となった音声情報である学習情報に含まれる各区間ごとに、対象音声の終端が含まれているか否かと、対象音声の始端からその区間までの期間とを学習させたモデルを用いて、音声情報から対象音声の始端を含む区間を検出する。また、端末装置１００は、所定の区間に含まれる音声が入力された場合にその所定の区間に対象音声の終端が含まれているか否かを示す終端情報と、対象音声の始端からその所定の区間までの期間を示す期間情報とを出力するように学習が行われたモデルを用いて、音声情報から対象音声の始端を含む区間を検出する。 In addition, the terminal device 100 determines whether or not the end of the target voice is included in each section included in the learning information that is the voice information that is the learning target, and the period from the start end of the target voice to the section. Using the learned model, a section including the start end of the target voice is detected from the voice information. In addition, the terminal device 100, when the voice included in the predetermined section is input, the terminal information indicating whether or not the end of the target voice is included in the predetermined section, and the predetermined start point of the target voice. A section including the start end of the target voice is detected from the voice information by using the model learned so as to output the period information indicating the period up to the period.

また、端末装置１００は、取得された音声情報を複数の区間に分割し、分割した区間のうち、区間に含まれる音声を入力した際に対象音声の終端が含まれている旨を示す終端情報をモデルが出力した区間を特定し、特定した区間についてモデルが出力した期間情報に基づいて、対象音声の始端が含まれる区間を検出する。また、端末装置１００は、再帰型ニューラルネットワークの構成を有するモデルに対し、音声情報の各区間に含まれる音声を先頭から順に入力し、そのモデルが出力した終端情報と期間情報とに基づいて、対象音声の始端を含む区間を検出する。このような処理の結果、端末装置１００は、対象音声全体の特徴に基づいて、対象音声が含まれる区間を検出するので、検出精度を向上させることができる。 Further, the terminal device 100 divides the acquired voice information into a plurality of sections, and of the divided sections, termination information indicating that the termination of the target speech is included when the speech included in the section is input. The section output by the model is specified, and the section including the start end of the target voice is detected based on the period information output by the model for the specified section. Further, the terminal device 100 sequentially inputs the voices included in each section of the voice information from the beginning to the model having the configuration of the recurrent neural network, and based on the end information and the period information output by the model, A section including the start end of the target voice is detected. As a result of such processing, the terminal device 100 detects the section in which the target voice is included, based on the characteristics of the entire target voice, so the detection accuracy can be improved.

以上、本願の実施形態のいくつかを図面に基づいて詳細に説明したが、これらは例示であり、発明の開示の欄に記載の態様を始めとして、当業者の知識に基づいて種々の変形、改良を施した他の形態で本発明を実施することが可能である。 As described above, some of the embodiments of the present application have been described in detail based on the drawings, but these are examples, and various modifications based on the knowledge of those skilled in the art, including the modes described in the section of the disclosure of the invention, It is possible to implement the present invention in other forms with improvements.

また、上記してきた「部（section、module、unit）」は、「手段」や「回路」などに読み替えることができる。例えば、検出部は、検出手段や検出回路に読み替えることができる。 Further, the above-mentioned "section (module, unit)" can be read as "means" or "circuit". For example, the detection unit can be replaced with a detection unit or a detection circuit.

１０情報提供装置
２０通信部
３０記憶部
３１学習データデータベース
３２モデルデータベース
４０制御部
４１学習部
４２取得部
４３検出部
４４応答生成部
４５提供部
１００利用者端末 10 information providing device 20 communication unit 30 storage unit 31 learning data database 32 model database 40 control unit 41 learning unit 42 acquisition unit 43 detection unit 44 response generation unit 45 providing unit 100 user terminal

Claims

An acquisition unit that acquires audio information that includes the target audio that is the detection target,
A learning device, comprising: a learning unit that causes a model to learn the end of the target voice and the period elapsed from the start of the target voice.

The learning unit divides the voice information into a plurality of sections, and for each section, learns whether the end of the target voice is included and a period from the start end of the target voice to the section as a model. The learning apparatus according to claim 1, wherein the learning apparatus is configured to:

The learning unit, when a voice included in a predetermined section is input to the model, end information indicating whether or not the end of the target voice is included in the predetermined section, and from a start end of the target voice. The learning device according to claim 2, wherein the model learning is performed so as to output period information indicating a period up to the predetermined section.

The learning unit divides the voice information into a plurality of sections, and when a voice included in a predetermined section is input, outputs a classification result according to a period from the start end of the target voice to the predetermined section. The learning device according to any one of claims 1 to 3, wherein the learning of the model is performed.

The learning unit causes a model having a configuration of a recursive neural network to learn a terminal end of the target voice and a period elapsed from a start end of the target voice. The learning device according to one.

The learning according to any one of claims 1 to 5, wherein the acquisition unit acquires voice information including a voice for causing a predetermined terminal device to perform a predetermined operation as the target voice. apparatus.

The said acquisition part acquires the audio | voice information which includes the audio | voice which pronounced several words, or the audio | voice containing a silent area as the said target audio | voice. Learning device.

The learning unit performs learning of the model based on a feature of the entire target voice so as to detect an end of the target voice. 8. The learning unit according to claim 1, wherein Learning device.

The learning unit performs learning of the model so as to detect a section including a terminal end of the target voice based on an appearance order of features included in each section of the target voice. Learning device.

An acquisition unit that acquires voice information,
A detection unit that detects the start end of the target voice from the voice information acquired by the acquisition unit using a model in which the end of the target voice to be detected and the period elapsed from the start end of the target voice are learned. And a detection device.

The detection unit determines whether or not the end of the target voice is included, and the period from the start end of the target voice to the section, for each section included in the learning information that is the voice information that is the learning target. The detection device according to claim 10, wherein the learned model is used to detect a section including a start end of the target voice from the voice information acquired by the acquisition unit.

The detection unit, when a voice included in a predetermined section is input, end information indicating whether or not the end of the target voice is included in the predetermined section, and the predetermined end from the start end of the target voice. A period including a start end of the target voice is detected from the voice information acquired by the acquisition unit by using a model learned to output period information indicating a period up to the period. The detection device according to claim 11.

The detection unit divides the voice information acquired by the acquisition unit into a plurality of sections, and the end of the target voice is included when a voice included in the section is input among the divided sections. 13. A section in which the model outputs end information indicating the section is specified, and a section in which the start end of the target voice is included is detected based on the period information output by the model for the specified section. The detection device according to 1.

The detection unit inputs voices included in each section of the voice information acquired by the acquisition unit in order from the beginning with respect to a model having a configuration of a recursive neural network, and outputs the end information and the end information output by the model. The detection device according to claim 11 or 12, wherein a section including a start end of the target voice is detected based on the period information.

A learning method executed by the learning device,
An acquisition step of acquiring voice information including a target voice to be detected,
A learning method, comprising: a learning step of causing a model to learn the end of the target voice and the period elapsed from the start of the target voice.

An acquisition procedure for acquiring audio information including target audio to be detected,
A learning program for causing a computer to execute a learning procedure for causing a model to learn the end of the target voice and the period elapsed from the start of the target voice.

A detection method performed by a detection device, comprising:
An acquisition process for acquiring audio information,
A detection step of detecting the start end of the target voice from the voice information acquired by the acquisition unit using a model in which the end of the target voice to be detected and the period elapsed from the start end of the target voice are learned. A detection method comprising: and.

Acquisition procedure to acquire voice information,
A detection procedure for detecting the start end of the target voice from the voice information acquired by the acquisition unit using a model in which the end of the target voice to be detected and the period elapsed from the start end of the target voice are learned. And a detection program including.