WO2020250797A1

WO2020250797A1 - Information processing device, information processing method, and program

Info

Publication number: WO2020250797A1
Application number: PCT/JP2020/022107
Authority: WO
Inventors: 裕一郎小山
Original assignee: ソニー株式会社
Priority date: 2019-06-14
Filing date: 2020-06-04
Publication date: 2020-12-17

Abstract

The information processing device according to one embodiment of the present invention is provided with an acquisition unit and an output unit. The acquisition unit acquires feature data on a plurality of signals obtained through observation of a target wave. On the basis of the acquired feature data, the output unit outputs a three-dimensional vector that represents direction information indicative of an arrival direction of the target wave and attribute information pertaining to the target wave.

Description

Information processing equipment, information processing methods, and programs

This technology relates to information processing devices, information processing methods, and programs that can be applied to detect voice and the like.

Patent Document 1 describes an acoustic signal processing device that estimates the direction of a sound source. In this device, the surrounding sound is captured by a plurality of microphones, and a plurality of acoustic signals are generated. In addition, the cross-correlation value of the acoustic signal of each microphone is calculated as the sound space feature amount. The sound source direction of the target sound is estimated using this sound space feature. Further, in the acoustic signal processing device, the reliability of the sound source direction estimated value is calculated by using a high-order statistic of the sound space feature amount (Patent Document 1 specification paragraphs [0035] [0040] [0044] FIG. 2 etc).

Japanese Unexamined Patent Publication No. 2011-139409

By observing waves that propagate through space, such as the acoustic waves described above, it is possible to estimate the direction in which the waves arrived and the characteristics of the waves, and various applications are expected. Therefore, there is a demand for a technique for detecting the direction of the target wave and other attached information with high accuracy.

In view of the above circumstances, an object of the present technology is to provide an information processing device, an information processing method, and a program capable of accurately detecting the direction of a target wave and other attached information. ..

In order to achieve the above object, the information processing device according to one embodiment of the present technology includes an acquisition unit and an output unit.
The acquisition unit acquires feature data of a plurality of signals that observe the target wave.
Based on the acquired feature data, the output unit outputs a three-dimensional vector representing direction information indicating the arrival direction of the target wave and incidental information regarding the target wave.

In this information processing device, a three-dimensional vector is output by inputting the feature data of a plurality of signals that observe the target wave. This three-dimensional vector is a vector that represents the direction information of the arrival direction of the target wave and the incidental information about the target wave. In this way, the direction information and the attached information are collectively output as one vector. This makes it possible to detect the direction of the target wave and other attached information with high accuracy.

The output unit may output the three-dimensional vector so that the direction of the three-dimensional vector represents the direction information and the magnitude of the three-dimensional vector represents the accessory information.

The output unit may output the three-dimensional vector so that the direction information and the accessory information are calculated by performing polar coordinate conversion on the three-dimensional vector.

The direction information may include a horizontal angle and an elevation angle indicating the direction of arrival of the target wave.

The output unit may perform polar coordinate transformation of the three-dimensional vector to calculate the direction information and the accessory information.

The target wave may be voice. In this case, the plurality of signals may be sound signals obtained by observing the voice.

The direction information may be information indicating the arrival direction of the voice.

The attached information may include any one of the volume of the voice, the existence probability of the voice, or the reliability regarding the arrival direction of the voice.

The output unit may output the three-dimensional vector representing the direction information and the accessory information for each frequency component included in the sound signal.

The attached information may be the volume of the voice for each frequency component. In this case, the output unit may calculate an audio signal representing the amplitude spectrum of the audio based on the three-dimensional vector output for each frequency component.

The output unit may synthesize the three-dimensional vectors output for each frequency component to calculate a first vector representing the arrival direction of the voice.

The output unit may calculate the direction information and the accessory information based on the first vector.

The output unit calculates a second vector by synthesizing the three-dimensional vectors output within a predetermined period, and calculates the arrival direction of the voice in the predetermined period based on the second vector. You may.

The plurality of signals may be the sound signals detected by each of the plurality of sound collectors arranged at different positions from each other.

The feature data may include the amplitude spectrum of each of the plurality of signals and the phase difference spectrum between the plurality of signals.

The output unit is a learner that outputs the three-dimensional vector corresponding to the input data and uses the error according to the Euclidean distance between the output three-dimensional vector and the answer vector corresponding to the input data for learning. You may.

The information processing method according to one form of the present technology is an information processing method executed by a computer system, and includes acquiring characteristic data of a plurality of signals obtained by observing a target wave.
Based on the acquired feature data, a three-dimensional vector representing the direction information indicating the arrival direction of the target wave and the incidental information regarding the target wave is output.

A program according to a form of the present technology causes a computer system to perform the following steps.
The step of acquiring the feature data of multiple signals that observed the target wave.
A step of outputting a three-dimensional vector representing direction information indicating the direction of arrival of the target wave and incidental information regarding the target wave based on the acquired feature data.

It is a block diagram which shows the structural example of the processing unit which concerns on 1st Embodiment of this technique. It is a schematic diagram for demonstrating a three-dimensional vector. It is a flowchart which shows the basic operation of a control part. It is a data plot which shows an example of a feature data. It is a graph of the three-dimensional vector output from the feature data shown in FIG. It is a graph of the sound source direction and the volume of voice calculated from the three-dimensional vector shown in FIG. It is a block diagram which shows the structural example of the processing unit which concerns on 2nd Embodiment. It is a data plot which shows an example of a feature data. It is a data plot which shows the 3D vector output from the feature data shown in FIG. 6 is a data plot showing the volume of voice calculated from the three-dimensional vector shown in FIG. It is a graph of the whole vector calculated from the three-dimensional vector shown in FIG. It is a graph of the sound source direction and the volume calculated from the whole vector shown in FIG.

Hereinafter, embodiments relating to the present technology will be described with reference to the drawings.

<First Embodiment>
[Processing unit configuration]
FIG. 1 is a block diagram showing a configuration example of a processing unit according to a first embodiment of the present technology. The processing unit 100 is a calculation unit that calculates information on a specific sound to be observed from a sound signal obtained by observing a sound (sound wave). As will be described later, in the processing unit 100, the calculation of calculating the information of the voice 2 is executed with the voice 2 of the human 1 as the observation target.

As shown in FIG. 1, the processing unit 100 is used by being connected to the microphone array 10. The microphone array 10 has a plurality of microphones 11. The microphone 11 is an element that detects surrounding sounds and outputs a sound signal corresponding to the detected sound, and functions as a sound collector. The sound signal output from the microphone 11 is an electric signal whose amplitude changes with time according to the surrounding sounds. The time variation of this amplitude represents the pitch, loudness, sound waveform, and the like. The sound signal is typically output as an analog signal and converted into a digital signal using an A / D converter or the like (not shown). The specific configuration of the microphone 11 is not limited, and any element capable of detecting the surrounding sound and detecting the sound signal may be used as the microphone 11.

Around the microphone array 10, voice 2 emitted by human 1 is generated. Therefore, the plurality of signals output from the microphone array 10 are sound signals obtained by observing the voice 2. In addition, not only voice 2 but also other sounds such as noise 3 are generated around the microphone array 10. Therefore, the sound signal 5 includes a signal corresponding to the noise 3 and the like in addition to the voice 2. In FIG. 1, the voice 2 and the noise 3 generated around the microphone array 10 are schematically illustrated by using arrows.

Further, the plurality of microphones 11 constituting the microphone array 10 are arranged at different positions from each other. Therefore, the plurality of signals output from the microphone array 10 are sound signals detected by each of the plurality of microphones 11 arranged at different positions from each other. Therefore, for example, even when the same voice 2 is detected, the timing at which the voice 2 is detected, the size of the detected voice 2, and the like are different for each microphone 11. Therefore, the sound signal output by each microphone 11 is a signal corresponding to the position where the microphone 11 is arranged.

The microphone array 10 is mounted on, for example, a robot or the like. In this case, a plurality of microphones 11 are arranged in a housing such as a robot. Further, for example, the microphone array 10 may be mounted on a stationary device or the like. Alternatively, a plurality of microphones 11 may be arranged in an indoor space, a vehicle interior space, or the like to form a microphone array 10.

Note that the microphone array 10 may include at least two microphones 11. In the present embodiment, the microphone array 10 is composed of four or more microphones 11. For example, by arranging four or more microphones 11 so as not to be included in the same plane, it is possible to properly detect the direction of the sound source and the like. In addition, the specific configuration of the microphone array 10 is not limited.

The processing unit 100 has a hardware configuration required for a computer such as a CPU and a memory (RAM, ROM). Various processes are executed by the CPU loading the program stored in the ROM into the RAM and executing the program. The program is installed in the processing unit 100, for example, via various recording media. Alternatively, the program may be installed via the Internet or the like.

As the processing unit 100, for example, a device such as a PLD (Programmable Logic Device) such as an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit) may be used. In this embodiment, the processing unit 100 corresponds to an information processing device.

When the CPU of the processing unit 100 executes the program according to the present embodiment, the pre-processing unit 20, the vector estimation unit 21, and the post-processing unit 22 are realized as functional blocks. Then, the information processing method according to the present embodiment is executed by these functional blocks. In addition, in order to realize each functional block, dedicated hardware such as an IC (integrated circuit) may be appropriately used.

The preprocessing unit 20 acquires the feature data 6 of the plurality of sound signals 5 in which the voice 2 is observed. Specifically, the preprocessing unit 20 reads a plurality of sound signals 5 (multi-channel sound signals 5) output from the microphone array 10 and calculates feature data 6 based on the read sound signals 5. That is, the preprocessing unit 20 acquires the feature data 6 by calculating the feature data 6 from the plurality of sound signals 5. In the present embodiment, the pretreatment unit 20 corresponds to the acquisition unit.

In the present disclosure, the feature data 6 is data that can represent features of, for example, a plurality of sound signals 5. For example, a predetermined conversion process is executed on the sound signal 5, and data representing the characteristics of the sound signal 5 is generated. Further, for example, the sound signal 5 itself can be used as the feature data 6. In the present embodiment, the Fourier transform is executed on the sound signal 5, and the amplitude spectrum, the phase difference spectrum, and the like of the sound signal 5 are calculated as the feature data 6. This point will be described in detail later with reference to FIG. 4 and the like.

The vector estimation unit 21 outputs a three-dimensional vector P indicating the direction information indicating the arrival direction of the target wave and the accessory information regarding the target wave based on the feature data 6 acquired by the preprocessing unit 20. Specifically, the vector estimation unit 21 is configured by a learner trained to input the feature data 6 and output a three-dimensional vector P representing directional information and attached information.

In this embodiment, the target wave is a sound (sound wave) to be observed by the processing unit 100. In the processing unit 100, among the sounds (sound waves) detected by the microphone 11, the voice 2 emitted by the human 1 is set as the observation target. That is, the target wave is voice 2. For example, when there are a plurality of humans 1 around the microphone array 10, all the voices 2 emitted by each human 1 are the target waves.

Note that the case is not limited to the case where the voice 2 of an unspecified number of humans 1 is the target wave, and for example, the voice 2 of a specific human 1 can be the target wave. Further, in addition to the voice 2, a specific sound such as a clap sound or a bell ringing sound may be set as the target wave. Further, ambient noise 3 and the like may be set as the target wave. In addition, the target wave can be arbitrarily set according to the purpose of the processing unit 100 and the like.

The direction information is information indicating the direction of arrival of the voice 2. That is, it can be said that the direction information is information indicating the direction (sound source direction) in which the human 1 who has emitted the voice 2 is located. In the following, the direction of arrival of the voice 2 may be simply described as the sound source direction. For example, reference coordinates are set in the microphone array 10 described above. The direction information is information indicating the direction in which the voice 2 arrives with respect to the origin of the reference coordinates, that is, the direction in which the human 1 who emits the voice 2 is located when viewed from the reference coordinates. The method of setting the reference coordinates is not limited and can be set arbitrarily.

The attached information is information obtained attached to the target voice 2, and is expressed using a one-dimensional value (norm). For example, the attached information is set to the volume of the voice 2. More specifically, the magnitude (power) of the voice 2 emitted by the human 1 at a certain timing is set in the attached information. In addition to this, it is possible to set the probability indicating the presence / absence of the voice 2 and the reliability regarding the arrival direction (direction information) of the voice 2 as ancillary information. The specific content of the attached information is not limited, and for example, an arbitrary one-dimensional amount that can be calculated from the feature data 6 may be set as the attached information.

FIG. 2 is a schematic diagram for explaining the three-dimensional vector P. FIG. 2 illustrates a Cartesian coordinate system represented by the X-axis, Y-axis, and Z-axis that are orthogonal to each other. This Cartesian coordinate system becomes the reference coordinate. The thick arrow in the figure is an example of the three-dimensional vector P output from the vector estimation unit 21. In the following, the three-dimensional vector P will be referred to as x, y, and z for each component of the X-axis, Y-axis, and Z-axis, respectively. The vector estimation unit 21 outputs each component of the vector P (x, y, z) as a three-dimensional vector P.

The vector estimation unit 21 (learner) is learned so that the direction of the three-dimensional vector P is the direction of arrival of the voice 2 and the magnitude I of the three-dimensional vector P is the value of the attached information. That is, the vector estimation unit 21 outputs the three-dimensional vector P so that the direction of the three-dimensional vector P represents the direction information and the magnitude of the three-dimensional vector P represents the attached information. For example, when viewed from the origin O, the direction indicated by the three-dimensional vector P is the sound source direction, and the magnitude thereof represents the value of the attached information (volume of voice 2 or the like). That is, each component x, y, z of the three-dimensional vector P does not directly represent the direction information or the accessory information, but the direction information and the accessory information are expressed by the vector represented by each component x, y, z. Will be done.

Further, the vector estimation unit 21 is learned so that the horizontal angle θ in the sound source direction, the elevation angle φ in the sound source direction, and the value I of the attached information can be obtained by converting the three-dimensional vector P into polar coordinates. Here, the horizontal angle θ is an angle representing the direction (direction) of the vector with respect to the X axis in the XY plane. The elevation angle φ is an angle representing the inclination of the vector with respect to the XY plane. As described above, in the present embodiment, the direction information includes the horizontal angle θ and the elevation angle φ indicating the sound source direction (the direction of arrival of the sound 2). The horizontal angle θ, the elevation angle φ, and the attached information I are expressed using the following equations, respectively.

In this way, the vector estimation unit 21 outputs the three-dimensional vector P so that the direction information and the attached information are calculated by performing polar coordinate conversion on the three-dimensional vector P. Therefore, for example, when calculating the direction information and the attached information, the angles θ and φ representing the sound source direction and the attached information are obtained by converting the three-dimensional vector P into polar coordinates according to the equations (1) to (3). The value I can be easily calculated.

Learning data generated based on the sound signal is used for learning of the vector estimation unit 21 (learner). As the learning data, the characteristic data of the sound signal to which the teacher label is attached is used. For example, characteristic data (amplitude spectrum and phase difference spectrum) of a sound signal including human voice becomes data for input. Further, a three-dimensional vector P representing the arrival direction (sound source direction) and attached information (volume, etc.) of the sound is attached to the feature data as a teacher label. This makes it possible to train the learner to estimate the vector representing the sound source direction and attached information by polar coordinate transformation.

The method of generating learning data is not limited. For example, it is possible to simulate a sound signal in which the position of a sound source is changed by performing a convolution operation used in a technique such as an impulse response. By performing this by changing the type of voice, it is possible to easily prepare learning data with a plurality of teacher labels. When a sound other than voice is used as the target wave, learning data obtained by sampling the target sound (applause, bell sound, etc.) may be used.

It is possible to apply an arbitrary algorithm used for machine learning such as deep learning to the learner that constitutes the vector estimation unit 21. For example, as an algorithm using a neural network (NN: Neural Network), a perceptron, a multilayer perceptron (MLP: Multilayer Perceptron), a convolutional neural network (CNN: Convolution Neural Network), a recurrent neural network (RNN: Recurrent Neural Network), It is possible to use an algorithm (learning model) such as an RSTM network (LongTermShortMemoryNetwork). In addition to this, the learner may be configured by using an arbitrary algorithm applicable to estimation of the sound source direction and the like.

The learner can be regarded as a function that converts the feature data 6 into a three-dimensional vector P (hereinafter referred to as a function A). Therefore, it can be said that training the learner is a process of optimizing the function A so that the three-dimensional vector P can be calculated appropriately.

Returning to FIG. 1, the post-processing unit 22 transforms the three-dimensional vector P into polar coordinates to calculate the direction information and the attached information. Specifically, according to the above equations (1) to (3), the horizontal angle θ and the elevation angle φ in the sound source direction and the value I of the attached information are calculated from the three-dimensional vector P. In addition, the post-processing unit 22 can execute various operations using the three-dimensional vector P. The vector estimation unit 21 and the post-processing unit 22 described above function as output units according to the present embodiment.

FIG. 3 is a flowchart showing the basic operation of the control unit. The process shown in FIG. 3 is a process that is repeatedly executed at a predetermined processing rate. First, the preprocessing unit 20 calculates the feature data 6 of the sound signal 5 (step 101). Next, the vector estimation unit 21 outputs the three-dimensional vector P with the feature data 6 as an input (step 102). Then, the post-processing unit 22 transforms the three-dimensional vector P into polar coordinates, and calculates the direction information (θ and φ) and the attached information (I) of the voice 2 (step 103).

In this way, the processing unit 100 inputs a quantity representing the characteristics of the multi-channel sound signal, and outputs a three-dimensional vector P (x, y, z) expressing direction information and other accessory information, and polar coordinates. By the conversion, (θ, φ, I) is calculated as the information related to the sound 2. When the process in step 103 is completed, the process at the next timing is started. Therefore, the processing unit 100 continuously calculates the direction information and the attached information at a predetermined processing rate. This makes it possible to constantly monitor the direction in which the voice 2 is emitted.

In the following, the operations of steps 101, 102, and 103 will be specifically described by taking the case where the volume of the voice 2 is set as the attached information (I) as an example. The following description is applicable even when the attached information is set to another value.

[Calculation of feature data]
In step 101, as the feature data 6 of the plurality of sound signals 5, the amplitude spectrum of each of the plurality of sound signals 5 and the phase difference spectrum between the plurality of sound signals 5 are calculated. The amplitude spectrum is a spectrum representing the intensity of each frequency component. The phase difference spectrum is a spectrum representing the phase difference for each frequency component. These two types of data are the data to be input to the vector estimation unit 21.

First, the preprocessing unit 20 reads the sound signals 5 (M channel sound signals 5) output from the M microphones 11, records them in a storage unit such as a buffer, and performs a short-time Fourier transform on each sound signal 5. The conversion is performed. In the short-time Fourier transform, the target signal (sound signal 5) is divided by a predetermined time Δ, and the Fourier transform is executed for the signals included in the divided sections. Here, the section separated by a predetermined time Δ is described as a time frame t (t = 1, 2, ...). In addition, each time frame t may overlap or may be divided.

Further, a certain sampling time tau, describes a sound signal 5 output from the M microphone 11 and s _m (τ). Also describes s _m a complex spectrum calculated by short-time Fourier transform of _{(τ) S m (t,} f) and. m is an index representing each microphone 11, and is a natural number less than or equal to M (m = 1, 2, ... M). Further, f is an index representing each frequency bin in the short-time Fourier transform, and is an integer having a frequency bin number F or less (f = 1, 2, ... F).

As the amplitude spectrum, the absolute value | S _m (t, f) | of the complex spectrum S _m (t, f) is calculated. That is, the amplitude spectrum of the M channel is calculated from the complex spectrum of the M channel. Further, as the phase difference spectrum, the phase difference arg (S _m (t, f) / S _j (t, f)) with other channels is calculated with reference to a specific channel (microphone 11 with m = j). To. Here, arg is a function for calculating the declination. Further, m represents a channel other than j. That is, the phase difference spectrum of the M-1 channel is calculated from the complex spectrum of the M channel.

The data (input data D _i) to be input to the vector estimating unit 21, a certain section length T _i (hereinafter referred to as the input section length T _i) as human settlement, summarizes the amplitude spectrum and phase difference spectrum as described above Data is used. The input section length _Ti is set to a section longer than the interval Δ of the time frame t described above. That is, the input section length T _i, will include a plurality of time frames t. Therefore, the preprocessing unit 20 outputs spectrum data for 2M-1 channels including a phase difference spectrum for M channels and a phase difference spectrum for M-1 channels.

The data size of the spectrum data is the number of channels × the section length _Ti × the number of frequency bins F. Therefore, the input data _Di is expressed as _Di (c, t, f), where c is an index indicating each channel of the spectrum data. Here, the index c indicating the channel is (c = 1,2, ... 2M-1), the time frame t is (t = 1,2, ... _Ti ), and the frequency bin. f is (f = 1, 2, ... F).

FIG. 4 is a data plot showing an example of feature data. FIG. 4 shows a strip plot representing the spectral data of the amplitude spectrum and the retardation spectrum. FIG. 4 shows an example when four microphones 11 (M = 4) are used. The four plots from the top are the amplitude spectra (| S _m (t, f) |), and the three below them. The plot is the phase difference spectrum (arg (S _m (t, f) / S _j (t, f))). Further, in each data plot, the horizontal axis is time (time frame t) and the vertical axis is frequency (frequency bin f). The color of each point shown in gray scale represents the amplitude or phase difference.

For example, at the time when a black plot with a strong amplitude is detected in the amplitude spectrum, it is considered that a sound with a volume corresponding to the amplitude is generated. The sound represented by the black plot includes, for example, voice 2 which is a target wave, ambient noise 3, and the like. In this way, in the state where the sound is generated, as shown in the phase difference spectrum, the phase difference corresponding to the deviation of the timing of detecting the sound is detected. On the other hand, in the region where the amplitude spectrum is gray, the sound is relatively quiet or only noise 3 is generated. In this case, the phase difference for each frequency is substantially random.

Also in Figure 4, the data sections included in the input section length T _i is illustrated by the solid black border. The data of each data plot included in this interval becomes the input data _Di (c, t, f) input to the vector estimation unit 21. For example, in FIG. 4, since M = 4, the number of channels is 7, and the data size of the input data _Di (c, t, f) is 7 × T _i × F.

[Estimation of 3D vector]
In step 102, the three-dimensional vector P is estimated from the vector estimation unit 21 to which the input data _Di (c, t, f) is input. In the present embodiment, the interval length T _o (hereinafter, the output section is described as length T _o) as batches and as data summarizing the three-dimensional vector P (output data D _o) is output, the learning device is It is composed.

Vector estimating unit 21, as the output data D _o, 3-dimensional vector P at each time frame t (t) = (x ( t), y (t), z (t)) and, included in the output section length T _o Outputs the data collected by the number of frames to be displayed. Therefore, the output data _Do is expressed as _Do (c, t), where c is an index representing each component of the three-dimensional vector P. Here, the index c representing each component is (c = 1, 2, 3), the time frame t is _{(t = 1,2, ··· T o} ). Therefore, the data size of D _o (c, t) is a 3 × T _o.

In this way, the vector estimation unit 21 functions as a function A that converts the input D _i into the output D _o . As described above, in algorithm development, the function A is optimized and determined by a machine learning algorithm such as deep learning. It should be noted that among the parameters constituting the function A, there may be a parameter for accumulating the past processing results. By using such past processing results for the optimization of the function A, it is possible to improve the estimation accuracy of the sound source direction and the detection accuracy of the attached information.

FIG. 5 is a graph of a three-dimensional vector output from the feature data shown in FIG. In FIG. 5, graphs of each component x (t), y (t), and z (t) of the three-dimensional vector P are shown in order from the top. The horizontal axis of each graph is time, and the vertical axis is the size of each component. The scale of the vertical axis is appropriately set for each graph.

Also in Figure 5, the data sections included in the output section length T _o is illustrated by the solid black border. The data of each graph included in this section becomes the output data _Do (c, t) output from the vector estimation unit 21 (function A). Output section length T _o is set, for example, the same length as the one time frame t (delta). In this case, the number of frames included in the output data _Do (c, t) is 1, and one three-dimensional vector P (t) = (x (t), y (t), z (t)) is output. To. Further, if it contains a plurality of time frames t to the output segment length T _o is three-dimensional vector P for each time frame t (t) is output.

[Calculation of sound source direction and attached information]
In step 103, the sound source direction and attached information are calculated from the output data _Do (c, t). Specifically, the post-processing unit 22 applies the three-dimensional vector P (t) = (x (t), y (t), z (t)) included in the output data _Do (c, t). , Polar coordinate transformation is performed as shown in the following equation.

Equations (4) to (6) correspond to equations (1) to (3) described with reference to FIG. 2, respectively. Equation (4) is a horizontal angle θ (t) in the sound source direction in the time frame t. Equation (5) is an elevation angle φ (t) in the sound source direction in the time frame t. Further, the equation (6) is the value I of the attached information in the time frame t, and is the volume of the voice 2. In this way, the post-processing unit 22 calculates the sound source direction and attached information (volume of voice 2) for each frame from the three-dimensional vector P (t).

[When setting the volume of voice 2 in the attached information]
Here, a case where the volume of the voice 2 is set as attached information will be described. The complex spectrum S _m (t, f): (f = 1, 2, ... F) in the time frame t in the M microphones 11 is the voice 2 (human voice) as shown in the following equation. It is composed of the component V _m (t, f) of the above and the other component N _m (t, f) of the noise 3.

The vector estimation unit 21 (function A) is trained on the component V _m (t, f) of the speech 2 in the complex spectrum S _m (t, f). Specifically, in the above equation (6), I (t), which is the magnitude of the three-dimensional vector P (t), becomes the voice power of the specific microphone 11 (here, the kth). To. In this case, I (t) is expressed by the following equation.

When training the vector estimation unit 21, the function A is optimized so that I (t) calculated by Eq. (6) satisfies the relationship of Eq. (8). As a result, the attached information I (t) output from the vector estimation unit 21 ideally receives the power of the voice 2 regardless of the power of the noise 3 even if the sound signal 5 disturbed by the noise 3 is input. It is possible to output only the volume). This corresponds to the detection of voice 2. Therefore, by setting the power of the voice 2 in the attached information, it is possible to realize voice section detection (VAD: Voice Activity Detection) or the like that detects the section in which the voice 2 is generated.

Further, for example, the power may be 0 when the voice 2 does not exist, and the power may be expressed on a logarithmic scale when the voice 2 exists. In this case, I (t) is expressed by the following equation.

By setting I (t) and optimizing the function A in this way, it is possible to separate the noise 3 and the voice 2 with high accuracy. Further, in the state where the voice 2 is not generated, the power becomes 0, so that the voice 2 can be easily detected. In addition, the method of expressing the volume of the voice 2 is not limited.

FIG. 6 is a graph of the sound source direction and the volume of the voice 2 calculated from the three-dimensional vector shown in FIG. FIG. 6 shows graphs of the horizontal angle θ (t) in the sound source direction, the elevation angle φ (t) in the sound source direction, and the volume I (t) of the sound 2 in this order from the top. The horizontal axis of each graph is time. The vertical axis of the graph of the horizontal angle θ (t) and the elevation angle φ (t) is the angle. The vertical axis of the graph of the volume I (t) of the voice 2 represents the loudness (power) of the sound.

For example, in the graph of I (t), in the section where the peak is detected, the human 1 around the microphone array 10 speaks and the voice 2 is generated. On the contrary, in the graph of I (t), the voice 2 is not generated in the section where the volume is substantially 0. By referring to I (t) in this way, it is possible to detect the voice 2 generated around the microphone array 10 with high accuracy.

Further, in FIG. 6, the values of θ (t) and φ (t) change from 0 ° to a constant angle corresponding to each peak of the graph of I (t). Therefore, in the example shown in FIG. 6, the voice 2 emitted by the human 1 in the same direction is detected. Further, for example, when conversations of humans 1 at different positions are observed, the direction in which the human 1 who utters each voice 2 exists is estimated as the sound source direction for each peak of the voice 2. As described above, in the present embodiment, it is possible to accurately detect the direction in which the person who emitted the voice 2 is present as well as the volume of the voice 2.

[Processing for utterance sections, etc.]
For example, when human 1 is speaking, it may be important to estimate the sound source direction or the like while speaking (speech section). As described above, when the person is interested in the sound source direction in a certain section length such as the utterance section, a method of calculating the sound source direction or the like by using the three-dimensional vector P aggregated by the section length is also effective. In the following, the section to be aggregated will be described as the target section. In the present embodiment, the target section corresponds to a predetermined period.

The aggregation of the three-dimensional vector P is executed by the post-processing unit 22. Specifically, the sum of each component x (t), y (t), and z (t) of the three-dimensional vector P (t) output in the target section is calculated. For example, if you want to acquire the sound source direction for the previous utterance at a certain time t _c , use the time t _p earlier than the time t _c , and the sum x _u , y _u , z _u of each component is as follows. It is calculated as.

Here, the time t _p corresponds to the start time of the target section, and the time t _c corresponds to the end time of the target section. Therefore, it can be said that x _u , y _u , and z _u are components of a vector (hereinafter referred to as an aggregate vector) obtained by synthesizing the three-dimensional vector P output in the target section. In this embodiment, the aggregate vector corresponds to the second vector.

Polar coordinate transformation is executed for the aggregate vector whose components are x _u , _yu , and z _u calculated according to the equation (10). As a result, the horizontal angle θ _u and the elevation angle φ _{u in} the sound source direction with respect to the utterance immediately before the time t _c are calculated as follows.

In the above equation (10), the sum of each component in the target section is calculated, but the average of each component in the target section may be calculated. That is, the average of each component is calculated by dividing x _u , _yu , and z _u in the equation (10) by the number of time frames included in the target interval. The vector represented by the average of each component is also an aggregate vector calculated by synthesizing the three-dimensional vector P.

As described above, in the present embodiment, the aggregation vector is calculated by synthesizing the three-dimensional vector P output in the target section, and the arrival direction of the voice 2 in the target section is calculated based on the aggregation vector.

In this method, the target section (time t _p to time t _c ) may include a section without voice 2. In such an interval, as shown in each graph of FIG. 5, the values of x (t), y (t), and z (t) are sufficiently small, ideally 0. Therefore, the value of each component in the section without the sound 2 does not have a great influence on the calculation result, and the value in the sound source direction with respect to the section with the sound 2 in the target section can be acquired with high accuracy. ..

For example, as a method of estimating the direction of the speaker in the target section, a method of identifying which section of the target section corresponds to the voice 2 and estimating the direction based on the result can be considered. However, in the identification of the section corresponding to the voice 2, heuristic processing using various parameters for determining the certainty, empirical rules, and the like may be performed, and the estimation accuracy may be lowered.

On the other hand, in the present embodiment, the sound source direction in a certain section of the voice 2 is easily calculated only by synthesizing the three-dimensional vector P over the target section. That is, it is not necessary to perform heuristic processing for determining a certain section of the sound 2, and it is possible to estimate the sound source direction with high accuracy.

[When setting the existence probability of voice 2 in the attached information]
A case where the existence probability of the voice 2 is set as the attached information will be described. Here, the complex spectrum complex spectrum S _m (t, f): (f = 1, 2, ... F) in the time frame t in the M microphones 11 is a voice as shown in the equation (7). It is assumed that it is composed of the component V _m (t, f) of 2 and the component N _m (t, f) of noise 3.

The vector estimation unit 21 (function A) is learned so that I (t), which is the magnitude of the three-dimensional vector P (t), is the existence probability of the voice 2 in the above equation (6). Here, the existence probability of the voice 2 is a probability indicating whether or not the voice 2 is generated. Specifically, a function so that the existence probability of voice 2 becomes 1 when the power (volume) of voice 2 of a specific microphone 11 (here, k-th) is larger than a predetermined threshold value ε. A is optimized. In this case, I (t) is expressed by the following equation.

The vector estimation unit 21 optimized according to the equation (13) outputs, for example, a three-dimensional vector P having a magnitude of 0 to 1. When performing the actual estimation, the three-dimensional vector P may be output as it is and I (t) may take a value from 0 to 1. This makes it possible to realize an application that performs a predetermined process when the voice 2 is likely to exist (for example, the existence probability is 0.5 or more). Further, for example, the output may be controlled so that I (t) has a value of either 0 or 1. This makes it possible to simplify the subsequent processing.

The method of setting the existence probability of voice 2 is not limited. For example, instead of the power of the voice 2 of the specific microphone 11, the average value of the power of the voice 2 of the plurality of microphones 11 included in the microphone array 10 may be used. In this case, the function A is optimized so that the existence probability of the voice 2 becomes 1 when the average value of the power is larger than the predetermined threshold value ε. Further, the predetermined threshold value ε can be arbitrarily set according to the configuration of the microphone 11 and the like.

[When setting the power ratio between voice 2 and noise 3 in the attached information]
As attached information, a case where the power ratio between the voice 2 and the noise 3 is set will be described. That is, the signal-to-noise ratio (S / N ratio) for the voice 2 is set in the attached information. Here, the complex spectrum complex spectrum S _m (t, f): (f = 1, 2, ... F) in the time frame t in the M microphones 11 is a voice as shown in the equation (7). It is assumed that it is composed of the component V _m (t, f) of 2 and the component N _m (t, f) of noise 3.

In the vector estimation unit 21 (function A), in the above equation (6), I (t), which is the magnitude of the three-dimensional vector P (t), is the signal-to-noise ratio between the voice 2 and the noise 3. To be learned. Specifically, the function A is optimized so that I (t) represents the ratio of the power of the voice 2 of the specific microphone 11 (here, the kth) to the power of the noise 3. In this case, I (t) is expressed by, for example, the following equation.

The estimation accuracy of the sound source direction generally correlates with the signal-to-noise ratio. That is, when the signal-to-noise ratio is small, the estimation accuracy tends to be low, and when the signal-to-noise ratio is large, the estimation accuracy tends to be high. Therefore, by setting the power ratio between the voice 2 and the noise 3 in the attached information, the output I (t) value can be interpreted as the reliability of the sound source direction estimation for each time frame. That is, it can be said that by using the equation (14), the reliability regarding the arrival direction of the voice 2 is set as the attached information.

The method for expressing the signal-to-noise ratio is not limited to the method represented by equation (14). For example, the signal-to-noise ratio may be expressed using the average values of the powers of the voice 2 and the noise 3 detected by the plurality of microphones 11 included in the microphone array 10. Further, in addition to the signal-to-noise ratio, an arbitrary parameter capable of expressing the reliability with respect to the arrival direction of the voice 2 may be set to I (t).

In an application that uses sound source direction estimation, for example, if the estimated sound source direction is an incorrect value and an attempt is made to perform a desired operation using the incorrect value as it is, the quality of the user experience may be significantly impaired. One example is an application in which the robot looks back in the direction of the user when the user speaks. In this case, if the estimated sound source direction is an erroneous value, there is a possibility that the robot may look back in an unrelated direction when the user speaks.

It is possible to avoid such a situation by using the reliability of sound source direction estimation. For example, when the reliability is low, the alternative process is executed without adopting the sound source direction estimated value at that time. As an alternative process, for example, a process of notifying the user that the sound source direction could not be estimated or that the reliability is low is executed. Examples of the notification method include execution of a gesture indicating that the voice 2 could not be heard, display of a message, lighting of a lamp, and the like. This avoids the situation where the robot turns in an unrelated direction.

As an alternative process, a process of switching the method of estimating the direction in which the user is located from a method using the microphone 11 to another method such as a method using the camera is executed. That is, when it is difficult to estimate the direction by the sound signal due to the influence of noise 3 or the like, a process of searching for a user by using image recognition or the like is executed. This makes it possible to properly detect the direction in which the user is, even when the estimation of the sound source direction does not work. In this way, by performing the alternative processing based on the reliability of the sound source direction estimation, it is possible to sufficiently avoid the deterioration of the quality of the user experience.

[Calculation of error]
Input data with a teacher label is used for learning of the learner constituting the vector estimation unit 21. This teacher label is a vector (answer vector) representing the sound source direction, volume, etc., which should be estimated from the corresponding input data. In the learning process, the accuracy of the learning device is evaluated by comparing the three-dimensional vector P output by the learning device based on the input data with the answer vector.

Specifically, the Euclidean distance between the three-dimensional vector P and the answer vector is calculated. Here, the Euclidean distance is a distance in a three-dimensional Euclidean space as represented by the three-dimensional Cartesian coordinate system described with reference to FIG. This Euclidean distance can represent the amount of deviation of the three-dimensional vector P with respect to the answer vector representing the correct answer.

For example, as the output error (loss) of the learner, the mean square error (MSE: Mean Squared Error) is calculated using this Euclidean distance. The method of expressing the error is not limited. In this way, the vector estimation unit 21 outputs the three-dimensional vector P corresponding to the input data, and uses the error according to the Euclidean distance between the output three-dimensional vector P and the answer vector corresponding to the input data for learning. It is a learner.

For example, when the Euclidean distance is small, the error of the learning device is small, and when the Euclidean distance is large, the error of the learning device is large. In other words, since the output format of the vector estimation unit 21 (learner) is a three-dimensional vector P that can express the sound source direction and attached information in an integrated manner, the Euclidean distance from the answer vector is calculated to be three-dimensional. It is possible to easily calculate the error of the vector P.

In addition, when evaluating an error when learning a machine learning algorithm, etc., by evaluating only the error (loss) in one vector, the evaluation of the three parameters of horizontal angle θ, elevation angle φ, and attached information can be performed at the same time. It is possible to do. For example, in a learner that calculates a horizontal angle θ, an elevation angle φ, etc., it is necessary to provide a rule or the like for distinguishing between 0 ° and 360 °, and heuristic processing is required to calculate the error. On the other hand, by using a format that outputs a three-dimensional vector P as shown in the present disclosure, it is possible to avoid heuristic processing and perform highly accurate error evaluation. This makes it possible to dramatically improve the learning accuracy of the learner.

In addition, in algorithms such as neural networks, an error backpropagation method (backpropagation) that adjusts weights using errors may be used. Even when learning such an algorithm, stable error back propagation is possible by expressing the information in the sound source direction not by an angle but by a three-dimensional vector P in a three-dimensional Euclidean space. This makes it possible to easily implement an algorithm using error back propagation.

As described above, in the processing unit 100 according to the present embodiment, the three-dimensional vector P is output by inputting the feature data 6 of the plurality of sound signals 5 in which the voice 2 is observed. This three-dimensional vector P is a vector representing the direction information of the arrival direction of the voice 2 and the incidental information regarding the voice 2. In this way, the direction information and the attached information are collectively output as one vector. This makes it possible to detect the direction of the voice 2 and other attached information (volume of the voice 2) with high accuracy.

In the sound source direction estimation algorithm that estimates the arrival direction of voice, etc., in many cases, how to integrate with information other than the sound source direction is a practical issue. For example, when it is desired to realize an application in which a robot looks back in the direction of the user at the timing when the user speaks, it is necessary to integrate an algorithm for detecting the user's voice and an algorithm for estimating the direction of arrival of the voice.

When the sound source direction estimation and the voice detection algorithm are individually configured in this way, it is generally difficult to perform overall optimization of both. For example, if the voice can be detected in advance, it is possible to estimate the direction with higher accuracy, and if the direction of the voice can be estimated in advance, it is possible to detect the voice with higher accuracy. In this case, the optimization of each process requires each other's processing results, and as a result, it may be necessary to adopt an algorithm individually optimized for each process.

In such a configuration, even if an individual optimum algorithm is adopted for each process, there is no guarantee that the entire process including sound source direction estimation and voice detection is optimized. Therefore, when individual algorithms are used for sound source direction estimation and voice detection, there is a possibility that there is a concern in terms of accuracy that the overall optimum is not achieved. In addition, since each algorithm must be developed independently, there may be concerns about development efficiency such as an increase in development cost.

In the present embodiment, the vector estimation unit 21 outputs a three-dimensional vector P representing the sound source direction and the volume (attached information) of the voice 2. By expressing a plurality of pieces of information using one vector in this way, an individual algorithm for estimating the sound source direction and detecting voice, an integrated algorithm for integrating the results, and the like become unnecessary.

The three-dimensional vector P is a vector representing the estimation result of the sound source direction and the detection result of voice detection. That is, by outputting the three-dimensional vector P, it is possible to optimally solve a plurality of problems at the same time. As a result, the estimation accuracy of the sound source direction and the detection accuracy of the voice 2 can be significantly improved, and the calculation efficiency can be sufficiently improved. In addition, it is not necessary to develop separate algorithms, and development costs can be significantly reduced.

The present inventor evaluated the estimation result of the sound source direction using the three-dimensional vector P according to the present technology using the data (sound signal 5) detected by the microphone array 10 mounted on the specific device. To evaluate the estimation results, we used a method of measuring the ratio of the error of the horizontal angle θ within a predetermined angle range in multiple environments and comparing it with other methods for estimating the sound source direction. Further, as the predetermined angle range, a range set based on the angle of view of the camera was adopted.

In this evaluation, as the evaluation environment, an environment with an extremely low signal-to-noise ratio, that is, an environment with relatively large noise was set, so that sufficient accuracy could not be obtained by other methods. On the other hand, it was found that the method using the three-dimensional vector P was able to estimate with significantly higher accuracy than the other methods. Specifically, in a plurality of environments, the correct answer rate (ratio within a predetermined angle range) was about 40% in other methods, whereas in the method using this technology, it was about 80% in each environment. .. Or even higher correct answer rate was obtained.

In this way, the method of expressing the sound source direction and attached information using one vector can greatly improve the estimation accuracy of the sound source direction. This makes it possible to improve the operating accuracy of the system that performs voice processing and the like. Further, by using this technology, it is possible to provide a highly reliable voice application or the like.

<Second embodiment>
The processing unit 200 of the second embodiment according to the present technology will be described. In the following description, the description of the parts similar to the configuration and operation in the processing unit 100 described in the above embodiment will be omitted or simplified.

FIG. 7 is a block diagram showing a configuration example of the processing unit 200 according to the second embodiment. The processing unit 200 is an arithmetic unit that calculates information of voice 2, and has a pre-processing unit 220, a vector estimation unit 221 and a post-processing unit 222. The preprocessing unit 220 is configured in the same manner as the preprocessing unit 20 shown in FIG. 1, for example, and outputs the feature data 6 of the plurality of sound signals 5 output from the microphone array 10. Note that in FIG. 7, the microphone array is not shown.

The vector estimation unit 221 outputs a three-dimensional vector P representing direction information and attached information for each frequency component included in the sound signal 5 based on the feature data 6. Specifically, the learner constituting the vector estimation unit 221 is learned to output the three-dimensional vector P for each frequency bin f. Further, the mean square error calculated for each frequency bin f between the three-dimensional vector P and the answer vector is used for learning of the learner.

The post-processing unit 222 executes conversion processing and aggregation processing of the three-dimensional vector P output for each frequency component (frequency bin), and calculates direction information indicating the sound source direction and attached information regarding the sound 2.

FIG. 8 is a data plot showing an example of feature data. The feature data 6 (amplitude spectrum and phase difference spectrum) is calculated by the preprocessing unit 220 in the same manner as the processing described with reference to, for example, FIG.

FIG. 8 shows a strip plot showing the spectral data of the amplitude spectrum and the retardation spectrum. FIG. 8 shows an example in which four microphones 11 (M = 4) are used. The four plots from the top are the amplitude spectra (| S _m (t, f) |), and the three below them. The plot is the phase difference spectrum (arg (S _m (t, f) / S _j (t, f))). Also in Figure 8, the data sections included in the input section length T _i is illustrated by the solid black border. The data of each data plot included in this interval becomes the input data _Di (c, t, f) input to the vector estimation unit 221.

[Estimation of 3D vector for each frequency component]
In the present embodiment, the vector estimation unit 221 estimates the three-dimensional vector P for each frequency component. Also from the vector estimation unit 221, an output section length T _o as batches and produce an output data D _o summarizes the three-dimensional vector P. Namely, within the output interval length T _o, for each frame each frequency bin and each time, three-dimensional vector P (t, f) = ( x (t, f), y (t, f), z (t, f )) Is output. This corresponds to adding a dimension in the frequency direction to the output of the vector estimation unit 21 shown in FIG. As a result, it is possible to output the arrival direction (sound source direction) of the voice 2 and the attached information regarding the voice 2 for each time frame t and frequency bin f.

The output data _Do is expressed as _Do (c, t, f), where c is an index representing each component of the three-dimensional vector P. Here, the index c representing each component is (c = 1, 2, 3), the time frame t is _{(t = 1,2, ··· T o} ). The frequency bin f is (f = 1, 2, ... F). Note that F is the total number of frequency bins. The data size of _{D o (c, t, f} ) is a 3 × T _o × F. In this way, the vector estimation unit 221 functions as a function B that converts the input D _i into the output D _o . In the following, the case where the volume of the voice 2 is set as the additional information targeted by the function B will be described as an example. Of course, it is possible to set arbitrary information such as the existence probability of the voice 2 and the reliability in the sound source direction as ancillary information.

FIG. 9 is a data plot showing the three-dimensional vector P output from the feature data shown in FIG. FIG. 9 shows a data plot of each component x (t, f), y (t, f), and z (t, f) of the three-dimensional vector P (t, f) in order from the top. .. The horizontal axis of each graph is time, and the vertical axis is frequency. The values of each component are shown in gray scale. Also in Figure 9, the data sections included in the output section length T _o is illustrated by the solid black border. The data of each graph included in this section becomes the output data _Do (c, t, f) output from the vector estimation unit 221 (function B).

[Calculation of sound source direction and attached information]
The sound source direction and attached information are calculated from the output data _Do (c, t, f). Specifically, the post-processing unit 222 includes the three-dimensional vector P (t, f) = (x (t, f), y (t, f), y (t, f), included in the output data _Do (c, t, f). For z (t, f)), polar coordinate transformation is executed as shown in the following equation.

Equations (15) to (17) correspond to equations (1) to (3) described with reference to FIG. 2, and are calculated for each time frame t and frequency bin f. Equation (15) is a horizontal angle θ (t, f) in the sound source direction. Equation (16) is an elevation angle φ (t, f) in the sound source direction. Further, the equation (17) is the value I (t, f) of the attached information and is the volume of the voice 2. In this way, the post-processing unit 222 calculates the sound source direction and attached information (volume of voice 2) for each time frame and frequency from the three-dimensional vector P (t, f).

FIG. 10 is a data plot showing the volume of voice 2 calculated from the three-dimensional vector P shown in FIG. The horizontal axis of FIG. 10 is time, and the vertical axis is frequency. Further, the volume (power) of the voice 2 in each time frame t and the frequency bin f is shown in gray scale.

When the volume of voice 2 is set in the attached information, in equation (17), I (t, f) is the power (spectrogram) of voice 2 for each frequency bin of a specific microphone (here, kth). As such, the function B is optimized. In this case, I (t, f) is expressed by the following equation using the equation (7) representing the complex spectrum of the voice 2.

When training the vector estimation unit 221, the function B is optimized so that I (t) calculated by the equation (6) satisfies the relationship of the equation (18). As a result, the output accessory information ideally represents the power (volume) of the voice 2 for each frequency bin regardless of the presence or absence of the noise 3, even if the sound signal 5 disturbed by the noise is input. Become.

This corresponds to estimating a voice signal in which only voice 2 is detected without noise 3. Further, it can be said that this process is a speech enhancement process that emphasizes only the sound 2 or a noise suppression process that reduces the noise 3. Therefore, the data plot shown in FIG. 10 is a plot representing a voice signal showing a response of only voice 2 extracted from the original sound signal including noise 3 and the like.

For example, in a certain time frame t, the frequency distribution of I (t, f) calculated according to the equation (17) becomes the frequency distribution of the power of the voice 2 in the time frame t, that is, the amplitude spectrum of the voice 2. This amplitude spectrum does not include a spectrum such as noise 3. By performing such processing for each time frame, it is possible to extract a voice signal in which only voice 2 is detected as shown in FIG.

As described above, in the present embodiment, the vector estimation unit 221 calculates the voice signal representing the amplitude spectrum of the voice 2 based on the three-dimensional vector P output for each frequency component. As a result, it becomes possible to perform highly accurate voice recognition or the like using a voice signal in which noise 3 is suppressed, and it is possible to significantly improve the processing accuracy of various applications using voice 2.

Further, the speech enhancement process (process for extracting a voice signal) can be considered as voice section detection (VAD) for each frequency bin. Therefore, in the present embodiment, when the volume of the voice 2 is set in the attached information, the voice enhancement, the voice section detection, and the sound source direction estimation are solved by one calculation. This makes it possible to provide a single algorithm that is totally optimized to perform three processes at once.

Similar to equation (9), it is possible to set the volume to be output using the logarithm. That is, the power may be 0 when the voice 2 does not exist, and the power may be expressed on a logarithmic scale when the voice 2 exists. In this case, I (t) is expressed by the following equation.

[Calculation of overall sound source direction and attached information]
If you are interested in the sound source direction for each time frame t and the attached information, the three-dimensional vector P (t, f) is added in advance in the frequency direction, and polar coordinate transformation is executed.

First, from the three-dimensional vector P (t, f) calculated for each frequency component, the entire three-dimensional vector P (t) representing the entire sound source direction and the entire attached information in a certain time frame t is calculated. In the following, the entire three-dimensional vector P (t) calculated from the three-dimensional vector P (t, f) will be referred to as the overall vector P (t). In this embodiment, the total vector P (t) corresponds to the first vector.

For example, when the three-dimensional vector P (t, f) is output from the vector estimation unit 221, the three-dimensional vector P (t, f) is synthesized in the frequency direction by the post-processing unit 222, and the entire vector P (t) is combined. Is calculated. That is, the whole vector P (t) is a vector obtained by synthesizing the three-dimensional vectors P (t, f) having frequency bins f = 1 to F in the time frame t. Specifically, the components x (t), y (t), and z (t) of the whole vector P (t) are represented as follows.

The direction of the entire vector P (t) calculated by the equation (20) represents the arrival direction (sound source direction) of the voice 2 generated at the timing t. As described above, in the present embodiment, the overall vector P (t) representing the arrival direction of the voice 2 is calculated by synthesizing the three-dimensional vectors P (t, f) output for each frequency component. The direction of the whole vector P (t) represents the whole value I (t) of the attached information regarding the voice 2.

FIG. 11 is a graph of the entire vector P (t) calculated from the three-dimensional vector P (t, f) shown in FIG. In FIG. 11, graphs of each component x (t), y (t), and z (t) of the entire vector P (t) are shown in order from the top. The horizontal axis of each graph is time, and the vertical axis is the size of each component. The scale of the vertical axis is appropriately set for each graph.

The graph shown in FIG. 11 is obtained by adding each component individually output for each frequency bin in the frequency direction, and corresponds to the component of the three-dimensional vector P (t) described with reference to FIG. That is, by synthesizing the three-dimensional vector P (t, f) by the post-processing unit 222, a vector similar to the three-dimensional vector P (t) output by the vector estimation unit 221 (function A) of the first embodiment. (Overall vector P (t)) can be calculated.

When the total vector P (t) is calculated, the direction information and the attached information are calculated based on the total vector P (t). Specifically, the post-processing unit 222 executes polar coordinate transformation on the entire vector P (t), and the horizontal angle θ (t), elevation angle φ (t), and attached information I (t) of the voice 2 are obtained. It is calculated. In this case, θ (t), φ (t), and I (t) are expressed by the following equations.

FIG. 12 is a graph of the sound source direction and the volume calculated from the entire vector shown in FIG. FIG. 12 shows graphs of the horizontal angle θ (t) in the sound source direction, the elevation angle φ (t) in the sound source direction, and the volume I (t) of the sound 2 in this order from the top. The horizontal axis of each graph is time. The vertical axis of the graph of the horizontal angle θ (t) and the elevation angle φ (t) is the angle. The vertical axis of the graph of the volume I (t) of the voice 2 represents the loudness (power) of the sound.

For example, in the graph of I (t), in the section where the peak is detected, the volume of the voice 2 is high, and it can be seen that the voice 2 is detected. Further, the values of θ (t) and φ (t) change from 0 ° to a constant angle corresponding to each peak of I (t). Therefore, it can be seen that the voices 2 detected as the peaks of I (t) are all emitted from the same direction.

When the magnitude I (t, f) of the three-dimensional vector P (t, f) is set to the power of the voice 2 shown in the equation (18), the magnitude I (t) of the entire vector P (t) is set. ) Can be regarded as the power of the voice 2 shown in the equation (8). Similarly, when the power of the voice 2 shown in the equation (19) is set, the magnitude of the whole vector P (t) can be regarded as the power of the voice 2 shown in the equation (9). ..

In this way, even when the three-dimensional vector P (t, f) is calculated for each frequency component, it is possible to calculate the entire sound source direction and attached information for each time frame. As a result, it becomes possible to extract the voice signal in which the noise 3 and the like are suppressed, and to perform the detection processing of the voice 2. As a result, it becomes possible to construct a highly versatile voice processing system or the like that is totally optimized.

<Other Embodiments>
The present technology is not limited to the embodiments described above, and various other embodiments can be realized.

In the above, the process of estimating the arrival direction of voice and the like contained in sound (sound wave) has been described. This technology can be applied not only to sound waves but also to vibration waves propagating inside an object. For example, a seismic wave that propagates underground when an earthquake occurs may be set as a target wave. In this case, the direction information indicating the direction of arrival of the seismic wave, that is, the direction of the epicenter, and the three-dimensional vector expressing the attached information such as the intensity (amplitude) of the seismic wave are output.

For example, vibration detectors that detect vibrations on the ground or in the ground are placed at multiple locations. Then, the characteristic data (amplitude spectrum and phase difference spectrum) of the vibration signal output from each vibration detector is input to the learner. The learner is learned in advance so as to output a three-dimensional vector representing the arrival direction of the seismic wave and its intensity based on the characteristic data of the vibration signal. This makes it possible to detect the arrival direction and intensity of seismic waves with high accuracy. In addition to this, this technology can be applied to various wave phenomena that propagate in space such as electromagnetic waves and gravitational waves.

In the above, as an embodiment of the information processing device according to the present technology, a single processing unit is taken as an example. However, the information processing apparatus according to the present technology may be realized by an arbitrary computer that is configured separately from the processing unit and is connected to the processing unit via wire or wirelessly. For example, the information processing method according to the present technology may be executed by a cloud server. Alternatively, the information processing method according to the present technology may be executed in conjunction with the processing unit and another computer.

That is, the information processing method and program according to the present technology can be executed not only in a computer system composed of a single computer but also in a computer system in which a plurality of computers operate in conjunction with each other. In the present disclosure, the system means a set of a plurality of components (devices, modules (parts), etc.), and it does not matter whether or not all the components are in the same housing. Therefore, a plurality of devices housed in separate housings and connected via a network, and one device in which a plurality of modules are housed in one housing are both systems.

The information processing method and program execution according to the present technology by the computer system are executed when, for example, acquisition of feature data and output of a three-dimensional vector are executed by a single computer, and each process is executed by a different computer. Includes both cases. Further, the execution of each process by a predetermined computer includes causing another computer to execute a part or all of the process and acquire the result.

That is, the information processing method and program related to this technology can be applied to a cloud computing configuration in which one function is shared by a plurality of devices via a network and processed jointly.

It is also possible to combine at least two feature parts among the feature parts related to the present technology described above. That is, the various feature portions described in each embodiment may be arbitrarily combined without distinction between the respective embodiments. Further, the various effects described above are merely examples and are not limited, and other effects may be exhibited.

In the present disclosure, "same", "equal", "orthogonal", etc. are concepts including "substantially the same", "substantially equal", "substantially orthogonal", etc. For example, a state included in a predetermined range (for example, a range of ± 10%) based on "perfectly the same", "perfectly equal", "perfectly orthogonal", etc. is also included.

In addition, this technology can also adopt the following configurations.
(1) An acquisition unit that acquires feature data of multiple signals that observe the target wave, and
An information processing device including a direction information indicating an arrival direction of the target wave and an output unit for outputting a three-dimensional vector representing ancillary information about the target wave based on the acquired feature data.
(2) The information processing device according to (1).
The output unit is an information processing device that outputs the three-dimensional vector so that the direction of the three-dimensional vector represents the direction information and the magnitude of the three-dimensional vector represents the accessory information.
(3) The information processing device according to (2).
The output unit is an information processing device that outputs the three-dimensional vector so that the direction information and the accessory information are calculated by performing polar coordinate conversion on the three-dimensional vector.
(4) The information processing device according to (3).
The direction information is an information processing device including a horizontal angle and an elevation angle indicating the arrival direction of the target wave.
(5) The information processing device according to (3) or (4).
The output unit is an information processing device that converts the three-dimensional vector into polar coordinates to calculate the direction information and the accessory information.
(6) The information processing device according to any one of (1) to (5).
The target wave is voice and
The plurality of signals are information processing devices that are sound signals obtained by observing the voice.
(7) The information processing device according to (6).
The direction information is an information processing device that indicates the direction of arrival of the voice.
(8) The information processing device according to (6) or (7).
The information processing device includes any one of the volume of the voice, the existence probability of the voice, and the reliability of the arrival direction of the voice.
(9) The information processing device according to any one of (6) to (8).
The output unit is an information processing device that outputs the three-dimensional vector representing the direction information and the accessory information for each frequency component included in the sound signal.
(10) The information processing apparatus according to (9).
The attached information is the volume of the voice for each frequency component.
The output unit is an information processing device that calculates an audio signal representing the amplitude spectrum of the audio based on the three-dimensional vector output for each frequency component.
(11) The information processing device according to (9) or (10).
The output unit is an information processing device that synthesizes the three-dimensional vectors output for each frequency component and calculates a first vector representing the arrival direction of the voice.
(12) The information processing apparatus according to (11).
The output unit is an information processing device that calculates the direction information and the accessory information based on the first vector.
(13) The information processing apparatus according to any one of (6) to (12).
The output unit calculates a second vector by synthesizing the three-dimensional vectors output within a predetermined period, and calculates the arrival direction of the voice in the predetermined period based on the second vector. Information processing device.
(14) The information processing apparatus according to any one of (6) to (13).
The plurality of signals are information processing devices that are the sound signals detected by each of the plurality of sound collectors arranged at different positions.
(15) The information processing apparatus according to any one of (1) to (14).
The feature data is an information processing apparatus including an amplitude spectrum of each of the plurality of signals and a phase difference spectrum between the plurality of signals.
(16) The information processing apparatus according to any one of (1) to (15).
The output unit is a learner that outputs the three-dimensional vector corresponding to the input data and uses an error according to the Euclidean distance between the output three-dimensional vector and the answer vector corresponding to the input data for learning. Information processing device.
(17) Acquire characteristic data of multiple signals that observed the target wave,
An information processing method in which a computer system executes to output three-dimensional vectors representing direction information indicating the direction of arrival of the target wave and incidental information regarding the target wave based on the acquired feature data.
(18) A step of acquiring characteristic data of a plurality of signals observing a target wave, and
A program that causes a computer system to execute a step of outputting a three-dimensional vector representing direction information indicating the arrival direction of the target wave and incidental information regarding the target wave based on the acquired feature data.

P ... 3D vector 2 ... Voice 5 ... Sound signal 6 ... Feature data 11 ...

Microphone

20, 220 ...

Pre-processing unit

21, 221 ...

Vector estimation unit

22, 222 ...

Post-processing unit

100, 200 ... Processing unit

Claims

An acquisition unit that acquires characteristic data of multiple signals that observe the target wave,
An information processing device including a direction information indicating the direction of arrival of the target wave and an output unit that outputs a three-dimensional vector representing ancillary information about the target wave based on the acquired feature data.
The information processing device according to claim 1.
The output unit is an information processing device that outputs the three-dimensional vector so that the direction of the three-dimensional vector represents the direction information and the magnitude of the three-dimensional vector represents the accessory information.
The information processing device according to claim 2.
The output unit is an information processing device that outputs the three-dimensional vector so that the direction information and the accessory information are calculated by performing polar coordinate conversion on the three-dimensional vector.
The information processing device according to claim 3.
The direction information is an information processing device including a horizontal angle and an elevation angle indicating the arrival direction of the target wave.
The information processing device according to claim 3.
The output unit is an information processing device that converts the three-dimensional vector into polar coordinates to calculate the direction information and the accessory information.
The information processing device according to claim 1.
The target wave is voice and
The plurality of signals are information processing devices that are sound signals obtained by observing the voice.
The information processing device according to claim 6.
The direction information is an information processing device that indicates the direction of arrival of the voice.
The information processing device according to claim 6.
The information processing device includes any one of the volume of the voice, the existence probability of the voice, and the reliability of the arrival direction of the voice.
The information processing device according to claim 6.
The output unit is an information processing device that outputs the three-dimensional vector representing the direction information and the accessory information for each frequency component included in the sound signal.
The information processing device according to claim 9.
The attached information is the volume of the voice for each frequency component.
The output unit is an information processing device that calculates an audio signal representing the amplitude spectrum of the audio based on the three-dimensional vector output for each frequency component.
The information processing device according to claim 9.
The output unit is an information processing device that synthesizes the three-dimensional vectors output for each frequency component and calculates a first vector representing the arrival direction of the voice.
The information processing device according to claim 11.
The output unit is an information processing device that calculates the direction information and the accessory information based on the first vector.
The information processing device according to claim 6.
The output unit calculates a second vector by synthesizing the three-dimensional vectors output within a predetermined period, and calculates the arrival direction of the voice in the predetermined period based on the second vector. Information processing device.
The information processing device according to claim 6.
The plurality of signals are information processing devices that are the sound signals detected by each of the plurality of sound collectors arranged at different positions.
The information processing device according to claim 1.
The feature data is an information processing apparatus including an amplitude spectrum of each of the plurality of signals and a phase difference spectrum between the plurality of signals.
The information processing device according to claim 1.
The output unit is a learner that outputs the three-dimensional vector corresponding to the input data and uses an error according to the Euclidean distance between the output three-dimensional vector and the answer vector corresponding to the input data for learning. Information processing device.
Acquire feature data of multiple signals that observe the target wave,
An information processing method in which a computer system executes to output three-dimensional vectors representing direction information indicating the direction of arrival of the target wave and incidental information regarding the target wave based on the acquired feature data.
Steps to acquire feature data of multiple signals that observed the target wave,
A program that causes a computer system to execute a step of outputting a three-dimensional vector representing direction information indicating the arrival direction of the target wave and incidental information regarding the target wave based on the acquired feature data.