WO2020250797A1 - Information processing device, information processing method, and program - Google Patents

Information processing device, information processing method, and program Download PDF

Info

Publication number
WO2020250797A1
WO2020250797A1 PCT/JP2020/022107 JP2020022107W WO2020250797A1 WO 2020250797 A1 WO2020250797 A1 WO 2020250797A1 JP 2020022107 W JP2020022107 W JP 2020022107W WO 2020250797 A1 WO2020250797 A1 WO 2020250797A1
Authority
WO
WIPO (PCT)
Prior art keywords
information processing
information
voice
processing device
vector
Prior art date
Application number
PCT/JP2020/022107
Other languages
French (fr)
Japanese (ja)
Inventor
裕一郎 小山
Original Assignee
ソニー株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ソニー株式会社 filed Critical ソニー株式会社
Publication of WO2020250797A1 publication Critical patent/WO2020250797A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones

Definitions

  • This technology relates to information processing devices, information processing methods, and programs that can be applied to detect voice and the like.
  • Patent Document 1 describes an acoustic signal processing device that estimates the direction of a sound source.
  • the surrounding sound is captured by a plurality of microphones, and a plurality of acoustic signals are generated.
  • the cross-correlation value of the acoustic signal of each microphone is calculated as the sound space feature amount.
  • the sound source direction of the target sound is estimated using this sound space feature.
  • the reliability of the sound source direction estimated value is calculated by using a high-order statistic of the sound space feature amount (Patent Document 1 specification paragraphs [0035] [0040] [0044] FIG. 2 etc).
  • an object of the present technology is to provide an information processing device, an information processing method, and a program capable of accurately detecting the direction of a target wave and other attached information. ..
  • the information processing device includes an acquisition unit and an output unit.
  • the acquisition unit acquires feature data of a plurality of signals that observe the target wave. Based on the acquired feature data, the output unit outputs a three-dimensional vector representing direction information indicating the arrival direction of the target wave and incidental information regarding the target wave.
  • a three-dimensional vector is output by inputting the feature data of a plurality of signals that observe the target wave.
  • This three-dimensional vector is a vector that represents the direction information of the arrival direction of the target wave and the incidental information about the target wave. In this way, the direction information and the attached information are collectively output as one vector. This makes it possible to detect the direction of the target wave and other attached information with high accuracy.
  • the output unit may output the three-dimensional vector so that the direction of the three-dimensional vector represents the direction information and the magnitude of the three-dimensional vector represents the accessory information.
  • the output unit may output the three-dimensional vector so that the direction information and the accessory information are calculated by performing polar coordinate conversion on the three-dimensional vector.
  • the direction information may include a horizontal angle and an elevation angle indicating the direction of arrival of the target wave.
  • the output unit may perform polar coordinate transformation of the three-dimensional vector to calculate the direction information and the accessory information.
  • the target wave may be voice.
  • the plurality of signals may be sound signals obtained by observing the voice.
  • the direction information may be information indicating the arrival direction of the voice.
  • the attached information may include any one of the volume of the voice, the existence probability of the voice, or the reliability regarding the arrival direction of the voice.
  • the output unit may output the three-dimensional vector representing the direction information and the accessory information for each frequency component included in the sound signal.
  • the attached information may be the volume of the voice for each frequency component.
  • the output unit may calculate an audio signal representing the amplitude spectrum of the audio based on the three-dimensional vector output for each frequency component.
  • the output unit may synthesize the three-dimensional vectors output for each frequency component to calculate a first vector representing the arrival direction of the voice.
  • the output unit may calculate the direction information and the accessory information based on the first vector.
  • the output unit calculates a second vector by synthesizing the three-dimensional vectors output within a predetermined period, and calculates the arrival direction of the voice in the predetermined period based on the second vector. You may.
  • the plurality of signals may be the sound signals detected by each of the plurality of sound collectors arranged at different positions from each other.
  • the feature data may include the amplitude spectrum of each of the plurality of signals and the phase difference spectrum between the plurality of signals.
  • the output unit is a learner that outputs the three-dimensional vector corresponding to the input data and uses the error according to the Euclidean distance between the output three-dimensional vector and the answer vector corresponding to the input data for learning. You may.
  • the information processing method is an information processing method executed by a computer system, and includes acquiring characteristic data of a plurality of signals obtained by observing a target wave. Based on the acquired feature data, a three-dimensional vector representing the direction information indicating the arrival direction of the target wave and the incidental information regarding the target wave is output.
  • a program causes a computer system to perform the following steps.
  • FIG. 6 is a data plot showing the volume of voice calculated from the three-dimensional vector shown in FIG. It is a graph of the whole vector calculated from the three-dimensional vector shown in FIG. It is a graph of the sound source direction and the volume calculated from the whole vector shown in FIG.
  • FIG. 1 is a block diagram showing a configuration example of a processing unit according to a first embodiment of the present technology.
  • the processing unit 100 is a calculation unit that calculates information on a specific sound to be observed from a sound signal obtained by observing a sound (sound wave). As will be described later, in the processing unit 100, the calculation of calculating the information of the voice 2 is executed with the voice 2 of the human 1 as the observation target.
  • the processing unit 100 is used by being connected to the microphone array 10.
  • the microphone array 10 has a plurality of microphones 11.
  • the microphone 11 is an element that detects surrounding sounds and outputs a sound signal corresponding to the detected sound, and functions as a sound collector.
  • the sound signal output from the microphone 11 is an electric signal whose amplitude changes with time according to the surrounding sounds. The time variation of this amplitude represents the pitch, loudness, sound waveform, and the like.
  • the sound signal is typically output as an analog signal and converted into a digital signal using an A / D converter or the like (not shown).
  • the specific configuration of the microphone 11 is not limited, and any element capable of detecting the surrounding sound and detecting the sound signal may be used as the microphone 11.
  • the microphone array 10 voice 2 emitted by human 1 is generated. Therefore, the plurality of signals output from the microphone array 10 are sound signals obtained by observing the voice 2. In addition, not only voice 2 but also other sounds such as noise 3 are generated around the microphone array 10. Therefore, the sound signal 5 includes a signal corresponding to the noise 3 and the like in addition to the voice 2.
  • the voice 2 and the noise 3 generated around the microphone array 10 are schematically illustrated by using arrows.
  • the plurality of microphones 11 constituting the microphone array 10 are arranged at different positions from each other. Therefore, the plurality of signals output from the microphone array 10 are sound signals detected by each of the plurality of microphones 11 arranged at different positions from each other. Therefore, for example, even when the same voice 2 is detected, the timing at which the voice 2 is detected, the size of the detected voice 2, and the like are different for each microphone 11. Therefore, the sound signal output by each microphone 11 is a signal corresponding to the position where the microphone 11 is arranged.
  • the microphone array 10 is mounted on, for example, a robot or the like.
  • a plurality of microphones 11 are arranged in a housing such as a robot.
  • the microphone array 10 may be mounted on a stationary device or the like.
  • a plurality of microphones 11 may be arranged in an indoor space, a vehicle interior space, or the like to form a microphone array 10.
  • the microphone array 10 may include at least two microphones 11.
  • the microphone array 10 is composed of four or more microphones 11.
  • the specific configuration of the microphone array 10 is not limited.
  • the processing unit 100 has a hardware configuration required for a computer such as a CPU and a memory (RAM, ROM). Various processes are executed by the CPU loading the program stored in the ROM into the RAM and executing the program.
  • the program is installed in the processing unit 100, for example, via various recording media. Alternatively, the program may be installed via the Internet or the like.
  • processing unit 100 for example, a device such as a PLD (Programmable Logic Device) such as an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit) may be used.
  • a device such as a PLD (Programmable Logic Device) such as an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit) may be used.
  • the processing unit 100 corresponds to an information processing device.
  • the pre-processing unit 20, the vector estimation unit 21, and the post-processing unit 22 are realized as functional blocks. Then, the information processing method according to the present embodiment is executed by these functional blocks.
  • dedicated hardware such as an IC (integrated circuit) may be appropriately used.
  • the preprocessing unit 20 acquires the feature data 6 of the plurality of sound signals 5 in which the voice 2 is observed. Specifically, the preprocessing unit 20 reads a plurality of sound signals 5 (multi-channel sound signals 5) output from the microphone array 10 and calculates feature data 6 based on the read sound signals 5. That is, the preprocessing unit 20 acquires the feature data 6 by calculating the feature data 6 from the plurality of sound signals 5. In the present embodiment, the pretreatment unit 20 corresponds to the acquisition unit.
  • the feature data 6 is data that can represent features of, for example, a plurality of sound signals 5. For example, a predetermined conversion process is executed on the sound signal 5, and data representing the characteristics of the sound signal 5 is generated. Further, for example, the sound signal 5 itself can be used as the feature data 6. In the present embodiment, the Fourier transform is executed on the sound signal 5, and the amplitude spectrum, the phase difference spectrum, and the like of the sound signal 5 are calculated as the feature data 6. This point will be described in detail later with reference to FIG. 4 and the like.
  • the vector estimation unit 21 outputs a three-dimensional vector P indicating the direction information indicating the arrival direction of the target wave and the accessory information regarding the target wave based on the feature data 6 acquired by the preprocessing unit 20.
  • the vector estimation unit 21 is configured by a learner trained to input the feature data 6 and output a three-dimensional vector P representing directional information and attached information.
  • the target wave is a sound (sound wave) to be observed by the processing unit 100.
  • the voice 2 emitted by the human 1 is set as the observation target. That is, the target wave is voice 2.
  • the target wave is voice 2.
  • all the voices 2 emitted by each human 1 are the target waves.
  • the case is not limited to the case where the voice 2 of an unspecified number of humans 1 is the target wave, and for example, the voice 2 of a specific human 1 can be the target wave.
  • a specific sound such as a clap sound or a bell ringing sound may be set as the target wave.
  • ambient noise 3 and the like may be set as the target wave.
  • the target wave can be arbitrarily set according to the purpose of the processing unit 100 and the like.
  • the direction information is information indicating the direction of arrival of the voice 2. That is, it can be said that the direction information is information indicating the direction (sound source direction) in which the human 1 who has emitted the voice 2 is located. In the following, the direction of arrival of the voice 2 may be simply described as the sound source direction.
  • reference coordinates are set in the microphone array 10 described above.
  • the direction information is information indicating the direction in which the voice 2 arrives with respect to the origin of the reference coordinates, that is, the direction in which the human 1 who emits the voice 2 is located when viewed from the reference coordinates.
  • the method of setting the reference coordinates is not limited and can be set arbitrarily.
  • the attached information is information obtained attached to the target voice 2, and is expressed using a one-dimensional value (norm).
  • the attached information is set to the volume of the voice 2. More specifically, the magnitude (power) of the voice 2 emitted by the human 1 at a certain timing is set in the attached information. In addition to this, it is possible to set the probability indicating the presence / absence of the voice 2 and the reliability regarding the arrival direction (direction information) of the voice 2 as ancillary information.
  • the specific content of the attached information is not limited, and for example, an arbitrary one-dimensional amount that can be calculated from the feature data 6 may be set as the attached information.
  • FIG. 2 is a schematic diagram for explaining the three-dimensional vector P.
  • FIG. 2 illustrates a Cartesian coordinate system represented by the X-axis, Y-axis, and Z-axis that are orthogonal to each other. This Cartesian coordinate system becomes the reference coordinate.
  • the thick arrow in the figure is an example of the three-dimensional vector P output from the vector estimation unit 21.
  • the three-dimensional vector P will be referred to as x, y, and z for each component of the X-axis, Y-axis, and Z-axis, respectively.
  • the vector estimation unit 21 outputs each component of the vector P (x, y, z) as a three-dimensional vector P.
  • the vector estimation unit 21 (learner) is learned so that the direction of the three-dimensional vector P is the direction of arrival of the voice 2 and the magnitude I of the three-dimensional vector P is the value of the attached information. That is, the vector estimation unit 21 outputs the three-dimensional vector P so that the direction of the three-dimensional vector P represents the direction information and the magnitude of the three-dimensional vector P represents the attached information. For example, when viewed from the origin O, the direction indicated by the three-dimensional vector P is the sound source direction, and the magnitude thereof represents the value of the attached information (volume of voice 2 or the like). That is, each component x, y, z of the three-dimensional vector P does not directly represent the direction information or the accessory information, but the direction information and the accessory information are expressed by the vector represented by each component x, y, z. Will be done.
  • the vector estimation unit 21 is learned so that the horizontal angle ⁇ in the sound source direction, the elevation angle ⁇ in the sound source direction, and the value I of the attached information can be obtained by converting the three-dimensional vector P into polar coordinates.
  • the horizontal angle ⁇ is an angle representing the direction (direction) of the vector with respect to the X axis in the XY plane.
  • the elevation angle ⁇ is an angle representing the inclination of the vector with respect to the XY plane.
  • the direction information includes the horizontal angle ⁇ and the elevation angle ⁇ indicating the sound source direction (the direction of arrival of the sound 2).
  • the horizontal angle ⁇ , the elevation angle ⁇ , and the attached information I are expressed using the following equations, respectively.
  • the vector estimation unit 21 outputs the three-dimensional vector P so that the direction information and the attached information are calculated by performing polar coordinate conversion on the three-dimensional vector P. Therefore, for example, when calculating the direction information and the attached information, the angles ⁇ and ⁇ representing the sound source direction and the attached information are obtained by converting the three-dimensional vector P into polar coordinates according to the equations (1) to (3).
  • the value I can be easily calculated.
  • Learning data generated based on the sound signal is used for learning of the vector estimation unit 21 (learner).
  • the characteristic data of the sound signal to which the teacher label is attached is used. For example, characteristic data (amplitude spectrum and phase difference spectrum) of a sound signal including human voice becomes data for input. Further, a three-dimensional vector P representing the arrival direction (sound source direction) and attached information (volume, etc.) of the sound is attached to the feature data as a teacher label. This makes it possible to train the learner to estimate the vector representing the sound source direction and attached information by polar coordinate transformation.
  • the method of generating learning data is not limited. For example, it is possible to simulate a sound signal in which the position of a sound source is changed by performing a convolution operation used in a technique such as an impulse response. By performing this by changing the type of voice, it is possible to easily prepare learning data with a plurality of teacher labels.
  • learning data obtained by sampling the target sound may be used.
  • N Neural Network
  • MLP Multilayer Perceptron
  • CNN Convolution Neural Network
  • RNN Recurrent Neural Network
  • RSTM LongTermShortMemoryNetwork
  • the learner may be configured by using an arbitrary algorithm applicable to estimation of the sound source direction and the like.
  • the learner can be regarded as a function that converts the feature data 6 into a three-dimensional vector P (hereinafter referred to as a function A). Therefore, it can be said that training the learner is a process of optimizing the function A so that the three-dimensional vector P can be calculated appropriately.
  • the post-processing unit 22 transforms the three-dimensional vector P into polar coordinates to calculate the direction information and the attached information. Specifically, according to the above equations (1) to (3), the horizontal angle ⁇ and the elevation angle ⁇ in the sound source direction and the value I of the attached information are calculated from the three-dimensional vector P. In addition, the post-processing unit 22 can execute various operations using the three-dimensional vector P.
  • the vector estimation unit 21 and the post-processing unit 22 described above function as output units according to the present embodiment.
  • FIG. 3 is a flowchart showing the basic operation of the control unit.
  • the process shown in FIG. 3 is a process that is repeatedly executed at a predetermined processing rate.
  • the preprocessing unit 20 calculates the feature data 6 of the sound signal 5 (step 101).
  • the vector estimation unit 21 outputs the three-dimensional vector P with the feature data 6 as an input (step 102).
  • the post-processing unit 22 transforms the three-dimensional vector P into polar coordinates, and calculates the direction information ( ⁇ and ⁇ ) and the attached information (I) of the voice 2 (step 103).
  • the processing unit 100 inputs a quantity representing the characteristics of the multi-channel sound signal, and outputs a three-dimensional vector P (x, y, z) expressing direction information and other accessory information, and polar coordinates.
  • P x, y, z
  • ⁇ , ⁇ , I the information related to the sound 2.
  • the processing unit 100 continuously calculates the direction information and the attached information at a predetermined processing rate. This makes it possible to constantly monitor the direction in which the voice 2 is emitted.
  • steps 101, 102, and 103 will be specifically described by taking the case where the volume of the voice 2 is set as the attached information (I) as an example. The following description is applicable even when the attached information is set to another value.
  • step 101 as the feature data 6 of the plurality of sound signals 5, the amplitude spectrum of each of the plurality of sound signals 5 and the phase difference spectrum between the plurality of sound signals 5 are calculated.
  • the amplitude spectrum is a spectrum representing the intensity of each frequency component.
  • the phase difference spectrum is a spectrum representing the phase difference for each frequency component.
  • the preprocessing unit 20 reads the sound signals 5 (M channel sound signals 5) output from the M microphones 11, records them in a storage unit such as a buffer, and performs a short-time Fourier transform on each sound signal 5. The conversion is performed.
  • the target signal sound signal 5
  • the target signal sound signal 5
  • the Fourier transform is executed for the signals included in the divided sections.
  • each time frame t may overlap or may be divided.
  • a certain sampling time tau describes a sound signal 5 output from the M microphone 11 and s m ( ⁇ ). Also describes s m a complex spectrum calculated by short-time Fourier transform of ( ⁇ ) S m (t, f) and.
  • of the complex spectrum S m (t, f) is calculated. That is, the amplitude spectrum of the M channel is calculated from the complex spectrum of the M channel.
  • arg is a function for calculating the declination.
  • m represents a channel other than j. That is, the phase difference spectrum of the M-1 channel is calculated from the complex spectrum of the M channel.
  • the input section length Ti is set to a section longer than the interval ⁇ of the time frame t described above. That is, the input section length T i, will include a plurality of time frames t. Therefore, the preprocessing unit 20 outputs spectrum data for 2M-1 channels including a phase difference spectrum for M channels and a phase difference spectrum for M-1 channels.
  • the data size of the spectrum data is the number of channels ⁇ the section length Ti ⁇ the number of frequency bins F. Therefore, the input data Di is expressed as Di (c, t, f), where c is an index indicating each channel of the spectrum data.
  • FIG. 4 is a data plot showing an example of feature data.
  • FIG. 4 shows a strip plot representing the spectral data of the amplitude spectrum and the retardation spectrum.
  • the four plots from the top are the amplitude spectra (
  • the plot is the phase difference spectrum (arg (S m (t, f) / S j (t, f))).
  • the horizontal axis is time (time frame t) and the vertical axis is frequency (frequency bin f).
  • the color of each point shown in gray scale represents the amplitude or phase difference.
  • the sound represented by the black plot includes, for example, voice 2 which is a target wave, ambient noise 3, and the like.
  • voice 2 which is a target wave, ambient noise 3, and the like.
  • the phase difference corresponding to the deviation of the timing of detecting the sound is detected.
  • the amplitude spectrum is gray
  • the sound is relatively quiet or only noise 3 is generated. In this case, the phase difference for each frequency is substantially random.
  • the data sections included in the input section length T i is illustrated by the solid black border.
  • the data of each data plot included in this interval becomes the input data Di (c, t, f) input to the vector estimation unit 21.
  • M 4
  • the number of channels is 7, and the data size of the input data Di (c, t, f) is 7 ⁇ T i ⁇ F.
  • the three-dimensional vector P is estimated from the vector estimation unit 21 to which the input data Di (c, t, f) is input.
  • the interval length T o hereinafter, the output section is described as length T o
  • the learning device is It is composed.
  • the vector estimation unit 21 functions as a function A that converts the input D i into the output D o .
  • the function A is optimized and determined by a machine learning algorithm such as deep learning. It should be noted that among the parameters constituting the function A, there may be a parameter for accumulating the past processing results. By using such past processing results for the optimization of the function A, it is possible to improve the estimation accuracy of the sound source direction and the detection accuracy of the attached information.
  • FIG. 5 is a graph of a three-dimensional vector output from the feature data shown in FIG.
  • graphs of each component x (t), y (t), and z (t) of the three-dimensional vector P are shown in order from the top.
  • the horizontal axis of each graph is time, and the vertical axis is the size of each component.
  • the scale of the vertical axis is appropriately set for each graph.
  • Equations (4) to (6) correspond to equations (1) to (3) described with reference to FIG. 2, respectively.
  • Equation (4) is a horizontal angle ⁇ (t) in the sound source direction in the time frame t.
  • Equation (5) is an elevation angle ⁇ (t) in the sound source direction in the time frame t.
  • the equation (6) is the value I of the attached information in the time frame t, and is the volume of the voice 2.
  • the post-processing unit 22 calculates the sound source direction and attached information (volume of voice 2) for each frame from the three-dimensional vector P (t).
  • the vector estimation unit 21 (function A) is trained on the component V m (t, f) of the speech 2 in the complex spectrum S m (t, f). Specifically, in the above equation (6), I (t), which is the magnitude of the three-dimensional vector P (t), becomes the voice power of the specific microphone 11 (here, the kth). To. In this case, I (t) is expressed by the following equation.
  • the function A is optimized so that I (t) calculated by Eq. (6) satisfies the relationship of Eq. (8).
  • the attached information I (t) output from the vector estimation unit 21 ideally receives the power of the voice 2 regardless of the power of the noise 3 even if the sound signal 5 disturbed by the noise 3 is input. It is possible to output only the volume). This corresponds to the detection of voice 2. Therefore, by setting the power of the voice 2 in the attached information, it is possible to realize voice section detection (VAD: Voice Activity Detection) or the like that detects the section in which the voice 2 is generated.
  • VAD Voice Activity Detection
  • the power may be 0 when the voice 2 does not exist, and the power may be expressed on a logarithmic scale when the voice 2 exists.
  • I (t) is expressed by the following equation.
  • the method of expressing the volume of the voice 2 is not limited.
  • FIG. 6 is a graph of the sound source direction and the volume of the voice 2 calculated from the three-dimensional vector shown in FIG.
  • FIG. 6 shows graphs of the horizontal angle ⁇ (t) in the sound source direction, the elevation angle ⁇ (t) in the sound source direction, and the volume I (t) of the sound 2 in this order from the top.
  • the horizontal axis of each graph is time.
  • the vertical axis of the graph of the horizontal angle ⁇ (t) and the elevation angle ⁇ (t) is the angle.
  • the vertical axis of the graph of the volume I (t) of the voice 2 represents the loudness (power) of the sound.
  • the values of ⁇ (t) and ⁇ (t) change from 0 ° to a constant angle corresponding to each peak of the graph of I (t). Therefore, in the example shown in FIG. 6, the voice 2 emitted by the human 1 in the same direction is detected. Further, for example, when conversations of humans 1 at different positions are observed, the direction in which the human 1 who utters each voice 2 exists is estimated as the sound source direction for each peak of the voice 2. As described above, in the present embodiment, it is possible to accurately detect the direction in which the person who emitted the voice 2 is present as well as the volume of the voice 2.
  • the target section corresponds to a predetermined period.
  • the aggregation of the three-dimensional vector P is executed by the post-processing unit 22. Specifically, the sum of each component x (t), y (t), and z (t) of the three-dimensional vector P (t) output in the target section is calculated. For example, if you want to acquire the sound source direction for the previous utterance at a certain time t c , use the time t p earlier than the time t c , and the sum x u , y u , z u of each component is as follows. It is calculated as.
  • the time t p corresponds to the start time of the target section
  • the time t c corresponds to the end time of the target section. Therefore, it can be said that x u , y u , and z u are components of a vector (hereinafter referred to as an aggregate vector) obtained by synthesizing the three-dimensional vector P output in the target section.
  • the aggregate vector corresponds to the second vector.
  • Polar coordinate transformation is executed for the aggregate vector whose components are x u , yu , and z u calculated according to the equation (10).
  • the horizontal angle ⁇ u and the elevation angle ⁇ u in the sound source direction with respect to the utterance immediately before the time t c are calculated as follows.
  • the sum of each component in the target section is calculated, but the average of each component in the target section may be calculated. That is, the average of each component is calculated by dividing x u , yu , and z u in the equation (10) by the number of time frames included in the target interval.
  • the vector represented by the average of each component is also an aggregate vector calculated by synthesizing the three-dimensional vector P.
  • the aggregation vector is calculated by synthesizing the three-dimensional vector P output in the target section, and the arrival direction of the voice 2 in the target section is calculated based on the aggregation vector.
  • the target section (time t p to time t c ) may include a section without voice 2.
  • the values of x (t), y (t), and z (t) are sufficiently small, ideally 0. Therefore, the value of each component in the section without the sound 2 does not have a great influence on the calculation result, and the value in the sound source direction with respect to the section with the sound 2 in the target section can be acquired with high accuracy. ..
  • a method of identifying which section of the target section corresponds to the voice 2 and estimating the direction based on the result can be considered.
  • heuristic processing using various parameters for determining the certainty, empirical rules, and the like may be performed, and the estimation accuracy may be lowered.
  • the sound source direction in a certain section of the voice 2 is easily calculated only by synthesizing the three-dimensional vector P over the target section. That is, it is not necessary to perform heuristic processing for determining a certain section of the sound 2, and it is possible to estimate the sound source direction with high accuracy.
  • the vector estimation unit 21 (function A) is learned so that I (t), which is the magnitude of the three-dimensional vector P (t), is the existence probability of the voice 2 in the above equation (6).
  • the existence probability of the voice 2 is a probability indicating whether or not the voice 2 is generated.
  • I (t) is expressed by the following equation.
  • the vector estimation unit 21 optimized according to the equation (13) outputs, for example, a three-dimensional vector P having a magnitude of 0 to 1.
  • the three-dimensional vector P may be output as it is and I (t) may take a value from 0 to 1. This makes it possible to realize an application that performs a predetermined process when the voice 2 is likely to exist (for example, the existence probability is 0.5 or more). Further, for example, the output may be controlled so that I (t) has a value of either 0 or 1. This makes it possible to simplify the subsequent processing.
  • the method of setting the existence probability of voice 2 is not limited.
  • the average value of the power of the voice 2 of the plurality of microphones 11 included in the microphone array 10 may be used.
  • the function A is optimized so that the existence probability of the voice 2 becomes 1 when the average value of the power is larger than the predetermined threshold value ⁇ .
  • the predetermined threshold value ⁇ can be arbitrarily set according to the configuration of the microphone 11 and the like.
  • I (t) which is the magnitude of the three-dimensional vector P (t)
  • I (t) is the signal-to-noise ratio between the voice 2 and the noise 3.
  • I (t) is expressed by, for example, the following equation.
  • the estimation accuracy of the sound source direction generally correlates with the signal-to-noise ratio. That is, when the signal-to-noise ratio is small, the estimation accuracy tends to be low, and when the signal-to-noise ratio is large, the estimation accuracy tends to be high. Therefore, by setting the power ratio between the voice 2 and the noise 3 in the attached information, the output I (t) value can be interpreted as the reliability of the sound source direction estimation for each time frame. That is, it can be said that by using the equation (14), the reliability regarding the arrival direction of the voice 2 is set as the attached information.
  • the method for expressing the signal-to-noise ratio is not limited to the method represented by equation (14).
  • the signal-to-noise ratio may be expressed using the average values of the powers of the voice 2 and the noise 3 detected by the plurality of microphones 11 included in the microphone array 10.
  • an arbitrary parameter capable of expressing the reliability with respect to the arrival direction of the voice 2 may be set to I (t).
  • the quality of the user experience may be significantly impaired.
  • One example is an application in which the robot looks back in the direction of the user when the user speaks. In this case, if the estimated sound source direction is an erroneous value, there is a possibility that the robot may look back in an unrelated direction when the user speaks.
  • the alternative process is executed without adopting the sound source direction estimated value at that time.
  • a process of notifying the user that the sound source direction could not be estimated or that the reliability is low is executed. Examples of the notification method include execution of a gesture indicating that the voice 2 could not be heard, display of a message, lighting of a lamp, and the like. This avoids the situation where the robot turns in an unrelated direction.
  • a process of switching the method of estimating the direction in which the user is located from a method using the microphone 11 to another method such as a method using the camera is executed. That is, when it is difficult to estimate the direction by the sound signal due to the influence of noise 3 or the like, a process of searching for a user by using image recognition or the like is executed. This makes it possible to properly detect the direction in which the user is, even when the estimation of the sound source direction does not work. In this way, by performing the alternative processing based on the reliability of the sound source direction estimation, it is possible to sufficiently avoid the deterioration of the quality of the user experience.
  • Input data with a teacher label is used for learning of the learner constituting the vector estimation unit 21.
  • This teacher label is a vector (answer vector) representing the sound source direction, volume, etc., which should be estimated from the corresponding input data.
  • the accuracy of the learning device is evaluated by comparing the three-dimensional vector P output by the learning device based on the input data with the answer vector.
  • the Euclidean distance between the three-dimensional vector P and the answer vector is calculated.
  • the Euclidean distance is a distance in a three-dimensional Euclidean space as represented by the three-dimensional Cartesian coordinate system described with reference to FIG.
  • This Euclidean distance can represent the amount of deviation of the three-dimensional vector P with respect to the answer vector representing the correct answer.
  • the mean square error (MSE: Mean Squared Error) is calculated using this Euclidean distance.
  • MSE Mean Squared Error
  • the method of expressing the error is not limited.
  • the vector estimation unit 21 outputs the three-dimensional vector P corresponding to the input data, and uses the error according to the Euclidean distance between the output three-dimensional vector P and the answer vector corresponding to the input data for learning. It is a learner.
  • the Euclidean distance when the Euclidean distance is small, the error of the learning device is small, and when the Euclidean distance is large, the error of the learning device is large.
  • the output format of the vector estimation unit 21 since the output format of the vector estimation unit 21 (learner) is a three-dimensional vector P that can express the sound source direction and attached information in an integrated manner, the Euclidean distance from the answer vector is calculated to be three-dimensional. It is possible to easily calculate the error of the vector P.
  • the evaluation of the three parameters of horizontal angle ⁇ , elevation angle ⁇ , and attached information can be performed at the same time. It is possible to do. For example, in a learner that calculates a horizontal angle ⁇ , an elevation angle ⁇ , etc., it is necessary to provide a rule or the like for distinguishing between 0 ° and 360 °, and heuristic processing is required to calculate the error. On the other hand, by using a format that outputs a three-dimensional vector P as shown in the present disclosure, it is possible to avoid heuristic processing and perform highly accurate error evaluation. This makes it possible to dramatically improve the learning accuracy of the learner.
  • an error backpropagation method that adjusts weights using errors may be used. Even when learning such an algorithm, stable error back propagation is possible by expressing the information in the sound source direction not by an angle but by a three-dimensional vector P in a three-dimensional Euclidean space. This makes it possible to easily implement an algorithm using error back propagation.
  • the three-dimensional vector P is output by inputting the feature data 6 of the plurality of sound signals 5 in which the voice 2 is observed.
  • This three-dimensional vector P is a vector representing the direction information of the arrival direction of the voice 2 and the incidental information regarding the voice 2.
  • the direction information and the attached information are collectively output as one vector. This makes it possible to detect the direction of the voice 2 and other attached information (volume of the voice 2) with high accuracy.
  • the sound source direction estimation and the voice detection algorithm are individually configured in this way, it is generally difficult to perform overall optimization of both. For example, if the voice can be detected in advance, it is possible to estimate the direction with higher accuracy, and if the direction of the voice can be estimated in advance, it is possible to detect the voice with higher accuracy. In this case, the optimization of each process requires each other's processing results, and as a result, it may be necessary to adopt an algorithm individually optimized for each process.
  • the vector estimation unit 21 outputs a three-dimensional vector P representing the sound source direction and the volume (attached information) of the voice 2.
  • a three-dimensional vector P representing the sound source direction and the volume (attached information) of the voice 2.
  • the three-dimensional vector P is a vector representing the estimation result of the sound source direction and the detection result of voice detection. That is, by outputting the three-dimensional vector P, it is possible to optimally solve a plurality of problems at the same time. As a result, the estimation accuracy of the sound source direction and the detection accuracy of the voice 2 can be significantly improved, and the calculation efficiency can be sufficiently improved. In addition, it is not necessary to develop separate algorithms, and development costs can be significantly reduced.
  • the present inventor evaluated the estimation result of the sound source direction using the three-dimensional vector P according to the present technology using the data (sound signal 5) detected by the microphone array 10 mounted on the specific device.
  • the estimation results we used a method of measuring the ratio of the error of the horizontal angle ⁇ within a predetermined angle range in multiple environments and comparing it with other methods for estimating the sound source direction. Further, as the predetermined angle range, a range set based on the angle of view of the camera was adopted.
  • the method of expressing the sound source direction and attached information using one vector can greatly improve the estimation accuracy of the sound source direction. This makes it possible to improve the operating accuracy of the system that performs voice processing and the like. Further, by using this technology, it is possible to provide a highly reliable voice application or the like.
  • FIG. 7 is a block diagram showing a configuration example of the processing unit 200 according to the second embodiment.
  • the processing unit 200 is an arithmetic unit that calculates information of voice 2, and has a pre-processing unit 220, a vector estimation unit 221 and a post-processing unit 222.
  • the preprocessing unit 220 is configured in the same manner as the preprocessing unit 20 shown in FIG. 1, for example, and outputs the feature data 6 of the plurality of sound signals 5 output from the microphone array 10. Note that in FIG. 7, the microphone array is not shown.
  • the vector estimation unit 221 outputs a three-dimensional vector P representing direction information and attached information for each frequency component included in the sound signal 5 based on the feature data 6. Specifically, the learner constituting the vector estimation unit 221 is learned to output the three-dimensional vector P for each frequency bin f. Further, the mean square error calculated for each frequency bin f between the three-dimensional vector P and the answer vector is used for learning of the learner.
  • the post-processing unit 222 executes conversion processing and aggregation processing of the three-dimensional vector P output for each frequency component (frequency bin), and calculates direction information indicating the sound source direction and attached information regarding the sound 2.
  • FIG. 8 is a data plot showing an example of feature data.
  • the feature data 6 (amplitude spectrum and phase difference spectrum) is calculated by the preprocessing unit 220 in the same manner as the processing described with reference to, for example, FIG.
  • FIG. 8 shows a strip plot showing the spectral data of the amplitude spectrum and the retardation spectrum.
  • the output data Do is expressed as Do (c, t, f), where c is an index representing each component of the three-dimensional vector P.
  • the data size of D o (c, t, f ) is a 3 ⁇ T o ⁇ F.
  • the vector estimation unit 221 functions as a function B that converts the input D i into the output D o .
  • the volume of the voice 2 is set as the additional information targeted by the function B will be described as an example.
  • FIG. 9 is a data plot showing the three-dimensional vector P output from the feature data shown in FIG.
  • FIG. 9 shows a data plot of each component x (t, f), y (t, f), and z (t, f) of the three-dimensional vector P (t, f) in order from the top. ..
  • the horizontal axis of each graph is time, and the vertical axis is frequency. The values of each component are shown in gray scale.
  • the data sections included in the output section length T o is illustrated by the solid black border. The data of each graph included in this section becomes the output data Do (c, t, f) output from the vector estimation unit 221 (function B).
  • the sound source direction and attached information are calculated from the output data Do (c, t, f).
  • polar coordinate transformation is executed as shown in the following equation.
  • Equations (15) to (17) correspond to equations (1) to (3) described with reference to FIG. 2, and are calculated for each time frame t and frequency bin f.
  • Equation (15) is a horizontal angle ⁇ (t, f) in the sound source direction.
  • Equation (16) is an elevation angle ⁇ (t, f) in the sound source direction.
  • the equation (17) is the value I (t, f) of the attached information and is the volume of the voice 2.
  • the post-processing unit 222 calculates the sound source direction and attached information (volume of voice 2) for each time frame and frequency from the three-dimensional vector P (t, f).
  • FIG. 10 is a data plot showing the volume of voice 2 calculated from the three-dimensional vector P shown in FIG.
  • the horizontal axis of FIG. 10 is time, and the vertical axis is frequency. Further, the volume (power) of the voice 2 in each time frame t and the frequency bin f is shown in gray scale.
  • I (t, f) is the power (spectrogram) of voice 2 for each frequency bin of a specific microphone (here, kth).
  • the function B is optimized.
  • I (t, f) is expressed by the following equation using the equation (7) representing the complex spectrum of the voice 2.
  • the function B is optimized so that I (t) calculated by the equation (6) satisfies the relationship of the equation (18).
  • the output accessory information ideally represents the power (volume) of the voice 2 for each frequency bin regardless of the presence or absence of the noise 3, even if the sound signal 5 disturbed by the noise is input. Become.
  • the data plot shown in FIG. 10 is a plot representing a voice signal showing a response of only voice 2 extracted from the original sound signal including noise 3 and the like.
  • the frequency distribution of I (t, f) calculated according to the equation (17) becomes the frequency distribution of the power of the voice 2 in the time frame t, that is, the amplitude spectrum of the voice 2.
  • This amplitude spectrum does not include a spectrum such as noise 3.
  • the vector estimation unit 221 calculates the voice signal representing the amplitude spectrum of the voice 2 based on the three-dimensional vector P output for each frequency component. As a result, it becomes possible to perform highly accurate voice recognition or the like using a voice signal in which noise 3 is suppressed, and it is possible to significantly improve the processing accuracy of various applications using voice 2.
  • the speech enhancement process (process for extracting a voice signal) can be considered as voice section detection (VAD) for each frequency bin. Therefore, in the present embodiment, when the volume of the voice 2 is set in the attached information, the voice enhancement, the voice section detection, and the sound source direction estimation are solved by one calculation. This makes it possible to provide a single algorithm that is totally optimized to perform three processes at once.
  • the entire three-dimensional vector P (t) representing the entire sound source direction and the entire attached information in a certain time frame t is calculated.
  • the entire three-dimensional vector P (t) calculated from the three-dimensional vector P (t, f) will be referred to as the overall vector P (t).
  • the total vector P (t) corresponds to the first vector.
  • the three-dimensional vector P (t, f) is output from the vector estimation unit 221
  • the components x (t), y (t), and z (t) of the whole vector P (t) are represented as follows.
  • the direction of the entire vector P (t) calculated by the equation (20) represents the arrival direction (sound source direction) of the voice 2 generated at the timing t.
  • the overall vector P (t) representing the arrival direction of the voice 2 is calculated by synthesizing the three-dimensional vectors P (t, f) output for each frequency component.
  • the direction of the whole vector P (t) represents the whole value I (t) of the attached information regarding the voice 2.
  • FIG. 11 is a graph of the entire vector P (t) calculated from the three-dimensional vector P (t, f) shown in FIG.
  • graphs of each component x (t), y (t), and z (t) of the entire vector P (t) are shown in order from the top.
  • the horizontal axis of each graph is time, and the vertical axis is the size of each component.
  • the scale of the vertical axis is appropriately set for each graph.
  • the graph shown in FIG. 11 is obtained by adding each component individually output for each frequency bin in the frequency direction, and corresponds to the component of the three-dimensional vector P (t) described with reference to FIG. That is, by synthesizing the three-dimensional vector P (t, f) by the post-processing unit 222, a vector similar to the three-dimensional vector P (t) output by the vector estimation unit 221 (function A) of the first embodiment. (Overall vector P (t)) can be calculated.
  • the post-processing unit 222 executes polar coordinate transformation on the entire vector P (t), and the horizontal angle ⁇ (t), elevation angle ⁇ (t), and attached information I (t) of the voice 2 are obtained. It is calculated.
  • ⁇ (t), ⁇ (t), and I (t) are expressed by the following equations.
  • FIG. 12 is a graph of the sound source direction and the volume calculated from the entire vector shown in FIG.
  • FIG. 12 shows graphs of the horizontal angle ⁇ (t) in the sound source direction, the elevation angle ⁇ (t) in the sound source direction, and the volume I (t) of the sound 2 in this order from the top.
  • the horizontal axis of each graph is time.
  • the vertical axis of the graph of the horizontal angle ⁇ (t) and the elevation angle ⁇ (t) is the angle.
  • the vertical axis of the graph of the volume I (t) of the voice 2 represents the loudness (power) of the sound.
  • the volume of the voice 2 is high, and it can be seen that the voice 2 is detected. Further, the values of ⁇ (t) and ⁇ (t) change from 0 ° to a constant angle corresponding to each peak of I (t). Therefore, it can be seen that the voices 2 detected as the peaks of I (t) are all emitted from the same direction.
  • the magnitude I (t, f) of the three-dimensional vector P (t, f) is set to the power of the voice 2 shown in the equation (18)
  • the magnitude I (t) of the entire vector P (t) is set.
  • the power of the voice 2 shown in the equation (19) is set, the magnitude of the whole vector P (t) can be regarded as the power of the voice 2 shown in the equation (9). ..
  • vibration detectors that detect vibrations on the ground or in the ground are placed at multiple locations. Then, the characteristic data (amplitude spectrum and phase difference spectrum) of the vibration signal output from each vibration detector is input to the learner.
  • the learner is learned in advance so as to output a three-dimensional vector representing the arrival direction of the seismic wave and its intensity based on the characteristic data of the vibration signal. This makes it possible to detect the arrival direction and intensity of seismic waves with high accuracy.
  • this technology can be applied to various wave phenomena that propagate in space such as electromagnetic waves and gravitational waves.
  • the information processing device may be realized by an arbitrary computer that is configured separately from the processing unit and is connected to the processing unit via wire or wirelessly.
  • the information processing method according to the present technology may be executed by a cloud server.
  • the information processing method according to the present technology may be executed in conjunction with the processing unit and another computer.
  • the information processing method and program according to the present technology can be executed not only in a computer system composed of a single computer but also in a computer system in which a plurality of computers operate in conjunction with each other.
  • the system means a set of a plurality of components (devices, modules (parts), etc.), and it does not matter whether or not all the components are in the same housing. Therefore, a plurality of devices housed in separate housings and connected via a network, and one device in which a plurality of modules are housed in one housing are both systems.
  • the information processing method and program execution according to the present technology by the computer system are executed when, for example, acquisition of feature data and output of a three-dimensional vector are executed by a single computer, and each process is executed by a different computer. Includes both cases. Further, the execution of each process by a predetermined computer includes causing another computer to execute a part or all of the process and acquire the result.
  • the information processing method and program related to this technology can be applied to a cloud computing configuration in which one function is shared by a plurality of devices via a network and processed jointly.
  • same”, “equal”, “orthogonal”, etc. are concepts including “substantially the same”, “substantially equal”, “substantially orthogonal”, etc.
  • a state included in a predetermined range for example, a range of ⁇ 10%
  • a predetermined range for example, a range of ⁇ 10%
  • this technology can also adopt the following configurations.
  • An acquisition unit that acquires feature data of multiple signals that observe the target wave, and An information processing device including a direction information indicating an arrival direction of the target wave and an output unit for outputting a three-dimensional vector representing ancillary information about the target wave based on the acquired feature data.
  • the output unit is an information processing device that outputs the three-dimensional vector so that the direction of the three-dimensional vector represents the direction information and the magnitude of the three-dimensional vector represents the accessory information.
  • the output unit is an information processing device that outputs the three-dimensional vector so that the direction information and the accessory information are calculated by performing polar coordinate conversion on the three-dimensional vector.
  • the information processing device is an information processing device including a horizontal angle and an elevation angle indicating the arrival direction of the target wave.
  • the output unit is an information processing device that converts the three-dimensional vector into polar coordinates to calculate the direction information and the accessory information.
  • the information processing device according to any one of (1) to (5).
  • the target wave is voice and
  • the plurality of signals are information processing devices that are sound signals obtained by observing the voice.
  • the information processing device according to (6).
  • the direction information is an information processing device that indicates the direction of arrival of the voice.
  • the information processing device is (7).
  • the information processing device includes any one of the volume of the voice, the existence probability of the voice, and the reliability of the arrival direction of the voice.
  • the information processing device according to any one of (6) to (8).
  • the output unit is an information processing device that outputs the three-dimensional vector representing the direction information and the accessory information for each frequency component included in the sound signal.
  • the attached information is the volume of the voice for each frequency component.
  • the output unit is an information processing device that calculates an audio signal representing the amplitude spectrum of the audio based on the three-dimensional vector output for each frequency component.
  • (11) The information processing device according to (9) or (10).
  • the output unit is an information processing device that synthesizes the three-dimensional vectors output for each frequency component and calculates a first vector representing the arrival direction of the voice.
  • the information processing apparatus according to (11) The output unit is an information processing device that calculates the direction information and the accessory information based on the first vector.
  • the information processing apparatus according to any one of (6) to (12). The output unit calculates a second vector by synthesizing the three-dimensional vectors output within a predetermined period, and calculates the arrival direction of the voice in the predetermined period based on the second vector.
  • Information processing device is included in the information processing apparatus.
  • the plurality of signals are information processing devices that are the sound signals detected by each of the plurality of sound collectors arranged at different positions.
  • the information processing apparatus is an information processing apparatus including an amplitude spectrum of each of the plurality of signals and a phase difference spectrum between the plurality of signals.
  • the output unit is a learner that outputs the three-dimensional vector corresponding to the input data and uses an error according to the Euclidean distance between the output three-dimensional vector and the answer vector corresponding to the input data for learning.
  • Information processing device Acquire characteristic data of multiple signals that observed the target wave, An information processing method in which a computer system executes to output three-dimensional vectors representing direction information indicating the direction of arrival of the target wave and incidental information regarding the target wave based on the acquired feature data.
  • a step of acquiring characteristic data of a plurality of signals observing a target wave and A program that causes a computer system to execute a step of outputting a three-dimensional vector representing direction information indicating the arrival direction of the target wave and incidental information regarding the target wave based on the acquired feature data.

Abstract

The information processing device according to one embodiment of the present invention is provided with an acquisition unit and an output unit. The acquisition unit acquires feature data on a plurality of signals obtained through observation of a target wave. On the basis of the acquired feature data, the output unit outputs a three-dimensional vector that represents direction information indicative of an arrival direction of the target wave and attribute information pertaining to the target wave.

Description

情報処理装置、情報処理方法、及びプログラムInformation processing equipment, information processing methods, and programs
 本技術は、音声等の検出に適用可能な情報処理装置、情報処理方法、及びプログラムに関する。 This technology relates to information processing devices, information processing methods, and programs that can be applied to detect voice and the like.
 特許文献1には、音源方向を推定する音響信号処理装置について記載されている。この装置では、複数のマイクロホンにより周辺の音響が捕捉され、複数の音響信号が生成される。また、音空間特徴量として各マイクロホンの音響信号の相互相関値が算出される。この音空間特徴量を用いて対象音響の音源方向が推定される。また音響信号処理装置では、音空間特徴量の高次の統計量を用いて音源方向推定値の信頼度が算出される(特許文献1の明細書段落[0035][0040][0044]図2等)。 Patent Document 1 describes an acoustic signal processing device that estimates the direction of a sound source. In this device, the surrounding sound is captured by a plurality of microphones, and a plurality of acoustic signals are generated. In addition, the cross-correlation value of the acoustic signal of each microphone is calculated as the sound space feature amount. The sound source direction of the target sound is estimated using this sound space feature. Further, in the acoustic signal processing device, the reliability of the sound source direction estimated value is calculated by using a high-order statistic of the sound space feature amount (Patent Document 1 specification paragraphs [0035] [0040] [0044] FIG. 2 etc).
特開2011-139409号公報Japanese Unexamined Patent Publication No. 2011-139409
 上記した音響波のように空間を伝わる波を観測することで、その波が到来した方向や波の特性等を推定することが可能であり、様々な応用が期待される。このため、対象波の方向とその他の付属する情報とを高精度に検出する技術が求められている。 By observing waves that propagate through space, such as the acoustic waves described above, it is possible to estimate the direction in which the waves arrived and the characteristics of the waves, and various applications are expected. Therefore, there is a demand for a technique for detecting the direction of the target wave and other attached information with high accuracy.
 以上のような事情に鑑み、本技術の目的は、対象波の方向とその他の付属する情報とを精度よく検出することが可能な情報処理装置、情報処理方法、及びプログラムを提供することにある。 In view of the above circumstances, an object of the present technology is to provide an information processing device, an information processing method, and a program capable of accurately detecting the direction of a target wave and other attached information. ..
 上記目的を達成するため、本技術の一形態に係る情報処理装置は、取得部と、出力部とを具備する。
 前記取得部は、対象波を観測した複数の信号の特徴データを取得する。
 前記出力部は、前記取得された特徴データに基づいて、前記対象波の到来方向を示す方向情報と、前記対象波に関する付属情報とを表す3次元ベクトルを出力する。
In order to achieve the above object, the information processing device according to one embodiment of the present technology includes an acquisition unit and an output unit.
The acquisition unit acquires feature data of a plurality of signals that observe the target wave.
Based on the acquired feature data, the output unit outputs a three-dimensional vector representing direction information indicating the arrival direction of the target wave and incidental information regarding the target wave.
 この情報処理装置では、対象波を観測した複数の信号の特徴データを入力として、3次元ベクトルが出力される。この3次元ベクトルは、対象波の到来方向の方向情報と、対象波に関する付属情報とを表すベクトルである。このように方向情報及び付属情報が1つのベクトルとしてまとめて出力される。これにより、対象波の方向とその他の付属する情報とを高精度に検出することが可能となる。 In this information processing device, a three-dimensional vector is output by inputting the feature data of a plurality of signals that observe the target wave. This three-dimensional vector is a vector that represents the direction information of the arrival direction of the target wave and the incidental information about the target wave. In this way, the direction information and the attached information are collectively output as one vector. This makes it possible to detect the direction of the target wave and other attached information with high accuracy.
 前記出力部は、前記3次元ベクトルの向きが前記方向情報を表し、前記3次元ベクトルの大きさが前記付属情報を表すように、前記3次元ベクトルを出力してもよい。 The output unit may output the three-dimensional vector so that the direction of the three-dimensional vector represents the direction information and the magnitude of the three-dimensional vector represents the accessory information.
 前記出力部は、前記3次元ベクトルに対して極座標変換を行うことで、前記方向情報及び前記付属情報が算出されるように、前記3次元ベクトルを出力してもよい。 The output unit may output the three-dimensional vector so that the direction information and the accessory information are calculated by performing polar coordinate conversion on the three-dimensional vector.
 前記方向情報は、前記対象波の到来方向を示す水平角及び仰角を含んでもよい。 The direction information may include a horizontal angle and an elevation angle indicating the direction of arrival of the target wave.
 前記出力部は、前記3次元ベクトルを極座標変換して、前記方向情報及び前記付属情報を算出してもよい。 The output unit may perform polar coordinate transformation of the three-dimensional vector to calculate the direction information and the accessory information.
 前記対象波は、音声であってもよい。この場合、前記複数の信号は、前記音声を観測した音信号であってもよい。 The target wave may be voice. In this case, the plurality of signals may be sound signals obtained by observing the voice.
 前記方向情報は、前記音声の到来方向を示す情報であってもよい。 The direction information may be information indicating the arrival direction of the voice.
 前記付属情報は、前記音声の音量、前記音声の存在確率、又は前記音声の到来方向に関する信頼度のいずれか1つを含んでもよい。 The attached information may include any one of the volume of the voice, the existence probability of the voice, or the reliability regarding the arrival direction of the voice.
 前記出力部は、前記音信号に含まれる周波数成分ごとに、前記方向情報及び前記付属情報を表す前記3次元ベクトルを出力してもよい。 The output unit may output the three-dimensional vector representing the direction information and the accessory information for each frequency component included in the sound signal.
 前記付属情報は、前記周波数成分ごとの前記音声の音量であってもよい。この場合、前記出力部は、前記周波数成分ごとに出力された前記3次元ベクトルに基づいて、前記音声の振幅スペクトルを表す音声信号を算出してもよい。 The attached information may be the volume of the voice for each frequency component. In this case, the output unit may calculate an audio signal representing the amplitude spectrum of the audio based on the three-dimensional vector output for each frequency component.
 前記出力部は、前記周波数成分ごとに出力された前記3次元ベクトルを合成して、前記音声の到来方向を表す第1のベクトルを算出してもよい。 The output unit may synthesize the three-dimensional vectors output for each frequency component to calculate a first vector representing the arrival direction of the voice.
 前記出力部は、前記第1のベクトルに基づいて前記方向情報及び前記付属情報を算出してもよい。 The output unit may calculate the direction information and the accessory information based on the first vector.
 前記出力部は、所定の期間内に出力された前記3次元ベクトルを合成することで第2のベクトルを算出し、前記第2のベクトルに基づいて前記所定の期間における前記音声の到来方向を算出してもよい。 The output unit calculates a second vector by synthesizing the three-dimensional vectors output within a predetermined period, and calculates the arrival direction of the voice in the predetermined period based on the second vector. You may.
 前記複数の信号は、互いに異なる位置に配置された複数の集音器の各々が検出した前記音信号であってもよい。 The plurality of signals may be the sound signals detected by each of the plurality of sound collectors arranged at different positions from each other.
 前記特徴データは、前記複数の信号の各々の振幅スペクトルと、前記複数の信号の間での位相差スペクトルとを含んでもよい。 The feature data may include the amplitude spectrum of each of the plurality of signals and the phase difference spectrum between the plurality of signals.
 前記出力部は、入力データに応じた前記3次元ベクトルを出力し、前記出力された3次元ベクトルと前記入力データに対応する回答ベクトルとのユークリッド距離に応じた誤差を学習に用いる学習器であってもよい。 The output unit is a learner that outputs the three-dimensional vector corresponding to the input data and uses the error according to the Euclidean distance between the output three-dimensional vector and the answer vector corresponding to the input data for learning. You may.
 本技術の一形態に係る情報処理方法は、コンピュータシステムにより実行される情報処理方法であって、対象波を観測した複数の信号の特徴データを取得することを含む。
 前記取得された特徴データに基づいて、前記対象波の到来方向を示す方向情報と、前記対象波に関する付属情報とを表す3次元ベクトルが出力される。
The information processing method according to one form of the present technology is an information processing method executed by a computer system, and includes acquiring characteristic data of a plurality of signals obtained by observing a target wave.
Based on the acquired feature data, a three-dimensional vector representing the direction information indicating the arrival direction of the target wave and the incidental information regarding the target wave is output.
 本技術の一形態に係るプログラムは、コンピュータシステムに以下のステップを実行させる。
 対象波を観測した複数の信号の特徴データを取得するステップ。
 前記取得された特徴データに基づいて、前記対象波の到来方向を示す方向情報と、前記対象波に関する付属情報とを表す3次元ベクトルを出力するステップ。
A program according to a form of the present technology causes a computer system to perform the following steps.
The step of acquiring the feature data of multiple signals that observed the target wave.
A step of outputting a three-dimensional vector representing direction information indicating the direction of arrival of the target wave and incidental information regarding the target wave based on the acquired feature data.
本技術の第1の実施形態に係る処理ユニットの構成例を示すブロック図である。It is a block diagram which shows the structural example of the processing unit which concerns on 1st Embodiment of this technique. 3次元ベクトルについて説明するための模式図である。It is a schematic diagram for demonstrating a three-dimensional vector. 制御部の基本的な動作を示すフローチャートである。It is a flowchart which shows the basic operation of a control part. 特徴データの一例を示すデータプロットである。It is a data plot which shows an example of a feature data. 図4に示す特徴データから出力された3次元ベクトルのグラフである。It is a graph of the three-dimensional vector output from the feature data shown in FIG. 図5に示す3次元ベクトルから算出された音源方向及び音声の音量のグラフである。It is a graph of the sound source direction and the volume of voice calculated from the three-dimensional vector shown in FIG. 第2の実施形態に係る処理ユニットの構成例を示すブロック図である。It is a block diagram which shows the structural example of the processing unit which concerns on 2nd Embodiment. 特徴データの一例を示すデータプロットである。It is a data plot which shows an example of a feature data. 図8に示す特徴データから出力された3次元ベクトルを示すデータプロットである。It is a data plot which shows the 3D vector output from the feature data shown in FIG. 図9に示す3次元ベクトルから算出された音声の音量を示すデータプロットである。6 is a data plot showing the volume of voice calculated from the three-dimensional vector shown in FIG. 図9に示す3次元ベクトルから算出された全体ベクトルのグラフである。It is a graph of the whole vector calculated from the three-dimensional vector shown in FIG. 図11に示す全体ベクトルから算出された音源方向及び音量のグラフである。It is a graph of the sound source direction and the volume calculated from the whole vector shown in FIG.
 以下、本技術に係る実施形態を、図面を参照しながら説明する。 Hereinafter, embodiments relating to the present technology will be described with reference to the drawings.
 <第1の実施形態>
[処理ユニットの構成]
 図1は、本技術の第1の実施形態に係る処理ユニットの構成例を示すブロック図である。処理ユニット100は、音(音波)を観測した音信号から、観測対象となる特定の音の情報を算出する演算ユニットである。後述するように、処理ユニット100では、人間1の音声2を観測対象として、音声2の情報を算出する演算が実行される。
<First Embodiment>
[Processing unit configuration]
FIG. 1 is a block diagram showing a configuration example of a processing unit according to a first embodiment of the present technology. The processing unit 100 is a calculation unit that calculates information on a specific sound to be observed from a sound signal obtained by observing a sound (sound wave). As will be described later, in the processing unit 100, the calculation of calculating the information of the voice 2 is executed with the voice 2 of the human 1 as the observation target.
 図1に示すように、処理ユニット100は、マイクアレイ10に接続して用いられる。マイクアレイ10は、複数のマイク11を有する。マイク11は、周辺の音を検出して、検出された音に応じた音信号を出力する素子であり、集音器として機能する。マイク11から出力される音信号は、周辺の音に応じて振幅が時間とともに変化する電気信号である。この振幅の時間変化により、音の高さ、音の大きさ、音の波形等が表される。音信号は、典型的にはアナログ信号として出力され、A/Dコンバータ等(図示省略)を用いてデジタル信号に変換される。マイク11の具体的な構成は限定されず、周辺の音を検出して音信号を検出可能な任意の素子がマイク11として用いられてよい。 As shown in FIG. 1, the processing unit 100 is used by being connected to the microphone array 10. The microphone array 10 has a plurality of microphones 11. The microphone 11 is an element that detects surrounding sounds and outputs a sound signal corresponding to the detected sound, and functions as a sound collector. The sound signal output from the microphone 11 is an electric signal whose amplitude changes with time according to the surrounding sounds. The time variation of this amplitude represents the pitch, loudness, sound waveform, and the like. The sound signal is typically output as an analog signal and converted into a digital signal using an A / D converter or the like (not shown). The specific configuration of the microphone 11 is not limited, and any element capable of detecting the surrounding sound and detecting the sound signal may be used as the microphone 11.
 マイクアレイ10の周辺では、人間1が発した音声2が生じている。従って、マイクアレイ10から出力される複数の信号は、音声2を観測した音信号となる。なお、マイクアレイ10の周辺では、音声2のみならず雑音3等の他の音が生じている。従って、音信号5には、音声2の他にも雑音3等に応じた信号が含まれる。図1には、マイクアレイ10の周辺で生じた音声2及び雑音3が、矢印を用いて模式的に図示されている。 Around the microphone array 10, voice 2 emitted by human 1 is generated. Therefore, the plurality of signals output from the microphone array 10 are sound signals obtained by observing the voice 2. In addition, not only voice 2 but also other sounds such as noise 3 are generated around the microphone array 10. Therefore, the sound signal 5 includes a signal corresponding to the noise 3 and the like in addition to the voice 2. In FIG. 1, the voice 2 and the noise 3 generated around the microphone array 10 are schematically illustrated by using arrows.
 またマイクアレイ10を構成する複数のマイク11は、互いに異なる位置に配置される。従って、マイクアレイ10から出力される複数の信号は、互いに異なる位置に配置された複数のマイク11の各々が検出した音信号となる。このため、例えば同じ音声2が検出される場合であっても、音声2が検出されるタイミングや、検出される音声2の大きさ等が、マイク11ごとに異なる。従って、各マイク11が出力する音信号は、マイク11が配置された位置に応じた信号となる。 Further, the plurality of microphones 11 constituting the microphone array 10 are arranged at different positions from each other. Therefore, the plurality of signals output from the microphone array 10 are sound signals detected by each of the plurality of microphones 11 arranged at different positions from each other. Therefore, for example, even when the same voice 2 is detected, the timing at which the voice 2 is detected, the size of the detected voice 2, and the like are different for each microphone 11. Therefore, the sound signal output by each microphone 11 is a signal corresponding to the position where the microphone 11 is arranged.
 マイクアレイ10は、例えばロボット等に搭載される。この場合、ロボット等の筐体に複数のマイク11が配置される。また例えば、据え置き型のデバイス等にマイクアレイ10が搭載されてもよい。あるいは、室内空間や車内空間等に複数のマイク11が配置されマイクアレイ10が構成されてもよい。 The microphone array 10 is mounted on, for example, a robot or the like. In this case, a plurality of microphones 11 are arranged in a housing such as a robot. Further, for example, the microphone array 10 may be mounted on a stationary device or the like. Alternatively, a plurality of microphones 11 may be arranged in an indoor space, a vehicle interior space, or the like to form a microphone array 10.
 なお、マイクアレイ10には、少なくとも2つのマイク11が含まれればよい。本実施形態では、4つ以上のマイク11により、マイクアレイ10が構成される。例えば4つ以上のマイク11を同一平面に含まれないように配置することで、音源の方向等を適正に検出することが可能である。この他、マイクアレイ10の具体的な構成は限定されない。 Note that the microphone array 10 may include at least two microphones 11. In the present embodiment, the microphone array 10 is composed of four or more microphones 11. For example, by arranging four or more microphones 11 so as not to be included in the same plane, it is possible to properly detect the direction of the sound source and the like. In addition, the specific configuration of the microphone array 10 is not limited.
 処理ユニット100は、CPUやメモリ(RAM、ROM)等のコンピュータに必要なハードウェア構成を有する。CPUがROMに記憶されているプログラムをRAMにロードして実行することにより、種々の処理が実行される。プログラムは、例えば種々の記録媒体を介して処理ユニット100にインストールされる。または、インターネット等を介してプログラムのインストールが実行されてもよい。 The processing unit 100 has a hardware configuration required for a computer such as a CPU and a memory (RAM, ROM). Various processes are executed by the CPU loading the program stored in the ROM into the RAM and executing the program. The program is installed in the processing unit 100, for example, via various recording media. Alternatively, the program may be installed via the Internet or the like.
 処理ユニット100としては、例えばFPGA(Field Programmable Gate Array)等のPLD(Programmable Logic Device)、その他ASIC(Application Specific Integrated Circuit)等のデバイスが用いられてもよい。本実施形態では、処理ユニット100は、情報処理装置に相当する。 As the processing unit 100, for example, a device such as a PLD (Programmable Logic Device) such as an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit) may be used. In this embodiment, the processing unit 100 corresponds to an information processing device.
 処理ユニット100のCPUが本実施形態に係るプログラムを実行することで、機能ブロックとして、前処理部20と、ベクトル推定部21と、後処理部22とが実現される。そしてこれらの機能ブロックにより、本実施形態に係る情報処理方法が実行される。なお、各機能ブロックを実現するために、IC(集積回路)等の専用のハードウェアが適宜用いられてもよい。 When the CPU of the processing unit 100 executes the program according to the present embodiment, the pre-processing unit 20, the vector estimation unit 21, and the post-processing unit 22 are realized as functional blocks. Then, the information processing method according to the present embodiment is executed by these functional blocks. In addition, in order to realize each functional block, dedicated hardware such as an IC (integrated circuit) may be appropriately used.
 前処理部20は、音声2を観測した複数の音信号5の特徴データ6を取得する。具体的には、前処理部20は、マイクアレイ10から出力された複数の音信号5(マルチチャネルの音信号5)を読み込み、読み込んだ各音信号5に基づいて特徴データ6を算出する。すなわち、前処理部20は、複数の音信号5から特徴データ6を算出することで、特徴データ6を取得する。本実施形態では、前処理部20は、取得部に相当する。 The preprocessing unit 20 acquires the feature data 6 of the plurality of sound signals 5 in which the voice 2 is observed. Specifically, the preprocessing unit 20 reads a plurality of sound signals 5 (multi-channel sound signals 5) output from the microphone array 10 and calculates feature data 6 based on the read sound signals 5. That is, the preprocessing unit 20 acquires the feature data 6 by calculating the feature data 6 from the plurality of sound signals 5. In the present embodiment, the pretreatment unit 20 corresponds to the acquisition unit.
 本開示において、特徴データ6とは、例えば複数の音信号5についての特徴を表すことが可能なデータである。例えば音信号5に対して所定の変換処理が実行され、音信号5の特徴を表すデータが生成される。また例えば、音信号5そのものを特徴データ6として用いることも可能である。本実施形態では、音信号5に対してフーリエ変換が実行され、音信号5の振幅スペクトル及び位相差スペクトル等が特徴データ6として算出される。この点については、図4等を参照して後に詳しく説明する。 In the present disclosure, the feature data 6 is data that can represent features of, for example, a plurality of sound signals 5. For example, a predetermined conversion process is executed on the sound signal 5, and data representing the characteristics of the sound signal 5 is generated. Further, for example, the sound signal 5 itself can be used as the feature data 6. In the present embodiment, the Fourier transform is executed on the sound signal 5, and the amplitude spectrum, the phase difference spectrum, and the like of the sound signal 5 are calculated as the feature data 6. This point will be described in detail later with reference to FIG. 4 and the like.
 ベクトル推定部21は、前処理部20により取得された特徴データ6に基づいて、対象波の到来方向を示す方向情報と、対象波に関する付属情報とを表す3次元ベクトルPを出力する。具体的には、ベクトル推定部21は、特徴データ6を入力として、方向情報及び付属情報を表す3次元ベクトルPを出力するように学習された学習器により構成される。 The vector estimation unit 21 outputs a three-dimensional vector P indicating the direction information indicating the arrival direction of the target wave and the accessory information regarding the target wave based on the feature data 6 acquired by the preprocessing unit 20. Specifically, the vector estimation unit 21 is configured by a learner trained to input the feature data 6 and output a three-dimensional vector P representing directional information and attached information.
 本実施形態において対象波とは、処理ユニット100において観測対象となる音(音波)である。処理ユニット100では、マイク11を用いて検出される音(音波)のうち、人間1が発した音声2が観測対象に設定される。すなわち、対象波は、音声2である。例えば、マイクアレイ10の周囲に複数の人間1がいた場合、各人間1が発した音声2が全て対象波となる。 In this embodiment, the target wave is a sound (sound wave) to be observed by the processing unit 100. In the processing unit 100, among the sounds (sound waves) detected by the microphone 11, the voice 2 emitted by the human 1 is set as the observation target. That is, the target wave is voice 2. For example, when there are a plurality of humans 1 around the microphone array 10, all the voices 2 emitted by each human 1 are the target waves.
 なお、不特定多数の人間1の音声2を対象波とする場合に限定されず、例えば、特定の人間1の音声2を対象波とすることも可能である。また、音声2の他にも、例えば手を叩く音や、鈴が鳴る音等の特定の音が、対象波として設定されてもよい。また周辺の雑音3等が対象波として設定されてもよい。この他、対象波は、処理ユニット100の目的等に応じて任意に設定することが可能である。 Note that the case is not limited to the case where the voice 2 of an unspecified number of humans 1 is the target wave, and for example, the voice 2 of a specific human 1 can be the target wave. Further, in addition to the voice 2, a specific sound such as a clap sound or a bell ringing sound may be set as the target wave. Further, ambient noise 3 and the like may be set as the target wave. In addition, the target wave can be arbitrarily set according to the purpose of the processing unit 100 and the like.
 方向情報は、音声2の到来方向を示す情報である。すなわち、方向情報は、音声2を発した人間1のいる方向(音源方向)を示す情報であると言える。以下では、音声2の到来方向を単に音源方向と記載する場合がある。例えば、上記したマイクアレイ10には、基準座標が設定される。この基準座標の原点に対して音声2が到来した方向、すなわち基準座標から見て音声2を発した人間1のいる方向を示す情報が方向情報となる。基準座標を設定する方法は限定されず、任意に設定可能である。 The direction information is information indicating the direction of arrival of the voice 2. That is, it can be said that the direction information is information indicating the direction (sound source direction) in which the human 1 who has emitted the voice 2 is located. In the following, the direction of arrival of the voice 2 may be simply described as the sound source direction. For example, reference coordinates are set in the microphone array 10 described above. The direction information is information indicating the direction in which the voice 2 arrives with respect to the origin of the reference coordinates, that is, the direction in which the human 1 who emits the voice 2 is located when viewed from the reference coordinates. The method of setting the reference coordinates is not limited and can be set arbitrarily.
 付属情報は、対象となる音声2に付属して得られる情報であり、1次元の値(ノルム)を用いて表される。例えば付属情報は、音声2の音量に設定される。より詳しくは、あるタイミングで人間1が発した音声2の大きさ(パワー)が、付属情報に設定される。この他にも、音声2の有無を表す確率や、音声2の到来方向(方向情報)に関する信頼度等を付属情報として設定することが可能である。付属情報の具体的な内容は限定されず、例えば特徴データ6から算出可能な任意の1次元量が付属情報として設定されてよい。 The attached information is information obtained attached to the target voice 2, and is expressed using a one-dimensional value (norm). For example, the attached information is set to the volume of the voice 2. More specifically, the magnitude (power) of the voice 2 emitted by the human 1 at a certain timing is set in the attached information. In addition to this, it is possible to set the probability indicating the presence / absence of the voice 2 and the reliability regarding the arrival direction (direction information) of the voice 2 as ancillary information. The specific content of the attached information is not limited, and for example, an arbitrary one-dimensional amount that can be calculated from the feature data 6 may be set as the attached information.
 図2は、3次元ベクトルPについて説明するための模式図である。図2には、相互に直交するX軸、Y軸、及びZ軸で表される直交座標系が図示されている。この直交座標系が、基準座標となる。また図中の太線の矢印は、ベクトル推定部21から出力される3次元ベクトルPの一例である。以下では、3次元ベクトルPを、X軸、Y軸、及びZ軸の各成分を、それぞれx、y、zと記載する。ベクトル推定部21は、3次元ベクトルPとして、ベクトルP(x,y,z)の各成分を出力する。 FIG. 2 is a schematic diagram for explaining the three-dimensional vector P. FIG. 2 illustrates a Cartesian coordinate system represented by the X-axis, Y-axis, and Z-axis that are orthogonal to each other. This Cartesian coordinate system becomes the reference coordinate. The thick arrow in the figure is an example of the three-dimensional vector P output from the vector estimation unit 21. In the following, the three-dimensional vector P will be referred to as x, y, and z for each component of the X-axis, Y-axis, and Z-axis, respectively. The vector estimation unit 21 outputs each component of the vector P (x, y, z) as a three-dimensional vector P.
 ベクトル推定部21(学習器)は、3次元ベクトルPの向きが音声2の到来方向となり、3次元ベクトルPの大きさIが付属情報の値となるように学習される。すなわち、ベクトル推定部21は、3次元ベクトルPの向きが方向情報を表し、3次元ベクトルPの大きさが付属情報を表すように、3次元ベクトルPを出力する。例えば原点Oから見て、3次元ベクトルPが示す向きが音源方向となり、その大きさが付属情報の値(音声2の音量等)を表す。つまり、3次元ベクトルPの各成分x、y、zが、そのまま方向情報や付属情報を表すわけではないが、各成分x、y、zにより表されるベクトルにより、方向情報及び付属情報が表現される。 The vector estimation unit 21 (learner) is learned so that the direction of the three-dimensional vector P is the direction of arrival of the voice 2 and the magnitude I of the three-dimensional vector P is the value of the attached information. That is, the vector estimation unit 21 outputs the three-dimensional vector P so that the direction of the three-dimensional vector P represents the direction information and the magnitude of the three-dimensional vector P represents the attached information. For example, when viewed from the origin O, the direction indicated by the three-dimensional vector P is the sound source direction, and the magnitude thereof represents the value of the attached information (volume of voice 2 or the like). That is, each component x, y, z of the three-dimensional vector P does not directly represent the direction information or the accessory information, but the direction information and the accessory information are expressed by the vector represented by each component x, y, z. Will be done.
 また、ベクトル推定部21は、3次元ベクトルPを極座標変換することにより、音源方向の水平角θ、音源方向の仰角φ、付属情報の値Iが得られるように学習される。ここで、水平角θは、XY平面においてX軸を基準とするベクトルの向き(方位)を表す角度である。また仰角φは、XY平面に対するベクトルの傾斜を表す角度である。このように、本実施形態では、方向情報は、音源方向(音声2の到来方向)を示す水平角θ及び仰角φを含む。水平角θ、仰角φ、及び付属情報Iは、以下の式を用いてそれぞれ表される。 Further, the vector estimation unit 21 is learned so that the horizontal angle θ in the sound source direction, the elevation angle φ in the sound source direction, and the value I of the attached information can be obtained by converting the three-dimensional vector P into polar coordinates. Here, the horizontal angle θ is an angle representing the direction (direction) of the vector with respect to the X axis in the XY plane. The elevation angle φ is an angle representing the inclination of the vector with respect to the XY plane. As described above, in the present embodiment, the direction information includes the horizontal angle θ and the elevation angle φ indicating the sound source direction (the direction of arrival of the sound 2). The horizontal angle θ, the elevation angle φ, and the attached information I are expressed using the following equations, respectively.
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000003
 このように、ベクトル推定部21は、3次元ベクトルPに対して極座標変換を行うことで、方向情報及び付属情報が算出されるように、3次元ベクトルPを出力する。従って、例えば、方向情報及び付属情報を算出する場合には、3次元ベクトルPを(1)式~(3)式に従って極座標変換することで、音源方向を表す角度θ及びφと、付属情報の値Iとを容易に算出することが可能となる。 In this way, the vector estimation unit 21 outputs the three-dimensional vector P so that the direction information and the attached information are calculated by performing polar coordinate conversion on the three-dimensional vector P. Therefore, for example, when calculating the direction information and the attached information, the angles θ and φ representing the sound source direction and the attached information are obtained by converting the three-dimensional vector P into polar coordinates according to the equations (1) to (3). The value I can be easily calculated.
 ベクトル推定部21(学習器)の学習には、音信号に基づいて生成された学習データが用いられる。学習データとしては、教師ラベルが付与された音信号の特徴データが用いられる。例えば、人間の音声を含む音信号の特徴データ(振幅スペクトル及び位相差スペクトル)が入力用のデータとなる。また、特徴データには、その音声の到来方向(音源方向)及び付属情報(音量等)を表す3次元ベクトルPが教師ラベルとして付与される。これにより、極座標変換により音源方向及び付属情報を表すベクトルを推定するように、学習器をトレーニングすることが可能となる。 Learning data generated based on the sound signal is used for learning of the vector estimation unit 21 (learner). As the learning data, the characteristic data of the sound signal to which the teacher label is attached is used. For example, characteristic data (amplitude spectrum and phase difference spectrum) of a sound signal including human voice becomes data for input. Further, a three-dimensional vector P representing the arrival direction (sound source direction) and attached information (volume, etc.) of the sound is attached to the feature data as a teacher label. This makes it possible to train the learner to estimate the vector representing the sound source direction and attached information by polar coordinate transformation.
 学習データを生成する方法等は限定されない。例えば、インパルス応答等の技術に用いられる畳み込み演算を行うことで、音源の位置等を変化させた音信号をシミュレーションすることが可能である。これを音声の種類を変えて行うことで、複数の教師ラベル付きの学習データを容易に用意することが可能である。なお、音声以外の音を対象波とする場合には、対象となる音(拍手や鈴の音等)をサンプリングした学習データを用いればよい。 The method of generating learning data is not limited. For example, it is possible to simulate a sound signal in which the position of a sound source is changed by performing a convolution operation used in a technique such as an impulse response. By performing this by changing the type of voice, it is possible to easily prepare learning data with a plurality of teacher labels. When a sound other than voice is used as the target wave, learning data obtained by sampling the target sound (applause, bell sound, etc.) may be used.
 ベクトル推定部21を構成する学習器には、深層学習等の機械学習に用いられる任意のアルゴリズムを適用することが可能である。例えば、ニューラルネットワーク(NN:Neural Network)を利用したアルゴリズムとして、パーセプトロン、多層パーセプトロン(MLP:Multilayer Perceptron)、畳み込みニューラルネットワーク(CNN:Convolution Neural Network)、再帰型ニューラルネットワーク(RNN:Recurrent Neural Network)、LSTMネットワーク(Long Term Short Memory Network)等のアルゴリズム(学習モデル)を利用することが可能である。この他にも、音源方向の推定等に適用可能な任意のアルゴリズムを用いて、学習器が構成されてよい。 It is possible to apply an arbitrary algorithm used for machine learning such as deep learning to the learner that constitutes the vector estimation unit 21. For example, as an algorithm using a neural network (NN: Neural Network), a perceptron, a multilayer perceptron (MLP: Multilayer Perceptron), a convolutional neural network (CNN: Convolution Neural Network), a recurrent neural network (RNN: Recurrent Neural Network), It is possible to use an algorithm (learning model) such as an RSTM network (LongTermShortMemoryNetwork). In addition to this, the learner may be configured by using an arbitrary algorithm applicable to estimation of the sound source direction and the like.
 なお、学習器は、特徴データ6を3次元ベクトルPに変換する関数(以下関数Aと記載する)と見做すことが可能である。従って、学習器を学習させることは、3次元ベクトルPを適正に算出することが可能となるように、関数Aを最適化する処理であると言える。 The learner can be regarded as a function that converts the feature data 6 into a three-dimensional vector P (hereinafter referred to as a function A). Therefore, it can be said that training the learner is a process of optimizing the function A so that the three-dimensional vector P can be calculated appropriately.
 図1に戻り、後処理部22は、3次元ベクトルPを極座標変換して、方向情報及び付属情報を算出する。具体的には、上記した(1)式~(3)式に従って、3次元ベクトルPから、音源方向の水平角θ及び仰角φと、付属情報の値Iとが算出される。この他、後処理部22は、3次元ベクトルPを用いた各種の演算を実行可能である。上記したベクトル推定部21及び後処理部22は、本実施形態に係る出力部として機能する。 Returning to FIG. 1, the post-processing unit 22 transforms the three-dimensional vector P into polar coordinates to calculate the direction information and the attached information. Specifically, according to the above equations (1) to (3), the horizontal angle θ and the elevation angle φ in the sound source direction and the value I of the attached information are calculated from the three-dimensional vector P. In addition, the post-processing unit 22 can execute various operations using the three-dimensional vector P. The vector estimation unit 21 and the post-processing unit 22 described above function as output units according to the present embodiment.
 図3は、制御部の基本的な動作を示すフローチャートである。図3に示す処理は、所定の処理レートで繰り返し実行される処理である。まず、前処理部20により、音信号5の特徴データ6が算出される(ステップ101)。次に、ベクトル推定部21により、特徴データ6を入力として3次元ベクトルPが出力される(ステップ102)。そして、後処理部22により、3次元ベクトルPが極座標変換され、音声2の方向情報(θ及びφ)及び付属情報(I)が算出される(ステップ103)。 FIG. 3 is a flowchart showing the basic operation of the control unit. The process shown in FIG. 3 is a process that is repeatedly executed at a predetermined processing rate. First, the preprocessing unit 20 calculates the feature data 6 of the sound signal 5 (step 101). Next, the vector estimation unit 21 outputs the three-dimensional vector P with the feature data 6 as an input (step 102). Then, the post-processing unit 22 transforms the three-dimensional vector P into polar coordinates, and calculates the direction information (θ and φ) and the attached information (I) of the voice 2 (step 103).
 このように、処理ユニット100では、マルチチャネルの音信号の特徴を表す量を入力として,方向情報とその他の付属情報を表現する3次元のベクトルP(x,y,z)が出力され、極座標変換により音声2に関する情報として(θ,φ,I)が算出される。なおステップ103での処理が終了すると、次のタイミングでの処理が開始される。従って、処理ユニット100により、方向情報及び付属情報を所定の処理レートで継続して算出される。これにより、音声2が発せられた方向を常時モニタリングすることが可能となる。 In this way, the processing unit 100 inputs a quantity representing the characteristics of the multi-channel sound signal, and outputs a three-dimensional vector P (x, y, z) expressing direction information and other accessory information, and polar coordinates. By the conversion, (θ, φ, I) is calculated as the information related to the sound 2. When the process in step 103 is completed, the process at the next timing is started. Therefore, the processing unit 100 continuously calculates the direction information and the attached information at a predetermined processing rate. This makes it possible to constantly monitor the direction in which the voice 2 is emitted.
 以下では、付属情報(I)として、音声2の音量が設定された場合を例に、ステップ101、ステップ102、及びステップ103の動作について具体的に説明する。なお、付属情報が他の値に設定された場合にも、以下の説明は適用可能である。 In the following, the operations of steps 101, 102, and 103 will be specifically described by taking the case where the volume of the voice 2 is set as the attached information (I) as an example. The following description is applicable even when the attached information is set to another value.
[特徴データの算出]
 ステップ101では、複数の音信号5の特徴データ6として、複数の音信号5の各々の振幅スペクトルと、複数の音信号5の間での位相差スペクトルとが算出される。振幅スペクトルは、周波数成分ごとの強度を表すスペクトルである。また位相差スペクトルは、周波数成分ごとの位相差を表すスペクトルである。この2種類のデータが、ベクトル推定部21に入力されるデータとなる。
[Calculation of feature data]
In step 101, as the feature data 6 of the plurality of sound signals 5, the amplitude spectrum of each of the plurality of sound signals 5 and the phase difference spectrum between the plurality of sound signals 5 are calculated. The amplitude spectrum is a spectrum representing the intensity of each frequency component. The phase difference spectrum is a spectrum representing the phase difference for each frequency component. These two types of data are the data to be input to the vector estimation unit 21.
 まず前処理部20により、M個のマイク11から出力された音信号5(Mチャネルの音信号5)が読み込まれ、バッファ等の記憶部に記録され、各音信号5に対して短時間フーリエ変換が実行される。短時間フーリエ変換では、対象の信号(音信号5)を所定の時間Δで区切り、区切られた区間に含まれる信号についてのフーリエ変換が実行される。ここで、所定の時間Δで区切られる区間を、時間フレームt(t=1、2・・・)と記載する。なお、各時間フレームtは、区間が重なっていてもよいし、分かれていてもよい。 First, the preprocessing unit 20 reads the sound signals 5 (M channel sound signals 5) output from the M microphones 11, records them in a storage unit such as a buffer, and performs a short-time Fourier transform on each sound signal 5. The conversion is performed. In the short-time Fourier transform, the target signal (sound signal 5) is divided by a predetermined time Δ, and the Fourier transform is executed for the signals included in the divided sections. Here, the section separated by a predetermined time Δ is described as a time frame t (t = 1, 2, ...). In addition, each time frame t may overlap or may be divided.
 また、あるサンプリング時刻τに、M個のマイク11から出力された音信号5をsm(τ)と記載する。また、sm(τ)を短時間フーリエ変換して算出された複素スペクトルをSm(t,f)と記載する。mは各マイク11を表す指標であり、M以下の自然数である(m=1、2、・・・M)。また、fは、短時間フーリエ変換における各周波数ビンを表す指標であり、周波数ビン数F以下の整数である(f=1、2、・・・F)。 Further, a certain sampling time tau, describes a sound signal 5 output from the M microphone 11 and s m (τ). Also describes s m a complex spectrum calculated by short-time Fourier transform of (τ) S m (t, f) and. m is an index representing each microphone 11, and is a natural number less than or equal to M (m = 1, 2, ... M). Further, f is an index representing each frequency bin in the short-time Fourier transform, and is an integer having a frequency bin number F or less (f = 1, 2, ... F).
 振幅スペクトルとして、複素スペクトルSm(t,f)の絶対値|Sm(t,f)|が算出される。すなわち、Mチャネルの複素スペクトルから、Mチャネルの振幅スペクトルが算出される。また、位相差スペクトルとして、特定のチャネル(m=jのマイク11)を基準として、他のチャネルとの位相差arg(Sm(t,f)/Sj(t,f))が算出される。ここでargは、偏角を算出する関数である。またmはj以外のチャネルを表す。すなわち、Mチャネルの複素スペクトルから、M-1チャネルの位相差スペクトルが算出される。 As the amplitude spectrum, the absolute value | S m (t, f) | of the complex spectrum S m (t, f) is calculated. That is, the amplitude spectrum of the M channel is calculated from the complex spectrum of the M channel. Further, as the phase difference spectrum, the phase difference arg (S m (t, f) / S j (t, f)) with other channels is calculated with reference to a specific channel (microphone 11 with m = j). To. Here, arg is a function for calculating the declination. Further, m represents a channel other than j. That is, the phase difference spectrum of the M-1 channel is calculated from the complex spectrum of the M channel.
 ベクトル推定部21に入力するデータ(入力データDi)としては、ある区間長Ti(以下、入力区間長Tiと記載する)をひとまとまりとして、上記した振幅スペクトル及び位相差スペクトルをまとめたデータが用いられる。入力区間長Tiは、上記した時間フレームtの間隔Δよりも長い区間に設定される。すなわち、入力区間長Tiには、複数の時間フレームtが含まれることになる。従って、前処理部20からは、Mチャネル分の位相差スペクトルと、M-1チャネル分の位相差スペクトルを含む、2M-1チャネル分のスペクトルデータが出力される。 The data (input data D i) to be input to the vector estimating unit 21, a certain section length T i (hereinafter referred to as the input section length T i) as human settlement, summarizes the amplitude spectrum and phase difference spectrum as described above Data is used. The input section length Ti is set to a section longer than the interval Δ of the time frame t described above. That is, the input section length T i, will include a plurality of time frames t. Therefore, the preprocessing unit 20 outputs spectrum data for 2M-1 channels including a phase difference spectrum for M channels and a phase difference spectrum for M-1 channels.
 またスペクトルデータのデータサイズは、チャネル数×区間長Ti×周波数ビン数Fとなる。従って、入力データDiは、スペクトルデータの各チャネルを示す指標をcとすると、Di(c,t,f)と表される。ここで、チャネルを示す指標cは、(c=1,2,・・・2M-1)であり、時間フレームtは、(t=1,2,・・・Ti)であり、周波数ビンfは、(f=1,2,・・・F)である。 The data size of the spectrum data is the number of channels × the section length Ti × the number of frequency bins F. Therefore, the input data Di is expressed as Di (c, t, f), where c is an index indicating each channel of the spectrum data. Here, the index c indicating the channel is (c = 1,2, ... 2M-1), the time frame t is (t = 1,2, ... Ti ), and the frequency bin. f is (f = 1, 2, ... F).
 図4は、特徴データの一例を示すデータプロットである。図4には、振幅スペクトル及び位相差スペクトルのスペクトルデータを表す帯状のプロットが示されている。図4は、4個のマイク11(M=4)を使用した場合の例であり、上から4つのプロットが振幅スペクトル(|Sm(t,f)|)であり、その下の3つのプロットが位相差スペクトル(arg(Sm(t,f)/Sj(t,f)))である。また、各データプロットにおいて、横軸は時間(時間フレームt)であり、縦軸は周波数(周波数ビンf)である。またグレースケールで示された各点の色が、振幅又は位相差を表している。 FIG. 4 is a data plot showing an example of feature data. FIG. 4 shows a strip plot representing the spectral data of the amplitude spectrum and the retardation spectrum. FIG. 4 shows an example when four microphones 11 (M = 4) are used. The four plots from the top are the amplitude spectra (| S m (t, f) |), and the three below them. The plot is the phase difference spectrum (arg (S m (t, f) / S j (t, f))). Further, in each data plot, the horizontal axis is time (time frame t) and the vertical axis is frequency (frequency bin f). The color of each point shown in gray scale represents the amplitude or phase difference.
 例えば、振幅スペクトルにおいて振幅が強くなっている黒色のプロットが検出された時刻では、振幅に応じた音量の音が発生していると考えられる。なお、黒色のプロットで表される音には、例えば対象波である音声2や周囲の雑音3等が含まれる。このように、音が発生した状態では、位相差スペクトルに示すように、音を検出するタイミングのずれ等に応じた位相差が検出される。一方で、振幅スペクトルが灰色の領域では、音が比較的小さい、あるいは雑音3のみが発生している状態となっている。この場合、各周波数ごとの位相差は略ランダムとなる。 For example, at the time when a black plot with a strong amplitude is detected in the amplitude spectrum, it is considered that a sound with a volume corresponding to the amplitude is generated. The sound represented by the black plot includes, for example, voice 2 which is a target wave, ambient noise 3, and the like. In this way, in the state where the sound is generated, as shown in the phase difference spectrum, the phase difference corresponding to the deviation of the timing of detecting the sound is detected. On the other hand, in the region where the amplitude spectrum is gray, the sound is relatively quiet or only noise 3 is generated. In this case, the phase difference for each frequency is substantially random.
 また図4には、入力区間長Tiに含まれるデータ区間が、実線の黒枠により図示されている。この区間に含まれる各データプロットのデータが、ベクトル推定部21に入力される入力データDi(c,t,f)となる。例えば図4では、M=4であるため、チャネル数は、7となり、入力データDi(c,t,f)のデータサイズは、7×Ti×Fとなる。 Also in Figure 4, the data sections included in the input section length T i is illustrated by the solid black border. The data of each data plot included in this interval becomes the input data Di (c, t, f) input to the vector estimation unit 21. For example, in FIG. 4, since M = 4, the number of channels is 7, and the data size of the input data Di (c, t, f) is 7 × T i × F.
[3次元ベクトルの推定]
 ステップ102では、入力データDi(c,t,f)が入力されたベクトル推定部21から3次元ベクトルPが推定される。本実施形態では、区間長To(以下、出力区間長Toと記載する)をひとまとまりとして、3次元ベクトルPをまとめたデータ(出力データDo)が出力されるように、学習器が構成される。
[Estimation of 3D vector]
In step 102, the three-dimensional vector P is estimated from the vector estimation unit 21 to which the input data Di (c, t, f) is input. In the present embodiment, the interval length T o (hereinafter, the output section is described as length T o) as batches and as data summarizing the three-dimensional vector P (output data D o) is output, the learning device is It is composed.
 ベクトル推定部21は、出力データDoとして、各時間フレームtにおける3次元ベクトルP(t)=(x(t),y(t),z(t))を、出力区間長Toに含まれるフレームの数だけまとめたデータを出力する。従って、出力データDoは、3次元ベクトルPの各成分を表す指標をcとすると、Do(c,t)と表される。ここで、各成分を表す指標cは、(c=1,2,3)であり、時間フレームtは、(t=1,2,・・・To)である。従って、Do(c,t)のデータサイズは、3×Toとなる。 Vector estimating unit 21, as the output data D o, 3-dimensional vector P at each time frame t (t) = (x ( t), y (t), z (t)) and, included in the output section length T o Outputs the data collected by the number of frames to be displayed. Therefore, the output data Do is expressed as Do (c, t), where c is an index representing each component of the three-dimensional vector P. Here, the index c representing each component is (c = 1, 2, 3), the time frame t is (t = 1,2, ··· T o ). Therefore, the data size of D o (c, t) is a 3 × T o.
 このように、ベクトル推定部21は、入力Diを出力Doへ変換する関数Aとして機能する。上記したように、アルゴリズム開発においては、例えば深層学習などの機械学習アルゴリズムにより関数Aを最適化して決定する。なお関数Aを構成するパラメータの中に、過去の処理結果を蓄積しておくパラメータが存在していても良い。このような過去の処理結果を、関数Aの最適化に利用することで、音源方向の推定精度や、付属情報の検出精度を向上することが可能となる。 In this way, the vector estimation unit 21 functions as a function A that converts the input D i into the output D o . As described above, in algorithm development, the function A is optimized and determined by a machine learning algorithm such as deep learning. It should be noted that among the parameters constituting the function A, there may be a parameter for accumulating the past processing results. By using such past processing results for the optimization of the function A, it is possible to improve the estimation accuracy of the sound source direction and the detection accuracy of the attached information.
 図5は、図4に示す特徴データから出力された3次元ベクトルのグラフである。図5には、上から順番に、3次元ベクトルPの各成分x(t)、y(t)、及びz(t)のグラフが示されている。各グラフの横軸は時間であり、縦軸は各成分の大きさである。なお縦軸の縮尺は、グラフごとに適宜設定されている。 FIG. 5 is a graph of a three-dimensional vector output from the feature data shown in FIG. In FIG. 5, graphs of each component x (t), y (t), and z (t) of the three-dimensional vector P are shown in order from the top. The horizontal axis of each graph is time, and the vertical axis is the size of each component. The scale of the vertical axis is appropriately set for each graph.
 また図5には、出力区間長Toに含まれるデータ区間が、実線の黒枠により図示されている。この区間に含まれる各グラフのデータが、ベクトル推定部21(関数A)から出力される出力データDo(c,t)となる。出力区間長Toは、例えば1つの時間フレームtと同じ長さ(Δ)に設定される。この場合、出力データDo(c,t)に含まれるフレーム数は1となり、1つの3次元ベクトルP(t)=(x(t),y(t),z(t))が出力される。また、出力区間長Toに複数の時間フレームtが含まれる場合には、時間フレームtごとの3次元ベクトルP(t)が出力される。 Also in Figure 5, the data sections included in the output section length T o is illustrated by the solid black border. The data of each graph included in this section becomes the output data Do (c, t) output from the vector estimation unit 21 (function A). Output section length T o is set, for example, the same length as the one time frame t (delta). In this case, the number of frames included in the output data Do (c, t) is 1, and one three-dimensional vector P (t) = (x (t), y (t), z (t)) is output. To. Further, if it contains a plurality of time frames t to the output segment length T o is three-dimensional vector P for each time frame t (t) is output.
[音源方向及び付属情報の算出]
 ステップ103では、出力データDo(c,t)から音源方向及び付属情報が算出される。具体的には、後処理部22により、出力データDo(c,t)に含まれる3次元ベクトルP(t)=(x(t),y(t),z(t))に対して、以下の式に示すように極座標変換が実行される。
[Calculation of sound source direction and attached information]
In step 103, the sound source direction and attached information are calculated from the output data Do (c, t). Specifically, the post-processing unit 22 applies the three-dimensional vector P (t) = (x (t), y (t), z (t)) included in the output data Do (c, t). , Polar coordinate transformation is performed as shown in the following equation.
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000005
Figure JPOXMLDOC01-appb-M000005
Figure JPOXMLDOC01-appb-M000006
Figure JPOXMLDOC01-appb-M000006
 (4)式~(6)式は、図2を参照して説明した(1)式~(3)式にそれぞれ対応する。(4)式は、時間フレームtにおける音源方向の水平角θ(t)である。(5)式は、時間フレームtにおける音源方向の仰角φ(t)である。また、(6)式は、時間フレームtにおける付属情報の値Iであり、音声2の音量である。このように、後処理部22では、3次元ベクトルP(t)から、フレーム毎の音源方向及び付属情報(音声2の音量)が算出される。 Equations (4) to (6) correspond to equations (1) to (3) described with reference to FIG. 2, respectively. Equation (4) is a horizontal angle θ (t) in the sound source direction in the time frame t. Equation (5) is an elevation angle φ (t) in the sound source direction in the time frame t. Further, the equation (6) is the value I of the attached information in the time frame t, and is the volume of the voice 2. In this way, the post-processing unit 22 calculates the sound source direction and attached information (volume of voice 2) for each frame from the three-dimensional vector P (t).
[付属情報に音声2の音量を設定する場合]
 ここで、音声2の音量を付属情報として設定する場合について説明する。M個のマイク11における時間フレームtでの複素スペクトルSm(t,f):(f=1、2、・・・F)は、以下の式に示すように、音声2(人間の声)の成分Vm(t,f)と、それ以外の雑音3の成分Nm(t,f)で構成される。
[When setting the volume of voice 2 in the attached information]
Here, a case where the volume of the voice 2 is set as attached information will be described. The complex spectrum S m (t, f): (f = 1, 2, ... F) in the time frame t in the M microphones 11 is the voice 2 (human voice) as shown in the following equation. It is composed of the component V m (t, f) of the above and the other component N m (t, f) of the noise 3.
Figure JPOXMLDOC01-appb-M000007
Figure JPOXMLDOC01-appb-M000007
 ベクトル推定部21(関数A)は、複素スペクトルSm(t,f)のうち、音声2の成分Vm(t,f)を対象としてトレーニングされる。具体的には、上記した(6)式において、3次元ベクトルP(t)の大きさであるI(t)が、特定のマイク11(ここではk番目とする)の音声のパワーとなるようにする。この場合、I(t)は以下の式で表される。 The vector estimation unit 21 (function A) is trained on the component V m (t, f) of the speech 2 in the complex spectrum S m (t, f). Specifically, in the above equation (6), I (t), which is the magnitude of the three-dimensional vector P (t), becomes the voice power of the specific microphone 11 (here, the kth). To. In this case, I (t) is expressed by the following equation.
Figure JPOXMLDOC01-appb-M000008
Figure JPOXMLDOC01-appb-M000008
 ベクトル推定部21をトレーニングする際には、(6)式で算出されるI(t)が(8)式の関係を満たすように、関数Aが最適化される。これにより、ベクトル推定部21から出力される付属情報I(t)は、理想的には雑音3に乱された音信号5が入力されても、雑音3のパワーに依らず音声2のパワー(音量)のみを出力することが可能となる。これはすなわち音声2の検出ができることに相当する。従って、音声2のパワーを付属情報に設定することで、音声2が発生している区間を検出する音声区間検出(VAD:Voice Activity Detection)等を実現することが可能となる。 When training the vector estimation unit 21, the function A is optimized so that I (t) calculated by Eq. (6) satisfies the relationship of Eq. (8). As a result, the attached information I (t) output from the vector estimation unit 21 ideally receives the power of the voice 2 regardless of the power of the noise 3 even if the sound signal 5 disturbed by the noise 3 is input. It is possible to output only the volume). This corresponds to the detection of voice 2. Therefore, by setting the power of the voice 2 in the attached information, it is possible to realize voice section detection (VAD: Voice Activity Detection) or the like that detects the section in which the voice 2 is generated.
 また例えば、音声2が存在しないときにはパワーが0となり,音声2が存在するときには対数の尺度でパワーを表すようにしてもよい。この場合、I(t)は以下の式で表される。 Further, for example, the power may be 0 when the voice 2 does not exist, and the power may be expressed on a logarithmic scale when the voice 2 exists. In this case, I (t) is expressed by the following equation.
Figure JPOXMLDOC01-appb-M000009
Figure JPOXMLDOC01-appb-M000009
 このように、I(t)を設定して、関数Aを最適化することで、雑音3と音声2とを高い精度で分離することが可能となる。また音声2が生じていない状態では、パワーが0になるため、音声2の検出を容易に行うことが可能となる。この他、音声2の音量を表す方法は限定されない。 By setting I (t) and optimizing the function A in this way, it is possible to separate the noise 3 and the voice 2 with high accuracy. Further, in the state where the voice 2 is not generated, the power becomes 0, so that the voice 2 can be easily detected. In addition, the method of expressing the volume of the voice 2 is not limited.
 図6は、図5に示す3次元ベクトルから算出された音源方向及び音声2の音量のグラフである。図6には、上から順番に、音源方向の水平角θ(t)、音源方向の仰角φ(t)、及び音声2の音量I(t)のグラフが示されている。各グラフの横軸は時間である。水平角θ(t)及び仰角φ(t)のグラフの縦軸は角度である。また音声2の音量I(t)のグラフの縦軸は、音の大きさ(パワー)を表している。 FIG. 6 is a graph of the sound source direction and the volume of the voice 2 calculated from the three-dimensional vector shown in FIG. FIG. 6 shows graphs of the horizontal angle θ (t) in the sound source direction, the elevation angle φ (t) in the sound source direction, and the volume I (t) of the sound 2 in this order from the top. The horizontal axis of each graph is time. The vertical axis of the graph of the horizontal angle θ (t) and the elevation angle φ (t) is the angle. The vertical axis of the graph of the volume I (t) of the voice 2 represents the loudness (power) of the sound.
 例えば、I(t)のグラフにおいて、ピークが検出されている区間では、マイクアレイ10の周囲にいる人間1が発話して音声2が生じている。逆に、I(t)のグラフにおいて、音量が略0になっている区間では、音声2が生じていない。このように、I(t)を参照することで、マイクアレイ10の周囲で発生した音声2を高精度に検出することが可能となる。 For example, in the graph of I (t), in the section where the peak is detected, the human 1 around the microphone array 10 speaks and the voice 2 is generated. On the contrary, in the graph of I (t), the voice 2 is not generated in the section where the volume is substantially 0. By referring to I (t) in this way, it is possible to detect the voice 2 generated around the microphone array 10 with high accuracy.
 また図6では、I(t)のグラフの各ピークに対応して、θ(t)及びφ(t)の値がそれぞれ0°から一定の角度に変化する。従って、図6に示す例では、同じ方向にいる人間1が発した音声2が検出されていることになる。また例えば、互いに異なる位置にいる人間1の会話等が観測される場合には、音声2のピークごとに、各音声2を発話した人間1が存在する方向が、音源方向として推定される。このように、本実施形態では、音声2の音量とともに、音声2を発した人間のいる方向を精度よく検出することが可能となる。 Further, in FIG. 6, the values of θ (t) and φ (t) change from 0 ° to a constant angle corresponding to each peak of the graph of I (t). Therefore, in the example shown in FIG. 6, the voice 2 emitted by the human 1 in the same direction is detected. Further, for example, when conversations of humans 1 at different positions are observed, the direction in which the human 1 who utters each voice 2 exists is estimated as the sound source direction for each peak of the voice 2. As described above, in the present embodiment, it is possible to accurately detect the direction in which the person who emitted the voice 2 is present as well as the volume of the voice 2.
[発話区間等を対象とした処理]
 例えば人間1が発話している場合、発話している間(発話区間)の音源方向等を推定することが重要となる場合がある。このように、発話区間等の一定の区間長における音源方向に興味がある場合には、その区間長で集計した3次元ベクトルPを用いて、音源方向等を算出する手法も有効である。以下では、集計を行う区間を対象区間と記載する。本実施形態では、対象区間は、所定の期間に相当する。
[Processing for utterance sections, etc.]
For example, when human 1 is speaking, it may be important to estimate the sound source direction or the like while speaking (speech section). As described above, when the person is interested in the sound source direction in a certain section length such as the utterance section, a method of calculating the sound source direction or the like by using the three-dimensional vector P aggregated by the section length is also effective. In the following, the section to be aggregated will be described as the target section. In the present embodiment, the target section corresponds to a predetermined period.
 3次元ベクトルPの集計は、後処理部22により実行される。具体的には、対象区間内に出力された3次元ベクトルP(t)の各成分x(t)、y(t)、及びz(t)の和がそれぞれ算出される。例えばある時刻tcにて直前の発話に対しての音源方向を取得したい場合、時刻tcよりも過去の時刻tpを用いて、各成分の和xu、yu、zuは、以下の様に算出される。 The aggregation of the three-dimensional vector P is executed by the post-processing unit 22. Specifically, the sum of each component x (t), y (t), and z (t) of the three-dimensional vector P (t) output in the target section is calculated. For example, if you want to acquire the sound source direction for the previous utterance at a certain time t c , use the time t p earlier than the time t c , and the sum x u , y u , z u of each component is as follows. It is calculated as.
Figure JPOXMLDOC01-appb-M000010
Figure JPOXMLDOC01-appb-M000010
 ここで、時刻tpは、対象区間の開始時刻に相当し、時刻tcは、対象区間の終了時刻に相当する。従って、xu、yu、zuは、対象区間内に出力された3次元ベクトルPを合成したベクトル(以下では集計ベクトルと記載する)の成分であると言える。本実施形態では、集計ベクトルは、第2のベクトルに相当する。 Here, the time t p corresponds to the start time of the target section, and the time t c corresponds to the end time of the target section. Therefore, it can be said that x u , y u , and z u are components of a vector (hereinafter referred to as an aggregate vector) obtained by synthesizing the three-dimensional vector P output in the target section. In this embodiment, the aggregate vector corresponds to the second vector.
 (10)式に従って算出された、xu、yu、zuを成分とする集計ベクトルについて、極座標変換が実行される。これにより、時刻tcの直前の発話に対しての音源方向の水平角θu及び仰角φuが、以下の様に算出される。 Polar coordinate transformation is executed for the aggregate vector whose components are x u , yu , and z u calculated according to the equation (10). As a result, the horizontal angle θ u and the elevation angle φ u in the sound source direction with respect to the utterance immediately before the time t c are calculated as follows.
Figure JPOXMLDOC01-appb-M000011
Figure JPOXMLDOC01-appb-M000011
Figure JPOXMLDOC01-appb-M000012
Figure JPOXMLDOC01-appb-M000012
 上記した、(10)式では、対象区間での各成分の和が算出されたが、対象区間での各成分の平均を算出してもよい。すなわち、(10)式のxu、yu、zuを、対象区間に含まれる時間フレームの数でそれぞれ除算することで、各成分の平均が算出される。なお、各成分の平均により表されるベクトルも、3次元ベクトルPを合成することで算出された集計ベクトルである。 In the above equation (10), the sum of each component in the target section is calculated, but the average of each component in the target section may be calculated. That is, the average of each component is calculated by dividing x u , yu , and z u in the equation (10) by the number of time frames included in the target interval. The vector represented by the average of each component is also an aggregate vector calculated by synthesizing the three-dimensional vector P.
 このように、本実施形態では、対象区間内に出力された3次元ベクトルPを合成することで集計ベクトルが算出され、集計ベクトルに基づいて対象区間における音声2の到来方向が算出される。 As described above, in the present embodiment, the aggregation vector is calculated by synthesizing the three-dimensional vector P output in the target section, and the arrival direction of the voice 2 in the target section is calculated based on the aggregation vector.
 この方法では、対象区間(時刻tpから時刻tc)には音声2の無い区間等が含まれる場合がある。そのような区間では、図5の各グラフに示すように、x(t)、y(t)、及びz(t)の値が十分に小さく、理想的には0となる。このため、音声2の無い区間での各成分の値は、計算結果に大きな影響を与えず、対象区間における音声2のある区間に対しての音源方向の値を高精度に取得することができる。 In this method, the target section (time t p to time t c ) may include a section without voice 2. In such an interval, as shown in each graph of FIG. 5, the values of x (t), y (t), and z (t) are sufficiently small, ideally 0. Therefore, the value of each component in the section without the sound 2 does not have a great influence on the calculation result, and the value in the sound source direction with respect to the section with the sound 2 in the target section can be acquired with high accuracy. ..
 例えば対象区間における発話者の方向等を推定する方法として、対象区間のどの区間が音声2に対応しているかを識別して、その結果に基づいて方向を推定するといった方法が考えられる。しかしながら、音声2に対応する区間の識別では、確からしさを判定するための様々なパラメータや経験的なルール等を用いるヒューリスティックな処理が行われる場合があり、推定精度が低下する恐れがあった。 For example, as a method of estimating the direction of the speaker in the target section, a method of identifying which section of the target section corresponds to the voice 2 and estimating the direction based on the result can be considered. However, in the identification of the section corresponding to the voice 2, heuristic processing using various parameters for determining the certainty, empirical rules, and the like may be performed, and the estimation accuracy may be lowered.
 これに対して、本実施形態では、対象区間にわたって3次元ベクトルPを合成するだけで、音声2のある区間における音源方向が容易に算出される。すなわち、音声2のある区間を決定するようなヒューリスティックな処理を行う必要が無く、音源方向を精度よく推定することが可能である。 On the other hand, in the present embodiment, the sound source direction in a certain section of the voice 2 is easily calculated only by synthesizing the three-dimensional vector P over the target section. That is, it is not necessary to perform heuristic processing for determining a certain section of the sound 2, and it is possible to estimate the sound source direction with high accuracy.
[付属情報に音声2の存在確率を設定する場合]
 付属情報として、音声2の存在確率を設定する場合について説明する。ここでは、M個のマイク11における時間フレームtでの複素スペクトル複素スペクトルSm(t,f):(f=1、2、・・・F)が、(7)式に示すように、音声2の成分Vm(t,f)と雑音3の成分Nm(t,f)で構成されるとする。
[When setting the existence probability of voice 2 in the attached information]
A case where the existence probability of the voice 2 is set as the attached information will be described. Here, the complex spectrum complex spectrum S m (t, f): (f = 1, 2, ... F) in the time frame t in the M microphones 11 is a voice as shown in the equation (7). It is assumed that it is composed of the component V m (t, f) of 2 and the component N m (t, f) of noise 3.
 ベクトル推定部21(関数A)は、上記した(6)式において、3次元ベクトルP(t)の大きさであるI(t)が、音声2の存在確率となるように学習される。ここで、音声2の存在確率とは、音声2が生じているか否かを表す確率である。具体的には、特定のマイク11(ここではk番目とする)の音声2のパワー(音量)が、所定の閾値εよりも大きい場合に、音声2の存在確率が1となるように、関数Aが最適化される。この場合、I(t)は、以下の式で表される。 The vector estimation unit 21 (function A) is learned so that I (t), which is the magnitude of the three-dimensional vector P (t), is the existence probability of the voice 2 in the above equation (6). Here, the existence probability of the voice 2 is a probability indicating whether or not the voice 2 is generated. Specifically, a function so that the existence probability of voice 2 becomes 1 when the power (volume) of voice 2 of a specific microphone 11 (here, k-th) is larger than a predetermined threshold value ε. A is optimized. In this case, I (t) is expressed by the following equation.
Figure JPOXMLDOC01-appb-M000013
Figure JPOXMLDOC01-appb-M000013
 (13)式に従って最適化されたベクトル推定部21からは、例えば0~1までの大きさの3次元ベクトルPが出力される。実際の推定を行う場合には、3次元ベクトルPをそのまま出力してI(t)が0~1までの値をとるようにしてもよい。これにより、音声2が存在しそうな場合(例えば存在確率が0.5以上)に、所定の処理を行うといったアプリケーションを実現することが可能となる。また例えば、I(t)が0又は1のどちらかの値となるように、出力が制御されてもよい。これにより、後段の処理を簡略化することが可能である。 The vector estimation unit 21 optimized according to the equation (13) outputs, for example, a three-dimensional vector P having a magnitude of 0 to 1. When performing the actual estimation, the three-dimensional vector P may be output as it is and I (t) may take a value from 0 to 1. This makes it possible to realize an application that performs a predetermined process when the voice 2 is likely to exist (for example, the existence probability is 0.5 or more). Further, for example, the output may be controlled so that I (t) has a value of either 0 or 1. This makes it possible to simplify the subsequent processing.
 音声2の存在確率を設定する方法は限定されない。例えば、特定のマイク11の音声2のパワーに代えて、マイクアレイ10に含まれる複数のマイク11の音声2のパワーの平均値が用いられてもよい。この場合、パワーの平均値が、所定の閾値εよりも大きい場合に、音声2の存在確率が1となるように、関数Aが最適化される。また所定の閾値εは、マイク11の構成等に応じて任意に設定可能である。 The method of setting the existence probability of voice 2 is not limited. For example, instead of the power of the voice 2 of the specific microphone 11, the average value of the power of the voice 2 of the plurality of microphones 11 included in the microphone array 10 may be used. In this case, the function A is optimized so that the existence probability of the voice 2 becomes 1 when the average value of the power is larger than the predetermined threshold value ε. Further, the predetermined threshold value ε can be arbitrarily set according to the configuration of the microphone 11 and the like.
[付属情報に音声2と雑音3とのパワー比を設定する場合]
 付属情報として、音声2と雑音3とのパワー比を設定する場合について説明する。すなわち、音声2についての信号対雑音比(S/N比)が、付属情報に設定される。ここでは、M個のマイク11における時間フレームtでの複素スペクトル複素スペクトルSm(t,f):(f=1、2、・・・F)が、(7)式に示すように、音声2の成分Vm(t,f)と雑音3の成分Nm(t,f)で構成されるとする。
[When setting the power ratio between voice 2 and noise 3 in the attached information]
As attached information, a case where the power ratio between the voice 2 and the noise 3 is set will be described. That is, the signal-to-noise ratio (S / N ratio) for the voice 2 is set in the attached information. Here, the complex spectrum complex spectrum S m (t, f): (f = 1, 2, ... F) in the time frame t in the M microphones 11 is a voice as shown in the equation (7). It is assumed that it is composed of the component V m (t, f) of 2 and the component N m (t, f) of noise 3.
 ベクトル推定部21(関数A)は、上記した(6)式において、3次元ベクトルP(t)の大きさであるI(t)が、音声2と雑音3との信号対雑音比となるように学習される。具体的には、I(t)が、特定のマイク11(ここではk番目とする)の音声2のパワーと、雑音3のパワーとの比率を表すように、関数Aが最適化される。この場合、I(t)は、例えば以下の式で表される。 In the vector estimation unit 21 (function A), in the above equation (6), I (t), which is the magnitude of the three-dimensional vector P (t), is the signal-to-noise ratio between the voice 2 and the noise 3. To be learned. Specifically, the function A is optimized so that I (t) represents the ratio of the power of the voice 2 of the specific microphone 11 (here, the kth) to the power of the noise 3. In this case, I (t) is expressed by, for example, the following equation.
Figure JPOXMLDOC01-appb-M000014
Figure JPOXMLDOC01-appb-M000014
 音源方向の推定精度は、一般に信号対雑音比と相関があることが多い。すなわち、信号対雑音比が小さい場合には推定精度が低く、信号対雑音比が大きい場合には推定精度が高くなる傾向にある。そのため、付属情報に音声2と雑音3とのパワー比を設定することにより、出力されるI(t)の値を、時間フレーム毎の音源方向推定の信頼度であると解釈することができる。すなわち、(14)式を用いることで、付属情報として、音声2の到来方向に関する信頼度が設定されるとも言える。 The estimation accuracy of the sound source direction generally correlates with the signal-to-noise ratio. That is, when the signal-to-noise ratio is small, the estimation accuracy tends to be low, and when the signal-to-noise ratio is large, the estimation accuracy tends to be high. Therefore, by setting the power ratio between the voice 2 and the noise 3 in the attached information, the output I (t) value can be interpreted as the reliability of the sound source direction estimation for each time frame. That is, it can be said that by using the equation (14), the reliability regarding the arrival direction of the voice 2 is set as the attached information.
 なお、信号対雑音比を表す方式は、(14)式で表される方式に限定されない。例えばマイクアレイ10に含まれる複数のマイク11で検出された音声2及び雑音3のパワーの各平均値を用いて、信号対雑音比が表されてもよい。また、信号雑音比の他にも、音声2の到来方向に関する信頼度を表すことが可能な任意のパラメータがI(t)に設定されてもよい。 The method for expressing the signal-to-noise ratio is not limited to the method represented by equation (14). For example, the signal-to-noise ratio may be expressed using the average values of the powers of the voice 2 and the noise 3 detected by the plurality of microphones 11 included in the microphone array 10. Further, in addition to the signal-to-noise ratio, an arbitrary parameter capable of expressing the reliability with respect to the arrival direction of the voice 2 may be set to I (t).
 音源方向推定を利用したアプリケーションでは、例えば推定した音源方向が誤った値である場合に、誤った値をそのまま用いて所望の動作を行おうとすると、ユーザ体験の質を著しく損なうことがある。一例として、ユーザが発話したタイミングで、ロボットがユーザのいる方向に振り返るようなアプリケーションが挙げられる。この場合、推定した音源方向が誤った値である場合には、ユーザが発話した際に、ロボットが関係の無い方向に振り返ってしまうといった事態が生じる可能性がある。 In an application that uses sound source direction estimation, for example, if the estimated sound source direction is an incorrect value and an attempt is made to perform a desired operation using the incorrect value as it is, the quality of the user experience may be significantly impaired. One example is an application in which the robot looks back in the direction of the user when the user speaks. In this case, if the estimated sound source direction is an erroneous value, there is a possibility that the robot may look back in an unrelated direction when the user speaks.
 音源方向推定の信頼度を用いることで、このような状況を回避することが可能である。例えば、信頼度が低い場合には、その時点での音源方向推定値を採用せずに代替処理が実行される。代替処理としては、例えば音源方向の推定ができなかったこと、あるいは信頼度が低いことをユーザに通知する処理が実行される。通知する方法としては、音声2を聞き取れなかった旨を示すジェスチャの実行、メッセージの表示、ランプの点灯等が挙げられる。これにより、ロボットが無関係な方向を向いてしまうといった事態が回避される。 It is possible to avoid such a situation by using the reliability of sound source direction estimation. For example, when the reliability is low, the alternative process is executed without adopting the sound source direction estimated value at that time. As an alternative process, for example, a process of notifying the user that the sound source direction could not be estimated or that the reliability is low is executed. Examples of the notification method include execution of a gesture indicating that the voice 2 could not be heard, display of a message, lighting of a lamp, and the like. This avoids the situation where the robot turns in an unrelated direction.
 また代替処理として、ユーザのいる方向を推定する方法を,マイク11を使う方法からカメラ使った方法等の別の方法に切り替える処理が実行される。すなわち、雑音3等の影響で、音信号による方向の推定が難しい場合には、画像認識等を用いてユーザを探す処理が実行される。これにより、音源方向の推定が機能しない場合であっても、ユーザのいる方向を適正に検出することが可能となる。このように、音源方向推定の信頼度に基づいて、代替処理を行うことで、ユーザ体験の品質が損なわれること十分に回避することが可能となる。 As an alternative process, a process of switching the method of estimating the direction in which the user is located from a method using the microphone 11 to another method such as a method using the camera is executed. That is, when it is difficult to estimate the direction by the sound signal due to the influence of noise 3 or the like, a process of searching for a user by using image recognition or the like is executed. This makes it possible to properly detect the direction in which the user is, even when the estimation of the sound source direction does not work. In this way, by performing the alternative processing based on the reliability of the sound source direction estimation, it is possible to sufficiently avoid the deterioration of the quality of the user experience.
[誤差の算出]
 ベクトル推定部21を構成する学習器の学習には、教師ラベルが付与された入力データが用いられる。この教師ラベルは、対応する入力データから推定されるべき、音源方向や音量等を表すベクトル(回答ベクトル)である。学習の過程では、学習器が入力データに基づいて出力した3次元ベクトルPと、回答ベクトルとを比較して、学習器の精度が評価される。
[Calculation of error]
Input data with a teacher label is used for learning of the learner constituting the vector estimation unit 21. This teacher label is a vector (answer vector) representing the sound source direction, volume, etc., which should be estimated from the corresponding input data. In the learning process, the accuracy of the learning device is evaluated by comparing the three-dimensional vector P output by the learning device based on the input data with the answer vector.
 具体的には、3次元ベクトルPと回答ベクトルとのユークリッド距離が算出される。ここで、ユークリッド距離とは、図2を参照して説明した3次元の直交座標系で表されるような3次元のユークリッド空間における距離である。このユークリッド距離は、正解を表す回答ベクトルに対する3次元ベクトルPのずれ量を表すことが可能である。 Specifically, the Euclidean distance between the three-dimensional vector P and the answer vector is calculated. Here, the Euclidean distance is a distance in a three-dimensional Euclidean space as represented by the three-dimensional Cartesian coordinate system described with reference to FIG. This Euclidean distance can represent the amount of deviation of the three-dimensional vector P with respect to the answer vector representing the correct answer.
 例えば学習器の出力の誤差(ロス)として、このユークリッド距離を用いて平均二乗誤差(MSE:Mean Squared Error)が算出される。なお誤差を表す方法は限定されない。このように、ベクトル推定部21は、入力データに応じた3次元ベクトルPを出力し、出力された3次元ベクトルPと入力データに対応する回答ベクトルとのユークリッド距離に応じた誤差を学習に用いる学習器である。 For example, as the output error (loss) of the learner, the mean square error (MSE: Mean Squared Error) is calculated using this Euclidean distance. The method of expressing the error is not limited. In this way, the vector estimation unit 21 outputs the three-dimensional vector P corresponding to the input data, and uses the error according to the Euclidean distance between the output three-dimensional vector P and the answer vector corresponding to the input data for learning. It is a learner.
 例えば、ユークリッド距離が小さい場合には、学習器の誤差が小さく、またユークリッド距離が大きい場合には、学習器の誤差が大きいといった評価を行うことが可能となる。言い換えれば、ベクトル推定部21(学習器)の出力形式が、音源方向及び付属情報を統合的に表現可能な3次元ベクトルPであるため、回答ベクトルとのユークリッド距離を算出することで、3次元ベクトルPの誤差を容易に算出することが可能となっている。 For example, when the Euclidean distance is small, the error of the learning device is small, and when the Euclidean distance is large, the error of the learning device is large. In other words, since the output format of the vector estimation unit 21 (learner) is a three-dimensional vector P that can express the sound source direction and attached information in an integrated manner, the Euclidean distance from the answer vector is calculated to be three-dimensional. It is possible to easily calculate the error of the vector P.
 また、機械学習アルゴリズムの学習時等において、誤差を評価する際に、1つのベクトルにおける誤差(ロス)のみ評価することで、水平角θ、仰角φ、及び付属情報の3つのパラメータの評価を同時に行うことが可能である。例えば、水平角θや仰角φ等が算出される学習器では、0°と360°とを区別するためのルール等を設ける必要が生じ、誤差の算出にヒューリスティックスな処理が必要となる。これに対し、本開示に示すような3次元ベクトルPを出力する形式を用いることで、ヒューリスティックスな処理を回避して、高精度な誤差評価を行うことが可能である。これにより、学習器の学習精度を飛躍的に向上することが可能となる。 In addition, when evaluating an error when learning a machine learning algorithm, etc., by evaluating only the error (loss) in one vector, the evaluation of the three parameters of horizontal angle θ, elevation angle φ, and attached information can be performed at the same time. It is possible to do. For example, in a learner that calculates a horizontal angle θ, an elevation angle φ, etc., it is necessary to provide a rule or the like for distinguishing between 0 ° and 360 °, and heuristic processing is required to calculate the error. On the other hand, by using a format that outputs a three-dimensional vector P as shown in the present disclosure, it is possible to avoid heuristic processing and perform highly accurate error evaluation. This makes it possible to dramatically improve the learning accuracy of the learner.
 また、ニューラルネットワーク等のアルゴリズムにおいて、誤差を用いて重みを調整する誤差逆伝搬法(バックプロパゲーション)が用いられる場合がある。このようなアルゴリズムを学習させる場合であっても、音源方向の情報を角度ではなく三次元ユークリッド空間における3次元ベクトルPで表現することで、安定した誤差逆伝搬が可能となる。これにより、誤差逆伝搬を用いたアルゴリズムを容易に実装することが可能となる。 In addition, in algorithms such as neural networks, an error backpropagation method (backpropagation) that adjusts weights using errors may be used. Even when learning such an algorithm, stable error back propagation is possible by expressing the information in the sound source direction not by an angle but by a three-dimensional vector P in a three-dimensional Euclidean space. This makes it possible to easily implement an algorithm using error back propagation.
 以上、本実施形態に係る処理ユニット100では、音声2を観測した複数の音信号5の特徴データ6を入力として、3次元ベクトルPが出力される。この3次元ベクトルPは、音声2の到来方向の方向情報と、音声2に関する付属情報とを表すベクトルである。このように方向情報及び付属情報が1つのベクトルとしてまとめて出力される。これにより、音声2の方向とその他の付属する情報(音声2の音量)とを高精度に検出することが可能となる。 As described above, in the processing unit 100 according to the present embodiment, the three-dimensional vector P is output by inputting the feature data 6 of the plurality of sound signals 5 in which the voice 2 is observed. This three-dimensional vector P is a vector representing the direction information of the arrival direction of the voice 2 and the incidental information regarding the voice 2. In this way, the direction information and the attached information are collectively output as one vector. This makes it possible to detect the direction of the voice 2 and other attached information (volume of the voice 2) with high accuracy.
 音声等の到来方向を推定する音源方向推定のアルゴリズムでは、多くの場合、音源方向以外の情報とどのように統合するか実用上の課題となる。例えば、ユーザが発話したタイミングで、ロボットがユーザのいる方向に振り返るようなアプリケーションを実現したい場合、ユーザの音声を検出するアルゴリズムと、音声の到来方向を推定するアルゴリズムを統合する必要がある。 In the sound source direction estimation algorithm that estimates the arrival direction of voice, etc., in many cases, how to integrate with information other than the sound source direction is a practical issue. For example, when it is desired to realize an application in which a robot looks back in the direction of the user at the timing when the user speaks, it is necessary to integrate an algorithm for detecting the user's voice and an algorithm for estimating the direction of arrival of the voice.
 このように音源方向推定と音声検出のアルゴリズムとが個別に構成された場合には、両者の全体最適を行うことは一般的に困難である。例えば、音声が予め検出できていればより精度の高い方向推定が可能であり、音声の方向が予め推定できていればより精度の高い音声の検出が可能であるといった状況が考えられる。この場合、各処理の最適化には、互いの処理結果が必要となるため、結果として、処理ごとに個別に最適化したアルゴリズムを採用せざるを得ないことがある。 When the sound source direction estimation and the voice detection algorithm are individually configured in this way, it is generally difficult to perform overall optimization of both. For example, if the voice can be detected in advance, it is possible to estimate the direction with higher accuracy, and if the direction of the voice can be estimated in advance, it is possible to detect the voice with higher accuracy. In this case, the optimization of each process requires each other's processing results, and as a result, it may be necessary to adopt an algorithm individually optimized for each process.
 こうした構成では、処理ごとに個別最適なアルゴリズムが採用されたとしても、音源方向推定及び音声検出を含む処理全体が最適化されている保証は無い。このため、音源方向推定及び音声検出に個別のアルゴリズムを用いる場合には、全体最適ではないという精度面での懸念が生じる可能性がある。またそれぞれのアルゴリズムを独立に開発しなければならないため、開発コストが増加するといった開発効率面での懸念が生じる場合がある。 In such a configuration, even if an individual optimum algorithm is adopted for each process, there is no guarantee that the entire process including sound source direction estimation and voice detection is optimized. Therefore, when individual algorithms are used for sound source direction estimation and voice detection, there is a possibility that there is a concern in terms of accuracy that the overall optimum is not achieved. In addition, since each algorithm must be developed independently, there may be concerns about development efficiency such as an increase in development cost.
 本実施形態では、ベクトル推定部21により、音源方向と音声2の音量(付属情報)とを表す3次元ベクトルPが出力される。このように、1つのベクトルを用いて複数の情報を表現することで、音源方向推定及び音声検出を行う個別のアルゴリズムや、それらの結果を統合する統合アルゴリズム等が不要となる。 In the present embodiment, the vector estimation unit 21 outputs a three-dimensional vector P representing the sound source direction and the volume (attached information) of the voice 2. By expressing a plurality of pieces of information using one vector in this way, an individual algorithm for estimating the sound source direction and detecting voice, an integrated algorithm for integrating the results, and the like become unnecessary.
 3次元ベクトルPは、音源方向の推定結果と、音声検出の検出結果とを表すベクトルである。すなわち、3次元ベクトルPを出力することで、複数の問題を同時最適的に解決することが可能となる。これにより、音源方向の推定精度、及び音声2の検出精度を大幅に向上することが可能となるとともに、演算効率を十分に向上することが可能となる。また、別々のアルゴリズムを開発する必要がなくなり、開発コストを大幅に抑制することが可能となる。 The three-dimensional vector P is a vector representing the estimation result of the sound source direction and the detection result of voice detection. That is, by outputting the three-dimensional vector P, it is possible to optimally solve a plurality of problems at the same time. As a result, the estimation accuracy of the sound source direction and the detection accuracy of the voice 2 can be significantly improved, and the calculation efficiency can be sufficiently improved. In addition, it is not necessary to develop separate algorithms, and development costs can be significantly reduced.
 本発明者は、特定のデバイスに搭載されたマイクアレイ10により検出されたデータ(音信号5)を用いて、本技術に係る3次元ベクトルPを用いた音源方向の推定結果を評価した。推定結果の評価には、水平角θの誤差が所定角度範囲に収まる割合を複数の環境で測定し、音源方向推定を行う他の方法と比較するという手法を用いた。また所定角度範囲としては、カメラの画角を基準に設定した範囲を採用した。 The present inventor evaluated the estimation result of the sound source direction using the three-dimensional vector P according to the present technology using the data (sound signal 5) detected by the microphone array 10 mounted on the specific device. To evaluate the estimation results, we used a method of measuring the ratio of the error of the horizontal angle θ within a predetermined angle range in multiple environments and comparing it with other methods for estimating the sound source direction. Further, as the predetermined angle range, a range set based on the angle of view of the camera was adopted.
 本評価では、評価環境として、極めて信号対雑音比の低い、すなわち雑音が相対的に大きい環境を設定したために、他の方法では十分な精度が得られなかった。これに対し、3次元ベクトルPを用いた方法では、他の方法に比べ著しく高精度に推定できていることがわかった。具体的には、複数の環境において、他の方法では正答率(所定角度範囲に収まる割合)が40%程度であったのに対し、本技術を用いた方法では、各環境において、およそ80%。若しくはそれ以上の正答率が得られた。 In this evaluation, as the evaluation environment, an environment with an extremely low signal-to-noise ratio, that is, an environment with relatively large noise was set, so that sufficient accuracy could not be obtained by other methods. On the other hand, it was found that the method using the three-dimensional vector P was able to estimate with significantly higher accuracy than the other methods. Specifically, in a plurality of environments, the correct answer rate (ratio within a predetermined angle range) was about 40% in other methods, whereas in the method using this technology, it was about 80% in each environment. .. Or even higher correct answer rate was obtained.
 このように、1つのベクトルを用いて音源方向及び付属情報を表す手法は、音源方向の推定精度を大幅に向上させることが可能である。これにより、音声処理等を行うシステムの動作精度を高めることが可能となる。また、本技術を用いることで、信頼性の高い音声アプリケーション等を提供することが可能となる。 In this way, the method of expressing the sound source direction and attached information using one vector can greatly improve the estimation accuracy of the sound source direction. This makes it possible to improve the operating accuracy of the system that performs voice processing and the like. Further, by using this technology, it is possible to provide a highly reliable voice application or the like.
 <第2の実施形態>
 本技術に係る第2の実施形態の処理ユニット200について説明する。これ以降の説明では、上記の実施形態で説明した処理ユニット100における構成及び作用と同様な部分については、その説明を省略又は簡略化する。
<Second embodiment>
The processing unit 200 of the second embodiment according to the present technology will be described. In the following description, the description of the parts similar to the configuration and operation in the processing unit 100 described in the above embodiment will be omitted or simplified.
 図7は、第2の実施形態に係る処理ユニット200の構成例を示すブロック図である。処理ユニット200は、音声2の情報を算出する演算ユニットであり、前処理部220、ベクトル推定部221、及び後処理部222を有する。前処理部220は、例えば、図1に示す前処理部20と同様に構成され、マイクアレイ10から出力された複数の音信号5の特徴データ6を出力する。なお図7では、マイクアレイの図示が省略されている。 FIG. 7 is a block diagram showing a configuration example of the processing unit 200 according to the second embodiment. The processing unit 200 is an arithmetic unit that calculates information of voice 2, and has a pre-processing unit 220, a vector estimation unit 221 and a post-processing unit 222. The preprocessing unit 220 is configured in the same manner as the preprocessing unit 20 shown in FIG. 1, for example, and outputs the feature data 6 of the plurality of sound signals 5 output from the microphone array 10. Note that in FIG. 7, the microphone array is not shown.
 ベクトル推定部221は、特徴データ6に基づいて、音信号5に含まれる周波数成分ごとに、方向情報及び付属情報を表す3次元ベクトルPを出力する。具体的には、ベクトル推定部221を構成する学習器が、周波数ビンfごとに、3次元ベクトルPを出力するように学習される。また学習器の学習には、3次元ベクトルPと回答ベクトルとの周波数ビンfごとに算出された平均二乗誤差が用いられる。 The vector estimation unit 221 outputs a three-dimensional vector P representing direction information and attached information for each frequency component included in the sound signal 5 based on the feature data 6. Specifically, the learner constituting the vector estimation unit 221 is learned to output the three-dimensional vector P for each frequency bin f. Further, the mean square error calculated for each frequency bin f between the three-dimensional vector P and the answer vector is used for learning of the learner.
 後処理部222は、周波数成分(周波数ビン)ごとに出力された3次元ベクトルPの変換処理や集計処理を実行し、音源方向を示す方向情報と、音声2に関する付属情報とを算出する。 The post-processing unit 222 executes conversion processing and aggregation processing of the three-dimensional vector P output for each frequency component (frequency bin), and calculates direction information indicating the sound source direction and attached information regarding the sound 2.
 図8は、特徴データの一例を示すデータプロットである。特徴データ6(振幅スペクトル及び位相差スペクトル)は、例えば図4等を参照して説明した処理と同様に、前処理部220により算出される。 FIG. 8 is a data plot showing an example of feature data. The feature data 6 (amplitude spectrum and phase difference spectrum) is calculated by the preprocessing unit 220 in the same manner as the processing described with reference to, for example, FIG.
 図8には、振幅スペクトル及び位相差スペクトルのスペクトルデータを表す帯状のプロットが示されている。図8は、4個のマイク11(M=4)を使用した場合の例であり、上から4つのプロットが振幅スペクトル(|Sm(t,f)|)であり、その下の3つのプロットが位相差スペクトル(arg(Sm(t,f)/Sj(t,f)))である。また図8には、入力区間長Tiに含まれるデータ区間が、実線の黒枠により図示されている。この区間に含まれる各データプロットのデータが、ベクトル推定部221に入力される入力データDi(c,t,f)となる。 FIG. 8 shows a strip plot showing the spectral data of the amplitude spectrum and the retardation spectrum. FIG. 8 shows an example in which four microphones 11 (M = 4) are used. The four plots from the top are the amplitude spectra (| S m (t, f) |), and the three below them. The plot is the phase difference spectrum (arg (S m (t, f) / S j (t, f))). Also in Figure 8, the data sections included in the input section length T i is illustrated by the solid black border. The data of each data plot included in this interval becomes the input data Di (c, t, f) input to the vector estimation unit 221.
[周波数成分ごとの3次元ベクトルの推定]
 本実施形態では、ベクトル推定部221により、周波数成分ごとの3次元ベクトルPが推定される。またベクトル推定部221からは、出力区間長Toをひとまとまりとして、3次元ベクトルPをまとめた出力データDoが出力される。すなわち、出力区間長To内において、各周波数ビン及び各時間フレームごとに、3次元ベクトルP(t,f)=(x(t,f),y(t,f),z(t,f))が出力される。これは、図1に示すベクトル推定部21の出力に周波数方向の次元を追加することに相当する。これにより、音声2の到来方向(音源方向)及び音声2に関する付属情報を、時間フレームt,周波数ビンfごとに出力することが可能となる。
[Estimation of 3D vector for each frequency component]
In the present embodiment, the vector estimation unit 221 estimates the three-dimensional vector P for each frequency component. Also from the vector estimation unit 221, an output section length T o as batches and produce an output data D o summarizes the three-dimensional vector P. Namely, within the output interval length T o, for each frame each frequency bin and each time, three-dimensional vector P (t, f) = ( x (t, f), y (t, f), z (t, f )) Is output. This corresponds to adding a dimension in the frequency direction to the output of the vector estimation unit 21 shown in FIG. As a result, it is possible to output the arrival direction (sound source direction) of the voice 2 and the attached information regarding the voice 2 for each time frame t and frequency bin f.
 出力データDoは、3次元ベクトルPの各成分を表す指標をcとすると、Do(c,t,f)と表される。ここで、各成分を表す指標cは、(c=1,2,3)であり、時間フレームtは、(t=1,2,・・・To)である。また、周波数ビンfは、(f=1,2,・・・F)である。なお、Fは、周波数ビンの総数である。またDo(c,t,f)のデータサイズは、3×To×Fとなる。このように、ベクトル推定部221は、入力Diを出力Doへ変換する関数Bとして機能する。以下では、関数Bが対象とする付属情報として、音声2の音量が設定された場合を例に説明を行う。もちろん、音声2の存在確率や、音源方向の信頼度等の任意の情報を付属情報として設定することが可能である。 The output data Do is expressed as Do (c, t, f), where c is an index representing each component of the three-dimensional vector P. Here, the index c representing each component is (c = 1, 2, 3), the time frame t is (t = 1,2, ··· T o ). The frequency bin f is (f = 1, 2, ... F). Note that F is the total number of frequency bins. The data size of D o (c, t, f ) is a 3 × T o × F. In this way, the vector estimation unit 221 functions as a function B that converts the input D i into the output D o . In the following, the case where the volume of the voice 2 is set as the additional information targeted by the function B will be described as an example. Of course, it is possible to set arbitrary information such as the existence probability of the voice 2 and the reliability in the sound source direction as ancillary information.
 図9は、図8に示す特徴データから出力された3次元ベクトルPを示すデータプロットである。図9には、上から順番に3次元ベクトルP(t,f)の各成分x(t,f)、y(t,f)、及びz(t,f)のデータプロットが示されている。各グラフの横軸は時間であり、縦軸は周波数である。また各成分の値がグレースケールで示されている。また図9には、出力区間長Toに含まれるデータ区間が、実線の黒枠により図示されている。この区間に含まれる各グラフのデータが、ベクトル推定部221(関数B)から出力される出力データDo(c,t、f)となる。 FIG. 9 is a data plot showing the three-dimensional vector P output from the feature data shown in FIG. FIG. 9 shows a data plot of each component x (t, f), y (t, f), and z (t, f) of the three-dimensional vector P (t, f) in order from the top. .. The horizontal axis of each graph is time, and the vertical axis is frequency. The values of each component are shown in gray scale. Also in Figure 9, the data sections included in the output section length T o is illustrated by the solid black border. The data of each graph included in this section becomes the output data Do (c, t, f) output from the vector estimation unit 221 (function B).
[音源方向及び付属情報の算出]
 出力データDo(c,t,f)から音源方向及び付属情報が算出される。具体的には、後処理部222により、出力データDo(c,t,f)に含まれる3次元ベクトルP(t,f)=(x(t,f),y(t,f),z(t,f))に対して、以下の式に示すように極座標変換が実行される。
[Calculation of sound source direction and attached information]
The sound source direction and attached information are calculated from the output data Do (c, t, f). Specifically, the post-processing unit 222 includes the three-dimensional vector P (t, f) = (x (t, f), y (t, f), y (t, f), included in the output data Do (c, t, f). For z (t, f)), polar coordinate transformation is executed as shown in the following equation.
Figure JPOXMLDOC01-appb-M000015
Figure JPOXMLDOC01-appb-M000015
Figure JPOXMLDOC01-appb-M000016
Figure JPOXMLDOC01-appb-M000016
Figure JPOXMLDOC01-appb-M000017
Figure JPOXMLDOC01-appb-M000017
 (15)式~(17)式は、図2を参照して説明した(1)式~(3)式にそれぞれ対応しており、時間フレームt及び周波数ビンfごとに算出される。(15)式は、音源方向の水平角θ(t,f)である。(16)式は、音源方向の仰角φ(t,f)である。また、(17)式は、付属情報の値I(t,f)であり、音声2の音量である。このように、後処理部222では、3次元ベクトルP(t,f)から、時間フレーム及び周波数ごとの音源方向及び付属情報(音声2の音量)が算出される。 Equations (15) to (17) correspond to equations (1) to (3) described with reference to FIG. 2, and are calculated for each time frame t and frequency bin f. Equation (15) is a horizontal angle θ (t, f) in the sound source direction. Equation (16) is an elevation angle φ (t, f) in the sound source direction. Further, the equation (17) is the value I (t, f) of the attached information and is the volume of the voice 2. In this way, the post-processing unit 222 calculates the sound source direction and attached information (volume of voice 2) for each time frame and frequency from the three-dimensional vector P (t, f).
 図10は、図9に示す3次元ベクトルPから算出された音声2の音量を示すデータプロットである。図10の横軸は時間であり、縦軸は周波数である。また各時間フレームt及び周波数ビンfにおける音声2の音量(パワー)が、グレースケールで示されている。 FIG. 10 is a data plot showing the volume of voice 2 calculated from the three-dimensional vector P shown in FIG. The horizontal axis of FIG. 10 is time, and the vertical axis is frequency. Further, the volume (power) of the voice 2 in each time frame t and the frequency bin f is shown in gray scale.
 付属情報に音声2の音量を設定する場合、(17)式において、I(t、f)を特定のマイク(ここではk番目とする)の周波数ビンごとの音声2のパワー(スペクトログラム)となるように、関数Bが最適化される。この場合、音声2の複素スペクトルを表す(7)式を用いて、I(t、f)は、以下の式で表される。 When the volume of voice 2 is set in the attached information, in equation (17), I (t, f) is the power (spectrogram) of voice 2 for each frequency bin of a specific microphone (here, kth). As such, the function B is optimized. In this case, I (t, f) is expressed by the following equation using the equation (7) representing the complex spectrum of the voice 2.
Figure JPOXMLDOC01-appb-M000018
Figure JPOXMLDOC01-appb-M000018
 ベクトル推定部221をトレーニングする際には、(6)式で算出されるI(t)が(18)式の関係を満たすように、関数Bが最適化される。これにより、出力される付属情報は,理想的には雑音に乱された音信号5を入力されても、雑音3の有無に依らず周波数ビンごとの音声2のパワー(音量)を表すことになる。 When training the vector estimation unit 221, the function B is optimized so that I (t) calculated by the equation (6) satisfies the relationship of the equation (18). As a result, the output accessory information ideally represents the power (volume) of the voice 2 for each frequency bin regardless of the presence or absence of the noise 3, even if the sound signal 5 disturbed by the noise is input. Become.
 これはすなわち、雑音3の無い、音声2のみが検出された音声信号を推定することに相当する。またこの処理は、音声2のみを強調する音声強調処理、あるいは雑音3を低減する雑音抑圧処理であるとも言える。従って、図10に示すデータプロットは、雑音3等が含まれるもとの音信号から抽出された、音声2のみの応答を示す音声信号を表すプロットとなる。 This corresponds to estimating a voice signal in which only voice 2 is detected without noise 3. Further, it can be said that this process is a speech enhancement process that emphasizes only the sound 2 or a noise suppression process that reduces the noise 3. Therefore, the data plot shown in FIG. 10 is a plot representing a voice signal showing a response of only voice 2 extracted from the original sound signal including noise 3 and the like.
 例えば、ある時間フレームtにおいて、(17)式に従って算出されたI(t,f)の周波数分布は、その時間フレームtにおける音声2のパワーの周波数分布、すなわち音声2の振幅スペクトルになる。この振幅スペクトルには、雑音3等のスペクトルが含まれない。このような処理を、時間フレームごとに行うことで、図10に示すような音声2のみが検出された音声信号を抽出することが可能となる。 For example, in a certain time frame t, the frequency distribution of I (t, f) calculated according to the equation (17) becomes the frequency distribution of the power of the voice 2 in the time frame t, that is, the amplitude spectrum of the voice 2. This amplitude spectrum does not include a spectrum such as noise 3. By performing such processing for each time frame, it is possible to extract a voice signal in which only voice 2 is detected as shown in FIG.
 このように、本実施形態では、ベクトル推定部221により、周波数成分ごとに出力された3次元ベクトルPに基づいて、音声2の振幅スペクトルを表す音声信号が算出される。これにより、雑音3が抑制された音声信号を用いた高精度な音声認識等を行うことが可能となり、音声2を用いた各種のアプリケーションの処理精度を大幅に向上することが可能となる。 As described above, in the present embodiment, the vector estimation unit 221 calculates the voice signal representing the amplitude spectrum of the voice 2 based on the three-dimensional vector P output for each frequency component. As a result, it becomes possible to perform highly accurate voice recognition or the like using a voice signal in which noise 3 is suppressed, and it is possible to significantly improve the processing accuracy of various applications using voice 2.
 また、音声強調処理(音声信号を抽出する処理)は、周波数ビンごとの音声区間検出(VAD)だと考えることが可能である。従って、本実施形態では、付属情報に音声2の音量を設定した場合、音声強調、音声区間検出、及び音源方向推定を1度の演算で解いていることになる。これにより、3つの処理を一度に行う全体最適化された単一のアルゴリズムを提供することが可能となる。 Further, the speech enhancement process (process for extracting a voice signal) can be considered as voice section detection (VAD) for each frequency bin. Therefore, in the present embodiment, when the volume of the voice 2 is set in the attached information, the voice enhancement, the voice section detection, and the sound source direction estimation are solved by one calculation. This makes it possible to provide a single algorithm that is totally optimized to perform three processes at once.
 なお、(9)式と同様に、対数を用いて音量が出力されるような設定が可能である。すなわち、音声2が存在しないときにはパワーが0となり,音声2が存在するときには対数の尺度でパワーを表すようにしてもよい。この場合、I(t)は以下の式で表される。 Similar to equation (9), it is possible to set the volume to be output using the logarithm. That is, the power may be 0 when the voice 2 does not exist, and the power may be expressed on a logarithmic scale when the voice 2 exists. In this case, I (t) is expressed by the following equation.
Figure JPOXMLDOC01-appb-M000019
Figure JPOXMLDOC01-appb-M000019
[全体の音源方向及び付属情報の算出]
 時間フレームtごとの音源方向、及び付属情報に興味がある場合には、3次元ベクトルP(t,f)を予め周波数方向に加算して、極座標変換が実行される。
[Calculation of overall sound source direction and attached information]
If you are interested in the sound source direction for each time frame t and the attached information, the three-dimensional vector P (t, f) is added in advance in the frequency direction, and polar coordinate transformation is executed.
 まず、周波数成分ごとに算出された3次元ベクトルP(t,f)から、ある時間フレームtにおける全体の音源方向及び全体の付属情報を表す全体の3次元ベクトルP(t)が算出される。以下では、3次元ベクトルP(t,f)から算出される全体の3次元ベクトルP(t)を全体ベクトルP(t)と記載する。本実施形態では、全体ベクトルP(t)は、第1のベクトルに相当する。 First, from the three-dimensional vector P (t, f) calculated for each frequency component, the entire three-dimensional vector P (t) representing the entire sound source direction and the entire attached information in a certain time frame t is calculated. In the following, the entire three-dimensional vector P (t) calculated from the three-dimensional vector P (t, f) will be referred to as the overall vector P (t). In this embodiment, the total vector P (t) corresponds to the first vector.
 例えば、ベクトル推定部221から3次元ベクトルP(t,f)が出力されると、後処理部222により、3次元ベクトルP(t,f)が周波数方向に合成され、全体ベクトルP(t)が算出される。すなわち、全体ベクトルP(t)は、時間フレームtにおいて、周波数ビンがf=1~Fの3次元ベクトルP(t,f)を合成したベクトルとなる。具体的には、全体ベクトルP(t)の成分x(t)、y(t)、及びz(t)は、以下の様に表される。 For example, when the three-dimensional vector P (t, f) is output from the vector estimation unit 221, the three-dimensional vector P (t, f) is synthesized in the frequency direction by the post-processing unit 222, and the entire vector P (t) is combined. Is calculated. That is, the whole vector P (t) is a vector obtained by synthesizing the three-dimensional vectors P (t, f) having frequency bins f = 1 to F in the time frame t. Specifically, the components x (t), y (t), and z (t) of the whole vector P (t) are represented as follows.
Figure JPOXMLDOC01-appb-M000020
Figure JPOXMLDOC01-appb-M000020
 (20)式により算出された全体ベクトルP(t)の向きは、タイミングtに発生した音声2の到来方向(音源方向)を表す。このように、本実施形態では、周波数成分ごとに出力された3次元ベクトルP(t,f)を合成することで、音声2の到来方向を表す全体ベクトルP(t)が算出される。なお、全体ベクトルP(t)の向きは、音声2に関する付属情報の全体の値I(t)を表すことになる。 The direction of the entire vector P (t) calculated by the equation (20) represents the arrival direction (sound source direction) of the voice 2 generated at the timing t. As described above, in the present embodiment, the overall vector P (t) representing the arrival direction of the voice 2 is calculated by synthesizing the three-dimensional vectors P (t, f) output for each frequency component. The direction of the whole vector P (t) represents the whole value I (t) of the attached information regarding the voice 2.
 図11は、図9に示す3次元ベクトルP(t,f)から算出された全体ベクトルP(t)のグラフである。図11には、上から順番に、全体ベクトルP(t)の各成分x(t)、y(t)、及びz(t)のグラフが示されている。各グラフの横軸は時間であり、縦軸は各成分の大きさである。なお縦軸の縮尺は、グラフごとに適宜設定されている。 FIG. 11 is a graph of the entire vector P (t) calculated from the three-dimensional vector P (t, f) shown in FIG. In FIG. 11, graphs of each component x (t), y (t), and z (t) of the entire vector P (t) are shown in order from the top. The horizontal axis of each graph is time, and the vertical axis is the size of each component. The scale of the vertical axis is appropriately set for each graph.
 図11に示すグラフは、周波数ビンごとに個別に出力された各成分を、周波数方向に加算したものであり、図5を参照して説明した3次元ベクトルP(t)の成分に対応する。すなわち、後処理部222により3次元ベクトルP(t,f)を合成することで、第1の実施形態のベクトル推定部221(関数A)が出力する3次元ベクトルP(t)と同様のベクトル(全体ベクトルP(t))を算出することが可能となる。 The graph shown in FIG. 11 is obtained by adding each component individually output for each frequency bin in the frequency direction, and corresponds to the component of the three-dimensional vector P (t) described with reference to FIG. That is, by synthesizing the three-dimensional vector P (t, f) by the post-processing unit 222, a vector similar to the three-dimensional vector P (t) output by the vector estimation unit 221 (function A) of the first embodiment. (Overall vector P (t)) can be calculated.
 全体ベクトルP(t)が算出されると、全体ベクトルP(t)に基づいて方向情報及び付属情報が算出される。具体的には、後処理部222により、全体ベクトルP(t)に対して極座標変換が実行され、音声2の水平角θ(t)、仰角φ(t)、及び付属情報I(t)が算出される。この場合、θ(t)、φ(t)、及びI(t)は、以下の式で表される。 When the total vector P (t) is calculated, the direction information and the attached information are calculated based on the total vector P (t). Specifically, the post-processing unit 222 executes polar coordinate transformation on the entire vector P (t), and the horizontal angle θ (t), elevation angle φ (t), and attached information I (t) of the voice 2 are obtained. It is calculated. In this case, θ (t), φ (t), and I (t) are expressed by the following equations.
Figure JPOXMLDOC01-appb-M000021
Figure JPOXMLDOC01-appb-M000021
Figure JPOXMLDOC01-appb-M000022
Figure JPOXMLDOC01-appb-M000022
Figure JPOXMLDOC01-appb-M000023
Figure JPOXMLDOC01-appb-M000023
 図12は、図11に示す全体ベクトルから算出された音源方向及び音量のグラフである。図12には、上から順番に、音源方向の水平角θ(t)、音源方向の仰角φ(t)、及び音声2の音量I(t)のグラフが示されている。各グラフの横軸は時間である。水平角θ(t)及び仰角φ(t)のグラフの縦軸は角度である。また音声2の音量I(t)のグラフの縦軸は、音の大きさ(パワー)を表している。 FIG. 12 is a graph of the sound source direction and the volume calculated from the entire vector shown in FIG. FIG. 12 shows graphs of the horizontal angle θ (t) in the sound source direction, the elevation angle φ (t) in the sound source direction, and the volume I (t) of the sound 2 in this order from the top. The horizontal axis of each graph is time. The vertical axis of the graph of the horizontal angle θ (t) and the elevation angle φ (t) is the angle. The vertical axis of the graph of the volume I (t) of the voice 2 represents the loudness (power) of the sound.
 例えば、I(t)のグラフにおいて、ピークが検出されている区間では、音声2の音量が大きくなっており、音声2が検出されていることが分かる。また、I(t)の各ピークに対応して、θ(t)及びφ(t)の値がそれぞれ0°から一定の角度に変化する。従って、I(t)のピークとして検出された音声2は、いずれも同じ方向から発せられていることが分かる。 For example, in the graph of I (t), in the section where the peak is detected, the volume of the voice 2 is high, and it can be seen that the voice 2 is detected. Further, the values of θ (t) and φ (t) change from 0 ° to a constant angle corresponding to each peak of I (t). Therefore, it can be seen that the voices 2 detected as the peaks of I (t) are all emitted from the same direction.
 なお、3次元ベクトルP(t,f)の大きさI(t,f)が(18)式に示す音声2のパワーに設定されている場合、全体ベクトルP(t)の大きさI(t)は、(8)式に示す音声2のパワーと見做すことが可能である。同様に、(19)式に示す音声2のパワーが設定されている場合、全体ベクトルP(t)の大きさは、(9)式に示す音声2のパワーと見做すことが可能である。 When the magnitude I (t, f) of the three-dimensional vector P (t, f) is set to the power of the voice 2 shown in the equation (18), the magnitude I (t) of the entire vector P (t) is set. ) Can be regarded as the power of the voice 2 shown in the equation (8). Similarly, when the power of the voice 2 shown in the equation (19) is set, the magnitude of the whole vector P (t) can be regarded as the power of the voice 2 shown in the equation (9). ..
 このように、周波数成分ごとに3次元ベクトルP(t,f)が算出される場合であっても、時間フレームごとの全体の音源方向や付属情報を算出することが可能である。これにより、雑音3等が抑制された音声信号を抽出するとともに、音声2の検出処理を行うことが可能となる。この結果、全体最適化された、汎用性の高い音声処理システム等を構築することが可能となる。 In this way, even when the three-dimensional vector P (t, f) is calculated for each frequency component, it is possible to calculate the entire sound source direction and attached information for each time frame. As a result, it becomes possible to extract the voice signal in which the noise 3 and the like are suppressed, and to perform the detection processing of the voice 2. As a result, it becomes possible to construct a highly versatile voice processing system or the like that is totally optimized.
 <その他の実施形態>
 本技術は、以上説明した実施形態に限定されず、他の種々の実施形態を実現することができる。
<Other Embodiments>
The present technology is not limited to the embodiments described above, and various other embodiments can be realized.
 上記では、音(音波)に含まれる音声等の到来方向を推定する処理について説明した。本技術は、音波の他にも、物体内部を伝搬する振動波等に適用することが可能である。例えば、地震が発生した場合に地中を伝搬する地震波が、対象波に設定されてもよい。この場合、地震波の到来方向、すなわち震源の方向を示す方向情報と、地震波の強度(振幅)等の付属情報を表現する3次元ベクトルが出力される。 In the above, the process of estimating the arrival direction of voice and the like contained in sound (sound wave) has been described. This technology can be applied not only to sound waves but also to vibration waves propagating inside an object. For example, a seismic wave that propagates underground when an earthquake occurs may be set as a target wave. In this case, the direction information indicating the direction of arrival of the seismic wave, that is, the direction of the epicenter, and the three-dimensional vector expressing the attached information such as the intensity (amplitude) of the seismic wave are output.
 例えば、地面や地中の振動を検出する振動検出器が複数個所に配置される。そして、各振動検出器から出力された振動信号の特徴データ(振幅スペクトルや位相差スペクトル)が、学習器に入力される。学習器は、振動信号の特徴データをもとに、地震波の到来方向及びその強度等を表す3次元ベクトルを出力するように予め学習される。これにより、地震波の到来方向や強度等を高精度に検出することが可能となる。この他にも、電磁波や重力波といった空間を伝搬する各種の波動現象に対して、本技術は適用可能である。 For example, vibration detectors that detect vibrations on the ground or in the ground are placed at multiple locations. Then, the characteristic data (amplitude spectrum and phase difference spectrum) of the vibration signal output from each vibration detector is input to the learner. The learner is learned in advance so as to output a three-dimensional vector representing the arrival direction of the seismic wave and its intensity based on the characteristic data of the vibration signal. This makes it possible to detect the arrival direction and intensity of seismic waves with high accuracy. In addition to this, this technology can be applied to various wave phenomena that propagate in space such as electromagnetic waves and gravitational waves.
 上記では、本技術に係る情報処理装置の一実施形態として、単体の処理ユニットを例に挙げた。しかしながら、処理ユニットとは別に構成され、有線又は無線を介して処理ユニットに接続される任意のコンピュータにより、本技術に係る情報処理装置が実現されてもよい。例えばクラウドサーバにより、本技術に係る情報処理方法が実行されてもよい。あるいは処理ユニットと他のコンピュータとが連動して、本技術に係る情報処理方法が実行されてもよい。 In the above, as an embodiment of the information processing device according to the present technology, a single processing unit is taken as an example. However, the information processing apparatus according to the present technology may be realized by an arbitrary computer that is configured separately from the processing unit and is connected to the processing unit via wire or wirelessly. For example, the information processing method according to the present technology may be executed by a cloud server. Alternatively, the information processing method according to the present technology may be executed in conjunction with the processing unit and another computer.
 すなわち本技術に係る情報処理方法、及びプログラムは、単体のコンピュータにより構成されたコンピュータシステムのみならず、複数のコンピュータが連動して動作するコンピュータシステムにおいても実行可能である。なお本開示において、システムとは、複数の構成要素(装置、モジュール(部品)等)の集合を意味し、すべての構成要素が同一筐体中にあるか否かは問わない。したがって、別個の筐体に収納され、ネットワークを介して接続されている複数の装置、及び、1つの筐体の中に複数のモジュールが収納されている1つの装置は、いずれもシステムである。 That is, the information processing method and program according to the present technology can be executed not only in a computer system composed of a single computer but also in a computer system in which a plurality of computers operate in conjunction with each other. In the present disclosure, the system means a set of a plurality of components (devices, modules (parts), etc.), and it does not matter whether or not all the components are in the same housing. Therefore, a plurality of devices housed in separate housings and connected via a network, and one device in which a plurality of modules are housed in one housing are both systems.
 コンピュータシステムによる本技術に係る情報処理方法、及びプログラムの実行は、例えば特徴データの取得、3次元ベクトルの出力等が、単体のコンピュータにより実行される場合、及び各処理が異なるコンピュータにより実行される場合の両方を含む。また所定のコンピュータによる各処理の実行は、当該処理の一部または全部を他のコンピュータに実行させその結果を取得することを含む。 The information processing method and program execution according to the present technology by the computer system are executed when, for example, acquisition of feature data and output of a three-dimensional vector are executed by a single computer, and each process is executed by a different computer. Includes both cases. Further, the execution of each process by a predetermined computer includes causing another computer to execute a part or all of the process and acquire the result.
 すなわち本技術に係る情報処理方法及びプログラムは、1つの機能をネットワークを介して複数の装置で分担、共同して処理するクラウドコンピューティングの構成にも適用することが可能である。 That is, the information processing method and program related to this technology can be applied to a cloud computing configuration in which one function is shared by a plurality of devices via a network and processed jointly.
 以上説明した本技術に係る特徴部分のうち、少なくとも2つの特徴部分を組み合わせることも可能である。すなわち各実施形態で説明した種々の特徴部分は、各実施形態の区別なく、任意に組み合わされてもよい。また上記で記載した種々の効果は、あくまで例示であって限定されるものではなく、また他の効果が発揮されてもよい。 It is also possible to combine at least two feature parts among the feature parts related to the present technology described above. That is, the various feature portions described in each embodiment may be arbitrarily combined without distinction between the respective embodiments. Further, the various effects described above are merely examples and are not limited, and other effects may be exhibited.
 本開示において、「同じ」「等しい」「直交」等は、「実質的に同じ」「実質的に等しい」「実質的に直交」等を含む概念とする。例えば「完全に同じ」「完全に等しい」「完全に直交」等を基準とした所定の範囲(例えば±10%の範囲)に含まれる状態も含まれる。 In the present disclosure, "same", "equal", "orthogonal", etc. are concepts including "substantially the same", "substantially equal", "substantially orthogonal", etc. For example, a state included in a predetermined range (for example, a range of ± 10%) based on "perfectly the same", "perfectly equal", "perfectly orthogonal", etc. is also included.
 なお、本技術は以下のような構成も採ることができる。
(1)対象波を観測した複数の信号の特徴データを取得する取得部と、
 前記取得された特徴データに基づいて、前記対象波の到来方向を示す方向情報と、前記対象波に関する付属情報とを表す3次元ベクトルを出力する出力部と
 を具備する情報処理装置。
(2)(1)に記載の情報処理装置であって、
 前記出力部は、前記3次元ベクトルの向きが前記方向情報を表し、前記3次元ベクトルの大きさが前記付属情報を表すように、前記3次元ベクトルを出力する
 情報処理装置。
(3)(2)に記載の情報処理装置であって、
 前記出力部は、前記3次元ベクトルに対して極座標変換を行うことで、前記方向情報及び前記付属情報が算出されるように、前記3次元ベクトルを出力する
 情報処理装置。
(4)(3)に記載の情報処理装置であって、
 前記方向情報は、前記対象波の到来方向を示す水平角及び仰角を含む
 情報処理装置。
(5)(3)又は(4)に記載の情報処理装置であって、
 前記出力部は、前記3次元ベクトルを極座標変換して、前記方向情報及び前記付属情報を算出する
 情報処理装置。
(6)(1)から(5)のうちいずれか1つに記載の情報処理装置であって、
 前記対象波は、音声であり、
 前記複数の信号は、前記音声を観測した音信号である
 情報処理装置。
(7)(6)に記載の情報処理装置であって、
 前記方向情報は、前記音声の到来方向を示す情報である
 情報処理装置。
(8)(6)又は(7)に記載の情報処理装置であって、
 前記付属情報は、前記音声の音量、前記音声の存在確率、又は前記音声の到来方向に関する信頼度のいずれか1つを含む
 情報処理装置。
(9)(6)から(8)のうちいずれか1つに記載の情報処理装置であって、
 前記出力部は、前記音信号に含まれる周波数成分ごとに、前記方向情報及び前記付属情報を表す前記3次元ベクトルを出力する
 情報処理装置。
(10)(9)に記載の情報処理装置であって、
 前記付属情報は、前記周波数成分ごとの前記音声の音量であり、
 前記出力部は、前記周波数成分ごとに出力された前記3次元ベクトルに基づいて、前記音声の振幅スペクトルを表す音声信号を算出する
 情報処理装置。
(11)(9)又は(10)に記載の情報処理装置であって、
 前記出力部は、前記周波数成分ごとに出力された前記3次元ベクトルを合成して、前記音声の到来方向を表す第1のベクトルを算出する
 情報処理装置。
(12)(11)に記載の情報処理装置であって、
 前記出力部は、前記第1のベクトルに基づいて前記方向情報及び前記付属情報を算出する
 情報処理装置。
(13)(6)から(12)のうちいずれか1つに記載の情報処理装置であって、
 前記出力部は、所定の期間内に出力された前記3次元ベクトルを合成することで第2のベクトルを算出し、前記第2のベクトルに基づいて前記所定の期間における前記音声の到来方向を算出する
 情報処理装置。
(14)(6)から(13)のうちいずれか1つに記載の情報処理装置であって、
 前記複数の信号は、互いに異なる位置に配置された複数の集音器の各々が検出した前記音信号である
 情報処理装置。
(15)(1)から(14)のうちいずれか1つに記載の情報処理装置であって、
 前記特徴データは、前記複数の信号の各々の振幅スペクトルと、前記複数の信号の間での位相差スペクトルとを含む
 情報処理装置。
(16)(1)から(15)のうちいずれか1つに記載の情報処理装置であって、
 前記出力部は、入力データに応じた前記3次元ベクトルを出力し、前記出力された3次元ベクトルと前記入力データに対応する回答ベクトルとのユークリッド距離に応じた誤差を学習に用いる学習器である
 情報処理装置。
(17)対象波を観測した複数の信号の特徴データを取得し、
 前記取得された特徴データに基づいて、前記対象波の到来方向を示す方向情報と、前記対象波に関する付属情報とを表す3次元ベクトルを出力する
 ことをコンピュータシステムが実行する情報処理方法。
(18)対象波を観測した複数の信号の特徴データを取得するステップと、
 前記取得された特徴データに基づいて、前記対象波の到来方向を示す方向情報と、前記対象波に関する付属情報とを表す3次元ベクトルを出力するステップと
 をコンピュータシステムに実行させるプログラム。
In addition, this technology can also adopt the following configurations.
(1) An acquisition unit that acquires feature data of multiple signals that observe the target wave, and
An information processing device including a direction information indicating an arrival direction of the target wave and an output unit for outputting a three-dimensional vector representing ancillary information about the target wave based on the acquired feature data.
(2) The information processing device according to (1).
The output unit is an information processing device that outputs the three-dimensional vector so that the direction of the three-dimensional vector represents the direction information and the magnitude of the three-dimensional vector represents the accessory information.
(3) The information processing device according to (2).
The output unit is an information processing device that outputs the three-dimensional vector so that the direction information and the accessory information are calculated by performing polar coordinate conversion on the three-dimensional vector.
(4) The information processing device according to (3).
The direction information is an information processing device including a horizontal angle and an elevation angle indicating the arrival direction of the target wave.
(5) The information processing device according to (3) or (4).
The output unit is an information processing device that converts the three-dimensional vector into polar coordinates to calculate the direction information and the accessory information.
(6) The information processing device according to any one of (1) to (5).
The target wave is voice and
The plurality of signals are information processing devices that are sound signals obtained by observing the voice.
(7) The information processing device according to (6).
The direction information is an information processing device that indicates the direction of arrival of the voice.
(8) The information processing device according to (6) or (7).
The information processing device includes any one of the volume of the voice, the existence probability of the voice, and the reliability of the arrival direction of the voice.
(9) The information processing device according to any one of (6) to (8).
The output unit is an information processing device that outputs the three-dimensional vector representing the direction information and the accessory information for each frequency component included in the sound signal.
(10) The information processing apparatus according to (9).
The attached information is the volume of the voice for each frequency component.
The output unit is an information processing device that calculates an audio signal representing the amplitude spectrum of the audio based on the three-dimensional vector output for each frequency component.
(11) The information processing device according to (9) or (10).
The output unit is an information processing device that synthesizes the three-dimensional vectors output for each frequency component and calculates a first vector representing the arrival direction of the voice.
(12) The information processing apparatus according to (11).
The output unit is an information processing device that calculates the direction information and the accessory information based on the first vector.
(13) The information processing apparatus according to any one of (6) to (12).
The output unit calculates a second vector by synthesizing the three-dimensional vectors output within a predetermined period, and calculates the arrival direction of the voice in the predetermined period based on the second vector. Information processing device.
(14) The information processing apparatus according to any one of (6) to (13).
The plurality of signals are information processing devices that are the sound signals detected by each of the plurality of sound collectors arranged at different positions.
(15) The information processing apparatus according to any one of (1) to (14).
The feature data is an information processing apparatus including an amplitude spectrum of each of the plurality of signals and a phase difference spectrum between the plurality of signals.
(16) The information processing apparatus according to any one of (1) to (15).
The output unit is a learner that outputs the three-dimensional vector corresponding to the input data and uses an error according to the Euclidean distance between the output three-dimensional vector and the answer vector corresponding to the input data for learning. Information processing device.
(17) Acquire characteristic data of multiple signals that observed the target wave,
An information processing method in which a computer system executes to output three-dimensional vectors representing direction information indicating the direction of arrival of the target wave and incidental information regarding the target wave based on the acquired feature data.
(18) A step of acquiring characteristic data of a plurality of signals observing a target wave, and
A program that causes a computer system to execute a step of outputting a three-dimensional vector representing direction information indicating the arrival direction of the target wave and incidental information regarding the target wave based on the acquired feature data.
 P…3次元ベクトル
 2…音声
 5…音信号
 6…特徴データ
 11…マイク
 20、220…前処理部
 21、221…ベクトル推定部
 22、222…後処理部
 100、200…処理ユニット
P ... 3D vector 2 ... Voice 5 ... Sound signal 6 ... Feature data 11 ... Microphone 20, 220 ... Pre-processing unit 21, 221 ... Vector estimation unit 22, 222 ... Post-processing unit 100, 200 ... Processing unit

Claims (18)

  1.  対象波を観測した複数の信号の特徴データを取得する取得部と、
     前記取得された特徴データに基づいて、前記対象波の到来方向を示す方向情報と、前記対象波に関する付属情報とを表す3次元ベクトルを出力する出力部と
     を具備する情報処理装置。
    An acquisition unit that acquires characteristic data of multiple signals that observe the target wave,
    An information processing device including a direction information indicating the direction of arrival of the target wave and an output unit that outputs a three-dimensional vector representing ancillary information about the target wave based on the acquired feature data.
  2.  請求項1に記載の情報処理装置であって、
     前記出力部は、前記3次元ベクトルの向きが前記方向情報を表し、前記3次元ベクトルの大きさが前記付属情報を表すように、前記3次元ベクトルを出力する
     情報処理装置。
    The information processing device according to claim 1.
    The output unit is an information processing device that outputs the three-dimensional vector so that the direction of the three-dimensional vector represents the direction information and the magnitude of the three-dimensional vector represents the accessory information.
  3.  請求項2に記載の情報処理装置であって、
     前記出力部は、前記3次元ベクトルに対して極座標変換を行うことで、前記方向情報及び前記付属情報が算出されるように、前記3次元ベクトルを出力する
     情報処理装置。
    The information processing device according to claim 2.
    The output unit is an information processing device that outputs the three-dimensional vector so that the direction information and the accessory information are calculated by performing polar coordinate conversion on the three-dimensional vector.
  4.  請求項3に記載の情報処理装置であって、
     前記方向情報は、前記対象波の到来方向を示す水平角及び仰角を含む
     情報処理装置。
    The information processing device according to claim 3.
    The direction information is an information processing device including a horizontal angle and an elevation angle indicating the arrival direction of the target wave.
  5.  請求項3に記載の情報処理装置であって、
     前記出力部は、前記3次元ベクトルを極座標変換して、前記方向情報及び前記付属情報を算出する
     情報処理装置。
    The information processing device according to claim 3.
    The output unit is an information processing device that converts the three-dimensional vector into polar coordinates to calculate the direction information and the accessory information.
  6.  請求項1に記載の情報処理装置であって、
     前記対象波は、音声であり、
     前記複数の信号は、前記音声を観測した音信号である
     情報処理装置。
    The information processing device according to claim 1.
    The target wave is voice and
    The plurality of signals are information processing devices that are sound signals obtained by observing the voice.
  7.  請求項6に記載の情報処理装置であって、
     前記方向情報は、前記音声の到来方向を示す情報である
     情報処理装置。
    The information processing device according to claim 6.
    The direction information is an information processing device that indicates the direction of arrival of the voice.
  8.  請求項6に記載の情報処理装置であって、
     前記付属情報は、前記音声の音量、前記音声の存在確率、又は前記音声の到来方向に関する信頼度のいずれか1つを含む
     情報処理装置。
    The information processing device according to claim 6.
    The information processing device includes any one of the volume of the voice, the existence probability of the voice, and the reliability of the arrival direction of the voice.
  9.  請求項6に記載の情報処理装置であって、
     前記出力部は、前記音信号に含まれる周波数成分ごとに、前記方向情報及び前記付属情報を表す前記3次元ベクトルを出力する
     情報処理装置。
    The information processing device according to claim 6.
    The output unit is an information processing device that outputs the three-dimensional vector representing the direction information and the accessory information for each frequency component included in the sound signal.
  10.  請求項9に記載の情報処理装置であって、
     前記付属情報は、前記周波数成分ごとの前記音声の音量であり、
     前記出力部は、前記周波数成分ごとに出力された前記3次元ベクトルに基づいて、前記音声の振幅スペクトルを表す音声信号を算出する
     情報処理装置。
    The information processing device according to claim 9.
    The attached information is the volume of the voice for each frequency component.
    The output unit is an information processing device that calculates an audio signal representing the amplitude spectrum of the audio based on the three-dimensional vector output for each frequency component.
  11.  請求項9に記載の情報処理装置であって、
     前記出力部は、前記周波数成分ごとに出力された前記3次元ベクトルを合成して、前記音声の到来方向を表す第1のベクトルを算出する
     情報処理装置。
    The information processing device according to claim 9.
    The output unit is an information processing device that synthesizes the three-dimensional vectors output for each frequency component and calculates a first vector representing the arrival direction of the voice.
  12.  請求項11に記載の情報処理装置であって、
     前記出力部は、前記第1のベクトルに基づいて前記方向情報及び前記付属情報を算出する
     情報処理装置。
    The information processing device according to claim 11.
    The output unit is an information processing device that calculates the direction information and the accessory information based on the first vector.
  13.  請求項6に記載の情報処理装置であって、
     前記出力部は、所定の期間内に出力された前記3次元ベクトルを合成することで第2のベクトルを算出し、前記第2のベクトルに基づいて前記所定の期間における前記音声の到来方向を算出する
     情報処理装置。
    The information processing device according to claim 6.
    The output unit calculates a second vector by synthesizing the three-dimensional vectors output within a predetermined period, and calculates the arrival direction of the voice in the predetermined period based on the second vector. Information processing device.
  14.  請求項6に記載の情報処理装置であって、
     前記複数の信号は、互いに異なる位置に配置された複数の集音器の各々が検出した前記音信号である
     情報処理装置。
    The information processing device according to claim 6.
    The plurality of signals are information processing devices that are the sound signals detected by each of the plurality of sound collectors arranged at different positions.
  15.  請求項1に記載の情報処理装置であって、
     前記特徴データは、前記複数の信号の各々の振幅スペクトルと、前記複数の信号の間での位相差スペクトルとを含む
     情報処理装置。
    The information processing device according to claim 1.
    The feature data is an information processing apparatus including an amplitude spectrum of each of the plurality of signals and a phase difference spectrum between the plurality of signals.
  16.  請求項1に記載の情報処理装置であって、
     前記出力部は、入力データに応じた前記3次元ベクトルを出力し、前記出力された3次元ベクトルと前記入力データに対応する回答ベクトルとのユークリッド距離に応じた誤差を学習に用いる学習器である
     情報処理装置。
    The information processing device according to claim 1.
    The output unit is a learner that outputs the three-dimensional vector corresponding to the input data and uses an error according to the Euclidean distance between the output three-dimensional vector and the answer vector corresponding to the input data for learning. Information processing device.
  17.  対象波を観測した複数の信号の特徴データを取得し、
     前記取得された特徴データに基づいて、前記対象波の到来方向を示す方向情報と、前記対象波に関する付属情報とを表す3次元ベクトルを出力する
     ことをコンピュータシステムが実行する情報処理方法。
    Acquire feature data of multiple signals that observe the target wave,
    An information processing method in which a computer system executes to output three-dimensional vectors representing direction information indicating the direction of arrival of the target wave and incidental information regarding the target wave based on the acquired feature data.
  18.  対象波を観測した複数の信号の特徴データを取得するステップと、
     前記取得された特徴データに基づいて、前記対象波の到来方向を示す方向情報と、前記対象波に関する付属情報とを表す3次元ベクトルを出力するステップと
     をコンピュータシステムに実行させるプログラム。
    Steps to acquire feature data of multiple signals that observed the target wave,
    A program that causes a computer system to execute a step of outputting a three-dimensional vector representing direction information indicating the arrival direction of the target wave and incidental information regarding the target wave based on the acquired feature data.
PCT/JP2020/022107 2019-06-14 2020-06-04 Information processing device, information processing method, and program WO2020250797A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2019-110917 2019-06-14
JP2019110917 2019-06-14

Publications (1)

Publication Number Publication Date
WO2020250797A1 true WO2020250797A1 (en) 2020-12-17

Family

ID=73780749

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/022107 WO2020250797A1 (en) 2019-06-14 2020-06-04 Information processing device, information processing method, and program

Country Status (1)

Country Link
WO (1) WO2020250797A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4050602A1 (en) * 2021-02-24 2022-08-31 GN Audio A/S Conference device with voice direction estimation
WO2024009746A1 (en) * 2022-07-07 2024-01-11 ソニーグループ株式会社 Model generation device, model generation method, signal processing device, signal processing method, and program

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012039275A (en) * 2010-08-05 2012-02-23 Nippon Telegr & Teleph Corp <Ntt> Reflection sound information estimation equipment, reflection sound information estimation method, and program
JP2013008031A (en) * 2011-06-24 2013-01-10 Honda Motor Co Ltd Information processor, information processing system, information processing method and information processing program
JP2015050610A (en) * 2013-08-30 2015-03-16 本田技研工業株式会社 Sound processing device, sound processing method and sound processing program
JP2015166764A (en) * 2014-03-03 2015-09-24 富士通株式会社 Speech processing device, noise suppression method, and program
JP2018032001A (en) * 2016-08-26 2018-03-01 日本電信電話株式会社 Signal processing device, signal processing method and signal processing program

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012039275A (en) * 2010-08-05 2012-02-23 Nippon Telegr & Teleph Corp <Ntt> Reflection sound information estimation equipment, reflection sound information estimation method, and program
JP2013008031A (en) * 2011-06-24 2013-01-10 Honda Motor Co Ltd Information processor, information processing system, information processing method and information processing program
JP2015050610A (en) * 2013-08-30 2015-03-16 本田技研工業株式会社 Sound processing device, sound processing method and sound processing program
JP2015166764A (en) * 2014-03-03 2015-09-24 富士通株式会社 Speech processing device, noise suppression method, and program
JP2018032001A (en) * 2016-08-26 2018-03-01 日本電信電話株式会社 Signal processing device, signal processing method and signal processing program

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4050602A1 (en) * 2021-02-24 2022-08-31 GN Audio A/S Conference device with voice direction estimation
US11778374B2 (en) 2021-02-24 2023-10-03 Gn Audio A/S Conference device with voice direction estimation
WO2024009746A1 (en) * 2022-07-07 2024-01-11 ソニーグループ株式会社 Model generation device, model generation method, signal processing device, signal processing method, and program

Similar Documents

Publication Publication Date Title
US10063965B2 (en) Sound source estimation using neural networks
JP6279181B2 (en) Acoustic signal enhancement device
US20060204019A1 (en) Acoustic signal processing apparatus, acoustic signal processing method, acoustic signal processing program, and computer-readable recording medium recording acoustic signal processing program
US9961460B2 (en) Vibration source estimation device, vibration source estimation method, and vibration source estimation program
CN108962231B (en) Voice classification method, device, server and storage medium
JP2017044916A (en) Sound source identifying apparatus and sound source identifying method
KR102191736B1 (en) Method and apparatus for speech enhancement with artificial neural network
WO2020250797A1 (en) Information processing device, information processing method, and program
JP2008236077A (en) Target sound extracting apparatus, target sound extracting program
JP6236282B2 (en) Abnormality detection apparatus, abnormality detection method, and computer-readable storage medium
JP7214798B2 (en) AUDIO SIGNAL PROCESSING METHOD, AUDIO SIGNAL PROCESSING DEVICE, ELECTRONIC DEVICE, AND STORAGE MEDIUM
KR20210137146A (en) Speech augmentation using clustering of queues
WO2022218134A1 (en) Multi-channel speech detection system and method
Pertilä Online blind speech separation using multiple acoustic speaker tracking and time–frequency masking
EP2745293B1 (en) Signal noise attenuation
CN116868265A (en) System and method for data enhancement and speech processing in dynamic acoustic environments
US20220262342A1 (en) System and method for data augmentation and speech processing in dynamic acoustic environments
Dov et al. Multimodal kernel method for activity detection of sound sources
JP2023550434A (en) Improved acoustic source positioning method
JP2011139409A (en) Audio signal processor, audio signal processing method, and computer program
Mirbagheri et al. C-SL: Contrastive Sound Localization with Inertial-Acoustic Sensors
US11783826B2 (en) System and method for data augmentation and speech processing in dynamic acoustic environments
CN113724692B (en) Telephone scene audio acquisition and anti-interference processing method based on voiceprint features
Firoozabadi et al. Estimating the Number of Speakers by Novel Zig-Zag Nested Microphone Array Based on Wavelet Packet and Adaptive GCC Method
US20230230582A1 (en) Data augmentation system and method for multi-microphone systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20822277

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20822277

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP