WO2020250797A1 - Dispositif de traitement d'informations, procédé de traitement d'informations, et programme - Google Patents

Dispositif de traitement d'informations, procédé de traitement d'informations, et programme Download PDF

Info

Publication number
WO2020250797A1
WO2020250797A1 PCT/JP2020/022107 JP2020022107W WO2020250797A1 WO 2020250797 A1 WO2020250797 A1 WO 2020250797A1 JP 2020022107 W JP2020022107 W JP 2020022107W WO 2020250797 A1 WO2020250797 A1 WO 2020250797A1
Authority
WO
WIPO (PCT)
Prior art keywords
information processing
information
voice
processing device
vector
Prior art date
Application number
PCT/JP2020/022107
Other languages
English (en)
Japanese (ja)
Inventor
裕一郎 小山
Original Assignee
ソニー株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ソニー株式会社 filed Critical ソニー株式会社
Publication of WO2020250797A1 publication Critical patent/WO2020250797A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones

Definitions

  • This technology relates to information processing devices, information processing methods, and programs that can be applied to detect voice and the like.
  • Patent Document 1 describes an acoustic signal processing device that estimates the direction of a sound source.
  • the surrounding sound is captured by a plurality of microphones, and a plurality of acoustic signals are generated.
  • the cross-correlation value of the acoustic signal of each microphone is calculated as the sound space feature amount.
  • the sound source direction of the target sound is estimated using this sound space feature.
  • the reliability of the sound source direction estimated value is calculated by using a high-order statistic of the sound space feature amount (Patent Document 1 specification paragraphs [0035] [0040] [0044] FIG. 2 etc).
  • an object of the present technology is to provide an information processing device, an information processing method, and a program capable of accurately detecting the direction of a target wave and other attached information. ..
  • the information processing device includes an acquisition unit and an output unit.
  • the acquisition unit acquires feature data of a plurality of signals that observe the target wave. Based on the acquired feature data, the output unit outputs a three-dimensional vector representing direction information indicating the arrival direction of the target wave and incidental information regarding the target wave.
  • a three-dimensional vector is output by inputting the feature data of a plurality of signals that observe the target wave.
  • This three-dimensional vector is a vector that represents the direction information of the arrival direction of the target wave and the incidental information about the target wave. In this way, the direction information and the attached information are collectively output as one vector. This makes it possible to detect the direction of the target wave and other attached information with high accuracy.
  • the output unit may output the three-dimensional vector so that the direction of the three-dimensional vector represents the direction information and the magnitude of the three-dimensional vector represents the accessory information.
  • the output unit may output the three-dimensional vector so that the direction information and the accessory information are calculated by performing polar coordinate conversion on the three-dimensional vector.
  • the direction information may include a horizontal angle and an elevation angle indicating the direction of arrival of the target wave.
  • the output unit may perform polar coordinate transformation of the three-dimensional vector to calculate the direction information and the accessory information.
  • the target wave may be voice.
  • the plurality of signals may be sound signals obtained by observing the voice.
  • the direction information may be information indicating the arrival direction of the voice.
  • the attached information may include any one of the volume of the voice, the existence probability of the voice, or the reliability regarding the arrival direction of the voice.
  • the output unit may output the three-dimensional vector representing the direction information and the accessory information for each frequency component included in the sound signal.
  • the attached information may be the volume of the voice for each frequency component.
  • the output unit may calculate an audio signal representing the amplitude spectrum of the audio based on the three-dimensional vector output for each frequency component.
  • the output unit may synthesize the three-dimensional vectors output for each frequency component to calculate a first vector representing the arrival direction of the voice.
  • the output unit may calculate the direction information and the accessory information based on the first vector.
  • the output unit calculates a second vector by synthesizing the three-dimensional vectors output within a predetermined period, and calculates the arrival direction of the voice in the predetermined period based on the second vector. You may.
  • the plurality of signals may be the sound signals detected by each of the plurality of sound collectors arranged at different positions from each other.
  • the feature data may include the amplitude spectrum of each of the plurality of signals and the phase difference spectrum between the plurality of signals.
  • the output unit is a learner that outputs the three-dimensional vector corresponding to the input data and uses the error according to the Euclidean distance between the output three-dimensional vector and the answer vector corresponding to the input data for learning. You may.
  • the information processing method is an information processing method executed by a computer system, and includes acquiring characteristic data of a plurality of signals obtained by observing a target wave. Based on the acquired feature data, a three-dimensional vector representing the direction information indicating the arrival direction of the target wave and the incidental information regarding the target wave is output.
  • a program causes a computer system to perform the following steps.
  • FIG. 6 is a data plot showing the volume of voice calculated from the three-dimensional vector shown in FIG. It is a graph of the whole vector calculated from the three-dimensional vector shown in FIG. It is a graph of the sound source direction and the volume calculated from the whole vector shown in FIG.
  • FIG. 1 is a block diagram showing a configuration example of a processing unit according to a first embodiment of the present technology.
  • the processing unit 100 is a calculation unit that calculates information on a specific sound to be observed from a sound signal obtained by observing a sound (sound wave). As will be described later, in the processing unit 100, the calculation of calculating the information of the voice 2 is executed with the voice 2 of the human 1 as the observation target.
  • the processing unit 100 is used by being connected to the microphone array 10.
  • the microphone array 10 has a plurality of microphones 11.
  • the microphone 11 is an element that detects surrounding sounds and outputs a sound signal corresponding to the detected sound, and functions as a sound collector.
  • the sound signal output from the microphone 11 is an electric signal whose amplitude changes with time according to the surrounding sounds. The time variation of this amplitude represents the pitch, loudness, sound waveform, and the like.
  • the sound signal is typically output as an analog signal and converted into a digital signal using an A / D converter or the like (not shown).
  • the specific configuration of the microphone 11 is not limited, and any element capable of detecting the surrounding sound and detecting the sound signal may be used as the microphone 11.
  • the microphone array 10 voice 2 emitted by human 1 is generated. Therefore, the plurality of signals output from the microphone array 10 are sound signals obtained by observing the voice 2. In addition, not only voice 2 but also other sounds such as noise 3 are generated around the microphone array 10. Therefore, the sound signal 5 includes a signal corresponding to the noise 3 and the like in addition to the voice 2.
  • the voice 2 and the noise 3 generated around the microphone array 10 are schematically illustrated by using arrows.
  • the plurality of microphones 11 constituting the microphone array 10 are arranged at different positions from each other. Therefore, the plurality of signals output from the microphone array 10 are sound signals detected by each of the plurality of microphones 11 arranged at different positions from each other. Therefore, for example, even when the same voice 2 is detected, the timing at which the voice 2 is detected, the size of the detected voice 2, and the like are different for each microphone 11. Therefore, the sound signal output by each microphone 11 is a signal corresponding to the position where the microphone 11 is arranged.
  • the microphone array 10 is mounted on, for example, a robot or the like.
  • a plurality of microphones 11 are arranged in a housing such as a robot.
  • the microphone array 10 may be mounted on a stationary device or the like.
  • a plurality of microphones 11 may be arranged in an indoor space, a vehicle interior space, or the like to form a microphone array 10.
  • the microphone array 10 may include at least two microphones 11.
  • the microphone array 10 is composed of four or more microphones 11.
  • the specific configuration of the microphone array 10 is not limited.
  • the processing unit 100 has a hardware configuration required for a computer such as a CPU and a memory (RAM, ROM). Various processes are executed by the CPU loading the program stored in the ROM into the RAM and executing the program.
  • the program is installed in the processing unit 100, for example, via various recording media. Alternatively, the program may be installed via the Internet or the like.
  • processing unit 100 for example, a device such as a PLD (Programmable Logic Device) such as an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit) may be used.
  • a device such as a PLD (Programmable Logic Device) such as an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit) may be used.
  • the processing unit 100 corresponds to an information processing device.
  • the pre-processing unit 20, the vector estimation unit 21, and the post-processing unit 22 are realized as functional blocks. Then, the information processing method according to the present embodiment is executed by these functional blocks.
  • dedicated hardware such as an IC (integrated circuit) may be appropriately used.
  • the preprocessing unit 20 acquires the feature data 6 of the plurality of sound signals 5 in which the voice 2 is observed. Specifically, the preprocessing unit 20 reads a plurality of sound signals 5 (multi-channel sound signals 5) output from the microphone array 10 and calculates feature data 6 based on the read sound signals 5. That is, the preprocessing unit 20 acquires the feature data 6 by calculating the feature data 6 from the plurality of sound signals 5. In the present embodiment, the pretreatment unit 20 corresponds to the acquisition unit.
  • the feature data 6 is data that can represent features of, for example, a plurality of sound signals 5. For example, a predetermined conversion process is executed on the sound signal 5, and data representing the characteristics of the sound signal 5 is generated. Further, for example, the sound signal 5 itself can be used as the feature data 6. In the present embodiment, the Fourier transform is executed on the sound signal 5, and the amplitude spectrum, the phase difference spectrum, and the like of the sound signal 5 are calculated as the feature data 6. This point will be described in detail later with reference to FIG. 4 and the like.
  • the vector estimation unit 21 outputs a three-dimensional vector P indicating the direction information indicating the arrival direction of the target wave and the accessory information regarding the target wave based on the feature data 6 acquired by the preprocessing unit 20.
  • the vector estimation unit 21 is configured by a learner trained to input the feature data 6 and output a three-dimensional vector P representing directional information and attached information.
  • the target wave is a sound (sound wave) to be observed by the processing unit 100.
  • the voice 2 emitted by the human 1 is set as the observation target. That is, the target wave is voice 2.
  • the target wave is voice 2.
  • all the voices 2 emitted by each human 1 are the target waves.
  • the case is not limited to the case where the voice 2 of an unspecified number of humans 1 is the target wave, and for example, the voice 2 of a specific human 1 can be the target wave.
  • a specific sound such as a clap sound or a bell ringing sound may be set as the target wave.
  • ambient noise 3 and the like may be set as the target wave.
  • the target wave can be arbitrarily set according to the purpose of the processing unit 100 and the like.
  • the direction information is information indicating the direction of arrival of the voice 2. That is, it can be said that the direction information is information indicating the direction (sound source direction) in which the human 1 who has emitted the voice 2 is located. In the following, the direction of arrival of the voice 2 may be simply described as the sound source direction.
  • reference coordinates are set in the microphone array 10 described above.
  • the direction information is information indicating the direction in which the voice 2 arrives with respect to the origin of the reference coordinates, that is, the direction in which the human 1 who emits the voice 2 is located when viewed from the reference coordinates.
  • the method of setting the reference coordinates is not limited and can be set arbitrarily.
  • the attached information is information obtained attached to the target voice 2, and is expressed using a one-dimensional value (norm).
  • the attached information is set to the volume of the voice 2. More specifically, the magnitude (power) of the voice 2 emitted by the human 1 at a certain timing is set in the attached information. In addition to this, it is possible to set the probability indicating the presence / absence of the voice 2 and the reliability regarding the arrival direction (direction information) of the voice 2 as ancillary information.
  • the specific content of the attached information is not limited, and for example, an arbitrary one-dimensional amount that can be calculated from the feature data 6 may be set as the attached information.
  • FIG. 2 is a schematic diagram for explaining the three-dimensional vector P.
  • FIG. 2 illustrates a Cartesian coordinate system represented by the X-axis, Y-axis, and Z-axis that are orthogonal to each other. This Cartesian coordinate system becomes the reference coordinate.
  • the thick arrow in the figure is an example of the three-dimensional vector P output from the vector estimation unit 21.
  • the three-dimensional vector P will be referred to as x, y, and z for each component of the X-axis, Y-axis, and Z-axis, respectively.
  • the vector estimation unit 21 outputs each component of the vector P (x, y, z) as a three-dimensional vector P.
  • the vector estimation unit 21 (learner) is learned so that the direction of the three-dimensional vector P is the direction of arrival of the voice 2 and the magnitude I of the three-dimensional vector P is the value of the attached information. That is, the vector estimation unit 21 outputs the three-dimensional vector P so that the direction of the three-dimensional vector P represents the direction information and the magnitude of the three-dimensional vector P represents the attached information. For example, when viewed from the origin O, the direction indicated by the three-dimensional vector P is the sound source direction, and the magnitude thereof represents the value of the attached information (volume of voice 2 or the like). That is, each component x, y, z of the three-dimensional vector P does not directly represent the direction information or the accessory information, but the direction information and the accessory information are expressed by the vector represented by each component x, y, z. Will be done.
  • the vector estimation unit 21 is learned so that the horizontal angle ⁇ in the sound source direction, the elevation angle ⁇ in the sound source direction, and the value I of the attached information can be obtained by converting the three-dimensional vector P into polar coordinates.
  • the horizontal angle ⁇ is an angle representing the direction (direction) of the vector with respect to the X axis in the XY plane.
  • the elevation angle ⁇ is an angle representing the inclination of the vector with respect to the XY plane.
  • the direction information includes the horizontal angle ⁇ and the elevation angle ⁇ indicating the sound source direction (the direction of arrival of the sound 2).
  • the horizontal angle ⁇ , the elevation angle ⁇ , and the attached information I are expressed using the following equations, respectively.
  • the vector estimation unit 21 outputs the three-dimensional vector P so that the direction information and the attached information are calculated by performing polar coordinate conversion on the three-dimensional vector P. Therefore, for example, when calculating the direction information and the attached information, the angles ⁇ and ⁇ representing the sound source direction and the attached information are obtained by converting the three-dimensional vector P into polar coordinates according to the equations (1) to (3).
  • the value I can be easily calculated.
  • Learning data generated based on the sound signal is used for learning of the vector estimation unit 21 (learner).
  • the characteristic data of the sound signal to which the teacher label is attached is used. For example, characteristic data (amplitude spectrum and phase difference spectrum) of a sound signal including human voice becomes data for input. Further, a three-dimensional vector P representing the arrival direction (sound source direction) and attached information (volume, etc.) of the sound is attached to the feature data as a teacher label. This makes it possible to train the learner to estimate the vector representing the sound source direction and attached information by polar coordinate transformation.
  • the method of generating learning data is not limited. For example, it is possible to simulate a sound signal in which the position of a sound source is changed by performing a convolution operation used in a technique such as an impulse response. By performing this by changing the type of voice, it is possible to easily prepare learning data with a plurality of teacher labels.
  • learning data obtained by sampling the target sound may be used.
  • N Neural Network
  • MLP Multilayer Perceptron
  • CNN Convolution Neural Network
  • RNN Recurrent Neural Network
  • RSTM LongTermShortMemoryNetwork
  • the learner may be configured by using an arbitrary algorithm applicable to estimation of the sound source direction and the like.
  • the learner can be regarded as a function that converts the feature data 6 into a three-dimensional vector P (hereinafter referred to as a function A). Therefore, it can be said that training the learner is a process of optimizing the function A so that the three-dimensional vector P can be calculated appropriately.
  • the post-processing unit 22 transforms the three-dimensional vector P into polar coordinates to calculate the direction information and the attached information. Specifically, according to the above equations (1) to (3), the horizontal angle ⁇ and the elevation angle ⁇ in the sound source direction and the value I of the attached information are calculated from the three-dimensional vector P. In addition, the post-processing unit 22 can execute various operations using the three-dimensional vector P.
  • the vector estimation unit 21 and the post-processing unit 22 described above function as output units according to the present embodiment.
  • FIG. 3 is a flowchart showing the basic operation of the control unit.
  • the process shown in FIG. 3 is a process that is repeatedly executed at a predetermined processing rate.
  • the preprocessing unit 20 calculates the feature data 6 of the sound signal 5 (step 101).
  • the vector estimation unit 21 outputs the three-dimensional vector P with the feature data 6 as an input (step 102).
  • the post-processing unit 22 transforms the three-dimensional vector P into polar coordinates, and calculates the direction information ( ⁇ and ⁇ ) and the attached information (I) of the voice 2 (step 103).
  • the processing unit 100 inputs a quantity representing the characteristics of the multi-channel sound signal, and outputs a three-dimensional vector P (x, y, z) expressing direction information and other accessory information, and polar coordinates.
  • P x, y, z
  • ⁇ , ⁇ , I the information related to the sound 2.
  • the processing unit 100 continuously calculates the direction information and the attached information at a predetermined processing rate. This makes it possible to constantly monitor the direction in which the voice 2 is emitted.
  • steps 101, 102, and 103 will be specifically described by taking the case where the volume of the voice 2 is set as the attached information (I) as an example. The following description is applicable even when the attached information is set to another value.
  • step 101 as the feature data 6 of the plurality of sound signals 5, the amplitude spectrum of each of the plurality of sound signals 5 and the phase difference spectrum between the plurality of sound signals 5 are calculated.
  • the amplitude spectrum is a spectrum representing the intensity of each frequency component.
  • the phase difference spectrum is a spectrum representing the phase difference for each frequency component.
  • the preprocessing unit 20 reads the sound signals 5 (M channel sound signals 5) output from the M microphones 11, records them in a storage unit such as a buffer, and performs a short-time Fourier transform on each sound signal 5. The conversion is performed.
  • the target signal sound signal 5
  • the target signal sound signal 5
  • the Fourier transform is executed for the signals included in the divided sections.
  • each time frame t may overlap or may be divided.
  • a certain sampling time tau describes a sound signal 5 output from the M microphone 11 and s m ( ⁇ ). Also describes s m a complex spectrum calculated by short-time Fourier transform of ( ⁇ ) S m (t, f) and.
  • of the complex spectrum S m (t, f) is calculated. That is, the amplitude spectrum of the M channel is calculated from the complex spectrum of the M channel.
  • arg is a function for calculating the declination.
  • m represents a channel other than j. That is, the phase difference spectrum of the M-1 channel is calculated from the complex spectrum of the M channel.
  • the input section length Ti is set to a section longer than the interval ⁇ of the time frame t described above. That is, the input section length T i, will include a plurality of time frames t. Therefore, the preprocessing unit 20 outputs spectrum data for 2M-1 channels including a phase difference spectrum for M channels and a phase difference spectrum for M-1 channels.
  • the data size of the spectrum data is the number of channels ⁇ the section length Ti ⁇ the number of frequency bins F. Therefore, the input data Di is expressed as Di (c, t, f), where c is an index indicating each channel of the spectrum data.
  • FIG. 4 is a data plot showing an example of feature data.
  • FIG. 4 shows a strip plot representing the spectral data of the amplitude spectrum and the retardation spectrum.
  • the four plots from the top are the amplitude spectra (
  • the plot is the phase difference spectrum (arg (S m (t, f) / S j (t, f))).
  • the horizontal axis is time (time frame t) and the vertical axis is frequency (frequency bin f).
  • the color of each point shown in gray scale represents the amplitude or phase difference.
  • the sound represented by the black plot includes, for example, voice 2 which is a target wave, ambient noise 3, and the like.
  • voice 2 which is a target wave, ambient noise 3, and the like.
  • the phase difference corresponding to the deviation of the timing of detecting the sound is detected.
  • the amplitude spectrum is gray
  • the sound is relatively quiet or only noise 3 is generated. In this case, the phase difference for each frequency is substantially random.
  • the data sections included in the input section length T i is illustrated by the solid black border.
  • the data of each data plot included in this interval becomes the input data Di (c, t, f) input to the vector estimation unit 21.
  • M 4
  • the number of channels is 7, and the data size of the input data Di (c, t, f) is 7 ⁇ T i ⁇ F.
  • the three-dimensional vector P is estimated from the vector estimation unit 21 to which the input data Di (c, t, f) is input.
  • the interval length T o hereinafter, the output section is described as length T o
  • the learning device is It is composed.
  • the vector estimation unit 21 functions as a function A that converts the input D i into the output D o .
  • the function A is optimized and determined by a machine learning algorithm such as deep learning. It should be noted that among the parameters constituting the function A, there may be a parameter for accumulating the past processing results. By using such past processing results for the optimization of the function A, it is possible to improve the estimation accuracy of the sound source direction and the detection accuracy of the attached information.
  • FIG. 5 is a graph of a three-dimensional vector output from the feature data shown in FIG.
  • graphs of each component x (t), y (t), and z (t) of the three-dimensional vector P are shown in order from the top.
  • the horizontal axis of each graph is time, and the vertical axis is the size of each component.
  • the scale of the vertical axis is appropriately set for each graph.
  • Equations (4) to (6) correspond to equations (1) to (3) described with reference to FIG. 2, respectively.
  • Equation (4) is a horizontal angle ⁇ (t) in the sound source direction in the time frame t.
  • Equation (5) is an elevation angle ⁇ (t) in the sound source direction in the time frame t.
  • the equation (6) is the value I of the attached information in the time frame t, and is the volume of the voice 2.
  • the post-processing unit 22 calculates the sound source direction and attached information (volume of voice 2) for each frame from the three-dimensional vector P (t).
  • the vector estimation unit 21 (function A) is trained on the component V m (t, f) of the speech 2 in the complex spectrum S m (t, f). Specifically, in the above equation (6), I (t), which is the magnitude of the three-dimensional vector P (t), becomes the voice power of the specific microphone 11 (here, the kth). To. In this case, I (t) is expressed by the following equation.
  • the function A is optimized so that I (t) calculated by Eq. (6) satisfies the relationship of Eq. (8).
  • the attached information I (t) output from the vector estimation unit 21 ideally receives the power of the voice 2 regardless of the power of the noise 3 even if the sound signal 5 disturbed by the noise 3 is input. It is possible to output only the volume). This corresponds to the detection of voice 2. Therefore, by setting the power of the voice 2 in the attached information, it is possible to realize voice section detection (VAD: Voice Activity Detection) or the like that detects the section in which the voice 2 is generated.
  • VAD Voice Activity Detection
  • the power may be 0 when the voice 2 does not exist, and the power may be expressed on a logarithmic scale when the voice 2 exists.
  • I (t) is expressed by the following equation.
  • the method of expressing the volume of the voice 2 is not limited.
  • FIG. 6 is a graph of the sound source direction and the volume of the voice 2 calculated from the three-dimensional vector shown in FIG.
  • FIG. 6 shows graphs of the horizontal angle ⁇ (t) in the sound source direction, the elevation angle ⁇ (t) in the sound source direction, and the volume I (t) of the sound 2 in this order from the top.
  • the horizontal axis of each graph is time.
  • the vertical axis of the graph of the horizontal angle ⁇ (t) and the elevation angle ⁇ (t) is the angle.
  • the vertical axis of the graph of the volume I (t) of the voice 2 represents the loudness (power) of the sound.
  • the values of ⁇ (t) and ⁇ (t) change from 0 ° to a constant angle corresponding to each peak of the graph of I (t). Therefore, in the example shown in FIG. 6, the voice 2 emitted by the human 1 in the same direction is detected. Further, for example, when conversations of humans 1 at different positions are observed, the direction in which the human 1 who utters each voice 2 exists is estimated as the sound source direction for each peak of the voice 2. As described above, in the present embodiment, it is possible to accurately detect the direction in which the person who emitted the voice 2 is present as well as the volume of the voice 2.
  • the target section corresponds to a predetermined period.
  • the aggregation of the three-dimensional vector P is executed by the post-processing unit 22. Specifically, the sum of each component x (t), y (t), and z (t) of the three-dimensional vector P (t) output in the target section is calculated. For example, if you want to acquire the sound source direction for the previous utterance at a certain time t c , use the time t p earlier than the time t c , and the sum x u , y u , z u of each component is as follows. It is calculated as.
  • the time t p corresponds to the start time of the target section
  • the time t c corresponds to the end time of the target section. Therefore, it can be said that x u , y u , and z u are components of a vector (hereinafter referred to as an aggregate vector) obtained by synthesizing the three-dimensional vector P output in the target section.
  • the aggregate vector corresponds to the second vector.
  • Polar coordinate transformation is executed for the aggregate vector whose components are x u , yu , and z u calculated according to the equation (10).
  • the horizontal angle ⁇ u and the elevation angle ⁇ u in the sound source direction with respect to the utterance immediately before the time t c are calculated as follows.
  • the sum of each component in the target section is calculated, but the average of each component in the target section may be calculated. That is, the average of each component is calculated by dividing x u , yu , and z u in the equation (10) by the number of time frames included in the target interval.
  • the vector represented by the average of each component is also an aggregate vector calculated by synthesizing the three-dimensional vector P.
  • the aggregation vector is calculated by synthesizing the three-dimensional vector P output in the target section, and the arrival direction of the voice 2 in the target section is calculated based on the aggregation vector.
  • the target section (time t p to time t c ) may include a section without voice 2.
  • the values of x (t), y (t), and z (t) are sufficiently small, ideally 0. Therefore, the value of each component in the section without the sound 2 does not have a great influence on the calculation result, and the value in the sound source direction with respect to the section with the sound 2 in the target section can be acquired with high accuracy. ..
  • a method of identifying which section of the target section corresponds to the voice 2 and estimating the direction based on the result can be considered.
  • heuristic processing using various parameters for determining the certainty, empirical rules, and the like may be performed, and the estimation accuracy may be lowered.
  • the sound source direction in a certain section of the voice 2 is easily calculated only by synthesizing the three-dimensional vector P over the target section. That is, it is not necessary to perform heuristic processing for determining a certain section of the sound 2, and it is possible to estimate the sound source direction with high accuracy.
  • the vector estimation unit 21 (function A) is learned so that I (t), which is the magnitude of the three-dimensional vector P (t), is the existence probability of the voice 2 in the above equation (6).
  • the existence probability of the voice 2 is a probability indicating whether or not the voice 2 is generated.
  • I (t) is expressed by the following equation.
  • the vector estimation unit 21 optimized according to the equation (13) outputs, for example, a three-dimensional vector P having a magnitude of 0 to 1.
  • the three-dimensional vector P may be output as it is and I (t) may take a value from 0 to 1. This makes it possible to realize an application that performs a predetermined process when the voice 2 is likely to exist (for example, the existence probability is 0.5 or more). Further, for example, the output may be controlled so that I (t) has a value of either 0 or 1. This makes it possible to simplify the subsequent processing.
  • the method of setting the existence probability of voice 2 is not limited.
  • the average value of the power of the voice 2 of the plurality of microphones 11 included in the microphone array 10 may be used.
  • the function A is optimized so that the existence probability of the voice 2 becomes 1 when the average value of the power is larger than the predetermined threshold value ⁇ .
  • the predetermined threshold value ⁇ can be arbitrarily set according to the configuration of the microphone 11 and the like.
  • I (t) which is the magnitude of the three-dimensional vector P (t)
  • I (t) is the signal-to-noise ratio between the voice 2 and the noise 3.
  • I (t) is expressed by, for example, the following equation.
  • the estimation accuracy of the sound source direction generally correlates with the signal-to-noise ratio. That is, when the signal-to-noise ratio is small, the estimation accuracy tends to be low, and when the signal-to-noise ratio is large, the estimation accuracy tends to be high. Therefore, by setting the power ratio between the voice 2 and the noise 3 in the attached information, the output I (t) value can be interpreted as the reliability of the sound source direction estimation for each time frame. That is, it can be said that by using the equation (14), the reliability regarding the arrival direction of the voice 2 is set as the attached information.
  • the method for expressing the signal-to-noise ratio is not limited to the method represented by equation (14).
  • the signal-to-noise ratio may be expressed using the average values of the powers of the voice 2 and the noise 3 detected by the plurality of microphones 11 included in the microphone array 10.
  • an arbitrary parameter capable of expressing the reliability with respect to the arrival direction of the voice 2 may be set to I (t).
  • the quality of the user experience may be significantly impaired.
  • One example is an application in which the robot looks back in the direction of the user when the user speaks. In this case, if the estimated sound source direction is an erroneous value, there is a possibility that the robot may look back in an unrelated direction when the user speaks.
  • the alternative process is executed without adopting the sound source direction estimated value at that time.
  • a process of notifying the user that the sound source direction could not be estimated or that the reliability is low is executed. Examples of the notification method include execution of a gesture indicating that the voice 2 could not be heard, display of a message, lighting of a lamp, and the like. This avoids the situation where the robot turns in an unrelated direction.
  • a process of switching the method of estimating the direction in which the user is located from a method using the microphone 11 to another method such as a method using the camera is executed. That is, when it is difficult to estimate the direction by the sound signal due to the influence of noise 3 or the like, a process of searching for a user by using image recognition or the like is executed. This makes it possible to properly detect the direction in which the user is, even when the estimation of the sound source direction does not work. In this way, by performing the alternative processing based on the reliability of the sound source direction estimation, it is possible to sufficiently avoid the deterioration of the quality of the user experience.
  • Input data with a teacher label is used for learning of the learner constituting the vector estimation unit 21.
  • This teacher label is a vector (answer vector) representing the sound source direction, volume, etc., which should be estimated from the corresponding input data.
  • the accuracy of the learning device is evaluated by comparing the three-dimensional vector P output by the learning device based on the input data with the answer vector.
  • the Euclidean distance between the three-dimensional vector P and the answer vector is calculated.
  • the Euclidean distance is a distance in a three-dimensional Euclidean space as represented by the three-dimensional Cartesian coordinate system described with reference to FIG.
  • This Euclidean distance can represent the amount of deviation of the three-dimensional vector P with respect to the answer vector representing the correct answer.
  • the mean square error (MSE: Mean Squared Error) is calculated using this Euclidean distance.
  • MSE Mean Squared Error
  • the method of expressing the error is not limited.
  • the vector estimation unit 21 outputs the three-dimensional vector P corresponding to the input data, and uses the error according to the Euclidean distance between the output three-dimensional vector P and the answer vector corresponding to the input data for learning. It is a learner.
  • the Euclidean distance when the Euclidean distance is small, the error of the learning device is small, and when the Euclidean distance is large, the error of the learning device is large.
  • the output format of the vector estimation unit 21 since the output format of the vector estimation unit 21 (learner) is a three-dimensional vector P that can express the sound source direction and attached information in an integrated manner, the Euclidean distance from the answer vector is calculated to be three-dimensional. It is possible to easily calculate the error of the vector P.
  • the evaluation of the three parameters of horizontal angle ⁇ , elevation angle ⁇ , and attached information can be performed at the same time. It is possible to do. For example, in a learner that calculates a horizontal angle ⁇ , an elevation angle ⁇ , etc., it is necessary to provide a rule or the like for distinguishing between 0 ° and 360 °, and heuristic processing is required to calculate the error. On the other hand, by using a format that outputs a three-dimensional vector P as shown in the present disclosure, it is possible to avoid heuristic processing and perform highly accurate error evaluation. This makes it possible to dramatically improve the learning accuracy of the learner.
  • an error backpropagation method that adjusts weights using errors may be used. Even when learning such an algorithm, stable error back propagation is possible by expressing the information in the sound source direction not by an angle but by a three-dimensional vector P in a three-dimensional Euclidean space. This makes it possible to easily implement an algorithm using error back propagation.
  • the three-dimensional vector P is output by inputting the feature data 6 of the plurality of sound signals 5 in which the voice 2 is observed.
  • This three-dimensional vector P is a vector representing the direction information of the arrival direction of the voice 2 and the incidental information regarding the voice 2.
  • the direction information and the attached information are collectively output as one vector. This makes it possible to detect the direction of the voice 2 and other attached information (volume of the voice 2) with high accuracy.
  • the sound source direction estimation and the voice detection algorithm are individually configured in this way, it is generally difficult to perform overall optimization of both. For example, if the voice can be detected in advance, it is possible to estimate the direction with higher accuracy, and if the direction of the voice can be estimated in advance, it is possible to detect the voice with higher accuracy. In this case, the optimization of each process requires each other's processing results, and as a result, it may be necessary to adopt an algorithm individually optimized for each process.
  • the vector estimation unit 21 outputs a three-dimensional vector P representing the sound source direction and the volume (attached information) of the voice 2.
  • a three-dimensional vector P representing the sound source direction and the volume (attached information) of the voice 2.
  • the three-dimensional vector P is a vector representing the estimation result of the sound source direction and the detection result of voice detection. That is, by outputting the three-dimensional vector P, it is possible to optimally solve a plurality of problems at the same time. As a result, the estimation accuracy of the sound source direction and the detection accuracy of the voice 2 can be significantly improved, and the calculation efficiency can be sufficiently improved. In addition, it is not necessary to develop separate algorithms, and development costs can be significantly reduced.
  • the present inventor evaluated the estimation result of the sound source direction using the three-dimensional vector P according to the present technology using the data (sound signal 5) detected by the microphone array 10 mounted on the specific device.
  • the estimation results we used a method of measuring the ratio of the error of the horizontal angle ⁇ within a predetermined angle range in multiple environments and comparing it with other methods for estimating the sound source direction. Further, as the predetermined angle range, a range set based on the angle of view of the camera was adopted.
  • the method of expressing the sound source direction and attached information using one vector can greatly improve the estimation accuracy of the sound source direction. This makes it possible to improve the operating accuracy of the system that performs voice processing and the like. Further, by using this technology, it is possible to provide a highly reliable voice application or the like.
  • FIG. 7 is a block diagram showing a configuration example of the processing unit 200 according to the second embodiment.
  • the processing unit 200 is an arithmetic unit that calculates information of voice 2, and has a pre-processing unit 220, a vector estimation unit 221 and a post-processing unit 222.
  • the preprocessing unit 220 is configured in the same manner as the preprocessing unit 20 shown in FIG. 1, for example, and outputs the feature data 6 of the plurality of sound signals 5 output from the microphone array 10. Note that in FIG. 7, the microphone array is not shown.
  • the vector estimation unit 221 outputs a three-dimensional vector P representing direction information and attached information for each frequency component included in the sound signal 5 based on the feature data 6. Specifically, the learner constituting the vector estimation unit 221 is learned to output the three-dimensional vector P for each frequency bin f. Further, the mean square error calculated for each frequency bin f between the three-dimensional vector P and the answer vector is used for learning of the learner.
  • the post-processing unit 222 executes conversion processing and aggregation processing of the three-dimensional vector P output for each frequency component (frequency bin), and calculates direction information indicating the sound source direction and attached information regarding the sound 2.
  • FIG. 8 is a data plot showing an example of feature data.
  • the feature data 6 (amplitude spectrum and phase difference spectrum) is calculated by the preprocessing unit 220 in the same manner as the processing described with reference to, for example, FIG.
  • FIG. 8 shows a strip plot showing the spectral data of the amplitude spectrum and the retardation spectrum.
  • the output data Do is expressed as Do (c, t, f), where c is an index representing each component of the three-dimensional vector P.
  • the data size of D o (c, t, f ) is a 3 ⁇ T o ⁇ F.
  • the vector estimation unit 221 functions as a function B that converts the input D i into the output D o .
  • the volume of the voice 2 is set as the additional information targeted by the function B will be described as an example.
  • FIG. 9 is a data plot showing the three-dimensional vector P output from the feature data shown in FIG.
  • FIG. 9 shows a data plot of each component x (t, f), y (t, f), and z (t, f) of the three-dimensional vector P (t, f) in order from the top. ..
  • the horizontal axis of each graph is time, and the vertical axis is frequency. The values of each component are shown in gray scale.
  • the data sections included in the output section length T o is illustrated by the solid black border. The data of each graph included in this section becomes the output data Do (c, t, f) output from the vector estimation unit 221 (function B).
  • the sound source direction and attached information are calculated from the output data Do (c, t, f).
  • polar coordinate transformation is executed as shown in the following equation.
  • Equations (15) to (17) correspond to equations (1) to (3) described with reference to FIG. 2, and are calculated for each time frame t and frequency bin f.
  • Equation (15) is a horizontal angle ⁇ (t, f) in the sound source direction.
  • Equation (16) is an elevation angle ⁇ (t, f) in the sound source direction.
  • the equation (17) is the value I (t, f) of the attached information and is the volume of the voice 2.
  • the post-processing unit 222 calculates the sound source direction and attached information (volume of voice 2) for each time frame and frequency from the three-dimensional vector P (t, f).
  • FIG. 10 is a data plot showing the volume of voice 2 calculated from the three-dimensional vector P shown in FIG.
  • the horizontal axis of FIG. 10 is time, and the vertical axis is frequency. Further, the volume (power) of the voice 2 in each time frame t and the frequency bin f is shown in gray scale.
  • I (t, f) is the power (spectrogram) of voice 2 for each frequency bin of a specific microphone (here, kth).
  • the function B is optimized.
  • I (t, f) is expressed by the following equation using the equation (7) representing the complex spectrum of the voice 2.
  • the function B is optimized so that I (t) calculated by the equation (6) satisfies the relationship of the equation (18).
  • the output accessory information ideally represents the power (volume) of the voice 2 for each frequency bin regardless of the presence or absence of the noise 3, even if the sound signal 5 disturbed by the noise is input. Become.
  • the data plot shown in FIG. 10 is a plot representing a voice signal showing a response of only voice 2 extracted from the original sound signal including noise 3 and the like.
  • the frequency distribution of I (t, f) calculated according to the equation (17) becomes the frequency distribution of the power of the voice 2 in the time frame t, that is, the amplitude spectrum of the voice 2.
  • This amplitude spectrum does not include a spectrum such as noise 3.
  • the vector estimation unit 221 calculates the voice signal representing the amplitude spectrum of the voice 2 based on the three-dimensional vector P output for each frequency component. As a result, it becomes possible to perform highly accurate voice recognition or the like using a voice signal in which noise 3 is suppressed, and it is possible to significantly improve the processing accuracy of various applications using voice 2.
  • the speech enhancement process (process for extracting a voice signal) can be considered as voice section detection (VAD) for each frequency bin. Therefore, in the present embodiment, when the volume of the voice 2 is set in the attached information, the voice enhancement, the voice section detection, and the sound source direction estimation are solved by one calculation. This makes it possible to provide a single algorithm that is totally optimized to perform three processes at once.
  • the entire three-dimensional vector P (t) representing the entire sound source direction and the entire attached information in a certain time frame t is calculated.
  • the entire three-dimensional vector P (t) calculated from the three-dimensional vector P (t, f) will be referred to as the overall vector P (t).
  • the total vector P (t) corresponds to the first vector.
  • the three-dimensional vector P (t, f) is output from the vector estimation unit 221
  • the components x (t), y (t), and z (t) of the whole vector P (t) are represented as follows.
  • the direction of the entire vector P (t) calculated by the equation (20) represents the arrival direction (sound source direction) of the voice 2 generated at the timing t.
  • the overall vector P (t) representing the arrival direction of the voice 2 is calculated by synthesizing the three-dimensional vectors P (t, f) output for each frequency component.
  • the direction of the whole vector P (t) represents the whole value I (t) of the attached information regarding the voice 2.
  • FIG. 11 is a graph of the entire vector P (t) calculated from the three-dimensional vector P (t, f) shown in FIG.
  • graphs of each component x (t), y (t), and z (t) of the entire vector P (t) are shown in order from the top.
  • the horizontal axis of each graph is time, and the vertical axis is the size of each component.
  • the scale of the vertical axis is appropriately set for each graph.
  • the graph shown in FIG. 11 is obtained by adding each component individually output for each frequency bin in the frequency direction, and corresponds to the component of the three-dimensional vector P (t) described with reference to FIG. That is, by synthesizing the three-dimensional vector P (t, f) by the post-processing unit 222, a vector similar to the three-dimensional vector P (t) output by the vector estimation unit 221 (function A) of the first embodiment. (Overall vector P (t)) can be calculated.
  • the post-processing unit 222 executes polar coordinate transformation on the entire vector P (t), and the horizontal angle ⁇ (t), elevation angle ⁇ (t), and attached information I (t) of the voice 2 are obtained. It is calculated.
  • ⁇ (t), ⁇ (t), and I (t) are expressed by the following equations.
  • FIG. 12 is a graph of the sound source direction and the volume calculated from the entire vector shown in FIG.
  • FIG. 12 shows graphs of the horizontal angle ⁇ (t) in the sound source direction, the elevation angle ⁇ (t) in the sound source direction, and the volume I (t) of the sound 2 in this order from the top.
  • the horizontal axis of each graph is time.
  • the vertical axis of the graph of the horizontal angle ⁇ (t) and the elevation angle ⁇ (t) is the angle.
  • the vertical axis of the graph of the volume I (t) of the voice 2 represents the loudness (power) of the sound.
  • the volume of the voice 2 is high, and it can be seen that the voice 2 is detected. Further, the values of ⁇ (t) and ⁇ (t) change from 0 ° to a constant angle corresponding to each peak of I (t). Therefore, it can be seen that the voices 2 detected as the peaks of I (t) are all emitted from the same direction.
  • the magnitude I (t, f) of the three-dimensional vector P (t, f) is set to the power of the voice 2 shown in the equation (18)
  • the magnitude I (t) of the entire vector P (t) is set.
  • the power of the voice 2 shown in the equation (19) is set, the magnitude of the whole vector P (t) can be regarded as the power of the voice 2 shown in the equation (9). ..
  • vibration detectors that detect vibrations on the ground or in the ground are placed at multiple locations. Then, the characteristic data (amplitude spectrum and phase difference spectrum) of the vibration signal output from each vibration detector is input to the learner.
  • the learner is learned in advance so as to output a three-dimensional vector representing the arrival direction of the seismic wave and its intensity based on the characteristic data of the vibration signal. This makes it possible to detect the arrival direction and intensity of seismic waves with high accuracy.
  • this technology can be applied to various wave phenomena that propagate in space such as electromagnetic waves and gravitational waves.
  • the information processing device may be realized by an arbitrary computer that is configured separately from the processing unit and is connected to the processing unit via wire or wirelessly.
  • the information processing method according to the present technology may be executed by a cloud server.
  • the information processing method according to the present technology may be executed in conjunction with the processing unit and another computer.
  • the information processing method and program according to the present technology can be executed not only in a computer system composed of a single computer but also in a computer system in which a plurality of computers operate in conjunction with each other.
  • the system means a set of a plurality of components (devices, modules (parts), etc.), and it does not matter whether or not all the components are in the same housing. Therefore, a plurality of devices housed in separate housings and connected via a network, and one device in which a plurality of modules are housed in one housing are both systems.
  • the information processing method and program execution according to the present technology by the computer system are executed when, for example, acquisition of feature data and output of a three-dimensional vector are executed by a single computer, and each process is executed by a different computer. Includes both cases. Further, the execution of each process by a predetermined computer includes causing another computer to execute a part or all of the process and acquire the result.
  • the information processing method and program related to this technology can be applied to a cloud computing configuration in which one function is shared by a plurality of devices via a network and processed jointly.
  • same”, “equal”, “orthogonal”, etc. are concepts including “substantially the same”, “substantially equal”, “substantially orthogonal”, etc.
  • a state included in a predetermined range for example, a range of ⁇ 10%
  • a predetermined range for example, a range of ⁇ 10%
  • this technology can also adopt the following configurations.
  • An acquisition unit that acquires feature data of multiple signals that observe the target wave, and An information processing device including a direction information indicating an arrival direction of the target wave and an output unit for outputting a three-dimensional vector representing ancillary information about the target wave based on the acquired feature data.
  • the output unit is an information processing device that outputs the three-dimensional vector so that the direction of the three-dimensional vector represents the direction information and the magnitude of the three-dimensional vector represents the accessory information.
  • the output unit is an information processing device that outputs the three-dimensional vector so that the direction information and the accessory information are calculated by performing polar coordinate conversion on the three-dimensional vector.
  • the information processing device is an information processing device including a horizontal angle and an elevation angle indicating the arrival direction of the target wave.
  • the output unit is an information processing device that converts the three-dimensional vector into polar coordinates to calculate the direction information and the accessory information.
  • the information processing device according to any one of (1) to (5).
  • the target wave is voice and
  • the plurality of signals are information processing devices that are sound signals obtained by observing the voice.
  • the information processing device according to (6).
  • the direction information is an information processing device that indicates the direction of arrival of the voice.
  • the information processing device is (7).
  • the information processing device includes any one of the volume of the voice, the existence probability of the voice, and the reliability of the arrival direction of the voice.
  • the information processing device according to any one of (6) to (8).
  • the output unit is an information processing device that outputs the three-dimensional vector representing the direction information and the accessory information for each frequency component included in the sound signal.
  • the attached information is the volume of the voice for each frequency component.
  • the output unit is an information processing device that calculates an audio signal representing the amplitude spectrum of the audio based on the three-dimensional vector output for each frequency component.
  • (11) The information processing device according to (9) or (10).
  • the output unit is an information processing device that synthesizes the three-dimensional vectors output for each frequency component and calculates a first vector representing the arrival direction of the voice.
  • the information processing apparatus according to (11) The output unit is an information processing device that calculates the direction information and the accessory information based on the first vector.
  • the information processing apparatus according to any one of (6) to (12). The output unit calculates a second vector by synthesizing the three-dimensional vectors output within a predetermined period, and calculates the arrival direction of the voice in the predetermined period based on the second vector.
  • Information processing device is included in the information processing apparatus.
  • the plurality of signals are information processing devices that are the sound signals detected by each of the plurality of sound collectors arranged at different positions.
  • the information processing apparatus is an information processing apparatus including an amplitude spectrum of each of the plurality of signals and a phase difference spectrum between the plurality of signals.
  • the output unit is a learner that outputs the three-dimensional vector corresponding to the input data and uses an error according to the Euclidean distance between the output three-dimensional vector and the answer vector corresponding to the input data for learning.
  • Information processing device Acquire characteristic data of multiple signals that observed the target wave, An information processing method in which a computer system executes to output three-dimensional vectors representing direction information indicating the direction of arrival of the target wave and incidental information regarding the target wave based on the acquired feature data.
  • a step of acquiring characteristic data of a plurality of signals observing a target wave and A program that causes a computer system to execute a step of outputting a three-dimensional vector representing direction information indicating the arrival direction of the target wave and incidental information regarding the target wave based on the acquired feature data.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)

Abstract

Le dispositif de traitement d'informations selon un mode de réalisation de la présente invention comprend une unité d'acquisition et une unité de sortie. L'unité d'acquisition acquiert des données de caractéristiques portant sur une pluralité de signaux obtenus par l'observation d'une onde cible. Sur la base des données de caractéristiques acquises, l'unité de sortie délivre en sortie un vecteur tridimensionnel qui représente des informations de direction indiquant une direction d'arrivée de l'onde cible et des informations d'attribut concernant l'onde cible.
PCT/JP2020/022107 2019-06-14 2020-06-04 Dispositif de traitement d'informations, procédé de traitement d'informations, et programme WO2020250797A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2019-110917 2019-06-14
JP2019110917 2019-06-14

Publications (1)

Publication Number Publication Date
WO2020250797A1 true WO2020250797A1 (fr) 2020-12-17

Family

ID=73780749

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/022107 WO2020250797A1 (fr) 2019-06-14 2020-06-04 Dispositif de traitement d'informations, procédé de traitement d'informations, et programme

Country Status (1)

Country Link
WO (1) WO2020250797A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4050602A1 (fr) * 2021-02-24 2022-08-31 GN Audio A/S Dispositif de conférence avec estimation de la direction vocale
WO2024009746A1 (fr) * 2022-07-07 2024-01-11 ソニーグループ株式会社 Dispositif de génération de modèle, procédé de génération de modèle, dispositif de traitement de signal, procédé de traitement de signal et programme

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012039275A (ja) * 2010-08-05 2012-02-23 Nippon Telegr & Teleph Corp <Ntt> 反射音情報推定装置、反射音情報推定方法、プログラム
JP2013008031A (ja) * 2011-06-24 2013-01-10 Honda Motor Co Ltd 情報処理装置、情報処理システム、情報処理方法及び情報処理プログラム
JP2015050610A (ja) * 2013-08-30 2015-03-16 本田技研工業株式会社 音響処理装置、音響処理方法、及び音響処理プログラム
JP2015166764A (ja) * 2014-03-03 2015-09-24 富士通株式会社 音声処理装置、雑音抑圧方法、およびプログラム
JP2018032001A (ja) * 2016-08-26 2018-03-01 日本電信電話株式会社 信号処理装置、信号処理方法および信号処理プログラム

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012039275A (ja) * 2010-08-05 2012-02-23 Nippon Telegr & Teleph Corp <Ntt> 反射音情報推定装置、反射音情報推定方法、プログラム
JP2013008031A (ja) * 2011-06-24 2013-01-10 Honda Motor Co Ltd 情報処理装置、情報処理システム、情報処理方法及び情報処理プログラム
JP2015050610A (ja) * 2013-08-30 2015-03-16 本田技研工業株式会社 音響処理装置、音響処理方法、及び音響処理プログラム
JP2015166764A (ja) * 2014-03-03 2015-09-24 富士通株式会社 音声処理装置、雑音抑圧方法、およびプログラム
JP2018032001A (ja) * 2016-08-26 2018-03-01 日本電信電話株式会社 信号処理装置、信号処理方法および信号処理プログラム

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4050602A1 (fr) * 2021-02-24 2022-08-31 GN Audio A/S Dispositif de conférence avec estimation de la direction vocale
US11778374B2 (en) 2021-02-24 2023-10-03 Gn Audio A/S Conference device with voice direction estimation
WO2024009746A1 (fr) * 2022-07-07 2024-01-11 ソニーグループ株式会社 Dispositif de génération de modèle, procédé de génération de modèle, dispositif de traitement de signal, procédé de traitement de signal et programme

Similar Documents

Publication Publication Date Title
US10063965B2 (en) Sound source estimation using neural networks
JP6279181B2 (ja) 音響信号強調装置
US20060204019A1 (en) Acoustic signal processing apparatus, acoustic signal processing method, acoustic signal processing program, and computer-readable recording medium recording acoustic signal processing program
US9961460B2 (en) Vibration source estimation device, vibration source estimation method, and vibration source estimation program
CN108962231B (zh) 一种语音分类方法、装置、服务器及存储介质
JP2017044916A (ja) 音源同定装置および音源同定方法
KR102191736B1 (ko) 인공신경망을 이용한 음성향상방법 및 장치
WO2020250797A1 (fr) Dispositif de traitement d&#39;informations, procédé de traitement d&#39;informations, et programme
JP2006194700A (ja) 音源方向推定システム、音源方向推定方法及び音源方向推定プログラム
JP2008236077A (ja) 目的音抽出装置,目的音抽出プログラム
JP6236282B2 (ja) 異常検出装置、異常検出方法、及びコンピュータ読み取り可能な記憶媒体
JP7214798B2 (ja) 音声信号処理方法、音声信号処理装置、電子機器及び記憶媒体
KR20210137146A (ko) 큐의 클러스터링을 사용한 음성 증강
WO2022218134A1 (fr) Système et procédé multicanaux de détection de parole
Pertilä Online blind speech separation using multiple acoustic speaker tracking and time–frequency masking
EP2745293B1 (fr) Atténuation du bruit dans un signal
CN116868265A (zh) 用于动态声学环境中的数据增强和语音处理的系统和方法
US20220262342A1 (en) System and method for data augmentation and speech processing in dynamic acoustic environments
Dov et al. Multimodal kernel method for activity detection of sound sources
JP2023550434A (ja) 改良型音響源測位法
JP2011139409A (ja) 音響信号処理装置、音響信号処理方法、及びコンピュータプログラム
Mirbagheri et al. C-SL: Contrastive Sound Localization with Inertial-Acoustic Sensors
US11783826B2 (en) System and method for data augmentation and speech processing in dynamic acoustic environments
CN113724692B (zh) 一种基于声纹特征的电话场景音频获取与抗干扰处理方法
Firoozabadi et al. Estimating the Number of Speakers by Novel Zig-Zag Nested Microphone Array Based on Wavelet Packet and Adaptive GCC Method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20822277

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20822277

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP