WO2020144857A1

WO2020144857A1 - Information processing device, program, and information processing method

Info

Publication number: WO2020144857A1
Application number: PCT/JP2019/000722
Authority: WO
Inventors: 政人土屋; 利行花澤
Original assignee: 三菱電機株式会社
Priority date: 2019-01-11
Filing date: 2019-01-11
Publication date: 2020-07-16
Also published as: JP6833147B2; JPWO2020144857A1

Abstract

The present invention is provided with: an audio speech likelihood calculation unit (103) that calculates audio speech likelihood from an audio signal including the voice of a subject person; a video speech likelihood calculation unit (104) that calculates video speech likelihood from a video signal indicating a video including the subject person; an environmental information determination unit (105) that determines audio reliability indicating the reliability of the audio signal and video reliability indicating the reliability of the video signal; a speech section detection unit (108) that adds a heavier weight to the audio speech likelihood when the audio reliability is higher, adds a heavier weight to the video speech likelihood when the video reliability is higher, calculates, by using the audio speech likelihood and the video speech likelihood, speech likelihood indicating a probability that the subject person is uttering speech in the audio signal and in the video signal, and detects, as a section of speech, a section where the calculated speech likelihood is higher than a predetermined threshold; and an audio recognition unit (109) that executes audio recognition on the audio signal in the section of speech.

Description

Information processing apparatus, program, and information processing method

The present invention relates to an information processing device, a program, and an information processing method.

ㆍMultimodal is a method to output some recognition processing result with multiple signals as input. In general, multimodal has higher system performance and tends to be more robust against signal noise than unimodal in which processing is performed using only one signal.

For example, in a system using an acoustic signal and an image signal, when acoustic noise is strong, a robust recognition result can be obtained by processing so that the image signal is used for recognition. Such a mechanism is called adaptive noise suppression.

In the conventional adaptive noise suppression method, for example, as described in Patent Document 1, a model learned by a general-purpose data set is erroneously recognized by a signal including noise in a use environment. There is a method of re-learning to reduce the number.

Japanese Patent Laid-Open No. 2002-169586

However, in the conventional method, for example, by combining existing person detection technology, it is better to use human prior knowledge such as "it is better not to use the image signal unless there is a person who can be acoustic noise in the vicinity", and the reliability of the signal is improved. It is difficult to design a flexible system that adjusts the characteristics.

Therefore, it is an object of one or more aspects of the present invention to make it possible to perform strong signal processing in a more noisy environment by determining the reliability of a signal.

An information processing apparatus according to an aspect of the present invention is a voice utterance likelihood calculation unit that calculates a voice utterance likelihood indicating a probability that the target person utters in the voice signal from a voice signal including a voice of the subject. And an image utterance likelihood calculation unit that calculates an image utterance likelihood indicating a probability that the subject is uttered in the image signal from an image signal indicating an image including the subject, and reliability of the voice signal. , And an environment information determination unit that determines image reliability indicating the reliability of the image signal, and the higher the audio reliability, the heavier the weight of the voice utterance likelihood, the higher the image reliability. Is higher, the weight of the image utterance likelihood is heavier, and the utterance indicating the probability that the subject is uttering in the voice signal and the image signal using the voice utterance likelihood and the image utterance likelihood. An utterance section detecting unit that calculates a likelihood and detects a section in which the calculated utterance likelihood is equal to or more than a predetermined threshold value as a utterance section, and a voice in the utterance section with respect to the voice signal. And a voice recognition unit that executes recognition.

A program according to one aspect of the present invention causes a computer to calculate a voice utterance likelihood calculating a voice utterance likelihood indicating a probability that the target person utters in the voice signal from a voice signal including a voice of the target person. Unit, an image utterance likelihood calculation unit that calculates an image utterance likelihood indicating a probability that the target person utters in the image signal from an image signal indicating an image including the target person, and reliability of the voice signal. , And an environment information determination unit that determines the image reliability indicating the reliability of the image signal, the higher the voice reliability, the heavier the weight of the voice utterance likelihood, the image reliability is The higher the higher the weight of the image utterance likelihood, the more the utterance likelihood indicating the probability that the subject is uttering in the voice signal and the image signal using the voice utterance likelihood and the image utterance likelihood. Utterance section detecting unit that calculates a degree and detects a section in which the calculated utterance likelihood is equal to or more than a predetermined threshold as a utterance section, and a voice in the utterance section with respect to the voice signal. It is characterized in that it functions as a voice recognition unit that executes recognition.

An information processing method according to an aspect of the present invention calculates a voice utterance likelihood indicating a probability that the target person speaks in the voice signal from a voice signal including a voice of the target person, and includes the target person. From an image signal indicating an image, an image utterance likelihood indicating the probability that the target person is uttering in the image signal is calculated, a voice reliability indicating the reliability of the voice signal, and a reliability of the image signal. Is determined, and the higher the voice reliability is, the heavier the weight of the voice utterance likelihood is, and the higher the image reliability is, the heavier the weight of the image utterance likelihood is, the higher the voice utterance likelihood is. Degree and the image utterance likelihood, the utterance likelihood indicating the probability that the subject is uttered in the voice signal and the image signal is calculated, and the calculated utterance likelihood is a predetermined threshold value. The above-mentioned section is detected as an utterance section, and voice recognition is performed on the voice signal in the utterance section.

According to an aspect of the present invention, it is possible to perform strong signal processing in a more noisy environment by determining the reliability of a signal.

It is a block diagram which shows roughly the structure of the speech recognition apparatus which concerns on embodiment. 1 is a schematic diagram of a vehicle-mounted voice recognition system including a voice recognition device according to an embodiment. It is a block diagram which shows the structure of an environment information determination part schematically. It is a schematic diagram showing an example of an utterance list of one passenger. It is a block diagram which shows roughly the hardware constitutions of the speech recognition apparatus which concerns on embodiment. 4 is a flowchart showing a flow of operations of the voice recognition device according to the exemplary embodiment.

FIG. 1 is a block diagram schematically showing a configuration of a voice recognition device 100 which is an information processing device according to an embodiment.
The voice recognition device 100 includes an interface unit (hereinafter referred to as I/F unit) 101, a voice signal processing unit 102, a voice utterance likelihood calculation unit 103, an image utterance likelihood calculation unit 104, and an environment information determination unit 105. And a speech section detection unit 108 and a voice recognition unit 109.

The voice recognition device 100 according to the embodiment is included in a vehicle-mounted voice recognition system 120, as shown in FIG. 2, for example.
The voice recognition system 120 includes the voice recognition device 100, N microphones 121 ₁ , 121 ₂ ,..., 121 _N as sound collection devices, a camera 122 as an imaging device, and a vehicle speed meter 123. In the present embodiment, the voice recognition system 120 is an in-vehicle voice recognition system in an in-vehicle environment equipped with a camera 122 for monitoring a passenger.
Here, N is an integer of 1 or more. In the present embodiment, N is equal to or larger than the number of seats M (M is an integer of 1 or more) provided in the vehicle 130 in which the voice recognition system 120 is installed. In the example of FIG. 2, N≧M and M=4.
The microphones 121 ₁ , 121 ₂ ,..., 121 _N are referred to as microphones 121 when there is no particular need to distinguish between them.

The microphone 121 generates a voice analog signal which is an analog signal indicating the voice inside the vehicle 130.
In the present embodiment, one microphone 121 is an omnidirectional microphone, and an array microphone is configured by arranging _N microphones 121 ₁ , 121 ₂ ,..., 121 _N at regular intervals. It has been done. Then, the N microphones 121 ₁ , 121 ₂ ,..., 121 _N obtain the N voice analog signals S ₁ , S ₂ ,..., S obtained by the voices of M passengers of the vehicle 130. _N is acquired. In other words, the audio analog signal _S _1, S 2, ..., _{S N} is the microphone ₁₂₁ _1, 121 2, ..., in one-to-one correspondence with 121 _N.

The configuration of the microphone 121 is not limited to such an example. The microphone 121 may have any configuration as long as it can generate a voice signal indicating the voice of the passenger of the vehicle 130. For example, N microphones 121 ₁ , 121 ₂ ,..., 121 _N may be arranged in front of the seat of the vehicle 130, with one microphone as a directional microphone. The microphone 121 may be installed in any place as long as it can acquire the sounds of all the passengers seated in the seat.

The camera 122 generates an image signal V showing an image inside the vehicle 130 in order to monitor an occupant.
The camera 122 is installed in an orientation having an angle of view such that the face of the passenger in the vehicle 130 is captured. The camera 122 may be a visible light camera or an infrared camera. When an infrared camera is used as the camera 122, it may be an active type in which a passenger is irradiated with infrared rays from a light emitting diode (not shown) installed in the vicinity and the reflected light is observed. ..
It should be noted that a plurality of cameras 122 may be installed in the vehicle 130 in order to image the faces of all passengers.

The vehicle speed meter 123 is a measuring device that measures the traveling speed of the vehicle 130, and generates speed information C indicating the traveling speed of the vehicle 130. For example, the vehicle speedometer 123 can acquire the vehicle speed from a system that controls the operation of the vehicle 130 through a communication line called a CAN bus to which an in-vehicle module such as a door meter is connected.

Returning to FIG. 1, the I/F unit 101 receives input of analog audio signals S ₁ to S _N from the microphone 121, image signal V from the camera 122, and speed information C from the vehicle speedometer 123. Then, the I/F unit 101 supplies the voice analog signals S ₁ to S _N from the microphone 121 to the voice signal processing unit 102, and the image signal V from the camera 122 to the image utterance likelihood calculation unit 104 and the environment information determination unit. The speed information C from the vehicle speedometer 123 is given to the environment information determination unit 105.

The audio signal processing unit 102 generates an audio digital signal by performing analog/digital conversion (hereinafter, “A/D conversion”) on each of the audio analog signals S ₁ to _SN output by the microphone 121. Then, the voice signal processing unit 102 performs voice signal processing, which is a process of emphasizing the uttered voice of the occupant who is the target of voice recognition, on the voice digital signal, so that the voice signals SS ₁ to SS _M are processed. To generate.

In the following, among M passengers, a passenger who is a target for voice recognition is a target person.
Further, it is assumed that each of the integers 1 to M is associated with one seat. It is assumed that an element with a subscript of “1”, for example, the audio signal SS ₁ is associated with the seat identified by “1”. Therefore, it can be said that the audio signal SS ₁ is associated with the passenger in the seat identified by “1”. Note that the symbol i is an arbitrary integer of 1 or more and M or less.

The audio signal processing unit 102 removes, from the components included in each of the N audio digital signals, a component corresponding to a voice different from the voice uttered by the target person (hereinafter referred to as “noise component”). Further, in the voice recognition unit 109 in the latter stage, the M passengers seated in each of the M voice recognition target seats can perform voice recognition independently for each of the M passengers. M audio signals SS ₁ to SS _M are extracted by extracting only the respective voices. The audio signal processing unit 102, the generated audio signals _SS 1 ~ SS _M, giving a speech utterance likelihood calculation unit 103, environmental information determination unit 105 and a voice recognition unit 109.

The noise component includes, for example, a component corresponding to noise generated by the traveling of the vehicle 130, a component corresponding to a voice uttered by a passenger other than the target person, and the like. Various known methods such as a beam forming method, a binary masking method, or a spectral subtraction method can be used to remove the noise component in the audio signal processing unit 102. Therefore, detailed description of the removal of the noise component in the audio signal processing unit 102 will be omitted.

The audio signal processing unit 102 may separate the M audio signals SS ₁ to SS _M from the N audio digital signals by using a blind audio separation technique such as independent component analysis. However, when this blind audio separation technology is used, the number of sound sources corresponding to the number of passengers is required, and therefore, for example, the image utterance likelihood calculation unit 104 uses the image indicated by the image signal V obtained from the camera 122. It is necessary to detect the number of passengers from the above and notify the audio signal processing unit 102 of the number of passengers. The image signal V may be input to the audio signal processing unit 102, and the audio signal processing unit 102 may detect the number of passengers.

Speech utterance likelihood calculation unit 103, in order to perform a voice activity detection as preprocessing for speech recognition, from each of the sound signals SS 1 _~ SS _M, subject is uttered in each of the audio signals SS 1 _~ SS _M The likelihood of speech utterance indicating the probability of being present is calculated. The voice utterance likelihood is also a probability indicating the utterance likelihood based on voice.

Various methods have been proposed in the past as methods for calculating the likelihood of speech utterance. For example, the STFT (Short-Time Fourier Transform) spectrum and the MFCC (Mel-Frequency Cepstrum Coefficients) coefficients at the time of utterance and at the time of non-speech are learned by the GMM (Gaussian Mixture Model) when each is input to the GMM (Gaussian Mixture Model) and are learned. Of the logarithmic likelihood Score of sound is used as the likelihood of speech utterance. Speech utterance likelihood calculation unit 103, from the audio signals _SS 1 ~ SS _M corresponding to the respective M's occupant, and calculates the voice utterance likelihood _AF 1 ~ AF _M corresponding to the respective M's passenger .. The calculated speech utterance likelihoods AF ₁ to AF _M are provided to the utterance section detection unit 108.

The image utterance likelihood calculation unit 104, in the same manner as the voice utterance likelihood calculation unit 103, detects an image utterance likelihood indicating the probability that the subject is uttering in the image signal V, in order to detect the utterance section. It is calculated from the signal V. The image utterance likelihood is also a probability of indicating utterance likelihood based on an image.

The method of calculating the image utterance likelihood is, for example, a method of learning the distribution of the gradient vector of the face parts dictionary and calculating the degree of mouth opening by combining a plurality of learning models as the image utterance likelihood. is there. The image utterance likelihood calculation unit 104 also generates image utterance likelihoods VF ₁ to VF _M corresponding to the M passengers. Then, the image utterance likelihood calculation unit 104 gives the generated image utterance likelihoods VF ₁ to VF _M to the utterance section detection unit 108.

The environment information determination unit 105 receives the voice signals SS ₁ to SS _M received from the voice signal processing unit 102, in which the utterance of the passenger is emphasized, the image signal V received from the camera 122, and the speed received from the vehicle speedometer 123. From the information C, the reliability X ₁ to X _M (hereinafter also referred to as audio reliability) of the audio signals SS ₁ to SS _{M and} the reliability Y ₁ to Y _M (hereinafter also referred to as image reliability) of the image signal V are obtained. calculate.

FIG. 3 is a block diagram schematically showing the configuration of the environment information determination unit 105.
The environment information determination unit 105 includes a passenger presence/absence determination unit 106 and a reliability determination unit 107.

The passenger presence/absence determining unit 106 determines the presence/absence of a person for each seat provided in the vehicle 130 from the image represented by the image signal V, and is the binary signal indicating the presence/absence of a person for each seat. The determination result signals E ₁ to E _M are generated. The passenger presence/absence determination unit 106 supplies the reliability determination unit 107 with the passenger presence/absence determination result signals E ₁ to E _M , which are binary signals indicating the determination result of the presence/absence of a person.

A lot of people detection algorithms have been proposed as means for determining the presence or absence of people, and those existing technologies can be used. The occupant presence/absence determining unit 106 receives, instead of the image signal V, weight information indicating weight, which is a detection value detected by a weight scale (not shown) provided in the seat, and based on the weight information, It may be determined whether there is an occupant in each seat.

The reliability determination unit 107 receives the passenger presence/absence determination result signals E ₁ to E _M from the passenger presence/absence determination unit 106, the speed information C from the vehicle speedometer 123, and the audio signal SS ₁ to from the audio signal processing unit 102. receive SS _M, calculates each reliability _X 1 _~ _X _M, a Y 1 ~ _{Y M} audio signals _SS 1 ~ SS _M and the image signal V.

Here, the reliability X ₁ to X _M is a parameter indicating the reliability of the audio signals SS ₁ to SS _M , and the reliability Y ₁ to Y _M is a parameter indicating the reliability of the image signal V.
The reliability X ₁ to X _M and Y ₁ to Y _M of each signal can be calculated by the following calculation method, for example.

Considering that the reliability X _1,t to X _M, t of the audio signals SS ₁ to SS _{M at} the time t is more likely to contain noise in the sound as the vehicle speed increases, the reliability determination unit 107 The higher the vehicle speed, the lower the reliability X _1,t to X _M,t . For example, if it is assumed that the value Ct of the vehicle speed meter at time t is proportional to a negative exponential function, the reliability X _1,t to X _M,t can be calculated by the following equation (1).

In addition, if there are passengers other than the target passengers, it goes without saying that the utterances that are not the recognition target increase, the reliability of the audio signals SS ₁ to SS _M decreases, and the reliability of the image signal V relatively increases. Conceivable. Therefore, the reliability determination unit 107 lowers the reliability Y _1,t to Y _M,t as the number of passengers in the vehicle 130 increases. For example, the reliability determination unit 107 calculates the reliability Y _1,t to Y _M,t by the expressions (2) and (3) shown below.

However, j is an identification number for identifying each passenger, and here j=1, 2,..., M. Further, δ(i≠j) is a function that becomes 1 only when the passenger i and the passenger j are different.

Returning to FIG. 1, the utterance section detection unit 108 determines from the voice utterance likelihoods AF ₁ to AF _M , the image utterance likelihoods VF ₁ to VF _M , the reliabilities X ₁ to X _M , and the reliabilities Y ₁ to Y _M. For each target person, the time of the section in which the utterance is performed is estimated, and an utterance list, which is section information indicating the time in the section in which the utterance is performed for each target person, is generated. For example, the utterance section detection unit 108 weights the corresponding voice utterance likelihood AF _i higher as the corresponding reliability X _i increases, and the corresponding image utterance likelihood VF _i increases as the corresponding image reliability Y _i increases. Of the corresponding voice utterance likelihood AF _i and the corresponding image utterance likelihood VF _i using the corresponding voice utterance likelihood AF _i and the corresponding image utterance likelihood VF _i to indicate the probability that the subject is uttering the corresponding voice signal SS _i and image signal V. The likelihood is calculated, and a section in which the calculated speech likelihood is equal to or greater than a predetermined threshold is detected as a speech section. Then, the utterance section detection unit 108 gives the generated section information to the voice recognition unit 109.

The estimation of the time of the section in which the utterance is performed is performed as follows.
First, the utterance section detection unit 108 uses the reliability X _i,t , Y _{i,t at} the passenger i and time t according to the softmax function shown in the following expressions (4) and (5) to obtain The weight W _i,t ^A and the weight W _i,t ^V for each signal at the passenger i and the time t are calculated. W _i,t ^A is a voice weight as a weight of the voice signal SS _i , and W _i,t ^V is an image weight as a weight of the image signal V.

Next, the utterance section detection unit 108 calculates a final utterance likelihood S _(i,t) . The utterance likelihood S _(i,t) is the probability that the passenger i is uttering at time t. The utterance likelihood S _(i,t) is obtained from a value obtained by multiplying the voice utterance likelihood AF _i,t and the image utterance likelihood VF _i,t at time _t by a weight as in the following expression (6). And

According to the equation (6), the voice utterance likelihood is multiplied by a value obtained by multiplying the voice utterance likelihood by which the weight is increased as the voice reliability is higher, and the image utterance likelihood is multiplied by the image weight by which the weight is increased as the image reliability is higher. The utterance likelihood is calculated by multiplying by the calculated value.

The utterance section detection unit 108 detects, for the utterance likelihood S _{(i, t)} thus calculated, a section that is equal to or greater than a predetermined threshold as a utterance section, and thus the utterance list for each passenger. U ₁ to U _M can be generated.
FIG. 4 is a schematic diagram showing an example of the utterance list U of one passenger.
The utterance list U# is table information including an utterance section string U#1, a start time string U#2, and an end time string U#3.
The utterance section string U#1 stores utterance section identification information for identifying the detected utterance section.
The start time column U#2 shows the start time of the detected utterance section.
The end time string U#3 shows the end time of the detected utterance section.

The method for the utterance section detection unit 108 to calculate the final utterance likelihood S _{(i, t)} is not limited to the above equation (6). For example, when the utterance likelihood S _(i,t) is obtained from the past state sequence, the value obtained by multiplying the voice utterance likelihood and the image utterance likelihood by weight, and the state transition table σ, the utterance section detection unit 108 can calculate the utterance likelihood S _(i,t) by the following equation (7).

However, the state transition table σ is assumed to be a function that returns a unique state from the past state transition sequence and the current voice utterance likelihood and image utterance likelihood.
Therefore, according to the equation (7), a value obtained by multiplying the voice utterance likelihood by a voice weight with a larger weight as the voice reliability is higher, and an image weight with a higher weight as the image reliability is higher than the image utterance likelihood. The utterance likelihood is calculated from a value obtained by multiplying by and a predetermined function having the utterance likelihood calculated in the past as a variable.

The speech recognition unit 109, for each subject, with respect to the corresponding audio signals _SS 1 ~ SS _M, in a corresponding utterance Listing _U 1 ~ speech intervals indicated by _{U M,} executes speech recognition. The voice recognition is performed, for example, by extracting a feature amount for voice recognition and using the extracted feature amount.

For the voice recognition processing, various known acoustic models such as HMM (Hidden Markov Model) can be used. The voice recognition unit 109 independently executes voice recognition for each passenger, and for each passenger, the voice recognition result of detecting the utterance section and the reliability of the voice recognition result (hereinafter, voice recognition). Output).
The voice recognition score may be a value that considers both the output probability of the acoustic model and the output probability of the language model, or may be the acoustic score of only the output probability of the acoustic model.

Note that the constituent elements of the voice recognition device 100 may be distributed to a server on a network, a mobile terminal such as a smartphone, or an in-vehicle device.

FIG. 5 is a block diagram schematically showing the hardware configuration of the voice recognition device 100 according to the embodiment.
The hardware of the voice recognition device 100 includes a memory 150, a processor 151, a voice interface (hereinafter referred to as voice I/F) 152, an image interface (hereinafter referred to as image I/F) 153, and a vehicle state interface (hereinafter referred to as “vehicle state interface”). , A vehicle state I/F) 154 and a network interface (hereinafter, referred to as network I/F) 155.

The memory 150 stores programs that function as the voice signal processing unit 102, the voice utterance likelihood calculation unit 103, the image utterance likelihood calculation unit 104, the environment information determination unit 105, the utterance section detection unit 108, and the voice recognition unit 109. There is. The memory 150 is, for example, a RAM (Random Access Memory), a ROM (Read Only Memory), a flash memory, an EPROM (Erasable Programmable Memory, etc.) or an EEPROM (Electrically Only Memory), or an EEPROM (Electrically Only Memory). A storage device using a disk, an optical disk, a magneto-optical disk, or the like.

The processor 151 is a program that functions as the voice signal processing unit 102, the voice utterance likelihood calculation unit 103, the environment information determination unit 105, the image utterance likelihood calculation unit 104, the utterance section detection unit 108, and the voice recognition unit 109 from the memory 150. Read and execute the program. The processor 151 is, for example, a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a microprocessor, a microcontroller, a DSP (Digital Signal Processor), or the like.

The voice I/F 152 is a voice input interface for receiving the voice analog signals S ₁ to S _M from the microphone 121 in multiple channels. When a natural language for interactively controlling a car or an air conditioner with an occupant is output from a speaker (not shown) as a voice recognition result, the voice I/F 152 also functions as a voice output interface. .. If the configuration does not require speaker output, the function as an audio output interface is not required.

The image I/F 153 is an image input interface for receiving the image signal V from the camera 122. Further, when receiving the final result of the voice recognition of the voice recognition unit 109 and notifying the occupant of necessary information by text or image display using a display device (not shown) such as a monitor, the image I/ The F153 also functions as an image output interface. If the display device does not require display, the function as an image output interface is unnecessary.

The vehicle status I/F 154 is an input interface for receiving the speed information C measured by the vehicle speedometer 123. The vehicle state I/F 154 can also acquire information on the current state of the vehicle such as the open/closed state of the door, not limited to the vehicle speed.

The network I/F 155 is an interface for communicating when performing voice recognition using a voice recognition service published on the cloud of the Internet instead of using the voice recognition unit 109. Further, the network I/F 155 is also an interface used as a connected car for P2P (Peer to Pear) communication with a nearby car or for communicating with a base station to execute navigation. The network I/F 155 is unnecessary if it has a configuration that does not require communication.

The I/F unit 101 shown in FIG. 1 can be realized by a voice I/F 152, an image I/F 153, a vehicle state I/F 154, or a network I/F 155.

Although the memory 150 is arranged inside the voice recognition device 100 in FIG. 5, it may be configured to connect an external memory such as a USB (Universal Serial Bus) memory to read a program or data. Good. Further, the memory in the device and the external memory may be used together.

FIG. 6 is a flowchart showing a flow of operations of the voice recognition device 100 according to the embodiment.
First, the audio signal processing unit 102 generates an audio digital signal by performing A/D conversion on the audio analog signals S ₁ to _SN from the microphone 121, and outputs an audio digital signal to the audio digital signal. The speech uttered by the target person who obtains the speech is emphasized to generate speech signals SS ₁ to SS _M (S10). For example, if four passengers are seated in the driver's seat, passenger seat, left rear seat, and right rear seat in the car 130, and all of the seats are voice recognition target seats, the voice signal processing unit 102 Emphasizes the sound from each of these four directions. The audio signal processing unit 102 provides the audio signals _SS 1 ~ SS _M to the voice utterance likelihood calculation unit 103, environmental information determination unit 105 and a voice recognition unit 109.

The speech utterance likelihood calculation unit 103 calculates a speech utterance likelihood _AF 1 ~ AF _M from each of the sound signals _{_{SS 1 ~ SS M (S11)}} .
Next, the image utterance likelihood calculation unit 104 calculates the image utterance likelihoods VF ₁ to VF _M from the image indicated by the image signal V (S12).

Next, the environment information determination unit 105 receives, from the audio signal processing unit 102, the audio signals SS ₁ to SS _M in which the utterance of the passenger is emphasized, the image signal V received from the camera 122, and the vehicle speedometer 123. from the speed information C received and calculates the image confidence _Y 1 ~ _{Y M} audio signals _SS 1 ~ SS voice reliability _X 1 ~ of _M _{X M} and the image signal V (S13).

Next, the utterance section detection unit 108 uses the voice utterance likelihoods AF ₁ to AF _M , the image utterance likelihoods VF ₁ to VF _M , the voice reliability X ₁ to X _M, and the image reliability Y ₁ to Y _M to board. The time of the section in which the utterance is being performed is estimated for each person, and the utterance section in which the utterance is being performed is detected for each passenger (S14). Then, the utterance section detection unit 108 provides the speech recognition unit 109 with the utterance lists U ₁ to U _M including the start time and end time of the detected utterance section.

The voice recognition unit 109 extracts the feature amount for voice recognition from the voice signal SS _i corresponding to the target person in the utterance section indicated by the utterance list U _i corresponding to the target person, and the extracted feature amount. Is used to execute voice recognition (S15). Then, the voice recognition unit 109 outputs the voice recognition result.

As described above, according to the present embodiment, it is possible to perform strong signal processing in a more noisy environment by determining the reliability of the signal.

The voice recognition device 100 described above can be applied to a navigation system, an integrated cockpit system including a meter display for a driver, a PC, a tablet PC, or a mobile information terminal such as a smartphone.

100 voice recognition device, 101 I/F unit, 102 voice signal processing unit, 103 voice utterance likelihood calculation unit, 104 image utterance likelihood calculation unit, 105 environment information determination unit, 106 passenger presence/absence determination unit, 107 reliability determination Section, 108 utterance section detection section, 109 voice recognition section, 120 voice recognition system, 121 microphone, 122 camera, 123 speedometer, 130 vehicles.

Claims

From a voice signal including the voice of the target person, a voice utterance likelihood calculation unit that calculates a voice utterance likelihood indicating the probability that the target person utters in the voice signal,
From an image signal showing an image including the target person, an image utterance likelihood calculation unit that calculates an image utterance likelihood indicating the probability that the target person utters in the image signal,
An audio information reliability indicating the reliability of the audio signal, and an environment information determination unit that determines the image reliability indicating the reliability of the image signal,
The higher the voice reliability is, the heavier the weight of the voice utterance likelihood is, the higher the image reliability is, the heavier the weight of the image utterance likelihood is, the voice utterance likelihood and the image utterance likelihood are used. Then, the utterance likelihood indicating the probability that the target person is uttering in the voice signal and the image signal is calculated, and the section in which the calculated utterance likelihood is equal to or more than a predetermined threshold is uttered. An utterance section detection unit that detects a section,
An information processing device, comprising: a voice recognition unit that performs voice recognition on the voice signal in the utterance section.
A voice signal processing unit that generates a voice digital signal by performing analog/digital conversion on a voice analog signal input from a microphone, and generates the voice signal by removing a noise component from the voice digital signal. The information processing apparatus according to claim 1, further comprising:
The information processing device is mounted on a vehicle,
The environment information determination unit lowers the audio reliability as the speed of the vehicle increases, and decreases the image reliability as the number of passengers on the vehicle increases. The information processing apparatus according to claim 1.
The information processing apparatus according to claim 3, wherein the environment information determination unit detects the number of the passengers from the image represented by the image signal.
The information processing apparatus according to claim 3, wherein the environment information determination unit detects the number of the passengers based on a detection value of a weight scale installed in a seat of the vehicle.
The utterance section detection unit calculates a value obtained by multiplying the voice utterance likelihood by a voice weight with a higher weight as the voice reliability is higher, and an image weight with a higher weight as the image reliability is higher. The information processing apparatus according to any one of claims 1 to 5, wherein the utterance likelihood is calculated by multiplying the utterance likelihood with a value obtained by multiplying the degree.
The utterance section detection unit is a value obtained by multiplying the voice utterance likelihood by a voice weight with a higher weight as the voice reliability is higher, and an image weight with a higher weight as the image reliability is higher. The utterance likelihood is calculated from a value obtained by multiplying the utterance likelihood and a previously-determined function having the utterance likelihood calculated in the past as a variable. The information processing device described.
Computer,
From a voice signal containing the voice of the target person, a voice utterance likelihood calculation unit that calculates a voice utterance likelihood indicating a probability that the target person utters in the voice signal.
From an image signal showing an image including the target person, an image utterance likelihood calculation unit that calculates an image utterance likelihood indicating a probability that the target person utters in the image signal,
An audio information reliability indicating the reliability of the audio signal, and an environment information determination unit determining the image reliability indicating the reliability of the image signal,
The higher the voice reliability is, the heavier the weight of the voice utterance likelihood is, the higher the image reliability is, the heavier the weight of the image utterance likelihood is, the voice utterance likelihood and the image utterance likelihood are used. Then, the utterance likelihood indicating the probability that the target person is uttering in the voice signal and the image signal is calculated, and the section in which the calculated utterance likelihood is equal to or more than a predetermined threshold is uttered. An utterance section detection unit that detects a section, and
A program causing a voice recognition unit to perform voice recognition on the voice signal in the utterance section.
From a voice signal including the voice of the target person, calculating a voice utterance likelihood indicating the probability that the target person is uttering in the voice signal,
From an image signal showing an image including the target person, calculating an image utterance likelihood indicating a probability that the target person utters in the image signal,
Audio reliability indicating the reliability of the audio signal, and determining image reliability indicating the reliability of the image signal,
The higher the voice reliability is, the heavier the weight of the voice utterance likelihood is, the higher the image reliability is, the heavier the weight of the image utterance likelihood is, the voice utterance likelihood and the image utterance likelihood are used. Then, the utterance likelihood indicating the probability that the target person is uttering in the voice signal and the image signal is calculated, and the section in which the calculated utterance likelihood is equal to or more than a predetermined threshold is uttered. Detected as an interval,
An information processing method comprising performing voice recognition on the voice signal in the utterance section.