WO2020144857A1 - Information processing device, program, and information processing method - Google Patents

Information processing device, program, and information processing method Download PDF

Info

Publication number
WO2020144857A1
WO2020144857A1 PCT/JP2019/000722 JP2019000722W WO2020144857A1 WO 2020144857 A1 WO2020144857 A1 WO 2020144857A1 JP 2019000722 W JP2019000722 W JP 2019000722W WO 2020144857 A1 WO2020144857 A1 WO 2020144857A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
image
utterance
reliability
signal
Prior art date
Application number
PCT/JP2019/000722
Other languages
French (fr)
Japanese (ja)
Inventor
政人 土屋
利行 花澤
Original Assignee
三菱電機株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 三菱電機株式会社 filed Critical 三菱電機株式会社
Priority to PCT/JP2019/000722 priority Critical patent/WO2020144857A1/en
Priority to JP2020564014A priority patent/JP6833147B2/en
Publication of WO2020144857A1 publication Critical patent/WO2020144857A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection

Definitions

  • the present invention relates to an information processing device, a program, and an information processing method.
  • Multimodal is a method to output some recognition processing result with multiple signals as input.
  • multimodal has higher system performance and tends to be more robust against signal noise than unimodal in which processing is performed using only one signal.
  • An information processing apparatus is a voice utterance likelihood calculation unit that calculates a voice utterance likelihood indicating a probability that the target person utters in the voice signal from a voice signal including a voice of the subject. And an image utterance likelihood calculation unit that calculates an image utterance likelihood indicating a probability that the subject is uttered in the image signal from an image signal indicating an image including the subject, and reliability of the voice signal. , And an environment information determination unit that determines image reliability indicating the reliability of the image signal, and the higher the audio reliability, the heavier the weight of the voice utterance likelihood, the higher the image reliability.
  • An utterance section detecting unit that calculates a likelihood and detects a section in which the calculated utterance likelihood is equal to or more than a predetermined threshold value as a utterance section, and a voice in the utterance section with respect to the voice signal.
  • a voice recognition unit that executes recognition.
  • a program causes a computer to calculate a voice utterance likelihood calculating a voice utterance likelihood indicating a probability that the target person utters in the voice signal from a voice signal including a voice of the target person.
  • an image utterance likelihood calculation unit that calculates an image utterance likelihood indicating a probability that the target person utters in the image signal from an image signal indicating an image including the target person, and reliability of the voice signal.
  • an environment information determination unit that determines the image reliability indicating the reliability of the image signal, the higher the voice reliability, the heavier the weight of the voice utterance likelihood, the image reliability is The higher the higher the weight of the image utterance likelihood, the more the utterance likelihood indicating the probability that the subject is uttering in the voice signal and the image signal using the voice utterance likelihood and the image utterance likelihood.
  • Utterance section detecting unit that calculates a degree and detects a section in which the calculated utterance likelihood is equal to or more than a predetermined threshold as a utterance section, and a voice in the utterance section with respect to the voice signal. It is characterized in that it functions as a voice recognition unit that executes recognition.
  • An information processing method calculates a voice utterance likelihood indicating a probability that the target person speaks in the voice signal from a voice signal including a voice of the target person, and includes the target person. From an image signal indicating an image, an image utterance likelihood indicating the probability that the target person is uttering in the image signal is calculated, a voice reliability indicating the reliability of the voice signal, and a reliability of the image signal. Is determined, and the higher the voice reliability is, the heavier the weight of the voice utterance likelihood is, and the higher the image reliability is, the heavier the weight of the image utterance likelihood is, the higher the voice utterance likelihood is.
  • the utterance likelihood indicating the probability that the subject is uttered in the voice signal and the image signal is calculated, and the calculated utterance likelihood is a predetermined threshold value.
  • the above-mentioned section is detected as an utterance section, and voice recognition is performed on the voice signal in the utterance section.
  • FIG. 1 is a schematic diagram of a vehicle-mounted voice recognition system including a voice recognition device according to an embodiment. It is a block diagram which shows the structure of an environment information determination part schematically. It is a schematic diagram showing an example of an utterance list of one passenger. It is a block diagram which shows roughly the hardware constitutions of the speech recognition apparatus which concerns on embodiment. 4 is a flowchart showing a flow of operations of the voice recognition device according to the exemplary embodiment.
  • FIG. 1 is a block diagram schematically showing a configuration of a voice recognition device 100 which is an information processing device according to an embodiment.
  • the voice recognition device 100 includes an interface unit (hereinafter referred to as I/F unit) 101, a voice signal processing unit 102, a voice utterance likelihood calculation unit 103, an image utterance likelihood calculation unit 104, and an environment information determination unit 105. And a speech section detection unit 108 and a voice recognition unit 109.
  • the voice recognition device 100 is included in a vehicle-mounted voice recognition system 120, as shown in FIG. 2, for example.
  • the voice recognition system 120 includes the voice recognition device 100, N microphones 121 1 , 121 2 ,..., 121 N as sound collection devices, a camera 122 as an imaging device, and a vehicle speed meter 123.
  • the voice recognition system 120 is an in-vehicle voice recognition system in an in-vehicle environment equipped with a camera 122 for monitoring a passenger.
  • N is an integer of 1 or more.
  • N is equal to or larger than the number of seats M (M is an integer of 1 or more) provided in the vehicle 130 in which the voice recognition system 120 is installed.
  • N ⁇ M and M 4.
  • the microphones 121 1 , 121 2 ,..., 121 N are referred to as microphones 121 when there is no particular need to distinguish between them.
  • the microphone 121 generates a voice analog signal which is an analog signal indicating the voice inside the vehicle 130.
  • one microphone 121 is an omnidirectional microphone, and an array microphone is configured by arranging N microphones 121 1 , 121 2 ,..., 121 N at regular intervals. It has been done. Then, the N microphones 121 1 , 121 2 ,..., 121 N obtain the N voice analog signals S 1 , S 2 ,..., S obtained by the voices of M passengers of the vehicle 130. N is acquired.
  • the audio analog signal S 1, S 2, ..., S N is the microphone 121 1, 121 2, ..., in one-to-one correspondence with 121 N.
  • the configuration of the microphone 121 is not limited to such an example.
  • the microphone 121 may have any configuration as long as it can generate a voice signal indicating the voice of the passenger of the vehicle 130.
  • N microphones 121 1 , 121 2 ,..., 121 N may be arranged in front of the seat of the vehicle 130, with one microphone as a directional microphone.
  • the microphone 121 may be installed in any place as long as it can acquire the sounds of all the passengers seated in the seat.
  • the camera 122 generates an image signal V showing an image inside the vehicle 130 in order to monitor an occupant.
  • the camera 122 is installed in an orientation having an angle of view such that the face of the passenger in the vehicle 130 is captured.
  • the camera 122 may be a visible light camera or an infrared camera. When an infrared camera is used as the camera 122, it may be an active type in which a passenger is irradiated with infrared rays from a light emitting diode (not shown) installed in the vicinity and the reflected light is observed. .. It should be noted that a plurality of cameras 122 may be installed in the vehicle 130 in order to image the faces of all passengers.
  • the vehicle speed meter 123 is a measuring device that measures the traveling speed of the vehicle 130, and generates speed information C indicating the traveling speed of the vehicle 130.
  • the vehicle speedometer 123 can acquire the vehicle speed from a system that controls the operation of the vehicle 130 through a communication line called a CAN bus to which an in-vehicle module such as a door meter is connected.
  • the I/F unit 101 receives input of analog audio signals S 1 to S N from the microphone 121, image signal V from the camera 122, and speed information C from the vehicle speedometer 123. Then, the I/F unit 101 supplies the voice analog signals S 1 to S N from the microphone 121 to the voice signal processing unit 102, and the image signal V from the camera 122 to the image utterance likelihood calculation unit 104 and the environment information determination unit. The speed information C from the vehicle speedometer 123 is given to the environment information determination unit 105.
  • the audio signal processing unit 102 generates an audio digital signal by performing analog/digital conversion (hereinafter, “A/D conversion”) on each of the audio analog signals S 1 to SN output by the microphone 121. Then, the voice signal processing unit 102 performs voice signal processing, which is a process of emphasizing the uttered voice of the occupant who is the target of voice recognition, on the voice digital signal, so that the voice signals SS 1 to SS M are processed. To generate.
  • A/D conversion analog/digital conversion
  • each of the integers 1 to M is associated with one seat. It is assumed that an element with a subscript of “1”, for example, the audio signal SS 1 is associated with the seat identified by “1”. Therefore, it can be said that the audio signal SS 1 is associated with the passenger in the seat identified by “1”.
  • the symbol i is an arbitrary integer of 1 or more and M or less.
  • the audio signal processing unit 102 removes, from the components included in each of the N audio digital signals, a component corresponding to a voice different from the voice uttered by the target person (hereinafter referred to as “noise component”). Further, in the voice recognition unit 109 in the latter stage, the M passengers seated in each of the M voice recognition target seats can perform voice recognition independently for each of the M passengers. M audio signals SS 1 to SS M are extracted by extracting only the respective voices. The audio signal processing unit 102, the generated audio signals SS 1 ⁇ SS M, giving a speech utterance likelihood calculation unit 103, environmental information determination unit 105 and a voice recognition unit 109.
  • the noise component includes, for example, a component corresponding to noise generated by the traveling of the vehicle 130, a component corresponding to a voice uttered by a passenger other than the target person, and the like.
  • Various known methods such as a beam forming method, a binary masking method, or a spectral subtraction method can be used to remove the noise component in the audio signal processing unit 102. Therefore, detailed description of the removal of the noise component in the audio signal processing unit 102 will be omitted.
  • the audio signal processing unit 102 may separate the M audio signals SS 1 to SS M from the N audio digital signals by using a blind audio separation technique such as independent component analysis.
  • a blind audio separation technique such as independent component analysis.
  • the image utterance likelihood calculation unit 104 uses the image indicated by the image signal V obtained from the camera 122. It is necessary to detect the number of passengers from the above and notify the audio signal processing unit 102 of the number of passengers.
  • the image signal V may be input to the audio signal processing unit 102, and the audio signal processing unit 102 may detect the number of passengers.
  • Speech utterance likelihood calculation unit 103 in order to perform a voice activity detection as preprocessing for speech recognition, from each of the sound signals SS 1 ⁇ SS M, subject is uttered in each of the audio signals SS 1 ⁇ SS M The likelihood of speech utterance indicating the probability of being present is calculated.
  • the voice utterance likelihood is also a probability indicating the utterance likelihood based on voice.
  • Speech utterance likelihood calculation unit 103 from the audio signals SS 1 ⁇ SS M corresponding to the respective M's occupant, and calculates the voice utterance likelihood AF 1 ⁇ AF M corresponding to the respective M's passenger ..
  • the calculated speech utterance likelihoods AF 1 to AF M are provided to the utterance section detection unit 108.
  • the image utterance likelihood calculation unit 104 in the same manner as the voice utterance likelihood calculation unit 103, detects an image utterance likelihood indicating the probability that the subject is uttering in the image signal V, in order to detect the utterance section. It is calculated from the signal V.
  • the image utterance likelihood is also a probability of indicating utterance likelihood based on an image.
  • the method of calculating the image utterance likelihood is, for example, a method of learning the distribution of the gradient vector of the face parts dictionary and calculating the degree of mouth opening by combining a plurality of learning models as the image utterance likelihood. is there.
  • the image utterance likelihood calculation unit 104 also generates image utterance likelihoods VF 1 to VF M corresponding to the M passengers. Then, the image utterance likelihood calculation unit 104 gives the generated image utterance likelihoods VF 1 to VF M to the utterance section detection unit 108.
  • the environment information determination unit 105 receives the voice signals SS 1 to SS M received from the voice signal processing unit 102, in which the utterance of the passenger is emphasized, the image signal V received from the camera 122, and the speed received from the vehicle speedometer 123. From the information C, the reliability X 1 to X M (hereinafter also referred to as audio reliability) of the audio signals SS 1 to SS M and the reliability Y 1 to Y M (hereinafter also referred to as image reliability) of the image signal V are obtained. calculate.
  • FIG. 3 is a block diagram schematically showing the configuration of the environment information determination unit 105.
  • the environment information determination unit 105 includes a passenger presence/absence determination unit 106 and a reliability determination unit 107.
  • the passenger presence/absence determining unit 106 determines the presence/absence of a person for each seat provided in the vehicle 130 from the image represented by the image signal V, and is the binary signal indicating the presence/absence of a person for each seat.
  • the determination result signals E 1 to E M are generated.
  • the passenger presence/absence determination unit 106 supplies the reliability determination unit 107 with the passenger presence/absence determination result signals E 1 to E M , which are binary signals indicating the determination result of the presence/absence of a person.
  • the occupant presence/absence determining unit 106 receives, instead of the image signal V, weight information indicating weight, which is a detection value detected by a weight scale (not shown) provided in the seat, and based on the weight information, It may be determined whether there is an occupant in each seat.
  • the reliability determination unit 107 receives the passenger presence/absence determination result signals E 1 to E M from the passenger presence/absence determination unit 106, the speed information C from the vehicle speedometer 123, and the audio signal SS 1 to from the audio signal processing unit 102. receive SS M, calculates each reliability X 1 ⁇ X M, a Y 1 ⁇ Y M audio signals SS 1 ⁇ SS M and the image signal V.
  • the reliability X 1 to X M is a parameter indicating the reliability of the audio signals SS 1 to SS M
  • the reliability Y 1 to Y M is a parameter indicating the reliability of the image signal V.
  • the reliability X 1 to X M and Y 1 to Y M of each signal can be calculated by the following calculation method, for example.
  • the reliability determination unit 107 The higher the vehicle speed, the lower the reliability X 1,t to X M,t .
  • the reliability X 1,t to X M,t can be calculated by the following equation (1).
  • the reliability determination unit 107 lowers the reliability Y 1,t to Y M,t as the number of passengers in the vehicle 130 increases.
  • the reliability determination unit 107 calculates the reliability Y 1,t to Y M,t by the expressions (2) and (3) shown below.
  • ⁇ (i ⁇ j) is a function that becomes 1 only when the passenger i and the passenger j are different.
  • the utterance section detection unit 108 determines from the voice utterance likelihoods AF 1 to AF M , the image utterance likelihoods VF 1 to VF M , the reliabilities X 1 to X M , and the reliabilities Y 1 to Y M. For each target person, the time of the section in which the utterance is performed is estimated, and an utterance list, which is section information indicating the time in the section in which the utterance is performed for each target person, is generated.
  • the utterance section detection unit 108 weights the corresponding voice utterance likelihood AF i higher as the corresponding reliability X i increases, and the corresponding image utterance likelihood VF i increases as the corresponding image reliability Y i increases.
  • the corresponding voice utterance likelihood AF i and the corresponding image utterance likelihood VF i using the corresponding voice utterance likelihood AF i and the corresponding image utterance likelihood VF i to indicate the probability that the subject is uttering the corresponding voice signal SS i and image signal V.
  • the likelihood is calculated, and a section in which the calculated speech likelihood is equal to or greater than a predetermined threshold is detected as a speech section.
  • the utterance section detection unit 108 gives the generated section information to the voice recognition unit 109.
  • the estimation of the time of the section in which the utterance is performed is performed as follows.
  • the utterance section detection unit 108 uses the reliability X i,t , Y i,t at the passenger i and time t according to the softmax function shown in the following expressions (4) and (5) to obtain The weight W i,t A and the weight W i,t V for each signal at the passenger i and the time t are calculated.
  • W i,t A is a voice weight as a weight of the voice signal SS i
  • W i,t V is an image weight as a weight of the image signal V.
  • the utterance section detection unit 108 calculates a final utterance likelihood S (i,t) .
  • the utterance likelihood S (i,t) is the probability that the passenger i is uttering at time t.
  • the utterance likelihood S (i,t) is obtained from a value obtained by multiplying the voice utterance likelihood AF i,t and the image utterance likelihood VF i,t at time t by a weight as in the following expression (6).
  • the voice utterance likelihood is multiplied by a value obtained by multiplying the voice utterance likelihood by which the weight is increased as the voice reliability is higher, and the image utterance likelihood is multiplied by the image weight by which the weight is increased as the image reliability is higher.
  • the utterance likelihood is calculated by multiplying by the calculated value.
  • the utterance section detection unit 108 detects, for the utterance likelihood S (i, t) thus calculated, a section that is equal to or greater than a predetermined threshold as a utterance section, and thus the utterance list for each passenger.
  • U 1 to U M can be generated.
  • FIG. 4 is a schematic diagram showing an example of the utterance list U of one passenger.
  • the utterance list U# is table information including an utterance section string U#1, a start time string U#2, and an end time string U#3.
  • the utterance section string U#1 stores utterance section identification information for identifying the detected utterance section.
  • the start time column U#2 shows the start time of the detected utterance section.
  • the end time string U#3 shows the end time of the detected utterance section.
  • the method for the utterance section detection unit 108 to calculate the final utterance likelihood S (i, t) is not limited to the above equation (6).
  • the utterance section detection unit 108 can calculate the utterance likelihood S (i,t) by the following equation (7).
  • the state transition table ⁇ is assumed to be a function that returns a unique state from the past state transition sequence and the current voice utterance likelihood and image utterance likelihood.
  • a value obtained by multiplying the voice utterance likelihood by a voice weight with a larger weight as the voice reliability is higher, and an image weight with a higher weight as the image reliability is higher than the image utterance likelihood.
  • the utterance likelihood is calculated from a value obtained by multiplying by and a predetermined function having the utterance likelihood calculated in the past as a variable.
  • the speech recognition unit 109 for each subject, with respect to the corresponding audio signals SS 1 ⁇ SS M, in a corresponding utterance Listing U 1 ⁇ speech intervals indicated by U M, executes speech recognition.
  • the voice recognition is performed, for example, by extracting a feature amount for voice recognition and using the extracted feature amount.
  • the voice recognition unit 109 independently executes voice recognition for each passenger, and for each passenger, the voice recognition result of detecting the utterance section and the reliability of the voice recognition result (hereinafter, voice recognition). Output).
  • the voice recognition score may be a value that considers both the output probability of the acoustic model and the output probability of the language model, or may be the acoustic score of only the output probability of the acoustic model.
  • the constituent elements of the voice recognition device 100 may be distributed to a server on a network, a mobile terminal such as a smartphone, or an in-vehicle device.
  • FIG. 5 is a block diagram schematically showing the hardware configuration of the voice recognition device 100 according to the embodiment.
  • the hardware of the voice recognition device 100 includes a memory 150, a processor 151, a voice interface (hereinafter referred to as voice I/F) 152, an image interface (hereinafter referred to as image I/F) 153, and a vehicle state interface (hereinafter referred to as “vehicle state interface”).
  • voice I/F voice interface
  • image I/F image interface
  • vehicle state interface vehicle state interface
  • a vehicle state I/F 154
  • network I/F network interface
  • the memory 150 stores programs that function as the voice signal processing unit 102, the voice utterance likelihood calculation unit 103, the image utterance likelihood calculation unit 104, the environment information determination unit 105, the utterance section detection unit 108, and the voice recognition unit 109. There is.
  • the memory 150 is, for example, a RAM (Random Access Memory), a ROM (Read Only Memory), a flash memory, an EPROM (Erasable Programmable Memory, etc.) or an EEPROM (Electrically Only Memory), or an EEPROM (Electrically Only Memory).
  • the processor 151 is a program that functions as the voice signal processing unit 102, the voice utterance likelihood calculation unit 103, the environment information determination unit 105, the image utterance likelihood calculation unit 104, the utterance section detection unit 108, and the voice recognition unit 109 from the memory 150. Read and execute the program.
  • the processor 151 is, for example, a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a microprocessor, a microcontroller, a DSP (Digital Signal Processor), or the like.
  • the voice I/F 152 is a voice input interface for receiving the voice analog signals S 1 to S M from the microphone 121 in multiple channels.
  • the voice I/F 152 also functions as a voice output interface. .. If the configuration does not require speaker output, the function as an audio output interface is not required.
  • the image I/F 153 is an image input interface for receiving the image signal V from the camera 122. Further, when receiving the final result of the voice recognition of the voice recognition unit 109 and notifying the occupant of necessary information by text or image display using a display device (not shown) such as a monitor, the image I/ The F153 also functions as an image output interface. If the display device does not require display, the function as an image output interface is unnecessary.
  • the vehicle status I/F 154 is an input interface for receiving the speed information C measured by the vehicle speedometer 123.
  • the vehicle state I/F 154 can also acquire information on the current state of the vehicle such as the open/closed state of the door, not limited to the vehicle speed.
  • the network I/F 155 is an interface for communicating when performing voice recognition using a voice recognition service published on the cloud of the Internet instead of using the voice recognition unit 109. Further, the network I/F 155 is also an interface used as a connected car for P2P (Peer to Pear) communication with a nearby car or for communicating with a base station to execute navigation. The network I/F 155 is unnecessary if it has a configuration that does not require communication.
  • P2P Peer to Pear
  • the I/F unit 101 shown in FIG. 1 can be realized by a voice I/F 152, an image I/F 153, a vehicle state I/F 154, or a network I/F 155.
  • the memory 150 is arranged inside the voice recognition device 100 in FIG. 5, it may be configured to connect an external memory such as a USB (Universal Serial Bus) memory to read a program or data. Good. Further, the memory in the device and the external memory may be used together.
  • an external memory such as a USB (Universal Serial Bus) memory
  • FIG. 6 is a flowchart showing a flow of operations of the voice recognition device 100 according to the embodiment.
  • the audio signal processing unit 102 generates an audio digital signal by performing A/D conversion on the audio analog signals S 1 to SN from the microphone 121, and outputs an audio digital signal to the audio digital signal.
  • the speech uttered by the target person who obtains the speech is emphasized to generate speech signals SS 1 to SS M (S10).
  • S10 speech signals SS 1 to SS M
  • the voice signal processing unit 102 Emphasizes the sound from each of these four directions.
  • the audio signal processing unit 102 provides the audio signals SS 1 ⁇ SS M to the voice utterance likelihood calculation unit 103, environmental information determination unit 105 and a voice recognition unit 109.
  • the speech utterance likelihood calculation unit 103 calculates a speech utterance likelihood AF 1 ⁇ AF M from each of the sound signals SS 1 ⁇ SS M (S11) .
  • the image utterance likelihood calculation unit 104 calculates the image utterance likelihoods VF 1 to VF M from the image indicated by the image signal V (S12).
  • the environment information determination unit 105 receives, from the audio signal processing unit 102, the audio signals SS 1 to SS M in which the utterance of the passenger is emphasized, the image signal V received from the camera 122, and the vehicle speedometer 123. from the speed information C received and calculates the image confidence Y 1 ⁇ Y M audio signals SS 1 ⁇ SS voice reliability X 1 ⁇ of M X M and the image signal V (S13).
  • the utterance section detection unit 108 uses the voice utterance likelihoods AF 1 to AF M , the image utterance likelihoods VF 1 to VF M , the voice reliability X 1 to X M, and the image reliability Y 1 to Y M to board.
  • the time of the section in which the utterance is being performed is estimated for each person, and the utterance section in which the utterance is being performed is detected for each passenger (S14).
  • the utterance section detection unit 108 provides the speech recognition unit 109 with the utterance lists U 1 to U M including the start time and end time of the detected utterance section.
  • the voice recognition unit 109 extracts the feature amount for voice recognition from the voice signal SS i corresponding to the target person in the utterance section indicated by the utterance list U i corresponding to the target person, and the extracted feature amount. Is used to execute voice recognition (S15). Then, the voice recognition unit 109 outputs the voice recognition result.
  • the voice recognition device 100 described above can be applied to a navigation system, an integrated cockpit system including a meter display for a driver, a PC, a tablet PC, or a mobile information terminal such as a smartphone.
  • voice recognition device 101 I/F unit, 102 voice signal processing unit, 103 voice utterance likelihood calculation unit, 104 image utterance likelihood calculation unit, 105 environment information determination unit, 106 passenger presence/absence determination unit, 107 reliability determination Section, 108 utterance section detection section, 109 voice recognition section, 120 voice recognition system, 121 microphone, 122 camera, 123 speedometer, 130 vehicles.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The present invention is provided with: an audio speech likelihood calculation unit (103) that calculates audio speech likelihood from an audio signal including the voice of a subject person; a video speech likelihood calculation unit (104) that calculates video speech likelihood from a video signal indicating a video including the subject person; an environmental information determination unit (105) that determines audio reliability indicating the reliability of the audio signal and video reliability indicating the reliability of the video signal; a speech section detection unit (108) that adds a heavier weight to the audio speech likelihood when the audio reliability is higher, adds a heavier weight to the video speech likelihood when the video reliability is higher, calculates, by using the audio speech likelihood and the video speech likelihood, speech likelihood indicating a probability that the subject person is uttering speech in the audio signal and in the video signal, and detects, as a section of speech, a section where the calculated speech likelihood is higher than a predetermined threshold; and an audio recognition unit (109) that executes audio recognition on the audio signal in the section of speech.

Description

情報処理装置、プログラム及び情報処理方法Information processing apparatus, program, and information processing method
 本発明は、情報処理装置、プログラム及び情報処理方法に関する。 The present invention relates to an information processing device, a program, and an information processing method.
 複数の信号を入力として何らかの認識処理結果等を出力させる方式をマルチモーダルという。一般的に一つの信号のみを用いて処理を行うユニモーダルと比べてマルチモーダルはシステムとしての性能が高くなり、信号ノイズに対してロバストになる傾向がある。 ㆍMultimodal is a method to output some recognition processing result with multiple signals as input. In general, multimodal has higher system performance and tends to be more robust against signal noise than unimodal in which processing is performed using only one signal.
 例えば、音響信号と、画像信号とを用いているシステムであれば、音響雑音が強い場合は、より画像信号を用いて認識するように処理することでロバストな認識結果を得ることができる。このような機構は適応型ノイズ抑圧と呼ばれる。 For example, in a system using an acoustic signal and an image signal, when acoustic noise is strong, a robust recognition result can be obtained by processing so that the image signal is used for recognition. Such a mechanism is called adaptive noise suppression.
 従来の適応型ノイズ抑圧の手法には、例えば、特許文献1に記載されているように、汎用的なデータセットで学習したモデルに対して、使用する環境でのノイズを含む信号で誤認識が少なくなるように再学習させる手法がある。 In the conventional adaptive noise suppression method, for example, as described in Patent Document 1, a model learned by a general-purpose data set is erroneously recognized by a signal including noise in a use environment. There is a method of re-learning to reduce the number.
特開2002-169586号公報Japanese Patent Laid-Open No. 2002-169586
 しかしながら、従来の手法は、例えば、既存の人検出技術を組み合わせ、「近辺に音響雑音となりうる人がいなければ画像信号は使用しない方が良い」等といった人間の事前知識を組み込んで、信号の信頼性を調整するような柔軟なシステム設計を行うのは困難である。 However, in the conventional method, for example, by combining existing person detection technology, it is better to use human prior knowledge such as "it is better not to use the image signal unless there is a person who can be acoustic noise in the vicinity", and the reliability of the signal is improved. It is difficult to design a flexible system that adjusts the characteristics.
 そこで、本発明の1又は複数の態様は、信号の信頼性を判定することにより、よりノイズ環境下に強い信号処理を行うことができるようにすることを目的とする。 Therefore, it is an object of one or more aspects of the present invention to make it possible to perform strong signal processing in a more noisy environment by determining the reliability of a signal.
 本発明の1態様に係る情報処理装置は、対象者の音声を含む音声信号から、前記音声信号において前記対象者が発話している確率を示す音声発話尤度を算出する音声発話尤度算出部と、前記対象者を含む画像を示す画像信号から、前記画像信号において前記対象者が発話している確率を示す画像発話尤度を算出する画像発話尤度算出部と、前記音声信号の信頼性を示す音声信頼性、及び、前記画像信号の信頼性を示す画像信頼性を判定する環境情報判定部と、前記音声信頼性が高いほど前記音声発話尤度の重みを重くし、前記画像信頼性が高いほど前記画像発話尤度の重みを重くして、前記音声発話尤度及び前記画像発話尤度を用いて、前記音声信号及び前記画像信号において前記対象者が発話している確率を示す発話尤度を算出し、前記算出された発話尤度が予め定められた閾値以上となっている区間を、発話区間として検出する発話区間検出部と、前記音声信号に対して、前記発話区間において音声認識を実行する音声認識部と、を備えることを特徴とする。 An information processing apparatus according to an aspect of the present invention is a voice utterance likelihood calculation unit that calculates a voice utterance likelihood indicating a probability that the target person utters in the voice signal from a voice signal including a voice of the subject. And an image utterance likelihood calculation unit that calculates an image utterance likelihood indicating a probability that the subject is uttered in the image signal from an image signal indicating an image including the subject, and reliability of the voice signal. , And an environment information determination unit that determines image reliability indicating the reliability of the image signal, and the higher the audio reliability, the heavier the weight of the voice utterance likelihood, the higher the image reliability. Is higher, the weight of the image utterance likelihood is heavier, and the utterance indicating the probability that the subject is uttering in the voice signal and the image signal using the voice utterance likelihood and the image utterance likelihood. An utterance section detecting unit that calculates a likelihood and detects a section in which the calculated utterance likelihood is equal to or more than a predetermined threshold value as a utterance section, and a voice in the utterance section with respect to the voice signal. And a voice recognition unit that executes recognition.
 本発明の1態様に係るプログラムは、コンピュータを、対象者の音声を含む音声信号から、前記音声信号において前記対象者が発話している確率を示す音声発話尤度を算出する音声発話尤度算出部、前記対象者を含む画像を示す画像信号から、前記画像信号において前記対象者が発話している確率を示す画像発話尤度を算出する画像発話尤度算出部と、前記音声信号の信頼性を示す音声信頼性、及び、前記画像信号の信頼性を示す画像信頼性を判定する環境情報判定部、前記音声信頼性が高いほど前記音声発話尤度の重みを重くし、前記画像信頼性が高いほど前記画像発話尤度の重みを重くして、前記音声発話尤度及び前記画像発話尤度を用いて、前記音声信号及び前記画像信号において前記対象者が発話している確率を示す発話尤度を算出し、前記算出された発話尤度が予め定められた閾値以上となっている区間を、発話区間として検出する発話区間検出部、及び、前記音声信号に対して、前記発話区間において音声認識を実行する音声認識部、として機能させることを特徴とする。 A program according to one aspect of the present invention causes a computer to calculate a voice utterance likelihood calculating a voice utterance likelihood indicating a probability that the target person utters in the voice signal from a voice signal including a voice of the target person. Unit, an image utterance likelihood calculation unit that calculates an image utterance likelihood indicating a probability that the target person utters in the image signal from an image signal indicating an image including the target person, and reliability of the voice signal. , And an environment information determination unit that determines the image reliability indicating the reliability of the image signal, the higher the voice reliability, the heavier the weight of the voice utterance likelihood, the image reliability is The higher the higher the weight of the image utterance likelihood, the more the utterance likelihood indicating the probability that the subject is uttering in the voice signal and the image signal using the voice utterance likelihood and the image utterance likelihood. Utterance section detecting unit that calculates a degree and detects a section in which the calculated utterance likelihood is equal to or more than a predetermined threshold as a utterance section, and a voice in the utterance section with respect to the voice signal. It is characterized in that it functions as a voice recognition unit that executes recognition.
 本発明の1態様に係る情報処理方法は、対象者の音声を含む音声信号から、前記音声信号において前記対象者が発話している確率を示す音声発話尤度を算出し、前記対象者を含む画像を示す画像信号から、前記画像信号において前記対象者が発話している確率を示す画像発話尤度を算出し、前記音声信号の信頼性を示す音声信頼性、及び、前記画像信号の信頼性を示す画像信頼性を判定し、前記音声信頼性が高いほど前記音声発話尤度の重みを重くし、前記画像信頼性が高いほど前記画像発話尤度の重みを重くして、前記音声発話尤度及び前記画像発話尤度を用いて、前記音声信号及び前記画像信号において前記対象者が発話している確率を示す発話尤度を算出し、前記算出された発話尤度が予め定められた閾値以上となっている区間を、発話区間として検出し、前記音声信号に対して、前記発話区間において音声認識を実行することを特徴とする。 An information processing method according to an aspect of the present invention calculates a voice utterance likelihood indicating a probability that the target person speaks in the voice signal from a voice signal including a voice of the target person, and includes the target person. From an image signal indicating an image, an image utterance likelihood indicating the probability that the target person is uttering in the image signal is calculated, a voice reliability indicating the reliability of the voice signal, and a reliability of the image signal. Is determined, and the higher the voice reliability is, the heavier the weight of the voice utterance likelihood is, and the higher the image reliability is, the heavier the weight of the image utterance likelihood is, the higher the voice utterance likelihood is. Degree and the image utterance likelihood, the utterance likelihood indicating the probability that the subject is uttered in the voice signal and the image signal is calculated, and the calculated utterance likelihood is a predetermined threshold value. The above-mentioned section is detected as an utterance section, and voice recognition is performed on the voice signal in the utterance section.
 本発明の一態様によれば、信号の信頼性を判定することにより、よりノイズ環境下に強い信号処理を行うことができる。 According to an aspect of the present invention, it is possible to perform strong signal processing in a more noisy environment by determining the reliability of a signal.
実施の形態に係る音声認識装置の構成を概略的に示すブロック図である。It is a block diagram which shows roughly the structure of the speech recognition apparatus which concerns on embodiment. 実施の形態に係る音声認識装置を含む、車載用の音声認識システムの概略図である。1 is a schematic diagram of a vehicle-mounted voice recognition system including a voice recognition device according to an embodiment. 環境情報判定部の構成を概略的に示すブロック図である。It is a block diagram which shows the structure of an environment information determination part schematically. 一人の搭乗者の発話リストの一例を示す概略図である。It is a schematic diagram showing an example of an utterance list of one passenger. 実施の形態に係る音声認識装置のハードウェア構成を概略的に示すブロック図である。It is a block diagram which shows roughly the hardware constitutions of the speech recognition apparatus which concerns on embodiment. 実施の形態に係る音声認識装置の動作の流れを示すフローチャートである。4 is a flowchart showing a flow of operations of the voice recognition device according to the exemplary embodiment.
 図1は、実施の形態に係る情報処理装置である音声認識装置100の構成を概略的に示すブロック図である。
 音声認識装置100は、インターフェース部(以下、I/F部という)101と、音声信号処理部102と、音声発話尤度算出部103と、画像発話尤度算出部104と、環境情報判定部105と、発話区間検出部108と、音声認識部109とを備える。
FIG. 1 is a block diagram schematically showing a configuration of a voice recognition device 100 which is an information processing device according to an embodiment.
The voice recognition device 100 includes an interface unit (hereinafter referred to as I/F unit) 101, a voice signal processing unit 102, a voice utterance likelihood calculation unit 103, an image utterance likelihood calculation unit 104, and an environment information determination unit 105. And a speech section detection unit 108 and a voice recognition unit 109.
 実施の形態に係る音声認識装置100は、例えば、図2に示されているように、車載用の音声認識システム120に含まれる。
 音声認識システム120は、音声認識装置100と、集音装置としてのN個のマイクロホン121、121、・・・、121と、撮像装置としてのカメラ122と、車速計123とを備える。本実施の形態では、音声認識システム120は、搭乗者をモニタリングするためのカメラ122が搭載された車内環境における車載音声認識システムとなっている。
 ここで、Nは、1以上の整数である。本実施の形態では、Nは、音声認識システム120が搭載されている車130に設けられている座席数M(Mは1以上の整数)以上の数となっている。図2の例では、N≧M、M=4となっている。
 マイクロホン121、121、・・・、121の各々を特に区別する必要がない場合には、マイクロホン121という。
The voice recognition device 100 according to the embodiment is included in a vehicle-mounted voice recognition system 120, as shown in FIG. 2, for example.
The voice recognition system 120 includes the voice recognition device 100, N microphones 121 1 , 121 2 ,..., 121 N as sound collection devices, a camera 122 as an imaging device, and a vehicle speed meter 123. In the present embodiment, the voice recognition system 120 is an in-vehicle voice recognition system in an in-vehicle environment equipped with a camera 122 for monitoring a passenger.
Here, N is an integer of 1 or more. In the present embodiment, N is equal to or larger than the number of seats M (M is an integer of 1 or more) provided in the vehicle 130 in which the voice recognition system 120 is installed. In the example of FIG. 2, N≧M and M=4.
The microphones 121 1 , 121 2 ,..., 121 N are referred to as microphones 121 when there is no particular need to distinguish between them.
 マイクロホン121は、車130内の音声を示すアナログ信号である音声アナログ信号を生成する。
 本実施の形態においては、1つのマイクロホン121は、無指向性のマイクロホンであり、N個のマイクロホン121、121、・・・、121を一定間隔に配置することにより、アレイマイクが構成されているものとする。そして、N個のマイクロホン121、121、・・・、121により、車130のM人の搭乗者の音声を取得したN個の音声アナログ信号S、S、・・・、Sが取得される。言い換えると、音声アナログ信号S、S、・・・、Sは、マイクロホン121、121、・・・、121と一対一に対応する。
The microphone 121 generates a voice analog signal which is an analog signal indicating the voice inside the vehicle 130.
In the present embodiment, one microphone 121 is an omnidirectional microphone, and an array microphone is configured by arranging N microphones 121 1 , 121 2 ,..., 121 N at regular intervals. It has been done. Then, the N microphones 121 1 , 121 2 ,..., 121 N obtain the N voice analog signals S 1 , S 2 ,..., S obtained by the voices of M passengers of the vehicle 130. N is acquired. In other words, the audio analog signal S 1, S 2, ..., S N is the microphone 121 1, 121 2, ..., in one-to-one correspondence with 121 N.
 なお、マイクロホン121の構成は、このような例に限定されない。マイクロホン121は、車130の搭乗者の音声を示す音声信号を生成することができれば、どのような構成であってもよい。例えば、1つのマイクロホンを指向性のマイクロホンとして、N個のマイクロホン121、121、・・・、121が、車130の座席の前に配置されてもよい。また、マイクロホン121の設置場所は、座席に着座する全ての搭乗者の音声を取得できる場所であれば、どの場所でもよい。 The configuration of the microphone 121 is not limited to such an example. The microphone 121 may have any configuration as long as it can generate a voice signal indicating the voice of the passenger of the vehicle 130. For example, N microphones 121 1 , 121 2 ,..., 121 N may be arranged in front of the seat of the vehicle 130, with one microphone as a directional microphone. The microphone 121 may be installed in any place as long as it can acquire the sounds of all the passengers seated in the seat.
 カメラ122は、搭乗者をモニタリングするために、車130内の画像を示す画像信号Vを生成する。
 カメラ122は、車130内の搭乗者の顔が撮影されるような画角を有する向きに設置されている。カメラ122は、可視光カメラでもよく、赤外線カメラでもよい。カメラ122として赤外線カメラが使用される場合には、付近に設置された発光ダイオード(図示せず)から、搭乗者に赤外線を照射し、その反射光を観測するタイプのアクティブ型であってもよい。
 なお、全搭乗者の顔を撮像するために、複数のカメラ122が車130内に設置されていてもよい。
The camera 122 generates an image signal V showing an image inside the vehicle 130 in order to monitor an occupant.
The camera 122 is installed in an orientation having an angle of view such that the face of the passenger in the vehicle 130 is captured. The camera 122 may be a visible light camera or an infrared camera. When an infrared camera is used as the camera 122, it may be an active type in which a passenger is irradiated with infrared rays from a light emitting diode (not shown) installed in the vicinity and the reflected light is observed. ..
It should be noted that a plurality of cameras 122 may be installed in the vehicle 130 in order to image the faces of all passengers.
 車速計123は、車130の走行速度を計測する計測器であり、車130の走行速度を示す速度情報Cを生成する。例えば、車速計123は、ドアメーター等の車載モジュールが接続されたCANバスと呼ばれる通信線を通じて、車130の運行を制御するシステムから車速を取得することができる。 The vehicle speed meter 123 is a measuring device that measures the traveling speed of the vehicle 130, and generates speed information C indicating the traveling speed of the vehicle 130. For example, the vehicle speedometer 123 can acquire the vehicle speed from a system that controls the operation of the vehicle 130 through a communication line called a CAN bus to which an in-vehicle module such as a door meter is connected.
 図1に戻り、I/F部101は、マイクロホン121から音声アナログ信号S~S、カメラ122から画像信号V、及び、車速計123から速度情報Cの入力を受け付ける。そして、I/F部101は、マイクロホン121からの音声アナログ信号S~Sを音声信号処理部102に与え、カメラ122からの画像信号Vを画像発話尤度算出部104及び環境情報判定部105に与え、車速計123からの速度情報Cを環境情報判定部105に与える。 Returning to FIG. 1, the I/F unit 101 receives input of analog audio signals S 1 to S N from the microphone 121, image signal V from the camera 122, and speed information C from the vehicle speedometer 123. Then, the I/F unit 101 supplies the voice analog signals S 1 to S N from the microphone 121 to the voice signal processing unit 102, and the image signal V from the camera 122 to the image utterance likelihood calculation unit 104 and the environment information determination unit. The speed information C from the vehicle speedometer 123 is given to the environment information determination unit 105.
 音声信号処理部102は、マイクロホン121により出力された音声アナログ信号S~Sのそれぞれに対して、アナログ/デジタル変換(以下「A/D変換」を行うことで、音声デジタル信号を生成する。そして、音声信号処理部102は、音声デジタル信号に対して、音声認識を行う対象となる搭乗者の発話音声を強調する処理である音声信号処理を行うことで、音声信号SS~SSを生成する。 The audio signal processing unit 102 generates an audio digital signal by performing analog/digital conversion (hereinafter, “A/D conversion”) on each of the audio analog signals S 1 to SN output by the microphone 121. Then, the voice signal processing unit 102 performs voice signal processing, which is a process of emphasizing the uttered voice of the occupant who is the target of voice recognition, on the voice digital signal, so that the voice signals SS 1 to SS M are processed. To generate.
 なお、以下では、M人の搭乗者のうち、音声認識を行う対象となる搭乗者を、対象者とする。
 また、1~Mの整数の各々は、1つの座席に対応付けられているものとする。「1」の下付き文字が付された要素、例えば、音声信号SSは、「1」で識別される座席に対応付けられているものとする。このため、音声信号SSは、「1」で識別される座席の搭乗者に対応付けられているともいえる。なお、符号iは、1以上、M以下の任意の整数とする。
In the following, among M passengers, a passenger who is a target for voice recognition is a target person.
Further, it is assumed that each of the integers 1 to M is associated with one seat. It is assumed that an element with a subscript of “1”, for example, the audio signal SS 1 is associated with the seat identified by “1”. Therefore, it can be said that the audio signal SS 1 is associated with the passenger in the seat identified by “1”. Note that the symbol i is an arbitrary integer of 1 or more and M or less.
 音声信号処理部102は、N個の音声デジタル信号の各々に含まれている成分のうち、対象者が発話した音声と異なる音声に対応する成分(以下「ノイズ成分」という。)を除去する。また、後段の音声認識部109で、M人の搭乗者の各々を対象者として、独立して音声認識を実行できるように、M個の音声認識対象座席の各々に着座したM人の搭乗者の各々の音声のみを、それぞれ抽出したM個の音声信号SS~SSを生成する。そして、音声信号処理部102は、生成された音声信号SS~SSを、音声発話尤度算出部103、環境情報判定部105及び音声認識部109に与える。 The audio signal processing unit 102 removes, from the components included in each of the N audio digital signals, a component corresponding to a voice different from the voice uttered by the target person (hereinafter referred to as “noise component”). Further, in the voice recognition unit 109 in the latter stage, the M passengers seated in each of the M voice recognition target seats can perform voice recognition independently for each of the M passengers. M audio signals SS 1 to SS M are extracted by extracting only the respective voices. The audio signal processing unit 102, the generated audio signals SS 1 ~ SS M, giving a speech utterance likelihood calculation unit 103, environmental information determination unit 105 and a voice recognition unit 109.
 ノイズ成分は、例えば、車130の走行により発生した騒音に対応する成分、及び、対象者以外の搭乗者により発話された音声に対応する成分等を含むものである。音声信号処理部102におけるノイズ成分の除去には、ビームフォーミング法、バイナリマスキング法、又は、スペクトルサブトラクション法等の公知の種々の方法を用いることができる。このため、音声信号処理部102におけるノイズ成分の除去についての詳細な説明は省略する。 The noise component includes, for example, a component corresponding to noise generated by the traveling of the vehicle 130, a component corresponding to a voice uttered by a passenger other than the target person, and the like. Various known methods such as a beam forming method, a binary masking method, or a spectral subtraction method can be used to remove the noise component in the audio signal processing unit 102. Therefore, detailed description of the removal of the noise component in the audio signal processing unit 102 will be omitted.
 なお、音声信号処理部102は、独立成分分析等のブラインド音声分離技術を利用することで、N個の音声デジタル信号からM個の音声信号SS~SSを分離してもよい。但し、このブラインド音声分離技術を利用する場合は、搭乗者数に対応する音源数が必要となるため、例えば、画像発話尤度算出部104が、カメラ122から得られる画像信号Vで示される画像から搭乗者数を検知して、音声信号処理部102に伝える必要がある。なお、画像信号Vが音声信号処理部102に入力されて、音声信号処理部102が搭乗者数を検知してもよい。 The audio signal processing unit 102 may separate the M audio signals SS 1 to SS M from the N audio digital signals by using a blind audio separation technique such as independent component analysis. However, when this blind audio separation technology is used, the number of sound sources corresponding to the number of passengers is required, and therefore, for example, the image utterance likelihood calculation unit 104 uses the image indicated by the image signal V obtained from the camera 122. It is necessary to detect the number of passengers from the above and notify the audio signal processing unit 102 of the number of passengers. The image signal V may be input to the audio signal processing unit 102, and the audio signal processing unit 102 may detect the number of passengers.
 音声発話尤度算出部103は、音声認識の前処理として発話区間検出を行うために、音声信号SS~SSのそれぞれから、音声信号SS~SSのそれぞれにおいて対象者が発話している確率を示す音声発話尤度を算出する。音声発話尤度は、音声に基づく発話らしさを示す確率でもある。 Speech utterance likelihood calculation unit 103, in order to perform a voice activity detection as preprocessing for speech recognition, from each of the sound signals SS 1 ~ SS M, subject is uttered in each of the audio signals SS 1 ~ SS M The likelihood of speech utterance indicating the probability of being present is calculated. The voice utterance likelihood is also a probability indicating the utterance likelihood based on voice.
 音声発話尤度の算出方法は、過去様々な手法が提案されてきた。例えば、発話時と非発話時のSTFT(Short-Time Fourier Transform)スペクトル及びMFCC(Mel-Frequency Cepstrum Coefficients)係数を、それぞれGMM(Gaussian Mixture Model)で学習し、音声信号を各GMMへ入力した際の、音響の対数尤度Scoreを音声発話尤度とする方法等がある。音声発話尤度算出部103は、M人の搭乗者のそれぞれに対応する音声信号SS~SSから、M人の搭乗者のそれぞれに対応する音声発話尤度AF~AFを算出する。算出された音声発話尤度AF~AFは、発話区間検出部108に与えられる。 Various methods have been proposed in the past as methods for calculating the likelihood of speech utterance. For example, the STFT (Short-Time Fourier Transform) spectrum and the MFCC (Mel-Frequency Cepstrum Coefficients) coefficients at the time of utterance and at the time of non-speech are learned by the GMM (Gaussian Mixture Model) when each is input to the GMM (Gaussian Mixture Model) and are learned. Of the logarithmic likelihood Score of sound is used as the likelihood of speech utterance. Speech utterance likelihood calculation unit 103, from the audio signals SS 1 ~ SS M corresponding to the respective M's occupant, and calculates the voice utterance likelihood AF 1 ~ AF M corresponding to the respective M's passenger .. The calculated speech utterance likelihoods AF 1 to AF M are provided to the utterance section detection unit 108.
 画像発話尤度算出部104は、音声発話尤度算出部103と同じように発話区間検出を行うために、画像信号Vにおいて対象者が発話している確率を示す画像発話尤度を、その画像信号Vから算出する。画像発話尤度は、画像に基づく発話らしさを示す確率でもある。 The image utterance likelihood calculation unit 104, in the same manner as the voice utterance likelihood calculation unit 103, detects an image utterance likelihood indicating the probability that the subject is uttering in the image signal V, in order to detect the utterance section. It is calculated from the signal V. The image utterance likelihood is also a probability of indicating utterance likelihood based on an image.
 画像発話尤度の算出方法は、例えば、顔パーツ辞書の勾配ベクトルの分布を学習し、複数の学習モデルを組み合わせて口の開き具合を算出した開口度を、画像発話尤度とする方法等がある。なお、画像発話尤度算出部104も、M人の搭乗者のそれぞれに対応する画像発話尤度VF~VFを生成する。そして、画像発話尤度算出部104は、生成された画像発話尤度VF~VFを、発話区間検出部108に与える。 The method of calculating the image utterance likelihood is, for example, a method of learning the distribution of the gradient vector of the face parts dictionary and calculating the degree of mouth opening by combining a plurality of learning models as the image utterance likelihood. is there. The image utterance likelihood calculation unit 104 also generates image utterance likelihoods VF 1 to VF M corresponding to the M passengers. Then, the image utterance likelihood calculation unit 104 gives the generated image utterance likelihoods VF 1 to VF M to the utterance section detection unit 108.
 環境情報判定部105は、音声信号処理部102から受け取った、搭乗者の発話が強調された音声信号SS~SS、カメラ122から受け取った画像信号V、及び、車速計123から受け取った速度情報Cから、音声信号SS~SSの信頼性X~X(以下、音声信頼性ともいう)及び画像信号Vの信頼性Y~Y(以下、画像信頼性ともいう)を算出する。 The environment information determination unit 105 receives the voice signals SS 1 to SS M received from the voice signal processing unit 102, in which the utterance of the passenger is emphasized, the image signal V received from the camera 122, and the speed received from the vehicle speedometer 123. From the information C, the reliability X 1 to X M (hereinafter also referred to as audio reliability) of the audio signals SS 1 to SS M and the reliability Y 1 to Y M (hereinafter also referred to as image reliability) of the image signal V are obtained. calculate.
 図3は、環境情報判定部105の構成を概略的に示すブロック図である。
 環境情報判定部105は、搭乗者有無判定部106と、信頼性判定部107とを備える。
FIG. 3 is a block diagram schematically showing the configuration of the environment information determination unit 105.
The environment information determination unit 105 includes a passenger presence/absence determination unit 106 and a reliability determination unit 107.
 搭乗者有無判定部106は、画像信号Vで示される画像から、車130に設けられている座席毎に人の有無を判定し、その座席毎に人の有無を示すバイナリ信号である搭乗者有無判定結果信号E~Eを生成する。搭乗者有無判定部106は、人の有無の判定結果を示すバイナリ信号である搭乗者有無判定結果信号E~Eを信頼性判定部107に与える。 The passenger presence/absence determining unit 106 determines the presence/absence of a person for each seat provided in the vehicle 130 from the image represented by the image signal V, and is the binary signal indicating the presence/absence of a person for each seat. The determination result signals E 1 to E M are generated. The passenger presence/absence determination unit 106 supplies the reliability determination unit 107 with the passenger presence/absence determination result signals E 1 to E M , which are binary signals indicating the determination result of the presence/absence of a person.
 人の有無を判定する手段は、人検出アルゴリズムとして過去多数提案されており、それら既存技術を用いることができる。搭乗者有無判定部106は、画像信号Vの代わりに、座席に設けられた体重計(図示せず)で検出された検出値である体重を示す体重情報を受け取り、その体重情報に基づいて、各座席に搭乗者が存在しているかを判断してもよい。 A lot of people detection algorithms have been proposed as means for determining the presence or absence of people, and those existing technologies can be used. The occupant presence/absence determining unit 106 receives, instead of the image signal V, weight information indicating weight, which is a detection value detected by a weight scale (not shown) provided in the seat, and based on the weight information, It may be determined whether there is an occupant in each seat.
 信頼性判定部107は、搭乗者有無判定部106からの搭乗者有無判定結果信号E~E、車速計123からの速度情報C、及び、音声信号処理部102からの音声信号SS~SSを受け取り、音声信号SS~SS及び画像信号Vのそれぞれの信頼性X~X、Y~Yを算出する。 The reliability determination unit 107 receives the passenger presence/absence determination result signals E 1 to E M from the passenger presence/absence determination unit 106, the speed information C from the vehicle speedometer 123, and the audio signal SS 1 to from the audio signal processing unit 102. receive SS M, calculates each reliability X 1 ~ X M, a Y 1 ~ Y M audio signals SS 1 ~ SS M and the image signal V.
 ここで、信頼性X~Xは、音声信号SS~SSに対する信頼性、信頼性Y~Yは、画像信号Vに対する信頼性を表すパラメータである。
 各信号の信頼性X~X、Y~Yは、例えば以下のような算出方式が考えられる。
Here, the reliability X 1 to X M is a parameter indicating the reliability of the audio signals SS 1 to SS M , and the reliability Y 1 to Y M is a parameter indicating the reliability of the image signal V.
The reliability X 1 to X M and Y 1 to Y M of each signal can be calculated by the following calculation method, for example.
 時刻tの音声信号SS~SSの信頼性X1,t~XM,tは、車が速度を上げるほど音声に雑音が入り込みやすくなっていくことを考慮し、信頼性判定部107は、車の速度が速いほど、信頼性X1,t~XM,tを低くする。例えば、時刻tの車速計の値Ctに負の指数関数として比例すると仮定した場合、以下の(1)式により、信頼性X1,t~XM,tを算出することができる。
Figure JPOXMLDOC01-appb-M000001
Considering that the reliability X 1,t to X M, t of the audio signals SS 1 to SS M at the time t is more likely to contain noise in the sound as the vehicle speed increases, the reliability determination unit 107 The higher the vehicle speed, the lower the reliability X 1,t to X M,t . For example, if it is assumed that the value Ct of the vehicle speed meter at time t is proportional to a negative exponential function, the reliability X 1,t to X M,t can be calculated by the following equation (1).
Figure JPOXMLDOC01-appb-M000001
 また、対象となる搭乗者以外の搭乗者がいれば、当然、認識対象ではない発話が増え、音声信号SS~SSの信頼性が下がり、相対的に画像信号Vの信頼性はあがると考えられる。このため、信頼性判定部107は、車130に搭乗している搭乗者の数が多いほど、信頼性Y1,t~YM,tを低くする。例えば、信頼性判定部107は、下記に示されている(2)式及び(3)式により、信頼性Y1,t~YM,tを算出する。
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000003
 但し、jは、搭乗者の各々を識別する識別番号であり、ここでは、j=1,2,・・・,Mである。また、δ(i≠j)は、搭乗者iと搭乗者jとが異なる場合のみ1になる関数である。
In addition, if there are passengers other than the target passengers, it goes without saying that the utterances that are not the recognition target increase, the reliability of the audio signals SS 1 to SS M decreases, and the reliability of the image signal V relatively increases. Conceivable. Therefore, the reliability determination unit 107 lowers the reliability Y 1,t to Y M,t as the number of passengers in the vehicle 130 increases. For example, the reliability determination unit 107 calculates the reliability Y 1,t to Y M,t by the expressions (2) and (3) shown below.
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000003
However, j is an identification number for identifying each passenger, and here j=1, 2,..., M. Further, δ(i≠j) is a function that becomes 1 only when the passenger i and the passenger j are different.
 図1に戻り、発話区間検出部108は、音声発話尤度AF~AF、画像発話尤度VF~VF、信頼性X~X、及び、信頼性Y~Yから、対象者毎に、発話が行われている区間の時刻を推定し、対象者毎に発話が行われている区間の時刻を示す区間情報である発話リストを生成する。例えば、発話区間検出部108は、対応する信頼性Xが高いほど対応する音声発話尤度AFの重みを重くし、対応する画像信頼性Yが高いほど対応する画像発話尤度VFの重みを重くして、対応する音声発話尤度AF及び対応する画像発話尤度VFを用いて、対応する音声信号SS及び画像信号Vにおいて対象者が発話している確率を示す発話尤度を算出し、算出された発話尤度が予め定められた閾値以上となっている区間を、発話区間として検出する。そして、発話区間検出部108は、生成された区間情報を音声認識部109に与える。 Returning to FIG. 1, the utterance section detection unit 108 determines from the voice utterance likelihoods AF 1 to AF M , the image utterance likelihoods VF 1 to VF M , the reliabilities X 1 to X M , and the reliabilities Y 1 to Y M. For each target person, the time of the section in which the utterance is performed is estimated, and an utterance list, which is section information indicating the time in the section in which the utterance is performed for each target person, is generated. For example, the utterance section detection unit 108 weights the corresponding voice utterance likelihood AF i higher as the corresponding reliability X i increases, and the corresponding image utterance likelihood VF i increases as the corresponding image reliability Y i increases. Of the corresponding voice utterance likelihood AF i and the corresponding image utterance likelihood VF i using the corresponding voice utterance likelihood AF i and the corresponding image utterance likelihood VF i to indicate the probability that the subject is uttering the corresponding voice signal SS i and image signal V. The likelihood is calculated, and a section in which the calculated speech likelihood is equal to or greater than a predetermined threshold is detected as a speech section. Then, the utterance section detection unit 108 gives the generated section information to the voice recognition unit 109.
 発話が行われている区間の時刻の推定は、以下のように行われる。
 まず、発話区間検出部108は、下記の(4)式及び(5)式に示されているソフトマックス関数に従って、搭乗者i及び時刻tにおける信頼性Xi,t,Yi,tから、搭乗者i及び時刻tにおける各信号への重みWi,t 及び重みWi,t を算出する。Wi,t は、音声信号SSの重みとしての音声重み、Wi,t は、画像信号Vの重みとしての画像重みである。
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000005
The estimation of the time of the section in which the utterance is performed is performed as follows.
First, the utterance section detection unit 108 uses the reliability X i,t , Y i,t at the passenger i and time t according to the softmax function shown in the following expressions (4) and (5) to obtain The weight W i,t A and the weight W i,t V for each signal at the passenger i and the time t are calculated. W i,t A is a voice weight as a weight of the voice signal SS i , and W i,t V is an image weight as a weight of the image signal V.
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000005
 次に、発話区間検出部108は、最終的な発話尤度S(i,t)を算出する。発話尤度S(i,t)は、時刻tにおいて、搭乗者iが発話している確率である。発話尤度S(i,t)は、下記の(6)式のように、時刻tの音声発話尤度AFi,t及び画像発話尤度VFi,tに重みを乗算した値から求まるものとする。
Figure JPOXMLDOC01-appb-M000006
 (6)式によれば、音声信頼性が高いほど重みが大きくなる音声重みを音声発話尤度に乗算した値と、画像信頼性が高いほど重みが大きくなる画像重みを画像発話尤度に乗算した値とを乗算することで、発話尤度が算出される。
Next, the utterance section detection unit 108 calculates a final utterance likelihood S (i,t) . The utterance likelihood S (i,t) is the probability that the passenger i is uttering at time t. The utterance likelihood S (i,t) is obtained from a value obtained by multiplying the voice utterance likelihood AF i,t and the image utterance likelihood VF i,t at time t by a weight as in the following expression (6). And
Figure JPOXMLDOC01-appb-M000006
According to the equation (6), the voice utterance likelihood is multiplied by a value obtained by multiplying the voice utterance likelihood by which the weight is increased as the voice reliability is higher, and the image utterance likelihood is multiplied by the image weight by which the weight is increased as the image reliability is higher. The utterance likelihood is calculated by multiplying by the calculated value.
 発話区間検出部108は、こうして算出された発話尤度S(i,t)に対して、予め定められた閾値以上の区間を発話している区間として検出することにより、搭乗者毎の発話リストU~Uを生成することができる。
 図4は、一人の搭乗者の発話リストUの一例を示す概略図である。
 発話リストU#は、発話区間列U#1と、始端時刻列U#2と、終端時刻列U#3とを備えるテーブル情報である。
 発話区間列U#1は、検出された発話区間を識別するための発話区間識別情報を格納する。
 始端時刻列U#2は、検出された発話区間の開始時刻を示す。
 終端時刻列U#3は、検出された発話区間の終了時刻を示す。
The utterance section detection unit 108 detects, for the utterance likelihood S (i, t) thus calculated, a section that is equal to or greater than a predetermined threshold as a utterance section, and thus the utterance list for each passenger. U 1 to U M can be generated.
FIG. 4 is a schematic diagram showing an example of the utterance list U of one passenger.
The utterance list U# is table information including an utterance section string U#1, a start time string U#2, and an end time string U#3.
The utterance section string U#1 stores utterance section identification information for identifying the detected utterance section.
The start time column U#2 shows the start time of the detected utterance section.
The end time string U#3 shows the end time of the detected utterance section.
 発話区間検出部108が最終的な発話尤度S(i,t)を算出する方法については、上記の(6)式に限定されない。例えば、発話尤度S(i,t)を過去の状態列、音声発話尤度及び画像発話尤度に重みを乗算した値、並びに、状態遷移テーブルσから求まるものとした場合、発話区間検出部108は、下記の(7)式により、発話尤度S(i,t)を算出することができる。
Figure JPOXMLDOC01-appb-M000007
 但し、状態遷移テーブルσは、過去の状態遷移列と、現在の音声発話尤度及び画像発話尤度とから一意な状態を返す関数であるものとする。
 このため、(7)式によれば、音声信頼性が高いほど重みが大きくなる音声重みを音声発話尤度に乗算した値、画像信頼性が高いほど重みが大きくなる画像重みを画像発話尤度に乗算した値、及び、過去に算出された発話尤度を変数とする予め定められた関数から、発話尤度が算出されることになる。
The method for the utterance section detection unit 108 to calculate the final utterance likelihood S (i, t) is not limited to the above equation (6). For example, when the utterance likelihood S (i,t) is obtained from the past state sequence, the value obtained by multiplying the voice utterance likelihood and the image utterance likelihood by weight, and the state transition table σ, the utterance section detection unit 108 can calculate the utterance likelihood S (i,t) by the following equation (7).
Figure JPOXMLDOC01-appb-M000007
However, the state transition table σ is assumed to be a function that returns a unique state from the past state transition sequence and the current voice utterance likelihood and image utterance likelihood.
Therefore, according to the equation (7), a value obtained by multiplying the voice utterance likelihood by a voice weight with a larger weight as the voice reliability is higher, and an image weight with a higher weight as the image reliability is higher than the image utterance likelihood. The utterance likelihood is calculated from a value obtained by multiplying by and a predetermined function having the utterance likelihood calculated in the past as a variable.
 音声認識部109は、対象者毎に、対応する音声信号SS~SSに対して、対応する発話リストU~Uで示される発話区間において、音声認識を実行する。音声認識は、例えば、音声認識用の特徴量を抽出し、抽出された特徴量を用いることで行われる。 The speech recognition unit 109, for each subject, with respect to the corresponding audio signals SS 1 ~ SS M, in a corresponding utterance Listing U 1 ~ speech intervals indicated by U M, executes speech recognition. The voice recognition is performed, for example, by extracting a feature amount for voice recognition and using the extracted feature amount.
 音声認識処理には、HMM(Hidden Markov Model)等の公知の種々の音響モデルを用いることができる。なお、音声認識部109は、各搭乗者を対象者として、独立して音声認識を実行し、搭乗者毎に、発話区間を検出した音声認識結果と音声認識結果の信頼度(以下、音声認識スコアという)とを出力する。
 なお、音声認識スコアは、音響モデルの出力確率と言語モデルの出力確率との双方を考慮した値でもよいし、音響モデルの出力確率のみの音響スコアであってもよい。
For the voice recognition processing, various known acoustic models such as HMM (Hidden Markov Model) can be used. The voice recognition unit 109 independently executes voice recognition for each passenger, and for each passenger, the voice recognition result of detecting the utterance section and the reliability of the voice recognition result (hereinafter, voice recognition). Output).
The voice recognition score may be a value that considers both the output probability of the acoustic model and the output probability of the language model, or may be the acoustic score of only the output probability of the acoustic model.
 なお、音声認識装置100の構成要素は、ネットワーク上のサーバ、スマートフォン等の携帯端末、又は、車載器に分散されてもよい。 Note that the constituent elements of the voice recognition device 100 may be distributed to a server on a network, a mobile terminal such as a smartphone, or an in-vehicle device.
 図5は、実施の形態に係る音声認識装置100のハードウェア構成を概略的に示すブロック図である。
 音声認識装置100のハードウェアは、メモリ150と、プロセッサ151と、音声インターフェース(以下、音声I/Fという)152と、画像インターフェース(以下、画像I/Fという)153と、車状態インターフェース(以下、車状態I/Fという)154と、ネットワークインターフェース(以下、ネットワークI/Fという)155とを備えるコンピュータで実現できる。
FIG. 5 is a block diagram schematically showing the hardware configuration of the voice recognition device 100 according to the embodiment.
The hardware of the voice recognition device 100 includes a memory 150, a processor 151, a voice interface (hereinafter referred to as voice I/F) 152, an image interface (hereinafter referred to as image I/F) 153, and a vehicle state interface (hereinafter referred to as “vehicle state interface”). , A vehicle state I/F) 154 and a network interface (hereinafter, referred to as network I/F) 155.
 メモリ150は、音声信号処理部102、音声発話尤度算出部103、画像発話尤度算出部104、環境情報判定部105、発話区間検出部108及び音声認識部109として機能するプログラムが記憶されている。メモリ150は、例えば、RAM(Random Access Memory)、ROM(Read Only Memory)、フラッシュメモリ、EPROM(Erasable Programmable Read Only Memory)若しくはEEPROM(Electrically Erasable Programmable Read-Only Memory)等の半導体メモリ、又は、磁気ディスク、光ディスク若しくは光磁気ディスク等を用いた記憶装置である。 The memory 150 stores programs that function as the voice signal processing unit 102, the voice utterance likelihood calculation unit 103, the image utterance likelihood calculation unit 104, the environment information determination unit 105, the utterance section detection unit 108, and the voice recognition unit 109. There is. The memory 150 is, for example, a RAM (Random Access Memory), a ROM (Read Only Memory), a flash memory, an EPROM (Erasable Programmable Memory, etc.) or an EEPROM (Electrically Only Memory), or an EEPROM (Electrically Only Memory). A storage device using a disk, an optical disk, a magneto-optical disk, or the like.
 プロセッサ151は、メモリ150からの音声信号処理部102、音声発話尤度算出部103、環境情報判定部105、画像発話尤度算出部104、発話区間検出部108及び音声認識部109として機能するプログラムを読み出し、そのプログラムを実行する。プロセッサ151は、例えば、CPU(Central Processing Unit)、GPU(Graphics Processing Unit)、マイクロプロセッサ、マイクロコントローラ又はDSP(Digital Signal Processor)等である。 The processor 151 is a program that functions as the voice signal processing unit 102, the voice utterance likelihood calculation unit 103, the environment information determination unit 105, the image utterance likelihood calculation unit 104, the utterance section detection unit 108, and the voice recognition unit 109 from the memory 150. Read and execute the program. The processor 151 is, for example, a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a microprocessor, a microcontroller, a DSP (Digital Signal Processor), or the like.
 音声I/F152は、マイクロホン121からの音声アナログ信号S~Sをマルチチャネルで受けとるための音声入力インターフェースである。また、音声認識結果として、車又はエアコンを搭乗員と対話的に制御するための自然言語をスピーカー(図示せず)から出力する場合には、音声I/F152は、音声出力インターフェースとしても機能する。スピーカーによる出力を必要としない構成となっていれば、音声出力インターフェースとしての機能は不要である。 The voice I/F 152 is a voice input interface for receiving the voice analog signals S 1 to S M from the microphone 121 in multiple channels. When a natural language for interactively controlling a car or an air conditioner with an occupant is output from a speaker (not shown) as a voice recognition result, the voice I/F 152 also functions as a voice output interface. .. If the configuration does not require speaker output, the function as an audio output interface is not required.
 画像I/F153は、カメラ122からの画像信号Vを受け取るための画像入力インターフェースである。また、音声認識部109の最終的な音声認識結果を受けて搭乗員に必要な情報をモニタ等の表示装置(図示せず)を使ってテキスト又は画像表示で通知する場合には、画像I/F153は、画像出力インターフェースとしても機能する。表示装置での表示を必要としない構成となっていれば、画像出力インターフェースとしての機能は不要である。 The image I/F 153 is an image input interface for receiving the image signal V from the camera 122. Further, when receiving the final result of the voice recognition of the voice recognition unit 109 and notifying the occupant of necessary information by text or image display using a display device (not shown) such as a monitor, the image I/ The F153 also functions as an image output interface. If the display device does not require display, the function as an image output interface is unnecessary.
 車状態I/F154は、車速計123が測定した速度情報Cを受け取るための入力インターフェースである。また、車状態I/F154は、車速に限らずドアの開閉状態等の車の現在の状態に関する情報を取得することもできる。 The vehicle status I/F 154 is an input interface for receiving the speed information C measured by the vehicle speedometer 123. The vehicle state I/F 154 can also acquire information on the current state of the vehicle such as the open/closed state of the door, not limited to the vehicle speed.
 ネットワークI/F155は、音声認識部109を使用する代わりにインターネットのクラウド上で公開されている音声認識サービスを利用して音声認識を実行する際に、通信するためのインターフェースである。また、ネットワークI/F155は、コネクテッドカーとして近辺の車とP2P(Peer to Pear)通信を行ったり、基地局と通信しナビゲーションを実行したりするため等に利用するインターフェースでもある。ネットワークI/F155は、通信を必要としない構成となっていれば、不要である。 The network I/F 155 is an interface for communicating when performing voice recognition using a voice recognition service published on the cloud of the Internet instead of using the voice recognition unit 109. Further, the network I/F 155 is also an interface used as a connected car for P2P (Peer to Pear) communication with a nearby car or for communicating with a base station to execute navigation. The network I/F 155 is unnecessary if it has a configuration that does not require communication.
 図1に示されているI/F部101は、音声I/F152、画像I/F153、車状態I/F154又はネットワークI/F155により実現することができる。 The I/F unit 101 shown in FIG. 1 can be realized by a voice I/F 152, an image I/F 153, a vehicle state I/F 154, or a network I/F 155.
 なお、図5において、メモリ150は、音声認識装置100の内部に配置されているが、USB(Universal Serial Bus)メモリ等の外部メモリを接続して、プログラム又はデータを読み込むように構成してもよい。また、装置内のメモリ及び外部メモリを共に使用する構成としてもよい。 Although the memory 150 is arranged inside the voice recognition device 100 in FIG. 5, it may be configured to connect an external memory such as a USB (Universal Serial Bus) memory to read a program or data. Good. Further, the memory in the device and the external memory may be used together.
 図6は、実施の形態に係る音声認識装置100の動作の流れを示すフローチャートである。
 まず、音声信号処理部102は、マイクロホン121からの音声アナログ信号S~Sに対して、A/D変換を行うことで、音声デジタル信号を生成し、その音声デジタル信号に対して、音声を取得する対象者の発話音声を強調して、音声信号SS~SSを生成する(S10)。例えば、車130内に運転席、助手席、後席左、後席右に4人の搭乗者が着座しており、その全ての座席が音声認識対象座席であるとすると、音声信号処理部102は、これら4つの方向からの音声をそれぞれ強調する。音声信号処理部102は、音声信号SS~SSを音声発話尤度算出部103、環境情報判定部105及び音声認識部109に与える。
FIG. 6 is a flowchart showing a flow of operations of the voice recognition device 100 according to the embodiment.
First, the audio signal processing unit 102 generates an audio digital signal by performing A/D conversion on the audio analog signals S 1 to SN from the microphone 121, and outputs an audio digital signal to the audio digital signal. The speech uttered by the target person who obtains the speech is emphasized to generate speech signals SS 1 to SS M (S10). For example, if four passengers are seated in the driver's seat, passenger seat, left rear seat, and right rear seat in the car 130, and all of the seats are voice recognition target seats, the voice signal processing unit 102 Emphasizes the sound from each of these four directions. The audio signal processing unit 102 provides the audio signals SS 1 ~ SS M to the voice utterance likelihood calculation unit 103, environmental information determination unit 105 and a voice recognition unit 109.
 次に、音声発話尤度算出部103は、音声信号SS~SSのそれぞれから音声発話尤度AF~AFを算出する(S11)。
 次に、画像発話尤度算出部104は、画像信号Vで示される画像から画像発話尤度VF~VFを算出する(S12)。
The speech utterance likelihood calculation unit 103 calculates a speech utterance likelihood AF 1 ~ AF M from each of the sound signals SS 1 ~ SS M (S11) .
Next, the image utterance likelihood calculation unit 104 calculates the image utterance likelihoods VF 1 to VF M from the image indicated by the image signal V (S12).
 次に、環境情報判定部105は、音声信号処理部102から受け取った、搭乗者の発話が強調された音声信号SS~SS、カメラ122から受け取った画像信号V、及び、車速計123から受け取った速度情報Cから、音声信号SS~SSの音声信頼性X~X及び画像信号Vの画像信頼性Y~Yを算出する(S13)。 Next, the environment information determination unit 105 receives, from the audio signal processing unit 102, the audio signals SS 1 to SS M in which the utterance of the passenger is emphasized, the image signal V received from the camera 122, and the vehicle speedometer 123. from the speed information C received and calculates the image confidence Y 1 ~ Y M audio signals SS 1 ~ SS voice reliability X 1 ~ of M X M and the image signal V (S13).
 次に、発話区間検出部108は、音声発話尤度AF~AF、画像発話尤度VF~VF、音声信頼性X~X及び画像信頼性Y~Yから、搭乗者毎に、発話が行われている区間の時刻を推定し、搭乗者毎に発話が行われている発話区間を検出する(S14)。そして、発話区間検出部108は、検出された発話区間の開始時刻と終了時刻とを含む発話リストU~Uを音声認識部109に与える。 Next, the utterance section detection unit 108 uses the voice utterance likelihoods AF 1 to AF M , the image utterance likelihoods VF 1 to VF M , the voice reliability X 1 to X M, and the image reliability Y 1 to Y M to board. The time of the section in which the utterance is being performed is estimated for each person, and the utterance section in which the utterance is being performed is detected for each passenger (S14). Then, the utterance section detection unit 108 provides the speech recognition unit 109 with the utterance lists U 1 to U M including the start time and end time of the detected utterance section.
 音声認識部109は、対象者に対応する音声信号SSに対して、対象者に対応する発話リストUで示される発話区間において、音声認識用の特徴量を抽出し、抽出された特徴量を用いて音声認識を実行する(S15)。そして、音声認識部109は、音声認識結果を出力する。 The voice recognition unit 109 extracts the feature amount for voice recognition from the voice signal SS i corresponding to the target person in the utterance section indicated by the utterance list U i corresponding to the target person, and the extracted feature amount. Is used to execute voice recognition (S15). Then, the voice recognition unit 109 outputs the voice recognition result.
 以上のように、本実施の形態によれば、信号の信頼性を判定することにより、よりノイズ環境下に強い信号処理を行うことができる。 As described above, according to the present embodiment, it is possible to perform strong signal processing in a more noisy environment by determining the reliability of the signal.
 以上に記載された音声認識装置100は、ナビゲーションシステム、運転者用のメータディスプレイも含む統合コックピットシステム、PC、タブレットPC、又は、スマートフォン等の携帯情報端末に適用することができる。 The voice recognition device 100 described above can be applied to a navigation system, an integrated cockpit system including a meter display for a driver, a PC, a tablet PC, or a mobile information terminal such as a smartphone.
 100 音声認識装置、 101 I/F部、 102 音声信号処理部、 103 音声発話尤度算出部、 104 画像発話尤度算出部、 105 環境情報判定部、 106 搭乗者有無判定部、 107 信頼性判定部、 108 発話区間検出部、 109 音声認識部、 120 音声認識システム、 121 マイクロホン、 122 カメラ、 123 車速計、 130 車。 100 voice recognition device, 101 I/F unit, 102 voice signal processing unit, 103 voice utterance likelihood calculation unit, 104 image utterance likelihood calculation unit, 105 environment information determination unit, 106 passenger presence/absence determination unit, 107 reliability determination Section, 108 utterance section detection section, 109 voice recognition section, 120 voice recognition system, 121 microphone, 122 camera, 123 speedometer, 130 vehicles.

Claims (9)

  1.  対象者の音声を含む音声信号から、前記音声信号において前記対象者が発話している確率を示す音声発話尤度を算出する音声発話尤度算出部と、
     前記対象者を含む画像を示す画像信号から、前記画像信号において前記対象者が発話している確率を示す画像発話尤度を算出する画像発話尤度算出部と、
     前記音声信号の信頼性を示す音声信頼性、及び、前記画像信号の信頼性を示す画像信頼性を判定する環境情報判定部と、
     前記音声信頼性が高いほど前記音声発話尤度の重みを重くし、前記画像信頼性が高いほど前記画像発話尤度の重みを重くして、前記音声発話尤度及び前記画像発話尤度を用いて、前記音声信号及び前記画像信号において前記対象者が発話している確率を示す発話尤度を算出し、前記算出された発話尤度が予め定められた閾値以上となっている区間を、発話区間として検出する発話区間検出部と、
     前記音声信号に対して、前記発話区間において音声認識を実行する音声認識部と、を備えること
     を特徴とする情報処理装置。
    From a voice signal including the voice of the target person, a voice utterance likelihood calculation unit that calculates a voice utterance likelihood indicating the probability that the target person utters in the voice signal,
    From an image signal showing an image including the target person, an image utterance likelihood calculation unit that calculates an image utterance likelihood indicating the probability that the target person utters in the image signal,
    An audio information reliability indicating the reliability of the audio signal, and an environment information determination unit that determines the image reliability indicating the reliability of the image signal,
    The higher the voice reliability is, the heavier the weight of the voice utterance likelihood is, the higher the image reliability is, the heavier the weight of the image utterance likelihood is, the voice utterance likelihood and the image utterance likelihood are used. Then, the utterance likelihood indicating the probability that the target person is uttering in the voice signal and the image signal is calculated, and the section in which the calculated utterance likelihood is equal to or more than a predetermined threshold is uttered. An utterance section detection unit that detects a section,
    An information processing device, comprising: a voice recognition unit that performs voice recognition on the voice signal in the utterance section.
  2.  マイクロホンから入力された音声アナログ信号に対して、アナログ/デジタル変換を行うことにより音声デジタル信号を生成し、前記音声デジタル信号から、ノイズ成分を除去することにより前記音声信号を生成する音声信号処理部をさらに備えること
     を特徴とする請求項1に記載の情報処理装置。
    A voice signal processing unit that generates a voice digital signal by performing analog/digital conversion on a voice analog signal input from a microphone, and generates the voice signal by removing a noise component from the voice digital signal. The information processing apparatus according to claim 1, further comprising:
  3.  前記情報処理装置は、車に搭載され、
     前記環境情報判定部は、前記車の速度が速いほど、前記音声信頼性を低くし、前記車に搭乗している搭乗者の数が多いほど、前記画像信頼性を低くすること
     を特徴とする請求項1又は2に記載の情報処理装置。
    The information processing device is mounted on a vehicle,
    The environment information determination unit lowers the audio reliability as the speed of the vehicle increases, and decreases the image reliability as the number of passengers on the vehicle increases. The information processing apparatus according to claim 1.
  4.  前記環境情報判定部は、前記画像信号で示される前記画像から前記搭乗者の数を検出すること
     を特徴とする請求項3に記載の情報処理装置。
    The information processing apparatus according to claim 3, wherein the environment information determination unit detects the number of the passengers from the image represented by the image signal.
  5.  前記環境情報判定部は、前記車の座席に設置されている体重計の検出値により、前記搭乗者の数を検出すること
     を特徴とする請求項3に記載の情報処理装置。
    The information processing apparatus according to claim 3, wherein the environment information determination unit detects the number of the passengers based on a detection value of a weight scale installed in a seat of the vehicle.
  6.  前記発話区間検出部は、前記音声信頼性が高いほど重みが大きくなる音声重みを前記音声発話尤度に乗算した値と、前記画像信頼性が高いほど重みが大きくなる画像重みを前記画像発話尤度に乗算した値とを乗算することで、前記発話尤度を算出すること
     を特徴とする請求項1から5の何れか一項に記載の情報処理装置。
    The utterance section detection unit calculates a value obtained by multiplying the voice utterance likelihood by a voice weight with a higher weight as the voice reliability is higher, and an image weight with a higher weight as the image reliability is higher. The information processing apparatus according to any one of claims 1 to 5, wherein the utterance likelihood is calculated by multiplying the utterance likelihood with a value obtained by multiplying the degree.
  7.  前記発話区間検出部は、前記音声信頼性が高いほど重みが大きくなる音声重みを前記音声発話尤度に乗算した値、前記画像信頼性が高いほど重みが大きくなる画像重みを前記画像発話尤度に乗算した値、及び、過去に算出された前記発話尤度を変数とする予め定められた関数から、前記発話尤度を算出すること
     を特徴とする請求項1から5の何れか一項に記載の情報処理装置。
    The utterance section detection unit is a value obtained by multiplying the voice utterance likelihood by a voice weight with a higher weight as the voice reliability is higher, and an image weight with a higher weight as the image reliability is higher. The utterance likelihood is calculated from a value obtained by multiplying the utterance likelihood and a previously-determined function having the utterance likelihood calculated in the past as a variable. The information processing device described.
  8.  コンピュータを、
     対象者の音声を含む音声信号から、前記音声信号において前記対象者が発話している確率を示す音声発話尤度を算出する音声発話尤度算出部、
     前記対象者を含む画像を示す画像信号から、前記画像信号において前記対象者が発話している確率を示す画像発話尤度を算出する画像発話尤度算出部、
     前記音声信号の信頼性を示す音声信頼性、及び、前記画像信号の信頼性を示す画像信頼性を判定する環境情報判定部、
     前記音声信頼性が高いほど前記音声発話尤度の重みを重くし、前記画像信頼性が高いほど前記画像発話尤度の重みを重くして、前記音声発話尤度及び前記画像発話尤度を用いて、前記音声信号及び前記画像信号において前記対象者が発話している確率を示す発話尤度を算出し、前記算出された発話尤度が予め定められた閾値以上となっている区間を、発話区間として検出する発話区間検出部、及び、
     前記音声信号に対して、前記発話区間において音声認識を実行する音声認識部、として機能させること
     を特徴とするプログラム。
    Computer,
    From a voice signal containing the voice of the target person, a voice utterance likelihood calculation unit that calculates a voice utterance likelihood indicating a probability that the target person utters in the voice signal.
    From an image signal showing an image including the target person, an image utterance likelihood calculation unit that calculates an image utterance likelihood indicating a probability that the target person utters in the image signal,
    An audio information reliability indicating the reliability of the audio signal, and an environment information determination unit determining the image reliability indicating the reliability of the image signal,
    The higher the voice reliability is, the heavier the weight of the voice utterance likelihood is, the higher the image reliability is, the heavier the weight of the image utterance likelihood is, the voice utterance likelihood and the image utterance likelihood are used. Then, the utterance likelihood indicating the probability that the target person is uttering in the voice signal and the image signal is calculated, and the section in which the calculated utterance likelihood is equal to or more than a predetermined threshold is uttered. An utterance section detection unit that detects a section, and
    A program causing a voice recognition unit to perform voice recognition on the voice signal in the utterance section.
  9.  対象者の音声を含む音声信号から、前記音声信号において前記対象者が発話している確率を示す音声発話尤度を算出し、
     前記対象者を含む画像を示す画像信号から、前記画像信号において前記対象者が発話している確率を示す画像発話尤度を算出し、
     前記音声信号の信頼性を示す音声信頼性、及び、前記画像信号の信頼性を示す画像信頼性を判定し、
     前記音声信頼性が高いほど前記音声発話尤度の重みを重くし、前記画像信頼性が高いほど前記画像発話尤度の重みを重くして、前記音声発話尤度及び前記画像発話尤度を用いて、前記音声信号及び前記画像信号において前記対象者が発話している確率を示す発話尤度を算出し、前記算出された発話尤度が予め定められた閾値以上となっている区間を、発話区間として検出し、
     前記音声信号に対して、前記発話区間において音声認識を実行すること
     を特徴とする情報処理方法。
    From a voice signal including the voice of the target person, calculating a voice utterance likelihood indicating the probability that the target person is uttering in the voice signal,
    From an image signal showing an image including the target person, calculating an image utterance likelihood indicating a probability that the target person utters in the image signal,
    Audio reliability indicating the reliability of the audio signal, and determining image reliability indicating the reliability of the image signal,
    The higher the voice reliability is, the heavier the weight of the voice utterance likelihood is, the higher the image reliability is, the heavier the weight of the image utterance likelihood is, the voice utterance likelihood and the image utterance likelihood are used. Then, the utterance likelihood indicating the probability that the target person is uttering in the voice signal and the image signal is calculated, and the section in which the calculated utterance likelihood is equal to or more than a predetermined threshold is uttered. Detected as an interval,
    An information processing method comprising performing voice recognition on the voice signal in the utterance section.
PCT/JP2019/000722 2019-01-11 2019-01-11 Information processing device, program, and information processing method WO2020144857A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2019/000722 WO2020144857A1 (en) 2019-01-11 2019-01-11 Information processing device, program, and information processing method
JP2020564014A JP6833147B2 (en) 2019-01-11 2019-01-11 Information processing equipment, programs and information processing methods

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2019/000722 WO2020144857A1 (en) 2019-01-11 2019-01-11 Information processing device, program, and information processing method

Publications (1)

Publication Number Publication Date
WO2020144857A1 true WO2020144857A1 (en) 2020-07-16

Family

ID=71521151

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2019/000722 WO2020144857A1 (en) 2019-01-11 2019-01-11 Information processing device, program, and information processing method

Country Status (2)

Country Link
JP (1) JP6833147B2 (en)
WO (1) WO2020144857A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000347688A (en) * 1999-06-09 2000-12-15 Mitsubishi Electric Corp Noise suppressor
JP2008134572A (en) * 2006-11-29 2008-06-12 Fujitsu Ten Ltd Speech recognizer
JP2011059186A (en) * 2009-09-07 2011-03-24 Gifu Univ Speech section detecting device and speech recognition device, program and recording medium
JP2011191423A (en) * 2010-03-12 2011-09-29 Honda Motor Co Ltd Device and method for recognition of speech
WO2016006088A1 (en) * 2014-07-10 2016-01-14 株式会社 東芝 Electronic device, method and program
JP2018106359A (en) * 2016-12-26 2018-07-05 京セラ株式会社 Electronic apparatus, vehicle, control device, control program, and method for operating electronic apparatus

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000347688A (en) * 1999-06-09 2000-12-15 Mitsubishi Electric Corp Noise suppressor
JP2008134572A (en) * 2006-11-29 2008-06-12 Fujitsu Ten Ltd Speech recognizer
JP2011059186A (en) * 2009-09-07 2011-03-24 Gifu Univ Speech section detecting device and speech recognition device, program and recording medium
JP2011191423A (en) * 2010-03-12 2011-09-29 Honda Motor Co Ltd Device and method for recognition of speech
WO2016006088A1 (en) * 2014-07-10 2016-01-14 株式会社 東芝 Electronic device, method and program
JP2018106359A (en) * 2016-12-26 2018-07-05 京セラ株式会社 Electronic apparatus, vehicle, control device, control program, and method for operating electronic apparatus

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
FUJIMOTO MASAKIYO , KENTARO ISHIZUKA , HIROKO KATO: "A noise robust voice activity detection with state transition processes of speech and noise", IPSJ SIG TECHNICAL REPORTS, no. 136 (2006-SLP-064), 21 December 2006 (2006-12-21), pages 13 - 18, XP055724628 *

Also Published As

Publication number Publication date
JP6833147B2 (en) 2021-02-24
JPWO2020144857A1 (en) 2021-03-11

Similar Documents

Publication Publication Date Title
US11694710B2 (en) Multi-stream target-speech detection and channel fusion
EP1210711B1 (en) Sound source classification
JP4355322B2 (en) Speech recognition method based on reliability of keyword model weighted for each frame, and apparatus using the method
US9786295B2 (en) Voice processing apparatus and voice processing method
US9311930B2 (en) Audio based system and method for in-vehicle context classification
EP2721609A1 (en) Identification of a local speaker
EP2148325B1 (en) Method for determining the presence of a wanted signal component
US20180137880A1 (en) Phonation Style Detection
CN112397065A (en) Voice interaction method and device, computer readable storage medium and electronic equipment
US20110218803A1 (en) Method and system for assessing intelligibility of speech represented by a speech signal
JP3298858B2 (en) Partition-based similarity method for low-complexity speech recognizers
Deng et al. Confidence measures in speech emotion recognition based on semi-supervised learning
US20170092298A1 (en) Speech-processing apparatus and speech-processing method
JP5385876B2 (en) Speech segment detection method, speech recognition method, speech segment detection device, speech recognition device, program thereof, and recording medium
JP6847324B2 (en) Speech recognition device, speech recognition system, and speech recognition method
Ponraj Speech Recognition with Gender Identification and Speaker Diarization
JP6487650B2 (en) Speech recognition apparatus and program
CN113674754A (en) Audio-based processing method and device
CN109243457B (en) Voice-based control method, device, equipment and storage medium
WO2020144857A1 (en) Information processing device, program, and information processing method
KR101023211B1 (en) Microphone array based speech recognition system and target speech extraction method of the system
Pohjalainen et al. Detection of shouted speech in the presence of ambient noise
WO2018029071A1 (en) Audio signature for speech command spotting
Diğken et al. Recognition of non-speech sounds using Mel-frequency cepstrum coefficients and dynamic time warping method
JP5672175B2 (en) Speaker discrimination apparatus, speaker discrimination program, and speaker discrimination method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19909311

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020564014

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19909311

Country of ref document: EP

Kind code of ref document: A1