WO2021024869A1 - 音声処理装置、音声処理方法、および記録媒体 - Google Patents
音声処理装置、音声処理方法、および記録媒体 Download PDFInfo
- Publication number
- WO2021024869A1 WO2021024869A1 PCT/JP2020/028955 JP2020028955W WO2021024869A1 WO 2021024869 A1 WO2021024869 A1 WO 2021024869A1 JP 2020028955 W JP2020028955 W JP 2020028955W WO 2021024869 A1 WO2021024869 A1 WO 2021024869A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speaker
- speech data
- speech
- unit
- data
- Prior art date
Links
- 238000012545 processing Methods 0.000 title claims abstract description 118
- 238000003672 processing method Methods 0.000 title claims description 6
- 239000000284 extract Substances 0.000 claims abstract description 24
- 238000000605 extraction Methods 0.000 claims description 77
- 238000012937 correction Methods 0.000 claims description 34
- 238000006243 chemical reaction Methods 0.000 claims description 28
- 238000000034 method Methods 0.000 description 29
- 230000005236 sound signal Effects 0.000 description 29
- 238000010586 diagram Methods 0.000 description 17
- 230000000694 effects Effects 0.000 description 9
- 239000013598 vector Substances 0.000 description 9
- 230000004048 modification Effects 0.000 description 6
- 238000012986 modification Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 230000010365 information processing Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000010191 image analysis Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000012447 hatching Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
- G10L15/25—Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/22—Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
- G06V40/171—Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
Definitions
- This disclosure relates to a voice processing device, a voice processing method, and a recording medium, and particularly to a voice processing device, a voice processing method, and a recording medium that process a voice corresponding to a statement.
- Patent Document 1 discloses that the content of a statement is reproduced from a voice signal by voice recognition. Specifically, Patent Document 1 describes that a microphone (microphone) is used to collect the remarks of a person and convert the audio signal output from the microphone into text data (text information). ing.
- a microphone microphone
- Patent Document 2 discloses a technique of collating the lip pattern and voice of a speaker with pre-registered data and outputting specific character information when the collation matches.
- Patent Document 3 discloses a technique for learning the relationship between the shape of the lips and phonemes from a moving image including a voice generated by a speaker.
- Patent Documents 1 and 2 are not supposed to consider the magnitude of the influence of noise on the voice (utterance) of a person's speech. In that case, for example, when reproducing the content of the statement from the voice, there is a possibility that the content of the statement cannot be accurately reproduced.
- This disclosure has been made in view of the above problems, and one of the purposes of the disclosure is to provide a voice processing device or the like that enables processing in consideration of the influence of noise on the voice caused by a person's speech. ..
- the audio processing device is a first speaker that shows the content of the speaker's remark based on the speaker extraction means for extracting the speaker's area from the image and the shape of the speaker's lip.
- a collation means for collating the first speech data with the second speech data.
- the audio processing method extracts a speaker area from an image and generates first speech data indicating the content of the speaker's speech based on the shape of the speaker's lip. , A second speech data indicating the content of the speaker's speech is generated based on the voice signal corresponding to the speaker's speech, and the first speech data is collated with the second speech data. Including to do.
- the recording medium extracts a speaker area from an image and generates first speech data indicating the content of the speaker's speech based on the shape of the speaker's lip. To generate a second speech data indicating the content of the speaker's speech based on the voice signal corresponding to the speaker's speech, the first speech data, and the second speech data. Stores a program that causes the computer to collate with speech data.
- FIG. It is a block diagram which shows an example of the structure of the voice processing apparatus which concerns on Embodiment 1.
- FIG. It is a block diagram which shows an example of the structure of the 1st remark data generation part provided in the voice processing apparatus which concerns on Embodiment 1.
- FIG. It is a flowchart which shows an example of the operation flow of the voice processing apparatus which concerns on Embodiment 1.
- FIG. It is a block diagram which shows an example of the structure of the 1st speech data generation part provided in the voice processing apparatus which concerns on Embodiment 3.
- FIG. It is a block diagram which shows an example of the structure of the voice processing apparatus which concerns on Embodiment 4.
- FIG. It is a flowchart which shows an example of the operation flow of the voice processing apparatus which concerns on Embodiment 4.
- It is a flowchart which shows an example of the operation flow of the voice processing apparatus which concerns on Embodiment 5.
- FIG. 1 is a block diagram showing an example of the configuration of the voice processing device 1.
- the voice processing device 1 includes a speaker extraction unit 20, a first speech data generation section 30, a collation section 40, and a second speech data generation section 50.
- the functions of each part of the audio processing device 1 (and the audio processing device according to each embodiment described later) according to the first embodiment are realized as software by executing the program read into the memory by the processor. It may be realized as hardware such as an intelligent camera.
- the speaker extraction unit 20 extracts the speaker area from the image.
- the speaker extraction unit 20 is an example of the speaker extraction means.
- the speaker extraction unit 20 acquires time-series image data from a camera or the like (not shown).
- the time-series image data is an image frame of a moving image for a certain period of time.
- the time-series image data may be data of a plurality of still images taken at predetermined time intervals.
- the intelligent camera itself captures time-series image data.
- the speaker extraction unit 20 extracts a speaker area from each image data by performing image analysis on the acquired time-series image data. For example, the speaker extraction unit 20 detects a person's area from each image data by using a classifier (also called a trained model) that has learned the characteristics (personality) of the person.
- the detected area of the person is an area of the image that includes at least a part of the person.
- the area of the person is, for example, a rectangular area surrounding the face portion of the person in the image data.
- the speaker extraction unit 20 identifies the lip portion of the person from the image data of the detected area of the person.
- the speaker extraction unit 20 discriminates the same person among time-series image data by, for example, face recognition (collation) or other means.
- the speaker extraction unit 20 detects a difference (that is, a change) in the shape of the lips of the same person between the time-series image data.
- the speaker extraction unit 20 determines that the person is a speaker.
- the speaker extraction unit 20 uses image data (hereinafter, may be referred to as speaker image data) including an area of a person determined to be a speaker (hereinafter, may be referred to as a speaker area). It is transmitted to the first remark data generation unit 30.
- image data hereinafter, may be referred to as speaker image data
- speaker area an area of a person determined to be a speaker
- the first remark data generation unit 30 generates the first remark data indicating the content of the speaker's remark based on the shape of the speaker's lip.
- the first speech data generation unit 30 is an example of the first speech data generation means.
- the first remark data generation unit 30 may generate the first remark data by using the technique described in Patent Document 3 (Japanese Unexamined Patent Publication No. 2011-13731). Alternatively, as described below, the first speech data generation unit 30 can also generate the first speech data directly from the image data of the speaker.
- FIG. 2 is a block diagram showing an example of the configuration of the first remark data generation unit 30.
- the first speech data generation unit 30 includes a mouth shape element identification unit 31 and a mouth shape element-phoneme conversion unit 32.
- the mouthpiece identification unit 31 receives the image data of the speaker from the speaker extraction unit 20.
- the mouth shape element identification unit 31 identifies the shape of the speaker's lip based on the speaker's image data. Then, the mouth shape element identification unit 31 identifies the mouth shape element from the shape of the speaker's lip.
- the mouth shape element identification unit 31 is an example of the mouth shape element identification means.
- Mouth shape element means the shape of the speaker's lip at the moment when the speaker is uttering one phoneme.
- a phoneme is the smallest unit of speech that a listener can discern in a language. Specifically, phonemes represent vowels, consonants, or semivowels that are discriminated in one language.
- the mouth shape element identification unit 31 transmits information indicating the mouth shape element to the mouth shape element-phoneme conversion unit 32.
- the phoneme-phoneme conversion unit 32 receives information indicating the phoneme from the phoneme identification unit 31.
- the phoneme-phoneme conversion unit 32 converts the information indicating the phoneme into phoneme data to generate first speech data including time-series data of one or more phonemes.
- the phoneme-phoneme conversion unit 32 is an example of a phoneme-phoneme conversion means.
- the mouth shape element-phoneme conversion unit 32 was identified from the shape of the speaker's lip with reference to the mouth shape element-phoneme correspondence table (not shown) showing the correspondence relationship between the mouth shape element and the phoneme. Search and output phoneme data corresponding to the phoneme.
- the phoneme-phoneme correspondence table shows a one-to-one correspondence between phonemes and phonemes. In this way, the mouth shape element-phoneme conversion unit 32 executes the conversion from the mouth shape element to the phoneme.
- the phoneme-phoneme conversion unit 32 provides information indicating the phonemes (phoneme data) corresponding to the phonemes identified from the shape of the speaker's lip and the phoneme arrangement order (that is, the phoneme time series order).
- the first speech data it is transmitted to the collation unit 40.
- the first speech data has a data structure in which sequence numbers (1, 2, 3 ...) are added to one or more phonemes, respectively.
- the first speech data generation unit 30 may directly generate the first speech data from the image data of the speaker.
- the first speech data generation unit 30 may train a model (for example, a neural network) so that a corresponding phoneme or audio signal can be discriminated from the speaker's image data by using a deep learning method. ..
- the first speech data generation unit 30 inputs the image data of the speaker into the trained model.
- the trained model discriminates the corresponding phoneme or audio signal from the input speaker image data, and outputs the discriminant result.
- the first speech data generation unit 30 generates the first speech data based on the output from the trained model.
- the first speech data generation unit 30 makes the first speech for each speaker area. Generate data. That is, the first speech data generation unit 30 generates a plurality of first speech data corresponding to the plurality of speakers.
- the first remark data generation unit 30 transmits the generated first remark data to the collation unit 40 shown in FIG.
- the second speech data generation unit 50 shown in FIG. 1 generates second speech data indicating the content of the speaker's speech based on the voice signal corresponding to the speaker's speech.
- the second speech data generation unit 50 is an example of the second speech data generation means.
- a voice signal corresponding to the speaker's speech is input to the second speech data generation unit 50.
- the second speech data generation unit 50 receives from the microphone an audio signal corresponding to the speaker's speech collected by the microphone.
- the image of the speaker whose speech is collected by the microphone is included in the time-series image data received by the speaker extraction unit 20.
- the second speech data generation unit 50 may acquire a pre-recorded audio signal.
- the image of the speaker who made the speech is included in the recorded time-series image data.
- the second speech data generation unit 50 generates the second speech data from the input audio signal.
- the second speech data generation unit 50 uses information indicating the phonemes corresponding to the input voice signal and the order of the phonemes (that is, the time series order of the phonemes) as the second speech data. Generate.
- the second speech data generation unit 50 uses information indicating the single notes included in the input audio signal and the arrangement order of the single notes (that is, the time series order of the single notes) as the second speech data.
- a single note is a sound that forms a syllable, and is the smallest unit of speech.
- a single note is represented by an audio signal having a fundamental frequency and an audio signal that is a multiple of the fundamental frequency.
- a phoneme is a voice that is discriminated as one single note in one language.
- Some single notes may be identified as the same phoneme. For example, [sh] and [s] are different single notes (consonants), but they are not distinguished in Japanese, so they are determined to be the same phoneme.
- the former is called phoneme data and the latter is single note data. Called.
- the second remark data generation unit 50 transmits the generated second remark data to the collation unit 40.
- the collation unit 40 collates the first remark data with the second remark data.
- the collation unit 40 is an example of collation means.
- the collation unit 40 receives the first remark data from the first remark data generation unit 30. Further, the collation unit 40 receives the second speech data from the second speech data generation unit 50. The collation unit 40 collates the first speech data with the second speech data.
- the collation unit 40 collates each of the plurality of first speech data with the second speech data.
- the second speech data may be either the phoneme data or the single note data described above.
- the second speech data is phoneme data, that is, the case where the second speech data is information indicating the phonemes corresponding to the voice signal and the arrangement order of the phonemes will be described below.
- the collation unit 40 generates a first feature vector in which the features of each phoneme included in the first speech data are arranged according to the sequence number added to the phoneme. Further, the collation unit 40 generates a second feature vector in which the features of each phoneme included in the second speech data are arranged according to the sequence number added to the phoneme.
- a phoneme feature vector is the amplitude, power, power spectrum, or Mel-Frequency Cepstrum Coefficients (MFCC) of a standard audio signal representing the phoneme.
- MFCC Mel-Frequency Cepstrum Coefficients
- the collation unit 40 calculates the distance between the first feature vector and the second feature vector.
- the collation unit 40 calculates the degree of similarity between the first feature vector and the second feature vector based on the calculated distance magnitude.
- the similarity is represented by, for example, a single number between 0 (not quite similar) and 1 (exact match).
- the collation unit 40 determines that the first remark data and the second remark data are the same (collation success). On the other hand, when the similarity is equal to or less than the threshold value, the collation unit 40 determines that the first speech data and the second speech data are not the same (collation failure).
- the collation unit 40 performs matching between the individual phonemes included in the first speech data and the individual phonemes included in the second speech data.
- the collation unit 40 is from the first to the N1st. It is determined whether or not the phonemes of are the same.
- the collation unit 40 counts the number of times the matching is successful, that is, the number of phonemes that are the same between the first speech data and the second speech data.
- the collation unit 40 determines that the first remark data and the second remark data are the same (collation success). On the other hand, when the number of successful matchings is less than or equal to a predetermined number, the collation unit 40 determines that the first remark data and the second remark data are not the same (collation failure).
- the second speech data is single note data
- the second speech data is information indicating the single notes included in the voice signal and the arrangement order of the single notes
- Example 3 the collation unit 40 considers one or more single notes that are indistinguishable from each other in one language to be the same phoneme.
- the collation unit 40 performs matching between each of one or a plurality of single notes considered to be the same as the individual phonemes included in the first speech data and the individual single notes included in the second speech data.
- the matching method in Example 3 is the same as when the second speech data is phoneme data.
- Example 4 when the first speech data and the second speech data are audio signals, a method of collating the two will be described.
- the first speech data generation unit 30 further converts the phoneme converted from the phoneme by the phoneme-phoneme conversion unit 32 into a voice signal corresponding to the phoneme.
- the first speech data generation unit 30 converts a phoneme into a corresponding voice signal by referring to a table (not shown) showing the correspondence between the phoneme and the voice signal.
- the second speech data generation unit 50 transmits the input audio signal itself as the second speech data to the collation unit 40.
- the collation unit 40 converts the voice signal, which is the first speech data, and the voice signal, which is the second speech data, into spectrograms, respectively.
- the collation unit 40 performs pattern matching between the first spectrogram representing the first speech data and the second spectrogram corresponding to the second speech data, and performs pattern matching with the first spectrogram and the second spectrogram. Calculate the similarity of.
- the collation unit 40 determines that the first remark data and the second remark data match (collation success). On the other hand, when the similarity is equal to or less than the threshold value, the collation unit 40 determines that the first speech data and the second speech data do not match (collation failure).
- the collation unit 40 collates the first remark data with the second remark data.
- the collation unit 40 outputs the collation result of the first remark data and the second remark data. For example, the collation unit 40 outputs information indicating whether or not the collation between the first remark data and the second remark data is successful as a result of the collation.
- the voice processing device 1 can perform processing in consideration of the influence of noise on the voice (that is, utterance) caused by the speaker's speech.
- the collation result by the collation unit 40 is used to associate a speaker with a statement made by the speaker (see Embodiment 4).
- the collation result by the collation unit 40 may be used to correct the second remark data by using the first remark data (see the fifth embodiment).
- FIG. 3 is a flowchart showing an example of the operation flow of the voice processing device 1.
- the speaker extraction unit 20 acquires time-series image data and extracts a speaker area from each image data (S101).
- the speaker extraction unit 20 transmits the extracted image data of the speaker to the first speech data generation unit 30.
- the first speech data generation unit 30 extracts the speaker's lip portion from the speaker's image data, and generates the first speech data based on the extracted speaker's lip shape (S102).
- the first remark data generation unit 30 transmits the generated first remark data to the collation unit 40.
- the second speech data generation unit 50 acquires an audio signal corresponding to the speaker's speech (S103).
- the second speech data generation unit 50 generates second speech data indicating the content of the speaker's speech based on the voice signal corresponding to the speaker's speech (S104).
- the second speech data generation unit 50 transmits the generated second speech data to the collation unit 40.
- the collation unit 40 receives the first remark data from the first remark data generation unit 30. Further, the collation unit 40 receives the second speech data from the second speech data generation unit 50. Then, the collation unit 40 collates the first remark data with the second remark data (S105).
- the collation unit 40 outputs the collation result in step S105.
- the collation unit 40 outputs information indicating whether or not the collation between the first remark data and the second remark data is successful as a result of the collation.
- the speaker extraction unit 20 extracts a speaker region from the image.
- the first speech data generation unit 30 generates first speech data indicating the content of the speaker's speech based on the shape of the speaker's lip.
- the second speech data generation unit 50 generates second speech data indicating the content of the speaker's speech based on the voice signal corresponding to the speaker's speech.
- the collation unit 40 collates the first remark data with the second remark data.
- the second statement data when reproducing the content of the statement from the second statement data, the second statement data can be corrected by using the first statement data that has been successfully collated by the collation unit 40. Specifically, for example, even if there is noise in the second speech data, the noisy portion can be reproduced based on the first speech data. That is, the voice processing device can reproduce the speech from the voice signal with higher accuracy. Therefore, the voice processing device according to the first embodiment has an effect of making it possible to perform processing in consideration of the influence of noise on the voice caused by the speech of a person.
- the configuration of the voice processing device according to the second embodiment is the same as that of the voice processing device 1 (FIG. 1) described in the first embodiment.
- the voice processing device according to the second embodiment includes a second speech data generation unit 250 (FIG. 4) instead of the second speech data generation unit 50 (FIG. 1).
- the second speech data according to the second embodiment is information indicating the phonemes corresponding to the voice signal and the order of the phonemes.
- FIG. 4 is a block diagram showing an example of the configuration of the second speech data generation unit 250 according to the second embodiment.
- the second speech data generation unit 250 includes a feature extraction unit 252 and a voice signal-phoneme conversion unit 253.
- the feature extraction unit 252 performs preprocessing such as sampling (A / D conversion (Analog-Digital Transform)) and filtering on the input audio signal, and then extracts features from the input audio signal.
- the feature extraction unit 252 is an example of a feature extraction means.
- the characteristics of the audio signal are, for example, the amplitude of the audio signal, the power of the audio signal at a certain frequency, or the spectrum (spectral envelope).
- the feature extraction unit 252 transmits information indicating the features extracted from the voice signal to the voice signal-phoneme conversion unit 253.
- the voice signal-phoneme conversion unit 253 receives information indicating the characteristics of the voice signal from the feature extraction unit 252. The voice signal-phoneme conversion unit 253 inputs the received feature to the trained model.
- the trained model is a model (for example, a neural network) trained so that phonemes can be discriminated from the characteristics of the voice signal.
- a phoneme is the smallest unit of speech that can be discriminated by a listener in one language.
- the trained model outputs the discrimination result of the phoneme corresponding to the input audio signal.
- the voice signal-phoneme conversion unit 253 converts the characteristics of the voice signal into the corresponding phonemes based on the output from the trained model, and generates the second speech data including one or more phonemes.
- the voice signal-phoneme conversion unit 253 is an example of the voice signal-phoneme conversion means.
- the voice signal-phoneme conversion unit 253 transmits information indicating the input voice signal, the corresponding phoneme, and the arrangement order of the phonemes to the collation unit 40 as the second speech data.
- sequence numbers (1, 2, 3 ...) are added to one or more phonemes corresponding to the audio signal.
- the collation unit 40 collates the first remark data with the second remark data. Specifically, the collation unit 40 may collate the first remark data with the second remark data by using the method described as Example 1 or Example 2 of the collation method in the first embodiment. Good. In the second embodiment, the description of the collation method will be omitted.
- the collation unit 40 outputs the collation result of the first remark data and the second remark data. For example, the collation unit 40 outputs information indicating whether or not the collation between the first remark data and the second remark data is successful as a result of the collation.
- the speaker extraction unit 20 extracts a speaker region from the image.
- the first speech data generation unit 30 generates first speech data indicating the content of the speaker's speech based on the shape of the speaker's lip.
- the second speech data generation unit 250 generates second speech data indicating the content of the speaker's speech based on the voice signal corresponding to the speaker's speech.
- the second speech data generation unit 250 includes a feature extraction unit 252 and a voice signal-phoneme conversion unit 253.
- the feature extraction unit 252 extracts features from the voice signal.
- the audio signal-phoneme conversion unit 253 converts the input audio signal into a phoneme corresponding to the input audio signal.
- the collation unit 40 collates the first remark data with the second remark data.
- the voice processing device makes it possible to perform processing in consideration of the influence of noise on the voice due to the speech of a person. For example, when the collation unit 40 succeeds in collation, the second speech data is corrected by using the first speech data, so that the voice processing device according to the second embodiment makes the speech higher from the voice signal. It can be reproduced with high accuracy.
- the configuration of the voice processing device (not shown) according to the third embodiment is the same as that of the voice processing device 1 (FIG. 1) described in the first embodiment.
- the voice processing device according to the third embodiment includes a second speech data generation unit 350 instead of the second speech data generation unit 50.
- the second speech data according to the third embodiment is information indicating the single notes included in the voice signal and the order of the single notes.
- the second speech data generation unit 350 generates the second speech data from the input audio signal by the third method described in the first embodiment.
- FIG. 5 is a block diagram showing an example of the configuration of the second remark data generation unit 350 according to the third embodiment. As shown in FIG. 5, the second speech data generation unit 350 includes a single note extraction unit 351.
- the single note extraction unit 351 extracts a single note included in the input voice signal and generates a second speech data including one or a plurality of single notes.
- the single note extraction unit 351 is an example of a single note extraction means. As described above, a single note is represented by an audio signal having a fundamental frequency and an audio signal that is a multiple of the fundamental frequency.
- the single note extraction unit 351 transmits information indicating the single note included in the input audio signal and the order of the single notes to the collation unit 40 as the second speech data.
- sequence numbers (1, 2, 3 ...) are added to one or a plurality of single notes corresponding to the audio signal.
- the collation unit 40 collates the first remark data with the second remark data. Specifically, the collation unit 40 may collate the first remark data with the second remark data by using the collation method described as Example 3 in the first embodiment.
- the collation unit 40 outputs the collation result of the first remark data and the second remark data. For example, the collation unit 40 outputs information indicating whether or not the collation between the first remark data and the second remark data is successful as a result of the collation.
- the speaker extraction unit 20 extracts a speaker region from the image.
- the first speech data generation unit 30 generates first speech data indicating the content of the speaker's speech based on the shape of the speaker's lip.
- the second speech data generation unit 350 generates second speech data indicating the content of the speaker's speech based on the voice signal corresponding to the speaker's speech.
- the second speech data generation unit 350 includes a single sound extraction unit 351 that extracts a single sound included in the input voice signal.
- the single note extraction unit 351 transmits information indicating the single notes included in the input audio signal and the arrangement order of the single notes to the collation unit 40 as the second speech data.
- the collation unit 40 collates the first remark data with the second remark data.
- the voice processing device makes it possible to perform processing in consideration of the influence of noise on the voice due to the speech of a person.
- the remark can be reproduced with higher accuracy from the voice signal by correcting the second remark data by using the first remark data.
- the speaker is identified from the speaker's speech, the corresponding voice signal, the corresponding second speech data, and the speaker based on the result of collation between the first speech data and the second speech data.
- the configuration for associating with speaker information will be described.
- FIG. 6 is a block diagram showing an example of the configuration of the voice processing device 4.
- the voice processing device 4 includes a speaker extraction unit 20, a first speech data generation section 30, a collation section 40, a second speech data generation section 50, and an association section 60. That is, the configuration of the voice processing device 4 according to the fourth embodiment is different from the configuration of the voice processing device 1 according to the first embodiment in that the association unit 60 is provided.
- the association unit 60 is connected to the storage unit 300.
- the storage unit 300 may be connected to the voice processing device 4 via a wireless or wired network. Alternatively, the storage unit 300 may be a part of the voice processing device 4.
- the storage unit 300 is an example of a storage means.
- the speaker extraction unit 20 extracts the speaker region from the time-series image data as described in the first embodiment. Further, the speaker extraction unit 20 generates speaker information that identifies the speaker extracted from the time-series image data.
- the speaker extraction unit 20 extracts the area of the speaker's face from the time-series image data. Then, the speaker extraction unit 20 generates the face image data of the speaker as the speaker information. Alternatively, the speaker extraction unit 20 may generate a feature vector representing the features of the speaker's face as speaker information.
- the speaker information is, for example, at least one of the attribute information of the speaker, the position information of the speaker, the face image of the speaker, and the first speech data.
- the speaker information is not limited to these as long as it is information for identifying the speaker.
- the speaker extraction unit 20 When a plurality of speakers are detected from the time-series image data, the speaker extraction unit 20 generates speaker information for each speaker. The speaker extraction unit 20 transmits the generated speaker information to the association unit 60.
- the first remark data generation unit 30 receives the image data of the speaker from the speaker extraction unit 20.
- the first speech data generation unit 30 generates the first speech data by the image analysis described in the first embodiment based on the received image data.
- the first speech data generation unit 30 transmits the generated first speech data to the collation unit 40.
- the second remark data generation unit 50 generates the second remark data as described in the first embodiment.
- the second speech data generation unit 50 transmits the generated second speech data to the collation unit 40.
- the second speech data may be either the phoneme data or the single note data described above.
- the collation unit 40 receives the first remark data from the first remark data generation unit 30. Further, the collation unit 40 receives the second speech data from the second speech data generation unit 50. The collation unit 40 collates the first remark data with the second remark data.
- the collation unit 40 collates the first remark data with the second remark data by using any one of Examples 1 to 4 of the collation method described in the first embodiment.
- the collation unit 40 transmits the collation result to the collation unit 60.
- the collation unit 40 transmits, as a result of the collation, information or a flag indicating that the collation between the first remark data and the second remark data succeeded or failed to the collation unit 60.
- the association unit 60 associates the speaker information for identifying the speaker in the image with the second speech data based on the collation result.
- the association unit 60 is an example of the association means.
- the association unit 60 receives speaker information from the speaker extraction unit 20. Further, the association unit 60 receives the result of the above-mentioned collation from the collation unit 40. As described above, the result of the collation is, for example, information or a flag indicating that the collation between the first speech data and the second speech data has succeeded or failed.
- the associating unit 60 starts with the speaker extraction unit 20 based on the collation result by the collation unit 40.
- the received speaker information is associated with the second speech data received from the second speech data generation unit 50.
- the association unit 60 assigns an ID (Identification) to a set of speaker information and second speech data.
- the association unit 60 stores a set of speaker information and a second speech data in the storage unit 300 shown in FIG. 6 together with an ID assigned to the set.
- the association unit 60 may store the associated speaker information and the second speech data in a network server or the like (not shown).
- the association unit 60 converts the second speech data into a voice signal or text data corresponding to the speaker's speech, associates the converted voice signal or text data with the speaker information, and stores the storage unit. It may be stored in 300.
- FIG. 7 is a flowchart showing an example of the operation flow of the voice processing device 4. Since steps S101 to S105 shown in FIG. 7 are common to the operation flow described in the first embodiment, the description of S101 to S105 will be omitted in the fourth embodiment. In the following, it will be described from step S206 shown in FIG.
- the collation unit 60 receives the collation result from the collation unit 40. To receive.
- the collation unit 40 fails to collate between the first speech data and the second speech data (No in S206)
- the process returns to the first step S101 of the operation flow shown in FIG.
- the matching unit 60 receives the speaker information received from the speaker extraction unit 20. Is associated with the second speech data received from the second speech data generation unit 50 (S207).
- the association unit 60 stores the associated speaker information and the second speech data in the storage unit 300. This completes the operation of the voice processing device 4 according to the fourth embodiment.
- the speaker extraction unit 20 extracts a speaker region from the image.
- the first speech data generation unit 30 generates first speech data indicating the content of the speaker's speech based on the shape of the speaker's lip.
- the second speech data generation unit 50 generates second speech data indicating the content of the speaker's speech based on the voice signal corresponding to the speaker's speech.
- the collation unit 40 collates the first remark data with the second remark data.
- the association unit 60 associates the speaker information for identifying the speaker in the image with the second speech data based on the collation result.
- the voice processing device according to the fourth embodiment can easily create, for example, the minutes data describing who said what. Further, the voice processing device according to the fourth embodiment can identify the speaker even when there are a plurality of persons.
- FIG. 8 is a block diagram showing an example of the configuration of the voice processing device 5.
- the voice processing device 5 further includes a correction unit 70 in addition to the speaker extraction unit 20, the first speech data generation unit 30, the collation unit 40, and the second speech data generation unit 50. There is.
- the configuration of the voice processing device 5 according to the fifth embodiment is different from the configuration of the voice processing device 1 according to the first embodiment in that the correction unit 70 is provided.
- the correction unit 70 receives the second speech data from the second speech data generation unit 50. In addition, the correction unit 70 receives the first speech data from the first speech data generation unit 30.
- the second speech data may be either the phoneme data or the single note data described above, as in the first embodiment.
- the correction unit 70 corrects the second speech data by using the first speech data received from the first speech data generation unit 30.
- the correction unit 70 is an example of correction means.
- the correction unit 70 may store the corrected second remark data in a storage unit (not shown), a network server, or both.
- correction unit 70 corrects the second speech data using the first speech data.
- the second speech data is phoneme data
- the second speech data is information indicating the phonemes corresponding to the voice signal and the arrangement order of the phonemes
- Example 1 the correction unit 70 compares the phoneme included in the first speech data with the corresponding phoneme included in the phoneme data which is the second speech data.
- the corresponding phoneme is a phoneme having the same added sequence number.
- the correction unit 70 compares the vowel included in the first speech data with the corresponding vowel included in the phoneme data which is the second speech data.
- the correction unit 70 leaves the vowels of the second speech data as they are.
- the correction unit 70 replaces the vowel included in the second vowel data with the corresponding vowel in the first vowel data. In this way, the correction unit 70 corrects the second speech data by using the first speech data.
- the correction unit 70 selects phonemes whose SN ratio (S / N) or likelihood is smaller than the threshold value among the phonemes included in the second speech data, and corresponds to the phonemes of the first speech data. Replace with.
- the second speech data is the above-mentioned single note data
- the second speech data is information indicating the single notes included in the voice signal and the arrangement order of the single notes
- the correction unit 70 adds weights corresponding to the corresponding phonemes of the first speech data to the likelihoods of each of the plurality of single note candidates included in the second speech data, and weights them. Based on the likelihood, one of a plurality of single note candidates of the second speech data is selected.
- the correction unit 70 assigns a weight X (> 1) to the likelihood of the first candidate and a weight y ( ⁇ 1) to the likelihood of the second candidate.
- the correction unit 70 compares the magnitude of the weighted likelihood X ⁇ a of the first candidate and the weighted likelihood y ⁇ A of the second candidate. The correction unit 70 selects the candidate having the larger weighted likelihood.
- FIG. 9 is a block diagram showing an example of the configuration of the voice processing device 5A according to the modified example.
- the voice processing device 5A includes a first speech data generation unit 30, a collation unit 40, a second speech data generation unit 50, a correction unit 70, and an association unit 60. That is, the configuration of the voice processing device 5A according to this modification is different from the configuration of the voice processing device 5 in that the association unit 60 is further provided.
- the association unit 60 associates the speaker information for identifying the speaker in the image data with the second speech data corrected by the correction unit 70.
- the association unit 60 is an example of the association means.
- the association unit 60 receives speaker information from the speaker extraction unit 20. Further, the association unit 60 receives the corrected second speech data from the correction unit 70. Further, the association unit 60 receives information or a flag from the collation unit 40 indicating that the first remark data and the second remark data have been successfully collated.
- the association unit 60 receives the speaker information from the speaker extraction unit 20. Is associated with the corrected second remark data received from the correction unit 70.
- the association unit 60 associates the corrected second speech data with the speaker information (for example, a speaker's face image) received from the speaker extraction unit 20, and stores the storage unit 300 (see the fourth embodiment). ) Etc.
- the collation unit 60 When the collation unit 40 collates the first remark data of a plurality of speakers with the second remark data, the collation unit 60 has the highest degree of similarity only 1 based on the collation result. The pair of the first speech data and the second speech data is specified. Then, the matching unit 60 associates the speaker information about one speaker corresponding to the specified first speech data with the second speech data.
- the association unit 60 stores the associated speaker information and the second speech data in the storage unit 300 shown in FIG.
- the association unit 60 may store the associated speaker information and the second speech data in a network server or the like (not shown).
- the association unit 60 assigns an ID to each set of speaker information and second speech data. Then, the association unit 60 stores a set of speaker information and the second speech data in the storage unit 300, the network server, or both together with the ID assigned to the set.
- FIG. 10 is a flowchart showing an example of the operation flow of the voice processing device 5.
- steps S101 to S105 are common to the operation flow described in the first embodiment. Further, the following step S206 is common to the operation flow described in the fourth embodiment.
- the operation flow of the voice processing device 5 returns to step S101.
- the correction unit 70 receives the first remark received from the first remark data generation unit 30.
- the second speech data received from the second speech data generation unit 50 is corrected by the data (S307).
- the correction unit 70 may output the corrected second remark data.
- the correction unit 70 transmits the corrected second statement data to the association unit 60.
- the association unit 60 associates the speaker information (for example, the face image data of the speaker) received from the speaker extraction unit 20 with the second speech data corrected by the correction unit 70, and stores the storage unit 300 (for example, the storage unit 300 (for example). Store in (see FIG. 6) and the like.
- the speaker extraction unit 20 extracts a speaker region from the image.
- the first speech data generation unit 30 generates first speech data indicating the content of the speaker's speech based on the shape of the speaker's lip.
- the second speech data generation unit 50 generates second speech data indicating the content of the speaker's speech based on the voice signal corresponding to the speaker's speech.
- the collation unit 40 collates the first remark data with the second remark data. This makes it possible to perform processing in consideration of the influence of noise on the voice caused by the speech of a person.
- the correction unit 70 corrects the second remark data by using the first remark data. Therefore, the accuracy of reproducing the content of the statement from the audio signal is improved.
- FIG. 11 is a block diagram showing an example of the system configuration.
- the system includes a microphone 100, a camera 200, and a display 400 in addition to the audio processing device 6. All or part of the system according to the sixth embodiment may be realized by an intelligent camera (for example, an IP camera or a network camera having an analysis function inside. Also called a smart camera or the like).
- an intelligent camera for example, an IP camera or a network camera having an analysis function inside. Also called a smart camera or the like.
- the voice processing device 6 according to the sixth embodiment further includes a display control unit 80 in addition to the speaker extraction unit 20, the first speech data generation unit 30, and the collation unit 40. That is, the configuration of the voice processing device 6 according to the sixth embodiment is different from the configuration of the voice processing device 1 according to the first embodiment in that the display control unit 80 is provided.
- the microphone 100 collects voices (utterances) from the speaker's remarks and generates a voice signal corresponding to the speaker's remarks.
- the microphone 100 includes one or more microphones.
- the microphone 100 transmits a voice signal corresponding to the voice of the speech to the second speech data generation unit 50.
- the camera 200 is installed in a place to be photographed (for example, in a conference room).
- the camera 200 captures a shooting target location and a person at the shooting target location, and captures time-series image data (for example, a frame image of a moving image for a certain period of time, or a plurality of images captured at predetermined time intervals).
- time-series image data for example, a frame image of a moving image for a certain period of time, or a plurality of images captured at predetermined time intervals.
- the collation unit 40 transmits the collation result of the first remark data and the second remark data to the speaker extraction unit 20.
- the speaker extraction unit 20 When the speaker extraction unit 20 receives from the collation unit 40 the result that the first speech data and the second speech data are successfully collated, the speaker extraction unit 20 includes image data for superimposition including a figure indicating an area including the speaker. (Hereinafter referred to as sub-image data) is generated. Then, the speaker extraction unit 20 transmits the generated sub-image data to the display control unit 80.
- the speaker extraction unit 20 does not generate sub-image data when it receives a result from the collation unit 40 that the matching between the first speech data and the second speech data has failed.
- the display control unit 80 receives time-series image data from the camera 200. When the display control unit 80 does not receive the sub-image data from the speaker extraction unit 20, the display control unit 80 converts the time-series image data received from the camera 200 into a format that can be displayed on the display 400 and displays it on the display 400. Display the image.
- the display control unit 80 when the display control unit 80 receives the sub-image data from the speaker extraction unit 20, the display control unit 80 superimposes the received sub-image data on the time-series image data and converts it into a format that can be displayed on the display 400. , The superimposed image is displayed on the display 400.
- the display control unit 80 is an example of the display control means. Specific examples of the superimposed image will be described below.
- FIG. 12 shows an example of a superimposed image generated by the display control unit 80 and displayed on the display 400.
- a rectangular figure is displayed around the face of the speaker (the person in the upper right in the figure).
- This rectangular figure is an example of the above-mentioned sub-image data. The user can easily identify the speaker by looking at the superimposed image displayed on the display 400.
- the shape and mode of the figure pointing to the speaker is not limited to the rectangle as shown in FIG. ..
- the figure pointing to the speaker may be an arrow pointing to the speaker.
- the speaker extraction unit 20 extracts a speaker region from the image.
- the first speech data generation unit 30 generates first speech data indicating the content of the speaker's speech based on the shape of the speaker's lip.
- the second speech data generation unit 50 generates second speech data indicating the content of the speaker's speech based on the voice signal corresponding to the speaker's speech.
- the collation unit 40 collates the first remark data with the second remark data.
- the display control unit 80 displays on the display 400 a superimposed image in which a figure indicating an area including a speaker is superimposed on the image. Therefore, the user can easily identify the speaker from the superimposed image on the display 400.
- Each component of the voice processing apparatus described in the first to sixth embodiments shows a block of functional units. Some or all of these components are realized by, for example, the information processing apparatus 900 as shown in FIG.
- FIG. 13 is a block diagram showing an example of the hardware configuration of the information processing apparatus 900.
- the information processing device 900 shows, for example, the internal configuration of an intelligent camera.
- the information processing apparatus 900 includes the following configuration as an example.
- -CPU Central Processing Unit
- ROM Read Only Memory
- RAM Random Access Memory
- program 904 Drive device 907 that reads and writes the recording medium 906.
- -Communication interface 908 that connects to the communication network 909 -I / O interface 910 for inputting / outputting data -Bus 911 connecting each component
- Each component of the voice processing apparatus described in the first to sixth embodiments is realized by the CPU 901 reading and executing the program 904 that realizes these functions.
- the program 904 that realizes the functions of the respective components is stored in, for example, a storage device 905 or ROM 902 in advance, and the CPU 901 loads the program 904 into the RAM 903 and executes the program as needed.
- the program 904 may be supplied to the CPU 901 via the communication network 909, or may be stored in the recording medium 906 in advance, and the drive device 907 may read the program and supply the program to the CPU 901.
- the voice processing device described in the above embodiment is realized as hardware. Therefore, the same effect as that described in the above embodiment can be obtained.
- a speaker extraction method that extracts the speaker area from the image, A first speech data generation means for generating a first speech data indicating the content of the speaker's speech based on the shape of the speaker's lip, and a first speech data generation means.
- a second speech data generation means for generating a second speech data indicating the content of the speaker's speech based on the voice signal corresponding to the speaker's speech, and
- a voice processing device including a collation means for collating the first speech data with the second speech data.
- the first statement data generation means A mouth-shaped element identification means for identifying a mouth-shaped element from the shape of the speaker's lip,
- the voice processing apparatus according to Appendix 1, further comprising a phoneme-phoneme conversion means for converting the phoneme into a phoneme and generating the first speech data including one or a plurality of phonemes.
- the second statement data generation means A feature extraction means for extracting features from an input voice signal and a voice signal-phoneme conversion that converts the features of the voice signal into corresponding phonemes to generate the second speech data including one or more phonemes.
- the voice processing apparatus according to Appendix 1 or 2, wherein the means is included.
- the second statement data generation means The voice processing apparatus according to Appendix 1 or 2, wherein the voice processing apparatus includes a single sound extraction means for extracting a single sound included in an input voice signal and generating the second speech data including one or a plurality of single sounds.
- the speaker extraction means generates speaker information for identifying the speaker extracted from the image, and generates speaker information.
- the voice processing device according to any one of Supplementary note 1 to 4, further comprising an associating means for associating the speaker information with the second speech data based on the result of the collation. ..
- the first speech data generation means generates a plurality of the first speech data based on the shapes of the lips of the plurality of speakers in the image.
- the collation means collates each of the plurality of first speech data with the second speech data, and collates the second speech data.
- the matching means is described in Appendix 5, characterized in that the speaker information relating to any one of the plurality of speakers is associated with the second speech data based on the result of the collation. Voice processing device.
- the speaker information is at least one of the attribute information of the speaker, the position information of the speaker, the face image of the speaker, and the first speech data.
- the voice processing apparatus according to 6.
- Speech processing device 2 Speech processing device 3, 3A Voice processing device 4, 4A Voice processing device 5, 5A Speech processing device 6 Speech processing device 20 Speaker extraction unit 30 First speech data generation unit 31 Phoneme identification unit 32 Phoneme element -Phoneme conversion unit 40 Matching unit 50 Second speech data generation unit 60 Correspondence unit 70 Correction unit 80 Display control unit 250 Second speech data generation unit 252 Feature extraction unit 253 Speech signal-phoneme conversion unit 300 Storage unit 350 Second speech Data generation unit 351 Single sound extraction unit 400 Display
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Oral & Maxillofacial Surgery (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Theoretical Computer Science (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
- User Interface Of Digital Computer (AREA)
- Image Analysis (AREA)
Abstract
Description
図1~図3を参照して、実施形態1について説明する。
図1を参照して、本実施形態1に係わる音声処理装置1の構成を説明する。図1は、音声処理装置1の構成の一例を示すブロック図である。図1に示すように、音声処理装置1は、話者抽出部20、第一発言データ生成部30、照合部40、及び第二発言データ生成部50を備えている。なお、本実施形態1に係わる音声処理装置1(および後に説明する各実施形態に係わる音声処理装置)の各部の機能は、プロセッサがメモリに読み込んだプログラムを実行することによって、ソフトウェアとして実現されてもよいし、インテリジェントカメラなどのハードウェアとして実現されてもよい。
あるいは、第一発言データ生成部30は、上述したように、話者の画像データから、直接的に、第1の発言データを生成してもよい。例えば、第一発言データ生成部30は、深層学習の手法を用いて、話者の画像データから、対応する音素または音声信号を判別できるように、モデル(例えばニューラルネットワーク)を訓練してもよい。この場合、第一発言データ生成部30は、話者の画像データを、学習済みのモデルに入力する。学習済みモデルは、入力された話者の画像データから、対応する音素または音声信号を判別し、判別結果を出力する。第一発言データ生成部30は、学習済みのモデルからの出力に基づいて、第1の発言データを生成する。
本例1では、照合部40は、第1の発言データに含まれる各音素の特徴を、音素に付加された順序番号にしたがって並べた第1の特徴ベクトルを生成する。また、照合部40は、第2の発言データに含まれる各音素の特徴を、音素に付加された順序番号にしたがって並べた第2の特徴ベクトルを生成する。例えば、音素の特徴ベクトルは、当該音素を表す標準的な音声信号の振幅、パワー、パワースペクトル、またはメル周波数ケプストラム係数(MFCC;Mel-Frequency Cepstrum Coefficients)である。これらの特徴ベクトルは、音素を表す音声信号に対し、様々な変換を行うことによって得られる。
本例2では、照合部40は、第1の発言データに含まれる個々の音素と、第2の発言データに含まれる個々の音素との間で、それぞれマッチングを行う。
本例3では、照合部40は、ある一つの言語において互いに区別されない1または複数の単音を、同一の音素であるとみなす。照合部40は、第1の発言データに含まれる個々の音素と同一とみなされる1または複数の単音の各々と、第2の発言データに含まれる個々の単音との間で、マッチングを行う。本例3におけるマッチングの手法は、第2の発言データが音素データである場合と同様である。
本例4では、第1の発言データおよび第2の発言データが、それぞれ音声信号である場合に、両者を照合する方法について説明する。
図3を参照して、本実施形態1に係わる音声処理装置1の動作フローを説明する。図3は、音声処理装置1の動作の流れの一例を示すフローチャートである。
本実施形態の構成によれば、話者抽出部20は、画像から話者の領域を抽出する。第一発言データ生成部30は、話者の口唇の形状に基づいて、話者の発言の内容を示す第1の発言データを生成する。第二発言データ生成部50は、話者の発言と対応する音声信号に基づいて、話者の発言の内容を示す第2の発言データを生成する。照合部40は、第1の発言データと、第2の発言データとを照合する。
図4を参照して、実施形態2について説明する。本実施形態2では、第2の発言データが、音素データである場合(すなわち、第2の発言データが第1の方法で生成される場合)に関して、第二発言データ生成部の詳細な構成を説明する。
図4は、本実施形態2に係わる第二発言データ生成部250の構成の一例を示すブロック図である。図4に示すように、第二発言データ生成部250は、特徴抽出部252および音声信号-音素変換部253を含む。
本実施形態の構成によれば、話者抽出部20は、画像から話者の領域を抽出する。第一発言データ生成部30は、話者の口唇の形状に基づいて、話者の発言の内容を示す第1の発言データを生成する。第二発言データ生成部250は、話者の発言と対応する音声信号に基づいて、話者の発言の内容を示す第2の発言データを生成する。
図5を参照して、実施形態3について説明する。本実施形態3では、第2の発言データが、単音データである場合(すなわち、第2の発言データが第2の方法で生成される場合)に関して、第二発言データ生成部の詳細を説明する。
本実施形態3では、第二発言データ生成部350は、前記実施形態1で説明した第3の方法によって、入力された音声信号から、第2の発言データを生成する。
本実施形態の構成によれば、話者抽出部20は、画像から話者の領域を抽出する。第一発言データ生成部30は、話者の口唇の形状に基づいて、話者の発言の内容を示す第1の発言データを生成する。第二発言データ生成部350は、話者の発言と対応する音声信号に基づいて、話者の発言の内容を示す第2の発言データを生成する。
図6および図7を参照して、実施形態4について説明する。本実施形態4では、第1の発言データと第2の発言データとの照合の結果に基づいて、話者の発言と対応する音声信号と対応する第2の発言データと、話者を識別する話者情報とを対応付ける構成について説明する。
図6を参照して、本実施形態4に係わる音声処理装置4の構成を説明する。図6は、音声処理装置4の構成の一例を示すブロック図である。
図7を参照して、本実施形態4に係わる音声処理装置4の動作フローを説明する。図7は、音声処理装置4の動作の流れの一例を示すフローチャートである。図7に示すステップS101からステップS105までは、前記実施形態1において説明した動作フローと共通であるから、本実施形態4では、S101~S105に関する説明を省略する。以下では、図7に示すステップS206から、説明する。
本実施形態の構成によれば、話者抽出部20は、画像から話者の領域を抽出する。第一発言データ生成部30は、話者の口唇の形状に基づいて、話者の発言の内容を示す第1の発言データを生成する。第二発言データ生成部50は、話者の発言と対応する音声信号に基づいて、話者の発言の内容を示す第2の発言データを生成する。照合部40は、第1の発言データと、第2の発言データとを照合する。
図8~図10を参照して、実施形態5について説明する。本実施形態5では、第1の発言データを用いて、第2の発言データを補正する構成を説明する。
図8を参照して、本実施形態5に係わる音声処理装置5の構成を説明する。図8は、音声処理装置5の構成の一例を示すブロック図である。図8に示すように、音声処理装置5は、話者抽出部20、第一発言データ生成部30、照合部40、および第二発言データ生成部50に加えて、補正部70をさらに備えている。
本例1では、補正部70は、第1の発言データに含まれる音素と、第2の発言データである音素データに含まれる対応する音素とを比較する。対応する音素とは、付加された順序番号が同一である音素である。特に、補正部70は、第1の発言データに含まれる母音と、第2の発言データである音素データに含まれる対応する母音とを比較する。
本例2では、補正部70は、第2の発言データに含まれる音素のうち、SN比(S/N)あるいは尤度が、閾値よりも小さい音素を、第1の発言データの対応する音素に置換する。
本例3では、補正部70は、第2の発言データに含まれる単音の複数の候補の、それぞれの尤度に対し、第1の発言データの対応する音素に応じた重みを付加し、重み付けした尤度に基づいて、第2の発言データの単音の複数の候補のうち1つを選択する。
図9を参照して、本実施形態5に係わる音声処理装置5の一変形例を説明する。図9は、一変形例に係わる音声処理装置5Aの構成の一例を示すブロック図である。
図10を参照して、本実施形態5に係わる音声処理装置5の動作フローを説明する。図10は、音声処理装置5の動作の流れの一例を示すフローチャートである。
本実施形態の構成によれば、話者抽出部20は、画像から話者の領域を抽出する。第一発言データ生成部30は、話者の口唇の形状に基づいて、話者の発言の内容を示す第1の発言データを生成する。第二発言データ生成部50は、話者の発言と対応する音声信号に基づいて、話者の発言の内容を示す第2の発言データを生成する。照合部40は、第1の発言データと、第2の発言データとを照合する。これにより、人物の発言による音声に対するノイズの影響を考慮した処理を行うことが可能となる。
図11および図12を参照して、実施形態6について説明する。本実施形態6では、音声処理装置6を含むシステムの構成を説明する。
図11を参照して、本実施形態6に係わるシステムの構成を説明する。図11は、システムの構成の一例を示すブロック図である。図11に示すように、システムは、音声処理装置6に加えて、マイク100、カメラ200、およびディスプレイ400を含む。本実施形態6に係わるシステムの全部または一部は、インテリジェントカメラ(例えば、内部に解析機能を備えるIPカメラやネットワークカメラ。スマートカメラ等とも呼ばれる。)によって、実現されてもよい。
本実施形態6に係わる音声処理装置6は、話者抽出部20、第一発言データ生成部30、および照合部40に加えて、表示制御部80をさらに備えている。すなわち、本実施形態6に係わる音声処理装置6の構成は、表示制御部80を備えている点で、前記実施形態1に係わる音声処理装置1の構成と異なる。
図12は、表示制御部80によって生成され、ディスプレイ400に表示された重畳画像の一例を示す。
本実施形態の構成によれば、話者抽出部20は、画像から話者の領域を抽出する。第一発言データ生成部30は、話者の口唇の形状に基づいて、話者の発言の内容を示す第1の発言データを生成する。第二発言データ生成部50は、話者の発言と対応する音声信号に基づいて、話者の発言の内容を示す第2の発言データを生成する。照合部40は、第1の発言データと、第2の発言データとを照合する。
図13を参照して、実施形態7について以下で説明する。
前記実施形態1~6で説明した音声処理装置の各構成要素は、機能単位のブロックを示している。これらの構成要素の一部又は全部は、例えば図13に示すような情報処理装置900により実現される。図13は、情報処理装置900のハードウェア構成の一例を示すブロック図である。情報処理装置900は、例えば、インテリジェントカメラの内部構成を示している。
・ROM(Read Only Memory)902
・RAM(Random Access Memory)903
・RAM903にロードされるプログラム904
・プログラム904を格納する記憶装置905
・記録媒体906の読み書きを行うドライブ装置907
・通信ネットワーク909と接続する通信インタフェース908
・データの入出力を行う入出力インタフェース910
・各構成要素を接続するバス911
前記実施形態1~6で説明した音声処理装置の各構成要素は、これらの機能を実現するプログラム904をCPU901が読み込んで実行することで、実現される。各構成要素の機能を実現するプログラム904は、例えば、予め記憶装置905やROM902に格納されており、必要に応じてCPU901がRAM903にロードして実行される。なお、プログラム904は、通信ネットワーク909を介してCPU901に供給されてもよいし、予め記録媒体906に格納されており、ドライブ装置907が当該プログラムを読み出してCPU901に供給してもよい。
本実施形態の構成によれば、前記実施形態において説明した音声処理装置が、ハードウェアとして実現される。したがって、前記実施形態において説明した効果と同様の効果を奏することができる。
上記の実施形態(及び実施例)の一部または全部は、以下の付記のように記載されてもよいが、以下の付記は単なる例である。上記の実施形態(及び実施例)の一部または全部は、以下の付記に記載した構成に限られない。
画像から話者の領域を抽出する話者抽出手段と、
前記話者の口唇の形状に基づいて、前記話者の発言の内容を示す第1の発言データを生成する第一発言データ生成手段と、
前記話者の発言と対応する音声信号に基づいて、前記話者の発言の内容を示す第2の発言データを生成する第二発言データ生成手段と、
前記第1の発言データと、前記第2の発言データとを照合する照合手段と
を備えた音声処理装置。
前記第一発言データ生成手段は、
前記話者の口唇の形状から、口形素を識別する口形素識別手段と、
前記口形素を音素に変換して、1または複数の音素を含む前記第1の発言データを生成する口形素-音素変換手段とを含む
ことを特徴とする付記1に記載の音声処理装置。
前記第二発言データ生成手段は、
入力された音声信号から特徴を抽出する特徴抽出手段と
前記音声信号の特徴を、対応する音素に変換して、1または複数の音素を含む前記第2の発言データを生成する音声信号-音素変換手段とを含む
ことを特徴とする付記1または2に記載の音声処理装置。
前記第二発言データ生成手段は、
入力された音声信号に含まれる単音を抽出し、1または複数の単音を含む前記第2の発言データを生成する単音抽出手段を含む
ことを特徴とする付記1または2に記載の音声処理装置。
前記話者抽出手段は、前記画像から抽出した前記話者を識別するための話者情報を生成し、
前記照合の結果に基づいて、前記話者情報と、前記第2の発言データとを対応付ける対応付け手段をさらに備えた
ことを特徴とする付記1から4のいずれか1項に記載の音声処理装置。
前記第一発言データ生成手段は、前記画像中の複数の話者の口唇の形状に基づいて、複数の前記第1の発言データを生成し、
前記照合手段は、複数の前記第1の発言データのそれぞれと、前記第2の発言データとを照合し、
前記対応付け手段は、前記照合の結果に基づいて、前記複数の話者のうちいずれか1人に関する前記話者情報と、前記第2の発言データとを対応付ける
ことを特徴とする付記5に記載の音声処理装置。
前記第1の発言データと、前記第2の発言データとの間の照合が成功した場合、前記第1の発言データを用いて、前記第2の発言データを補正する補正手段をさらに備えた
ことを特徴とする付記1から6のいずれか1項に記載の音声処理装置。
前記画像上に、前記話者を含む領域を示す図形を重畳した重畳画像を、ディスプレイ上に表示させる表示制御手段をさらに備えた
ことを特徴とする付記1から7のいずれか1項に記載の音声処理装置。
画像から話者の領域を抽出し、
前記話者の口唇の形状に基づいて、前記話者の発言の内容を示す第1の発言データを生成し、
前記話者の発言と対応する音声信号に基づいて、前記話者の発言の内容を示す第2の発言データを生成し、
前記第1の発言データと、前記第2の発言データとを照合する
ことを含む音声処理方法。
画像から話者の領域を抽出することと、
前記話者の口唇の形状に基づいて、前記話者の発言の内容を示す第1の発言データを生成することと、
前記話者の発言と対応する音声信号に基づいて、前記話者の発言の内容を示す第2の発言データを生成することと、
前記第1の発言データと、前記第2の発言データとを照合することと
をコンピュータに実行させるためのプログラムを格納する記録媒体。
前記話者情報は、前記話者の属性情報、前記話者の位置情報、前記話者の顔画像、および前記第1の発言データのうち、少なくとも1つである
ことを特徴とする付記5または6に記載の音声処理装置。
2 音声処理装置
3、3A 音声処理装置
4、4A 音声処理装置
5、5A 音声処理装置
6 音声処理装置
20 話者抽出部
30 第一発言データ生成部
31 口形素識別部
32 口形素-音素変換部
40 照合部
50 第二発言データ生成部
60 対応付け部
70 補正部
80 表示制御部
250 第二発言データ生成部
252 特徴抽出部
253 音声信号-音素変換部
300 記憶部
350 第二発言データ生成部
351 単音抽出部
400 ディスプレイ
Claims (11)
- 画像から話者の領域を抽出する話者抽出手段と、
前記話者の口唇の形状に基づいて、前記話者の発言の内容を示す第1の発言データを生成する第一発言データ生成手段と、
前記話者の発言と対応する音声信号に基づいて、前記話者の発言の内容を示す第2の発言データを生成する第二発言データ生成手段と、
前記第1の発言データと、前記第2の発言データとを照合する照合手段と
を備えた音声処理装置。 - 前記第一発言データ生成手段は、
前記話者の口唇の形状から、口形素を識別する口形素識別手段と、
前記口形素を音素に変換して、1または複数の音素を含む前記第1の発言データを生成する口形素-音素変換手段とを含む
ことを特徴とする請求項1に記載の音声処理装置。 - 前記第二発言データ生成手段は、
入力された音声信号から特徴を抽出する特徴抽出手段と
前記音声信号の特徴を、対応する音素に変換して、1または複数の音素を含む前記第2の発言データを生成する音声信号-音素変換手段とを含む
ことを特徴とする請求項1または2に記載の音声処理装置。 - 前記第二発言データ生成手段は、
入力された音声信号に含まれる単音を抽出し、1または複数の単音を含む前記第2の発言データを生成する単音抽出手段を含む
ことを特徴とする請求項1または2に記載の音声処理装置。 - 前記話者抽出手段は、前記画像から抽出した前記話者を識別するための話者情報を生成し、
前記照合の結果に基づいて、前記話者情報と、前記第2の発言データとを対応付ける対応付け手段をさらに備えた
ことを特徴とする請求項1から4のいずれか1項に記載の音声処理装置。 - 前記第一発言データ生成手段は、前記画像中の複数の話者の口唇の形状に基づいて、複数の前記第1の発言データを生成し、
前記照合手段は、複数の前記第1の発言データのそれぞれと、前記第2の発言データとを照合し、
前記対応付け手段は、前記照合の結果に基づいて、前記複数の話者のうちいずれか1人に関する前記話者情報と、前記第2の発言データとを対応付ける
ことを特徴とする請求項5に記載の音声処理装置。 - 前記第1の発言データと、前記第2の発言データとの間の照合が成功した場合、前記第1の発言データを用いて、前記第2の発言データを補正する補正手段をさらに備えた
ことを特徴とする請求項1から6のいずれか1項に記載の音声処理装置。 - 前記画像上に、前記話者を含む領域を示す図形を重畳した重畳画像を、ディスプレイ上に表示させる表示制御手段をさらに備えた
ことを特徴とする請求項1から7のいずれか1項に記載の音声処理装置。 - 前記話者情報は、前記話者の属性情報、前記話者の位置情報、前記話者の顔画像、および前記第1の発言データのうち、少なくとも1つである
ことを特徴とする請求項5または6に記載の音声処理装置。 - 画像から話者の領域を抽出し、
前記話者の口唇の形状に基づいて、前記話者の発言の内容を示す第1の発言データを生成し、
前記話者の発言と対応する音声信号に基づいて、前記話者の発言の内容を示す第2の発言データを生成し、
前記第1の発言データと、前記第2の発言データとを照合する
ことを含む音声処理方法。 - 画像から話者の領域を抽出することと、
前記話者の口唇の形状に基づいて、前記話者の発言の内容を示す第1の発言データを生成することと、
前記話者の発言と対応する音声信号に基づいて、前記話者の発言の内容を示す第2の発言データを生成することと、
前記第1の発言データと、前記第2の発言データとを照合することと
をコンピュータに実行させるためのプログラムを格納する記録媒体。
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/630,632 US20220262363A1 (en) | 2019-08-02 | 2020-07-29 | Speech processing device, speech processing method, and recording medium |
BR112022001300A BR112022001300A2 (pt) | 2019-08-02 | 2020-07-29 | Dispositivo de processamento de fala, método de processamento de fala, e mídia de gravação |
EP20850688.1A EP4009629A4 (en) | 2019-08-02 | 2020-07-29 | VOICE PROCESSING DEVICE, METHOD AND RECORDING MEDIUM |
JP2021537252A JP7347511B2 (ja) | 2019-08-02 | 2020-07-29 | 音声処理装置、音声処理方法、およびプログラム |
CN202080055074.9A CN114175147A (zh) | 2019-08-02 | 2020-07-29 | 语音处理设备、语音处理方法和记录介质 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2019-142951 | 2019-08-02 | ||
JP2019142951 | 2019-08-02 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021024869A1 true WO2021024869A1 (ja) | 2021-02-11 |
Family
ID=74503621
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2020/028955 WO2021024869A1 (ja) | 2019-08-02 | 2020-07-29 | 音声処理装置、音声処理方法、および記録媒体 |
Country Status (6)
Country | Link |
---|---|
US (1) | US20220262363A1 (ja) |
EP (1) | EP4009629A4 (ja) |
JP (1) | JP7347511B2 (ja) |
CN (1) | CN114175147A (ja) |
BR (1) | BR112022001300A2 (ja) |
WO (1) | WO2021024869A1 (ja) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116110373A (zh) * | 2023-04-12 | 2023-05-12 | 深圳市声菲特科技技术有限公司 | 智能会议系统的语音数据采集方法及相关装置 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH08187368A (ja) * | 1994-05-13 | 1996-07-23 | Matsushita Electric Ind Co Ltd | ゲーム装置、入力装置、音声選択装置、音声認識装置及び音声反応装置 |
WO2007114346A1 (ja) * | 2006-03-30 | 2007-10-11 | Honda Moter Co., Ltd. | 音声認識装置 |
JP2010262424A (ja) * | 2009-05-01 | 2010-11-18 | Nikon Corp | 車載カメラシステム |
JP2011013731A (ja) | 2009-06-30 | 2011-01-20 | Sony Corp | 情報処理装置、情報処理方法、およびプログラム |
JP2018091954A (ja) * | 2016-12-01 | 2018-06-14 | オリンパス株式会社 | 音声認識装置、及び音声認識方法 |
JP2019125927A (ja) * | 2018-01-17 | 2019-07-25 | 株式会社Jvcケンウッド | 表示制御装置、通信装置、表示制御方法およびプログラム |
JP2019142951A (ja) | 2015-08-20 | 2019-08-29 | ソル − ゲル テクノロジーズ リミテッド | 過酸化ベンゾイルおよびアダパレンを含む局所投与用組成物 |
Family Cites Families (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS59182687A (ja) * | 1983-04-01 | 1984-10-17 | Nippon Telegr & Teleph Corp <Ntt> | 静止画像通信会議方式 |
US5528728A (en) * | 1993-07-12 | 1996-06-18 | Kabushiki Kaisha Meidensha | Speaker independent speech recognition system and method using neural network and DTW matching technique |
JP2004024863A (ja) * | 1994-05-13 | 2004-01-29 | Matsushita Electric Ind Co Ltd | 口唇認識装置および発生区間認識装置 |
AU2001296459A1 (en) * | 2000-10-02 | 2002-04-15 | Clarity, L.L.C. | Audio visual speech processing |
US7257538B2 (en) * | 2002-10-07 | 2007-08-14 | Intel Corporation | Generating animation from visual and audio input |
US20050047664A1 (en) * | 2003-08-27 | 2005-03-03 | Nefian Ara Victor | Identifying a speaker using markov models |
JP4462339B2 (ja) * | 2007-12-07 | 2010-05-12 | ソニー株式会社 | 情報処理装置、および情報処理方法、並びにコンピュータ・プログラム |
US8798311B2 (en) * | 2009-01-23 | 2014-08-05 | Eldon Technology Limited | Scrolling display of electronic program guide utilizing images of user lip movements |
JP2011186351A (ja) * | 2010-03-11 | 2011-09-22 | Sony Corp | 情報処理装置、および情報処理方法、並びにプログラム |
JP5849761B2 (ja) * | 2012-02-22 | 2016-02-03 | 日本電気株式会社 | 音声認識システム、音声認識方法および音声認識プログラム |
US9940932B2 (en) * | 2016-03-02 | 2018-04-10 | Wipro Limited | System and method for speech-to-text conversion |
US11456005B2 (en) * | 2017-11-22 | 2022-09-27 | Google Llc | Audio-visual speech separation |
US20190371318A1 (en) * | 2018-02-15 | 2019-12-05 | DMAI, Inc. | System and method for adaptive detection of spoken language via multiple speech models |
-
2020
- 2020-07-29 CN CN202080055074.9A patent/CN114175147A/zh active Pending
- 2020-07-29 EP EP20850688.1A patent/EP4009629A4/en not_active Withdrawn
- 2020-07-29 WO PCT/JP2020/028955 patent/WO2021024869A1/ja unknown
- 2020-07-29 JP JP2021537252A patent/JP7347511B2/ja active Active
- 2020-07-29 BR BR112022001300A patent/BR112022001300A2/pt unknown
- 2020-07-29 US US17/630,632 patent/US20220262363A1/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH08187368A (ja) * | 1994-05-13 | 1996-07-23 | Matsushita Electric Ind Co Ltd | ゲーム装置、入力装置、音声選択装置、音声認識装置及び音声反応装置 |
WO2007114346A1 (ja) * | 2006-03-30 | 2007-10-11 | Honda Moter Co., Ltd. | 音声認識装置 |
JP2010262424A (ja) * | 2009-05-01 | 2010-11-18 | Nikon Corp | 車載カメラシステム |
JP2011013731A (ja) | 2009-06-30 | 2011-01-20 | Sony Corp | 情報処理装置、情報処理方法、およびプログラム |
JP2019142951A (ja) | 2015-08-20 | 2019-08-29 | ソル − ゲル テクノロジーズ リミテッド | 過酸化ベンゾイルおよびアダパレンを含む局所投与用組成物 |
JP2018091954A (ja) * | 2016-12-01 | 2018-06-14 | オリンパス株式会社 | 音声認識装置、及び音声認識方法 |
JP2019125927A (ja) * | 2018-01-17 | 2019-07-25 | 株式会社Jvcケンウッド | 表示制御装置、通信装置、表示制御方法およびプログラム |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116110373A (zh) * | 2023-04-12 | 2023-05-12 | 深圳市声菲特科技技术有限公司 | 智能会议系统的语音数据采集方法及相关装置 |
CN116110373B (zh) * | 2023-04-12 | 2023-06-09 | 深圳市声菲特科技技术有限公司 | 智能会议系统的语音数据采集方法及相关装置 |
Also Published As
Publication number | Publication date |
---|---|
JP7347511B2 (ja) | 2023-09-20 |
EP4009629A1 (en) | 2022-06-08 |
US20220262363A1 (en) | 2022-08-18 |
EP4009629A4 (en) | 2022-09-21 |
CN114175147A (zh) | 2022-03-11 |
BR112022001300A2 (pt) | 2022-03-22 |
JPWO2021024869A1 (ja) | 2021-02-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10878824B2 (en) | Speech-to-text generation using video-speech matching from a primary speaker | |
CN108305615B (zh) | 一种对象识别方法及其设备、存储介质、终端 | |
Aleksic et al. | Audio-visual biometrics | |
JP4867804B2 (ja) | 音声認識装置及び会議システム | |
US5621857A (en) | Method and system for identifying and recognizing speech | |
US20020116197A1 (en) | Audio visual speech processing | |
JP6654611B2 (ja) | 成長型対話装置 | |
Tao et al. | End-to-end audiovisual speech activity detection with bimodal recurrent neural models | |
JP2001092974A (ja) | 話者認識方法及びその実行装置並びに音声発生確認方法及び装置 | |
KR20010102549A (ko) | 화자 인식 방법 및 장치 | |
Potamianos et al. | Audio and visual modality combination in speech processing applications | |
US6546369B1 (en) | Text-based speech synthesis method containing synthetic speech comparisons and updates | |
WO2021024869A1 (ja) | 音声処理装置、音声処理方法、および記録媒体 | |
KR102113879B1 (ko) | 참조 데이터베이스를 활용한 화자 음성 인식 방법 및 그 장치 | |
JP4469880B2 (ja) | 音声処理装置及び方法 | |
JP7511374B2 (ja) | 発話区間検知装置、音声認識装置、発話区間検知システム、発話区間検知方法及び発話区間検知プログラム | |
US11043212B2 (en) | Speech signal processing and evaluation | |
Malcangi et al. | Audio-visual fuzzy fusion for robust speech recognition | |
Hazen et al. | Multimodal face and speaker identification for mobile devices | |
JP2002372992A (ja) | 話者識別方法 | |
Belete | College of Natural Sciences | |
US20240079027A1 (en) | Synthetic voice detection method based on biological sound, recording medium and apparatus for performing the same | |
KR102661005B1 (ko) | 다채널 다화자 환경에서 화자별 음원분리장치 및 방법 | |
JP6730636B2 (ja) | 情報処理装置,制御プログラムおよび制御方法 | |
Tao | Advances in Audiovisual Speech Processing for Robust Voice Activity Detection and Automatic Speech Recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20850688 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2021537252 Country of ref document: JP Kind code of ref document: A |
|
REG | Reference to national code |
Ref country code: BR Ref legal event code: B01A Ref document number: 112022001300 Country of ref document: BR |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2020850688 Country of ref document: EP Effective date: 20220302 |
|
ENP | Entry into the national phase |
Ref document number: 112022001300 Country of ref document: BR Kind code of ref document: A2 Effective date: 20220124 |