WO2023276159A1 - 信号処理装置、信号処理方法及び信号処理プログラム - Google Patents
信号処理装置、信号処理方法及び信号処理プログラム Download PDFInfo
- Publication number
- WO2023276159A1 WO2023276159A1 PCT/JP2021/025207 JP2021025207W WO2023276159A1 WO 2023276159 A1 WO2023276159 A1 WO 2023276159A1 JP 2021025207 W JP2021025207 W JP 2021025207W WO 2023276159 A1 WO2023276159 A1 WO 2023276159A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- utterance
- speech
- recognition results
- speech recognition
- time
- Prior art date
Links
- 238000012545 processing Methods 0.000 title claims abstract description 65
- 238000003672 processing method Methods 0.000 title claims description 5
- 238000001514 detection method Methods 0.000 claims abstract description 38
- 238000004364 calculation method Methods 0.000 claims abstract description 24
- 235000016496 Panda oleosa Nutrition 0.000 claims abstract description 17
- 240000000220 Panda oleosa Species 0.000 claims abstract description 17
- 238000000034 method Methods 0.000 description 41
- 230000008569 process Effects 0.000 description 17
- 238000005516 engineering process Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 8
- 238000011156 evaluation Methods 0.000 description 6
- 238000006243 chemical reaction Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000010485 coping Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
Definitions
- the present invention relates to a signal processing device, a signal processing method, and a signal processing program.
- one application of speech recognition is the application of speech recognition of user utterances in interactions such as discussions and meetings.
- each user collects their own voice with their own device and recognizes it as voice.
- each user picks up sound using a microphone of the computer that the user is using or a microphone that is connected to the computer.
- each user's voice is collected by an individual device, generally recognized by a server, and provided to users as minutes or real-time subtitles.
- voice interval detection technology (VAD: Voice Activity Detection) exists and is widely used as a technology for detecting intervals in which voice exists.
- VAD Voice Activity Detection
- the speech interval detection technique is a technique for identifying speech or non-speech, it is not possible to reject the speech of other speakers that should not be recognized as described above.
- Non-Patent Document 1 in addition to the acoustic feature amount, by using the feature amount related to the relevance of the signal between the microphones, such as the energy ratio between each microphone, the speaker corresponding to the microphone It is realized to reject voices other than voice.
- the technology described in Non-Patent Document 2 implements rejection of voices other than the speaker corresponding to the microphone based on the correlation between the microphones.
- Non-Patent Document 3 the signal of each microphone is treated independently without assuming synchronization between microphones, and only the voice of the person wearing the microphone is extracted from the input signal using a deep neural network. A method of extraction has been proposed. However, other literature points out that the method of independently processing each microphone without using the signals of other microphones has poor performance when only the wearer's voice is detected. In addition, the technology described in Non-Patent Document 3 limits the devices to be worn, and is not suitable for dealing with general microphones that differ from user to user.
- Non-Patent Document 4 proposes an algorithm that reduces overlap between speakers in speech recognition results that occurs as a result of speaker diarization.
- the algorithm described in Non-Patent Document 4 compares the speech recognition results for each pair of utterances that overlap at the time from the start to the end of the utterance, and when the word matching rate of the speech recognition results exceeds the threshold Second, it determines that both are speech recognition results corresponding to the same utterance, and rejects the shorter speech recognition result.
- the algorithm described in Non-Patent Document 4 performs deduplication of results in speaker diarization.
- Non-Patent Document 4 the degree of similarity s(w i , w j ) of speech recognition results is expressed by Equation (1).
- W i is the word string of utterance i
- W j is the word string of utterance j
- is the length of the word string.
- d( ⁇ ) represents the Levenshtein distance.
- Non-Patent Document 4 has the limitation that speech recognition errors and erroneous conversions tend to occur because the wraparound speech is recognized in fragments. For this reason, when words containing kana and kanji are compared, the degree of similarity is often not calculated correctly. Specific examples include “misread” and “wait and see”.
- the present invention has been made in view of the above. It is an object of the present invention to provide a signal processing device, a signal processing method, and a signal processing program capable of rejecting recognition results.
- a signal processing apparatus provides the start time and end time of each utterance together with the speech recognition result of the utterance period of each utterance input to a plurality of microphones. and information on the appearance time of each word in the speech recognition results, and combine the speech recognition results of two utterances from the speech recognition results of the utterance sections of the utterances input to multiple microphones.
- a first detection unit for detecting whether or not there is an overlap in time between utterance segments for each pair of voice recognition results of utterances; a calculation unit that calculates the degree of similarity of the speech recognition result for each kana or phoneme unit, and compares the degree of similarity with a predetermined threshold value for each pair that overlaps the time of the utterance section, and the degree of similarity exceeds the threshold value. and a rejecting unit that rejects, as loop-around speech, an utterance with a short speech recognition result length for the pairs exceeding the number of pairs.
- each speaker has a microphone, and when recognizing the voice picked up by the microphone, it is possible to reject the voice recognition result caused by the speech of other speakers.
- FIG. 1 is a diagram schematically showing an example of the configuration of a signal processing device according to an embodiment.
- FIG. 2 is a diagram schematically showing an example of a configuration of a loopback speech rejecting unit shown in FIG. 1;
- FIG. 3 is a flow chart showing a processing procedure of signal processing according to the embodiment.
- FIG. 4 is a flow chart showing the procedure of the loopback speech rejection process shown in FIG.
- FIG. 5 is a diagram showing performance evaluation results when the signal processing device according to the embodiment is applied.
- FIG. 6 is a diagram illustrating an example of a computer that implements a signal processing device by executing a program.
- the speech recognition results of two utterances are combined to form a pair, and among the speech recognition result pairs of the utterances, each pair having an overlap in the time of the utterance section , the following three processes are performed.
- similarity calculation processing is performed not on a word-by-word basis, but on a kana or phoneme-by-phoneme basis of speech recognition results for pairs having overlapping utterance periods, thereby eliminating errors based on erroneous conversion of speech recognition results. , a robust comparison was achieved.
- FIG. 1 is a diagram schematically showing an example of the configuration of a signal processing device according to an embodiment.
- the signal processing device 100 for example, a computer including a ROM (Read Only Memory), a RAM (Random Access Memory), a CPU (Central Processing Unit), or the like reads a predetermined program, and the CPU executes a predetermined program. It is realized by executing the program of The signal processing device 100 also has a communication interface for transmitting and receiving various information to and from another device connected via a wired connection or a network or the like.
- ROM Read Only Memory
- RAM Random Access Memory
- CPU Central Processing Unit
- the signal processing device 100 has a microphone for each of the speakers 1 to N, and recognizes the voice (microphone signal) picked up by each microphone. Note that the signal processing device 100 is premised on time synchronization in units of several 100 ms. Signal processing apparatus 100 includes speech period detection units 101-1 to 101-N (second detection unit), speech recognition units 102-1 to 102-N, and loopback speech rejection unit 103. FIG.
- the speech segment detection units 101-1 to 101-N use speech segment detection technology to detect and extract speech segments in which speech is present from each continuous input microphone signal.
- Speech period detection units 101-1 to 101-N output the speech period of each utterance to corresponding speech recognition units 102-1 to 102-N, respectively.
- Speech segment detection units 101-1 to 101-N can apply existing speech segment detection techniques.
- Speech period detection units 101-1 to 101-N perform speech period detection processing on the microphone signals of microphones 1, 2, . . .
- the speech recognition units 102-1 to 102-N perform speech recognition on the speech period of each utterance input from each of the speech period detection units 101-1 to 101-N.
- An existing speech recognition technology can be applied to the speech recognition units 102-1 to 102-N.
- Speech recognition units 102 - 1 to 102 -N output speech recognition results to loopback speech rejection unit 103 .
- the output speech recognition result is the text of the speech recognition result and the time information corresponding to the text of the speech recognition result and indicating at what time each word in the text was uttered. That is, the output of the speech recognition units 102-1 to 102-N is the text of the speech recognition result of each utterance section of the utterance input to the microphone of each speaker 1 to N, and the start time and end time of each utterance. Time information and appearance time of each word in the text of the speech recognition result.
- the wraparound utterance rejection unit 103 receives the text of the speech recognition result of each utterance section of the utterance input to each microphone 1 to N, the time information of the start time and end time of each utterance, and the word of each word in the speech recognition result. Based on the information about the time of appearance, it detects utterances in which the voices of other speakers seem to have entered, and rejects them.
- the loopback utterance rejection unit 103 obtains a voice recognition result for each speaker's utterance by rejecting a loopback utterance from the voice recognition results corresponding to each microphone.
- the wraparound utterance rejection unit 103 determines whether or not there is an overlap in time between utterance segments for each pair of voice recognition results of utterances obtained by combining the voice recognition results of two utterances from the voice recognition results of the utterance segments of each story. To detect. Then, the wraparound utterance rejection unit 103 calculates the similarity of the speech recognition result for each pair of speech recognition results of utterances that overlap in time in the utterance section, not for each word but for each kana or phoneme. By doing so, utterances that are considered to be wraparound are rejected. Then, the loopback speech rejecting unit 103 outputs speech recognition results corresponding to the speeches uttered by the speakers 1 to N.
- FIG. 2 is a diagram schematically showing an example of the configuration of loopback speech rejection section 103 shown in FIG.
- the loopback utterance rejection section 103 has a same-timing utterance detection section 1031 (first detection section), an utterance similarity calculation section 1032 (calculation section), and a rejection section 1033 .
- the same-timing utterance detection unit 1031 obtains the speech recognition result of each utterance section of the utterance input to each of the microphones 1 to N from the speech recognition units 102-1 to 102-N, and information accompanying the speech recognition result. is entered.
- Information associated with the speech recognition result is time information of the start time and end time of each utterance, and information about the appearance time of each word in the speech recognition result.
- the same-timing utterance detection unit 1031 combines the voice recognition results of two utterances from the voice recognition results of each utterance section of the input utterance to form one pair.
- the same-timing utterance detection unit 1031 creates a plurality of pairs of speech recognition results of these two utterances.
- the same-timing utterance detection unit detects whether or not there is an overlap in the time of the utterance section for a pair of speech recognition results of two utterances. This is because, in a combination of speech recognition results of utterances with overlapping utterance times, there is a possibility that one of the results is a speech recognition result using wraparound speech.
- the same-timing utterance detection unit 1031 detects that the start time and the end time of each utterance overlap among the time information of the pair of speech recognition results of each of the two input utterances. Detects that there is an overlap in the time of speech segments in a pair of recognition results.
- the utterance similarity calculation unit 1032 calculates the following first to third The similarity of the speech recognition results is calculated using the feature-applied method. It should be noted that the first to third features can all be applied, or each can be applied independently.
- the utterance similarity calculation unit 1032 compares the kana or phoneme sequences of the speech recognition results of the utterances to be compared, thereby calculating the similarity of the speech recognition results in kana or phoneme units.
- the utterance similarity calculation unit 1032 can realize similarity calculation that is robust against errors based on erroneous conversion of the speech recognition result by comparing the speech recognition results not in units of words but in units of kana or phonemes.
- the utterance similarity calculation unit 1032 calculates the similarity using the overlap rate of the utterance section for each utterance, and adjusts the similarity so that even when only a small part of the utterance overlaps, the similarity is calculated. To avoid calculating a high degree of similarity even if there is.
- the utterance similarity calculation unit 1032 uses information on the time at which each word or kana occurs, which is obtained from the speech recognition result. A more robust comparison is achieved by comparing only the similarity and calculating the similarity. Conventionally, even when only a part of the utterance segments of utterances to be compared overlap, the entire speech recognition results are compared, and the similarity may become unduly high. On the other hand, the utterance similarity calculation unit 1032 calculates the similarity with higher accuracy by comparing only the portions of the speech recognition results that can be determined to have been uttered at the same time.
- the utterance similarity calculation unit 1032 calculates the similarity s(c i , c j ) of the speech recognition result using Equation (2), for example. Equation (2) applies all of the first to third features.
- c i and c j are the kana or phoneme strings of the portion uttered at the time when both utterances overlap among the speech recognition results of utterance i and utterance j.
- overlap(t i , t j ) indicates the overlap rate of the utterance sections of the utterance i and the utterance j.
- the overlap rate of the utterance segment can be obtained, for example, by dividing the length of overlap between utterance i and utterance j by the length of the shorter one of utterance i and utterance j.
- d( ⁇ ) is the distance between speech recognition results, and for example, the Levenshtein distance can be used.
- indicates the length of the character string.
- Equation (3) in Equation (2) is a calculation equation that indicates how many characters in the short speech recognition result of the duplicated utterances match the longer speech recognition result.
- overlap(t i , t j ) weights the portion shown in Equation (3) by the temporal overlap rate between the utterance segments.
- the rejecting unit 1033 compares the degree of similarity calculated for each pair with a predetermined threshold value for each pair whose utterance segments overlap in time, thereby determining whether or not the wraparound utterance is included. , to reject the wraparound utterance. For pairs whose similarity calculated by the utterance similarity calculation unit 1032 exceeds the threshold, the rejection unit 1033 determines that an utterance with a short speech recognition result length is a wraparound utterance, and the length of the speech recognition result is rejects short utterances.
- FIG. 3 is a flow chart showing a processing procedure of signal processing according to the embodiment.
- the speech period detection units 101-1 to 101-N Upon receiving the input of the microphone signals picked up by the microphones of the speakers 1 to N, the speech period detection units 101-1 to 101-N use the speech period detection technology to detect each continuous microphone signal that is input. , an utterance segment detection process for extracting a segment in which utterance is present is performed (step S1).
- the speech recognition units 102-1 to 102-N perform speech recognition processing on the speech period speech input from the speech period detection units 101-1 to 101-N (step S2).
- the wraparound speech rejection unit 103 extracts the text of the speech recognition result of each speech section of the speech input to each of the microphones 1 to N, the time information of the start time and end time of each speech, and each of the speech recognition results. Based on the information about the appearance time of the word, an utterance that seems to include another speaker's voice is detected.
- FIG. 4 is a flow chart showing the procedure of the loopback speech rejection process shown in FIG.
- same-timing speech detection section 1031 obtains speech recognition results of speech sections of speech input to microphones 1 to N from speech recognition sections 102-1 to 102-N, respectively, and voice
- the speech recognition result of each utterance section of the input utterance is divided into two pairs of speech recognition results of each utterance.
- the same-timing utterance detection unit 1031 performs a same-timing utterance detection process for detecting whether or not there is overlap in the time of the utterance section for each pair of speech recognition results of two utterances (step S11).
- the utterance similarity calculation unit 1032 calculates the voices of the utterances to be compared for each pair whose utterance duration overlaps among the pairs of speech recognition results of the utterances.
- An utterance similarity calculation process for calculating the similarity of the speech recognition results is performed by comparing the kana or phoneme strings of the recognition results (step S12).
- the rejecting unit 1033 compares the degree of similarity calculated for each pair with a predetermined threshold value for each pair whose utterance segments overlap in time, thereby determining whether or not the wraparound utterance is included. Then, rejection processing for rejecting the loopback utterance is performed (step S13).
- FIG. 5 is a diagram showing performance evaluation results when the signal processing apparatus 100 according to the embodiment is applied.
- FIG. 5 shows the results of evaluating the speech recognition character error rate (CER).
- FIG. 5 shows the evaluation results when speech is processed using VAD alone and when speech is processed using the technique described in Non-Patent Document 4.
- FIG. 5 shows the evaluation results when speech is processed using VAD alone and when speech is processed using the technique described in Non-Patent Document 4.
- (1) in FIG. 5 shows the evaluation result when the degree of similarity is calculated for each kana unit of the speech recognition result for a pair of speech recognition results of two utterances, and the wraparound utterance is rejected (first feature). show.
- (2) in FIG. combination of features In addition to (2) in FIG. 5, (3) in FIG. 5 compares the degree of similarity between speech recognition results only for portions that are determined to have been uttered at the same time among the speech recognition results, and rejects looping utterances. Evaluation results for cases (combinations of the first to third features) are shown.
- the signal processing device 100 is different from the case of processing voice by VAD alone and the case of using the technology described in Non-Patent Document 4 in both cases of headset recording and stand microphone recording. It shows high speech recognition performance. That is, signal processing apparatus 100 can appropriately reject the loopback speech. Then, in signal processing apparatus 100, by applying the first to third features, it is possible to further improve the rejection accuracy of the loopback speech.
- the signal processing apparatus 100 generates a pair of speech recognition results of two utterances, which are obtained by combining the speech recognition results of two utterances from the speech recognition results of the utterance sections of the utterances respectively input to a plurality of microphones. For each, it is detected whether or not there is an overlap in the time of the utterance period. Then, the signal processing apparatus 100 calculates the similarity of the speech recognition results in units of kana or phoneme for each pair of speech recognition results of utterances in which the time of the utterance period overlaps.
- the signal processing apparatus 100 compares the degree of similarity with a predetermined threshold value for each pair whose utterance periods overlap in time, and for the pair of speech recognition results of utterances whose degree of similarity exceeds the threshold value, An utterance whose speech recognition result has a short length is rejected as a wraparound utterance.
- the signal processing device 100 performs accurate similarity calculation processing not on a word-by-word basis but on a kana or phoneme-by-phoneme basis of speech recognition results for each pair of overlapping utterance periods.
- the signal processing apparatus 100 can implement robust comparison against errors based on erroneous conversion of speech recognition results, and reject the loopback utterance with high accuracy.
- the technique described in Non-Patent Document 4 is an algorithm that compares and rejects utterances whose utterance intervals overlap even slightly. For this reason, the technique described in Non-Patent Document 4 is also limited in that there are occasional occasions when the data are erroneously rejected even though they are only partially duplicated. For example, when one speaker says "It's tough, isn't it?" and another speaker says "It's tough," with the utterance interval slightly overlapping, the technique described in Non-Patent Document 4 , are erroneously rejected because the speech recognition results of both are highly similar.
- signal processing device 100 a similarity calculation process is performed in consideration of the overlapping rate of the utterance period for each utterance for each pair in which the utterance period overlaps.
- signal processing apparatus 100 does not calculate a high degree of similarity even when only a small portion of the speech overlaps, and can reduce erroneous rejection of looping speech.
- Non-Patent Document 4 considers only the degree of matching between words and does not consider the timing of their appearance. When computed, it has the constraint of being erroneously rejected. For example, when comparing two speech recognition results, "Did you see the movie?" For some reason it was rejected.
- the signal processing apparatus 100 calculates the degree of similarity by comparing only the portions of the speech recognition results that are determined to have been uttered at the same time for each pair of speech segments that overlap in time. By implementing this, we realized a reduction in false rejection of wraparound speech.
- each speaker has a microphone, and when voice recognition is performed on the voice picked up by the microphone, the voice generated by the other speaker's voice Recognition results can be appropriately rejected, and speech recognition performance can be improved.
- Each component of the signal processing device 100 is functionally conceptual, and does not necessarily need to be physically configured as illustrated. That is, the specific form of distribution and integration of the functions of the signal processing device 100 is not limited to the illustrated one, and all or part of it can be functionally or physically distributed in arbitrary units according to various loads and usage conditions. can be distributed or integrated.
- each process performed in the signal processing device 100 may be realized by a CPU, a GPU (Graphics Processing Unit), and a program that is analyzed and executed by the CPU and GPU. Further, each process performed in the signal processing device 100 may be realized as hardware by wired logic.
- FIG. 6 is a diagram showing an example of a computer that implements the signal processing device 100 by executing a program.
- the computer 1000 has a memory 1010 and a CPU 1020, for example.
- Computer 1000 also has hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .
- the memory 1010 includes a ROM 1011 and a RAM 1012.
- the ROM 1011 stores a boot program such as BIOS (Basic Input Output System).
- BIOS Basic Input Output System
- Hard disk drive interface 1030 is connected to hard disk drive 1090 .
- a disk drive interface 1040 is connected to the disk drive 1100 .
- a removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1100 .
- Serial port interface 1050 is connected to mouse 1110 and keyboard 1120, for example.
- Video adapter 1060 is connected to display 1130, for example.
- the hard disk drive 1090 stores an OS (Operating System) 1091, application programs 1092, program modules 1093, and program data 1094, for example. That is, a program that defines each process of the signal processing device 100 is implemented as a program module 1093 in which code executable by the computer 1000 is described. Program modules 1093 are stored, for example, on hard disk drive 1090 .
- the hard disk drive 1090 stores a program module 1093 for executing processing similar to the functional configuration of the signal processing device 100 .
- the hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
- the setting data used in the processing of the above-described embodiment is stored as program data 1094 in the memory 1010 or the hard disk drive 1090, for example. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 as necessary and executes them.
- the program modules 1093 and program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in a removable storage medium, for example, and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program modules 1093 and program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Program modules 1093 and program data 1094 may then be read by CPU 1020 through network interface 1070 from other computers.
- LAN Local Area Network
- WAN Wide Area Network
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
Description
本実施の形態では、以下の3つの処理によって、各話者にマイクがあり、マイクで収音した音声の音声認識を行う場合に、他話者の音声が回り込んだことによって生じる音声認識結果(回り込み発話)を精度よく棄却することを実現した。
次に、実施の形態に係る信号処理装置について説明する。図1は、実施の形態に係る信号処理装置の構成の一例を模式的に示す図である。
次に、回り込み発話棄却部103について説明する。図2は、図1に示す回り込み発話棄却部103の構成の一例を模式的に示す図である。図2に示すように、回り込み発話棄却部103は、同タイミング発話検出部1031(第1の検出部)、発話類似度計算部1032(計算部)、及び、棄却部1033を有する。
次に、信号処理装置100が実行する信号処理について説明する。図3は、実施の形態に係る信号処理の処理手順を示すフローチャートである。
次に、図3に示す回り込み発話棄却処理(ステップS3)の処理手順について説明する。図4は、図3に示す回り込み発話棄却処理の処理手順を示すフローチャートである。
図5は、実施の形態に係る信号処理装置100を適用した場合の性能評価結果を示す図である。図5では、音声認識文字誤り率(CER:Character Error Rate)を評価した結果を示す。図5では、VAD単独で音声を処理した場合及び非特許文献4に記載の技術を用いて音声を用いて処理した場合の評価結果を示す。
このように、実施の形態に係る信号処理装置100は、複数のマイクにそれぞれ入力された発話の発話区間の音声認識結果から、2つの発話の音声認識結果を組み合わせた発話の音声認識結果のペアごとに、発話区間の時間に重複があるか否かを検出する。そして、信号処理装置100は、発話の音声認識結果のペアのうち、発話区間の時間に重複があるペアごとに、音声認識結果の類似度を、カナ或いは音素単位で計算する。そして、信号処理装置100は、発話区間の時間に重複があるペアごとに、類似度と所定の閾値とを比較し、類似度が閾値を上回った発話の音声認識結果のペアに対しては、音声認識結果の長さが短い発話を回り込み発話として棄却する。
信号処理装置100の各構成要素は機能概念的なものであり、必ずしも物理的に図示のように構成されていることを要しない。すなわち、信号処理装置100の機能の分散及び統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散または統合して構成することができる。
図6は、プログラムが実行されることにより、信号処理装置100が実現されるコンピュータの一例を示す図である。コンピュータ1000は、例えば、メモリ1010、CPU1020を有する。また、コンピュータ1000は、ハードディスクドライブインタフェース1030、ディスクドライブインタフェース1040、シリアルポートインタフェース1050、ビデオアダプタ1060、ネットワークインタフェース1070を有する。これらの各部は、バス1080によって接続される。
101-1~101-N 発話区間検出部
102-1~102-N 音声認識部
103 回り込み発話棄却部
1031 同タイミング発話検出部
1032 発話類似度計算部
1033 棄却部
Claims (7)
- 複数のマイクにそれぞれ入力された発話の発話区間の音声認識結果とともに、各発話の開始時刻と終了時刻との時間情報、及び、前記音声認識結果における各単語の出現時刻に関する情報の入力を受け付け、前記複数のマイクにそれぞれ入力された発話の発話区間の音声認識結果から、2つの発話の音声認識結果を組み合わせた発話の音声認識結果のペアごとに、発話区間の時間に重複があるか否かを検出する第1の検出部と、
前記発話の音声認識結果のペアのうち、発話区間の時間に重複があるペアごとに、音声認識結果の類似度を、カナ或いは音素単位で計算する計算部と、
前記発話区間の時間に重複があるペアごとに、前記類似度と所定の閾値とを比較し、前記類似度が閾値を上回ったペアに対しては、前記音声認識結果の長さが短い発話を回り込み発話として棄却する棄却部と、
を有することを特徴とする信号処理装置。 - 前記計算部は、発話ごとの発話区間の重複率を用いて前記類似度を計算することを特徴とする請求項1に記載の信号処理装置。
- 前記計算部は、音声認識結果のうち同時刻に発せられたと判定される部分のみを比較して前記類似度を計算することを特徴とする請求項1または2に記載の信号処理装置。
- 前記複数のマイクにそれぞれ入力された各発話の発話区間の音声に対して音声認識を行う音声認識部をさらに有する請求項1~3のいずれか一つに記載の信号処理装置。
- 前記複数のマイクにそれぞれ入力された発話の音声から、発話が存在する発話区間をそれぞれ検出し、各発話の発話区間の音声を前記音声認識部に出力する第2の検出部をさらに有することを特徴とする請求項4に記載の信号処理装置。
- 信号処理装置が実行する信号処理方法であって、
複数のマイクにそれぞれ入力された発話の発話区間の音声認識結果とともに、各発話の開始時刻と終了時刻との時間情報、及び、前記音声認識結果における各単語の出現時刻に関する情報の入力を受け付け、前記複数のマイクにそれぞれ入力された発話の発話区間の音声認識結果から、2つの発話の音声認識結果を組み合わせた発話の音声認識結果のペアごとに、発話区間の時間に重複があるか否かを検出する工程と、
前記発話の音声認識結果のペアのうち、発話区間の時間に重複があるペアごとに、音声認識結果の類似度を、カナ或いは音素単位で計算する工程と、
前記発話区間の時間に重複があるペアごとに、前記類似度と所定の閾値とを比較し、前記類似度が閾値を上回ったペアに対しては、前記音声認識結果の長さが短い発話を回り込み発話として棄却する工程と、
を含んだことを特徴とする信号処理方法。 - 複数のマイクにそれぞれ入力された発話の発話区間の音声認識結果とともに、各発話の開始時刻と終了時刻との時間情報、及び、前記音声認識結果における各単語の出現時刻に関する情報の入力を受け付け、前記複数のマイクにそれぞれ入力された発話の発話区間の音声認識結果から、2つの発話の音声認識結果を組み合わせた発話の音声認識結果のペアごとに、発話区間の時間に重複があるか否かを検出するステップと、
前記発話の音声認識結果のペアのうち、発話区間の時間に重複があるペアごとに、音声認識結果の類似度を、カナ或いは音素単位で計算するステップと、
前記発話区間の時間に重複があるペアごとに、前記類似度と所定の閾値とを比較し、前記類似度が閾値を上回ったペアに対しては、前記音声認識結果の長さが短い発話を回り込み発話として棄却するステップと、
をコンピュータに実行させるための信号処理プログラム。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2023531334A JPWO2023276159A1 (ja) | 2021-07-02 | 2021-07-02 | |
PCT/JP2021/025207 WO2023276159A1 (ja) | 2021-07-02 | 2021-07-02 | 信号処理装置、信号処理方法及び信号処理プログラム |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2021/025207 WO2023276159A1 (ja) | 2021-07-02 | 2021-07-02 | 信号処理装置、信号処理方法及び信号処理プログラム |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023276159A1 true WO2023276159A1 (ja) | 2023-01-05 |
Family
ID=84691089
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2021/025207 WO2023276159A1 (ja) | 2021-07-02 | 2021-07-02 | 信号処理装置、信号処理方法及び信号処理プログラム |
Country Status (2)
Country | Link |
---|---|
JP (1) | JPWO2023276159A1 (ja) |
WO (1) | WO2023276159A1 (ja) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2010092914A1 (ja) * | 2009-02-13 | 2010-08-19 | 日本電気株式会社 | 多チャンネル音響信号処理方法、そのシステム及びプログラム |
WO2010092913A1 (ja) * | 2009-02-13 | 2010-08-19 | 日本電気株式会社 | 多チャンネル音響信号処理方法、そのシステム及びプログラム |
WO2021125037A1 (ja) * | 2019-12-17 | 2021-06-24 | ソニーグループ株式会社 | 信号処理装置、信号処理方法、プログラムおよび信号処理システム |
-
2021
- 2021-07-02 JP JP2023531334A patent/JPWO2023276159A1/ja active Pending
- 2021-07-02 WO PCT/JP2021/025207 patent/WO2023276159A1/ja active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2010092914A1 (ja) * | 2009-02-13 | 2010-08-19 | 日本電気株式会社 | 多チャンネル音響信号処理方法、そのシステム及びプログラム |
WO2010092913A1 (ja) * | 2009-02-13 | 2010-08-19 | 日本電気株式会社 | 多チャンネル音響信号処理方法、そのシステム及びプログラム |
WO2021125037A1 (ja) * | 2019-12-17 | 2021-06-24 | ソニーグループ株式会社 | 信号処理装置、信号処理方法、プログラムおよび信号処理システム |
Non-Patent Citations (1)
Title |
---|
SHOTA HORIGUCHI; YUSUKE FUJITA; KENJI NAGAMATSU: "Utterance-Wise Meeting Transcription System Using Asynchronous Distributed Microphones", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 31 July 2020 (2020-07-31), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081730026 * |
Also Published As
Publication number | Publication date |
---|---|
JPWO2023276159A1 (ja) | 2023-01-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6171617B2 (ja) | 応答対象音声判定装置、応答対象音声判定方法および応答対象音声判定プログラム | |
US6618702B1 (en) | Method of and device for phone-based speaker recognition | |
US20080294433A1 (en) | Automatic Text-Speech Mapping Tool | |
WO2017162053A1 (zh) | 一种身份认证的方法和装置 | |
Wyatt et al. | Conversation detection and speaker segmentation in privacy-sensitive situated speech data. | |
US20140337024A1 (en) | Method and system for speech command detection, and information processing system | |
US9251808B2 (en) | Apparatus and method for clustering speakers, and a non-transitory computer readable medium thereof | |
KR20170007107A (ko) | 음성인식 시스템 및 방법 | |
KR20240053639A (ko) | 제한된 스펙트럼 클러스터링을 사용한 화자-턴 기반 온라인 화자 구분 | |
CN113744742B (zh) | 对话场景下的角色识别方法、装置和系统 | |
Arjun et al. | Automatic correction of stutter in disfluent speech | |
EP3493201B1 (en) | Information processing device, information processing method, and computer program | |
Këpuska | Wake-up-word speech recognition | |
Adi et al. | Automatic Measurement of Voice Onset Time and Prevoicing Using Recurrent Neural Networks. | |
WO2023276159A1 (ja) | 信号処理装置、信号処理方法及び信号処理プログラム | |
Arsikere et al. | Computationally-efficient endpointing features for natural spoken interaction with personal-assistant systems | |
JP6526602B2 (ja) | 音声認識装置、その方法、及びプログラム | |
Zelenák et al. | Speaker overlap detection with prosodic features for speaker diarisation | |
JP7511374B2 (ja) | 発話区間検知装置、音声認識装置、発話区間検知システム、発話区間検知方法及び発話区間検知プログラム | |
KR101229108B1 (ko) | 단어별 신뢰도 문턱값에 기반한 발화 검증 장치 및 그 방법 | |
KR20090061566A (ko) | 마이크배열 기반 음성인식 시스템 및 그 시스템에서의 목표음성 추출 방법 | |
US20240321273A1 (en) | Signal processing device, signal processing method, and signal processing program | |
Singh et al. | Voice based login authentication for Linux | |
JP7377736B2 (ja) | オンライン話者逐次区別方法、オンライン話者逐次区別装置及びオンライン話者逐次区別システム | |
Jamil et al. | Influences of age in emotion recognition of spontaneous speech: A case of an under-resourced language |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21948463 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2023531334 Country of ref document: JP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 18575327 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21948463 Country of ref document: EP Kind code of ref document: A1 |