WO2022145015A1 - Signal processing device, signal processing method, and signal processing program - Google Patents

Signal processing device, signal processing method, and signal processing program Download PDF

Info

Publication number
WO2022145015A1
WO2022145015A1 PCT/JP2020/049247 JP2020049247W WO2022145015A1 WO 2022145015 A1 WO2022145015 A1 WO 2022145015A1 JP 2020049247 W JP2020049247 W JP 2020049247W WO 2022145015 A1 WO2022145015 A1 WO 2022145015A1
Authority
WO
WIPO (PCT)
Prior art keywords
unit
speaker
feature vector
speaker feature
signal processing
Prior art date
Application number
PCT/JP2020/049247
Other languages
French (fr)
Japanese (ja)
Inventor
慶介 木下
智広 中谷
マーク デルクロア
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2020/049247 priority Critical patent/WO2022145015A1/en
Priority to JP2022572857A priority patent/JPWO2022145015A1/ja
Publication of WO2022145015A1 publication Critical patent/WO2022145015A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques

Definitions

  • the present invention relates to a signal processing device, a signal processing method, and a signal processing program.
  • the first sound source separation technology that separates the acoustic signal that is a mixture of sounds from multiple sound sources into a signal for each sound source
  • the first sound source separation technology that targets the sound picked up by multiple sound sources and one microphone
  • the second sound source separation technique is said to be more difficult than the first sound source separation technique because information regarding the position of the microphone cannot be used.
  • Non-Patent Documents 1 to 3 are known as a second sound source separation technique for performing sound source separation based on input acoustic signal information without using microphone position information.
  • Non-Patent Document 1 separates an input acoustic signal into a predetermined number of sound sources.
  • BLSTM neural network hereinafter referred to as BLSTM
  • the mask for extracting each sound source can be estimated. ..
  • BLSTM bi-directional long short-term memory neural network
  • the distance between the previously given correct answer separation signal and the separation signal extracted by applying the estimated mask to the observation signal is minimized. Update the parameters.
  • Non-Patent Document 2 performs sound source separation by changing the number of separated signals according to the number of sound sources contained in the input signal.
  • the mask is estimated using BLSTM, but in Non-Patent Document 2, the separation mask estimated at one time is only for one sound source in the input signal.
  • the observation signal is separated into a separation signal and a residual signal obtained by removing the separation signal from the observation signal by using a separation mask.
  • it is automatically determined whether or not another sound source signal still remains in the residual signal, and if it remains, the residual signal is input to BLSTM. Extract another sound source. On the other hand, if another sound source signal does not remain in the residual signal, the process ends at that point.
  • this mask estimation process is repeated until the residual signal is finally no longer included in the residual signal, and each sound source is extracted one by one, whereby the sound source separation process and the sound source number estimation process are performed. Achieve both at the same time.
  • a method of thresholding the volume of the residual signal, inputting the residual signal to another neural network, and inspecting whether another sound source remains is proposed for the determination. ing.
  • the same BLSTM is repeatedly used for mask estimation.
  • Non-Patent Documents 1 and 2 are batch processing, and take the form of applying the processing to the entire input signal. Therefore, the techniques described in Non-Patent Documents 1 and 2 lack real-time property. For example, when trying to apply these processes to a certain conference voice, the processing cannot be started at least until the recording of the conference voice is completed. That is, the techniques described in Non-Patent Documents 1 and 2 cannot be applied to applications in which sound source separation is applied from the start of a conference and each separated voice is sequentially transcribed using automatic voice recognition.
  • Non-Patent Document 3 The technique described in Non-Patent Document 3 was devised in view of such a problem.
  • the input signal is divided into a plurality of time blocks (blocks having a length of about 5 to 10 seconds), and processing is sequentially applied to each block.
  • the method of Non-Patent Document 2 is applied to the first block.
  • the speaker feature vectors corresponding to the extracted speakers are calculated and output as well.
  • each of all the speaker feature vectors obtained as a result of the processing of the first block is used to produce the voice corresponding to those speakers. Extract repeatedly in the same order as the first block. Then, in the technique described in Non-Patent Document 3, the speakers are extracted in the order in which the speakers are detected in the past blocks even in the third and subsequent blocks.
  • Non-Patent Document 3 when a new speaker (new speaker) that did not appear in the first block appears in the second block, it appears in the first block. Even if all the existing speakers are extracted from the observation signal of the second block, the voice component of the new speaker remains in the residual signal. Therefore, by using the above-mentioned determination process, it is possible to detect the presence of a new speaker, and as a result, it is possible to estimate the mask of the new speaker and the speaker feature vector of the new speaker. ..
  • sound source separation and sound source number estimation can be performed even for long-time data in the form of block online processing.
  • order of speaker extraction common among each block, which of the separated sounds obtained in a certain time block and the separated sound obtained in a different time block is followed by the speaker between the time blocks. Can work to determine if they belong to the same speaker).
  • Non-Patent Document 3 If the technique described in Non-Patent Document 3 is used, even if the speaker extracted in the past block does not speak in the current block, the speaker's speech is the same as in the case of other speakers.
  • the feature vector is used to extract the signal corresponding to the speaker. However, if the speaker's signal is not included in the block, the mask corresponding to the speaker is 0, and as a result, a signal with a sound pressure of 0 is ideally extracted.
  • Non-Patent Document 3 When the technique described in Non-Patent Document 3 is used, it is not possible to know in advance whether or not a specific speaker is speaking in a new time block (new time block), so in the past block. For speakers that have been extracted even once, it is necessary to try to extract the sound source from the new time block using the speaker feature vector. Then, as a result, whether or not the speaker is speaking can be determined by examining the sound pressure of the output signal.
  • Non-Patent Document 3 performs an extraction process that is not originally necessary for a speaker who is not speaking (silent speaker).
  • Non-Patent Document 3 when the technique described in Non-Patent Document 3 is used, the speakers of the new time block are extracted in the order in which the utterances were made in the past.
  • the optimum order of speaker extraction is likely to be different in each block (see Non-Patent Document 1). That is, trying to extract speakers in the same order in all blocks increases the amount of processing and impairs the optimality of processing.
  • the present invention has been made in view of the above, and an object of the present invention is to provide a signal processing device, a signal processing method, and a signal processing program capable of improving the processing accuracy for an acoustic signal while reducing the processing amount. And.
  • the signal processing apparatus sets a speaker feature vector for each time block with respect to the input acoustic signal, and a speaker existing in the time block.
  • An extraction unit that repeatedly extracts the number of speakers, an external memory unit that stores the speaker feature vector extracted by the extraction unit, and a speaker feature vector of a speaker that has never appeared in the time block so far
  • it is instructed to write the speaker feature vector to the unused memory slot of the external memory unit, and the speaker feature amount vector of the speaker who has already appeared in the time block so far.
  • the instruction unit that instructs the instruction unit, the writing unit that receives an instruction from the instruction unit and writes the speaker feature vector to the external memory unit, and the instruction unit that receives an instruction from the instruction unit and reads the speaker feature vector from the external memory unit. It is characterized by having a reading unit and a processing unit that executes signal processing based on the speaker feature vector read by the reading unit.
  • FIG. 1 is a diagram schematically showing an example of a configuration of a signal processing device according to an embodiment.
  • FIG. 2 is a diagram schematically showing another example of the configuration of the signal processing device according to the embodiment.
  • FIG. 3 is a diagram schematically showing another example of the configuration of the signal processing device according to the embodiment.
  • FIG. 4 is a diagram schematically showing another example of the configuration of the signal processing device according to the embodiment.
  • FIG. 5 is a flowchart showing a processing procedure of the data processing method according to the embodiment.
  • FIG. 6 is a diagram showing an example of a computer in which a signal processing device is realized by executing a program.
  • FIG. 1 is a diagram showing an example of a configuration of a signal processing device according to an embodiment.
  • a predetermined program is read into a computer or the like including a ROM (Read Only Memory), a RAM (Random Access Memory), a CPU (Central Processing Unit), and the CPU is a predetermined CPU. It is realized by executing the program.
  • the signal processing device 10 includes a speaker feature vector extraction unit 11 (extraction unit), an external memory unit 12, a memory control instruction unit 13 (instruction unit), and an external memory writing unit 14 (writing unit). It has an external memory reading unit 15 (reading unit), a sound source extraction unit 16, an utterance section detection unit 17, a voice recognition unit 18, a repetition control unit 19, and a learning unit 20.
  • the signal processing device 10 divides it into time blocks, and inputs the acoustic signal of each divided time block to the speaker feature vector extraction unit 11.
  • the speaker feature vector extraction unit 11 repeatedly extracts the speaker feature vector for each time block for the input acoustic signal (hereinafter referred to as an observation signal) for the number of speakers existing in the block. ..
  • the number of repetitions of the speaker feature vector extraction process is set by the repetition control unit 19.
  • the speaker feature vector extraction unit 11 can use various methods as the speaker feature vector extraction method. For example, when the technique described in Non-Patent Document 3 or the technique described in Reference 1 is used, in the first time block, a speaker feature vector is set for each of any number of speakers included in the block. Can be extracted.
  • Reference 1 Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, Kenji Nagamatsu, “End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors”, arXiv: 2005.09921, 2021.
  • the extraction of the speaker feature vector can be formulated as in Eq. (1).
  • I b is the total number of speakers in the time block b.
  • a b and i are speaker feature vectors related to the i-th speaker extracted in the time block b.
  • X b is an observation signal of the time block b, and
  • NN embed [ ⁇ ] is a neural network such as BLSTM.
  • h b and 0 are initial vectors (in some cases, a matrix) for informing the network that it is the first iteration in the speaker feature vector extraction, and are appropriately set according to NN embed [ ⁇ ]. ..
  • the external memory unit 12 is a memory for storing the speaker feature vector extracted by the speaker feature vector extraction unit 11.
  • the external memory unit 12 has a plurality of memory addresses, and stores a speaker feature vector of one speaker in one memory address.
  • the external memory unit 12 may be a semiconductor memory in which data such as RAM (Random Access Memory), flash memory, NVSRAM (Non Volatile Static Random Access Memory) can be rewritten.
  • the external memory unit 12 may be a storage device such as an HDD (Hard Disk Drive), SSD (Solid State Drive), or an optical disk.
  • the memory control instruction unit 13 receives the speaker feature vector each time the speaker feature vector is extracted by the speaker feature vector extraction unit 11.
  • the memory control instruction unit 13 instructs the external memory unit 12 to write the vector to the new memory address. That is, when the speaker feature amount vector of the speaker that has never appeared in the time block up to now is extracted, the memory control instruction unit 13 puts this story into an unused memory slot of the external memory unit 12. Instructs to write the person feature vector.
  • the memory control instruction unit 13 has a memory corresponding to the speaker that has appeared in the past among the memory addresses of the external memory unit 12. Instruct the address to write the speaker feature vector in an appropriate form. That is, when the speaker feature amount vector of the speaker who has already appeared in the time block up to now is extracted, the memory control instruction unit 13 is the same as the memory slot corresponding to this speaker in the external memory unit 12. Instructs to write the speaker feature vector.
  • the memory control instruction unit 13 writes and gives an instruction to read the speaker feature vector to the external memory unit 12 by using the one based on the neural network described in References 2 to 4.
  • Reference 2 A. Graves, G. Wayne, and I. Danihelka. “Neural Turing Machines”, arxiv: 1410.5401., 2014.
  • Reference 3 Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwinska, Sergio Gomez; Edward Grefenstette, and Tiago Ramalho, “Hybrid computing using a neural network with dynamic external memory”. Nature. 538 (7626): 471-476. (2016-10-12).
  • Reference 4 Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus, “End-To-End Memory Networks”, Advances in Neural Information Processing Systems 28, pp, 2440-2448, 2015
  • the external memory writing unit 14 writes the speaker feature vector to the external memory unit 12 in response to an instruction from the memory control instruction unit 13.
  • the external memory reading unit 15 receives an instruction from the memory control instruction unit 13 and reads a speaker feature vector for each speaker from the external memory unit 12. In this way, the signal processing device 10 can optimize all the systems by the error propagation method by implementing the exchange of information with the external memory unit 12 using the neural network.
  • the memory control instruction unit 13 is described in References 2 to 4 and the like for instructions to the external memory writing unit 14 that writes to the external memory unit 12 and the external memory reading unit 15 that reads to the external memory unit 12.
  • the instruction vector is generated according to the procedure of.
  • the memory control instruction unit 13 first bases on the outputs a b and i from the speaker feature vector extraction unit and the past outputs (instruction vectors) w b and i-1 of the memory control instruction unit 13.
  • a 1 ⁇ M size key vector kb , i is generated using a neural network.
  • the memory control instruction unit 13 also generates ⁇ b and i used for the calculation of the equation (2).
  • the memory control instruction unit 13 measures the closeness of the same key vector to each column n of the current external memories M b and i , and calculates the instruction vectors w b and i as in the equation (2). ..
  • M b and i are external memory matrices (size of N ⁇ M) before writing the i-th speaker feature vector in the b-th time block.
  • N be the total number of memory addresses
  • M be the length of the vector that can be written to the address.
  • w b and i are 1 ⁇ N-dimensional instruction vectors
  • w b and i (n) are each element thereof and have the characteristics shown in the equation (4).
  • Equation (5) c is a constant for enhancing sparsity.
  • the external memory writing unit 14 writes and updates the speaker feature vector to the external memory unit 12 based on the instruction vectors w b and i output from the memory control instruction unit 13.
  • the writing process by the external memory writing unit 14 is performed by pairing the erasing process and the writing process, as described below.
  • the speaker feature vectors ab and i extracted by the speaker feature vector extraction unit 11 are passed to the external memory writing unit 14, and are written out by the external memory unit 12 in an appropriate form.
  • the external memory writing unit 14 erases the memory according to the equation (6) based on the erasing vectors e b and i which are 1 ⁇ N vectors and the instruction vectors w b and i .
  • the erasure vectors e b and i are also outputs from the memory control instruction unit 13.
  • L is a 1 ⁇ N vector consisting of N 1s.
  • the elimination vectors e b and i are often set as in the equation (7).
  • the external memory reading unit 15 receives an instruction from the memory control instruction unit 13, reads a speaker feature vector for each speaker from the external memory unit 12, and outputs the speaker feature vector to the memory control instruction unit 13.
  • the external memory reading unit 15 reads the updated speaker feature vector from the external memory unit 12 based on the instruction vectors w b and i output from the memory control instruction unit 13.
  • the external memory reading unit 15 can read data from the external memory unit 12 as shown in the equation (9) by using the instruction vectors w b and i .
  • the speaker feature vectors r b, i to be read can be output by multiplying the matrix M and the instruction vectors w b, i .
  • the fact that the value of the nth element w b, i (n) of the instruction vector is 0 corresponds to the instruction that the information is not read from the nth address, so that the elements w b , i (n) of the instruction vector.
  • the information of the memory address whose value of) is close to 1 is mainly read, and the information of the memory address close to 0 is relatively not read.
  • the sound source extraction unit 16 extracts the speaker's voice corresponding to this speaker feature vector from the observation signal based on the observation signal and the speaker feature vector output from the memory control instruction unit 13 (details are non-patented). See Document 3).
  • the sound source extraction unit 16 extracts the separated voices ⁇ S b, i as shown in the equation (10) using the speaker feature vectors r b, i read from the observation signal X b and the external memory unit 12. ..
  • NN extract [ ⁇ ] it is common to use a neural network such as BLSTM or a convolutional neural network.
  • the separated voices ⁇ S b, i (i 1, ..., I b ) extracted in the time block b and the separated voices ⁇ S b'extracted in the time block b'(b ⁇ b').
  • the speaker feature vector extraction unit 11 extracts the speaker feature vector in a different order between the time blocks, the maximum value detection and the comparison process in the instruction vector to the external memory unit 12 described above can be performed. By using it, the speaker can be followed between time blocks.
  • the utterance section detection unit 17 outputs the utterance section detection result of the speaker corresponding to this speaker feature vector based on the observation signal and the speaker feature vector output from the memory control instruction unit 13 (for details, refer to the reference). See Document 1).
  • the voice recognition unit 18 outputs the voice recognition result of the speaker corresponding to this speaker feature vector based on the observation signal and the speaker feature vector output from the memory control instruction unit 13 (for details, see Reference 5). reference).
  • Reference 5 Marc Delcroix, Shinji Watanabe, Tsubasa Ochiai, Keisuke Kinoshita, Shigeki Karita, Atsunori Ogawa, Tomohiro Nakatani, “End-to-end SpeakerBeam for single channel target speech recognition”, Interspeech2019, pp.451-455, 2019.
  • the sound source extraction unit 16, the utterance section detection unit 17, and the voice recognition unit 18 are examples of processing units that process voice. Further, as the signal processing device according to the embodiment, the signal processing device 10 having a plurality of processing units of the sound source extraction unit 16, the utterance section detection unit 17, and the voice recognition unit 18 has been described, but the present invention is not limited thereto.
  • the signal processing device according to the embodiment is a signal processing device 10A having a sound source extraction unit 16 (see FIG. 2), a signal processing device 10B having a speech section detection unit 17 (see FIG. 3), or a voice recognition unit 18. It may be a signal processing device 10C (see FIG. 4) having the above.
  • the repeat control unit 19 is based on the extraction processing state of the speaker feature vector extraction unit 11 or the processing result of the sound source extraction unit 16, the utterance section detection unit 17, and the voice recognition unit 18, and the speaker feature vector extraction unit The number of repetitions of the extraction process according to 11 is determined. In other words, the repetition control unit 19 determines the number of repetitions of the extraction process by the speaker feature vector extraction unit 11 based on, for example, the state of the speaker feature vector extraction process of the speaker feature vector extraction unit 11. Alternatively, the repetition control unit 19 uses the output results from the utterance section detection unit 17 and the voice recognition unit 18 to determine the number of repetitions of the speaker feature vector extraction process of the speaker feature vector extraction unit 11. Ideally, the speaker feature vector extraction unit 11 should extract I b speaker feature vectors in each block b.
  • the repetition control unit 19 has the internal state vectors h b, i and the observation signal X b output from the speaker feature vector extraction unit 11.
  • the scalar values ⁇ f b, i (0 ⁇ ⁇ f b, i ⁇ 1) indicating whether or not the repetition as in the following equation (11) should be stopped are calculated using the separated voices ⁇ S b, i . do. If the scalar values ⁇ f b and i are larger than the predetermined value, the repetition is stopped, and if it is lower than the predetermined value, the repetition is continued.
  • the repeat control unit 19 inputs different auxiliary information to the neural network of the speaker feature vector extraction unit 11 each time the speaker feature vector extraction unit 11 extracts the speaker feature vector, so that the speaker feature vector extraction unit 19 can be used. 11 is made to output the extraction result of the speaker feature vector corresponding to a different sound source.
  • the repeat control unit 19 virtually recognizes all the speaker feature vectors included in the input acoustic signal.
  • the repeat control unit 19 recognizes which speaker the speaker feature vector corresponds to.
  • the repeat control unit 19 inputs auxiliary information about the speaker feature vector of the next speaker to the speaker feature vector extraction unit 11, and the speaker feature vector extraction unit 19
  • the speaker feature vector of the next speaker is extracted in 11.
  • the repetition control unit 19 stops the repetition of the extraction process by the speaker feature vector extraction unit 11 when the extraction process of the speaker feature vector for all the speakers of the observation signal of this time block is completed.
  • the learning unit 20 optimizes the parameters used by the signal processing device 10 using the learning data.
  • the learning unit 20 optimizes the parameters of the neural network constituting the signal processing device 10 based on a predetermined objective function.
  • the learning data includes an input signal (observation signal), a correct answer clean signal corresponding to each sound source included in this input signal, correct answer utterance time information, correct utterance content, and total number of people included in the input signal (total speaker). Number) consists of I.
  • the learning unit 20 learns parameters based on the learning data so that the error between the output of the signal processing device 10 and the correct answer information becomes small.
  • the system output result (separated voice) ⁇ S b, i a loss function based on the circumstances error of Eq. (12) is provided so that the separated voice S b, i of the correct answer is close to.
  • the cross entropy loss function of the equation (13) is provided for the scalar value for determining the number of repetitions in order to estimate the correct number of sound sources.
  • the learning unit 20 updates the parameter so that the total number of addresses used in the external memory becomes the same value as the total number of speakers I, and the equation (14) also becomes small at the same time.
  • T count is a preset threshold value, and is generally set to 1.
  • min ( ⁇ ) is a function that outputs a T count when the input value is larger than the T count , and outputs the value as it is when the input value is smaller than the T count .
  • Equation (14) is a loss function that prompts the number of memory addresses used in the external memory to match the total number of speakers I.
  • the learning unit 20 learns the parameters of the neural network so that the value of the equation (15), which is the sum of the values of all the loss functions, becomes small.
  • the learning unit 20 uses a signal processing device 10B having a speech section detection unit 17, a signal processing device 10C having a voice recognition unit 18, and a loss function corresponding to the processing unit provided for the signal processing device 10. Then, the parameters of the neural network may be optimized. Please refer to Reference 6 for the signal processing device 10B having the utterance section detection unit 17, and Reference 7 for the signal processing device 10C having the voice recognition unit 18.
  • Reference 6 Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Kenji Nagamatsu, Shinji Watanabe, “End-to-End Neural Speaker Diarization with Permutation-free Objectives”, Proc. Interspeech, pp. 4300-4304, 2019.
  • Reference 7 Shigeki Karita et al., “A comparative study on transformer vs RNN in speech applications”, IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019.
  • FIG. 5 is a flowchart showing a processing procedure of the signal processing method according to the embodiment.
  • the speaker feature vector extraction unit 11 When the acoustic signal (observation signal) of the time block b is input to the speaker feature vector extraction unit 11, the speaker feature vector extraction unit 11 is one person existing in the time block b from the observation signal of the time block b.
  • the speaker feature vector of the speaker is estimated and extracted (step S4).
  • the memory control instruction unit 13 determines whether or not the speaker feature vector extracted by the speaker feature vector extraction unit 11 is a speaker feature amount vector of a speaker that has never appeared in the time block so far. Is determined (step S5).
  • step S5 The case where it is a speaker feature vector of a speaker who has never appeared in the time block so far (step S5: Yes) will be described.
  • the memory control instruction unit 13 instructs the unused memory slot of the external memory unit 12 to write the speaker feature vector
  • the external memory writing unit 14 directs the unused memory of the external memory unit 12 to be written.
  • the speaker feature vector is written in the slot (step S6).
  • step S5 The case where the speaker feature quantity vector of the speaker who has appeared in the time block so far (step S5: No) will be described.
  • the memory control instruction unit 13 instructs the external memory unit 12 to write the speaker feature vector to the memory slot corresponding to the speaker, and the external memory writing unit 14 indicates this story of the external memory unit 12.
  • the speaker feature vector is written in the memory slot corresponding to the person (step S7).
  • the external memory writing unit 14 reads the speaker feature vector corresponding to one speaker extracted by the speaker feature vector extraction unit 11 from the external memory unit 12 according to the instruction from the memory control instruction unit 13 (step). S8), output to the memory control instruction unit 13.
  • the sound source extraction unit 16 observes the speaker's voice corresponding to this speaker feature vector based on the observation signal and the speaker feature vector corresponding to one speaker output from the memory control instruction unit 13. Extract from (step S9).
  • the utterance section detection unit 17 detects the utterance section of the speaker corresponding to this speaker feature vector based on the observation signal and the speaker feature vector corresponding to one speaker output from the memory control instruction unit 13. The result is output (step S10).
  • the voice recognition unit 18 obtains the voice recognition result of the speaker corresponding to this speaker feature vector based on the observation signal and the speaker feature vector corresponding to one speaker output from the memory control instruction unit 13. Output (step S11).
  • steps S9 to S11 may be processed in parallel or in series.
  • the order of processing is not particularly limited.
  • processing may be executed according to the voice processing function unit provided.
  • the sound source extraction process step S9 by the sound source extraction unit 16 is executed, and the process proceeds to step S12.
  • the repetition control unit 19 determines whether or not to stop the repetition based on the processing results of the sound source extraction unit 16, the utterance section detection unit 17, and the voice recognition unit 18 (step S12).
  • the repetition control unit 19 may determine whether or not to stop the repetition based on the extraction processing state of the speaker feature vector extraction unit 11.
  • step S12 determines that the repetition is not stopped (step S12: No)
  • the repetition control unit 19 returns to step S4 in order to proceed with the processing regarding the next speaker in this time block b, and returns to the speaker feature vector for the next speaker. Is extracted.
  • the repetition control unit 19 determines that the repetition is stopped (step S12: Yes)
  • the repetition control unit 19 outputs the processing result for this time block b (step S13).
  • the output results are the sound source extraction result, the utterance section detection result, and the voice recognition result. Further, the signal processing device 10 may collectively output the processing results of all the time blocks.
  • the signal processing device 10 determines whether or not the processing of all the time blocks is completed (step S14).
  • step S14: Yes the signal processing device 10 ends the processing for the input acoustic signal.
  • step S14: No 1 is added to the time block b in order to perform the processing for the next time block (step S15). The process returns to step S4 and the process is continued.
  • the signal processing device 10 repeatedly extracts the speaker feature vector for each time block and writes it to the external memory unit 12. At this time, when the speaker feature amount vector of the speaker that has never appeared in the time block up to now is extracted, the signal processing device 10 puts the speaker feature vector in the unused memory slot of the external memory unit 12. To write. When the speaker feature quantity vector of the speaker who has already appeared in the time block up to now is extracted, the signal processing device 10 puts the speaker feature in the memory slot corresponding to this speaker in the external memory unit 12. Write a vector.
  • the signal processing device 10 does not execute the extraction process itself for the speaker (silent speaker) who is not speaking. Therefore, the signal processing device 10 does not need to try to extract the sound source for the silent speaker for each time block as in the conventional case, so that the processing amount can be reduced as compared with the conventional case and the utterance is truly performed. Since the processing for only the speaker can be appropriately performed, the processing accuracy can be improved.
  • the signal processing device 10 extracts the speaker feature vector for each time block, it is not necessary to extract the speaker feature vector in the same order in all the time blocks as in the conventional case, so that the processing is optimal. Does not hurt.
  • the signal processing device 10 can improve the processing accuracy for the acoustic signal while reducing the processing amount.
  • FIG. 6 is a diagram showing an example of a computer in which the signal processing device 10 is realized by executing a program.
  • the computer 1000 has, for example, a memory 1010 and a CPU 1020.
  • the computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.
  • the memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012.
  • the ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System).
  • BIOS Basic Input Output System
  • the hard disk drive interface 1030 is connected to the hard disk drive 1090.
  • the disk drive interface 1040 is connected to the disk drive 1100.
  • a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100.
  • the serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120.
  • the video adapter 1060 is connected to, for example, the display 1130.
  • the hard disk drive 1090 stores, for example, an OS (Operating System) 1091, an application program 1092, a program module 1093, and program data 1094. That is, the program that defines each process of the signal processing device 10 is implemented as a program module 1093 in which a code that can be executed by a computer is described.
  • the program module 1093 is stored in, for example, the hard disk drive 1090.
  • a program module 1093 for executing processing similar to the functional configuration in the signal processing device 10 is stored in the hard disk drive 1090.
  • the hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
  • the setting data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, a memory 1010 or a hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 into the RAM 1012 and executes them as needed.
  • the program module 1093 and the program data 1094 are not limited to those stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Then, the program module 1093 and the program data 1094 may be read from another computer by the CPU 1020 via the network interface 1070.
  • LAN Local Area Network
  • WAN Wide Area Network

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

A signal processing device (10) has: a speaker feature vector extraction unit (11) for repeatedly extracting, for an inputted acoustic signal, a speaker feature vector for each time block a number of times commensurate with the number of speakers present in the time block; a memory control instruction unit (13) for instructing to write the speaker feature vector to an unused memory block of an external memory unit (12) when a speaker feature vector of a speaker who has not appeared in a time block thus far is extracted, and instructing to write the speaker feature vector to a memory slot of the external memory unit (12) that corresponds to a speaker who has already appeared when a speaker feature value vector of a speaker who has already appeared in a time block thus far is extracted; and a sound source extraction unit (16), an utterance section detecting unit (17), and a speech recognition unit (18) for executing signal processing on the basis of a speaker feature vector.

Description

信号処理装置、信号処理方法及び信号処理プログラムSignal processing equipment, signal processing method and signal processing program
 本発明は、信号処理装置、信号処理方法及び信号処理プログラムに関する。 The present invention relates to a signal processing device, a signal processing method, and a signal processing program.
 複数の音源からの音が混合された音響信号を音源ごとの信号に分離する音源分離の技術では、複数のマイクで収音された音を対象とする第1の音源分離技術と、1つのマイクで収音された音を対象とする第2の音源分離技術がある。第2の音源分離技術は、マイクの位置に関する情報を利用することができないため、第1の音源分離技術よりも難しいとされている。 In the sound source separation technology that separates the acoustic signal that is a mixture of sounds from multiple sound sources into a signal for each sound source, the first sound source separation technology that targets the sound picked up by multiple sound sources and one microphone There is a second sound source separation technique for the sound picked up in. The second sound source separation technique is said to be more difficult than the first sound source separation technique because information regarding the position of the microphone cannot be used.
 ここで、マイクの位置の情報を使わず、入力音響信号の情報に基づいて音源分離を行う第2の音源分離技術として、非特許文献1~3に記載の技術が知られている。 Here, the techniques described in Non-Patent Documents 1 to 3 are known as a second sound source separation technique for performing sound source separation based on input acoustic signal information without using microphone position information.
 非特許文献1に記載の技術は、入力音響信号を、事前に決められた数の音源に分離するものである。入力信号を双方向長短期記憶ニューラルネットワーク(Bi-directional Long Short-Term Memory neural network:BLSTM neural network, 以降BLSTMとする)に与えることで、各音源を抽出するためのマスクを推定することができる。BLSTMのパラメータの学習時には、ある入力信号が与えられた時に、予め与えられた正解分離信号と、推定されたマスクを観測信号に適用して取り出した分離信号との距離を最小化するように、パラメータを更新する。 The technique described in Non-Patent Document 1 separates an input acoustic signal into a predetermined number of sound sources. By giving the input signal to a bi-directional long short-term memory neural network (BLSTM neural network, hereinafter referred to as BLSTM), the mask for extracting each sound source can be estimated. .. When learning the parameters of BLSTM, when a certain input signal is given, the distance between the previously given correct answer separation signal and the separation signal extracted by applying the estimated mask to the observation signal is minimized. Update the parameters.
 非特許文献2に記載の技術は、入力信号中に含まれる音源数に応じて分離信号数を変え、音源分離を行う。非特許文献1と同様にBLSTMを用いてマスクを推定するが、非特許文献2では、一度に推定する分離マスクは、入力信号中のある一つの音源に関するものだけである。そして、非特許文献2に記載の技術では、分離マスクを用いて、観測信号を、分離信号と、観測信号から分離信号を取り除いた残留信号とに分離する。その上で、非特許文献2に記載の技術では、残留信号中にまだ別の音源信号が残っているか否かを自動的に判定し、残っている場合は、残留信号をBLSTMに入力し、別の音源を抽出する。一方、残留信号中に別の音源信号が残っていない場合は、その時点で、処理を終了する。 The technique described in Non-Patent Document 2 performs sound source separation by changing the number of separated signals according to the number of sound sources contained in the input signal. As in Non-Patent Document 1, the mask is estimated using BLSTM, but in Non-Patent Document 2, the separation mask estimated at one time is only for one sound source in the input signal. Then, in the technique described in Non-Patent Document 2, the observation signal is separated into a separation signal and a residual signal obtained by removing the separation signal from the observation signal by using a separation mask. Then, in the technique described in Non-Patent Document 2, it is automatically determined whether or not another sound source signal still remains in the residual signal, and if it remains, the residual signal is input to BLSTM. Extract another sound source. On the other hand, if another sound source signal does not remain in the residual signal, the process ends at that point.
 非特許文献2に記載の技術では、最終的に残留信号中に音源が含まれなくなるまで、このマスク推定処理を繰り返し、各音源をひとつひとつ抽出することで、音源分離処理と音源数推定処理との両方を同時に達成する。非特許文献2に記載の技術において、判定には、残留信号の音量の閾値処理や、残留信号を別のニューラルネットワークに入力し、別の音源が残っているかどうかを検査させる方法などが提案されている。なお、非特許文献2に記載の技術では、マスク推定のためには、同じBLSTMを繰り返し使用する。 In the technique described in Non-Patent Document 2, this mask estimation process is repeated until the residual signal is finally no longer included in the residual signal, and each sound source is extracted one by one, whereby the sound source separation process and the sound source number estimation process are performed. Achieve both at the same time. In the technique described in Non-Patent Document 2, a method of thresholding the volume of the residual signal, inputting the residual signal to another neural network, and inspecting whether another sound source remains is proposed for the determination. ing. In the technique described in Non-Patent Document 2, the same BLSTM is repeatedly used for mask estimation.
 非特許文献1~2に記載の技術は、バッチ処理であり、入力信号全体に処理を適用する形式をとっている。そのため、非特許文献1~2に記載の技術は、リアルタイム性に欠ける。例えば、ある会議音声に対してこれらの処理を適用しようとする場合には、少なくともその会議音声の録音が終了するまでは、処理を開始することができない。つまり、会議の開始時から音源分離を適用し、各分離音声を、自動音声認識を用いて逐次的に書き起こすような用途には、非特許文献1~2に記載の技術を適用できない。 The techniques described in Non-Patent Documents 1 and 2 are batch processing, and take the form of applying the processing to the entire input signal. Therefore, the techniques described in Non-Patent Documents 1 and 2 lack real-time property. For example, when trying to apply these processes to a certain conference voice, the processing cannot be started at least until the recording of the conference voice is completed. That is, the techniques described in Non-Patent Documents 1 and 2 cannot be applied to applications in which sound source separation is applied from the start of a conference and each separated voice is sequentially transcribed using automatic voice recognition.
 非特許文献3に記載の技術は、このような問題に鑑み、考案されたものである。非特許文献3に記載の技術では、入力信号を複数の時間ブロック(5~10秒程度の長さのブロック)に分け、ブロックごとに逐次的に処理を適用していく形式を取っている。最初のブロックに対しては、非特許文献2の方法を適用する。ただし、音源分離用マスクを推定すると同時に、抽出した話者に対応する話者特徴ベクトルを計算し、それらも出力する。 The technique described in Non-Patent Document 3 was devised in view of such a problem. In the technique described in Non-Patent Document 3, the input signal is divided into a plurality of time blocks (blocks having a length of about 5 to 10 seconds), and processing is sequentially applied to each block. The method of Non-Patent Document 2 is applied to the first block. However, at the same time as estimating the sound source separation mask, the speaker feature vectors corresponding to the extracted speakers are calculated and output as well.
 非特許文献3に記載の技術において、2つ目のブロックでは、1つ目のブロックの処理の結果得られた全ての話者特徴ベクトルのそれぞれを用いて、それらの話者に対応する音声を1つ目のブロックと同じ順番で繰り返し抽出する。そして、非特許文献3に記載の技術では、3つ目以降のブロックでも、過去のブロックで話者が検出された順番に従って話者を抽出する。 In the technique described in Non-Patent Document 3, in the second block, each of all the speaker feature vectors obtained as a result of the processing of the first block is used to produce the voice corresponding to those speakers. Extract repeatedly in the same order as the first block. Then, in the technique described in Non-Patent Document 3, the speakers are extracted in the order in which the speakers are detected in the past blocks even in the third and subsequent blocks.
 ここで、非特許文献3に記載の技術では、最初のブロックには出現しなかった新たな話者(新規話者)が2つ目のブロックに出現した場合、一つ目のブロックに出現していた話者すべてを2つ目のブロックの観測信号から抽出しても、その新規話者の音声成分が残留信号に残る。そのため、前記の判定処理を用いれば、新規話者がいることを検出することができ、結果として、新規話者のマスクと、新規話者の話者特徴ベクトルとを推定することも可能である。 Here, in the technique described in Non-Patent Document 3, when a new speaker (new speaker) that did not appear in the first block appears in the second block, it appears in the first block. Even if all the existing speakers are extracted from the observation signal of the second block, the voice component of the new speaker remains in the residual signal. Therefore, by using the above-mentioned determination process, it is possible to detect the presence of a new speaker, and as a result, it is possible to estimate the mask of the new speaker and the speaker feature vector of the new speaker. ..
 このような処理をブロックごとに逐次的に繰り返していけば、ブロックオンライン処理の形で長時間データに関しても音源分離、音源数推定を行うことができる。また、話者抽出の順番を各ブロック間で共通とすることで、時間ブロック間で話者の追従(ある時間ブロックで得られた分離音と、それと異なる時間ブロックで得られた分離音のどれが同じ話者に属するものか見極める作業)を行うことができる。 If such processing is sequentially repeated for each block, sound source separation and sound source number estimation can be performed even for long-time data in the form of block online processing. In addition, by making the order of speaker extraction common among each block, which of the separated sounds obtained in a certain time block and the separated sound obtained in a different time block is followed by the speaker between the time blocks. Can work to determine if they belong to the same speaker).
 なお、非特許文献3記載の技術を用いれば、過去のブロックで抽出された話者が、現在のブロックでは発話をしていない場合も、他の話者の場合と同様に、その話者の特徴ベクトルを用いて、その話者に対応する信号の抽出が行われる。しかし、その話者の信号がそのブロックに含まれない場合は、その話者に対応するマスクは0となり、結果として、理想的には音圧0の信号が抽出される。 If the technique described in Non-Patent Document 3 is used, even if the speaker extracted in the past block does not speak in the current block, the speaker's speech is the same as in the case of other speakers. The feature vector is used to extract the signal corresponding to the speaker. However, if the speaker's signal is not included in the block, the mask corresponding to the speaker is 0, and as a result, a signal with a sound pressure of 0 is ideally extracted.
 非特許文献3に記載の技術を用いた場合、新たな時間ブロック(新規時間ブロック)において、ある特定の話者が発話をしているか否かを事前に知ることができないため、過去のブロックにおいて一度でも抽出された話者については、話者特徴ベクトルを用いて新規時間ブロックからの音源抽出を試みなければならない。そして、その結果として、話者が発話しているかどうかを、出力信号の音圧を調べることで判別することができる。 When the technique described in Non-Patent Document 3 is used, it is not possible to know in advance whether or not a specific speaker is speaking in a new time block (new time block), so in the past block. For speakers that have been extracted even once, it is necessary to try to extract the sound source from the new time block using the speaker feature vector. Then, as a result, whether or not the speaker is speaking can be determined by examining the sound pressure of the output signal.
 しかしながら、一般的な会話では、ある時間ブロックの中で発話を行う話者数は、2、3人程度であることが一般的である。つまり、過去に発話した全話者が新規ブロックで発話を行うような状況は極めて稀である。このため、非特許文献3に記載の技術は、発話をしていない話者(サイレント話者)については、本来必要のない抽出処理を行っていることとなる。 However, in general conversation, the number of speakers who speak in a certain time block is generally about two or three. In other words, it is extremely rare for all speakers who have spoken in the past to speak in a new block. Therefore, the technique described in Non-Patent Document 3 performs an extraction process that is not originally necessary for a speaker who is not speaking (silent speaker).
 また、一般的にはサイレント話者についても、完全に音圧0の信号が抽出されることは稀であり、ある程度の音圧を持つ信号がその話者の信号として、誤って抽出されることが多い。そのため、サイレント話者が複数いる場合に、繰り返しの音源抽出処理を行えば、真に発話を行っている話者に関する音源分離・抽出性能の劣化も引き起こすこととなる。 Also, in general, even for a silent speaker, it is rare that a signal with a sound pressure of 0 is completely extracted, and a signal having a certain sound pressure is erroneously extracted as a signal of the speaker. There are many. Therefore, when there are a plurality of silent speakers, if the sound source extraction process is repeated, the sound source separation / extraction performance of the speaker who is truly speaking will be deteriorated.
 さらには、非特許文献3に記載の技術を用いた場合、過去に発話を行った順番で、新規時間ブロックの話者抽出を行うこととなる。しかしながら、一般的に、話者抽出の最適な順番は、各ブロックにおいて異なる可能性が高いことが分かっている(非特許文献1参照)。つまり、全てのブロックにおいて、同じ順番で話者抽出を試みることは、処理量が多くなり、処理の最適性を損なうことにつながる。 Furthermore, when the technique described in Non-Patent Document 3 is used, the speakers of the new time block are extracted in the order in which the utterances were made in the past. However, in general, it is known that the optimum order of speaker extraction is likely to be different in each block (see Non-Patent Document 1). That is, trying to extract speakers in the same order in all blocks increases the amount of processing and impairs the optimality of processing.
 本発明は、上記に鑑みてなされたものであって、処理量を削減しながら、音響信号に対する処理精度を向上することができる信号処理装置、信号処理方法及び信号処理プログラムを提供することを目的とする。 The present invention has been made in view of the above, and an object of the present invention is to provide a signal processing device, a signal processing method, and a signal processing program capable of improving the processing accuracy for an acoustic signal while reducing the processing amount. And.
 上述した課題を解決し、目的を達成するために、本発明に係る信号処理装置は、入力された音響信号に対して、時間ブロックごとに話者特徴ベクトルを、該時間ブロックに存在する話者の数の分だけ繰り返し抽出する抽出部と、抽出部によって抽出された話者特徴ベクトルを格納する外部メモリ部と、今までの時間ブロックに出現したことのない話者の話者特徴量ベクトルが抽出部によって抽出された場合には外部メモリ部の未使用のメモリスロットに話者特徴ベクトルを書き込むことを指示し、今までの時間ブロックに既に出現したことのある話者の話者特徴量ベクトルが抽出部によって抽出された場合には外部メモリ部の既に出現したことのある話者に対応するメモリスロットに話者特徴ベクトルを書き込むことを指示し、外部メモリ部からの話者特徴ベクトルの読み込みを指示する指示部と、指示部から指示を受け、外部メモリ部への話者特徴ベクトルの書き込みを行う書き込み部と、指示部から指示を受け、外部メモリ部から話者特徴ベクトルの読み込みを行う読み込み部と、読み込み部によって読み込まれた話者特徴ベクトルに基づき信号処理を実行する処理部と、を有することを特徴とする。 In order to solve the above-mentioned problems and achieve the object, the signal processing apparatus according to the present invention sets a speaker feature vector for each time block with respect to the input acoustic signal, and a speaker existing in the time block. An extraction unit that repeatedly extracts the number of speakers, an external memory unit that stores the speaker feature vector extracted by the extraction unit, and a speaker feature vector of a speaker that has never appeared in the time block so far When extracted by the extraction unit, it is instructed to write the speaker feature vector to the unused memory slot of the external memory unit, and the speaker feature amount vector of the speaker who has already appeared in the time block so far. Is extracted by the extraction unit, it is instructed to write the speaker feature vector to the memory slot corresponding to the speaker who has already appeared in the external memory unit, and the speaker feature vector is read from the external memory unit. The instruction unit that instructs the instruction unit, the writing unit that receives an instruction from the instruction unit and writes the speaker feature vector to the external memory unit, and the instruction unit that receives an instruction from the instruction unit and reads the speaker feature vector from the external memory unit. It is characterized by having a reading unit and a processing unit that executes signal processing based on the speaker feature vector read by the reading unit.
 本発明によれば、処理量を削減しながら、音響信号に対する処理精度を向上することができる。 According to the present invention, it is possible to improve the processing accuracy for acoustic signals while reducing the processing amount.
図1は、実施形態に係る信号処理装置の構成の一例を模式的に示す図である。FIG. 1 is a diagram schematically showing an example of a configuration of a signal processing device according to an embodiment. 図2は、実施形態に係る信号処理装置の構成の他の例を模式的に示す図である。FIG. 2 is a diagram schematically showing another example of the configuration of the signal processing device according to the embodiment. 図3は、実施形態に係る信号処理装置の構成の他の例を模式的に示す図である。FIG. 3 is a diagram schematically showing another example of the configuration of the signal processing device according to the embodiment. 図4は、実施形態に係る信号処理装置の構成の他の例を模式的に示す図である。FIG. 4 is a diagram schematically showing another example of the configuration of the signal processing device according to the embodiment. 図5は、実施形態に係るデータ処理方法の処理手順を示すフローチャートである。FIG. 5 is a flowchart showing a processing procedure of the data processing method according to the embodiment. 図6は、プログラムが実行されることにより、信号処理装置が実現されるコンピュータの一例を示す図である。FIG. 6 is a diagram showing an example of a computer in which a signal processing device is realized by executing a program.
 以下、図面を参照して、本発明の一実施形態を詳細に説明する。なお、この実施形態により本発明が限定されるものではない。また、図面の記載において、同一部分には同一の符号を付して示している。なお、以下では、Aに対し、“^A”と記載する場合は「“A”の直上に“^”が記された記号」と同等であるとする。 Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings. The present invention is not limited to this embodiment. Further, in the description of the drawings, the same parts are indicated by the same reference numerals. In the following, when "^ A" is described for A, it is assumed to be equivalent to "a symbol in which" ^ "is written directly above" A "".
[実施形態]
[信号処理装置の構成]
 まず、図1を用いて、実施形態に係る信号処理装置の構成について説明する。図1は、実施形態に係る信号処理装置の構成の一例を示す図である。実施形態に係る信号処理装置10は、例えば、ROM(Read Only Memory)、RAM(Random Access Memory)、CPU(Central Processing Unit)等を含むコンピュータ等に所定のプログラムが読み込まれて、CPUが所定のプログラムを実行することで実現される。
[Embodiment]
[Configuration of signal processing device]
First, the configuration of the signal processing device according to the embodiment will be described with reference to FIG. FIG. 1 is a diagram showing an example of a configuration of a signal processing device according to an embodiment. In the signal processing device 10 according to the embodiment, for example, a predetermined program is read into a computer or the like including a ROM (Read Only Memory), a RAM (Random Access Memory), a CPU (Central Processing Unit), and the CPU is a predetermined CPU. It is realized by executing the program.
 図1に示すように、信号処理装置10は、話者特徴ベクトル抽出部11(抽出部)、外部メモリ部12、メモリ制御指示部13(指示部)、外部メモリ書き込み部14(書き込み部)、外部メモリ読み込み部15(読み込み部)、音源抽出部16、発話区間検出部17、音声認識部18、繰り返し制御部19及び学習部20を有する。信号処理装置10は、音響信号が入力されると、時間ブロックに分割し、分割した各時間ブロックの音響信号を話者特徴ベクトル抽出部11に入力する。 As shown in FIG. 1, the signal processing device 10 includes a speaker feature vector extraction unit 11 (extraction unit), an external memory unit 12, a memory control instruction unit 13 (instruction unit), and an external memory writing unit 14 (writing unit). It has an external memory reading unit 15 (reading unit), a sound source extraction unit 16, an utterance section detection unit 17, a voice recognition unit 18, a repetition control unit 19, and a learning unit 20. When the acoustic signal is input, the signal processing device 10 divides it into time blocks, and inputs the acoustic signal of each divided time block to the speaker feature vector extraction unit 11.
 話者特徴ベクトル抽出部11は、入力された音響信号(以降、観測信号とする)に対し、時間ブロックごとに話者特徴ベクトルを、そのブロックに存在する話者の数の分だけ繰り返し抽出する。話者特徴ベクトル抽出処理の繰り返し数は、繰り返し制御部19によって設定される。 The speaker feature vector extraction unit 11 repeatedly extracts the speaker feature vector for each time block for the input acoustic signal (hereinafter referred to as an observation signal) for the number of speakers existing in the block. .. The number of repetitions of the speaker feature vector extraction process is set by the repetition control unit 19.
 話者特徴ベクトル抽出部11は、話者特徴ベクトルの抽出方法として、各種方法を用いることができる。例えば、非特許文献3に記載の技術または参考文献1に記載の技術を用いる場合には、最初の時間ブロックにおいては、同ブロックに含まれる任意の数の話者のそれぞれに関して話者特徴ベクトルを抽出できる。
参考文献1:Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, Kenji Nagamatsu, “End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors”, arXiv:2005.09921, 2021.
The speaker feature vector extraction unit 11 can use various methods as the speaker feature vector extraction method. For example, when the technique described in Non-Patent Document 3 or the technique described in Reference 1 is used, in the first time block, a speaker feature vector is set for each of any number of speakers included in the block. Can be extracted.
Reference 1: Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, Kenji Nagamatsu, “End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors”, arXiv: 2005.09921, 2021.
 ここで、一般的な形式として、話者特徴ベクトルの抽出は、式(1)のように定式化できる。 Here, as a general form, the extraction of the speaker feature vector can be formulated as in Eq. (1).
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
 Iは時間ブロックbにおける話者の総数である。ab,iは、時間ブロックbにおいて抽出されたi番目の話者に関する話者特徴ベクトルである。Xは、時間ブロックbの観測信号、NNembed[・]は、BLSTMなどのニューラルネットワークである。hb,0は、話者特徴ベクトル抽出における最初の繰り返しであることをネットワークに知らせるための初期ベクトル(場合によってはマトリックス)であり、NNembed[・]に合わせて適切に設定するものである。 I b is the total number of speakers in the time block b. a b and i are speaker feature vectors related to the i-th speaker extracted in the time block b. X b is an observation signal of the time block b, and NN embed [·] is a neural network such as BLSTM. h b and 0 are initial vectors (in some cases, a matrix) for informing the network that it is the first iteration in the speaker feature vector extraction, and are appropriately set according to NN embed [・]. ..
 外部メモリ部12は、話者特徴ベクトル抽出部11によって抽出された話者特徴ベクトルを格納するためのメモリである。外部メモリ部12は、複数のメモリ番地を有し、1つのメモリ番地に1人の話者の話者特徴ベクトルを格納する。外部メモリ部12は、RAM(Random Access Memory)、フラッシュメモリ、NVSRAM(Non Volatile Static Random Access Memory)等のデータを書き換え可能な半導体メモリであってもよい。外部メモリ部12は、HDD(Hard Disk Drive)、SSD(Solid State Drive)、光ディスク等の記憶装置であってもよい。 The external memory unit 12 is a memory for storing the speaker feature vector extracted by the speaker feature vector extraction unit 11. The external memory unit 12 has a plurality of memory addresses, and stores a speaker feature vector of one speaker in one memory address. The external memory unit 12 may be a semiconductor memory in which data such as RAM (Random Access Memory), flash memory, NVSRAM (Non Volatile Static Random Access Memory) can be rewritten. The external memory unit 12 may be a storage device such as an HDD (Hard Disk Drive), SSD (Solid State Drive), or an optical disk.
 メモリ制御指示部13は、話者特徴ベクトル抽出部11によって話者特徴ベクトルが抽出されるごとに、その話者特徴ベクトルを受信する。 The memory control instruction unit 13 receives the speaker feature vector each time the speaker feature vector is extracted by the speaker feature vector extraction unit 11.
 メモリ制御指示部13は、受信した話者特徴ベクトルが、新規話者のものである場合、外部メモリ部12の新しいメモリ番地にそのベクトルを書き込むように指示を行う。すなわち、メモリ制御指示部13は、今までの時間ブロックには出現したことのない話者の話者特徴量ベクトルが抽出された場合は、外部メモリ部12の未使用のメモリスロットに、この話者特徴ベクトルを書き込むことを指示する。 When the received speaker feature vector is that of a new speaker, the memory control instruction unit 13 instructs the external memory unit 12 to write the vector to the new memory address. That is, when the speaker feature amount vector of the speaker that has never appeared in the time block up to now is extracted, the memory control instruction unit 13 puts this story into an unused memory slot of the external memory unit 12. Instructs to write the person feature vector.
 一方、メモリ制御指示部13は、受信した話者特徴ベクトルが、過去に出現した話者のものである場合、外部メモリ部12のメモリ番地のうち、その過去に出現した話者に対応するメモリ番地に、その話者特徴ベクトルを適切な形で書き込むように指示を行う。すなわち、メモリ制御指示部13は、今までの時間ブロックにすでに出現したことのある話者の話者特徴量ベクトルが抽出された場合、外部メモリ部12のこの話者に対応するメモリスロットに同話者特徴ベクトルを書き込むことを指示する。 On the other hand, when the received speaker feature vector is that of a speaker that has appeared in the past, the memory control instruction unit 13 has a memory corresponding to the speaker that has appeared in the past among the memory addresses of the external memory unit 12. Instruct the address to write the speaker feature vector in an appropriate form. That is, when the speaker feature amount vector of the speaker who has already appeared in the time block up to now is extracted, the memory control instruction unit 13 is the same as the memory slot corresponding to this speaker in the external memory unit 12. Instructs to write the speaker feature vector.
 メモリ制御指示部13は、参考文献2~4に記載されたニューラルネットワークに基づくものを用いて、外部メモリ部12への話者特徴ベクトルの書き込み、読み込み指示を行う。
参考文献2:A. Graves, G. Wayne, and I. Danihelka. “Neural Turing Machines”, arxiv:1410.5401., 2014.
参考文献3:Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwinska, Sergio Gomez; Edward Grefenstette, and Tiago Ramalho, “Hybrid computing using a neural network with dynamic external memory”. Nature. 538 (7626): 471-476. (2016-10-12).
参考文献4:Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus, “End-To-End Memory Networks”, Advances in Neural Information Processing Systems 28, pp,2440-2448, 2015
The memory control instruction unit 13 writes and gives an instruction to read the speaker feature vector to the external memory unit 12 by using the one based on the neural network described in References 2 to 4.
Reference 2: A. Graves, G. Wayne, and I. Danihelka. “Neural Turing Machines”, arxiv: 1410.5401., 2014.
Reference 3: Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwinska, Sergio Gomez; Edward Grefenstette, and Tiago Ramalho, “Hybrid computing using a neural network with dynamic external memory”. Nature. 538 (7626): 471-476. (2016-10-12).
Reference 4: Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus, “End-To-End Memory Networks”, Advances in Neural Information Processing Systems 28, pp, 2440-2448, 2015
 外部メモリ書き込み部14は、メモリ制御指示部13からの指示を受けて外部メモリ部12に話者特徴ベクトルの書き込みを行う。外部メモリ読み込み部15は、メモリ制御指示部13からの指示を受けて、外部メモリ部12から各話者に関する話者特徴ベクトルを読み込む。このように、信号処理装置10は、ニューラルネットワークを用いて外部メモリ部12との情報のやり取りを実装することで、全てのシステムを、誤差伝搬法により最適化することが可能となる。 The external memory writing unit 14 writes the speaker feature vector to the external memory unit 12 in response to an instruction from the memory control instruction unit 13. The external memory reading unit 15 receives an instruction from the memory control instruction unit 13 and reads a speaker feature vector for each speaker from the external memory unit 12. In this way, the signal processing device 10 can optimize all the systems by the error propagation method by implementing the exchange of information with the external memory unit 12 using the neural network.
 メモリ制御指示部13は、外部メモリ部12への書き込みを行う外部メモリ書き込み部14、外部メモリ部12への読み込みを行う外部メモリ読み込み部15に対する指示のために、参考文献2~4等に記載の手順に倣って、指示ベクトルを生成する。 The memory control instruction unit 13 is described in References 2 to 4 and the like for instructions to the external memory writing unit 14 that writes to the external memory unit 12 and the external memory reading unit 15 that reads to the external memory unit 12. The instruction vector is generated according to the procedure of.
 例えば、メモリ制御指示部13は、まず、話者特徴ベクトル抽出部からの出力ab,iと、メモリ制御指示部13の過去の出力(指示ベクトル)wb,i-1とを基に、ニューラルネットワークを用いて1×Mのサイズのキーベクトルkb,iを生成する。また、メモリ制御指示部13は、同様に、式(2)の計算に用いるβb,iも生成する。その上で、メモリ制御指示部13は、同キーベクトルと現在の外部メモリMb,iの各列nとの近さを測り、指示ベクトルwb,iを式(2)のように計算する。 For example, the memory control instruction unit 13 first bases on the outputs a b and i from the speaker feature vector extraction unit and the past outputs (instruction vectors) w b and i-1 of the memory control instruction unit 13. A 1 × M size key vector kb , i is generated using a neural network. Similarly, the memory control instruction unit 13 also generates β b and i used for the calculation of the equation (2). Then, the memory control instruction unit 13 measures the closeness of the same key vector to each column n of the current external memories M b and i , and calculates the instruction vectors w b and i as in the equation (2). ..
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002
 式(2)において、Mb,iは、b番目の時間ブロックにおけるi番目の話者特徴ベクトルを書き込む前の外部メモリ行列(N×Mのサイズ)である。Nをメモリ番地の総数とし、Mを番地に書き込めるベクトルの長さとする。 In the equation (2), M b and i are external memory matrices (size of N × M) before writing the i-th speaker feature vector in the b-th time block. Let N be the total number of memory addresses, and let M be the length of the vector that can be written to the address.
 キーベクトルとメモリの各列との近さを測る尺度としては、式(3)に示すコサイン類似度などが考えられる。 As a measure for measuring the closeness between the key vector and each column of the memory, the cosine similarity shown in the equation (3) can be considered.
Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000003
 wb,iを1×N次元の指示ベクトルとすると、wb,i(n)は、その各要素であり、式(4)に示す特徴を有する。 Assuming that w b and i are 1 × N-dimensional instruction vectors, w b and i (n) are each element thereof and have the characteristics shown in the equation (4).
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000004
 また、指示ベクトルwb,i中の特定の要素が1に近くなるように(ベクトルがスパースになるように)、式(5)に示す操作を加える。 Further, the operation shown in the equation (5) is added so that the specific elements in the instruction vectors w b and i are close to 1 (so that the vector becomes sparse).
Figure JPOXMLDOC01-appb-M000005
Figure JPOXMLDOC01-appb-M000005
 式(5)において、cは、スパース性を高めるための定数である。 In equation (5), c is a constant for enhancing sparsity.
 外部メモリ書き込み部14は、メモリ制御指示部13から出力された指示ベクトルwb,iを基に、話者特徴ベクトルの外部メモリ部12への書き込み及び更新を行う。 The external memory writing unit 14 writes and updates the speaker feature vector to the external memory unit 12 based on the instruction vectors w b and i output from the memory control instruction unit 13.
 外部メモリ書き込み部14による書き込み処理は、以下に説明するように、消去処理と書き込み処理とを対として行われる。話者特徴ベクトル抽出部11において抽出された話者特徴ベクトルab,iは、外部メモリ書き込み部14に渡され、適切な形で外部メモリ部12によって書き出される。例えば、外部メモリ書き込み部14は、1×Nのベクトルである消去ベクトルeb,iと指示ベクトルwb,iとを基に、式(6)のようにメモリの消去を行う。なお、この消去ベクトルeb,iもメモリ制御指示部13からの出力である。 The writing process by the external memory writing unit 14 is performed by pairing the erasing process and the writing process, as described below. The speaker feature vectors ab and i extracted by the speaker feature vector extraction unit 11 are passed to the external memory writing unit 14, and are written out by the external memory unit 12 in an appropriate form. For example, the external memory writing unit 14 erases the memory according to the equation (6) based on the erasing vectors e b and i which are 1 × N vectors and the instruction vectors w b and i . The erasure vectors e b and i are also outputs from the memory control instruction unit 13.
Figure JPOXMLDOC01-appb-M000006
Figure JPOXMLDOC01-appb-M000006
 lは、N個の1からなる1×Nのベクトルである。外部メモリと[・]内の成分との掛け算は、メモリスロットにおいて、point-wiseの掛け算を行う。 L is a 1 × N vector consisting of N 1s. For the multiplication of the external memory and the components in [・], point-wise multiplication is performed in the memory slot.
 なお、消去ベクトルeb,iは、式(7)のように設定することが多い。 The elimination vectors e b and i are often set as in the equation (7).
Figure JPOXMLDOC01-appb-M000007
Figure JPOXMLDOC01-appb-M000007
 つまり、式(6)、式(7)は、消去ベクトルの要素eb,i(n)と、指示ベクトルの要素wb,i(n)の両方が1の値である時に、そのメモリスロットに関する情報は0にリセットされることを示している。そして、上記の消去処理を行った上で、式(8)に示すように新しい情報ベクトルab,i(ここでは、話者特徴ベクトル)が、外部メモリ書き込み部14によって外部メモリ部12に書き込まれる。 That is, in equations (6) and (7), when both the elements e b and i (n) of the elimination vector and the elements w b and i (n) of the instruction vector have a value of 1, the memory slot thereof. Information about is reset to 0. Then, after performing the above erasing process, new information vectors a b and i (here, the speaker feature vector) are written to the external memory unit 12 by the external memory writing unit 14 as shown in the equation (8). Is done.
Figure JPOXMLDOC01-appb-M000008
Figure JPOXMLDOC01-appb-M000008
 外部メモリ読み込み部15は、メモリ制御指示部13からの指示を受けて、外部メモリ部12から各話者に関する話者特徴ベクトルを読み込み、メモリ制御指示部13に出力する。外部メモリ読み込み部15は、メモリ制御指示部13から出力された指示ベクトルwb,iを基に、外部メモリ部12から、更新済みの話者特徴ベクトルの読み込みを行う。 The external memory reading unit 15 receives an instruction from the memory control instruction unit 13, reads a speaker feature vector for each speaker from the external memory unit 12, and outputs the speaker feature vector to the memory control instruction unit 13. The external memory reading unit 15 reads the updated speaker feature vector from the external memory unit 12 based on the instruction vectors w b and i output from the memory control instruction unit 13.
 外部メモリ読み込み部15は、指示ベクトルwb,iを用いることで、式(9)に示すように、外部メモリ部12からデータを読み込むことができる。 The external memory reading unit 15 can read data from the external memory unit 12 as shown in the equation (9) by using the instruction vectors w b and i .
Figure JPOXMLDOC01-appb-M000009
Figure JPOXMLDOC01-appb-M000009
 式(9)に示すように、行列Mと指示ベクトルwb,iとの乗算で、読み込み対象の話者特徴ベクトルrb,iを出力することができる。指示ベクトルのn番目の要素wb,i(n)の値が0であることは、n番目の番地からは情報を読み込まないという指示に対応するため、指示ベクトルの要素wb,i(n)の値が1に近いメモリ番地の情報が主に読み込まれ、0に近いメモリ番地の情報は相対的に読み込まれない。 As shown in the equation (9), the speaker feature vectors r b, i to be read can be output by multiplying the matrix M and the instruction vectors w b, i . The fact that the value of the nth element w b, i (n) of the instruction vector is 0 corresponds to the instruction that the information is not read from the nth address, so that the elements w b , i (n) of the instruction vector. The information of the memory address whose value of) is close to 1 is mainly read, and the information of the memory address close to 0 is relatively not read.
 信号処理装置10では、このように、外部メモリ書き込み部14による書き込み処理、外部メモリ読み込み部15による読み込み処理を、i=1からi=Iまで繰り返すことで、外部メモリ部12への書き込み及び読み込みを全ての話者について行うことができる。そして、これら処理が時間ブロックごとに実行されるため、全ての時間ブロックの処理後の外部メモリ部12の書き込みが行われた番地の数は、全時間ブロック中に現れた話者数に相当する。したがって、全ての時間ブロックの処理後に外部メモリ部12の状態を確認することによって、全時間ブロック中に現れた話者数を確認することができる。 In the signal processing device 10, the writing process by the external memory writing unit 14 and the reading process by the external memory reading unit 15 are repeated from i = 1 to i = Ib in this way to write to the external memory unit 12 and write to the external memory unit 12. Reading can be done for all speakers. Since these processes are executed for each time block, the number of addresses to which the external memory unit 12 is written after the processes of all the time blocks corresponds to the number of speakers appearing in the all-time block. .. Therefore, by checking the state of the external memory unit 12 after processing all the time blocks, it is possible to check the number of speakers appearing in the whole time block.
 音源抽出部16は、観測信号と、メモリ制御指示部13から出力された話者特徴ベクトルに基づき、この話者特徴ベクトルに対応する話者の音声を観測信号から抽出する(詳細は、非特許文献3参照)。 The sound source extraction unit 16 extracts the speaker's voice corresponding to this speaker feature vector from the observation signal based on the observation signal and the speaker feature vector output from the memory control instruction unit 13 (details are non-patented). See Document 3).
 音源抽出部16は、観測信号Xおよび外部メモリ部12から読みだされた話者特徴ベクトルrb,iを用いて、式(10)に示すように分離音声^Sb,iを抽出する。 The sound source extraction unit 16 extracts the separated voices ^ S b, i as shown in the equation (10) using the speaker feature vectors r b, i read from the observation signal X b and the external memory unit 12. ..
Figure JPOXMLDOC01-appb-M000010
Figure JPOXMLDOC01-appb-M000010
 NNextract[・]には、BLSTMや畳み込みニューラルネットワークなどのニューラルネットワークを用いるのが一般的である。 For NN extract [・], it is common to use a neural network such as BLSTM or a convolutional neural network.
 なお、時間ブロックbにおいて抽出された分離音声^Sb,i(i=1,・・・,I)と、時間ブロックb´(b≠b´)で抽出された分離音声^Sb´,i(i=1,・・・,Ib´)とが、同じ話者のものであるか否かは、外部メモリ部12に対する指示ベクトルwb,i(i=1,・・・,I)と、指示ベクトルwb´,i(i=1,・・・,Ib´)の各iにおいて最大値が検出されるインデックスが同じか否かを調べることで判定できる。言い換えれば、メモリ制御指示部13は、話者照合の役割を担っている。このため、話者特徴ベクトル抽出部11において、各時間ブロック間で異なる順序で話者特徴ベクトルが抽出されても、前述の外部メモリ部12への指示ベクトル内の最大値検出と比較処理とを用いることで、時間ブロック間で話者を追従することができる。 The separated voices ^ S b, i (i = 1, ..., I b ) extracted in the time block b and the separated voices ^ S b'extracted in the time block b'(b ≠ b'). , I (i = 1, ..., I b' ) and whether or not they belong to the same speaker are the instruction vectors w b, i (i = 1, ..., I b'to the external memory unit 12). It can be determined by examining whether or not the index at which the maximum value is detected is the same in each i of I b ) and the instruction vectors w b', i (i = 1, ..., I b' ). In other words, the memory control instruction unit 13 plays a role of speaker collation. Therefore, even if the speaker feature vector extraction unit 11 extracts the speaker feature vector in a different order between the time blocks, the maximum value detection and the comparison process in the instruction vector to the external memory unit 12 described above can be performed. By using it, the speaker can be followed between time blocks.
 発話区間検出部17は、観測信号と、メモリ制御指示部13から出力された話者特徴ベクトルに基づき、この話者特徴ベクトルに対応する話者の発話区間検出結果を出力する(詳細は、参考文献1参照)。 The utterance section detection unit 17 outputs the utterance section detection result of the speaker corresponding to this speaker feature vector based on the observation signal and the speaker feature vector output from the memory control instruction unit 13 (for details, refer to the reference). See Document 1).
 音声認識部18は、観測信号と、メモリ制御指示部13から出力された話者特徴ベクトルに基づき、この話者特徴ベクトルに対応する話者の音声認識結果を出力する(詳細は、参考文献5参照)。
参考文献5:Marc Delcroix, Shinji Watanabe, Tsubasa Ochiai,Keisuke Kinoshita, Shigeki Karita, Atsunori Ogawa, Tomohiro Nakatani, “End-to-end SpeakerBeam for single channel target speech recognition”, Interspeech2019, pp.451-455, 2019.
The voice recognition unit 18 outputs the voice recognition result of the speaker corresponding to this speaker feature vector based on the observation signal and the speaker feature vector output from the memory control instruction unit 13 (for details, see Reference 5). reference).
Reference 5: Marc Delcroix, Shinji Watanabe, Tsubasa Ochiai, Keisuke Kinoshita, Shigeki Karita, Atsunori Ogawa, Tomohiro Nakatani, “End-to-end SpeakerBeam for single channel target speech recognition”, Interspeech2019, pp.451-455, 2019.
 なお、音源抽出部16、発話区間検出部17及び音声認識部18は、音声を処理する処理部の一例である。また、実施形態に係る信号処理装置として、音源抽出部16、発話区間検出部17及び音声認識部18の複数の処理部を有する信号処理装置10について説明したが、これに限らない。例えば、実施形態に係る信号処理装置は、音源抽出部16を有する信号処理装置10A(図2参照)、発話区間検出部17を有する信号処理装置10B(図3参照)、または、音声認識部18を有する信号処理装置10C(図4参照)であってもよい。 The sound source extraction unit 16, the utterance section detection unit 17, and the voice recognition unit 18 are examples of processing units that process voice. Further, as the signal processing device according to the embodiment, the signal processing device 10 having a plurality of processing units of the sound source extraction unit 16, the utterance section detection unit 17, and the voice recognition unit 18 has been described, but the present invention is not limited thereto. For example, the signal processing device according to the embodiment is a signal processing device 10A having a sound source extraction unit 16 (see FIG. 2), a signal processing device 10B having a speech section detection unit 17 (see FIG. 3), or a voice recognition unit 18. It may be a signal processing device 10C (see FIG. 4) having the above.
 繰り返し制御部19は、話者特徴ベクトル抽出部11の抽出処理の状態、または、音源抽出部16と発話区間検出部17と音声認識部18とによる処理結果を基に、話者特徴ベクトル抽出部11による抽出処理の繰り返し数を決定する。言い換えると、繰り返し制御部19は、例えば、話者特徴ベクトル抽出部11の話者特徴ベクトルの抽出処理の状態を基に、話者特徴ベクトル抽出部11による抽出処理の繰り返し数を決定する。または、繰り返し制御部19は、発話区間検出部17及び音声認識部18からの出力結果を用いて、話者特徴ベクトル抽出部11の話者特徴ベクトルの抽出処理の繰り返し数を決定する。なお、理想的には、話者特徴ベクトル抽出部11は、各ブロックbにおいてI個の話者特徴ベクトルを抽出することが望ましい。 The repeat control unit 19 is based on the extraction processing state of the speaker feature vector extraction unit 11 or the processing result of the sound source extraction unit 16, the utterance section detection unit 17, and the voice recognition unit 18, and the speaker feature vector extraction unit The number of repetitions of the extraction process according to 11 is determined. In other words, the repetition control unit 19 determines the number of repetitions of the extraction process by the speaker feature vector extraction unit 11 based on, for example, the state of the speaker feature vector extraction process of the speaker feature vector extraction unit 11. Alternatively, the repetition control unit 19 uses the output results from the utterance section detection unit 17 and the voice recognition unit 18 to determine the number of repetitions of the speaker feature vector extraction process of the speaker feature vector extraction unit 11. Ideally, the speaker feature vector extraction unit 11 should extract I b speaker feature vectors in each block b.
 例えば、音源抽出部16のみを有する信号処理装置10A(図2参照)の場合、繰り返し制御部19は、話者特徴ベクトル抽出部11から出力される内部状態ベクトルhb,i、観測信号X、分離音声^Sb,iを用いて、以下の式(11)のような繰り返しを停止すべきか否かを示すスカラー値^fb,i(0≦^fb,i≦1)を算出する。スカラー値^fb,iが所定の値よりも大きければ繰り返しを止め、所定の値よりも低ければ繰り返しを継続する。 For example, in the case of the signal processing device 10A (see FIG. 2) having only the sound source extraction unit 16, the repetition control unit 19 has the internal state vectors h b, i and the observation signal X b output from the speaker feature vector extraction unit 11. , The scalar values ^ f b, i (0 ≦ ^ f b, i ≦ 1) indicating whether or not the repetition as in the following equation (11) should be stopped are calculated using the separated voices ^ S b, i . do. If the scalar values ^ f b and i are larger than the predetermined value, the repetition is stopped, and if it is lower than the predetermined value, the repetition is continued.
Figure JPOXMLDOC01-appb-M000011
Figure JPOXMLDOC01-appb-M000011
 繰り返し制御部19は、話者特徴ベクトル抽出部11が話者特徴ベクトルを抽出する度に、異なる補助情報を話者特徴ベクトル抽出部11のニューラルネットワークに入力することで、話者特徴ベクトル抽出部11に、異なる音源に対応する話者特徴ベクトルの抽出結果を出力させる。 The repeat control unit 19 inputs different auxiliary information to the neural network of the speaker feature vector extraction unit 11 each time the speaker feature vector extraction unit 11 extracts the speaker feature vector, so that the speaker feature vector extraction unit 19 can be used. 11 is made to output the extraction result of the speaker feature vector corresponding to a different sound source.
 例えば、繰り返し制御部19は、入力される音響信号に含まれる全ての話者特徴ベクトルを仮想的に認識している場合を例に説明する。繰り返し制御部19は、話者特徴ベクトル抽出部11が話者特徴ベクトルを抽出する度に、この話者特徴ベクトルがいずれの話者に対応するかを認識する。そして、繰り返し制御部19は、未抽出の話者がある場合には、次の話者の話者特徴ベクトルに関する補助情報を話者特徴ベクトル抽出部11に入力して、話者特徴ベクトル抽出部11に次の話者の話者特徴ベクトルを抽出される。繰り返し制御部19は、この時間ブロックの観測信号の全話者に対する話者特徴ベクトルの抽出処理が終われば、話者特徴ベクトル抽出部11による抽出処理の繰り返しを停止する。 For example, the case where the repeat control unit 19 virtually recognizes all the speaker feature vectors included in the input acoustic signal will be described as an example. Each time the speaker feature vector extraction unit 11 extracts the speaker feature vector, the repeat control unit 19 recognizes which speaker the speaker feature vector corresponds to. Then, when there is an unextracted speaker, the repeat control unit 19 inputs auxiliary information about the speaker feature vector of the next speaker to the speaker feature vector extraction unit 11, and the speaker feature vector extraction unit 19 The speaker feature vector of the next speaker is extracted in 11. The repetition control unit 19 stops the repetition of the extraction process by the speaker feature vector extraction unit 11 when the extraction process of the speaker feature vector for all the speakers of the observation signal of this time block is completed.
 学習部20は、学習用データを用いて、信号処理装置10が使用するパラメータを最適化する。学習部20は、信号処理装置10を構成するニューラルネットワークのパラメータを所定の目的関数に基づき最適化する。 The learning unit 20 optimizes the parameters used by the signal processing device 10 using the learning data. The learning unit 20 optimizes the parameters of the neural network constituting the signal processing device 10 based on a predetermined objective function.
 学習データは、入力信号(観測信号)と、この入力信号に含まれる各音源に対応する正解クリーン信号、正解発話時間情報、正解発話内容と、入力信号に含まれる人数の総数情報(総話者数)Iとから成る。学習部20は、学習データを基に、信号処理装置10の出力と正解情報との誤差が小さくなるようにパラメータを学習する。 The learning data includes an input signal (observation signal), a correct answer clean signal corresponding to each sound source included in this input signal, correct answer utterance time information, correct utterance content, and total number of people included in the input signal (total speaker). Number) consists of I. The learning unit 20 learns parameters based on the learning data so that the error between the output of the signal processing device 10 and the correct answer information becomes small.
 具体的には、図2の信号処理装置10Aのパラメータを最適化するために学習処理を行う場合、入力信号X(b=1,・・・,B)を与えた際に、各時間ブロックにおいてシステムの出力結果(分離音声)^Sb,iが正解の分離音声Sb,iに近くなるように、式(12)の事情誤差基準の損失関数を設ける。 Specifically, when the learning process is performed to optimize the parameters of the signal processing device 10A of FIG. 2, each time block is given when the input signal X b (b = 1, ..., B) is given. In the system output result (separated voice) ^ S b, i , a loss function based on the circumstances error of Eq. (12) is provided so that the separated voice S b, i of the correct answer is close to.
Figure JPOXMLDOC01-appb-M000012
Figure JPOXMLDOC01-appb-M000012
 また、各時間ブロックbにおいて、正しい音源数を推定するために繰り返し回数を決定するためのスカラー値について、式(13)のクロスエントロピー損失関数を設ける。 Further, in each time block b, the cross entropy loss function of the equation (13) is provided for the scalar value for determining the number of repetitions in order to estimate the correct number of sound sources.
Figure JPOXMLDOC01-appb-M000013
Figure JPOXMLDOC01-appb-M000013
 なお、fb,iは、i=Iの場合のみ1を取り、それ以外は0を取る値である。 Note that f b and i are values that take 1 only when i = I b , and take 0 in other cases.
 そして、学習部20は、外部メモリ内で使われるアドレスの総数が総話者数Iと同じ値となるように、式(14)も同時に小さくなるようにパラメータを更新する。 Then, the learning unit 20 updates the parameter so that the total number of addresses used in the external memory becomes the same value as the total number of speakers I, and the equation (14) also becomes small at the same time.
Figure JPOXMLDOC01-appb-M000014
Figure JPOXMLDOC01-appb-M000014
 式(14)において、Tcountは事前設定した閾値であり、一般的には1を設定する。min(・)は、入力の値がTcountよりも大きい場合はTcountを出力し、Tcountよりも小さい場合、そのままの値を出力する関数である。式(14)は、外部メモリの中で使用されるメモリ番地の数が総話者数Iと合致するように促す損失関数である。 In the equation (14), T count is a preset threshold value, and is generally set to 1. min (・) is a function that outputs a T count when the input value is larger than the T count , and outputs the value as it is when the input value is smaller than the T count . Equation (14) is a loss function that prompts the number of memory addresses used in the external memory to match the total number of speakers I.
 学習部20は、最終的には、全ての損失関数の値を合算した式(15)の値が小さくなるようにニューラルネットワークのパラメータを学習する。 Finally, the learning unit 20 learns the parameters of the neural network so that the value of the equation (15), which is the sum of the values of all the loss functions, becomes small.
Figure JPOXMLDOC01-appb-M000015
Figure JPOXMLDOC01-appb-M000015
 なお、学習部20は、発話区間検出部17を有する信号処理装置10B、音声認識部18を有する信号処理装置10C、信号処理装置10に対しては、具備する処理部に応じた損失関数を用いて、ニューラルネットワークのパラメータを最適化すればよい。発話区間検出部17を有する信号処理装置10Bについては参考文献6を、音声認識部18を有する信号処理装置10Cについては参考文献7を参照されたい。
参考文献6:Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Kenji Nagamatsu, Shinji Watanabe, “End-to-End Neural Speaker Diarization with Permutation-free Objectives”, Proc. Interspeech, pp. 4300-4304, 2019.
参考文献7:Shigeki Karita et al., “A comparative study on transformer vs RNN in speech applications”, IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019.
The learning unit 20 uses a signal processing device 10B having a speech section detection unit 17, a signal processing device 10C having a voice recognition unit 18, and a loss function corresponding to the processing unit provided for the signal processing device 10. Then, the parameters of the neural network may be optimized. Please refer to Reference 6 for the signal processing device 10B having the utterance section detection unit 17, and Reference 7 for the signal processing device 10C having the voice recognition unit 18.
Reference 6: Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Kenji Nagamatsu, Shinji Watanabe, “End-to-End Neural Speaker Diarization with Permutation-free Objectives”, Proc. Interspeech, pp. 4300-4304, 2019.
Reference 7: Shigeki Karita et al., “A comparative study on transformer vs RNN in speech applications”, IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019.
[信号処理の処理手順]
 次に、本実施形態に係る信号処理方法の処理手順について説明する。図5は、実施形態に係る信号処理方法の処理手順を示すフローチャートである。
[Signal processing procedure]
Next, the processing procedure of the signal processing method according to the present embodiment will be described. FIG. 5 is a flowchart showing a processing procedure of the signal processing method according to the embodiment.
 図5に示すように、信号処理装置10は、音響信号の入力を受け付けると(ステップS1)、音響信号を時間ブロックに分割する(ステップS2)。そして、信号処理装置10は、時間ブロックの数bをb=1に初期化する(ステップS3)。 As shown in FIG. 5, when the signal processing device 10 receives the input of the acoustic signal (step S1), the signal processing device 10 divides the acoustic signal into time blocks (step S2). Then, the signal processing device 10 initializes the number b of the time blocks to b = 1 (step S3).
 時間ブロックbの音響信号(観測信号)が話者特徴ベクトル抽出部11に入力されると、話者特徴ベクトル抽出部11は、時間ブロックbの観測信号から、この時間ブロックbに存在するある一人の話者の話者特徴ベクトルを推定し、抽出する(ステップS4)。 When the acoustic signal (observation signal) of the time block b is input to the speaker feature vector extraction unit 11, the speaker feature vector extraction unit 11 is one person existing in the time block b from the observation signal of the time block b. The speaker feature vector of the speaker is estimated and extracted (step S4).
 そして、メモリ制御指示部13は、話者特徴ベクトル抽出部11が抽出した話者特徴ベクトルが、今までの時間ブロックには出現したことのない話者の話者特徴量ベクトルであるか否かを判定する(ステップS5)。 Then, the memory control instruction unit 13 determines whether or not the speaker feature vector extracted by the speaker feature vector extraction unit 11 is a speaker feature amount vector of a speaker that has never appeared in the time block so far. Is determined (step S5).
 今までの時間ブロックには出現したことのない話者の話者特徴量ベクトルである場合(ステップS5:Yes)について説明する。この場合、メモリ制御指示部13は、外部メモリ部12の未使用のメモリスロットに、この話者特徴ベクトルを書き込むことを指示し、外部メモリ書き込み部14は、外部メモリ部12の未使用のメモリスロットに話者特徴ベクトルを書き込む(ステップS6)。 The case where it is a speaker feature vector of a speaker who has never appeared in the time block so far (step S5: Yes) will be described. In this case, the memory control instruction unit 13 instructs the unused memory slot of the external memory unit 12 to write the speaker feature vector, and the external memory writing unit 14 directs the unused memory of the external memory unit 12 to be written. The speaker feature vector is written in the slot (step S6).
 今までの時間ブロックに出現したことのある話者の話者特徴量ベクトルである場合(ステップS5:No)について説明する。この場合、メモリ制御指示部13は、外部メモリ部12のこの話者に対応するメモリスロットに、話者特徴ベクトルを書き込むことを指示し、外部メモリ書き込み部14は、外部メモリ部12のこの話者に対応するメモリスロットに話者特徴ベクトルを書き込む(ステップS7)。 The case where the speaker feature quantity vector of the speaker who has appeared in the time block so far (step S5: No) will be described. In this case, the memory control instruction unit 13 instructs the external memory unit 12 to write the speaker feature vector to the memory slot corresponding to the speaker, and the external memory writing unit 14 indicates this story of the external memory unit 12. The speaker feature vector is written in the memory slot corresponding to the person (step S7).
 そして、外部メモリ書き込み部14は、メモリ制御指示部13による指示にしたがって、外部メモリ部12から、話者特徴ベクトル抽出部11が抽出した一人の話者に対応する話者特徴ベクトルを読み込み(ステップS8)、メモリ制御指示部13に出力する。 Then, the external memory writing unit 14 reads the speaker feature vector corresponding to one speaker extracted by the speaker feature vector extraction unit 11 from the external memory unit 12 according to the instruction from the memory control instruction unit 13 (step). S8), output to the memory control instruction unit 13.
 そして、音源抽出部16は、観測信号と、メモリ制御指示部13から出力された一人の話者に対応する話者特徴ベクトルに基づき、この話者特徴ベクトルに対応する話者の音声を観測信号から抽出する(ステップS9)。 Then, the sound source extraction unit 16 observes the speaker's voice corresponding to this speaker feature vector based on the observation signal and the speaker feature vector corresponding to one speaker output from the memory control instruction unit 13. Extract from (step S9).
 また、発話区間検出部17は、観測信号と、メモリ制御指示部13から出力された一人の話者に対応する話者特徴ベクトルに基づき、この話者特徴ベクトルに対応する話者の発話区間検出結果を出力する(ステップS10)。 Further, the utterance section detection unit 17 detects the utterance section of the speaker corresponding to this speaker feature vector based on the observation signal and the speaker feature vector corresponding to one speaker output from the memory control instruction unit 13. The result is output (step S10).
 そして、音声認識部18は、観測信号と、メモリ制御指示部13から出力された一人の話者に対応する話者特徴ベクトルに基づき、この話者特徴ベクトルに対応する話者の音声認識結果を出力する(ステップS11)。 Then, the voice recognition unit 18 obtains the voice recognition result of the speaker corresponding to this speaker feature vector based on the observation signal and the speaker feature vector corresponding to one speaker output from the memory control instruction unit 13. Output (step S11).
 ステップS9~ステップS11は、図5に示すように、並行に処理されるほか、直列に処理されてもよい。直列に処理する場合、処理の順は、特に限定しない。また、信号処理装置10A~10Cの場合には、具備する音声処理機能部に応じた処理を実行すればよい。例えば、信号処理装置10Aの場合には、音源抽出部16による音源抽出処理(ステップS9)を実行し、ステップS12に進む。 As shown in FIG. 5, steps S9 to S11 may be processed in parallel or in series. When processing in series, the order of processing is not particularly limited. Further, in the case of the signal processing devices 10A to 10C, processing may be executed according to the voice processing function unit provided. For example, in the case of the signal processing device 10A, the sound source extraction process (step S9) by the sound source extraction unit 16 is executed, and the process proceeds to step S12.
 そして、繰り返し制御部19は、音源抽出部16、発話区間検出部17及び音声認識部18の処理結果を基に、繰り返しを停止するか否かを判定する(ステップS12)。なお、繰り返し制御部19は、話者特徴ベクトル抽出部11の抽出処理の状態を基に、繰り返しを停止するか否かを判定してもよい。 Then, the repetition control unit 19 determines whether or not to stop the repetition based on the processing results of the sound source extraction unit 16, the utterance section detection unit 17, and the voice recognition unit 18 (step S12). The repetition control unit 19 may determine whether or not to stop the repetition based on the extraction processing state of the speaker feature vector extraction unit 11.
 繰り返し制御部19は、繰り返しを停止しないと判定した場合(ステップS12:No)、この時間ブロックbにおける次の話者に関する処理を進めるため、ステップS4に戻り、次の話者に対する話者特徴ベクトルの抽出を行う。 When the repetition control unit 19 determines that the repetition is not stopped (step S12: No), the repetition control unit 19 returns to step S4 in order to proceed with the processing regarding the next speaker in this time block b, and returns to the speaker feature vector for the next speaker. Is extracted.
 また、繰り返し制御部19は、繰り返しを停止すると判定した場合(ステップS12:Yes)、この時間ブロックbに対する処理結果を出力する(ステップS13)。なお、信号処理装置10の例であれば、出力結果は、音源抽出結果、発話区間検出結果、音声認識結果である。また、信号処理装置10は、全ての時間ブロックの処理結果をまとめて出力してもよい。 Further, when the repetition control unit 19 determines that the repetition is stopped (step S12: Yes), the repetition control unit 19 outputs the processing result for this time block b (step S13). In the case of the signal processing device 10, the output results are the sound source extraction result, the utterance section detection result, and the voice recognition result. Further, the signal processing device 10 may collectively output the processing results of all the time blocks.
 そして、信号処理装置10は、全ての時間ブロックの処理を終了したか否かを判定する(ステップS14)。信号処理装置10は、全ての時間ブロックの処理を終了した場合(ステップS14:Yes)、入力された音響信号に対する処理を終了する。また、信号処理装置10は、全ての時間ブロックの処理を終了していない場合(ステップS14:No)、次の時間ブロックに対する処理を行うため、時間ブロックbに1を加算し(ステップS15)、ステップS4に戻り、処理を続ける。 Then, the signal processing device 10 determines whether or not the processing of all the time blocks is completed (step S14). When the processing of all the time blocks is completed (step S14: Yes), the signal processing device 10 ends the processing for the input acoustic signal. Further, when the signal processing device 10 has not completed the processing of all the time blocks (step S14: No), 1 is added to the time block b in order to perform the processing for the next time block (step S15). The process returns to step S4 and the process is continued.
[実施形態の効果]
 実施形態に係る信号処理装置10は、時間ブロックごとに話者特徴ベクトルを繰り返し抽出し、外部メモリ部12に書き込みを行う。この際、信号処理装置10は、今までの時間ブロックに出現したことのない話者の話者特徴量ベクトルが抽出された場合には外部メモリ部12の未使用のメモリスロットに話者特徴ベクトルを書き込む。信号処理装置10は、今までの時間ブロックに既に出現したことのある話者の話者特徴量ベクトルが抽出された場合には外部メモリ部12のこの話者に対応するメモリスロットに話者特徴ベクトルを書き込む。
[Effect of embodiment]
The signal processing device 10 according to the embodiment repeatedly extracts the speaker feature vector for each time block and writes it to the external memory unit 12. At this time, when the speaker feature amount vector of the speaker that has never appeared in the time block up to now is extracted, the signal processing device 10 puts the speaker feature vector in the unused memory slot of the external memory unit 12. To write. When the speaker feature quantity vector of the speaker who has already appeared in the time block up to now is extracted, the signal processing device 10 puts the speaker feature in the memory slot corresponding to this speaker in the external memory unit 12. Write a vector.
 したがって、信号処理装置10は、発話をしていない話者(サイレント話者)については、抽出処理自体を実行しない。このため、信号処理装置10は、従来のように、時間ブロックごとにサイレント話者について音源抽出を試みる必要がないため、従来と比して処理量を削減できるとともに、真に発話を行っている話者のみに対する処理を適切に行えるため、処理精度を向上することができる。 Therefore, the signal processing device 10 does not execute the extraction process itself for the speaker (silent speaker) who is not speaking. Therefore, the signal processing device 10 does not need to try to extract the sound source for the silent speaker for each time block as in the conventional case, so that the processing amount can be reduced as compared with the conventional case and the utterance is truly performed. Since the processing for only the speaker can be appropriately performed, the processing accuracy can be improved.
 また、信号処理装置10では、時間ブロックごとに話者特徴ベクトルの抽出を行うため、従来のように全ての時間ブロックにおいて同じ順で話者特徴ベクトルを抽出する必要がないため、処理の最適性を損なうこともない。 Further, since the signal processing device 10 extracts the speaker feature vector for each time block, it is not necessary to extract the speaker feature vector in the same order in all the time blocks as in the conventional case, so that the processing is optimal. Does not hurt.
 このように、信号処理装置10は、処理量を削減しながら、音響信号に対する処理精度を向上させることができる。 In this way, the signal processing device 10 can improve the processing accuracy for the acoustic signal while reducing the processing amount.
[プログラム]
 図6は、プログラムが実行されることにより、信号処理装置10が実現されるコンピュータの一例を示す図である。コンピュータ1000は、例えば、メモリ1010、CPU1020を有する。また、コンピュータ1000は、ハードディスクドライブインタフェース1030、ディスクドライブインタフェース1040、シリアルポートインタフェース1050、ビデオアダプタ1060、ネットワークインタフェース1070を有する。これらの各部は、バス1080によって接続される。
[program]
FIG. 6 is a diagram showing an example of a computer in which the signal processing device 10 is realized by executing a program. The computer 1000 has, for example, a memory 1010 and a CPU 1020. The computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.
 メモリ1010は、ROM(Read Only Memory)1011及びRAM1012を含む。ROM1011は、例えば、BIOS(Basic Input Output System)等のブートプログラムを記憶する。ハードディスクドライブインタフェース1030は、ハードディスクドライブ1090に接続される。ディスクドライブインタフェース1040は、ディスクドライブ1100に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ1100に挿入される。シリアルポートインタフェース1050は、例えばマウス1110、キーボード1120に接続される。ビデオアダプタ1060は、例えばディスプレイ1130に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012. The ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to the disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120. The video adapter 1060 is connected to, for example, the display 1130.
 ハードディスクドライブ1090は、例えば、OS(Operating System)1091、アプリケーションプログラム1092、プログラムモジュール1093、プログラムデータ1094を記憶する。すなわち、信号処理装置10の各処理を規定するプログラムは、コンピュータにより実行可能なコードが記述されたプログラムモジュール1093として実装される。プログラムモジュール1093は、例えばハードディスクドライブ1090に記憶される。例えば、信号処理装置10における機能構成と同様の処理を実行するためのプログラムモジュール1093が、ハードディスクドライブ1090に記憶される。なお、ハードディスクドライブ1090は、SSD(Solid State Drive)により代替されてもよい。 The hard disk drive 1090 stores, for example, an OS (Operating System) 1091, an application program 1092, a program module 1093, and program data 1094. That is, the program that defines each process of the signal processing device 10 is implemented as a program module 1093 in which a code that can be executed by a computer is described. The program module 1093 is stored in, for example, the hard disk drive 1090. For example, a program module 1093 for executing processing similar to the functional configuration in the signal processing device 10 is stored in the hard disk drive 1090. The hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
 また、上述した実施形態の処理で用いられる設定データは、プログラムデータ1094として、例えばメモリ1010やハードディスクドライブ1090に記憶される。そして、CPU1020が、メモリ1010やハードディスクドライブ1090に記憶されたプログラムモジュール1093やプログラムデータ1094を必要に応じてRAM1012に読み出して実行する。 Further, the setting data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, a memory 1010 or a hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 into the RAM 1012 and executes them as needed.
 なお、プログラムモジュール1093やプログラムデータ1094は、ハードディスクドライブ1090に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ1100等を介してCPU1020によって読み出されてもよい。あるいは、プログラムモジュール1093及びプログラムデータ1094は、ネットワーク(LAN(Local Area Network)、WAN(Wide Area Network)等)を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール1093及びプログラムデータ1094は、他のコンピュータから、ネットワークインタフェース1070を介してCPU1020によって読み出されてもよい。 The program module 1093 and the program data 1094 are not limited to those stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Then, the program module 1093 and the program data 1094 may be read from another computer by the CPU 1020 via the network interface 1070.
 以上、本発明者によってなされた発明を適用した実施形態について説明したが、本実施形態による本発明の開示の一部をなす記述及び図面により本発明は限定されることはない。すなわち、本実施形態に基づいて当業者等によりなされる他の実施形態、実施例及び運用技術等は全て本発明の範疇に含まれる。 Although the embodiment to which the invention made by the present inventor is applied has been described above, the present invention is not limited by the description and the drawings which form a part of the disclosure of the present invention according to the present embodiment. That is, other embodiments, examples, operational techniques, and the like made by those skilled in the art based on the present embodiment are all included in the scope of the present invention.
 10,10A,10B,10C 信号処理装置
 11 話者特徴ベクトル抽出部
 12 外部メモリ部
 13 メモリ制御指示部
 14 外部メモリ書き込み部
 15 外部メモリ読み込み部
 16 音源抽出部
 17 発話区間検出部
 18 音声認識部
 19 繰り返し制御部
 20 学習部
10, 10A, 10B, 10C Signal processing device 11 Speaker feature vector extraction unit 12 External memory unit 13 Memory control instruction unit 14 External memory writing unit 15 External memory reading unit 16 Sound source extraction unit 17 Speech section detection unit 18 Voice recognition unit 19 Repeat control unit 20 Learning unit

Claims (8)

  1.  入力された音響信号に対して、時間ブロックごとに話者特徴ベクトルを、該時間ブロックに存在する話者の数の分だけ繰り返し抽出する抽出部と、
     前記抽出部によって抽出された話者特徴ベクトルを格納する外部メモリ部と、
     今までの時間ブロックに出現したことのない話者の話者特徴量ベクトルが前記抽出部によって抽出された場合には前記外部メモリ部の未使用のメモリスロットに前記話者特徴ベクトルを書き込むことを指示し、今までの時間ブロックに既に出現したことのある話者の話者特徴量ベクトルが前記抽出部によって抽出された場合には前記外部メモリ部の前記既に出現したことのある話者に対応するメモリスロットに前記話者特徴ベクトルを書き込むことを指示し、前記外部メモリ部からの話者特徴ベクトルの読み込みを指示する指示部と、
     前記指示部から指示を受け、前記外部メモリ部への話者特徴ベクトルの書き込みを行う書き込み部と、
     前記指示部から指示を受け、前記外部メモリ部から話者特徴ベクトルの読み込みを行う読み込み部と、
     前記読み込み部によって読み込まれた話者特徴ベクトルに基づき信号処理を実行する処理部と、
     を有することを特徴とする信号処理装置。
    An extraction unit that repeatedly extracts the speaker feature vector for each time block for the input acoustic signal by the number of speakers existing in the time block.
    An external memory unit that stores the speaker feature vector extracted by the extraction unit, and
    When the speaker feature quantity vector of the speaker that has never appeared in the time block so far is extracted by the extraction unit, the speaker feature vector is written in the unused memory slot of the external memory unit. Corresponds to the speaker who has already appeared in the external memory unit when the speaker feature quantity vector of the speaker who has instructed and has already appeared in the time block so far is extracted by the extraction unit. An instruction unit that instructs to write the speaker feature vector to the memory slot to be used, and instructs to read the speaker feature vector from the external memory unit.
    A writing unit that receives an instruction from the indicating unit and writes a speaker feature vector to the external memory unit, and a writing unit.
    A reading unit that receives an instruction from the instruction unit and reads the speaker feature vector from the external memory unit, and a reading unit.
    A processing unit that executes signal processing based on the speaker feature vector read by the reading unit, and a processing unit.
    A signal processing device characterized by having.
  2.  前記抽出部の抽出処理の状態または前記処理部による処理結果を基に、前記抽出部による抽出処理の繰り返し数を決定することを特徴とする請求項1に記載の信号処理装置。 The signal processing apparatus according to claim 1, wherein the number of repetitions of the extraction process by the extraction unit is determined based on the state of the extraction process of the extraction unit or the processing result by the processing unit.
  3.  前記抽出部、前記指示部及び前記処理部が用いるパラメータを所定の目的関数に基づき最適化する学習部をさらに有することを特徴とする請求項1または2に記載の信号処理装置。 The signal processing apparatus according to claim 1 or 2, further comprising a learning unit that optimizes the parameters used by the extraction unit, the instruction unit, and the processing unit based on a predetermined objective function.
  4.  前記処理部は、前記話者特徴ベクトルに基づき、前記話者特徴ベクトルに対応する話者の音声を前記音響信号から抽出する音源抽出部であることを特徴とする請求項1~3のいずれか一つに記載の信号処理装置。 One of claims 1 to 3, wherein the processing unit is a sound source extraction unit that extracts a speaker's voice corresponding to the speaker feature vector from the acoustic signal based on the speaker feature vector. The signal processing device according to one.
  5.  前記処理部は、前記話者特徴ベクトルに基づき、前記話者特徴ベクトルに対応する話者の音声認識結果を出力する音源抽出部であることを特徴とする請求項1~4のいずれか一つに記載の信号処理装置。 One of claims 1 to 4, wherein the processing unit is a sound source extraction unit that outputs a voice recognition result of a speaker corresponding to the speaker feature vector based on the speaker feature vector. The signal processing device according to.
  6.  前記処理部は、前記話者特徴ベクトルに基づき、前記話者特徴ベクトルに対応する話者の発話区間検出結果を出力する発話区間検出部であることを特徴とする請求項1~5のいずれか一つに記載の信号処理装置。 One of claims 1 to 5, wherein the processing unit is an utterance section detection unit that outputs a speaker's utterance section detection result corresponding to the speaker feature vector based on the speaker feature vector. The signal processing device according to one.
  7.  信号処理装置が実行する信号処理方法であって、
     前記信号処理装置は、データを格納する外部メモリを有し、
     入力された音響信号に対して、時間ブロックごとに話者特徴ベクトルを、該時間ブロックに存在する話者の数の分だけ繰り返し抽出する抽出工程と、
     前記抽出工程において抽出された話者特徴ベクトルを格納する外部メモリと、
     今までの時間ブロックに出現したことのない話者の話者特徴量ベクトルが前記抽出工程において抽出された場合には、前記外部メモリの未使用のメモリスロットに前記話者特徴ベクトルを書き込むことを指示し、今までの時間ブロックに既に出現したことのある話者の話者特徴量ベクトルが前記抽出工程において抽出された場合には、前記外部メモリの前記既に出現したことのある話者に対応するメモリスロットに前記話者特徴ベクトルを書き込むことを指示する指示工程と、
     前記指示工程における指示を受け、前記外部メモリへの話者特徴ベクトルの書き込みを行う書き込み工程と、
     前記指示工程における指示を受け、前記外部メモリから話者特徴ベクトルの読み込みを行う読み込み工程と、
     前記読み込み工程において読み込まれた話者特徴ベクトルに基づき信号処理を実行する処理工程と、
     を含んだことを特徴とする信号処理方法。
    It is a signal processing method executed by a signal processing device.
    The signal processing device has an external memory for storing data and has an external memory.
    An extraction process in which speaker feature vectors are repeatedly extracted for each time block for the input acoustic signal by the number of speakers existing in the time block.
    An external memory that stores the speaker feature vector extracted in the extraction step, and
    When the speaker feature amount vector of the speaker that has never appeared in the time block so far is extracted in the extraction step, the speaker feature vector is written in the unused memory slot of the external memory. When the speaker feature vector of the speaker who has been instructed and has already appeared in the time block up to now is extracted in the extraction step, it corresponds to the speaker who has already appeared in the external memory. An instruction step for instructing the speaker feature vector to be written in the memory slot to be used, and
    A writing step of receiving an instruction in the instruction step and writing the speaker feature vector to the external memory, and a writing step of writing the speaker feature vector to the external memory.
    A reading process in which the speaker feature vector is read from the external memory in response to the instruction in the instruction process, and
    A processing step of executing signal processing based on the speaker feature vector read in the reading step, and
    A signal processing method characterized by including.
  8.  コンピュータを、請求項1~6のいずれか一つに記載の信号処理装置として機能させるための信号処理プログラム。 A signal processing program for making a computer function as the signal processing device according to any one of claims 1 to 6.
PCT/JP2020/049247 2020-12-28 2020-12-28 Signal processing device, signal processing method, and signal processing program WO2022145015A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2020/049247 WO2022145015A1 (en) 2020-12-28 2020-12-28 Signal processing device, signal processing method, and signal processing program
JP2022572857A JPWO2022145015A1 (en) 2020-12-28 2020-12-28

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/049247 WO2022145015A1 (en) 2020-12-28 2020-12-28 Signal processing device, signal processing method, and signal processing program

Publications (1)

Publication Number Publication Date
WO2022145015A1 true WO2022145015A1 (en) 2022-07-07

Family

ID=82259163

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/049247 WO2022145015A1 (en) 2020-12-28 2020-12-28 Signal processing device, signal processing method, and signal processing program

Country Status (2)

Country Link
JP (1) JPWO2022145015A1 (en)
WO (1) WO2022145015A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007233239A (en) * 2006-03-03 2007-09-13 National Institute Of Advanced Industrial & Technology Method, system, and program for utterance event separation
JP2020013034A (en) * 2018-07-19 2020-01-23 株式会社日立製作所 Voice recognition device and voice recognition method
WO2020039571A1 (en) * 2018-08-24 2020-02-27 三菱電機株式会社 Voice separation device, voice separation method, voice separation program, and voice separation system
JP2020134657A (en) * 2019-02-18 2020-08-31 日本電信電話株式会社 Signal processing device, learning device, signal processing method, learning method and program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007233239A (en) * 2006-03-03 2007-09-13 National Institute Of Advanced Industrial & Technology Method, system, and program for utterance event separation
JP2020013034A (en) * 2018-07-19 2020-01-23 株式会社日立製作所 Voice recognition device and voice recognition method
WO2020039571A1 (en) * 2018-08-24 2020-02-27 三菱電機株式会社 Voice separation device, voice separation method, voice separation program, and voice separation system
JP2020134657A (en) * 2019-02-18 2020-08-31 日本電信電話株式会社 Signal processing device, learning device, signal processing method, learning method and program

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
NEUMANN THILO VON; KINOSHITA KEISUKE; DELCROIX MARC; ARAKI SHOKO; NAKATANI TOMOHIRO; HAEB-UMBACH REINHOLD: "All-neural Online Source Separation, Counting, and Diarization for Meeting Analysis", ICASSP 2019 - 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 12 May 2019 (2019-05-12), pages 91 - 95, XP033565103, DOI: 10.1109/ICASSP.2019.8682572 *

Also Published As

Publication number Publication date
JPWO2022145015A1 (en) 2022-07-07

Similar Documents

Publication Publication Date Title
Kreuk et al. Fooling end-to-end speaker verification with adversarial examples
EP2943951B1 (en) Speaker verification and identification using artificial neural network-based sub-phonetic unit discrimination
WO2020228173A1 (en) Illegal speech detection method, apparatus and device and computer-readable storage medium
JP2017097162A (en) Keyword detection device, keyword detection method and computer program for keyword detection
WO2016095218A1 (en) Speaker identification using spatial information
US20140350934A1 (en) Systems and Methods for Voice Identification
WO2019151507A1 (en) Learning device, learning method and learning program
Wang et al. Recurrent deep stacking networks for supervised speech separation
US10089977B2 (en) Method for system combination in an audio analytics application
CN110335608B (en) Voiceprint verification method, voiceprint verification device, voiceprint verification equipment and storage medium
US20100324893A1 (en) System and method for improving robustness of speech recognition using vocal tract length normalization codebooks
JP2020042257A (en) Voice recognition method and device
JP6985221B2 (en) Speech recognition device and speech recognition method
JP2023539948A (en) Long context end-to-end speech recognition system
KR20210141115A (en) Method and apparatus for estimating utterance time
Qian et al. Noise robust speech recognition on aurora4 by humans and machines
CN112489623A (en) Language identification model training method, language identification method and related equipment
JP2010078877A (en) Speech recognition device, speech recognition method, and speech recognition program
KR101122591B1 (en) Apparatus and method for speech recognition by keyword recognition
CN108847251B (en) Voice duplicate removal method, device, server and storage medium
JP5670298B2 (en) Noise suppression device, method and program
Broughton et al. Improving end-to-end neural diarization using conversational summary representations
WO2022145015A1 (en) Signal processing device, signal processing method, and signal processing program
JP2012063611A (en) Voice recognition result search device, voice recognition result search method, and voice recognition result search program
KR20130126570A (en) Apparatus for discriminative training acoustic model considering error of phonemes in keyword and computer recordable medium storing the method thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20968026

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022572857

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20968026

Country of ref document: EP

Kind code of ref document: A1