WO2022145015A1

WO2022145015A1 - Signal processing device, signal processing method, and signal processing program

Info

Publication number: WO2022145015A1
Application number: PCT/JP2020/049247
Authority: WO
Inventors: 慶介木下; 智広中谷; マークデルクロア
Original assignee: 日本電信電話株式会社
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2022-07-07
Also published as: JPWO2022145015A1

Abstract

A signal processing device (10) has: a speaker feature vector extraction unit (11) for repeatedly extracting, for an inputted acoustic signal, a speaker feature vector for each time block a number of times commensurate with the number of speakers present in the time block; a memory control instruction unit (13) for instructing to write the speaker feature vector to an unused memory block of an external memory unit (12) when a speaker feature vector of a speaker who has not appeared in a time block thus far is extracted, and instructing to write the speaker feature vector to a memory slot of the external memory unit (12) that corresponds to a speaker who has already appeared when a speaker feature value vector of a speaker who has already appeared in a time block thus far is extracted; and a sound source extraction unit (16), an utterance section detecting unit (17), and a speech recognition unit (18) for executing signal processing on the basis of a speaker feature vector.

Description

Signal processing equipment, signal processing method and signal processing program

The present invention relates to a signal processing device, a signal processing method, and a signal processing program.

In the sound source separation technology that separates the acoustic signal that is a mixture of sounds from multiple sound sources into a signal for each sound source, the first sound source separation technology that targets the sound picked up by multiple sound sources and one microphone There is a second sound source separation technique for the sound picked up in. The second sound source separation technique is said to be more difficult than the first sound source separation technique because information regarding the position of the microphone cannot be used.

Here, the techniques described in Non-Patent Documents 1 to 3 are known as a second sound source separation technique for performing sound source separation based on input acoustic signal information without using microphone position information.

The technique described in Non-Patent Document 1 separates an input acoustic signal into a predetermined number of sound sources. By giving the input signal to a bi-directional long short-term memory neural network (BLSTM neural network, hereinafter referred to as BLSTM), the mask for extracting each sound source can be estimated. .. When learning the parameters of BLSTM, when a certain input signal is given, the distance between the previously given correct answer separation signal and the separation signal extracted by applying the estimated mask to the observation signal is minimized. Update the parameters.

The technique described in Non-Patent Document 2 performs sound source separation by changing the number of separated signals according to the number of sound sources contained in the input signal. As in Non-Patent Document 1, the mask is estimated using BLSTM, but in Non-Patent Document 2, the separation mask estimated at one time is only for one sound source in the input signal. Then, in the technique described in Non-Patent Document 2, the observation signal is separated into a separation signal and a residual signal obtained by removing the separation signal from the observation signal by using a separation mask. Then, in the technique described in Non-Patent Document 2, it is automatically determined whether or not another sound source signal still remains in the residual signal, and if it remains, the residual signal is input to BLSTM. Extract another sound source. On the other hand, if another sound source signal does not remain in the residual signal, the process ends at that point.

In the technique described in Non-Patent Document 2, this mask estimation process is repeated until the residual signal is finally no longer included in the residual signal, and each sound source is extracted one by one, whereby the sound source separation process and the sound source number estimation process are performed. Achieve both at the same time. In the technique described in Non-Patent Document 2, a method of thresholding the volume of the residual signal, inputting the residual signal to another neural network, and inspecting whether another sound source remains is proposed for the determination. ing. In the technique described in Non-Patent Document 2, the same BLSTM is repeatedly used for mask estimation.

The techniques described in Non-Patent Documents 1 and 2 are batch processing, and take the form of applying the processing to the entire input signal. Therefore, the techniques described in Non-Patent Documents 1 and 2 lack real-time property. For example, when trying to apply these processes to a certain conference voice, the processing cannot be started at least until the recording of the conference voice is completed. That is, the techniques described in Non-Patent Documents 1 and 2 cannot be applied to applications in which sound source separation is applied from the start of a conference and each separated voice is sequentially transcribed using automatic voice recognition.

The technique described in Non-Patent Document 3 was devised in view of such a problem. In the technique described in Non-Patent Document 3, the input signal is divided into a plurality of time blocks (blocks having a length of about 5 to 10 seconds), and processing is sequentially applied to each block. The method of Non-Patent Document 2 is applied to the first block. However, at the same time as estimating the sound source separation mask, the speaker feature vectors corresponding to the extracted speakers are calculated and output as well.

In the technique described in Non-Patent Document 3, in the second block, each of all the speaker feature vectors obtained as a result of the processing of the first block is used to produce the voice corresponding to those speakers. Extract repeatedly in the same order as the first block. Then, in the technique described in Non-Patent Document 3, the speakers are extracted in the order in which the speakers are detected in the past blocks even in the third and subsequent blocks.

Here, in the technique described in Non-Patent Document 3, when a new speaker (new speaker) that did not appear in the first block appears in the second block, it appears in the first block. Even if all the existing speakers are extracted from the observation signal of the second block, the voice component of the new speaker remains in the residual signal. Therefore, by using the above-mentioned determination process, it is possible to detect the presence of a new speaker, and as a result, it is possible to estimate the mask of the new speaker and the speaker feature vector of the new speaker. ..

If such processing is sequentially repeated for each block, sound source separation and sound source number estimation can be performed even for long-time data in the form of block online processing. In addition, by making the order of speaker extraction common among each block, which of the separated sounds obtained in a certain time block and the separated sound obtained in a different time block is followed by the speaker between the time blocks. Can work to determine if they belong to the same speaker).

If the technique described in Non-Patent Document 3 is used, even if the speaker extracted in the past block does not speak in the current block, the speaker's speech is the same as in the case of other speakers. The feature vector is used to extract the signal corresponding to the speaker. However, if the speaker's signal is not included in the block, the mask corresponding to the speaker is 0, and as a result, a signal with a sound pressure of 0 is ideally extracted.

When the technique described in Non-Patent Document 3 is used, it is not possible to know in advance whether or not a specific speaker is speaking in a new time block (new time block), so in the past block. For speakers that have been extracted even once, it is necessary to try to extract the sound source from the new time block using the speaker feature vector. Then, as a result, whether or not the speaker is speaking can be determined by examining the sound pressure of the output signal.

However, in general conversation, the number of speakers who speak in a certain time block is generally about two or three. In other words, it is extremely rare for all speakers who have spoken in the past to speak in a new block. Therefore, the technique described in Non-Patent Document 3 performs an extraction process that is not originally necessary for a speaker who is not speaking (silent speaker).

Also, in general, even for a silent speaker, it is rare that a signal with a sound pressure of 0 is completely extracted, and a signal having a certain sound pressure is erroneously extracted as a signal of the speaker. There are many. Therefore, when there are a plurality of silent speakers, if the sound source extraction process is repeated, the sound source separation / extraction performance of the speaker who is truly speaking will be deteriorated.

Furthermore, when the technique described in Non-Patent Document 3 is used, the speakers of the new time block are extracted in the order in which the utterances were made in the past. However, in general, it is known that the optimum order of speaker extraction is likely to be different in each block (see Non-Patent Document 1). That is, trying to extract speakers in the same order in all blocks increases the amount of processing and impairs the optimality of processing.

The present invention has been made in view of the above, and an object of the present invention is to provide a signal processing device, a signal processing method, and a signal processing program capable of improving the processing accuracy for an acoustic signal while reducing the processing amount. And.

In order to solve the above-mentioned problems and achieve the object, the signal processing apparatus according to the present invention sets a speaker feature vector for each time block with respect to the input acoustic signal, and a speaker existing in the time block. An extraction unit that repeatedly extracts the number of speakers, an external memory unit that stores the speaker feature vector extracted by the extraction unit, and a speaker feature vector of a speaker that has never appeared in the time block so far When extracted by the extraction unit, it is instructed to write the speaker feature vector to the unused memory slot of the external memory unit, and the speaker feature amount vector of the speaker who has already appeared in the time block so far. Is extracted by the extraction unit, it is instructed to write the speaker feature vector to the memory slot corresponding to the speaker who has already appeared in the external memory unit, and the speaker feature vector is read from the external memory unit. The instruction unit that instructs the instruction unit, the writing unit that receives an instruction from the instruction unit and writes the speaker feature vector to the external memory unit, and the instruction unit that receives an instruction from the instruction unit and reads the speaker feature vector from the external memory unit. It is characterized by having a reading unit and a processing unit that executes signal processing based on the speaker feature vector read by the reading unit.

According to the present invention, it is possible to improve the processing accuracy for acoustic signals while reducing the processing amount.

FIG. 1 is a diagram schematically showing an example of a configuration of a signal processing device according to an embodiment. FIG. 2 is a diagram schematically showing another example of the configuration of the signal processing device according to the embodiment. FIG. 3 is a diagram schematically showing another example of the configuration of the signal processing device according to the embodiment. FIG. 4 is a diagram schematically showing another example of the configuration of the signal processing device according to the embodiment. FIG. 5 is a flowchart showing a processing procedure of the data processing method according to the embodiment. FIG. 6 is a diagram showing an example of a computer in which a signal processing device is realized by executing a program.

Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings. The present invention is not limited to this embodiment. Further, in the description of the drawings, the same parts are indicated by the same reference numerals. In the following, when "^ A" is described for A, it is assumed to be equivalent to "a symbol in which" ^ "is written directly above" A "".

[Embodiment]
[Configuration of signal processing device]
First, the configuration of the signal processing device according to the embodiment will be described with reference to FIG. FIG. 1 is a diagram showing an example of a configuration of a signal processing device according to an embodiment. In the signal processing device 10 according to the embodiment, for example, a predetermined program is read into a computer or the like including a ROM (Read Only Memory), a RAM (Random Access Memory), a CPU (Central Processing Unit), and the CPU is a predetermined CPU. It is realized by executing the program.

As shown in FIG. 1, the signal processing device 10 includes a speaker feature vector extraction unit 11 (extraction unit), an external memory unit 12, a memory control instruction unit 13 (instruction unit), and an external memory writing unit 14 (writing unit). It has an external memory reading unit 15 (reading unit), a sound source extraction unit 16, an utterance section detection unit 17, a voice recognition unit 18, a repetition control unit 19, and a learning unit 20. When the acoustic signal is input, the signal processing device 10 divides it into time blocks, and inputs the acoustic signal of each divided time block to the speaker feature vector extraction unit 11.

The speaker feature vector extraction unit 11 repeatedly extracts the speaker feature vector for each time block for the input acoustic signal (hereinafter referred to as an observation signal) for the number of speakers existing in the block. .. The number of repetitions of the speaker feature vector extraction process is set by the repetition control unit 19.

The speaker feature vector extraction unit 11 can use various methods as the speaker feature vector extraction method. For example, when the technique described in Non-Patent Document 3 or the technique described in Reference 1 is used, in the first time block, a speaker feature vector is set for each of any number of speakers included in the block. Can be extracted.
Reference 1: Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, Kenji Nagamatsu, “End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors”, arXiv: 2005.09921, 2021.

Here, as a general form, the extraction of the speaker feature vector can be formulated as in Eq. (1).

I _b is the total number of speakers in the time block b. a _{b and i} are speaker feature vectors related to the i-th speaker extracted in the time block b. X _b is an observation signal of the time block b, and NN _embed [·] is a neural network such as BLSTM. h _{b and 0} are initial vectors (in some cases, a matrix) for informing the network that it is the first iteration in the speaker feature vector extraction, and are appropriately set according to NN _embed [・]. ..

The external memory unit 12 is a memory for storing the speaker feature vector extracted by the speaker feature vector extraction unit 11. The external memory unit 12 has a plurality of memory addresses, and stores a speaker feature vector of one speaker in one memory address. The external memory unit 12 may be a semiconductor memory in which data such as RAM (Random Access Memory), flash memory, NVSRAM (Non Volatile Static Random Access Memory) can be rewritten. The external memory unit 12 may be a storage device such as an HDD (Hard Disk Drive), SSD (Solid State Drive), or an optical disk.

The memory control instruction unit 13 receives the speaker feature vector each time the speaker feature vector is extracted by the speaker feature vector extraction unit 11.

When the received speaker feature vector is that of a new speaker, the memory control instruction unit 13 instructs the external memory unit 12 to write the vector to the new memory address. That is, when the speaker feature amount vector of the speaker that has never appeared in the time block up to now is extracted, the memory control instruction unit 13 puts this story into an unused memory slot of the external memory unit 12. Instructs to write the person feature vector.

On the other hand, when the received speaker feature vector is that of a speaker that has appeared in the past, the memory control instruction unit 13 has a memory corresponding to the speaker that has appeared in the past among the memory addresses of the external memory unit 12. Instruct the address to write the speaker feature vector in an appropriate form. That is, when the speaker feature amount vector of the speaker who has already appeared in the time block up to now is extracted, the memory control instruction unit 13 is the same as the memory slot corresponding to this speaker in the external memory unit 12. Instructs to write the speaker feature vector.

The memory control instruction unit 13 writes and gives an instruction to read the speaker feature vector to the external memory unit 12 by using the one based on the neural network described in References 2 to 4.
Reference 2: A. Graves, G. Wayne, and I. Danihelka. “Neural Turing Machines”, arxiv: 1410.5401., 2014.
Reference 3: Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwinska, Sergio Gomez; Edward Grefenstette, and Tiago Ramalho, “Hybrid computing using a neural network with dynamic external memory”. Nature. 538 (7626): 471-476. (2016-10-12).
Reference 4: Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus, “End-To-End Memory Networks”, Advances in Neural Information Processing Systems 28, pp, 2440-2448, 2015

The external memory writing unit 14 writes the speaker feature vector to the external memory unit 12 in response to an instruction from the memory control instruction unit 13. The external memory reading unit 15 receives an instruction from the memory control instruction unit 13 and reads a speaker feature vector for each speaker from the external memory unit 12. In this way, the signal processing device 10 can optimize all the systems by the error propagation method by implementing the exchange of information with the external memory unit 12 using the neural network.

The memory control instruction unit 13 is described in References 2 to 4 and the like for instructions to the external memory writing unit 14 that writes to the external memory unit 12 and the external memory reading unit 15 that reads to the external memory unit 12. The instruction vector is generated according to the procedure of.

For example, the memory control instruction unit 13 first bases on the outputs a _{b and i} from the speaker feature vector extraction unit and the past outputs (instruction vectors) w _{b and i-1} of the memory control instruction unit 13. A 1 × M size key vector kb _{, i} is generated using a neural network. Similarly, the memory control instruction unit 13 also generates β _{b and i} used for the calculation of the equation (2). Then, the memory control instruction unit 13 measures the closeness of the same key vector to each column n of the current external memories M _{b and i} , and calculates the instruction vectors w _{b and i} as in the equation (2). ..

In the equation (2), M _{b and i} are external memory matrices (size of N × M) before writing the i-th speaker feature vector in the b-th time block. Let N be the total number of memory addresses, and let M be the length of the vector that can be written to the address.

As a measure for measuring the closeness between the key vector and each column of the memory, the cosine similarity shown in the equation (3) can be considered.

Assuming that w _{b and i} are 1 × N-dimensional instruction vectors, w _{b and i} (n) are each element thereof and have the characteristics shown in the equation (4).

Further, the operation shown in the equation (5) is added so that the specific elements in the instruction vectors w _{b and i} are close to 1 (so that the vector becomes sparse).

In equation (5), c is a constant for enhancing sparsity.

The external memory writing unit 14 writes and updates the speaker feature vector to the external memory unit 12 based on the instruction vectors w _{b and i} output from the memory control instruction unit 13.

The writing process by the external memory writing unit 14 is performed by pairing the erasing process and the writing process, as described below. The speaker feature vectors _{ab and i} extracted by the speaker feature vector extraction unit 11 are passed to the external memory writing unit 14, and are written out by the external memory unit 12 in an appropriate form. For example, the external memory writing unit 14 erases the memory according to the equation (6) based on the erasing vectors e _{b and i} which are 1 × N vectors and the instruction vectors w _{b and i} . The erasure vectors e _{b and i} are also outputs from the memory control instruction unit 13.

L is a 1 × N vector consisting of N 1s. For the multiplication of the external memory and the components in [・], point-wise multiplication is performed in the memory slot.

The elimination vectors e _{b and i} are often set as in the equation (7).

That is, in equations (6) and (7), when both the elements e _{b and i} (n) of the elimination vector and the elements w _{b and i} (n) of the instruction vector have a value of 1, the memory slot thereof. Information about is reset to 0. Then, after performing the above erasing process, new information vectors a _{b and i} (here, the speaker feature vector) are written to the external memory unit 12 by the external memory writing unit 14 as shown in the equation (8). Is done.

The external memory reading unit 15 receives an instruction from the memory control instruction unit 13, reads a speaker feature vector for each speaker from the external memory unit 12, and outputs the speaker feature vector to the memory control instruction unit 13. The external memory reading unit 15 reads the updated speaker feature vector from the external memory unit 12 based on the instruction vectors w _{b and i} output from the memory control instruction unit 13.

The external memory reading unit 15 can read data from the external memory unit 12 as shown in the equation (9) by using the instruction vectors w _{b and i} .

As shown in the equation (9), the speaker feature vectors r _{b, i} to be read can be output by multiplying the matrix M and the instruction vectors w _{b, i} . The fact that the value of the nth element w _{b, i} (n) of the instruction vector is 0 corresponds to the instruction that the information is not read from the nth address, so that the elements w _b , i (n) of the instruction vector. The information of the memory address whose value of) is close to 1 is mainly read, and the information of the memory address close to 0 is relatively not read.

In the signal processing device 10, the writing process by the external memory writing unit 14 and the reading process by the external memory reading unit 15 are repeated from i = 1 to i = _Ib in this way to write to the external memory unit 12 and write to the external memory unit 12. Reading can be done for all speakers. Since these processes are executed for each time block, the number of addresses to which the external memory unit 12 is written after the processes of all the time blocks corresponds to the number of speakers appearing in the all-time block. .. Therefore, by checking the state of the external memory unit 12 after processing all the time blocks, it is possible to check the number of speakers appearing in the whole time block.

The sound source extraction unit 16 extracts the speaker's voice corresponding to this speaker feature vector from the observation signal based on the observation signal and the speaker feature vector output from the memory control instruction unit 13 (details are non-patented). See Document 3).

The sound source extraction unit 16 extracts the separated voices ^ S _{b, i} as shown in the equation (10) using the speaker feature vectors r _{b, i} read from the observation signal X _b and the external memory unit 12. ..

For NN _extract [・], it is common to use a neural network such as BLSTM or a convolutional neural network.

The separated voices ^ S _{b, i} (i = 1, ..., I _b ) extracted in the time block b and the separated voices ^ S _b'extracted in the time block b'(b ≠ b'). _{, I} (i = 1, ..., I _b' ) and whether or not they belong to the same speaker are the instruction vectors w _{b, i} (i = 1, ..., I b'to the external memory unit 12). It can be determined by examining whether or not the index at which the maximum value is detected is the same in each i of I _b ) and the instruction vectors w _{b', i} (i = 1, ..., I _b' ). In other words, the memory control instruction unit 13 plays a role of speaker collation. Therefore, even if the speaker feature vector extraction unit 11 extracts the speaker feature vector in a different order between the time blocks, the maximum value detection and the comparison process in the instruction vector to the external memory unit 12 described above can be performed. By using it, the speaker can be followed between time blocks.

The utterance section detection unit 17 outputs the utterance section detection result of the speaker corresponding to this speaker feature vector based on the observation signal and the speaker feature vector output from the memory control instruction unit 13 (for details, refer to the reference). See Document 1).

The voice recognition unit 18 outputs the voice recognition result of the speaker corresponding to this speaker feature vector based on the observation signal and the speaker feature vector output from the memory control instruction unit 13 (for details, see Reference 5). reference).
Reference 5: Marc Delcroix, Shinji Watanabe, Tsubasa Ochiai, Keisuke Kinoshita, Shigeki Karita, Atsunori Ogawa, Tomohiro Nakatani, “End-to-end SpeakerBeam for single channel target speech recognition”, Interspeech2019, pp.451-455, 2019.

The sound source extraction unit 16, the utterance section detection unit 17, and the voice recognition unit 18 are examples of processing units that process voice. Further, as the signal processing device according to the embodiment, the signal processing device 10 having a plurality of processing units of the sound source extraction unit 16, the utterance section detection unit 17, and the voice recognition unit 18 has been described, but the present invention is not limited thereto. For example, the signal processing device according to the embodiment is a signal processing device 10A having a sound source extraction unit 16 (see FIG. 2), a signal processing device 10B having a speech section detection unit 17 (see FIG. 3), or a voice recognition unit 18. It may be a signal processing device 10C (see FIG. 4) having the above.

The repeat control unit 19 is based on the extraction processing state of the speaker feature vector extraction unit 11 or the processing result of the sound source extraction unit 16, the utterance section detection unit 17, and the voice recognition unit 18, and the speaker feature vector extraction unit The number of repetitions of the extraction process according to 11 is determined. In other words, the repetition control unit 19 determines the number of repetitions of the extraction process by the speaker feature vector extraction unit 11 based on, for example, the state of the speaker feature vector extraction process of the speaker feature vector extraction unit 11. Alternatively, the repetition control unit 19 uses the output results from the utterance section detection unit 17 and the voice recognition unit 18 to determine the number of repetitions of the speaker feature vector extraction process of the speaker feature vector extraction unit 11. Ideally, the speaker feature vector extraction unit 11 should extract I _b speaker feature vectors in each block b.

For example, in the case of the signal processing device 10A (see FIG. 2) having only the sound source extraction unit 16, the repetition control unit 19 has the internal state vectors h _{b, i} and the observation signal X _b output from the speaker feature vector extraction unit 11. , The scalar values ^ f _{b, i} (0 ≦ ^ f _{b, i} ≦ 1) indicating whether or not the repetition as in the following equation (11) should be stopped are calculated using the separated voices ^ S _{b, i} . do. If the scalar values ^ f _{b and i} are larger than the predetermined value, the repetition is stopped, and if it is lower than the predetermined value, the repetition is continued.

The repeat control unit 19 inputs different auxiliary information to the neural network of the speaker feature vector extraction unit 11 each time the speaker feature vector extraction unit 11 extracts the speaker feature vector, so that the speaker feature vector extraction unit 19 can be used. 11 is made to output the extraction result of the speaker feature vector corresponding to a different sound source.

For example, the case where the repeat control unit 19 virtually recognizes all the speaker feature vectors included in the input acoustic signal will be described as an example. Each time the speaker feature vector extraction unit 11 extracts the speaker feature vector, the repeat control unit 19 recognizes which speaker the speaker feature vector corresponds to. Then, when there is an unextracted speaker, the repeat control unit 19 inputs auxiliary information about the speaker feature vector of the next speaker to the speaker feature vector extraction unit 11, and the speaker feature vector extraction unit 19 The speaker feature vector of the next speaker is extracted in 11. The repetition control unit 19 stops the repetition of the extraction process by the speaker feature vector extraction unit 11 when the extraction process of the speaker feature vector for all the speakers of the observation signal of this time block is completed.

The learning unit 20 optimizes the parameters used by the signal processing device 10 using the learning data. The learning unit 20 optimizes the parameters of the neural network constituting the signal processing device 10 based on a predetermined objective function.

The learning data includes an input signal (observation signal), a correct answer clean signal corresponding to each sound source included in this input signal, correct answer utterance time information, correct utterance content, and total number of people included in the input signal (total speaker). Number) consists of I. The learning unit 20 learns parameters based on the learning data so that the error between the output of the signal processing device 10 and the correct answer information becomes small.

Specifically, when the learning process is performed to optimize the parameters of the signal processing device 10A of FIG. 2, each time block is given when the input signal X _b (b = 1, ..., B) is given. In the system output result (separated voice) ^ S _{b, i} , a loss function based on the circumstances error of Eq. (12) is provided so that the separated voice S _{b, i} of the correct answer is close to.

Further, in each time block b, the cross entropy loss function of the equation (13) is provided for the scalar value for determining the number of repetitions in order to estimate the correct number of sound sources.

Note that f _{b and i} are values that take 1 only when i = I _b , and take 0 in other cases.

Then, the learning unit 20 updates the parameter so that the total number of addresses used in the external memory becomes the same value as the total number of speakers I, and the equation (14) also becomes small at the same time.

In the equation (14), T _count is a preset threshold value, and is generally set to 1. min (・) is a function that outputs a T _{count when the input value is larger than the T count} _, and outputs the value as it is when the input value is smaller than the T _count . Equation (14) is a loss function that prompts the number of memory addresses used in the external memory to match the total number of speakers I.

Finally, the learning unit 20 learns the parameters of the neural network so that the value of the equation (15), which is the sum of the values of all the loss functions, becomes small.

The learning unit 20 uses a signal processing device 10B having a speech section detection unit 17, a signal processing device 10C having a voice recognition unit 18, and a loss function corresponding to the processing unit provided for the signal processing device 10. Then, the parameters of the neural network may be optimized. Please refer to Reference 6 for the signal processing device 10B having the utterance section detection unit 17, and Reference 7 for the signal processing device 10C having the voice recognition unit 18.
Reference 6: Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Kenji Nagamatsu, Shinji Watanabe, “End-to-End Neural Speaker Diarization with Permutation-free Objectives”, Proc. Interspeech, pp. 4300-4304, 2019.
Reference 7: Shigeki Karita et al., “A comparative study on transformer vs RNN in speech applications”, IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019.

[Signal processing procedure]
Next, the processing procedure of the signal processing method according to the present embodiment will be described. FIG. 5 is a flowchart showing a processing procedure of the signal processing method according to the embodiment.

As shown in FIG. 5, when the signal processing device 10 receives the input of the acoustic signal (step S1), the signal processing device 10 divides the acoustic signal into time blocks (step S2). Then, the signal processing device 10 initializes the number b of the time blocks to b = 1 (step S3).

When the acoustic signal (observation signal) of the time block b is input to the speaker feature vector extraction unit 11, the speaker feature vector extraction unit 11 is one person existing in the time block b from the observation signal of the time block b. The speaker feature vector of the speaker is estimated and extracted (step S4).

Then, the memory control instruction unit 13 determines whether or not the speaker feature vector extracted by the speaker feature vector extraction unit 11 is a speaker feature amount vector of a speaker that has never appeared in the time block so far. Is determined (step S5).

The case where it is a speaker feature vector of a speaker who has never appeared in the time block so far (step S5: Yes) will be described. In this case, the memory control instruction unit 13 instructs the unused memory slot of the external memory unit 12 to write the speaker feature vector, and the external memory writing unit 14 directs the unused memory of the external memory unit 12 to be written. The speaker feature vector is written in the slot (step S6).

The case where the speaker feature quantity vector of the speaker who has appeared in the time block so far (step S5: No) will be described. In this case, the memory control instruction unit 13 instructs the external memory unit 12 to write the speaker feature vector to the memory slot corresponding to the speaker, and the external memory writing unit 14 indicates this story of the external memory unit 12. The speaker feature vector is written in the memory slot corresponding to the person (step S7).

Then, the external memory writing unit 14 reads the speaker feature vector corresponding to one speaker extracted by the speaker feature vector extraction unit 11 from the external memory unit 12 according to the instruction from the memory control instruction unit 13 (step). S8), output to the memory control instruction unit 13.

Then, the sound source extraction unit 16 observes the speaker's voice corresponding to this speaker feature vector based on the observation signal and the speaker feature vector corresponding to one speaker output from the memory control instruction unit 13. Extract from (step S9).

Further, the utterance section detection unit 17 detects the utterance section of the speaker corresponding to this speaker feature vector based on the observation signal and the speaker feature vector corresponding to one speaker output from the memory control instruction unit 13. The result is output (step S10).

Then, the voice recognition unit 18 obtains the voice recognition result of the speaker corresponding to this speaker feature vector based on the observation signal and the speaker feature vector corresponding to one speaker output from the memory control instruction unit 13. Output (step S11).

As shown in FIG. 5, steps S9 to S11 may be processed in parallel or in series. When processing in series, the order of processing is not particularly limited. Further, in the case of the signal processing devices 10A to 10C, processing may be executed according to the voice processing function unit provided. For example, in the case of the signal processing device 10A, the sound source extraction process (step S9) by the sound source extraction unit 16 is executed, and the process proceeds to step S12.

Then, the repetition control unit 19 determines whether or not to stop the repetition based on the processing results of the sound source extraction unit 16, the utterance section detection unit 17, and the voice recognition unit 18 (step S12). The repetition control unit 19 may determine whether or not to stop the repetition based on the extraction processing state of the speaker feature vector extraction unit 11.

When the repetition control unit 19 determines that the repetition is not stopped (step S12: No), the repetition control unit 19 returns to step S4 in order to proceed with the processing regarding the next speaker in this time block b, and returns to the speaker feature vector for the next speaker. Is extracted.

Further, when the repetition control unit 19 determines that the repetition is stopped (step S12: Yes), the repetition control unit 19 outputs the processing result for this time block b (step S13). In the case of the signal processing device 10, the output results are the sound source extraction result, the utterance section detection result, and the voice recognition result. Further, the signal processing device 10 may collectively output the processing results of all the time blocks.

Then, the signal processing device 10 determines whether or not the processing of all the time blocks is completed (step S14). When the processing of all the time blocks is completed (step S14: Yes), the signal processing device 10 ends the processing for the input acoustic signal. Further, when the signal processing device 10 has not completed the processing of all the time blocks (step S14: No), 1 is added to the time block b in order to perform the processing for the next time block (step S15). The process returns to step S4 and the process is continued.

[Effect of embodiment]
The signal processing device 10 according to the embodiment repeatedly extracts the speaker feature vector for each time block and writes it to the external memory unit 12. At this time, when the speaker feature amount vector of the speaker that has never appeared in the time block up to now is extracted, the signal processing device 10 puts the speaker feature vector in the unused memory slot of the external memory unit 12. To write. When the speaker feature quantity vector of the speaker who has already appeared in the time block up to now is extracted, the signal processing device 10 puts the speaker feature in the memory slot corresponding to this speaker in the external memory unit 12. Write a vector.

Therefore, the signal processing device 10 does not execute the extraction process itself for the speaker (silent speaker) who is not speaking. Therefore, the signal processing device 10 does not need to try to extract the sound source for the silent speaker for each time block as in the conventional case, so that the processing amount can be reduced as compared with the conventional case and the utterance is truly performed. Since the processing for only the speaker can be appropriately performed, the processing accuracy can be improved.

Further, since the signal processing device 10 extracts the speaker feature vector for each time block, it is not necessary to extract the speaker feature vector in the same order in all the time blocks as in the conventional case, so that the processing is optimal. Does not hurt.

In this way, the signal processing device 10 can improve the processing accuracy for the acoustic signal while reducing the processing amount.

[program]
FIG. 6 is a diagram showing an example of a computer in which the signal processing device 10 is realized by executing a program. The computer 1000 has, for example, a memory 1010 and a CPU 1020. The computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.

The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012. The ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to the disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120. The video adapter 1060 is connected to, for example, the display 1130.

The hard disk drive 1090 stores, for example, an OS (Operating System) 1091, an application program 1092, a program module 1093, and program data 1094. That is, the program that defines each process of the signal processing device 10 is implemented as a program module 1093 in which a code that can be executed by a computer is described. The program module 1093 is stored in, for example, the hard disk drive 1090. For example, a program module 1093 for executing processing similar to the functional configuration in the signal processing device 10 is stored in the hard disk drive 1090. The hard disk drive 1090 may be replaced by an SSD (Solid State Drive).

Further, the setting data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, a memory 1010 or a hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 into the RAM 1012 and executes them as needed.

The program module 1093 and the program data 1094 are not limited to those stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Then, the program module 1093 and the program data 1094 may be read from another computer by the CPU 1020 via the network interface 1070.

Although the embodiment to which the invention made by the present inventor is applied has been described above, the present invention is not limited by the description and the drawings which form a part of the disclosure of the present invention according to the present embodiment. That is, other embodiments, examples, operational techniques, and the like made by those skilled in the art based on the present embodiment are all included in the scope of the present invention.

10, 10A, 10B, 10C Signal processing device 11 Speaker feature vector extraction unit 12 External memory unit 13 Memory control instruction unit 14 External memory writing unit 15 External memory reading unit 16 Sound source extraction unit 17 Speech section detection unit 18 Voice recognition unit 19 Repeat control unit 20 Learning unit

Claims

An extraction unit that repeatedly extracts the speaker feature vector for each time block for the input acoustic signal by the number of speakers existing in the time block.
An external memory unit that stores the speaker feature vector extracted by the extraction unit, and
When the speaker feature quantity vector of the speaker that has never appeared in the time block so far is extracted by the extraction unit, the speaker feature vector is written in the unused memory slot of the external memory unit. Corresponds to the speaker who has already appeared in the external memory unit when the speaker feature quantity vector of the speaker who has instructed and has already appeared in the time block so far is extracted by the extraction unit. An instruction unit that instructs to write the speaker feature vector to the memory slot to be used, and instructs to read the speaker feature vector from the external memory unit.
A writing unit that receives an instruction from the indicating unit and writes a speaker feature vector to the external memory unit, and a writing unit.
A reading unit that receives an instruction from the instruction unit and reads the speaker feature vector from the external memory unit, and a reading unit.
A processing unit that executes signal processing based on the speaker feature vector read by the reading unit, and a processing unit.
A signal processing device characterized by having.
The signal processing apparatus according to claim 1, wherein the number of repetitions of the extraction process by the extraction unit is determined based on the state of the extraction process of the extraction unit or the processing result by the processing unit.
The signal processing apparatus according to claim 1 or 2, further comprising a learning unit that optimizes the parameters used by the extraction unit, the instruction unit, and the processing unit based on a predetermined objective function.
One of claims 1 to 3, wherein the processing unit is a sound source extraction unit that extracts a speaker's voice corresponding to the speaker feature vector from the acoustic signal based on the speaker feature vector. The signal processing device according to one.
One of claims 1 to 4, wherein the processing unit is a sound source extraction unit that outputs a voice recognition result of a speaker corresponding to the speaker feature vector based on the speaker feature vector. The signal processing device according to.
One of claims 1 to 5, wherein the processing unit is an utterance section detection unit that outputs a speaker's utterance section detection result corresponding to the speaker feature vector based on the speaker feature vector. The signal processing device according to one.
It is a signal processing method executed by a signal processing device.
The signal processing device has an external memory for storing data and has an external memory.
An extraction process in which speaker feature vectors are repeatedly extracted for each time block for the input acoustic signal by the number of speakers existing in the time block.
An external memory that stores the speaker feature vector extracted in the extraction step, and
When the speaker feature amount vector of the speaker that has never appeared in the time block so far is extracted in the extraction step, the speaker feature vector is written in the unused memory slot of the external memory. When the speaker feature vector of the speaker who has been instructed and has already appeared in the time block up to now is extracted in the extraction step, it corresponds to the speaker who has already appeared in the external memory. An instruction step for instructing the speaker feature vector to be written in the memory slot to be used, and
A writing step of receiving an instruction in the instruction step and writing the speaker feature vector to the external memory, and a writing step of writing the speaker feature vector to the external memory.
A reading process in which the speaker feature vector is read from the external memory in response to the instruction in the instruction process, and
A processing step of executing signal processing based on the speaker feature vector read in the reading step, and
A signal processing method characterized by including.
A signal processing program for making a computer function as the signal processing device according to any one of claims 1 to 6.