CN112420021A

CN112420021A - Learning method, speaker recognition method, and recording medium

Info

Publication number: CN112420021A
Application number: CN202010829027.7A
Authority: CN
Inventors: 土井美沙贵; 釜井孝浩; 板仓光佑
Original assignee: Panasonic Intellectual Property Corp of America
Current assignee: Panasonic Intellectual Property Corp of America
Priority date: 2019-08-23
Filing date: 2020-08-18
Publication date: 2021-02-26
Also published as: JP2021033260A

Abstract

The problem to be solved by the present invention is to expect high-precision speaker recognition. Provided are a learning method, a speaker recognition method, and a recording medium. A learning method is a learning method of a speaker recognition model (20), wherein when voice data is inputted, the speaker recognition model (20) outputs speaker recognition information for recognizing a speaker who utters a voice included in the voice data, and performs voice feature conversion processing on first voice data of a first speaker to generate second voice data of a second speaker, and performs learning processing of the speaker recognition model (20) using the first voice data and the second voice data as learning data.

Description

Learning method, speaker recognition method, and recording medium

Technical Field

The present disclosure relates to techniques for identifying a speaker.

Background

Conventionally, a technique for recognizing a speaker using a speaker recognition model is known (for example, see non-patent document 1).

Documents of the prior art

Non-patent document

Non-patent document 1: david Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, Sanjeev Khudanpur, "X-VECTORS: ROBUST DNN EMBEDDING FOR SPEAKER RECOGNITION "ICASSP 2018: 5329-5333.

Disclosure of Invention

Problems to be solved by the invention

It is desirable to identify the speaker with high accuracy.

Means for solving the problems

A learning method according to an aspect of the present disclosure is a learning method of a speaker recognition model that outputs speaker recognition information for recognizing a speaker who utters a voice included in voice data when the voice data is input, wherein first voice data of a first speaker is subjected to voice feature conversion processing to generate second voice data of a second speaker, and the first voice data and the second voice data are used as learning data to perform the learning processing of the speaker recognition model.

A speaker recognition method according to an aspect of the present disclosure inputs voice data to the speaker recognition model that has been previously subjected to a learning process by the learning method, and causes the speaker recognition model to output the speaker recognition information.

A recording medium according to an aspect of the present disclosure is a computer-readable recording medium having a program recorded thereon, the program causing a computer to execute a process of learning a speaker recognition model that outputs speaker recognition information for recognizing a speaker who utters an utterance included in voice data when the voice data is input, the process including: a first step of generating second sound data of a second speaker by performing sound characteristic conversion processing on first sound data of a first speaker; and a second step of performing learning processing of the speaker recognition model using the first sound data and the second sound data as learning data.

These overall and specific aspects may be realized by a system, a method, an integrated circuit, a computer program, or a computer-readable recording medium such as a CD-ROM, or may be realized by any combination of a system, a method, an integrated circuit, a computer program, and a recording medium.

Effects of the invention

According to the learning method and the like of the present disclosure, a speaker can be recognized with high accuracy.

Drawings

Fig. 1 is a block diagram showing a configuration example of a speaker recognition apparatus according to an embodiment.

Fig. 2 is a schematic diagram showing an example of a case where the voice data holding unit of the embodiment stores voice data and speaker identification information in association with each other.

Fig. 3 is a schematic diagram showing a case where the voice feature conversion unit of the embodiment converts voice data of one speaker into voice data of a plurality of other speakers and outputs the converted voice data.

Fig. 4 is a block diagram showing a configuration example of the sound characteristic conversion unit according to the embodiment.

Fig. 5 is a flowchart of speaker recognition model learning processing according to the embodiment.

Fig. 6 is a flowchart of the voice characteristic transformation model learning process according to the embodiment.

Fig. 7 is a flowchart of speaker recognition processing of the embodiment.

Description of the reference numerals

1 speaker recognition device

10 sound data expansion unit

11 voice data holding unit

12 first voice data acquisition unit

13 sound characteristic conversion part

14 noise reverberation imparting unit

15 first feature amount calculating section

16 comparing part

17 voice data holding unit

18 expanded audio data holding unit

20 speaker recognition model

21 third feature amount calculating section

22 deep neural network

23 determination unit

30 learning part

31 second sound data acquisition unit

32 second feature quantity calculating section

33 first learning unit

40 recognition target voice data acquisition unit

131 voice feature conversion learning data holding unit

132 second learning unit

133 sound conversion model

Detailed Description

(approach to one way of obtaining the present disclosure)

A speaker recognition technique is known that recognizes a speaker using a speaker recognition model that performs a learning process in advance using, as learning data, voice data associated with recognition information for recognizing the speaker.

Conventionally, in order to increase the number of learning data (hereinafter, "increasing the number of learning data" is also referred to as "extension of learning data"), noise addition, reverberation addition, and the like have been performed on original learning-use sound data. However, in the extension of learning data by the above-described conventional noise addition, reverberation addition, and the like, the content and language (japanese, english, and the like) of speech of one speaker cannot be increased. Therefore, the influence of the utterance content and the language in the learning process of the speaker recognition model may not be sufficiently reduced.

Therefore, the inventors have made extensive studies and experiments to identify a speaker with high accuracy in speaker recognition using a speaker recognition model. As a result, the inventors have conceived the following learning method and the like.

According to the above learning method, the number of the voice data of the second speaker can be increased without being limited by the utterance content and the language in the extension of the learning data in the learning process of the speaker recognition model. Therefore, the accuracy of speaker recognition by the speaker recognition model can be improved.

Therefore, according to the learning method, the speaker can be recognized with high accuracy.

In addition, the voice characteristic conversion process may be a process based on the voice data of the first speaker and the voice data of the second speaker.

The voice feature conversion process may include a process of inputting the first voice data to a voice feature conversion model and outputting the second voice data from the voice feature conversion model, and the voice feature conversion model may be previously subjected to a learning process so that when the voice data of the first speaker is input, the voice data of the second speaker is output.

The voice feature transformation model may include a deep neural network that receives the voice data in the WAV format as an input and outputs the voice data in the WAV format as an output.

In addition, the voice characteristic conversion process may be a process based on the voice data of the first speaker and the voice data of the third speaker.

The speaker recognition model may include a deep neural network that receives an utterance feature quantity indicating a feature of an utterance included in the speech data as an input and outputs a speaker-specific feature quantity indicating a feature of a speaker.

A speaker recognition method according to an aspect of the present disclosure inputs speech data to the speaker recognition model that has been previously subjected to a learning process by the learning method, and causes the speaker recognition model to output the speaker recognition information.

According to the above speaker recognition method, in the extension of the learning data in the learning process of the speaker recognition model, the number of the voice data of the second speaker can be increased without being limited by the utterance content and the language. Therefore, the accuracy of speaker recognition by the speaker recognition model can be improved.

Therefore, according to the speaker recognition method, the speaker can be recognized with high accuracy.

According to the above-described recording medium, in the extension of the learning data in the learning process of the speaker recognition model, the number of the voice data of the second speaker can be increased without being limited by the utterance content and the language. Therefore, the accuracy of speaker recognition by the speaker recognition model can be improved.

Therefore, according to the recording medium, a speaker can be recognized with high accuracy.

The general or specific aspects can be realized by a system, a method, an integrated circuit, a computer program, or a computer-readable recording medium such as a CD-ROM, or any combination of a system, a method, an integrated circuit, a computer program, and a recording medium.

Hereinafter, embodiments of the present disclosure will be described with reference to the drawings. The embodiments described below all show a specific example of the present disclosure. The numerical values, shapes, constituent elements, steps, and the order of the steps shown in the following embodiments are examples, and are not intended to limit the present disclosure. In all the embodiments, the contents may be combined.

(embodiment mode)

Hereinafter, a speaker recognition apparatus according to an embodiment will be described. The speaker recognition device acquires voice data and outputs recognition information for recognizing a speaker who uttered a voice included in the voice data.

< Structure >

Fig. 1 is a block diagram showing a configuration example of a speaker recognition apparatus 1 according to the embodiment.

As shown in fig. 1, the speaker recognition device 1 includes a speech data expansion unit 10, a speaker recognition model 20, a learning unit 30, and a recognition target speech data acquisition unit 40.

The speech data extension unit 10 extends the learning data for performing the learning process of the speaker recognition model 20 (i.e., increases the number of learning data). The audio data expansion unit 10 may be implemented by a computer having a microprocessor, a memory, a communication interface, and the like. In this case, the various functions of the audio data extension unit 10 are realized by the microprocessor executing a program stored in the memory. The audio data extension unit 10 may be realized by, for example, distributed computing or cloud computing performed by a plurality of computers communicating with each other.

As shown in fig. 1, the audio data expansion unit 10 includes an audio data storage unit 11, a first audio data acquisition unit 12, an audio feature conversion unit 13, a noise reverberation imparting unit 14, a first feature amount calculation unit 15, a comparison unit 16, an audio data storage unit 17, and an expanded audio data storage unit 18.

The learning unit 30 performs a learning process of the speaker recognition model 20 using the learning data expanded by the speech data expansion unit 10. The learning unit 30 may be implemented by a computer having a microprocessor, a memory, a communication interface, and the like. In this case, the various functions of the learning unit 30 are realized by the microprocessor executing a program stored in the memory. The learning unit 30 may be realized by, for example, distributed computing or cloud computing performed by a plurality of computers communicating with each other.

As shown in fig. 1, the learning unit 30 includes a second sound data acquisition unit 31, a second feature amount calculation unit 32, and a first learning unit 33.

When voice data is input, the speaker recognition model 20 outputs speaker recognition information for recognizing a speaker who utters a voice included in the voice data. The speaker recognition model 20 may be implemented by a computer having a microprocessor, a memory, a communication interface, and the like, for example. In this case, various functions of the speaker recognition model 20 are realized by the microprocessor executing a program stored in the memory. Moreover, the speaker recognition model 20 may be realized by, for example, distributed computing or cloud computing performed by a plurality of computers communicating with each other.

As shown in fig. 1, the speaker recognition model 20 includes a third feature amount calculation unit 21, a Deep Neural Network (DNN) 22, and a determination unit 23.

The recognition target speech data acquisition unit 40 acquires speech data to be recognized in the recognition of the speaker by the speaker recognition model 20. The recognition target audio data acquisition unit 40 may have a communication interface for communicating with an external device, for example, and may acquire audio data from the external device via the communication interface. The recognition target audio data acquisition unit 40 may have an input/output port (e.g., a USB port), for example, and acquire audio data from an external storage device (e.g., a USB memory) connected to the input/output port. The recognition target sound data acquisition unit 40 may have a microphone, for example, and may acquire sound data by converting a sound input to the microphone into an electric signal.

Hereinafter, each component constituting the audio data extension unit 10 will be described.

The voice data holding unit 11 stores voice data and speaker identification information, which is associated with the voice data and identifies a speaker who uttered a voice included in the voice data, in association with each other.

Fig. 2 is a schematic diagram showing an example of a case where the voice data holding unit 11 stores voice data and speaker identification information in association with each other.

As shown in fig. 2, the voice data storage 11 stores a plurality of voice data associated with a plurality of speaker identification information different from each other. The voice data and speaker recognition information stored in the voice data storage 11 are used as learning data for performing a learning process of the speaker recognition model 20.

Returning again to fig. 1, the explanation of the talker identifying apparatus 1 is continued.

The voice data holding unit 11 may have, for example, a communication interface for communicating with an external device, and store voice data acquired from the external device via the communication interface and speaker identification information associated with the voice data. The audio data holding unit 11 may have an input/output port (e.g., a USB port), for example, and store audio data acquired from an external storage device (e.g., a USB memory) connected to the input/output port and speaker identification information associated with the audio data.

Here, the description will be made assuming that the audio data is in the WAV format. However, the audio data is not necessarily limited to the WAV format, and may be, for example, an AIFF format, an AAC format, or the like.

The first voice data acquisition unit 12 acquires voice data and speaker identification information associated with the voice data from the voice data storage unit 11.

The voice feature conversion unit 13 converts the voice data acquired by the first voice data acquisition unit 12 into voice data to be uttered by a speaker other than the speaker identified by the speaker identification information associated with the voice data (hereinafter, also referred to as "other speaker") and outputs the voice data. More specifically, the voice feature conversion unit 13 generates and outputs voice data uttered by another speaker by changing the frequency components of utterances included in the voice data.

The voice feature conversion unit 13 can output a plurality of voice data having the same utterance content but different speakers by converting the voice data of one speaker into the voice data of a plurality of other speakers and outputting the converted voice data. In addition, when the voice data of one speaker is voice data including japanese pronunciation, the voice feature conversion unit 13 can convert the voice data to voice data including japanese pronunciation of another speaker who does not necessarily speak japanese. That is, the voice feature conversion unit 13 can convert and output voice data of one speaker into voice data of a plurality of other speakers without being limited by the utterance content and language of the voice data before conversion.

Fig. 3 is a schematic diagram showing a case where the voice characteristic conversion unit 13 converts voice data of one speaker into voice data of a plurality of other speakers and outputs the converted voice data.

As shown in fig. 3, the voice feature conversion unit 13 can increase the number of voice data used as learning data for performing the learning process of the speaker recognition model 20, regardless of the utterance content or language.

The sound characteristic conversion unit 13 may be realized by a widely available conventional sound characteristic converter, for example. For example, the voice feature conversion unit 13 may be realized by using a voice feature conversion model that is subjected to learning processing in advance so that when the voice data of the first speaker is input, the voice data of the second speaker is output. Here, the voice feature conversion unit 13 is realized by using a voice feature conversion model that is subjected to learning processing in advance so that when voice data of a first speaker is input, voice data of a second speaker is output.

Fig. 4 is a block diagram showing a configuration example of the sound characteristic conversion unit 13.

As shown in fig. 4, the voice trait conversion unit 13 includes a voice trait conversion learning data holding unit 131, a second learning unit 132, and a voice trait conversion model 133.

The speech feature transformation model 133 is a Deep Neural Network (DNN) that performs a learning process in advance for each of a plurality of speaker pairs so that when speech data of a first speaker that is one of the speaker pairs is input, speech data of a second speaker that is the other of the speaker pairs is output, and when speech data of the second speaker is input, speech data of the first speaker is output. Here, as an example, the voice trait conversion model 133 will be described as a cycleVAE that performs learning processing in advance for each of a plurality of speaker pairs so that when voice data in the WAV format of a first speaker is input, voice data in the WAV format of a second speaker is output, and when voice data in the WAV format of a second speaker is input, voice data in the WAV format of the first speaker is output. However, the speech feature transformation model 133 is not necessarily limited to the above-described cycleVAE as long as it is DNN subjected to learning processing in advance for each of a plurality of speaker pairs so that the speech data of the second speaker is output when the speech data of the first speaker is input and the speech data of the first speaker is output when the speech data of the second speaker is input.

The voice trait conversion learning data holding unit 131 stores learning data for performing a learning process of the voice trait conversion model 133. More specifically, the voice trait conversion learning data holding unit 131 stores voice data (here, voice data in the WAV format) of each of a plurality of speakers to which the voice trait conversion models 133 are applied.

The second learning unit 132 performs a learning process of the voice trait conversion model 133 for each of the plurality of speaker pairs using the learning data stored in the voice trait conversion learning data holding unit 131 so that when voice data of a first speaker, which is one of the speaker pairs, is input, voice data of a second speaker, which is the other of the speaker pairs, is output, and when voice data of the second speaker is input, voice data of the first speaker is output.

The noise reverberation unit 14 applies noise (for example, 4 types) and reverberation (for example, 1 type) to each of the audio data output from the audio characteristic conversion unit 13, and outputs the audio data to which the noise is applied and the audio data to which the reverberation is applied. This allows the noise reverberation unit 14 to further increase the number of audio data.

The first feature value calculating unit 15 calculates utterance feature values indicating features of utterances included in the sound data, based on the sound data output from the sound characteristic converting unit 13 and the sound data output from the noise reverberation imparting unit 14. Here, as an example, the first feature value calculating unit 15 calculates MFCC (Mel-frequency cepstral Coefficients) representing the characteristics of the vocal tract of the speaker as the utterance feature value. However, the first feature value calculation unit 15 is not necessarily limited to the example of calculating the MFCC as long as it can calculate the utterance feature value indicating the feature of the speaker. The first feature amount calculation unit 15 may calculate, as the utterance feature amount, a value obtained by applying a Mel Filter Bank (Mel Filter Bank) to the uttered voice signal, for example, or may calculate, as the utterance feature amount, a spectrum of the uttered voice signal, for example.

The comparison unit 16 compares the first speaker feature amount with the speaker feature amount of the uttering speaker (hereinafter also referred to as "second speaker feature amount") included in the speech data that is the calculation source of the first speaker feature amount, for each speaker feature amount (hereinafter also referred to as "first speaker feature amount") output from the first feature amount calculation unit 15.

As a result of the comparison, (1) when the degree of similarity between the first speaker feature quantity and the second speaker feature quantity is within a predetermined range, the comparison unit 16 associates the voice data that is the calculation source of the first speaker feature quantity with the speaker identification information of the speaker that identifies the utterance contained in the voice data. Thus, the comparison unit 16 can increase the number of pieces of speech data associated with one piece of speaker identification information. Then, the comparison unit 16 outputs the voice data and speaker identification information associated with the voice data.

As a result of the comparison, (2) when the degree of similarity between the first speaker feature quantity and the second speaker feature quantity is not within the predetermined range, the comparison unit 16 associates the voice data that is the calculation source of the first speaker feature quantity with the identification information for identifying a third person who is different from the speaker who uttered the voice data. This enables the comparison unit 16 to increase the number of speaker identification information pieces associated with the voice data. That is, the comparison unit 16 can increase the number of speakers in the learning data for performing the learning process of the speaker recognition model 20. By increasing the number of speakers, it is possible to suppress over-learning in the learning process of the speaker recognition model 20, which will be described later. This can improve the generalization performance of the speaker recognition model 20. Then, the comparison unit 16 outputs the voice data and speaker identification information associated with the voice data.

Similarly to the voice data holding unit 11, the extended voice data holding unit 18 stores the voice data and speaker identification information associated with the voice data and identifying a speaker of an utterance included in the voice data in association with each other.

The voice data storage 17 stores the voice data output from the comparison unit 16 and the speaker identification information associated with the voice data in association with each other in the extended voice data storage 18. The speech data storage 17 stores the speech data acquired by the first speech data acquisition unit 12 and the speaker identification information associated with the speech data in association with each other in the extended speech data storage 18. Thus, in addition to the speech data stored as learning data for performing the learning process of the speaker recognition model 20 by the speech data holding unit 11, the extended speech data holding unit 18 also stores the speech data output from the comparison unit 16 as learning data for performing the learning process of the speaker recognition model.

Hereinafter, each component constituting the speaker recognition model 20 will be described.

The third feature value calculation unit 21 calculates an utterance feature value indicating a feature of an utterance included in the voice data, based on the voice data acquired by the recognition target voice data acquisition unit 40. Here, as an example, the third feature value calculation unit 21 calculates MFCC indicating the vocal tract characteristics of the speaker as the utterance feature value. However, the third feature value calculation unit 21 is not necessarily limited to the example of calculating the MFCC as long as it can calculate the utterance feature value indicating the feature of the speaker. The third feature amount calculation unit 21 may calculate, for example, a value obtained by applying a mel filter bank to a voiced sound signal as a vocal feature amount, or may calculate, for example, a spectrum of the voiced sound signal as a vocal feature amount.

The deep neural network 22 is a Deep Neural Network (DNN) as follows: when the utterance feature amount calculated by the third feature amount calculation unit 21 is input, a learning process is performed in advance so as to output a speaker characteristic feature amount indicating a feature of a speaker who utters an utterance included in the speech data which is a calculation source of the utterance feature amount. Here, as an example, the deep neural network 22 is described as Kaldi that is subjected to learning processing in advance so that, when an MFCC representing the vocal tract characteristics of a speaker is input, x-Vector, which is an acoustic feature quantity that maps a variable-length utterance to an utterance embedded in a fixed dimension, is output as a talker's feature quantity. However, the deep neural network 22 is not necessarily limited to Kaldi described above as long as it is a DNN subjected to learning processing in advance such that when the utterance feature amount calculated by the third feature amount calculation unit 21 is input, a speaker-specific feature amount indicating a feature of a speaker is output. Further, the details of the calculation method of x-Vector and the like are disclosed in non-patent document 1, and therefore, detailed description thereof is omitted here.

The determination unit 23 determines the speaker of the utterance included in the speech data acquired by the recognition target speech data acquisition unit 40 based on the speaker characteristic feature amount output from the deep neural network 22. More specifically, the determination unit 23 stores x-vectors of a plurality of speakers, specifies an x-Vector that is most similar to the x-Vector output from the deep neural network 22 among the stored x-vectors, and determines the specified x-Vector speaker as the speaker of the utterance included in the sound data acquired by the recognition target sound data acquisition unit 40. Then, the determination unit 23 outputs speaker identification information for identifying the determined speaker.

Hereinafter, each component constituting the learning unit 30 will be described.

The second sound data acquisition unit 31 acquires the sound data and the speaker identification information associated with the sound data from the extended sound data storage unit 18.

The second feature value calculation unit 32 calculates an utterance feature value indicating a feature of an utterance included in the sound data, based on the sound data acquired by the second sound data acquisition unit 31. Here, as an example, the second feature value calculation unit 32 calculates MFCC indicating the vocal tract characteristics of the speaker as the utterance feature value. However, the second feature value calculation unit 32 is not necessarily limited to the example of calculating the MFCC as long as it can calculate the utterance feature value representing the feature of the speaker. The second feature value calculation unit 32 may calculate, for example, a value obtained by applying a mel filter bank to a voiced sound signal as a vocal feature value, or may calculate, for example, a spectrum of the voiced sound signal as a vocal feature value.

The first learning unit 33 performs a learning process of the speaker recognition model 20 using the utterance feature amount calculated by the second feature amount calculation unit 32 and speaker recognition information of a speaker recognizing an utterance included in the speech data that is a calculation source of the utterance feature amount as learning data, so that when the speech data is input, the first learning unit outputs speaker recognition information of a speaker recognizing an utterance included in the speech data.

More specifically, the first learning unit 33 performs a learning process of the deep neural network 22 to output an x-Vector indicating a feature of a speaker of an utterance included in the sound data that is a source of the MFCC calculation, when the MFCC calculated by the second feature amount calculation unit 32 and speaker identification information corresponding to the MFCC are input as learning data.

< action >

The speaker recognition device 1 configured as described above performs speaker recognition model learning processing, voice feature transformation model learning processing, and speaker recognition processing.

These processes are described below in turn with reference to the drawings.

Fig. 5 is a flowchart of the speaker recognition model learning process.

The speaker recognition model learning process is a process of performing a learning process of the speaker recognition model 20.

The speaker recognition model learning process is started, for example, by the user of the speaker recognition apparatus 1 performing an operation to start the speaker recognition model learning process on the speaker recognition apparatus 1.

When the speaker recognition model learning process is started, the first speech data acquisition unit 12 acquires one piece of speech data and one piece of speaker recognition information associated with the one piece of speech data from the speech data storage unit 11 (step S100).

When one voice data and one speaker identification information are acquired, the voice data storage unit 17 stores the one voice data and the one speaker identification information in association with each other in the extended voice data storage unit 18 (step S110).

On the other hand, the voice feature conversion unit 13 selects one speaker from the other speakers other than the speaker identified by the one speaker identification information (step S120). Then, the voice feature conversion unit 13 converts the one voice data into voice data uttered by the one speaker (step S130) and outputs the converted voice data.

When the voice data is output from the voice characteristic conversion unit 13, the noise reverberation imparting unit 14 imparts noise and reverberation to the voice data output from the voice characteristic conversion unit 13 (step S140), and outputs one or more voice data.

When one or more pieces of sound data are output from the noise reverberation imparting unit 14, the first feature amount calculating unit 15 calculates the utterance feature amount based on the sound data output from the sound characteristic converting unit 13 and the one or more pieces of sound data output from the noise reverberation imparting unit 14 (step S150).

When the utterance feature amount is calculated, the comparison unit 16 compares the calculated utterance feature amount with the utterance feature amount of the selected one speaker, and determines whether or not the similarity between the calculated utterance feature amount and the utterance feature amount of the one speaker is within a predetermined range (step S160).

When an affirmative determination is made in the processing of step S160 (yes in step S160), the comparison unit 16 associates speaker identification information for identifying the selected one speaker with the voice data that is the source of the utterance feature amount calculation of the affirmative determination (step S170). Then, the comparison unit 16 outputs the voice data and speaker identification information associated with the voice data.

When a negative determination is made in the process of step S160 (no in step S160), the comparison unit 16 associates the identification information for identifying a third person different from the selected one speaker with the voice data that is the source of the utterance feature amount of the negative determination (step S180). Then, the comparison unit 16 outputs the voice data and speaker identification information associated with the voice data.

When the comparison unit 16 executes the processing of step S170 or the processing of step S180 for all the utterance feature amounts to be compared in the processing of step S160, the speech data storage unit 17 stores the speech data output from the comparison unit 16 and the speaker identification information associated with the speech data in the extended speech data storage unit 18 in association with each other (step S190).

When the process of step S190 is completed, the voice feature conversion unit 13 determines whether or not one of the other speakers that was not selected in the process of step S120 (hereinafter, also referred to as "unselected speaker") is present (step S200).

If it is determined in the process of step S200 that there are unselected speakers (yes in step S200), the voice feature converting unit 13 selects one speaker from the unselected speakers (step S210), and the process proceeds to step S130.

When it is determined that there is no unselected speaker in the processing of step S200 (no in step S200), the first audio data acquisition unit 12 determines whether there is unacquired audio data that has not been acquired in the audio data stored in the audio data storage unit 11 (step S220).

If it is determined in the process of step S220 that there is any unacquired audio data (yes in step S220), the first audio data acquiring unit 12 acquires one audio data from the unacquired audio data (step S230), and the process proceeds to step S110.

When it is determined in the process of step S220 that there is no unacquired speech data (no in step S220), the second speech data acquisition unit 31 acquires the speech data and the speaker identification information associated with the speech data from the extended speech data storage unit 18 for all the speech data stored in the extended speech data storage unit 18 (step S240).

When the voice data and the speaker identification information associated with the voice data are acquired for all the voice data, the second feature calculating unit 32 calculates utterance features representing features of utterances included in the voice data from the voice data for all the voice data (step S250).

When the utterance feature amount is calculated for all the speech data, the first learning unit 33 performs a learning process of the speaker recognition model 20 using, as learning data, the utterance feature amount and speaker recognition information of a speaker recognizing an utterance included in the speech data that is a calculation source of the utterance feature amount for all the utterance feature amounts, so that when the speech data is input, the speaker recognition information of the speaker recognizing the utterance included in the speech data is output (step S260).

When the process of step S260 ends, the speaker recognition apparatus 1 ends the speaker recognition model learning process.

Fig. 6 is a flowchart of the voice trait transformation model learning process.

The voice trait conversion model learning process is a process of performing a learning process of the voice trait conversion model 133.

The voice characteristic transformation model learning process is started, for example, by a user of the speaker recognition apparatus 1 performing an operation to start the voice characteristic transformation model learning process on the speaker recognition apparatus 1.

When the voice trait conversion model learning process is started, the second learning unit 132 selects one speaker pair of a plurality of speakers whose target is the voice trait conversion model 133 (step S300). Then, the second learning unit 132 performs a learning process of the voice trait conversion model 133 on the selected one speaker pair using the learning data of each of the 2 speakers constituting the selected one speaker pair among the learning data held by the voice trait conversion learning data holding unit 131 so that the voice data of the second speaker, which is one of the speaker pairs, is output when the voice data of the first speaker, which is the other of the speaker pairs, is input, and the voice data of the first speaker is output when the voice data of the second speaker is input (step S310).

When the learning process of the voice trait conversion model 133 is performed for one speaker pair, the second learning unit 132 determines whether or not there is an unselected speaker pair that has not been selected among a plurality of speakers targeted for the voice trait conversion model 133 (step S320).

If it is determined in the process of step S320 that there is an unacquired speaker pair (yes in step S320), the second learning unit 132 selects one speaker pair from the unselected speaker pair (step S330), and the process proceeds to step S310.

If it is determined in the process of step S320 that there is no pair of speakers that has not been acquired (no in step S320), the speaker recognition apparatus 1 ends the voice characteristic transformation model learning process.

Fig. 7 is a flowchart of the speaker identification process.

The speaker recognition processing is processing for recognizing a speaker of an utterance contained in the voice data. More specifically, the speaker recognition process is a process of inputting voice data to the speaker recognition model 20 that has been subjected to the learning process in advance, and causing the speaker recognition model 20 to output speaker recognition information.

The speaker recognition process is started, for example, by an operation of the speaker recognition apparatus 1 to start the speaker recognition process by the user of the speaker recognition apparatus 1.

When the speaker recognition processing is started, the recognition target speech data acquisition unit 40 acquires speech data to be recognized (step S400).

When acquiring the voice data, the third feature quantity calculation unit 21 calculates an utterance feature quantity indicating a feature of an utterance included in the voice data, based on the acquired voice data (step S410), and inputs the calculated utterance feature quantity to the deep neural network 22. Then, the deep neural network 22 outputs speaker characteristic features indicating the characteristics of the speaker of the utterance contained in the speech data that is the source of the input utterance feature calculation (step S420).

When outputting the speaker-specific feature amount, the determination unit 23 determines the speaker of the utterance contained in the speech data acquired by the recognition target speech data acquisition unit 40 based on the output speaker-specific feature amount (step S430). Then, the determination unit 23 outputs speaker identification information for identifying the determined speaker (step S440).

When the process of step S440 ends, the speaker recognition apparatus 1 ends the speaker recognition process.

< examination >

As described above, the speaker recognition device 1 expands the learning data for learning the speaker recognition model 20 stored in the sound data storage 11 without being limited by the utterance content and the language. Then, the extended learning data is used to perform a learning process of the speaker recognition model 20. Therefore, according to the speaker recognition device 1, the accuracy of speaker recognition using the speaker recognition model 20 can be improved. Therefore, the speaker recognition apparatus 1 can recognize the speaker with high accuracy.

(supplementary notes)

Although the speaker recognition device according to the embodiment has been described above, the present disclosure is not limited to this embodiment.

For example, each processing unit included in the speaker recognition device according to the above embodiment is typically realized as an LSI which is an integrated circuit. These may be formed into a single chip individually or may be formed into a single chip including a part or all of them.

The integrated circuit is not limited to an LSI, and may be realized by a dedicated circuit or a general-purpose processor. An FPGA (Field Programmable Gate Array) that can be programmed after LSI manufacturing or a reconfigurable processor that can reconfigure connection and setting of circuit cells within an LSI may be used.

The present disclosure can be realized as a method for learning a speaker recognition model executed by the speaker recognition device of the embodiment, or as a speaker recognition method.

In the above-described embodiment, each component is configured by dedicated hardware, and may be realized by executing a software program suitable for each component. Each component may be realized by a program execution unit such as a CPU or a processor reading out and executing a software program recorded in a recording medium such as a hard disk or a semiconductor memory.

Note that division of functional blocks in the block diagrams is an example, and a plurality of functional blocks may be implemented as one functional block, one functional block may be divided into a plurality of functional blocks, or a part of functions may be transferred to another functional block. Further, a single piece of hardware or software may process functions of a plurality of functional blocks having similar functions in parallel or in time division.

The order in which the steps in the flowcharts are executed is exemplified for the purpose of specifically describing the present disclosure, and may be an order other than the above. Further, a part of the above steps may be executed simultaneously (in parallel) with other steps.

Although the speaker recognition device according to one or more embodiments has been described above based on the embodiments, the present disclosure is not limited to the embodiments. Various modifications that can be made to the present embodiment and embodiments constructed by combining the components in various modifications and the like that can be made by those skilled in the art may be included in one or more embodiments without departing from the spirit of the present disclosure.

Industrial applicability

The present disclosure can be widely applied to devices for recognizing a speaker and the like.

Claims

1. A learning method of a speaker recognition model that outputs speaker recognition information of a speaker recognizing an utterance contained in voice data when the voice data is input, wherein,

generating second voice data of a second speaker by performing voice characteristic conversion processing on first voice data of a first speaker,

performing learning processing of the speaker recognition model using the first sound data and the second sound data as learning data.

2. The learning method according to claim 1, wherein,

the voice trait conversion process is a process based on the voice data of the first speaker and the voice data of the second speaker.

3. The learning method according to claim 2, wherein,

the voice characteristic conversion process includes a process of inputting the first voice data to a voice characteristic conversion model, and outputting the second voice data from the voice characteristic conversion model, the voice characteristic conversion model being subjected to a learning process in advance so that when the voice data of the first speaker is input, the voice data of the second speaker is output.

4. The learning method according to claim 3, wherein,

the sound trait transformation model comprises a deep neural network which takes sound data in a WAV format as input and takes the sound data in the WAV format as output.

5. The learning method according to claim 1, wherein,

the voice trait conversion process is a process based on the voice data of the first speaker and the voice data of the third speaker.

6. The learning method according to claim 1, wherein,

the speaker recognition model includes a deep neural network that receives, as input, utterance feature quantities representing features of utterances included in the speech data and outputs speaker-specific feature quantities representing features of speakers.

7. A speaker recognition method, wherein,

inputting voice data to the speaker recognition model that has been previously subjected to a learning process by the learning method according to claim 1, and causing the speaker recognition model to output the speaker recognition information.

8. A computer-readable recording medium having a program recorded thereon, the program causing a computer to execute a process of learning a speaker recognition model that outputs speaker recognition information for recognizing a speaker who uttered a voice contained in voice data when the voice data is input,

the processing comprises the following steps:

a first step of generating second sound data of a second speaker by performing sound characteristic conversion processing on first sound data of a first speaker; and

a second step of performing learning processing of the speaker recognition model using the first sound data and the second sound data as learning data.