WO2016092807A1 - 話者識別装置および話者識別用の登録音声の特徴量登録方法 - Google Patents
話者識別装置および話者識別用の登録音声の特徴量登録方法 Download PDFInfo
- Publication number
- WO2016092807A1 WO2016092807A1 PCT/JP2015/006068 JP2015006068W WO2016092807A1 WO 2016092807 A1 WO2016092807 A1 WO 2016092807A1 JP 2015006068 W JP2015006068 W JP 2015006068W WO 2016092807 A1 WO2016092807 A1 WO 2016092807A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- registered
- speaker
- text data
- speech
- speaker identification
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/06—Decision making techniques; Pattern matching strategies
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/22—Interactive procedures; Man-machine interfaces
- G10L17/24—Interactive procedures; Man-machine interfaces the user being prompted to utter a password or a predefined phrase
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/06—Decision making techniques; Pattern matching strategies
- G10L17/12—Score normalisation
Definitions
- the present invention relates to a speaker identification device, and the like, for example, to one for identifying who is a registered speaker whose input speech is registered in advance.
- Speaker identification refers to processing by a computer that recognizes (identifies and authenticates) an individual from a human voice. Specifically, in speaker identification, features are extracted from speech, modeled, and individual voices are identified using the modeled data.
- the speaker identification service is a service that provides speaker identification and identifies a speaker of input voice data.
- Speaker registration is also referred to as enrolling, training, or training.
- FIGS. 9A and 9B are diagrams for explaining a general speaker identification service. As shown in FIGS. 9A and 9B, a typical speaker identification service operates in two stages and has two phases, a registration phase and an identification phase.
- FIG. 9A is a diagram schematically showing the contents of the registration phase.
- FIG. 9B is a diagram schematically showing the contents of the identification phase.
- the user inputs a registered voice (actually, a speaker name and a registered voice) to the speaker identification service.
- a registered voice actually, a speaker name and a registered voice
- the speaker identification service extracts feature amounts from the registered speech.
- the speaker identification service stores the speaker name and feature amount pair in the speaker identification dictionary as dictionary registration.
- the user inputs voice (specifically, voice to be identified) to the speaker recognition service.
- the speaker identification service extracts feature amounts from the identification target speech.
- the speaker identification service specifies a registered voice having the same feature quantity as the identification target voice by comparing the extracted feature quantity with the feature quantity registered in the speaker identification dictionary.
- the speaker identification service returns the speaker name added to the specified registered voice to the user as an identification result.
- the accuracy of speaker identification depends on the quality of registered speech. In other words, for example, when the registered voice contains only vowels, when voices other than the speaker to be registered are mixed, or when the noise level is high, registration is performed under ideal conditions. The accuracy is lower than the case. For this reason, practical identification accuracy may not be obtained depending on the contents of data stored in the identification dictionary.
- the data stored in the identification dictionary is not necessarily these feature quantities themselves.
- a method of generating a classifier such as a support vector machine (Support Vector Machine) using a set of feature data and registering parameters of the classifier in an identification dictionary is also known (for example, Patent Document 1). ).
- Patent Document 1 the similarity between data previously registered in the database and data newly registered in the database is calculated, and registration is permitted only when the similarity is less than the reference value.
- secondary identification is performed for more strictly calculating the degree of similarity with the input voice (sound to be identified).
- Patent Document 2 discloses an evaluation means using a similarity with biometric information registered in advance in a database.
- likelihood similarity
- biometric information is calculated between biometric information to be newly registered and biometric information already registered in the database, and all the registered biometric information is calculated. Registration is permitted only when the likelihood is less than the reference value.
- Patent Documents 3 to 5 disclose techniques related to the present invention.
- Patent Document 2 uses the degree of similarity with registered biometric information as a criterion, the evaluation target voice has a large difference from the registered biometric information. In the case where it is not included, there is a problem that another person is mistakenly determined to be the same person or the person cannot be identified.
- the present invention has been made in view of such circumstances, and an object of the present invention is to suppress speaker identification errors caused by registered speech, and to identify a speaker stably and accurately. It is to provide a device or the like.
- the speaker identification device uses, as extracted text data, text data corresponding to registered speech that is input by reading out registration target text data, which is text data set in advance, by a registered speaker.
- Speech recognition means for extracting; registration speech evaluation means for calculating a score indicating the similarity between the extracted text data and the registration target text data for each registered speaker; and evaluation results of the registered speech evaluation means
- a speaker registration dictionary for registering the feature amount of the registered speech for each registered speaker is provided with a dictionary registration means for registering the feature amount of the registered speech.
- the registered voice feature quantity registration method for speaker identification corresponds to registered voice, which is voice inputted by reading out registration target text data, which is text data set in advance, by a registered speaker.
- Text data is extracted as extracted text data
- a score indicating the similarity between the extracted text data and the registration target text data is calculated for each registered speaker, and according to the calculation result of the score,
- the feature amount of the registered voice is registered in a speaker identification dictionary for registering the feature amount of the registered voice for each registered speaker.
- the storage medium of the present invention extracts, as extracted text data, text data corresponding to registered speech that is input by reading out registration target text data, which is text data set in advance, by a registered speaker. , A score indicating the degree of similarity between the extracted text data and the registration target text data is calculated for each registered speaker, and the registered voice of each registered speaker is calculated according to the score calculation result.
- a program for causing a computer to execute the process of registering the feature amount of the registered speech is stored in a speaker identification dictionary for registering the feature amount.
- the speaker identification device or the like it is possible to suppress identification errors caused by registered speech and identify a speaker stably and accurately.
- FIG. 2 is a diagram for explaining the principle of speaker identification processing according to the first embodiment of the present invention.
- the speaker identification device 500 corresponds to the speaker identification device of the present invention.
- the speaker identification device 500 presents the registration target text data 501 to the user 600. At this time, the speaker identification device 500 requests the user 600 to read out the registration target text data 501 (processing 1).
- the speaker identification device 500 corresponds to the speaker identification device of the present invention, and corresponds to a block schematically showing the function of the speaker identification server 100 of FIG.
- a microphone (not shown in FIG. 2) provided in the terminal (not shown in FIG. 2) collects the voice read out by the user 600. Then, the voice read out by the user 600 is input as the registered voice 502 to the speaker identification device 500 (processing 2).
- the speaker identification device 500 extracts the extracted text data 503 from the registered speech 502 by speech recognition (processing 3).
- the speaker identification device 500 compares the extracted text data 503 (text extraction result) extracted in the process 3 with the registration target text data 501, and the ratio (similarity) of the portions where they match. A score is calculated based on (Process 4).
- the speaker identification device 500 registers the feature amount and speaker name extracted from the registered speech 502 in the speaker identification dictionary 504 (process 5). ). On the other hand, when the score obtained in process 4 is not equal to or greater than the reference value, the speaker identification device 500 retries the process 2 and subsequent processes.
- the entire registration target text is divided into a plurality of partial texts (for example, sentence units), and steps 1 to 4 are repeated for each partial text, and when the score exceeds the reference value for all the partial texts.
- the registration process of process 5 may be performed for the corresponding user.
- FIG. 1 is a diagram showing a configuration of a speaker identification system 1000 including a speaker identification server 100.
- the speaker identification server 100 corresponds to the speaker identification device of the present invention.
- the speaker identification system 1000 includes a speaker identification server 100 and a terminal 200.
- the speaker identification server 100 and the terminal 200 are connected via a network 300 so that they can communicate with each other.
- the speaker identification server 100 is connected to a network 300.
- the speaker identification server 100 is communicatively connected to one or more terminals 200 via the network 300. More specifically, the speaker identification server 100 is a server device that performs speaker identification with respect to voice data input by the terminal 200 via the network 300.
- One or more arbitrary number of terminals 200 can be connected to one speaker identification server.
- the speaker identification server 100 includes a text presentation unit 101, a speech recognition unit 102, a registered speech evaluation unit 103, a dictionary registration unit 104, a speaker identification unit 105, and a registration target text.
- a recording unit 106, a temporary voice recording unit 107, and a speaker identification dictionary 108 are provided.
- the text presentation unit 101 is connected to a speech recognition unit 102, a registered speech evaluation unit 103, a dictionary registration unit 104, and a registration target text recording unit 106.
- the text presentation unit 101 provides registration target text data (data including characters or symbols), which is preset text data, to a registered speaker. More specifically, the text presentation unit 101 provides registration target text data to a registered speaker who uses the terminal 200 via the network 300, and prompts the registered speaker to read out the registration target text data.
- the registered speaker is a user of the terminal 200 and registers his / her voice in the speaker identification server 100.
- the registration target text data is text data set in advance and serving as reference text data. Registration target text data can be arbitrarily set in advance.
- the speech recognition unit 102 is connected to a text presentation unit 101, a registered speech evaluation unit 103, and a dictionary registration unit 104.
- the voice recognition unit 102 extracts, as extracted text data, text data corresponding to the registered voice, which is a voice input when the registration target text data is read out by the registered speaker. That is, when the registered speaker reads out the reference text data using the terminal 200, the terminal 200 uses the voice inputted by the registered speaker as input to the speaker identification server via the network 300 as the registered voice. To 100. Then, the speech recognition unit 102 extracts text data as extracted text data from the registered speech that is a result of reading out the registration target text data by speech recognition (speech-to-text).
- the registered voice evaluation unit 103 is connected to a text presentation unit 101, a voice recognition unit 102, a dictionary registration unit 104, a registration target text recording unit 106, and a temporary voice recording unit 107.
- the registered speech evaluation unit 103 calculates a registered speech score indicating the similarity between the extracted text data extracted by the speech recognition unit 102 and the registration target text data for each registered speaker. That is, the registered voice evaluation unit 103 calculates a registered voice score as an index indicating the quality of the registered voice by comparing the text extraction result (extracted text data) from the registered voice with the registration target text data.
- the dictionary registration unit 104 is connected to a text presentation unit 101, a speech recognition unit 102, a registered speech evaluation unit 103, a speaker identification unit 105, and a speaker identification dictionary 108.
- the dictionary registration unit 104 registers the feature amount of the registered voice in the speaker identification dictionary 108 according to the evaluation result of the registered voice evaluation unit 103. More specifically, when the registered speech score calculated by the registered speech evaluation unit 103 is larger than a predetermined reference value, the dictionary registration unit 104 registers the feature amount of the registered speech in the speaker identification dictionary 108. That is, the dictionary registration unit 104 extracts feature amounts from the registered speech whose registered speech score calculated by the registered speech evaluation unit 103 is equal to or greater than a reference value, and registers the extracted information in the speaker identification dictionary 108.
- the speaker identification unit 105 is connected to the dictionary registration unit 104 and the speaker identification dictionary 108.
- the speaker identifying unit 105 refers to the speaker identification dictionary 108 based on the identification target speech input from the terminal 200 and identifies which registered speaker is the main identification target speech.
- the registration target text recording unit 106 is connected to a text presentation unit 101 and a registered voice evaluation unit 103.
- the registration target text recording unit 106 is a storage device (or a partial area in the storage device) and stores registration target text data.
- the text data to be registered is referred to by the text presentation unit 101.
- the voice temporary recording unit 107 is connected to the registered voice evaluation unit 103.
- the temporary audio recording unit 107 is a storage device (or a partial area in the storage device), and temporarily records the registered audio input by the terminal 200.
- the speaker identification dictionary 108 is connected to the dictionary registration unit 104 and the speaker identification unit 105.
- the speaker identification dictionary 108 is a dictionary for registering the feature amount of the registered voice for each registered speaker.
- the terminal 200 is connected to a network 300.
- the terminal 200 is communicatively connected to the speaker identification server 100 via the network 300.
- the terminal 200 includes an input device such as a microphone (not shown in FIG. 1) and an output device such as a liquid crystal display (not shown in FIG. 1).
- the terminal 200 has a transmission / reception function for transmitting / receiving information to / from the speaker identification server 100 via the network 300.
- the terminal 200 is, for example, a PC (Personal Computer), a telephone, a mobile phone, a smartphone, or the like.
- the configuration of the speaker identification system 1000 has been described above.
- the operation of the speaker identification server 100 includes two types of operations, a registration phase and an identification phase.
- the registration phase starts with a speaker registration operation performed on the terminal 200 by a registered speaker.
- the registration target text is composed of a plurality of texts.
- FIG. 3 is a diagram showing an operation flow of the registration phase of the speaker identification server 100.
- the speaker identification server 100 transmits registration target text data to the terminal 200 in response to the speaker registration request transmitted by the terminal 200 (step (STEP: hereinafter, simply S).) 11).
- the text presentation unit 101 acquires registration target text data stored in advance in the registration target text recording unit 106 and provides this registration target text data to a registered speaker who is a user of the terminal 200.
- the processing in S11 corresponds to the text presentation processing (processing 1) in FIG.
- the terminal 200 receives the registration target text data provided by the text presentation unit 101, and requests a registered speaker who is a user of the terminal 200 to read out the registration target text data.
- the terminal 200 transmits the voice data as a result of reading out by the registered speaker to the speaker identification server 100 as registered speech.
- This process corresponds to the voice input process (process 2) of FIG.
- the registration target text data is transmitted as a telegram from the speaker server 100 to the terminal 200, or the registration target text data is preliminarily printed on paper (hereinafter referred to as registration target text paper). You may do it.
- the registration target text sheet is printed with a number added to each registration target text, and in this step, the number to be read out is transmitted from the speaker identification server to the terminal.
- the speaker identification server 100 receives the registered voice transmitted by the terminal 200 (S12).
- the registered voice signal input from the terminal 200 to the speaker identification server 100 is PCM (Pulse Code Modulation), G.P.
- PCM Pulse Code Modulation
- G.P Either a digital signal expressed by an encoding method such as 729 or an analog audio signal may be used.
- the audio signal input here may be converted prior to the processing after S13.
- the speaker identification server 100 is a G. After receiving the speech signal by the 729 encoding method and converting the speech signal to linear PCM between S12 and S13, it is configured to be compatible with speech recognition processing (S13) and dictionary registration processing (S18). May be.
- the voice recognition unit 102 extracts extracted text data from the registered voice by voice recognition (S13).
- S13 a known voice recognition technique is used. Some voice recognition techniques do not require user pre-registration (enroll), but the present invention uses a technique that does not require pre-registration.
- the processing in S13 corresponds to the text extraction processing (processing 3) in FIG.
- the registered speech evaluation unit 103 compares the extracted text data extracted by the speech recognition unit 102 with the registration target text data, and calculates a registered speech score indicating the similarity between the two for each registered speaker. (S14).
- This S14 process corresponds to the comparison ⁇ score calculation process (process 4) in FIG.
- FIG. 4 and 5 are diagrams for explaining the score calculation processing by the registered voice evaluation unit 103.
- FIG. 4 and 5 are diagrams for explaining the score calculation processing by the registered voice evaluation unit 103.
- Fig. 4 shows the case where the text data to be registered is Japanese.
- the upper part of FIG. 4 shows [A] registration target text data as correct text.
- the lower part of FIG. 4 shows a text extraction result (extracted text data) from [B] registered speech.
- the speech recognition result [B] is expressed as a kana-kanji mixed sentence in units of words using a dictionary.
- the registration target text [A] used as the correct text is recorded in the registration target text recording unit 106 in a state of being divided into units of words beforehand.
- the registered speech evaluation unit 103 compares the registration target text data [A] and the extracted text data [B] for each word.
- the registered speech evaluation unit 103 then extracts the extracted text data [B] from the total number of words in the registration target text data [A] based on the comparison result between the registration target text data [A] and the extracted text data [B].
- Fig. 5 shows the case where the registration target text is English.
- the upper part of FIG. 5 shows [A] registration target text data as correct text.
- the lower part of FIG. 5 shows a text extraction result (extracted text data) from [B] registered speech.
- the dictionary registration unit 104 determines whether or not the registered voice score calculated by the registered voice evaluation unit 103 is larger than a predetermined threshold (reference value) (S15).
- the dictionary registration unit 104 registers the registered speech in the speaker identification dictionary 108 in the temporary speech recording unit 107. (S16).
- the speaker identification server 100 repeats the processing after S11.
- the speaker identification server 100 determines whether or not the registered voice corresponding to all the registration target text data is stored in the voice temporary recording unit 107 for the registration target user (registered speaker) (S17).
- the dictionary registration unit 104 uses the speaker identification dictionary 108.
- the registered voice is registered in (S18). This S18 corresponds to the dictionary registration process (process 5) of FIG.
- the speaker identification server 100 determines whether the registered voice is S11. Returning to the process, the process is performed on other text data to be registered.
- FIG. 6 is a diagram illustrating information stored in the audio temporary recording unit 107.
- the speaker identification server 100 selects any one of the registration target text data 3 to 5. The process after S11 is repeatedly performed as an object.
- FIG. 7 is a diagram showing an operation flow of the registration phase of the speaker identification server 100.
- the identification phase of the speaker identification server 100 is the same as the registration phase process of FIG.
- the speaker identification server 100 receives a speaker identification request transmitted from the terminal 200 (S21).
- the speaker identification request includes voice data (identification target voice) recorded by the terminal 200 as a parameter.
- the speaker identification unit 105 of the speaker identification server 100 identifies the registered speaker with reference to the speaker identification dictionary 108 (S22). That is, the speaker identification unit 105 collates the feature amount of the identification target speech obtained in S21 with the feature amount of the registered speech registered in the speaker identification dictionary 108. Thereby, the speaker identification unit 105 determines whether or not the identification target voice matches the registered voice of any user ID (Identifier) in the speaker identification dictionary 108.
- the speaker identification server 100 transmits the identification result of the speaker identification unit 105 to the terminal 200 (S23).
- the speaker identification server 100 (speaker identification device) according to the first embodiment of the present invention includes the speech recognition unit 102, the registered speech evaluation unit 103, and the dictionary registration unit 104.
- the voice recognition unit 102 extracts text data corresponding to the registered voice as extracted text data.
- the registered voice is a voice that is input by reading out registration target text data, which is text data set in advance, by a registered speaker.
- the registered speech evaluation unit 103 calculates a score (registered speech score) indicating the similarity between the extracted text data and the registration target text data for each registered speaker.
- the dictionary registration unit 104 registers the feature amount of the registered speech in the speaker identification dictionary 108 for registering the feature amount of the registered speech for each registered speaker according to the evaluation result of the registered speech evaluation unit 103.
- the speaker identification server 100 speech identification device
- text extraction is performed from the registered speech obtained by reading the registration target text data by the registered speaker.
- the feature amount of the registered speech is registered in the speaker identification dictionary 108 based on the calculation result of the score indicating the similarity between the extracted text data as the text extraction result and the registration target text data.
- the registered voice evaluation unit 103 calculates a score (registered voice score) indicating the similarity between the extracted text data and the registration target text data, and the dictionary registration unit 104 determines the evaluation result of the registered voice evaluation unit 103.
- the registered voice feature quantity is registered in the speaker identification dictionary 108 for each registered speaker.
- the registered voice when the evaluation result of the registered voice evaluation unit 103 is preferable is registered in the speaker identification dictionary 108, but the registered voice when the evaluation result of the registered voice evaluation unit 103 is not preferable is speaker identification. Not registered in dictionary 108. Therefore, only registered speech of sufficient quality can be registered in the speaker identification dictionary 108. As a result, it is possible to suppress identification errors caused by insufficiently registered speech.
- the speaker identification server 100 (speaker identification device) in the first embodiment of the present invention, it is possible to suppress identification errors caused by insufficient quality of registered speech and to accurately and stably.
- the speaker can be identified. Therefore, unlike the evaluation technique described in Patent Document 2, it is reduced that another person is mistakenly determined to be the same person or the person cannot be identified.
- the dictionary registration unit 104 determines that the speaker identification dictionary is greater than a predetermined reference value when the score (registered speech score) is greater than a predetermined reference value. In 108, the feature amount of the registered voice is registered.
- the quality of the registered speech registered in the speaker identification dictionary 108 is determined by quantitatively determining the score (registered speech score) that is a criterion for registering the feature amount of the registered speech in the speaker identification dictionary 108. Can be increased more quantitatively. Therefore, it is possible to more effectively suppress identification errors caused by insufficiently registered speech, and to identify speakers more stably and accurately.
- the speaker identification server 100 in the first embodiment of the present invention includes a text presentation unit 101.
- the text presentation unit 101 provides registration target text data to a registered speaker. Thereby, registration object text data can be provided to a registered speaker more smoothly.
- registered speech evaluation unit 103 indicates the similarity between extracted text data and registration target text data for each word.
- a score registered speech score
- the extracted text data and the registration target text data can be compared with higher accuracy.
- the dictionary registration unit 104 stores the score for each word in the speaker identification dictionary 108 when all the scores for each word are larger than a predetermined reference value. , Register the feature amount of the registered voice. Thereby, the quality of the registered voice registered in the speaker identification dictionary 108 can be further improved.
- the feature amount registration method for registered speech for speaker identification in the first exemplary embodiment of the present invention includes a speech recognition step, a registered speech evaluation step, and a dictionary registration step.
- the speech recognition step text data corresponding to the registered speech is extracted as extracted text data.
- the registered voice is a voice that is input by reading out registration target text data, which is text data set in advance, by a registered speaker.
- a score registered speech score
- the dictionary registration step the feature amount of the registered speech is registered in the speaker identification dictionary for registering the feature amount of the registered speech for each registered speaker according to the evaluation result of the registered speech evaluation step. Also by this method, the same effect as that of the speaker identification server 100 (speaker identification device) described above can be obtained.
- the registered voice feature amount registration program for speaker identification is a computer that performs processing including the aforementioned speech recognition step, the aforementioned registered speech evaluation step, and the aforementioned dictionary registration step. To run.
- This program can provide the same effect as that of the speaker identification server 100 (speaker identification device) described above.
- the storage medium stores a program that causes a computer to execute processing including the above-described speech recognition step, the above-described registration speech evaluation step, and the above-described dictionary registration step. Also with this storage medium, the same effects as those of the speaker identification server 100 (speaker identification device) described above can be obtained.
- comparison between text data extracted from registered speech by speech recognition and registration target text data as correct text is used as a registration speech evaluation criterion.
- the registration target text data as the correct text indicates the registration target text data in S11 of FIG.
- the type of phoneme included in the registered voice (eg, a, i, u, e, o, k, s, ...) is used as the evaluation standard for the registered speech. Specifically, the number of appearances of each phoneme extracted as a result of speech recognition of the registered speech is counted, and if the number of appearances reaches a reference number (for example, 5 times) for all types of phonemes, sufficient information is obtained. It is determined that it contains. If this condition is not met, the user is requested to input additional registered voices, and whether or not the number of phonemes included in the previous registered voices is added to the reference number (reference phoneme number). May be determined.
- a reference number for example, 5 times
- the registered speech evaluation unit compares the number of phonemes included in the extracted text data with a preset reference phoneme number.
- FIG. 8 is a diagram showing the configuration of the speaker identification server 100A according to the third embodiment of the present invention.
- constituent elements equivalent to those shown in FIGS. 1 to 7 are denoted by the same reference numerals as those shown in FIGS.
- the speaker identification server 100A includes a voice recognition unit 102, a registered voice evaluation unit 103, and a dictionary registration unit 104.
- the speech recognition unit 102, the registered speech evaluation unit 103, and the dictionary registration unit 104 are connected to each other.
- the voice recognition unit 102, the registered voice evaluation unit 103, and the dictionary registration unit 104 are the same as the components included in the speaker identification server 100 in the first embodiment. That is, the speaker identification server 100 ⁇ / b> A is configured by only some components of the speaker identification server 100.
- the voice recognition unit 102 extracts text data corresponding to the registered voice as extracted text data.
- the registered voice is a voice that is input by reading out registration target text data, which is text data set in advance, by a registered speaker.
- the registered voice evaluation unit 103 calculates a score indicating the degree of similarity between the extracted text data and the registration target text data for each registered speaker.
- the dictionary registration unit 104 registers the feature amount of the registered speech in the speaker identification dictionary for registering the feature amount of the registered speech for each registered speaker according to the evaluation result of the registered speech evaluation unit 103.
- the speaker identification server 100 (speaker identification device) according to the third embodiment of the present invention includes the voice recognition unit 102, the registered voice evaluation unit 103, and the dictionary registration unit 104.
- the voice recognition unit 102 extracts text data corresponding to the registered voice as extracted text data.
- the registered voice is a voice that is input by reading out registration target text data, which is text data set in advance, by a registered speaker.
- the registered speech evaluation unit 103 calculates a score (registered speech score) indicating the similarity between the extracted text data and the registration target text data for each registered speaker.
- the dictionary registration unit 104 registers the feature amount of the registered speech in the speaker identification dictionary for registering the feature amount of the registered speech for each registered speaker according to the evaluation result of the registered speech evaluation unit 103.
- the speaker identification server 100A speech identification device
- text extraction is performed from the registered speech obtained by reading the registration target text data by the registered speaker. And based on the calculation result of the score which shows the similarity of the extraction text data which is a text extraction result, and registration object text data, the feature-value of registration speech is registered into a speaker identification dictionary.
- the extracted text data which is the text extraction result
- coincides with the registration target text data at a high rate it can be estimated that the registered speech corresponding to the extracted text data is clearly pronounced and the noise level is sufficiently low.
- the registered voice evaluation unit 103 calculates a score (registered voice score) indicating the similarity between the extracted text data and the registration target text data, and the dictionary registration unit 104 determines the evaluation result of the registered voice evaluation unit 103.
- the feature amount of the registered speech is registered in the speaker identification dictionary for each registered speaker.
- the registered voice when the evaluation result of the registered voice evaluation unit 103 is preferable is registered in the speaker identification dictionary, but the registered voice when the evaluation result of the registered voice evaluation unit 103 is not preferable is the speaker identification dictionary. Not registered. Therefore, only a sufficiently high quality registered speech can be registered in the speaker identification dictionary. As a result, it is possible to suppress identification errors caused by insufficiently registered speech.
- the speaker identification server 100A (speaker identification device) in the third embodiment of the present invention, it is possible to suppress the identification error caused by the insufficiently registered speech and to accurately and stably.
- the speaker can be identified. Therefore, unlike the evaluation technique described in Patent Document 2, it is reduced that another person is mistakenly determined to be the same person or the person cannot be identified.
- the speaker identification technology in the first to third embodiments of the present invention can be used in all application fields of speaker identification. Specific examples include the following. (1) In voice calls such as telephones, a service for identifying the other party from the call voice, (2) A device that manages entry / exit to a building or room using voice characteristics, (3) Telephone conference / video conference -A service that extracts a set of speaker name and statement content as text in a video work.
- Patent Documents 3 to 5 The comparison between Patent Documents 3 to 5 and the present invention is as follows.
- Patent Document 3 discloses a technique for calculating a score based on comparison between a speech recognition result (text obtained as a result of speech recognition) and a correct text (text used as a reference for comparison) and recognition reliability. (Particularly, paragraphs [0009], [0011], [0013]). However, the technique described in Patent Document 3 is a general method for evaluating the result of speech recognition and is not directly related to the present invention. Further, in Patent Document 3, when the score calculation result is less than the threshold, speaker registration learning is applied, the speaker to be registered is prompted to utter about a specific word, and the pronunciation dictionary is used by using the result. The process of updating is disclosed.
- Patent Document 3 discloses a technique in which at least the registered speech evaluation unit 103 calculates a score (registered speech score) indicating the similarity between the extracted text data and the registration target text data for each word for each registered speaker. Is not disclosed.
- a short sound such as a word unit is not sequentially registered in the identification dictionary, but a sound having a certain length (typically about several minutes). Must be registered at once.
- Japanese Patent Application Laid-Open No. 2004-151867 is an operation in which a speech uttered by a user and a text corresponding to the speech are input, and a speech feature amount after the speaker property is removed from the former and a correspondence relationship between the latter text is stored in a recognition dictionary.
- a process for specifying a normalization parameter to be applied to a speech signal to be speech-recognized using a speaker label that is a result of speaker recognition is disclosed (particularly [0040]).
- Patent Document 4 discloses a technique in which the registered speech evaluation unit 103 calculates a score (registered speech score) indicating the degree of similarity between extracted text data and registration target text data for each word for each registered speaker. Is not disclosed.
- Patent Document 5 discloses an operation of presenting a random text to a newly registered user, prompting corresponding voice input, and creating a personal dictionary using the result (paragraph [0016]). .
- an operation of calculating a matching score which is a matching result between an unspecified speaker voice dictionary and voice data, and registering it as a part of the personal dictionary is disclosed (particularly, paragraph [0022]).
- Patent Document 5 does not disclose a technique for presenting a plurality of partial texts for the same speaker.
- Patent Document 5 discloses an operation for determining whether or not the user is the person based on the magnitude relationship between the normalized score and the threshold (particularly, paragraph [0024]). This is a general operation in speaker verification (corresponding to the “identification phase” of the technique described in FIG. 8 of the present case).
Abstract
Description
本発明の第1の実施の形態における話者識別サーバ100を含む話者識別システム1000の構成について説明する。
次に、本発明の第2の実施の形態における話者識別サーバの構成について、説明する。
本発明の第3の実施の形態における話者識別サーバ100Aの構成について説明する。図8は、本発明の第3の実施の形態における話者識別サーバ100Aの構成を示す図である。なお、図8では、図1~図7で示した各構成要素と同等の構成要素には、図1~図7に示した符号と同等の符号を付している。
101 テキスト提示部
102 音声認識部
103 登録音声評価部
104 辞書登録部
105 話者識別部
106 登録対象テキスト記録部
107 音声一時記録部
108 話者識別辞書
200 端末
300 ネットワーク
Claims (8)
- 事前に設定されたテキストデータである登録対象テキストデータが登録話者により読み上げられることにより入力される音声である登録音声に対応するテキストデータを、抽出テキストデータとして抽出する音声認識手段と、
前記抽出テキストデータと前記登録対象テキストデータとの間の類似度を示すスコアを、前記登録話者毎に算出する登録音声評価手段と、
前記登録音声評価手段の評価結果に応じて、前記登録話者毎に前記登録音声の特徴量を登録するための話者識別辞書に、前記登録音声の特徴量を登録する辞書登録手段とを備えた話者識別装置。 - 前記辞書登録手段は、前記スコアが所定の基準値より大きい場合、前記話者識別辞書に、前記登録音声の特徴量を登録する請求項1に記載の話者識別装置。
- 前記登録対象テキストデータを前記登録話者に提供するテキスト提供手段を備えた請求項1または2に記載の話者識別装置。
- 前記登録音声評価手段は、単語毎に、前記抽出テキストデータと前記登録対象テキストデータとの間の類似度を示すスコアを、前記登録話者毎に算出する請求項1~3のいずれか1項に記載の話者識別装置。
- 前記辞書登録手段は、前記単語毎の前記スコアの全てが所定の基準値より大きい場合、前記話者識別辞書に、前記登録音声の特徴量を登録する請求項4に記載の話者識別装置。
- 前記登録音声評価手段は、前記抽出テキストデータに含まれる音素の数と、予め設定された基準音素数と比較する請求項1に記載の話者識別装置。
- 事前に設定されたテキストデータである登録対象テキストデータが登録話者により読み上げられることにより入力される音声である登録音声に対応するテキストデータを、抽出テキストデータとして抽出し、
前記抽出テキストデータと前記登録対象テキストデータとの間の類似度を示すスコアを、前記登録話者毎に算出し、
前記スコアの算出結果に応じて、前記登録話者毎に前記登録音声の特徴量を登録するための話者識別辞書に、前記登録音声の特徴量を登録する話者識別用の登録音声の特徴量登録方法。 - 事前に設定されたテキストデータである登録対象テキストデータが登録話者により読み上げられることにより入力される音声である登録音声に対応するテキストデータを、抽出テキストデータとして抽出し、
前記抽出テキストデータと前記登録対象テキストデータとの間の類似度を示すスコアを、前記登録話者毎に算出し、
前記スコアの算出結果に応じて、前記登録話者毎に前記登録音声の特徴量を登録するための話者識別辞書に、前記登録音声の特徴量を登録する処理をコンピュータに実行させるプログラムを記憶する記憶媒体。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/534,545 US20170323644A1 (en) | 2014-12-11 | 2015-12-07 | Speaker identification device and method for registering features of registered speech for identifying speaker |
JP2016563500A JP6394709B2 (ja) | 2014-12-11 | 2015-12-07 | 話者識別装置および話者識別用の登録音声の特徴量登録方法 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2014-250835 | 2014-12-11 | ||
JP2014250835 | 2014-12-11 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016092807A1 true WO2016092807A1 (ja) | 2016-06-16 |
Family
ID=56107027
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2015/006068 WO2016092807A1 (ja) | 2014-12-11 | 2015-12-07 | 話者識別装置および話者識別用の登録音声の特徴量登録方法 |
Country Status (3)
Country | Link |
---|---|
US (1) | US20170323644A1 (ja) |
JP (1) | JP6394709B2 (ja) |
WO (1) | WO2016092807A1 (ja) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020084741A1 (ja) | 2018-10-25 | 2020-04-30 | 日本電気株式会社 | 音声処理装置、音声処理方法、及びコンピュータ読み取り可能な記録媒体 |
WO2023228542A1 (ja) * | 2022-05-27 | 2023-11-30 | パナソニックIpマネジメント株式会社 | 認証システムおよび認証方法 |
WO2024009465A1 (ja) * | 2022-07-07 | 2024-01-11 | パイオニア株式会社 | 音声認識装置、プログラム、音声認識方法、及び音声認識システム |
Families Citing this family (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106295299A (zh) * | 2016-08-15 | 2017-01-04 | 歌尔股份有限公司 | 一种智能机器人的用户注册方法和装置 |
KR20180082033A (ko) * | 2017-01-09 | 2018-07-18 | 삼성전자주식회사 | 음성을 인식하는 전자 장치 |
WO2019002831A1 (en) | 2017-06-27 | 2019-01-03 | Cirrus Logic International Semiconductor Limited | REPRODUCTIVE ATTACK DETECTION |
GB2563953A (en) | 2017-06-28 | 2019-01-02 | Cirrus Logic Int Semiconductor Ltd | Detection of replay attack |
GB201713697D0 (en) | 2017-06-28 | 2017-10-11 | Cirrus Logic Int Semiconductor Ltd | Magnetic detection of replay attack |
GB201801528D0 (en) | 2017-07-07 | 2018-03-14 | Cirrus Logic Int Semiconductor Ltd | Method, apparatus and systems for biometric processes |
GB201801530D0 (en) | 2017-07-07 | 2018-03-14 | Cirrus Logic Int Semiconductor Ltd | Methods, apparatus and systems for authentication |
GB201801526D0 (en) | 2017-07-07 | 2018-03-14 | Cirrus Logic Int Semiconductor Ltd | Methods, apparatus and systems for authentication |
GB201801532D0 (en) | 2017-07-07 | 2018-03-14 | Cirrus Logic Int Semiconductor Ltd | Methods, apparatus and systems for audio playback |
GB201801527D0 (en) | 2017-07-07 | 2018-03-14 | Cirrus Logic Int Semiconductor Ltd | Method, apparatus and systems for biometric processes |
GB201804843D0 (en) | 2017-11-14 | 2018-05-09 | Cirrus Logic Int Semiconductor Ltd | Detection of replay attack |
GB201801664D0 (en) | 2017-10-13 | 2018-03-21 | Cirrus Logic Int Semiconductor Ltd | Detection of liveness |
GB2567503A (en) | 2017-10-13 | 2019-04-17 | Cirrus Logic Int Semiconductor Ltd | Analysing speech signals |
GB201803570D0 (en) | 2017-10-13 | 2018-04-18 | Cirrus Logic Int Semiconductor Ltd | Detection of replay attack |
GB201801663D0 (en) * | 2017-10-13 | 2018-03-21 | Cirrus Logic Int Semiconductor Ltd | Detection of liveness |
GB201801874D0 (en) | 2017-10-13 | 2018-03-21 | Cirrus Logic Int Semiconductor Ltd | Improving robustness of speech processing system against ultrasound and dolphin attacks |
GB201801661D0 (en) * | 2017-10-13 | 2018-03-21 | Cirrus Logic International Uk Ltd | Detection of liveness |
GB201801659D0 (en) | 2017-11-14 | 2018-03-21 | Cirrus Logic Int Semiconductor Ltd | Detection of loudspeaker playback |
US11475899B2 (en) | 2018-01-23 | 2022-10-18 | Cirrus Logic, Inc. | Speaker identification |
US11264037B2 (en) | 2018-01-23 | 2022-03-01 | Cirrus Logic, Inc. | Speaker identification |
US11735189B2 (en) | 2018-01-23 | 2023-08-22 | Cirrus Logic, Inc. | Speaker identification |
US10733996B2 (en) * | 2018-03-30 | 2020-08-04 | Qualcomm Incorporated | User authentication |
US10720166B2 (en) * | 2018-04-09 | 2020-07-21 | Synaptics Incorporated | Voice biometrics systems and methods |
US10818296B2 (en) * | 2018-06-21 | 2020-10-27 | Intel Corporation | Method and system of robust speaker recognition activation |
US10692490B2 (en) | 2018-07-31 | 2020-06-23 | Cirrus Logic, Inc. | Detection of replay attack |
US10915614B2 (en) | 2018-08-31 | 2021-02-09 | Cirrus Logic, Inc. | Biometric authentication |
US11037574B2 (en) | 2018-09-05 | 2021-06-15 | Cirrus Logic, Inc. | Speaker recognition and speaker change detection |
KR20200129346A (ko) * | 2019-05-08 | 2020-11-18 | 삼성전자주식회사 | 디스플레이 장치 및 이의 제어 방법 |
US11355136B1 (en) * | 2021-01-11 | 2022-06-07 | Ford Global Technologies, Llc | Speech filtering in a vehicle |
JP7109113B1 (ja) * | 2021-05-20 | 2022-07-29 | 力 森 | 識別システム装置 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10214095A (ja) * | 1997-01-29 | 1998-08-11 | Nec Corp | 話者認識装置 |
JPH11344992A (ja) * | 1998-06-01 | 1999-12-14 | Ntt Data Corp | 音声辞書作成方法、個人認証装置および記録媒体 |
US6064957A (en) * | 1997-08-15 | 2000-05-16 | General Electric Company | Improving speech recognition through text-based linguistic post-processing |
US20040049385A1 (en) * | 2002-05-01 | 2004-03-11 | Dictaphone Corporation | Systems and methods for evaluating speaker suitability for automatic speech recognition aided transcription |
JP2007248730A (ja) * | 2006-03-15 | 2007-09-27 | Nippon Telegr & Teleph Corp <Ntt> | 音響モデル適応装置、音響モデル適応方法、音響モデル適応プログラム及び記録媒体 |
WO2007111197A1 (ja) * | 2006-03-24 | 2007-10-04 | Pioneer Corporation | 話者認識システムにおける話者モデル登録装置及び方法、並びにコンピュータプログラム |
JP2009210829A (ja) * | 2008-03-04 | 2009-09-17 | Nippon Hoso Kyokai <Nhk> | 音響モデル学習装置およびプログラム |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4363102A (en) * | 1981-03-27 | 1982-12-07 | Bell Telephone Laboratories, Incorporated | Speaker identification system using word recognition templates |
JP2003044445A (ja) * | 2001-08-02 | 2003-02-14 | Matsushita Graphic Communication Systems Inc | 認証システム、サービス提供サーバ装置および音声認証装置並びに認証方法 |
JP2007052496A (ja) * | 2005-08-15 | 2007-03-01 | Advanced Media Inc | ユーザ認証システム及びユーザ認証方法 |
US8694315B1 (en) * | 2013-02-05 | 2014-04-08 | Visa International Service Association | System and method for authentication using speaker verification techniques and fraud model |
-
2015
- 2015-12-07 JP JP2016563500A patent/JP6394709B2/ja active Active
- 2015-12-07 US US15/534,545 patent/US20170323644A1/en not_active Abandoned
- 2015-12-07 WO PCT/JP2015/006068 patent/WO2016092807A1/ja active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10214095A (ja) * | 1997-01-29 | 1998-08-11 | Nec Corp | 話者認識装置 |
US6064957A (en) * | 1997-08-15 | 2000-05-16 | General Electric Company | Improving speech recognition through text-based linguistic post-processing |
JPH11344992A (ja) * | 1998-06-01 | 1999-12-14 | Ntt Data Corp | 音声辞書作成方法、個人認証装置および記録媒体 |
US20040049385A1 (en) * | 2002-05-01 | 2004-03-11 | Dictaphone Corporation | Systems and methods for evaluating speaker suitability for automatic speech recognition aided transcription |
JP2007248730A (ja) * | 2006-03-15 | 2007-09-27 | Nippon Telegr & Teleph Corp <Ntt> | 音響モデル適応装置、音響モデル適応方法、音響モデル適応プログラム及び記録媒体 |
WO2007111197A1 (ja) * | 2006-03-24 | 2007-10-04 | Pioneer Corporation | 話者認識システムにおける話者モデル登録装置及び方法、並びにコンピュータプログラム |
JP2009210829A (ja) * | 2008-03-04 | 2009-09-17 | Nippon Hoso Kyokai <Nhk> | 音響モデル学習装置およびプログラム |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020084741A1 (ja) | 2018-10-25 | 2020-04-30 | 日本電気株式会社 | 音声処理装置、音声処理方法、及びコンピュータ読み取り可能な記録媒体 |
WO2023228542A1 (ja) * | 2022-05-27 | 2023-11-30 | パナソニックIpマネジメント株式会社 | 認証システムおよび認証方法 |
WO2024009465A1 (ja) * | 2022-07-07 | 2024-01-11 | パイオニア株式会社 | 音声認識装置、プログラム、音声認識方法、及び音声認識システム |
Also Published As
Publication number | Publication date |
---|---|
US20170323644A1 (en) | 2017-11-09 |
JPWO2016092807A1 (ja) | 2017-08-31 |
JP6394709B2 (ja) | 2018-09-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6394709B2 (ja) | 話者識別装置および話者識別用の登録音声の特徴量登録方法 | |
AU2016216737B2 (en) | Voice Authentication and Speech Recognition System | |
US20160372116A1 (en) | Voice authentication and speech recognition system and method | |
JP4672003B2 (ja) | 音声認証システム | |
US6161090A (en) | Apparatus and methods for speaker verification/identification/classification employing non-acoustic and/or acoustic models and databases | |
US9183367B2 (en) | Voice based biometric authentication method and apparatus | |
TW557443B (en) | Method and apparatus for voice recognition | |
CN109410664B (zh) | 一种发音纠正方法及电子设备 | |
AU2013203139A1 (en) | Voice authentication and speech recognition system and method | |
CN104143326A (zh) | 一种语音命令识别方法和装置 | |
CN104462912B (zh) | 改进的生物密码安全 | |
EP2879130A1 (en) | Methods and systems for splitting a digital signal | |
US20140188468A1 (en) | Apparatus, system and method for calculating passphrase variability | |
CN112309406A (zh) | 声纹注册方法、装置和计算机可读存储介质 | |
KR102394912B1 (ko) | 음성 인식을 이용한 주소록 관리 장치, 차량, 주소록 관리 시스템 및 음성 인식을 이용한 주소록 관리 방법 | |
Beigi | Challenges of LargeScale Speaker Recognition | |
US20180012602A1 (en) | System and methods for pronunciation analysis-based speaker verification | |
JP7339116B2 (ja) | 音声認証装置、音声認証システム、および音声認証方法 | |
KR101598950B1 (ko) | 발음 평가 장치 및 이를 이용한 발음 평가 방법에 대한 프로그램이 기록된 컴퓨터 판독 가능한 기록 매체 | |
CN110853674A (zh) | 文本核对方法、设备以及计算机可读存储介质 | |
JP4245948B2 (ja) | 音声認証装置、音声認証方法及び音声認証プログラム | |
CN113409774A (zh) | 语音识别方法、装置及电子设备 | |
CN111785280A (zh) | 身份认证方法和装置、存储介质和电子设备 | |
WO2006027844A1 (ja) | 話者照合装置 | |
US20180012603A1 (en) | System and methods for pronunciation analysis-based non-native speaker verification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15868492 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2016563500 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 15534545 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15868492 Country of ref document: EP Kind code of ref document: A1 |