WO2019156427A1

WO2019156427A1 - Method for identifying utterer on basis of uttered word and apparatus therefor, and apparatus for managing voice model on basis of context and method thereof

Info

Publication number: WO2019156427A1
Application number: PCT/KR2019/001355
Authority: WO
Inventors: 이태훈
Original assignee: 주식회사 공훈
Priority date: 2018-02-09
Filing date: 2019-01-31
Publication date: 2019-08-15

Abstract

One embodiment of the present invention may provide a method for identifying an utterer on the basis of an uttered word and an apparatus therefor. Further, an apparatus for managing a voice model on the basis of context according to one embodiment of the present invention may interwork with a text-prompted utterer identification system. The apparatus and a method thereof may be configured such that an individual voice datum generated whenever a voice is received from an utterer is stored in a storage unit, and when multiple individual voice data are stored in the storage unit, the respective individual voice data are extracted from the storage unit, the similarity between the individual voice data is estimated so as to generate a voice model, and then the voice model is managed on the basis of the context of a user's utterance.

Description

Method and apparatus for identifying speaker based on spoken words, apparatus for context-based speech model management and method

The present invention relates to a method and apparatus for identifying a speaker based on a spoken word, and more particularly, to grasp a voice characteristic of a speaker (for example, a user of the device) based on the spoken word, The present invention relates to a method and apparatus for determining that a speech pattern of a word corresponding to a speech characteristic having a high similarity compared to the speech characteristic stored in a database (DB) generated according to the characteristic is a speaker's updated speech pattern.

The present invention also relates to a context-based speech model management apparatus and a method of operating the apparatus, and more particularly, to a speech model that can be used in a speech authentication system at a context-based speaker's speech characteristics and at predetermined predetermined intervals. An apparatus for managing a voice model by updating and a method of operating the apparatus.

In the biometric method, the voice is vulnerable to imitation and recording / playback of others, and may change from time to time depending on the pronunciation state and time of the user, and thus may be restricted in use as a means of recognition and authentication. However, voice is equipped with the optimum conditions of the interface between the machine and human beings, the use range is gradually increasing.

As a means for accurately recognizing and authenticating a legitimate user in relation to the voice command used as the interface between the current machine and a person, the voice of the speaker and other authentication means such as iris, fingerprint, and password are used in parallel. It is hampering the effectiveness of authentication through a bay.

The existing speaker identification (recognition) has a limitation in raising the recognition rate standard for the speaker by taking a method of recognizing the user by data-forming common feature elements based on all voices spoken by the user.

In addition, this conventional speaker identification method has caused a lot of inconvenience for the user who needs the instantaneous use of identification (authentication) information in that it takes quite a long time to accurately identify the speaker.

On the other hand, the speaker's voice is not permanent, and the aging of the vocal muscles over time, changes in the living environment (e.g., area, work place, etc.), changes in the state of health (e.g., the development of a cold, etc.) Depending on various factors, it may change temporarily or continuously and over time.

Thus, in order to identify or identify a speaker through voice in a state where the speaker's voice is not guaranteed, it is necessary to update the voice model to be used for speaker verification or authentication according to the speaker's voice change.

In the related art, in order to reflect such a diversity of speech of a user, a method of classifying a specific user by detecting an accent of the user and the like has been studied. However, these conventional speech recognition methods have a disadvantage in that they cannot effectively track and manage the user's voice that changes with time or environment. In other words, the conventional speech recognition method or the method for managing the speech model merely changes the speech model for the speaker by analyzing the speaker's speech characteristics without considering the environment in which the speaker is placed.

With the emergence and dissemination of various control methods of electronic devices through voice, management of modernized voice models to accurately recognize (identify) the user's voice and perform appropriate actions (eg, user authentication, etc.) accordingly. Is needed.

The present invention has been made as a countermeasure to the above-described problem, and is intended to enhance the effectiveness of speech recognition and authentication by increasing the accuracy of speech recognition and speaker identification (eg, authentication, etc.) for the speaker.

In other words, the present invention can be performed temporarily or for a period of time depending on the speaker's voice tone, depending on the speaker's emotion, the surrounding environment (e.g., noise, etc.), the speaker's state of health (e.g., the development of a throat, etc.). In order to be changed, the present invention is to provide a method and apparatus for improving identification accuracy by reflecting the possibility of such a voice change in the speaker's identification process.

In addition, as a countermeasure to the above-mentioned problem, the context in the matrix DB including the user's context (word) speech model that can be utilized in the context (word) presentation system that is an implementation aspect of the voice authentication system The present invention provides a method and apparatus for updating a user's context (word) speech model in consideration of the presence or absence of a change in the voice input from a speaker and the degree of change.

In one embodiment of the present invention, a method and apparatus for identifying a speaker based on a spoken word can be provided.

According to an embodiment of the present invention, a method for identifying a speaker based on a spoken word may include receiving a spoken voice from a speaker, extracting a word included in the received voice, and voice information of the word, in advance. Searching for a word in the database (DB), if the word does not exist in the DB, adds the word and voice information of the word to the DB, and if the word exists in the DB, the voice information of the spoken word and Comparing the respective reference voice information stored in the DB, estimating the similarity according to the comparison with the respective reference voice information, and the words of the speaker based on the number of times the voice information corresponding to the estimated similarity is received. Determining an utterance pattern for and identifying the speaker based on the determined utterance pattern.

The voice information of the word according to an embodiment of the present invention may include at least one of a frequency, pitch, formant, speech time, and speech speed of the speech.

In the above comparing step, it is determined whether or not the voice information of the word spoken by the speaker is similar to at least one reference voice information stored in the DB, and in estimating the similarity, the similarity is determined according to the determination result. When the estimated similarity is less than the first reference value, new reference voice information is generated and stored in the DB. When the estimated similarity is greater than or equal to the first reference value, the number of matching of the reference voice information having the corresponding similarity may be increased and counted. .

In the step of determining a speech pattern for the speaker's word according to an embodiment of the present invention, when the number of matching matches is less than the second reference value, a new voice spoken by the speaker is received and the similarity is repeatedly estimated. If it is equal to or greater than the second reference value, it may be determined as a speech pattern for the speaker's word.

In the determining of the speech pattern according to an embodiment of the present invention, the speech pattern is determined by establishing a speech model of the speaker based on the speech information corresponding to the similarity having the number of matching counts greater than or equal to the second reference value. In the identifying step, it may be identified who the speaker of the spoken voice is based on the speech pattern determined through the above-described steps with respect to the spoken voice.

An apparatus for identifying a speaker based on a spoken word according to an embodiment of the present invention includes a voice receiver for receiving a spoken voice from a speaker, information contained in the received voice, and information extracted to extract voice information of the word. The information retrieval unit which searches for words in a pre-built database (DB). If a word does not exist in the DB, the word and voice information of the word are added to the DB. A comparison unit for comparing the voice information of the word with each reference voice information stored in the DB, a similarity estimation unit for estimating the similarity according to comparison with each reference voice information, and receiving voice information corresponding to the estimated similarity A speech pattern determining unit that determines a speech pattern for the speaker's word based on the number of times of speech and a speaker identification unit that identifies the speaker based on the determined speech pattern May be included.

In addition, the voice information about the word may include at least one of the frequency, pitch, formant, speech time, and speech speed of the speech.

The comparison unit determines whether the voice information about the word spoken by the speaker is similar to at least one reference voice information stored in the DB, and the similarity estimation unit estimates the similarity according to the result of the determination. If is less than the first reference value is a new reference voice information is generated and stored in the DB, if more than the first reference value can be counted by increasing the number of matching of the reference voice information having a corresponding similarity.

If the counted matching count is less than the second reference value, the speech pattern determination unit receives a new speech spoken from the speaker and repeatedly performs the process of estimating the similarity. You can decide by pattern.

According to one embodiment of the present invention, a speech pattern is determined by a speech pattern determination unit by establishing a speech model of a speaker based on speech information corresponding to a similarity having a counted matching count equal to or greater than a second reference value, and the speaker identification unit In, the person who is the speaker may be identified based on the speech pattern determined for the speech spoken.

Meanwhile, as an embodiment of the present invention, a computer-readable recording medium having recorded thereon a program for executing the above method on a computer may be provided.

In addition, as an embodiment of the present invention, a context-based speech model management apparatus and a method of operating the apparatus may be provided.

An apparatus for managing a context-based speech model according to an embodiment of the present invention may be linked to a context-based speaker identification system, and the apparatus may include a storage unit for storing individual voice data generated each time a voice from the speaker is received. When a plurality of voice data are stored in the storage unit, a similarity estimator extracting each individual voice data from the storage unit and estimating the similarity between the individual voice data and at least one individual selected based on the similarity estimated by the similarity estimator A voice model generator for generating a first voice model of the speaker according to the voice data, determines whether a comparison voice model corresponding to the first voice model exists in a storage unit of the contextual speaker identification system. A speech model is provided to the storage of the contextual speaker identification system and stored. If the comparison similarity between the first speech model and the comparison speech model is estimated by the similarity estimating unit, and the comparison similarity degree estimated by the similarity estimating unit by the determining unit is equal to or greater than a predetermined reference value, the comparison speech model is defined as the first. A voice model editing unit for replacing the voice model and generating a second voice model by combining the first voice model and the comparison voice model when less than a predetermined reference value, the second voice model being provided to the determination unit and the voice model editing unit. Can be.

In addition, the context presenting speaker identification system includes a voice receiver for receiving a voice from the speaker, a voice feature extractor for extracting voice characteristics from the received voice, and a context voice model generation for generating a voice model based on the extracted voice characteristics. A storage unit in which the generated speech model is stored in a matrix form, a random number generator for generating random numbers to be used for identification of a speaker, and a position corresponding to the generated random number on the speech model DB in matrix form of the storage unit. A speech model extraction unit for extracting a speech model, a speech speech requesting unit for requesting a speaker for a predetermined speech based on the extracted speech model, and a speaker identification for identifying the speaker by comparing the speech spoken from the speaker with the extracted speech model And a predetermined speech utterance is set in advance at a position on a DB in a matrix form of a storage unit corresponding to the generated random number. SOLO can be a word or sentence.

The individual voice data according to an embodiment of the present invention includes at least one of a speaker's speech per speech, pitch, formant, speech time, and speech rate, and the context-based speech model management apparatus. The similarity estimating unit may evaluate the similarity between individual voice data for each speaker's speech per speech.

In addition, the apparatus according to an embodiment of the present invention further includes a period setting unit for setting a management period of the voice model, and when all the voice models are updated within the set management period, the voice model editing unit provides a context presentation type. If the existing matrix voice model DB on the storage of the speaker identification system is maintained and at least one voice model is not updated within the set management period, the voice model editing unit is based on the new first voice model associated with the speaker. Thus, a part of the existing matrix speech model DB may be deleted or maintained.

The voice model editing unit deletes at least one unupdated voice model from the matrix-type voice model DB if there is no new first voice model associated with the speaker, and at least one unupdated voice model if the new first voice model exists. The speech model is compared with the new first speech model, and if the difference is within the predetermined range, the speech model editing unit maintains the existing matrix speech model DB on the storage of the contextual speaker identification system. If it is out of range, at least one unupdated voice model may be deleted from the matrix voice model DB.

A method of managing a speech model using a context-based speech model management apparatus according to an embodiment of the present invention includes the steps of: (a) generating and storing individual voice data each time a voice from a speaker is received; Extracting each individual voice data and estimating the similarity between the individual voice data when a plurality of voice data are stored; and (c) generating the speaker's first voice model according to the at least one individual voice data selected based on the estimated similarity. (D) determining whether a comparison speech model corresponding to the first speech model exists in the storage of the context-presenting speaker identification system, and if not, the first speech model of the context-presenting speaker identification system. Provide it to the storage unit and store it, and if present, the comparison similarity between the first speech model and the comparison speech model is estimated through the similarity estimator. And (e) replacing the comparison speech model with the first speech model when the comparison similarity is greater than or equal to a predetermined reference value, and generating the second speech model by combining the first speech model and the comparison speech model if less than the predetermined reference value. It may include a step. In addition, steps (d) and (e) may be repeatedly performed for the second voice model.

In addition, the method according to an embodiment of the present invention further comprises the step of setting the management period of the voice model by the period setting unit of the above-described device, if all the voice model is updated within the set management period, the device The voice model editing unit of the voice model editing unit maintains an existing matrix voice model DB on the storage unit of the context presenting speaker identification system, and if at least one voice model is not updated within the set management period, the voice model editing unit is associated with the speaker. Based on the new first speech model, a part of the existing matrix speech model DB may be deleted or maintained.

The voice model editing unit deletes at least one unupdated voice model from the matrix-type voice model DB if there is no new first voice model associated with the speaker, and at least one unupdated voice model if the new first voice model exists. The speech model is compared with the new first speech model, and if the difference is within the predetermined range, the speech model editing unit maintains the existing matrix-type speech model DB on the storage of the contextual speaker identification system and maintains the range. If out of the at least one voice model can be deleted from the speech model DB of the unupdated voice model.

According to an embodiment of the present invention, accuracy and reliability of speaker recognition and authentication by extracting and matching a user's speech pattern (eg, speech characteristics according to speech) based on a common word among a large number of speeches spoken by the user Can be made higher.

In other words, in the process of repeatedly performing the method using the apparatus according to the embodiment of the present invention, it is possible to recognize an optimized speech pattern for a specific word for each speaker, and who is the speaker based on the speech pattern. It can quickly and accurately identify.

According to one embodiment of the invention, since the speaker's voice may change continuously or for a period of time by temporal factors (e.g., aging, etc.), environmental factors (e.g., concert halls, etc.) By monitoring the possibility and continuously collecting and updating the changed voice information, the speaker can be identified quickly and accurately according to the voice information fully reflecting the current state of the speaker. A stable and reliable identification (authentication) of the speaker is possible regardless of the speaker's temporal and environmental factors.

In addition, according to an embodiment of the present invention, the speech model can be updated by updating the speech model that can be used in the speaker identification (or speech authentication) system based on the speaker's speech characteristics and a predetermined period of time. (up to date) to manage.

In addition, it is possible to efficiently control a variety of electronic devices through the user-specific voice.

In addition, the influence of the user's speech state (temporal or environmental factors) is minimized, so that user authentication in e-commerce and the like can be performed quickly and accurately.

1 is a view showing a conventional speaker identification system.

2 is a diagram illustrating a conventional context (word) presentation speaker identification system.

3 shows a conventional leveling system for speech.

4 is a flowchart illustrating a method for identifying a speaker based on a spoken word according to an embodiment of the present invention.

5 is a flowchart illustrating a specific speaker identification method according to an embodiment of the present invention.

6 is a block diagram illustrating an apparatus for identifying a speaker based on a spoken word according to an embodiment of the present invention.

7 is a diagram illustrating a leveling system for speech according to an embodiment of the present invention.

8 is a view showing a leveling process based on the speaker's utterance similarity according to an embodiment of the present invention.

9 is a block diagram of an apparatus for context-based speech model management according to an embodiment of the present invention.

10 is a block diagram of a context-based speech model management apparatus and a context-presenting speaker identification system interoperable with the context-based speech model management apparatus according to an embodiment of the present invention.

11 shows an example of the operation of the contextual speaker identification system.

12 is a flowchart illustrating an operation example of a context-based speech model management apparatus according to an embodiment of the present invention.

13 illustrates an operation example of a context-based speech model management apparatus according to another embodiment of the present invention.

14 is a flowchart illustrating a voice model management method using a context-based voice model management apparatus according to an embodiment of the present invention.

DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present invention. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention, and like reference numerals designate like parts throughout the specification.

Terms used herein will be briefly described and the present invention will be described in detail.

The terms used in the present invention have been selected as widely used general terms as possible in consideration of the functions in the present invention, but this may vary according to the intention or precedent of the person skilled in the art, the emergence of new technologies and the like. In addition, in certain cases, there is also a term arbitrarily selected by the applicant, in which case the meaning will be described in detail in the description of the invention. Therefore, the terms used in the present invention should be defined based on the meanings of the terms and the contents throughout the present invention, rather than the names of the simple terms.

When any part of the specification is to "include" any component, this means that it may further include other components, except to exclude other components unless otherwise stated. In addition, the terms "... unit", "module", etc. described in the specification mean a unit for processing at least one function or operation, which may be implemented in hardware or software or a combination of hardware and software. . In addition, when a part of the specification is "connected" to another part, this includes not only "directly connected", but also "connected with other elements in the middle".

Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

1 is a view showing a conventional speaker identification system.

As shown in FIG. 1, a conventional speaker identification system first obtains a plurality of voice samples from a speaker (eg, A of FIG. 1) to be identified, extracts characteristic values such as frequency and pitch for each voice, and then overlaps them. The speech is leveled based on the overlapped portion. After leveling, a speech model is established for the speaker. After collecting an acoustic signal such as a human voice, noise can be removed from the collected signal, and the characteristics of the voice signal can be extracted and made into a database. May be referred to. In other words, through the speech model establishment process for the specific speaker (A of FIG. 1), information about the specific speaker's voice may be collected in advance and a DB may be constructed (eg, a blue dashed line box of FIG. 1).

After establishing a speech model in which a comparison criterion for the speech is set, a speech characteristic parameter and the like are extracted and formed in the same manner as the verification target speaker (A of FIG. 1) with respect to a newly input voice of an unspecified speaker (for example, B of FIG. When the data is compared with the voice model of the speaker to be confirmed and the predetermined threshold value is exceeded, it is determined that the input voice of the unspecified speaker is the same person as the speaker to be confirmed. However, as described above, the conventional voice comparison method takes a long time, and does not reflect a case where the voice of the speaker to be confirmed is changed by temporal and environmental factors.

Conventional speaker identification systems may be classified into a context (word) fixed type system using a sentence or word designated by a user and a context free form system having no limitation on the pronunciation content of the user. In the case of fixed context (word) systems, the system efficiency is good, but the security is weak due to the risk of exposure of a given context (word) and the use of illegal methods such as recording impersonating the user. A large amount of training data is required to identify the user, making the system less efficient in terms of time and resource utilization.

As a system to take advantage of the context (word) fixed system and the context (word) free form system and to compensate for the disadvantages, a context (word) presentation system, such as in Figure 2 has emerged. In this context-based system, if a user's confirmation is required, the system asks the user to pronounce a different word or sentence each time, and performs a speech recognition process for the requested word or sentence and After checking whether the text is matched, the speaker's unique feature value is extracted from the pronunciation information of the word or sentence required by the user and compared with the predefined speaker's voice feature value. This process of the context-based presentation system reduces the risk of remembering the user-specified sentences or words or recordings impersonating the user, and in terms of performance, it is possible to achieve the same efficiency as the context-fixed form. This is the advantage.

However, in the case of the context-presenting system, since the process of generating the context (word) arbitrarily is based on the speaker's speech model, there may be a fundamental difference from the original input of the speaker's speech, forming a speech model. Leveling errors may occur during the process.

3 shows a conventional leveling system for speech.

The user's voice can be digitized through a sampling process into continuous waveforms. In general, the system samples a plurality of voice data instead of one user voice to generate reference data for speaker identification (identification or authentication), and then common data (eg, normalized data) for the digitized voice data is collected. (Red region in Fig. 3). Based on the data generated in this way, LPC (linear predictive coding) and MFCC (Mel-Frequency Cepstral Coefficients) are used to extract feature values for speech and then user-referenced data for speech. However, according to the user's feelings other than the normal user's utterance, surrounding conditions (e.g., noise, etc.), and the speaker's health condition (e.g., a disease such as a cold), the voice tone generally spoken, that is, Frequency and pitch can vary. In the case of the voice spoken by the user, although the voice may be changed in a specific environment and state as described above, the voice model configuration based on simply leveled data, as in the conventional method, is a common method according to the user's living environment. Distortion of the characteristic values can rather act as a barrier to accurate speaker identification (identification).

4 is a flowchart illustrating a method for identifying a speaker based on a spoken word according to an embodiment of the present invention, and FIG. 5 is a flowchart illustrating a specific speaker identification method according to an embodiment of the present invention.

According to an embodiment of the present invention, a method for identifying a speaker based on a spoken word includes receiving a spoken voice from a speaker (S110), extracting a word included in the received voice, and voice information of the word. Step S120, searching for a word in a pre-built database DB (S130), if a word does not exist in the DB, adds the word and voice information of the word to the DB, and the word exists in the DB. In the case of comparing the voice information of the spoken word with each reference voice information stored in the DB (S140), estimating the similarity according to the comparison with each reference voice information (S150), the estimated similarity The method may include determining an utterance pattern for the speaker's word based on the number of times voice information corresponding to the signal is received (S160) and identifying a speaker based on the determined utterance pattern (S170).

The voice information of the word according to an embodiment of the present invention may include at least one of the frequency, pitch, formant, speech time, and speech speed of the speech.

Pitch refers to the pitch of the note. Voice (voiced sound) consists of the fundamental frequency component of vocal cord vibration and its harmonic components. All of the oscillation sources have unique vibration characteristics (eg, resonance characteristics). Human articulation organs (eg, vocal cords, etc.) also have a resonance characteristic at the moment that changes with the articulation, and the vocal cords can be filtered and expressed according to the resonance characteristics. Looking at the frequency spectrum of a particular sound (eg, a vowel), it can be seen that a plurality of resonance bands exist when the resonance characteristic is expressed. Such a plurality of resonant frequency bands is referred to as a formant.

4 and 5, according to an embodiment of the present invention, when a word does not exist in the DB, the word and the voice information of the word may be added to the DB. The added voice information may be used as reference data for comparison of voice information when a voice by a speaker is received later as reference voice information. In addition, when a word exists in the DB, voice information of the spoken word may be compared with each reference voice information stored in the DB. In the comparison step (S140), it may be determined whether the voice information of the word spoken by the speaker is similar to at least one reference voice information stored in the DB.

In the step of estimating the similarity according to the comparison with the respective reference voice information according to an embodiment of the present invention (S150), the similarity is estimated according to the result of the above determination, and when the estimated similarity is less than the first reference value, Reference voice information of may be generated and stored in the DB. In this case, the estimated similarity information may be included in the voice information and stored together on the DB. For example, the first reference value may be 70% (or 0.7), and the first reference value may be variably set according to a user's setting. Even if the same word is spoken by the same speaker, the voice information may be changed according to the speaker's state and environmental conditions (elements). You need to keep track of your patterns and manage them.

In addition, when the reference value is greater than or equal to the first reference value, the number of matching of the reference voice information having the corresponding similarity may be increased and counted. In other words, if the same word is repeatedly uttered by the speaker to have the same or similar voice information, the speaker is highly likely to speak again in this current speech pattern. That is, as in an embodiment of the present invention, by grasping (collecting) the frequency of the speaker's speech pattern and using the same for speaker recognition (identification), not only can a high level of accuracy and reliability be obtained, but also voice information of the speaker. Can be kept up to date.

In the determining of the speech pattern for the speaker's word according to an embodiment of the present invention (S160), when the number of matching counts is less than the second reference value, a new voice spoken from the speaker is received to estimate similarity. The processes can be performed repeatedly. In other words, since it is possible to reliably estimate that the counted matching number must exceed a certain level, the speaker's repeated current speech pattern can be reliably received. Repeat steps (steps).

When the counted matching count is equal to or greater than the second reference value, the reference voice information may be determined as a speech pattern for the speaker's word. This second reference value may, for example, have a value comprised in the range of 5-10.

In the determining of the speech pattern for the speaker's word based on the number of times the voice information corresponding to the estimated similarity is received according to an embodiment of the present invention, the similarity having the number of matching counts equal to or greater than the second reference value is determined. A speech pattern may be determined by establishing a speaker's speech model based on the corresponding speech information. As described above, reference voice information having a counted matching count greater than or equal to the second reference value may be established as the speaker's voice model, and thus a speech pattern may be determined.

In operation S170, the speaker may be identified based on the speech pattern determined through the above-described steps with respect to the spoken speech. In other words, the reference voice information exceeding the first reference value and the second reference value may be determined by the speech pattern of the speaker to be confirmed, and if the voice is input (received), the speaker who uttered the voice according to the determined speech pattern is confirmed. Whether it is the same person or another person as the target speaker can be identified quickly and accurately.

The apparatus 1000 for identifying a speaker based on a spoken word according to an embodiment of the present invention includes a voice receiver 1100 for receiving a spoken voice from a speaker, a word included in the received voice, and a voice for a word. Information extraction unit 1200 for extracting information, information search unit 1300 for searching for words in a pre-built database (DB), and if words do not exist in the DB, adds words and voice information of the words to the DB. If there is a word in the DB, a comparison unit 1400 for comparing the voice information of the spoken word with each reference voice information stored in the DB, and estimates the similarity according to the comparison with each reference voice information. The similarity estimation unit 1500 for determining the speech pattern corresponding to the speaker's word based on the number of times voice information corresponding to the estimated similarity is received, and based on the determined speech pattern. A speaker identification unit 1700 for identifying a speaker may be included.

Referring to FIG. 6, when the first user (first speaker) speaks, for example, "corporate", tag information (eg, U000), which is an identifier for the first user, is assigned, and the speech "enterprise" is spoken. Voice information (for example, vector property information, etc.) V_Inof000 for the data may be stored and managed in the DB in association with the tag information U000. In addition, the speech matching count information as described above may be stored and managed together with the tag information U000 and the voice information V_Inof000. (E.g., "2" in FIG. 6).

Similarly, when the first speaker speaks, for example, "bank", the tag information (for example, U000), which is an identifier for the first speaker, and the voice information (V_Inof003) for the voice of the spoken "bank" are spoken matching times. It can be stored and managed with the information (eg, "7" in FIG. 6).

Tag information of the second user (second speaker) may be assigned to U011, for example.

The comparator 1400 according to an embodiment of the present invention determines whether the voice information of the word spoken by the speaker is similar to at least one reference voice information stored in the DB, and the similarity estimation unit 1500. The similarity is estimated according to the result of the determination. If the estimated similarity is less than the first reference value, new reference voice information is generated and stored in the DB. If the estimated similarity is greater than or equal to the first reference value, the number of matching of the reference voice information having the corresponding similarity is determined. May be increased and counted.

In addition, when the number of matched counts is less than the second reference value, the speech pattern determination unit 1600 receives a new voice spoken from the speaker and repeatedly performs the process of estimating the similarity. It can be determined by the speech pattern for the word of.

According to an embodiment of the present invention, a speech pattern is determined by the speech pattern determining unit 1600 by establishing a speech model of a speaker based on speech information corresponding to a similarity having a number of matching counts equal to or greater than a second reference value. The speaker identification unit 1700 may identify who is the speaker based on the speech pattern determined for the spoken voice.

For example, the system may not know about the user's everyday speech pattern, and may not know about the state of speech. Accordingly, for each voice spoken by the user, a separate reference voice information DB for each voice property is constructed. Thereafter, the newly input voice is distinguished from the reference voice information DB constructed after the characteristic classification, and the characteristic similarity is determined. If the reference voice value is equal to or greater than a predetermined reference value (for example, the third reference value), the newly input voice other than the compared reference voice information DB The number of matching counts of the reference voice information DB is increased by 1 so as to form a similar reference voice information DB for the user and to analyze the user voice similarity pattern. In addition, when the feature similarity of speech is less than or equal to the third reference value, a new DB may be generated as a new reference speech information value.

Repeating the above process for a continuous new voice input, if a DB with a high similarity over a predetermined reference value (for example, the fourth reference value) continues to appear (for example, when a count of matching counts is high), the corresponding reference voice information is used. Recognizes as a speech pattern for a specific context (word), and uses the DB of the reference speech information as basic speech data for establishing a speaker speech model. This effectively eliminates distortion errors for the speaker's various voice state transitions and can normalize the voice pattern for the context (word) of a particular speaker.

Unlike in FIG. 3, the voice graph of FIG. 8 has a similarity, and thus, it can be seen that not much difference occurs in each voice data. The speech model may be established based on the common content (eg, the hatched region of FIG. 8), and the speaker identification may be performed by comparing and matching a newly input unspecified speaker speech.

In this case, as an exemplary reference factor of speaker identification determination, a difference between the maximum value and the minimum value of the corresponding voice data other than the common area (eg, the hatched area of FIG. 8) may be applied as an error range, and the input comparison value converges within the error range. In this case, the speaker who uttered the voice may be recognized as a legitimate speaker (ie, the same person) corresponding to the reference voice information DB.

The numerical values described above in the present specification are provided as examples for convenience of explanation for easy understanding, and are not necessarily limited thereto.

With regard to the apparatus for identifying the speaker based on the spoken word according to an embodiment of the present invention, the above-described method may be applied. Therefore, with respect to the apparatus, the description of the same contents as those of the above-described method is omitted.

The method for identifying a speaker based on the spoken words described above can be written in a program executable in a computer, and can be implemented in a general-purpose digital computer operating the program using a computer readable medium. In addition, the structure of the data used in the above-described method can be recorded on the computer-readable medium through various means. A recording medium for recording an executable computer program or code for performing various methods of the present invention should not be understood to include temporary objects, such as carrier waves or signals. The computer readable medium may include a storage medium such as a magnetic storage medium (eg, a ROM, a floppy disk, a hard disk, etc.), an optical reading medium (eg, a CD-ROM, a DVD, etc.).

On the other hand, even in the same context (word) according to the user's feelings other than the normal user's speech, the surrounding situation (for example, noise, etc.), and the speaker's health condition (for example, a disease such as a cold), a general tone of speech, that is, Frequency and pitch can vary. In other words, since the speaker's voice can change temporarily or for a period of time by time factors (eg aging), environmental factors (eg concert halls, etc.) By continuously collecting and updating the changed voice information, it is necessary to quickly and accurately identify the speaker according to the voice information sufficiently reflecting the current state of the speaker.

In the case of the voice spoken by the user, although the voice may be changed in a specific environment and state as described above, the identification of the user's voice using a fixed voice model as in the conventional method is dependent on the user's living environment and the like. Since the possibility of speech fluctuations is not considered at all, reliability in speech recognition may be seriously degraded.

According to the context-based voice model management device and the operation method of the device according to an embodiment of the present invention, stable and reliable voice identification (authentication) for the speaker is possible compared to the prior art regardless of the time and environmental factors of the speaker. Do.

9 is a block diagram of a context-based speech model management apparatus according to an embodiment of the present invention, and FIG. 10 is a context-based speech model management apparatus and a context-presenting speaker identification system interoperable with each other according to an embodiment of the present invention. A block diagram of FIG. 11 shows an example of an operation of a contextual speaker identification system. 12 is a flowchart illustrating an operation example of the context-based speech model management apparatus according to an embodiment of the present invention, and FIG. 13 illustrates an operation example of the context-based speech model management apparatus according to another embodiment of the present invention.

The context-based speech model management apparatus 3000 according to an embodiment of the present invention may be interworked with the context-presenting speaker identification system 4000, and the apparatus 3000 is generated whenever a voice from the speaker is received. When a plurality of individual voice data is stored in the storage unit 3100 and the individual voice data are stored in the storage unit 3100, a similarity estimator extracts each individual voice data from the storage unit 3100 and estimates the similarity between the individual voice data. 3200, a speech model generator 3300 for generating a speaker's first speech model based on the at least one individual speech data selected based on the similarity estimated by the similarity estimator 3200, and the contextual presentation speaker identification It is determined whether a comparison speech model corresponding to the first speech model exists in the storage unit 4400 of the system 4000, and if not, the first speech model is identified when the context-presenting speaker is identified. The determination unit 3400 and the determination to provide the storage unit 4400 of the system 4000 and store the same, and if there is a comparison similarity between the first voice model and the comparison voice model, through the similarity estimation unit 3200. The comparison speech model is replaced with the first speech model when the comparison similarity degree, which is a result of estimation by the similarity estimating unit 3200, is greater than or equal to the predetermined reference value. When the comparison similarity model is less than the predetermined reference value, the first speech model and the comparison speech model are compared. The voice model editing unit 3500 for generating a second voice model may be included, and the second voice model may be provided again to the determination unit 3400 and the voice model editing unit 3500.

In addition, the contextual presentation speaker identification system 4000 includes a voice receiver 4100 for receiving a voice from the speaker, a voice feature extractor 4200 for extracting voice characteristics from the received voice, and a voice attribute based on the extracted voice characteristic. A contextual speech model generator 4300 for generating a speech model, a storage unit 4400 in which the generated speech model is stored in a matrix form, a random number generator 4500 for generating a random number to be used for identification of a speaker, A voice model extractor 4600 for extracting a voice model at a position corresponding to the random number generated on the matrix-shaped voice model DB of the storage unit, and a voice speech request for requesting a speaker to make a predetermined speech based on the extracted voice model. A unit 4700 and a speaker identification unit 4800 identifying a speaker by comparing the speech uttered from the speaker with the extracted speech model, and the predetermined speech utterance is a matrix of the storage unit corresponding to the generated random number. It may be a sound of a word or sentence that is preset at a position on a DB of the form.

For example, the word 'bank' and a spoken speech model of the word are stored in a matrix DB of the storage unit 4400 in advance, and the user's word 'bank' is spoken for user identification (confirmation) through voice. If necessary, the voice request unit 4700 may request the user to pronounce the word "bank". Such a request may be presented to the user by voice, picture, message, or the like. The speech model according to an embodiment of the present invention refers to a data set including speech pattern information such as a context and a speaker's pronunciation method for the context. In addition, context refers to a particular word (eg, "bank") as well as containing a series of sentences containing the word.

The word 'bank' and the spoken speech model of the word may be stored on a matrix position of a predetermined matrix DB. When user voice identification is required, the random number generator 4500 generates a random number, and a word on the matrix position of the matrix DB corresponding to the random number may be presented to the user as a voice speech request target word.

The context-presented speech model matrix DB according to an embodiment of the present invention may be configured in the form of NxM (where N and M are the same or different positive integers). For example, as shown in FIGS. 11 to 13, a context-presented speech model may be constructed as a DB in a 20 × 5 matrix.

The context-based voice model managing apparatus 3000 may communicate with another electronic device included in a network through which the communication unit 3700 may communicate. For example, the apparatus 3000 may communicate with each other while transmitting and receiving data with the communication unit 4900 of the context presenting speaker identification system 4000. In FIG. 10, the context-based speech model management apparatus 3000 is designed separately from the context-presenting speaker identification system 4000 for convenience of description. However, the context-based speech model management apparatus 3000 is the context-based speech model identification system ( It may be implemented to constitute a portion of 4000). The

communication unit

3700 and 4900 may include a Bluetooth communication module, a BLE (Bluetooth Low Energy) communication module, a near field communication unit, a Wi-Fi communication module, and a Zigbee communication module. , An infrared data association (IrDA) communication module, a Wi-Fi Direct (WFD) communication module, an ultra wideband (UWB) communication module, an Ant + communication module, and the like, but is not limited thereto.

Individual voice data according to an embodiment of the present invention includes at least one of the frequency, pitch, formant, speech time, speech rate of each speaker's speech, and the context-based speech model management apparatus ( The similarity estimator 3200 of 3000 may evaluate the similarity between individual voice data for each speaker's speech. Pitch refers to the pitch of the note. Voice (voiced sound) consists of the fundamental frequency component of vocal cord vibration and its harmonic components. All of the oscillation sources have unique vibration characteristics (eg, resonance characteristics). Human articulation organs (eg, vocal cords, etc.) also have a resonance characteristic at the moment that changes with the articulation, and the vocal cords can be filtered and expressed according to the resonance characteristics. Looking at the frequency spectrum of a particular sound (eg, a vowel), it can be seen that a plurality of resonance bands exist when the resonance characteristic is expressed. Such a plurality of resonant frequency bands is referred to as a formant.

For example, as shown in FIG. 11, when a predetermined word (eg, “bank”) is uttered by a specific speaker (eg, user B of FIG. 11), the spoken voice is received by the voice receiver 4100. Speech characteristics can be extracted. The extracted voice characteristic may be composed of individual voice data. Referring to FIG. 12, in the similarity estimator 3200 of the context-based voice model management apparatus 3000, a voice for each speaker's speech (eg, a voice spoken two weeks ago for a "bank", a voice spoken one week ago, Similarity between individual voice data for each of the voices uttered yesterday) may be evaluated. At least one piece of individual voice data selected based on the similarity estimated by the similarity estimator 3200 (for example, data about a voice spoken one week ago for “bank”, data on voice spoken yesterday, etc.). Accordingly, the voice model generator 3300 may generate a first voice model of the speaker (for example, user B of FIG. 11).

9, 10, and 12, the determination unit 3400 determines whether a comparison speech model corresponding to the first speech model exists in the storage unit 4400 of the contextual presentation speaker identification system 4000. If not present, the first speech model is provided to the storage unit 4400 of the context presenting speaker identification system 4000 and stored therein, and if present, the comparison similarity between the first speech model and the comparative speech model is similarity estimating unit ( 3200 may be estimated.

When the comparison similarity that is the result of estimation by the similarity estimating unit 3200 by the determination unit 3400 is equal to or greater than a predetermined reference value, the voice model editing unit 3500 replaces the comparison voice model with the first voice model, and the value is less than the predetermined reference value. In this case, the second voice model may be generated by combining the first voice model and the comparison voice model. This predetermined reference value may be at least 51% (or 0.51). Preferably at least 75% (or 0.75). It is possible to edit (replace) a reliable voice model or the like above the predetermined reference value.

The second voice model may be provided to the determination unit 3400 and the voice model editing unit 3500 again, and the determination unit 3400 may include the second voice in the storage unit 4400 of the context presenting speaker identification system 4000. It is determined whether there is a comparison speech model corresponding to the model (newly reproduced speech model), and if not, the second speech model is provided to the storage unit 4400 of the contextual speaker identification system 4000 for storage. And, if present, the comparison similarity between the second speech model and the comparison speech model may be estimated by the similarity estimator 3200. This process can be performed repeatedly. Through such an iterative process, a speech model optimized for the speaker's current speech state may be stored and managed in the matrix DB.

In addition, the apparatus according to an embodiment of the present invention further includes a period setting unit 3600 for setting a management period of the voice model, and when all the voice models are updated within the set management period, the voice model editing unit ( In 3500, the speech model DB of the existing matrix form on the storage unit 4400 of the context presenting speaker identification system 4000 is maintained, and when at least one speech model is not updated within a set management period, the speech model editing unit is performed. At 3500, a part of the existing matrix speech model DB based on the new first speech model associated with the speaker may be deleted or maintained. The management cycle according to an embodiment of the present invention may be a period of one day, one week, or one month, and may be individually set according to a user's intention. For example, for certain words ("banks"), you can set up a management cycle to manage the voice model at weekly intervals, a particular user has a management cycle at daily intervals, and another user has a month. The management cycle for each user may be individually set to have a management cycle as a period.

The voice model editing unit 3500 deletes at least one unupdated voice model from the matrix-type voice model DB if a new first voice model related to the speaker does not exist, and if a new first voice model exists, the voice model editor 3500 is not updated. Compare the at least one speech model with the new first speech model, and if the comparison results in a difference within the predetermined range, the speech model editor 3500 in the existing matrix form on the storage of the contextual presentation speaker identification system. The voice model DB is maintained, and if it is outside the above-mentioned range, the at least one unupdated voice model can be deleted from the matrix-type voice model DB. The allowable range of the difference value representing the aforementioned difference may be greater than 0 and 15% (or 0.15), depending on whether or not there is a difference within the range, the specific speech model (eg, The voice model 8) of FIG. 13 may be kept or deleted. As a result of the comparison of the new first speech model with the at least one updated speech model, if the difference has a value of 40% (or 0.4), the at least one updated speech model (eg, speech model 8 of FIG. 13). ) Is deleted from the matrix of speech models DB.

According to an embodiment of the present invention, a method of managing a speech model using a context-based speech model management apparatus includes (a) generating and storing individual speech data each time a speech from a speaker is received (S210), ( b) when a plurality of individual voice data are stored, extracting each individual voice data to estimate similarity between the individual voice data (S220), and (c) the speaker according to at least one individual voice data selected based on the estimated similarity. Step S230 of generating a first speech model of (d), (d) determining whether a comparison speech model corresponding to the first speech model exists in the storage of the contextual speaker identification system; The model is provided to the storage unit of the contextual speaker identification system to be stored, and if there is a comparison similarity between the first speech model and the comparison speech model, (S) and (e) if the comparison similarity is greater than or equal to a predetermined reference value, replaces the comparison speech model with a first speech model, and if the comparison similarity is less than or equal to the predetermined reference value, combines the first speech model and the comparison speech model to form a second comparison model. Generating a voice model may include a step (S250). In addition, steps (d) S240 and (e) S250 described above with respect to the second voice model may be repeatedly performed.

In addition, the method for managing a voice model according to an embodiment of the present invention may further include setting a management cycle of the voice model by the period setting unit of the aforementioned context-based voice model management apparatus (S10). The setting of the management period may be performed before S210 or may be performed such that the management period is set at any time by the user.

In addition, when all the voice models are updated within the set management period, the voice model editing unit 3500 of the apparatus 3000 may use the existing matrix model voice model DB on the storage unit of the contextual speaker identification system 4000. In this case, if at least one voice model is not updated within the set management period, the voice model editing unit 3500 performs a part of the existing matrix voice model DB based on the new first voice model associated with the speaker. Can be deleted or maintained.

The voice model editing unit 3500 deletes at least one unupdated voice model from the matrix-type voice model DB if a new first voice model related to the speaker does not exist, and if a new first voice model exists, the voice model editor 3500 is not updated. Comparing the at least one speech model with the new first speech model, and if the difference is within a predetermined range, the speech model editing unit 3500 forms an existing matrix on the storage unit of the contextual presentation speaker identification system 4000. If the speech model DB of is maintained and out of range, at least one un-updated speech model can be deleted from the matrix-type speech model DB.

With regard to the operation method of the context-based speech model management apparatus according to an embodiment of the present invention, the above-described content of the context-based speech model management apparatus may be applied. Therefore, with regard to the operation method, descriptions of the same contents as those of the above-described context-based voice model management apparatus are omitted.

The above-described method of operating the context-based speech model management apparatus may be written as a program executable on a computer, and may be implemented in a general-purpose digital computer operating the program using a computer readable medium. In addition, the structure of the data used in the above-described method can be recorded on the computer-readable medium through various means. A recording medium for recording an executable computer program or code for performing various methods of the present invention should not be understood to include temporary objects, such as carrier waves or signals. The computer readable medium may include a storage medium such as a magnetic storage medium (eg, a ROM, a floppy disk, a hard disk, etc.), an optical reading medium (eg, a CD-ROM, a DVD, etc.).

The foregoing description of the present invention is intended for illustration, and it will be understood by those skilled in the art that the present invention may be easily modified in other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are exemplary in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

The scope of the present invention is shown by the following claims rather than the above description, and all changes or modifications derived from the meaning and scope of the claims and their equivalents should be construed as being included in the scope of the present invention. do.

Claims

A method for identifying a speaker based on a spoken word,

Receiving spoken voice from the speaker;

Extracting a word included in the received voice and voice information of the word;

Retrieving the word from a pre-built database (DB);

If the word does not exist in the DB, the word and voice information of the word are added to the DB. If the word exists in the DB, the voice information of the spoken word is stored in the DB. Comparing each reference voice information present;

Estimating a degree of similarity according to comparison with the respective reference voice information;

Determining a speech pattern for the word of the speaker based on the number of times voice information corresponding to the estimated similarity is received; And

Identifying the speaker based on the determined utterance pattern.
The method of claim 1,

Speech information for the word includes at least one of the frequency, pitch, formant, speech time, speech rate of the speech, the method for identifying the speaker based on the spoken word .
The method of claim 1,

In the comparing step, it is determined whether or not the voice information of the word spoken by the speaker is similar to at least one reference voice information stored in the DB.

In the estimating of the similarity, the similarity is estimated according to the result of the determination. If the estimated similarity is less than a first reference value, new reference voice information is generated and stored in the DB. A method for identifying a speaker based on spoken words, characterized in that the number of matching of reference speech information having a corresponding similarity is increased and counted.
The method of claim 3, wherein

In the determining of the speech pattern, when the counted number of matching is less than a second reference value, a new voice spoken from the speaker is received so that a process of estimating similarity is repeatedly performed. And determining the speaker based on the spoken word, determined by the speaking pattern for the word of the speaker.
The method of claim 4, wherein

The speech pattern is determined by establishing a speech model of the speaker based on speech information corresponding to a similarity having a count of matching counts equal to or greater than the second reference value,

And in the identifying step, who is the speaker is identified based on the determined utterance pattern with respect to the uttered speech.
An apparatus for identifying a speaker based on a spoken word,

A voice receiver for receiving a voice spoken by the speaker;

An information extracting unit extracting a word included in the received voice and voice information of the word;

An information search unit for searching the word in a pre-built database;

If the word does not exist in the DB, the word and voice information of the word are added to the DB. If the word exists in the DB, the voice information of the spoken word is stored in the DB. A comparison unit for comparing each reference voice information present;

A similarity estimating unit for estimating a similarity according to comparison with the respective reference voice information;

A speech pattern determination unit that determines a speech pattern for the word of the speaker based on the number of times voice information corresponding to the estimated similarity is received; And

And a speaker identification unit for identifying the speaker based on the determined speech pattern.
The method of claim 6,

Device for identifying a speaker based on the spoken word, characterized in that the voice information for the word includes at least one of the frequency, pitch, formant, speech time, speech rate of the speech. .
The method of claim 6,

The comparison unit determines whether or not the voice information of the word spoken by the speaker is similar to at least one reference voice information stored in the DB,

The similarity estimating unit estimates the similarity according to the result of the determination, and when the estimated similarity is less than the first reference value, new reference voice information is generated and stored in the DB. And counting the number of matching of the reference speech information having an increased number of counts.
The method of claim 8,

If the counted matching number is less than a second reference value, the speech pattern determination unit receives a new voice spoken from the speaker to repeatedly perform a process of estimating the similarity. And a speaker pattern for identifying a speaker based on the spoken word.
The method of claim 9,

The speech pattern is determined by the speech pattern determination unit by establishing a speech model of the speaker based on the speech information corresponding to the similarity having the number of matching counts equal to or greater than the second reference value.

And wherein the speaker identification unit identifies who is the speaker based on the determined speech pattern with respect to the spoken voice.
A computer-readable recording medium having recorded thereon a program for implementing the method of any one of claims 1 to 5.
Context-based speech model management device,

The apparatus may be associated with a contextual speaker identification system,

A storage unit storing individual voice data generated whenever a voice from the speaker is received;

A plurality of similarity estimators for estimating the similarity between individual voice data by extracting respective individual voice data from the storage unit when the plurality of individual voice data are stored in the storage unit;

A speech model generator for generating a first speech model of the speaker according to at least one individual speech data selected based on the similarity estimated by the similarity estimator;

It is determined whether a comparison speech model corresponding to the first speech model exists in the storage of the contextual speaker identification system, and if not present, provides the first speech model to the storage of the contextual speaker identification system. A judging unit for estimating a comparison similarity between the first voice model and the comparison voice model if present; And

The comparison speech model is replaced with the first speech model when the comparison similarity level, which is a result of the estimation in the similarity estimator, is greater than or equal to a predetermined reference value. Including a voice model editing unit for generating a second voice model by combining the voice model,

And the second speech model is provided to the determining unit and the speech model editing unit.
The method of claim 12,

In the contextual speaker identification system,

A voice receiver for receiving a voice from the speaker;

A speech characteristic extractor for extracting speech characteristics from the received speech;

A contextual speech model generator for generating a speech model based on the extracted speech characteristics;

A storage unit in which the generated speech model is stored in a matrix form;

A random number generator for generating a random number to be used for identification of the speaker;

A speech model extraction unit for extracting a speech model at a position corresponding to the generated random number on the speech model DB in a matrix form of the storage unit;

A speech utterance request unit for requesting a predetermined speech utterance from the speaker based on the extracted speech model; And

A speaker identification unit for identifying the speaker by comparing the voice spoken by the speaker with the extracted voice model,

The predetermined speech utterance is a context-based speech model management device, characterized in that the read aloud of a word or sentence that is preset at a position on a DB in the matrix form of the storage unit corresponding to the generated random number.
The method of claim 12,

The individual voice data includes at least one of a frequency, a pitch, a formant, a speech time, and a speech rate of speech of each speaker's speech.

And a similarity estimator of the context-based speech model management apparatus evaluates the similarity between individual speech data for each speaker's speech.
The method of claim 12,

And a period setting unit for setting a management period of the voice model.

When all the voice models are updated within the set management period, the voice model editing unit maintains an existing matrix model of the voice model DB on the storage unit of the contextual presentation speaker identification system and within the set management period. If at least one voice model is not updated, the voice model editing unit deletes or maintains a part of the existing matrix voice model based on the new first voice model associated with the speaker. Context-based speech model management device.
The method of claim 15,

The voice model editing unit deletes the unrenewed at least one voice model from the matrix voice model DB if a new first voice model associated with the speaker does not exist.

If the new first voice model is present, the unrenewed at least one voice model is compared with the new first voice model, and if the difference is within a predetermined range, the voice model editing unit is configured to Maintaining the existing matrix-type speech model DB on the storage of the context-presented speaker identification system, and if out of the range is deleted the at least one un-updated speech model from the matrix-type speech model DB Context-based speech model management device.
A method of managing a speech model using a context-based speech model management apparatus,

The apparatus may be associated with a contextual speaker identification system,

(a) generating and storing individual voice data each time a voice from the speaker is received;

(b) extracting each individual voice data and estimating the similarity between the individual voice data when a plurality of the individual voice data are stored;

(c) generating a first speech model of the speaker according to the at least one individual speech data selected based on the estimated similarity;

(d) determining whether a comparison speech model corresponding to the first speech model exists in the storage of the context-presenting speaker identification system; Providing the data to a storage unit and storing the same, and if there is a comparison similarity between the first speech model and the comparison speech model, estimating the similarity estimation unit through the similarity estimation unit; And

(e) replacing the comparison speech model with the first speech model when the comparison similarity is equal to or greater than a predetermined reference value, and combining the first speech model and the comparison speech model when the comparison similarity is less than the predetermined reference value. Generating steps,

And (d) and (e) are repeated for the second speech model.
The method of claim 17,

Setting a management period of the voice model by the period setting unit of the device,

When all the voice models are updated within the set management period, the voice model editing unit of the device causes the existing matrix model voice model DB on the storage unit of the contextual presentation speaker identification system to be maintained, and the set management If at least one voice model is not updated within a period, the voice model editing unit deletes or maintains a part of the existing matrix voice model based on the new first voice model associated with the speaker. Context-based speech model management method.
The method of claim 18,

The voice model editing unit deletes the at least one unupdated voice model from the voice model DB in a matrix form if there is no new first voice model associated with the speaker.

If the new first voice model exists, the unrenewed at least one voice model is compared with the new first voice model, and if the difference is within a predetermined range, the voice model editing unit displays the context. Context-based speech characterized in that the existing matrix-type speech model DB on the storage of the speaker identification system is maintained and if it is out of the range, the at least one un-updated speech model is deleted from the matrix-type speech model DB. How to manage your model.
20. A computer readable recording medium having recorded thereon a program for implementing the method of any one of claims 17 to 19.