CN107240396B - Speaker self-adaptation method, device, equipment and storage medium - Google Patents
Speaker self-adaptation method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN107240396B CN107240396B CN201710457375.4A CN201710457375A CN107240396B CN 107240396 B CN107240396 B CN 107240396B CN 201710457375 A CN201710457375 A CN 201710457375A CN 107240396 B CN107240396 B CN 107240396B
- Authority
- CN
- China
- Prior art keywords
- voice
- speaker
- training
- target speaker
- parameter
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
- G10L15/07—Adaptation to the speaker
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
Abstract
The embodiment of the invention discloses a speaker self-adaptive method, a speaker self-adaptive device, speaker self-adaptive equipment and a storage medium. The speaker adaptive method comprises the following steps: acquiring first voice data of a target speaker; and inputting the first voice data into a batch normalized BN network obtained by pre-training for self-adaptive training to obtain a voice recognition model containing the voice parameters of the target speaker. According to the embodiment of the invention, the first voice data of the target speaker is input into the batch normalized BN network obtained by pre-training for self-adaptive training, so that the voice recognition model containing the voice parameters of the target speaker is obtained, the self-adaptive process of the speaker can be simplified, the self-adaptive complexity is reduced, and the self-adaptive performance is improved.
Description
Technical Field
The embodiment of the invention relates to the technical field of voice recognition, in particular to a speaker self-adaption method, device, equipment and storage medium.
Background
Speaker adaptive technology has gained more and more attention in recent years, and this technology utilizes specific Speaker data to modify Speaker Independent (SI) codebook, and aims to obtain Speaker Adaptive (SA) codebook to improve recognition performance.
Under the condition that training data of a certain Speaker are enough, a Speaker Dependent (SD) codebook can be obtained by adopting a traditional training method aiming at the current Speaker data, and the SD codebook well reflects the characteristics of the current Speaker, so that the SD codebook has good performance; however, in some cases, the data of the speaker is not enough to train a robust SD model, and at this time, adaptation is needed to avoid the under-training situation, and compared with the situation that the SD codebook requires a large amount of data for training, the speaker adaptation only needs a small amount of data to obtain a large performance improvement.
The essence of speaker adaptation is that the SI codebook is adjusted by using adaptive data to conform to the characteristics of the current speaker, since the SI codebook obtained by the conventional training method is inevitably affected by the characteristics of the training set, the adaptation effect becomes less obvious when the training set and the adaptive data are mismatched, and the original codebook has speaker independence and can approach the characteristics of the current speaker more rapidly when adapting. Codebook training combined with self-adaptation models the SI codebook and the characteristics of each speaker in the training set respectively, so that an SI codebook with more speaker independence can be obtained.
Currently, there are two main ways to perform speaker adaptation: the first is speaker self-adaptation based on a characteristic layer, and the main idea is to use the characteristic parameters of a voice signal to construct a transformation method, transform the characteristics related to the speaker into the characteristics unrelated to the speaker, and then send the characteristics unrelated to the speaker into a speaker independent model for recognition, thereby realizing the speaker self-adaptation. The second is speaker self-adaptation based on model layer, which adjusts speaker independent model by using speaker voice data, self-adapts different acoustic models for different speakers, and then uses the self-adapted models for recognition, thus realizing speaker self-adaptation.
However, the above adaptive process is complicated, and two decoding passes are usually required in the adaptive process, so the adaptive process requires more time and is inefficient. Moreover, since the speaker has limited voice data and more parameters requiring adaptation, the contradiction between the two causes the adaptation performance to be poor.
Disclosure of Invention
Embodiments of the present invention provide a speaker adaptive method, apparatus, device, and storage medium, which can simplify a speaker adaptive process, reduce adaptive complexity, and improve adaptive performance.
In a first aspect, an embodiment of the present invention provides a speaker adaptive method, where the method includes:
acquiring first voice data of a target speaker;
and inputting the first voice data into a batch normalized BN network obtained by pre-training for self-adaptive training to obtain a voice recognition model containing the voice parameters of the target speaker.
In a second aspect, an embodiment of the present invention further provides a speaker adaptive apparatus, where the apparatus includes:
the voice data acquisition module is used for acquiring first voice data of a target speaker;
and the model training module is used for inputting the first voice data into a batch standardized BN network obtained by pre-training for self-adaptive training to obtain a voice recognition model containing the voice parameters of the target speaker.
In a third aspect, an embodiment of the present invention further provides an apparatus, including:
one or more processors;
a storage device for storing one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors implement the speaker adaptation method according to any one of the embodiments of the invention.
In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement any one of the speaker adaptive methods according to the embodiments of the present invention.
According to the embodiment of the invention, the first voice data of the target speaker is input into the batch normalized BN network obtained by pre-training for self-adaptive training, so that the voice recognition model containing the voice parameters of the target speaker is obtained, the self-adaptive process of the speaker can be simplified, the self-adaptive complexity is reduced, and the self-adaptive performance is improved.
Drawings
FIG. 1 is a flow chart of a speaker adaptive method according to an embodiment of the present invention;
FIG. 2 is a flowchart of a speaker adaptive method according to a second embodiment of the present invention;
FIG. 3 is a flowchart of a speaker adaptive method according to a third embodiment of the present invention;
FIG. 4 is a block diagram of a speaker adaptive apparatus according to a fourth embodiment of the present invention;
fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, specific embodiments of the present invention are described in further detail below with reference to the accompanying drawings. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention.
It should be further noted that, for the convenience of description, only some but not all of the relevant aspects of the present invention are shown in the drawings. Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.
Example one
Fig. 1 is a flowchart of a speaker adaptive method according to an embodiment of the present invention, where the method is applicable to a speaker adaptive situation, and the method can be executed by a speaker adaptive apparatus according to an embodiment of the present invention, where the apparatus can be implemented in a software and/or hardware manner, and the apparatus can be integrated in a terminal device or an application end of the terminal device. The terminal device may be, but is not limited to, a mobile terminal (tablet computer or smartphone).
The application end can be a plug-in of a certain client end embedded in the terminal equipment or a plug-in of an operating system of the terminal equipment, and is matched with a speaker self-adaptive client end embedded in the terminal equipment or a speaker self-adaptive application program in the operating system of the terminal equipment for use; the application end may also be an independent client end capable of providing speaker adaptation in the terminal device, which is not limited in this embodiment.
As shown in fig. 1, the method of this embodiment specifically includes:
s101, first voice data of a target speaker are obtained.
The voice data may be an original voice signal, or may be voice feature data obtained by processing the original voice signal.
Specifically, the voice data can be acquired through a voice input device or a recording device of the terminal device.
S102, inputting the first voice data into a Batch Normalization (BN) network obtained through pre-training for self-adaptive training, and obtaining a voice recognition model containing voice parameters of a target speaker.
The voice parameters are variance and/or mean values, and can be obtained by training the first voice data BN network.
Specifically, dividing the voice data into m frames, inputting the m frames of voice data into the BN network, obtaining a variance and a mean through a BN transformation formula in the BN network, and including a variance and mean voice recognition model:
wherein m is the number of frames of voice data, x i For the i-th frame of speech data, mu B Is taken as the mean value of the average value,is the variance.
The BN network self-adaptation does not need to add an extra layer, so that the self-adaptation process is simpler, and the mean value and the variance obtained through the BN network are both one-dimensional vectors, so that the parameter quantity needing to be adjusted during self-adaptation is less, and the speech parameters (namely the mean value and the variance) are obtained by performing self-adaptation training on the BN network obtained through pre-training without two-pass decoding.
Therefore, in the embodiment, the first speech data of the target speaker is input into the batch normalized BN network obtained by the pre-training for the adaptive training, so as to obtain the speech recognition model including the speech parameters of the target speaker, which can simplify the adaptive process of the speaker, reduce the adaptive complexity, and improve the adaptive performance.
Example two
Fig. 2 is a flowchart of a speaker adaptive method according to a second embodiment of the present invention. The embodiment is optimized based on the above embodiment, and in the embodiment, the method further includes the following steps of obtaining voice data of a reference speaker; and training according to the voice data of the reference speaker to obtain the BN network, wherein the BN network comprises global voice parameters and a voice recognition model comprising the global voice parameters.
Correspondingly, the method of the embodiment specifically includes:
s201, voice data of a reference speaker is obtained.
Wherein the number of the reference speakers is one or more.
S202, training according to voice data of a reference speaker to obtain a BN (boron nitride) network, wherein the BN network comprises global voice parameters and a voice recognition model comprising the global voice parameters.
Wherein the global speech parameter is a variance and/or a mean. Specifically, the global speech parameter of each reference speaker can be obtained through the BN transformation formula, then an average is obtained to obtain a global speech parameter as the global speech parameter in the BN network, and a speech recognition model including the global speech parameter is obtained through training.
S203, first voice data of the target speaker are obtained.
S204, inputting the first voice data into the BN network for self-adaptive training to obtain a voice recognition model containing the voice parameters of the target speaker.
Specifically, the first voice data is input into the BN network to obtain the voice parameter of the target speaker, and the global voice parameter in the voice recognition model is replaced with the voice parameter of the target speaker to obtain the voice recognition model including the voice parameter of the target speaker. Or, in order to improve the speech recognition performance, the weighting of the speech parameter of the target speaker and the global speech parameter can be used as the final speech parameter of the target speaker, and the speech parameter is used to replace the global speech parameter in the speech recognition model, so as to obtain the speech recognition model containing the speech parameter of the target speaker.
Alternatively, when there are multiple target speakers, a speech recognition model specific to each target speaker can be obtained through the above-mentioned adaptive process, and the speech recognition models of each target speaker are identical except for the speech parameters (i.e., mean and variance).
In the embodiment, the BN network is obtained through training according to the voice data of the reference speaker, the BN network comprises global voice parameters and a voice recognition model comprising the global voice parameters, and then the first voice data of the target speaker is input into the BN network for self-adaptive training to obtain the voice recognition model comprising the voice parameters of the target speaker, so that the self-adaptive process of the speaker can be simplified, the self-adaptive complexity is reduced, and the self-adaptive performance is improved.
EXAMPLE III
Fig. 3 is a flowchart of a speaker adaptive method according to a third embodiment of the present invention. The present embodiment is optimized based on the above embodiment, and in the present embodiment, the method further includes the following steps: obtaining a voice parameter of the target speaker according to the second voice data of the target speaker; and inputting the voice parameters of the target speaker into the voice recognition model for recognition to obtain corresponding text information.
Correspondingly, the method of the embodiment specifically includes:
s301, first voice data of the target speaker are obtained.
S302, inputting the first voice data into a pre-trained BN network for self-adaptive training to obtain a voice recognition model containing the voice parameters of the target speaker.
And S303, obtaining the voice parameter of the target speaker according to the second voice data of the target speaker.
The first voice data and the second voice data may be the same data or different data.
Specifically, the second voice data of the target speaker is input into the BN network for adaptive training, so as to obtain the voice parameter of the target speaker. The speech parameters may be mean and variance, among others.
S304, inputting the voice parameters of the target speaker into a voice recognition model containing the voice parameters of the target speaker for recognition to obtain corresponding text information.
Specifically, the speech parameters of the target speaker can be directly input into the speech recognition model for recognition, so as to obtain corresponding text information. Or, calculating the weighting of the voice parameter of the target speaker and the global voice parameter; and identifying the weighted input speech recognition model to obtain corresponding text information. For example, if the speech parameter of the target speaker corresponds to a weight w1, the global speech parameter corresponds to a weight w2, the speech parameter of the target speaker is x1, and the global speech parameter is x2, the corresponding weight is x1 × w1+ x2 × w2.
Because the speech recognition model of the embodiment is obtained by inputting the first speech data of the target speaker into the pre-trained BN network for adaptive training, and the BN network has high adaptive performance, the embodiment performs recognition by inputting the second speech parameter of the target speaker into the speech recognition model including the speech parameter of the target speaker to obtain corresponding text information, thereby improving speech recognition efficiency.
Example four
Fig. 4 is a block diagram of a speaker adaptive apparatus according to a fourth embodiment of the present invention. The embodiment can be suitable for the situation of speaker adaptation, the device can be implemented in a software and/or hardware manner, and the device can be integrated in the terminal equipment or an application end of the terminal equipment. The terminal device may be, but is not limited to, a mobile terminal (tablet computer or smartphone).
The application end can be a plug-in of a certain client end embedded in the terminal equipment or a plug-in of an operating system of the terminal equipment, and is matched with a speaker self-adaptive client end embedded in the terminal equipment or a speaker self-adaptive application program in the operating system of the terminal equipment for use; the application end may also be an independent client end capable of providing speaker adaptation in the terminal device, which is not limited in this embodiment.
As shown in fig. 4, the apparatus includes: a speech data acquisition module 401 and a model training module 402, wherein:
the voice data acquisition module 401 is configured to acquire first voice data of a target speaker;
the model training module 402 is configured to input the first speech data into a batch normalized BN network obtained by pre-training for adaptive training, so as to obtain a speech recognition model including speech parameters of a target speaker.
The speaker adaptive apparatus of the present embodiment is used for executing the speaker adaptive method of the above embodiments, and the technical principle and the generated technical effect are similar, and are not described herein again.
On the basis of the above embodiments, the apparatus further includes: a speech recognition module 403;
the voice recognition module 403 is configured to obtain a voice parameter of the target speaker according to the second voice data of the target speaker; and inputting the voice parameters of the target speaker into the voice recognition model for recognition to obtain corresponding text information.
On the basis of the foregoing embodiments, the voice data obtaining module 401 is further configured to: acquiring voice data of a reference speaker;
the model training module 402 is further configured to: and training according to the voice data of the reference speaker to obtain the BN network, wherein the BN network comprises global voice parameters and a voice recognition model comprising the global voice parameters.
On the basis of the foregoing embodiments, the model training module 402 is specifically configured to: and inputting the first voice data into the BN network to obtain the voice parameters of the target speaker, and replacing the global voice parameters in the voice recognition model with the voice parameters of the target speaker to obtain the voice recognition model containing the voice parameters of the target speaker.
On the basis of the foregoing embodiments, the speech recognition module 403 is specifically configured to: calculating the weighting of the voice parameter of the target speaker and the global voice parameter; and identifying the weighted input speech recognition model to obtain corresponding text information.
On the basis of the above embodiments, the speech parameters are variance and/or mean.
The speaker adaptive device provided by each embodiment can execute the speaker adaptive method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects for executing the speaker adaptive method.
EXAMPLE five
Fig. 5 is a schematic structural diagram of an apparatus according to a fifth embodiment of the present invention. FIG. 5 illustrates a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present invention. The computer device 12 shown in FIG. 5 is only an example and should not impose any limitations on the functionality or scope of use of embodiments of the present invention.
As shown in FIG. 5, computer device 12 is in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.
The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. The computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, and commonly referred to as a "hard drive"). Although not shown in FIG. 5, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.
The processing unit 16 executes various functional applications and data processing by executing programs stored in the system memory 28, for example, to implement the speaker adaptive method provided by the embodiment of the present invention:
acquiring first voice data of a target speaker;
and inputting the first voice data into a batch normalized BN network obtained by pre-training for self-adaptive training to obtain a voice recognition model containing the voice parameters of the target speaker.
Further, the method further comprises:
obtaining a voice parameter of the target speaker according to the second voice data of the target speaker;
and inputting the voice parameters of the target speaker into the voice recognition model for recognition to obtain corresponding text information.
Further, the method further comprises:
acquiring voice data of a reference speaker;
and training according to the voice data of the reference speaker to obtain the BN network, wherein the BN network comprises the global voice parameters and a voice recognition model comprising the global voice parameters.
Further, the inputting the first speech data into a batch normalized BN network obtained by pre-training for adaptive training to obtain a speech recognition model including the speech parameters of the target speaker includes:
and inputting the first voice data into the BN network to obtain the voice parameter of the target speaker, and replacing the global voice parameter in the voice recognition model with the voice parameter of the target speaker to obtain the voice recognition model containing the voice parameter of the target speaker.
Further, the inputting the voice parameter of the target speaker into the voice recognition model for recognition to obtain the corresponding text information includes:
calculating the weighting of the voice parameter of the target speaker and the global voice parameter;
and inputting the weighted speech recognition model for recognition to obtain corresponding text information.
Further, the speech parameter is a variance and/or a mean.
EXAMPLE six
Embodiment 6 of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the speaker adaptive method according to the embodiments of the present invention:
acquiring first voice data of a target speaker;
and inputting the first voice data into a batch normalized BN network obtained by pre-training for self-adaptive training to obtain a voice recognition model containing the voice parameters of the target speaker.
Further, the method further comprises:
obtaining a voice parameter of the target speaker according to the second voice data of the target speaker;
and inputting the voice parameters of the target speaker into the voice recognition model for recognition to obtain corresponding text information.
Further, the method further comprises:
acquiring voice data of a reference speaker;
and training according to the voice data of the reference speaker to obtain the BN network, wherein the BN network comprises the global voice parameters and a voice recognition model comprising the global voice parameters.
Further, the inputting the first speech data into a batch of normalized BN networks obtained by pre-training for adaptive training to obtain a speech recognition model including the speech parameters of the target speaker includes:
and inputting the first voice data into the BN network to obtain the voice parameter of the target speaker, and replacing the global voice parameter in the voice recognition model with the voice parameter of the target speaker to obtain the voice recognition model containing the voice parameter of the target speaker.
Further, the inputting the voice parameter of the target speaker into the voice recognition model for recognition to obtain corresponding text information includes:
calculating the weighting of the voice parameter of the target speaker and the global voice parameter;
and inputting the weighted speech recognition model for recognition to obtain corresponding text information.
Further, the speech parameter is a variance and/or a mean.
Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.
Claims (4)
1. A speaker adaptation method, comprising:
acquiring first voice data of a target speaker;
inputting the first voice data into a batch normalized BN network obtained by pre-training for self-adaptive training to obtain a voice recognition model containing the voice parameters of the target speaker;
acquiring voice data of a reference speaker;
training according to the voice data of the reference speaker to obtain the BN network, wherein the BN network comprises global voice parameters and a voice recognition model comprising the global voice parameters;
wherein, the inputting the first voice data into a batch normalized BN network obtained by pre-training for adaptive training to obtain a voice recognition model including the voice parameters of the target speaker includes:
inputting the first voice data into a batch normalized BN network obtained by pre-training for self-adaptive training to obtain the voice parameter of the target speaker, and replacing the global voice parameter with the voice parameter of the target speaker to obtain a voice recognition model containing the voice parameter of the target speaker; alternatively, the first and second electrodes may be,
inputting the first voice data into a batch normalized BN network obtained by pre-training for self-adaptive training to obtain a target speaker voice parameter, taking the weighting of the target speaker voice parameter and the global voice parameter as a final voice parameter of a target speaker, and replacing the global voice parameter in the voice recognition model with the final voice parameter to obtain a voice recognition model containing the target speaker voice parameter;
the voice parameters are variance and/or mean values obtained through frame number calculation of voice data.
2. A speaker adaptive apparatus, comprising:
the voice data acquisition module is used for acquiring first voice data of a target speaker;
the model training module is used for inputting the first voice data into a batch normalized BN network obtained by pre-training for self-adaptive training to obtain a voice recognition model containing the voice parameters of the target speaker;
the voice data acquisition module is further configured to: acquiring voice data of a reference speaker;
the model training module is further configured to: training according to the voice data of the reference speaker to obtain the BN network, wherein the BN network comprises global voice parameters and a voice recognition model comprising the global voice parameters;
the model training module is specifically configured to: inputting the first voice data into a batch normalized BN network obtained by pre-training for self-adaptive training to obtain the voice parameter of the target speaker, and replacing the global voice parameter with the voice parameter of the target speaker to obtain a voice recognition model containing the voice parameter of the target speaker; or inputting the first voice data into a batch normalized BN network obtained by pre-training for self-adaptive training to obtain the voice parameter of the target speaker, taking the weighting of the voice parameter of the target speaker and the global voice parameter as the final voice parameter of the target speaker, and replacing the global voice parameter in the voice recognition model with the final voice parameter to obtain the voice recognition model containing the voice parameter of the target speaker; the voice parameters are variance and/or mean values obtained through frame number calculation of voice data.
3. A computer device, characterized in that the device comprises:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the speaker adaptation method as recited in claim 1.
4. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the speaker adaptation method as claimed in claim 1.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710457375.4A CN107240396B (en) | 2017-06-16 | 2017-06-16 | Speaker self-adaptation method, device, equipment and storage medium |
US15/933,064 US10665225B2 (en) | 2017-06-16 | 2018-03-22 | Speaker adaption method and apparatus, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710457375.4A CN107240396B (en) | 2017-06-16 | 2017-06-16 | Speaker self-adaptation method, device, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107240396A CN107240396A (en) | 2017-10-10 |
CN107240396B true CN107240396B (en) | 2023-01-17 |
Family
ID=59986433
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710457375.4A Active CN107240396B (en) | 2017-06-16 | 2017-06-16 | Speaker self-adaptation method, device, equipment and storage medium |
Country Status (2)
Country | Link |
---|---|
US (1) | US10665225B2 (en) |
CN (1) | CN107240396B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108154235A (en) * | 2017-12-04 | 2018-06-12 | 盈盛资讯科技有限公司 | A kind of image question and answer inference method, system and device |
KR102225984B1 (en) * | 2018-09-03 | 2021-03-10 | 엘지전자 주식회사 | Device including battery |
CN109710499B (en) * | 2018-11-13 | 2023-01-17 | 平安科技(深圳)有限公司 | Computer equipment performance identification method and device |
CN112786016B (en) * | 2019-11-11 | 2022-07-19 | 北京声智科技有限公司 | Voice recognition method, device, medium and equipment |
US11183178B2 (en) | 2020-01-13 | 2021-11-23 | Microsoft Technology Licensing, Llc | Adaptive batching to reduce recognition latency |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102024455A (en) * | 2009-09-10 | 2011-04-20 | 索尼株式会社 | Speaker recognition system and method |
Family Cites Families (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5787394A (en) * | 1995-12-13 | 1998-07-28 | International Business Machines Corporation | State-dependent speaker clustering for speaker adaptation |
JPH09179580A (en) * | 1995-12-27 | 1997-07-11 | Oki Electric Ind Co Ltd | Learning method for hidden markov model |
JP3156668B2 (en) * | 1998-06-19 | 2001-04-16 | 日本電気株式会社 | Voice recognition device |
US6275789B1 (en) * | 1998-12-18 | 2001-08-14 | Leo Moser | Method and apparatus for performing full bidirectional translation between a source language and a linked alternative language |
US8396859B2 (en) * | 2000-06-26 | 2013-03-12 | Oracle International Corporation | Subject matter context search engine |
US6606595B1 (en) * | 2000-08-31 | 2003-08-12 | Lucent Technologies Inc. | HMM-based echo model for noise cancellation avoiding the problem of false triggers |
US20110060587A1 (en) * | 2007-03-07 | 2011-03-10 | Phillips Michael S | Command and control utilizing ancillary information in a mobile voice-to-speech application |
US9633669B2 (en) * | 2013-09-03 | 2017-04-25 | Amazon Technologies, Inc. | Smart circular audio buffer |
WO2016145379A1 (en) * | 2015-03-12 | 2016-09-15 | William Marsh Rice University | Automated Compilation of Probabilistic Task Description into Executable Neural Network Specification |
US10332509B2 (en) * | 2015-11-25 | 2019-06-25 | Baidu USA, LLC | End-to-end speech recognition |
US10319076B2 (en) * | 2016-06-16 | 2019-06-11 | Facebook, Inc. | Producing higher-quality samples of natural images |
US10481863B2 (en) * | 2016-07-06 | 2019-11-19 | Baidu Usa Llc | Systems and methods for improved user interface |
CN106782510B (en) * | 2016-12-19 | 2020-06-02 | 苏州金峰物联网技术有限公司 | Place name voice signal recognition method based on continuous Gaussian mixture HMM model |
CN106683680B (en) * | 2017-03-10 | 2022-03-25 | 百度在线网络技术(北京)有限公司 | Speaker recognition method and device, computer equipment and computer readable medium |
-
2017
- 2017-06-16 CN CN201710457375.4A patent/CN107240396B/en active Active
-
2018
- 2018-03-22 US US15/933,064 patent/US10665225B2/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102024455A (en) * | 2009-09-10 | 2011-04-20 | 索尼株式会社 | Speaker recognition system and method |
Also Published As
Publication number | Publication date |
---|---|
US20180366109A1 (en) | 2018-12-20 |
US10665225B2 (en) | 2020-05-26 |
CN107240396A (en) | 2017-10-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107240396B (en) | Speaker self-adaptation method, device, equipment and storage medium | |
JP6683234B2 (en) | Audio data processing method, device, equipment and program | |
CN110600017B (en) | Training method of voice processing model, voice recognition method, system and device | |
US20190066671A1 (en) | Far-field speech awaking method, device and terminal device | |
US10380996B2 (en) | Method and apparatus for correcting speech recognition result, device and computer-readable storage medium | |
JP2021086154A (en) | Method, device, apparatus, and computer-readable storage medium for speech recognition | |
US10672380B2 (en) | Dynamic enrollment of user-defined wake-up key-phrase for speech enabled computer system | |
CN108335694B (en) | Far-field environment noise processing method, device, equipment and storage medium | |
CN110600041B (en) | Voiceprint recognition method and device | |
WO2020207174A1 (en) | Method and apparatus for generating quantized neural network | |
CN110413812A (en) | Training method, device, electronic equipment and the storage medium of neural network model | |
CN109947924B (en) | Dialogue system training data construction method and device, electronic equipment and storage medium | |
WO2021174883A1 (en) | Voiceprint identity-verification model training method, apparatus, medium, and electronic device | |
CN114528044B (en) | Interface calling method, device, equipment and medium | |
CN111667843B (en) | Voice wake-up method and system for terminal equipment, electronic equipment and storage medium | |
CN111241043A (en) | Multimedia file sharing method, terminal and storage medium | |
CN110992975B (en) | Voice signal processing method and device and terminal | |
CN112397086A (en) | Voice keyword detection method and device, terminal equipment and storage medium | |
KR102556815B1 (en) | Electronic device and Method for controlling the electronic device thereof | |
CN107992457B (en) | Information conversion method, device, terminal equipment and storage medium | |
CN113035176B (en) | Voice data processing method and device, computer equipment and storage medium | |
JP2022116285A (en) | Voice processing method for vehicle, device, electronic apparatus, storage medium and computer program | |
JP7335460B2 (en) | clear text echo | |
CN111048096B (en) | Voice signal processing method and device and terminal | |
CN111899747B (en) | Method and apparatus for synthesizing audio |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |