CN107240396B - Speaker self-adaptation method, device, equipment and storage medium - Google Patents

Speaker self-adaptation method, device, equipment and storage medium Download PDF

Info

Publication number
CN107240396B
CN107240396B CN201710457375.4A CN201710457375A CN107240396B CN 107240396 B CN107240396 B CN 107240396B CN 201710457375 A CN201710457375 A CN 201710457375A CN 107240396 B CN107240396 B CN 107240396B
Authority
CN
China
Prior art keywords
voice
speaker
training
target speaker
parameter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710457375.4A
Other languages
Chinese (zh)
Other versions
CN107240396A (en
Inventor
黄�俊
李先刚
蒋兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201710457375.4A priority Critical patent/CN107240396B/en
Publication of CN107240396A publication Critical patent/CN107240396A/en
Priority to US15/933,064 priority patent/US10665225B2/en
Application granted granted Critical
Publication of CN107240396B publication Critical patent/CN107240396B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Abstract

The embodiment of the invention discloses a speaker self-adaptive method, a speaker self-adaptive device, speaker self-adaptive equipment and a storage medium. The speaker adaptive method comprises the following steps: acquiring first voice data of a target speaker; and inputting the first voice data into a batch normalized BN network obtained by pre-training for self-adaptive training to obtain a voice recognition model containing the voice parameters of the target speaker. According to the embodiment of the invention, the first voice data of the target speaker is input into the batch normalized BN network obtained by pre-training for self-adaptive training, so that the voice recognition model containing the voice parameters of the target speaker is obtained, the self-adaptive process of the speaker can be simplified, the self-adaptive complexity is reduced, and the self-adaptive performance is improved.

Description

Speaker self-adaptation method, device, equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of voice recognition, in particular to a speaker self-adaption method, device, equipment and storage medium.
Background
Speaker adaptive technology has gained more and more attention in recent years, and this technology utilizes specific Speaker data to modify Speaker Independent (SI) codebook, and aims to obtain Speaker Adaptive (SA) codebook to improve recognition performance.
Under the condition that training data of a certain Speaker are enough, a Speaker Dependent (SD) codebook can be obtained by adopting a traditional training method aiming at the current Speaker data, and the SD codebook well reflects the characteristics of the current Speaker, so that the SD codebook has good performance; however, in some cases, the data of the speaker is not enough to train a robust SD model, and at this time, adaptation is needed to avoid the under-training situation, and compared with the situation that the SD codebook requires a large amount of data for training, the speaker adaptation only needs a small amount of data to obtain a large performance improvement.
The essence of speaker adaptation is that the SI codebook is adjusted by using adaptive data to conform to the characteristics of the current speaker, since the SI codebook obtained by the conventional training method is inevitably affected by the characteristics of the training set, the adaptation effect becomes less obvious when the training set and the adaptive data are mismatched, and the original codebook has speaker independence and can approach the characteristics of the current speaker more rapidly when adapting. Codebook training combined with self-adaptation models the SI codebook and the characteristics of each speaker in the training set respectively, so that an SI codebook with more speaker independence can be obtained.
Currently, there are two main ways to perform speaker adaptation: the first is speaker self-adaptation based on a characteristic layer, and the main idea is to use the characteristic parameters of a voice signal to construct a transformation method, transform the characteristics related to the speaker into the characteristics unrelated to the speaker, and then send the characteristics unrelated to the speaker into a speaker independent model for recognition, thereby realizing the speaker self-adaptation. The second is speaker self-adaptation based on model layer, which adjusts speaker independent model by using speaker voice data, self-adapts different acoustic models for different speakers, and then uses the self-adapted models for recognition, thus realizing speaker self-adaptation.
However, the above adaptive process is complicated, and two decoding passes are usually required in the adaptive process, so the adaptive process requires more time and is inefficient. Moreover, since the speaker has limited voice data and more parameters requiring adaptation, the contradiction between the two causes the adaptation performance to be poor.
Disclosure of Invention
Embodiments of the present invention provide a speaker adaptive method, apparatus, device, and storage medium, which can simplify a speaker adaptive process, reduce adaptive complexity, and improve adaptive performance.
In a first aspect, an embodiment of the present invention provides a speaker adaptive method, where the method includes:
acquiring first voice data of a target speaker;
and inputting the first voice data into a batch normalized BN network obtained by pre-training for self-adaptive training to obtain a voice recognition model containing the voice parameters of the target speaker.
In a second aspect, an embodiment of the present invention further provides a speaker adaptive apparatus, where the apparatus includes:
the voice data acquisition module is used for acquiring first voice data of a target speaker;
and the model training module is used for inputting the first voice data into a batch standardized BN network obtained by pre-training for self-adaptive training to obtain a voice recognition model containing the voice parameters of the target speaker.
In a third aspect, an embodiment of the present invention further provides an apparatus, including:
one or more processors;
a storage device for storing one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors implement the speaker adaptation method according to any one of the embodiments of the invention.
In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement any one of the speaker adaptive methods according to the embodiments of the present invention.
According to the embodiment of the invention, the first voice data of the target speaker is input into the batch normalized BN network obtained by pre-training for self-adaptive training, so that the voice recognition model containing the voice parameters of the target speaker is obtained, the self-adaptive process of the speaker can be simplified, the self-adaptive complexity is reduced, and the self-adaptive performance is improved.
Drawings
FIG. 1 is a flow chart of a speaker adaptive method according to an embodiment of the present invention;
FIG. 2 is a flowchart of a speaker adaptive method according to a second embodiment of the present invention;
FIG. 3 is a flowchart of a speaker adaptive method according to a third embodiment of the present invention;
FIG. 4 is a block diagram of a speaker adaptive apparatus according to a fourth embodiment of the present invention;
fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, specific embodiments of the present invention are described in further detail below with reference to the accompanying drawings. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention.
It should be further noted that, for the convenience of description, only some but not all of the relevant aspects of the present invention are shown in the drawings. Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.
Example one
Fig. 1 is a flowchart of a speaker adaptive method according to an embodiment of the present invention, where the method is applicable to a speaker adaptive situation, and the method can be executed by a speaker adaptive apparatus according to an embodiment of the present invention, where the apparatus can be implemented in a software and/or hardware manner, and the apparatus can be integrated in a terminal device or an application end of the terminal device. The terminal device may be, but is not limited to, a mobile terminal (tablet computer or smartphone).
The application end can be a plug-in of a certain client end embedded in the terminal equipment or a plug-in of an operating system of the terminal equipment, and is matched with a speaker self-adaptive client end embedded in the terminal equipment or a speaker self-adaptive application program in the operating system of the terminal equipment for use; the application end may also be an independent client end capable of providing speaker adaptation in the terminal device, which is not limited in this embodiment.
As shown in fig. 1, the method of this embodiment specifically includes:
s101, first voice data of a target speaker are obtained.
The voice data may be an original voice signal, or may be voice feature data obtained by processing the original voice signal.
Specifically, the voice data can be acquired through a voice input device or a recording device of the terminal device.
S102, inputting the first voice data into a Batch Normalization (BN) network obtained through pre-training for self-adaptive training, and obtaining a voice recognition model containing voice parameters of a target speaker.
The voice parameters are variance and/or mean values, and can be obtained by training the first voice data BN network.
Specifically, dividing the voice data into m frames, inputting the m frames of voice data into the BN network, obtaining a variance and a mean through a BN transformation formula in the BN network, and including a variance and mean voice recognition model:
Figure BDA0001324021110000051
Figure BDA0001324021110000052
wherein m is the number of frames of voice data, x i For the i-th frame of speech data, mu B Is taken as the mean value of the average value,
Figure BDA0001324021110000053
is the variance.
The BN network self-adaptation does not need to add an extra layer, so that the self-adaptation process is simpler, and the mean value and the variance obtained through the BN network are both one-dimensional vectors, so that the parameter quantity needing to be adjusted during self-adaptation is less, and the speech parameters (namely the mean value and the variance) are obtained by performing self-adaptation training on the BN network obtained through pre-training without two-pass decoding.
Therefore, in the embodiment, the first speech data of the target speaker is input into the batch normalized BN network obtained by the pre-training for the adaptive training, so as to obtain the speech recognition model including the speech parameters of the target speaker, which can simplify the adaptive process of the speaker, reduce the adaptive complexity, and improve the adaptive performance.
Example two
Fig. 2 is a flowchart of a speaker adaptive method according to a second embodiment of the present invention. The embodiment is optimized based on the above embodiment, and in the embodiment, the method further includes the following steps of obtaining voice data of a reference speaker; and training according to the voice data of the reference speaker to obtain the BN network, wherein the BN network comprises global voice parameters and a voice recognition model comprising the global voice parameters.
Correspondingly, the method of the embodiment specifically includes:
s201, voice data of a reference speaker is obtained.
Wherein the number of the reference speakers is one or more.
S202, training according to voice data of a reference speaker to obtain a BN (boron nitride) network, wherein the BN network comprises global voice parameters and a voice recognition model comprising the global voice parameters.
Wherein the global speech parameter is a variance and/or a mean. Specifically, the global speech parameter of each reference speaker can be obtained through the BN transformation formula, then an average is obtained to obtain a global speech parameter as the global speech parameter in the BN network, and a speech recognition model including the global speech parameter is obtained through training.
S203, first voice data of the target speaker are obtained.
S204, inputting the first voice data into the BN network for self-adaptive training to obtain a voice recognition model containing the voice parameters of the target speaker.
Specifically, the first voice data is input into the BN network to obtain the voice parameter of the target speaker, and the global voice parameter in the voice recognition model is replaced with the voice parameter of the target speaker to obtain the voice recognition model including the voice parameter of the target speaker. Or, in order to improve the speech recognition performance, the weighting of the speech parameter of the target speaker and the global speech parameter can be used as the final speech parameter of the target speaker, and the speech parameter is used to replace the global speech parameter in the speech recognition model, so as to obtain the speech recognition model containing the speech parameter of the target speaker.
Alternatively, when there are multiple target speakers, a speech recognition model specific to each target speaker can be obtained through the above-mentioned adaptive process, and the speech recognition models of each target speaker are identical except for the speech parameters (i.e., mean and variance).
In the embodiment, the BN network is obtained through training according to the voice data of the reference speaker, the BN network comprises global voice parameters and a voice recognition model comprising the global voice parameters, and then the first voice data of the target speaker is input into the BN network for self-adaptive training to obtain the voice recognition model comprising the voice parameters of the target speaker, so that the self-adaptive process of the speaker can be simplified, the self-adaptive complexity is reduced, and the self-adaptive performance is improved.
EXAMPLE III
Fig. 3 is a flowchart of a speaker adaptive method according to a third embodiment of the present invention. The present embodiment is optimized based on the above embodiment, and in the present embodiment, the method further includes the following steps: obtaining a voice parameter of the target speaker according to the second voice data of the target speaker; and inputting the voice parameters of the target speaker into the voice recognition model for recognition to obtain corresponding text information.
Correspondingly, the method of the embodiment specifically includes:
s301, first voice data of the target speaker are obtained.
S302, inputting the first voice data into a pre-trained BN network for self-adaptive training to obtain a voice recognition model containing the voice parameters of the target speaker.
And S303, obtaining the voice parameter of the target speaker according to the second voice data of the target speaker.
The first voice data and the second voice data may be the same data or different data.
Specifically, the second voice data of the target speaker is input into the BN network for adaptive training, so as to obtain the voice parameter of the target speaker. The speech parameters may be mean and variance, among others.
S304, inputting the voice parameters of the target speaker into a voice recognition model containing the voice parameters of the target speaker for recognition to obtain corresponding text information.
Specifically, the speech parameters of the target speaker can be directly input into the speech recognition model for recognition, so as to obtain corresponding text information. Or, calculating the weighting of the voice parameter of the target speaker and the global voice parameter; and identifying the weighted input speech recognition model to obtain corresponding text information. For example, if the speech parameter of the target speaker corresponds to a weight w1, the global speech parameter corresponds to a weight w2, the speech parameter of the target speaker is x1, and the global speech parameter is x2, the corresponding weight is x1 × w1+ x2 × w2.
Because the speech recognition model of the embodiment is obtained by inputting the first speech data of the target speaker into the pre-trained BN network for adaptive training, and the BN network has high adaptive performance, the embodiment performs recognition by inputting the second speech parameter of the target speaker into the speech recognition model including the speech parameter of the target speaker to obtain corresponding text information, thereby improving speech recognition efficiency.
Example four
Fig. 4 is a block diagram of a speaker adaptive apparatus according to a fourth embodiment of the present invention. The embodiment can be suitable for the situation of speaker adaptation, the device can be implemented in a software and/or hardware manner, and the device can be integrated in the terminal equipment or an application end of the terminal equipment. The terminal device may be, but is not limited to, a mobile terminal (tablet computer or smartphone).
The application end can be a plug-in of a certain client end embedded in the terminal equipment or a plug-in of an operating system of the terminal equipment, and is matched with a speaker self-adaptive client end embedded in the terminal equipment or a speaker self-adaptive application program in the operating system of the terminal equipment for use; the application end may also be an independent client end capable of providing speaker adaptation in the terminal device, which is not limited in this embodiment.
As shown in fig. 4, the apparatus includes: a speech data acquisition module 401 and a model training module 402, wherein:
the voice data acquisition module 401 is configured to acquire first voice data of a target speaker;
the model training module 402 is configured to input the first speech data into a batch normalized BN network obtained by pre-training for adaptive training, so as to obtain a speech recognition model including speech parameters of a target speaker.
The speaker adaptive apparatus of the present embodiment is used for executing the speaker adaptive method of the above embodiments, and the technical principle and the generated technical effect are similar, and are not described herein again.
On the basis of the above embodiments, the apparatus further includes: a speech recognition module 403;
the voice recognition module 403 is configured to obtain a voice parameter of the target speaker according to the second voice data of the target speaker; and inputting the voice parameters of the target speaker into the voice recognition model for recognition to obtain corresponding text information.
On the basis of the foregoing embodiments, the voice data obtaining module 401 is further configured to: acquiring voice data of a reference speaker;
the model training module 402 is further configured to: and training according to the voice data of the reference speaker to obtain the BN network, wherein the BN network comprises global voice parameters and a voice recognition model comprising the global voice parameters.
On the basis of the foregoing embodiments, the model training module 402 is specifically configured to: and inputting the first voice data into the BN network to obtain the voice parameters of the target speaker, and replacing the global voice parameters in the voice recognition model with the voice parameters of the target speaker to obtain the voice recognition model containing the voice parameters of the target speaker.
On the basis of the foregoing embodiments, the speech recognition module 403 is specifically configured to: calculating the weighting of the voice parameter of the target speaker and the global voice parameter; and identifying the weighted input speech recognition model to obtain corresponding text information.
On the basis of the above embodiments, the speech parameters are variance and/or mean.
The speaker adaptive device provided by each embodiment can execute the speaker adaptive method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects for executing the speaker adaptive method.
EXAMPLE five
Fig. 5 is a schematic structural diagram of an apparatus according to a fifth embodiment of the present invention. FIG. 5 illustrates a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present invention. The computer device 12 shown in FIG. 5 is only an example and should not impose any limitations on the functionality or scope of use of embodiments of the present invention.
As shown in FIG. 5, computer device 12 is in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. The computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, and commonly referred to as a "hard drive"). Although not shown in FIG. 5, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.
Computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with computer device 12, and/or with any devices (e.g., network card, modem, etc.) that enable computer device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, computer device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via network adapter 20. As shown in FIG. 5, the network adapter 20 communicates with the other modules of the computer device 12 via the bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processing unit 16 executes various functional applications and data processing by executing programs stored in the system memory 28, for example, to implement the speaker adaptive method provided by the embodiment of the present invention:
acquiring first voice data of a target speaker;
and inputting the first voice data into a batch normalized BN network obtained by pre-training for self-adaptive training to obtain a voice recognition model containing the voice parameters of the target speaker.
Further, the method further comprises:
obtaining a voice parameter of the target speaker according to the second voice data of the target speaker;
and inputting the voice parameters of the target speaker into the voice recognition model for recognition to obtain corresponding text information.
Further, the method further comprises:
acquiring voice data of a reference speaker;
and training according to the voice data of the reference speaker to obtain the BN network, wherein the BN network comprises the global voice parameters and a voice recognition model comprising the global voice parameters.
Further, the inputting the first speech data into a batch normalized BN network obtained by pre-training for adaptive training to obtain a speech recognition model including the speech parameters of the target speaker includes:
and inputting the first voice data into the BN network to obtain the voice parameter of the target speaker, and replacing the global voice parameter in the voice recognition model with the voice parameter of the target speaker to obtain the voice recognition model containing the voice parameter of the target speaker.
Further, the inputting the voice parameter of the target speaker into the voice recognition model for recognition to obtain the corresponding text information includes:
calculating the weighting of the voice parameter of the target speaker and the global voice parameter;
and inputting the weighted speech recognition model for recognition to obtain corresponding text information.
Further, the speech parameter is a variance and/or a mean.
EXAMPLE six
Embodiment 6 of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the speaker adaptive method according to the embodiments of the present invention:
acquiring first voice data of a target speaker;
and inputting the first voice data into a batch normalized BN network obtained by pre-training for self-adaptive training to obtain a voice recognition model containing the voice parameters of the target speaker.
Further, the method further comprises:
obtaining a voice parameter of the target speaker according to the second voice data of the target speaker;
and inputting the voice parameters of the target speaker into the voice recognition model for recognition to obtain corresponding text information.
Further, the method further comprises:
acquiring voice data of a reference speaker;
and training according to the voice data of the reference speaker to obtain the BN network, wherein the BN network comprises the global voice parameters and a voice recognition model comprising the global voice parameters.
Further, the inputting the first speech data into a batch of normalized BN networks obtained by pre-training for adaptive training to obtain a speech recognition model including the speech parameters of the target speaker includes:
and inputting the first voice data into the BN network to obtain the voice parameter of the target speaker, and replacing the global voice parameter in the voice recognition model with the voice parameter of the target speaker to obtain the voice recognition model containing the voice parameter of the target speaker.
Further, the inputting the voice parameter of the target speaker into the voice recognition model for recognition to obtain corresponding text information includes:
calculating the weighting of the voice parameter of the target speaker and the global voice parameter;
and inputting the weighted speech recognition model for recognition to obtain corresponding text information.
Further, the speech parameter is a variance and/or a mean.
Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (4)

1. A speaker adaptation method, comprising:
acquiring first voice data of a target speaker;
inputting the first voice data into a batch normalized BN network obtained by pre-training for self-adaptive training to obtain a voice recognition model containing the voice parameters of the target speaker;
acquiring voice data of a reference speaker;
training according to the voice data of the reference speaker to obtain the BN network, wherein the BN network comprises global voice parameters and a voice recognition model comprising the global voice parameters;
wherein, the inputting the first voice data into a batch normalized BN network obtained by pre-training for adaptive training to obtain a voice recognition model including the voice parameters of the target speaker includes:
inputting the first voice data into a batch normalized BN network obtained by pre-training for self-adaptive training to obtain the voice parameter of the target speaker, and replacing the global voice parameter with the voice parameter of the target speaker to obtain a voice recognition model containing the voice parameter of the target speaker; alternatively, the first and second electrodes may be,
inputting the first voice data into a batch normalized BN network obtained by pre-training for self-adaptive training to obtain a target speaker voice parameter, taking the weighting of the target speaker voice parameter and the global voice parameter as a final voice parameter of a target speaker, and replacing the global voice parameter in the voice recognition model with the final voice parameter to obtain a voice recognition model containing the target speaker voice parameter;
the voice parameters are variance and/or mean values obtained through frame number calculation of voice data.
2. A speaker adaptive apparatus, comprising:
the voice data acquisition module is used for acquiring first voice data of a target speaker;
the model training module is used for inputting the first voice data into a batch normalized BN network obtained by pre-training for self-adaptive training to obtain a voice recognition model containing the voice parameters of the target speaker;
the voice data acquisition module is further configured to: acquiring voice data of a reference speaker;
the model training module is further configured to: training according to the voice data of the reference speaker to obtain the BN network, wherein the BN network comprises global voice parameters and a voice recognition model comprising the global voice parameters;
the model training module is specifically configured to: inputting the first voice data into a batch normalized BN network obtained by pre-training for self-adaptive training to obtain the voice parameter of the target speaker, and replacing the global voice parameter with the voice parameter of the target speaker to obtain a voice recognition model containing the voice parameter of the target speaker; or inputting the first voice data into a batch normalized BN network obtained by pre-training for self-adaptive training to obtain the voice parameter of the target speaker, taking the weighting of the voice parameter of the target speaker and the global voice parameter as the final voice parameter of the target speaker, and replacing the global voice parameter in the voice recognition model with the final voice parameter to obtain the voice recognition model containing the voice parameter of the target speaker; the voice parameters are variance and/or mean values obtained through frame number calculation of voice data.
3. A computer device, characterized in that the device comprises:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the speaker adaptation method as recited in claim 1.
4. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the speaker adaptation method as claimed in claim 1.
CN201710457375.4A 2017-06-16 2017-06-16 Speaker self-adaptation method, device, equipment and storage medium Active CN107240396B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201710457375.4A CN107240396B (en) 2017-06-16 2017-06-16 Speaker self-adaptation method, device, equipment and storage medium
US15/933,064 US10665225B2 (en) 2017-06-16 2018-03-22 Speaker adaption method and apparatus, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710457375.4A CN107240396B (en) 2017-06-16 2017-06-16 Speaker self-adaptation method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN107240396A CN107240396A (en) 2017-10-10
CN107240396B true CN107240396B (en) 2023-01-17

Family

ID=59986433

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710457375.4A Active CN107240396B (en) 2017-06-16 2017-06-16 Speaker self-adaptation method, device, equipment and storage medium

Country Status (2)

Country Link
US (1) US10665225B2 (en)
CN (1) CN107240396B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108154235A (en) * 2017-12-04 2018-06-12 盈盛资讯科技有限公司 A kind of image question and answer inference method, system and device
KR102225984B1 (en) * 2018-09-03 2021-03-10 엘지전자 주식회사 Device including battery
CN109710499B (en) * 2018-11-13 2023-01-17 平安科技(深圳)有限公司 Computer equipment performance identification method and device
CN112786016B (en) * 2019-11-11 2022-07-19 北京声智科技有限公司 Voice recognition method, device, medium and equipment
US11183178B2 (en) 2020-01-13 2021-11-23 Microsoft Technology Licensing, Llc Adaptive batching to reduce recognition latency

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102024455A (en) * 2009-09-10 2011-04-20 索尼株式会社 Speaker recognition system and method

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5787394A (en) * 1995-12-13 1998-07-28 International Business Machines Corporation State-dependent speaker clustering for speaker adaptation
JPH09179580A (en) * 1995-12-27 1997-07-11 Oki Electric Ind Co Ltd Learning method for hidden markov model
JP3156668B2 (en) * 1998-06-19 2001-04-16 日本電気株式会社 Voice recognition device
US6275789B1 (en) * 1998-12-18 2001-08-14 Leo Moser Method and apparatus for performing full bidirectional translation between a source language and a linked alternative language
US8396859B2 (en) * 2000-06-26 2013-03-12 Oracle International Corporation Subject matter context search engine
US6606595B1 (en) * 2000-08-31 2003-08-12 Lucent Technologies Inc. HMM-based echo model for noise cancellation avoiding the problem of false triggers
US20110060587A1 (en) * 2007-03-07 2011-03-10 Phillips Michael S Command and control utilizing ancillary information in a mobile voice-to-speech application
US9633669B2 (en) * 2013-09-03 2017-04-25 Amazon Technologies, Inc. Smart circular audio buffer
WO2016145379A1 (en) * 2015-03-12 2016-09-15 William Marsh Rice University Automated Compilation of Probabilistic Task Description into Executable Neural Network Specification
US10332509B2 (en) * 2015-11-25 2019-06-25 Baidu USA, LLC End-to-end speech recognition
US10319076B2 (en) * 2016-06-16 2019-06-11 Facebook, Inc. Producing higher-quality samples of natural images
US10481863B2 (en) * 2016-07-06 2019-11-19 Baidu Usa Llc Systems and methods for improved user interface
CN106782510B (en) * 2016-12-19 2020-06-02 苏州金峰物联网技术有限公司 Place name voice signal recognition method based on continuous Gaussian mixture HMM model
CN106683680B (en) * 2017-03-10 2022-03-25 百度在线网络技术(北京)有限公司 Speaker recognition method and device, computer equipment and computer readable medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102024455A (en) * 2009-09-10 2011-04-20 索尼株式会社 Speaker recognition system and method

Also Published As

Publication number Publication date
US20180366109A1 (en) 2018-12-20
US10665225B2 (en) 2020-05-26
CN107240396A (en) 2017-10-10

Similar Documents

Publication Publication Date Title
CN107240396B (en) Speaker self-adaptation method, device, equipment and storage medium
JP6683234B2 (en) Audio data processing method, device, equipment and program
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
US20190066671A1 (en) Far-field speech awaking method, device and terminal device
US10380996B2 (en) Method and apparatus for correcting speech recognition result, device and computer-readable storage medium
JP2021086154A (en) Method, device, apparatus, and computer-readable storage medium for speech recognition
US10672380B2 (en) Dynamic enrollment of user-defined wake-up key-phrase for speech enabled computer system
CN108335694B (en) Far-field environment noise processing method, device, equipment and storage medium
CN110600041B (en) Voiceprint recognition method and device
WO2020207174A1 (en) Method and apparatus for generating quantized neural network
CN110413812A (en) Training method, device, electronic equipment and the storage medium of neural network model
CN109947924B (en) Dialogue system training data construction method and device, electronic equipment and storage medium
WO2021174883A1 (en) Voiceprint identity-verification model training method, apparatus, medium, and electronic device
CN114528044B (en) Interface calling method, device, equipment and medium
CN111667843B (en) Voice wake-up method and system for terminal equipment, electronic equipment and storage medium
CN111241043A (en) Multimedia file sharing method, terminal and storage medium
CN110992975B (en) Voice signal processing method and device and terminal
CN112397086A (en) Voice keyword detection method and device, terminal equipment and storage medium
KR102556815B1 (en) Electronic device and Method for controlling the electronic device thereof
CN107992457B (en) Information conversion method, device, terminal equipment and storage medium
CN113035176B (en) Voice data processing method and device, computer equipment and storage medium
JP2022116285A (en) Voice processing method for vehicle, device, electronic apparatus, storage medium and computer program
JP7335460B2 (en) clear text echo
CN111048096B (en) Voice signal processing method and device and terminal
CN111899747B (en) Method and apparatus for synthesizing audio

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant