CN113035177B - Acoustic model training method and device - Google Patents

Acoustic model training method and device Download PDF

Info

Publication number
CN113035177B
CN113035177B CN202110264782.XA CN202110264782A CN113035177B CN 113035177 B CN113035177 B CN 113035177B CN 202110264782 A CN202110264782 A CN 202110264782A CN 113035177 B CN113035177 B CN 113035177B
Authority
CN
China
Prior art keywords
vector corresponding
voice frame
voice
frame
acoustic model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110264782.XA
Other languages
Chinese (zh)
Other versions
CN113035177A (en
Inventor
鄢楷强
马骏
王少军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202110264782.XA priority Critical patent/CN113035177B/en
Publication of CN113035177A publication Critical patent/CN113035177A/en
Application granted granted Critical
Publication of CN113035177B publication Critical patent/CN113035177B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses an acoustic model training method and device, wherein the acoustic model training method comprises the following steps: acquiring a plurality of voice frames from at least two channels, wherein one channel corresponds to one channel class; for each voice frame in a plurality of voice frames, determining a channel category corresponding to a channel from which the voice frame comes, and performing single-heat coding on the channel category to obtain a single-heat coding vector corresponding to the voice frame; acquiring a feature vector for representing a voice feature of a voice frame; obtaining a first vector corresponding to the voice frame according to the independent heat coding vector corresponding to the voice frame and the characteristic vector corresponding to the voice frame; and carrying out model training on the acoustic model to be trained according to the first vector corresponding to each voice frame in the plurality of voice frames. By adopting the method and the device, the computing resources can be saved, and the maintenance cost is reduced.

Description

Acoustic model training method and device
Technical Field
The invention relates to the technical field of model training, in particular to an acoustic model training method and device.
Background
With the continuous development of voice recognition technology, the technology has more and more energized scenes, such as intelligent hardware, telephone customer service, conference systems, vehicle-mounted scenes and the like. Where the voice signals originating from different devices may have channel differences, for example: the traditional fixed speech (sampling rate 8K) and the mobile phone microphone speech (sampling rate 16K) are derived from different channels and have different channel characteristics. By comparing the time domain and frequency domain parameters of the voice signals under different channels, obvious differences of signal frequency, broadband noise, resonance noise and the like can be found. At present, when an acoustic model is trained, because of the difference of voice signals from different channels, the acoustic model is usually trained for a certain channel independently, but the acoustic model occupies additional computing resources and increases maintenance cost.
Disclosure of Invention
The embodiment of the invention provides an acoustic model training method and device, which can train an acoustic model according to the combination of voice frames of channels from various channel types, save computing resources and reduce maintenance cost.
In a first aspect, an embodiment of the present invention provides an acoustic model training method, including:
acquiring a plurality of voice frames from at least two channels, wherein one channel corresponds to one channel class;
for each voice frame in the voice frames, determining a channel category corresponding to a channel from which the voice frame comes, and performing single-hot coding on the channel category to obtain a single-hot coding vector corresponding to the voice frame;
acquiring a feature vector for representing a voice feature of the voice frame;
obtaining a first vector corresponding to the voice frame according to the independent heat coding vector corresponding to the voice frame and the characteristic vector corresponding to the voice frame;
and carrying out model training on the acoustic model to be trained according to the first vector corresponding to each voice frame in the plurality of voice frames.
In one possible design, the obtaining the first vector corresponding to the speech frame according to the one-hot encoding vector corresponding to the speech frame and the feature vector corresponding to the speech frame includes:
and splicing the independent heat coding vector corresponding to the voice frame and the characteristic vector corresponding to the voice frame to obtain a first vector corresponding to the voice frame.
In one possible design, the obtaining the first vector corresponding to the speech frame according to the one-hot encoding vector corresponding to the speech frame and the feature vector corresponding to the speech frame includes:
processing the single-hot coding vector corresponding to the voice frame by utilizing an embedding layer embedding to obtain a second vector corresponding to the voice frame;
and splicing the second vector corresponding to the voice frame with the feature vector corresponding to the voice frame to obtain the first vector corresponding to the voice frame.
In one possible design, the acoustic model includes a plurality of hidden layers connected in sequence, the method further comprising:
determining at least one selected hidden layer from the plurality of hidden layers, the at least one selected hidden layer being a hidden layer outside a first hidden layer of the plurality of hidden layers;
for each selected hidden layer in the at least one selected hidden layer, acquiring an intermediate vector corresponding to the voice frame output by the hidden layer before the selected hidden layer;
splicing the second vector corresponding to the voice frame and the intermediate vector corresponding to the voice frame to obtain a third vector corresponding to the voice frame;
performing model training on an acoustic model to be trained according to a first vector corresponding to each voice frame in the plurality of voice frames, including:
inputting a first vector corresponding to each of the plurality of speech frames into a first hidden layer of the acoustic model to be trained; and inputting a third vector corresponding to each of the plurality of speech frames into the selected hidden layer of the acoustic model to be trained to model train the acoustic model.
In one possible design, the method further comprises:
acquiring the dimension of the model parameter vector of the acoustic model, and adjusting the dimension of the model parameter vector of the embedded layer according to the dimension of the model parameter vector of the acoustic model;
the processing the single thermal coding vector corresponding to the voice frame by using the embedding layer embedding to obtain a second vector corresponding to the voice frame includes:
and processing the single-hot coding vector corresponding to the voice frame by utilizing the adjusted embedded layer to obtain a second vector corresponding to the voice frame.
In one possible design, the method further comprises:
acquiring a state quantity used for representing the degree of difference between each channel in the at least two channels, and adjusting the dimension of the model parameter vector of the embedded layer according to the state quantity;
the processing the single thermal coding vector corresponding to the voice frame by using the embedding layer embedding to obtain a second vector corresponding to the voice frame includes:
and processing the single-hot coding vector corresponding to the voice frame by utilizing the adjusted embedded layer to obtain a second vector corresponding to the voice frame.
In one possible design, the eigenvectors include mel-frequency cepstral coefficient MFCC eigenvectors or filter bank parameter eigenvectors.
In a second aspect, an embodiment of the present invention provides an acoustic model training apparatus, including:
a first obtaining unit, configured to obtain a plurality of speech frames from at least two channels, where a channel corresponds to a channel class;
a first determining unit, configured to determine, for each of the plurality of speech frames, a channel class corresponding to a channel from which the speech frame is derived, and perform single-hot encoding on the channel class, to obtain a single-hot encoding vector corresponding to the speech frame;
a second acquisition unit configured to acquire feature vectors representing speech features of the speech frame;
the third acquisition unit is used for acquiring a first vector corresponding to the voice frame according to the independent heat coding vector corresponding to the voice frame and the feature vector corresponding to the voice frame;
and the model training unit is used for carrying out model training on the acoustic model to be trained according to the first vector corresponding to each voice frame in the plurality of voice frames.
In one possible design, the third obtaining unit is specifically configured to splice the unique hot encoding vector corresponding to the speech frame and the feature vector corresponding to the speech frame, to obtain the first vector corresponding to the speech frame.
In one possible design, the third obtaining unit is specifically configured to process the one-hot encoding vector corresponding to the speech frame by using embedding layer embedding, so as to obtain a second vector corresponding to the speech frame;
and splicing the second vector corresponding to the voice frame with the feature vector corresponding to the voice frame to obtain the first vector corresponding to the voice frame.
In one possible design, the acoustic model includes a plurality of hidden layers connected in sequence, the apparatus further comprising:
a second determining unit configured to determine at least one selected hidden layer from the plurality of hidden layers, the at least one selected hidden layer being a hidden layer other than a first hidden layer of the plurality of hidden layers;
a fourth obtaining unit, configured to obtain, for each selected concealment layer of the at least one selected concealment layer, an intermediate vector corresponding to the speech frame output by a concealment layer preceding the selected concealment layer;
a fifth obtaining unit, configured to splice the second vector corresponding to the speech frame and the intermediate vector corresponding to the speech frame, to obtain a third vector corresponding to the speech frame;
the model training unit is specifically configured to input a first vector corresponding to each of the plurality of speech frames into a first hidden layer of the acoustic model to be trained; and inputting a third vector corresponding to each of the plurality of speech frames into the selected hidden layer of the acoustic model to be trained to model train the acoustic model.
In one possible design, the apparatus further comprises:
the first adjusting unit is used for acquiring the dimension of the model parameter vector of the acoustic model and adjusting the dimension of the model parameter vector of the embedded layer according to the dimension of the model parameter vector of the acoustic model;
the third obtaining unit is specifically configured to process the one-hot encoded vector corresponding to the speech frame by using the adjusted embedded layer, so as to obtain a second vector corresponding to the speech frame.
In one possible design, the apparatus further comprises:
the second adjusting unit is used for acquiring state quantity used for representing the degree of difference between each channel in the at least two channels and adjusting the dimension of the model parameter vector of the embedded layer according to the state quantity;
the third obtaining unit is specifically configured to process the one-hot encoded vector corresponding to the speech frame by using the adjusted embedded layer, so as to obtain a second vector corresponding to the speech frame.
In one possible design, the eigenvectors include mel-frequency cepstral coefficient MFCC eigenvectors or filter bank parameter eigenvectors.
In a third aspect, an embodiment of the present invention provides an acoustic model training apparatus, where the acoustic model training apparatus includes a processor, a memory, and a communication interface, where the processor, the memory, and the communication interface are connected to each other, where the communication interface is configured to receive and send data, the memory is configured to store program code, and the processor is configured to invoke the program code to execute the method described in the first aspect.
In a fourth aspect, embodiments of the present invention provide a computer readable storage medium storing a computer program for execution by a processor to implement the method described above.
In the embodiment of the invention, a plurality of voice frames from at least two channels are acquired, one channel corresponds to one channel type, namely, the plurality of voice frames are channels from at least two channel types, the channel type corresponding to the channel from which the voice frame is positioned is determined for each voice frame, and the channel type is subjected to independent thermal coding to obtain an independent thermal coding vector corresponding to the voice frame, and a feature vector used for representing voice features of the voice frame is acquired, so that a first vector corresponding to the voice frame is acquired according to the independent thermal coding vector corresponding to the voice frame and the feature vector corresponding to the voice frame, and an acoustic model is trained according to the first vector. By adopting the embodiment of the application, the acoustic model can be trained according to the voice frame combination of the channels from various channel types, so that the computing resources are saved, and the maintenance cost is reduced.
Drawings
In order to illustrate embodiments of the invention or solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.
FIG. 1 is a flowchart of an acoustic model training method according to an embodiment of the present invention;
fig. 2a is a schematic diagram of vector stitching according to an embodiment of the present invention;
FIG. 2b is a schematic diagram of another vector concatenation according to an embodiment of the present invention;
FIG. 3 is a flowchart of another method for training an acoustic model according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a model input provided in an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of an acoustic model training device according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of another acoustic model training device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention.
An acoustic model training method according to an embodiment of the present invention will be described in detail with reference to fig. 1 to fig. 4.
Referring to fig. 1, a flow chart of an acoustic model training method is provided in an embodiment of the present invention. As shown in fig. 1, the acoustic model training method according to the embodiment of the present invention may include the following steps S101 to S105.
S101, acquiring a plurality of voice frames from at least two channels, wherein one channel corresponds to one channel type;
in the embodiment of the present application, the channel is a channel for transmitting a voice signal, and the encoding and decoding modes and the compression modes adopted by the voice signals of different channel types are different. Illustratively, the voice signals come from different types of devices and the channel categories to which the experienced channels belong may be different. For example, the channel class to which the voice signal from the handset microphone belongs is different from the channel class to which the voice signal from the conventional fixed telephone device belongs. The voice signals of different channel classes may differ in frequency, wideband noise, resonant noise, etc.
Specifically, at least two channels are determined, one of the at least two channels corresponding to each channel class. And respectively acquiring the voice signals transmitted by each channel in the at least two channels, and carrying out framing processing on the voice signals of each channel to acquire at least one voice frame of the channel.
In the embodiment of the application, at least one voice frame of each of the at least two channels is acquired, so that a plurality of voice frames from all channels are acquired.
S102, determining a channel class corresponding to a channel from which the voice frame comes for each voice frame in the voice frames, and performing single-hot encoding on the channel class to obtain a single-hot encoding vector corresponding to the voice frame;
in one embodiment, for each of the plurality of voice frames, it is further determined that the channel class corresponding to the channel from which the voice frame is received, e.g., the voice frame is received from a conventional fixed-line device, and the channel class corresponding to the channel from which the corresponding voice frame is received is the conventional fixed-line device channel class.
And performing one-time thermal coding on the channel category corresponding to the channel from which the voice frame is received, and obtaining the one-time thermal coding vector corresponding to the voice frame. One-Hot encoding (One-Hot encoding), also known as One-bit efficient encoding, mainly uses N-bit status registers to encode N states, each of which is defined by its independent register bit, and only One bit is active at any time, N being the number of channels of the at least two channels, i.e. the number of all channel categories.
In machine learning algorithms such as regression, classification, clustering and the like, calculation of distances or similarity between features is very important, and common calculation of distances or similarity is similarity calculation in Euclidean space, and cosine similarity is calculated based on Euclidean space. And the single-heat coding expands the value of the discrete feature to the European space, and a certain value of the discrete feature corresponds to a certain point of the European space. The discrete features are subjected to single-heat coding, so that the distance calculation between the features is more reasonable.
S103, for each voice frame in the plurality of voice frames, acquiring a feature vector for representing voice features of the voice frame;
in one embodiment, a speech feature extraction is performed on each speech frame to obtain feature vectors representing the speech features of the speech frame. The speech features include, but are not limited to, mel frequency cepstral coefficient (Mel-scale Frequency Cepstral Coefficients, MFCC) features, and Filter Bank parameters (Filter Bank) features.
S104, obtaining a first vector corresponding to the voice frame according to the independent heat coding vector corresponding to the voice frame and the characteristic vector corresponding to the voice frame;
s105, carrying out model training on the acoustic model to be trained according to the first vector corresponding to each voice frame in the plurality of voice frames.
In this embodiment, for each speech frame, the obtaining manner of the first vector corresponding to the speech frame according to the feature vector corresponding to the speech frame and the independent heat coding vector corresponding to the speech frame includes, but is not limited to, the following two optional embodiments:
in a first alternative embodiment, the unique heat encoding vector corresponding to the speech frame and the feature vector corresponding to the speech frame are spliced, so as to obtain a first vector corresponding to the speech frame.
As shown in fig. 2a, the feature vector corresponding to the voice frame and the independent heat coding vector corresponding to the voice frame are directly spliced to obtain a spliced feature vector, which is used as the first vector corresponding to the voice frame. For example, the feature vector corresponding to the voice frame is a 40-dimensional vector, the type of the channel is 2, the single-hot encoding vector corresponding to the voice frame is a 2-dimensional vector, and the first vector after splicing is 42-dimensional. And inputting the spliced first vector into an acoustic model for training.
In a second alternative embodiment, for each speech frame, the embedding layer embedding is used to process the one-hot encoded vector corresponding to the speech frame, so as to obtain a second vector corresponding to the speech frame. And splicing the second vector corresponding to the voice frame with the characteristic vector corresponding to the voice frame to obtain the first vector corresponding to the voice frame. I.e. one first vector for each speech frame.
The second vector obtained by processing the one-hot encoded vector corresponding to the speech frame by ebadd is a vector that is higher in dimension and denser than the one-hot encoded vector. The sounding can enable voice signals corresponding to vectors with similar distances to have similar meanings, and can represent differences among various channel categories. In a second alternative embodiment, the uni-thermal coding vector is processed through embedding layer embedding, so that the problem of vector sparseness caused by directly adopting the uni-thermal coding vector can be avoided.
As shown in fig. 2b, after the single-hot encoding vector corresponding to the voice frame is processed by the embedding layer embedding, a second vector corresponding to the voice frame is obtained, the feature vector corresponding to the voice frame and the second vector corresponding to the voice frame are spliced, a first vector corresponding to the voice frame after being spliced is obtained, and the first vector corresponding to the voice frame is input into the acoustic model for training.
It will be appreciated that in the second alternative embodiment, the dimension of the model parameter vector used in the ebedding processing may be adjusted according to the dimension of the model parameter vector of the acoustic model, so as to adjust the dimension of the second vector after the ebedding processing. For example, the dimension of the model parameter vector is relatively large, so is the dimension of the model parameter vector used in the corresponding ebadd process.
Optionally, the dimension of the model parameter vector used in the ebedding processing process can be adjusted according to the difference condition between the channels, so that the dimension of the second vector after the ebedding processing is adjusted. For example, the difference between channels is relatively large, and the dimension of the model parameter vector used in the ebedding processing is also relatively large, and the dimension of the corresponding second vector after ebedding processing is also relatively large. It can be appreciated that the above adjustment of the dimension of the model parameter vector used in the ebadd process can achieve a balance between the model calculation amount and the performance.
Specifically, a state quantity for representing the degree of difference between each of at least two channels may be obtained, so that the dimension of the model parameter vector used in the ebadd processing procedure is adjusted according to the state quantity.
Referring to fig. 3, a flowchart of another acoustic model training method is provided in an embodiment of the present invention. As shown in fig. 3, the acoustic model training method according to the embodiment of the present invention may include the following steps S201 to S.
S201, acquiring a plurality of voice frames from at least two channels, wherein one channel corresponds to one channel type;
s202, determining a channel class corresponding to a channel from which the voice frame comes for each voice frame in the voice frames, and performing single-hot encoding on the channel class to obtain a single-hot encoding vector corresponding to the voice frame;
s203, for each voice frame in the plurality of voice frames, acquiring a feature vector for representing voice features of the voice frame;
the descriptions of step S201 to step S203 refer to step S101 to step S103 shown in fig. 1, and are not repeated here.
S204, processing the single-hot coding vector corresponding to the voice frame by using the embedding layer ebedding to obtain a second vector corresponding to the voice frame.
S205, splicing the second vector corresponding to the voice frame and the characteristic vector corresponding to the voice frame to obtain the first vector corresponding to the voice frame;
and processing the one-hot coding vector corresponding to each voice frame by utilizing an embedding layer embedding to obtain a second vector corresponding to the voice frame. And splicing the second vector corresponding to the voice frame with the characteristic vector corresponding to the voice frame to obtain the first vector corresponding to the voice frame. I.e. one first vector for each speech frame.
S206, inputting a first vector corresponding to each voice frame in the plurality of voice frames into a first hidden layer of the acoustic model to be trained;
specifically, the acoustic model to be trained may include a plurality of hidden layers connected in sequence, as shown in fig. 4, where the acoustic model includes a plurality of hidden layers connected in sequence, and a first vector corresponding to the spliced voice frame is input into a first hidden layer of the acoustic model to be trained.
S207, determining at least one selected hidden layer from the plurality of hidden layers, wherein the at least one selected hidden layer is a hidden layer except a first hidden layer of the plurality of hidden layers;
in one embodiment, the at least one selected hidden layer may be determined in a plurality of ways, for example, one selected hidden layer may be selected every interval of one hidden layer, or a last preset number of hidden layers may be selected as selected hidden layers, and so on.
S208, aiming at each selected hidden layer in the at least one selected hidden layer, acquiring an intermediate vector corresponding to the voice frame output by the hidden layer before the selected hidden layer;
in one embodiment, for each selected concealment layer of the at least one selected concealment layer, a respective corresponding intermediate vector for each speech frame output by the concealment layer preceding the selected concealment layer is determined.
S209, splicing the second vector corresponding to the voice frame and the intermediate vector corresponding to the voice frame to obtain a third vector corresponding to the voice frame;
specifically, the second vector corresponding to the voice frame is spliced with the intermediate vector corresponding to the voice frame, so as to obtain a third vector corresponding to the voice frame, and each voice frame in the plurality of voice frames corresponds to one third vector.
S210, inputting a third vector corresponding to each voice frame in the plurality of voice frames into the selected hidden layer of the acoustic model to be trained so as to perform model training on the acoustic model;
specifically, a third vector corresponding to each voice frame is input into a corresponding selected hidden layer in the acoustic model to be trained. For example, aiming at the second hidden layer as the selected hidden layer, splicing the intermediate vector corresponding to the voice frame output by the first hidden layer and the second vector corresponding to the voice frame to obtain the third vector corresponding to the voice frame, and inputting the third vector corresponding to the voice frame into the second hidden layer.
As shown in fig. 4, the one-hot encoded vector is processed by embedding layer embedding to obtain a processed second vector. The processed second vector can be spliced with the feature vector to obtain a spliced first vector, and the spliced first vector is input into the acoustic model for training, namely, a first hidden layer of the acoustic model is input. The second vector after the embedding process can be spliced with intermediate vectors output by hidden layers before other hidden layers of the acoustic model to obtain a third vector, and the third vector is input into the other hidden layers of the acoustic model to perform acoustic model training, so that the characteristics of the processed second vector are prevented from being lost in the training process under the scene that the acoustic model is deeper.
It will be appreciated that in this embodiment, the dimension of the model parameter vector used in the ebadd process may be adjusted according to the dimension of the model parameter vector of the acoustic model, so as to adjust the dimension of the second vector after the ebadd process. For example, the dimension of the model parameter vector is relatively large, so is the dimension of the model parameter vector used in the corresponding ebadd process.
Optionally, the dimension of the model parameter vector used in the ebedding processing process can be adjusted according to the difference condition between the channels, so that the dimension of the second vector after the ebedding processing is adjusted. For example, the difference between channels is relatively large, and the dimension of the model parameter vector used in the ebedding processing is also relatively large, and the dimension of the corresponding second vector after ebedding processing is also relatively large. It can be appreciated that the above adjustment of the dimension of the model parameter vector used in the ebadd process can achieve a balance between the model calculation amount and the performance.
Specifically, a state quantity for representing the degree of difference between each of at least two channels may be obtained, so that the dimension of the model parameter vector used in the ebadd processing procedure is adjusted according to the state quantity.
Referring to fig. 5, a schematic structural diagram of an acoustic model training device is provided in an embodiment of the present invention. As shown in fig. 5, the acoustic model training apparatus according to the embodiment of the present invention may include:
a first acquisition unit 10 for acquiring a plurality of speech frames from at least two channels;
a first determining unit 11, configured to determine, for each of the plurality of speech frames, a channel class corresponding to a channel from which the speech frame is derived, and perform one-hot encoding on the channel class, to obtain a one-hot encoding vector corresponding to the speech frame;
a second acquisition unit 12 for acquiring feature vectors representing speech features of the speech frame;
a third obtaining unit 13, configured to obtain a first vector corresponding to the speech frame according to the unique thermal coding vector corresponding to the speech frame and the feature vector corresponding to the speech frame;
the model training unit 14 is configured to perform model training on the acoustic model to be trained according to a first vector corresponding to each of the plurality of speech frames.
In a possible implementation manner, the third obtaining unit 13 is specifically configured to splice the one-hot encoding vector corresponding to the speech frame and the feature vector corresponding to the speech frame, so as to obtain the first vector corresponding to the speech frame.
In one possible design, the third obtaining unit 13 is specifically configured to process the one-hot encoded vector corresponding to the speech frame by using embedding layer embedding to obtain a second vector corresponding to the speech frame;
and splicing the second vector corresponding to the voice frame with the feature vector corresponding to the voice frame to obtain the first vector corresponding to the voice frame.
In one possible design, the acoustic model includes a plurality of hidden layers connected in sequence, the apparatus further comprising:
a second determining unit configured to determine at least one selected hidden layer from the plurality of hidden layers, the at least one selected hidden layer being a hidden layer other than a first hidden layer of the plurality of hidden layers;
a fourth obtaining unit, configured to obtain, for each selected concealment layer of the at least one selected concealment layer, an intermediate vector corresponding to the speech frame output by a concealment layer preceding the selected concealment layer;
a fifth obtaining unit, configured to splice the second vector corresponding to the speech frame and the intermediate vector corresponding to the speech frame, to obtain a third vector corresponding to the speech frame;
the model training unit 14 is specifically configured to input a first vector corresponding to each of the plurality of speech frames into a first hidden layer of the acoustic model to be trained; and inputting a third vector corresponding to each of the plurality of speech frames into the selected hidden layer of the acoustic model to be trained to model train the acoustic model.
In one possible design, the apparatus further comprises:
the first adjusting unit is used for acquiring the dimension of the model parameter vector of the acoustic model and adjusting the dimension of the model parameter vector of the embedded layer according to the dimension of the model parameter vector of the acoustic model;
the third obtaining unit 13 is specifically configured to process the one-hot encoded vector corresponding to the speech frame by using the adjusted embedded layer, so as to obtain a second vector corresponding to the speech frame.
In one possible design, the apparatus further comprises:
the second adjusting unit is used for acquiring state quantity used for representing the degree of difference between each channel in the at least two channels and adjusting the dimension of the model parameter vector of the embedded layer according to the state quantity;
the third obtaining unit 13 is specifically configured to process the one-hot encoded vector corresponding to the speech frame by using the adjusted embedded layer, so as to obtain a second vector corresponding to the speech frame.
In one possible design, the eigenvectors include mel-frequency cepstral coefficient MFCC eigenvectors or filter bank parameter eigenvectors.
The specific description of the embodiment of the apparatus shown in fig. 5 may refer to the specific description of the embodiment of the method shown in fig. 1 or fig. 3, which is not described herein.
Referring to fig. 6, a schematic structural diagram of another acoustic model training apparatus according to an embodiment of the present invention, as shown in fig. 6, the acoustic model training apparatus 1000 may include: at least one processor 1001, such as a CPU, at least one communication interface 1003, memory 1004, at least one communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. Communication interface 1003 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1004 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1004 may also optionally be at least one storage device located remotely from the processor 1001. As shown in fig. 6, an operating system, network communication modules, and program instructions may be included in memory 1004, which is a type of computer storage medium.
In the acoustic model training apparatus 1000 shown in fig. 6, the processor 1001 may be configured to load program instructions stored in the memory 1004 and specifically perform the following operations:
acquiring a plurality of voice frames from at least two channels, wherein one channel corresponds to one channel class;
for each voice frame in the voice frames, determining a channel category corresponding to a channel from which the voice frame comes, and performing single-hot coding on the channel category to obtain a single-hot coding vector corresponding to the voice frame;
acquiring a feature vector for representing a voice feature of the voice frame;
obtaining a first vector corresponding to the voice frame according to the independent heat coding vector corresponding to the voice frame and the characteristic vector corresponding to the voice frame;
and carrying out model training on the acoustic model to be trained according to the first vector corresponding to each voice frame in the plurality of voice frames.
It should be noted that, the specific implementation process may refer to the specific description of the method embodiment shown in fig. 1 or fig. 3, and will not be described herein.
Specific implementation steps may be referred to the description of the foregoing embodiments, and are not described herein in detail.
The embodiment of the present invention further provides a computer storage medium, where the computer storage medium may store a plurality of instructions, where the instructions are suitable for being loaded by a processor and executed by the processor, and the specific execution process may refer to the specific description of the embodiment shown in fig. 1 or fig. 3, and is not described herein.
Those skilled in the art will appreciate that implementing all or part of the above-described embodiment methods may be accomplished by way of a computer program, which may be stored on a computer readable storage medium, which when executed comprises the steps of embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.

Claims (10)

1. An acoustic model training method, comprising:
acquiring a plurality of voice frames from at least two channels, wherein one channel corresponds to one channel class;
for each voice frame in the voice frames, determining a channel category corresponding to a channel from which the voice frame comes, and performing single-hot coding on the channel category to obtain a single-hot coding vector corresponding to the voice frame;
acquiring a feature vector for representing a voice feature of the voice frame;
obtaining a first vector corresponding to the voice frame according to the independent heat coding vector corresponding to the voice frame and the characteristic vector corresponding to the voice frame;
and carrying out model training on the acoustic model to be trained according to the first vector corresponding to each voice frame in the plurality of voice frames.
2. The method of claim 1, wherein the obtaining the first vector corresponding to the speech frame from the one-hot encoded vector corresponding to the speech frame and the feature vector corresponding to the speech frame comprises:
and splicing the independent heat coding vector corresponding to the voice frame and the characteristic vector corresponding to the voice frame to obtain a first vector corresponding to the voice frame.
3. The method of claim 1, wherein the obtaining the first vector corresponding to the speech frame from the one-hot encoded vector corresponding to the speech frame and the feature vector corresponding to the speech frame comprises:
processing the single-hot coding vector corresponding to the voice frame by utilizing an embedding layer embedding to obtain a second vector corresponding to the voice frame;
and splicing the second vector corresponding to the voice frame with the feature vector corresponding to the voice frame to obtain the first vector corresponding to the voice frame.
4. The method of claim 3, wherein the acoustic model comprises a plurality of hidden layers connected in sequence, the method further comprising:
determining at least one selected hidden layer from the plurality of hidden layers, the at least one selected hidden layer being a hidden layer outside a first hidden layer of the plurality of hidden layers;
for each selected hidden layer in the at least one selected hidden layer, acquiring an intermediate vector corresponding to the voice frame output by the hidden layer before the selected hidden layer;
splicing the second vector corresponding to the voice frame and the intermediate vector corresponding to the voice frame to obtain a third vector corresponding to the voice frame;
performing model training on an acoustic model to be trained according to a first vector corresponding to each voice frame in the plurality of voice frames, including:
inputting a first vector corresponding to each of the plurality of speech frames into a first hidden layer of the acoustic model to be trained; and inputting a third vector corresponding to each of the plurality of speech frames into the selected hidden layer of the acoustic model to be trained to model train the acoustic model.
5. A method as claimed in claim 3, wherein the method further comprises:
acquiring the dimension of the model parameter vector of the acoustic model, and adjusting the dimension of the model parameter vector of the embedded layer according to the dimension of the model parameter vector of the acoustic model;
the processing the single thermal coding vector corresponding to the voice frame by using the embedding layer embedding to obtain a second vector corresponding to the voice frame includes:
and processing the single-hot coding vector corresponding to the voice frame by utilizing the adjusted embedded layer to obtain a second vector corresponding to the voice frame.
6. A method as claimed in claim 3, wherein the method further comprises:
acquiring a state quantity used for representing the degree of difference between each channel in the at least two channels, and adjusting the dimension of the model parameter vector of the embedded layer according to the state quantity;
the processing the single thermal coding vector corresponding to the voice frame by using the embedding layer embedding to obtain a second vector corresponding to the voice frame includes:
and processing the single-hot coding vector corresponding to the voice frame by utilizing the adjusted embedded layer to obtain a second vector corresponding to the voice frame.
7. The method of any of claims 1-6, wherein the eigenvector comprises a mel-frequency cepstral coefficient, MFCC, eigenvector or a filter bank parameter eigenvector.
8. An acoustic model training device, comprising:
a first acquisition unit configured to acquire a plurality of speech frames from at least two channels;
a first determining unit, configured to determine, for each of the plurality of speech frames, a channel class corresponding to a channel from which the speech frame is derived, and perform single-hot encoding on the channel class, to obtain a single-hot encoding vector corresponding to the speech frame;
a second acquisition unit configured to acquire feature vectors representing speech features of the speech frame;
the third acquisition unit is used for acquiring a first vector corresponding to the voice frame according to the independent heat coding vector corresponding to the voice frame and the feature vector corresponding to the voice frame;
and the model training unit is used for carrying out model training on the acoustic model to be trained according to the first vector corresponding to each voice frame in the plurality of voice frames.
9. An acoustic model training device comprising a processor, a memory and a communication interface, the processor, memory and communication interface being interconnected, wherein the communication interface is adapted to receive and transmit data, the memory is adapted to store program code, and the processor is adapted to invoke the program code to perform the method of any of claims 1-7.
10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program, which is executed by a processor to implement the method of any one of claims 1 to 7.
CN202110264782.XA 2021-03-11 2021-03-11 Acoustic model training method and device Active CN113035177B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110264782.XA CN113035177B (en) 2021-03-11 2021-03-11 Acoustic model training method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110264782.XA CN113035177B (en) 2021-03-11 2021-03-11 Acoustic model training method and device

Publications (2)

Publication Number Publication Date
CN113035177A CN113035177A (en) 2021-06-25
CN113035177B true CN113035177B (en) 2024-02-09

Family

ID=76469783

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110264782.XA Active CN113035177B (en) 2021-03-11 2021-03-11 Acoustic model training method and device

Country Status (1)

Country Link
CN (1) CN113035177B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9460711B1 (en) * 2013-04-15 2016-10-04 Google Inc. Multilingual, acoustic deep neural networks
CN108701452A (en) * 2016-02-02 2018-10-23 日本电信电话株式会社 Audio model learning method, audio recognition method, audio model learning device, speech recognition equipment, audio model learning program and speech recognition program
CN110047478A (en) * 2018-01-16 2019-07-23 中国科学院声学研究所 Multicenter voice based on space characteristics compensation identifies Acoustic Modeling method and device
WO2019212375A1 (en) * 2018-05-03 2019-11-07 Общество с ограниченной ответственностью "Центр речевых технологий" Method for obtaining speaker-dependent small high-level acoustic speech attributes
CN111383628A (en) * 2020-03-09 2020-07-07 第四范式(北京)技术有限公司 Acoustic model training method and device, electronic equipment and storage medium
CN111883113A (en) * 2020-07-30 2020-11-03 云知声智能科技股份有限公司 Voice recognition method and device

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE10030105A1 (en) * 2000-06-19 2002-01-03 Bosch Gmbh Robert Speech recognition device
US9697826B2 (en) * 2015-03-27 2017-07-04 Google Inc. Processing multi-channel audio waveforms
US10657962B2 (en) * 2018-05-02 2020-05-19 International Business Machines Corporation Modeling multiparty conversation dynamics: speaker, response, addressee selection using a novel deep learning approach
CN111599344B (en) * 2020-03-31 2022-05-17 因诺微科技(天津)有限公司 Language identification method based on splicing characteristics
KR20220069776A (en) * 2020-11-20 2022-05-27 한국전자통신연구원 Method for speech generation for automatic speech recognition
CN112786028B (en) * 2021-02-07 2024-03-26 百果园技术(新加坡)有限公司 Acoustic model processing method, apparatus, device and readable storage medium
CN114898736A (en) * 2022-03-30 2022-08-12 北京小米移动软件有限公司 Voice signal recognition method and device, electronic equipment and storage medium
CN114863916A (en) * 2022-04-26 2022-08-05 北京小米移动软件有限公司 Speech recognition model training method, speech recognition device and storage medium
CN115206288A (en) * 2022-07-11 2022-10-18 新疆大学 Cross-channel language identification method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9460711B1 (en) * 2013-04-15 2016-10-04 Google Inc. Multilingual, acoustic deep neural networks
CN108701452A (en) * 2016-02-02 2018-10-23 日本电信电话株式会社 Audio model learning method, audio recognition method, audio model learning device, speech recognition equipment, audio model learning program and speech recognition program
CN110047478A (en) * 2018-01-16 2019-07-23 中国科学院声学研究所 Multicenter voice based on space characteristics compensation identifies Acoustic Modeling method and device
WO2019212375A1 (en) * 2018-05-03 2019-11-07 Общество с ограниченной ответственностью "Центр речевых технологий" Method for obtaining speaker-dependent small high-level acoustic speech attributes
CN111383628A (en) * 2020-03-09 2020-07-07 第四范式(北京)技术有限公司 Acoustic model training method and device, electronic equipment and storage medium
CN111883113A (en) * 2020-07-30 2020-11-03 云知声智能科技股份有限公司 Voice recognition method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Deep Neural Networks for Acoust icModeling in Speech Recognition,The shared views of four research groups;G.Hinton;IEEE SIGNAL PROCESSING MAGAZINE;第29卷(第6期);全文 *

Also Published As

Publication number Publication date
CN113035177A (en) 2021-06-25

Similar Documents

Publication Publication Date Title
CN107393526B (en) Voice silence detection method, device, computer equipment and storage medium
KR102002681B1 (en) Bandwidth extension based on generative adversarial networks
US6003002A (en) Method and system of adapting speech recognition models to speaker environment
JP2006079079A (en) Distributed speech recognition system and its method
KR101863097B1 (en) Apparatus and method for keyword recognition
WO2013177981A1 (en) Scene recognition method, device and mobile terminal based on ambient sound
CN109644192B (en) Audio delivery method and apparatus with speech detection period duration compensation
WO2015103836A1 (en) Voice control method and device
WO2019232826A1 (en) I-vector extraction method, speaker recognition method and apparatus, device, and medium
CN110610707A (en) Voice keyword recognition method and device, electronic equipment and storage medium
KR101618512B1 (en) Gaussian mixture model based speaker recognition system and the selection method of additional training utterance
US11562735B1 (en) Multi-modal spoken language understanding systems
CN113178201B (en) Voice conversion method, device, equipment and medium based on non-supervision
CN103514882A (en) Voice identification method and system
CN116490920A (en) Method for detecting an audio challenge, corresponding device, computer program product and computer readable carrier medium for a speech input processed by an automatic speech recognition system
CN113870860A (en) End-to-end voiceprint recognition method and voiceprint recognition device
CN114338623A (en) Audio processing method, device, equipment, medium and computer program product
CN114333865A (en) Model training and tone conversion method, device, equipment and medium
US11551707B2 (en) Speech processing method, information device, and computer program product
CN115472174A (en) Sound noise reduction method and device, electronic equipment and storage medium
CN114530156A (en) Generation countermeasure network optimization method and system for short voice speaker confirmation
CN113035177B (en) Acoustic model training method and device
CN112802498B (en) Voice detection method, device, computer equipment and storage medium
CN113724698B (en) Training method, device, equipment and storage medium of voice recognition model
CN115762500A (en) Voice processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant