CN113035177B

CN113035177B - Acoustic model training method and device

Info

Publication number: CN113035177B
Application number: CN202110264782.XA
Authority: CN
Inventors: 鄢楷强; 马骏; 王少军
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-03-11
Filing date: 2021-03-11
Publication date: 2024-02-09
Anticipated expiration: 2041-03-11
Also published as: CN113035177A

Abstract

The invention discloses an acoustic model training method and device, wherein the acoustic model training method comprises the following steps: acquiring a plurality of voice frames from at least two channels, wherein one channel corresponds to one channel class; for each voice frame in a plurality of voice frames, determining a channel category corresponding to a channel from which the voice frame comes, and performing single-heat coding on the channel category to obtain a single-heat coding vector corresponding to the voice frame; acquiring a feature vector for representing a voice feature of a voice frame; obtaining a first vector corresponding to the voice frame according to the independent heat coding vector corresponding to the voice frame and the characteristic vector corresponding to the voice frame; and carrying out model training on the acoustic model to be trained according to the first vector corresponding to each voice frame in the plurality of voice frames. By adopting the method and the device, the computing resources can be saved, and the maintenance cost is reduced.

Description

Acoustic model training method and device

Technical Field

The invention relates to the technical field of model training, in particular to an acoustic model training method and device.

Background

With the continuous development of voice recognition technology, the technology has more and more energized scenes, such as intelligent hardware, telephone customer service, conference systems, vehicle-mounted scenes and the like. Where the voice signals originating from different devices may have channel differences, for example: the traditional fixed speech (sampling rate 8K) and the mobile phone microphone speech (sampling rate 16K) are derived from different channels and have different channel characteristics. By comparing the time domain and frequency domain parameters of the voice signals under different channels, obvious differences of signal frequency, broadband noise, resonance noise and the like can be found. At present, when an acoustic model is trained, because of the difference of voice signals from different channels, the acoustic model is usually trained for a certain channel independently, but the acoustic model occupies additional computing resources and increases maintenance cost.

Disclosure of Invention

The embodiment of the invention provides an acoustic model training method and device, which can train an acoustic model according to the combination of voice frames of channels from various channel types, save computing resources and reduce maintenance cost.

In a first aspect, an embodiment of the present invention provides an acoustic model training method, including:

acquiring a plurality of voice frames from at least two channels, wherein one channel corresponds to one channel class;

for each voice frame in the voice frames, determining a channel category corresponding to a channel from which the voice frame comes, and performing single-hot coding on the channel category to obtain a single-hot coding vector corresponding to the voice frame;

acquiring a feature vector for representing a voice feature of the voice frame;

obtaining a first vector corresponding to the voice frame according to the independent heat coding vector corresponding to the voice frame and the characteristic vector corresponding to the voice frame;

and carrying out model training on the acoustic model to be trained according to the first vector corresponding to each voice frame in the plurality of voice frames.

In one possible design, the obtaining the first vector corresponding to the speech frame according to the one-hot encoding vector corresponding to the speech frame and the feature vector corresponding to the speech frame includes:

and splicing the independent heat coding vector corresponding to the voice frame and the characteristic vector corresponding to the voice frame to obtain a first vector corresponding to the voice frame.

processing the single-hot coding vector corresponding to the voice frame by utilizing an embedding layer embedding to obtain a second vector corresponding to the voice frame;

and splicing the second vector corresponding to the voice frame with the feature vector corresponding to the voice frame to obtain the first vector corresponding to the voice frame.

In one possible design, the acoustic model includes a plurality of hidden layers connected in sequence, the method further comprising:

determining at least one selected hidden layer from the plurality of hidden layers, the at least one selected hidden layer being a hidden layer outside a first hidden layer of the plurality of hidden layers;

for each selected hidden layer in the at least one selected hidden layer, acquiring an intermediate vector corresponding to the voice frame output by the hidden layer before the selected hidden layer;

splicing the second vector corresponding to the voice frame and the intermediate vector corresponding to the voice frame to obtain a third vector corresponding to the voice frame;

performing model training on an acoustic model to be trained according to a first vector corresponding to each voice frame in the plurality of voice frames, including:

inputting a first vector corresponding to each of the plurality of speech frames into a first hidden layer of the acoustic model to be trained; and inputting a third vector corresponding to each of the plurality of speech frames into the selected hidden layer of the acoustic model to be trained to model train the acoustic model.

In one possible design, the method further comprises:

acquiring the dimension of the model parameter vector of the acoustic model, and adjusting the dimension of the model parameter vector of the embedded layer according to the dimension of the model parameter vector of the acoustic model;

the processing the single thermal coding vector corresponding to the voice frame by using the embedding layer embedding to obtain a second vector corresponding to the voice frame includes:

and processing the single-hot coding vector corresponding to the voice frame by utilizing the adjusted embedded layer to obtain a second vector corresponding to the voice frame.

In one possible design, the method further comprises:

acquiring a state quantity used for representing the degree of difference between each channel in the at least two channels, and adjusting the dimension of the model parameter vector of the embedded layer according to the state quantity;

In one possible design, the eigenvectors include mel-frequency cepstral coefficient MFCC eigenvectors or filter bank parameter eigenvectors.

In a second aspect, an embodiment of the present invention provides an acoustic model training apparatus, including:

a first obtaining unit, configured to obtain a plurality of speech frames from at least two channels, where a channel corresponds to a channel class;

a first determining unit, configured to determine, for each of the plurality of speech frames, a channel class corresponding to a channel from which the speech frame is derived, and perform single-hot encoding on the channel class, to obtain a single-hot encoding vector corresponding to the speech frame;

a second acquisition unit configured to acquire feature vectors representing speech features of the speech frame;

the third acquisition unit is used for acquiring a first vector corresponding to the voice frame according to the independent heat coding vector corresponding to the voice frame and the feature vector corresponding to the voice frame;

and the model training unit is used for carrying out model training on the acoustic model to be trained according to the first vector corresponding to each voice frame in the plurality of voice frames.

In one possible design, the third obtaining unit is specifically configured to splice the unique hot encoding vector corresponding to the speech frame and the feature vector corresponding to the speech frame, to obtain the first vector corresponding to the speech frame.

In one possible design, the third obtaining unit is specifically configured to process the one-hot encoding vector corresponding to the speech frame by using embedding layer embedding, so as to obtain a second vector corresponding to the speech frame;

In one possible design, the acoustic model includes a plurality of hidden layers connected in sequence, the apparatus further comprising:

a second determining unit configured to determine at least one selected hidden layer from the plurality of hidden layers, the at least one selected hidden layer being a hidden layer other than a first hidden layer of the plurality of hidden layers;

a fourth obtaining unit, configured to obtain, for each selected concealment layer of the at least one selected concealment layer, an intermediate vector corresponding to the speech frame output by a concealment layer preceding the selected concealment layer;

a fifth obtaining unit, configured to splice the second vector corresponding to the speech frame and the intermediate vector corresponding to the speech frame, to obtain a third vector corresponding to the speech frame;

the model training unit is specifically configured to input a first vector corresponding to each of the plurality of speech frames into a first hidden layer of the acoustic model to be trained; and inputting a third vector corresponding to each of the plurality of speech frames into the selected hidden layer of the acoustic model to be trained to model train the acoustic model.

In one possible design, the apparatus further comprises:

the first adjusting unit is used for acquiring the dimension of the model parameter vector of the acoustic model and adjusting the dimension of the model parameter vector of the embedded layer according to the dimension of the model parameter vector of the acoustic model;

the third obtaining unit is specifically configured to process the one-hot encoded vector corresponding to the speech frame by using the adjusted embedded layer, so as to obtain a second vector corresponding to the speech frame.

In one possible design, the apparatus further comprises:

the second adjusting unit is used for acquiring state quantity used for representing the degree of difference between each channel in the at least two channels and adjusting the dimension of the model parameter vector of the embedded layer according to the state quantity;

In a third aspect, an embodiment of the present invention provides an acoustic model training apparatus, where the acoustic model training apparatus includes a processor, a memory, and a communication interface, where the processor, the memory, and the communication interface are connected to each other, where the communication interface is configured to receive and send data, the memory is configured to store program code, and the processor is configured to invoke the program code to execute the method described in the first aspect.

In a fourth aspect, embodiments of the present invention provide a computer readable storage medium storing a computer program for execution by a processor to implement the method described above.

In the embodiment of the invention, a plurality of voice frames from at least two channels are acquired, one channel corresponds to one channel type, namely, the plurality of voice frames are channels from at least two channel types, the channel type corresponding to the channel from which the voice frame is positioned is determined for each voice frame, and the channel type is subjected to independent thermal coding to obtain an independent thermal coding vector corresponding to the voice frame, and a feature vector used for representing voice features of the voice frame is acquired, so that a first vector corresponding to the voice frame is acquired according to the independent thermal coding vector corresponding to the voice frame and the feature vector corresponding to the voice frame, and an acoustic model is trained according to the first vector. By adopting the embodiment of the application, the acoustic model can be trained according to the voice frame combination of the channels from various channel types, so that the computing resources are saved, and the maintenance cost is reduced.

Drawings

In order to illustrate embodiments of the invention or solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

FIG. 1 is a flowchart of an acoustic model training method according to an embodiment of the present invention;

fig. 2a is a schematic diagram of vector stitching according to an embodiment of the present invention;

FIG. 2b is a schematic diagram of another vector concatenation according to an embodiment of the present invention;

FIG. 3 is a flowchart of another method for training an acoustic model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a model input provided in an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of an acoustic model training device according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of another acoustic model training device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention.

An acoustic model training method according to an embodiment of the present invention will be described in detail with reference to fig. 1 to fig. 4.

Referring to fig. 1, a flow chart of an acoustic model training method is provided in an embodiment of the present invention. As shown in fig. 1, the acoustic model training method according to the embodiment of the present invention may include the following steps S101 to S105.

S101, acquiring a plurality of voice frames from at least two channels, wherein one channel corresponds to one channel type;

in the embodiment of the present application, the channel is a channel for transmitting a voice signal, and the encoding and decoding modes and the compression modes adopted by the voice signals of different channel types are different. Illustratively, the voice signals come from different types of devices and the channel categories to which the experienced channels belong may be different. For example, the channel class to which the voice signal from the handset microphone belongs is different from the channel class to which the voice signal from the conventional fixed telephone device belongs. The voice signals of different channel classes may differ in frequency, wideband noise, resonant noise, etc.

Specifically, at least two channels are determined, one of the at least two channels corresponding to each channel class. And respectively acquiring the voice signals transmitted by each channel in the at least two channels, and carrying out framing processing on the voice signals of each channel to acquire at least one voice frame of the channel.

In the embodiment of the application, at least one voice frame of each of the at least two channels is acquired, so that a plurality of voice frames from all channels are acquired.

S102, determining a channel class corresponding to a channel from which the voice frame comes for each voice frame in the voice frames, and performing single-hot encoding on the channel class to obtain a single-hot encoding vector corresponding to the voice frame;

in one embodiment, for each of the plurality of voice frames, it is further determined that the channel class corresponding to the channel from which the voice frame is received, e.g., the voice frame is received from a conventional fixed-line device, and the channel class corresponding to the channel from which the corresponding voice frame is received is the conventional fixed-line device channel class.

And performing one-time thermal coding on the channel category corresponding to the channel from which the voice frame is received, and obtaining the one-time thermal coding vector corresponding to the voice frame. One-Hot encoding (One-Hot encoding), also known as One-bit efficient encoding, mainly uses N-bit status registers to encode N states, each of which is defined by its independent register bit, and only One bit is active at any time, N being the number of channels of the at least two channels, i.e. the number of all channel categories.

In machine learning algorithms such as regression, classification, clustering and the like, calculation of distances or similarity between features is very important, and common calculation of distances or similarity is similarity calculation in Euclidean space, and cosine similarity is calculated based on Euclidean space. And the single-heat coding expands the value of the discrete feature to the European space, and a certain value of the discrete feature corresponds to a certain point of the European space. The discrete features are subjected to single-heat coding, so that the distance calculation between the features is more reasonable.

S103, for each voice frame in the plurality of voice frames, acquiring a feature vector for representing voice features of the voice frame;

in one embodiment, a speech feature extraction is performed on each speech frame to obtain feature vectors representing the speech features of the speech frame. The speech features include, but are not limited to, mel frequency cepstral coefficient (Mel-scale Frequency Cepstral Coefficients, MFCC) features, and Filter Bank parameters (Filter Bank) features.

S104, obtaining a first vector corresponding to the voice frame according to the independent heat coding vector corresponding to the voice frame and the characteristic vector corresponding to the voice frame;

s105, carrying out model training on the acoustic model to be trained according to the first vector corresponding to each voice frame in the plurality of voice frames.

In this embodiment, for each speech frame, the obtaining manner of the first vector corresponding to the speech frame according to the feature vector corresponding to the speech frame and the independent heat coding vector corresponding to the speech frame includes, but is not limited to, the following two optional embodiments:

in a first alternative embodiment, the unique heat encoding vector corresponding to the speech frame and the feature vector corresponding to the speech frame are spliced, so as to obtain a first vector corresponding to the speech frame.

As shown in fig. 2a, the feature vector corresponding to the voice frame and the independent heat coding vector corresponding to the voice frame are directly spliced to obtain a spliced feature vector, which is used as the first vector corresponding to the voice frame. For example, the feature vector corresponding to the voice frame is a 40-dimensional vector, the type of the channel is 2, the single-hot encoding vector corresponding to the voice frame is a 2-dimensional vector, and the first vector after splicing is 42-dimensional. And inputting the spliced first vector into an acoustic model for training.

In a second alternative embodiment, for each speech frame, the embedding layer embedding is used to process the one-hot encoded vector corresponding to the speech frame, so as to obtain a second vector corresponding to the speech frame. And splicing the second vector corresponding to the voice frame with the characteristic vector corresponding to the voice frame to obtain the first vector corresponding to the voice frame. I.e. one first vector for each speech frame.

The second vector obtained by processing the one-hot encoded vector corresponding to the speech frame by ebadd is a vector that is higher in dimension and denser than the one-hot encoded vector. The sounding can enable voice signals corresponding to vectors with similar distances to have similar meanings, and can represent differences among various channel categories. In a second alternative embodiment, the uni-thermal coding vector is processed through embedding layer embedding, so that the problem of vector sparseness caused by directly adopting the uni-thermal coding vector can be avoided.

As shown in fig. 2b, after the single-hot encoding vector corresponding to the voice frame is processed by the embedding layer embedding, a second vector corresponding to the voice frame is obtained, the feature vector corresponding to the voice frame and the second vector corresponding to the voice frame are spliced, a first vector corresponding to the voice frame after being spliced is obtained, and the first vector corresponding to the voice frame is input into the acoustic model for training.

It will be appreciated that in the second alternative embodiment, the dimension of the model parameter vector used in the ebedding processing may be adjusted according to the dimension of the model parameter vector of the acoustic model, so as to adjust the dimension of the second vector after the ebedding processing. For example, the dimension of the model parameter vector is relatively large, so is the dimension of the model parameter vector used in the corresponding ebadd process.

Optionally, the dimension of the model parameter vector used in the ebedding processing process can be adjusted according to the difference condition between the channels, so that the dimension of the second vector after the ebedding processing is adjusted. For example, the difference between channels is relatively large, and the dimension of the model parameter vector used in the ebedding processing is also relatively large, and the dimension of the corresponding second vector after ebedding processing is also relatively large. It can be appreciated that the above adjustment of the dimension of the model parameter vector used in the ebadd process can achieve a balance between the model calculation amount and the performance.

Specifically, a state quantity for representing the degree of difference between each of at least two channels may be obtained, so that the dimension of the model parameter vector used in the ebadd processing procedure is adjusted according to the state quantity.

Referring to fig. 3, a flowchart of another acoustic model training method is provided in an embodiment of the present invention. As shown in fig. 3, the acoustic model training method according to the embodiment of the present invention may include the following steps S201 to S.

S201, acquiring a plurality of voice frames from at least two channels, wherein one channel corresponds to one channel type;

s202, determining a channel class corresponding to a channel from which the voice frame comes for each voice frame in the voice frames, and performing single-hot encoding on the channel class to obtain a single-hot encoding vector corresponding to the voice frame;

s203, for each voice frame in the plurality of voice frames, acquiring a feature vector for representing voice features of the voice frame;

the descriptions of step S201 to step S203 refer to step S101 to step S103 shown in fig. 1, and are not repeated here.

S204, processing the single-hot coding vector corresponding to the voice frame by using the embedding layer ebedding to obtain a second vector corresponding to the voice frame.

S205, splicing the second vector corresponding to the voice frame and the characteristic vector corresponding to the voice frame to obtain the first vector corresponding to the voice frame;

and processing the one-hot coding vector corresponding to each voice frame by utilizing an embedding layer embedding to obtain a second vector corresponding to the voice frame. And splicing the second vector corresponding to the voice frame with the characteristic vector corresponding to the voice frame to obtain the first vector corresponding to the voice frame. I.e. one first vector for each speech frame.

S206, inputting a first vector corresponding to each voice frame in the plurality of voice frames into a first hidden layer of the acoustic model to be trained;

specifically, the acoustic model to be trained may include a plurality of hidden layers connected in sequence, as shown in fig. 4, where the acoustic model includes a plurality of hidden layers connected in sequence, and a first vector corresponding to the spliced voice frame is input into a first hidden layer of the acoustic model to be trained.

S207, determining at least one selected hidden layer from the plurality of hidden layers, wherein the at least one selected hidden layer is a hidden layer except a first hidden layer of the plurality of hidden layers;

in one embodiment, the at least one selected hidden layer may be determined in a plurality of ways, for example, one selected hidden layer may be selected every interval of one hidden layer, or a last preset number of hidden layers may be selected as selected hidden layers, and so on.

S208, aiming at each selected hidden layer in the at least one selected hidden layer, acquiring an intermediate vector corresponding to the voice frame output by the hidden layer before the selected hidden layer;

in one embodiment, for each selected concealment layer of the at least one selected concealment layer, a respective corresponding intermediate vector for each speech frame output by the concealment layer preceding the selected concealment layer is determined.

S209, splicing the second vector corresponding to the voice frame and the intermediate vector corresponding to the voice frame to obtain a third vector corresponding to the voice frame;

specifically, the second vector corresponding to the voice frame is spliced with the intermediate vector corresponding to the voice frame, so as to obtain a third vector corresponding to the voice frame, and each voice frame in the plurality of voice frames corresponds to one third vector.

S210, inputting a third vector corresponding to each voice frame in the plurality of voice frames into the selected hidden layer of the acoustic model to be trained so as to perform model training on the acoustic model;

specifically, a third vector corresponding to each voice frame is input into a corresponding selected hidden layer in the acoustic model to be trained. For example, aiming at the second hidden layer as the selected hidden layer, splicing the intermediate vector corresponding to the voice frame output by the first hidden layer and the second vector corresponding to the voice frame to obtain the third vector corresponding to the voice frame, and inputting the third vector corresponding to the voice frame into the second hidden layer.

As shown in fig. 4, the one-hot encoded vector is processed by embedding layer embedding to obtain a processed second vector. The processed second vector can be spliced with the feature vector to obtain a spliced first vector, and the spliced first vector is input into the acoustic model for training, namely, a first hidden layer of the acoustic model is input. The second vector after the embedding process can be spliced with intermediate vectors output by hidden layers before other hidden layers of the acoustic model to obtain a third vector, and the third vector is input into the other hidden layers of the acoustic model to perform acoustic model training, so that the characteristics of the processed second vector are prevented from being lost in the training process under the scene that the acoustic model is deeper.

It will be appreciated that in this embodiment, the dimension of the model parameter vector used in the ebadd process may be adjusted according to the dimension of the model parameter vector of the acoustic model, so as to adjust the dimension of the second vector after the ebadd process. For example, the dimension of the model parameter vector is relatively large, so is the dimension of the model parameter vector used in the corresponding ebadd process.

Referring to fig. 5, a schematic structural diagram of an acoustic model training device is provided in an embodiment of the present invention. As shown in fig. 5, the acoustic model training apparatus according to the embodiment of the present invention may include:

a first acquisition unit 10 for acquiring a plurality of speech frames from at least two channels;

a first determining unit 11, configured to determine, for each of the plurality of speech frames, a channel class corresponding to a channel from which the speech frame is derived, and perform one-hot encoding on the channel class, to obtain a one-hot encoding vector corresponding to the speech frame;

a second acquisition unit 12 for acquiring feature vectors representing speech features of the speech frame;

a third obtaining unit 13, configured to obtain a first vector corresponding to the speech frame according to the unique thermal coding vector corresponding to the speech frame and the feature vector corresponding to the speech frame;

the model training unit 14 is configured to perform model training on the acoustic model to be trained according to a first vector corresponding to each of the plurality of speech frames.

In a possible implementation manner, the third obtaining unit 13 is specifically configured to splice the one-hot encoding vector corresponding to the speech frame and the feature vector corresponding to the speech frame, so as to obtain the first vector corresponding to the speech frame.

In one possible design, the third obtaining unit 13 is specifically configured to process the one-hot encoded vector corresponding to the speech frame by using embedding layer embedding to obtain a second vector corresponding to the speech frame;

the model training unit 14 is specifically configured to input a first vector corresponding to each of the plurality of speech frames into a first hidden layer of the acoustic model to be trained; and inputting a third vector corresponding to each of the plurality of speech frames into the selected hidden layer of the acoustic model to be trained to model train the acoustic model.

In one possible design, the apparatus further comprises:

the third obtaining unit 13 is specifically configured to process the one-hot encoded vector corresponding to the speech frame by using the adjusted embedded layer, so as to obtain a second vector corresponding to the speech frame.

In one possible design, the apparatus further comprises:

The specific description of the embodiment of the apparatus shown in fig. 5 may refer to the specific description of the embodiment of the method shown in fig. 1 or fig. 3, which is not described herein.

Referring to fig. 6, a schematic structural diagram of another acoustic model training apparatus according to an embodiment of the present invention, as shown in fig. 6, the acoustic model training apparatus 1000 may include: at least one processor 1001, such as a CPU, at least one communication interface 1003, memory 1004, at least one communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. Communication interface 1003 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1004 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1004 may also optionally be at least one storage device located remotely from the processor 1001. As shown in fig. 6, an operating system, network communication modules, and program instructions may be included in memory 1004, which is a type of computer storage medium.

In the acoustic model training apparatus 1000 shown in fig. 6, the processor 1001 may be configured to load program instructions stored in the memory 1004 and specifically perform the following operations:

acquiring a feature vector for representing a voice feature of the voice frame;

It should be noted that, the specific implementation process may refer to the specific description of the method embodiment shown in fig. 1 or fig. 3, and will not be described herein.

Specific implementation steps may be referred to the description of the foregoing embodiments, and are not described herein in detail.

The embodiment of the present invention further provides a computer storage medium, where the computer storage medium may store a plurality of instructions, where the instructions are suitable for being loaded by a processor and executed by the processor, and the specific execution process may refer to the specific description of the embodiment shown in fig. 1 or fig. 3, and is not described herein.

Those skilled in the art will appreciate that implementing all or part of the above-described embodiment methods may be accomplished by way of a computer program, which may be stored on a computer readable storage medium, which when executed comprises the steps of embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.

Claims

1. An acoustic model training method, comprising:

acquiring a feature vector for representing a voice feature of the voice frame;

2. The method of claim 1, wherein the obtaining the first vector corresponding to the speech frame from the one-hot encoded vector corresponding to the speech frame and the feature vector corresponding to the speech frame comprises:

3. The method of claim 1, wherein the obtaining the first vector corresponding to the speech frame from the one-hot encoded vector corresponding to the speech frame and the feature vector corresponding to the speech frame comprises:

4. The method of claim 3, wherein the acoustic model comprises a plurality of hidden layers connected in sequence, the method further comprising:

5. A method as claimed in claim 3, wherein the method further comprises:

6. A method as claimed in claim 3, wherein the method further comprises:

7. The method of any of claims 1-6, wherein the eigenvector comprises a mel-frequency cepstral coefficient, MFCC, eigenvector or a filter bank parameter eigenvector.

8. An acoustic model training device, comprising:

a first acquisition unit configured to acquire a plurality of speech frames from at least two channels;

9. An acoustic model training device comprising a processor, a memory and a communication interface, the processor, memory and communication interface being interconnected, wherein the communication interface is adapted to receive and transmit data, the memory is adapted to store program code, and the processor is adapted to invoke the program code to perform the method of any of claims 1-7.

10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program, which is executed by a processor to implement the method of any one of claims 1 to 7.