CN116364063B

CN116364063B - Phoneme alignment method, apparatus, driving apparatus, and medium

Info

Publication number: CN116364063B
Application number: CN202310642929.3A
Authority: CN
Inventors: 徐高鹏
Original assignee: Weilai Automobile Technology Anhui Co Ltd
Current assignee: Weilai Automobile Technology Anhui Co Ltd
Priority date: 2023-06-01
Filing date: 2023-06-01
Publication date: 2023-09-05
Anticipated expiration: 2043-06-01
Also published as: CN116364063A

Abstract

The application provides a phoneme alignment method, equipment, driving equipment and medium, which comprise the steps of extracting characteristics of acquired target audio to obtain original audio characteristics; inputting the original audio features into a phoneme alignment model trained in advance under a preset feature reduction constraint condition to perform phoneme alignment to obtain a first phoneme sequence corresponding to the original audio features. Therefore, the phoneme alignment model can ensure that the restored features of the first phoneme sequence after being restored are consistent with the original audio features as much as possible under the preset feature restoration constraint condition, so that the influence of independent assumption output conditions of a CTC output layer is avoided, and the accuracy of phoneme alignment is improved.

Description

Phoneme alignment method, apparatus, driving apparatus, and medium

Technical Field

The application relates to the technical field of voice recognition, and particularly provides a phoneme alignment method, equipment, driving equipment and a medium.

Background

The phonemes are the basic units of the voice, the description of the voice signals is provided from the abstract level, the determination of the specific positions of the phonemes in the voice signals is very important in numerous tasks such as voice pronunciation evaluation, voice awakening and the like, and meanwhile, the phonemes are widely applied to frame-level voice recognition and voice synthesis tasks. This task of time-aligning phonemes in a speech waveform is called phoneme alignment. The main difficulties of phoneme alignment are that the pronunciation patterns of different phonemes are varied and that it is difficult to distinguish the boundary between two phonemes. The traditional phoneme alignment method mainly comprises the following steps:

1. manual alignment

The method is to mark the time point of each phoneme in the voice signal manually, and although the manual alignment is accurate, the method cannot be widely used in a large-scale alignment scene due to the fact that the manual alignment is very time-consuming and expensive

2. HMM-GMM based phoneme alignment

Modeling is performed on the emission probability of the phonemes by using a GMM model, modeling is performed on the phoneme sequence by using an HMM model, and GMM clustering is performed on each frame of voice, so that the probability that each frame of voice belongs to each phoneme is obtained, and then decoding search is performed by using an HMM to obtain the optimal phoneme representation sequence of each frame.

3. CTC-based phoneme alignment

CTCs are an algorithm commonly used for speech recognition, and the inputs and outputs of CTCs may be of different lengths. For a given input, CTC calculates the probability of its output by summing all possible phoneme alignment paths, score calculating all possible phoneme pronunciation paths, and selecting the path of the optimal path score.

Compared with the HMM-GMM method, the method is simpler and easier to use, but because each output condition is assumed in the calculation process, the dependency relationship between output phonemes is not considered, and therefore the alignment effect is relatively poor.

Therefore, how to improve the accuracy of phoneme alignment is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

The present application has been made to overcome the above-mentioned drawbacks, and provides a phoneme alignment method, apparatus, driving apparatus, and medium that solve or at least partially solve the technical problem of low accuracy of phoneme alignment.

In a first aspect, the present application provides a phoneme alignment method, the phoneme alignment method comprising:

extracting features of the obtained target audio to obtain original audio features;

inputting the original audio features into a pre-trained phoneme alignment model for phoneme alignment to obtain a first phoneme sequence corresponding to the original audio features;

the phoneme alignment model is obtained by training under the preset characteristic reduction constraint condition.

Further, in the above phoneme alignment method, the training of the phoneme alignment model under the preset feature reduction constraint condition includes:

extracting features of the obtained sample audio to obtain original sample audio features;

performing iterative training on the phoneme alignment model to be trained by utilizing the original sample audio characteristics until the iteration termination condition is met, so as to obtain the phoneme alignment model; the training process of each iteration can comprise the following steps:

inputting the original sample audio features into a current phoneme alignment model to be trained for phoneme alignment to obtain a second phoneme sequence corresponding to the original sample audio features;

performing feature reduction on the second phoneme sequence to obtain a reduction feature corresponding to the second phoneme sequence;

if the original sample audio features and the restoring features corresponding to the second phoneme sequence meet preset feature restoring constraint conditions, determining that iteration termination conditions are met, and taking a current phoneme alignment model to be trained as the phoneme alignment model;

if the original sample audio features and the restoring features corresponding to the second phoneme sequence do not meet the preset feature restoring constraint conditions, determining that iteration termination conditions are not met, and updating parameters of the current phoneme alignment model to be trained to serve as a phoneme alignment model to be trained for the next training.

Further, in the above phoneme alignment method, performing feature reduction on the second phoneme sequence to obtain a reduced feature corresponding to the second phoneme sequence, including:

calculating the product of the restoring weight of the feature restorer in the phoneme alignment model to be trained and the second phoneme sequence;

and taking the sum of the product and the restoring bias of the feature restorer as a restoring feature corresponding to the second phoneme sequence.

Further, the phoneme alignment method further includes:

calculating an absolute value between each original sample audio feature and a corresponding restoring feature of the corresponding second phoneme sequence;

calculating the square of the absolute value, and calculating the average value of all square sums and all sample numbers;

if the mean value is in a preset range, determining that the original sample audio characteristics and the reduction characteristics corresponding to the second phoneme sequence meet preset characteristic reduction constraint conditions;

if the mean value is out of the preset range, determining that the original sample audio characteristics and the restoring characteristics corresponding to the second phoneme sequence do not meet preset characteristic restoring constraint conditions.

Further, in the above phoneme alignment method, feature extraction is performed on the obtained target audio to obtain an original audio feature, including:

and carrying out Fourier transformation on the obtained target audio to obtain the frequency spectrum characteristic corresponding to the target audio as the original audio characteristic.

Further, in the above phoneme alignment method, inputting the original audio feature into a pre-trained phoneme alignment model for phoneme alignment to obtain a first phoneme sequence corresponding to the original audio feature, including:

encoding the original audio features by using an encoder in the phoneme alignment model to obtain encoded features;

and decoding the coding features by utilizing a CTC output layer in the phoneme alignment model to obtain the first phoneme sequence.

Further, in the above phoneme alignment method, decoding the coding feature by using a CTC output layer in the phoneme alignment model to obtain the first phoneme sequence includes:

decoding the coding features by utilizing the CTC output layer to obtain a plurality of phoneme sequence probabilities corresponding to the coding features;

and selecting the phoneme sequence with the highest probability as the first phoneme sequence.

In a second aspect, the present application provides a phoneme alignment apparatus comprising a processor and a memory device adapted to store a plurality of program codes adapted to be loaded and executed by the processor to perform the phoneme alignment method as set forth in any one of the preceding claims.

In a third aspect, there is provided a driving apparatus including the phoneme alignment apparatus as described above.

In a fourth aspect, there is provided a computer readable storage medium storing a plurality of program codes, wherein the program codes are adapted to be loaded and executed by a processor to perform the phoneme alignment method of any of the above.

The technical scheme provided by the application has at least one or more of the following beneficial effects:

in the technical scheme of implementing the application, the original audio characteristics are obtained by extracting the characteristics of the acquired target audio; and inputting the original audio features into a phoneme alignment model obtained by training under a preset feature reduction constraint condition in advance to perform phoneme alignment so as to obtain a first phoneme sequence corresponding to the original audio features. Therefore, the phoneme alignment model can ensure that the restored features of the first phoneme sequence are consistent with the original audio features as much as possible under the preset feature restoration constraint condition, so that the influence of independent assumption output conditions of a CTC output layer is avoided, and the accuracy of phoneme alignment is improved.

Drawings

The present disclosure will become more readily understood with reference to the accompanying drawings. As will be readily appreciated by those skilled in the art: the drawings are for illustrative purposes only and are not intended to limit the scope of the present application. Moreover, like numerals in the figures are used to designate like parts, wherein:

FIG. 1 is a flow chart illustrating the main steps of a phoneme alignment method in accordance with an embodiment of the present application;

FIG. 2 is a flowchart of the main steps for training a phoneme alignment model;

fig. 3 is a main structural block diagram of a phoneme alignment apparatus according to an embodiment of the present application.

Detailed Description

Some embodiments of the application are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are merely for explaining the technical principles of the present application, and are not intended to limit the scope of the present application.

In the description of the present application, a "module," "processor" may include hardware, software, or a combination of both. A module may comprise hardware circuitry, various suitable sensors, communication ports, memory, or software components, such as program code, or a combination of software and hardware. The processor may be a central processor, a microprocessor, an image processor, a digital signal processor, or any other suitable processor. The processor has data and/or signal processing functions. The processor may be implemented in software, hardware, or a combination of both. Non-transitory computer readable storage media include any suitable medium that can store program code, such as magnetic disks, hard disks, optical disks, flash memory, read-only memory, random access memory, and the like. The term "a and/or B" means all possible combinations of a and B, such as a alone, B alone or a and B. The term "at least one A or B" or "at least one of A and B" has a meaning similar to "A and/or B" and may include A alone, B alone or A and B. The singular forms "a", "an" and "the" include plural referents.

The traditional phoneme alignment method mainly comprises the following steps:

1. manual alignment

2. HMM-GMM based phoneme alignment

3. CTC-based phoneme alignment

CTCs are an algorithm commonly used for speech recognition, and the inputs and outputs of CTCs may be of different lengths. For a given input, CTC calculates the probability of its output by summing all possible phoneme alignment paths, score calculating all possible phoneme pronunciation paths, and selecting the path of the optimal path score. Compared with the HMM-GMM method, the method is simpler and easier to use, but because each output condition is assumed in the calculation process, the dependency relationship between output phonemes is not considered, and therefore the alignment effect is relatively poor.

Therefore, in order to improve the accuracy of phoneme alignment, the present application provides the following technical solutions:

referring to fig. 1, fig. 1 is a flowchart illustrating main steps of a phoneme alignment method according to an embodiment of the present application. Wherein the phoneme alignment method can be applied to, but is not limited to, a digital cabin voice interactive device. As shown in fig. 1, the phoneme alignment method in the embodiment of the present application mainly includes the following steps 101 to 102.

Step 101, extracting characteristics of the acquired target audio to obtain original audio characteristics;

in one implementation, the voice capture device may be used to capture audio and take the captured audio as the target audio. After the target audio is obtained, feature extraction can be carried out on the target audio to obtain original audio features. The obtained target audio can be subjected to Fourier transformation to obtain the spectrum characteristics corresponding to the target audio as the original audio characteristics. Specifically, fourier transform can be performed with reference to the calculation formula (1):

（1）

wherein, the liquid crystal display device comprises a liquid crystal display device,representing the converted spectral features, i.e. the original audio features,/->Representing target tonesFrequency, F, represents the fourier transform function.

Step 102, inputting the original audio features into a phoneme alignment model trained in advance under a preset feature reduction constraint condition to perform phoneme alignment, so as to obtain a first phoneme sequence corresponding to the original audio features.

In a specific implementation, step 102 may be implemented as follows:

(1) Encoding the original audio features by using an encoder in the phoneme alignment model to obtain encoded features;

wherein, the coding process can refer to the calculation formula (2):

（2）

wherein, the liquid crystal display device comprises a liquid crystal display device, representing the coding feature->May consist of a 2-layer BLSTM.

(2) And decoding the coding features by utilizing a CTC output layer in the phoneme alignment model to obtain the first phoneme sequence.

Specifically, the CTC output layer may be used to decode the coding feature to obtain a plurality of phoneme sequence probabilities corresponding to the coding feature; and selecting the phoneme sequence with the highest probability as the first phoneme sequence.

The process of obtaining the probabilities of the plurality of phoneme sequences corresponding to the coding features can refer to the calculation formula (3):

（3）

wherein, the liquid crystal display device comprises a liquid crystal display device,probability of each path under each phoneme sequence is shown, +.>Indicating a posterior probability,/-at time t in each path>Probability representing each phoneme sequence, +.>Representing all recognition as phoneme sequence +.>Is provided. The probability of each path under each phoneme sequence can be obtained by multiplying a plurality of posterior probabilities of time periods (1 to T), and then the probability of all paths under each phoneme sequence is summed to obtain the probability of each phoneme sequence.

In a specific implementation process, the phoneme alignment model of the embodiment may be obtained by training under a preset feature reduction constraint condition.

FIG. 2 is a flowchart of the main steps of training the phoneme alignment model, and as shown in FIG. 2, the training phoneme alignment model in the embodiment of the application mainly includes the following steps 201 to 206.

Step 201, extracting features of the obtained sample audio to obtain original sample audio features;

in a specific implementation process, a large amount of online audios can be fished out to serve as sample audios, and feature extraction is performed on the obtained sample audios to obtain original sample audio features.

It should be noted that, in this step, the spectral feature of the sample audio may be obtained as the original sample audio feature by using the method of the calculation formula (1).

After the original sample audio features are obtained, performing iterative training on a phoneme alignment model to be trained by using the original sample audio features until iteration termination conditions are met, so as to obtain the phoneme alignment model; wherein reference step 202-step 206 may be included in each iterative training process.

Step 202, inputting the original sample audio features into a current phoneme alignment model to be trained for phoneme alignment, and obtaining a second phoneme sequence corresponding to the original sample audio features;

step 203, performing feature reduction on the second phoneme sequence to obtain a reduction feature corresponding to the second phoneme sequence;

in a specific implementation process, a product of the reduction weight of the feature reducer in the phoneme alignment model to be trained and the second phoneme sequence can be calculated; and taking the sum of the product and the restoring bias of the feature restorer as a restoring feature corresponding to the second phoneme sequence.

The calculation formula corresponding to the process can be referred to as a calculation formula (4):

（4）

wherein, the liquid crystal display device comprises a liquid crystal display device,representing the output value of the feature reducer, +.>Representing the restoration weight of the feature restorer, +.>Representing a second phoneme sequence,/->Representing the reduction bias of the feature reducer.

Step 204, detecting whether the original sample audio feature and the restoring feature corresponding to the second phoneme sequence meet a preset feature restoring constraint condition; if yes, go to step 205, if no, go to step 206;

in one specific implementation, step 204 may be implemented as follows:

(11) Calculating an absolute value between each original sample audio feature and a corresponding restoring feature of the corresponding second phoneme sequence;

(12) Calculating the square of the absolute value, and calculating the average value of all square sums and all sample numbers;

(13) If the mean value is in a preset range, determining that the original sample audio characteristics and the reduction characteristics corresponding to the second phoneme sequence meet preset characteristic reduction constraint conditions;

(14) If the mean value is out of the preset range, determining that the original sample audio characteristics and the restoring characteristics corresponding to the second phoneme sequence do not meet preset characteristic restoring constraint conditions.

The above steps (11) to (12) can be expressed as the calculation formula (5):

（5）

wherein, the liquid crystal display device comprises a liquid crystal display device,indicating a loss value->Representing the original sample audio features +.>Representing corresponding restoring features of the second phoneme sequence, < >>Representing the total number of sample tones.

Step 205, determining that iteration termination conditions are met, and taking a current phoneme alignment model to be trained as the phoneme alignment model;

if the original sample audio features and the restoring features corresponding to the second phoneme sequence meet the preset feature restoring constraint conditions, the original sample audio features and the restoring features corresponding to the second phoneme sequence are basically the same, iteration termination conditions can be determined to be met, and the current phoneme alignment model to be trained is used as the phoneme alignment model.

Step 206, determining that the iteration termination condition is not met, updating parameters of the current phoneme alignment model to be trained, using the updated parameters as the phoneme alignment model to be trained for the next training, and returning to step 202.

If the original sample audio features and the restoring features corresponding to the second phoneme sequence do not meet the preset feature restoring constraint conditions, which means that the original sample audio features and the restoring features corresponding to the second phoneme sequence have large differences, it can be determined that the iteration termination conditions are not met, the current phoneme alignment model to be trained is used as a phoneme alignment model to be trained for the next training after parameter updating, and the step 202 is returned to continue iteration.

In a specific implementation process, although the phoneme sequence is output under the condition that the dependency relationship between output phonemes is not considered when the phoneme alignment model is used for performing phoneme alignment in the training process, after the output phoneme sequence is subjected to feature restoration, whether the original sample audio features are consistent with the restored features or not can be detected under the preset feature restoration constraint condition, and the phoneme alignment model can be obtained only under the condition that the original sample audio features are consistent with the restored features, so that when the phoneme alignment model obtained through training is used for performing phoneme alignment later, the influence of independent assumption output conditions of a CTC output layer is avoided, and a relatively accurate alignment result is obtained.

According to the phoneme alignment method, the original audio characteristics are obtained by extracting the characteristics of the obtained target audio; and inputting the original audio features into a phoneme alignment model obtained by training under a preset feature reduction constraint condition in advance to perform phoneme alignment so as to obtain a first phoneme sequence corresponding to the original audio features. Therefore, the phoneme alignment model can ensure that the restored features of the first phoneme sequence after being restored are consistent with the original audio features as much as possible under the preset feature restoration constraint condition, so that the influence of independent assumption output conditions of a CTC output layer is avoided, and the accuracy of phoneme alignment is improved.

It should be noted that, although the foregoing embodiments describe the steps in a specific order, it will be understood by those skilled in the art that, in order to achieve the effects of the present application, the steps are not necessarily performed in such an order, and may be performed simultaneously (in parallel) or in other orders, and these variations are within the scope of the present application.

It will be appreciated by those skilled in the art that the present application may implement all or part of the above-described methods according to the above-described embodiments, or may be implemented by means of a computer program for instructing relevant hardware, where the computer program may be stored in a computer readable storage medium, and where the computer program may implement the steps of the above-described embodiments of the method when executed by a processor. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable storage medium may include: any entity or device, medium, usb disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory, random access memory, electrical carrier wave signals, telecommunications signals, software distribution media, and the like capable of carrying the computer program code. It should be noted that the computer readable storage medium may include content that is subject to appropriate increases and decreases as required by jurisdictions and by jurisdictions in which such computer readable storage medium does not include electrical carrier signals and telecommunications signals.

Further, the application also provides a phoneme alignment device.

Referring to fig. 3, fig. 3 is a main structural block diagram of a phoneme alignment apparatus according to an embodiment of the present application. As shown in fig. 3, the phoneme alignment apparatus in an embodiment of the application may include a processor 31 and a storage device 32.

The storage device 32 may be configured to store a program for performing the phoneme alignment method of the above-described method embodiments, and the processor 31 may be configured to execute the program in the storage device 32, including, but not limited to, the program for performing the phoneme alignment method of the above-described method embodiments. For convenience of explanation, only those portions of the embodiments of the present application that are relevant to the embodiments of the present application are shown, and specific technical details are not disclosed, please refer to the method portions of the embodiments of the present application. The phoneme alignment apparatus may be a control apparatus that includes various electronic device formations.

In one implementation, the number of memory devices 32 and processors 31 may be multiple. While the program for performing the phoneme alignment method of the above-described method embodiment may be divided into a plurality of sub-programs, each of which may be loaded and executed by the processor 31 to perform the different steps of the phoneme alignment method of the above-described method embodiment, respectively. Specifically, each of the sub-programs may be stored in a different storage device 32, and each of the processors 31 may be configured to execute the programs in one or more storage devices 32 to collectively implement the phoneme alignment method of the above-described method embodiment, that is, each of the processors 31 executes different steps of the phoneme alignment method of the above-described method embodiment, respectively, to collectively implement the phoneme alignment method of the above-described method embodiment.

The plurality of processors 31 may be processors disposed on the same device, for example, the device may be a high-performance device composed of a plurality of processors, and the plurality of processors 31 may be processors configured on the high-performance device. The plurality of processors 31 may be processors disposed on different devices, for example, the devices may be a server cluster, and the plurality of processors 31 may be processors on different servers in the server cluster.

Further, the present application also provides a driving apparatus, which may include the phoneme alignment apparatus of the above embodiment.

Further, the application also provides a computer readable storage medium. In one embodiment of a computer-readable storage medium according to the present application, the computer-readable storage medium may be configured to store a program for performing the phoneme alignment method of the above-described method embodiment, which may be loaded and executed by a processor to implement the phoneme alignment method described above. The phoneme alignment method may be applied to, but is not limited to, a digital cabin voice interactive device. For convenience of explanation, only those portions of the embodiments of the present application that are relevant to the embodiments of the present application are shown, and specific technical details are not disclosed, please refer to the method portions of the embodiments of the present application. The computer readable storage medium may be a storage device including various electronic devices, and optionally, the computer readable storage medium in the embodiments of the present application is a non-transitory computer readable storage medium.

Further, it should be understood that, since the respective modules are merely set to illustrate the functional units of the apparatus of the present application, the physical devices corresponding to the modules may be the processor itself, or a part of software in the processor, a part of hardware, or a part of a combination of software and hardware. Accordingly, the number of individual modules in the figures is merely illustrative.

Those skilled in the art will appreciate that the various modules in the apparatus may be adaptively split or combined. Such splitting or combining of specific modules does not cause the technical solution to deviate from the principle of the present application, and therefore, the technical solution after splitting or combining falls within the protection scope of the present application.

It should be noted that, the personal information of the relevant user possibly related to each embodiment of the present application is personal information which is strictly according to the requirements of laws and regulations, follows legal, legal and necessary principles, and is actively provided by the user or generated by using the product/service in the process of using the product/service based on the reasonable purpose of the business scenario and obtained by the user through authorization.

The personal information of the user processed by the applicant may vary depending on the specific product/service scenario, and may relate to account information, equipment information, driving information, vehicle information or other related information of the user, depending on the specific scenario in which the user uses the product/service. The applicant would treat the user's personal information and its processing with a high diligence.

The applicant has very important consideration to the safety of personal information of users, and has adopted safety protection measures which meet industry standards and are reasonably feasible to protect the information of the users and prevent the personal information from unauthorized access, disclosure, use, modification, damage or loss.

Thus far, the technical solution of the present application has been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of protection of the present application is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present application, and such modifications and substitutions will fall within the scope of the present application.

Claims

1. A method of phoneme alignment comprising:

the training process of the phoneme alignment model under the preset characteristic reduction constraint condition comprises the following steps:

2. The phoneme alignment method as recited in claim 1, wherein performing feature reduction on the second phoneme sequence to obtain a reduced feature corresponding to the second phoneme sequence comprises:

3. The phoneme alignment method as recited in claim 1, further comprising:

4. The phoneme alignment method as recited in claim 1 wherein feature extracting the acquired target audio to obtain the original audio features comprises:

5. The phoneme alignment method as recited in claim 1, wherein inputting the original audio features into a pre-trained phoneme alignment model for phoneme alignment to obtain a first phoneme sequence corresponding to the original audio features comprises:

6. The phoneme alignment method as recited in claim 5 wherein decoding the encoded features with a CTC output layer in the phoneme alignment model results in the first sequence of phonemes comprising:

7. A phoneme alignment apparatus comprising a processor and a memory device adapted to store a plurality of program codes adapted to be loaded and executed by the processor to perform the phoneme alignment method as recited in any of claims 1 to 6.

8. A driving apparatus comprising the phoneme alignment apparatus as claimed in claim 7.

9. A computer readable storage medium, characterized in that a plurality of program codes are stored, characterized in that the program codes are adapted to be loaded and run by a processor to perform the phoneme alignment method as claimed in any one of claims 1 to 6.