CN113450779B

CN113450779B - Speech model training data set construction method and device

Info

Publication number: CN113450779B
Application number: CN202110697465.7A
Authority: CN
Inventors: 马明; 刘宇
Original assignee: Hisense Visual Technology Co Ltd
Current assignee: Hisense Visual Technology Co Ltd
Priority date: 2021-06-23
Filing date: 2021-06-23
Publication date: 2022-11-11
Anticipated expiration: 2041-06-23
Also published as: CN113450779A

Abstract

The embodiment of the application provides a method and a device for constructing a speech model training data set, wherein the method comprises the following steps: and after obtaining the polyphone sample and the non-polyphone sample, respectively performing vector representation on the polyphone sample and the non-polyphone sample. And further carrying out repeated sampling processing on the polyphone sample vector characterization, and constructing a new polyphone sample vector characterization according to the polyphone sample vector characterization subjected to repeated sampling. And finally, combining the polyphone sample vector characterization, the new polyphone sample vector characterization and the non-polyphone sample vector characterization to obtain a constructed speech model training data set. According to the voice model training data set construction method and the voice model training data set extraction device, polyphone sample vector representation in the voice model training data set can be increased, the condition that polyphone training samples and non-polyphone training samples are not distributed evenly is avoided, the conversion accuracy rate of the trained voice model is improved, and the user experience is improved.

Description

Method and device for constructing voice model training data set

Technical Field

The application relates to the technical field of voice interaction, in particular to a method and a device for constructing a voice model training data set.

Background

With the development of artificial intelligence in the field of voice interaction, intelligent devices can convert text input by users into audio.

There are currently a large number of end-to-end text-to-audio speech models based on deep learning. After training the speech models with a given data set, the text to be converted is input into the trained speech models, and the corresponding audio can be obtained.

However, a core difficulty in the text-to-audio process is the issue of pronunciations for polyphones. And because the use ratio of polyphone data in daily life is not high, the polyphone training samples are used in the training samples for training the voice model, the polyphone training samples are fewer, and the polyphone training samples and the non-polyphone training samples are not distributed in an unbalanced manner. Therefore, when a speech model obtained by training the existing training data set is used for text-to-audio conversion, polyphone characters are easily predicted to be non-polyphone characters, the conversion accuracy is low, and finally the user experience is poor.

Disclosure of Invention

The application provides a method and a device for constructing a speech model training data set, which are used for solving the problems that polyphone characters are easy to predict into non-polyphone characters when a speech model obtained by training the existing training data set is used for converting text into audio, the conversion accuracy is low, and finally the user experience is poor.

In a first aspect, an embodiment of the present application provides a method for constructing a speech model training data set, where the method includes:

acquiring a speech model training sample set, wherein the speech model training sample set comprises polyphone samples and non-polyphone samples, the polyphone samples are sentences at least containing one Chinese polyphone, the non-polyphone samples are sentences not containing Chinese polyphone, and the number of the non-polyphone samples is more than that of the polyphone samples;

performing vector representation on the polyphone sample and the non-polyphone sample to obtain corresponding polyphone sample vector representation and non-polyphone sample vector representation;

performing repeated sampling processing on the polyphone sample vector characterization, and constructing a new sample according to the polyphone sample vector characterization subjected to repeated sampling to obtain a new polyphone sample vector characterization;

and combining the polyphone sample vector characterization, the non-polyphone sample vector characterization and the new polyphone sample vector characterization to obtain a constructed speech model training data set.

In a second aspect, an embodiment of the present application provides an apparatus for constructing a speech model training data set, where the apparatus includes:

a speech model training sample set obtaining unit for performing: acquiring a speech model training sample set, wherein the speech model training sample set comprises polyphone samples and non-polyphone samples, the polyphone samples are sentences at least containing one Chinese polyphone, the non-polyphone samples are sentences not containing Chinese polyphone, and the number of the non-polyphone samples is more than that of the polyphone samples;

a vector characterization unit to perform: performing vector characterization on the polyphone sample and the non-polyphone sample to obtain corresponding polyphone sample vector characterization and non-polyphone sample vector characterization;

a resampling unit to perform: performing repeated sampling processing on the polyphone sample vector representation;

a new data generation unit for performing: constructing a new sample according to the repeatedly sampled polyphone sample vector characterization to obtain a new polyphone sample vector characterization;

a data merging unit to perform: and combining the polyphone sample vector characterization, the non-polyphone sample vector characterization and the new polyphone sample vector characterization to obtain a constructed speech model training data set.

The technical scheme provided by the application comprises the following beneficial effects: after a polyphone sample and a non-polyphone sample are obtained, vector representation is respectively carried out on the polyphone sample and the non-polyphone sample, and polyphone sample vector representation and non-polyphone sample vector representation are obtained. And further carrying out repeated sampling processing on the polyphone sample vector representation, and constructing a new polyphone sample vector representation according to the polyphone sample vector representation subjected to repeated sampling. And finally, combining the polyphone sample vector characterization, the new polyphone sample vector characterization and the non-polyphone sample vector characterization to obtain a constructed speech model training data set. The method for constructing the voice model training data set and the device for extracting the same can increase the polyphone sample vector representation in the voice model training data set, avoid the condition that polyphone training samples and non-polyphone training samples are unbalanced in distribution, further improve the conversion accuracy of the trained voice model and improve the use experience of a user.

Drawings

In order to more clearly describe the technical solution of the present application, the drawings required to be used in the embodiments will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive labor.

FIG. 1 is a schematic flow chart illustrating a method for constructing a speech model training data set according to an embodiment of the present application;

FIG. 2 is a flow chart illustrating a sentence characterization method provided in an embodiment of the present application;

fig. 3 shows a schematic diagram of a method for obtaining K nearest neighbors of minority samples according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating a new sample construction method provided by an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating an apparatus for constructing a training data set of a speech model according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

Reference throughout this specification to "embodiments," "some embodiments," "one embodiment," or "an embodiment," etc., means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, appearances of the phrases "in various embodiments," "in some embodiments," "in at least one other embodiment," or "in an embodiment" or the like throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. Thus, the particular features, structures, or characteristics shown or described in connection with one embodiment may be combined, in whole or in part, with the features, structures, or characteristics of one or more other embodiments, without limitation. Such modifications and variations are intended to be included within the scope of the present application.

With the development of artificial intelligence in the field of voice interaction, intelligent devices can convert text input by users into audio. There are currently a large number of end-to-end text-to-audio speech models based on deep learning. After training the speech models with a given data set, the text to be converted is input into the trained speech models, and the corresponding audio is obtained.

In order to solve the above problems, the present application provides a method for constructing a speech model training data set, which can increase the polyphone sample vector representation in the speech model training data set, avoid the situation that the polyphone training samples and the non-polyphone training samples are not distributed equally, further improve the conversion accuracy of the trained speech model, and improve the user experience.

The flow chart of the speech model training data set construction method shown in fig. 1 comprises the following steps:

and step S101, obtaining a speech model training sample set.

The source of the obtained speech model training sample set may be a network, and the speech model training sample set includes polyphonic samples and non-polyphonic samples. Polyphonic samples are sentences that include at least one Chinese polyphonic word, and non-polyphonic samples are sentences that do not include Chinese polyphonic words. The polyphone in the Chinese language included in the polyphone sample can be a common polyphone in the Chinese language obtained through statistics, such as "single", "folding", "landing", and the like. Or a sentence comprising the common polyphones can be searched according to the common polyphones as a polyphone sample.

Because the use ratio of polyphones in daily life is lower than that of other non-polyphones in daily life, the number of the non-polyphone samples in the acquired speech model training sample set is more than that of the polyphone samples.

Step S102, carrying out vector representation on the polyphone samples and the polyphone samples obtained in the step S102 to obtain corresponding polyphone sample vector representations and polyphone sample vector representations;

in some embodiments, both the non-polyphonic samples and the polyphonic samples are sentence samples, so the samples are vector characterized, in effect, sentences. As shown in fig. 2, the specific steps of performing vector characterization on the sentence samples may include:

firstly, word segmentation processing and word segmentation processing are carried out on sentence samples. This process may be performed using a segmentation tool, such as LAC segmentation tool, and the application is not limited to the word segmentation process and the word segmentation process.

After the sentence sample is subjected to word segmentation processing, a plurality of words are obtained, such as "i want to watch a movie", and the word segmentation result is "i want to watch, movie". The sentence sample is divided into multiple words, for example, "i want to watch movie", and the word division result is "i want to watch, electricity, shadow".

The plurality of words of the sentence sample are then input into a word vector characterization model, such as the BERT model of google, which is not limited in this application to the specific use of the word vector model. A vector representation of each word of the sentence sample is output from the word vector representation model. And then, averaging vector representations of all words of the sentence sample to obtain the word vector average representation of the sentence sample.

For example, the word segmentation result "i want to see, movie" in the above embodiment is input into the word vector characterization model to obtain the vector characterization of each word: "I" is w1, "want to see" is w2, and "movie" is w3. The word vector mean for that sentence sample characterizes w = (w 1+ w2+ w 3)/3.

A vector representation of each word in the sentence sample is then obtained from the word vector library. And then, averaging vector representations of all words of the sentence sample to obtain the word vector average representation of the sentence sample.

For example, the word segmentation result of the above embodiment "i, want to see, electricity, shadow", obtains the vector characterization of each word from the word vector library: "I" is c1, "want" is c2, "see" is c3, "electric" is c4, and "shadow" is c5. The word vector mean of the sentence sample characterizes C = (C1 + C2+ C3+ C4+ C5)/5.

And finally, splicing the word vector mean value representation and the word vector mean value representation of the sentence sample to obtain the sentence sample vector representation. And (4) the polyphone sample and the non-polyphone sample are subjected to the vector characterization step to obtain corresponding polyphone sample vector characterization and non-polyphone sample vector characterization.

For example, word vector mean representation w and word vector mean representation C of the sentence sample of the concatenated sentence sample "i want to watch movie". If w is a vector of 1 × 300 dimensions and C is a vector of 1 × 100 dimensions, the final sentence sample vector obtained by splicing is characterized as a vector of 1 × 400 dimensions.

Step S103 is a process of adding polyphone sample data, which specifically includes:

and S301, repeatedly sampling the polyphone sample vector characterization obtained in the S102.

The repeated sampling, namely oversampling, refers to a method for repeatedly sampling a small amount of sample data to achieve the balance of multi-class sample data. The process of repeated sampling is to first number the polyphonic sample vector representations and the non-polyphonic sample vector representations. And then, only the serial numbers of the polyphonic sample vector representations are repeatedly sampled until the ratio of the polyphonic sample vector representations to the non-polyphonic sample vector representations reaches a threshold value T. Where T can be set to 1:2.

step S302, in the repeated sampling process, each time of sampling, a new polyphone sample vector representation is constructed for the polyphone sample vector representation obtained by sampling.

In some embodiments, the SMOTE algorithm may be used in the process of constructing a new polyphonic sample vector representation. That is, by using SMOTE algorithm, new vector data is generated randomly for the existing vector data.

Specifically, a nearest neighbor algorithm is firstly adopted to calculate K nearest neighbors of each few sample classes (polyphonic sample vector characterization). A few class samples, K neighbor acquisition diagram, as shown in fig. 3. The dots in fig. 3 represent a larger number of samples, representing most sample classes, i.e. non-polyphonic sample vector representations. The five-pointed star in fig. 3 represents a smaller number of samples, representing a smaller number of classes of samples, i.e. polyphonic sample vector characterizations.

K nearest neighbors means that if the big ones of the K nearest samples in the neighborhood of a sample in the feature space all belong to a certain class, the sample also belongs to this class. And then setting a sampling ratio according to the unbalanced ratio of the polyphonic sample vector representation and the non-polyphonic sample vector representation to determine a sampling multiplying factor N. Characterizing x for each polyphonic sample vector _i Randomly selecting a number of samples from its k neighbors, assuming the selected neighbors are

Based on randomly selected neighbors

And polyphonic sample vector characterization x _i And constructing a new sample according to the following formula, namely a new polyphonic sample vector characterization, such as a new sample construction schematic diagram shown in fig. 4.

And obtaining a plurality of new polyphone sample vector representations through the process of constructing new samples for N times according to the sampling multiplying power.

And S104, obtaining a non-polyphone sample vector characterization, a polyphone sample vector characterization and a new polyphone sample vector characterization through the steps, and combining the data to obtain a constructed speech model training data set.

The speech model training data set is obtained through the steps S101 to S104, compared with the original speech model training sample set, the method increases the polyphone sample vector representation, and can avoid the condition that the polyphone training samples and the non-polyphone training samples are not distributed evenly. The speech model obtained by training the speech model training data set can improve the accuracy of converting characters into audio and improve the user experience.

In some embodiments, after the speech model training data set is constructed, all data can be randomly scrambled, and then the data is input into the constructed deep learning model in a batch training mode. The deep learning model can use bidirectional LSTM coding, then a loss function is obtained through a full link layer, gradient updating is carried out through reverse propagation, and finally a trained model is obtained and stored.

An embodiment of the present application provides a speech model training data set constructing apparatus, configured to execute the embodiment corresponding to fig. 1, and as shown in fig. 5, the speech model training data set constructing apparatus provided by the present application includes:

a speech model training sample set obtaining unit 201, configured to perform: acquiring a speech model training sample set, wherein the speech model training sample set comprises polyphone samples and non-polyphone samples, the polyphone samples are sentences at least containing one Chinese polyphone, the non-polyphone samples are sentences not containing Chinese polyphone, and the number of the non-polyphone samples is more than that of the polyphone samples;

a vector characterization unit 202 configured to perform: performing vector representation on the polyphone sample and the non-polyphone sample to obtain corresponding polyphone sample vector representation and non-polyphone sample vector representation;

a resampling unit 203 for performing: performing repeated sampling processing on the polyphone sample vector representation;

a new data generation unit 204 configured to perform: constructing a new polyphonic sample vector representation according to the repeatedly sampled polyphonic sample vector representation;

a data merging unit 205 configured to perform: and combining the polyphone sample vector characterization, the non-polyphone sample vector characterization and the new polyphone sample vector characterization to obtain a constructed speech model training data set.

In some embodiments, the vector characterization unit 202 is specifically configured to perform: carrying out word segmentation processing and word segmentation processing on the sentence sample;

inputting the sentence sample after word segmentation into a word vector representation model to obtain a vector representation of each word in the sentence sample, and calculating a mean value of the vector representation of each word to obtain a word vector mean value representation of the sentence sample;

obtaining a vector representation of each word in the sentence sample from a word vector library, and solving a mean value of the vector representation of each word to obtain a word vector mean value representation of the sentence sample;

and splicing the word vector mean representation of the sentence sample and the word vector mean representation of the sentence sample to obtain a sentence sample vector representation, wherein the sentence sample vector representation is one of the polyphonic word sample vector representation or the non-polyphonic sample vector representation.

In some embodiments, the new data generating unit 204 is specifically configured to perform: and constructing a new sample according to the repeatedly sampled polyphonic sample vector characterization by utilizing a SMOTE algorithm.

What has been described above includes examples of implementations of the invention. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the claimed subject matter, but it is to be appreciated that many further combinations and permutations of the subject innovation are possible. Accordingly, the claimed subject matter is intended to embrace all such alterations, modifications and variations that fall within the spirit and scope of the appended claims. Moreover, the foregoing description of illustrated implementations of the present application, including what is described in the "abstract," is not intended to be exhaustive or to limit the disclosed implementations to the precise forms disclosed. While specific implementations and examples are described herein for illustrative purposes, various modifications are possible that are considered within the scope of such implementations and examples, as those skilled in the relevant art will recognize.

Moreover, the word "exemplary" or "exemplary" is used herein to mean "serving as an example, instance, or illustration. Any aspect or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other aspects or designs. Rather, use of the word "exemplary" or "exemplary" is intended to present concepts in a concrete fashion.

Claims

1. A method for constructing a speech model training data set is characterized by comprising the following steps:

obtaining a speech model training sample set, wherein the speech model training sample set comprises polyphone samples and non-polyphone samples, the polyphone samples are sentences at least containing one Chinese polyphone, the non-polyphone samples are sentences not containing Chinese polyphone, the number of the non-polyphone samples is more than that of the polyphone samples, and the polyphone samples and the non-polyphone samples are sentence samples;

carrying out word segmentation processing and word segmentation processing on the sentence sample;

inputting the sentence sample after word segmentation into a word vector representation model to obtain the vector representation of each word in the sentence sample, and solving the vector representation of each word to obtain the word vector mean representation of the sentence sample;

obtaining the vector representation of each word in the sentence sample from a word vector library, and solving the mean value of the vector representation of each word to obtain the word vector mean value representation of the sentence sample;

splicing the word vector mean representation of the sentence sample and the word vector mean representation of the sentence sample to obtain a sentence sample vector representation, wherein the sentence sample vector representation is the polyphonic character sample vector representation and the non-polyphonic character sample vector representation;

performing repeated sampling processing on the polyphone sample vector characterization, and constructing a new polyphone sample vector characterization according to the polyphone sample vector characterization subjected to repeated sampling;

2. The method of constructing a speech model training data set according to claim 1, wherein prior to the resampling process of the polyphonic sample vector representations, the method further comprises:

numbering the polyphonic sample vector representations and the non-polyphonic sample vector representations;

and performing repeated sampling processing on the polyphone sample vector characterization, which specifically comprises the following steps:

and according to the serial numbers, performing repeated sampling processing on the polyphone sample vector representation.

3. The method of constructing a speech model training data set according to claim 1, wherein after the repeated sampling of the polyphonic sample vector representations, the ratio of the polyphonic sample vector representations to the non-polyphonic sample vector representations in the sampling result is 1.

4. The method of constructing a speech model training data set according to claim 1, comprising: and constructing a new sample according to the repeatedly sampled polyphonic sample vector characterization by utilizing a SMOTE algorithm.

5. The method of constructing a speech model training data set according to claim 1, further comprising:

and inputting the randomly disturbed voice model training data set into a built deep learning model, and training the deep learning model.

6. An apparatus for constructing a training data set of a speech model, comprising:

a speech model training sample set obtaining unit for performing: acquiring a speech model training sample set, wherein the speech model training sample set comprises polyphone samples and non-polyphone samples, the polyphone samples are sentences at least containing one Chinese polyphone, the non-polyphone samples are sentences not containing Chinese polyphone, the number of the non-polyphone samples is more than that of the polyphone samples, and the polyphone samples and the non-polyphone samples are sentence samples;

a vector characterization unit to perform: carrying out word segmentation processing and word segmentation processing on the sentence sample;

obtaining the vector representation of each word in the sentence sample from a word vector library, and solving the average value of the vector representation of each word to obtain the word vector average value representation of the sentence sample;

a new data generation unit for performing: constructing a new polyphonic sample vector representation according to the repeatedly sampled polyphonic sample vector representation;

7. The speech model training data set construction device of claim 6, wherein after the repeated sampling process of the polyphonic sample vector representations, the ratio of the polyphonic sample vector representations to the non-polyphonic sample vector representations in the sampling result reaches 1.

8. The speech model training data set construction apparatus according to claim 6, wherein the new data generation unit is specifically configured to perform: and constructing a new sample according to the repeatedly sampled polyphonic sample vector characterization by utilizing a SMOTE algorithm.