CN113450779B - Speech model training data set construction method and device - Google Patents

Speech model training data set construction method and device Download PDF

Info

Publication number
CN113450779B
CN113450779B CN202110697465.7A CN202110697465A CN113450779B CN 113450779 B CN113450779 B CN 113450779B CN 202110697465 A CN202110697465 A CN 202110697465A CN 113450779 B CN113450779 B CN 113450779B
Authority
CN
China
Prior art keywords
sample
polyphone
vector
word
representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110697465.7A
Other languages
Chinese (zh)
Other versions
CN113450779A (en
Inventor
马明
刘宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hisense Visual Technology Co Ltd
Original Assignee
Hisense Visual Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hisense Visual Technology Co Ltd filed Critical Hisense Visual Technology Co Ltd
Priority to CN202110697465.7A priority Critical patent/CN113450779B/en
Publication of CN113450779A publication Critical patent/CN113450779A/en
Application granted granted Critical
Publication of CN113450779B publication Critical patent/CN113450779B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The embodiment of the application provides a method and a device for constructing a speech model training data set, wherein the method comprises the following steps: and after obtaining the polyphone sample and the non-polyphone sample, respectively performing vector representation on the polyphone sample and the non-polyphone sample. And further carrying out repeated sampling processing on the polyphone sample vector characterization, and constructing a new polyphone sample vector characterization according to the polyphone sample vector characterization subjected to repeated sampling. And finally, combining the polyphone sample vector characterization, the new polyphone sample vector characterization and the non-polyphone sample vector characterization to obtain a constructed speech model training data set. According to the voice model training data set construction method and the voice model training data set extraction device, polyphone sample vector representation in the voice model training data set can be increased, the condition that polyphone training samples and non-polyphone training samples are not distributed evenly is avoided, the conversion accuracy rate of the trained voice model is improved, and the user experience is improved.

Description

Method and device for constructing voice model training data set
Technical Field
The application relates to the technical field of voice interaction, in particular to a method and a device for constructing a voice model training data set.
Background
With the development of artificial intelligence in the field of voice interaction, intelligent devices can convert text input by users into audio.
There are currently a large number of end-to-end text-to-audio speech models based on deep learning. After training the speech models with a given data set, the text to be converted is input into the trained speech models, and the corresponding audio can be obtained.
However, a core difficulty in the text-to-audio process is the issue of pronunciations for polyphones. And because the use ratio of polyphone data in daily life is not high, the polyphone training samples are used in the training samples for training the voice model, the polyphone training samples are fewer, and the polyphone training samples and the non-polyphone training samples are not distributed in an unbalanced manner. Therefore, when a speech model obtained by training the existing training data set is used for text-to-audio conversion, polyphone characters are easily predicted to be non-polyphone characters, the conversion accuracy is low, and finally the user experience is poor.
Disclosure of Invention
The application provides a method and a device for constructing a speech model training data set, which are used for solving the problems that polyphone characters are easy to predict into non-polyphone characters when a speech model obtained by training the existing training data set is used for converting text into audio, the conversion accuracy is low, and finally the user experience is poor.
In a first aspect, an embodiment of the present application provides a method for constructing a speech model training data set, where the method includes:
acquiring a speech model training sample set, wherein the speech model training sample set comprises polyphone samples and non-polyphone samples, the polyphone samples are sentences at least containing one Chinese polyphone, the non-polyphone samples are sentences not containing Chinese polyphone, and the number of the non-polyphone samples is more than that of the polyphone samples;
performing vector representation on the polyphone sample and the non-polyphone sample to obtain corresponding polyphone sample vector representation and non-polyphone sample vector representation;
performing repeated sampling processing on the polyphone sample vector characterization, and constructing a new sample according to the polyphone sample vector characterization subjected to repeated sampling to obtain a new polyphone sample vector characterization;
and combining the polyphone sample vector characterization, the non-polyphone sample vector characterization and the new polyphone sample vector characterization to obtain a constructed speech model training data set.
In a second aspect, an embodiment of the present application provides an apparatus for constructing a speech model training data set, where the apparatus includes:
a speech model training sample set obtaining unit for performing: acquiring a speech model training sample set, wherein the speech model training sample set comprises polyphone samples and non-polyphone samples, the polyphone samples are sentences at least containing one Chinese polyphone, the non-polyphone samples are sentences not containing Chinese polyphone, and the number of the non-polyphone samples is more than that of the polyphone samples;
a vector characterization unit to perform: performing vector characterization on the polyphone sample and the non-polyphone sample to obtain corresponding polyphone sample vector characterization and non-polyphone sample vector characterization;
a resampling unit to perform: performing repeated sampling processing on the polyphone sample vector representation;
a new data generation unit for performing: constructing a new sample according to the repeatedly sampled polyphone sample vector characterization to obtain a new polyphone sample vector characterization;
a data merging unit to perform: and combining the polyphone sample vector characterization, the non-polyphone sample vector characterization and the new polyphone sample vector characterization to obtain a constructed speech model training data set.
The technical scheme provided by the application comprises the following beneficial effects: after a polyphone sample and a non-polyphone sample are obtained, vector representation is respectively carried out on the polyphone sample and the non-polyphone sample, and polyphone sample vector representation and non-polyphone sample vector representation are obtained. And further carrying out repeated sampling processing on the polyphone sample vector representation, and constructing a new polyphone sample vector representation according to the polyphone sample vector representation subjected to repeated sampling. And finally, combining the polyphone sample vector characterization, the new polyphone sample vector characterization and the non-polyphone sample vector characterization to obtain a constructed speech model training data set. The method for constructing the voice model training data set and the device for extracting the same can increase the polyphone sample vector representation in the voice model training data set, avoid the condition that polyphone training samples and non-polyphone training samples are unbalanced in distribution, further improve the conversion accuracy of the trained voice model and improve the use experience of a user.
Drawings
In order to more clearly describe the technical solution of the present application, the drawings required to be used in the embodiments will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive labor.
FIG. 1 is a schematic flow chart illustrating a method for constructing a speech model training data set according to an embodiment of the present application;
FIG. 2 is a flow chart illustrating a sentence characterization method provided in an embodiment of the present application;
fig. 3 shows a schematic diagram of a method for obtaining K nearest neighbors of minority samples according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram illustrating a new sample construction method provided by an embodiment of the present application;
FIG. 5 is a schematic diagram illustrating an apparatus for constructing a training data set of a speech model according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
Reference throughout this specification to "embodiments," "some embodiments," "one embodiment," or "an embodiment," etc., means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, appearances of the phrases "in various embodiments," "in some embodiments," "in at least one other embodiment," or "in an embodiment" or the like throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. Thus, the particular features, structures, or characteristics shown or described in connection with one embodiment may be combined, in whole or in part, with the features, structures, or characteristics of one or more other embodiments, without limitation. Such modifications and variations are intended to be included within the scope of the present application.
With the development of artificial intelligence in the field of voice interaction, intelligent devices can convert text input by users into audio. There are currently a large number of end-to-end text-to-audio speech models based on deep learning. After training the speech models with a given data set, the text to be converted is input into the trained speech models, and the corresponding audio is obtained.
However, a core difficulty in the text-to-audio process is the issue of pronunciations for polyphones. And because the use ratio of polyphone data in daily life is not high, the polyphone training samples are used in the training samples for training the voice model, the polyphone training samples are fewer, and the polyphone training samples and the non-polyphone training samples are not distributed in an unbalanced manner. Therefore, when a speech model obtained by training the existing training data set is used for text-to-audio conversion, polyphone characters are easily predicted to be non-polyphone characters, the conversion accuracy is low, and finally the user experience is poor.
In order to solve the above problems, the present application provides a method for constructing a speech model training data set, which can increase the polyphone sample vector representation in the speech model training data set, avoid the situation that the polyphone training samples and the non-polyphone training samples are not distributed equally, further improve the conversion accuracy of the trained speech model, and improve the user experience.
The flow chart of the speech model training data set construction method shown in fig. 1 comprises the following steps:
and step S101, obtaining a speech model training sample set.
The source of the obtained speech model training sample set may be a network, and the speech model training sample set includes polyphonic samples and non-polyphonic samples. Polyphonic samples are sentences that include at least one Chinese polyphonic word, and non-polyphonic samples are sentences that do not include Chinese polyphonic words. The polyphone in the Chinese language included in the polyphone sample can be a common polyphone in the Chinese language obtained through statistics, such as "single", "folding", "landing", and the like. Or a sentence comprising the common polyphones can be searched according to the common polyphones as a polyphone sample.
Because the use ratio of polyphones in daily life is lower than that of other non-polyphones in daily life, the number of the non-polyphone samples in the acquired speech model training sample set is more than that of the polyphone samples.
Step S102, carrying out vector representation on the polyphone samples and the polyphone samples obtained in the step S102 to obtain corresponding polyphone sample vector representations and polyphone sample vector representations;
in some embodiments, both the non-polyphonic samples and the polyphonic samples are sentence samples, so the samples are vector characterized, in effect, sentences. As shown in fig. 2, the specific steps of performing vector characterization on the sentence samples may include:
firstly, word segmentation processing and word segmentation processing are carried out on sentence samples. This process may be performed using a segmentation tool, such as LAC segmentation tool, and the application is not limited to the word segmentation process and the word segmentation process.
After the sentence sample is subjected to word segmentation processing, a plurality of words are obtained, such as "i want to watch a movie", and the word segmentation result is "i want to watch, movie". The sentence sample is divided into multiple words, for example, "i want to watch movie", and the word division result is "i want to watch, electricity, shadow".
The plurality of words of the sentence sample are then input into a word vector characterization model, such as the BERT model of google, which is not limited in this application to the specific use of the word vector model. A vector representation of each word of the sentence sample is output from the word vector representation model. And then, averaging vector representations of all words of the sentence sample to obtain the word vector average representation of the sentence sample.
For example, the word segmentation result "i want to see, movie" in the above embodiment is input into the word vector characterization model to obtain the vector characterization of each word: "I" is w1, "want to see" is w2, and "movie" is w3. The word vector mean for that sentence sample characterizes w = (w 1+ w2+ w 3)/3.
A vector representation of each word in the sentence sample is then obtained from the word vector library. And then, averaging vector representations of all words of the sentence sample to obtain the word vector average representation of the sentence sample.
For example, the word segmentation result of the above embodiment "i, want to see, electricity, shadow", obtains the vector characterization of each word from the word vector library: "I" is c1, "want" is c2, "see" is c3, "electric" is c4, and "shadow" is c5. The word vector mean of the sentence sample characterizes C = (C1 + C2+ C3+ C4+ C5)/5.
And finally, splicing the word vector mean value representation and the word vector mean value representation of the sentence sample to obtain the sentence sample vector representation. And (4) the polyphone sample and the non-polyphone sample are subjected to the vector characterization step to obtain corresponding polyphone sample vector characterization and non-polyphone sample vector characterization.
For example, word vector mean representation w and word vector mean representation C of the sentence sample of the concatenated sentence sample "i want to watch movie". If w is a vector of 1 × 300 dimensions and C is a vector of 1 × 100 dimensions, the final sentence sample vector obtained by splicing is characterized as a vector of 1 × 400 dimensions.
Step S103 is a process of adding polyphone sample data, which specifically includes:
and S301, repeatedly sampling the polyphone sample vector characterization obtained in the S102.
The repeated sampling, namely oversampling, refers to a method for repeatedly sampling a small amount of sample data to achieve the balance of multi-class sample data. The process of repeated sampling is to first number the polyphonic sample vector representations and the non-polyphonic sample vector representations. And then, only the serial numbers of the polyphonic sample vector representations are repeatedly sampled until the ratio of the polyphonic sample vector representations to the non-polyphonic sample vector representations reaches a threshold value T. Where T can be set to 1:2.
step S302, in the repeated sampling process, each time of sampling, a new polyphone sample vector representation is constructed for the polyphone sample vector representation obtained by sampling.
In some embodiments, the SMOTE algorithm may be used in the process of constructing a new polyphonic sample vector representation. That is, by using SMOTE algorithm, new vector data is generated randomly for the existing vector data.
Specifically, a nearest neighbor algorithm is firstly adopted to calculate K nearest neighbors of each few sample classes (polyphonic sample vector characterization). A few class samples, K neighbor acquisition diagram, as shown in fig. 3. The dots in fig. 3 represent a larger number of samples, representing most sample classes, i.e. non-polyphonic sample vector representations. The five-pointed star in fig. 3 represents a smaller number of samples, representing a smaller number of classes of samples, i.e. polyphonic sample vector characterizations.
K nearest neighbors means that if the big ones of the K nearest samples in the neighborhood of a sample in the feature space all belong to a certain class, the sample also belongs to this class. And then setting a sampling ratio according to the unbalanced ratio of the polyphonic sample vector representation and the non-polyphonic sample vector representation to determine a sampling multiplying factor N. Characterizing x for each polyphonic sample vector i Randomly selecting a number of samples from its k neighbors, assuming the selected neighbors are
Figure BDA0003129088300000051
Based on randomly selected neighbors
Figure BDA0003129088300000052
And polyphonic sample vector characterization x i And constructing a new sample according to the following formula, namely a new polyphonic sample vector characterization, such as a new sample construction schematic diagram shown in fig. 4.
Figure BDA0003129088300000053
And obtaining a plurality of new polyphone sample vector representations through the process of constructing new samples for N times according to the sampling multiplying power.
And S104, obtaining a non-polyphone sample vector characterization, a polyphone sample vector characterization and a new polyphone sample vector characterization through the steps, and combining the data to obtain a constructed speech model training data set.
The speech model training data set is obtained through the steps S101 to S104, compared with the original speech model training sample set, the method increases the polyphone sample vector representation, and can avoid the condition that the polyphone training samples and the non-polyphone training samples are not distributed evenly. The speech model obtained by training the speech model training data set can improve the accuracy of converting characters into audio and improve the user experience.
In some embodiments, after the speech model training data set is constructed, all data can be randomly scrambled, and then the data is input into the constructed deep learning model in a batch training mode. The deep learning model can use bidirectional LSTM coding, then a loss function is obtained through a full link layer, gradient updating is carried out through reverse propagation, and finally a trained model is obtained and stored.
An embodiment of the present application provides a speech model training data set constructing apparatus, configured to execute the embodiment corresponding to fig. 1, and as shown in fig. 5, the speech model training data set constructing apparatus provided by the present application includes:
a speech model training sample set obtaining unit 201, configured to perform: acquiring a speech model training sample set, wherein the speech model training sample set comprises polyphone samples and non-polyphone samples, the polyphone samples are sentences at least containing one Chinese polyphone, the non-polyphone samples are sentences not containing Chinese polyphone, and the number of the non-polyphone samples is more than that of the polyphone samples;
a vector characterization unit 202 configured to perform: performing vector representation on the polyphone sample and the non-polyphone sample to obtain corresponding polyphone sample vector representation and non-polyphone sample vector representation;
a resampling unit 203 for performing: performing repeated sampling processing on the polyphone sample vector representation;
a new data generation unit 204 configured to perform: constructing a new polyphonic sample vector representation according to the repeatedly sampled polyphonic sample vector representation;
a data merging unit 205 configured to perform: and combining the polyphone sample vector characterization, the non-polyphone sample vector characterization and the new polyphone sample vector characterization to obtain a constructed speech model training data set.
In some embodiments, the vector characterization unit 202 is specifically configured to perform: carrying out word segmentation processing and word segmentation processing on the sentence sample;
inputting the sentence sample after word segmentation into a word vector representation model to obtain a vector representation of each word in the sentence sample, and calculating a mean value of the vector representation of each word to obtain a word vector mean value representation of the sentence sample;
obtaining a vector representation of each word in the sentence sample from a word vector library, and solving a mean value of the vector representation of each word to obtain a word vector mean value representation of the sentence sample;
and splicing the word vector mean representation of the sentence sample and the word vector mean representation of the sentence sample to obtain a sentence sample vector representation, wherein the sentence sample vector representation is one of the polyphonic word sample vector representation or the non-polyphonic sample vector representation.
In some embodiments, the new data generating unit 204 is specifically configured to perform: and constructing a new sample according to the repeatedly sampled polyphonic sample vector characterization by utilizing a SMOTE algorithm.
What has been described above includes examples of implementations of the invention. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the claimed subject matter, but it is to be appreciated that many further combinations and permutations of the subject innovation are possible. Accordingly, the claimed subject matter is intended to embrace all such alterations, modifications and variations that fall within the spirit and scope of the appended claims. Moreover, the foregoing description of illustrated implementations of the present application, including what is described in the "abstract," is not intended to be exhaustive or to limit the disclosed implementations to the precise forms disclosed. While specific implementations and examples are described herein for illustrative purposes, various modifications are possible that are considered within the scope of such implementations and examples, as those skilled in the relevant art will recognize.
Moreover, the word "exemplary" or "exemplary" is used herein to mean "serving as an example, instance, or illustration. Any aspect or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other aspects or designs. Rather, use of the word "exemplary" or "exemplary" is intended to present concepts in a concrete fashion.

Claims (8)

1. A method for constructing a speech model training data set is characterized by comprising the following steps:
obtaining a speech model training sample set, wherein the speech model training sample set comprises polyphone samples and non-polyphone samples, the polyphone samples are sentences at least containing one Chinese polyphone, the non-polyphone samples are sentences not containing Chinese polyphone, the number of the non-polyphone samples is more than that of the polyphone samples, and the polyphone samples and the non-polyphone samples are sentence samples;
carrying out word segmentation processing and word segmentation processing on the sentence sample;
inputting the sentence sample after word segmentation into a word vector representation model to obtain the vector representation of each word in the sentence sample, and solving the vector representation of each word to obtain the word vector mean representation of the sentence sample;
obtaining the vector representation of each word in the sentence sample from a word vector library, and solving the mean value of the vector representation of each word to obtain the word vector mean value representation of the sentence sample;
splicing the word vector mean representation of the sentence sample and the word vector mean representation of the sentence sample to obtain a sentence sample vector representation, wherein the sentence sample vector representation is the polyphonic character sample vector representation and the non-polyphonic character sample vector representation;
performing repeated sampling processing on the polyphone sample vector characterization, and constructing a new polyphone sample vector characterization according to the polyphone sample vector characterization subjected to repeated sampling;
and combining the polyphone sample vector characterization, the non-polyphone sample vector characterization and the new polyphone sample vector characterization to obtain a constructed speech model training data set.
2. The method of constructing a speech model training data set according to claim 1, wherein prior to the resampling process of the polyphonic sample vector representations, the method further comprises:
numbering the polyphonic sample vector representations and the non-polyphonic sample vector representations;
and performing repeated sampling processing on the polyphone sample vector characterization, which specifically comprises the following steps:
and according to the serial numbers, performing repeated sampling processing on the polyphone sample vector representation.
3. The method of constructing a speech model training data set according to claim 1, wherein after the repeated sampling of the polyphonic sample vector representations, the ratio of the polyphonic sample vector representations to the non-polyphonic sample vector representations in the sampling result is 1.
4. The method of constructing a speech model training data set according to claim 1, comprising: and constructing a new sample according to the repeatedly sampled polyphonic sample vector characterization by utilizing a SMOTE algorithm.
5. The method of constructing a speech model training data set according to claim 1, further comprising:
and inputting the randomly disturbed voice model training data set into a built deep learning model, and training the deep learning model.
6. An apparatus for constructing a training data set of a speech model, comprising:
a speech model training sample set obtaining unit for performing: acquiring a speech model training sample set, wherein the speech model training sample set comprises polyphone samples and non-polyphone samples, the polyphone samples are sentences at least containing one Chinese polyphone, the non-polyphone samples are sentences not containing Chinese polyphone, the number of the non-polyphone samples is more than that of the polyphone samples, and the polyphone samples and the non-polyphone samples are sentence samples;
a vector characterization unit to perform: carrying out word segmentation processing and word segmentation processing on the sentence sample;
inputting the sentence sample after word segmentation into a word vector representation model to obtain the vector representation of each word in the sentence sample, and solving the vector representation of each word to obtain the word vector mean representation of the sentence sample;
obtaining the vector representation of each word in the sentence sample from a word vector library, and solving the average value of the vector representation of each word to obtain the word vector average value representation of the sentence sample;
splicing the word vector mean representation of the sentence sample and the word vector mean representation of the sentence sample to obtain a sentence sample vector representation, wherein the sentence sample vector representation is the polyphonic character sample vector representation and the non-polyphonic character sample vector representation;
a resampling unit to perform: performing repeated sampling processing on the polyphone sample vector representation;
a new data generation unit for performing: constructing a new polyphonic sample vector representation according to the repeatedly sampled polyphonic sample vector representation;
a data merging unit to perform: and combining the polyphone sample vector characterization, the non-polyphone sample vector characterization and the new polyphone sample vector characterization to obtain a constructed speech model training data set.
7. The speech model training data set construction device of claim 6, wherein after the repeated sampling process of the polyphonic sample vector representations, the ratio of the polyphonic sample vector representations to the non-polyphonic sample vector representations in the sampling result reaches 1.
8. The speech model training data set construction apparatus according to claim 6, wherein the new data generation unit is specifically configured to perform: and constructing a new sample according to the repeatedly sampled polyphonic sample vector characterization by utilizing a SMOTE algorithm.
CN202110697465.7A 2021-06-23 2021-06-23 Speech model training data set construction method and device Active CN113450779B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110697465.7A CN113450779B (en) 2021-06-23 2021-06-23 Speech model training data set construction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110697465.7A CN113450779B (en) 2021-06-23 2021-06-23 Speech model training data set construction method and device

Publications (2)

Publication Number Publication Date
CN113450779A CN113450779A (en) 2021-09-28
CN113450779B true CN113450779B (en) 2022-11-11

Family

ID=77812312

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110697465.7A Active CN113450779B (en) 2021-06-23 2021-06-23 Speech model training data set construction method and device

Country Status (1)

Country Link
CN (1) CN113450779B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111798834A (en) * 2020-07-03 2020-10-20 北京字节跳动网络技术有限公司 Method and device for identifying polyphone, readable medium and electronic equipment
WO2021093449A1 (en) * 2019-11-14 2021-05-20 腾讯科技(深圳)有限公司 Wakeup word detection method and apparatus employing artificial intelligence, device, and medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106973057B (en) * 2017-03-31 2018-12-14 浙江大学 A kind of classification method suitable for intrusion detection
CN111199153B (en) * 2018-10-31 2023-08-25 北京国双科技有限公司 Word vector generation method and related equipment
US11631029B2 (en) * 2019-09-09 2023-04-18 Adobe Inc. Generating combined feature embedding for minority class upsampling in training machine learning models with imbalanced samples
CN111581385B (en) * 2020-05-06 2024-04-02 西安交通大学 Unbalanced data sampling Chinese text category recognition system and method
CN112036515A (en) * 2020-11-04 2020-12-04 北京淇瑀信息科技有限公司 Oversampling method and device based on SMOTE algorithm and electronic equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021093449A1 (en) * 2019-11-14 2021-05-20 腾讯科技(深圳)有限公司 Wakeup word detection method and apparatus employing artificial intelligence, device, and medium
CN111798834A (en) * 2020-07-03 2020-10-20 北京字节跳动网络技术有限公司 Method and device for identifying polyphone, readable medium and electronic equipment

Also Published As

Publication number Publication date
CN113450779A (en) 2021-09-28

Similar Documents

Publication Publication Date Title
US10740564B2 (en) Dialog generation method, apparatus, and device, and storage medium
CN110782882B (en) Voice recognition method and device, electronic equipment and storage medium
CN105976812B (en) A kind of audio recognition method and its equipment
US7788098B2 (en) Predicting tone pattern information for textual information used in telecommunication systems
CN107506823B (en) Construction method of hybrid neural network model for dialog generation
CN111223498A (en) Intelligent emotion recognition method and device and computer readable storage medium
CN105261358A (en) N-gram grammar model constructing method for voice identification and voice identification system
CN111435592B (en) Voice recognition method and device and terminal equipment
CN109933809B (en) Translation method and device, and training method and device of translation model
CN110543645A (en) Machine learning model training method, medium, device and computing equipment
CN112632288A (en) Power dispatching system and method based on knowledge graph
CN107102861B (en) A kind of method and system obtaining the vector of function in Open Source Code library
CN111508466A (en) Text processing method, device and equipment and computer readable storage medium
CN112861548A (en) Natural language generation and model training method, device, equipment and storage medium
CN115762489A (en) Data processing system and method of voice recognition model and voice recognition method
WO2024242633A1 (en) Text image generation method and diffusion generative model training method
CN114154518A (en) Data enhancement model training method and device, electronic equipment and storage medium
CN113450779B (en) Speech model training data set construction method and device
CN113268989A (en) Polyphone processing method and device
CN113096675A (en) Audio style unifying method based on generating type countermeasure network
CN117668187A (en) Image generation, automatic question answering and conditional control model training methods
CN111312267B (en) Voice style conversion method, device, equipment and storage medium
CN115512695A (en) Voice recognition method, device, equipment and storage medium
CN110245331A (en) A kind of sentence conversion method, device, server and computer storage medium
CN114997395A (en) Training method of text generation model, method for generating text and respective devices

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant