CN112201227A - Voice sample generation method and device, storage medium and electronic device - Google Patents

Voice sample generation method and device, storage medium and electronic device Download PDF

Info

Publication number
CN112201227A
CN112201227A CN202011044992.XA CN202011044992A CN112201227A CN 112201227 A CN112201227 A CN 112201227A CN 202011044992 A CN202011044992 A CN 202011044992A CN 112201227 A CN112201227 A CN 112201227A
Authority
CN
China
Prior art keywords
sample
voice
voice sample
target
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011044992.XA
Other languages
Chinese (zh)
Other versions
CN112201227B (en
Inventor
葛路奇
赵培
马路
赵欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Haier Uplus Intelligent Technology Beijing Co Ltd
Original Assignee
Haier Uplus Intelligent Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Haier Uplus Intelligent Technology Beijing Co Ltd filed Critical Haier Uplus Intelligent Technology Beijing Co Ltd
Priority to CN202011044992.XA priority Critical patent/CN112201227B/en
Priority claimed from CN202011044992.XA external-priority patent/CN112201227B/en
Publication of CN112201227A publication Critical patent/CN112201227A/en
Application granted granted Critical
Publication of CN112201227B publication Critical patent/CN112201227B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/28Data switching networks characterised by path configuration, e.g. LAN [Local Area Networks] or WAN [Wide Area Networks]
    • H04L12/2803Home automation networks
    • H04L12/2816Controlling appliance services of a home automation network by calling their functionalities
    • H04L12/282Controlling appliance services of a home automation network by calling their functionalities based on user interaction within the home
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0638Interactive procedures
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Automation & Control Theory (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a voice sample generation method and device, a storage medium and an electronic device, wherein the voice sample generation method comprises the following steps: acquiring a first voice sample and a second voice sample, wherein the first voice sample is generated under the condition that a first object does not wear a mask, and the second voice sample is generated under the condition that the first object wears the mask; establishing a sample generation model according to the first voice sample and the second voice sample; acquiring a third voice sample, and generating a target voice sample according to the third voice sample and the sample generation model; wherein, the third voice sample is a voice sample generated under the condition that the second object does not wear the mask. Therefore, the embodiment of the invention can solve the problem that the voice sample generated under the condition that the user wears the mask cannot be effectively obtained in the related art, so as to achieve the effect of efficiently obtaining the voice sample generated under the condition that the user wears the mask.

Description

Voice sample generation method and device, storage medium and electronic device
Technical Field
The invention relates to the field of Internet of things equipment, in particular to a voice sample generation method and device, a storage medium and an electronic device.
Background
With the development of smart homes nowadays, a voice function has become one of essential functions of most smart homes, and the implementation of the voice function includes two parts, namely a voice algorithm and a voice sample data set used for training the voice algorithm. In a practical use environment, a plurality of complex scenes exist, and because a voice sample in the complex scene is different from that in a standard scene, in order to improve the accuracy of a voice function, the voice sample in the complex scene needs to be collected separately. In the prior art, the voice samples under the complex scenes are mostly collected from various scenes, and considerable labor and time costs are consumed.
At present, as people become habit of wearing the mask, part of voice commands generated by users are sent under the condition of wearing the mask, and because certain difference exists between the condition of wearing the mask and the condition of not wearing the mask, the improvement of the voice recognition efficiency under the condition of wearing the mask by the users becomes the requirement of most intelligent homes. However, the method for acquiring the voice samples generated under the condition that the user wears the mask in a large scale so as to train the corresponding model has the problems that the cost of manpower and time is too high, and even the voice samples cannot be effectively acquired.
In view of the above problem in the related art that a voice sample generated when a user wears a mask cannot be effectively obtained, an effective solution has not been proposed in the related art.
Disclosure of Invention
The embodiment of the invention provides a voice sample generation method and device, a storage medium and an electronic device, and at least solves the problem that a voice sample generated under the condition that a user wears a mask cannot be effectively acquired in the related art.
According to an embodiment of the present invention, there is provided a speech sample generation method including:
acquiring a first voice sample and a second voice sample, wherein the first voice sample is generated under the condition that a first object does not wear a mask, and the second voice sample is generated under the condition that the first object wears the mask;
establishing a sample generation model according to the first voice sample and the second voice sample; wherein the sample generation model is to indicate a relationship between the first speech sample and the second speech sample;
acquiring a third voice sample, and generating a target voice sample according to the third voice sample and the sample generation model; the third voice sample is a voice sample generated under the condition that a second object does not wear a mask, and the target voice sample is used for indicating a corresponding voice sample under the condition that the second object wears the mask.
In an optional embodiment, the establishing a sample generation model according to the first speech sample and the second speech sample further includes:
acquiring a first voice segment in the first voice sample and a second voice segment in the second voice sample; wherein the first voice segment is a valid segment in the first voice sample, and the second voice segment is a valid segment in the second voice sample;
and performing time domain alignment processing on the first voice segment and the second voice segment, and establishing the sample generation model according to the first voice segment and the second voice segment.
In an optional embodiment, the building the sample generation model according to the first speech segment and the second speech segment includes:
converting the first voice segment and the second voice segment from a time domain to a frequency domain, and acquiring a first frequency domain value corresponding to each frame in the first voice segment and a second frequency domain value corresponding to each frame in the second voice segment;
determining a transfer coefficient corresponding to each frame according to the first frequency-domain value and the second frequency-domain value, wherein the transfer coefficient is used for indicating a relationship between the first frequency-domain value and the corresponding second frequency-domain value;
and establishing the sample generation model according to the transfer coefficient.
In an optional embodiment, the building the sample generation model according to the transfer coefficient includes:
clustering a plurality of transmission coefficients corresponding to the multiple frames to determine a sample generation coefficient; the sample generation coefficient is used for indicating the transmission coefficient corresponding to a central point obtained by clustering a plurality of transmission coefficients.
In an optional embodiment, the generating a target speech sample according to the third speech sample and the sample generation model includes:
converting the third voice sample from a time domain to a frequency domain to obtain a third frequency domain value corresponding to each frame in the third voice sample;
generating a coefficient according to the third frequency domain value and the sample to obtain a target frequency domain value corresponding to each frame; the target frequency-domain value is used for indicating a frequency-domain value corresponding to each frame after the target voice sample is converted from a time domain to a frequency domain;
and converting the target frequency-domain value corresponding to each frame into a time domain to obtain the target voice sample.
In an optional embodiment, converting the target frequency-domain value corresponding to each frame into the time domain to obtain the target speech sample includes:
converting the target frequency domain value corresponding to the first frame into first target time domain information, and converting the target frequency domain value corresponding to the second frame into second target time domain information; wherein the first frame and the second frame are adjacent frames;
and overlapping at least part of the first target time domain information with at least part of the second target time domain information to obtain the target voice sample.
According to another embodiment of the present invention, there is also provided a speech sample generation apparatus including:
the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a first voice sample and a second voice sample, the first voice sample is generated under the condition that a first object does not wear a mask, and the second voice sample is generated under the condition that the first object wears the mask;
the establishing module is used for establishing a sample generation model according to the first voice sample and the second voice sample; wherein the sample generation model is to indicate a relationship between the first speech sample and the second speech sample;
the generating module is used for acquiring a third voice sample and generating a target voice sample according to the third voice sample and the sample generating model; the third voice sample is a voice sample generated under the condition that a second object does not wear a mask, and the target voice sample is used for indicating a corresponding voice sample under the condition that the second object wears the mask.
According to another embodiment of the present invention, a computer-readable storage medium is also provided, in which a computer program is stored, wherein the computer program is configured to perform the steps of any of the above-described method embodiments when executed.
According to another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.
According to the embodiment of the invention, on the basis of obtaining the first voice sample and the second voice sample which are respectively generated by the same first object under the conditions of not wearing a mask and wearing the mask, the sample generation model for indicating the relation between the first voice sample and the second voice sample is established according to the first voice sample and the second voice sample, and then the third voice sample which is generated by the second object under the condition of not wearing the mask is converted through the sample generation model so as to generate the corresponding target voice sample for indicating the second object under the condition of wearing the mask.
Therefore, the embodiment of the invention does not need to collect the voice samples generated under the condition that the second object wears the mask one by one, but can generate the voice samples corresponding to the condition that the mask is worn by converting the samples generated under the condition that the mask is not worn according to the pre-established sample generation model. Therefore, the embodiment of the invention can solve the problem that the voice sample generated under the condition that the user wears the mask cannot be effectively obtained in the related art, so as to achieve the effect of efficiently obtaining the voice sample generated under the condition that the user wears the mask.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a flow chart of a method for generating a speech sample according to an embodiment of the present invention;
FIG. 2 is a flow chart diagram of a method of generating a speech sample provided in accordance with an exemplary embodiment of the present invention;
fig. 3 is a block diagram of a speech sample generation apparatus according to an embodiment of the present invention.
Detailed Description
The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
To further explain the method and apparatus for reporting the device connection state, the storage medium, and the electronic apparatus in the embodiments of the present invention, the following describes application scenarios of the method and apparatus for reporting the device connection state, the storage medium, and the electronic apparatus in the embodiments of the present invention:
in one aspect, an embodiment of the present invention provides a method for generating a speech sample, fig. 1 is a flowchart of the method for generating a speech sample according to the embodiment of the present invention, and as shown in fig. 1, the method for generating a speech sample according to the embodiment of the present invention includes:
s102, a first voice sample and a second voice sample are obtained, wherein the first voice sample is generated under the condition that the first object does not wear the mask, and the second voice sample is generated under the condition that the first object wears the mask.
In an embodiment of the present invention, the first voice sample and the second voice sample are multiple, and the first voice sample and the second voice sample are in one-to-one correspondence; the first object may be one or more than one, that is, the first voice sample and the second voice sample may be multiple groups of voice samples generated by the same first object, or multiple groups of voice samples respectively generated by multiple different first objects, which is not limited in this embodiment of the present invention. The first voice sample and the second voice sample are voice samples respectively generated by the same first object under the conditions that the mask is not worn and the mask is worn, and generally speaking, the voice contents of the first voice sample and the second voice sample are the same.
To ensure consistency between the first voice sample and the second voice sample, on the basis that the first voice sample and the second voice sample have the same content, in an optional embodiment, the audio similarity between the first voice sample and the second voice sample is smaller than a preset threshold. It should be noted that the above-mentioned audio similarity is used to indicate the pronunciation of the audio, for example, the difference value between parameters such as speech rate, intonation, loudness, etc., that is, the second speech sample should be the same as the first speech sample in the content of the audio, and be similar or even the same as possible in the pronunciation of the audio. In an example, the following audio "turn on the speaker to play song N" may be entered by the user a as the first object without wearing the mask, and this is taken as the first voice sample, and the "turn on the speaker to play song N" may be re-entered by the user a with the same speech rate, intonation, loudness, and the like, while wearing the mask, and this is taken as the second voice sample.
S104, establishing a sample generation model according to the first voice sample and the second voice sample; wherein the sample generation model is used to indicate a relationship between the first speech sample and the second speech sample.
In the embodiment of the present invention, on the basis of obtaining the first voice sample and the second voice sample, a sample generation model for indicating a relationship between the first voice sample and the second voice sample may be established according to the first voice sample and the second voice sample. Since the first voice sample and the second voice sample respectively correspond to the voice samples corresponding to the same audio content generated under the two situations of not wearing the mask and wearing the mask, the sample generation model can indicate the relationship between the voice samples corresponding to the same audio content generated under the two situations of not wearing the mask and wearing the mask. In one example, the sample generation model may be characterized by a functional relationship, such as:
Y=K×X。
x is used to represent a voice sample frequency domain value generated by a next audio content when the mask is not worn, Y is used to represent a voice sample frequency domain value generated by the same audio content when the mask is worn, and K is used to represent a coefficient between Y and X.
The process of the above-described sample generation model creation is described below by an alternative embodiment:
in an optional embodiment, in the step S104, establishing a sample generation model according to the first speech sample and the second speech sample further includes:
acquiring a first voice segment in a first voice sample and a second voice segment in a second voice sample; the first voice segment is an effective segment in the first voice sample, and the second voice segment is an effective segment in the second voice sample;
and performing time domain alignment processing on the first voice segment and the second voice segment, and establishing a sample generation model according to the first voice segment and the second voice segment.
In the above alternative embodiment, for the first voice sample and the second voice sample, there may be no audio portion at the beginning or end of the audio, and there may be an excessively long pause portion at the middle of the audio, where the above portions are invalid portions, and correspondingly, the first voice segment and the second voice segment are valid portions remaining after the invalid portions are removed from the first voice sample and the second voice sample. Therefore, unnecessary signal processing and possible deviation in the signal processing process can be avoided through the extraction of the first voice segment and the second voice segment.
On the other hand, the first voice segment and the second voice segment are aligned in the time domain, that is, the audio contents corresponding to the first voice segment and the second voice segment at each moment are all corresponding, so that after the first voice segment and the second voice segment are converted from the time domain to the frequency domain, each frame between the first voice segment and the second voice segment is corresponding. In an optional embodiment, the building a sample generation model according to the first speech segment and the second speech segment includes:
converting the time domain of the first voice segment and the frequency domain of the second voice segment into the frequency domain, and acquiring a first frequency domain value corresponding to each frame in the first voice segment and a second frequency domain value corresponding to each frame in the second voice segment;
determining a transfer coefficient corresponding to each frame according to the first frequency-domain value and the second frequency-domain value, wherein the transfer coefficient is used for indicating the relationship between the first frequency-domain value and the corresponding second frequency-domain value;
and establishing a sample generation model according to the transfer coefficient.
It should be noted that the frequency domain value includes a magnitude value and a phase value, and taking the first voice segment as an example, after the first voice segment is subjected to time-frequency conversion through fourier transform, the magnitude value and the phase value corresponding to the frequency of each frame form the first frequency domain value corresponding to the frame. The first frequency domain value is a complex number, and the amplitude and phase values of the first frequency domain value can be obtained through an Euler formula. Similarly, after the time-frequency conversion is performed on the second voice segment through the fourier transform, the amplitude value and the phase value corresponding to the frequency of each frame form a second frequency-domain value corresponding to the frame, and the second frequency-domain value is also a complex number.
Under the condition of determining the first frequency domain value and the second frequency domain value corresponding to each frame, a group of transfer coefficients corresponding to the frame can be determined according to the relationship between the first frequency domain value and the second frequency domain value corresponding to the frame. In an example, the transfer coefficient is used to indicate a proportional relationship between the first frequency-domain value and the second frequency-domain value.
It should be noted that each frame corresponds to a first frequency domain value and a second frequency domain value, so each frame can correspond to determine a group of transmission coefficients, and multiple frames correspond to multiple groups of transmission coefficients. In an alternative embodiment, the establishing of the sample generation model according to the transfer coefficients comprises:
clustering multiple groups of transmission coefficient amplitude values (and phase values respectively) corresponding to multiple frames to determine a group of sample generation coefficients; the sample generation coefficient is used for indicating a group of transfer coefficients corresponding to a center point obtained by clustering corresponding values of the plurality of groups of transfer coefficients.
In the above process, each frame of data is converted into a frequency domain value to obtain a set of values, and the transfer coefficient is also a set of values, such as P1 ═ { P1 }1,p12,p13,…,p1nAnd obtaining multiple groups of coefficients { p1, p2, p3,. multidot.pm } by multiple frames, and clustering An as (p 1) of each value of the multiple groups of coefficientsn,p2n,p3n,..,pmnAnd h), selecting one or more classes for clustering, and taking the central point An of the class in the most concentrated part to obtain the final transfer coefficient P ═ A1, A2, A3.
In the above optional embodiment, the clustering process is performed on the plurality of transmission coefficients corresponding to the plurality of frames, that is, the distribution of the plurality of transmission coefficients is determined, specifically, the center point obtained by the clustering process, that is, the transmission coefficient corresponding to a point in the plurality of transmission coefficients where the distribution is most concentrated, may be selected as the sample generation coefficient. In an example, the clustering process may be performed based on a K-means clustering algorithm (K-means), and the dimension of the clustering may be amplitude, or the amplitude and the phase may be clustered separately and then recombined.
Therefore, the most densely distributed transmission coefficients can be selected as the sample generation coefficients to further establish a sample generation model, and the sample generation model can actually indicate the relationship between the frequency domain values of the voice signals generated by the same audio content under the two conditions of not wearing the mask and wearing the mask after the voice signals are converted into the frequency domain.
It should be noted that, in the case that there are a plurality of first voice samples and a plurality of second voice samples, after determining the transmission coefficients corresponding to the plurality of frames in each group of the first voice samples and the second voice samples, clustering the plurality of transmission coefficients corresponding to the plurality of groups of the first voice samples and the second voice samples to further improve the accuracy of the sample generation model.
It is to be understood that a person skilled in the art may also establish a sample generation model according to the first speech sample and the second speech sample by other ways, for example, perform fitting according to the first frequency-domain value and the second frequency-domain value, and the like, which is not limited by the embodiment of the present invention.
S106, obtaining a third voice sample, and generating a target voice sample according to the third voice sample and the sample generation model; the third voice sample is a voice sample generated under the condition that the second object does not wear the mask, and the target voice sample is used for indicating the corresponding voice sample under the condition that the second object wears the mask.
In the embodiment of the present invention, the third speech sample is generated by a second object, and the second object may be an object different from the first object or the same as the first object. The third voice sample may be one or more voice samples, and the second object may also be one or more voice samples, that is, the third voice sample may be a plurality of voice samples generated by the same second object, or may be a plurality of voice samples respectively generated by a plurality of second objects. The third voice sample is a voice sample generated by the second object under the condition that the mask is not worn, the target voice sample is a voice sample generated in the embodiment of the invention, and the target voice sample is used for simulating the voice sample generated by the second object under the condition that the mask is worn. The following illustrates, by an alternative embodiment, the process of generating a sample of a target speech:
in an alternative embodiment, the step S106 of generating the target speech sample according to the third speech sample and the sample generation model includes:
converting the third voice sample from the time domain to the frequency domain to obtain a third frequency domain value corresponding to each frame in the third voice sample;
generating a coefficient according to the third frequency domain value and the sample to obtain a target frequency domain value corresponding to each frame; the target frequency-domain value is used for indicating a frequency-domain value corresponding to each frame after the target voice sample is converted from a time domain to a frequency domain;
and converting the target frequency-domain value corresponding to each frame into a time domain to obtain a target voice sample.
In the above optional embodiment, after the time-frequency conversion is performed on the third speech segment through the fourier transform, the amplitude value and the phase value corresponding to the frequency of each frame form a third frequency domain value corresponding to the frame, and the third frequency domain value is also a complex number. The sample generation model can indicate the relationship between the frequency domain values of the voice signals generated by the same audio content under the conditions of not wearing the mask and wearing the mask after being converted into the frequency domain. Therefore, on the basis of determining the third frequency domain value corresponding to each frame in the third voice sample, each third frequency domain value can be sequentially brought into the sample generation model to generate a corresponding target frequency domain value, wherein the target frequency domain value is the frequency domain value corresponding to the target voice sample after the audio content corresponding to the third voice sample is converted to the frequency domain under the condition that the mask is worn. Therefore, the target voice samples can be obtained by converting the target frequency-domain values corresponding to the multiple frames into the time domain. In one example, the following audio "turn on the air conditioner to 24 ℃" may be entered by the user B as the second object without wearing the mask, as the third voice sample; after the third voice sample is converted according to the method, the obtained target voice sample is the audio generated by simulating the situation that the user B wears the mask, namely, the user B turns on the air conditioner to 24 ℃.
It should be noted that, in general, the number set of the third speech samples is much larger than the number set of the first speech samples and/or the second speech samples, so that under the condition that the sample generation model is obtained by using a smaller number of the first speech samples and the second speech samples, a larger number of the third speech samples can be converted to obtain a large number of target speech samples. The third voice sample may be a recorded sample, or a voice sample in an existing voice sample database may be directly selected as the third voice sample.
According to the embodiment of the invention, on the basis of obtaining the first voice sample and the second voice sample which are respectively generated by the same first object under the conditions of not wearing a mask and wearing the mask, the sample generation model for indicating the relation between the first voice sample and the second voice sample is established according to the first voice sample and the second voice sample, and then the third voice sample which is generated by the second object under the condition of not wearing the mask is converted through the sample generation model so as to generate the corresponding target voice sample for indicating the second object under the condition of wearing the mask.
Therefore, the embodiment of the invention does not need to collect the voice samples generated under the condition that the second object wears the mask one by one, but can generate the corresponding voice samples under the condition that the mask is worn by converting the voice samples generated under the condition that the mask is not worn by the sample generation model according to the pre-established sample generation model. Therefore, the embodiment of the invention can solve the problem that the voice sample generated under the condition that the user wears the mask cannot be effectively obtained in the related art, so as to achieve the effect of efficiently obtaining the voice sample generated under the condition that the user wears the mask.
Meanwhile, the corresponding target voice samples obtained by the generation method under the condition that a large number of masks are worn can be used for training the voice model based on the target voice samples, and then the voice model can be changed according to the data lack condition of the mask wearing or not under the complex acoustic scenes, so that the voice model can be suitable for voice recognition processing under the scene of wearing the masks, and the user experience is remarkably improved.
In an optional embodiment, the converting the target frequency-domain value corresponding to each frame into the time domain to obtain the target speech sample includes:
converting a target frequency domain value corresponding to the first frame into first target time domain information, and converting a target frequency domain value corresponding to the second frame into second target time domain information; the first frame and the second frame are adjacent frames;
and overlapping at least part of the first target time domain information with at least part of the second target time domain information to obtain a target voice sample.
In the above optional embodiment, since the data between two adjacent previous and subsequent frames have a certain overlap phenomenon during the framing of the voice signal, during the time-frequency conversion of the target frequency domain value to obtain the target voice sample, the overlap part of the adjacent first frame and the second frame can be overlapped and spliced, so as to avoid the voice repetition phenomenon in the target voice sample. In an example, the second frame is subsequent to the first frame, and then the second half of the first target frequency domain information corresponding to the first frame and the first half of the second target frequency domain information corresponding to the second frame may be spliced; the above-mentioned superposition method is an overlap-add process.
It should be noted that, the first frame and the second frame may be any two adjacent frames, that is, by the technical solution in the optional embodiment, the splicing process may be performed on the target time domain information that is converted correspondingly to the target frequency domain value of each two adjacent frames.
To further describe the speech sample generation method in the embodiment of the present invention, the following is explained by an exemplary embodiment:
fig. 2 is a schematic flowchart of a speech sample generation method according to an exemplary embodiment of the present invention, and as shown in fig. 2, the workflow of the speech sample generation method in the exemplary embodiment is as follows:
s201, recording voice samples of the same speaker without wearing a mask and under wearing the mask as a first voice sample and a second voice sample respectively, and requiring two times of recorded audios, namely the similarity of corresponding vocalization between the first voice sample and the second voice sample is close, wherein the similarity comprises speed, tone, duration and the like; valid sound segments of the two recorded audios are retained and aligned.
S202, performing time-frequency conversion on the first voice sample and the second voice sample, and converting the first voice sample and the second voice sample into a frequency domain; and using the ratio of the frequency domain up-frequency domain values (including amplitude values and phase values) of each frame of the first voice sample and the second voice sample as a group of transfer coefficients of each frame, wherein the coefficients are complex numbers.
S203, solving an amplitude spectrum of the corresponding coefficient of each frame, and clustering the corresponding coefficient of each frame in the amplitude spectrum through a K-means algorithm, wherein the clustering dimension can be amplitude and phase; and after clustering, obtaining a central value of the coefficient corresponding to each frame, and using the central value as a transfer coefficient for establishing a relation model between the first voice sample and the second voice sample.
And S204, re-acquiring an audio frequency without wearing a mask to serve as a third voice sample, performing time-frequency conversion on the third voice sample to a frequency domain, and multiplying the frequency domain value of each frame in the third voice sample by the finally determined transfer coefficient in the step S203 to obtain the frequency domain value corresponding to each frame of the target voice sample to be generated.
S205, performing time-frequency conversion on the frequency domain value corresponding to each frame of the target voice sample to a time domain, and generating the target voice sample. The target voice sample is the audio frequency generated under the wearing mask corresponding to the third voice sample.
In the above process of generating the target speech sample, since the front and rear frame data overlap each other when the signal is framed, the target speech sample is obtained by superimposing the rear half of the rear frame and the rear half of the front frame in the adjacent two frames of signals.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
On the other hand, an embodiment of the present invention further provides a speech sample generating device, which is used to implement the foregoing embodiment and the preferred embodiments, and the description of the device that has been already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 3 is a block diagram of a structure of a speech sample generation apparatus according to an embodiment of the present invention, and as shown in fig. 3, the speech sample generation apparatus according to the embodiment of the present invention includes:
an obtaining module 302, configured to obtain a first voice sample and a second voice sample, where the first voice sample is a voice sample generated when the first object does not wear a mask, and the second voice sample is a voice sample generated when the first object wears the mask;
an establishing module 304, configured to establish a sample generation model according to the first voice sample and the second voice sample; wherein the sample generation model is used to indicate a relationship between the first speech sample and the second speech sample;
a generating module 306, configured to obtain a third voice sample, and generate a target voice sample according to the third voice sample and the sample generating model; the third voice sample is a voice sample generated under the condition that the second object does not wear the mask, and the target voice sample is used for indicating the corresponding voice sample under the condition that the second object wears the mask.
It should be noted that, the remaining optional embodiments and technical effects of the speech sample generation apparatus in the embodiment of the present invention are all corresponding to the aforementioned speech sample generation method, and therefore, no further description is provided herein.
In an alternative embodiment, the audio similarity between the first speech sample and the second speech sample is less than a predetermined threshold.
In an alternative embodiment, the establishing the sample generation model according to the first speech sample and the second speech sample further includes:
acquiring a first voice segment in a first voice sample and a second voice segment in a second voice sample; the first voice segment is an effective segment in the first voice sample, and the second voice segment is an effective segment in the second voice sample;
and performing time domain alignment processing on the first voice segment and the second voice segment, and establishing a sample generation model according to the first voice segment and the second voice segment.
In an alternative embodiment, building a sample generation model from a first speech segment and a second speech segment includes:
converting the time domain of the first voice segment and the frequency domain of the second voice segment into the frequency domain, and acquiring a first frequency domain value corresponding to each frame in the first voice segment and a second frequency domain value corresponding to each frame in the second voice segment;
determining a transfer coefficient corresponding to each frame according to the first frequency-domain value and the second frequency-domain value, wherein the transfer coefficient is used for indicating the relationship between the first frequency-domain value and the corresponding second frequency-domain value;
and establishing a sample generation model according to the transfer coefficient.
In an alternative embodiment, the establishing of the sample generation model according to the transfer coefficients comprises:
clustering a plurality of transmission coefficients corresponding to the plurality of frames to determine a sample generation coefficient; the sample generation coefficient is used for indicating the transmission coefficient corresponding to the central point obtained by clustering the plurality of transmission coefficients.
In an alternative embodiment, generating the target speech sample from the third speech sample and the sample generation model includes:
converting the third voice sample from the time domain to the frequency domain to obtain a third frequency domain value corresponding to each frame in the third voice sample;
generating a coefficient according to the third frequency domain value and the sample to obtain a target frequency domain value corresponding to each frame; the target frequency-domain value is used for indicating a frequency-domain value corresponding to each frame after the target voice sample is converted from a time domain to a frequency domain;
and converting the target frequency-domain value corresponding to each frame into a time domain to obtain a target voice sample.
In an alternative embodiment, converting the target frequency-domain value corresponding to each frame into the time domain to obtain the target speech sample includes:
converting a target frequency domain value corresponding to the first frame into first target time domain information, and converting a target frequency domain value corresponding to the second frame into second target time domain information; the first frame and the second frame are adjacent frames;
and overlapping at least part of the first target time domain information with at least part of the second target time domain information to obtain a target voice sample.
In another aspect, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, where the computer program is configured to, when executed, perform the steps in any of the above method embodiments.
Alternatively, in an embodiment of the present invention, the computer-readable storage medium may be configured to store a computer program for executing the above-described embodiment.
Optionally, in an embodiment of the present invention, the computer-readable storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
In another aspect, an embodiment of the present invention further provides an electronic apparatus, which includes a memory and a processor, where the memory stores a computer program, and the processor is configured to execute the computer program to perform the steps in any of the method embodiments described above.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
Alternatively, in this embodiment, the processor may be configured to execute the steps in the above embodiments through a computer program.
Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for generating a speech sample, comprising:
acquiring a first voice sample and a second voice sample, wherein the first voice sample is generated under the condition that a first object does not wear a mask, and the second voice sample is generated under the condition that the first object wears the mask;
establishing a sample generation model according to the first voice sample and the second voice sample; wherein the sample generation model is to indicate a relationship between the first speech sample and the second speech sample;
acquiring a third voice sample, and generating a target voice sample according to the third voice sample and the sample generation model; the third voice sample is a voice sample generated under the condition that a second object does not wear a mask, and the target voice sample is used for indicating a corresponding voice sample under the condition that the second object wears the mask.
2. The method of claim 1, wherein the audio similarity between the first speech sample and the second speech sample is less than a preset threshold.
3. The method of claim 1 or 2, wherein the modeling sample generation from the first speech sample and the second speech sample further comprises:
acquiring a first voice segment in the first voice sample and a second voice segment in the second voice sample; wherein the first voice segment is a valid segment in the first voice sample, and the second voice segment is a valid segment in the second voice sample;
and performing time domain alignment processing on the first voice segment and the second voice segment, and establishing the sample generation model according to the first voice segment and the second voice segment.
4. The method of claim 3, wherein said modeling said sample generation model from said first speech segment and said second speech segment comprises:
converting the first voice segment and the second voice segment from a time domain to a frequency domain, and acquiring a first frequency domain value corresponding to each frame in the first voice segment and a second frequency domain value corresponding to each frame in the second voice segment;
determining a transfer coefficient corresponding to each frame according to the first frequency-domain value and the second frequency-domain value, wherein the transfer coefficient is used for indicating a relationship between the first frequency-domain value and the corresponding second frequency-domain value;
and establishing the sample generation model according to the transfer coefficient.
5. The method of claim 4, wherein said building the sample generation model from the transfer coefficients comprises:
clustering a plurality of transmission coefficients corresponding to the multiple frames to determine a sample generation coefficient; the sample generation coefficient is used for indicating the transmission coefficient corresponding to a central point obtained by clustering a plurality of transmission coefficients.
6. The method of claim 5, wherein generating a target speech sample from the third speech sample and the sample generation model comprises:
converting the third voice sample from a time domain to a frequency domain to obtain a third frequency domain value corresponding to each frame in the third voice sample;
generating a coefficient according to the third frequency domain value and the sample to obtain a target frequency domain value corresponding to each frame; the target frequency-domain value is used for indicating a frequency-domain value corresponding to each frame after the target voice sample is converted from a time domain to a frequency domain;
and converting the target frequency-domain value corresponding to each frame into a time domain to obtain the target voice sample.
7. The method of claim 6, wherein converting the target frequency-domain value corresponding to each frame into the time domain to obtain the target speech sample comprises:
converting the target frequency domain value corresponding to the first frame into first target time domain information, and converting the target frequency domain value corresponding to the second frame into second target time domain information; wherein the first frame and the second frame are adjacent frames;
and overlapping at least part of the first target time domain information with at least part of the second target time domain information to obtain the target voice sample.
8. A speech sample generation apparatus, comprising:
the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a first voice sample and a second voice sample, the first voice sample is generated under the condition that a first object does not wear a mask, and the second voice sample is generated under the condition that the first object wears the mask;
the establishing module is used for establishing a sample generation model according to the first voice sample and the second voice sample; wherein the sample generation model is to indicate a relationship between the first speech sample and the second speech sample;
the generating module is used for acquiring a third voice sample and generating a target voice sample according to the third voice sample and the sample generating model; the third voice sample is a voice sample generated under the condition that a second object does not wear a mask, and the target voice sample is used for indicating a corresponding voice sample under the condition that the second object wears the mask.
9. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to carry out the method of any one of claims 1 to 7 when executed.
10. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 7.
CN202011044992.XA 2020-09-28 Speech sample generation method and device, storage medium and electronic device Active CN112201227B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011044992.XA CN112201227B (en) 2020-09-28 Speech sample generation method and device, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011044992.XA CN112201227B (en) 2020-09-28 Speech sample generation method and device, storage medium and electronic device

Publications (2)

Publication Number Publication Date
CN112201227A true CN112201227A (en) 2021-01-08
CN112201227B CN112201227B (en) 2024-06-28

Family

ID=

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113674737A (en) * 2021-08-09 2021-11-19 维沃移动通信(杭州)有限公司 Voice data processing method and device, electronic equipment and storage medium

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003022088A (en) * 2001-07-10 2003-01-24 Sharp Corp Device and method for speaker's features extraction, voice recognition device, and program recording medium
CN105448303A (en) * 2015-11-27 2016-03-30 百度在线网络技术(北京)有限公司 Voice signal processing method and apparatus
CN106571135A (en) * 2016-10-27 2017-04-19 苏州大学 Ear voice feature extraction method and system
CN108597505A (en) * 2018-04-20 2018-09-28 北京元心科技有限公司 Audio recognition method, device and terminal device
CN109120779A (en) * 2018-07-24 2019-01-01 Oppo(重庆)智能科技有限公司 Microphone blocks based reminding method and relevant apparatus
CN109473091A (en) * 2018-12-25 2019-03-15 四川虹微技术有限公司 A kind of speech samples generation method and device
CN109961794A (en) * 2019-01-14 2019-07-02 湘潭大学 A kind of layering method for distinguishing speek person of model-based clustering
CN110197665A (en) * 2019-06-25 2019-09-03 广东工业大学 A kind of speech Separation and tracking for police criminal detection monitoring
CN110910865A (en) * 2019-11-25 2020-03-24 秒针信息技术有限公司 Voice conversion method and device, storage medium and electronic device
US20200160877A1 (en) * 2018-11-20 2020-05-21 Airbus Operations Sas Method and system for processing audio signals for a microphone of an aircraft oxygen mask
CN111348499A (en) * 2020-03-02 2020-06-30 北京声智科技有限公司 Elevator control method, elevator control device, electronic equipment and computer-readable storage medium
CN111358066A (en) * 2020-03-10 2020-07-03 中国人民解放军陆军军医大学第一附属医院 Protective clothing based on speech recognition
CN111599346A (en) * 2020-05-19 2020-08-28 科大讯飞股份有限公司 Speaker clustering method, device, equipment and storage medium

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003022088A (en) * 2001-07-10 2003-01-24 Sharp Corp Device and method for speaker's features extraction, voice recognition device, and program recording medium
CN105448303A (en) * 2015-11-27 2016-03-30 百度在线网络技术(北京)有限公司 Voice signal processing method and apparatus
CN106571135A (en) * 2016-10-27 2017-04-19 苏州大学 Ear voice feature extraction method and system
CN108597505A (en) * 2018-04-20 2018-09-28 北京元心科技有限公司 Audio recognition method, device and terminal device
CN109120779A (en) * 2018-07-24 2019-01-01 Oppo(重庆)智能科技有限公司 Microphone blocks based reminding method and relevant apparatus
US20200160877A1 (en) * 2018-11-20 2020-05-21 Airbus Operations Sas Method and system for processing audio signals for a microphone of an aircraft oxygen mask
CN109473091A (en) * 2018-12-25 2019-03-15 四川虹微技术有限公司 A kind of speech samples generation method and device
CN109961794A (en) * 2019-01-14 2019-07-02 湘潭大学 A kind of layering method for distinguishing speek person of model-based clustering
CN110197665A (en) * 2019-06-25 2019-09-03 广东工业大学 A kind of speech Separation and tracking for police criminal detection monitoring
CN110910865A (en) * 2019-11-25 2020-03-24 秒针信息技术有限公司 Voice conversion method and device, storage medium and electronic device
CN111348499A (en) * 2020-03-02 2020-06-30 北京声智科技有限公司 Elevator control method, elevator control device, electronic equipment and computer-readable storage medium
CN111358066A (en) * 2020-03-10 2020-07-03 中国人民解放军陆军军医大学第一附属医院 Protective clothing based on speech recognition
CN111599346A (en) * 2020-05-19 2020-08-28 科大讯飞股份有限公司 Speaker clustering method, device, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113674737A (en) * 2021-08-09 2021-11-19 维沃移动通信(杭州)有限公司 Voice data processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
JP6876752B2 (en) Response method and equipment
CN110415687A (en) Method of speech processing, device, medium, electronic equipment
CN112071330B (en) Audio data processing method and device and computer readable storage medium
CN106104674A (en) Mixing voice identification
CN109637525B (en) Method and apparatus for generating an on-board acoustic model
CN111354332A (en) Singing voice synthesis method and device
CN100585663C (en) Language studying system
CN110675886A (en) Audio signal processing method, audio signal processing device, electronic equipment and storage medium
CN112820315A (en) Audio signal processing method, audio signal processing device, computer equipment and storage medium
CN108039168B (en) Acoustic model optimization method and device
CN105895080A (en) Voice recognition model training method, speaker type recognition method and device
JP2014089420A (en) Signal processing device, method and program
CN106375780A (en) Method and apparatus for generating multimedia file
CN112289343B (en) Audio repair method and device, electronic equipment and computer readable storage medium
CN111142066A (en) Direction-of-arrival estimation method, server, and computer-readable storage medium
CN112837670B (en) Speech synthesis method and device and electronic equipment
CN112652309A (en) Dialect voice conversion method, device, equipment and storage medium
CN110070891B (en) Song identification method and device and storage medium
CN113450811B (en) Method and equipment for performing transparent processing on music
CN112201227B (en) Speech sample generation method and device, storage medium and electronic device
CN111103568A (en) Sound source positioning method, device, medium and equipment
CN112201227A (en) Voice sample generation method and device, storage medium and electronic device
CN114974281A (en) Training method and device of voice noise reduction model, storage medium and electronic device
CN113793623A (en) Sound effect setting method, device, equipment and computer readable storage medium
CN112164387A (en) Audio synthesis method and device, electronic equipment and computer-readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant