CN111370002A - Method and device for acquiring voice training sample, computer equipment and storage medium - Google Patents

Method and device for acquiring voice training sample, computer equipment and storage medium Download PDF

Info

Publication number
CN111370002A
CN111370002A CN202010093613.XA CN202010093613A CN111370002A CN 111370002 A CN111370002 A CN 111370002A CN 202010093613 A CN202010093613 A CN 202010093613A CN 111370002 A CN111370002 A CN 111370002A
Authority
CN
China
Prior art keywords
tearing
spectrogram
sound
time
spectrograms
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010093613.XA
Other languages
Chinese (zh)
Other versions
CN111370002B (en
Inventor
马坤
赵之砚
施奕明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202010093613.XA priority Critical patent/CN111370002B/en
Priority to PCT/CN2020/093092 priority patent/WO2021159635A1/en
Publication of CN111370002A publication Critical patent/CN111370002A/en
Application granted granted Critical
Publication of CN111370002B publication Critical patent/CN111370002B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Auxiliary Devices For Music (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

The application discloses a method, a device, computer equipment and a storage medium for acquiring a voice training sample, wherein the method comprises the following steps: processing a voice signal to obtain a sound spectrogram of the voice signal; randomly selecting a time point in a time direction on the sound spectrogram; and taking the time point as a tearing point, separating the sound spectrogram on two sides of the tearing point in the time direction, finishing the tearing treatment of the sound spectrogram, adding transition information at a fracture part according to a preset rule to obtain a tearing spectrogram, and taking the tearing spectrogram as the voice training sample. According to the method and the device, after an original voice signal is converted into the sound spectrogram, a large number of torn spectrograms, the first mask spectrogram and the second mask spectrogram are derived from the sound spectrogram through processing of tearing and masks, and therefore the problem that accurate voiceprint recognition models cannot be obtained due to the fact that a few samples for training the voiceprint recognition models in the prior art can be solved.

Description

Method and device for acquiring voice training sample, computer equipment and storage medium
Technical Field
The present application relates to the field of computer neural network training, and in particular, to a method and an apparatus for obtaining a speech training sample, a computer device, and a storage medium.
Background
Voice recognition identity, i.e., voiceprint recognition, is an important direction in the field of artificial intelligence, and is an important application of artificial intelligence technology in biological feature recognition scenarios. Although the accuracy of voiceprint recognition is always higher in laboratory conditions, in an actual service scenario, since voice transmission depends on a transmission channel, such as a transmission channel of a telephone, a broadband network, and the like, and received voice is affected by the channel, the accuracy of voiceprint recognition is still not high.
Because the speaking voice and the channel can not be completely separated, in the process of voiceprint recognition, the extracted voice characteristics of the speaker inevitably have channel characteristics, for example, the extracted characteristics of the speaker A in the telephone recording and the speaker A in the network voice are respectively attached with the characteristics of the telephone channel and the network channel, which can cause the judgment error of the voiceprint recognition. Thus, the cross-channel problem has heretofore been a problem in the field of voiceprint recognition.
The mainstream solution in the industry at present is to collect voice data of each channel, either train a model for feature transformation between channels, or expand the training set of the original model with collected cross-channel data. The core of this is to collect enough cross-channel data as samples. In actual production, due to the limitations of sample collection cost and collection conditions, sufficient and effective cross-channel voice data cannot be collected as a sample.
Disclosure of Invention
The application mainly aims to provide a method, a device, computer equipment and a storage medium for acquiring a voice training sample, and aims to solve the technical problem that sufficient and effective cross-channel voice data cannot be acquired as a sample in the prior art.
In order to achieve the above object, the present application provides a method for obtaining a speech training sample, including:
processing a voice signal to obtain a sound spectrogram of the voice signal;
randomly selecting a time point in a time direction on the sound spectrogram;
and taking the time point as a tearing point, separating the sound frequency spectrograms on two sides of the tearing point in the time direction to finish tearing processing of the sound frequency spectrograms, adding transition information at the breakage part according to a preset rule to obtain the tearing frequency spectrograms, and taking the tearing frequency spectrograms as the voice training samples, wherein the separation distance of the sound frequency spectrograms on two sides of the tearing point is S, the S is a number randomly selected from uniform distribution of [0, S ], and the S is a time deformation parameter.
Further, the step of adding the excessive spectrogram information at the tearing part according to a preset rule includes:
and randomly adding the excess information to the fracture part of the tearing spectrogram.
Further, the step of randomly selecting a time point in a time direction on the sound spectrogram comprises:
acquiring the time length of the sound spectrogram;
determining the tearing processing times of the sound frequency spectrogram according to the time length;
and selecting the time points with the same number of times as the tearing times so as to tear the sound spectrogram for different times.
Further, the step of selecting the time points with the same number of times as the tearing times to perform tearing for different times on the sound spectrogram comprises:
and equally distributing the time points with the number corresponding to the tearing times in the time length so as to perform tearing on the sound frequency spectrogram for different times.
Further, after the step of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point to complete the tearing process of the sound spectrograms, and adding transition information at the fracture part according to a preset rule to obtain the tearing spectrograms, the method includes:
selecting a plurality of first spectrum blocks arranged at intervals in the time direction on the tear spectrogram;
and applying a mask sequence to each first spectrum block to obtain a first mask spectrum diagram.
Further, after the step of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point, completing the tearing process on the sound spectrograms, and adding excessive information at the fracture part according to a preset rule to obtain the tearing spectrograms, the method further includes:
selecting a second spectrum block of a plurality of different frequency channels in the frequency direction on the tear spectrogram;
and applying a mask sequence to each second spectrum block to obtain a second mask spectrum map.
Further, the step of randomly selecting a time point in a time direction on the sound spectrogram comprises:
randomly adding a mask in the time direction on the sound spectrogram to obtain a third mask spectrogram;
randomly selecting the time point in a temporal direction on the third mask spectrogram.
The present application further provides an apparatus for obtaining a speech training sample, including:
the conversion unit is used for processing a voice signal to obtain a sound spectrogram of the voice signal;
a selection unit configured to randomly select a time point in a time direction on the sound spectrogram;
and the tearing unit is used for separating the sound frequency spectrograms on the two sides of the tearing point in the time direction by taking the time point as the tearing point, finishing the tearing processing of the sound frequency spectrograms, adding excessive information at the fracture part according to a preset rule to obtain the tearing frequency spectrograms, and taking the tearing frequency spectrograms as the voice training samples, wherein the separation distance of the sound frequency spectrograms on the two sides of the tearing point is S, the S is a number randomly selected from the uniform distribution of [0, S ], and the S is a time deformation parameter.
The present application further provides a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of any of the above methods when executing the computer program.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of any of the above.
According to the method, the device, the computer equipment and the storage medium for acquiring the voice training sample, an original voice signal can be converted into the sound spectrogram, a large number of tear spectrograms, first mask spectrograms and second mask spectrograms are derived from the sound spectrogram through tearing and mask processing, and the tear spectrograms, the first mask spectrograms and the second mask spectrograms can be used as samples for training the voiceprint recognition model, so that the problem that the accurate voiceprint recognition model cannot be obtained due to the fact that the number of the samples for training the voiceprint recognition model is small in the prior art can be solved. For example, the problem that a voiceprint recognition model cannot be trained well due to the fact that a few samples are obtained in different channel scenes can be solved well.
Drawings
Fig. 1 is a schematic flowchart of a method for obtaining a speech training sample according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of an apparatus for obtaining a speech training sample according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application.
The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Referring to fig. 1, a method for obtaining a speech training sample includes:
s1, processing a voice signal to obtain a sound spectrogram of the voice signal;
s2, randomly selecting a time point in the time direction on the sound spectrogram;
and S3, taking the time point as a tearing point, separating the sound spectrogram at two sides of the tearing point in the time direction, completing tearing processing of the sound spectrogram, adding excessive information at a fracture part according to a preset rule to obtain a tearing spectrogram, and taking the tearing spectrogram as the voice training sample, wherein the separation distance of the sound spectrogram at two sides of the tearing point is S, the S is a number randomly selected from uniform distribution of [0, S ], and the S is a time deformation parameter.
In this embodiment, a sample speech signal is first converted into an acoustic spectrogram, which is generally a mel spectrogram, and the specific conversion process can be implemented by any one of the prior art. Tearing the sound spectrogram at a certain time point, namely separating the sound spectrogram in time at the time point, wherein the separation mode can be various, for example, a first side of the sound spectrogram at two sides of the tearing point is fixed, and a second side of the sound spectrogram moves in a direction away from the first side; alternatively, the first side and the second side are moved in directions away from each other, respectively, and the like. In one embodiment, the first side may be fixed and the second side moved s away from the first side; the second side is then fixed on the original sound spectrogram, the first side is moved away from the second side by s, and so on, thereby obtaining two tear spectrograms with different moving directions at the processing of one point in time. In its embodiment, it may also be moved a specified distance in a specified direction. Further, repeating the above steps S2 and S3, selecting different time points each time, obtaining a plurality of torn spectrum patterns corresponding to the sound spectrogram, and finally forming the sound spectrogram and the plurality of torn spectrum patterns into a first speech training sample set. By the aid of the technical scheme, a plurality of torn spectrograms after being torn can be derived through one sound spectrogram, so that the number of voice training samples is enriched, and the problem that accurate voiceprint recognition models cannot be obtained due to the fact that the number of samples for training the voiceprint recognition models in the prior art is small can be solved. For example, the problem that a voiceprint recognition model cannot be trained well due to the fact that a few samples are obtained in different channel scenes can be solved well.
The step of adding the excessive information at the fracture according to the preset rule includes:
and randomly adding the excess information to the fracture part of the tearing spectrogram.
In this embodiment, since the torn spectrum portion exists in the torn spectrum graph, the torn spectrum portion may have a blank, and in order to improve the diversity of the training samples, excessive information, such as adding different smooth signals, may be added to the blank. The excess information can be preset, a plurality of different excess information can be preset generally, then one excess information is randomly selected to be added to the fracture part, and if the excess information cannot just fill the blank, the excess information can be amplified or reduced in an equal proportion, so that the excess information can be just added to the blank. In another embodiment, if S is a positive integer, S kinds of transition information are set, each kind of transition information includes a plurality of transition information with different contents, and when the transition information is added, one of the transition information corresponding to S kinds of transition information is randomly selected, thereby further providing diversity of the creep samples.
In another embodiment, the preset rule is to add all the same data at the fracture, such as all 0's, all 1's, or other data such as 010101 that repeats the loop continuously.
In one embodiment, the step S2 of randomly selecting a time point in the time direction on the sound spectrogram includes:
s201, acquiring the time length of the sound spectrogram;
s202, determining the tearing processing times of the sound spectrogram according to the time length;
s203, selecting the time points with the same number of times as the tearing times so as to tear the sound spectrogram for different times.
In this embodiment, the audio spectrogram cannot be torn by the wireless frequency, so the application determines the tearing frequency according to the length of the time information in the audio spectrogram. Specifically, a mapping table is set, one column in the mapping table is a time length range, one column is the tearing times corresponding to the time length range, after the time length in the sound spectrogram is determined, the time length is checked to be in which time length range in the mapping table, and then the tearing times corresponding to the time length range are selected. The specific time length and the tearing times can be set manually according to experience, and the setting idea is that the longer the time length is, the more the corresponding tearable times are, and otherwise, the tearable times are less.
In an embodiment, the step S203 of selecting the same number of time points as the number of times of the tearing process to perform the tearing process on the sound spectrogram for different times includes:
and equally distributing the time points with the number corresponding to the tearing times in the time length so as to perform tearing on the sound frequency spectrogram for different times.
In this embodiment, the time points are evenly distributed within the time length, the distribution is fast and uniform, and the difference between samples is more even than the random distribution.
In one embodiment, the audio spectrogram can be torn at only one time point, so as to obtain a torn spectrogram with only one torn part; in another embodiment, a sound spectrogram can be subjected to tearing processing by taking a plurality of time points as tearing points at the same time, so as to obtain a tearing spectrogram with a plurality of tearing positions.
In an embodiment, after the step S3 of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point, completing the tearing process on the sound spectrograms, and adding transition information at the fracture according to a preset rule to obtain a torn spectrogram, the method includes:
s4, selecting a plurality of first spectrum blocks arranged at intervals in the time direction on the tear spectrogram;
and S5, applying a mask sequence to each first spectrum block to obtain a first mask spectrum map.
In this embodiment, in the time direction of the tear-off spectrogram, first spectral blocks of x (positive integer) consecutive time steps [ t0, t0+ t ] are selected, and then a mask sequence [ W1, … ] is applied to these first spectral blocks, W being a number randomly selected from a uniform distribution of [0, W ], W being a time mask parameter. In a specific embodiment, different t is selected, so that different first mask spectrograms can be obtained, and a plurality of first mask spectrograms corresponding to the tear spectrograms are obtained, the sound spectrogram is put together with all the first mask spectrograms and all the tear spectrograms to form a second voice training sample set, and the number of samples and the richness of the samples are further improved. In this embodiment, the time length represented by t is smaller than the time length of the tear spectrogram, and t0 is an arbitrary time point in the tear spectrogram, but it is required that the time length is capable of satisfying the blocking of the tear spectrogram.
In an embodiment, after the step S3 of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point, completing the tearing process on the sound spectrograms, and adding transition information at the fracture according to a preset rule to obtain a torn spectrogram, the method further includes:
s6, selecting a plurality of second spectrum blocks of different frequency channels in the frequency direction on the tear-off spectrogram;
and S7, applying a mask sequence to each second spectrum block to obtain a second mask spectrum map.
In the present embodiment, the second spectrum block is a spectrum block in the frequency direction, not a spectrum block in time. Specifically, in the frequency direction of the spectrogram, a mask sequence [ V1, … ] is applied to a spectrum block of n (positive integer) consecutive frequency channels [ m0, m0+ n ], V being a number randomly selected from a uniform distribution of [0, V ], and V being a frequency mask parameter. Similarly, different n is selected, so that different second mask spectrograms can be obtained, a plurality of second mask spectrograms corresponding to the tear spectrograms are obtained, and the sound spectrogram, all the second mask spectrograms and all the tear spectrograms are put together to form a third voice training sample set. In the present embodiment, m0 is an arbitrary frequency channel point in the tear spectrogram, but it is required to satisfy the blocking of the tear spectrogram.
In one embodiment, the step S2 of randomly selecting the time point in the time direction on the sound spectrogram includes:
s21, randomly adding masks in the time direction of the sound spectrogram to obtain a third mask spectrogram;
s22, randomly selecting the time point in the time direction on the third mask spectrogram.
In this embodiment, a mask is first added to the sound spectrogram, and then the time point is randomly selected in the time direction on the third mask spectrogram, so that a richer sample can be obtained.
According to the method for acquiring the voice training sample, an original voice signal can be converted into the voice spectrogram, a large number of tear spectrograms, a first mask spectrogram and a second mask spectrogram are derived from the voice spectrogram through the processing of tearing and masking, and the tear spectrograms, the first mask spectrogram and the second mask spectrogram can be used as samples for training the voiceprint recognition model, so that the problem that the accurate voiceprint recognition model cannot be obtained due to the fact that the number of the samples for training the voiceprint recognition model is small in the prior art can be solved. For example, voice information under different channel scenes is respectively acquired, if the voice information is directly used as training samples, an accurate voiceprint recognition model cannot be obtained due to the fact that the number of the training samples is small, but by the method, a large number of training samples can be derived according to the training samples corresponding to a small number of voice information, so that the problem that the number of the training samples is small is solved, and the problem that one voiceprint recognition model cannot be well trained due to the fact that the number of the samples under different channel scenes is small is well solved.
Referring to fig. 2, an embodiment of the present application further provides an apparatus for obtaining a speech training sample, including:
the conversion unit 10 is configured to process a voice signal to obtain a sound spectrogram of the voice signal;
a selection unit 20 for randomly selecting a time point in a time direction on the sound spectrogram;
and the tearing unit 30 is configured to separate the sound spectrograms on the two sides of the tearing point in the time direction by using the time point as the tearing point, complete tearing processing on the sound spectrograms, add transition information at the fracture part according to a preset rule to obtain a tearing spectrogram, and use the tearing spectrograms as the voice training sample, where a separation distance of the sound spectrograms on the two sides of the tearing point is S, the S is a number randomly selected from uniform distribution of [0, S ], and the S is a time deformation parameter.
In this embodiment, the converting unit 10 first converts the voice signal as a sample into an acoustic spectrogram, which is generally a mel spectrogram, and the specific converting process can be implemented by any one of the prior art. After the selecting unit 20 randomly selects a time point, the tearing unit 30 tears the sound spectrogram by using the time point as a tearing point, that is, the sound spectrogram is separated in time at the time point, the separation manner may be various, for example, a first side of the sound spectrogram on both sides of the tearing point is fixed, and a second side of the sound spectrogram moves in a direction away from the first side; alternatively, the first side and the second side are moved in directions away from each other, respectively, and the like. In one embodiment, the first side may be fixed and the second side moved s away from the first side; the second side is then fixed on the original sound spectrogram, the first side is moved away from the second side by s, and so on, thereby obtaining two tear spectrograms with different moving directions at the processing of one point in time. In its embodiment, it may also be moved a specified distance in a specified direction. Further, the process of randomly selecting the time points and the tearing processing is repeated, different time points are selected each time, a plurality of tearing spectrum graphs corresponding to the sound frequency spectrogram are obtained, and finally the sound frequency spectrogram and the plurality of tearing spectrum graphs form a first voice training sample set. By the aid of the technical scheme, a plurality of torn spectrograms after being torn can be derived through one sound spectrogram, so that the number of voice training samples is enriched, and the problem that accurate voiceprint recognition models cannot be obtained due to the fact that the number of samples for training the voiceprint recognition models in the prior art is small can be solved. For example, the problem that a voiceprint recognition model cannot be trained well due to the fact that a few samples are obtained in different channel scenes can be solved well.
In one embodiment, the tearing unit 30 further includes:
and the adding unit is used for randomly adding the excessive information to the fracture part of the tearing spectrogram. Namely, the preset rule is to add excessive information at the fracture randomly.
In this embodiment, since the torn spectrum portion exists in the torn spectrum graph, the torn spectrum portion may have a blank, and in order to improve the diversity of the training samples, excessive information may be added to the blank, for example, a different smooth signal may be added. The excess information can be preset, a plurality of different excess information can be preset generally, then one excess information is randomly selected to be added to the fracture part, and if the excess information cannot just fill the blank, the excess information can be amplified or reduced in an equal proportion, so that the excess information can be just added to the blank. In another embodiment, if S is a positive integer, S kinds of transition information are set, each kind of transition information includes a plurality of transition information with different contents, and when the transition information is added, one of the transition information corresponding to S kinds of transition information is randomly selected, so as to further provide diversity of training samples.
In another embodiment, the predetermined rule is to add all the same data at the fracture, such as all 0's, all 1's, or other data such as 010101 that repeats the loop continuously.
In an embodiment, the apparatus for obtaining a speech training sample further includes:
an acquisition unit, configured to acquire a time length of the sound spectrogram;
the determining unit is used for determining the tearing processing times of the sound spectrogram according to the time length;
and the selecting unit is used for selecting the time points with the same number of times as the tearing times so as to tear the sound spectrogram for different times.
In this embodiment, the audio spectrogram cannot be torn by the wireless frequency, so the application determines the tearing frequency according to the length of the time information in the audio spectrogram. Specifically, a mapping table is set, one column in the mapping table is a time length range, one column is the tearing times corresponding to the time length range, after the time length in the sound spectrogram is determined, the time length is checked to be in which time length range in the mapping table, and then the tearing times corresponding to the time length range are selected. The specific time length and the tearing times can be set manually according to experience, and the setting idea is that the longer the time length is, the more the corresponding tearable times are, and otherwise, the tearable times are less.
In one embodiment, the selecting unit includes:
and the average selection module is used for averagely distributing the time points with the number corresponding to the tearing times in the time length so as to tear the sound frequency spectrogram for different times.
In this embodiment, the time points are evenly distributed within the time length, the distribution is fast and uniform, and the difference between samples is more even than the random distribution.
In one embodiment, the audio spectrogram can be torn at only one time point, so as to obtain a torn spectrogram with only one torn part; in another embodiment, a sound spectrogram can be subjected to tearing processing by taking a plurality of time points as tearing points at the same time, so as to obtain a tearing spectrogram with a plurality of tearing positions.
In an embodiment, the apparatus for obtaining a speech training sample further includes:
the time spectrum unit is used for selecting a plurality of first spectrum blocks arranged at intervals in the time direction on the tearing spectrogram;
a first mask unit, configured to apply a mask sequence to each of the first spectrum blocks to obtain a first mask spectrum map.
In this embodiment, in the time direction of the tear-off spectrogram, first spectral blocks of x (positive integer) consecutive time steps [ t0, t0+ t ] are selected, and then a mask sequence [ W1, … ] is applied to these first spectral blocks, W being a number randomly selected from a uniform distribution of [0, W ], W being a time mask parameter. In a specific embodiment, different t is selected, so that different first mask spectrograms can be obtained, and a plurality of first mask spectrograms corresponding to the tear spectrograms are obtained, the sound spectrogram is put together with all the first mask spectrograms and all the tear spectrograms to form a second voice training sample set, and the number of samples and the richness of the samples are further improved. In this embodiment, the time length represented by t is smaller than the time length of the tear spectrogram, and t0 is an arbitrary time point in the tear spectrogram, but it is required that the time length is capable of satisfying the blocking of the tear spectrogram.
In an embodiment, the apparatus for obtaining a speech training sample further includes:
a frequency spectrum unit, configured to select a second spectrum block of multiple different frequency channels in a frequency direction on the tear spectrogram;
and the second mask unit is used for applying a mask sequence to each second spectrum block to obtain a second mask spectrum map.
In the present embodiment, the second spectrum block is a spectrum block in the frequency direction, not a spectrum block in time. Specifically, in the frequency direction of the spectrogram, a mask sequence [ V1, … ] is applied to a spectrum block of n (positive integer) consecutive frequency channels [ m0, m0+ n ], V being a number randomly selected from a uniform distribution of [0, V ], and V being a frequency mask parameter. Similarly, different n is selected, so that different second mask spectrograms can be obtained, a plurality of second mask spectrograms corresponding to the tear spectrograms are obtained, and the sound spectrogram, all the second mask spectrograms and all the tear spectrograms are put together to form a third voice training sample set. In the present embodiment, m0 is an arbitrary frequency channel point in the tear spectrogram, but it is required to satisfy the blocking of the tear spectrogram.
In one embodiment, the selecting unit 20 includes:
the mask module is used for randomly adding masks in the time direction on the sound spectrogram to obtain a third mask spectrogram;
a selection module configured to randomly select the time point in a time direction on the third mask spectrogram.
In this embodiment, a mask is first added to the sound spectrogram, and then the time point is randomly selected in the time direction on the third mask spectrogram, so that a richer sample can be obtained.
The device for acquiring the voice training sample can convert an original voice signal into a voice spectrogram, derive a large number of tear spectrograms, a first mask spectrogram and a second mask spectrogram from one voice spectrogram through the processing of tearing and masking, and the tear spectrograms, the first mask spectrogram and the second mask spectrogram can be used as samples for training the voiceprint recognition model, so that the problem that the accurate voiceprint recognition model cannot be obtained due to the fact that the number of samples for training the voiceprint recognition model in the prior art is small can be solved. For example, voice information under different channel scenes is respectively acquired, if the voice information is directly used as training samples, an accurate voiceprint recognition model cannot be obtained due to the fact that the number of the training samples is small, but by the method, a large number of training samples can be derived according to the training samples corresponding to a small number of voice information, so that the problem that the number of the training samples is small is solved, and the problem that one voiceprint recognition model cannot be well trained due to the fact that the number of the samples under different channel scenes is small is well solved.
Referring to fig. 3, an embodiment of the present application further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and an internal structure of the memory may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used for storing data such as sample sets. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of obtaining speech training samples. Specifically, the method comprises the following steps:
a method for acquiring a voice training sample comprises the following steps: processing a voice signal to obtain a sound spectrogram of the voice signal; randomly selecting a time point in a time direction on the sound spectrogram; and taking the time point as a tearing point, separating the sound frequency spectrograms on two sides of the tearing point in the time direction to finish tearing processing of the sound frequency spectrograms, adding transition information at the breakage part according to a preset rule to obtain the tearing frequency spectrograms, and taking the tearing frequency spectrograms as the voice training samples, wherein the separation distance of the sound frequency spectrograms on two sides of the tearing point is S, the S is a number randomly selected from uniform distribution of [0, S ], and the S is a time deformation parameter.
In one embodiment, the step of adding the excessive spectrogram information at the tearing part according to a preset rule includes: and randomly adding the excess information to the fracture part of the tearing spectrogram.
In one embodiment, the step of randomly selecting a time point in a time direction on the sound spectrogram comprises: acquiring the time length of the sound spectrogram; determining the tearing processing times of the sound frequency spectrogram according to the time length; and selecting the time points with the same number of times as the tearing times so as to tear the sound spectrogram for different times.
In one embodiment, the step of selecting the same number of time points as the number of tearing processes to tear the sound spectrogram different times comprises: and equally distributing the time points with the number corresponding to the tearing times in the time length so as to perform tearing on the sound frequency spectrogram for different times.
In an embodiment, after the step of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point, completing the tearing process on the sound spectrograms, and adding transition information at the fracture according to a preset rule to obtain the tearing spectrograms, the method includes: selecting a plurality of first spectrum blocks arranged at intervals in the time direction on the tear spectrogram; and applying a mask sequence to each first spectrum block to obtain a first mask spectrum diagram.
In an embodiment, after the step of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point, completing the tearing process on the sound spectrograms, and adding transition information at the fracture according to a preset rule to obtain the tearing spectrograms, the method further includes: selecting a second spectrum block of a plurality of different frequency channels in the frequency direction on the tear spectrogram; and applying a mask sequence to each second spectrum block to obtain a second mask spectrum map.
In one embodiment, the step of randomly selecting time points in a time direction on the sound spectrogram comprises: randomly adding a mask in the time direction on the sound spectrogram to obtain a third mask spectrogram; randomly selecting the time point in a temporal direction on the third mask spectrogram.
The computer device of the embodiment of the application can convert an original voice signal into a sound spectrogram, and derive a large number of tear spectrograms, a first mask spectrogram and a second mask spectrogram from the sound spectrogram through the processing of tearing and masking, and the tear spectrograms, the first mask spectrogram and the second mask spectrogram can be used as samples for training a voiceprint recognition model, so that the problem that the accurate voiceprint recognition model cannot be obtained due to the fact that the number of the samples for training the voiceprint recognition model in the prior art is small can be solved. For example, voice information under different channel scenes is respectively acquired, if the voice information is directly used as training samples, an accurate voiceprint recognition model cannot be obtained due to the fact that the number of the training samples is small, but by the method, a large number of training samples can be derived according to the training samples corresponding to a small number of voice information, so that the problem that the number of the training samples is small is solved, and the problem that one voiceprint recognition model cannot be well trained due to the fact that the number of the samples under different channel scenes is small is well solved.
An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements a method for obtaining a speech training sample. Specifically, the method comprises the following steps:
a method for acquiring a voice training sample comprises the following steps: processing a voice signal to obtain a sound spectrogram of the voice signal; randomly selecting a time point in a time direction on the sound spectrogram; and taking the time point as a tearing point, separating the sound frequency spectrograms on two sides of the tearing point in the time direction to finish tearing processing of the sound frequency spectrograms, adding transition information at the breakage part according to a preset rule to obtain the tearing frequency spectrograms, and taking the tearing frequency spectrograms as the voice training samples, wherein the separation distance of the sound frequency spectrograms on two sides of the tearing point is S, the S is a number randomly selected from uniform distribution of [0, S ], and the S is a time deformation parameter.
In one embodiment, the step of adding the excessive spectrogram information at the tearing part according to a preset rule includes: and randomly adding the excess information to the fracture part of the tearing spectrogram.
In one embodiment, the step of randomly selecting a time point in a time direction on the sound spectrogram comprises: acquiring the time length of the sound spectrogram; determining the tearing processing times of the sound frequency spectrogram according to the time length; and selecting the time points with the same number of times as the tearing times so as to tear the sound spectrogram for different times.
In one embodiment, the step of selecting the same number of time points as the number of tearing processes to tear the sound spectrogram different times comprises: and equally distributing the time points with the number corresponding to the tearing times in the time length so as to perform tearing on the sound frequency spectrogram for different times.
In an embodiment, after the step of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point, completing the tearing process on the sound spectrograms, and adding transition information at the fracture according to a preset rule to obtain the tearing spectrograms, the method includes: selecting a plurality of first spectrum blocks arranged at intervals in the time direction on the tear spectrogram; and applying a mask sequence to each first spectrum block to obtain a first mask spectrum diagram.
In an embodiment, after the step of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point, completing the tearing process on the sound spectrograms, and adding transition information at the fracture according to a preset rule to obtain the tearing spectrograms, the method further includes: selecting a second spectrum block of a plurality of different frequency channels in the frequency direction on the tear spectrogram; and applying a mask sequence to each second spectrum block to obtain a second mask spectrum map.
In one embodiment, the step of randomly selecting time points in a time direction on the sound spectrogram comprises: randomly adding a mask in the time direction on the sound spectrogram to obtain a third mask spectrogram; randomly selecting the time point in a temporal direction on the third mask spectrogram.
When a computer program is executed by a processor to realize the method for acquiring the voice training sample, an original voice signal can be converted into a voice spectrogram, a large number of tearing frequency spectrograms, a first mask frequency spectrogram and a second mask frequency spectrogram are derived from the voice spectrogram through tearing and mask processing, and the tearing frequency spectrogram, the first mask frequency spectrogram and the second mask frequency spectrogram can be used as samples for training the voiceprint recognition model, so that the problem that the accurate voiceprint recognition model cannot be obtained due to the fact that the number of the samples for training the voiceprint recognition model is small in the prior art can be solved. For example, voice information under different channel scenes is respectively acquired, if the voice information is directly used as training samples, an accurate voiceprint recognition model cannot be obtained due to the fact that the number of the training samples is small, but by the method, a large number of training samples can be derived according to the training samples corresponding to a small number of voice information, so that the problem that the number of the training samples is small is solved, and the problem that one voiceprint recognition model cannot be well trained due to the fact that the number of the samples under different channel scenes is small is well solved.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).
The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims (10)

1. A method for obtaining a speech training sample is characterized by comprising the following steps:
processing a voice signal to obtain a sound spectrogram of the voice signal;
randomly selecting a time point in a time direction on the sound spectrogram;
and taking the time point as a tearing point, separating the sound frequency spectrograms on two sides of the tearing point in the time direction to finish tearing processing of the sound frequency spectrograms, adding transition information at the breakage part according to a preset rule to obtain the tearing frequency spectrograms, and taking the tearing frequency spectrograms as the voice training samples, wherein the separation distance of the sound frequency spectrograms on two sides of the tearing point is S, the S is a number randomly selected from uniform distribution of [0, S ], and the S is a time deformation parameter.
2. The method for acquiring the voice training sample according to claim 1, wherein the step of adding the excessive information at the torn part according to the preset rule comprises:
and randomly adding the excess information to the fracture part of the tearing spectrogram.
3. The method of claim 1, wherein the step of randomly selecting the time point in the time direction on the sound spectrogram is preceded by:
acquiring the time length of the sound spectrogram;
determining the tearing processing times of the sound frequency spectrogram according to the time length;
and selecting the time points with the same number of times as the tearing times so as to tear the sound spectrogram for different times.
4. The method according to claim 3, wherein the step of selecting the same number of time points as the number of times of the tearing process to perform the tearing process on the sound spectrogram for different times comprises:
and equally distributing the time points with the number corresponding to the tearing times in the time length so as to perform tearing on the sound frequency spectrogram for different times.
5. The method for acquiring the speech training sample according to claim 1, wherein the step of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point to complete the tearing process on the sound spectrograms, and adding excessive information at the fracture according to a preset rule to obtain the tearing spectrograms comprises:
selecting a plurality of first spectrum blocks arranged at intervals in the time direction on the tear spectrogram;
and applying a mask sequence to each first spectrum block to obtain a first mask spectrum diagram.
6. The method for acquiring a speech training sample according to claim 1, wherein the step of separating the sound spectrograms on both sides of the tearing point in the time direction by using the time point as the tearing point to complete the tearing process on the sound spectrograms, and adding excessive information at the fracture according to a preset rule to obtain the tearing spectrograms further comprises:
selecting a second spectrum block of a plurality of different frequency channels in the frequency direction on the tear spectrogram;
and applying a mask sequence to each second spectrum block to obtain a second mask spectrum map.
7. The method of claim 1, wherein the step of randomly selecting the time point in the time direction on the sound spectrogram comprises:
randomly adding a mask in the time direction on the sound spectrogram to obtain a third mask spectrogram;
randomly selecting the time point in a temporal direction on the third mask spectrogram.
8. An apparatus for obtaining a speech training sample, comprising:
the conversion unit is used for processing a voice signal to obtain a sound spectrogram of the voice signal;
a selection unit configured to randomly select a time point in a time direction on the sound spectrogram;
and the tearing unit is used for separating the sound frequency spectrograms on the two sides of the tearing point in the time direction by taking the time point as the tearing point, finishing the tearing processing of the sound frequency spectrograms, adding excessive information at the fracture part according to a preset rule to obtain the tearing frequency spectrograms, and taking the tearing frequency spectrograms as the voice training samples, wherein the separation distance of the sound frequency spectrograms on the two sides of the tearing point is S, the S is a number randomly selected from the uniform distribution of [0, S ], and the S is a time deformation parameter.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202010093613.XA 2020-02-14 2020-02-14 Method and device for acquiring voice training sample, computer equipment and storage medium Active CN111370002B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010093613.XA CN111370002B (en) 2020-02-14 2020-02-14 Method and device for acquiring voice training sample, computer equipment and storage medium
PCT/CN2020/093092 WO2021159635A1 (en) 2020-02-14 2020-05-29 Speech training sample obtaining method and apparatus, computer device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010093613.XA CN111370002B (en) 2020-02-14 2020-02-14 Method and device for acquiring voice training sample, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111370002A true CN111370002A (en) 2020-07-03
CN111370002B CN111370002B (en) 2022-08-19

Family

ID=71206253

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010093613.XA Active CN111370002B (en) 2020-02-14 2020-02-14 Method and device for acquiring voice training sample, computer equipment and storage medium

Country Status (2)

Country Link
CN (1) CN111370002B (en)
WO (1) WO2021159635A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112017638A (en) * 2020-09-08 2020-12-01 北京奇艺世纪科技有限公司 Voice semantic recognition model construction method, semantic recognition method, device and equipment
CN113241062A (en) * 2021-06-01 2021-08-10 平安科技(深圳)有限公司 Method, device and equipment for enhancing voice training data set and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115580682B (en) * 2022-12-07 2023-04-28 北京云迹科技股份有限公司 Method and device for determining connection and disconnection time of robot dialing

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5617507A (en) * 1991-11-06 1997-04-01 Korea Telecommunication Authority Speech segment coding and pitch control methods for speech synthesis systems
CN104408681A (en) * 2014-11-04 2015-03-11 南昌大学 Multi-image hiding method based on fractional mellin transform
CN104484872A (en) * 2014-11-27 2015-04-01 浙江工业大学 Interference image edge extending method based on directions
US20170200092A1 (en) * 2016-01-11 2017-07-13 International Business Machines Corporation Creating deep learning models using feature augmentation
CN110148400A (en) * 2018-07-18 2019-08-20 腾讯科技(深圳)有限公司 The pronunciation recognition methods of type, the training method of model, device and equipment
CN110379414A (en) * 2019-07-22 2019-10-25 出门问问(苏州)信息科技有限公司 Acoustic model enhances training method, device, readable storage medium storing program for executing and calculates equipment

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106898357B (en) * 2017-02-16 2019-10-18 华南理工大学 A kind of vector quantization method based on normal distribution law
CN108830277B (en) * 2018-04-20 2020-04-21 平安科技(深圳)有限公司 Training method and device of semantic segmentation model, computer equipment and storage medium
CN108922560B (en) * 2018-05-02 2022-12-02 杭州电子科技大学 Urban noise identification method based on hybrid deep neural network model
CN109087632B (en) * 2018-08-17 2023-06-06 平安科技(深圳)有限公司 Speech processing method, device, computer equipment and storage medium
CN110751177A (en) * 2019-09-17 2020-02-04 阿里巴巴集团控股有限公司 Training method, prediction method and device of classification model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5617507A (en) * 1991-11-06 1997-04-01 Korea Telecommunication Authority Speech segment coding and pitch control methods for speech synthesis systems
CN104408681A (en) * 2014-11-04 2015-03-11 南昌大学 Multi-image hiding method based on fractional mellin transform
CN104484872A (en) * 2014-11-27 2015-04-01 浙江工业大学 Interference image edge extending method based on directions
US20170200092A1 (en) * 2016-01-11 2017-07-13 International Business Machines Corporation Creating deep learning models using feature augmentation
CN110148400A (en) * 2018-07-18 2019-08-20 腾讯科技(深圳)有限公司 The pronunciation recognition methods of type, the training method of model, device and equipment
CN110379414A (en) * 2019-07-22 2019-10-25 出门问问(苏州)信息科技有限公司 Acoustic model enhances training method, device, readable storage medium storing program for executing and calculates equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DANIEL S. PARK∗等: "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition", 《ARXIV》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112017638A (en) * 2020-09-08 2020-12-01 北京奇艺世纪科技有限公司 Voice semantic recognition model construction method, semantic recognition method, device and equipment
CN113241062A (en) * 2021-06-01 2021-08-10 平安科技(深圳)有限公司 Method, device and equipment for enhancing voice training data set and storage medium
CN113241062B (en) * 2021-06-01 2023-12-26 平安科技(深圳)有限公司 Enhancement method, device, equipment and storage medium for voice training data set

Also Published As

Publication number Publication date
CN111370002B (en) 2022-08-19
WO2021159635A1 (en) 2021-08-19

Similar Documents

Publication Publication Date Title
CN111370002B (en) Method and device for acquiring voice training sample, computer equipment and storage medium
DE102018128006B4 (en) METHOD OF PRODUCING OUTPUTS OF NATURAL LANGUAGE GENERATION BASED ON USER LANGUAGE STYLE
DE102012217160B4 (en) Procedures for correcting unintelligible synthetic speech
DE60222249T2 (en) SPEECH RECOGNITION SYSTEM BY IMPLICIT SPEAKER ADAPTION
DE60302407T2 (en) Ambient and speaker-adapted speech recognition
DE602005002706T2 (en) Method and system for the implementation of text-to-speech
CN111247584B (en) Voice conversion method, system, device and storage medium
DE60004331T2 (en) SPEAKER RECOGNITION
DE102018103188B4 (en) METHOD OF VOICE RECOGNITION IN A VEHICLE TO IMPROVE TASKS
DE102019206743A1 (en) Hearing aid system and method for processing audio signals
DE102019111529A1 (en) AUTOMATED LANGUAGE IDENTIFICATION USING A DYNAMICALLY ADJUSTABLE TIME-OUT
DE102010034433B4 (en) Method of recognizing speech
DE69930961T2 (en) DEVICE AND METHOD FOR LANGUAGE SEGMENTATION
DE102017102392A1 (en) AUTOMATIC LANGUAGE RECOGNITION BY VOICE CHANNELS
DE112013007617T5 (en) Speech recognition device and speech recognition method
DE102015106280B4 (en) Systems and methods for compensating for speech artifacts in speech recognition systems
DE19942178C1 (en) Method of preparing database for automatic speech processing enables very simple generation of database contg. grapheme-phoneme association
DE60133537T2 (en) AUTOMATIC UMTRAINING OF A LANGUAGE RECOGNITION SYSTEM
DE602004004572T2 (en) Tracking vocal tract resonances using an objective constraint
DE102018128003A1 (en) NEURONAL NETWORK FOR USE IN VOICE RECOGNITION ARBITRATION
EP1282897B1 (en) Method for creating a speech database for a target vocabulary in order to train a speech recognition system
DE60014583T2 (en) METHOD AND DEVICE FOR INTEGRITY TESTING OF USER INTERFACES OF VOICE CONTROLLED EQUIPMENT
EP2792165B1 (en) Adaptation of a classification of an audio signal in a hearing aid
DE102013222520B4 (en) METHOD FOR A LANGUAGE SYSTEM OF A VEHICLE
EP1361738A1 (en) Method and system for speech signal processing using speech recognition and frequency analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant