CN114360523A

CN114360523A - Keyword dataset acquisition and model training methods, devices, equipment and medium

Info

Publication number: CN114360523A
Application number: CN202210274759.3A
Authority: CN
Inventors: 黄静; 沙露露
Original assignee: Shenzhen Yizhi Times Technology Co ltd
Current assignee: Shenzhen Yizhi Times Technology Co ltd
Priority date: 2022-03-21
Filing date: 2022-03-21
Publication date: 2022-04-15

Abstract

The invention is suitable for the technical field of audio processing, and provides a keyword dataset acquisition method, a keyword detection model training device, equipment and a storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining an initial audio set and an initial audio dictionary corresponding to the initial audio set, conducting audio subclass division on the initial audio set according to the initial audio dictionary and a preset audio subclass division rule to obtain a first audio subclass, conducting data amplification on audio of a second audio subclass in a preset class-by-class amplification mode to obtain a training audio set, and enabling the second audio subclass to be contained in the first audio subclass, so that complexity of the audio is increased progressively, sudden change of audio features is avoided, the training audio set is obtained with resource consumption and workload as small as possible, and quality of the training audio set is guaranteed.

Description

Keyword dataset acquisition and model training methods, devices, equipment and medium

Technical Field

The invention belongs to the technical field of audio processing, and particularly relates to a keyword dataset acquisition method, a keyword detection model training device, equipment and a storage medium.

Background

The KWS (voice wakeup), that is, detecting a specific segment of a speaker (e.g., specific voice content such as hey siri, makitten sprites, etc.) in real time in a continuous voice stream, is a keyword retrieval task of a small resource, and can also be regarded as a type of special voice recognition. During man-machine interaction, the device activating function and the system opening function are often played, and the device activating function and the system opening function are particularly commonly used in the aspects of mobile phone assistants, vehicles, wearable devices, smart homes, robots and the like.

In recent years, with the rapid development of artificial intelligence technology and the wide demand of human-computer voice interaction, more and more intelligent devices carrying voice awakening are provided, and many manufacturers begin to develop related voice products based on their own exclusive keywords. Theoretically, when keyword data is collected, the more the number of the pronouncing people is, the richer the pronunciation scene is, and the better the awakening effect of the trained model is. However, under the constraint of manpower, material resources and financial resources in actual situations, the number of the collected audios is limited, the quality of the collected audios is uneven, and even an audio recording scene is inconsistent with a product positioning use scene, so that the product cannot be normally used when falling to the ground. Although the existing data augmentation mode can make up for the defects in the data acquisition work to a certain extent, the quality of the acquired keyword data set is still not high enough, and the improvement of the performance of the voice model is further limited.

Disclosure of Invention

The invention aims to provide a method, a device, equipment and a storage medium for acquiring a keyword data set and training a keyword detection model, and aims to solve the problem that the quality of the keyword data set acquired under the constraints of manpower, material resources and financial resources is not high enough in the prior art.

In one aspect, the present invention provides a keyword dataset acquisition method, including the following steps:

acquiring an initial audio set and an initial audio dictionary corresponding to the initial audio set;

performing audio subclass division on the initial audio set according to the initial audio dictionary and a preset audio subclass division rule to obtain a first audio subclass;

and performing data amplification on the audio of a second audio subclass by adopting a preset class-by-class amplification mode to obtain a training audio set, wherein the second audio subclass is contained in the first audio subclass.

Preferably, the initial audio dictionary comprises first audio labeling information of each keyword audio, and the first audio labeling information comprises an audio acquisition scene and a speed of speech of a speaker.

Preferably, the initial audio set includes a positive sample and a negative sample, the positive sample is keyword audio, and the negative sample includes approximate pronunciation word audio of a keyword, incomplete keyword audio, and background audio of the keyword in a corresponding application scene.

Preferably, before the step of performing data amplification on the audio of the second audio subclass in a preset class-by-class amplification manner, the method includes:

acquiring a data augmentation target and/or a data augmentation factor, wherein the data augmentation target and/or the data augmentation factor are determined according to the acquired audio number of each of the first audio subclasses;

and the step of performing data amplification on the audio of the second audio subclass by adopting a preset class-by-class amplification mode comprises the following steps of:

and performing data amplification on the audio of the second audio subclass in the class-by-class amplification mode according to the amplification target and/or the data amplification factor.

Preferably, the initial audio set is included in the training audio set.

Preferably, the second audio subclass includes a quiet class audio, a medium class audio, and a speech speed adjusting class audio, and the step of performing data amplification on the audio of the second audio subclass in a preset class-by-class amplification manner includes:

obtaining the quiet audio from the initial audio set, and performing noise processing on the quiet audio to obtain a first audio augmentation set;

acquiring the medium-speed audio from the first audio amplification set, and performing speed adjustment on the medium-speed audio to obtain a second audio amplification set;

acquiring the voice frequency of the adjusted speed class from the second voice frequency augmentation set, and performing pitch adjustment on the voice frequency of the adjusted speed class to obtain a third voice frequency augmentation set;

if the stopping condition of data augmentation is met, selecting audio from the third audio augmentation set according to a preset selection rule, and carrying out volume normalization processing on the selected audio to obtain the training audio set;

and if the stopping condition is not met, skipping to the step of acquiring the quiet audio from the initial audio set.

Preferably, after the step of determining the quiet audio according to the initial audio dictionary and performing noise processing on the quiet audio, the method further includes:

performing augmentation labeling on the initial audio dictionary, or updating the third audio augmentation dictionary to obtain a first audio augmentation dictionary corresponding to the first audio augmentation set;

the medium-speed audio is determined according to the first audio amplification dictionary, and after the step of adjusting the speech speed of the medium-speed audio, the method further comprises the following steps:

updating the first audio frequency augmentation dictionary to obtain a second audio frequency augmentation dictionary corresponding to the second audio frequency augmentation set;

the step of adjusting the pitch of the speech rate class audio according to the second audio augmented dictionary, further comprising:

updating the second audio amplification dictionary to obtain a third audio amplification dictionary corresponding to the third audio amplification set;

the first audio augmentation dictionary comprises first audio annotation information and second annotation information of each keyword audio in the first audio augmentation set, and the second annotation information comprises a noise adding state, a speech speed adjusting state and a pitch adjusting state of the audio.

In another aspect, the present invention provides a method for training a keyword detection model, where the method includes:

acquiring a training audio set, wherein the training audio set is obtained by adopting the method;

and training a preset keyword detection model based on the training audio set to obtain a trained keyword detection model.

In another aspect, the present invention provides a keyword dataset acquisition apparatus, including:

the audio set acquisition unit is used for acquiring an initial audio set and an initial audio dictionary corresponding to the initial audio set;

the audio subclassing unit is used for performing audio subclassing on the initial audio set according to the initial audio dictionary and a preset audio subclass dividing rule to obtain a first audio subclass; and

and the audio data amplification unit is used for performing data amplification on the audio of the second audio subclass in a preset class-by-class amplification mode to obtain a training audio set, wherein the second audio subclass is contained in the first audio subclass.

In another aspect, the present invention provides a keyword detection model training apparatus comprising the keyword dataset acquisition apparatus as described above and a model training unit, wherein,

and the model training unit is used for training a preset keyword detection model based on the training audio set to obtain a trained keyword detection model.

In another aspect, the present invention further provides a keyword dataset processing apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method when executing the computer program.

In another aspect, the present invention also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method as described above.

According to the method, an initial audio set and an initial audio dictionary corresponding to the initial audio set are obtained, audio subclass division is carried out on the initial audio set according to the initial audio dictionary and a preset audio subclass division rule to obtain a first audio subclass, data amplification is carried out on audio of a second audio subclass in a preset class-by-class amplification mode to obtain a training audio set, and the second audio subclass is contained in the first audio subclass, so that the complexity of the audio is increased progressively, sudden change of audio features is avoided, the training audio set is obtained under the condition of least resource consumption and workload, and the quality of the training audio set is guaranteed.

Drawings

Fig. 1 is a flowchart illustrating an implementation of a keyword dataset acquisition method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating an implementation of a keyword detection model training method according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a keyword dataset acquisition apparatus according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a keyword detection model training apparatus according to a fourth embodiment of the present invention; and

fig. 5 is a schematic structural diagram of a keyword dataset processing device according to a fifth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The following detailed description of specific implementations of the present invention is provided in conjunction with specific embodiments:

the first embodiment is as follows:

fig. 1 shows an implementation flow of a keyword dataset acquisition method provided in an embodiment of the present invention, and for convenience of description, only a part related to the embodiment of the present invention is shown, which is detailed as follows:

in step S101, an initial audio set and an initial audio dictionary corresponding to the initial audio set are acquired.

The embodiment of the present invention is applicable to a keyword dataset processing device, which may be a mobile phone, a tablet computer, a wearable device, a vehicle-mounted device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, a super-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), or other terminal devices.

In the embodiment of the present invention, the initial audio set includes a plurality of keyword audios, where the keyword audio may be a wake-up word audio, such as "tianmaoling", "xiaocaihou classmates", "xiaojue", and the like, or may be a command word audio that often appears in some control devices or in a human-computer conversation, such as "turn on an air conditioner", "turn off a television", "change a song", and the like.

When the initial audio set is obtained, a plurality of keyword audios are obtained, and when each keyword audio is obtained, the keyword audio collected under various environment variables can be obtained according to actual needs. The audio acquisition scenes comprise indoor quiet scenes for acquiring quiet audio and indoor noisy scenes for acquiring noisy audio, the number of audios corresponding to different sexes of speakers in the initial audio set is generally relatively balanced, and the speech speed of the audio in the initial audio set can comprise three types, namely a fast speed, a medium speed and a slow speed. It should be noted here that the equalizations in the present embodiment are not strictly equal in number.

If the environment variable includes the type of the audio acquisition device, when acquiring the keyword audio, the selection of the audio acquisition device needs to ensure the authenticity of recording the audio, for example, the audio acquisition device includes a mobile phone, a recording pen, etc. which are not started to reduce noise or enhance functions, so as to ensure the consistency of the training data of the keyword detection model and the characteristics of the audio stream collected in real time when the product is actually applied.

If the environment variable includes audio frequency collection scene, then when gathering keyword audio frequency, consider that keyword detection's relevant application scene is mostly indoor, because the end noise of audio frequency under the quiet environment is less, intensity of noise is controlled more easily when carrying out the augmentation and making an uproar, generate audio frequency under the high-quality multiple noise, audio frequency of recording under the noisy environment considers the replenishment to the complicated audio frequency of back of augmentation, guarantee that speech model still can awaken corresponding equipment under the environment that slightly is noisy, the event can select indoor quiet environment and indoor noisy environment and carry out keyword audio frequency collection. The background sound in the noisy environment may be selected according to application settings, such as a speaking sound of a person, a music sound, and the like, which is not limited herein.

If the environment variables comprise the sex of the speaker, when the keyword audio is collected, the characteristic of each speaker is considered, so that the more the number of speakers is, the better the speaker is, and the proportion of the speakers is balanced as much as possible. Further, if the client location of the keyword detection device has an age requirement, the proportion of people in the age group located by the keyword detection device should be increased correspondingly.

If the environment variable includes the speech speed of the speaker, the speech speed of the speaker is considered to be different when the keyword audio is collected, but most of the speech speed is located in a certain interval, so that the keyword audio including fast, medium and slow keywords can be collected when the keyword audio is collected, and the later-stage augmentation processing is facilitated.

If the environment variable includes a distance between the speaker and the audio acquisition device, when the keyword audio is acquired, considering that the distance between the speaker and the audio acquisition device may affect the quality of the acquired keyword audio, for example, the loudness and reverberation degree of the keyword audio, the sound effect of the keyword audio acquired by the audio acquisition device may be controlled by using various recording distances. The specific acquisition distance can be set according to the application requirements of the product.

It should be noted that, in order to make the awakening effect of the keyword audio in different environments consistent, the number of audios in different environment variables needs to be equalized as much as possible. Considering that the number of factors influencing the recorded audio is large, the number of the keyword audio under each environment variable cannot be controlled to be consistent, and therefore the purpose of data balance is usually achieved through data amplification. In order to facilitate data augmentation, each keyword audio in the initial audio set needs to be labeled, and an initial audio dictionary is generated, wherein the initial audio dictionary contains information of each keyword audio in the initial audio set. Preferably, the initial audio dictionary comprises first audio labeling information of each keyword audio, and the first audio labeling information comprises an audio acquisition scene and the speed of speech of a speaker, so as to meet the data augmentation requirement. The first audio annotation information may further include the type of the audio capturing device, the gender of the speaker, the age of the speaker, the distance between the speaker and the capturing device, and the like.

Of course, the initial audio dictionary may also include information other than the above, such as an audio identifier, which may be an audio name, etc.

Illustratively, the audio capture devices are m, which are respectively represented by 1-m, the scenes are n, which are respectively represented by 1-n, the speaker man and woman are respectively represented by 0 and 1, the person age groups 0-10 are represented by 0, 10-30 are represented by 1, 30-60 are represented by 2, more than 60 are represented by 3, the speed of speech is fast, medium and slow respectively by 0,1 and 2, the distances between the speaker and the audio capture device are 0-20cm, 20cm-1m and 1m-3m are respectively represented by 0,1 and 2, if the keyword audio is recorded by the device 1 under the scene 2, by the 20-year-old man, the speech speed is fast and the distance is 20cm, the first audio annotation information is "1-2-0-1-0-0", that is "device-scene-gender-age-speech speed-distance", that is, the corresponding annotation of "device-scene-age-speech speed-distance" And (4) information.

When the initial audio dictionary corresponding to the initial audio set is generated, all the acquired keyword audios can be traversed, and the initial audio dictionary is generated according to the identification of each keyword audio and the first label information of each keyword audio. As an example, the initialized audio dictionary includes audio identifiers for each keyword audio, and first audio annotation information corresponding to each keyword audio identifier, and is denoted by a fact 1, fact 1= { 'audio 1. wav': 1-2-0-1-0-0, 'audio2. wav': 0-0-1-1-0-0,...}.

Considering that a fixed keyword is given, the ideal effect of voice awakening is a higher awakening rate and a lower false awakening rate, and the main factors influencing false awakening are an approximate pronunciation word and an incomplete keyword pronunciation of the awakening word, preferably, the initial audio set comprises a positive sample and a negative sample, the positive sample is a keyword audio, the negative sample comprises the approximate pronunciation word audio of the keyword, the incomplete keyword audio and a background audio of the keyword in a corresponding application scene, so as to improve the quality of the obtained initial audio set and further improve the keyword detection effect. In a specific implementation, two-stage audio acquisition may be performed, namely, a first-stage audio acquisition and a second-stage audio acquisition, where the first-stage audio acquisition may be understood as the keyword audio acquisition, namely, the acquisition of a positive sample, and the second-stage audio acquisition may be understood as the complementary acquisition of the keyword audio, namely, the acquisition of the negative sample, so as to integrate the selection of the data augmentation method into the data acquisition process.

The similar pronunciation words of the above keywords need to have two characteristics: one is the same number of words as the keyword, and one is a partial word in which the syllable or tone has a little deviation. For example, when the keyword is "tianmaoling", the approximating word may be "tianmaoling", or "tianmiao eloling", etc. In specific implementation, the voice frequency of the approximate pronunciation words of the keywords can be used as a negative sample to be added into the initial data set, so that the coverage range of the training audio set is improved, the uniqueness of the keywords is ensured, and the false awakening rate of voice awakening is further reduced.

The incomplete keyword audio refers to a word containing a part of pronunciation of the keyword, for example, when the keyword is "tianmaoling", people can talk and mention "go to a tianmaoling supermarket for a while to buy things", so as to avoid that the device catches the word "tianmaoling" in the audio stream and mistakenly awakens the voice awakening device. In specific implementation, an incomplete keyword audio can be used as a negative sample to be added into an initial data set, so that the coverage range of a training audio set is improved, and the false awakening rate of voice awakening is reduced.

The background audio refers to a background sound in an application scene where a product corresponding to the keyword is located. In specific implementation, the background sound can be added to the initial data set as a negative sample to improve the coverage of the training audio set and reduce the false wake-up rate caused by the background sound.

The specific collection modes of the keyword audio frequencies of the approximate pronunciation words and the incomplete words can refer to the collection modes of the keyword audio frequencies. The background audio acquisition can be realized by adding an audio acquisition device at a certain position when acquiring the keywords and only keeping the non-keyword audio in the audio acquisition device as the background audio. And after the second-stage data acquisition is finished, supplementing the first audio annotation information of the audio acquired by the second-stage data acquisition into the initial audio dictionary. And the audio acquired by the second stage data and the keyword audio are subjected to subsequent data amplification.

Further, the first audio labeling information further comprises an audio type, the audio type is determined according to the positive sample and the negative sample, and if the negative sample comprises the approximate pronounced word audio, the incomplete keyword audio and the background audio, the audio type comprises the keyword audio, the approximate pronounced word audio, the incomplete keyword audio and the background audio, so that subsequent audio subclasses can be divided conveniently. Illustratively, the first tagged audio annotation information is represented in the form of "word type-device-scene-gender-age-speech rate-distance", wherein the types of keyword audio, approximate pronunciation word audio, incomplete word audio, and background audio are represented by 1,2, 3, 4, respectively.

In step S102, audio subclassing is performed on the initial audio set according to the initial audio dictionary and a preset audio subclass division rule, so as to obtain a first audio subclass.

In the embodiment of the invention, the original data needs to be augmented when the keyword detection model is trained by deep learning, and different augmentation modes depend on the characteristics of the original data. Generally, the standard of data augmentation is only slightly changed, so that the augmented data and the original data have small difference, the structure of the original data can be kept, and poor data is avoided. Therefore, in this embodiment, after the initial audio set is obtained, audio subclassing is performed on the initial audio set according to the initial audio dictionary and a preset audio subclass division rule, so that data amplification is performed on different audio subclasses in a corresponding data amplification manner.

The audio subclass division rule may include division according to an initial audio dictionary and/or division according to a data augmentation mode, and the data augmentation mode may include noise adding processing, speed adjustment, pitch adjustment, and the like, for example, the subclass division rule may include audio subclass division according to one or more combinations of a type of an audio acquisition device in the initial audio dictionary, an audio acquisition scene, a gender of a speaker, a speed of the speaker, a distance between the speaker and the acquisition device, a speed of a keyword audio, whether the keyword audio is noisy in the data augmentation process, whether the keyword audio is speed adjusted, and whether the keyword audio is pitch adjusted, and all audio subclasses obtained after the audio subclass division are the first audio subclass.

The method comprises the steps of obtaining quiet audio and noisy audio according to audio subclassing in a keyword audio acquisition scene, obtaining fast audio, medium audio and slow audio according to audio subclassing in the keyword audio, obtaining noisy audio and non-noisy audio according to audio subclassing whether the keyword audio is noisy or not, and obtaining adjusted speed audio and non-adjusted speed audio according to audio subclassing whether the keyword audio is adjusted in speed or not. If the audio subclass division is performed according to the audio acquisition scene, the speech speed of the speaker and whether the speech speed adjustment is performed in the data amplification process, the obtained first audio subclass may include a quiet audio, a noisy audio, a fast audio, a medium audio, a slow audio, a noisy audio, an un-noisy audio, an adjusted speech speed audio and an unadjusted speech speed audio. It should be noted that, the number of the audio in the first audio subclass dynamically changes during the data augmentation process, and when the data augmentation is not performed, the number of the audio in the noise-added class, the audio in the non-noise-added class, the audio in the speech rate-adjusted class, and the audio in the non-speech rate-adjusted class are all zero.

Of course, the above-mentioned division rule may also include other rules than the above-mentioned rules, which are determined according to actual expansion requirements and are not limited herein.

In step S103, a preset class-by-class augmentation mode is adopted to perform data augmentation on the audio of the second audio subclass to obtain a training audio set, where the second audio subclass is included in the first audio subclass.

In the embodiment of the invention, when the data amplification is carried out on the audio of the second audio subclass, the data amplification is carried out on the audio of the second audio subclass in a class-by-class amplification mode instead of simply carrying out corresponding data amplification on each audio subclass, so that the complexity of the audio is increased in a progressive manner, abrupt change of audio characteristics is avoided, and the quality of a training audio set is improved. The class-by-class augmentation means that data augmentation is performed on each audio subclass under the second audio subclass in sequence according to a preset audio subclass augmentation sequence. In view of the fact that the data amplification is performed in a class-by-class manner in this embodiment, the audio of the first audio subclass dynamically changes during the data amplification process, and the audio of the second audio subclass also dynamically changes during the data amplification process, and in the data amplification process, the audio of the first audio subclass and the audio of the second audio subclass are updated correspondingly each time the data amplification of one audio subclass is completed.

Illustratively, the second audio subclass includes a first class of audio, a second class of audio, and a third class of audio, and when data amplification is performed, the first class of audio is obtained from the initial audio set, the first class of audio is subjected to noise addition processing to obtain a first audio amplification set, the second class of audio is obtained from the first audio amplification set, speech speed adjustment is performed on the second class of audio to obtain a second audio amplification set, then the third class of audio is obtained from the second audio amplification set, pitch adjustment is performed on the third class of audio to obtain a third audio amplification set, if data amplification is determined to be completed, the third audio amplification set can be used as a training audio set, and if data amplification is determined to be not completed, the step of obtaining a quiet class of audio from the initial audio set is skipped until data amplification is completed. Here, when data amplification is performed other than the first round, the quiet audio may be directly acquired from the third audio amplification set, and the quiet audio may be partially subjected to the noise processing based on the quiet audio of the third audio amplification set.

Before data amplification is performed on the audio of the second audio subclass in a class-by-class amplification mode, preferably, a data amplification target and/or a data amplification factor are/is acquired, the data amplification target and/or the data amplification factor are determined according to the acquired number of the audio of each subclass in the first audio subclass, and data amplification is performed on the audio of the second audio subclass in the class-by-class amplification mode according to the amplification target and/or the data amplification factor, so that the data amplification effect is improved.

The data augmentation target may be the number of data augmentation rounds, for example, the number of data augmentation rounds may be set to two rounds, and it is understood that data augmentation is completed for each sub-category under the second audio sub-category as one round of data augmentation completion. It should be noted here that, when data amplification other than the first round is performed, although the quiet audio is acquired from the initial audio set, the first audio amplification set obtained when data amplification other than the first round is performed is an audio amplification set generated on the basis of the third audio amplification set obtained after the data amplification of the previous round is completed.

The data augmentation target may also be the ratio of the quiet audio to the noisy audio after the data augmentation is completed, for example, considering that the noisy background is more than the quiet background in the real environment, the ratio of the quiet audio to the noisy audio after the data augmentation may be set to 2: 8. it should be noted here that the ratio of the quiet audio to the noisy audio may not be exactly 2 after the actual data augmentation is completed: 8, but is a close value.

The data augmentation target can also be proportional equalization of fast audio, medium-speed audio and slow audio in training audio set, namely the proportion of the fast audio, the medium-speed audio and the slow audio is approximately equal.

The data augmentation target may also be a ratio of keyword audio and background audio. Considering that a user may be more aware of a false wake-up problem caused by a background audio in the using process of the voice wake-up device, when data is augmented, the ratio of the number of positive samples (i.e., keyword audio) after the data is augmented to the number of negative samples (i.e., approximate vocalized word audio of the keyword, incomplete keyword audio, and background audio of the keyword in a corresponding application scene) after the data is augmented may be controlled to be 2: 1.

of course, the data augmentation target may be a combination of the above-mentioned various ways, and is specifically determined according to the actual data augmentation requirement.

The data amplification factor may include one or more of a random selection factor and an adjustment factor. The random selection factor may be used to indicate a proportion of audio selected when data amplification is performed on audio in a certain subclass of the second audio subclass, and the random selection factor may include one or more of a noise-adding random selection factor, a speech speed-adjusting random selection factor, a pitch-adjusting random selection factor, and a volume-normalizing random selection factor. For example, the noise random selection factor for quiet audio is set to 1/2, i.e., half of the quiet audio is randomly selected for noise addition. The adjustment factor may be used to indicate an adjustment ratio when the audio in a sub-category of the second audio sub-category is adjusted (e.g., pitch or speech rate adjustment), and further, the adjustment factor may further include an adjustment step size, etc. For example, the interval where the speech rate adjustment factor is set to [0.85, 1.25], the speech rate adjustment factor being smaller than 1 indicates that the speech rate is decreased, the speech rate adjustment factor being larger than 1 indicates that the speech rate is increased, and further, in order to increase the number of fast audio frequencies, a smaller step size may be used when the value is taken between [1, 1.25], so as to increase the probability of adjusting the fast speech rate, thereby increasing the occupation ratio of the fast audio frequencies.

Preferably, the training audio set comprises an initial audio set, so that the influence of poor-quality audio generated after a small amount of amplification on model training is reduced, the quality of the training audio set is improved, and the over-learning of the keyword detection model on the poor-quality audio is avoided. The poor-quality audio may include audio whose noise is over-marked by key words and/or audio whose speech speed is so fast that audio contents are difficult to hear, and so on.

Preferably, the second audio subclass includes quiet audio, medium-speed audio and speech speed adjustment audio, when data amplification is performed on the audio of the second audio subclass in a preset class-by-class amplification mode, the quiet audio is obtained from the initial audio set, noise processing is performed on the quiet audio to obtain a first audio amplification set, the medium-speed audio is obtained from the first audio amplification set, speech speed adjustment is performed on the medium-speed audio to obtain a second audio amplification set, speech speed adjustment audio is obtained from the second audio amplification set, pitch adjustment is performed on the speech speed adjustment audio to obtain a third audio amplification set, and class-by-class amplification of the audio of the second audio subclass is achieved. In specific implementation, when quiet audio is subjected to noise adding, the audio under different noise intensities can be simulated by controlling the signal-to-noise ratio, and part of audio can be randomly selected from the quiet audio according to a noise adding random selection factor to be subjected to noise adding, or all the audio in the quiet audio is subjected to noise adding; when the speed of speech of the middle-speed audio is adjusted, the audio can be adjusted according to the random selection factor and the speed adjustment factor, so that the amplified audio can cover the speaking speed of people in different situations as much as possible; when the pitch is adjusted, part of audio can be randomly selected from the adjusted speed class according to the random pitch selection factor to adjust the pitch, the purpose of the adjustment is to simulate the pronunciation of more different speakers, and audio distortion is avoided during the pitch adjustment.

After pitch adjustment is performed on the adjusted speech speed type audio, preferably, whether a stop condition of data amplification is met is judged, if the stop condition of data amplification is met, data amplification is determined to be completed, the third audio amplification set can be used as the training audio set, and if the stop condition is not met, the step of obtaining quiet type audio from the initial audio set is skipped until data amplification is completed, so that multi-round data amplification is automatically completed.

After the stop condition meeting the data augmentation is determined, audio is selected from the third audio augmentation set according to a preset selection rule, volume normalization processing is carried out on the selected audio, the audio set after the normalization processing is used as a training audio set, and therefore the quality of the training audio set is further improved, the keyword detection model can learn the audio with different volumes, and the keyword detection model is prevented from being sensitive to the audio with a certain volume during learning. The preset selection rule may be to select according to a preset volume normalization random selection factor, and the like.

Preferably, the quiet audio is determined according to the initial audio dictionary, specifically, when data amplification is performed for a period other than the first round, noise may be added to a part of the quiet audio in the third audio-based augmented dictionary, and therefore, comprehensive determination needs to be performed according to an audio recording scene and a noise adding state in the third audio, and in order to simplify the determination process of the quiet audio, the quiet audio may be determined according to the audio recording scene in the initial audio dictionary, and after the quiet audio is subjected to noise adding processing, the initial audio dictionary is subjected to amplification labeling, or the third audio augmented dictionary is updated, so that a first audio augmented dictionary corresponding to the first audio augmented set is obtained; the medium-speed audio is determined according to the first audio amplification dictionary, specifically, the medium-speed audio can be determined according to the speech speed in the first audio amplification dictionary, and after the speech speed of the medium-speed audio is adjusted, the first audio amplification dictionary is updated to obtain a second audio amplification dictionary corresponding to the second audio amplification set; the adjusted speech speed class audio is determined according to the second audio frequency augmentation dictionary, specifically, the adjusted speech speed class audio can be determined according to the pitch adjustment state in the second audio frequency augmentation dictionary, after the pitch adjustment is carried out on the adjusted speech speed class audio, the second audio frequency augmentation dictionary is updated, a third audio frequency augmentation dictionary corresponding to the third audio frequency augmentation set is obtained, and therefore updating of the audio frequency sub-class audio is achieved through updating of the audio frequency dictionary. The first audio augmentation dictionary comprises first audio annotation information and second annotation information of each keyword audio in the first audio augmentation set, and the second annotation information comprises a noise adding state, a speech speed adjusting state and a pitch adjusting state of the audio, so that the number of the audio under each subclass can be conveniently obtained according to the audio dictionary in the data augmentation process, and the pertinence of data augmentation is guaranteed. Illustratively, the label information of the first audio augmentation dictionary is in the form of "word type-device-scene-gender-age-speed-distance-whether to add noise-whether to adjust speed of speech-whether to adjust pitch", wherein the quiet audio and the noisy audio can be respectively represented by 0,1, the noisy audio and the non-noisy audio can be respectively represented by 0,1, the adjusted speed audio and the non-adjusted speed of speech can be respectively represented by 0,1, and the adjusted pitch audio and the non-adjusted pitch audio can be respectively represented by 0, 1.

It should be noted that before or during data amplification according to the data amplification target and/or the data amplification factor, the number of audios under each audio subclass may be obtained according to the initial audio dictionary, the first audio amplification dictionary, the second audio amplification dictionary, and/or the third audio amplification dictionary, and then the audio for data amplification on a certain audio subclass is determined according to the amplification target and/or the amplification factor.

In the embodiment of the invention, an initial audio set and an initial audio dictionary corresponding to the initial audio set are obtained, audio subclass division is carried out on the initial audio set according to the initial audio dictionary and a preset audio subclass division rule to obtain a first audio subclass, data amplification is carried out on audio of a second audio subclass in a preset class-by-class amplification mode to obtain a training audio set, and the second audio subclass is contained in the first audio subclass, so that the complexity of the audio is increased progressively, sudden change of audio characteristics is avoided, the acquisition of the training audio set is completed under the condition of resource consumption and workload as little as possible, and the quality of the training audio set is ensured.

Example two:

fig. 2 shows an implementation flow of a keyword detection model training method provided by the second embodiment of the present invention, and for convenience of description, only the parts related to the second embodiment of the present invention are shown, which are detailed as follows:

in step S201, a training audio set is acquired.

In the embodiment of the present invention, the training audio set can be obtained by the method described in the first embodiment.

In step S202, a preset keyword detection model is trained based on a training audio set, so as to obtain a trained keyword detection model.

In the embodiment of the invention, the keyword detection model is trained based on the training audio set to obtain the trained keyword detection model. After the training of the keyword detection model is completed, the keyword detection model may be deployed to the voice wake-up device or a server connected to the voice wake-up device, respectively.

In the embodiment of the invention, an initial audio set and an initial audio dictionary corresponding to the initial audio set are obtained, audio subclass division is carried out on the initial audio set according to the initial audio dictionary and a preset audio subclass division rule to obtain a first audio subclass, data amplification is carried out on audio of a second audio subclass in a preset class-by-class amplification mode to obtain a training audio set, the second audio subclass is contained in the first audio subclass, a preset keyword detection model is trained based on the training audio set to obtain a trained keyword detection model, so that the acquisition of the training audio set is completed under the condition of resource consumption and workload as little as possible, and the training effect of the keyword detection model is ensured.

Example three:

fig. 3 shows a structure of a keyword dataset acquisition apparatus according to a third embodiment of the present invention, and for convenience of description, only a part related to the third embodiment of the present invention is shown, where the keyword dataset acquisition apparatus includes:

an audio set obtaining unit 31 for obtaining an initial audio set and an initial audio dictionary corresponding to the initial audio set;

the audio subclassing unit 32 is configured to perform audio subclassing on the initial audio set according to the initial audio dictionary and a preset audio subclass dividing rule to obtain a first audio subclass; and

the audio data amplification unit 33 is configured to perform data amplification on the audio of the second audio subclass in a preset class-by-class amplification manner to obtain a training audio set, where the second audio subclass is included in the first audio subclass.

In the embodiment of the present invention, each unit of the keyword dataset acquisition apparatus may be implemented by a corresponding hardware or software unit, and each unit may be an independent software or hardware unit, or may be integrated into a software or hardware unit, which is not limited herein. For the specific implementation of each unit of the keyword dataset acquisition apparatus, reference may be made to the description of the first method embodiment, and details are not repeated here.

Example four:

fig. 4 shows a structure of a keyword detection model training apparatus according to a fourth embodiment of the present invention, and for convenience of description, only a part related to the fourth embodiment of the present invention is shown, where the structure includes:

an audio set obtaining unit 41 configured to obtain an initial audio set and an initial audio dictionary corresponding to the initial audio set;

the audio subclassing unit 42 is configured to perform audio subclassing on the initial audio set according to the initial audio dictionary and a preset audio subclass dividing rule to obtain a first audio subclass;

an audio data amplification unit 43, configured to perform data amplification on the audio of the second audio subclass in a preset class-by-class amplification manner to obtain a training audio set, where the second audio subclass is included in the first audio subclass; and

and the model training unit 44 is configured to train a preset keyword detection model based on the training audio set to obtain a trained keyword detection model.

In the embodiment of the present invention, each unit of the keyword detection model training apparatus may be implemented by a corresponding hardware or software unit, and each unit may be an independent software or hardware unit, or may be integrated into a software or hardware unit, which is not limited herein. The detailed implementation of each unit of the keyword detection model training apparatus can refer to the description of the first and second embodiments of the method, and will not be described herein again.

Example five:

fig. 5 shows a structure of a keyword dataset processing apparatus according to a fifth embodiment of the present invention, and for convenience of explanation, only a part related to the embodiment of the present invention is shown.

The keyword dataset processing apparatus 5 of the embodiment of the present invention includes a processor 50, a memory 51, and a computer program 52 stored in the memory 51 and executable on the processor 50. The processor 50, when executing the computer program 52, implements the steps in the above-described method embodiments, such as the steps S101 to S103 shown in fig. 1. Alternatively, the processor 50, when executing the computer program 52, implements the functions of the units in the above-described device embodiments, such as the functions of the units 31 to 33 shown in fig. 3.

Example six:

in an embodiment of the present invention, a computer-readable storage medium is provided, which stores a computer program that, when executed by a processor, implements the steps in the above-described method embodiment, for example, steps S101 to S103 shown in fig. 1. Alternatively, the computer program may be adapted to perform the functions of the units of the above-described device embodiments, such as the functions of the units 31 to 33 shown in fig. 3, when executed by the processor.

The computer readable storage medium of the embodiments of the present invention may include any entity or device capable of carrying computer program code, a recording medium, such as a ROM/RAM, a magnetic disk, an optical disk, a flash memory, or the like.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A keyword dataset acquisition method, the method comprising:

2. The method of claim 1, wherein the initial audio dictionary comprises first audio annotation information for each keyword audio, wherein the first audio annotation information comprises an audio capture scene and a speaker's speech rate.

3. The method of claim 2, wherein the initial set of audio includes a positive sample and a negative sample, the positive sample being keyword audio, the negative sample including approximate voiced word audio of keywords, incomplete keyword audio, and background audio of the keywords in a corresponding application scene;

before the step of performing data amplification on the audio of the second audio subclass in a preset class-by-class amplification mode, the method comprises the following steps of:

4. The method of claim 1, wherein the training audio set includes the initial audio set, the second audio sub-category includes quiet audio, medium speed audio, and speech speed-adjusted audio, and the step of performing data amplification on the audio of the second audio sub-category in a preset class-by-class amplification manner includes:

5. The method of claim 4, wherein the quiet class audio is determined from the initial audio dictionary, and wherein after the step of noising the quiet class audio, further comprising:

performing augmentation marking on the initial audio dictionary, or updating a third audio augmentation dictionary to obtain a first audio augmentation dictionary corresponding to the first audio augmentation set;

6. A method for training a keyword detection model, the method comprising:

obtaining a training audio set, wherein the training audio set is obtained by the method of any one of claims 1-5;

7. A keyword dataset acquisition apparatus, characterized in that the apparatus comprises:

8. A keyword detection model training apparatus comprising the keyword dataset acquisition apparatus according to claim 7 and a model training unit, wherein,

9. A keyword dataset processing apparatus comprising a memory, a processor and a computer program stored in said memory and executable on said processor, characterized in that said processor implements the steps of the method according to any of the claims 1 to 6 when executing said computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.