CN114758665A

CN114758665A - Audio data enhancement method and device, electronic equipment and storage medium

Info

Publication number: CN114758665A
Application number: CN202210666591.0A
Authority: CN
Inventors: 郑鑫江; 凌明; 杨作兴; 艾国
Original assignee: Shenzhen MicroBT Electronics Technology Co Ltd
Current assignee: Shenzhen MicroBT Electronics Technology Co Ltd
Priority date: 2022-06-14
Filing date: 2022-06-14
Publication date: 2022-07-15
Anticipated expiration: 2042-06-14
Also published as: CN114758665B

Abstract

The present disclosure relates to an audio data enhancement method, apparatus, electronic device and storage medium, including: determining an audio recognition task, wherein the audio recognition task is a keyword detection task and/or a sound event detection task; receiving audio data associated with an audio recognition task; according to the audio recognition task, audio data are split and recombined to obtain enhanced sample data aiming at the audio recognition task; and obtaining an audio training sample aiming at the audio recognition task according to the enhancement sample data and the audio recognition task. According to the method and the device, the audio data are split and recombined, the obtained audio training sample has more prominent keyword characteristics aiming at the keyword detection task or more prominent sound characteristics aiming at the sound event detection task, the accuracy of voice recognition of the keyword detection task can be improved, the detection response time of the sound event detection task can be shortened, and the user experience of the keyword detection task and/or the sound event detection task can be improved.

Description

Audio data enhancement method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computers, and in particular, to an audio data enhancement method and apparatus, an electronic device, and a storage medium.

Background

Currently, keyword Detection (KWS) and Sound Event Detection (SED) are two commonly used voice tasks of edge smart voice devices.

The keyword detection task needs to reduce the false wake-up rate while ensuring the detection rate, and the sound event detection task needs to have as short detection delay as possible, namely, the closer the detection time point is to the occurrence time point of the sound event, the better the detection time point is.

The existing keyword detection task and/or sound event detection task are generally implemented by adopting a deep learning method. Data enhancement is an important speech data processing mode adopted in a deep learning method aiming at a keyword detection task and/or a sound event detection task. At present, data enhancement modes aiming at a keyword detection task and/or a sound event detection task mainly comprise noise adding, audio speed adjusting, audio fundamental frequency adjusting, time domain moving, volume adjusting and the like.

For the keyword detection task, the data enhancement modes can cause the false awakening condition of incomplete keywords caused by reasons of similar pronunciation, word cutting and the like. For example, if the wake-up word is "small and tiny", there may be a "smiling" voice that causes the device to wake up because of the word cut of the "tiny" and the similar pronunciation, in which case, when the user speaks in a daily day, when speaking the "smiling" similar voice, the device may be triggered by mistake, which reduces the accuracy of voice recognition and affects the user experience.

For the sound event detection task, there may be a case where the detection response time is too long. For example, in the event detection of crying of an infant, the edge smart voice device and the infant are located in one room, the user is located in another room for some reason, and it is necessary to detect whether the infant cries through the edge smart voice device and enter the room where the infant is located to take care of the infant, in this case, if the crying sound of the infant is detected for too long time after the infant cries, the user may not know the crying of the infant in time and take corresponding measures in time.

Therefore, the enhancement of audio data is still further improved and developed.

Disclosure of Invention

In view of the above, the present disclosure provides an audio data enhancement method, apparatus, electronic device and storage medium to improve accuracy of speech recognition of a keyword detection task, shorten detection response duration of a sound event detection task, and improve user experience of the keyword detection task and/or the sound event detection task.

The technical scheme of the disclosure is realized as follows:

a method of audio data enhancement, comprising:

determining an audio recognition task, wherein the audio recognition task is a keyword detection task and/or a sound event detection task;

Receiving audio data associated with the audio recognition task;

according to the audio recognition task, splitting and recombining the audio data to obtain enhanced sample data aiming at the audio recognition task;

and obtaining an audio training sample aiming at the audio recognition task according to the enhancement sample data and the audio recognition task.

Further, when the audio recognition task is a keyword detection task, the splitting and recombining the audio data according to the audio recognition task to obtain enhancement sample data for the audio recognition task includes:

cutting off non-voice data in the audio data;

according to the voice time in the audio data and the word number of the keyword related to the keyword detection task, segmenting the audio data to obtain at least two sections of audio subdata;

obtaining initial audio sample data according to the at least two sections of audio subdata;

splicing interference audio data at two ends of the initial audio sample data to obtain enhanced sample data aiming at the keyword detection task, wherein the interference audio data is derived from non-trigger audio data in training data associated with the keyword detection task.

Further, the cutting of the non-voice data in the audio data is realized by adopting an active voice detection VAD method.

Further, the obtaining initial audio sample data according to the at least two segments of audio subdata includes:

determining each section of audio subdata in the at least two sections of audio subdata as the initial audio sample data;

randomly arranging and splicing the audio subdata with the quantity larger than or equal to that of the two sections of audio subdata to obtain the initial audio sample data.

Further, in a case that the audio recognition task is a keyword detection task, obtaining an audio training sample for the audio recognition task according to the enhancement sample data and the audio recognition task includes:

and under the condition that the audio content in the enhancement sample data is inconsistent with the keyword of the keyword detection task, determining the annotation information associated with the enhancement sample data as a non-trigger category, and determining the enhancement sample data and the annotation information associated with the enhancement sample data as a non-trigger category audio training sample aiming at the keyword detection task.

and under the condition that the audio content in the enhancement sample data is consistent with the keywords of the keyword detection task, determining the labeling information associated with the enhancement sample data as a trigger category, and determining the enhancement sample data and the labeling information associated with the enhancement sample data as a trigger category audio training sample aiming at the keyword detection task.

Further, when the audio recognition task is a sound event detection task, the splitting and recombining the audio data according to the audio recognition task to obtain enhancement sample data for the audio recognition task includes:

under the condition that the time length of the audio data is within a preset time length threshold value range, acquiring sub-tone frequency band data meeting a preset time length condition from the audio data, and determining the sub-tone frequency band data as initial audio sample data;

splicing interference audio data at two ends of the initial audio sample data to obtain enhancement sample data aiming at the sound event detection task, wherein the interference audio data is derived from non-trigger audio data in training data related to the sound event detection task.

Further, the audio data enhancement method further includes:

and if the time length of the audio data is beyond the time length threshold range, discarding the audio data.

Further, the time length threshold range is greater than or equal to half of the time length of the enhancement sample data;

the preset time length condition is greater than or equal to half of the time length of the enhancement sample data and less than the time length of the enhancement sample data.

Further, in a case that the audio recognition task is a sound event detection task, obtaining an audio training sample for the audio recognition task according to the enhancement sample data and the audio recognition task includes:

and determining the labeling information associated with the enhancement sample data as a trigger type, and determining the enhancement sample data and the labeling information associated with the enhancement sample data as a trigger type audio training sample aiming at the sound event detection task.

Further, after obtaining the audio training samples for the audio recognition task, the audio data enhancement method further comprises:

training a joint network model for executing the keyword detection task and/or the sound event detection task based on the audio training samples;

And executing at least one of the keyword detection task and the sound event detection task by using the trained joint network model.

An audio data enhancement apparatus, comprising:

the task determination module is configured to execute a task of determining audio recognition, wherein the audio recognition task is a keyword detection task and/or a sound event detection task;

a data receiving module configured to perform receiving audio data associated with the audio recognition task;

the splicing recombination module is configured to split and recombine the audio data according to the audio identification task to obtain enhanced sample data for the audio identification task;

and the sample acquisition module is configured to execute the audio recognition task according to the enhancement sample data and the audio recognition task to obtain an audio training sample aiming at the audio recognition task.

An electronic device, comprising:

a processor;

a memory for storing executable instructions of the processor;

wherein the processor is configured to execute the executable instructions to implement the audio data enhancement method as defined in any one of the above.

A computer readable storage medium having at least one instruction which, when executed by a processor of an electronic device, enables the electronic device to implement an audio data enhancement method as claimed in any preceding claim.

According to the technical scheme, the received audio data are split and recombined according to the audio recognition task to obtain the enhancement sample data aiming at the audio recognition task, and then the audio training sample aiming at the audio recognition task is obtained The response time is long, thereby improving the user experience of the keyword detection task and/or the sound event detection task.

Drawings

FIG. 1 is a flow diagram illustrating a method of audio data enhancement according to an exemplary embodiment;

FIG. 2 is a flowchart illustrating an audio data splitting reassembly process for an audio recognition task, in accordance with an exemplary embodiment;

FIG. 3 is a schematic diagram illustrating slicing of audio data in accordance with an exemplary embodiment;

FIG. 4 is a flowchart illustrating an audio data splitting and reassembling process for a sound event detection task, according to an exemplary embodiment;

FIG. 5 is a schematic diagram illustrating an enhancement sample data length relationship in accordance with an exemplary embodiment;

FIG. 6 is a schematic diagram illustrating a federated network model in accordance with an exemplary embodiment;

FIG. 7 is a flowchart illustrating an application scenario of a method of audio data enhancement according to an exemplary embodiment;

FIG. 8 is a schematic diagram illustrating the structure of an audio data enhancement apparatus according to an exemplary embodiment;

fig. 9 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more apparent, the present disclosure is further described in detail below with reference to the accompanying drawings and examples.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Fig. 1 is a flowchart illustrating an audio data enhancement method according to an exemplary embodiment, and as shown in fig. 1, the audio data enhancement method of the embodiment of the present disclosure mainly includes the following steps:

step 101, determining an audio identification task, wherein the audio identification task is a keyword detection task and/or a sound event detection task;

102, receiving audio data related to an audio identification task;

103, according to the audio recognition task, splitting and recombining the audio data to obtain enhanced sample data for the audio recognition task;

and 104, obtaining an audio training sample aiming at the audio recognition task according to the enhanced sample data and the audio recognition task, wherein the training sample is used for training a joint network model for executing a keyword detection task and/or a sound event detection task.

The audio data enhancement method of the embodiment of the disclosure, according to the audio recognition task, splits and recombines the received audio data to obtain the enhancement sample data for the audio recognition task, and further obtains the audio training sample for the audio recognition task, in the technical scheme of the embodiment of the disclosure, splits and recombines the received audio data to realize the targeted reorganization of the training sample of the audio recognition task, the obtained training sample has more prominent keyword features for the keyword detection task or more prominent sound features for the sound event detection task, so that the combined network model for executing the keyword detection task and/or the sound event detection task after being trained by the training sample obtained by the technical scheme of the disclosure can improve the accuracy of the speech recognition of the keyword detection task and shorten the detection response duration of the sound event detection task, thereby improving the user experience of the keyword detection task and/or the sound event detection task.

Fig. 2 is a flowchart illustrating an audio data splitting and reassembling process for an audio recognition task according to an exemplary embodiment, where, as shown in fig. 2, in some embodiments, in the case that the audio recognition task is a keyword detection task, step 103 includes:

step 10311, cutting off non-voice data in the audio data;

step 10312, segmenting the audio data according to the voice time length in the audio data and the word number of the keyword related to the keyword detection task to obtain at least two sections of audio subdata;

step 10313, obtaining initial audio sample data according to the at least two sections of audio subdata;

and step 10314, splicing the interference audio data at two ends of the initial audio sample data to obtain enhanced sample data for the keyword detection task, wherein the interference audio data is derived from the non-trigger audio data in the training data associated with the keyword detection task.

In the process of collecting audio data, a voice presenter often does not speak in a completely quiet environment, and audio data contents other than a voice data segment, such as a background sound data segment, a quiet blank data segment, and the like, are also present in the audio data in a normal case. In the embodiment of the present disclosure, after the non-voice data in the audio data is cut off in step 10311, the voice data in the audio data is retained, so that the ratio of the voice content in the audio data is maximized, in this case, the audio training sample obtained by using only the retained voice data has the most prominent voice characteristics, and the recognition accuracy of the keyword detection task by the network model for executing the keyword detection task obtained based on the training of the audio training sample can be greatly improved.

In some embodiments, step 10311 may be implemented using a Voice Activity Detection (VAD) method.

In general, after the non-voice data in the audio data is cut by using the active voice detection method, several independent fragment data only including the voice content are obtained, and therefore, in some embodiments, after the non-voice data is cut by using the active voice detection method to obtain multiple fragment data, step 10311 further includes splicing the multiple fragment data according to a time sequence to obtain the audio data only including the complete voice content.

For the keyword detection task, the situations of false recognition and false triggering are often caused by the fact that the detected voice content is similar to the keyword, for example, the situation of false wake-up occurs when the keyword, which is the wake-up word, is "tiny", and the voice content is similar to sounds such as "smile", "tiny", "defensive school", "tiny", and the like. Therefore, in order to obtain accurate recognition for each word in the keyword and avoid the occurrence of such false wake-up, in some embodiments of the present disclosure, each word in the audio subdata is segmented in steps 10312 and 10313, all possible non-keyword combination forms similar to the keyword are obtained according to each segmented word, and in subsequent steps (see subsequent description), a non-trigger category audio training sample is obtained based on initial audio sample data of these possible non-keyword combination forms, so that it is ensured that the voice of other non-wake-up words similar to the voice of the wake-up word is not recognized as the keyword, thereby being capable of avoiding the false wake-up and improving the accuracy of voice recognition. Based on this, in a preferred embodiment, the voice content included in the audio data associated with the keyword detection task is a keyword, and for example, if the keyword is "tiny", the voice content included in the audio data is "tiny".

In some embodiments, if the voice duration of the audio data obtained after cutting off the non-voice data is S and the number of words of the keyword is N, in step 10312, the audio data is segmented to obtain at least two segments of audio sub-data, which may include averaging the voice duration S by N parts, where the duration of each part is S/N, and thus, after segmentation, the audio sub-data of each time segment can basically contain one word of the keyword. In some embodiments, N ≧ 2, i.e., the number of words of the keyword is at least two, for the keyword detection task, if the number of words of the keyword is only one, then a false trigger of waking up occurs in a voice or an environmental sound similar or identical to the pronunciation of the keyword in daily communication, and therefore, the number of words of the keyword should not be set to one. In addition, for a keyword detection task such as waking up, too long keywords may also reduce the user experience due to a long voice, so the keywords should not be too long, for example, in an alternative embodiment, the word count of the keywords may be limited to be within 10 words, and further, in an alternative embodiment, the word count of the keywords may be limited to be not more than 6 words or not more than 5 words.

FIG. 3 is a schematic diagram illustrating a segmentation of audio data according to an exemplary embodiment, where as shown in FIG. 3, the audio data includes a keyword composed of 4 words, the 4 words are respectively represented as A1, A2, A3 and A4, the voice duration of the audio data is represented as S, the voice duration S of the audio data is averaged by 4 parts (the number of parts is determined by the number of words, the number of words in the embodiment shown in FIG. 3 is 4, and the number of parts is therefore 4) to obtain four audio subdata segments of N1, N2, N3 and N4, the four audio subdata segments have equal duration and are S/4, and in the process of voice expression of the keyword by a voice presenter, the voice duration between words usually differs, and therefore, each audio subdata segment may further include other text voice remnants adjacent to the text voice in the audio subdata segment, for example, as shown in FIG. 3, the portion of the audio subdata N1 includes text voice remnants of a small part of the text voice character A2 in addition to the word A1 Because the audio subdata N1 is mainly based on the voice of word A1, the text voice residual of word A2 will not affect the expression of the voice characteristics of word A1 by the audio subdata N1, similarly, the audio subdata N2 contains a small portion of text voice residual of word A3 in addition to word A2, because the audio subdata N2 is mainly based on the voice of word A2, the text voice residual of word A3 will not affect the expression of the voice characteristics of word A2 by the audio subdata N2, and because the audio subdata N3 contains a small portion of text voice residual of word A4 in addition to word A4, because the audio subdata N3 is mainly based on the voice of word A3, the text voice residual of word A4 will not affect the expression of the voice characteristics of word A3 to word A3 by the audio subdata N3. The text-voice residual segment is a segment which only contains a small part of voice pronunciation in the complete text-voice pronunciation.

Taking the ABAB structure keyword of "small and tiny", as an example, in fig. 3, the word a1 corresponds to the first a in the ABAB structure, the word a2 corresponds to the first B in the ABAB structure, the word A3 corresponds to the second a in the ABAB structure, and the word a4 corresponds to the second B in the ABAB structure.

And step 10312, segmenting the audio data of the ABAB structure key words to obtain 4 sections of audio subdata.

In some embodiments, step 10313 may comprise:

determining each section of audio subdata in the at least two sections of audio subdata as initial audio sample data;

randomly arranging and splicing any audio subdata more than or equal to two sections of audio subdata in the at least two sections of audio subdata to obtain initial audio sample data.

Taking the embodiment of fig. 3 for segmenting audio data as an example, in step 10313, obtaining initial audio sample data according to 4 segments of audio sub-data (i.e. audio sub-data N1, audio sub-data N2, audio sub-data N3, and audio sub-data N4), may include:

(1) each piece of audio sub data is used as initial audio sample data alone, for example: the audio sub-data N1 is independently used as initial audio sample data (the first A in an ABAB structure), the audio sub-data N2 is independently used as initial audio sample data (the first B in the ABAB structure), the audio sub-data N3 is independently used as initial audio sample data (the second A in the ABAB structure), and the audio sub-data N4 is independently used as initial audio sample data (the second B in the ABAB structure); although the audio sub-data N1 and the audio sub-data N3 correspond to the first a and the third a in the ABAB structure, respectively, the pronunciations of the two a are the same, so that only one a (the audio sub-data N1 or the audio sub-data N3) can be taken as initial audio sample data; similarly, for the audio sub-data N2 and the audio sub-data N4, only one of B (the audio sub-data N2 or the audio sub-data N4) may be taken as initial audio sample data; that is, in some embodiments, the obtained initial audio sample data with the same voice content may be subjected to a deduplication process;

(2) Randomly arranging any two sections of audio subdata and splicing to obtain initial audio sample data, for example: splicing audio sub-data N1 and audio sub-data N2 (audio sub-data N1 is before and audio sub-data N2 is after) to obtain initial audio sample data with an AB structure, splicing audio sub-data N1 and audio sub-data N3 (audio sub-data N1 is before and audio sub-data N3 is after) to obtain initial audio sample data with an AA structure, splicing audio sub-data N1 and audio sub-data N4 (audio sub-data N1 is before and audio sub-data N4 is after) to obtain initial audio sample data with an AB structure, wherein the initial audio sample data with the AB structure of audio sub-data N1 is before and audio sub-data N2 is after and the initial audio sample data with the AB structure of audio sub-data N1 is before and audio sub-data N4 is after are initial audio sample data with the same voice content, and performing de-duplication processing;

(3) randomly arranging any three sections of audio subdata and splicing to obtain initial audio sample data, for example, splicing any three of the audio subdata N1, the audio subdata N2, the audio subdata N3 and the audio subdata N4 to form initial audio sample data of an ABA structure, an AAB structure, an ABB structure, a BBA structure and a BAB structure, and performing deduplication processing under the condition that the initial audio sample data with the same voice content appears;

(4) Randomly arranging and splicing the four sections of audio sub-data to obtain initial audio sample data, for example: the audio sub-data N1, the audio sub-data N2, the audio sub-data N3 and the audio sub-data N4 are spliced together in various combination forms to form initial audio sample data of an AABB structure, a BBAA structure and a BABA structure, and the deduplication processing can be performed under the condition that the initial audio sample data with the same voice content appears.

Wherein, the above (2), (3) and (4) belong to the case of more than or equal to two sections of audio sub-data.

In the embodiment of the present disclosure, for the keyword detection task, the enhancement sample data obtained by splicing in step 10314 includes the initial audio sample data and the interference audio data spliced at both ends of the initial audio sample data, so that the enhancement sample data contains both the features of the audio data associated with the keyword detection task and the features of the interference audio data. Because the interference audio data is derived from the non-trigger audio data, the characteristics of the non-trigger audio data determine that the interference audio data does not trigger the execution of the subsequent operation (if the non-trigger wake-up is not triggered), so that the network model for executing the keyword detection task can be trained by using the enhanced sample data containing the interference audio data, and according to the training purpose, the network model for executing the keyword detection task obtains the capability of not triggering the subsequent operation aiming at other keywords (for example, other structural keywords except for 'small and tiny' in the above description) which are similar to a specific keyword, so that the false wake-up rate of incomplete keywords in an actual application scene can be greatly reduced. Moreover, the method can also be adopted to enable the network model for executing the keyword detection task to obtain the capability of triggering subsequent operations only aiming at specific keywords (such as the small micro-micro in the above description) according to the training purpose.

For the keyword detection task, it is necessary to ensure that the passage is detected only when the voice is detected to be completely matched with the keyword, and no other condition should be detected, for example, the ABAB structure type of the above-mentioned "tiny micro", and the passage is detected to trigger the execution (e.g., wake-up) of the subsequent operation only when the voice is detected to be the "tiny micro", and no other voice except the "tiny micro" should trigger the execution of the subsequent operation. In this case, in all the obtained initial audio sample data, any speech except the ABAB structural keyword should not trigger the execution of the subsequent operation, that is, any speech except the ABAB structural keyword needs to be determined as a non-trigger type speech to avoid a false trigger condition. Further, in a case where the audio recognition task is a keyword detection task, step 104 includes:

step 1041, in case that the audio content in the enhancement sample data is not consistent with the keyword of the keyword detection task, determining the annotation information associated with the enhancement sample data as a non-trigger category, and determining the enhancement sample data and the annotation information associated therewith as a non-trigger category audio training sample for the keyword detection task.

In some embodiments, to enhance the ability of the audio recognition task to trigger subsequent operations only by the speech of the ABAB structure keyword, step 104 may further include:

step 1042, under the condition that the audio content in the enhanced sample data is consistent with the keywords of the keyword detection task, determining the labeling information associated with the enhanced sample data as a trigger type, and determining the enhanced sample data and the labeling information associated therewith as a trigger type audio training sample for the keyword detection task.

In some embodiments, where the audio recognition task is a keyword detection task, the length of the enhancement sample data is greater than the length of the keyword audio that can trigger subsequent operations.

For the keyword detection task, the keyword is determined, and only the voice message with the voice feature of the keyword can trigger the execution of the subsequent operation (such as triggering the wake-up), therefore, in the audio data enhancement method of the embodiment of the present disclosure, for the enhancement sample data of the keyword detection task, the content of the received audio data associated with the keyword detection task generally does not necessarily include the keyword content, other audio data are possible and the number of words of the speech text involved in the other audio data may be any number, only the resulting enhancement sample data is required to determine the annotation information associated with the enhancement sample data as a non-trigger category if it is the content of a non-keyword, and if the content is the content of the keyword, determining the annotation information associated with the enhancement sample data as the trigger category. Preferably, the content of the received audio data associated with the keyword detection task preferably must contain keyword content for the purpose of avoiding false triggering due to similar pronunciation.

Fig. 4 is a flowchart illustrating an audio data splitting and reassembling process for a sound event detection task according to an exemplary embodiment, where, as shown in fig. 4, the audio identification task is the sound event detection task, step 103 includes:

step 10321, under the condition that the time length of the audio data is within the preset time length threshold range, acquiring sub-tone band data meeting the preset time length condition from the audio data, and determining the sub-tone band data as initial audio sample data;

step 10322, splicing the interference audio data at two ends of the initial audio sample data to obtain enhanced sample data for the sound event detection task, wherein the interference audio data is derived from the non-trigger audio data in the training data associated with the sound event detection task.

The preset time length threshold range is used for ensuring that the audio data can meet the training requirement. If the audio data is long enough, the audio data meeting the length requirement can be obtained by intercepting the segments in the audio data, but if the audio data is too short, the feature information contained in the audio data is too little, and even if the audio data which is too short is repeatedly spliced, more feature information cannot be obtained, so that the enhancement sample data obtained by the audio data which is too short does not contain enough features, and the trained network model cannot meet the purpose of the sound event detection task. Based on this, in some embodiments, the audio data enhancement method of the present disclosure further includes:

And under the condition that the time length of the audio data is out of the preset time length threshold range, discarding the audio data.

And, in some embodiments, the temporal length threshold range is greater than or equal to half the temporal length of the enhancement sample data; the preset time length condition is greater than or equal to half of the time length of the enhancement sample data and less than the time length of the enhancement sample data.

The time length of the enhancement sample data is set with the set time length of the training sample. For example, if the set time length of the training sample is 3 seconds, and the time length of the enhancement sample data is 3 seconds, the time length threshold range is greater than or equal to 1.5 seconds, and the preset time length condition is greater than or equal to half of 3 seconds and less than 3 seconds, that is, the preset time length condition is 1.5 seconds (including 1.5 seconds) to 3 seconds (not including 3 seconds). That is, in the case where the time length of the audio data is at least 1.5 seconds, the length of the sub-tone section data obtained from the audio data is between 1.5 seconds (including 1.5 seconds) and 3 seconds (not including 3 seconds).

In the embodiment of the present disclosure, for the sound event detection task, the enhancement sample data obtained by the splicing in step 10322 includes the audio data associated with the sound event detection task and the interference audio data spliced at both ends of the audio data associated with the sound event detection task, so that the enhancement sample data contains both the features of the audio data associated with the sound event detection task and the features of the interference audio data. Because the interference audio data is derived from the non-trigger audio data, the characteristics of the non-trigger audio data determine that the interference audio data does not trigger the execution of the subsequent operation (such as not triggering the wake-up), and meanwhile, the audio data is discarded under the condition that the time length of the audio data is out of the range of the preset time length threshold, and the range of the time length threshold is greater than or equal to half of the time length of the enhancement sample data, and the preset time length condition is greater than or equal to half of the time length of the enhancement sample data and less than the time length of the enhancement sample data, so that the audio data containing the audio event detection task in the obtained enhancement sample data occupies at least half of the enhancement sample data, thereby the enhancement sample data contains at least half of the characteristics of the audio data associated with the audio event detection task, and the network model for executing the audio event detection task is trained by using the enhancement sample data, the network model for performing the sound event detection task can be enabled to obtain the capability of rapidly detecting and triggering subsequent operations for a specific sound (e.g., a crying sound of a child, etc.), so that the response time for the specific sound can be greatly shortened.

Because the network model has a length requirement on the training samples, in some embodiments, the length of the enhancement sample data is a preset fixed length. Fig. 5 is a schematic diagram illustrating a length relationship of enhancement sample data according to an exemplary embodiment, as shown in fig. 5, a preset fixed length of the enhancement sample data 501 is M, if the length of the initial audio sample data 502 is T and T is smaller than M, then random-length interference audio data 503 are spliced at the head and tail ends of the initial audio sample data 502, and the total length of the spliced interference audio data 503 is M-T, so as to ensure that the total length of the finally obtained enhancement sample data 501 is M.

In some embodiments, where the audio recognition task is a sound event detection task, step 104 comprises:

and determining the marking information associated with the enhancement sample data as a trigger category, and determining the enhancement sample data and the marking information associated with the enhancement sample data as a trigger category audio training sample aiming at the sound event detection task.

For a sound event detection task, in an actual application scenario, rapid detection and identification of sound needs to be realized, and the detection delay needs to be as short as possible, so that the requirement of rapid response can be met. By using the enhanced sample data for the voice event detection task obtained by the embodiment of the disclosure, the network model executing the voice event detection task can learn the capability of successful detection under the condition that the detected audio information length is less than the enhanced sample data length, thereby shortening the duration of voice detection reaction.

After obtaining the audio training samples for the audio recognition task, the audio data enhancement method of the embodiment of the present disclosure further includes:

training a joint network model for executing a keyword detection task and/or a sound event detection task based on the audio training sample;

The specific process of training may include:

acquiring a training sample set, wherein the training sample set comprises a plurality of enhancement sample data and labeling information associated with each enhancement sample data; the enhancement sample data comprises enhancement sample data aiming at the keyword detection task and/or enhancement sample data aiming at the sound event detection task; in the training sample set, the enhanced sample data for the keyword detection task further comprises enhanced sample data of a non-trigger type, and the labeling information associated with the enhanced sample data of the non-trigger type is non-trigger type labeling information; the labeling information associated with the enhanced sample data of the sound event detection task is trigger type labeling information; in some embodiments, in the training sample set, the enhancement sample data for the keyword detection task may further include enhancement sample data of a trigger category, and the tagging information associated with the enhancement sample data of the trigger category is trigger category tagging information;

Inputting the enhanced sample data into a joint network model to obtain a result corresponding to the enhanced sample data, wherein the joint network model is used for executing a keyword detection task and/or a sound event detection task, fig. 6 is a schematic diagram of the joint network model according to an exemplary embodiment, as shown in fig. 6, the joint network model comprises a coding layer, a keyword detection task decoding layer and a sound event detection task decoding layer when being used for executing the keyword detection task and the sound event detection task, wherein the same coding layer is shared for audio data processing of the keyword detection task and the sound event detection task, and two different decoding layers of the keyword detection task decoding layer and the sound event detection task decoding layer are respectively adopted based on different tasks in the keyword detection task and the sound event detection task, and the enhanced sample data input by the joint network model in the embodiment of the present disclosure and any one decoding layer of the sound event detection task when executing the keyword detection task The method comprises the steps that audio data input during a service and sound event detection task enter a coding layer, coding is executed by the coding layer, then the coded data output by the coding layer are input into a keyword detection task decoding layer or a sound event detection task decoding layer according to different detection tasks, results are output by the keyword detection task decoding layer aiming at the keyword detection task, and the results are output by the sound event detection task decoding layer aiming at the sound event detection task, wherein the results can be execution probabilities for triggering subsequent operations (such as triggering awakening);

Obtaining a value of a target loss function based on a result corresponding to the enhancement sample data and annotation information associated with the enhancement sample data;

and training the combined network model by adjusting parameters of the combined network model according to the value of the target loss function to obtain the trained combined network model.

Fig. 7 is a flowchart illustrating an application scenario of an audio data enhancement method according to an exemplary embodiment, where the application scenario mainly includes the following steps, as shown in fig. 7.

Step 701, determining an audio recognition task, if the audio recognition task is a keyword detection task, performing step 711, and if the audio recognition task is a voice event detection task, performing step 721.

Step 711, receiving audio data, and then performing step 712.

Wherein the audio data received in step 711 is the audio data associated with the keyword detection task.

Step 712, cut off the non-voice data in the audio data, and then execute step 713.

In some embodiments, a method for detecting active voice is adopted to cut off non-voice data in audio data to obtain multiple segments of data, and the multiple segments of data are spliced according to a time sequence to obtain audio data only containing complete voice content.

713, segmenting the audio data according to the voice time length in the audio data and the word number of the keyword associated with the keyword detection task to obtain at least two sections of audio subdata, and then executing 714.

Taking the keyword as an ABAB structure as an example, referring to fig. 3, the audio data includes a keyword composed of 4 words, the 4 words are respectively represented as a1, a2, A3 and a4, in the ABAB structure, a1 represents the first a in the ABAB structure, a2 represents the first B in the ABAB structure, A3 represents the second a in the ABAB structure, and a4 represents the second B in the ABAB structure, in some embodiments, the audio data may be divided equally according to the number of words of the keyword as shown in fig. 3, for example, the number of words of the keyword is four, and then the audio data is divided equally to obtain four pieces of audio sub-data.

Step 714, obtaining initial audio sample data according to the at least two sections of audio subdata, and then executing step 715.

Wherein, in step 714, each piece of audio sub-data in the at least two pieces of audio sub-data may be determined as initial audio sample data; randomly arranging and splicing any audio subdata more than or equal to two sections of audio subdata in the at least two sections of audio subdata to obtain initial audio sample data.

Taking the keyword as an ABAB structure as an example, in step 714, the splicing includes four ways: (1) independently taking a section of audio subdata as initial audio sample data; (2) arranging any two sections of audio sub-data in all possible sequences to obtain initial audio sample data; (3) arranging any three sections of audio sub-data in all possible sequences to obtain initial audio sample data; (4) and arranging the four segments of audio sub-data in all possible orders to obtain initial audio sample data.

As for the mode (1), referring to fig. 3, four pieces of audio sub-data of contents [ a1], [ a2], [ A3] and [ a4] are obtained, where a word in a keyword corresponding to the audio sub-data of the content [ a1] and the audio sub-data of the content [ A3] is a, and a word in a keyword corresponding to the audio sub-data of the content [ a2] and the audio sub-data of the content [ a4] is B.

In some embodiments, each piece of audio sub-data of [ A1], [ A2], [ A3] and [ A4] content is determined as initial audio sample data; in other embodiments, after the [ a1], [ a2], [ A3] and [ a4] are subjected to deduplication according to the content, initial audio sample data is obtained, for example, based on the reason that the word in the corresponding keyword in the audio sub-data of the [ a1] content and the audio sub-data of the [ A3] content is a, the audio sub-data of the [ a1] content and the audio sub-data of the [ A3] content are subjected to deduplication, only one of the audio sub-data of the [ a1] content and the audio sub-data of the [ A3] content is reserved, and likewise, only one of the audio sub-data of the [ a2] content and the audio sub-data of the [ a4] content is reserved.

Wherein, for the mode (2), referring to fig. 3, a plurality of pieces of initial audio sample data of contents [ a1, a2], [ a1, A3], [ a1, a4], [ a2, a1], [ a2, A3], [ a2, a4], [ A3, a1], [ A3, a2], [ A3, a4], [ a4, a1], [ a4, a2], [ a4, A3] are obtained, the words in the keywords corresponding to the initial audio sample data of the [ a1, a2], [ a1, a4], [ A3, a2] and [ A3, a4] content are AB, the words in the keywords corresponding to the initial audio sample data of the [ a1, A3] and [ A3, a1] content are AA, the words in the keywords corresponding to the initial audio sample data of the [ a2, a1], [ a2, A3], [ a4, a1] and [ a4, A3] content are BA, and the words in the keywords corresponding to the initial audio sample data of the [ a2, a4] and [ a4, a2] content are BB.

In some embodiments, the initial audio sample data of all combined form content is retained to obtain subsequent enhancement sample data; in other embodiments, the subsequent enhancement sample data is obtained after the initial audio sample data of all combined form contents are subjected to de-duplication according to the contents, for example, based on the reason that the characters in the corresponding keywords in the initial audio sample data of the [ A, A ], [ A, A ] and [ A, A ] contents are AB, the initial audio sample data of the [ A, A ], [ A, A ] and [ A, A ] contents are subjected to de-duplication, only one of the initial audio sample data of the [ A, A ], [ A, A ] and [ A, A ] contents is retained, and similarly, only one of the initial audio sample data of the [ A, A ] and [ A, A ] contents is retained, and only one of the initial audio sample data of the [ A, A ], [ A, A ] and [ A, A ] contents is retained, only one of the original audio sample data of [ a2, a4] and [ a4, a2] content is retained.

For the mode (3), referring to fig. 3, the following pieces of initial audio sample data are obtained:

[A1,A2,A3]、[A1,A3,A2]、[A2,A1,A3]、[A2,A3,A1]、[A3,A1,A2]、[A3,A2,A1]

[A1,A2,A4]、[A1,A4,A2]、[A2,A1,A4]、[A2,A4,A1]、[A4,A1,A2]、[A4,A2,A1]

[A1,A3,A4]、[A1,A4,A3]、[A3,A1,A4]、[A3,A4,A1]、[A4,A1,A3]、[A4,A3,A1]

[A2,A3,A4]、[A2,A4,A3]、[A3,A2,A4]、[A3,A4,A2]、[A4,A2,A3]、[A4,A3,A2]

in some embodiments, the initial audio sample data of all combined form content is retained to obtain subsequent enhancement sample data; in other embodiments, after the initial audio sample data of all combined-form contents is subjected to de-duplication according to the contents, subsequent enhancement sample data is obtained, for example, based on the reason that all words in corresponding keywords in the initial audio sample data of [ a, a ], [ a, a ] and [ a, a ] contents are ABA, the initial audio sample data of [ a, a ], [ a, a ] and [ a, a ] contents are subjected to de-duplication, and only one of the initial audio sample data of [ a, a ], [ a, a ] and [ a, a ] contents is reserved.

For the method (4), referring to fig. 3, the following pieces of initial audio sample data are obtained:

[A1,A2,A3,A4]、[A1,A2,A4,A3]、[A1,A3,A2,A4]、[A1,A3,A4,A2]

[A1,A4,A2,A3]、[A1,A4,A3,A2]、[A2,A1,A3,A4]、[A2,A1,A4,A3]

[A2,A3,A1,A4]、[A2,A3,A4,A1]、[A2,A4,A1,A3]、[A2,A4,A3,A1]

[A3,A1,A2,A4]、[A3,A1,A4,A2]、[A3,A2,A1,A4]、[A3,A2,A4,A1]

[A3,A4,A1,A2]、[A3,A4,A2,A1]、[A4,A1,A2,A3]、[A4,A1,A3,A2]

[A4,A2,A1,A3]、[A4,A2,A3,A1]、[A4,A3,A1,A2]、[A4,A3,A2,A1]

in some embodiments, the initial audio sample data of all combined form content is retained to obtain subsequent enhancement sample data; in other embodiments, the subsequent enhancement sample data is obtained after the initial audio sample data of all combined content is de-duplicated according to the content, for example, based on one of [ a1, a3, a2, A4], [ A4, A4] of the initial audio sample data of [ A4, A4] content is de-duplicated, and only one of the initial audio sample data of [ A4, A4] content is re 4, and a is re 4 of the initial audio sample data of the content is re 4 content, and A4 is re 4 is re-duplicated, and A4 is retained.

Step 715, splicing the interference audio data at two ends of the initial audio sample data to obtain enhanced sample data for the keyword detection task, and then executing step 716.

The interference audio data is derived from non-trigger audio data in training data related to the keyword detection task, and a network model of the keyword detection task does not trigger subsequent operation (such as triggering and awakening) based on a result obtained by the non-trigger audio data.

Wherein, the time length of the enhancement sample data is set according to the set time length of the training sample.

Generally, the audio time length of each word of the keyword is between 0.2 and 0.4 seconds, and the audio time length of the keyword of the ABAB structure is between 0.8 and 1.6 seconds, so in some embodiments, the audio time length of the training sample should be greater than 1.6 seconds, for example, the audio time length of the training sample may be 2 to 3 seconds, wherein the audio time length of the initial audio sample data is not greater than 1.6 seconds, and the part of the enhancement sample data except the audio of the keyword (initial audio sample data) is the interference audio data.

And 716, obtaining labeling information related to the enhanced sample data based on the audio content in the enhanced sample data and the keyword content of the keyword detection task, and determining the enhanced sample data and the labeling information related to the enhanced sample data as a trigger type audio training sample for the keyword detection task.

For the labeling information of the enhanced sample data of the keyword detection task, two situations exist: enhancing the condition that audio content in sample data is inconsistent with keywords of a keyword detection task; and secondly, enhancing the condition that the audio content in the sample data is consistent with the keywords of the keyword detection task.

In case, in step 716, the annotation information associated with the enhancement sample data is determined to be a non-trigger category, and the enhancement sample data and the annotation information associated therewith are determined to be a non-trigger category audio training sample for the keyword detection task. Based on the purpose of avoiding false triggering caused by similar pronunciation in the embodiment of the present disclosure, the determination of the non-triggering type audio training sample for the case one is a necessary step in the embodiment.

In case two, in step 716, the annotation information associated with the enhancement sample data is determined as the trigger category, and the enhancement sample data and the annotation information associated therewith are determined as the trigger category audio training samples for the keyword detection task. The determination of the trigger class audio training samples for case two is an optional step in this embodiment.

Step 721, audio data is received, followed by step 722.

Where the audio data received in step 721 is audio data associated with a sound event detection task.

Step 722, judging whether the time length of the audio data meets the requirement, if so, executing step 723, otherwise, discarding the audio data.

Because the audio features contained in the over-short audio data are less, the probability of recognition errors is high, and if the audio data are used as training samples, the error probability of the network model is increased, so that the sufficient audio features can be ensured to be contained only if the shortest time length meets a certain requirement, and the accuracy of recognition is improved. In some embodiments, the determining whether the time length of the audio data in step 722 meets the requirement may specifically include: and judging whether the time length of the audio data is within a preset time length threshold range. The preset time length threshold range is greater than or equal to half of the time length of the enhancement sample data, so that at least half of the audio features related to the sound event detection can be ensured to be contained in the obtained enhancement sample data.

Step 723, acquiring the sub-tone band data meeting the preset time length condition from the audio data, determining the sub-tone band data as initial audio sample data, and then executing step 724.

In order to ensure that at least half of the audio features related to sound event detection can be contained in the obtained enhancement sample data, the length of the sub-tone band data is at least half of the enhancement sample data, so that the preset time length condition may be greater than or equal to half of the time length of the enhancement sample data and less than the time length of the enhancement sample data. By adopting the method, the obtained enhanced sample data can at least contain half of the audio features related to the detection of the sound event and at most contain all the audio features related to the detection of the sound event, and the accuracy and corresponding timeliness of the network model for detecting the sound event can be improved by training the network model by utilizing the enhanced sample data.

In some embodiments, in step 723, any audio segment satisfying the condition of the preset time length may be cut out from the audio data in a random manner as the sub-audio segment data.

Step 724, splicing the interference audio data at two ends of the initial audio sample data to obtain enhanced sample data aiming at the sound event detection task, and then executing step 725.

The interference audio data is derived from non-trigger audio data in training data related to the sound event detection task, and the time length of the enhancement sample data is the time length set for the sound event detection task.

Step 725, determining the labeling information associated with the enhancement sample data as a trigger category, and determining the enhancement sample data and the labeling information associated with the enhancement sample data as a trigger category audio training sample for the sound event detection task.

Fig. 8 is a schematic diagram illustrating a structure of an audio data enhancement apparatus according to an exemplary embodiment, and as shown in fig. 8, the audio data enhancement apparatus includes a task determination module 801, a data receiving module 802, a concatenation and reassembly module 803, and a sample acquisition module 804.

The task determining module 801 is configured to perform determining an audio recognition task, where the audio recognition task is a keyword detection task and/or a sound event detection task.

A data receiving module 802 configured to perform receiving audio data associated with an audio recognition task.

And the splicing and recombining module 803 is configured to split and recombine the audio data according to the audio recognition task to obtain enhancement sample data for the audio recognition task.

The sample obtaining module 804 is configured to execute the enhancement sample data and the audio recognition task to obtain an audio training sample for the audio recognition task, where the training sample is used for training a joint network model for executing the keyword detection task and/or the sound event detection task.

In some embodiments, in the case that the audio recognition task is a keyword detection task, the splicing recombination module 803 includes:

a non-speech removal submodule configured to perform removal of non-speech data in the audio data;

the audio segmentation submodule is configured to segment the audio data according to the voice time in the audio data and the word number of the keywords related to the keyword detection task to obtain at least two sections of audio subdata;

the first initial audio acquisition submodule is configured to execute the first audio acquisition submodule according to the at least two sections of audio subdata to obtain initial audio sample data;

and the first audio splicing submodule is configured to splice interference audio data at two ends of the initial audio sample data to obtain enhanced sample data aiming at the keyword detection task, wherein the interference audio data is derived from non-trigger audio data in training data associated with the keyword detection task.

In some embodiments, the non-speech removal sub-module implements removal of non-speech data in the audio data using a VAD method.

In some embodiments, the non-voice cutting sub-module is further configured to perform, after cutting off the non-voice data by using an active voice detection method to obtain a plurality of pieces of segment data, splicing the plurality of pieces of segment data in chronological order to obtain audio data only including complete voice content.

In some embodiments, the first initial audio acquisition sub-module is further configured to perform:

randomly arranging and splicing any audio subdata with the quantity larger than or equal to that of the two sections of audio subdata to obtain initial audio sample data.

In some embodiments, in the case that the audio recognition task is a keyword detection task, the sample obtaining module 804 further comprises:

and the non-trigger sample acquisition sub-module is configured to determine the annotation information associated with the enhancement sample data as a non-trigger type under the condition that the audio content in the enhancement sample data is inconsistent with the keywords of the keyword detection task, and determine the enhancement sample data and the annotation information associated with the enhancement sample data as a non-trigger type audio training sample aiming at the keyword detection task.

and the trigger sample acquisition sub-module is configured to determine the annotation information associated with the enhancement sample data as a trigger type under the condition that the audio content in the enhancement sample data is consistent with the keywords of the keyword detection task, and determine the enhancement sample data and the annotation information associated with the enhancement sample data as a trigger type audio training sample aiming at the keyword detection task.

In some embodiments, in the case that the audio recognition task is a sound event detection task, the splicing recombination module 803 includes:

the second initial audio acquisition submodule is configured to acquire sub-tone band data meeting a preset time length condition from the audio data under the condition that the time length of the audio data is within a preset time length threshold range, and determine the sub-tone band data as initial audio sample data;

and the second audio splicing submodule is configured to splice interference audio data at two ends of the initial audio sample data to obtain enhancement sample data for the sound event detection task, wherein the interference audio data are derived from non-trigger audio data in training data related to the sound event detection task, and the time length of the enhancement sample data is the time length set for the sound event detection task.

In some embodiments, the second initial audio acquisition sub-module is further configured to perform: and in the case that the time length of the audio data is beyond the time length threshold range, discarding the audio data.

In some embodiments, the temporal length threshold range is greater than or equal to half the temporal length of the enhancement sample data; the preset time length condition is greater than or equal to half of the time length of the enhancement sample data and less than the time length of the enhancement sample data.

In some embodiments, in the case that the audio recognition task is a sound event detection task, the sample acquisition module 804 is further configured to perform: and determining the marking information associated with the enhancement sample data as a trigger category, and determining the enhancement sample data and the marking information associated with the enhancement sample data as a trigger category audio training sample aiming at the sound event detection task.

In some embodiments, the audio data enhancement apparatus of the present disclosure further comprises:

a model training module configured to perform training of a joint network model performing a keyword detection task and/or a sound event detection task based on the audio training samples;

and the task execution module is configured to execute at least one of the keyword detection task and the sound event detection task by utilizing the trained joint network model.

According to the technical scheme of the embodiment of the disclosure, the received audio data is split and recombined to obtain the enhancement sample data aiming at the audio recognition task and further obtain the audio training sample aiming at the audio recognition task, the pertinence reorganization of the training sample of the audio recognition task is realized, the obtained training sample has more prominent keyword characteristics aiming at the keyword detection task or more prominent sound characteristics aiming at the sound event detection task, and therefore the accuracy, the speed and the like of the speech recognition of the keyword detection task can be improved by utilizing the combined network model which is used for executing the keyword detection task and/or the sound event detection task and is trained by the training sample obtained by the technical scheme of the disclosure, The detection response time of the voice event detection task is shortened, so that the user experience of the keyword detection task and/or the voice event detection task can be improved.

Fig. 9 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure, and as shown in fig. 9, the electronic device 900 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 901 and one or more memories 902, where the memory 902 stores at least one program code, and the at least one program code is loaded and executed by the processors 901 to implement the audio data enhancement method provided in each embodiment. Certainly, the electronic device 900 may further have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the electronic device 900 may further include other components for implementing device functions, which are not described herein again.

Embodiments of the present disclosure also provide a computer-readable storage medium, such as a memory, including at least one instruction, which is executable by a processor in a computer device to perform the audio data enhancement method in the above embodiments. Alternatively, the computer-readable storage medium may be a non-transitory computer-readable storage medium, which may include, for example, a ROM (Read-Only Memory), a RAM (Random-Access Memory), a CD-ROM (Compact Disc Read-Only Memory), a magnetic tape, a floppy disk, an optical data storage device, and the like.

The above description is meant to be illustrative of the preferred embodiments of the present disclosure and not to be taken as limiting the disclosure, and any modifications, equivalents, improvements and the like that are within the spirit and scope of the present disclosure are intended to be included therein.

Claims

1. A method of audio data enhancement, comprising:

determining an audio identification task, wherein the audio identification task is a keyword detection task and/or a sound event detection task;

receiving audio data associated with the audio recognition task;

according to the audio recognition task, the audio data are split and recombined to obtain enhancement sample data aiming at the audio recognition task;

2. The audio data enhancement method according to claim 1, wherein, in a case that the audio recognition task is a keyword detection task, the splitting and recombining the audio data according to the audio recognition task to obtain enhancement sample data for the audio recognition task includes:

cutting off non-voice data in the audio data;

3. The audio data enhancement method of claim 2, wherein:

and cutting off non-voice data in the audio data by adopting an active voice detection VAD method.

4. The method of claim 2, wherein obtaining initial audio sample data according to the at least two segments of audio sub-data comprises:

randomly arranging and splicing any audio subdata with the quantity larger than or equal to that of the two sections of audio subdata to obtain the initial audio sample data.

5. The audio data enhancement method of claim 1, wherein in a case that the audio recognition task is a keyword detection task, obtaining an audio training sample for the audio recognition task according to the enhancement sample data and the audio recognition task comprises:

6. The audio data enhancement method of claim 1, wherein in a case that the audio recognition task is a keyword detection task, obtaining an audio training sample for the audio recognition task according to the enhancement sample data and the audio recognition task comprises:

7. The audio data enhancement method according to claim 1, wherein, in a case that the audio recognition task is a sound event detection task, the splitting and recombining the audio data according to the audio recognition task to obtain enhancement sample data for the audio recognition task includes:

splicing interference audio data at two ends of the initial audio sample data to obtain enhanced sample data aiming at the sound event detection task, wherein the interference audio data is derived from non-trigger audio data in training data related to the sound event detection task.

8. The audio data enhancement method of claim 7, further comprising:

9. The audio data enhancement method of claim 7 or 8, wherein:

the temporal length threshold range is greater than or equal to half of a temporal length of the enhancement sample data;

10. The method according to claim 1, wherein in a case that the audio recognition task is a sound event detection task, obtaining an audio training sample for the audio recognition task according to the enhancement sample data and the audio recognition task comprises:

11. The audio data enhancement method of claim 1, wherein after obtaining audio training samples for the audio recognition task, the audio data enhancement method further comprises:

training a joint network model for executing the keyword detection task and/or the sound event detection task based on the audio training sample;

12. An audio data enhancement device, comprising:

the splicing and recombining module is configured to split and recombine the audio data according to the audio identification task to obtain enhancement sample data for the audio identification task;

13. An electronic device, comprising:

a processor;

a memory for storing executable instructions of the processor;

wherein the processor is configured to execute the executable instructions to implement the audio data enhancement method of any of claims 1 to 11.

14. A computer-readable storage medium, wherein at least one instruction of the computer-readable storage medium, when executed by a processor of an electronic device, enables the electronic device to implement the audio data enhancement method of any one of claims 1 to 11.