CN114758665B

CN114758665B - Audio data enhancement method and device, electronic equipment and storage medium

Info

Publication number: CN114758665B
Application number: CN202210666591.0A
Authority: CN
Inventors: 郑鑫江; 凌明; 杨作兴; 艾国
Original assignee: Shenzhen MicroBT Electronics Technology Co Ltd
Current assignee: Shenzhen MicroBT Electronics Technology Co Ltd
Priority date: 2022-06-14
Filing date: 2022-06-14
Publication date: 2022-09-02
Anticipated expiration: 2042-06-14
Also published as: CN114758665A

Abstract

The present disclosure relates to an audio data enhancement method, apparatus, electronic device and storage medium, including: determining an audio recognition task, wherein the audio recognition task is a keyword detection task and/or a sound event detection task; receiving audio data associated with an audio recognition task; according to the audio recognition task, audio data are split and recombined to obtain enhanced sample data aiming at the audio recognition task; and obtaining an audio training sample aiming at the audio recognition task according to the enhancement sample data and the audio recognition task. According to the method and the device, the audio data are split and recombined, the obtained audio training sample has more prominent keyword characteristics aiming at the keyword detection task or more prominent sound characteristics aiming at the sound event detection task, the accuracy of voice recognition of the keyword detection task can be improved, the detection response time of the sound event detection task can be shortened, and the user experience of the keyword detection task and/or the sound event detection task can be improved.

Description

Audio data enhancement method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computers, and in particular, to an audio data enhancement method and apparatus, an electronic device, and a storage medium.

Background

Currently, keyword Detection (KWS) and Sound Event Detection (SED) are two commonly used voice tasks of edge smart voice devices.

The keyword detection task needs to reduce the false wake-up rate while ensuring the detection rate, and the sound event detection task needs to have as short detection delay as possible, namely, the closer the detection time point is to the occurrence time point of the sound event, the better the detection time point is.

The existing keyword detection task and/or sound event detection task are generally implemented by adopting a deep learning method. Data enhancement is an important speech data processing mode adopted in a deep learning method aiming at a keyword detection task and/or a sound event detection task. At present, data enhancement modes aiming at a keyword detection task and/or a sound event detection task mainly comprise noise adding, audio speed adjusting, audio fundamental frequency adjusting, time domain moving, volume adjusting and the like.

For the keyword detection task, the data enhancement modes can cause the false awakening condition of incomplete keywords caused by reasons of similar pronunciation, word cutting and the like. For example, if the wake-up word is "small and tiny", there may be a "smiling" voice that causes the device to wake up because of the word cut of the "tiny" and the similar pronunciation, in which case, when the user speaks in a daily day, when speaking the "smiling" similar voice, the device may be triggered by mistake, which reduces the accuracy of voice recognition and affects the user experience.

For the sound event detection task, there may be a case where the detection response time is too long. For example, in the event detection of crying of an infant, the edge smart voice device and the infant are located in one room, the user is located in another room for some reason, and it is necessary to detect whether the infant cries through the edge smart voice device and enter the room where the infant is located to take care of the infant, in this case, if the crying sound of the infant is detected for too long time after the infant cries, the user may not know the crying of the infant in time and take corresponding measures in time.

Therefore, the enhancement mode of the audio data is still further promoted and developed.

Disclosure of Invention

In view of the above, the present disclosure provides an audio data enhancement method, apparatus, electronic device and storage medium to improve accuracy of speech recognition of a keyword detection task, shorten detection response duration of a sound event detection task, and improve user experience of the keyword detection task and/or the sound event detection task.

The technical scheme of the disclosure is realized as follows:

a method of audio data enhancement, comprising:

determining an audio recognition task, wherein the audio recognition task is a keyword detection task and/or a sound event detection task;

receiving audio data associated with the audio recognition task;

according to the audio recognition task, splitting and recombining the audio data to obtain enhanced sample data aiming at the audio recognition task;

and obtaining an audio training sample aiming at the audio recognition task according to the enhancement sample data and the audio recognition task.

Further, when the audio recognition task is a keyword detection task, the splitting and recombining the audio data according to the audio recognition task to obtain enhancement sample data for the audio recognition task includes:

cutting off non-voice data in the audio data;

according to the voice time in the audio data and the word number of the keyword related to the keyword detection task, segmenting the audio data to obtain at least two sections of audio subdata;

obtaining initial audio sample data according to the at least two sections of audio subdata;

splicing interference audio data at two ends of the initial audio sample data to obtain enhanced sample data for the keyword detection task, wherein the interference audio data are derived from non-trigger audio data in training data associated with the keyword detection task.

Further, the cutting of the non-voice data in the audio data is realized by adopting an active voice detection VAD method.

Further, the obtaining initial audio sample data according to the at least two segments of audio sub data includes:

determining each section of audio subdata in the at least two sections of audio subdata as the initial audio sample data;

randomly arranging and splicing any audio subdata with the quantity larger than or equal to that of the two sections of audio subdata to obtain the initial audio sample data.

Further, in a case that the audio recognition task is a keyword detection task, obtaining an audio training sample for the audio recognition task according to the enhancement sample data and the audio recognition task includes:

and under the condition that the audio content in the enhancement sample data is inconsistent with the keyword of the keyword detection task, determining the labeling information associated with the enhancement sample data as a non-trigger type, and determining the enhancement sample data and the labeling information associated with the enhancement sample data as a non-trigger type audio training sample aiming at the keyword detection task.

Further, when the audio recognition task is a keyword detection task, obtaining an audio training sample for the audio recognition task according to the enhancement sample data and the audio recognition task, including:

and under the condition that the audio content in the enhancement sample data is consistent with the keywords of the keyword detection task, determining the labeling information associated with the enhancement sample data as a trigger category, and determining the enhancement sample data and the labeling information associated with the enhancement sample data as a trigger category audio training sample aiming at the keyword detection task.

Further, when the audio recognition task is a sound event detection task, the splitting and recombining the audio data according to the audio recognition task to obtain enhancement sample data for the audio recognition task includes:

under the condition that the time length of the audio data is within a preset time length threshold value range, acquiring sub-tone frequency band data meeting a preset time length condition from the audio data, and determining the sub-tone frequency band data as initial audio sample data;

splicing interference audio data at two ends of the initial audio sample data to obtain enhanced sample data aiming at the sound event detection task, wherein the interference audio data is derived from non-trigger audio data in training data related to the sound event detection task.

Further, the audio data enhancement method further includes:

and in the case that the time length of the audio data is beyond the time length threshold range, discarding the audio data.

Further, the time length threshold range is greater than or equal to half of the time length of the enhancement sample data;

the preset time length condition is greater than or equal to half of the time length of the enhancement sample data and less than the time length of the enhancement sample data.

Further, when the audio recognition task is a sound event detection task, obtaining an audio training sample for the audio recognition task according to the enhancement sample data and the audio recognition task, including:

and determining the labeling information associated with the enhancement sample data as a trigger type, and determining the enhancement sample data and the labeling information associated with the enhancement sample data as a trigger type audio training sample aiming at the sound event detection task.

Further, after obtaining the audio training samples for the audio recognition task, the audio data enhancement method further comprises:

training a joint network model for executing the keyword detection task and/or the sound event detection task based on the audio training sample;

and executing at least one of the keyword detection task and the sound event detection task by using the trained joint network model.

An audio data enhancement apparatus comprising:

the task determination module is configured to execute a task of determining audio recognition, wherein the audio recognition task is a keyword detection task and/or a sound event detection task;

a data receiving module configured to perform receiving audio data associated with the audio recognition task;

the splicing recombination module is configured to split and recombine the audio data according to the audio identification task to obtain enhanced sample data for the audio identification task;

and the sample acquisition module is configured to execute the audio recognition task according to the enhancement sample data and the audio recognition task to obtain an audio training sample aiming at the audio recognition task.

An electronic device, comprising:

a processor;

a memory for storing executable instructions of the processor;

wherein the processor is configured to execute the executable instructions to implement the audio data enhancement method as defined in any one of the above.

A computer readable storage medium having at least one instruction which, when executed by a processor of an electronic device, enables the electronic device to implement an audio data enhancement method as claimed in any preceding claim.

According to the technical scheme, the received audio data are split and recombined according to the audio recognition task to obtain the enhancement sample data aiming at the audio recognition task, and then the audio training sample aiming at the audio recognition task is obtained The response duration may improve the user experience of the keyword detection task and/or the sound event detection task.

Drawings

FIG. 1 is a flow diagram illustrating a method of audio data enhancement according to an exemplary embodiment;

FIG. 2 is a flowchart illustrating an audio data splitting and reassembling process for an audio recognition task, according to an example embodiment;

FIG. 3 is a schematic diagram illustrating slicing of audio data in accordance with an exemplary embodiment;

FIG. 4 is a flowchart illustrating an audio data splitting and reassembling process for a sound event detection task, according to an exemplary embodiment;

FIG. 5 is a schematic diagram illustrating an enhancement sample data length relationship in accordance with an exemplary embodiment;

FIG. 6 is a schematic diagram illustrating a federated network model in accordance with an exemplary embodiment;

FIG. 7 is a flowchart illustrating an application scenario of a method of audio data enhancement according to an exemplary embodiment;

FIG. 8 is a schematic diagram illustrating the structure of an audio data enhancement apparatus according to an exemplary embodiment;

fig. 9 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more clearly understood, the present disclosure is further described in detail below with reference to the accompanying drawings and examples.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Fig. 1 is a flowchart illustrating an audio data enhancement method according to an exemplary embodiment, and as shown in fig. 1, the audio data enhancement method according to the embodiment of the present disclosure mainly includes the following steps:

step 101, determining an audio identification task, wherein the audio identification task is a keyword detection task and/or a sound event detection task;

102, receiving audio data related to an audio identification task;

103, according to the audio recognition task, splitting and recombining the audio data to obtain enhanced sample data for the audio recognition task;

and 104, obtaining an audio training sample aiming at the audio recognition task according to the enhanced sample data and the audio recognition task, wherein the training sample is used for training a joint network model for executing the keyword detection task and/or the sound event detection task.

The audio data enhancement method of the embodiment of the disclosure, according to the audio recognition task, splits and recombines the received audio data to obtain the enhancement sample data for the audio recognition task, and further obtains the audio training sample for the audio recognition task, in the technical scheme of the embodiment of the disclosure, splits and recombines the received audio data to realize the targeted reorganization of the training sample of the audio recognition task, the obtained training sample has more prominent keyword features for the keyword detection task or more prominent sound features for the sound event detection task, so that the combined network model for executing the keyword detection task and/or the sound event detection task after being trained by the training sample obtained by the technical scheme of the disclosure can improve the accuracy of the speech recognition of the keyword detection task and shorten the detection response duration of the sound event detection task, thereby improving the user experience of the keyword detection task and/or the sound event detection task.

Fig. 2 is a flowchart illustrating an audio data splitting and reassembling process for an audio recognition task according to an exemplary embodiment, where, as shown in fig. 2, in some embodiments, in the case that the audio recognition task is a keyword detection task, step 103 includes:

step 10311, cutting off non-voice data in the audio data;

step 10312, segmenting the audio data according to the voice time length in the audio data and the word number of the keyword related to the keyword detection task to obtain at least two sections of audio subdata;

step 10313, obtaining initial audio sample data according to the at least two sections of audio subdata;

and step 10314, splicing the interference audio data at two ends of the initial audio sample data to obtain enhanced sample data for the keyword detection task, wherein the interference audio data is derived from the non-trigger audio data in the training data associated with the keyword detection task.

In the process of collecting audio data, a voice presenter often does not speak in a completely quiet environment, and audio data contents other than a voice data segment, such as a background sound data segment, a quiet blank data segment, and the like, are also present in the audio data in a normal case. In the embodiment of the present disclosure, after the non-voice data in the audio data is cut off in step 10311, the voice data in the audio data is retained, so that the ratio of the voice content in the audio data is maximized, in this case, the audio training sample obtained by using only the retained voice data has the most prominent voice characteristics, and the recognition accuracy of the keyword detection task by the network model for executing the keyword detection task obtained based on the training of the audio training sample can be greatly improved.

In some embodiments, step 10311 may be implemented using a Voice Activity Detection (VAD) method.

In general, after the non-voice data in the audio data is cut by using the active voice detection method, several independent segment data only including the voice content may be obtained, and therefore, in some embodiments, after the non-voice data is cut by using the active voice detection method to obtain multiple segments of segment data, step 10311 further includes splicing the multiple segments of segment data according to a time sequence to obtain the audio data only including the complete voice content.

For the keyword detection task, the situations of false recognition and false triggering are often caused by the fact that the detected voice content is similar to the keyword, for example, the situation of false wake-up occurs when the wake-up word, i.e. the keyword is "tiny", and the voice content is similar to sounds such as "smile", "tiny", "guard school", "tiny", and "tiny". Therefore, in order to obtain accurate recognition for each word in the keyword and avoid the occurrence of such false wake-up, in some embodiments of the present disclosure, each word in the audio sub-data is segmented in steps 10312 and 10313, and according to each segmented word, all possible non-keyword combination forms similar to the keyword are obtained, and in subsequent steps (see the following description), a non-trigger category audio training sample is obtained based on the initial audio sample data of these possible non-keyword combination forms, so as to ensure that the voice of other non-wake-up words similar to the voice of the wake-up word is not recognized as the keyword, thereby avoiding the false wake-up and improving the accuracy of voice recognition. Based on this, in a preferred embodiment, the voice content included in the audio data associated with the keyword detection task is a keyword, and for example, if the keyword is "tiny small", the voice content included in the audio data is "tiny small".

In some embodiments, if the voice duration of the audio data obtained after the non-voice data is cut off is S and the number of words of the keyword is N, in step 10312, the audio data is segmented to obtain at least two segments of audio sub-data, which may include averaging the voice duration S by N, where the duration of each segment is S/N, and thus, after the segmentation, the audio sub-data of each segment can basically include one word of the keyword. In some embodiments, N is greater than or equal to 2, that is, the number of words of the keyword is at least two, and for the keyword detection task, if the number of words of the keyword is only one, then a false trigger that the voice similar to or the same as the pronunciation of the keyword is awakened in the daily communication language or environmental sound is very likely to occur, and therefore, the number of words of the keyword is not suitable to be set to one. In addition, for a keyword detection task such as waking up, too long keywords may also reduce the user experience due to long voice, so the keywords are not too long, for example, in an alternative embodiment, the word count of the keywords may be limited to 10 words, and further, in an alternative embodiment, the word count of the keywords may be limited to no more than 6 words or no more than 5 words.

FIG. 3 is a schematic diagram illustrating a segmentation of audio data according to an exemplary embodiment, where as shown in FIG. 3, the audio data includes a keyword composed of 4 words, the 4 words are respectively represented as A1, A2, A3 and A4, the voice duration of the audio data is represented as S, the voice duration S of the audio data is averaged by 4 parts (the number of parts is determined by the number of words, the number of words in the embodiment shown in FIG. 3 is 4, and the number of parts is 4), so that four segments of audio subdata including N1, N2, N3 and N4 are obtained, the four segments of audio subdata have equal duration, all being S/4, and in the process of voice expression of the keyword by a voice speaker, there is a difference in voice duration between words, therefore, each segment of audio subdata may further include other text voice remnants adjacent to the text voice in the audio subdata, for example, as shown in FIG. 3, the portion of audio subdata N1 may include a text voice remnant of a small part of a voice word A2 in addition to the word A1 Since the audio subdata N1 is mainly the voice of the word A1, the text voice residual segment of the word A2 does not affect the expression of the voice characteristics of the audio subdata N1 to the word A1, similarly, the audio subdata N2 contains a small part of the text voice residual segment of the word A3 in addition to the word A2, since the audio subdata N2 is mainly the voice of the word A2, the text voice residual segment of the word A3 does not affect the expression of the voice characteristics of the audio subdata N2 to the word A2, the audio subdata N3 contains a small part of the text voice residual segment of the word A4 in addition to the word A4, and since the audio subdata N3 is mainly the voice of the word A3, the text voice residual segment of the word A4 does not affect the expression of the voice characteristics of the audio subdata N3 to the voice characteristics of the word A3. The text-voice residual segment is a segment which only contains a small part of voice pronunciation in the complete text-voice pronunciation.

Taking the ABAB structure keyword of "small and tiny", as an example, in fig. 3, the word a1 corresponds to the first a in the ABAB structure, the word a2 corresponds to the first B in the ABAB structure, the word A3 corresponds to the second a in the ABAB structure, and the word a4 corresponds to the second B in the ABAB structure.

And for the audio data of the ABAB structure keywords, after step 10312, segmenting to obtain 4 segments of audio subdata.

In some embodiments, step 10313 may comprise:

determining each section of audio subdata in the at least two sections of audio subdata as initial audio sample data;

randomly arranging and splicing any audio subdata with the quantity larger than or equal to that of the two sections of audio subdata to obtain initial audio sample data.

Taking the embodiment of segmenting the audio data shown in fig. 3 as an example, in step 10313, obtaining initial audio sample data according to 4 segments of audio sub-data (i.e., audio sub-data N1, audio sub-data N2, audio sub-data N3, and audio sub-data N4), may include:

(1) each piece of audio sub-data is used alone as initial audio sample data, for example: taking the audio sub-data N1 as initial audio sample data (the first A in the ABAB structure), the audio sub-data N2 as initial audio sample data (the first B in the ABAB structure), the audio sub-data N3 as initial audio sample data (the second A in the ABAB structure), and the audio sub-data N4 as initial audio sample data (the second B in the ABAB structure); although the audio sub-data N1 and the audio sub-data N3 correspond to the first a and the third a in the ABAB structure, respectively, the pronunciations of the two a are the same, so that only one a (the audio sub-data N1 or the audio sub-data N3) can be taken as initial audio sample data; similarly, as for the audio sub-data N2 and the audio sub-data N4, only one of B (the audio sub-data N2 or the audio sub-data N4) may be taken as the initial audio sample data; that is, in some embodiments, the obtained initial audio sample data with the same voice content may be subjected to a deduplication process;

(2) randomly arranging any two sections of audio subdata and splicing to obtain initial audio sample data, for example: splicing the audio sub-data N1 and the audio sub-data N2 (the audio sub-data N1 is before and the audio sub-data N2 is after) to obtain initial audio sample data of an AB structure, splicing the audio sub-data N1 and the audio sub-data N3 (the audio sub-data N1 is before and the audio sub-data N3 is after) to obtain initial audio sample data of an AA structure, splicing the audio sub-data N1 and the audio sub-data N4 (the audio sub-data N1 is before and the audio sub-data N4 is after) to obtain initial audio sample data of an AB structure, wherein the initial audio sample data of the AB structure of the audio sub-data N1 before and the audio sub-data N2 after and the initial audio sample data of the AB structure of the audio sub-data N1 before and the audio sub-data N4 after are before and initial audio sample data of the AB structure with the same voice content, and performing de-duplication processing;

(3) randomly arranging any three sections of audio subdata and splicing to obtain initial audio sample data, for example, splicing any three audio subdata of the audio subdata N1, the audio subdata N2, the audio subdata N3 and the audio subdata N4 to form initial audio sample data of an ABA structure, an AAB structure, an ABB structure, a BBA structure and a BAB structure, and carrying out deduplication processing under the condition that the initial audio sample data with the same voice content appears;

(4) randomly arranging and splicing the four sections of audio sub-data to obtain initial audio sample data, for example: the audio sub data N1, the audio sub data N2, the audio sub data N3 and the audio sub data N4 are spliced together in various combination forms to form initial audio sample data of an AABB structure, a BBAA structure and a BABA structure, and the audio sample data can be subjected to de-duplication processing under the condition that the initial audio sample data with the same voice content appears.

Wherein, the above (2), (3) and (4) belong to the case of more than or equal to two sections of audio sub data.

In the embodiment of the present disclosure, for the keyword detection task, the enhancement sample data obtained by splicing in step 10314 includes the initial audio sample data and the interference audio data spliced at both ends of the initial audio sample data, so that the enhancement sample data contains both the features of the audio data associated with the keyword detection task and the features of the interference audio data. Because the interference audio data is derived from the non-trigger audio data, the characteristics of the non-trigger audio data determine that the interference audio data does not trigger the execution of the subsequent operation (if the non-trigger wake-up is not triggered), so that the network model for executing the keyword detection task can be trained by using the enhanced sample data containing the interference audio data, and according to the training purpose, the network model for executing the keyword detection task obtains the capability of not triggering the subsequent operation for other keywords (for example, other structural keywords except for 'small and tiny' in the above description) which pronounce similar to the specific keywords, so that the false wake-up rate of the incomplete keywords in the actual application scene can be greatly reduced. Moreover, the method can also be adopted to enable the network model for executing the keyword detection task to obtain the capability of triggering subsequent operations only aiming at specific keywords (such as the small micro-micro in the above description) according to the training purpose.

For the keyword detection task, it is necessary to ensure that the detection is passed only when the voice is detected to be completely matched with the keyword, and no other condition should be detected, for example, the ABAB structure type of the above-mentioned "small micro", and the detection is passed to trigger the execution of the subsequent operation (e.g., wake-up) only when the voice is detected to be "small micro", and no other voice except "small micro" should trigger the execution of the subsequent operation. In this case, in all the obtained initial audio sample data, any voice other than the ABAB structural keyword should not trigger the execution of the subsequent operation, that is, any voice other than the ABAB structural keyword needs to be determined as a voice of the non-trigger category to avoid a false trigger condition. Further, in a case where the audio recognition task is a keyword detection task, step 104 includes:

step 1041, in case that the audio content in the enhancement sample data is not consistent with the keyword of the keyword detection task, determining the annotation information associated with the enhancement sample data as a non-trigger type, and determining the enhancement sample data and the annotation information associated therewith as a non-trigger type audio training sample for the keyword detection task.

In some embodiments, to enhance the ability of the audio recognition task to trigger subsequent operations only by the speech of the ABAB structure keyword, step 104 may further include:

step 1042, under the condition that the audio content in the enhanced sample data is consistent with the keywords of the keyword detection task, determining the labeling information associated with the enhanced sample data as a trigger type, and determining the enhanced sample data and the labeling information associated therewith as a trigger type audio training sample for the keyword detection task.

In some embodiments, where the audio recognition task is a keyword detection task, the length of the enhancement sample data is greater than the length of the keyword audio that can trigger subsequent operations.

For the keyword detection task, the keyword is determined, and only the voice message with the voice feature of the keyword can trigger the execution of the subsequent operation (such as triggering the wake-up), therefore, in the audio data enhancement method of the embodiment of the present disclosure, for the enhancement sample data of the keyword detection task, the content of the received audio data associated with the keyword detection task generally does not necessarily include the keyword content, other audio data are possible and the number of words of the speech text involved in the other audio data may be any number, only the resulting enhancement sample data is required to determine the annotation information associated with the enhancement sample data as a non-trigger category if it is the content of a non-keyword, and if the content is the keyword, determining the annotation information associated with the enhancement sample data as the trigger category. Preferably, the content of the received audio data associated with the keyword detection task preferably must contain keyword content for the purpose of avoiding false triggering due to similar pronunciation.

Fig. 4 is a flowchart illustrating an audio data splitting and reassembling process for a sound event detection task according to an exemplary embodiment, where, as shown in fig. 4, the audio identification task is the sound event detection task, step 103 includes:

step 10321, under the condition that the time length of the audio data is within the preset time length threshold range, acquiring sub-tone band data meeting the preset time length condition from the audio data, and determining the sub-tone band data as initial audio sample data;

step 10322, splicing the interference audio data at two ends of the initial audio sample data to obtain enhanced sample data for the sound event detection task, wherein the interference audio data is derived from the non-trigger audio data in the training data associated with the sound event detection task.

The preset time length threshold range is used for ensuring that the audio data can meet the training requirement. If the audio data is long enough, the audio data meeting the length requirement can be obtained by intercepting the segments in the audio data, but if the audio data is too short, the feature information contained in the audio data is too little, and even if the audio data which is too short is repeatedly spliced, more feature information cannot be obtained, so that the enhancement sample data obtained by the audio data which is too short does not contain enough features, and the trained network model cannot meet the purpose of the sound event detection task. Based on this, in some embodiments, the audio data enhancement method of the present disclosure further includes:

and under the condition that the time length of the audio data is out of the preset time length threshold range, discarding the audio data.

And, in some embodiments, the time length threshold range is greater than or equal to half the time length of the enhancement sample data; the preset time length condition is greater than or equal to half of the time length of the enhancement sample data and less than the time length of the enhancement sample data.

The time length of the enhancement sample data is set with the set time length of the training sample. For example, if the set time length of the training sample is 3 seconds, the time length of the enhancement sample data is 3 seconds, the time length threshold range is greater than or equal to 1.5 seconds, and the preset time length condition is greater than or equal to half of 3 seconds and less than 3 seconds, i.e., the preset time length condition is 1.5 seconds (including 1.5 seconds) to 3 seconds (not including 3 seconds). That is, in the case where the time length of the audio data is at least 1.5 seconds, the length of the sub-tone section data obtained from the audio data is between 1.5 seconds (including 1.5 seconds) and 3 seconds (not including 3 seconds).

In the embodiment of the present disclosure, for the sound event detection task, the enhancement sample data obtained by the splicing in step 10322 includes the audio data associated with the sound event detection task and the interference audio data spliced at both ends of the audio data associated with the sound event detection task, so that the enhancement sample data contains both the features of the audio data associated with the sound event detection task and the features of the interference audio data. Because the interference audio data is derived from the non-trigger audio data, the characteristics of the non-trigger audio data determine that the interference audio data does not trigger the execution of the subsequent operation (such as not triggering the wake-up), and meanwhile, the audio data is discarded under the condition that the time length of the audio data is out of the range of the preset time length threshold, and the range of the time length threshold is greater than or equal to half of the time length of the enhancement sample data, and the preset time length condition is greater than or equal to half of the time length of the enhancement sample data and less than the time length of the enhancement sample data, so that the audio data containing the audio event detection task in the obtained enhancement sample data occupies at least half of the enhancement sample data, thereby the enhancement sample data contains at least half of the characteristics of the audio data associated with the audio event detection task, and the network model for executing the audio event detection task is trained by using the enhancement sample data, the network model for performing the sound event detection task can be enabled to obtain the capability of rapidly detecting and triggering subsequent operations for a specific sound (e.g., a crying sound of a child, etc.), so that the response time for the specific sound can be greatly shortened.

Because the network model has a length requirement on the training samples, in some embodiments, the length of the enhancement sample data is a preset fixed length. Fig. 5 is a schematic diagram illustrating a length relationship of enhancement sample data according to an exemplary embodiment, as shown in fig. 5, a preset fixed length of the enhancement sample data 501 is M, if the length of the initial audio sample data 502 is T and T is smaller than M, then random-length interference audio data 503 are spliced at the head and tail ends of the initial audio sample data 502, and the total length of the spliced interference audio data 503 is M-T, so as to ensure that the total length of the finally obtained enhancement sample data 501 is M.

In some embodiments, where the audio recognition task is a sound event detection task, step 104 comprises:

and determining the marking information associated with the enhancement sample data as a trigger category, and determining the enhancement sample data and the marking information associated with the enhancement sample data as a trigger category audio training sample aiming at the sound event detection task.

For a sound event detection task, in an actual application scenario, rapid detection and identification of sound needs to be realized, and the detection delay needs to be as short as possible, so that the requirement of rapid response can be met. By using the enhanced sample data for the voice event detection task obtained by the embodiment of the disclosure, the network model executing the voice event detection task can learn the capability of successful detection under the condition that the detected audio information length is less than the enhanced sample data length, thereby shortening the duration of voice detection reaction.

After obtaining the audio training samples for the audio recognition task, the audio data enhancement method of the embodiment of the present disclosure further includes:

training a joint network model for executing a keyword detection task and/or a sound event detection task based on the audio training sample;

and executing at least one of the keyword detection task and the sound event detection task by utilizing the trained joint network model.

The specific process of training may include:

acquiring a training sample set, wherein the training sample set comprises a plurality of enhancement sample data and labeling information associated with each enhancement sample data; the enhancement sample data comprises enhancement sample data aiming at the keyword detection task and/or enhancement sample data aiming at the sound event detection task; in the training sample set, the enhanced sample data for the keyword detection task further comprises enhanced sample data of a non-trigger type, and the labeling information associated with the enhanced sample data of the non-trigger type is non-trigger type labeling information; the labeling information associated with the enhanced sample data of the sound event detection task is trigger type labeling information; in some embodiments, in the training sample set, the enhancement sample data for the keyword detection task may further include enhancement sample data of a trigger category, and the tagging information associated with the enhancement sample data of the trigger category is trigger category tagging information;

inputting the enhanced sample data into a joint network model to obtain a result corresponding to the enhanced sample data, where the joint network model is used to execute a keyword detection task and/or a sound event detection task, fig. 6 is a schematic diagram of the joint network model according to an exemplary embodiment, and as shown in fig. 6, the joint network model includes an encoding layer, a keyword detection task decoding layer, and a sound event detection task decoding layer when being used to execute the keyword detection task and the sound event detection task, where audio data processing for the keyword detection task and the sound event detection task shares the same encoding layer, and two different decoding layers, namely, the keyword detection task decoding layer and the sound event detection task decoding layer, are respectively adopted based on different tasks in the keyword detection task and the sound event detection task, and the enhanced sample data input by the joint network model in the embodiment of the present disclosure and any decoding layer in executing the keyword detection task The method comprises the steps that audio data input during a task and a sound event detection task enter an encoding layer, the encoding layer executes encoding, then, the encoded data output by the encoding layer are input into a keyword detection task decoding layer or a sound event detection task decoding layer according to different detection tasks, the result is output by the keyword detection task decoding layer for the keyword detection task, the result is output by the sound event detection task decoding layer for the sound event detection task, and the result can be the execution probability of triggering subsequent operation (such as triggering awakening);

obtaining a value of a target loss function based on a result corresponding to the enhancement sample data and annotation information associated with the enhancement sample data;

and training the combined network model by adjusting parameters of the combined network model according to the value of the target loss function to obtain the trained combined network model.

Fig. 7 is a flowchart illustrating an application scenario of an audio data enhancement method according to an exemplary embodiment, where the application scenario mainly includes the following steps, as shown in fig. 7.

Step 701, determining an audio recognition task, if the audio recognition task is a keyword detection task, performing step 711, and if the audio recognition task is a voice event detection task, performing step 721.

Step 711, receiving the audio data, and then executing step 712.

Wherein the audio data received in step 711 is audio data associated with the keyword detection task.

Step 712, cut off the non-voice data in the audio data, and then execute step 713.

In some embodiments, a moving voice detection method is adopted to cut off non-voice data in audio data to obtain multiple segments of segment data, and the multiple segments of segment data are spliced according to a time sequence to obtain audio data only containing complete voice content.

Step 713, according to the voice duration in the audio data and the word number of the keyword associated with the keyword detection task, segmenting the audio data to obtain at least two sections of audio subdata, and then executing step 714.

Taking the keyword as an ABAB structure as an example, referring to fig. 3, the audio data includes keywords composed of 4 words, the 4 words are respectively represented as a1, a2, A3 and a4, in the ABAB structure, a1 represents the first a in the ABAB structure, a2 represents the first B in the ABAB structure, A3 represents the second a in the ABAB structure, and a4 represents the second B in the ABAB structure, in some embodiments, the audio data may be equally divided according to the number of words of the keyword as shown in fig. 3, for example, the number of words of the keyword is four, and then the audio data is equally divided to obtain four pieces of audio sub-data.

Step 714, obtaining initial audio sample data according to the at least two sections of audio subdata, and then executing step 715.

In step 714, each piece of audio sub data in the at least two pieces of audio sub data may be determined as initial audio sample data; randomly arranging any audio subdata with the quantity larger than or equal to that of the two sections of audio subdata, and splicing to obtain initial audio sample data.

Taking the keyword as an ABAB structure as an example, in step 714, the splicing includes four ways: (1) independently taking a section of audio subdata as initial audio sample data; (2) arranging any two sections of audio sub-data in all possible sequences to obtain initial audio sample data; (3) arranging any three sections of audio sub-data in all possible sequences to obtain initial audio sample data; (4) and arranging the four segments of audio sub-data in all possible orders to obtain initial audio sample data.

As for the mode (1), referring to fig. 3, four pieces of audio sub-data of contents [ a1], [ a2], [ A3] and [ a4] are obtained, where a word in a keyword corresponding to the audio sub-data of the content [ a1] and the audio sub-data of the content [ A3] is a, and a word in a keyword corresponding to the audio sub-data of the content [ a2] and the audio sub-data of the content [ a4] is B.

In some embodiments, each piece of audio sub data of [ A1], [ A2], [ A3] and [ A4] content is determined as initial audio sample data; in other embodiments, after the [ a1], [ a2], [ A3] and [ a4] are deduplicated according to the content, the initial audio sample data is obtained, for example, based on the reason that the word in the corresponding keyword in the audio sub-data of the [ a1] content and the audio sub-data of the [ A3] content is a, the audio sub-data of the [ a1] content and the audio sub-data of the [ A3] content are deduplicated, only one of the audio sub-data of the [ a1] content and the audio sub-data of the [ A3] content is retained, and similarly, only one of the audio sub-data of the [ a2] content and the audio sub-data of the [ a4] content is retained.

Wherein, for the mode (2), referring to fig. 3, a plurality of pieces of initial audio sample data are obtained, which are respectively [ a1, a2], [ a1, A3], [ a1, a4], [ a2, a1], [ a2, A3], [ a2, a4], [ A3, a1], [ A3, a2], [ A3, a4], [ a4, a1], [ a4, a2], [ a4, A3], wherein, the words in the keywords corresponding to the initial audio sample data of the [ a1, a2], [ a1, a4], [ A3, a2] and [ A3, a4] content are AB, the words in the keywords corresponding to the initial audio sample data of the [ a1, A3] and [ A3, a1] content are AA, the words in the keywords corresponding to the initial audio sample data of the [ a2, a1], [ a2, A3], [ a4, a1] and [ a4, A3] content are BA, and the words in the keywords corresponding to the initial audio sample data of the [ a2, a4] and [ a4, a2] content are BB.

In some embodiments, the initial audio sample data of all combined-form content is retained to obtain subsequent enhancement sample data; in other embodiments, after the initial audio sample data of all the combined-form contents are deduplicated according to the contents, subsequent enhancement sample data is obtained, for example, based on the reason that the characters in the corresponding keywords in the initial audio sample data of the contents [ a, a ], [ a, a ] and [ a, a ] are all AB, the initial audio sample data of the contents [ a, a ], [ a, a ] and [ a, a ] are deduplicated, only one of the initial audio sample data of the contents [ a, a ], [ a, a ] and [ a, a ] is retained, similarly, only one of the initial audio sample data of the contents [ a, a ] and [ a, a ] is retained, only one of the initial audio sample data of the contents [ a, a ], [ a, a ] and [ a, a ] is retained, only one of the original audio sample data of [ a2, a4] and [ a4, a2] content is retained.

For the mode (3), referring to fig. 3, multiple pieces of initial audio sample data are obtained, which are the following contents:

[A1,A2,A3]、[A1,A3,A2]、[A2,A1,A3]、[A2,A3,A1]、[A3,A1,A2]、[A3,A2,A1]

[A1,A2,A4]、[A1,A4,A2]、[A2,A1,A4]、[A2,A4,A1]、[A4,A1,A2]、[A4,A2,A1]

[A1,A3,A4]、[A1,A4,A3]、[A3,A1,A4]、[A3,A4,A1]、[A4,A1,A3]、[A4,A3,A1]

[A2,A3,A4]、[A2,A4,A3]、[A3,A2,A4]、[A3,A4,A2]、[A4,A2,A3]、[A4,A3,A2]

in some embodiments, the initial audio sample data of all combined-form content is retained to obtain subsequent enhancement sample data; in other embodiments, after the initial audio sample data of all combined-form contents is subjected to de-duplication according to contents, subsequent enhancement sample data is obtained, for example, based on the reason that the words in the corresponding keywords in the initial audio sample data of [ a1, a2, A3], [ A3, a2, a1], [ a1, A4, A3] and [ A3, A4, a1] contents are ABA, the initial audio sample data of [ a1, a2, A3], [ A3, a2, a1], [ a1, A4, A3] and [ A3, A4, a1] contents are subjected to de-duplication, and only one sample data in the initial audio sample data of [ a1, a2, A3], [ A3, a.

For the method (4), referring to fig. 3, multiple pieces of initial audio sample data are obtained, which are the following contents:

[A1,A2,A3,A4]、[A1,A2,A4,A3]、[A1,A3,A2,A4]、[A1,A3,A4,A2]

[A1,A4,A2,A3]、[A1,A4,A3,A2]、[A2,A1,A3,A4]、[A2,A1,A4,A3]

[A2,A3,A1,A4]、[A2,A3,A4,A1]、[A2,A4,A1,A3]、[A2,A4,A3,A1]

[A3,A1,A2,A4]、[A3,A1,A4,A2]、[A3,A2,A1,A4]、[A3,A2,A4,A1]

[A3,A4,A1,A2]、[A3,A4,A2,A1]、[A4,A1,A2,A3]、[A4,A1,A3,A2]

[A4,A2,A1,A3]、[A4,A2,A3,A1]、[A4,A3,A1,A2]、[A4,A3,A2,A1]

in some embodiments, the initial audio sample data of all combined-form content is retained to obtain subsequent enhancement sample data; in other embodiments, after the initial audio sample data of all the combined-format contents are subjected to de-emphasis according to the contents, subsequent enhancement sample data is obtained, for example, based on the reasons that words in corresponding keywords in the initial audio sample data of [ a1, A3, a2, a4], [ a4, a4] contents are AABB, the initial audio sample data of [ a4, a4] contents are subjected to de-emphasis, and only one of the initial audio sample data of [ a4, and a 4.

And 715, splicing the interference audio data at two ends of the initial audio sample data to obtain enhanced sample data aiming at the keyword detection task, and then executing 716.

The interference audio data is derived from non-trigger audio data in training data related to the keyword detection task, and a network model of the keyword detection task does not trigger subsequent operation (such as triggering and awakening) based on a result obtained by the non-trigger audio data.

Wherein, the time length of the enhancement sample data is set according to the set time length of the training sample.

Generally, the audio time length of each word of the keyword is between 0.2 and 0.4 seconds, and the audio time length of the keyword of the ABAB structure is between 0.8 and 1.6 seconds, so in some embodiments, the audio time length of the training sample should be greater than 1.6 seconds, for example, the audio time length of the training sample may be 2 to 3 seconds, wherein the audio time length of the initial audio sample data is not greater than 1.6 seconds, and the part of the enhancement sample data except the audio of the keyword (initial audio sample data) is the interference audio data.

And 716, obtaining labeling information related to the enhanced sample data based on the audio content in the enhanced sample data and the keyword content of the keyword detection task, and determining the enhanced sample data and the labeling information related to the enhanced sample data as a trigger type audio training sample for the keyword detection task.

There are two cases for the labeling information of the enhanced sample data of the keyword detection task: enhancing the condition that audio content in sample data is inconsistent with keywords of a keyword detection task; and secondly, enhancing the condition that the audio content in the sample data is consistent with the keywords of the keyword detection task.

In case, in step 716, the annotation information associated with the enhancement sample data is determined to be a non-trigger category, and the enhancement sample data and the annotation information associated therewith are determined to be a non-trigger category audio training sample for the keyword detection task. Based on the embodiment of the present disclosure, in order to achieve the purpose of avoiding false triggering caused by similar pronunciation, the determination of the non-triggering type audio training sample for the case one is an indispensable step in the embodiment.

In case two, in step 716, the annotation information associated with the enhancement sample data is determined as the trigger category, and the enhancement sample data and the annotation information associated therewith are determined as the trigger category audio training samples for the keyword detection task. The determination of the trigger class audio training samples for case two is an optional step in this embodiment.

Step 721, audio data is received, followed by step 722.

Where the audio data received in step 721 is audio data associated with a sound event detection task.

Step 722, judging whether the time length of the audio data meets the requirement, if so, executing step 723, otherwise, discarding the audio data.

Because the audio features contained in the over-short audio data are less, the probability of recognition errors is high, and if the audio data are used as training samples, the error probability of the network model is increased, so that the sufficient audio features can be ensured to be contained only if the shortest time length meets a certain requirement, and the accuracy of recognition is improved. In some embodiments, the determining whether the time length of the audio data in step 722 meets the requirement may specifically include: and judging whether the time length of the audio data is within a preset time length threshold range. The preset time length threshold range is greater than or equal to half of the time length of the enhancement sample data, so that at least half of the audio features related to the sound event detection can be ensured to be contained in the obtained enhancement sample data.

Step 723, acquiring the sub-tone band data meeting the preset time length condition from the audio data, determining the sub-tone band data as initial audio sample data, and then executing step 724.

In order to ensure that at least half of the audio features related to sound event detection can be contained in the obtained enhancement sample data, the length of the sub-tone band data is at least half of the enhancement sample data, so that the preset time length condition may be greater than or equal to half of the time length of the enhancement sample data and less than the time length of the enhancement sample data. By adopting the method, the obtained enhanced sample data can at least contain half of the audio features related to the detection of the sound event and at most contain all the audio features related to the detection of the sound event, and the accuracy and corresponding timeliness of the network model for detecting the sound event can be improved by training the network model by utilizing the enhanced sample data.

In some embodiments, in step 723, any audio segment satisfying the condition of the preset time length may be cut out from the audio data in a random manner as the sub-audio segment data.

Step 724, splicing the interference audio data at two ends of the initial audio sample data to obtain enhanced sample data for the sound event detection task, and then executing step 725.

The interference audio data are derived from non-trigger audio data in training data related to the sound event detection task, and the time length of the enhancement sample data is the time length set for the sound event detection task.

And 725, determining the labeling information associated with the enhancement sample data as a trigger category, and determining the enhancement sample data and the labeling information associated with the enhancement sample data as a trigger category audio training sample for the sound event detection task.

Fig. 8 is a schematic structural diagram illustrating an audio data enhancement apparatus according to an exemplary embodiment, and as shown in fig. 8, the audio data enhancement apparatus includes a task determination module 801, a data receiving module 802, a splicing and recombining module 803, and a sample obtaining module 804.

The task determining module 801 is configured to perform determining an audio recognition task, where the audio recognition task is a keyword detection task and/or a sound event detection task.

A data receiving module 802 configured to perform receiving audio data associated with an audio recognition task.

And the splicing and recombining module 803 is configured to split and recombine the audio data according to the audio recognition task to obtain enhancement sample data for the audio recognition task.

The sample obtaining module 804 is configured to execute the enhancement sample data and the audio recognition task to obtain an audio training sample for the audio recognition task, where the training sample is used for training a joint network model for executing the keyword detection task and/or the sound event detection task.

In some embodiments, in the case that the audio recognition task is a keyword detection task, the splicing recombination module 803 includes:

a non-speech removal submodule configured to perform removal of non-speech data in the audio data;

the audio segmentation submodule is configured to segment the audio data according to the voice time in the audio data and the word number of the keywords related to the keyword detection task to obtain at least two sections of audio subdata;

the first initial audio acquisition submodule is configured to execute the first audio acquisition submodule according to the at least two sections of audio subdata to obtain initial audio sample data;

and the first audio splicing submodule is configured to splice interference audio data at two ends of the initial audio sample data to obtain enhanced sample data aiming at the keyword detection task, wherein the interference audio data is derived from non-trigger audio data in training data associated with the keyword detection task.

In some embodiments, the non-speech removal sub-module implements removal of non-speech data in the audio data using a VAD method.

In some embodiments, the non-voice cutting sub-module is further configured to perform, after cutting off the non-voice data by using an active voice detection method to obtain a plurality of pieces of segment data, splicing the plurality of pieces of segment data in chronological order to obtain audio data only including complete voice content.

In some embodiments, the first initial audio acquisition sub-module is further configured to perform:

In some embodiments, in the case that the audio recognition task is a keyword detection task, the sample obtaining module 804 further comprises:

and the non-trigger sample acquisition sub-module is configured to determine the annotation information associated with the enhancement sample data as a non-trigger type under the condition that the audio content in the enhancement sample data is inconsistent with the keywords of the keyword detection task, and determine the enhancement sample data and the annotation information associated with the enhancement sample data as a non-trigger type audio training sample aiming at the keyword detection task.

and the trigger sample acquisition sub-module is configured to determine the annotation information associated with the enhancement sample data as a trigger type under the condition that the audio content in the enhancement sample data is consistent with the keywords of the keyword detection task, and determine the enhancement sample data and the annotation information associated with the enhancement sample data as a trigger type audio training sample aiming at the keyword detection task.

In some embodiments, in the case that the audio recognition task is a sound event detection task, the splicing recombination module 803 includes:

the second initial audio acquisition submodule is configured to acquire sub-tone band data meeting a preset time length condition from the audio data under the condition that the time length of the audio data is within a preset time length threshold range, and determine the sub-tone band data as initial audio sample data;

and the second audio splicing submodule is configured to splice interference audio data at two ends of the initial audio sample data to obtain enhancement sample data for the sound event detection task, wherein the interference audio data are derived from non-trigger audio data in training data related to the sound event detection task, and the time length of the enhancement sample data is the time length set for the sound event detection task.

In some embodiments, the second initial audio acquisition sub-module is further configured to perform: and in the case that the time length of the audio data is beyond the time length threshold range, discarding the audio data.

In some embodiments, the temporal length threshold range is greater than or equal to half the temporal length of the enhancement sample data; the preset time length condition is greater than or equal to half of the time length of the enhancement sample data and less than the time length of the enhancement sample data.

In some embodiments, in the case that the audio recognition task is a sound event detection task, the sample acquisition module 804 is further configured to perform: and determining the marking information associated with the enhancement sample data as a trigger category, and determining the enhancement sample data and the marking information associated with the enhancement sample data as a trigger category audio training sample aiming at the sound event detection task.

In some embodiments, the audio data enhancement apparatus of the present disclosure further comprises:

a model training module configured to perform training of a joint network model performing a keyword detection task and/or a sound event detection task based on the audio training samples;

and the task execution module is configured to execute at least one of the keyword detection task and the sound event detection task by utilizing the trained joint network model.

According to the technical scheme of the embodiment of the disclosure, the received audio data is split and recombined to obtain the enhancement sample data aiming at the audio recognition task and further obtain the audio training sample aiming at the audio recognition task, the pertinence reorganization of the training sample of the audio recognition task is realized, the obtained training sample has more prominent keyword characteristics aiming at the keyword detection task or more prominent sound characteristics aiming at the sound event detection task, and therefore the accuracy, the speed and the like of the speech recognition of the keyword detection task can be improved by utilizing the combined network model which is used for executing the keyword detection task and/or the sound event detection task and is trained by the training sample obtained by the technical scheme of the disclosure, The detection response time of the voice event detection task is shortened, so that the user experience of the keyword detection task and/or the voice event detection task can be improved.

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure, and as shown in fig. 9, the electronic device 900 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 901 and one or more memories 902, where the memory 902 stores at least one program code, and the at least one program code is loaded and executed by the processors 901 to implement the audio data enhancement method according to the foregoing embodiments. Certainly, the electronic device 900 may further have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the electronic device 900 may further include other components for implementing device functions, which are not described herein again.

Embodiments of the present disclosure also provide a computer-readable storage medium, such as a memory, comprising at least one instruction, which is executable by a processor in a computer device to perform the audio data enhancement method in the above embodiments. Alternatively, the computer-readable storage medium may be a non-transitory computer-readable storage medium, which may include, for example, a ROM (Read-Only Memory), a RAM (Random-Access Memory), a CD-ROM (Compact Disc Read-Only Memory), a magnetic tape, a floppy disk, an optical data storage device, and the like.

The above description is only exemplary of the present disclosure and should not be taken as limiting the disclosure, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A method of audio data enhancement, comprising:

receiving audio data associated with the audio recognition task;

obtaining an audio training sample aiming at the audio recognition task according to the enhancement sample data and the audio recognition task;

when the audio recognition task is a keyword detection task, the splitting and recombining the audio data according to the audio recognition task to obtain enhancement sample data for the audio recognition task includes:

cutting off non-voice data in the audio data;

splicing interference audio data at two ends of the initial audio sample data to obtain enhanced sample data aiming at the keyword detection task, wherein the interference audio data is derived from non-trigger audio data in training data related to the keyword detection task;

when the audio recognition task is a sound event detection task, the splitting and recombining the audio data according to the audio recognition task to obtain enhancement sample data for the audio recognition task includes:

2. The audio data enhancement method of claim 1, wherein:

and cutting off non-voice data in the audio data by adopting an active voice detection VAD method.

3. The method of claim 1, wherein obtaining initial audio sample data according to the at least two segments of audio sub-data comprises:

4. The audio data enhancement method of claim 1, wherein in a case that the audio recognition task is a keyword detection task, obtaining an audio training sample for the audio recognition task according to the enhancement sample data and the audio recognition task comprises:

5. The method according to claim 1, wherein in a case that the audio recognition task is a keyword detection task, obtaining an audio training sample for the audio recognition task according to the enhancement sample data and the audio recognition task comprises:

and under the condition that the audio content in the enhancement sample data is consistent with the keywords of the keyword detection task, determining the labeling information associated with the enhancement sample data as a trigger type, and determining the enhancement sample data and the labeling information associated with the enhancement sample data as a trigger type audio training sample aiming at the keyword detection task.

6. The audio data enhancement method of claim 1, further comprising:

7. The audio data enhancement method according to claim 1 or 6, characterized in that:

the temporal length threshold range is greater than or equal to half of a temporal length of the enhancement sample data;

8. The method according to claim 1, wherein in a case that the audio recognition task is a sound event detection task, obtaining an audio training sample for the audio recognition task according to the enhancement sample data and the audio recognition task comprises:

9. The audio data enhancement method of claim 1, wherein after obtaining audio training samples for the audio recognition task, the audio data enhancement method further comprises:

training a joint network model for executing the keyword detection task and/or the sound event detection task based on the audio training samples;

10. An audio data enhancement apparatus, comprising:

the sample acquisition module is configured to execute the audio recognition task according to the enhancement sample data and the audio recognition task to obtain an audio training sample aiming at the audio recognition task;

wherein, under the condition that the audio recognition task is a keyword detection task, the splicing recombination module comprises:

a non-speech removal sub-module configured to perform removal of non-speech data in the audio data;

the audio segmentation submodule is configured to segment the audio data according to the voice time in the audio data and the word number of the keyword related to the keyword detection task to obtain at least two sections of audio subdata;

the first initial audio acquisition submodule is configured to execute the at least two sections of audio subdata to obtain initial audio sample data;

a first audio splicing sub-module configured to perform splicing of interfering audio data at both ends of the initial audio sample data, obtaining enhancement sample data for the keyword detection task, wherein the interfering audio data is derived from non-trigger audio data in training data associated with the keyword detection task;

wherein, under the condition that the audio recognition task is a sound event detection task, the splicing recombination module comprises:

a second initial audio acquisition sub-module configured to acquire sub-audio band data satisfying a preset time length condition from the audio data and determine the sub-audio band data as initial audio sample data, when the time length of the audio data is within a preset time length threshold range;

a second audio stitching sub-module configured to perform stitching of interfering audio data to both ends of the initial audio sample data, obtaining enhancement sample data for the sound event detection task, wherein the interfering audio data is derived from non-trigger audio data in training data associated with the sound event detection task.

11. An electronic device, comprising:

a processor;

a memory for storing executable instructions of the processor;

wherein the processor is configured to execute the executable instructions to implement the audio data enhancement method of any of claims 1 to 9.

12. A computer-readable storage medium having at least one instruction thereon which, when executed by a processor of an electronic device, enables the electronic device to implement the audio data enhancement method of any one of claims 1 to 9.