CN115810350A - Training data acquisition method, device, equipment and storage medium - Google Patents

Training data acquisition method, device, equipment and storage medium Download PDF

Info

Publication number
CN115810350A
CN115810350A CN202211430866.7A CN202211430866A CN115810350A CN 115810350 A CN115810350 A CN 115810350A CN 202211430866 A CN202211430866 A CN 202211430866A CN 115810350 A CN115810350 A CN 115810350A
Authority
CN
China
Prior art keywords
audio
training
filtering
waveform
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211430866.7A
Other languages
Chinese (zh)
Inventor
王宁
李良斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing SoundAI Technology Co Ltd
Original Assignee
Beijing SoundAI Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing SoundAI Technology Co Ltd filed Critical Beijing SoundAI Technology Co Ltd
Priority to CN202211430866.7A priority Critical patent/CN115810350A/en
Publication of CN115810350A publication Critical patent/CN115810350A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Electrically Operated Instructional Devices (AREA)

Abstract

The application discloses a training data acquisition method, a training data acquisition device, training data acquisition equipment and a storage medium, and belongs to the field of artificial intelligence. The method comprises the following steps: cutting audio data collected in a target scene to obtain a training audio set; each audio in the training audio set comprises a wake-up command word; determining execution sequences of multiple filtering modes according to the environment type of the target scene; different filtering modes are used for filtering the audio in the training audio set based on different filtering conditions; when each audio is subjected to filtering judgment according to the execution sequence, responding to that any one audio meets the filtering condition corresponding to the current filtering mode, and deleting the audio from the training audio set; and using the filtered training audio set as training data for training the acoustic model. According to the method and the device, high-quality training data can be acquired, model training is carried out based on the high-quality training data, the model training effect can be improved, and the voice recognition accuracy can be improved.

Description

Training data acquisition method, device, equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a training data acquisition method, apparatus, device, and storage medium.
Background
Speech recognition is an important breakthrough in the development of artificial intelligence. In a broad sense, speech recognition takes speech as a research object, and aims to realize interaction between people and machines based on natural language. In a narrow sense, speech recognition is a technique for a machine to convert speech into text or commands through a process of recognition and understanding.
At present, the voice recognition technology is widely applied to the fields of industry, household appliances, communication, automobiles, electronics, medical treatment, home service and the like. For example, in a calling system of an intelligent elevator, a voice recognition function provided based on an acoustic model can realize intelligent calling, namely, a calling and waking operation is executed after a waking command word is recognized.
The acoustic model is a machine learning model in the field of artificial intelligence, and before speech recognition is performed by applying the acoustic model, labeled training data needs to be acquired for model training. However, the quality of the training data directly affects the model training precision, and further affects the speech recognition effect. Therefore, how to acquire high-quality training data for model training to improve the model training precision and ensure the speech recognition accuracy becomes a focus of attention of those skilled in the art.
Disclosure of Invention
The embodiment of the application provides a training data acquisition method, a training data acquisition device and a storage medium, high-quality training data can be acquired, acoustic model training is carried out based on the high-quality training data, not only can model training precision be ensured and model training effect be improved, but also voice recognition is carried out based on a trained acoustic model, and voice recognition accuracy can be greatly improved. The technical scheme is as follows:
in one aspect, a training data obtaining method is provided, and the method includes:
cutting audio data collected in a target scene to obtain a training audio set; wherein each audio in the training audio set comprises a wake-up command word;
determining execution sequences of multiple filtering modes according to the environment type of the target scene; wherein different filtering means are used for filtering the audio in the training audio set based on different filtering conditions;
when each audio is subjected to filtering judgment according to the execution sequence, in response to that any one audio meets the filtering condition corresponding to the current filtering mode, deleting the audio from the training audio set;
and using the filtered training audio set as training data for training the acoustic model.
In a possible implementation manner, the determining an execution sequence of multiple filtering manners according to the environment type of the target scene includes:
acquiring the place type and the peripheral infrastructure information corresponding to the target scene;
acquiring an acquisition time period of the audio data;
and determining the execution sequence of the multiple filtering modes according to the place type and the peripheral infrastructure information corresponding to the target scene and the acquisition time period of the audio data.
In a possible implementation manner, the determining, according to the place type and the surrounding infrastructure information corresponding to the target scene and the acquisition time period of the audio data, an execution sequence of the multiple filtering manners includes:
determining an initial execution sequence of the multiple filtering modes according to the place types;
and correcting the initial execution sequence according to the peripheral infrastructure information and the acquisition time period of the audio data to obtain a final execution sequence of the multiple filtering modes.
In a possible implementation manner, the deleting, in response to that any piece of audio satisfies a filtering condition corresponding to a current filtering manner, the audio from the training audio set includes:
for any audio, acquiring an audio waveform diagram of the audio;
carrying out waveform identification on the audio waveform diagram to obtain a waveform identification result;
in response to determining that there is a truncation phenomenon with respect to the audio based on the waveform recognition result, deleting the audio from the set of training audio;
wherein, the truncation phenomenon refers to that a part of audio frames corresponding to the wake-up command word is absent in the audio.
In one possible implementation, the method further includes at least one of:
in response to the waveform feature indicated by the waveform identification result being matched with the first waveform feature, determining that the audio has a truncation phenomenon; the first waveform feature corresponds to a first audio, and a truncation phenomenon exists at an audio starting position of the first audio;
in response to the waveform feature indicated by the waveform identification result being matched with a second waveform feature, determining that the audio has a truncation phenomenon; the second waveform feature corresponds to a second audio, and a truncation phenomenon exists at the audio end position of the second audio;
responding to the waveform characteristics indicated by the waveform identification result and a third waveform characteristic matching, and determining that the audio has a truncation phenomenon; the third waveform feature corresponds to a third audio, and a truncation phenomenon exists in the middle position of the audio of the third audio.
In one possible implementation manner, the deleting, in response to that any piece of audio satisfies a filtering condition corresponding to a current filtering manner, the audio from the training audio set includes:
for any audio, carrying out voiceprint recognition on the audio to obtain a voiceprint recognition result;
performing voice recognition on the audio in response to determining that the audio corresponds to a plurality of speakers based on the voiceprint recognition result and the number of audio frames corresponding to at least two speakers is greater than a frame number threshold;
in response to determining that the audio has chat speech based on speech recognition results, deleting the audio from the set of training audio.
In one possible implementation manner, the deleting, in response to that any piece of audio satisfies a filtering condition corresponding to a current filtering manner, the audio from the training audio set includes:
for any audio, acquiring the signal-to-noise ratio of the audio;
in response to a signal-to-noise ratio of the audio being less than a signal-to-noise ratio threshold, deleting the audio from the set of training audio.
In a possible implementation manner, the cutting audio data collected in a target scene to obtain a training audio set includes:
performing audio truncation on audio data acquired in a target scene to obtain an initial audio set; wherein each audio in the initial set of audio comprises a wake-up command word;
performing voice endpoint detection on each audio in the initial audio set;
based on a voice endpoint detection result, eliminating a mute segment of each audio frequency in the initial audio frequency set to obtain the training audio frequency set; wherein the duration of the silence segment is greater than a target duration.
In another aspect, a training data acquisition apparatus is provided, the apparatus including:
the first processing module is configured to cut audio data collected in a target scene to obtain a training audio set; wherein each audio in the training audio set comprises a wake-up command word;
the second processing module is configured to determine the execution sequence of the multiple filtering modes according to the environment type of the target scene; wherein different filtering means are used for filtering the audio in the training audio set based on different filtering conditions;
a third processing module, configured to, when each piece of audio is subjected to filtering determination according to the execution sequence, respond that any one piece of audio satisfies a filtering condition corresponding to a current filtering manner, and delete the audio from the training audio set; and using the filtered training audio set as training data for training the acoustic model.
In one possible implementation, the second processing module is configured to:
acquiring the place type and the peripheral infrastructure information corresponding to the target scene;
acquiring an acquisition time period of the audio data;
and determining the execution sequence of the multiple filtering modes according to the place type and the peripheral infrastructure information corresponding to the target scene and the acquisition time period of the audio data.
In one possible implementation, the second processing module is configured to:
determining an initial execution sequence of the multiple filtering modes according to the place types;
and correcting the initial execution sequence according to the peripheral infrastructure information and the acquisition time period of the audio data to obtain a final execution sequence of the multiple filtering modes.
In one possible implementation, the third processing module is configured to:
for any audio, acquiring an audio waveform diagram of the audio;
carrying out waveform identification on the audio waveform diagram to obtain a waveform identification result;
in response to determining that there is a truncation phenomenon with respect to the audio based on the waveform recognition result, deleting the audio from the set of training audio;
wherein, the truncation phenomenon refers to the absence of a partial audio frame corresponding to the wake-up command word in the audio.
In one possible implementation manner, the third processing module is configured to:
in response to the waveform feature indicated by the waveform identification result being matched with the first waveform feature, determining that the audio has a truncation phenomenon; the first waveform feature corresponds to a first audio, and a truncation phenomenon exists at an audio starting position of the first audio;
in response to the waveform feature indicated by the waveform identification result being matched with a second waveform feature, determining that the audio has a truncation phenomenon; the second waveform feature corresponds to a second audio, and a truncation phenomenon exists at the audio end position of the second audio;
responding to the waveform feature indicated by the waveform identification result and a third waveform feature to be matched, and determining that the audio frequency has a truncation phenomenon; the third waveform feature corresponds to a third audio, and a truncation phenomenon exists in the middle position of the audio of the third audio.
In one possible implementation, the third processing module is configured to:
for any audio, carrying out voiceprint recognition on the audio to obtain a voiceprint recognition result;
performing voice recognition on the audio in response to determining that the audio corresponds to a plurality of speakers based on the voiceprint recognition result and the number of audio frames corresponding to at least two speakers is greater than a frame number threshold;
in response to determining that the audio has chat speech based on speech recognition results, deleting the audio from the set of training audio.
In one possible implementation manner, the third processing module is configured to:
for any audio, acquiring the signal-to-noise ratio of the audio;
in response to a signal-to-noise ratio of the audio being less than a signal-to-noise ratio threshold, deleting the audio from the set of training audio.
In one possible implementation, the first processing module is configured to:
performing audio truncation on audio data acquired in a target scene to obtain an initial audio set; wherein each audio in the initial set of audio comprises a wake-up command word;
performing voice endpoint detection on each audio in the initial audio set;
based on a voice endpoint detection result, eliminating a mute segment of each audio frequency in the initial audio frequency set to obtain the training audio frequency set; wherein the duration of the silence segment is greater than a target duration.
In another aspect, a computer device is provided, the device comprising a processor and a memory, the memory having at least one program code stored therein, the at least one program code being loaded and executed by the processor to implement the training data acquisition method described above.
In another aspect, a computer-readable storage medium is provided, in which at least one program code is stored, and the at least one program code is loaded and executed by a processor to implement the above-mentioned training data acquisition method.
In another aspect, a computer program product or a computer program is provided, the computer program product or the computer program comprising computer program code, the computer program code being stored in a computer-readable storage medium, the computer program code being read by a processor of a computer device from the computer-readable storage medium, the computer program code being executed by the processor to cause the computer device to perform the training data acquisition method described above.
According to the method and the device, after the training audio set is obtained, the audio in the training audio set can be filtered to remove low-quality audio, the high-quality training data can be obtained through the training data obtaining mode, acoustic model training is carried out based on the high-quality training data, model training precision can be guaranteed, the model training effect is improved, voice recognition is carried out based on the trained acoustic model, and the voice recognition accuracy can be greatly improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a schematic diagram of an implementation environment related to a training data acquisition method provided in an embodiment of the present application;
fig. 2 is a flowchart of a training data obtaining method according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a training data acquisition apparatus according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of another computer device provided in an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The terms "first," "second," and the like, in this application, are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it should be understood that "first," "second," and "nth" do not have any logical or temporal dependency, nor do they define a quantity or order of execution. It will be further understood that, although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by these terms.
These terms are only used to distinguish one element from another. For example, a first element can be termed a second element, and, similarly, a second element can also be termed a first element, without departing from the scope of various examples. The first element and the second element may both be elements, and in some cases, may be separate and distinct elements. For example, at least one element may be an integer number of elements equal to or greater than one, such as one element, two elements, three elements, and the like. The plurality of elements means two or more, and for example, the plurality of elements may be two elements, three elements, or any integer of two or more.
It should be noted that the information (including but not limited to user device information, user personal information, etc.), data (including but not limited to data for analysis, stored data, displayed data, etc.) and signals referred to in this application are authorized by the user or fully authorized by various parties, and the collection, use and processing of the relevant data are subject to relevant laws and regulations and standards in relevant countries and regions. For example, the initial text and interest tags referred to in this application are obtained with sufficient authorization.
Fig. 1 is a schematic diagram of an implementation environment related to a training data acquisition method provided in an embodiment of the present application.
The training data acquisition method provided by the embodiment of the application is applied to training data acquisition equipment.
Referring to fig. 1, the training data obtaining device 101 is a computer device with machine learning capability, for example, the computer device may be a fixed computer device such as a personal computer and a server, or may also be a mobile computer device such as a tablet computer and a smart phone, which is not limited in this application.
Taking an application scenario as an intelligent elevator as an example, in a call system of the intelligent elevator, wake-up command words (call command words in the intelligent elevator scenario) of users are generally short. And when the user calls the elevator, the user may also have chatty voice of other elevator users or playing voice of the advertisement screen in the elevator.
In addition, when the acoustic model of the calling system is subjected to iterative training, actual elevator-taking audio needs to be used as training data. The calling system is responsible for collecting the elevator taking audio for a long time, and immediately executes the elevator taking awakening operation after the calling command word is recognized. The training data is segmented and labeled from the ladder-taking audio collected in a long period. For these training data, situations such as excessive noise, truncation, somebody chatting, etc. may occur, and the labeling result is only the call command word actually required by the call system, so if the acoustic model is trained using these training data, a relatively serious negative effect is brought to the acoustic model, and the model training accuracy is poor.
To solve the problem, the embodiment of the present application provides a new training data acquisition method to improve the quality of training data.
It should be noted that, in addition to the above intelligent elevator scenario, the training data obtaining method provided in the embodiment of the present application may also be applied to other voice wake-up scenarios, such as in the home, in a restaurant or in a mall and other public places, which is not limited in this application.
In a possible implementation manner, the quality of the training audio is improved by adopting an image recognition technology, a voiceprint recognition technology, a signal-to-noise ratio calculation method and the like. By way of example, embodiments of the present application include, but are not limited to, the following steps:
1. by using VAD (Voice Activity Detection) technology, a mute segment with longer duration in the audio is eliminated, so that the size of the audio can be reduced, and the subsequent feature extraction speed and model training speed are improved.
In another possible implementation, the audio mentioned in this step is cut out by the annotating person from the audio acquired over a long period of time. Wherein the divided audio frequencies constitute a training audio set. Illustratively, the audio capture scenario is an intelligent elevator, which is not limited in this application.
2. For any piece of audio, if the piece of audio is too noisy, training the acoustic model using the piece of audio may have negative effects on the model. Therefore, the embodiment of the application calculates the signal-to-noise ratio of each audio, and deletes the audio from the training audio set for the audio with the signal-to-noise ratio lower than the set signal-to-noise ratio threshold.
3. Since each audio in the training audio set is cut out from the audio collected over a long period of time, the audio may be truncated. In addition, sometimes the audio is cut off due to the problem of audio acquisition. The truncation phenomenon refers to the lack of a part of audio frames corresponding to the wake-up command word in the audio. In other words, there are some voiced frames where the truncated audio lacks the wake command word, which is very disadvantageous for the training of the acoustic model. Therefore, the embodiment of the application judges whether each audio has a truncation phenomenon, and deletes the audio which is judged to have the truncation from the training audio set.
4. For the audio with the chatting, the voiceprint recognition technology is used for judging. If a plurality of speakers exist in a certain audio, and the number of audio frames corresponding to at least two speakers is greater than a set frame number threshold, it is indicated that the audio may have chatting. At this time, the audio is sent to a voice recognition system, and whether the audio has chatting conditions is further judged according to a voice recognition result. If the piece of audio belongs to the audio with chatty, the piece of audio is deleted from the training audio set.
Through the processing mode, the training data (marked audio) with high quality and including the awakening command words can be screened out, and the negative effect of the low-quality audio on the acoustic model is reduced.
In another possible implementation manner, for several filtering manners shown in the above steps 2 to 4, the embodiment of the present application further provides a scheme for determining the sequential execution order of the multiple filtering manners according to the application scenario. In other words, the audio filtering may be performed in a different order depending on the application scenario. For example, for public places such as an intelligent elevator, if there is too much noise or the possibility of somebody chatting is higher, noise filtering determination or chatting filtering determination may be performed on each audio first; if a certain audio does not pass the above filtering decision, it is directly deleted from the training audio set. If the truncation filtering decision is performed first and passes this filtering decision, then the noise filtering decision or chatting filtering decision is still performed continuously, which wastes resources. See the embodiment shown in fig. 2 described below for details.
Fig. 2 is a flowchart of a training data obtaining method according to an embodiment of the present application. The execution subject of the method is a computer device, and referring to fig. 2, the method flow of the embodiment of the application includes the following steps.
201. The method comprises the steps that computer equipment cuts audio data collected in a target scene to obtain a training audio set; wherein each audio in the training audio set comprises a wake-up command word.
In the embodiment of the present application, the target scene is any voice wake-up scene, such as a home, a restaurant, an elevator or a public place such as a mall, which is not limited in the present application. Exemplarily, the embodiments of the present application are described by taking the above target scenario as an example of an intelligent elevator.
In one possible implementation, the audio data collected in the target scene is cut to obtain a training audio set, including but not limited to the following ways: performing audio truncation on audio data acquired in a target scene to obtain an initial audio set; wherein each audio in the initial audio set comprises a wake-up command word; performing voice endpoint detection on each audio in the initial audio set; based on the obtained voice endpoint detection result, eliminating a mute segment of each audio frequency in the initial audio frequency set to obtain a training audio frequency set; wherein the duration of the removed silence segments is greater than the target duration.
Taking a target scene as an intelligent elevator as an example, the audio data acquired in the target scene is the elevator-taking audio acquired for a long time. The target duration may be any set time threshold, for example, 30ms, which is not limited in this application.
In the step, the size of the audio can be reduced by eliminating the mute segment with longer time length in the audio, so that the subsequent characteristic extraction speed and model training speed are improved.
202. The computer equipment determines the execution sequence of a plurality of filtering modes according to the environment type of the target scene; wherein different filtering means are used for filtering the audio in the training audio set based on different filtering conditions.
In one possible implementation, the execution order of the multiple filtering manners is determined according to the environment type of the target scene, including but not limited to the following manners:
acquiring a place type and peripheral infrastructure information corresponding to a target scene; acquiring an acquisition time period of audio data acquired in a target scene; and determining the execution sequence of the multiple filtering modes according to the place type and the peripheral infrastructure information corresponding to the target scene and the acquisition time period of the audio data.
The places can be classified into private places, public places and the like; the perimeter infrastructure information may be used to indicate the perimeter's underlying recommendations, such as whether the perimeter has overpasses, subways, construction sites or shopping malls, etc.; acquisition periods include, but are not limited to: the time period of day, the time period of day other than the work period, night, etc., and the application is not limited herein.
For example, the determining of the execution sequence of the multiple filtering manners according to the location type and the surrounding infrastructure information corresponding to the target scene and the acquisition time period of the audio data may be implemented based on the following manners: determining an initial execution sequence of a plurality of filtering modes according to the place types corresponding to the target scene; and correcting the initial execution sequence according to the peripheral infrastructure information corresponding to the target scene and the acquisition time period of the audio data acquired in the target scene to obtain the final execution sequence of multiple filtering modes.
Taking a target scene as an intelligent elevator as an example, because the intelligent elevator belongs to a public place, the initial execution sequence of various filtering modes can be set as chatting filtering → noise filtering → truncation filtering; assuming that the surrounding infrastructure information is a construction site and the collection period is a non-commuting period during the day, the initial execution sequence may be modified to noise filtering → chatty filtering → truncation filtering.
203. And when the computer equipment carries out filtering judgment on each audio according to the execution sequence, responding to that any one audio meets the filtering condition corresponding to the current filtering mode, and deleting the audio from the training audio set.
2031. For any audio in the training audio set, acquiring the signal-to-noise ratio of the audio; in response to the signal-to-noise ratio of the audio being less than a signal-to-noise ratio threshold, the audio is deleted from the set of training audio.
In a possible implementation manner, for any audio frame in the audio, the energy of the audio frame is obtained, and the ratio between the energy of the audio frame and the reference energy of the noise is used as the signal-to-noise ratio of the audio frame. For example, the average of the signal-to-noise ratios of all audio frames in the audio may be compared with a set signal-to-noise ratio threshold, which is not limited in this application.
Illustratively, each audio frame corresponds to an energy value, such as the root mean square energy of the audio signal, which represents the average energy of the audio signal waveform for a short time. In addition, a noise estimation algorithm may be employed to estimate the energy of the noise (referred to herein as the reference energy); for example, the noise estimation algorithm is a minimum tracking algorithm.
2032. For any audio in the training audio set, acquiring an audio oscillogram of the audio; carrying out waveform identification on the audio waveform diagram of the audio to obtain a waveform identification result; in response to determining that the audio has a truncation phenomenon based on the waveform recognition result, the audio is deleted from the set of training audio.
Where audio is often represented as a waveform diagram; for example, the horizontal axis is time and the vertical axis is amplitude. In addition, the horizontal axis may be a sampling rate. The truncation phenomenon in the audio has obvious characteristics on the oscillogram, so that whether the audio is truncated or not can be judged by using an image recognition technology. That is, the following method can be adopted when determining whether a certain piece of audio has a truncation phenomenon:
2032-1, in response to the waveform feature indicated by the waveform identification result matching the first waveform feature, determining that the audio has a truncation phenomenon.
The first waveform feature corresponds to the first audio, and a truncation phenomenon exists at an audio starting position of the first audio. In other words, the truncation phenomenon at the audio start position is represented by a waveform which has no mute section at first and is composed of sampling points with higher amplitude.
2032-2, in response to the waveform feature indicated by the waveform identification result matching the second waveform feature, determining that the audio has a truncation phenomenon.
The second waveform feature corresponds to a second audio frequency, and a truncation phenomenon exists at the audio frequency ending position of the second audio frequency. In other words, the truncation phenomenon at the end of the audio frequency is represented by a waveform which is composed of sampling points with higher amplitude and has no mute section at the end on the waveform diagram.
2032-3, and determining that the audio has a truncation phenomenon in response to the matching of the waveform feature indicated by the waveform identification result and the third waveform feature.
The third waveform feature corresponds to a third audio, and a truncation phenomenon exists in the middle position of the audio of the third audio. In other words, the truncation phenomenon at the middle position of the audio frequency is represented by a waveform consisting of sampling points with higher amplitudes before and after a small mute section on the waveform diagram.
2033. For any audio in the training audio set, carrying out voiceprint recognition on the audio to obtain a voiceprint recognition result; performing voice recognition on the audio in response to the fact that a plurality of speakers corresponding to the audio are determined based on the voiceprint recognition result, and the number of audio frames corresponding to at least two speakers is larger than a frame number threshold; the audio is deleted from the set of training audio in response to determining that the audio exists in chat speech based on the speech recognition result.
The threshold value of the number of frames may be any set number of frames, which is not limited in the present application. In addition, the voice recognition of the audio is realized by utilizing the voice print recognition technology and inputting the audio into the voice recognition model.
204. The computer device uses the filtered set of training audio as training data for training the acoustic model.
Through the processing mode, the training data which is high in quality and comprises the awakening command words can be screened out, and the negative effect of low-quality audio on the acoustic model is reduced.
To sum up, after the training audio set is obtained, the audio in the training audio set is filtered to remove low-quality audio, the high-quality training data can be obtained by the training data obtaining mode, acoustic model training is performed based on the high-quality training data, model training precision can be ensured, model training effect is improved, voice recognition is performed based on a trained acoustic model, and voice recognition accuracy can be greatly improved.
In another possible implementation, besides the embodiment shown in fig. 2, the training data may be acquired in several ways as shown below.
The method comprises the steps that firstly, audio data collected in a target scene are cut, and a training audio set is obtained; wherein each audio in the training audio set comprises a wake-up command word; determining weights corresponding to multiple filtering modes according to the environment type of the target scene; for any audio in the training audio set, respectively carrying out filtering judgment on the audio by adopting a plurality of filtering modes to obtain a plurality of filtering judgment results; in response to determining that the audio satisfies the filtering condition according to the weights corresponding to the plurality of filtering modes and the plurality of filtering determination results, deleting the audio from the training audio set; and using the filtered training audio set as training data for training the acoustic model. Taking the target scene as an intelligent elevator as an example, since the intelligent elevator belongs to a public place, the sequence of the weights of the multiple filtering modes from large to small can be chatting filtering → noise filtering → truncation filtering, and the application does not limit the sequence.
The method II comprises the steps of shearing audio data collected in a target scene to obtain a training audio set; wherein each audio in the training audio set comprises a wake-up command word; for any audio in the training audio set, respectively performing filtering judgment on the audio by adopting a plurality of filtering modes to obtain a plurality of filtering judgment results; deleting the audio from the set of training audio in response to determining that the audio satisfies a filtering condition based on any one of a plurality of filtering decision results; and using the filtered training audio set as training data for training the acoustic model.
Cutting the audio data collected in the target scene to obtain a training audio set; wherein each audio in the training audio set comprises a wake-up command word; according to the environment type of the target scene, determining a target filtering mode matched with the environment type in a plurality of filtering modes; for any audio in the training audio set, filtering and judging the audio in a target filtering mode; in response to determining that the audio satisfies the filtering condition based on the filtering determination result, deleting the audio from the set of training audios; and using the filtered training audio set as training data for training the acoustic model.
Fig. 3 is a schematic structural diagram of a training data acquisition apparatus according to an embodiment of the present application. Referring to fig. 3, the apparatus includes:
the first processing module 301 is configured to cut audio data acquired in a target scene to obtain a training audio set; wherein each audio in the training audio set comprises a wake-up command word;
a second processing module 302 configured to determine an execution order of a plurality of filtering manners according to the environment type of the target scene; wherein different filtering means are used for filtering the audio in the training audio set based on different filtering conditions;
a third processing module 303, configured to, when performing a filtering determination on each piece of audio according to the execution order, in response to that any piece of audio satisfies a filtering condition corresponding to a current filtering manner, delete the audio from the training audio set; and using the filtered training audio set as training data for training the acoustic model.
According to the embodiment of the application, after the training audio set is obtained, the audio in the training audio set can be filtered to remove low-quality audio, the high-quality training data can be obtained through the training data obtaining mode, acoustic model training is carried out based on the high-quality training data, model training precision can be guaranteed, the model training effect is improved, voice recognition is carried out based on a trained acoustic model, and the voice recognition accuracy can be greatly improved.
In one possible implementation, the second processing module 302 is configured to:
acquiring the place type and the peripheral infrastructure information corresponding to the target scene;
acquiring an acquisition time period of the audio data;
and determining the execution sequence of the multiple filtering modes according to the place type and the peripheral infrastructure information corresponding to the target scene and the acquisition time period of the audio data.
In one possible implementation, the second processing module 302 is configured to:
determining an initial execution sequence of the multiple filtering modes according to the place types;
and correcting the initial execution sequence according to the peripheral infrastructure information and the acquisition time period of the audio data to obtain a final execution sequence of the multiple filtering modes.
In one possible implementation, the third processing module 303 is configured to:
for any audio, acquiring an audio waveform diagram of the audio;
carrying out waveform identification on the audio waveform diagram to obtain a waveform identification result;
in response to determining that there is a truncation phenomenon in the audio based on the waveform recognition result, deleting the audio from the set of training audio;
wherein, the truncation phenomenon refers to that a part of audio frames corresponding to the wake-up command word is absent in the audio.
In one possible implementation, the third processing module 303 is configured to:
in response to the waveform feature indicated by the waveform identification result being matched with the first waveform feature, determining that the audio has a truncation phenomenon; the first waveform feature corresponds to a first audio, and a truncation phenomenon exists at an audio starting position of the first audio;
in response to the waveform feature indicated by the waveform identification result being matched with a second waveform feature, determining that the audio has a truncation phenomenon; the second waveform feature corresponds to a second audio, and a truncation phenomenon exists at the audio end position of the second audio;
responding to the waveform characteristics indicated by the waveform identification result and a third waveform characteristic matching, and determining that the audio has a truncation phenomenon; the third waveform feature corresponds to a third audio, and a truncation phenomenon exists in the middle position of the audio of the third audio.
In one possible implementation, the third processing module 303 is configured to:
for any audio, carrying out voiceprint recognition on the audio to obtain a voiceprint recognition result;
performing voice recognition on the audio in response to determining that the audio corresponds to a plurality of speakers based on the voiceprint recognition result and the number of audio frames corresponding to at least two speakers is greater than a frame number threshold;
in response to determining that the audio has chat speech based on speech recognition results, deleting the audio from the set of training audio.
In one possible implementation, the third processing module 303 is configured to:
for any audio, acquiring the signal-to-noise ratio of the audio;
in response to a signal-to-noise ratio of the audio being less than a signal-to-noise ratio threshold, deleting the audio from the set of training audio.
In one possible implementation, the first processing module 301 is configured to:
performing audio truncation on audio data acquired in a target scene to obtain an initial audio set; wherein each audio in the initial set of audio comprises a wake-up command word;
performing voice endpoint detection on each audio in the initial audio set;
based on a voice endpoint detection result, eliminating a mute segment of each audio frequency in the initial audio frequency set to obtain the training audio frequency set; wherein the duration of the silence segment is greater than a target duration.
All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.
It should be noted that: in practical applications, the above function distribution may be completed by different function modules according to needs, that is, the internal structure of the apparatus is divided into different function modules to complete all or part of the above described functions. In addition, the training data acquisition device and the training data acquisition method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments and are not described herein again.
Fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present application. Illustratively, the computer 400 may be embodied as a training data acquisition device.
Generally, the computer device 400 includes: a processor 401 and a memory 402. Among other things, processor 401 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 401 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 401 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In one possible implementation, the processor 401 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed by the display screen. In one possible implementation, the processor 401 may further include an AI (Artificial Intelligence) processor for processing a computing operation related to machine learning.
Memory 402 may include one or more computer-readable storage media, which may be non-transitory. Memory 402 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In one possible implementation, a non-transitory computer readable storage medium in the memory 402 is used to store at least one program code, which is used to be executed by the processor 401 to implement the training data acquisition method provided by the method embodiment in the present application.
In one possible implementation, the computer device 400 may further include: a peripheral interface 403 and at least one peripheral. The processor 401, memory 402 and peripheral interface 403 may be connected by bus or signal lines. Each peripheral may be connected to the peripheral interface 403 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 404, a display screen 405, a camera assembly 406, an audio circuit 407, and a power supply 408.
The peripheral interface 403 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 401 and the memory 402. In one possible implementation, the processor 401, memory 402, and peripheral interface 403 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 401, the memory 402 and the peripheral interface 403 may be implemented on a separate chip or circuit board, which is not limited by this embodiment.
The Radio Frequency circuit 404 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 404 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 404 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 404 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 404 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In one possible implementation, the radio frequency circuit 404 may further include a circuit related to NFC (Near Field Communication), which is not limited in this application.
The display screen 405 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 405 is a touch display screen, the display screen 405 also has the ability to capture touch signals on or over the surface of the display screen 405. The touch signal may be input to the processor 401 as a control signal for processing. At this point, the display screen 405 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In one possible implementation, the display screen 405 may be one, provided on the front panel of the computer device 400; in another possible implementation, the display screens 405 may be at least two, respectively disposed on different surfaces of the computer device 400 or in a folded design; in another possible implementation, the display screen 405 may be a flexible display screen, disposed on a curved surface or on a folded surface of the computer device 400. Even further, the display screen 405 may be arranged in a non-rectangular irregular pattern, i.e. a shaped screen. The Display screen 405 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.
The camera assembly 406 is used to capture images or video. Optionally, camera assembly 406 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In a possible implementation manner, the number of the rear cameras is at least two, and the rear cameras are respectively any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and a VR (Virtual Reality) shooting function or other fusion shooting functions. In one possible implementation, camera assembly 406 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.
The audio circuit 407 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 401 for processing, or inputting the electric signals to the radio frequency circuit 404 for realizing voice communication. The microphones may be provided in plural numbers, respectively, at different portions of the computer apparatus 400 for the purpose of stereo sound collection or noise reduction. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 401 or the radio frequency circuit 404 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In one possible implementation, the audio circuit 407 may also include a headphone jack.
The power supply 408 is used to power the various components in the computer device 400. The power source 408 may be alternating current, direct current, disposable or rechargeable. When power source 408 comprises a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.
Those skilled in the art will appreciate that the configuration shown in FIG. 4 does not constitute a limitation of the computer device 400, and may include more or fewer components than those shown, or combine certain components, or employ a different arrangement of components.
Fig. 5 is a schematic structural diagram of another computer device 500 provided in the embodiment of the present application. Illustratively, the computer 500 may be embodied as a training data acquisition device.
The computer device 500 may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 501 and one or more memories 502, where the memory 502 stores at least one program code, and the at least one program code is loaded and executed by the processors 501 to implement the training data obtaining method provided by the above-mentioned method embodiments. Of course, the computer device 500 may further have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the computer device 500 may further include other components for implementing device functions, which are not described herein again.
In an exemplary embodiment, a computer program product or a computer program is also provided, which includes computer program code stored in a computer-readable storage medium, which is read by a processor of a computer device from the computer-readable storage medium, and which is executed by the processor to cause the computer device to execute the above-mentioned training data acquisition method.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is intended only to illustrate the alternative embodiments of the present application, and should not be construed as limiting the present application, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (12)

1. A method of training data acquisition, the method comprising:
cutting audio data collected in a target scene to obtain a training audio set; wherein each audio in the training audio set comprises a wake-up command word;
determining execution sequences of multiple filtering modes according to the environment type of the target scene; wherein different filtering means are used for filtering the audio in the training audio set based on different filtering conditions;
when each audio is subjected to filtering judgment according to the execution sequence, in response to that any one audio meets the filtering condition corresponding to the current filtering mode, deleting the audio from the training audio set;
and using the filtered training audio set as training data for training the acoustic model.
2. The method according to claim 1, wherein the determining an execution order of the plurality of filtering manners according to the environment type of the target scene comprises:
acquiring the place type and the peripheral infrastructure information corresponding to the target scene;
acquiring an acquisition time period of the audio data;
and determining the execution sequence of the multiple filtering modes according to the place type and the peripheral infrastructure information corresponding to the target scene and the acquisition time period of the audio data.
3. The method according to claim 2, wherein the determining the execution sequence of the plurality of filtering modes according to the place type and the surrounding infrastructure information corresponding to the target scene and the acquisition time period of the audio data comprises:
determining an initial execution sequence of the plurality of filtering modes according to the place types;
and correcting the initial execution sequence according to the peripheral infrastructure information and the acquisition time period of the audio data to obtain a final execution sequence of the multiple filtering modes.
4. The method according to claim 1, wherein the deleting the audio from the training audio set in response to any audio satisfying the filtering condition corresponding to the current filtering mode comprises:
for any audio, acquiring an audio waveform diagram of the audio;
carrying out waveform identification on the audio waveform diagram to obtain a waveform identification result;
in response to determining that there is a truncation phenomenon in the audio based on the waveform recognition result, deleting the audio from the set of training audio;
wherein, the truncation phenomenon refers to that a part of audio frames corresponding to the wake-up command word is absent in the audio.
5. The method of claim 4, further comprising at least one of:
in response to the waveform feature indicated by the waveform identification result being matched with the first waveform feature, determining that the audio has a truncation phenomenon; the first waveform feature corresponds to a first audio, and a truncation phenomenon exists at an audio starting position of the first audio;
in response to the waveform feature indicated by the waveform identification result being matched with a second waveform feature, determining that the audio has a truncation phenomenon; the second waveform feature corresponds to a second audio, and a truncation phenomenon exists at the audio end position of the second audio;
responding to the waveform feature indicated by the waveform identification result and a third waveform feature to be matched, and determining that the audio frequency has a truncation phenomenon; the third waveform feature corresponds to a third audio, and a truncation phenomenon exists in the middle position of the audio of the third audio.
6. The method according to claim 1, wherein the deleting the audio from the training audio set in response to any audio satisfying the filtering condition corresponding to the current filtering mode comprises:
for any audio, carrying out voiceprint recognition on the audio to obtain a voiceprint recognition result;
performing voice recognition on the audio in response to determining that the audio corresponds to a plurality of speakers based on the voiceprint recognition result and the number of audio frames corresponding to at least two speakers is greater than a frame number threshold;
in response to determining that the audio has chat speech based on speech recognition results, deleting the audio from the set of training audio.
7. The method according to claim 1, wherein the deleting the audio from the training audio set in response to any audio satisfying the filtering condition corresponding to the current filtering mode comprises:
for any audio, acquiring the signal-to-noise ratio of the audio;
in response to a signal-to-noise ratio of the audio being less than a signal-to-noise ratio threshold, deleting the audio from the set of training audio.
8. The method according to any one of claims 1 to 7, wherein the clipping the audio data collected in the target scene to obtain a training audio set comprises:
performing audio truncation on audio data acquired in a target scene to obtain an initial audio set; wherein each audio in the initial set of audio comprises a wake-up command word;
performing voice endpoint detection on each audio in the initial audio set;
based on a voice endpoint detection result, eliminating a mute segment of each audio frequency in the initial audio frequency set to obtain the training audio frequency set; wherein the duration of the silence segment is greater than a target duration.
9. A training data acquisition apparatus, characterized in that the apparatus comprises:
the first processing module is configured to cut audio data collected in a target scene to obtain a training audio set; wherein each audio in the training audio set comprises a wake-up command word;
the second processing module is configured to determine the execution sequence of the multiple filtering modes according to the environment type of the target scene; wherein different filtering manners are used for filtering the audio in the training audio set based on different filtering conditions;
a third processing module, configured to, when each piece of audio is subjected to filtering determination according to the execution sequence, respond that any one piece of audio satisfies a filtering condition corresponding to a current filtering manner, and delete the audio from the training audio set; and using the filtered training audio set as training data for training the acoustic model.
10. A computer device, characterized in that it comprises a processor and a memory, in which at least one program code is stored, which is loaded and executed by the processor to implement the training data acquisition method according to any one of claims 1 to 8.
11. A computer-readable storage medium, wherein at least one program code is stored, which is loaded and executed by a processor to implement the training data acquisition method according to any one of claims 1 to 8.
12. A computer program product or a computer program, characterized in that the computer program product or the computer program comprises computer program code, which is stored in a computer-readable storage medium, from which a processor of a computer device reads the computer program code, the processor executing the computer program code, causing the computer device to perform the training data acquisition method as claimed in any one of claims 1 to 8.
CN202211430866.7A 2022-11-15 2022-11-15 Training data acquisition method, device, equipment and storage medium Pending CN115810350A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211430866.7A CN115810350A (en) 2022-11-15 2022-11-15 Training data acquisition method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211430866.7A CN115810350A (en) 2022-11-15 2022-11-15 Training data acquisition method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115810350A true CN115810350A (en) 2023-03-17

Family

ID=85483204

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211430866.7A Pending CN115810350A (en) 2022-11-15 2022-11-15 Training data acquisition method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115810350A (en)

Similar Documents

Publication Publication Date Title
US20210217433A1 (en) Voice processing method and apparatus, and device
CN108320751B (en) Voice interaction method, device, equipment and server
CN109062535B (en) Sound production control method and device, electronic device and computer readable medium
CN110364156A (en) Voice interactive method, system, terminal and readable storage medium storing program for executing
CN109599104A (en) Multi-beam choosing method and device
CN107888965A (en) Image present methods of exhibiting and device, terminal, system, storage medium
CN111083678A (en) Playing control method and system of Bluetooth sound box and intelligent device
CN109151366B (en) Sound processing method for video call, storage medium and server
CN118051111A (en) High-energy-efficiency display processing method and equipment
CN110931028B (en) Voice processing method and device and electronic equipment
CN112634872A (en) Voice equipment awakening method and device
CN111508531A (en) Audio processing method and device
KR20200094732A (en) Method and system for classifying time series data
CN110798327A (en) Message processing method, device and storage medium
CN111724783B (en) Method and device for waking up intelligent device, intelligent device and medium
CN108900688A (en) Sounding control method, device, electronic device and computer-readable medium
CN111312243B (en) Equipment interaction method and device
CN112291672A (en) Speaker control method, control device and electronic equipment
CN105244037B (en) Audio signal processing method and device
CN111554314A (en) Noise detection method, device, terminal and storage medium
CN106603882A (en) Incoming call sound volume adjusting method, incoming call sound volume adjusting device and terminal
CN111341317A (en) Method and device for evaluating awakening audio data, electronic equipment and medium
CN115810350A (en) Training data acquisition method, device, equipment and storage medium
CN109144461A (en) Sounding control method, device, electronic device and computer-readable medium
CN109032008A (en) Sounding control method, device and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination