WO2023193573A1 - 一种音频处理方法、装置、存储介质及电子设备 - Google Patents

一种音频处理方法、装置、存储介质及电子设备 Download PDF

Info

Publication number
WO2023193573A1
WO2023193573A1 PCT/CN2023/081227 CN2023081227W WO2023193573A1 WO 2023193573 A1 WO2023193573 A1 WO 2023193573A1 CN 2023081227 W CN2023081227 W CN 2023081227W WO 2023193573 A1 WO2023193573 A1 WO 2023193573A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
type
frame
output
audio frame
Prior art date
Application number
PCT/CN2023/081227
Other languages
English (en)
French (fr)
Inventor
熊伟浩
周新权
Original Assignee
北京字节跳动网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字节跳动网络技术有限公司 filed Critical 北京字节跳动网络技术有限公司
Publication of WO2023193573A1 publication Critical patent/WO2023193573A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • the embodiments of the present disclosure relate to the field of data processing technology, such as an audio processing method, device, storage medium and electronic equipment.
  • Embodiments of the present disclosure provide an audio processing method, device, storage medium, and electronic equipment to improve audio recognition accuracy.
  • an embodiment of the present disclosure provides an audio processing method, including:
  • the current recognition threshold is adjusted according to the determination state, wherein the adjusted recognition threshold is used to identify the audio type of the next audio frame.
  • an audio processing device including:
  • a type determination module configured to obtain the audio frame to be processed and determine the audio type of the audio frame based on the current recognition threshold
  • a state determination module configured to determine the determination state of the identified audio type based on the characteristic information of the identified continuous audio frames in response to determining that the current audio frame satisfies the threshold adjustment condition
  • a threshold adjustment module configured to adjust the current recognition threshold according to the determination state, wherein the adjusted recognition threshold is used to identify the audio type of the next audio frame.
  • embodiments of the present disclosure also provide an electronic device, where the electronic device includes:
  • processors one or more processors
  • a storage device configured to store one or more programs
  • the one or more processors are caused to implement the audio processing method as described in any one of the embodiments of the present disclosure.
  • embodiments of the disclosure further provide a storage medium containing computer-executable instructions, which when executed by a computer processor are used to perform audio processing as described in any embodiment of the disclosure. method.
  • Figure 1 is a schematic flow chart of an audio processing method provided by an embodiment of the present disclosure
  • Figure 2 is a schematic flow chart of an audio processing method provided by an embodiment of the present disclosure
  • Figure 3 is a schematic flow chart of an audio processing method provided by an embodiment of the present disclosure.
  • Figure 4 is a schematic flow chart of an audio processing method provided by an embodiment of the present disclosure.
  • Figure 5 is a schematic structural diagram of an audio processing device provided by an embodiment of the present disclosure.
  • FIG. 6 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
  • the term “include” and its variations are open-ended, ie, “including but not limited to.”
  • the term “based on” means “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; and the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
  • Figure 1 is a schematic flow chart of an audio processing method provided by an embodiment of the present disclosure.
  • the embodiment of the present disclosure is adapted to perform audio type recognition based on an automatically adjusted threshold.
  • This method can be executed by an audio processing device provided by an embodiment of the present disclosure.
  • the audio processing device can be implemented in the form of software and/or hardware, for example, through an electronic device, and the electronic device can be a mobile terminal or a PC.
  • the method in this embodiment includes:
  • the execution subject (the above-mentioned electronic device) used to execute the audio processing method includes but is not limited to mobile phones, smart watches, computers and other devices.
  • the above-mentioned electronic device can obtain the audio frame to be processed in a variety of ways.
  • audio data can be collected in real time through an audio acquisition device, and audio frames to be processed can be extracted from the audio data.
  • Audio data can also be retrieved from a preset storage location or other devices, and audio frames to be processed can be extracted from the audio data.
  • the disclosed embodiments do not limit the method of obtaining audio frames to be processed.
  • the above audio data may include but is not limited to call audio data, audio data in videos, live broadcast audio data, etc., and is not limited to this.
  • the audio data may be a piece of audio, and the audio may contain information such as speech, noise, etc.
  • the duration of the audio data is not limited.
  • the audio data is divided into multiple audio frames, and recognition processing is performed on each audio frame.
  • the audio data is real-time data
  • the music data collected in real time is divided into audio frames in sequence, and the obtained audio frames are recognized and processed in real time.
  • the audio data is offline data
  • multiple audio frames can be processed sequentially according to the timing of dividing the audio frames.
  • the audio frame may be audio data with a preset time length, where the duration of the audio frame may be determined based on the recognition accuracy, and is not limited to this.
  • the duration of the audio frame may be 10 ms.
  • the audio frame to be processed may be the audio frame that currently needs to be processed.
  • audio types are identified for multiple speech frames based on dynamic recognition thresholds.
  • the current recognition threshold corresponding to the processing time of the audio frame to be processed is obtained.
  • the current recognition threshold can be the current
  • the judgment threshold corresponding to the audio frame processing moment can be used to judge the audio type of the audio frame.
  • the recognition threshold is an adjustable value, and the current recognition threshold can be an adjusted recognition threshold based on the previous speech frame, or it can also be an unadjusted initial recognition threshold.
  • the audio type is identified for the audio frame.
  • the audio type can be set according to the recognition requirements.
  • the audio type is divided into speech type and noise type according to whether the audio frame contains noise; in some implementations,
  • the object that emits a sound may also be used to divide the audio frame into types such as human voice, rain sound, bird song, etc.; in some embodiments, the audio frame may also be divided into types such as song, speech, etc., which is not limited. .
  • the audio type includes speech type and noise type.
  • the audio frame corresponding to the speech type is a speech frame
  • the speech frame can include speech information
  • the speech information can be real language content
  • the audio frame corresponding to the noise type is a noise frame
  • the noise frame can include noise information
  • the noise information can It is interference information that has nothing to do with the language content, such as interference sounds generated in the environment.
  • the audio frame may be recognized based on a preset audio recognition algorithm to obtain the recognition probability of the audio frame, where the preset audio recognition algorithm is a recognition algorithm adapted to the recognition requirements of the audio type.
  • the audio recognition algorithm may be a machine learning model, and the machine learning model may be a neural network model, etc., which is not limited.
  • the recognition probability of the audio frame is determined based on the current recognition threshold, and the audio type of the audio frame is determined.
  • the audio frame when the recognition probability of the audio frame is greater than or equal to the current recognition threshold, the audio frame is determined to be a speech frame; when the recognition probability of the audio frame is less than the current recognition threshold, it is determined Audio frames are noise frames.
  • the recognition threshold can be one or more data, and the number of recognition thresholds and the method of judging the recognition probability of the audio frame can be set according to the judgment requirements of the audio type, and are not limited to this.
  • the recognition threshold can be dynamically adjusted. For example, it can be adjusted in real time based on the determination status of the recognized speech frame.
  • the adjusted recognition threshold is more suitable for the determination status of the next audio frame, so that the recognition threshold can be obtained through the adjusted recognition threshold.
  • Audio type has better accuracy.
  • the adjusted recognition threshold can be used to identify the audio type of the next audio frame, that is, continue to adjust the recognition threshold during the process of identifying the next audio frame, thereby realizing dynamic adjustment of the recognition threshold to adapt to different environments and improve the audio type. Recognition accuracy.
  • the threshold adjustment condition is met. If the threshold adjustment condition is met, the current recognition threshold is adjusted, and the next audio frame is recognized based on the adjusted recognition threshold. , when the threshold adjustment conditions are not met, the current recognition threshold is kept unchanged, and the next audio frame is recognized based on the current recognition threshold to avoid frequent adjustments to the recognition threshold and the interference of frequent adjustments to the recognition threshold on audio type recognition.
  • the threshold adjustment condition may be a condition set using the characteristics of continuous audio frames of speech type, which may include but is not limited to a condition for judging whether the speech ends, that is, the threshold adjustment condition may be used to filter out interruptions. or paused speech segment.
  • the threshold adjustment condition is that the audio type of the current audio frame is noise type, and the audio type of the previous audio frame is speech type.
  • the audio type of the current audio frame is noise type and the audio type of the previous audio frame is speech type, it indicates that there may be a voice interruption in the audio piece, that is, the speech stops or terminates, etc. .
  • the threshold adjustment condition is used to determine whether a segment of voice-type audio ends. When a segment of voice-type audio ends, the voice-type audio is then verified, that is, the overall verification of multiple audio frames is performed. From the overall audio Consider whether it is speech, and obtain the determination status of the recognized audio type to measure whether the current recognition threshold setting is appropriate. This judgment method is more reliable and can avoid misjudgment of the audio type based on a single audio frame.
  • the identified continuous audio frames may be a continuous audio frame for which audio type recognition has been completed, that is, multiple audio frames that have been continuously identified as speech types.
  • the identified continuous audio frames may be located before the current audio frame, and the last The audio segment that is close to the end of a speech segment, that is, the last audio frame of the identified consecutive audio frames is the audio frame preceding the current audio frame.
  • the characteristic information of the identified continuous audio frames refers to the reference information used to determine whether the identified audio type is correct.
  • the characteristic information can be obtained by calculation or statistics of the basic information of the identified continuous audio frames.
  • the recognized audio type refers to the overall type of continuous audio frames, that is, the audio type corresponding to a recognized piece of audio, rather than the audio type of a single audio frame.
  • the judgment status refers to the correct or wrong status of the recognized audio type.
  • the characteristic information includes one or more of the following: length of continuous speech frames, recognition probability, pitch frequency and energy value.
  • the length of the continuous voice frame refers to the time length of the continuous voice frame, which can be obtained statistically, that is, the continuous voice frame The sum of the durations of multiple audio frames in .
  • the recognition probability refers to the probability that the continuous speech frame is of the speech type. For example, it may be the average probability of multiple speech frames in the continuous speech frame being of the speech type.
  • the fundamental frequency refers to the frequency of vocal cord vibration.
  • the fundamental frequency of a continuous speech frame may be the average of the fundamental frequencies of multiple speech frames in the continuous speech frame.
  • the energy value of a continuous speech frame may be the sum of energy values of multiple speech frames in the continuous speech frame or the average energy value, etc. It should be noted that the above characteristic information can be obtained through statistics, calculation, etc., and will not be described again here.
  • effective speech segments that is, continuous audio frames of speech type
  • the frequency discrimination threshold or the energy discrimination threshold is used to judge the characteristic information of the continuous audio frames to determine whether the judgment result of the audio type of the continuous audio frame is correct, so as to obtain the judgment status of the identified audio type. For example, if the feature information is greater than the corresponding set threshold, it indicates that the determination status is correct; if the feature information is less than or equal to the corresponding set threshold, it indicates that the determination status is wrong.
  • determining the determination status of the identified audio type based on the characteristic information of the identified continuous audio frames includes: determining the characteristic information of the continuous speech frame before the current audio frame, based on the characteristics The information is compared with the judgment threshold of the characteristic information; the judgment status of the identified audio type is determined based on the comparison result.
  • the characteristic information of the continuous speech frame before the current audio frame can be obtained through calculation, statistics and other methods, and can include one or more, that is, the continuous speech frame before the current audio frame can have one or more characteristic information. It can be understood that Yes, when there is multiple feature information, the evaluation parameters for determining the state are more abundant, which can improve the accuracy of determining the state.
  • the judgment threshold of feature information can be set based on experience and can include one or more, that is, the feature information can have one or more judgment thresholds.
  • the feature information and the judgment threshold of the feature information are compared one by one to obtain the comparison result, and then the judgment status of the identified audio type is determined based on the comparison result, where, the comparison There is a mapping relationship between the result and the judgment status. For example, if multiple feature information are greater than the judgment threshold of the feature information, the judgment status of the identified audio type is correct; if at least one feature information among the multiple feature information is less than or equal to the feature information If the judgment threshold of the information is exceeded, the judgment status of the identified audio type is Error. In this embodiment, status determination is performed through threshold comparison. This method is simple and efficient, and can quickly verify the audio type of the identified continuous speech frames.
  • the length of the continuous speech frame and the average recognition probability are determined, the length of the continuous speech frame is judged based on the duration threshold, and the average recognition probability is judged based on the recognition probability threshold.
  • the determination status of the recognized audio type is determined to be correct, when the length of the continuous speech frame is less than or equal to the duration threshold, and/or, the mean recognition probability is less than or equal to the recognition probability threshold In this case, the determination status of the recognized audio type is determined to be error.
  • the determination state includes an error state and a correct state
  • the current recognition threshold is adjusted according to the determination state, wherein the adjustment method of the current recognition threshold includes increasing and decreasing, for example , when the determination state is an error state, increase the current recognition threshold; when the determination state is a correct state, decrease the current recognition threshold.
  • the error status indicates an error in the determination of the recognized audio type, indicating that the noise with a short time length, a small recognition probability, a low pitch frequency, or a small energy value is recognized as speech, so the current recognition threshold needs to be increased. This prevents subsequent identification of this type of noise as speech; the correct state indicates that the identified audio type has been determined correctly.
  • the determination state is correct, the current recognition threshold can be reduced and the criterion for recognition probability can be relaxed. It should be noted that if the current recognition threshold is increased to the upper threshold, or the current recognition threshold is reduced to the lower threshold, the current recognition threshold will not continue to increase or decrease to prevent over-adjustment, which may lead to a decrease in the accuracy of audio recognition. .
  • the technical solution of the embodiment of the present disclosure by acquiring the audio frame to be processed, shows that this solution processes audio data in frames, so as to determine the audio type of the audio frame based on the current recognition threshold, and achieve a preliminary judgment of the audio frame type; in the current audio
  • the determination status of the identified audio type is determined based on the characteristic information of the identified continuous audio frames, and the audio type of the identified continuous audio frames can be re-verified; based on the re-verified determination status Adjust the current recognition threshold, and the adjusted recognition threshold can be used to identify the audio type of the next audio frame, that is, continue to adjust the recognition threshold during the process of identifying the next audio frame, thereby achieving dynamic adjustment of the recognition threshold, and then according to the dynamic Adjusting the recognition threshold for audio categories can improve the accuracy of audio type recognition.
  • FIG 2 is a schematic flow chart of an audio processing method provided by an embodiment of the present disclosure.
  • the method of this embodiment can be combined with multiple example solutions of the audio processing method provided in the above embodiments.
  • the audio processing method provided in this embodiment is refined. For example, determining the audio type of the audio frame based on the current recognition threshold includes: extracting audio features of the audio frame, inputting the audio features into the audio recognition model, and obtaining the recognition probability of the audio frame; based on the The current recognition threshold and the recognition probability determine the audio type of the audio frame.
  • the method in this embodiment includes:
  • audio features can be used to determine the recognition probability of audio frames, and the audio features can be obtained through a series of calculations and transformations.
  • the audio feature extraction process may include but is not limited to windowing, fast Fourier transform, mel spectrum conversion, normalization, etc.
  • the audio recognition model can be a fully trained deep learning model, including but not limited to a recurrent neural network model, a long short-term memory recurrent neural network, etc.
  • the recognition probability of an audio frame can be used to determine whether the audio frame is speech.
  • the duration of the audio frame can be 10ms
  • the sampling rate is 16KHz
  • the signal x(n) is multiplied by the window function w(n) to obtain the windowed signal x(n)*w(n).
  • the window function can be a function such as Hanning window.
  • the spectrum is divided into several segments, and the energy of each segment is summed based on the Mel filter function to obtain the Mel spectrum.
  • the Mel spectrum is normalized to obtain the audio frame. Audio features, input the audio features into the audio recognition model, and the recognition probability of the audio frame can be obtained.
  • the audio type of the audio frame is the speech type; if the recognition probability is less than the current recognition threshold, the audio type of the audio frame is the noise type.
  • the training method of the audio recognition model includes: obtaining noise-free audio, setting labels for audio segments in the noise-free audio; obtaining noise information, and superposing the noise information to the noise-free audio.
  • the noise audio a sample audio is formed, wherein the noise information includes at least one of steady noise, transient noise and howling noise; the audio recognition model to be trained is iteratively trained based on the sample audio until the training is obtained Good audio recognition model.
  • howling noise is a feedback sound.
  • the audio recognition model is trained by superimposing the howling noise into the noise-free audio, and using the audio superimposed with the howling noise as the sample audio, so that the audio recognition model has the ability to determine howling as noise. , so that this audio processing method has the ability to suppress howling.
  • the audio recognition model is trained, so that the audio recognition model can identify steady noise, transient noise and howling noise.
  • the ability to detect various noises such as noise improves the robustness and applicability of the audio recognition model.
  • noise-free audio can include clean speech and blank audio. Audio frames in clean speech that are greater than a preset threshold can be marked as 1, and other audio frames can be marked as 0. For example, to obtain noise information, the noise information can be marked as 1. Overlay to In noiseless audio, a sample audio is formed, and the label of the sample audio is consistent with the label of the noiseless audio, that is, the label does not change.
  • focal loss can be used as the loss function, and the audio recognition model to be trained is iteratively trained based on the sample audio until the trained audio recognition model is obtained.
  • the method further includes: adjusting the signal-to-noise ratio in the sample audio; and/or filtering the sample audio based on a preset filter.
  • the signal-to-noise ratio in the sample audio can be adjusted randomly or in a fixed setting, and the preset filters can include but are not limited to high-pass filters, low-pass filters, etc.
  • the technical solution of the embodiment of the present disclosure is to superimpose the howling noise into the noise-free audio, and use the audio superimposed with the howling noise as the sample audio to train the audio recognition model, so that the audio recognition model has the ability to detect howling noise.
  • Figure 3 is a schematic flowchart of an audio processing method provided by an embodiment of the present disclosure.
  • the method of this embodiment can be combined with multiple example solutions of the audio processing method provided in the above embodiments.
  • the audio processing method provided in this embodiment is refined. For example, after obtaining the audio frame to be processed, the audio frame is added to the buffer area, where the buffer area is used to store multiple audio frames that are not output, and the current audio frame is located in the last frame of the buffer area. , the first frame in the buffer area is the frame to be output.
  • the method in this embodiment includes:
  • the buffer area can be used to store multiple unoutputted audio frames, and the multiple audio frames are arranged in the buffer area based on timestamps.
  • the embodiment of the present disclosure is a solution for adding audio frames to the buffer area in real time, that is, while performing audio type discrimination on the current audio frame, the current audio frame is added to the buffer area.
  • the current audio frame can be stored after the last recognized audio frame, that is, the newly added audio frame is always located in the last frame of the buffer area, and the first frame in the buffer area is set as the frame to be output, which ensures that the audio follows the Adding sequential output, that is, output according to timing, can avoid disordered voice playback.
  • the recognition threshold is dynamically adjusted to improve the accuracy of the recognition threshold, correspondingly, in the recognition process of multiple audio frames in the initial stage of audio, the recognition threshold is in the initial adjustment stage.
  • the audio type of the currently cached audio frame in the buffer area is set to Speech type to avoid misidentifying speech frames as noise frames in the initial audio recognition stage.
  • the audio type of the current audio frame is noise type, and the noise frame has no impact on the previous audio frame
  • the audio type of the audio frame corresponding to the current audio frame in the buffer area is set to the noise type, that is, the last audio frame in the buffer area is
  • the audio type of an audio frame is set to noise type.
  • the audio type of the subsequent audio frame is Update the audio type of the previous audio frame to avoid misrecognition of the speech frame during the recognition threshold adjustment process and ensure the normal output of the speech frame.
  • the audio in this implementation can be real-time output audio such as call audio, live broadcast audio, etc.
  • the method further includes: determining a current threshold range in which the adjusted recognition threshold is located, according to the current threshold range Determine the buffer length.
  • the current threshold range can be set based on experience, and the current threshold range can be compared with the adjusted recognition threshold. Since the recognition threshold is a dynamically changing value, the comparison result is also a dynamically changing value, and the buffer length can be determined based on the comparison result. Dynamic Adjustment.
  • the buffer length refers to the length of time that audio frames can be buffered. It should be noted that the adjustment of the buffer length is actually an adjustment of the audio output delay. In the embodiment of the present disclosure, by dynamically adjusting the length of the buffer area, it is possible to avoid the situation where the speech at the beginning of the speech is misjudged as noise. It can be understood that by adjusting the length of the buffer area and buffering different numbers of audio frames in the buffer, secondary correction of audio types of different numbers of audio frames can be achieved, thereby improving the accuracy of speech recognition.
  • the current threshold range may include a first threshold and a second threshold.
  • the first threshold is greater than the second threshold. If the adjusted recognition threshold is greater than the first threshold, it indicates that the current recognition of the audio frame is incorrect. If the rate is larger, the buffer length is set to the first time length; if the adjusted recognition threshold is less than the second threshold, indicating that the current recognition error rate of the audio frame is small, the buffer length is set to the second time length.
  • the first time length can be greater than the second time length; in other cases, the buffer length can remain unchanged. Wherein, the first time length is greater than the second time length.
  • the buffer length may be set to a third time length, and the third time length may be greater than the second time length and less than the second time length. A length of time.
  • the current threshold range may include more than two thresholds, and the buffer length is adjusted through more than two thresholds.
  • the method further includes: when the audio types of the audio frames in the buffer area are all noise types, clearing the buffer area and rebuilding a buffer area based on the current buffer area length. .
  • the technical solution of the embodiment of the present disclosure can achieve delayed output of audio frames by adding audio frames to the buffer area; if the audio type of the current audio frame is a speech type, then the audio frames of the multiple audio frames in the buffer area are added to the buffer area.
  • the audio type is set to the speech type; if the audio type of the current audio frame is the noise type, the audio type of the last audio frame in the buffer area is set to the noise type to realize the judgment of whether the speech ends; according to the current threshold range Determine the length of the buffer area and dynamically adjust the audio output delay, which can avoid misjudgment of the beginning of speech and improve the accuracy of speech type recognition.
  • Figure 4 is a schematic flowchart of an audio processing method provided by an embodiment of the present disclosure.
  • the method of this embodiment can be combined with multiple example solutions of the audio processing method provided in the above embodiments.
  • the audio processing method provided in this embodiment is refined.
  • the method further includes: determining an output gain of the audio frame to be output based on the audio type of the audio frame to be output; processing the audio frame to be output based on the output gain to obtain an output audio frame and outputting it.
  • the method in this embodiment includes:
  • S450 Process the audio frame to be output based on the output gain to obtain an output audio frame and output it.
  • the audio frame to be output refers to the audio frame located at the first position in the buffer area.
  • the output gain of the audio frame to be output can be used to adjust the output state of the audio frame to be output to achieve noise reduction.
  • a fixed corresponding relationship between audio type and output gain can be established in advance, a fixed relationship table is generated, and the audio type of the audio frame to be output is matched in the fixed relationship table to obtain the corresponding audio type.
  • the output gain of the audio frame to be output can also be determined based on the gain function model.
  • the gain function model can be used to calculate the output gain of the next audio frame to be output. Dynamic adjustment of the output gain to make the audio frame output to be output more natural.
  • determining the output gain of the audio frame to be output based on the audio type of the audio frame to be output includes: when the audio type of the audio frame to be output is a speech type , the output gain of the audio frame to be output is determined to be the first preset value; when the audio type of the audio frame to be output is the noise type, the output gain of the audio frame to be output is determined to be the second preset value, where , the first preset value is greater than the second preset value.
  • the first preset value may be 1, and the second preset value may be 0.
  • the output gain of the audio frame to be output may be set to 1, Output the audio frame normally; when the audio type of the audio frame to be output is a noise type, you can set the output gain of the audio frame to be output to 0 and eliminate the audio frame to achieve the purpose of mute noise reduction.
  • the first preset value and the second preset value here are only examples.
  • the first preset value and the second preset value can also be other values, such as 0.8 and 0.2, which are not limiting.
  • determining the output gain of the audio frame to be output based on the audio type of the audio frame to be output further includes: when the audio type of the audio frame to be output is a noise type, And if the output audio frames of the preset number of frames include voice type audio frames, smoothing is performed based on the output gain of the previous audio frame to be output, to obtain the output gain of the audio frame to be output.
  • smoothing processing can obtain a gradually changing output gain.
  • smoothing processing can obtain a gradually changing output gain.
  • it can prevent the audio type from repeatedly jumping and remove burrs in the audio. part to make the audio frame output to be output more natural.
  • the preset number of frames may be N, that is, when the audio type of the audio frame to be output is the noise type, and N audio frames of the speech type have been output before the audio frame to be output, according to the previous
  • the output gain of the output audio frame determines the output gain of the audio frame to be output.
  • the output gain of the audio frame to be output is obtained by decreasing step by step.
  • the gain is 0.8.
  • Subtract the gain adjustment value from the output gain of the last output audio frame which may be 0.2, for example.
  • the resulting output gain of the audio frame to be output is 0.6. It should be noted that the above numerical values and decrement values of the output gain are only examples and are not limiting.
  • the output gain obtained above can be used to process the audio frame to be output to obtain the output audio frame.
  • the output audio frame can refer to the audio frame that needs to be output and played.
  • the output method can include but is not limited to direct output from the current device, or through Wired or wireless communication methods are used to transmit to other devices for output, etc.
  • the output gain can be multiplied by the parameter value corresponding to the audio frame to be output, so as to perform noise reduction processing on the audio frame to be output. For example, if the output gain is 1, the audio frame to be output remains unchanged. If the output gain If it is 0, the audio frame to be output is set to zero and the noise frame is cleared.
  • the technical solution of the embodiment of the present disclosure determines the output gain of the audio frame to be output according to the audio type of the audio frame to be output.
  • the output gain can be used to adjust the output state of the audio frame to be output to achieve noise reduction processing and improve the quality of the output audio frame. quality.
  • FIG. 5 is a schematic structural diagram of an audio processing device provided by an embodiment of the present disclosure. As shown in Figure 5, the device includes:
  • the type determination module 510 is configured to obtain the audio frame to be processed and determine the audio type of the audio frame based on the current recognition threshold;
  • the status determination module 520 is configured to determine the determination status of the identified audio type based on the characteristic information of the identified continuous audio frames when the current audio frame meets the threshold adjustment condition;
  • the threshold adjustment module 530 is configured to adjust the current recognition threshold according to the determination state, wherein the adjusted recognition threshold is used to identify the audio type of the next audio frame.
  • the audio type includes a speech type and a noise type
  • the threshold adjustment condition is that the audio type of the current audio frame is noise type, and the audio type of the previous audio frame is speech type.
  • the status determination module 520 is further configured to:
  • the determination status of the identified audio type is determined based on the comparison result.
  • the characteristic information includes one or more of the following: length of the continuous speech frame, recognition probability, pitch frequency, and energy value.
  • the determination status includes an error status and a correct status; the threshold adjustment module 530 is also configured to:
  • the current recognition threshold is reduced.
  • the audio frame is added to the buffer area, where the buffer area is used to store multiple audio frames that have not been output, and the current audio frame is The frame is located at the last frame in the buffer area, and the first frame in the buffer area is the frame to be output.
  • the device is further configured to:
  • the audio type of the current audio frame is the speech type, then set the audio types of the multiple audio frames in the buffer area to the speech type;
  • the audio type of the current audio frame is the noise type
  • the audio type of the last audio frame in the buffer area is set to the noise type
  • the device further includes:
  • the device further includes:
  • the buffer area is cleared, and a buffer area is rebuilt based on the current buffer area length.
  • the device further includes:
  • a gain determination module configured to determine the output gain of the audio frame to be output based on the audio type of the audio frame to be output;
  • the audio output module is configured to process the audio frame to be output based on the output gain, obtain an output audio frame, and output it.
  • the gain determination module is further configured to:
  • the audio type of the audio frame to be output is a speech type
  • the output gain of the audio frame to be output is determined to be a second preset value, wherein the first preset value is greater than the second preset value.
  • the gain determination module is further configured to:
  • the audio type of the audio frame to be output is the noise type
  • the output audio frames of the preset number of frames include audio frames of the speech type
  • the type determination module 510 is further configured to:
  • Extract the audio features of the audio frame input the audio features into the audio recognition model, and obtain the recognition probability of the audio frame;
  • the audio type of the audio frame is determined based on the current recognition threshold and the recognition probability.
  • the training device for the audio recognition model includes:
  • a label setting module configured to obtain noise-free audio, and set labels for audio segments in the noise-free audio
  • a sample production module configured to obtain noise information, superimpose the noise information into the noise-free audio, and form sample audio, wherein the noise information includes at least one of steady noise, transient noise and howling noise. item;
  • the model training module is configured to iteratively train the audio recognition model to be trained based on the sample audio until a trained audio recognition model is obtained.
  • the training device of the audio recognition model may also be configured to:
  • the audio processing device provided by the embodiments of the present disclosure can execute the audio processing method provided by any embodiment of the present disclosure, and has corresponding functional modules and beneficial effects for executing the method.
  • Terminal devices in embodiments of the present disclosure may include, but are not limited to, mobile phones, laptops, digital broadcast receivers, PDAs (Personal Digital Assistants), PADs (Tablets), PMPs (Portable Multimedia Players), vehicle-mounted terminals (such as Mobile terminals such as car navigation terminals) and fixed terminals such as digital TVs, desktop computers, etc.
  • the electronic device shown in FIG. 6 is only an example and should not impose any limitations on the functions and scope of use of the embodiments of the present disclosure.
  • the electronic device 400 may include a processing device (eg, central processing unit, graphics processor, etc.) 401 , which may be loaded into a random access device according to a program stored in a read-only memory (ROM) 402 or from a storage device 408 .
  • the program in the memory (RAM) 403 executes various appropriate actions and processes.
  • various programs and data required for the operation of the electronic device 400 are also stored.
  • the processing device 401, ROM 402 and RAM 403 are connected to each other via a bus 404.
  • An input/output (I/O) interface 405 is also connected to bus 404.
  • the following devices may be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 407 such as a computer; a storage device 408 including a magnetic tape, a hard disk, etc.; and a communication device 409.
  • the communication device 409 may allow the electronic device 400 to communicate wirelessly or wiredly with other devices to exchange data.
  • FIG. 6 illustrates electronic device 400 with various means, it should be understood that implementation or availability of all illustrated means is not required. More or fewer means may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product including a computer program carried on a non-transitory computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via communication device 409, or from storage device 408, or from ROM 402.
  • the processing device 401 When the computer program is executed by the processing device 401, the above-mentioned functions defined in the method of the embodiment of the present disclosure are performed.
  • the electronic device provided by the embodiments of the present disclosure and the audio processing method provided by the above embodiments belong to the same concept.
  • Technical details that are not described in detail in this embodiment can be referred to the above embodiments, and this embodiment has the same benefits as the above embodiments. Effect.
  • Embodiments of the present disclosure provide a computer storage medium on which a computer program is stored.
  • the program is executed by a processor, the audio processing method provided by the above embodiments is implemented.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium may be, for example, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or any combination thereof.
  • computer More specific examples of readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable Read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to: wire, optical cable, RF (radio frequency), etc., or any suitable combination of the above.
  • the client and server can communicate using any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol), and can communicate with digital data in any form or medium.
  • Communications e.g., communications network
  • communications networks include local area networks (“LAN”), wide area networks (“WAN”), the Internet (e.g., the Internet), and end-to-end networks (e.g., ad hoc end-to-end networks), as well as any currently known or developed in the future network of.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; it may also exist independently without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs.
  • the electronic device executes the above-mentioned one or more programs.
  • the current recognition threshold is adjusted according to the determination state, wherein the adjusted recognition threshold is used to identify the audio type of the next audio frame.
  • Computer program code for performing the operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and Includes conventional procedural programming languages—such as "C” or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as an Internet service provider through Internet connection).
  • LAN local area network
  • WAN wide area network
  • Internet service provider such as an Internet service provider through Internet connection
  • each block in the flowchart or block diagrams may represent a module, program frame, or portion of code that contains one or more logic functions that implement the specified executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved.
  • each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or operations. , or can be implemented using a combination of specialized hardware and computer instructions.
  • the units involved in the embodiments of the present disclosure can be implemented in software or hardware. Among them, the name of the unit/module does not constitute a limitation on the unit itself under certain circumstances.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs Systems on Chips
  • CPLD Complex Programmable Logical device
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, laptop disks, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM portable compact disk read-only memory
  • magnetic storage device or any suitable combination of the above.
  • Example 1 provides an audio processing method, including:
  • the current recognition threshold is adjusted according to the determination state, wherein the adjusted recognition threshold is used to identify the audio type of the next audio frame.
  • Example 2 provides an audio processing method, wherein,
  • the audio type includes speech type and noise type
  • the threshold adjustment condition is that the audio type of the current audio frame is noise type, and the audio type of the previous audio frame is speech type.
  • Example 3 provides an audio processing method, wherein,
  • the determination status of the identified audio type is determined based on the comparison result.
  • Example 4 provides an audio processing method, wherein,
  • the characteristic information includes one or more of the following: length of the continuous speech frame, recognition probability, pitch frequency and energy value.
  • Example 5 provides an audio processing method, wherein,
  • the determination status includes an error status and a correct status
  • Adjusting the current recognition threshold according to the determination status includes:
  • the current recognition threshold is reduced.
  • Example 6 provides an audio processing method
  • the current audio frame is located in the last frame of the buffer area, and the first frame in the buffer area is Frame to be output.
  • Example 7 provides an audio processing method. After determining the audio type of the audio frame based on the current recognition threshold, the method further includes:
  • the audio type of the current audio frame is the speech type, then set the audio types of the multiple audio frames in the buffer area to the speech type;
  • Example 8 provides an audio processing method. After adjusting the current recognition threshold according to the determination state, the method further includes:
  • Example 9 provides an audio processing method, including:
  • the buffer area is cleared, and a buffer area is rebuilt based on the current buffer area length.
  • Example 10 provides an audio processing method, further comprising:
  • the audio frame to be output is processed based on the output gain to obtain an output audio frame and output.
  • Example 11 provides an audio processing method, wherein,
  • Determining the output gain of the audio frame to be output based on the audio type of the audio frame to be output includes:
  • the audio type of the audio frame to be output is a speech type
  • the output gain of the audio frame to be output is determined to be a second preset value, wherein the first preset value is greater than the second preset value.
  • Example 12 provides an audio processing method, wherein,
  • Determining the output gain of the audio frame to be output based on the audio type of the audio frame to be output further includes:
  • the audio type of the audio frame to be output is the noise type
  • the output audio frames of the preset number of frames include audio frames of the speech type
  • Example 13 provides an audio processing method, wherein,
  • Determining the audio type of the audio frame based on the current recognition threshold includes:
  • Extract the audio features of the audio frame input the audio features into the audio recognition model, and obtain the recognition probability of the audio frame;
  • the audio type of the audio frame is determined based on the current recognition threshold and the recognition probability.
  • Example 14 provides an audio processing method, wherein,
  • the training method of the audio recognition model includes:
  • noise information superimpose the noise information into the noise-free audio to form sample audio, wherein the noise information includes at least one of steady noise, transient noise and howling noise;
  • the audio recognition model to be trained is iteratively trained based on the sample audio until a trained audio recognition model is obtained.
  • Example 15 provides an audio processing method, further comprising at least one of the following:
  • Example 16 provides an audio processing device, which includes:
  • a type determination module configured to obtain the audio frame to be processed and determine the audio type of the audio frame based on the current recognition threshold
  • a state determination module configured to determine the determination state of the identified audio type based on the characteristic information of the identified continuous audio frames when the current audio frame meets the threshold adjustment condition
  • a threshold adjustment module configured to adjust the current recognition threshold according to the determination state, wherein the adjusted recognition threshold is used to identify the audio type of the next audio frame.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种音频处理方法、装置、存储介质及电子设备(400)。其中音频处理方法包括:获取待处理的音频帧,并基于当前识别阈值确定音频帧的音频类型(S110);响应于确定当前音频帧满足阈值调节条件,基于已识别的连续音频帧的特征信息判定已识别音频类型的判定状态(S120);根据判定状态调节当前识别阈值,其中,调节后的识别阈值用于对下一音频帧进行音频类型的识别(S130)。

Description

一种音频处理方法、装置、存储介质及电子设备
本申请要求在2022年4月8日提交中国专利局、申请号为202210367406.8的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本公开实施例涉及数据处理技术领域,例如涉及一种音频处理方法、装置、存储介质及电子设备。
背景技术
随着互联网以及通信技术的不断发展,在通信系统、语音识别等领域中,音频识别越来越受用户关注。
目前,可以通过设置固定阈值的方法进行音频识别,但该方法的识别准确率较差。
发明内容
本公开实施例提供了一种音频处理方法、装置、存储介质及电子设备,以实现提升音频识别准确率。
第一方面,本公开实施例提供了一种音频处理方法,包括:
获取待处理的音频帧,基于当前识别阈值确定所述音频帧的音频类型;
响应于确定当前音频帧满足阈值调节条件,基于已识别的连续音频帧的特征信息判定已识别音频类型的判定状态;
根据所述判定状态调节所述当前识别阈值,其中,调节后的识别阈值用于对下一音频帧进行音频类型的识别。
第二方面,本公开实施例还提供了一种音频处理装置,包括:
类型确定模块,设置为获取待处理的音频帧,基于当前识别阈值确定所述音频帧的音频类型;
状态判定模块,设置为响应于确定当前音频帧满足阈值调节条件,基于已识别的连续音频帧的特征信息判定已识别音频类型的判定状态;
阈值调节模块,设置为根据所述判定状态调节所述当前识别阈值,其中,调节后的识别阈值用于对下一音频帧进行音频类型的识别。
第三方面,本公开实施例还提供了一种电子设备,所述电子设备包括:
一个或多个处理器;
存储装置,设置为存储一个或多个程序,
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如本公开实施例任一所述的音频处理方法。
第四方面,本公开实施例还提供了一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行如本公开实施例任一所述的音频处理方法。
附图说明
贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。
图1为本公开实施例提供的一种音频处理方法流程示意图;
图2是本公开实施例提供的一种音频处理方法流程示意图;
图3是本公开实施例提供的一种音频处理方法流程示意图;
图4是本公开实施例提供的一种音频处理方法流程示意图;
图5是本公开实施例提供的一种音频处理装置的结构示意图;
图6为本公开实施例提供的一种电子设备的结构示意图。
具体实施方式
应当理解,本公开的方法实施方式中记载的多个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。
图1为本公开实施例所提供的一种音频处理方法流程示意图,本公开实施例适应于根据自动调节的阈值进行音频类型识别情况,该方法可以由本公开实施例提供的音频处理装置来执行,该音频处理装置可以通过软件和/或硬件的形式实现,例如,通过电子设备来实现,该电子设备可以是移动终端或PC端等。如图1,本实施例的方法包括:
S110、获取待处理的音频帧,基于当前识别阈值确定所述音频帧的音频类型。
S120、在当前音频帧满足阈值调节条件的情况下,基于已识别的连续音频帧的特征信息判定已识别音频类型的判定状态。
S130、根据所述判定状态调节所述当前识别阈值,其中,调节后的识别阈值用于对下一音频帧进行音频类型的识别。
在本公开实施例中,用于执行该音频处理方法的执行主体(上述电子设备)包括但不限于手机、智能手表、计算机等设备。上述电子设备可以通过多种方式获取待处理的音频帧。例如,可以通过音频采集装置实时采集音频数据,从音频数据中提取待处理的音频帧,也可以从预设存储位置或者其他设备调取音频数据,从音频数据中提取待处理的音频帧,本公开实施例对待处理的音频帧的获取方法不进行限定。上述音频数据可以包括但不限于通话音频数据、视频中的音频数据、直播音频数据等,对此不作限定。
其中,音频数据可以为一段音频,该音频中可以包含语音、噪声等信息。本实施例中不限定音频数据的时长,为了提高音频数据的识别精度以及提高音频数据的处理实时性,将音频数据划分为多个音频帧,对每一音频帧进行识别处理。其中,在音频数据为实时数据的情况下,对实时采集的音乐数据依次划分音频帧,并对得到的音频帧进行实时识别处理。在音频数据为离线数据的情况下,可根据划分音频帧的时序依次对多个音频帧进行处理。音频帧可以是具有预设时间长度的音频数据,其中,音频帧的时长可根据识别精度确定,对此不作限定,示例性的,音频帧的时长可以是10ms。本实施例中,待处理的音频帧可以为当前所需处理的音频帧。
本实施例中,基于动态识别阈值对多个语音帧进行音频类型的识别,对于待处理的音频帧,获取该待处理的音频帧的处理时刻对应的当前识别阈值,该当前识别阈值可以是当前音频帧处理时刻对应的判断阈值,可以用于判断该音频帧的音频类型。需要说明的是,识别阈值为可调节值,当前识别阈值可以是在上一语音帧的基础上经过调节后的识别阈值,或者,还可以是未经调节的初始识别阈值。
本实施例中,对音频帧进行音频类型的识别,音频类型可以是根据识别需求设置,在一些实施例中,根据音频帧是否包含噪声,将音频类型划分为语音类型和噪声类型;在一些实施例中,还可以是发出声音的对象将音频帧划分为人声、雨声、鸟鸣声等类型;在一些实施例中,还可以是将音频帧划分为歌曲、说话等类型,对此不作限定。例如,所述音频类型包括语音类型和噪声类型。其中,语音类型对应的音频帧为语音帧,语音帧中可以包括语音信息,语音信息可以是真实的语言内容;噪声类型对应的音频帧为噪声帧,噪声帧中可以包括噪声信息,噪声信息可以是与语言内容无关的干扰信息,例如,环境中产生的干扰声音。
在上述实施例的基础上,可以是基于预先设置的音频识别算法,对音频帧进行识别,得到音频帧的识别概率,其中,预先设置的音频识别算法为适应于音频类型的识别需求的识别算法。在一些实施例中,音频识别算法可以是机器学习模型,该机器学习模型可以是神经网络模型等,对此不作限定。基于当前识别阈值对音频帧的识别概率进行判别,确定音频帧的音频类型。以音频类型包括语音类型和噪声类型为例,在音频帧的识别概率大于或等于当前识别阈值的情况下,确定音频帧为语音帧,在音频帧的识别概率小于当前识别阈值的情况下,确定音频帧为噪声帧。需要说明的是,识别阈值可以是一个或多个数据,识别阈值的数量以及对音频帧的识别概率的判别方式,可根据音频类型的判别需求设置,对此不作限定。
本实施例中,识别阈值可动态调节,例如可以通过已识别语音帧的判定状态进行实时调节,调节后的识别阈值更贴合下一音频帧的判断状态,从而通过调节后的识别阈值识别得到音频类型的准确率更高。调节后的识别阈值可以用于对下一音频帧进行音频类型的识别,即在识别下一音频帧的过程中继续调节识别阈值,从而实现识别阈值的动态调节,以适应不同环境,提高音频类型识别的准确率。
需要说明的是,在每一语音帧识别完成后,确定是否满足阈值调节条件,在满足阈值调节条件的情况下,对当前识别阈值进行调节,基于调节后的识别阈值对下一音频帧进行识别,在不满足阈值调节条件的情况下,保持当前识别阈值不变,基于当前识别阈值对下一音频帧进行识别,避免对识别阈值的频繁调节,以及频繁调节识别阈值对音频类型识别的干扰。
在本公开实施例中,阈值调节条件可以是利用语音类型的连续音频帧的特性进行设置的条件,可以包括但不限于判断语音是否结束的条件,即该阈值调节条件可以用于筛选出已经中断或暂停的语音段。例如,所述阈值调节条件为当前音频帧的音频类型为噪声类型,前一音频帧的音频类型为语音类型。
可以理解的是,在一段音频中,若当前音频帧的音频类型为噪声类型,前一音频帧的音频类型为语音类型,表明该段音频中可能存在语音中断,即讲话停歇或终止等结束情况。阈值调节条件是用于判断一段语音类型的音频是否结束,在一段语音类型的音频结束的情况下,再对该段语音类型的音频进行验证,即进行多个音频帧的整体验证,从音频整体考虑是否为语音,得到已识别音频类型的判定状态,以衡量当前识别阈值的设置是否合适。该判断方法更具可靠性,可以避免因单一音频帧误判音频类型的情况发生。
其中,已识别的连续音频帧可以是已经完成过音频类型识别的一段连续音频,即被连续识别为语音类型的多个音频帧,该已识别的连续音频帧可以是位于当前音频帧之前,最接近的一段语音结束的音频段,即已识别的连续音频帧的最后一个音频帧为当前音频帧的前一音频帧。已识别的连续音频帧的特征信息指的是用于判断已识别音频类型是否正确的参考信息,特征信息可以通过对已识别的连续音频帧的基础信息进行计算或统计等方式得到。该已识别音频类型指的是连续音频帧的整体类型,即已识别的一段音频对应的音频类型,而非单个音频帧的音频类型。判定状态指的是对已识别音频类型的判断对错状态。
在上述实施例的基础上,所述特征信息包括如下的一项或多项:连续语音帧的长度、识别概率、基音频率和能量值。
其中,连续语音帧的长度指的是连续语音帧的时间长度,可以统计获得,即连续语音帧 中多个音频帧的时长和。识别概率指的是连续语音帧为语音类型的概率,例如可以是连续语音帧中多个语音帧为语音类型的概率均值。基音频率指的是声带振动的频率,例如,连续语音帧的基音频率可以是连续语音帧中多个语音帧的基音频率均值。连续语音帧的能量值可以是连续语音帧中多个语音帧的能量值和或者能量值均值等。需要说明的是,上述特征信息都可以通过统计、计算等方式获得,在此不再赘述。
示例性的,有效的语音段(即语音类型的连续音频帧)具有时间长度较长、识别概率较大、基音频率较高或者能量值较大的特征,可以通过设置时长判别阈值、概率判别阈值、频率判别阈值或能量判别阈值对连续音频帧的特征信息进行判断,以确定连续音频帧的音频类型的判定结果是否正确,以得到已识别音频类型的判定状态。例如,若特征信息大于对应的设置阈值,则表明判定状态为正确,若特征信息小于或等于对应的设置阈值,则表明判定状态为错误。
在本公开实施例的一些示例实现方式中,所述基于已识别的连续音频帧的特征信息判定已识别音频类型的判定状态,包括:确定当前音频帧之前的连续语音帧的特征信息,基于特征信息和所述特征信息的判断阈值进行比对;基于比对结果确定所述已识别音频类型的判定状态。
例如,当前音频帧之前的连续语音帧的特征信息可以通过计算、统计等方法获取,可以包括一个或多个,即当前音频帧之前的连续语音帧可以具有一个或多个特征信息,可以理解的是,当特征信息为多个时,判定状态的评判参数更加丰富,可以提高判定状态的准确性。特征信息的判断阈值可以根据经验设定,可以包括一个或多个,即特征信息可以具有一个或多个判断阈值。在特征信息以及对应的判断阈值确定后,将特征信息与特征信息的判断阈值进行一一比对,可以得到比对结果,然后根据比对结果确定已识别音频类型的判定状态,其中,比对结果与判定状态存在着映射关系,例如,若多个特征信息都大于特征信息的判断阈值,则已识别音频类型的判定状态为正确;若多个特征信息中的至少一个特征信息小于或等于特征信息的判断阈值,则已识别音频类型的判定状态为错误。在本实施例中,通过阈值比对进行状态判定,该方法简单高效,可以快速对已识别的连续语音帧的音频类型进行验证。
在一些示例实施例中,确定连续语音帧的长度和识别概率均值,基于时长阈值对连续语音帧的长度进行判断,基于识别概率阈值对识别概率均值进行判断,在连续语音帧的长度大于时长阈值,以及识别概率均值大于识别概率阈值的情况下,确定已识别音频类型的判定状态为正确,在连续语音帧的长度小于或等于时长阈值,和/或,识别概率均值小于或等于识别概率阈值的情况下,确定已识别音频类型的判定状态为错误。
在本公开实施例的一些示例实现方式中,所述判定状态包括错误状态和正确状态,根据所述判定状态调节所述当前识别阈值,其中,当前识别阈值的调节方式包括增大和减小,例如,在所述判定状态为错误状态的情况下,增大所述当前识别阈值;在所述判定状态为正确状态的情况下,减小所述当前识别阈值。
其中,错误状态表示已识别音频类型的判定错误,表明将时间长度较短、识别概率较小、基音频率较低或者能量值较小等情况的噪音认定为语音,因此需要增大当前识别阈值,防止后续将该类噪音认定为语音;正确状态表示已识别音频类型的判定正确,在判定状态为正确状态的情况下,可以减小当前识别阈值,放宽对识别概率的判别标准。需要说明的是,若增大当前识别阈值到上限阈值,或者减小当前识别阈值到下限阈值,则当前识别阈值不继续增大或减小,以防止过度调节,导致音频识别正确率的情况发生。
本公开实施例的技术方案,通过获取待处理的音频帧,表明本方案为分帧处理音频数据,以便基于当前识别阈值确定音频帧的音频类型,实现对音频帧类型的初步判断;在当前音频帧满足阈值调节条件的情况下,基于已识别的连续音频帧的特征信息判定已识别音频类型的判定状态,可以实现对已识别的连续音频帧音频类型的再次验证;根据再次验证的判定状态 调节当前识别阈值,调节后的识别阈值可以用于对下一音频帧进行音频类型的识别,即在识别下一音频帧的过程中继续调节识别阈值,从而实现识别阈值的动态调节,进而根据动态调节的识别阈值进行音频类别,可以提高音频类型识别的准确率。
参考图2,图2为本公开实施例提供的音频处理方法流程示意图,本实施例的方法与上述实施例中提供的音频处理方法中多个示例方案可以结合。本实施例提供的音频处理方法进行了细化。例如,所述基于当前识别阈值确定所述音频帧的音频类型包括:提取所述音频帧的音频特征,将所述音频特征输入至音频识别模型中,得到所述音频帧的识别概率;基于所述当前识别阈值和所述识别概率确定所述音频帧的音频类型。
如图2,本实施例的方法包括:
S210、获取待处理的音频帧。
S220、提取所述音频帧的音频特征,将所述音频特征输入至音频识别模型中,得到所述音频帧的识别概率。
S230、基于所述当前识别阈值和所述识别概率确定所述音频帧的音频类型。
S240、在当前音频帧满足阈值调节条件的情况下,基于已识别的连续音频帧的特征信息判定已识别音频类型的判定状态。
S250、根据所述判定状态调节所述当前识别阈值,其中,调节后的识别阈值用于对下一音频帧进行音频类型的识别。
其中,音频特征可以用于确定音频帧的识别概率,该音频特征可以通过一系列的计算和变换得到。音频特征提取过程可以包括但不限于加窗、快速傅里叶变换、梅尔谱转换和归一化等。音频识别模型可以是训练完成的深度学习模型,例如,包括但不限于循环神经网络模型、长短期记忆循环神经网络等。音频帧的识别概率可以用于判断该音频帧是否为语音。
示例性的,以一个具体应用场景为例,音频帧的时长可以为10ms,采样率为16KHz,一个音频帧输入为x(n),其中,n=0,1,…,160,即160个采样点的数据。在进行加窗时,将信号x(n)乘以窗函数w(n),得到加窗后的信号x(n)*w(n)。其中,窗函数可以为汉宁窗等函数。在进行快速傅里叶变换时,对加窗后的信号进行快速傅里叶变换,得到信号频谱X(m)=FFT(x(n)*w(n))。在计算梅尔谱时,将频谱划分成若干个分段,基于梅尔滤波函数对每个分段能量求和,得到梅尔谱,并对梅尔谱进行归一化操作,得到音频帧的音频特征,将该音频特征输入至音频识别模型,可以得到该音频帧的识别概率。
例如,若识别概率大于或等于当前识别阈值,则音频帧的音频类型为语音类型,若识别概率小于当前识别阈值,则音频帧的音频类型为噪音类型。
在上述实施例的基础上,所述音频识别模型的训练方法包括:获取无噪声音频,对所述无噪声音频中的音频段设置标签;获取噪声信息,将所述噪声信息叠加至所述无噪声音频中,形成样本音频,其中,所述噪声信息包括稳态噪声、瞬态噪声和啸叫噪声中的至少一项;基于所述样本音频对待训练的音频识别模型进行迭代训练,直到得到训练好的音频识别模型。
其中,啸叫噪声为一种回授音,例如,在通话场景中,如果发送端与接收端设备处于同一物理空间,容易产生啸叫现象。传统的语音活性检测算法难以识别啸叫,因此不具备啸叫抑制的能力。本实施例通过将啸叫噪声叠加至无噪声音频中,并将该叠加了啸叫噪声的音频作为样本音频,对音频识别模型进行训练,使得该音频识别模型具备将啸叫判定为噪声的能力,从而该音频处理方法具有啸叫抑制的能力。例如,通过在无噪声音频中加入稳态噪声、瞬态噪声和啸叫噪声等多种噪声样本音频,对音频识别模型进行训练,使得该音频识别模型具备识别稳态噪声、瞬态噪声和啸叫噪声等多种噪声的能力,提高了音频识别模型的鲁棒性和适用性。
示例性的,无噪声音频可以是包括干净语音和空白音频,可以将干净语音中大于预设阈值的音频帧标注为1,其它音频帧,标注为0;例如,获取噪声信息,可以将噪声信息叠加至 无噪声音频中,形成样本音频,样本音频的标签与无噪声音频的标签一致,即标签不发生变化。
在对音频识别模型的迭代训练过程中,可以将focal loss作为损失函数,根据样本音频对待训练的音频识别模型进行迭代训练,直到得到训练好的音频识别模型。
在上述实施例的基础上,所述方法还包括:调节所述样本音频中的信噪比;和/或,基于预设滤波器对所述样本音频进行滤波处理。
例如,通过调节样本音频中的信噪比,或者根据预设滤波器对样本音频进行不同程度的滤波处理,可以增加样本音频的数量,提高样本音频的多样化,使得训练的音频识别模型具有更强的泛化能力。其中,样本音频中的信噪比可以以随机或者固定设定的方式调整,预设滤波器可以包括但不限于高通滤波器、低通滤波器等。
本公开实施例的技术方案,通过将啸叫噪声叠加至无噪声音频中,并将该叠加了啸叫噪声的音频作为样本音频,对音频识别模型进行训练,使得该音频识别模型具备将啸叫判定为噪声的能力,从而该音频处理方法具有啸叫抑制的能力。
参考图3,图3为本公开实施例提供的音频处理方法流程示意图,本实施例的方法与上述实施例中提供的音频处理方法中多个示例方案可以结合。本实施例提供的音频处理方法进行了细化。例如,在获取待处理的音频帧之后,将所述音频帧添加到缓存区内,其中,缓存区内用于存储未输出的多个音频帧,当前音频帧位于所述缓存区的最后一帧,所述缓存区中的第一帧为待输出帧。如图3,本实施例的方法包括:
S310、获取待处理的音频帧。
S320、将所述音频帧添加到缓存区内。
S330、基于当前识别阈值确定所述音频帧的音频类型。
S340、若当前音频帧的音频类型为语音类型,则将所述缓存区内的多个音频帧的音频类型设置为语音类型。
S350、若当前音频帧的音频类型为噪声类型,则将所述缓存区内的最后一音频帧的音频类型设置为噪声类型。
S360、在当前音频帧满足阈值调节条件的情况下,基于已识别的连续音频帧的特征信息判定已识别音频类型的判定状态。
S370、根据所述判定状态调节所述当前识别阈值,其中,调节后的识别阈值用于对下一音频帧进行音频类型的识别。
其中,缓存区内可以用于存储未输出的多个音频帧,多个音频帧在缓存区内基于时间戳进行排列。
需要说明的是,本公开实施例为将音频帧实时添加到缓存区内的方案,即在对当前音频帧进行音频类型判别的同时,将当前音频帧添加到缓存区内。可以将当前音频帧存储在上一已识别音频帧之后,即新添加的音频帧永远位于缓存区的最后一帧,并且将缓存区中的第一帧设置为待输出帧,这样可以保证音频按照添加顺序输出,即按照时序进行输出,可以避免语音播放错乱的情况发生。
由于在音频帧的不断识别过程中,动态调节识别阈值,以提高识别阈值的准确性,相应的,在音频初始阶段的多个音频帧的识别过程中,识别阈值处于初始调节阶段,为了避免音频初始阶段的多个音频帧的识别错误的情况,若当前音频帧的音频类型为语音类型,无法确定当前音频帧是否为初始语音帧,则将缓存区内当前缓存的音频帧的音频类型设置为语音类型,避免在音频初始识别阶段将语音帧误识别为噪声帧的情况。若当前音频帧的音频类型为噪声类型,该噪声帧对之前的音频帧不存在影响,则将当前音频帧在缓存区内对应的音频帧的音频类型设置为噪声类型,即将缓存区内的最后一音频帧的音频类型设置为噪声类型。本公开实施例中,通过设置缓存区,对音频输出设置一定的延迟,通过后续音频帧的音频类型 更新前续音频帧的音频类型,避免识别阈值调节过程中,对语音帧误识别的情况,保证语音帧的正常输出。其中,本实施中的音频可以是通话音频、直播音频等的实时输出音频。
在本公开实施例的一些示例实现方式中,在根据所述判定状态调节所述当前识别阈值之后,所述方法还包括:确定调节后的识别阈值所在的当前阈值范围,根据所述当前阈值范围确定缓存区长度。
其中,当前阈值范围可以根据经验进行设定,该当前阈值范围可以与调节后的识别阈值进行比较,由于识别阈值为动态变化值,比较结果也为动态变化值,缓存区长度可以根据比较结果进行动态调整。缓存区长度指的是可以缓存音频帧的时间长度。需要说明的是,对缓存区长度的调整,实际是对音频输出时延的调整。在本公开实施例中,通过动态调整缓存区长度,可以避免语音开头的语音误判为噪声的情况发生。可以理解的是,通过调整缓存区长度,在缓存内缓存不同数量的音频帧,可实现不同数量的音频帧的音频类型的二次校正,从而提高语音识别的正确率。
示例性的,在一些实施例中,当前阈值范围可以包括第一阈值和第二阈值,第一阈值大于第二阈值,若调节后的识别阈值大于第一阈值,表明当前对音频帧的识别错误率较大,则将缓存区长度设置为第一时间长度;若调节后的识别阈值小于第二阈值,表明当前对音频帧的识别错误率较小,则将缓存区长度设置为第二时间长度,第一时间长度可以大于第二时间长度;其它情况下,缓存区长度可以保持不变。其中,第一时间长度大于第二时间长度。
在一些实施例中,若调节后的识别阈值小于第一阈值,并大于第二阈值,则将缓存区长度可以设置为第三时间长度,第三时间长度可以大于第二时间长度,并小于第一时间长度。在一些实施例中,当前阈值范围可以包括两个以上阈值,通过两个以上阈值调节缓存区长度。
在上述实施例的基础上,所述方法还包括:在所述缓存区中音频帧的音频类型均为噪声类型的情况下,清空所述缓存区,并基于当前的缓存区长度重建一缓存区。
需要说明的是,通过清空缓存区,可以将缓存区中所有噪声清除,从而避免将噪声进行输出的情况发生,并根据当前的缓存区长度可以重建一个缓存区,继续对后续音频帧进行缓存,避免语音和噪声缓存混乱的情况发生。
本公开实施例的技术方案,通过将音频帧添加到缓存区内,可以实现音频帧的延迟输出;若当前音频帧的音频类型为语音类型,则将所述缓存区内的多个音频帧的音频类型设置为语音类型;若当前音频帧的音频类型为噪声类型,则将所述缓存区内的最后一音频帧的音频类型设置为噪声类型,实现对语音是否结束的判断;根据当前阈值范围确定缓存区长度,实现动态调整音频输出时延,可以避免语音开头误判的情况发生,提升语音类型的识别准确率。
参考图4,图4为本公开实施例提供的音频处理方法流程示意图,本实施例的方法与上述实施例中提供的音频处理方法中多个示例方案可以结合。本实施例提供的音频处理方法进行了细化。例如,所述方法还包括:基于待输出音频帧的音频类型确定所述待输出音频帧的输出增益;基于所述输出增益对所述待输出音频帧进行处理,得到输出音频帧,并输出。如图4,本实施例的方法包括:
S410、获取待处理的音频帧,基于当前识别阈值确定所述音频帧的音频类型。
S420、在当前音频帧满足阈值调节条件的情况下,基于已识别的连续音频帧的特征信息判定已识别音频类型的判定状态。
S430、根据所述判定状态调节所述当前识别阈值,其中,调节后的识别阈值用于对下一音频帧进行音频类型的识别。
S440、基于待输出音频帧的音频类型确定所述待输出音频帧的输出增益。
S450、基于所述输出增益对所述待输出音频帧进行处理,得到输出音频帧,并输出。
其中,待输出音频帧指的是位于缓存区第一位的音频帧。待输出音频帧的输出增益可以用于调整待输出音频帧的输出状态,以实现降噪。
例如,在一些实施例中,可以预先建立音频类型与输出增益的固定对应关系,并生成固定关系表,将待输出音频帧的音频类型在固定关系表进行匹配,得到对应的音频类型。在一些实施例中,还可以根据增益函数模型确定待输出音频帧的输出增益,该增益函数模型可以是将当前待输出音频帧的输出增益用于下一待输出音频帧输出增益的计算,实现输出增益的动态调节,从而使待输出音频帧输出更加自然。
在本公开实施例的一些示例实现方式中,所述基于待输出音频帧的音频类型确定所述待输出音频帧的输出增益,包括:在所述待输出音频帧的音频类型为语音类型的情况下,确定待输出音频帧的输出增益为第一预设值;在所述待输出音频帧的音频类型为噪声类型的情况下,确定待输出音频帧的输出增益为第二预设值,其中,所述第一预设值大于所述第二预设值。
示例性的,第一预设值可以为1,第二预设值可以为0,在待输出音频帧的音频类型为语音类型的情况下,可以将待输出音频帧的输出增益设置为1,正常输出音频帧;在待输出音频帧的音频类型为噪声类型的情况下,可以将待输出音频帧的输出增益设置为0,将该音频帧消除,以达到静音降噪的目的。需要说明的是,此处第一预设值和第二预设值仅是举例说明,第一预设值和第二预设值还可以是其它数值,例如0.8和0.2,并非限定。
在本公开实施例的一些示例实现方式中,所述基于待输出音频帧的音频类型确定所述待输出音频帧的输出增益,还包括:在所述待输出音频帧的音频类型为噪声类型,且预设数量帧的已输出音频帧中包括语音类型的音频帧的情况下,基于上一待输出音频帧的输出增益进行平滑处理,得到待输出音频帧的输出增益。
其中,进行平滑处理可以得到逐渐变化的输出增益,与上述设置固定的第一预设值和第二预设值的方法相比,可以防止音频类型反复跳变的情况发生,去除掉音频中毛刺部分,使待输出音频帧输出更加自然。
示例性的,预设数量帧可以为N个,即在待输出音频帧的音频类型为噪声类型,且该待输出音频帧之前已输出N个语音类型的音频帧的情况下,根据上一个已输出音频帧的输出增益确定待输出音频帧的输出增益,例如,根据上一个已输出音频帧的输出增益逐级递减,得到待输出音频帧的输出增益,例如,上一个已输出音频帧的输出增益为0.8,在上一个已输出音频帧的输出增益的基础上减去增益调节值,例如可以是0.2,得到的待输出音频帧的输出增益,即为0.6。需要说明的是,上述输出增益的数值和递减数值仅是举例说明,并非限定。
例如,上述得到的输出增益可以用于对待输出音频帧进行处理,得到输出音频帧,输出音频帧指的可以是需要输出播放的音频帧,输出方式可以包括但不限于当前设备直接输出,或者通过有线或无线通信方式传输至其他设备进行输出等。
示例性的,输出增益可以与待输出音频帧对应的参数值进行相乘,实现对待输出音频帧进行降噪处理,例如,若输出增益为1,则待输出音频帧保持不变,若输出增益为0,则待输出音频帧置零,将该噪音帧清除。
本公开实施例的技术方案,根据待输出音频帧的音频类型确定待输出音频帧的输出增益,输出增益可以用于调整待输出音频帧的输出状态,以实现降噪处理,提高输出音频帧的质量。
图5是本公开实施例所提供的一种音频处理装置的结构示意图。如图5所示,所述装置包括:
类型确定模块510,设置为获取待处理的音频帧,基于当前识别阈值确定所述音频帧的音频类型;
状态判定模块520,设置为在当前音频帧满足阈值调节条件的情况下,基于已识别的连续音频帧的特征信息判定已识别音频类型的判定状态;
阈值调节模块530,设置为根据所述判定状态调节所述当前识别阈值,其中,调节后的识别阈值用于对下一音频帧进行音频类型的识别。
在本公开实施例的一些示例实现方式中,所述音频类型包括语音类型和噪声类型;
所述阈值调节条件为当前音频帧的音频类型为噪声类型,前一音频帧的音频类型为语音类型。
在本公开实施例的一些示例实现方式中,状态判定模块520还设置为:
确定当前音频帧之前的连续语音帧的特征信息,基于特征信息和所述特征信息的判断阈值进行比对;
基于比对结果确定所述已识别音频类型的判定状态。
在本公开实施例的一些示例实现方式中,所述特征信息包括如下的一项或多项:所述连续语音帧的长度、识别概率、基音频率和能量值。
在本公开实施例的一些示例实现方式中,所述判定状态包括错误状态和正确状态;阈值调节模块530还设置为:
在所述判定状态为错误状态的情况下,增大所述当前识别阈值;
在所述判定状态为正确状态的情况下,减小所述当前识别阈值。
在本公开实施例的一些示例实现方式中,在获取待处理的音频帧之后,将所述音频帧添加到缓存区内,其中,缓存区内用于存储未输出的多个音频帧,当前音频帧位于所述缓存区的最后一帧,所述缓存区中的第一帧为待输出帧。
在本公开实施例的一些示例实现方式中,所述装置还设置为:
若当前音频帧的音频类型为语音类型,则将所述缓存区内的多个音频帧的音频类型设置为语音类型;
若当前音频帧的音频类型为噪声类型,则将所述缓存区内的最后一音频帧的音频类型设置为噪声类型。
在本公开实施例的一些示例实现方式中,所述装置还包括:
确定调节后的识别阈值所在的当前阈值范围,根据所述当前阈值范围确定缓存区长度。
在本公开实施例的一些示例实现方式中,所述装置还包括:
在所述缓存区中音频帧的音频类型均为噪声类型的情况下,清空所述缓存区,并基于当前的缓存区长度重建一缓存区。
在本公开实施例的一些示例实现方式中,所述装置还包括:
增益确定模块,设置为基于待输出音频帧的音频类型确定所述待输出音频帧的输出增益;
音频输出模块,设置为基于所述输出增益对所述待输出音频帧进行处理,得到输出音频帧,并输出。
在本公开实施例的一些示例实现方式中,所述增益确定模块还设置为:
在所述待输出音频帧的音频类型为语音类型的情况下,确定待输出音频帧的输出增益为第一预设值;
在所述待输出音频帧的音频类型为噪声类型的情况下,确定待输出音频帧的输出增益为第二预设值,其中,所述第一预设值大于所述第二预设值。
在本公开实施例的一些示例实现方式中,所述增益确定模块还设置为:
在所述待输出音频帧的音频类型为噪声类型,且预设数量帧的已输出音频帧中包括语音类型的音频帧的情况下,基于上一待输出音频帧的输出增益进行平滑处理,得到待输出音频帧的输出增益。
在本公开实施例的一些示例实现方式中,所述类型确定模块510还设置为:
提取所述音频帧的音频特征,将所述音频特征输入至音频识别模型中,得到所述音频帧的识别概率;
基于所述当前识别阈值和所述识别概率确定所述音频帧的音频类型。
在本公开实施例的一些示例实现方式中,所述音频识别模型的训练装置包括:
标签设置模块,设置为获取无噪声音频,对所述无噪声音频中的音频段设置标签;
样本制作模块,设置为获取噪声信息,将所述噪声信息叠加至所述无噪声音频中,形成样本音频,其中,所述噪声信息包括稳态噪声、瞬态噪声和啸叫噪声中的至少一项;
模型训练模块,设置为基于所述样本音频对待训练的音频识别模型进行迭代训练,直到得到训练好的音频识别模型。
在本公开实施例的一些示例实现方式中,所述音频识别模型的训练装置还可以设置为:
调节所述样本音频中的信噪比;和/或,
基于预设滤波器对所述样本音频进行滤波处理。
本公开实施例所提供的音频处理装置可执行本公开任意实施例所提供的音频处理方法,具备执行方法相应的功能模块和有益效果。
值得注意的是,上述装置所包括的多个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,多个功能单元的具体名称也只是为了便于相互区分,并不用于限制本公开实施例的保护范围。
下面参考图6,其示出了适于用来实现本公开实施例的电子设备(例如图6中的终端设备或服务器)400的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图6示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图6所示,电子设备400可以包括处理装置(例如中央处理器、图形处理器等)401,其可以根据存储在只读存储器(ROM)402中的程序或者从存储装置408加载到随机访问存储器(RAM)403中的程序而执行多种适当的动作和处理。在RAM403中,还存储有电子设备400操作所需的多种程序和数据。处理装置401、ROM 402以及RAM 403通过总线404彼此相连。输入/输出(I/O)接口405也连接至总线404。
通常,以下装置可以连接至I/O接口405:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置406;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置407;包括例如磁带、硬盘等的存储装置408;以及通信装置409。通信装置409可以允许电子设备400与其他设备进行无线或有线通信以交换数据。虽然图6示出了具有多种装置的电子设备400,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置409从网络上被下载和安装,或者从存储装置408被安装,或者从ROM402被安装。在该计算机程序被处理装置401执行时,执行本公开实施例的方法中限定的上述功能。
本公开实施例提供的电子设备与上述实施例提供的音频处理方法属于同一构思,未在本实施例中详尽描述的技术细节可参见上述实施例,并且本实施例与上述实施例具有相同的有益效果。
本公开实施例提供了一种计算机存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述实施例所提供的音频处理方法。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机 可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:
获取待处理的音频帧,基于当前识别阈值确定所述音频帧的音频类型;
在当前音频帧满足阈值调节条件的情况下,基于已识别的连续音频帧的特征信息判定已识别音频类型的判定状态;
根据所述判定状态调节所述当前识别阈值,其中,调节后的识别阈值用于对下一音频帧进行音频类型的识别。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开多种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序帧、或代码的一部分,该模块、程序帧、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元/模块的名称在某种情况下并不构成对该单元本身的限定。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
根据本公开的一个或多个实施例,【示例一】提供了一种音频处理方法,包括:
获取待处理的音频帧,基于当前识别阈值确定所述音频帧的音频类型;
在当前音频帧满足阈值调节条件的情况下,基于已识别的连续音频帧的特征信息判定已识别音频类型的判定状态;
根据所述判定状态调节所述当前识别阈值,其中,调节后的识别阈值用于对下一音频帧进行音频类型的识别。
根据本公开的一个或多个实施例,【示例二】提供了一种音频处理方法,其中,
所述音频类型包括语音类型和噪声类型;
所述阈值调节条件为当前音频帧的音频类型为噪声类型,前一音频帧的音频类型为语音类型。
根据本公开的一个或多个实施例,【示例三】提供了一种音频处理方法,其中,
所述基于已识别的连续音频帧的特征信息判定已识别音频类型的判定状态,包括:
确定当前音频帧之前的连续语音帧的特征信息,基于特征信息和所述特征信息的判断阈值进行比对;
基于比对结果确定所述已识别音频类型的判定状态。
根据本公开的一个或多个实施例,【示例四】提供了一种音频处理方法,其中,
所述特征信息包括如下的一项或多项:所述连续语音帧的长度、识别概率、基音频率和能量值。
根据本公开的一个或多个实施例,【示例五】提供了一种音频处理方法,其中,
所述判定状态包括错误状态和正确状态;
所述根据所述判定状态调节所述当前识别阈值,包括:
在所述判定状态为错误状态的情况下,增大所述当前识别阈值;
在所述判定状态为正确状态的情况下,减小所述当前识别阈值。
根据本公开的一个或多个实施例,【示例六】提供了一种音频处理方法,
在获取待处理的音频帧之后,还包括:
将所述音频帧添加到缓存区内,其中,缓存区内用于存储未输出的多个音频帧,当前音频帧位于所述缓存区的最后一帧,所述缓存区中的第一帧为待输出帧。
根据本公开的一个或多个实施例,【示例七】提供了一种音频处理方法,在基于当前识别阈值确定所述音频帧的音频类型之后,所述方法还包括:
若当前音频帧的音频类型为语音类型,则将所述缓存区内的多个音频帧的音频类型设置为语音类型;
若当前音频帧的音频类型为噪声类型,则将所述缓存区内的最后一音频帧的音频类型设 置为噪声类型。
根据本公开的一个或多个实施例,【示例八】提供了一种音频处理方法,在根据所述判定状态调节所述当前识别阈值之后,所述方法还包括:
确定调节后的识别阈值所在的当前阈值范围,根据所述当前阈值范围确定缓存区长度。
根据本公开的一个或多个实施例,【示例九】提供了一种音频处理方法,包括:
在所述缓存区中多个音频帧的音频类型均为噪声类型的情况下,清空所述缓存区,并基于当前的缓存区长度重建一缓存区。
根据本公开的一个或多个实施例,【示例十】提供了一种音频处理方法,还包括:
基于待输出音频帧的音频类型确定所述待输出音频帧的输出增益;
基于所述输出增益对所述待输出音频帧进行处理,得到输出音频帧,并输出。
根据本公开的一个或多个实施例,【示例十一】提供了一种音频处理方法,其中,
所述基于待输出音频帧的音频类型确定所述待输出音频帧的输出增益,包括:
在所述待输出音频帧的音频类型为语音类型的情况下,确定待输出音频帧的输出增益为第一预设值;
在所述待输出音频帧的音频类型为噪声类型的情况下,确定待输出音频帧的输出增益为第二预设值,其中,所述第一预设值大于所述第二预设值。
根据本公开的一个或多个实施例,【示例十二】提供了一种音频处理方法,其中,
所述基于待输出音频帧的音频类型确定所述待输出音频帧的输出增益,还包括:
在所述待输出音频帧的音频类型为噪声类型,且预设数量帧的已输出音频帧中包括语音类型的音频帧的情况下,基于上一待输出音频帧的输出增益进行平滑处理,得到待输出音频帧的输出增益。
根据本公开的一个或多个实施例,【示例十三】提供了一种音频处理方法,其中,
所述基于当前识别阈值确定所述音频帧的音频类型,包括:
提取所述音频帧的音频特征,将所述音频特征输入至音频识别模型中,得到所述音频帧的识别概率;
基于所述当前识别阈值和所述识别概率确定所述音频帧的音频类型。
根据本公开的一个或多个实施例,【示例十四】提供了一种音频处理方法,其中,
所述音频识别模型的训练方法,包括:
获取无噪声音频,对所述无噪声音频中的音频段设置标签;
获取噪声信息,将所述噪声信息叠加至所述无噪声音频中,形成样本音频,其中,所述噪声信息包括稳态噪声、瞬态噪声和啸叫噪声中的至少一项;
基于所述样本音频对待训练的音频识别模型进行迭代训练,直到得到训练好的音频识别模型。
根据本公开的一个或多个实施例,【示例十五】提供了一种音频处理方法,还包括以下至少之一:
调节所述样本音频中的信噪比;和
基于预设滤波器对所述样本音频进行滤波处理。
根据本公开的一个或多个实施例,【示例十六】提供了一种音频处理装置,该装置包括:
类型确定模块,设置为获取待处理的音频帧,基于当前识别阈值确定所述音频帧的音频类型;
状态判定模块,设置为在当前音频帧满足阈值调节条件的情况下,基于已识别的连续音频帧的特征信息判定已识别音频类型的判定状态;
阈值调节模块,设置为根据所述判定状态调节所述当前识别阈值,其中,调节后的识别阈值用于对下一音频帧进行音频类型的识别。
此外,虽然采用特定次序描绘了多种操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的多种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。

Claims (18)

  1. 一种音频处理方法,包括:
    获取待处理的音频帧,并基于当前识别阈值确定所述音频帧的音频类型;
    响应于确定当前音频帧满足阈值调节条件,基于已识别的连续音频帧的特征信息判定已识别音频类型的判定状态;
    根据所述判定状态调节所述当前识别阈值,其中,调节后的识别阈值用于对下一音频帧进行音频类型的识别。
  2. 根据权利要求1所述的方法,其中,所述音频类型包括语音类型和噪声类型;
    所述阈值调节条件为当前音频帧的音频类型为噪声类型,前一音频帧的音频类型为语音类型。
  3. 根据权利要求2所述的方法,其中,所述基于已识别的连续音频帧的特征信息判定已识别音频类型的判定状态,包括:
    确定所述当前音频帧之前的连续语音帧的特征信息,基于所述特征信息和所述特征信息的判断阈值进行比对;
    基于比对结果确定所述已识别音频类型的判定状态。
  4. 根据权利要求3所述的方法,其中,所述特征信息包括如下的一项或多项:所述连续语音帧的长度、识别概率、基音频率和能量值。
  5. 根据权利要求1所述的方法,其中,所述判定状态包括错误状态和正确状态;
    所述根据所述判定状态调节所述当前识别阈值,包括:
    响应于确定所述判定状态为所述错误状态,增大所述当前识别阈值;
    响应于确定所述判定状态为所述正确状态,减小所述当前识别阈值。
  6. 根据权利要求1所述的方法,在所述获取待处理的音频帧之后,还包括:
    将所述音频帧添加到缓存区内;
    其中,所述缓存区用于存储未输出的多个音频帧,所述当前音频帧位于所述缓存区的最后一帧,所述缓存区中的第一帧为待输出音频帧。
  7. 根据权利要求6所述的方法,在所述基于当前识别阈值确定所述音频帧的音频类型之后,还包括:
    响应于确定所述当前音频帧的音频类型为语音类型,将所述缓存区内的多个音频帧的音频类型设置为所述语音类型;
    响应于确定所述当前音频帧的音频类型为噪声类型,将所述缓存区内的最后一音频帧的音频类型设置为所述噪声类型。
  8. 根据权利要求6所述的方法,在所述根据所述判定状态调节所述当前识别阈值之后,还包括:
    确定调节后的识别阈值所在的当前阈值范围,根据所述当前阈值范围确定缓存区长度。
  9. 根据权利要求6所述的方法,还包括:
    响应于确定所述缓存区中多个音频帧的音频类型分别为噪声类型,清空所述缓存区,并基于当前的缓存区长度重建一缓存区。
  10. 根据权利要求6所述的方法,还包括:
    基于待输出音频帧的音频类型确定所述待输出音频帧的输出增益;
    基于所述输出增益对所述待输出音频帧进行处理,得到输出音频帧,并输出。
  11. 根据权利要求10所述的方法,其中,所述基于待输出音频帧的音频类型确定所述待输出音频帧的输出增益,包括:
    响应于确定所述待输出音频帧的音频类型为语音类型,确定所述待输出音频帧的输出增益为第一预设值;
    响应于确定所述待输出音频帧的音频类型为噪声类型,确定所述待输出音频帧的输出增 益为第二预设值,其中,所述第一预设值大于所述第二预设值。
  12. 根据权利要求11所述的方法,其中,所述基于待输出音频帧的音频类型确定所述待输出音频帧的输出增益,还包括:
    响应于确定所述待输出音频帧的音频类型为噪声类型,且预设数量帧的已输出音频帧中包括语音类型的音频帧,基于上一待输出音频帧的输出增益进行平滑处理,得到所述待输出音频帧的输出增益。
  13. 根据权利要求1所述的方法,其中,所述基于当前识别阈值确定所述音频帧的音频类型,包括:
    提取所述音频帧的音频特征,并将所述音频特征输入至音频识别模型中,以得到所述音频帧的识别概率;
    基于所述当前识别阈值和所述识别概率确定所述音频帧的音频类型。
  14. 根据权利要求13所述的方法,其中,所述音频识别模型的训练方法,包括:
    获取无噪声音频,并对所述无噪声音频中的音频段设置标签;
    获取噪声信息,并将所述噪声信息叠加至所述无噪声音频中,以形成样本音频,其中,所述噪声信息包括稳态噪声、瞬态噪声和啸叫噪声中的至少一项;
    基于所述样本音频对待训练的音频识别模型进行迭代训练,直到得到训练好的音频识别模型。
  15. 根据权利要求14所述的方法,还包括以下至少之一:
    调节所述样本音频中的信噪比;以及
    基于预设滤波器对所述样本音频进行滤波处理。
  16. 一种音频处理装置,包括:
    类型确定模块,设置为获取待处理的音频帧,并基于当前识别阈值确定所述音频帧的音频类型;
    状态判定模块,设置为响应于确定当前音频帧满足阈值调节条件,基于已识别的连续音频帧的特征信息判定已识别音频类型的判定状态;
    阈值调节模块,设置为根据所述判定状态调节所述当前识别阈值,其中,调节后的识别阈值用于对下一音频帧进行音频类型的识别。
  17. 一种电子设备,包括:
    一个或多个处理器;
    存储装置,设置为存储一个或多个程序,
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-15中任一所述的音频处理方法。
  18. 一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行如权利要求1-15中任一所述的音频处理方法。
PCT/CN2023/081227 2022-04-08 2023-03-14 一种音频处理方法、装置、存储介质及电子设备 WO2023193573A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210367406.8A CN114743571A (zh) 2022-04-08 2022-04-08 一种音频处理方法、装置、存储介质及电子设备
CN202210367406.8 2022-04-08

Publications (1)

Publication Number Publication Date
WO2023193573A1 true WO2023193573A1 (zh) 2023-10-12

Family

ID=82279274

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/081227 WO2023193573A1 (zh) 2022-04-08 2023-03-14 一种音频处理方法、装置、存储介质及电子设备

Country Status (2)

Country Link
CN (1) CN114743571A (zh)
WO (1) WO2023193573A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114743571A (zh) * 2022-04-08 2022-07-12 北京字节跳动网络技术有限公司 一种音频处理方法、装置、存储介质及电子设备

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140207447A1 (en) * 2013-01-24 2014-07-24 Huawei Device Co., Ltd. Voice identification method and apparatus
CN107863099A (zh) * 2017-10-10 2018-03-30 成都启英泰伦科技有限公司 一种新型双麦克风语音检测和增强方法
CN107924687A (zh) * 2015-09-23 2018-04-17 三星电子株式会社 语音识别设备、用户设备的语音识别方法和非暂时性计算机可读记录介质
CN109767792A (zh) * 2019-03-18 2019-05-17 百度国际科技(深圳)有限公司 语音端点检测方法、装置、终端和存储介质
US20190172451A1 (en) * 2017-12-04 2019-06-06 Samsung Electronics Co., Ltd. Electronic apparatus and control method thereof
CN111833869A (zh) * 2020-07-01 2020-10-27 中关村科学城城市大脑股份有限公司 一种应用于城市大脑的语音交互方法及系统
CN112489648A (zh) * 2020-11-25 2021-03-12 广东美的制冷设备有限公司 唤醒处理阈值调整方法、语音家电、存储介质
CN114743571A (zh) * 2022-04-08 2022-07-12 北京字节跳动网络技术有限公司 一种音频处理方法、装置、存储介质及电子设备

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140207447A1 (en) * 2013-01-24 2014-07-24 Huawei Device Co., Ltd. Voice identification method and apparatus
CN107924687A (zh) * 2015-09-23 2018-04-17 三星电子株式会社 语音识别设备、用户设备的语音识别方法和非暂时性计算机可读记录介质
CN107863099A (zh) * 2017-10-10 2018-03-30 成都启英泰伦科技有限公司 一种新型双麦克风语音检测和增强方法
US20190172451A1 (en) * 2017-12-04 2019-06-06 Samsung Electronics Co., Ltd. Electronic apparatus and control method thereof
CN109767792A (zh) * 2019-03-18 2019-05-17 百度国际科技(深圳)有限公司 语音端点检测方法、装置、终端和存储介质
CN111833869A (zh) * 2020-07-01 2020-10-27 中关村科学城城市大脑股份有限公司 一种应用于城市大脑的语音交互方法及系统
CN112489648A (zh) * 2020-11-25 2021-03-12 广东美的制冷设备有限公司 唤醒处理阈值调整方法、语音家电、存储介质
CN114743571A (zh) * 2022-04-08 2022-07-12 北京字节跳动网络技术有限公司 一种音频处理方法、装置、存储介质及电子设备

Also Published As

Publication number Publication date
CN114743571A (zh) 2022-07-12

Similar Documents

Publication Publication Date Title
WO2020119150A1 (zh) 节奏点识别方法、装置、电子设备及存储介质
EP2994911B1 (en) Adaptive audio frame processing for keyword detection
WO2021139327A1 (zh) 一种音频信号处理方法、模型训练方法以及相关装置
US9313250B2 (en) Audio playback method, apparatus and system
US20150039304A1 (en) Voice Activity Detection Using A Soft Decision Mechanism
CN103117067B (zh) 一种低信噪比下语音端点检测方法
CN111916061B (zh) 语音端点检测方法、装置、可读存储介质及电子设备
WO2018068636A1 (zh) 一种语音信号检测方法与装置
CN112004177B (zh) 一种啸叫检测方法、麦克风音量调节方法及存储介质
US11282514B2 (en) Method and apparatus for recognizing voice
WO2023193573A1 (zh) 一种音频处理方法、装置、存储介质及电子设备
WO2014194641A1 (en) Audio playback method, apparatus and system
CN109712610A (zh) 用于识别语音的方法和装置
CN109361995B (zh) 一种电器设备的音量调节方法、装置、电器设备和介质
CN112309414B (zh) 基于音频编解码的主动降噪方法、耳机及电子设备
CN104900236B (zh) 音频信号处理
CN108877779B (zh) 用于检测语音尾点的方法和装置
CN113779208A (zh) 用于人机对话的方法和装置
US10431236B2 (en) Dynamic pitch adjustment of inbound audio to improve speech recognition
CN111326146A (zh) 语音唤醒模板的获取方法、装置、电子设备及计算机可读存储介质
CN113889091A (zh) 语音识别方法、装置、计算机可读存储介质及电子设备
CN112423019B (zh) 调整音频播放速度的方法、装置、电子设备及存储介质
US20230186943A1 (en) Voice activity detection method and apparatus, and storage medium
WO2023142409A1 (zh) 调整播放音量的方法、装置、设备以及存储介质
US11641592B1 (en) Device management using stored network metrics

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23784137

Country of ref document: EP

Kind code of ref document: A1