WO2021184732A1 - 基于神经网络的音频丢包修复方法、设备和系统 - Google Patents
基于神经网络的音频丢包修复方法、设备和系统 Download PDFInfo
- Publication number
- WO2021184732A1 WO2021184732A1 PCT/CN2020/119603 CN2020119603W WO2021184732A1 WO 2021184732 A1 WO2021184732 A1 WO 2021184732A1 CN 2020119603 W CN2020119603 W CN 2020119603W WO 2021184732 A1 WO2021184732 A1 WO 2021184732A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- frame
- frames
- observation window
- repair
- audio data
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 63
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 36
- 230000008439 repair process Effects 0.000 claims abstract description 254
- 238000003062 neural network model Methods 0.000 claims abstract description 32
- 238000012545 processing Methods 0.000 claims description 15
- 230000005236 sound signal Effects 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 6
- 230000003993 interaction Effects 0.000 claims description 4
- 238000012549 training Methods 0.000 description 27
- 238000010586 diagram Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 8
- 230000005540 biological transmission Effects 0.000 description 4
- 230000008030 elimination Effects 0.000 description 3
- 238000003379 elimination reaction Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 206010002953 Aphonia Diseases 0.000 description 1
- 241000699670 Mus sp. Species 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000010183 spectrum analysis Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/005—Correction of errors induced by the transmission channel, if related to the coding algorithm
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/45—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/60—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
Definitions
- the invention relates to the field of audio data processing, in particular to a method, equipment and system for repairing audio packet loss based on neural networks.
- Bluetooth wireless data sending and receiving methods for wireless transmission of audio-visual data
- Bluetooth speakers such as Bluetooth speakers, Bluetooth headsets, Bluetooth mice, Bluetooth keyboards, and Bluetooth remote controls.
- Bluetooth products such as devices appear in people's lives.
- Bluetooth speakers and Bluetooth headsets mainly use functions such as Bluetooth calls and Bluetooth music playback.
- Bluetooth transmits these audios it transmits audio data to the host (mobile phone, computer, etc.) in the form of one data packet after one data packet.
- Bluetooth playback device to play.
- the wireless transmission is often interfered by other wireless signals, or due to obstacles or distance, the data packets during the transmission process are lost. If these data are not repaired, there will be problems on the playback end. Continuous or noise.
- the loss of signal will directly affect the experience of the phone call, and even affect communication in severe cases. Therefore, it is necessary to repair the Bluetooth packet loss data.
- Mute processing replace the missing data with mute data to avoid other harsh noises. This mute processing method is simple but has limited performance. It can only avoid noise but the lost signal is not restored.
- Waveform replacement calculate the relevant signal through the pitch period of the speech signal or other relevant algorithms, and replace it with a similar signal.
- the principle is based on the short-term stability of the voice, which can be replaced with similar waveforms.
- the real voice signal also has vowels, consonant switching, and continuous changes in speech speed and intonation, so it is difficult to restore this changed signal by replacing it with a similar signal.
- the energy of the speech signal is constantly changing, and it needs more additional processing to recover better. When the signal loss is serious, the repeated use of similar signals will also cause the generation of machine sounds.
- the main purpose of the present invention is to provide a neural network-based audio packet loss repair method, device, and system to repair lost audio data packets and improve repair accuracy.
- an embodiment of the present invention discloses a method for repairing audio data packet loss based on a neural network, including:
- Step S101 Obtain an audio data packet.
- the audio data packet includes several frames of audio data.
- the several audio data frames contain at least multiple voice signal frames, and the voice signal frames are audio data frames containing voice signals;
- the neural network model includes a first repair model and a second repair model, wherein the first repair model is used for repair The voice signal frame at the first preset position, the second repair model is used to repair the voice signal frame at the second preset position; and step S107, a number of audio data frames are sent to the selected neural network model to correct the loss The voice signal frame is repaired.
- several audio data frames also include non-speech signal frames; between step S101 and step S103, it also includes: step S102, distinguishing voice signal frames and non-speech signals in several audio data frames according to a preset algorithm Frame; in step S103, the position information of the dropped frame is the position of the lost speech signal frame in the speech signal frame group.
- the speech signal group includes N frames of speech signal frames, where N is an integer greater than or equal to 5.
- step S103 includes: step S1031, sequentially sliding through multiple voice signal frames through the observation window to group multiple voice signal frames into groups of N frames; step S1032, focusing on the voice signal in the observation window Frame, determine whether there is frame loss; and step S1033, after the voice signal frame in the observation window has frame loss, determine the position of the lost voice signal frame in the observation window to obtain the position information of the lost frame ; Step S107 includes: repairing the lost speech signal frame in the observation window.
- the method further includes: updating the restored voice signal frame to the corresponding frame at the drop-out position in the observation window.
- step S1031 sliding the observation window in an iterative replacement sliding manner, so that the first K frames of speech signal frames in the observation window slide out of the observation window, and the last K frames of speech signal frames outside the observation window slide into the observation window.
- N is an integer greater than or equal to 1.
- K is 1.
- step S1033 includes: determining that the position of the lost speech signal frame in the observation window does not include the last frame in the observation window, and set it as the first preset position; step S105 includes: converting the voice in the observation window The signal frame is sent to the first repair model to repair the lost speech signal frame, wherein the input data of the first repair model includes the last frame in the observation window.
- step S1033 includes: determining that the lost speech signal frames are at least 2 frames, and the position of the dropped frame is the last frame in the observation window and other position frames in the observation window, as the second preset position;
- Step S105 includes: sending the speech signal frame in the observation window before the frame at other positions into the second repair model to repair the speech signal frame at the other position frame, wherein the input data of the second repair model is in the observation window The speech signal frame before the frame at other positions within the frame, and does not include the last frame in the observation window.
- the method further includes: performing dilute envelope processing on the audio signal of the non-speech signal frame.
- an embodiment of the present invention discloses a neural network-based audio data packet loss repair device, including: a data acquisition module for acquiring audio data packets, the audio data packet includes several frames of audio data, several frames of audio
- the data frame contains at least multiple speech signal frames, and the speech signal frame is an audio data frame that contains the speech signal; the position determination module is used to determine the missing frame after several audio data frames are lost.
- the position of the voice signal frame in several audio data frames obtains the position information of the dropped frame; the position includes the first preset position or the second preset position; the model selection module is used to obtain the position information of the dropped frame Select the neural network model used to repair the frame loss situation.
- the neural network model includes a first repair model and a second repair model.
- the first repair model is used to repair the speech signal frame at the first preset position
- the second repair model It is used to repair the voice signal frame at the second preset position
- the data repair module is used to send several audio data frames into the selected neural network model to repair the lost voice signal frame.
- several audio data frames also include non-voice signal frames; it also includes: a signal distinguishing module for distinguishing between voice signal frames and non-voice signal frames in several audio data frames according to a preset algorithm;
- the position information is the position of the lost speech signal frame in the speech signal frame group.
- the speech signal group includes N speech signal frames, where N is an integer greater than or equal to 5.
- the position determining module includes: a sliding window grouping unit, which is used to sequentially slide among multiple voice signal frames through the observation window to group multiple voice signal frames into a group of N frames; and a frame loss determining unit uses To determine whether there is frame loss for the speech signal frame in the observation window; and the position acquisition unit is used to determine that the lost speech signal frame is under observation after the speech signal frame in the observation window has frame loss.
- a data update module configured to update the restored voice signal frame to the corresponding frame at the dropout position in the observation window.
- the sliding window grouping unit slides the observation window in a sliding manner of iterative replacement, so that the first K frames of speech signal frames in the observation window slide out of the observation window, and the last K frames of speech signal frames outside the observation window slide into the observation window ,
- N is an integer greater than or equal to 1.
- K is 1.
- the position acquisition unit includes: determining that the position of the lost speech signal frame in the observation window does not include the last frame in the observation window, and is used as the first preset position; the model selection module includes: it will be in the observation window The voice signal frame of is sent to the first repair model to repair the lost voice signal frame, wherein the input data of the first repair model includes the last frame in the observation window.
- the position acquiring unit includes: determining that the lost speech signal frames are at least 2 frames, and the position of the dropped frame is the last frame in the observation window and other position frames in the observation window, and is used as the second preset position ;
- the model selection module includes: sending the speech signal frame in the observation window before the frame at other positions into the second repair model to repair the speech signal frame at the other position frame, wherein the input data of the second repair model is at The speech signal frame in the observation window before the frame at other positions, and does not include the last frame in the observation window.
- a fade-out envelope module for performing fade-out envelope processing on the audio signal of the non-speech signal frame.
- the neural network model includes a first repair model and a second repair model
- the neural network training device includes: a sample acquisition module for acquiring voice signal sample data to be learned, and the voice signal sample data is represented by N frames of voice signal frames It is a group, where N is an integer greater than or equal to 5, and the speech signal frame is an audio data frame containing the speech signal; the first rejection module is used to eliminate the first preset position in each group of N frames of speech signal frames The speech signal frame obtains the first input sample; the second elimination module is used to eliminate the speech signal frame at the second preset position in each group of N frames of speech signal frames to obtain the second input sample, the first preset position and the second preset position It is assumed that the positions of the positions are different; and the training module is used to input the first input sample and the second input sample into the first repair model and the second repair model respectively to train the first repair model and the second repair model respectively, the first The repair model is used to repair the speech signal frame at the first preset position, and the second repair model is used to repair the
- the training module for separately training the first repair model and the second repair model includes: training the first repair model and the second repair model through repeated iterations.
- the training module includes a first training unit and/or a second training unit, wherein: the first training unit is used to iteratively train the first repair model, including: obtaining the i-th speech signal frame after the i-th iteration , Where i is a positive integer; it is determined whether the first error between the i-th speech signal frame and the speech signal frame at the first preset position that is eliminated is within the preset range; if the first error is within the preset range , Then output the model parameters obtained in the i-th iteration to solidify the first repair model; the second training unit is used to repeatedly iteratively train the second repair model, including: obtaining the j-th speech signal frame after the j-th iteration, Where j is a positive integer; it is determined whether the second error between the j-th speech signal frame and the speech signal frame at the second preset position that is eliminated is within the preset range; if the second error is within the preset range, Then output the model parameters obtained in the
- the sample acquisition module is used to group the voice signal sample data with N frames of voice signal frames through an observation window of a preset length; run the first elimination module, the second elimination module and the training module in the observation window.
- the first preset position is in the observation window and does not include the last frame in the observation window; the training module is used to train the first frame by using the speech signal frames before and after the first preset position in the observation window One repair model.
- the first preset position is the first frame that is not within the observation window.
- the second preset position includes the last frame in the observation window; the training module is used to train the second repair model through the speech signal frame before the first preset position in the observation window.
- an embodiment of the present invention discloses an audio device, including a processor, configured to implement any method disclosed in the first aspect.
- the audio device is a headset, a mobile terminal or a smart wearable device with audio playback function.
- an embodiment of the present invention discloses an audio signal interaction system, including: a first device and a second device; the first device sends audio data packets to the second device; and the second device is used to implement the first device described above. Any disclosed method.
- the first device is a mobile terminal
- the second device is a headset
- an embodiment of the present invention discloses a computer-readable storage medium on which a computer program is stored, and the computer program stored in the storage medium is used to be executed to implement any method disclosed in the first aspect.
- an embodiment of the present invention discloses an audio device chip with an integrated circuit thereon, and the integrated circuit is designed to implement any method disclosed in the first aspect.
- the acquired audio data packet includes several audio data frames, and when there is a lost voice signal frame in the several audio data frames After the frame condition, determine the position of the lost voice signal frame in several audio data frames to obtain the position information of the frame;
- the frame audio data frame is sent to the selected neural network model to repair the lost speech signal frame. Since the first repair model and the second repair model are respectively suitable for repairing speech signal frames at different frame loss positions, after determining the position of the lost speech signal frame, the corresponding repair model can be selected according to the position of the frame loss.
- the solution of the embodiment of the present invention can be adapted to select the repair model, and the repair of lost speech signal frames is more targeted, thereby improving the repair accuracy. .
- the voice signal frame and the non-voice signal frame in several audio data frames are distinguished according to a preset algorithm.
- the position information of the dropped frame is the position of the lost speech signal frame in the speech signal frame group. In this way, it is possible to repair the packet loss data of the voice signal frame, reduce the interference caused by the non-voice signal frame, and improve the repair accuracy.
- the restored speech signal frame is updated to the corresponding frame loss position in the observation window, so that the input data of the repair model is not missing , Improve the reference data of the repair model input, and improve the accuracy of the repair.
- sliding the observation window with an iterative replacement sliding method is adopted, so that the first K frames of speech signal frames in the observation window slide out of the observation window, and the last K frames of speech signal frames outside the observation window slide into the observation window.
- the amount of input data of the neural network can be guaranteed. Accordingly, the first K frames of speech signal frames in the observation window slide out of the observation window, which can reduce the extra delay of the system output data, that is, the speech signal frames can be output in time.
- the audio signal of the non-speech signal frame can be diluted and enveloped, which can improve the data processing efficiency, and since the non-speech signal frame is not sent In the repair model, therefore, the pressure of the neural network on the repair and training of the speech signal frame can be reduced, and the accuracy of the repair can be improved.
- FIG. 1 is a flowchart of a method for repairing audio data packet loss based on neural network disclosed in this embodiment
- FIG. 2 is a flowchart of a method for determining the position of dropped frames through an observation window disclosed in this embodiment
- FIG. 3 is a schematic diagram of an example of a sliding window grouping repair disclosed in this embodiment
- FIG. 4 is a flowchart of a neural network training method for audio packet loss repair disclosed in this embodiment
- FIG. 5A and 5B are schematic diagrams of an example of excluding the preset position disclosed in this embodiment, wherein FIG. 5A is a schematic diagram of an example of excluding the first preset position in this embodiment, and FIG. Example schematic diagram of the preset position;
- FIG. 6 is a schematic structural diagram of a device for repairing audio data packet loss based on a neural network disclosed in this embodiment.
- this embodiment discloses a neural network-based audio data packet loss repair method. Please refer to FIG. 1, which is a disclosure of this embodiment.
- Step S101 Acquire audio data packets.
- the so-called audio data packet includes several audio data frames, and the several audio data frames include at least a plurality of voice signal frames, and the voice signal frame is an audio data frame that includes a voice signal.
- the voice signal frame is an audio data frame that includes a voice signal.
- the audio data packet has header information, and different audio data frame numbers can be distinguished according to the header information.
- Step S103 When there is a frame loss situation of missing voice signal frames in several audio data frames, determine the position of the lost voice signal frame in the several audio data frames to obtain the position information of the dropped frame.
- the existing method can be used to determine whether there is a frame loss situation of missing speech signal frames. For example, the time-frequency characteristics of speech signal frames can be counted, and the missing features can be estimated according to the statistics. In this way, it can be determined whether the speech signal frame is lost; when there is a loss of the speech signal frame, the position of the speech signal frame can be determined according to the statistical result.
- the location includes the first preset location or the second preset location. Specifically, the first preset location and the second preset location are different in specific locations.
- the so-called location information can be represented by a frame number, which can be the frame number of the lost voice signal frame in several audio data frames, or the frame number of multiple voice signal frames, or The frame number in the same group of speech signal frames.
- Step S105 selecting a neural network model for repairing the frame loss situation according to the position information of the frame loss.
- the neural network model includes a first repair model and a second repair model, and the first repair model and the second repair model are respectively suitable for repairing speech signal frames at different frame loss positions.
- the situation of the position of the frame loss is distinguished.
- the first repair model or the second repair model is selected to repair the lost signal frame, so that when the lost signal frame is repaired, it is more effective. Pertinence, improve the accuracy of repair.
- step S107 several audio data frames are sent to the selected neural network model to repair the lost speech signal frames.
- the repaired voice signal data can be inserted into the position of the lost signal frame to obtain The complete voice signal is sent to the audio output device to play the voice signal.
- the first repair model is used to repair the speech signal frame at the first preset position
- the second repair model is used to repair the speech signal frame at the second preset position, that is, the first repair model adopts The sample obtained by excluding the first preset position is trained, and the second repair model is obtained by training the sample excluding the second preset position.
- step S101 and step S103 Also includes:
- Step S102 Distinguish voice signal frames and non-voice signal frames in several audio data frames according to a preset algorithm.
- the specific preset algorithm is not limited, and it may be, for example, a neural network algorithm, or other spectrum analysis methods, as long as it can distinguish between voice signal frames and non-voice signal frames.
- the acquired audio data is sent to the pre-neural network algorithm, and the pre-neural network algorithm is used to distinguish whether each frame of audio data is a voice signal frame or a non-voice signal frame.
- step S102 it further includes: performing a fade-out envelope processing on the audio signal of the non-voice signal frame.
- the so-called fade-out envelope processing may be simple copy replacement or simple fade-in and fade-out.
- the audio signal of the non-speech signal frame is faded and enveloped, which can improve the data processing efficiency, and since the non-speech signal frame is not sent to repair In the model, therefore, the pressure of the neural network on the repair and training of the speech signal frame can be reduced, and the accuracy of the repair can be improved.
- the position information of the dropped frame is the position of the lost speech signal frame in the speech signal frame group, and the speech signal group includes N frames of speech signal frames, where N is greater than Or an integer equal to 5.
- the voice signal frames can be grouped to obtain multiple sets of voice signal frames, each set of voice signal frames contains N frames of voice signal frames, and then the lost voice signal is determined The position of the frame in this group, such as the frame number in this group.
- each speech signal frame group can be divided by the observation window.
- FIG. include:
- Step S1031 Swipe sequentially among multiple speech signal frames through the observation window.
- the observation window with a preset length of N can be slid, thereby grouping multiple speech signal frames into groups of N frames, that is, the length of the observation window is the number of frames of speech signal frames in the group. .
- Step S1032 for the speech signal frame in the observation window, it is determined whether there is frame loss. Specifically, when the observation window slides and contains N frames of speech signal frames, time-frequency features can be counted in the observation window, and the missing features can be estimated based on the statistics, so that it can be determined whether the speech signal frames in the observation window are lost Voice signal data.
- Step S1033 After the voice signal frame in the observation window has frame loss, determine the position of the lost voice signal frame in the observation window to obtain the position information of the frame loss.
- the position information of the dropped frame refers to the position of the lost speech signal frame in the observation window; in this embodiment, the position in the observation window may be the frame number in the observation window, that is, the missing frame number. Which frame is the speech signal frame in the observation window.
- step S107 includes: performing the lost speech signal frame in the observation window repair.
- the method may further include: updating the restored voice signal frame to the corresponding frame at the drop-out position in the observation window. Specifically, after it is determined that the data of a certain voice signal frame in the observation window is lost, the data of the lost voice signal frame is repaired through the selected repair model, and then the repaired data is updated to the location where the lost frame is located. Location. In this embodiment, the repaired data is updated to the corresponding dropped frame position frame in the observation window to complete the voice signal frame data in the observation window, so that the input data of the repair model is not missing, and the repair model is improved The input reference data improves the accuracy of repair.
- the observation window may be sliding in an iterative replacement sliding manner to make it within the observation window
- N is an integer greater than or equal to 1
- the value of K can be 1.
- FIG. 3 is a method disclosed in this embodiment.
- the dotted frame shows an observation window with a preset length.
- the processing process of sliding window grouping repair on voice signal frames through the observation window is as follows:
- the observation window contains voice signal frames numbered 1 to 7, and the lost voice signal frame is the sixth frame, as shown in the grid line frame in Figure 3;
- the voice signal frame data of the fifth frame in this state can also be used as the input of the repair model.
- step S1033 when step S1033 is performed, it is determined that the lost speech signal frame does not include or includes the last frame in the observation window, specifically:
- step S1033 includes: determining that the position of the lost speech signal frame in the observation window does not include the last frame in the observation window, that is, the data of the last frame of the speech signal in the observation window There is no loss. What is lost is the signal frame of the middle section in the observation window, and the position of the signal frame of the lost middle section is taken as the first preset position.
- step S105 includes: sending the speech signal frame in the observation window to the first repair model to repair the lost speech signal frame, wherein the input data of the first repair model includes The last frame in the window.
- the speech signal frames before and after the lost speech signal frame in the observation window are sent to the first repair model, and the first repair model repairs the lost speech according to the speech signal frames before and after the lost speech signal frame.
- Signal frame As an example, suppose the data lost in the observation window is the 6th frame, then the speech signal frames numbered 1-5 and the 7th frame are input into the first repair model, and the first repair model is based on the speech signal frames numbered 1-5 and 7. To repair the 6th frame of the speech signal frame.
- step S1033 includes: determining that the lost speech signal frames are at least 2 frames, and the position of the dropped frames is the last frame in the observation window and other frames in the observation window. The other positions are used as the second preset positions.
- step S105 includes: sending the voice signal frame in the observation window before the frame of other positions into the second repair model to repair the voice signal frame of the frame located in other positions, where the second The input data of the repair model is the speech signal frame in the observation window before the frame at other positions, and does not include the last frame in the observation window.
- the speech signal frame before the lost speech signal frame in the observation window is sent to the second repair model, and the second repair model repairs the lost speech signal frame according to the speech signal frame before the lost speech signal frame.
- the speech signal frames numbered 1-5 are input into the second repair model, and the second repair model is based on the speech signals numbered 1-5 Frame 6 is used to repair the voice signal frame of the sixth frame.
- the voice signal data of the seventh frame of the last frame in the observation window can also be repaired synchronously.
- this embodiment also discloses a neural network training method for audio packet loss repair.
- the neural network model after training is suitable for the above-mentioned packet loss repair method.
- the neural network model includes a first repair model and a second repair model.
- FIG. 4 is a flowchart of a neural network training method for audio packet loss repair disclosed in this embodiment.
- the neural network training method includes:
- Step S201 Acquire voice signal sample data to be learned.
- the voice signal sample data is referred to as a group of N frames of voice signal frames, where N is an integer greater than or equal to 5, and the voice signal frame is an audio data frame containing a voice signal.
- the voice signal frames that have been divided into groups can be directly obtained as the voice signal sample data, or after the voice signal sample data is obtained, the voice signal sample data can be grouped with N frames of voice signal frames as a group.
- Step S203 In each group of N frames of speech signal frames, the speech signal frame at the first preset position is removed to obtain the first input sample.
- the so-called first preset position refers to a position in the group, and specifically, the first preset position may be represented by the sequence number of the voice signal frame in the group.
- Step S205 removing the speech signal frame at the second preset position from each group of N frames of speech signal frames to obtain a second input sample.
- the so-called second preset position refers to a position within the group, and specifically, the second preset position may be represented by the sequence number of the voice signal frame in the group.
- the positions of the first preset position and the second preset position are different.
- Step S207 Input the first input sample and the second input sample to the first repair model and the second repair model, respectively, to train the first repair model and the second repair model respectively.
- the first repair model and the second repair model are respectively suitable for repairing speech signal frames at different frame loss positions.
- the training sequence of the first repair model and the second repair model is not limited, and the first repair model and the second repair model can be trained separately or synchronized on the same PC.
- the first repair model and/or the second repair model can be trained separately, for example, in repeated iterations.
- other methods can also be used to train the first repair model and the second repair model separately to obtain The model parameters of the first repair model and the second repair model, such as weights, coefficients, etc.; then, the first repair model and the second repair model can be stored in the storage device, so that the lost speech signal frame can be directly repaired Call the related repair model.
- step S201 includes: the voice signal sample data is divided into N frames of voice signal frames through an observation window of a preset length.
- One group that is, the observation window with a length of N frames slides on the speech signal sample data, thereby dividing the speech signal sample data into N groups; then, step S203, step S205, and step S207 are executed in the observation window.
- the first preset position is in the observation window and does not include the last frame in the observation window; through the voice before and after the first preset position in the observation window
- the signal frame trains the first repair model. That is, the data of the last voice signal frame in the observation window is not eliminated, and the signal frame in the middle section of the observation window is eliminated.
- the speech signal frame in the observation window is sent to the first repair model to train the first repair model, that is, the first input samples for training the first repair model include those in the observation window The last frame.
- the speech signal frames before and after the speech signal frames that are eliminated in the observation window are sent to the first repair model, and the first repair model is trained based on the speech signal frames before and after the eliminated speech signal frames to obtain the eliminated Speech signal frame.
- FIG. 5A is an example schematic diagram of excluding the first preset position in this embodiment.
- the speech signal frames numbered 1-5 and the 7th frame are taken as the first input samples, and then the first input samples are input into the first repair model.
- the repair model is iteratively trained according to the voice signal frames numbered 1-5 and 7 to obtain the sixth voice signal frame.
- the first preset position is the first frame that is not within the observation window.
- a sliding window is used to repair the lost voice signal frame.
- the first preset position may preferably be a position behind the observation window, such as the N-1th frame.
- the second preset position is the last frame in the observation window; the second repair model is trained by the speech signal frame before the second preset position in the observation window. That is, the data of the last speech signal frame in the observation window is eliminated.
- the speech signal frame in the observation window is sent to the second repair model to train the second repair model.
- the voice signal frame before the removed voice signal frame (the second preset position) in the observation window is sent to the second repair model, and the second repair model is trained based on the voice signal frame before the removed voice signal frame
- Obtain the eliminated speech signal frame that is, obtain the speech signal frame at the second preset position through forward prediction.
- Figure 5B is an example schematic diagram of excluding the second preset position in this embodiment.
- the dotted frame is an observation window with a preset length.
- the speech signal frame assuming that the data excluded in the observation window is the 6th frame, the speech signal frames numbered 1-5 before the number 6 are used as the second input sample, and then the second input sample is input into the second repair model.
- the repair model is iteratively trained according to the numbered 1-5 speech signal frames to obtain the sixth speech signal frame.
- This embodiment also discloses a device for repairing audio data packet loss based on a neural network.
- the repair device includes: a data acquisition module 701, a position determination module 702, a model selection module 703, and a data repair module 704, among which:
- the data acquisition module 701 is used to acquire audio data packets.
- the audio data packets include several audio data frames, and the several audio data frames contain at least multiple voice signal frames, and the voice signal frames are audio data frames containing voice signals;
- position determining module 702 is used to determine the position of the lost voice signal frame in several audio data frames to obtain the position information of the dropped frame when there is a missing voice signal frame in several audio data frames;
- the model selection module 703 uses For selecting the neural network model used to repair the frame loss situation according to the position information of the frame loss, the neural network model includes a first repair model and a second repair model. The first repair model and the second repair model are respectively suitable for repairing different frame loss positions The voice signal frame;
- the data repair module 704 is used to send several audio data frames into the selected neural network model to repair the lost voice signal frame.
- the several audio data frames also include non-voice signal frames;
- the audio data packet loss repairing device further includes: a signal distinguishing module for distinguishing the voices in the several audio data frames according to a preset algorithm Signal frames and non-speech signal frames; the position information of dropped frames is the position of the lost speech signal frame in the speech signal frame group.
- the speech signal group includes N frames of speech signal frames, where N is an integer greater than or equal to 5 .
- the position determination module includes: a sliding window grouping unit, which is used to sequentially slide among multiple voice signal frames through the observation window to group multiple voice signal frames in a group of N frames; frame dropping The determining unit is used to determine whether there is frame loss for the voice signal frame in the observation window; and the position acquisition unit is used to determine the lost voice after the voice signal frame in the observation window has frame loss The position of the signal frame in the observation window to obtain the position information of the lost frame; the data repair module includes: repairing the lost voice signal frame in the observation window.
- the audio data packet loss repairing device further includes: a data update module, configured to update the restored voice signal frame to the corresponding frame at the position of the frame loss in the observation window.
- the sliding window grouping unit slides the observation window in an iterative replacement sliding manner, so that the first K frames of speech signal frames in the observation window slide out of the observation window, and the last K frames of speech signal frames outside the observation window Slide into the observation window, where N is an integer greater than or equal to 1.
- K is 1.
- the position acquiring unit includes: determining that the position of the lost speech signal frame in the observation window does not include the last frame in the observation window; the model selection module includes: placing the speech signal frame in the observation window The first repair model is sent to repair the lost speech signal frame, wherein the input data of the first repair model includes the last frame in the observation window.
- the position acquisition unit includes: determining that the lost speech signal frames are at least 2 frames, and the position of the dropped frames is the last frame in the observation window and other position frames in the observation window; model selection module Including: sending the speech signal frame in the observation window before the frame at other positions into the second repair model to repair the speech signal frame in the frame at other positions, wherein the input data of the second repair model is in the observation window , The speech signal frame before the frame at other positions, and does not include the last frame in the observation window.
- the audio data packet loss repairing device further includes: a fade-out envelope module for performing fade-out envelope processing on the audio signal of the non-voice signal frame.
- the audio device is a device with an audio data collection function.
- the audio device may be, for example, a headset, a mobile terminal, or a smart wearable device.
- the audio device includes: a processor for implementing A neural network-based audio data packet loss repair method disclosed in any of the above embodiments.
- This embodiment also discloses an audio signal interaction system, including: a first device and a second device;
- the first device sends the audio data packet to the second device; the second device is used to implement the neural network-based audio data packet loss repair method disclosed in any of the foregoing embodiments.
- the first device is a mobile terminal
- the second device is a headset.
- the mobile terminal may be a terminal with data processing functions, such as a tablet computer, a mobile phone, and a notebook computer.
- This embodiment also discloses a computer-readable storage medium on which a computer program is stored, and the computer program stored in the storage medium is used to be executed to implement the neural network-based audio data packet loss repair method disclosed in any of the above embodiments.
- the acquired audio data packet includes several audio data frames, and when there is a loss of voice signal frame in several audio data frames Then, determine the position of the lost speech signal frame in several audio data frames to obtain the position information of the dropped frame; then, according to the position information of the dropped frame, select the neural network model used to repair the frame loss, and combine several frames of audio
- the data frame is sent to the selected neural network model to repair the lost speech signal frame. Since the first repair model and the second repair model are respectively suitable for repairing speech signal frames at different frame loss positions, after determining the position of the lost speech signal frame, the corresponding repair model can be selected according to the position of the frame loss.
- the solution of the embodiment of the present invention can be adapted to select the repair model, and the repair of lost speech signal frames is more targeted, thereby improving the repair accuracy. .
- the voice signal frame and the non-voice signal frame in several audio data frames are distinguished according to a preset algorithm, and the position information of the dropped frame is the position of the lost speech signal frame in the speech signal frame group. In this way, it is possible to realize the repair of the packet loss data of the voice signal frame, reduce the interference caused by the non-voice signal frame, and improve the repair accuracy.
- the restored speech signal frame is updated to the corresponding frame loss position in the observation window, so that the input data of the repair model is not missing Yes, the reference data input for the repair model is improved to improve the accuracy of the repair.
- sliding the observation window in an iterative replacement sliding manner so that the first K frames of speech signal frames in the observation window slide out of the observation window, and the last K frames of speech signal frames outside the observation window slide into the observation window , which can guarantee the amount of input data of the neural network.
- the first K frames of speech signal frames in the observation window slide out of the observation window, which can reduce the extra delay of the system output data, that is, it can output speech signal frames in time .
- the audio signal of the non-speech signal frame is weakened and enveloped, which can improve the data processing efficiency, and since the non-speech signal frame is not sent Into the repair model, therefore, the pressure of the neural network on the repair and training of the speech signal frame can be reduced, and the accuracy of the repair can be improved.
Landscapes
- Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
- Telephonic Communication Services (AREA)
Abstract
一种基于神经网络的音频丢包修复方法、设备和系统,方法包括:获取音频数据包(S101),音频数据包包括若干帧音频数据帧,若干帧音频数据帧中至少包含多个语音信号帧;确定丢失的语音信号帧在若干帧音频数据帧中所处的位置得到丢帧的位置信息(S103);所处的位置包括第一预设位置或第二预设位置;根据丢帧的位置信息选择用于修复丢帧情况的神经网络模型(S105),神经网络模型包括第一修复模型和第二修复模型;将若干帧音频数据帧送入选择的神经网络模型,以对丢失的语音信号帧进行修复(S107)。能够适应选择修复模型,对丢失语音信号帧的修复更有针对性,继而,提高修复准确率。
Description
本发明涉及音频数据处理领域,具体涉及一种基于神经网络的音频丢包修复方法、设备和系统。
随着影音设备、移动通信的普及,蓝牙技术的发展,人们越来越多地采用无线收发数据的方式来进行影音数据的无线传输,例如蓝牙音箱、蓝牙耳机、蓝牙鼠标、蓝牙键盘、蓝牙遥控器等越多的蓝牙产品出现在人们的生活中。
其中蓝牙音箱、蓝牙耳机主要应用了蓝牙通话及蓝牙音乐播放等功能,而蓝牙在传输这些音频的时候是将音频数据以一个数据包接一个数据包的形式通过主机(手机、电脑等)传输给蓝牙播放设备播放。在传输过程中,由于是无线传输往往会受到其它无线信号的干扰、或者由于障碍物或距离的原因、导致传输过程数据包的丢失,如果不对这些数据进行修复,那么在播放端就会出现不连续或者是杂音。特别是蓝牙通话模式下,丢失信号会直接影响电话通话的体验感受,严重时甚至影响沟通。因此需要对蓝牙丢包数据进行修复。
针对蓝牙丢包数据的修复,常见的传统修复方式有:
1.静音处理,对丢失的数据用静音数据替换,避免其它刺耳的杂音。这种静音处理方式,方法简单但性能有限,只能避免杂音但丢失的信号是没有恢复的。
2.波形替换,通过语音信号的基音周期或其它相关算法计算相关信号,用相似信号替换。其原理是基于语音短时平稳,可以用相似波形替换,但是现实语音信号也是存在元音、辅音切换、以及语速、语调的不停变化的,所以用相似信号替换很难恢复这种变化信号。另外语音信号能量也是在不停变化的,要较好的恢复出来也需要更多的额外处理,当信号丢失比较严重的时候,由于相似信号的重复使用,也会导致机器声的产生。
为了实现对丢失的音频数据包进行修复,现有技术中,常利用高级算法进行预测、修复,例如,采用神经网络学习音频数据帧与帧之间的非线性模型,而后,根据非线性模型在频域上重构丢失的数据,以此来获得当前丢失数据的 时域估计;这种构建非线性模型的方式,通常利用当前音频数据帧之前已接收的音频数据帧来学习模型,由此,预测得到当前丢失的数据,也就是在时域上根据前面的帧来预测后面的音频数据帧。这种方式虽然能够预测、估计丢失的数据,从而对丢失的数据进行修复,但是,当之前的音频数据帧也存在丢包现象时,会导致神经网络学习的非线性模型不够精确,从而导致基于该模型无法准确地预测丢失的音频数据帧。
因此,如何利用新的神经网络模型来修复丢失的音频数据包,提高修复准确率成为亟待解决的技术问题。
发明内容
基于上述现状,本发明的主要目的在于提供一种基于神经网络的音频丢包修复方法、设备和系统,以修复丢失的音频数据包,提高修复准确率。
为实现上述目的,本发明采用的技术方案如下:
根据第一方面,本发明实施例公开了一种基于神经网络的音频数据丢包修复方法,包括:
步骤S101,获取音频数据包,音频数据包包括若干帧音频数据帧,若干帧音频数据帧中至少包含多个语音信号帧,语音信号帧为包含语音信号的音频数据帧;步骤S103,当若干帧音频数据帧中存在丢失语音信号帧的丢帧情况后,确定丢失的语音信号帧在若干帧音频数据帧中所处的位置得到丢帧的位置信息;所处的位置包括第一预设位置或第二预设位置;步骤S105,根据丢帧的位置信息选择用于修复丢帧情况的神经网络模型,神经网络模型包括第一修复模型和第二修复模型,其中,第一修复模型用于修复处于第一预设位置的语音信号帧,第二修复模型用于修复处于第二预设位置的语音信号帧;及步骤S107,将若干帧音频数据帧送入选择的神经网络模型,以对丢失的语音信号帧进行修复。
可选地,若干帧音频数据帧中还包括非语音信号帧;在步骤S101和步骤S103之间还包括:步骤S102,按预设算法区分若干帧音频数据帧中的语音信号帧和非语音信号帧;在步骤S103中,丢帧的位置信息为丢失的语音信号帧在语音信号帧组中所处的位置,语音信号组包括N帧语音信号帧,其中,N为大于或等于5的整数。
可选地,步骤S103包括:步骤S1031,通过观察窗在多个语音信号帧中顺序滑动,以对多个语音信号帧进行N帧一组的分组;步骤S1032,针对处于观察窗内的语音信号帧,确定是否存在丢帧的情况;及步骤S1033,当处于观察窗内的语音信号帧存在丢帧的情况后,确定丢失的语音信号帧在观察窗内的位置,以得到丢帧的位置信息;步骤S107包括:在观察窗内对丢失的语音信号帧进行修复。
可选地,在步骤S107之后,还包括:将恢复的语音信号帧更新至观察窗内对应的丢帧位置帧。
可选地,在步骤S1031中,采用迭代替换的滑动方式滑动观察窗,以使处于观察窗内的前K帧语音信号帧滑出观察窗,处于观察窗外的后K帧语音信号帧滑入观察窗内,其中,N为大于或等于1的整数。
可选地,K为1。
可选地,步骤S1033包括:确定丢失的语音信号帧在观察窗内的位置未包含处于观察窗内的最后一帧,并作为第一预设位置;步骤S105包括:将处于观察窗内的语音信号帧送入第一修复模型,以修复丢失的语音信号帧,其中,第一修复模型的输入数据中包括处于观察窗内的最后一帧。
可选地,步骤S1033包括:确定丢失的语音信号帧至少为2帧,丢帧的位置为处于观察窗内的最后一帧和处于观察窗内的其它位置帧,并作为第二预设位置;步骤S105包括:将处于观察窗内的、其它位置帧之前的语音信号帧送入第二修复模型,以修复位于其它位置帧的语音信号帧,其中,第二修复模型的输入数据为处于观察窗内的、其它位置帧之前的语音信号帧,且未包含处于观察窗内的最后一帧。
可选地,在步骤S102之后,还包括:对非语音信号帧的音频信号进行淡化包络处理。
根据第二方面,本发明实施例公开了一种基于神经网络的音频数据丢包修复装置,包括:数据获取模块,用于获取音频数据包,音频数据包包括若干帧音频数据帧,若干帧音频数据帧中至少包含多个语音信号帧,语音信号帧为包含语音信号的音频数据帧;位置确定模块,用于当若干帧音频数据帧中存在丢失语音信号帧的丢帧情况后,确定丢失的语音信号帧在若干帧音频数据帧中所处的位置得到丢帧的位置信息;所处的位置包括第一预设位置或第二预设位置;模型选择模块,用于根据丢帧的位置信息选择用于修复丢帧情况的神经网 络模型,神经网络模型包括第一修复模型和第二修复模型,其中,第一修复模型用于修复处于第一预设位置的语音信号帧,第二修复模型用于修复处于第二预设位置的语音信号帧;及数据修复模块,用于将若干帧音频数据帧送入选择的神经网络模型,以对丢失的语音信号帧进行修复。
可选地,若干帧音频数据帧中还包括非语音信号帧;还包括:信号区分模块,用于按预设算法区分若干帧音频数据帧中的语音信号帧和非语音信号帧;丢帧的位置信息为丢失的语音信号帧在语音信号帧组中所处的位置,语音信号组包括N帧语音信号帧,其中,N为大于或等于5的整数。
可选地,位置确定模块包括:滑窗分组单元,用于通过观察窗在多个语音信号帧中顺序滑动,以对多个语音信号帧进行N帧一组的分组;丢帧确定单元,用于针对处于观察窗内的语音信号帧,确定是否存在丢帧的情况;及位置获取单元,用于当处于观察窗内的语音信号帧存在丢帧的情况后,确定丢失的语音信号帧在观察窗内的位置,以得到丢帧的位置信息;数据修复模块包括:在观察窗内对丢失的语音信号帧进行修复。
可选地,还包括:数据更新模块,用于将恢复的语音信号帧更新至观察窗内对应的丢帧位置帧。
可选地,滑窗分组单元采用迭代替换的滑动方式滑动观察窗,以使处于观察窗内的前K帧语音信号帧滑出观察窗,处于观察窗外的后K帧语音信号帧滑入观察窗内,其中,N为大于或等于1的整数。
可选地,K为1。
可选地,位置获取单元包括:确定丢失的语音信号帧在观察窗内的位置未包含处于观察窗内的最后一帧,并作为第一预设位置;模型选择模块包括:将处于观察窗内的语音信号帧送入第一修复模型,以修复丢失的语音信号帧,其中,第一修复模型的输入数据中包括处于观察窗内的最后一帧。
可选地,位置获取单元包括:确定丢失的语音信号帧至少为2帧,丢帧的位置为处于观察窗内的最后一帧和处于观察窗内的其它位置帧,并作为第二预设位置;模型选择模块包括:将处于观察窗内的、其它位置帧之前的语音信号帧送入第二修复模型,以修复位于其它位置帧的语音信号帧,其中,第二修复模型的输入数据为处于观察窗内的、其它位置帧之前的语音信号帧,且未包含处于观察窗内的最后一帧。
可选地,还包括:淡化包络模块,用于对非语音信号帧的音频信号进行淡 化包络处理。
可选地,神经网络的模型包括第一修复模型和第二修复模型,神经网络训练装置包括:样本获取模块,用于获取待学习的语音信号样本数据,语音信号样本数据以N帧语音信号帧为一组,其中,N为大于或等于5的整数,语音信号帧为包含语音信号的音频数据帧;第一剔除模块,用于在每组N帧语音信号帧中剔除第一预设位置的语音信号帧得到第一输入样本;第二剔除模块,用于在每组N帧语音信号帧中剔除第二预设位置的语音信号帧得到第二输入样本,第一预设位置与第二预设位置的位置不同;及训练模块,用于将第一输入样本和第二输入样本分别输入至第一修复模型和第二修复模型,以分别训练第一修复模型和第二修复模型,第一修复模型用于修复处于第一预设位置的语音信号帧,第二修复模型用于修复处于第二预设位置的语音信号帧。
可选地,训练模块用于分别训练第一修复模型和第二修复模型包括:通过反复迭代训练第一修复模型和第二修复模型。
可选地,训练模块包括第一训练单元和/或第二训练单元,其中:第一训练单元用于反复迭代训练第一修复模型,包括:在第i次迭代后得到第i个语音信号帧,其中,i为正整数;判断第i个语音信号帧与被剔除的第一预设位置的语音信号帧之间的第一误差是否在预设范围内;如果第一误差在预设范围内,则输出第i次迭代所得到的模型参数,以固化第一修复模型;第二训练单元用于反复迭代训练第二修复模型,包括:在第j次迭代后得到第j个语音信号帧,其中,j为正整数;判断第j个语音信号帧与被剔除的第二预设位置的语音信号帧之间的第二误差是否在预设范围内;如果第二误差在预设范围内,则输出第j次迭代所得到的模型参数,以固化第二修复模型。
可选地,样本获取模块用于通过预设长度的观察窗对语音信号样本数据以N帧语音信号帧为一组;在观察窗内运行第一剔除模块、第二剔除模块和训练模块。
可选地,第一预设位置为处于观察窗内且未包含处于观察窗内的最后一帧;训练模块用于通过处于观察窗内的第一预设位置之前和之后的语音信号帧训练第一修复模型。
可选地,第一预设位置为非处于观察窗内的第一帧。
可选地,第二预设位置包括处于观察窗内的最后一帧;训练模块用于通过处于观察窗内的第一预设位置之前的语音信号帧训练第二修复模型。
根据第三方面,本发明实施例公开了一种音频设备,包括:处理器,用于实现上述第一方面任意公开的方法。
可选地,音频设备为具有音频播放功能的耳机、移动终端或智能穿戴设备。
根据第四方面,本发明实施例公开了一种音频信号交互系统,包括:第一设备和第二设备;第一设备将音频数据包发送给第二设备;第二设备用于实现上述第一方面任意公开的方法。
可选地,第一设备为移动终端,第二设备为耳机。
根据第五方面,本发明实施例公开了一种计算机可读存储介质,其上存储有计算机程序,存储介质中存储的计算机程序用于被执行实现上述第一方面任意公开的方法。
根据第六方面,本发明实施例公开了一种音频设备的芯片,其上具有集成电路,集成电路被设计成用于实现上述第一方面任意公开的方法。
依据本发明实施例公开的一种神经网络训练方法及音频丢包修复方法、设备和系统,获取的音频数据包包括若干帧音频数据帧,当若干帧音频数据帧中存在丢失语音信号帧的丢帧情况后,确定丢失的语音信号帧在若干帧音频数据帧中所处的位置得到丢帧的位置信息;而后,根据丢帧的位置信息选择用于修复丢帧情况的神经网络模型,将若干帧音频数据帧送入选择的神经网络模型,以对丢失的语音信号帧进行修复。由于第一修复模型和第二修复模型分别适用于修复不同丢帧位置的语音信号帧,因此,在确定丢失的语音信号帧所处的位置后,可以根据丢帧的位置选择对应的修复模型,相对于现有技术中,采用同样的修复模型来修复不同的丢帧情况,本发明实施例的方案能够适应选择修复模型,对丢失语音信号帧的修复更有针对性,继而,提高修复准确率。
作为可选的方案,按预设算法区分若干帧音频数据帧中的语音信号帧和非语音信号帧,丢帧的位置信息为丢失的语音信号帧在语音信号帧组中所处的位置,由此,可以实现针对语音信号帧的丢包数据进行修复,减少类非语音信号帧所带来的干扰,提高了修复准确率。
作为可选的方案,通过观察窗在多个语音信号帧中顺序滑动,将恢复的语音信号帧更新至观察窗内对应的丢帧位置帧,由此可以使得修复模型的输入数 据是不缺失的,完善修复模型输入的参考数据,提高了修复的准确性。
作为可选的方案,采用迭代替换的滑动方式滑动观察窗,以使处于观察窗内的前K帧语音信号帧滑出观察窗,处于观察窗外的后K帧语音信号帧滑入观察窗内,可以保证神经网络输入数据的数量,相应地,处于观察窗内的前K帧语音信号帧滑出观察窗,可以减小系统输出数据的额外延时,也就是,能够及时地输出语音信号帧。
作为可选的方案,由于非语音信号帧所包含的有用信息较少,对非语音信号帧的音频信号进行淡化包络处理,可以提高数据处理效率,并且,由于非语音信号帧并未送入修复模型中,因此,可以减少神经网络对语音信号帧修复、训练的压力,提高修复的准确性。
本发明的其他有益效果,将在具体实施方式中通过具体技术特征和技术方案的介绍来阐述,本领域技术人员通过这些技术特征和技术方案的介绍,应能理解所述技术特征和技术方案带来的有益技术效果。
以下将参照附图对根据本发明实施例进行描述。图中:
图1为本实施例公开的一种基于神经网络的音频数据丢包修复方法流程图;
图2为本实施例公开的一种通过观察窗确定丢帧位置的方法流程图;
图3为本实施例公开的一种滑窗分组修复的示例示意图;
图4为本实施例公开的一种用于音频丢包修复的神经网络训练方法流程图;
图5A和图5B为本实施例公开的一种剔除预设位置的示例示意图,其中,图5A为本实施例中剔除第一预设位置的示例示意图,图5B为本实施例中剔除第二预设位置的示例示意图;
图6为本实施例公开的一种基于神经网络的音频数据丢包修复装置结构示意图。
为了利用新的神经网络模型来修复丢失的音频数据包,提高修复准确率,本实施例公开了一种基于神经网络的音频数据丢包修复方法,请参考图1,为本实施例公开的一种基于神经网络的音频数据丢包修复方法流程图,该音频数据丢包修复方法包括:
步骤S101,获取音频数据包。本实施例中,所称音频数据包包括若干帧音频数据帧,若干帧音频数据帧中至少包含多个语音信号帧,语音信号帧为包含语音信号的音频数据帧。具体地,若干帧音频信号帧中,可以存在一些帧是纯粹的语音信号;也可以存在一些帧是纯粹的噪声信号;还可以存在一些帧中既包含了语音信号,也包含了噪声信号,也就是,噪声信号和语音信号同时存在于同一帧中。在具体实施过程中,音频数据包具有包头信息,根据包头信息可以区分不同的音频数据帧序号。
步骤S103,当若干帧音频数据帧中存在丢失语音信号帧的丢帧情况后,确定丢失的语音信号帧在若干帧音频数据帧中所处的位置得到丢帧的位置信息。在具体实施例中,可以通过现有的方式来确定是否存在丢失语音信号帧的丢帧情况,例如,可以通过对语音信号帧的时频特征进行统计,根据统计量来估计缺失的特征,由此,可以确定是否丢失语音信号帧;当存在丢失语音信号帧的丢帧情况后,根据统计的结果可以确定语音信号帧所处的位置。所处的位置包括第一预设位置或第二预设位置,具体地,第一预设位置和第二预设位置的具体位置不同。本实施例中,所称位置信息可以采用帧号来表示,可以是丢失的语音信号帧在若干帧音频数据帧中的帧号,也可以是多个语音信号帧中的帧号,还可以是处于同一组语音信号帧内的帧号。
步骤S105,根据丢帧的位置信息选择用于修复丢帧情况的神经网络模型。本实施例中,神经网络模型包括第一修复模型和第二修复模型,第一修复模型和第二修复模型分别适用于修复不同丢帧位置的语音信号帧。本实施例中,区分了丢帧位置的情形,针对不同的丢帧位置,选择采用第一修复模型或第二修复模型对丢失的信号帧进行修复,使得在修复丢失的信号帧时,更具有针对性,提高修复准确率。
需要说明的是,在神经网络训练的过程中,可以用不同的输入样本对修复模型进行训练,从而,分别训练得到第一修复模型和第二修复模型的权值、参数,具体地,请参见下文描述。
步骤S107,将若干帧音频数据帧送入选择的神经网络模型,以对丢失的 语音信号帧进行修复。在具体实施过程中,在选择第一修复模型或第二修复模型对丢失的语音信号帧进行修复得到相应的语音信号数据后,可以将修复得到的语音信号数据插入丢失的信号帧的位置,得到完整的语音信号,发送至音频输出设备播放语音信号。本实施例中,第一修复模型用于修复处于第一预设位置的语音信号帧,第二修复模型用于修复处于第二预设位置的语音信号帧,也就是,第一修复模型是采用剔除了第一预设位置的样本训练得到,第二修复模型是采用剔除了第二预设位置的样本训练得到。
若干帧音频数据帧中包括语音信号帧和非语音信号帧,为了筛选出语音信号帧,提高修复的针对性,在可选的实施例中,请参考图1,在步骤S101和步骤S103之间还包括:
步骤S102,按预设算法区分若干帧音频数据帧中的语音信号帧和非语音信号帧。本实施例中,并不限制具体的预设算法,可以是例如神经网络算法,也可以是其它的频谱分析法,只要能够区分语音信号帧和非语音信号帧即可。作为示例,在获取音频数据后,将获取的音频数据送入前置神经网络算法,通过前置神经网络算法来区分各帧音频数据帧是语音信号帧,还是非语音信号帧。
为了减少神经网络对语音信号帧修复、训练的压力,提高修复的准确性,在可选的实施例中,在步骤S102之后,还包括:对非语音信号帧的音频信号进行淡化包络处理。本实施例中,所称淡化包络处理可以是简单的复制替换或者简单淡入淡出。本实施例中,由于非语音信号帧所包含的有用信息较少,对非语音信号帧的音频信号进行淡化包络处理,可以提高数据处理效率,并且,由于非语音信号帧并未送入修复模型中,因此,可以减少神经网络对语音信号帧修复、训练的压力,提高修复的准确性。
在可选的实施例中,在步骤S103中,丢帧的位置信息为丢失的语音信号帧在语音信号帧组中所处的位置,语音信号组包括N帧语音信号帧,其中,N为大于或等于5的整数。具体地,当区分语音信号帧和非语音信号帧后,可以对语音信号帧进行分组,得到多组语音信号帧,每组语音信号帧中包含N帧语音信号帧,而后,确定丢失的语音信号帧在本组内的位置,例如在本组内的帧号。
在具体实施过程中,可以通过观察窗的方式来划分各个语音信号帧组,具体地,请参考图2,为本实施例公开的一种通过观察窗确定丢帧位置的方法流 程图,该方法包括:
步骤S1031,通过观察窗在多个语音信号帧中顺序滑动。本实施例中,可以通过预设长度N的观察窗进行滑动,从而对多个语音信号帧进行N帧一组的分组,也就是,观察窗的长度也就是组内的语音信号帧的帧数。
步骤S1032,针对处于观察窗内的语音信号帧,确定是否存在丢帧的情况。具体地,当观察窗滑动包含N帧语音信号帧后,可以在观察窗内进行时频特征进行统计,根据统计量来估计缺失的特征,由此,可以确定观察窗内的语音信号帧是否丢失语音信号数据。
步骤S1033,当处于观察窗内的语音信号帧存在丢帧的情况后,确定丢失的语音信号帧在观察窗内的位置,以得到丢帧的位置信息。本实施例中,丢帧的位置信息是指丢失的语音信号帧在观察窗内的位置;本实施例中,所称观察窗内的位置可以是观察窗内的帧序号,也就是,丢失的语音信号帧在观察窗内处于第几帧。
本实施例中,在采用观察窗确定丢失信号帧的实施例中,也可以在观察窗内对丢失的语音信号帧进行修复,也就是步骤S107包括:在观察窗内对丢失的语音信号帧进行修复。
在可选的实施例中,在执行步骤S107之后,还可以包括:将恢复的语音信号帧更新至观察窗内对应的丢帧位置帧。具体地,在确定观察窗内的某一帧语音信号帧的数据丢失后,通过选择的修复模型修复该丢失的语音信号帧的数据,而后,再将修复后的数据更新至该丢失帧所在的位置。本实施例中,将修复后的数据更新至观察窗内对应的丢帧位置帧,可以完善观察窗内的语音信号帧数据,由此可以使得修复模型的输入数据是不缺失的,完善修复模型输入的参考数据,提高了修复的准确性。
为了保证神经网络输入数据的数量,减小系统输出数据的额外延时,在可选的实施例中,在执行步骤S1031时,可以采用迭代替换的滑动方式滑动观察窗,以使处于观察窗内的前K帧语音信号帧滑出观察窗,处于观察窗外的后K帧语音信号帧滑入观察窗内,其中,N为大于或等于1的整数,K的取值可以为1。
为便于本领域技术人员理解,作为示例,观察窗的预设长度示例性为7帧,也就是N=7,K的取值例如为1,请参考图3,为本实施例公开的一种滑窗分组修复的示例示意图,图3中,虚线框所示的为预设长度的观察窗,通过 观察窗在语音信号帧进行滑窗分组修复的处理过程如下:
(1)在当前状态,观察窗内包含了编号1至7的语音信号帧,丢失的语音信号帧为第6帧,如图3的网格线框所示的位置;
(2)基于该丢帧的位置信息选择第一修复模型或第二修复模型作为修复的神经网络模型;
(3)将编号1-5、7输入至选择的神经网络模型,通过选择的神经网络模型修复得到第6帧的语音信号数据;
(4)将修复得到的语音信号数据更新至当前状态观察窗内第6帧的位置,如图3的网格线框所示的位置,从而,完善了当前状态观察窗内语音信号帧的数据;
(5)在修复后,将处于观察窗内的第1帧语音信号帧滑出观察窗,同时,处于观察窗外的后1帧语音信号帧滑入观察窗内;此时,进入了当前状态的下一个状态,此状态下,原编号2至7的语音信号帧顺次变为编号1-6(如图3的箭头线条所示),新滑入观察窗内的语音信号帧编号为7;
在当前状态的下一个状态,如果需要修复丢失的语音信号帧,由于原第6帧的位置(也就是此状态下的第5帧,如图3的斜线框所示的位置)的数据已经更新,因此,此状态下的第5帧的语音信号帧数据也可以作为修复模型的输入。
在具体实施例中,在执行步骤S1033时,确定丢失的语音信号帧未包含、包含观察窗内的最后一帧两种情形,具体地:
在一种实施例中,步骤S1033包括:确定丢失的语音信号帧在观察窗内的位置未包含处于观察窗内的最后一帧,也就是,处于观察窗内的最后一帧语音信号帧的数据没有丢失,丢失的是观察窗内中间段的信号帧,将该丢失的中间段的信号帧所在位置作为第一预设位置。此时,在执行步骤S105时,步骤S105包括:将处于观察窗内的语音信号帧送入第一修复模型,以修复丢失的语音信号帧,其中,第一修复模型的输入数据中包括处于观察窗内的最后一帧。具体地,也就是将观察窗内丢失的语音信号帧之前、之后的语音信号帧送入第一修复模型,第一修复模型根据丢失的语音信号帧之前、之后的语音信号帧来修复丢失的语音信号帧。作为示例,假设观察窗内丢失的数据是第6帧,则将编号1-5的语音信号帧及第7帧输入第一修复模型,第一修复模型根据编号1-5、7的语音信号帧来修复第6帧语音信号帧。
在另一种实施例中,步骤S1033包括:确定丢失的语音信号帧至少为2帧,丢帧的位置为处于观察窗内的最后一帧和处于观察窗内的其它位置帧,该丢帧的其它位置作为第二预设位置。此时,在执行步骤S105时,步骤S105包括:将处于观察窗内的、其它位置帧之前的语音信号帧送入第二修复模型,以修复位于其它位置帧的语音信号帧,其中,第二修复模型的输入数据为处于观察窗内的、其它位置帧之前的语音信号帧,且未包含处于观察窗内的最后一帧。具体地,也就是将观察窗内丢失的语音信号帧之前的语音信号帧送入第二修复模型,第二修复模型根据丢失的语音信号帧之前的语音信号帧来修复丢失的语音信号帧。作为示例,假设观察窗内丢失的数据是第6帧和最后一帧第7帧,则将编号1-5的语音信号帧输入第二修复模型,第二修复模型根据编号1-5的语音信号帧来修复第6帧语音信号帧,当然,在一些实施例中,也可以同步修复观察窗内最后一帧第7帧的语音信号数据。
为便于本领域技术人员理解,本实施例还公开了一种用于音频丢包修复的神经网络训练方法,本实施例中,完成训练后的神经网络模型适用于上述的丢包修复方法,其中,神经网络的模型包括第一修复模型和第二修复模型,请参考图4,为本实施例公开的一种用于音频丢包修复的神经网络训练方法流程图,该神经网络训练方法包括:
步骤S201,获取待学习的语音信号样本数据。本实施例中,所称语音信号样本数据以N帧语音信号帧为一组,其中,N为大于或等于5的整数,语音信号帧为包含语音信号的音频数据帧。在具体实施过程中,可以直接获取已经划分组的语音信号帧作为语音信号样本数据,也可以获取语音信号样本数据后,再对语音信号样本数据进行以N帧语音信号帧为一组的分组。
步骤S203,在每组N帧语音信号帧中剔除第一预设位置的语音信号帧得到第一输入样本。本实施例中,所称第一预设位置是指在组内的位置,具体地,可以以组内的语音信号帧序号来表示第一预设位置。
步骤S205,在每组N帧语音信号帧中剔除第二预设位置的语音信号帧得到第二输入样本。本实施例中,所称第二预设位置是指在组内的位置,具体地,可以以组内的语音信号帧序号来表示第二预设位置。本实施例中,第一预设位置与第二预设位置的位置不同。
步骤S207,将第一输入样本和第二输入样本分别输入至第一修复模型和第二修复模型,以分别训练第一修复模型和第二修复模型。本实施例中,第一 修复模型和第二修复模型分别适用于修复不同丢帧位置的语音信号帧。需要说明的是,本实施例中,并不限制第一修复模型和第二修复模型的训练先后顺序,第一修复模型和第二修复模型可以分开单独训练,也可以在同一PC端同步训练。
在具体实施过程中,可以采用例如反复迭代的方式来分别训练第一修复模型和/或第二修复模型,当然,也可以采用其它的方式来分别训练第一修复模型和第二修复模型,得到第一修复模型和第二修复模型的模型参数,例如权值、系数等;而后,可以将第一修复模型和第二修复模型存储至存储设备,以便在修复丢失的语音信号帧时,可以直接调用相关的修复模型。
在具体实施过程中,应当采用与上述丢包修复方法相同的分组方式来对样本进行分组,具体地,步骤S201包括:通过预设长度的观察窗对语音信号样本数据以N帧语音信号帧为一组,也就是,长度为N帧的观察窗在语音信号样本数据滑动,从而,将语音信号样本数据划分为N一组;而后,在观察窗内执行步骤S203、步骤S205和步骤S207。
在具体实施例中,在执行步骤S203时,第一预设位置为处于观察窗内且未包含处于观察窗内的最后一帧;通过处于观察窗内的第一预设位置之前和之后的语音信号帧训练第一修复模型。也就是,处于观察窗内的最后一帧语音信号帧的数据没有剔除,剔除的是观察窗内中间段的信号帧。此时,在执行步骤S203时,将处于观察窗内的语音信号帧送入第一修复模型,以训练第一修复模型,即,训练第一修复模型的第一输入样本中包括处于观察窗内的最后一帧。具体地,也就是将观察窗内剔除的语音信号帧之前、之后的语音信号帧送入第一修复模型,第一修复模型根据剔除的语音信号帧之前、之后的语音信号帧来训练得到剔除的语音信号帧。请参考图5A,为本实施例中剔除第一预设位置的示例示意图,图5A中,虚线框为预设长度的观察窗,假设N=7,该观察窗内包含了编号1-7的语音信号帧,假设观察窗内剔除的数据是第6帧,则将编号1-5的语音信号帧及第7帧作为第一输入样本,而后将第一输入样本输入第一修复模型,第一修复模型根据编号1-5、7的语音信号帧来迭代训练得到第6帧语音信号帧。
在可选的实施例中,第一预设位置为非处于观察窗内的第一帧。作为优选的实施例,根据上述实施例的记载,考虑到是采用滑窗的方式来修复丢失的语音信号帧,当丢失的数据被修复后,观察窗内的数据可以更新完善,因此,在 优选的实施例中,第一预设位置可以优选为观察窗内靠后的位置,例如第N-1帧。
在具体实施例中,在执行步骤S203时,第二预设位置为处于观察窗内的最后一帧;通过处于观察窗内的第二预设位置之前的语音信号帧训练第二修复模型。也就是,处于观察窗内的最后一帧语音信号帧的数据被剔除。此时,在执行步骤S203时,将处于观察窗内的语音信号帧送入第二修复模型,以训练第二修复模型。具体地,也就是将观察窗内剔除的语音信号帧(第二预设位置)之前的语音信号帧送入第二修复模型,第二修复模型根据剔除的语音信号帧之前的语音信号帧来训练得到剔除的语音信号帧,亦即,通过前向预测的方式来训练得到第二预设位置的语音信号帧。请参考图5B,为本实施例中剔除第二预设位置的示例示意图,图5B中,虚线框为预设长度的观察窗,假设N=6,该观察窗内包含了编号1-6的语音信号帧,假设观察窗内剔除的数据是第6帧,则将编号6之前的编号1-5的语音信号帧作为第二输入样本,而后将第二输入样本输入第二修复模型,第二修复模型根据编号1-5的语音信号帧来迭代训练得到第6帧语音信号帧。
本实施例还公开了一种基于神经网络的音频数据丢包修复装置,请参考图6,为本实施例公开的一种基于神经网络的音频数据丢包修复装置结构示意图,该音频数据丢包修复装置包括:数据获取模块701、位置确定模块702、模型选择模块703和数据修复模块704,其中:
数据获取模块701用于获取音频数据包,音频数据包包括若干帧音频数据帧,若干帧音频数据帧中至少包含多个语音信号帧,语音信号帧为包含语音信号的音频数据帧;位置确定模块702用于当若干帧音频数据帧中存在丢失语音信号帧的丢帧情况后,确定丢失的语音信号帧在若干帧音频数据帧中所处的位置得到丢帧的位置信息;模型选择模块703用于根据丢帧的位置信息选择用于修复丢帧情况的神经网络模型,神经网络模型包括第一修复模型和第二修复模型,第一修复模型和第二修复模型分别适用于修复不同丢帧位置的语音信号帧;数据修复模块704用于将若干帧音频数据帧送入选择的神经网络模型,以对丢失的语音信号帧进行修复。
在可选的实施例中,若干帧音频数据帧中还包括非语音信号帧;该音频数据丢包修复装置还包括:信号区分模块,用于按预设算法区分若干帧音频数据帧中的语音信号帧和非语音信号帧;丢帧的位置信息为丢失的语音信号帧在语 音信号帧组中所处的位置,语音信号组包括N帧语音信号帧,其中,N为大于或等于5的整数。
在可选的实施例中,位置确定模块包括:滑窗分组单元,用于通过观察窗在多个语音信号帧中顺序滑动,以对多个语音信号帧进行N帧一组的分组;丢帧确定单元,用于针对处于观察窗内的语音信号帧,确定是否存在丢帧的情况;及位置获取单元,用于当处于观察窗内的语音信号帧存在丢帧的情况后,确定丢失的语音信号帧在观察窗内的位置,以得到丢帧的位置信息;数据修复模块包括:在观察窗内对丢失的语音信号帧进行修复。
在可选的实施例中,音频数据丢包修复装置还包括:数据更新模块,用于将恢复的语音信号帧更新至观察窗内对应的丢帧位置帧。
在可选的实施例中,滑窗分组单元采用迭代替换的滑动方式滑动观察窗,以使处于观察窗内的前K帧语音信号帧滑出观察窗,处于观察窗外的后K帧语音信号帧滑入观察窗内,其中,N为大于或等于1的整数。
在可选的实施例中,K为1。
在可选的实施例中,位置获取单元包括:确定丢失的语音信号帧在观察窗内的位置未包含处于观察窗内的最后一帧;模型选择模块包括:将处于观察窗内的语音信号帧送入第一修复模型,以修复丢失的语音信号帧,其中,第一修复模型的输入数据中包括处于观察窗内的最后一帧。
在可选的实施例中,位置获取单元包括:确定丢失的语音信号帧至少为2帧,丢帧的位置为处于观察窗内的最后一帧和处于观察窗内的其它位置帧;模型选择模块包括:将处于观察窗内的、其它位置帧之前的语音信号帧送入第二修复模型,以修复位于其它位置帧的语音信号帧,其中,第二修复模型的输入数据为处于观察窗内的、其它位置帧之前的语音信号帧,且未包含处于观察窗内的最后一帧。
在可选的实施例中,音频数据丢包修复装置还包括:淡化包络模块,用于对非语音信号帧的音频信号进行淡化包络处理。
本实施例还公开了一种音频设备,音频设备为具有音频数据采集功能的设备,具体地,音频设备可以是例如耳机、移动终端或智能穿戴设备,该音频设备包括:处理器,用于实现上述任意实施例公开的基于神经网络的音频数据丢包修复方法。
本实施例还公开了一种音频信号交互系统,包括:第一设备和第二设备;
第一设备将音频数据包发送给第二设备;第二设备用于实现上述任意实施例公开的基于神经网络的音频数据丢包修复方法。
在可选的实施例中,第一设备为移动终端,第二设备为耳机。移动终端可以是平板电脑、手机、笔记本电脑等具有数据处理功能的终端。
本实施例还公开了一种计算机可读存储介质,其上存储有计算机程序,存储介质中存储的计算机程序用于被执行实现上述任意实施例公开的基于神经网络的音频数据丢包修复方法。
依据本实施例公开的一种基于神经网络的音频丢包修复方法、设备和系统,获取的音频数据包包括若干帧音频数据帧,当若干帧音频数据帧中存在丢失语音信号帧的丢帧情况后,确定丢失的语音信号帧在若干帧音频数据帧中所处的位置得到丢帧的位置信息;而后,根据丢帧的位置信息选择用于修复丢帧情况的神经网络模型,将若干帧音频数据帧送入选择的神经网络模型,以对丢失的语音信号帧进行修复。由于第一修复模型和第二修复模型分别适用于修复不同丢帧位置的语音信号帧,因此,在确定丢失的语音信号帧所处的位置后,可以根据丢帧的位置选择对应的修复模型,相对于现有技术中,采用同样的修复模型来修复不同的丢帧情况,本发明实施例的方案能够适应选择修复模型,对丢失语音信号帧的修复更有针对性,继而,提高修复准确率。
作为可选的实施例,按预设算法区分若干帧音频数据帧中的语音信号帧和非语音信号帧,丢帧的位置信息为丢失的语音信号帧在语音信号帧组中所处的位置,由此,可以实现针对语音信号帧的丢包数据进行修复,减少类非语音信号帧所带来的干扰,提高了修复准确率。
作为可选的实施例,通过观察窗在多个语音信号帧中顺序滑动,将恢复的语音信号帧更新至观察窗内对应的丢帧位置帧,由此可以使得修复模型的输入数据是不缺失的,完善修复模型输入的参考数据,提高了修复的准确性。
作为可选的实施例,采用迭代替换的滑动方式滑动观察窗,以使处于观察窗内的前K帧语音信号帧滑出观察窗,处于观察窗外的后K帧语音信号帧滑入观察窗内,可以保证神经网络输入数据的数量,相应地,处于观察窗内的前K帧语音信号帧滑出观察窗,可以减小系统输出数据的额外延时,也就是,能够及时地输出语音信号帧。
作为可选的实施例,由于非语音信号帧所包含的有用信息较少,对非语音信号帧的音频信号进行淡化包络处理,可以提高数据处理效率,并且,由于非语音信号帧并未送入修复模型中,因此,可以减少神经网络对语音信号帧修复、训练的压力,提高修复的准确性。
本领域的技术人员能够理解的是,在不冲突的前提下,上述各优选方案可以自由地组合、叠加。
应当理解,上述的实施方式仅是示例性的,而非限制性的,在不偏离本发明的基本原理的情况下,本领域的技术人员可以针对上述细节做出的各种明显的或等同的修改或替换,都将包含于本发明的权利要求范围内。
Claims (24)
- 一种基于神经网络的音频数据丢包修复方法,其特征在于,包括:步骤S101,获取音频数据包,所述音频数据包包括若干帧音频数据帧,所述若干帧音频数据帧中至少包含多个语音信号帧,所述语音信号帧为包含语音信号的音频数据帧;步骤S103,当所述若干帧音频数据帧中存在丢失语音信号帧的丢帧情况后,确定丢失的语音信号帧在所述若干帧音频数据帧中所处的位置得到丢帧的位置信息;所述所处的位置包括第一预设位置或第二预设位置;步骤S105,根据所述丢帧的位置信息选择用于修复丢帧情况的神经网络模型,所述神经网络模型包括第一修复模型和第二修复模型,其中,所述第一修复模型用于修复处于第一预设位置的语音信号帧,所述第二修复模型用于修复处于第二预设位置的语音信号帧;及步骤S107,将所述若干帧音频数据帧送入选择的神经网络模型,以对丢失的语音信号帧进行修复。
- 如权利要求1所述的音频数据丢包修复方法,其特征在于,所述若干帧音频数据帧中还包括非语音信号帧;在所述步骤S101和步骤S103之间还包括:步骤S102,按预设算法区分所述若干帧音频数据帧中的语音信号帧和非语音信号帧;在所述步骤S103中,所述丢帧的位置信息为丢失的语音信号帧在语音信号帧组中所处的位置,所述语音信号组包括N帧语音信号帧,其中,N为大于或等于5的整数。
- 如权利要求2所述的音频数据丢包修复方法,其特征在于,所述步骤S103包括:步骤S1031,通过观察窗在所述多个语音信号帧中顺序滑动,以对所述多个语音信号帧进行N帧一组的分组;步骤S1032,针对处于所述观察窗内的语音信号帧,确定是否存在丢帧的情况;及步骤S1033,当处于所述观察窗内的语音信号帧存在丢帧的情况后,确定丢失的语音信号帧在所述观察窗内的位置,以得到所述丢帧的位置信息;所述步骤S107包括:在所述观察窗内对丢失的语音信号帧进行修复。
- 如权利要求3所述的音频数据丢包修复方法,其特征在于,在所述步骤S107之后,还包括:将恢复的语音信号帧更新至所述观察窗内对应的丢帧位置帧。
- 如权利要求3所述的音频数据丢包修复方法,其特征在于,在所述步骤S1031中,采用迭代替换的滑动方式滑动所述观察窗,以使处于所述观察窗内的前K帧语音信号帧滑出所述观察窗,处于所述观察窗外的后K帧语音信号帧滑入所述观察窗内,其中,所述N为大于或等于1的整数。
- 如权利要求5所述的音频数据丢包修复方法,其特征在于,所述K为1。
- 如权利要求3-6任意一项所述的音频数据丢包修复方法,其特征在于,所述步骤S1033包括:确定丢失的语音信号帧在所述观察窗内的位置未包含处于所述观察窗内的最后一帧,并作为所述第一预设位置;所述步骤S105包括:将处于所述观察窗内的语音信号帧送入所述第一修复模型,以修复丢失的语音信号帧,其中,所述第一修复模型的输入数据中包括处于所述观察窗内的最后一帧。
- 如权利要求3-6任意一项所述的音频数据丢包修复方法,其特征在于,所述步骤S1033包括:确定丢失的语音信号帧至少为2帧,所述丢帧的位置为处于所述观察窗内的最后一帧和处于所述观察窗内的其它位置帧,并作为所述第二预设位置;所述步骤S105包括:将处于所述观察窗内的、所述其它位置帧之前的语音信号帧送入所述第二修复模型,以修复位于所述其它位置帧的语音信号帧,其中,所述第二修复模型的输入数据为处于所述观察窗内的、所述其它位置帧之前的语音信号帧,且未包含处于所述观察窗内的最后一帧。
- 如权利要求2-6任意一项所述的音频数据丢包修复方法,其特征在于,在所述步骤S102之后,还包括:对所述非语音信号帧的音频信号进行淡化包络处理。
- 一种基于神经网络的音频数据丢包修复装置,其特征在于,包括:数据获取模块,用于获取音频数据包,所述音频数据包包括若干帧音频数据帧,所述若干帧音频数据帧中至少包含多个语音信号帧,所述语音信号帧为包含语音信号的音频数据帧;位置确定模块,用于当所述若干帧音频数据帧中存在丢失语音信号帧的丢帧情况后,确定丢失的语音信号帧在所述若干帧音频数据帧中所处的位置得到丢帧的位置信息;所述所处的位置包括第一预设位置或第二预设位置;模型选择模块,用于根据所述丢帧的位置信息选择用于修复丢帧情况的神经网络模型,所述神经网络模型包括第一修复模型和第二修复模型,其中,所述第一修复模型用于修复处于第一预设位置的语音信号帧,所述第二修复模型用于修复处于第二预设位置的语音信号帧;及数据修复模块,用于将所述若干帧音频数据帧送入选择的神经网络模型,以对丢失的语音信号帧进行修复。
- 如权利要求10所述的音频数据丢包修复装置,其特征在于,所述若干帧音频数据帧中还包括非语音信号帧;还包括:信号区分模块,用于按预设算法区分所述若干帧音频数据帧中的语音信号帧和非语音信号帧;所述丢帧的位置信息为丢失的语音信号帧在语音信号帧组中所处的位置,所述语音信号组包括N帧语音信号帧,其中,N为大于或等于5的整数。
- 如权利要求11所述的音频数据丢包修复装置,其特征在于,所述位置确定模块包括:滑窗分组单元,用于通过观察窗在所述多个语音信号帧中顺序滑动,以对所述多个语音信号帧进行N帧一组的分组;丢帧确定单元,用于针对处于所述观察窗内的语音信号帧,确定是否存在丢帧的情况;及位置获取单元,用于当处于所述观察窗内的语音信号帧存在丢帧的情况后,确定丢失的语音信号帧在所述观察窗内的位置,以得到所述丢帧的位置信息;所述数据修复模块包括:在所述观察窗内对丢失的语音信号帧进行修复。
- 如权利要求12所述的音频数据丢包修复装置,其特征在于,还包括:数据更新模块,用于将恢复的语音信号帧更新至所述观察窗内对应的丢帧位置帧。
- 如权利要求12所述的音频数据丢包修复装置,其特征在于,所述滑窗分组单元采用迭代替换的滑动方式滑动所述观察窗,以使处于所述观察窗内的前K帧语音信号帧滑出所述观察窗,处于所述观察窗外的后K帧语音信号 帧滑入所述观察窗内,其中,所述N为大于或等于1的整数。
- 如权利要求14所述的音频数据丢包修复装置,其特征在于,所述K为1。
- 如权利要求12-15任意一项所述的音频数据丢包修复装置,其特征在于,所述位置获取单元包括:确定丢失的语音信号帧在所述观察窗内的位置未包含处于所述观察窗内的最后一帧,并作为所述第一预设位置;所述模型选择模块包括:将处于所述观察窗内的语音信号帧送入所述第一修复模型,以修复丢失的语音信号帧,其中,所述第一修复模型的输入数据中包括处于所述观察窗内的最后一帧。
- 如权利要求12-15任意一项所述的音频数据丢包修复装置,其特征在于,所述位置获取单元包括:确定丢失的语音信号帧至少为2帧,所述丢帧的位置为处于所述观察窗内的最后一帧和处于所述观察窗内的其它位置帧,并作为所述第二预设位置;所述模型选择模块包括:将处于所述观察窗内的、所述其它位置帧之前的语音信号帧送入所述第二修复模型,以修复位于所述其它位置帧的语音信号帧,其中,所述第二修复模型的输入数据为处于所述观察窗内的、所述其它位置帧之前的语音信号帧,且未包含处于所述观察窗内的最后一帧。
- 如权利要求11-15任意一项所述的音频数据丢包修复装置,其特征在于,还包括:淡化包络模块,用于对所述非语音信号帧的音频信号进行淡化包络处理。
- 一种音频设备,其特征在于,包括:处理器,用于实现如权利要求1-9任意一项所述的方法。
- 如权利要求19所述的音频设备,其特征在于,所述音频设备为具有音频播放功能的耳机、移动终端或智能穿戴设备。
- 一种音频信号交互系统,其特征在于,包括:第一设备和第二设备;所述第一设备将音频数据包发送给所述第二设备;所述第二设备用于实现如权利要求1-9任意一项所述的方法。
- 如权利要求21所述的音频信号交互系统,其特征在于,所述第一设备为移动终端,所述第二设备为耳机。
- 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,存储介质中存储的计算机程序用于被执行实现如权利要求1-9任意一项所述的方 法。
- 一种音频设备的芯片,其上具有集成电路,其特征在于,所述集成电路被设计成用于实现如权利要求1-9任意一项所述的方法。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/911,733 US20230245668A1 (en) | 2020-03-20 | 2020-09-30 | Neural network-based audio packet loss restoration method and apparatus, and system |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010200811.1A CN111883173B (zh) | 2020-03-20 | 2020-03-20 | 基于神经网络的音频丢包修复方法、设备和系统 |
CN202010200811.1 | 2020-03-20 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021184732A1 true WO2021184732A1 (zh) | 2021-09-23 |
Family
ID=73154241
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/119603 WO2021184732A1 (zh) | 2020-03-20 | 2020-09-30 | 基于神经网络的音频丢包修复方法、设备和系统 |
Country Status (3)
Country | Link |
---|---|
US (1) | US20230245668A1 (zh) |
CN (1) | CN111883173B (zh) |
WO (1) | WO2021184732A1 (zh) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115101088A (zh) * | 2022-06-08 | 2022-09-23 | 维沃移动通信有限公司 | 音频信号恢复方法、装置、电子设备及介质 |
CN118101632B (zh) * | 2024-04-22 | 2024-06-21 | 安徽声讯信息技术有限公司 | 一种基于人工智能的语音低延时信号传输方法及系统 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140088974A1 (en) * | 2012-09-26 | 2014-03-27 | Motorola Mobility Llc | Apparatus and method for audio frame loss recovery |
CN103714821A (zh) * | 2012-09-28 | 2014-04-09 | 杜比实验室特许公司 | 基于位置的混合域数据包丢失隐藏 |
CN108111702A (zh) * | 2017-12-07 | 2018-06-01 | 瑟达智家科技(杭州)有限公司 | 一种对voip系统语音包丢失自动补偿的方法 |
CN109637540A (zh) * | 2019-02-28 | 2019-04-16 | 北京百度网讯科技有限公司 | 智能语音设备的蓝牙评测方法、装置、设备及介质 |
CN110534120A (zh) * | 2019-08-31 | 2019-12-03 | 刘秀萍 | 一种移动网络环境下的环绕声误码修复方法 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9053699B2 (en) * | 2012-07-10 | 2015-06-09 | Google Technology Holdings LLC | Apparatus and method for audio frame loss recovery |
ES2881510T3 (es) * | 2013-02-05 | 2021-11-29 | Ericsson Telefon Ab L M | Método y aparato para controlar la ocultación de pérdida de trama de audio |
CN109218083B (zh) * | 2018-08-27 | 2021-08-13 | 广州猎游信息科技有限公司 | 一种语音数据传输方法及装置 |
-
2020
- 2020-03-20 CN CN202010200811.1A patent/CN111883173B/zh active Active
- 2020-09-30 WO PCT/CN2020/119603 patent/WO2021184732A1/zh active Application Filing
- 2020-09-30 US US17/911,733 patent/US20230245668A1/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140088974A1 (en) * | 2012-09-26 | 2014-03-27 | Motorola Mobility Llc | Apparatus and method for audio frame loss recovery |
CN103714821A (zh) * | 2012-09-28 | 2014-04-09 | 杜比实验室特许公司 | 基于位置的混合域数据包丢失隐藏 |
CN108111702A (zh) * | 2017-12-07 | 2018-06-01 | 瑟达智家科技(杭州)有限公司 | 一种对voip系统语音包丢失自动补偿的方法 |
CN109637540A (zh) * | 2019-02-28 | 2019-04-16 | 北京百度网讯科技有限公司 | 智能语音设备的蓝牙评测方法、装置、设备及介质 |
CN110534120A (zh) * | 2019-08-31 | 2019-12-03 | 刘秀萍 | 一种移动网络环境下的环绕声误码修复方法 |
Also Published As
Publication number | Publication date |
---|---|
CN111883173B (zh) | 2023-09-12 |
US20230245668A1 (en) | 2023-08-03 |
CN111883173A (zh) | 2020-11-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Reddy et al. | A scalable noisy speech dataset and online subjective test framework | |
JP2021086154A (ja) | 音声認識方法、装置、機器及びコンピュータ読み取り可能な記憶媒体 | |
CN105869629B (zh) | 语音识别方法及装置 | |
US11514925B2 (en) | Using a predictive model to automatically enhance audio having various audio quality issues | |
CN108564966B (zh) | 语音测试的方法及其设备、具有存储功能的装置 | |
CN107799126A (zh) | 基于有监督机器学习的语音端点检测方法及装置 | |
WO2020238209A1 (zh) | 音频处理的方法、系统及相关设备 | |
WO2021184732A1 (zh) | 基于神经网络的音频丢包修复方法、设备和系统 | |
CN111031448B (zh) | 回声消除方法、装置、电子设备和存储介质 | |
CN109961797B (zh) | 一种回声消除方法、装置以及电子设备 | |
CN109949821B (zh) | 一种利用cnn的u-net结构进行远场语音去混响的方法 | |
CN112614504A (zh) | 单声道语音降噪方法、系统、设备及可读存储介质 | |
WO2023116660A2 (zh) | 一种模型训练以及音色转换方法、装置、设备及介质 | |
CN109087634A (zh) | 一种基于音频分类的音质设置方法 | |
CN113612808B (zh) | 音频处理方法、相关设备、存储介质及程序产品 | |
CN111142066A (zh) | 波达方向估计方法、服务器以及计算机可读存储介质 | |
JP2014176033A (ja) | 通信システム、通信方法およびプログラム | |
CN114792524B (zh) | 音频数据处理方法、装置、程序产品、计算机设备和介质 | |
US20240296856A1 (en) | Audio data processing method and apparatus, device, storage medium, and program product | |
CN114155852A (zh) | 语音处理方法、装置、电子设备及存储介质 | |
CN109427342A (zh) | 用于防止语音延迟的语音数据处理装置及方法 | |
CN111968620B (zh) | 算法的测试方法、装置、电子设备及存储介质 | |
EP2382623B1 (en) | Aligning scheme for audio signals | |
CN110619886A (zh) | 一种针对低资源土家语的端到端语音增强方法 | |
CN114743571A (zh) | 一种音频处理方法、装置、存储介质及电子设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20925398 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 22/02/2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20925398 Country of ref document: EP Kind code of ref document: A1 |