WO2020192009A1 - Procédé de détection de silence reposant sur un réseau neuronal, et dispositif terminal et support - Google Patents
Procédé de détection de silence reposant sur un réseau neuronal, et dispositif terminal et support Download PDFInfo
- Publication number
- WO2020192009A1 WO2020192009A1 PCT/CN2019/103149 CN2019103149W WO2020192009A1 WO 2020192009 A1 WO2020192009 A1 WO 2020192009A1 CN 2019103149 W CN2019103149 W CN 2019103149W WO 2020192009 A1 WO2020192009 A1 WO 2020192009A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio
- layer
- subsequence
- dimensionality reduction
- convolution
- Prior art date
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 130
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 22
- 238000012545 processing Methods 0.000 claims abstract description 78
- 238000005070 sampling Methods 0.000 claims abstract description 49
- 230000005236 sound signal Effects 0.000 claims abstract description 47
- 238000000034 method Methods 0.000 claims abstract description 28
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 22
- 230000009467 reduction Effects 0.000 claims description 123
- 238000004364 calculation method Methods 0.000 claims description 16
- 238000009432 framing Methods 0.000 abstract description 9
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 230000008569 process Effects 0.000 description 13
- 238000012549 training Methods 0.000 description 9
- 238000004590 computer program Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 230000001360 synchronised effect Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007430 reference method Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L2025/783—Detection of presence or absence of voice signals based on threshold decision
Definitions
- This application belongs to the field of artificial intelligence technology, and in particular relates to a method for detecting silence based on a neural network, a terminal device and a computer non-volatile readable storage medium.
- Mute detection refers to the feature analysis of audio signals to identify voice signals and noise signals from the audio signals. It has a very wide range of applications in the fields of speech coding, speech enhancement, and speech recognition. Mute detection is the first step in speech coding, speech enhancement and speech recognition. Its accuracy is directly related to whether the subsequent speech processing can be carried out effectively.
- Traditional silence detection usually uses detection methods such as zero-crossing detection, correlation detection or spectral envelope detection. These detection methods all need to first convert the time domain audio signal into a frequency domain signal, which is not only cumbersome to operate, difficult to apply, but also accurate in detection The rate is low.
- the embodiments of this application provide a neural network-based silent detection method, terminal equipment, and computer non-volatile readable storage medium to solve the cumbersome operation, difficulty in application, and detection accuracy of the existing silent detection method. Low problem.
- a silent detection method based on neural network including:
- the silence detection model is a one-dimensional convolutional neural network model, and the feature value of the audio sub-sequence is used for Characterizing the probability that the audio segment corresponding to the audio subsequence is a speech signal, and the feature value is a one-dimensional value;
- the feature value of the audio subsequence is greater than or equal to the preset feature value threshold, it is determined that the audio segment corresponding to the audio subsequence is a speech signal.
- a terminal device including:
- the first sampling unit is configured to sample an original audio signal to be detected based on a preset sampling frequency to obtain a sampling signal corresponding to the original audio signal;
- the first audio processing unit is configured to perform frame division processing on the sample signal based on the preset receptive field length to obtain at least two frames of audio subsequences;
- the feature value calculation unit is configured to input the audio subsequence to a pre-trained silence detection model to obtain the feature value of the audio subsequence;
- the silence detection model is a one-dimensional convolutional neural network model, and the audio
- the feature value of the subsequence is used to characterize the probability that the audio segment corresponding to the audio subsequence is a speech signal, and the feature value is a one-dimensional value;
- the silence detection unit is configured to determine that the audio segment corresponding to the audio subsequence is a voice signal if the feature value of the audio subsequence is greater than or equal to a preset feature value threshold.
- a terminal device including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor.
- the processor executes the computer-readable instructions, The steps of the above-mentioned silent detection method based on neural network.
- a computer non-volatile readable storage medium stores computer readable instructions, which when executed by a processor, realize the above Steps of the silent detection method based on neural network.
- This application samples the original audio signal based on a preset sampling frequency, and performs frame processing on the sampled signal based on the preset receptive field length to obtain at least a frame audio sub-sequence, and uses a pre-trained silence detection model to The audio subsequence undergoes dimensionality reduction processing, and finally the audio subsequence is converted into a one-dimensional value, and based on the magnitude relationship between the one-dimensional value and the preset feature value threshold, it is determined whether the audio segment corresponding to the audio subsequence is speech signal.
- the silence detection model when performing silence detection on the original audio signal, there is no need to convert the original audio signal from the time domain to the frequency domain, and only need to be converted into a digital audio signal in the time domain, thereby simplifying the silence detection process.
- the efficiency of silence detection is improved, and because the silence detection model is obtained by training, various parameters included in the silence detection model can be continuously optimized during the training process, thereby improving the accuracy of silence detection.
- Fig. 1 is an implementation flowchart of a neural network-based silence detection method provided by an embodiment of the present invention
- FIG. 2 is a specific implementation flow chart of S3 in a method for silent detection based on neural network provided by an embodiment of the present invention
- FIG. 3 is an implementation flowchart of a method for detecting silence based on a neural network according to another embodiment of the present invention
- FIG. 4 is a structural block diagram of a terminal device provided by an embodiment of the present invention.
- Fig. 5 is a structural block diagram of a terminal device according to another embodiment of the present invention.
- FIG. 1 is an implementation flowchart of a method for detecting silence based on a neural network according to an embodiment of the present invention.
- the execution subject of the neural network-based silence detection method is the terminal device.
- Terminal devices include but are not limited to smartphones, tablets or desktop computers.
- the silent detection method based on neural network as shown in Figure 1 includes the following steps:
- S1 Sampling the original audio signal to be detected based on the preset sampling frequency to obtain the sampling signal corresponding to the original audio signal.
- the original audio signal to be detected is an analog audio signal, which is usually collected by a microphone.
- the terminal device may sample the original audio signal to be detected based on the preset sampling frequency, and then obtain the sampling signal corresponding to the original audio signal.
- the sampling signal is a digital audio signal
- the length of the sampling signal is related to the duration of the original audio signal and the preset sampling frequency
- the length of the sampling signal is used to identify the number of sampling points included in the sampling signal.
- the length of the sampling signal N t ⁇ f, where t is the duration of the original audio signal, f is the preset sampling frequency, and N is a positive integer, that is, the duration is t seconds based on the preset sampling frequency f
- the sampling signal obtained by sampling the original audio signal is an audio sequence of length t ⁇ f.
- the preset sampling frequency can be set according to actual needs, and there is no restriction here.
- S2 Perform frame division processing on the sampled signal based on the preset receptive field length to obtain at least two frames of audio subsequences.
- the preset receptive field length refers to the frame length of a single frame on which the sampled signal is subjected to framing processing, that is, each frame of audio obtained by framing the sampled signal based on the preset receptive field length
- the length of the sub-sequence is equal to the length of the preset receptive field.
- the preset receptive field length can be set according to actual needs, and there is no restriction here.
- the preset receptive field length can be T.
- the embodiment of the present invention needs to perform the same processing on each frame of audio subsequence obtained by framing processing in the subsequent steps, it is necessary to ensure that the length of each frame of audio subsequence obtained by performing framing processing on the sampled signal is the same as the preset If the length of the receptive field is equal, it is necessary to ensure that the length of the sampled signal is an integer multiple of the preset receptive field length. In actual applications, the length of the sampled signal is not an integer multiple of the preset receptive field length. Therefore, the present invention is implemented In an example, the terminal device also detects whether the length of the sampled signal is an integer multiple of the preset receptive field length before performing framing processing on the sampled signal.
- the terminal device detects that the length of the sampled signal is not an integer multiple of the preset receptive field length, it adjusts the length of the sampled signal based on the preset length adjustment strategy, so that the length of the sampled signal is the preset Integer multiples of the length of the felt field.
- the preset length adjustment strategy can be set according to actual needs.
- the preset length adjustment strategy can be: zero-filling the sampling signal until the length of the sampling signal is an integer multiple of the preset receptive field length.
- performing zero-filling processing on the sampled signal may specifically be: zero-filling before or after the audio sequence corresponding to the sampled signal.
- the terminal device After the terminal device adjusts the length of the sampled signal to an integer multiple of the preset receptive field length, it performs frame processing on the adjusted length of the sampled signal based on the preset receptive field length to obtain at least two frames of audio subsequences, each frame of audio subsequence
- the length of the sequence is T, that is, each frame of audio sub-sequence is composed of T sample values.
- S3 Input the audio sub-sequence to the pre-trained silence detection model to obtain the feature value of the audio sub-sequence;
- the silence detection model is a one-dimensional convolutional neural network model, and the feature value of the audio sub-sequence It is used to characterize the probability that the audio segment corresponding to the audio subsequence is a speech signal, and the characteristic value is a one-dimensional value.
- the silent detection model is based on a preset sample set and a deep learning algorithm is used to train a pre-built one-dimensional convolutional neural network model.
- the input values and intermediate processing values of the one-dimensional convolutional neural network model described in the embodiment of the present invention are both one-dimensional arrays, and the output value of the one-dimensional convolutional neural network model is a one-dimensional value.
- Each piece of sample data in the preset sample set is composed of an audio sub-sequence of length T and a feature value corresponding to the audio sub-sequence.
- the feature value of the audio subsequence is used to characterize the probability that the audio segment corresponding to the audio subsequence is a speech signal. For example, if the audio signal corresponding to a certain audio subsequence is a speech signal, the feature value of the audio subsequence can be set to 1, and if the audio signal corresponding to a certain audio subsequence is a noise signal, the audio subsequence can be The eigenvalue of the sequence is set to 0. It should be noted that the audio segment corresponding to the audio subsequence is used to characterize the audio segment corresponding to the audio subsequence in the original audio signal.
- the one-dimensional convolutional neural network model includes an input layer, a hidden layer, and an output layer.
- the input layer includes T input nodes, which are used to respectively receive the T sample values contained in the audio subsequence;
- the hidden layer is composed of L-layer cascaded dimensionality reduction networks, and each layer of dimensionality reduction network is configured with one for audio
- the first convolution kernel for dimensionality reduction processing of the subsequence.
- the first convolution kernel is a one-dimensional array whose length is less than the length of the audio subsequence.
- the step size of the first convolution kernel may be the same as that of the first convolution kernel.
- the lengths of the cores are the same; the output layer is equipped with a second convolution core used to reduce the dimensionality of the audio subsequence after the convolution processing of the hidden layer output.
- the second convolution core is also a one-dimensional array.
- the length of the convolution kernel is equal to the length of the audio subsequence output by the hidden layer.
- the audio subsequence of length T contained in each sample data in the preset sample set is used as the input of the one-dimensional convolutional neural network model.
- the feature value of the audio subsequence contained in each sample data is used as the output of the one-dimensional convolutional neural network model, and then the one-dimensional convolutional neural network model is trained.
- the terminal device can learn the convolution kernel parameters of the first convolution kernel of each layer of the dimensionality reduction network in the hidden layer of the one-dimensional convolutional neural network model.
- the output layer of the three-dimensional convolutional neural network model learns the convolution kernel parameters of the second convolution kernel.
- the convolution kernel parameter refers to the value of each element contained in the convolution kernel.
- the terminal device after the terminal device performs frame processing on the sampled signal to obtain at least two frames of audio subsequences, all the audio subsequences are input to the pre-trained silence detection model to obtain the feature value of each audio subsequence.
- S3 can be specifically implemented through S31 to S33 as shown in FIG. 2, which is described in detail as follows:
- S31 Receive the T sample values included in the audio subsequence through the T input nodes included in the input layer of the silence detection model.
- the terminal device After the terminal device inputs the audio sub-sequence obtained by the framing process to the pre-trained silence detection model, the T input nodes contained in the input layer of the silence detection model respectively receive the T sample values contained in the audio sub-sequence, and receive The received audio sub-sequence is input to the hidden layer of the silence detection model.
- the terminal device inputs the audio subsequence received in the input layer of the silence detection model to the hidden layer of the silence detection model, in the hidden layer of the silence detection model, it is based on the first order of each layer of the dimensionality reduction network.
- the convolution kernel performs convolution processing on the audio sub-sequences received by each layer of the dimensionality reduction network, and obtains the feature array of the audio sub-sequences in the L-th dimensionality reduction network.
- the terminal device performs convolution processing on the audio sub-sequences received by each layer of the dimensionality reduction network based on the first convolution kernel of each layer of the dimensionality reduction network, which specifically includes the following steps:
- the first convolution kernel based on the first layer of dimensionality reduction network performs convolution processing on the audio subsequence output by the input layer, and inputs the convolution processed audio subsequence into the second layer of dimensionality reduction network ;
- the second layer of dimensionality reduction network based on the first convolution kernel of the second layer of dimensionality reduction network, the audio subsequence output by the first layer of dimensionality reduction network will be convolved again after convolution processing.
- the audio subsequence of is input to the third layer of dimensionality reduction network, and so on, and finally, in the hidden layer of the L-th dimensionality reduction network, based on the first convolution check of the L-th dimensionality reduction network to output the L-1 layer
- the feature array of the audio sub-sequence is obtained. It should be noted that the length of the audio subsequence feature array is much smaller than the length of the audio subsequence.
- the length of the audio subsequence input to the silence detection model is determined, in practical applications, the number of layers of the dimensionality reduction network contained in the hidden layer and the number of dimensionality reduction networks in each layer can be determined according to actual needs.
- the length and step length of the first convolution kernel are flexibly set, so that the length of the feature array finally output by the L-th layer dimensionality reduction network is determined, and the length of the feature array of the audio subsequence can also be used to determine the output layer contains The length and step size of the second convolution kernel.
- the length of the first convolution kernel is equal to the step size, and the length of the audio subsequence received by each layer of the dimensionality reduction network is an integer multiple of the length of the first convolution kernel of the layer, based on this , S32 may specifically include the following steps:
- the audio subsequences received by the dimensionality reduction network of each layer are subjected to convolution processing based on the first preset convolution formula in turn;
- the first preset convolution formula is:
- Kernel ij is the first convolution kernel of the dimensionality reduction network of the i-th layer
- the value of the j-th element, k i is the length of Kernel ij
- Audio (i-1)j is the value of the j-th audio element contained in the audio subsequence output by the dimensionality reduction network of the i-1th layer
- J + k i of the audio elements of the sequence of audio values dimensionality reduction output from the first network layer comprises i-1, Is the first subsequence contained in the audio subsequence output by the dimensionality reduction network of the i-1th layer
- the value of each audio element, a i-1 is the length of the audio subsequence output by the dimensionality reduction network of the i-1th layer;
- the audio subsequence after the convolution processing output by the dimensionality reduction network of the Lth layer is determined as the feature array of the audio subsequence.
- the audio subsequence received by the layer 1 dimensionality reduction network of the hidden layer is the audio subsequence output by the input layer
- the audio subsequences received by the layer 2 to layer L dimensionality reduction network of the hidden layer are all It is the audio subsequence output by the upper layer of the dimensionality reduction network after the convolution processing.
- the audio subsequence to be output is obtained by processing the received audio subsequence based on the first preset convolution formula in the first layer of dimensionality reduction network
- the terminal device After the terminal device obtains the feature array of the audio subsequence in the L-th dimensionality reduction network of the hidden layer, it inputs the feature array of the audio subsequence to the output layer of the silence detection model.
- S33 Perform convolution processing on the feature array of the audio subsequence based on the second convolution kernel in the output layer of the silence detection model to obtain the feature value of the audio subsequence.
- the terminal device performs convolution processing on the feature array of the audio subsequence output by the hidden layer based on the second convolution kernel in the output layer to obtain the feature value of the audio subsequence. It should be noted that, since the length of the second convolution kernel of the output layer is equal to the length of the feature array of the audio subsequence output by the hidden layer, the feature array of the audio subsequence is convolved through the second convolution kernel.
- the characteristic value of the obtained audio subsequence is a one-dimensional value.
- S33 can be implemented through the following steps:
- convolution processing is performed on the feature array of the audio subsequence based on a second preset convolution formula to obtain the feature value of the audio subsequence;
- the second preset convolution formula is:
- Audio final is the feature value of the audio subsequence
- a final is the length of the feature array of the audio subsequence
- Kernel j is the value of the jth element in the second convolution kernel
- Audio j is the value The value of the j-th audio element in the feature array of the audio subsequence.
- the terminal device After the terminal device calculates the feature value of each audio subsequence, it compares the feature value of each audio subsequence with a preset feature value threshold. If the terminal device detects that the characteristic value of a certain audio subsequence is greater than or equal to the preset characteristic value threshold, it determines that the audio segment corresponding to the audio subsequence is a speech signal.
- the preset feature value threshold can be set according to actual needs, and there is no limitation here.
- the neural network-based silence detection method may further include S5.
- S5 the neural network-based silence detection method
- the terminal device detects that the characteristic value of a certain audio subsequence is less than the preset characteristic value threshold, it determines that the audio segment corresponding to the audio subsequence is a noise signal.
- the embodiment of the present invention samples the original audio signal based on the preset sampling frequency, and performs framing processing on the sampled sample signal based on the preset receptive field length to obtain at least a frame audio sub-sequence.
- the trained silence detection model performs dimensionality reduction processing on the audio subsequence, and finally converts the audio subsequence into a one-dimensional value, and determines the audio subsequence based on the magnitude relationship between the one-dimensional value and the preset feature value threshold Whether the corresponding audio segment is a voice signal.
- the silence detection model when performing silence detection on the original audio signal, there is no need to convert the original audio signal from the time domain to the frequency domain, and only need to be converted into a digital audio signal in the time domain, thereby simplifying the silence detection process.
- the efficiency of silence detection is improved, and because the silence detection model is obtained by training, various parameters included in the silence detection model can be continuously optimized during the training process, thereby improving the accuracy of silence detection.
- FIG. 4 is a structural block diagram of a terminal device according to an embodiment of the present invention.
- the terminal device in this embodiment may be a terminal device such as a smart phone or a tablet computer.
- the units included in the terminal device are used to execute the steps in the embodiments corresponding to FIGS. 1 to 3.
- the terminal device 400 includes: a first sampling unit 41, a first audio processing unit 42, a feature value calculation unit 43, and a silence detection unit 44. among them:
- the first sampling unit 41 is configured to sample the original audio signal to be detected based on a preset sampling frequency to obtain a sampling signal corresponding to the original audio signal;
- the first audio processing unit 42 is configured to perform frame division processing on the sampled signal based on the preset receptive field length to obtain at least two frames of audio subsequences;
- the feature value calculation unit 43 is configured to input the audio subsequence into a pre-trained silence detection model to obtain the feature value of the audio subsequence;
- the silence detection model is a one-dimensional convolutional neural network model, and
- the feature value of the audio subsequence is used to characterize the probability that the audio segment corresponding to the audio subsequence is a speech signal, and the feature value is a one-dimensional value;
- the silence detection unit 44 is configured to determine that the audio segment corresponding to the audio subsequence is a voice signal if the feature value of the audio subsequence is greater than or equal to a preset feature value threshold.
- each frame of the audio subsequence includes T sample values;
- the silence detection model includes an input layer, a hidden layer, and an output layer.
- the input layer includes T input nodes, and the hidden layer consists of
- the L-layer cascaded dimensionality reduction network constitutes, each layer of the dimensionality reduction network is configured with a first convolution kernel;
- the feature value calculation unit 43 specifically includes: a first receiving unit, a first calculation unit, and a second calculation unit. among them:
- the first receiving unit is configured to receive the T sample values included in the audio subsequence through the T input nodes included in the input layer of the silence detection model.
- the first calculation unit is configured to perform convolution processing on the audio subsequences received by each layer of the dimensionality reduction network based on the first convolution kernel of each layer of the dimensionality reduction network in the hidden layer of the silence detection model, Obtain the feature array of the audio subsequence in the dimensionality reduction network of the Lth layer.
- the second calculation unit is configured to perform convolution processing on the feature array of the audio subsequence based on the second convolution kernel in the output layer of the silence detection model to obtain the feature value of the audio subsequence.
- the length of the first convolution kernel is equal to its step size, and the length of the audio subsequence received by the dimensionality reduction network of each layer is an integer multiple of the length of the first convolution kernel of the layer ;
- the first calculation unit is specifically configured to:
- the audio subsequences received by the dimensionality reduction network of each layer are subjected to convolution processing based on the first preset convolution formula in turn;
- the first preset convolution formula is:
- the audio subsequences received by the dimensionality reduction network of each layer are subjected to convolution processing based on the first preset convolution formula in turn;
- the first preset convolution formula is:
- Kernel ij is the first convolution kernel of the dimensionality reduction network of the i-th layer
- the value of the j-th element, k i is the length of Kernel ij
- Audio (i-1)j is the value of the j-th audio element contained in the audio subsequence output by the dimensionality reduction network of the i-1th layer
- J + k i of the audio elements of the sequence of audio values dimensionality reduction output from the first network layer comprises i-1, Is the first audio subsequence included in the output of the dimensionality reduction network of the i-1th layer
- the value of each audio element, a i-1 is the length of the audio subsequence output by the dimensionality reduction network of the i-1th layer;
- the audio subsequence after the convolution processing output by the dimensionality reduction network of the Lth layer is determined as the feature array of the audio subsequence.
- convolution processing is performed on the feature array of the audio subsequence based on a second preset convolution formula to obtain the feature value of the audio subsequence;
- the second preset convolution formula is:
- Audio final is the feature value of the audio subsequence
- a final is the length of the feature array of the audio subsequence
- Kernel j is the value of the jth element in the second convolution kernel
- Audio j is the value The value of the j-th audio element in the feature array of the audio subsequence.
- the silence detection unit 44 is further configured to determine that the audio segment corresponding to the audio subsequence is a noise signal if the feature value of the audio subsequence is less than a preset feature value threshold.
- the terminal device samples the original audio signal based on the preset sampling frequency, and performs frame processing on the sampled signal based on the preset receptive field length to obtain at least a frame audio subsequence.
- the audio sub-sequence is finally converted into a one-dimensional value, and based on the magnitude relationship between the one-dimensional value and the preset feature value threshold, the Whether the audio segment corresponding to the audio subsequence is a speech signal.
- the silence detection model when performing silence detection on the original audio signal, there is no need to convert the original audio signal from the time domain to the frequency domain, and only need to be converted into a digital audio signal in the time domain, thereby simplifying the silence detection process.
- the efficiency of silence detection is improved, and because the silence detection model is obtained by training, various parameters included in the silence detection model can be continuously optimized during the training process, thereby improving the accuracy of silence detection.
- Fig. 5 is a structural block diagram of a terminal device according to another embodiment of the present invention.
- the terminal device 5 of this embodiment includes: a processor 50, a memory 51, and a computer program 52 stored in the memory 51 and running on the processor 50, such as a neural network-based mute Procedures for testing methods.
- the processor 50 executes the computer program 52, the steps in each embodiment of the above-mentioned neural network-based silence detection method are implemented, for example, S1 to S4 shown in FIG. 1.
- the processor 50 executes the computer program 52
- the functions of the units in the embodiment corresponding to FIG. 4 are realized, for example, the functions of the units 41 to 44 shown in FIG. 4, please refer to the corresponding implementation in FIG. 4 for details The relevant description in the example will not be repeated here.
- the computer program 52 may be divided into one or more units, and the one or more units are stored in the memory 51 and executed by the processor 50 to complete the present invention.
- the one or more units may be a series of computer program instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer program 52 in the terminal device 5.
- the computer program 52 may be divided into a first sampling unit, a first audio processing unit, a feature value calculation unit, and a silence detection unit, and the specific functions of each unit are as described above.
- the terminal device may include, but is not limited to, a processor 50 and a memory 51.
- FIG. 5 is only an example of the terminal device 5, and does not constitute a limitation on the terminal device 5. It may include more or less components than shown in the figure, or a combination of certain components, or different components.
- the terminal device may also include input and output devices, network access devices, buses, etc.
- the so-called processor 50 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
- the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
- the memory 51 may be an internal storage unit of the terminal device 5, such as a hard disk or a memory of the terminal device 5.
- the memory 51 may also be an external storage device of the terminal device 5, such as a plug-in hard disk equipped on the terminal device 5, a smart memory card (Smart Media Card, SMC), or a Secure Digital (SD). Card, Flash Card, etc. Further, the memory 51 may also include both an internal storage unit of the terminal device 5 and an external storage device.
- the memory 51 is used to store the computer program and other programs and data required by the terminal device.
- the memory 51 can also be used to temporarily store data that has been output or will be output.
- the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
- the integrated module/unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
- this application implements all or part of the procedures in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium. in.
- Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
- ROM read only memory
- PROM programmable ROM
- EPROM electrically programmable ROM
- EEPROM electrically erasable programmable ROM
- Volatile memory may include random access memory (RAM) or external cache memory.
- RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
L'invention concerne un procédé de détection de silence reposant sur un réseau neuronal, et un dispositif terminal et un support de stockage non volatil lisible par ordinateur, appartenant au domaine technique de l'intelligence artificielle. Le procédé consiste à : échantillonner, sur la base d'une fréquence d'échantillonnage préréglée, un signal audio d'origine à détecter, de façon à obtenir un signal d'échantillonnage correspondant au signal audio d'origine (S1); effectuer un traitement de division en trames sur le signal d'échantillonnage sur la base d'une longueur de champ récepteur préréglée de façon à obtenir au moins deux trames de sous-séquences audio (S2); introduire les sous-séquences audio dans un modèle de détection de silence pré-entraîné de façon à obtenir des valeurs de caractéristiques des sous-séquences audio, le modèle de détection de silence étant un modèle de réseau neuronal convolutif unidimensionnel, les valeurs de caractéristiques des sous-séquences audio étant utilisées pour représenter les probabilités que des segments audio correspondant aux sous-séquences audio soient des signaux de parole, et les valeurs de caractéristiques étant des valeurs unidimensionnelles (S3); et si les valeurs de caractéristiques des sous-séquences audio sont supérieures ou égales à un seuil de valeur de caractéristique préréglé, déterminer que les segments audio correspondant aux sous-séquences audio sont des signaux de parole (S4). Par conséquent, l'efficacité et la précision de la détection de silence sont améliorées.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910226470.2A CN110010153A (zh) | 2019-03-25 | 2019-03-25 | 一种基于神经网络的静音检测方法、终端设备及介质 |
CN201910226470.2 | 2019-03-25 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020192009A1 true WO2020192009A1 (fr) | 2020-10-01 |
Family
ID=67167950
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/103149 WO2020192009A1 (fr) | 2019-03-25 | 2019-08-29 | Procédé de détection de silence reposant sur un réseau neuronal, et dispositif terminal et support |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110010153A (fr) |
WO (1) | WO2020192009A1 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116417015A (zh) * | 2023-04-03 | 2023-07-11 | 广州市迪士普音响科技有限公司 | 一种压缩音频的静默检测方法及装置 |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110010153A (zh) * | 2019-03-25 | 2019-07-12 | 平安科技(深圳)有限公司 | 一种基于神经网络的静音检测方法、终端设备及介质 |
CN111181949B (zh) * | 2019-12-25 | 2023-12-12 | 视联动力信息技术股份有限公司 | 一种声音检测方法、装置、终端设备和存储介质 |
CN114446291A (zh) * | 2020-11-04 | 2022-05-06 | 阿里巴巴集团控股有限公司 | 语音识别方法、装置、智能音箱、家电、电子设备及介质 |
CN114694636A (zh) * | 2020-12-31 | 2022-07-01 | 华为技术有限公司 | 语音识别方法及装置 |
CN116469413B (zh) * | 2023-04-03 | 2023-12-01 | 广州市迪士普音响科技有限公司 | 一种基于人工智能的压缩音频静默检测方法及装置 |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001086633A1 (fr) * | 2000-05-10 | 2001-11-15 | Multimedia Technologies Institute - Mti S.R.L. | Detection d'activite vocale et d'extremite de mot |
CN102693724A (zh) * | 2011-03-22 | 2012-09-26 | 张燕 | 一种基于神经网络的高斯混合模型的噪声分类方法 |
CN105427870A (zh) * | 2015-12-23 | 2016-03-23 | 北京奇虎科技有限公司 | 一种针对停顿的语音识别方法和装置 |
CN107393526A (zh) * | 2017-07-19 | 2017-11-24 | 腾讯科技(深圳)有限公司 | 语音静音检测方法、装置、计算机设备和存储介质 |
US20180166066A1 (en) * | 2016-12-14 | 2018-06-14 | International Business Machines Corporation | Using long short-term memory recurrent neural network for speaker diarization segmentation |
CN108428448A (zh) * | 2017-02-13 | 2018-08-21 | 芋头科技(杭州)有限公司 | 一种语音端点检测方法及语音识别方法 |
CN109036467A (zh) * | 2018-10-26 | 2018-12-18 | 南京邮电大学 | 基于tf-lstm的cffd提取方法、语音情感识别方法及系统 |
CN109146066A (zh) * | 2018-11-01 | 2019-01-04 | 重庆邮电大学 | 一种基于语音情感识别的虚拟学习环境自然交互方法 |
CN110010153A (zh) * | 2019-03-25 | 2019-07-12 | 平安科技(深圳)有限公司 | 一种基于神经网络的静音检测方法、终端设备及介质 |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10229700B2 (en) * | 2015-09-24 | 2019-03-12 | Google Llc | Voice activity detection |
US11080591B2 (en) * | 2016-09-06 | 2021-08-03 | Deepmind Technologies Limited | Processing sequences using convolutional neural networks |
CN108346433A (zh) * | 2017-12-28 | 2018-07-31 | 北京搜狗科技发展有限公司 | 一种音频处理方法、装置、设备及可读存储介质 |
CN109036459B (zh) * | 2018-08-22 | 2019-12-27 | 百度在线网络技术(北京)有限公司 | 语音端点检测方法、装置、计算机设备、计算机存储介质 |
CN109378016A (zh) * | 2018-10-10 | 2019-02-22 | 四川长虹电器股份有限公司 | 一种基于vad的关键词识别标注方法 |
-
2019
- 2019-03-25 CN CN201910226470.2A patent/CN110010153A/zh active Pending
- 2019-08-29 WO PCT/CN2019/103149 patent/WO2020192009A1/fr active Application Filing
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001086633A1 (fr) * | 2000-05-10 | 2001-11-15 | Multimedia Technologies Institute - Mti S.R.L. | Detection d'activite vocale et d'extremite de mot |
CN102693724A (zh) * | 2011-03-22 | 2012-09-26 | 张燕 | 一种基于神经网络的高斯混合模型的噪声分类方法 |
CN105427870A (zh) * | 2015-12-23 | 2016-03-23 | 北京奇虎科技有限公司 | 一种针对停顿的语音识别方法和装置 |
US20180166066A1 (en) * | 2016-12-14 | 2018-06-14 | International Business Machines Corporation | Using long short-term memory recurrent neural network for speaker diarization segmentation |
CN108428448A (zh) * | 2017-02-13 | 2018-08-21 | 芋头科技(杭州)有限公司 | 一种语音端点检测方法及语音识别方法 |
CN107393526A (zh) * | 2017-07-19 | 2017-11-24 | 腾讯科技(深圳)有限公司 | 语音静音检测方法、装置、计算机设备和存储介质 |
CN109036467A (zh) * | 2018-10-26 | 2018-12-18 | 南京邮电大学 | 基于tf-lstm的cffd提取方法、语音情感识别方法及系统 |
CN109146066A (zh) * | 2018-11-01 | 2019-01-04 | 重庆邮电大学 | 一种基于语音情感识别的虚拟学习环境自然交互方法 |
CN110010153A (zh) * | 2019-03-25 | 2019-07-12 | 平安科技(深圳)有限公司 | 一种基于神经网络的静音检测方法、终端设备及介质 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116417015A (zh) * | 2023-04-03 | 2023-07-11 | 广州市迪士普音响科技有限公司 | 一种压缩音频的静默检测方法及装置 |
CN116417015B (zh) * | 2023-04-03 | 2023-09-12 | 广州市迪士普音响科技有限公司 | 一种压缩音频的静默检测方法及装置 |
Also Published As
Publication number | Publication date |
---|---|
CN110010153A (zh) | 2019-07-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020192009A1 (fr) | Procédé de détection de silence reposant sur un réseau neuronal, et dispositif terminal et support | |
US20200372905A1 (en) | Mixed speech recognition method and apparatus, and computer-readable storage medium | |
WO2022141868A1 (fr) | Procédé et appareil permettant d'extraire des caractéristiques de parole, terminal et support de stockage | |
CN111429938B (zh) | 一种单通道语音分离方法、装置及电子设备 | |
WO2022121180A1 (fr) | Procédé et appareil de formation de modèle, procédé de conversion de voix, dispositif, et support de stockage | |
WO2022121257A1 (fr) | Procédé et appareil d'entraînement de modèle, procédé et appareil de reconnaissance de la parole, dispositif et support de stockage | |
WO2019227590A1 (fr) | Procédé d'amélioration de voix, appareil, dispositif informatique, et support d'informations | |
WO2021189642A1 (fr) | Procédé et dispositif de traitement de signal, dispositif informatique et support de stockage | |
CN110364185B (zh) | 一种基于语音数据的情绪识别方法、终端设备及介质 | |
WO2022178942A1 (fr) | Procédé et appareil de reconnaissance d'émotion, dispositif informatique et support de stockage | |
CN109360572B (zh) | 通话分离方法、装置、计算机设备及存储介质 | |
US20160189730A1 (en) | Speech separation method and system | |
WO2020024396A1 (fr) | Procédé et appareil de reconnaissance de style de musique, dispositif informatique et support d'informations | |
CN108922543B (zh) | 模型库建立方法、语音识别方法、装置、设备及介质 | |
CN111739539B (zh) | 确定说话人数量的方法、装置及存储介质 | |
WO2021051628A1 (fr) | Procédé, appareil et dispositif de construction d'un modèle de reconnaissance de la parole, et support de stockage | |
CN109658943B (zh) | 一种音频噪声的检测方法、装置、存储介质和移动终端 | |
Hasannezhad et al. | PACDNN: A phase-aware composite deep neural network for speech enhancement | |
WO2021127982A1 (fr) | Procédé de reconnaissance d'émotion de parole, dispositif intelligent, et support de stockage lisible par ordinateur | |
WO2020140609A1 (fr) | Procédé et dispositif de reconnaissance vocale et support d'informations lisible par ordinateur | |
WO2019091401A1 (fr) | Procédé et appareil de compression de modèle de réseau pour réseau neuronal profond et dispositif informatique | |
WO2017088587A1 (fr) | Procédé et dispositif de traitement de données | |
CN115691475A (zh) | 用于训练语音识别模型的方法以及语音识别方法 | |
CN112800813B (zh) | 一种目标识别方法及装置 | |
CN112634870A (zh) | 关键词检测方法、装置、设备和存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19921058 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19921058 Country of ref document: EP Kind code of ref document: A1 |