WO2020238681A1 - 音频处理方法、装置和人机交互系统 - Google Patents
音频处理方法、装置和人机交互系统 Download PDFInfo
- Publication number
- WO2020238681A1 WO2020238681A1 PCT/CN2020/090853 CN2020090853W WO2020238681A1 WO 2020238681 A1 WO2020238681 A1 WO 2020238681A1 CN 2020090853 W CN2020090853 W CN 2020090853W WO 2020238681 A1 WO2020238681 A1 WO 2020238681A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio
- processed
- probability
- frame
- processing method
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 28
- 230000002452 interceptive effect Effects 0.000 title 1
- 238000010801 machine learning Methods 0.000 claims abstract description 23
- 238000000034 method Methods 0.000 claims abstract description 16
- 230000000875 corresponding effect Effects 0.000 claims description 35
- 239000010410 layer Substances 0.000 claims description 22
- 238000012545 processing Methods 0.000 claims description 22
- 238000012549 training Methods 0.000 claims description 13
- 238000013527 convolutional neural network Methods 0.000 claims description 10
- 238000013528 artificial neural network Methods 0.000 claims description 8
- 230000006870 function Effects 0.000 claims description 8
- 230000002596 correlated effect Effects 0.000 claims description 7
- 230000003993 interaction Effects 0.000 claims description 7
- 230000000306 recurrent effect Effects 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 4
- 238000003058 natural language processing Methods 0.000 claims description 4
- 239000002356 single layer Substances 0.000 claims description 4
- 230000002457 bidirectional effect Effects 0.000 claims description 3
- 125000004122 cyclic group Chemical group 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 8
- 238000010586 diagram Methods 0.000 description 8
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 206010011224 Cough Diseases 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
- G10L19/167—Audio streaming, i.e. formatting and decoding of an encoded audio signal representation into a data stream for transmission or storage purposes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/45—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/84—Detection of presence or absence of voice signals for discriminating voice from noise
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02087—Noise filtering the noise being separate speech, e.g. cocktail party
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Definitions
- the present disclosure relates to the field of computer technology, and in particular to an audio processing method, an audio processing device, a human-computer interaction system and a non-volatile computer-readable storage medium.
- noises in the environment where the user is located such as the voice of surrounding people, environmental noise, the speaker's cough, etc.
- the noise is mistakenly recognized as a meaningless text after speech recognition, which interferes with semantic understanding and causes the natural language processing to fail to establish a reasonable dialogue flow. Therefore, noise greatly interferes with the human-machine intelligent interaction process.
- the audio file is noise or effective sound according to the energy of the audio signal.
- an audio processing method including: determining the probability that each frame belongs to each candidate character by using a machine learning model according to the feature information of each frame in the audio to be processed; Whether the candidate character corresponding to the maximum probability parameter of each frame is a blank character or a non-blank character, the maximum probability parameter is the maximum value of the probability that each frame belongs to each candidate character; the maximum probability in each frame When the candidate character corresponding to the parameter is a non-blank character, the maximum probability parameter is determined as the effective probability; according to each effective probability, it is determined whether the audio to be processed is effective speech or noise.
- the judging whether the audio to be processed is valid speech or noise according to the effective probabilities includes: calculating the confidence of the audio to be processed according to the weighted sum of the effective probabilities; Degree to determine whether the audio to be processed is valid speech or noise.
- the calculating the confidence of the audio to be processed according to the weighted sum of the effective probabilities includes: calculating the weighted sum of the effective probabilities and the number of the effective probabilities.
- the confidence degree is positively correlated with the weighted sum of the effective probabilities, and negatively correlated with the number of the effective probabilities.
- the target audio is determined to be noise.
- the feature information is obtained by performing short-time Fourier transform on each frame in a sliding window manner.
- the machine learning model includes a convolutional neural network layer, a recurrent neural network layer, a fully connected layer, and a Softmax layer in sequence.
- the convolutional neural network layer is a convolutional neural network with a double-layer structure
- the recurrent neural network layer is a bidirectional recurrent neural network with a single-layer structure
- the machine learning model is trained by the following steps: extracting multiple labeled speech segments of different lengths from training data as training samples, the training data being audio files collected in a customer service scene and Corresponding artificially annotated text; use the connection time series classification CTC function as the loss function to train the machine learning model.
- the audio processing method further includes: determining the text corresponding to the to-be-processed audio according to the candidate character corresponding to the valid probability determined by the machine learning model in the case that the judgment result is a valid speech Information; if the result of the judgment is noise, discard the audio to be processed.
- the audio processing method further includes: using a natural language processing method to perform semantic understanding on the text information; and determining the voice signal corresponding to the to-be-processed audio to be output according to the result of the semantic understanding .
- an audio processing device including: a probability determination unit, configured to determine that each frame belongs to each candidate character according to the feature information of each frame in the audio to be processed, using a machine learning model
- the character judgment unit is used to judge whether the candidate character corresponding to the maximum probability parameter of each frame is a blank character or a non-blank character, and the maximum probability parameter is the probability of each frame belonging to each candidate character The maximum value;
- the validity determination unit is used to determine the maximum probability parameter as the effective probability when the candidate character corresponding to the maximum probability parameter of each frame is a non-blank character;
- the noise judgment unit is used to determine the For each effective probability, it is determined whether the audio to be processed is effective speech or noise.
- an audio processing device including: a memory; and a processor coupled to the memory, the processor being configured to execute based on instructions stored in the memory device The audio processing method in any of the above embodiments.
- a human-computer interaction system including: a receiving device for receiving audio to be processed from a user; a processor for executing the audio processing method in any of the above embodiments; The output device is used to output the voice signal corresponding to the audio to be processed.
- a non-volatile computer-readable storage medium having a computer program stored thereon, and when the program is executed by a processor, the audio processing method in any of the above embodiments is implemented.
- Figure 1 shows a flowchart of some embodiments of the audio processing method of the present disclosure
- FIG. 2 shows a schematic diagram of some embodiments of step 110 in FIG. 1;
- FIG. 3 shows a flowchart of some embodiments of step 150 in FIG. 1;
- Figure 4 shows a block diagram of some embodiments of the audio processing device of the present disclosure
- Figure 5 shows a block diagram of other embodiments of audio processing of the present disclosure
- Fig. 6 shows a block diagram of further embodiments of audio processing of the present disclosure.
- the inventors of the present disclosure found that the above-mentioned related technologies have the following problems: due to the large differences in the speaking style, voice size, and surrounding environment of different users, the energy judgment threshold is difficult to set, resulting in low accuracy of noise judgment.
- the present disclosure proposes an audio processing technical solution, which can improve the accuracy of noise judgment.
- FIG. 1 shows a flowchart of some embodiments of the audio processing method of the present disclosure.
- the method includes: step 110, determining the probability that each frame belongs to each candidate character; step 120, determining whether the corresponding candidate character is a non-blank character; step 140, determining the effective probability; and step 150, Determine whether it is valid speech or noise.
- a machine learning model is used to determine the probability that each frame belongs to each candidate character according to the feature information of each frame in the audio to be processed.
- the audio to be processed may be an audio file with a sampling rate of 8KHz and a 16-bit PCM (Pulse Code Modulation) format in a customer service scenario.
- PCM Pulse Code Modulation
- the audio to be processed has a total of T frames ⁇ 1,2,...t...T ⁇ , T is a positive integer, and t is a positive integer less than T.
- the candidate character set may include common Chinese characters, English letters, Arabic numerals, punctuation marks and other non-blank characters and blank characters ⁇ blank>.
- a candidate character set W ⁇ w 1, w 2 , ising w i -> w I ⁇ , I is a positive integer, i is a positive integer smaller than I, W i is the i-th candidate characters.
- the probability distribution of the t-th frame belonging to each candidate character in the audio to be processed is P t (W
- X) ⁇ p t (w 1
- X) >p t (w I
- the characters in the candidate character set can be collected and configured according to application scenarios (such as e-commerce customer service scenarios, daily communication scenarios, etc.).
- application scenarios such as e-commerce customer service scenarios, daily communication scenarios, etc.
- the blank character is a meaningless character, indicating that the current frame of the audio to be processed cannot correspond to any non-blank character with practical meaning in the candidate character set.
- the probability of each frame belonging to each candidate character can be determined through the embodiment in FIG. 2.
- FIG. 2 shows a schematic diagram of some embodiments of step 110 in FIG. 1.
- the feature information of the audio to be processed can be extracted by the feature extraction module.
- the feature information of each frame of the audio to be processed can be extracted by means of a sliding window.
- short-time Fourier transform is performed on the signal in the sliding window to obtain energy distribution information (Spectrogram) at different frequencies as the characteristic information.
- the size of the sliding window can be 20ms
- the step length of the sliding can be 10ms
- the obtained feature information can be an 81-dimensional vector.
- the extracted feature information may be input into a machine learning model to determine the probability that each frame belongs to each candidate character, that is, the probability distribution of each frame for each candidate character in the candidate character set.
- the machine learning model may include CNN (Convolutional Neural Networks) with a two-layer structure, bidirectional RNN (Recurrent Neural Network) with a single-layer structure, and FC (Fully Connected) with a single-layer structure. layers, fully connected layer) and Softmax layer.
- CNN can adopt the Stride processing method to reduce the amount of calculation of RNN.
- the output of the machine learning model is a 2748-dimensional vector (where each element corresponds to the probability of a candidate character).
- the last dimension of the vector can be the probability of the ⁇ blank> character.
- the audio files collected in the customer service scene and the corresponding manually labeled text may be used as training data.
- the training sample may be a plurality of labeled speech segments with different lengths (for example, 1 second to 10 seconds) extracted from the training data.
- a CTC (Connectionist Temporal Classification) function may be used as a loss function for training.
- the CTC function can make the output of the machine learning model have sparse spike characteristics, that is, the candidate characters corresponding to the maximum probability parameter of most frames are blank characters, and the candidate characters corresponding to the maximum probability parameter of only a few frames are non-blank characters. In this way, the processing efficiency of the system can be improved.
- the machine learning model can be trained in a SortaGrad manner, that is, the first epoch is trained in the order of the sample length from small to large, thereby improving the convergence speed of training. For example, after 20 epochs of training, the model with the best performance on the validation set can be selected as the final machine learning model.
- a sequential batch normalization (Seq-wise Batch Normalization) method may be used to improve the speed and accuracy of RNN training.
- step 120 it is determined whether the candidate character corresponding to the maximum probability parameter of each frame is a blank character or a non-blank character.
- the maximum probability parameter is the maximum value of the probability of each frame belonging to each candidate character. For example, the maximum value of p t (w 1
- step 140 is executed. In some embodiments, when the candidate character corresponding to the maximum probability parameter is a blank character, step 130 is executed to determine the invalid probability.
- step 130 the maximum probability parameter is determined as the invalid probability.
- step 140 the maximum probability parameter is determined as the effective probability.
- step 150 it is judged whether the audio to be processed is valid speech or noise according to each valid probability.
- step 150 may be implemented through the embodiment in FIG. 3.
- FIG. 3 shows a flowchart of some embodiments of step 150 in FIG. 1.
- step 150 includes: step 1510, calculating the confidence level; and step 1520, determining whether it is valid speech or noise.
- the confidence level of the audio to be processed is calculated according to the weighted sum of the effective probabilities.
- the confidence level can be calculated based on the weighted sum of each effective probability and the number of each effective probability. Confidence is positively correlated with the weighted sum of each effective probability, and negatively correlated with the number of each effective probability.
- the confidence level can be calculated by the following formula:
- the denominator is the weighted sum of the maximum probability parameters belonging to each candidate character in each frame of the audio to be processed.
- different weights can also be set according to the non-blank characters corresponding to the effective probability (such as specific semantics, application scenarios, importance in the dialogue, etc.), thereby improving noise judgment Accuracy.
- step 1520 according to the confidence level, it is determined whether the audio to be processed is valid speech or noise. For example, in the above situation, the greater the degree of confidence, the greater the probability that the voice to be processed will be judged as valid. Therefore, when the confidence level is greater than or equal to the threshold, it can be determined that the voice to be processed is valid; when the confidence level is less than the threshold, the voice to be processed can be judged to be noise.
- the text information corresponding to the audio to be processed may be determined according to the candidate characters corresponding to the valid probability determined by the machine learning model. In this way, the noise judgment and speech recognition of the audio to be processed can be completed at the same time.
- the computer can perform subsequent processing such as semantic understanding (such as natural language processing) on the determined text information, so that the computer can understand the semantics of the audio to be processed.
- semantic understanding such as natural language processing
- a response text corresponding to the semantic understanding result can be generated based on semantic understanding, and a speech signal can be synthesized based on the response text.
- the audio to be processed may be directly discarded, and no subsequent processing is performed. In this way, the adverse effects of noise on subsequent processing such as semantic understanding and speech synthesis can be effectively reduced, thereby improving the accuracy of speech recognition and the processing efficiency of the system.
- the validity of the audio to be processed is determined, and then whether the audio to be processed is noise is determined.
- noise judgment based on the semantics of the audio to be processed can better adapt to different voice environments and the voice volume of different users, thereby improving the accuracy of noise judgment.
- Figure 4 shows a block diagram of some embodiments of the audio processing apparatus of the present disclosure.
- the audio processing device 4 includes a probability determination unit 41, a character determination unit 42, a validity determination unit 43, and a noise determination unit 44.
- the probability determination unit 41 uses a machine learning model to determine the probability that each frame belongs to each candidate character according to the feature information of each frame in the audio to be processed.
- the feature information is obtained by performing short-time Fourier transform on each frame by means of a sliding window.
- the machine learning model can sequentially include a convolutional neural network layer, a recurrent neural network layer, a fully connected layer, and a Softmax layer.
- the character judgment unit 42 judges whether the candidate character corresponding to the maximum probability parameter of each frame is a blank character or a non-blank character.
- the maximum probability parameter is the maximum value of the probability of each frame belonging to each candidate character.
- the validity determining unit 43 determines the maximum probability parameter as the valid probability. In some embodiments, when the candidate character corresponding to the maximum probability parameter of each frame is a blank character, the validity determining unit 43 determines the maximum probability parameter as an invalid probability.
- the noise judging unit 44 judges whether the audio to be processed is valid speech or noise according to each effective probability. For example, in the case where there is no effective probability for the audio to be processed, the target audio is judged to be noise.
- the noise determination unit 44 calculates the confidence level of the audio to be processed according to the weighted sum of the effective probabilities.
- the noise judging unit 44 judges whether the audio to be processed is valid speech or noise according to the confidence level. For example, the noise judging unit 44 calculates the degree of confidence based on the weighted sum of each effective probability and the number of each effective probability. Confidence is positively correlated with the weighted sum of each effective probability, and negatively correlated with the number of each effective probability.
- the validity of the audio to be processed is determined, and then whether the audio to be processed is noise is determined.
- noise judgment based on the semantics of the audio to be processed can better adapt to different voice environments and the voice volume of different users, thereby improving the accuracy of noise judgment.
- Fig. 5 shows a block diagram of other embodiments of audio processing of the present disclosure.
- the audio processing device 5 of this embodiment includes a memory 51 and a processor 52 coupled to the memory 51.
- the processor 52 is configured to execute any of the instructions in the present disclosure based on instructions stored in the memory 51.
- An audio processing method in one embodiment.
- the memory 51 may include, for example, a system memory, a fixed non-volatile storage medium, and the like.
- the system memory stores, for example, an operating system, an application program, a boot loader (Boot Loader), a database, and other programs.
- Fig. 6 shows a block diagram of further embodiments of audio processing of the present disclosure.
- the audio processing device 6 of this embodiment includes a memory 610 and a processor 620 coupled to the memory 610.
- the processor 620 is configured to execute any one of the foregoing implementations based on instructions stored in the memory 610.
- the audio processing method in the example is described in detail below.
- the memory 610 may include, for example, a system memory, a fixed non-volatile storage medium, and the like.
- the system memory for example, stores an operating system, an application program, a boot loader (Boot Loader), and other programs.
- the audio processing device 6 may also include an input and output interface 630, a network interface 640, a storage interface 650, and the like. These interfaces 630, 640, 650, and the memory 610 and the processor 620 may be connected via a bus 660, for example.
- the input and output interface 630 provides connection interfaces for input and output devices such as a display, a mouse, a keyboard, a touch screen, a microphone, and a speaker.
- the network interface 640 provides a connection interface for various networked devices.
- the storage interface 650 provides a connection interface for external storage devices such as SD cards and U disks.
- the embodiments of the present disclosure may be provided as methods, systems, or computer program products. Therefore, the present disclosure may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the present disclosure may be in the form of a computer program product implemented on one or more computer-usable non-transitory storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes. .
- the method and system of the present disclosure may be implemented in many ways.
- the method and system of the present disclosure can be implemented by software, hardware, firmware or any combination of software, hardware, and firmware.
- the above-mentioned order of the steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above, unless otherwise specifically stated.
- the present disclosure can also be implemented as programs recorded in a recording medium, and these programs include machine-readable instructions for implementing the method according to the present disclosure.
- the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Quality & Reliability (AREA)
- Data Mining & Analysis (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Mathematical Optimization (AREA)
- Algebra (AREA)
- Probability & Statistics with Applications (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Mathematics (AREA)
- Molecular Biology (AREA)
- Mathematical Analysis (AREA)
- Pure & Applied Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
一种音频处理方法、装置和计算机可读存储介质,涉及计算机技术领域。方法包括:根据待处理音频中每一帧的特征信息,利用机器学习模型确定每一帧属于各候选字符的概率;判断每一帧的最大概率参数对应的候选字符是空白字符还是非空白字符,最大概率参数为每一帧属于各候选字符的概率中的最大值;在每一帧的最大概率参数对应的候选字符为非空白字符的情况下,将最大概率参数确定为待处理音频的有效概率;根据待处理音频的各有效概率,判断待处理音频为有效语音还是噪音。能够提高噪音判断的准确率。
Description
相关申请的交叉引用
本申请是以CN申请号为201910467088.0,申请日为2019年5月31日的申请为基础,并主张其优先权,该CN申请的公开内容在此作为整体引入本申请中。
本公开涉及计算机技术领域,特别涉及一种音频处理方法、音频处理装置、人机交互系统和非易失性计算机可读存储介质。
随着技术的不断发展,人机智能交互技术近年来取得了很大的进步。智能语音交互技术在客服场景的应用越来越多。
然而,用户所在环境中往往存在各种噪音(如周围人说话声、环境噪声、说话人咳嗽等)。噪音经过语音识别后被错误地识别成一段无意义的文本,从而干扰语义理解,导致自然语言处理无法建立起合理的对话流程。因此,噪音对人机智能交互流程的干扰很大。
在相关技术中,一般根据音频信号的能量判定对音频文件是噪音还是有效音。
发明内容
根据本公开的一些实施例,提供了一种音频处理方法,包括:根据待处理音频中每一帧的特征信息,利用机器学习模型确定所述每一帧属于各候选字符的概率;判断所述每一帧的最大概率参数对应的候选字符是空白字符还是非空白字符,所述最大概率参数为所述每一帧属于各候选字符的概率中的最大值;在所述每一帧的最大概率参数对应的候选字符为非空白字符的情况下,将所述最大概率参数确定为有效概率;根据各有效概率,判断所述待处理音频为有效语音还是噪音。
在一些实施例中,所述根据各有效概率,判断所述待处理音频为有效语音还是噪音包括:根据所述各有效概率的加权和,计算所述待处理音频的置信度;根据所述置信度,判断所述待处理音频为有效语音还是噪音。
在一些实施例中,所述根据所述各有效概率的加权和,计算所述待处理音频的置 信度包括:根据所述各有效概率的加权和与所述各有效概率的个数,计算所述置信度,所述置信度与所述各有效概率的加权和正相关,与所述各有效概率的个数负相关。
在一些实施例中,在所述待处理音频不存在有效概率的情况下,所述目标音频被判断为噪音。
在一些实施例中,所述特征信息通过滑动窗口的方式对所述每一帧进行短时傅里叶变换得到。
在一些实施例中,所述机器学习模型依次包括卷积神经网络层、循环神经网络层、全连接层和Softmax层。
在一些实施例中,所述卷积神经网络层为具有双层结构的卷积神经网络,所述循环神经网络层为具有单层结构的双向循环神经网络。
在一些实施例中,所述机器学习模型通过如下步骤进行训练:从训练数据中抽取多条长度不等的标注语音句段作为训练样本,所述训练数据为在客服场景中采集的音频文件以及对应的人工标注文本;利用连接时序分类CTC函数作为损失函数对所述机器学习模型进行训练。
在一些实施例中,所述的音频处理方法,还包括:在判断结果为有效语音的情况下,根据所述机器学习模型确定的有效概率对应的候选字符,确定所述待处理音频对应的文本信息;在判断结果为噪音的情况下,丢弃所述待处理音频。
在一些实施例中,所述的音频处理方法,还包括:利用自然语言处理方法,对所述文本信息进行语义理解;根据语义理解的结果,确定要输出的所述待处理音频对应的语音信号。
根据本公开的另一些实施例,提供一种音频处理装置,包括:概率确定单元,用于根据待处理音频中每一帧的特征信息,利用机器学习模型确定所述每一帧属于各候选字符的概率;字符判断单元,用于判断所述每一帧的最大概率参数对应的候选字符是空白字符还是非空白字符,所述最大概率参数为所述每一帧属于各候选字符的概率中的最大值;有效性确定单元,用于在所述每一帧的最大概率参数对应的候选字符为非空白字符的情况下,将所述最大概率参数确定为有效概率;噪音判断单元,用于根据各有效概率,判断所述待处理音频为有效语音还是噪音。
根据本公开的又一些实施例,提供一种音频处理装置,包括:存储器;和耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器装置中的指令,执行上述任一个实施例中的音频处理方法。
根据本公开的又一些实施例,提供一种人机交互系统,包括:接收装置,用于接收用户发来的待处理音频;处理器,用于执行上述任一个实施例中的音频处理方法;输出装置,用于输出所述待处理音频对应的语音信号。
根据本公开的再一些实施例,提供一种非易失性计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述任一个实施例中的音频处理方法。
构成说明书的一部分的附图描述了本公开的实施例,并且连同说明书一起用于解释本公开的原理。
参照附图,根据下面的详细描述,可以更加清楚地理解本公开,其中:
图1示出本公开的音频处理方法的一些实施例的流程图;
图2示出图1中步骤110的一些实施例的示意图;
图3示出图1中步骤150的一些实施例的流程图;
图4示出本公开的音频处理装置的一些实施例的框图;
图5示出本公开的音频处理的另一些实施例的框图;
图6示出本公开的音频处理的又一些实施例的框图。
现在将参照附图来详细描述本公开的各种示例性实施例。应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本公开的范围。
同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本公开及其应用或使用的任何限制。
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为授权说明书的一部分。
在这里示出和讨论的所有示例中,任何具体值应被解释为仅仅是示例性的,而不是作为限制。因此,示例性实施例的其它示例可以具有不同的值。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在 一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。
本公开的发明人发现上述相关技术中存在如下问题:由于不同用户的说话风格、声音大小、周围环境差异较大,能量的判定阀值较难设定,从而导致噪音判断的准确率低。
鉴于此,本公开提出了一种音频处理技术方案,能够提高噪音判断的准确率。
图1示出本公开的音频处理方法的一些实施例的流程图。
如图1所示,该方法包括:步骤110,确定每一帧属于各候选字符的概率;步骤120,判断对应的候选字符是否为非空白字符;步骤140,确定为有效概率;和步骤150,判断是有效语音还是噪音。
在步骤110中,根据待处理音频中每一帧的特征信息,利用机器学习模型确定每一帧属于各候选字符的概率。例如,待处理音频可以为客服场景下8KHz采样率、16bit的PCM(Pulse Code Modulation,脉冲编码调制)格式的音频文件。
在一些实施例中,待处理音频共有T帧{1,2,……t……T},T为正整数,t为小于T的正整数。待处理视频的特征信息为X={x
1,x
2,……x
t……x
T},x
t为第t帧的特征信息。
在一些实施例中,候选字符集合中可以包含常见的中文汉字、英文字母、阿拉伯数字、标点符号等非空白字符以及空白字符<blank>。例如,候选字符集合W={w
1,w
2,……w
i……w
I},I为正整数,i为小于I的正整数,w
i为第i个候选字符。
在一些实施例中,待处理音频中第t帧属于各候选字符的概率分布为P
t(W|X)={p
t(w
1|X),p
t(w
2|X),……p
t(w
i|X)……p
t(w
I|X)},p
t(w
i|X)为第t帧属于w
i的概率。
例如,可以根据应用场景(如电商客服场景、日常交流场景等),采集、配置候选字符集合中的字符。空白字符为无意义字符,表明待处理音频的当前帧无法对应候选字符集合中的任何一个具有实际意义的非空白字符。
在一些实施例中,可以通过图2中的实施例确定每一帧属于各候选字符的概率。
图2示出图1中步骤110的一些实施例的示意图。
如图2所示,可以通过特征提取模块提取待处理音频的特征信息。例如,可以通过滑动窗口的方式提取待处理音频每一帧的特征信息。例如,对滑动窗口内的信号进行短时傅里叶变换得到不同频率处的能量分布信息(Spectrogram)作为特征信息。滑动窗口的大小可以为20ms,滑动的步长可以为10ms,得到的特征信息可以为一个81维向量。
在一些实施例中,可以将提取的特征信息输入机器学习模型,确定每一帧属于各候选字符的概率,即每一帧对于候选字符集合中各候选字符的概率分布。例如,机器学习模型可以包含具有双层结构的CNN(Convolutional Neural Networks,卷积神经网络)、具有单层结构的双向RNN(Recurrent Neural Network,循环神经网络)、具有单层结构的FC(Fully Connected layers,全连接层)和Softmax层。CNN可以采取Stride处理方式,以减少RNN的计算量。
在一些实施例中,候选字符集合中共有2748个候选字符,则机器学习模型的输出为2748维的向量(其中每一个元素对应一个候选字符的概率)。例如,向量的最后一维可以为<blank>字符的概率。
在一些实施例中,可以将在客服场景中采集的音频文件以及对应的人工标注文本作为训练数据。例如,训练样本可以为从训练数据中抽取的多条长度不等(如1秒到10秒)的标注语音句段。
在一些实施例中,可以采用CTC(Connectionist Temporal Classification,连接时序分类)函数作为训练用的损失函数。CTC函数可以使得机器学习模型的输出具有稀疏尖峰特征,即在多数帧的最大概率参数对应的候选字符为空白字符,只有少数帧的最大概率参数对应的候选字符为非空白字符。这样,可以提高系统的处理效率。
在一些实施例中,可以采用SortaGrad的方式训练机器学习模型,即按照样本长度从小到大的顺序训练首个epoch,从而提高训练的收敛速度。例如,可以经过20个epoch的训练后,选取在验证集上表现最好的模型作为最终的机器学习模型。
在一些实施例中,可以采用顺序批处理归一化(Seq-wise Batch Normalization)的方法提高RNN训练的速度和准确度。
在确定了概率分布后,可以继续通过图1中的步骤完成噪音判断。
在步骤120中,判断每一帧的最大概率参数对应的候选字符是空白字符还是非空白字符。最大概率参数为每一帧属于各候选字符的概率中的最大值。例如,p
t(w
1|X),p
t(w
2|X),……p
t(w
i|X)……p
t(w
I|X)中的最大值为第t帧的最大概率参数。
在最大概率参数对应的候选字符为空白字符的情况下,执行步骤140。在一些实施例中,在最大概率参数对应的候选字符为空白字符的情况下,执行步骤130,确定为无效概率。
在步骤130中,将最大概率参数确定为无效概率。
在步骤140中,将最大概率参数确定为有效概率。
在步骤150中,根据各有效概率,判断待处理音频为有效语音还是噪音。
在一些实施例中,可以通过图3中的实施例实现步骤150。
图3示出图1中步骤150的一些实施例的流程图。
如图3所示,步骤150包括:步骤1510,计算置信度;和步骤1520,判断是有效语音还是噪音。
在步骤1510中,根据各有效概率的加权和,计算待处理音频的置信度。例如,可以根据各有效概率的加权和与各有效概率的个数,计算置信度。置信度与各有效概率的加权和正相关,与各有效概率的个数负相关。
在一些实施例中,可以通过如下的公式计算置信度:
函数F的定义为
上述公式中,分母为待处理音频中各帧属于各候选字符的最大概率参数的加权和,最大概率参数对应空白字符(即有效概率)权值为0,最大概率参数对应非空白字符(即无效概率)的权值为1;分母为对应非空白字符的最大概率参数的个数。例如,在待处理音频不存在有效概率的情况下(即分母部分为0),目标音频被判断为噪音(即定义α=0)。
在一些实施例中,也可以根据有效概率对应的非空白字符(如根据具体语义、应用场景、对话中的重要程度等)设置不同的权值(如大于0的权值),从而提高噪音判断的准确性。
在步骤1520中,根据置信度,判断待处理音频为有效语音还是噪音。例如,在上述情况中置信度越大,待处理语音被判断为有效语音的可能性越大。因此,可以在置信度大于等于阈值的情况下,判断待处理语音为有效语音;在置信度小于阈值的情况下,判断待处理语音为噪音。
在一些实施例中,在判断结果为有效语音的情况下,可以根据机器学习模型确定 的有效概率对应的候选字符,确定待处理音频对应的文本信息。这样,可以同时完成待处理音频的噪音判断和语音识别。
在一些实施例中,计算机可以对确定的文本信息进行语义理解(如自然语言处理)等后续处理,使得计算机能够理解待处理音频的语义。例如,可以基于语义理解进行语音合成后输出语音信号,从而实现人机智能交流。例如,可以基于语义理解生成与语义理解结果相对应的应答文本,根据应答文本合成语音信号。
在一些实施例中,在判断结果为噪音的情况下,可以直接丢弃待处理音频,不进行后续处理。这样,可以有效降低噪音对语义理解、语音合成等后续处理的不利影响,从而提高语音识别的准确性和系统的处理效率。
在上述实施例中,根据每一帧待处理音频对应的候选字符为非空白字符的概率,确定待处理音频的有效性,进而判断待处理音频是否为噪音。这样,基于待处理音频的语义进行噪音判断,能够更好地适应不同的语音环境和不同用户的语音音量,从而提高噪音判断的准确性。
图4示出本公开的音频处理装置的一些实施例的框图。
如图4所示,音频处理装置4包括概率确定单元41、字符判断单元42、有效性确定单元43和噪音判断单元44。
概率确定单元41根据待处理音频中每一帧的特征信息,利用机器学习模型确定每一帧属于各候选字符的概率。例如,特征信息通过滑动窗口的方式对每一帧进行短时傅里叶变换得到。机器学习模型可以依次包括卷积神经网络层、循环神经网络层、全连接层和Softmax层。
字符判断单元42判断每一帧的最大概率参数对应的候选字符是空白字符还是非空白字符。最大概率参数为每一帧属于各候选字符的概率中的最大值。
在每一帧的最大概率参数对应的候选字符为非空白字符的情况下,有效性确定单元43将最大概率参数确定为有效概率。在一些实施例中,在每一帧的最大概率参数对应的候选字符为空白字符的情况下,有效性确定单元43将最大概率参数确定为无效概率。
噪音判断单元44根据各有效概率,判断待处理音频为有效语音还是噪音。例如,在待处理音频不存在有效概率的情况下,目标音频被判断为噪音。
在一些实施例中,噪音判断单元44根据各有效概率的加权和,计算待处理音频的置信度。噪音判断单元44根据置信度,判断待处理音频为有效语音还是噪音。例 如,噪音判断单元44根据各有效概率的加权和与各有效概率的个数,计算置信度。置信度与各有效概率的加权和正相关,与各有效概率的个数负相关。
在上述实施例中,根据每一帧待处理音频对应的候选字符为非空白字符的概率,确定待处理音频的有效性,进而判断待处理音频是否为噪音。这样,基于待处理音频的语义进行噪音判断,能够更好地适应不同的语音环境和不同用户的语音音量,从而提高噪音判断的准确性。
图5示出本公开的音频处理的另一些实施例的框图。
如图5所示,该实施例的音频处理装置5包括:存储器51以及耦接至该存储器51的处理器52,处理器52被配置为基于存储在存储器51中的指令,执行本公开中任意一个实施例中的音频处理方法。
其中,存储器51例如可以包括系统存储器、固定非易失性存储介质等。系统存储器例如存储有操作系统、应用程序、引导装载程序(Boot Loader)、数据库以及其他程序等。
图6示出本公开的音频处理的又一些实施例的框图。
如图6所示,该实施例的音频处理装置6包括:存储器610以及耦接至该存储器610的处理器620,处理器620被配置为基于存储在存储器610中的指令,执行前述任意一个实施例中的音频处理方法。
存储器610例如可以包括系统存储器、固定非易失性存储介质等。系统存储器例如存储有操作系统、应用程序、引导装载程序(Boot Loader)以及其他程序等。
音频处理装置6还可以包括输入输出接口630、网络接口640、存储接口650等。这些接口630、640、650以及存储器610和处理器620之间例如可以通过总线660连接。其中,输入输出接口630为显示器、鼠标、键盘、触摸屏、麦克风、扬声器等输入输出设备提供连接接口。网络接口640为各种联网设备提供连接接口。存储接口650为SD卡、U盘等外置存储设备提供连接接口。
本领域内的技术人员应当明白,本公开的实施例可提供为方法、系统、或计算机程序产品。因此,本公开可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本公开可采用在一个或多个其中包含有计算机可用程序代码的计算机可用非瞬时性存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
至此,已经详细描述了根据本公开的音频处理方法、音频处理装置、人机交互系 统和非易失性计算机可读存储介质。为了避免遮蔽本公开的构思,没有描述本领域所公知的一些细节。本领域技术人员根据上面的描述,完全可以明白如何实施这里公开的技术方案。
可能以许多方式来实现本公开的方法和系统。例如,可通过软件、硬件、固件或者软件、硬件、固件的任何组合来实现本公开的方法和系统。用于所述方法的步骤的上述顺序仅是为了进行说明,本公开的方法的步骤不限于以上具体描述的顺序,除非以其它方式特别说明。此外,在一些实施例中,还可将本公开实施为记录在记录介质中的程序,这些程序包括用于实现根据本公开的方法的机器可读指令。因而,本公开还覆盖存储用于执行根据本公开的方法的程序的记录介质。
虽然已经通过示例对本公开的一些特定实施例进行了详细说明,但是本领域的技术人员应该理解,以上示例仅是为了进行说明,而不是为了限制本公开的范围。本领域的技术人员应该理解,可在不脱离本公开的范围和精神的情况下,对以上实施例进行修改。本公开的范围由所附权利要求来限定。
Claims (14)
- 一种音频处理方法,包括:根据待处理音频中每一帧的特征信息,利用机器学习模型确定所述每一帧属于各候选字符的概率;判断所述每一帧的最大概率参数对应的候选字符是空白字符还是非空白字符,所述最大概率参数为所述每一帧属于各候选字符的概率中的最大值;在所述每一帧的最大概率参数对应的候选字符为非空白字符的情况下,将所述最大概率参数确定为所述待处理音频的有效概率;根据所述待处理音频的各有效概率,判断所述待处理音频为有效语音还是噪音。
- 根据权利要求1所述的音频处理方法,其中,所述根据所述待处理音频的各有效概率,判断所述待处理音频为有效语音还是噪音包括:根据所述各有效概率的加权和,计算所述待处理音频的置信度;根据所述置信度,判断所述待处理音频为有效语音还是噪音。
- 根据权利要求2所述的音频处理方法,其中,所述根据所述待处理音频的各有效概率的加权和,计算所述待处理音频的置信度包括:根据所述各有效概率的加权和与所述各有效概率的个数,计算所述置信度,所述置信度与所述各有效概率的加权和正相关,与所述各有效概率的个数负相关。
- 根据权利要求1所述的音频处理方法,其中,所述根据所述待处理音频相应的各有效概率,判断所述待处理音频为有效语音还是噪音包括:在所述待处理音频不存在有效概率的情况下,判断所述目标音频为噪音。
- 根据权利要求1-4任一项所述的音频处理方法,其中,所述特征信息为通过滑动窗口的方式对所述每一帧进行短时傅里叶变换得到的不同频率处的能量分布信息。
- 根据权利要求1-4任一项所述的音频处理方法,其中,所述机器学习模型依次包括卷积神经网络层、循环神经网络层、全连接层和Softmax层。
- 根据权利要求6所述的音频处理方法,其中,所述卷积神经网络层为具有双层结构的卷积神经网络,所述循环神经网络层为具有单层结构的双向循环神经网络。
- 根据权利要求1-4任一项所述的音频处理方法,其中,所述机器学习模型通过如下步骤进行训练:从训练数据中抽取多条长度不等的标注语音句段作为训练样本,所述训练数据为在客服场景中采集的音频文件以及对应的人工标注文本;利用连接时序分类CTC函数作为损失函数对所述机器学习模型进行训练。
- 根据权利要求1-4任一项所述的音频处理方法,还包括:在判断结果为有效语音的情况下,根据所述机器学习模型确定的有效概率对应的候选字符,确定所述待处理音频对应的文本信息;在判断结果为噪音的情况下,丢弃所述待处理音频。
- 根据权利要求9所述的音频处理方法,还包括:利用自然语言处理方法,对所述文本信息进行语义理解;根据语义理解的结果,确定要输出的所述待处理音频对应的语音信号。
- 一种人机交互系统,包括:接收装置,用于接收用户发来的待处理音频;处理器,用于执行权利要求1-10任一项所述的音频处理方法;输出装置,用于输出所述待处理音频对应的语音信号。
- 一种音频处理装置,包括:概率确定单元,用于根据待处理音频中每一帧的特征信息,利用机器学习模型确定所述每一帧属于各候选字符的概率;字符判断单元,用于判断所述每一帧的最大概率参数对应的候选字符是空白字符还是非空白字符,所述最大概率参数为所述每一帧属于各候选字符的概率中的最大值;有效性确定单元,用于在所述每一帧的最大概率参数对应的候选字符为非空白字符的情况下,将所述最大概率参数确定为所述待处理音频的有效概率;噪音判断单元,用于根据所述待处理音频的各有效概率,判断所述待处理音频为有效语音还是噪音。
- 一种音频处理装置,包括:存储器;和耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器装置中的指令,执行权利要求1-10任一项所述的音频处理方法。
- 一种非易失性计算机可读存储介质,其上存储有计算机程序,该程序被处理 器执行时实现权利要求1-10任一项所述的音频处理方法。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/611,741 US20220238104A1 (en) | 2019-05-31 | 2020-05-18 | Audio processing method and apparatus, and human-computer interactive system |
JP2021569116A JP2022534003A (ja) | 2019-05-31 | 2020-05-18 | 音声処理方法、音声処理装置およびヒューマンコンピュータインタラクションシステム |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910467088.0A CN112017676B (zh) | 2019-05-31 | 2019-05-31 | 音频处理方法、装置和计算机可读存储介质 |
CN201910467088.0 | 2019-05-31 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020238681A1 true WO2020238681A1 (zh) | 2020-12-03 |
Family
ID=73501009
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/090853 WO2020238681A1 (zh) | 2019-05-31 | 2020-05-18 | 音频处理方法、装置和人机交互系统 |
Country Status (4)
Country | Link |
---|---|
US (1) | US20220238104A1 (zh) |
JP (1) | JP2022534003A (zh) |
CN (1) | CN112017676B (zh) |
WO (1) | WO2020238681A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113593603A (zh) * | 2021-07-27 | 2021-11-02 | 浙江大华技术股份有限公司 | 音频类别的确定方法、装置、存储介质及电子装置 |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20210117488A (ko) * | 2020-03-19 | 2021-09-29 | 삼성전자주식회사 | 사용자 입력을 처리하는 전자 장치 및 방법 |
CN115394288B (zh) * | 2022-10-28 | 2023-01-24 | 成都爱维译科技有限公司 | 民航多语种无线电陆空通话的语种识别方法及系统 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1783211A (zh) * | 2004-11-25 | 2006-06-07 | Lg电子株式会社 | 语音区别方法 |
US20060155537A1 (en) * | 2005-01-12 | 2006-07-13 | Samsung Electronics Co., Ltd. | Method and apparatus for discriminating between voice and non-voice using sound model |
JP2006201287A (ja) * | 2005-01-18 | 2006-08-03 | Advanced Telecommunication Research Institute International | 雑音抑圧装置及び音声認識システム |
CN106448661A (zh) * | 2016-09-23 | 2017-02-22 | 华南理工大学 | 基于纯净语音与背景噪声两极建模的音频类型检测方法 |
CN106971741A (zh) * | 2016-01-14 | 2017-07-21 | 芋头科技(杭州)有限公司 | 实时将语音进行分离的语音降噪的方法及系统 |
CN109643552A (zh) * | 2016-09-09 | 2019-04-16 | 大陆汽车系统公司 | 用于可变噪声状况中语音增强的鲁棒噪声估计 |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012158156A1 (en) * | 2011-05-16 | 2012-11-22 | Google Inc. | Noise supression method and apparatus using multiple feature modeling for speech/noise likelihood |
WO2013132926A1 (ja) * | 2012-03-06 | 2013-09-12 | 日本電信電話株式会社 | 雑音推定装置、雑音推定方法、雑音推定プログラム及び記録媒体 |
KR101240588B1 (ko) * | 2012-12-14 | 2013-03-11 | 주식회사 좋은정보기술 | 오디오-영상 융합 음성 인식 방법 및 장치 |
CN104157290B (zh) * | 2014-08-19 | 2017-10-24 | 大连理工大学 | 一种基于深度学习的说话人识别方法 |
US10540957B2 (en) * | 2014-12-15 | 2020-01-21 | Baidu Usa Llc | Systems and methods for speech transcription |
US10332509B2 (en) * | 2015-11-25 | 2019-06-25 | Baidu USA, LLC | End-to-end speech recognition |
WO2017112813A1 (en) * | 2015-12-22 | 2017-06-29 | Sri International | Multi-lingual virtual personal assistant |
IL263655B2 (en) * | 2016-06-14 | 2023-03-01 | Netzer Omry | Automatic speech recognition |
US10403268B2 (en) * | 2016-09-08 | 2019-09-03 | Intel IP Corporation | Method and system of automatic speech recognition using posterior confidence scores |
US10490183B2 (en) * | 2017-11-22 | 2019-11-26 | Amazon Technologies, Inc. | Fully managed and continuously trained automatic speech recognition service |
CN108389575B (zh) * | 2018-01-11 | 2020-06-26 | 苏州思必驰信息科技有限公司 | 音频数据识别方法及系统 |
CN108877775B (zh) * | 2018-06-04 | 2023-03-31 | 平安科技(深圳)有限公司 | 语音数据处理方法、装置、计算机设备及存储介质 |
-
2019
- 2019-05-31 CN CN201910467088.0A patent/CN112017676B/zh active Active
-
2020
- 2020-05-18 US US17/611,741 patent/US20220238104A1/en active Pending
- 2020-05-18 WO PCT/CN2020/090853 patent/WO2020238681A1/zh active Application Filing
- 2020-05-18 JP JP2021569116A patent/JP2022534003A/ja active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1783211A (zh) * | 2004-11-25 | 2006-06-07 | Lg电子株式会社 | 语音区别方法 |
US20060155537A1 (en) * | 2005-01-12 | 2006-07-13 | Samsung Electronics Co., Ltd. | Method and apparatus for discriminating between voice and non-voice using sound model |
JP2006201287A (ja) * | 2005-01-18 | 2006-08-03 | Advanced Telecommunication Research Institute International | 雑音抑圧装置及び音声認識システム |
CN106971741A (zh) * | 2016-01-14 | 2017-07-21 | 芋头科技(杭州)有限公司 | 实时将语音进行分离的语音降噪的方法及系统 |
CN109643552A (zh) * | 2016-09-09 | 2019-04-16 | 大陆汽车系统公司 | 用于可变噪声状况中语音增强的鲁棒噪声估计 |
CN106448661A (zh) * | 2016-09-23 | 2017-02-22 | 华南理工大学 | 基于纯净语音与背景噪声两极建模的音频类型检测方法 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113593603A (zh) * | 2021-07-27 | 2021-11-02 | 浙江大华技术股份有限公司 | 音频类别的确定方法、装置、存储介质及电子装置 |
Also Published As
Publication number | Publication date |
---|---|
CN112017676A (zh) | 2020-12-01 |
JP2022534003A (ja) | 2022-07-27 |
CN112017676B (zh) | 2024-07-16 |
US20220238104A1 (en) | 2022-07-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021208287A1 (zh) | 用于情绪识别的语音端点检测方法、装置、电子设备及存储介质 | |
WO2021093449A1 (zh) | 基于人工智能的唤醒词检测方法、装置、设备及介质 | |
EP3806089B1 (en) | Mixed speech recognition method and apparatus, and computer readable storage medium | |
CN106683680B (zh) | 说话人识别方法及装置、计算机设备及计算机可读介质 | |
CN107492382B (zh) | 基于神经网络的声纹信息提取方法及装置 | |
CN111833845B (zh) | 多语种语音识别模型训练方法、装置、设备及存储介质 | |
CN111402891B (zh) | 语音识别方法、装置、设备和存储介质 | |
WO2020238681A1 (zh) | 音频处理方法、装置和人机交互系统 | |
CN112562691A (zh) | 一种声纹识别的方法、装置、计算机设备及存储介质 | |
WO2020155584A1 (zh) | 声纹特征的融合方法及装置,语音识别方法,系统及存储介质 | |
JP5932869B2 (ja) | N−gram言語モデルの教師無し学習方法、学習装置、および学習プログラム | |
CN112673421A (zh) | 训练和/或使用语言选择模型以自动确定用于口头话语的话音辨识的语言 | |
KR102688236B1 (ko) | 인공 지능을 이용한 음성 합성 장치, 음성 합성 장치의 동작 방법 및 컴퓨터로 판독 가능한 기록 매체 | |
CN112102850A (zh) | 情绪识别的处理方法、装置、介质及电子设备 | |
CN112017633B (zh) | 语音识别方法、装置、存储介质及电子设备 | |
CN114038457B (zh) | 用于语音唤醒的方法、电子设备、存储介质和程序 | |
EP4392972A1 (en) | Speaker-turn-based online speaker diarization with constrained spectral clustering | |
CN114550703A (zh) | 语音识别系统的训练方法和装置、语音识别方法和装置 | |
US10847154B2 (en) | Information processing device, information processing method, and program | |
CN113891177B (zh) | 一种音视频数据的摘要生成方法、装置、设备和存储介质 | |
CN113889091A (zh) | 语音识别方法、装置、计算机可读存储介质及电子设备 | |
Rose et al. | Integration of utterance verification with statistical language modeling and spoken language understanding | |
CN112199498A (zh) | 一种养老服务的人机对话方法、装置、介质及电子设备 | |
JP3913626B2 (ja) | 言語モデル生成方法、その装置及びそのプログラム | |
KR102642617B1 (ko) | 인공 지능을 이용한 음성 합성 장치, 음성 합성 장치의 동작 방법 및 컴퓨터로 판독 가능한 기록 매체 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20812632 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2021569116 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20812632 Country of ref document: EP Kind code of ref document: A1 |