CN114242066A - Speech processing method, speech processing model training method, apparatus and medium - Google Patents

Speech processing method, speech processing model training method, apparatus and medium Download PDF

Info

Publication number
CN114242066A
CN114242066A CN202111675104.9A CN202111675104A CN114242066A CN 114242066 A CN114242066 A CN 114242066A CN 202111675104 A CN202111675104 A CN 202111675104A CN 114242066 A CN114242066 A CN 114242066A
Authority
CN
China
Prior art keywords
acoustic
sample
audio signals
module
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111675104.9A
Other languages
Chinese (zh)
Inventor
吴国兵
熊世富
高建清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202111675104.9A priority Critical patent/CN114242066A/en
Publication of CN114242066A publication Critical patent/CN114242066A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/285Memory allocation or algorithm optimisation to reduce hardware requirements
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The application provides a voice processing method, a training method of a voice processing model, equipment and a medium, wherein the voice processing method comprises the following steps: acquiring first acoustic characteristic information based on a plurality of paths of audio signals, wherein the first acoustic characteristic information is fused with the characteristics of the plurality of paths of audio signals, and the plurality of paths of audio signals are acquired according to original audio signals acquired by at least two audio acquisition devices; and classifying the first acoustic characteristic information to obtain spatial position information, wherein the spatial position information is used for representing the direction of the sound source. The technical scheme of the application can reduce the cascade error, improve the accuracy of sound source positioning and improve the operation speed.

Description

Speech processing method, speech processing model training method, apparatus and medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a speech processing method, a device, and a medium for training a speech processing model.
Background
Under the voice interaction scene, the interactive equipment can position the direction of the user according to the received voice, so that the interactive equipment can better collect the voice of the user positioned in the direction. Therefore, the interference of noise in other directions to the voice interaction process can be avoided, and the intelligent level of the interaction equipment is improved. However, the existing sound source positioning method has the problems of poor positioning effect, high memory occupation and the like.
Disclosure of Invention
In view of this, embodiments of the present application provide a speech processing method, a training method of a speech processing model, a device, and a medium, which can reduce a cascade error, improve accuracy of sound source localization, and improve a computation speed.
In a first aspect, an embodiment of the present application provides a speech processing method, including: acquiring first acoustic characteristic information based on a plurality of paths of audio signals, wherein the first acoustic characteristic information is fused with the characteristics of the plurality of paths of audio signals, and the plurality of paths of audio signals are acquired according to original audio signals acquired by at least two audio acquisition devices; and classifying the first acoustic characteristic information to obtain spatial position information, wherein the spatial position information is used for representing the direction of the sound source.
In some embodiments of the present application, the speech processing method is applied to a speech processing model, the speech processing model including: the device comprises a feature extraction module, a feature fusion module and an acoustic awakening module.
In some embodiments of the present application, obtaining the first acoustic feature information based on the multiple audio signals includes: extracting characteristic information corresponding to the multi-channel audio signals and characteristic information corresponding to original audio signals acquired by one or more audio acquisition devices in at least two audio acquisition devices by using a characteristic extraction module; and fusing the characteristic information corresponding to the multi-channel audio signals and the characteristic information corresponding to the original audio signals by using a characteristic fusion module to obtain first acoustic characteristic information.
In some embodiments of the present application, obtaining the first acoustic feature information based on the multiple audio signals includes: extracting the characteristic information corresponding to the multi-channel audio signals by using a characteristic extraction module; and fusing the characteristic information corresponding to the multi-channel audio signals and the phase difference between the original audio signals acquired by any two audio acquisition devices in the at least two audio acquisition devices by using the characteristic fusion module to acquire first acoustic characteristic information.
In some embodiments of the present application, obtaining the first acoustic feature information based on the multiple audio signals includes: extracting characteristic information corresponding to the multi-channel audio signals and characteristic information corresponding to the amplitude difference between original audio signals acquired by any two audio acquisition devices in at least two audio acquisition devices by using a characteristic extraction module; and fusing the feature information corresponding to the multi-channel audio signals and the feature information corresponding to the amplitude difference by using the feature fusion module to acquire first acoustic feature information.
In some embodiments of the present application, obtaining the first acoustic feature information based on the multiple audio signals includes: extracting characteristic information corresponding to the multi-channel audio signals, characteristic information corresponding to original audio signals acquired by one or more audio acquisition devices in the at least two audio acquisition devices and characteristic information corresponding to the amplitude difference between the original audio signals acquired by any two audio acquisition devices in the at least two audio acquisition devices by using a characteristic extraction module; and fusing the characteristic information corresponding to the multi-channel audio signals, the characteristic information corresponding to the original audio signals, the characteristic information corresponding to the amplitude difference and the phase difference between the original audio signals acquired by any two audio acquisition devices in the at least two audio acquisition devices by using the characteristic fusion module to acquire first acoustic characteristic information.
In some embodiments of the present application, the speech processing method further includes: and identifying a wake-up word based on the first acoustic characteristic information by using an acoustic wake-up module.
In some embodiments of the present application, identifying a wake word based on first acoustic feature information by using the acoustic wake module includes: and identifying a wake-up word by utilizing an acoustic wake-up module in combination with the spatial position information and the first acoustic feature information.
In some embodiments of the present application, recognizing a wake-up word by using an acoustic wake-up module in combination with spatial location information and first acoustic feature information includes: performing gating selection on the first acoustic characteristic information based on the space position information by using gating of the acoustic wakeup module to obtain second acoustic characteristic information; processing the second acoustic feature information by using an acoustic awakening module to obtain an acoustic score, wherein the acoustic awakening module is obtained by using a gradient overturning layer to perform countermeasure training, and the gradient overturning layer points to the energy classifier; and identifying the awakening words according to the acoustic scores.
In some embodiments of the present application, identifying the wake-up word based on the acoustic score includes: the acoustic score is input to a decoding network to identify the wake-up word.
In some embodiments of the present application, the multiple audio signals are obtained by performing beamforming on original audio signals acquired by at least two audio acquisition devices according to a plurality of preset sound source directions.
In some embodiments of the present application, the speech processing model further includes a sound source classifier and a training module, and the speech processing model is obtained by a training method including: acquiring first sample acoustic characteristic information based on sample data by using a characteristic extraction module and a characteristic fusion module, wherein the sample data comprises a plurality of paths of sample audio signals, the plurality of paths of sample audio signals are acquired according to original sample audio signals acquired by at least two sample audio acquisition devices, and the first sample acoustic characteristic information fuses the characteristics of the plurality of paths of sample audio signals; acquiring spatial position information corresponding to the sample data based on the first sample acoustic characteristic information by using a sound source classifier; and training the voice processing model by utilizing a training module based on the spatial position information corresponding to the sample data and the sound source marking information corresponding to the sample data.
In certain embodiments of the present application, the sample data further comprises at least one of: the audio signal processing method comprises the steps of acquiring original sample audio signals acquired by one or more of at least two sample audio acquisition devices, acquiring sample phase differences between the original sample audio signals acquired by any two of the at least two sample audio acquisition devices, and acquiring sample amplitude differences between the original sample audio signals acquired by any two of the at least two sample audio acquisition devices.
In some embodiments of the present application, the training method further comprises: obtaining acoustic information based on the first sample acoustic characteristic information by using an acoustic awakening module; and training the acoustic awakening module by using the training module according to the acoustic information and the acoustic marking information corresponding to the sample data.
In some embodiments of the present application, obtaining the acoustic information based on the first sample acoustic feature information by using the acoustic wakeup module includes: and obtaining acoustic information based on the first sample acoustic characteristic information and the spatial position information corresponding to the sample data by using an acoustic awakening module.
In some embodiments of the present application, a gradient flipping layer, an energy classifier, an acoustic classifier, and a gate control are disposed in the acoustic wake-up module, and the acoustic wake-up module is used to obtain acoustic information based on the first sample acoustic feature information and spatial position information corresponding to the sample data, including: acquiring acoustic feature information of a second sample by using gating based on the acoustic feature information of the first sample and the spatial position information corresponding to the sample data; and obtaining acoustic information based on the second sample acoustic feature information by using an acoustic classifier, wherein the training method further comprises the following steps: and performing countermeasure training on the acoustic awakening module by using the gradient overturning layer based on the second acoustic characteristic information and the energy labeling information, wherein the gradient overturning layer points to the energy classifier.
In some embodiments of the present application, the multi-channel sample audio signals are obtained by performing beamforming on original sample audio signals collected by at least two sample audio collecting devices according to a plurality of preset sound source directions.
In a second aspect, an embodiment of the present application provides a method for training a speech processing model, where the speech processing model includes a feature extraction module, a feature fusion module, a sound source classifier, and a training module, and the training method includes: acquiring first sample acoustic characteristic information based on sample data by using a characteristic extraction module and a characteristic fusion module, wherein the sample data comprises a plurality of paths of sample audio signals, the plurality of paths of sample audio signals are acquired according to original sample audio signals acquired by at least two sample audio acquisition devices, and the first sample acoustic characteristic information fuses the characteristics of the plurality of paths of sample audio signals; acquiring spatial position information corresponding to the sample data based on the first sample acoustic characteristic information by using a sound source classifier; and training the voice processing model by utilizing a training module based on the spatial position information corresponding to the sample data and the sound source marking information corresponding to the sample data.
In a third aspect, an embodiment of the present application provides a speech processing apparatus, including: the acquisition module is used for acquiring first acoustic characteristic information based on a plurality of paths of audio signals, wherein the first acoustic characteristic information is fused with the characteristics of the plurality of paths of audio signals, and the plurality of paths of audio signals are acquired according to original audio signals acquired by at least two audio acquisition devices; and the classification module is used for classifying the first acoustic characteristic information to obtain spatial position information, and the spatial position information is used for representing the direction of the sound source.
In a fourth aspect, an embodiment of the present application provides a training apparatus for a speech processing model, including: the first acquisition module is used for acquiring first sample acoustic characteristic information based on sample data, wherein the sample data comprises a plurality of paths of sample audio signals, the plurality of paths of sample audio signals are acquired according to original sample audio signals acquired by at least two sample audio acquisition devices, and the first sample acoustic characteristic information is fused with the characteristics of the plurality of paths of sample audio signals; the second acquisition module is used for acquiring the spatial position information corresponding to the sample data based on the first sample acoustic characteristic information; and the training module is used for training the voice processing model based on the spatial position information corresponding to the sample data and the sound source marking information corresponding to the sample data.
In a fifth aspect, an embodiment of the present application provides an electronic device, including: a processor; a memory for storing processor executable instructions, wherein the processor is configured to perform the speech processing method according to the first aspect or to perform the training method of the speech processing model according to the second aspect.
In a sixth aspect, an embodiment of the present application provides a computer-readable storage medium, where the storage medium stores a computer program for executing the speech processing method according to the first aspect or executing the training method of the speech processing model according to the second aspect.
The embodiment of the application provides a voice processing method, a training method of a voice processing model, a device and a medium, wherein multiple paths of audio signals are obtained according to original audio signals collected by at least two audio collecting devices, the characteristics of the multiple paths of audio signals are fused to obtain first acoustic characteristic information corresponding to the multiple paths of audio signals, and then the first acoustic characteristic information is classified to obtain spatial position information capable of representing the direction of a sound source. Because the characteristics of the multiple paths of audio signals are fused to obtain the first acoustic characteristic information, rather than performing independent processing on each path of audio signal, the voice processing method can reduce the memory occupancy rate, improve the operation speed, reduce the cascade error and improve the accuracy of sound source positioning.
Drawings
Fig. 1 is a schematic system architecture diagram of a speech processing system according to an exemplary embodiment of the present application.
Fig. 2 is a flowchart illustrating a speech processing method according to an exemplary embodiment of the present application.
Fig. 3 is a schematic structural diagram of a speech processing model according to an exemplary embodiment of the present application.
Fig. 4 is a flowchart illustrating a speech processing method according to another exemplary embodiment of the present application.
Fig. 5 is a schematic structural diagram of a speech processing model according to another exemplary embodiment of the present application.
Fig. 6 is a flowchart illustrating a method for training a speech processing model according to an exemplary embodiment of the present application.
Fig. 7 is a flowchart illustrating a method for training a speech processing model according to another exemplary embodiment of the present application.
Fig. 8 is a schematic structural diagram of a speech processing apparatus according to an exemplary embodiment of the present application.
Fig. 9 is a schematic structural diagram of a training apparatus for a speech processing model according to an exemplary embodiment of the present application.
FIG. 10 is a block diagram illustrating an electronic device for performing a speech processing method or a training method of a speech processing model according to an exemplary embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Summary of the application
With the development of the level of intelligence, more and more devices can be controlled by voice. In some scenes, the device can locate the position of the user by collecting the sound (audio signal) of the user, so that the sound of the user located in the position can be conveniently and effectively collected subsequently, the requirement process of the user is identified through the sound of the user so as to realize the interaction process with the user, and better service is provided for the user. For example, some chat robots, after locating the user's position by capturing the user's voice, can continuously capture the user's voice at the position without being interfered by noise from other positions, and recognize the user's needs by voice to maintain a continuous chat with the user.
In other scenarios, the device may determine whether the user is a user in a particular location by using the user's voice (audio signal) to locate the user's location, so as to provide service to the user in the particular location. For example, some vehicle-mounted interaction devices locate the position of a user through a collected sound, determine whether the position is a specific position (such as a driving direction), if so, recognize the content of the sound and perform corresponding operations, and if not, do not recognize the content of the sound, so as to avoid that the sound of the user in other positions (such as a driving direction or a rear row direction) is recognized and corresponding operations are performed to interfere with the driving process of a driver or cause confusion of vehicle control.
A sound source positioning method comprises the steps of carrying out beam forming processing on collected original audio signals to obtain multiple paths of audio signals, respectively processing each path of audio signal in the multiple paths of audio signals to obtain a confidence score and an energy score corresponding to each path of audio signal, and carrying out decision making according to the confidence scores and the energy scores of the multiple paths of audio signals to determine the direction of a sound source.
The sound source positioning method adopts multi-step cascade, and error accumulation is easy to cause. Moreover, the sound source localization method cannot solve the sound source localization problem of the noisy scene well based on the energy score, because the direction with high energy may be noise interference, and even if the confidence score is combined, the problem cannot be solved well. In addition, the sound source localization method performs the same processing (calculating a confidence score and an energy score) for each audio signal, resulting in a multiple increase in memory and CPU occupation.
That is, the sound source localization method has the problems of easy accumulation of errors, easy noise interference, high memory occupation, and the like.
Exemplary System
Fig. 1 is a system architecture diagram of a speech processing system 100 according to an exemplary embodiment of the present application, which illustrates an application scenario for performing sound source localization on an acquired original audio signal. The speech processing system 100 includes at least two audio capture devices 110 and a computing device 120. The computing device 120 may be communicatively connected with the audio capture device 110, or the audio capture device 110 may be integrated on the computing device 120.
In one embodiment, the audio capture device 110 may be a microphone or microphone for capturing sound in the environment (e.g., the user's voice). The at least two audio capturing devices 110 may be distributed in different orientations for capturing the user's voice from different directions. The computing device 120 processes the sound (audio signal) captured by the audio capture device 110 to locate the direction of the sound source. Specific processing can be seen in the description of the exemplary methods section below.
Computing device 120 may be a cell phone, tablet, notebook, desktop, vehicle-mounted controller, etc. device.
Further, after locating the sound source direction, the computing device 120 may perform a wake word recognition process on the audio signal in combination with information (e.g., spatial position information) about the sound source direction, and if a wake word is recognized, perform an operation corresponding to the wake word.
It should be noted that the above application scenarios are only presented to facilitate understanding of the spirit and principles of the present application, and the embodiments of the present application are not limited thereto. Rather, embodiments of the present application may be applied to any scenario where it may be applicable.
Exemplary method
Fig. 2 is a flowchart illustrating a speech processing method according to an exemplary embodiment of the present application. The method of fig. 2 may be performed by a computing device. As shown in fig. 2, the speech processor includes the following.
210: the method comprises the steps of obtaining first acoustic characteristic information based on multiple audio signals, wherein the first acoustic characteristic information fuses the characteristics of the multiple audio signals, and the multiple audio signals are obtained according to original audio signals collected by at least two audio collecting devices.
And acquiring a plurality of paths of audio signals according to the original audio signals acquired by the at least two audio acquisition devices.
In particular, the audio capturing device may be a microphone or a sound pick-up for capturing the original audio signal. The at least two audio capturing devices may be distributed in different orientations for capturing the original audio signals from different directions. For example, the at least two audio capturing devices include four microphones that are respectively provided at a main driving position, a sub-driving position, a left position of the rear seat, and a right position of the rear seat in the vehicle. There may be phase and/or amplitude differences between the raw audio signals collected by the microphones in different orientations.
And integrating the original audio signals collected by each microphone, and processing the integrated original audio signals to obtain a plurality of paths of audio signals. For example, the original audio signals collected by each microphone may be simply combined together to obtain an integrated original audio signal, or the integrated original audio signal may be obtained through a weighting process or the like.
The number of the multi-path audio signals can be the same as the number of the sound source directions which can be actually positioned by the voice processing method. For example, the number of sound source directions actually localizable by the speech processing method is 3, and the three sound source directions are a main driving direction, a secondary driving direction and a rear row direction respectively. The original audio signals collected by the four microphones are processed to obtain three paths of audio signals corresponding to three sound source directions.
It should be noted that there is a certain relationship between the number of audio acquisition devices and the number of multiple audio signals, but the relationship is not absolute. The one-to-one correspondence is not required, but the larger the number of audio acquisition devices, the larger the number of available multi-channel audio signals. For example, three audio signals may be acquired by two or more audio capture devices, while four or more audio signals may be acquired by three or more audio capture devices.
The number of audio acquisition equipment and the number of multichannel audio signal can set up according to actual need, and this application embodiment does not limit this.
Specifically, first acoustic feature information about a multi-channel audio signal may be acquired in conjunction with the multi-channel audio signal. For example, the multiple audio signals may be input to a neural network model to obtain the first acoustic feature information. Alternatively, features in the multi-channel audio signal may be extracted by some algorithm, and the features may be subjected to fusion processing to obtain the first acoustic feature information.
220: and classifying the first acoustic characteristic information to obtain spatial position information, wherein the spatial position information is used for representing the direction of the sound source.
Specifically, the spatial location information may be obtained by performing classification processing on the first acoustic feature information by a classifier. The spatial position information may be a matrix or a vector for characterizing the sound source direction of the original audio signal, i.e. characterizing the direction in which the user corresponding to the original audio signal is located. For example, three paths of audio signals are acquired according to original audio signals acquired by at least two audio acquisition devices, first acoustic feature information is acquired based on the three paths of audio signals, and spatial position information obtained by classifying the first acoustic feature information is a three-dimensional vector. Each dimension of the vector represents a preset sound source direction, the numerical value on each dimension represents the probability that the corresponding preset sound source direction is the sound source direction of the original audio signal, and the preset sound source direction corresponding to the maximum numerical value in the vector can be determined as the finally positioned sound source direction.
The embodiment of the application provides a voice processing method, which includes the steps of obtaining multiple paths of audio signals according to original audio signals collected by at least two audio collecting devices, fusing the characteristics of the multiple paths of audio signals to obtain first acoustic characteristic information corresponding to the multiple paths of audio signals, and then classifying the first acoustic characteristic information to obtain spatial position information capable of representing the direction of a sound source. Because the characteristics of the multiple paths of audio signals are fused to obtain the first acoustic characteristic information, rather than performing independent processing on each path of audio signal, the voice processing method can reduce the memory occupancy rate, improve the operation speed, reduce the cascade error and improve the accuracy of sound source positioning.
According to an embodiment of the present application, the multi-channel audio signal is obtained by performing beamforming on the original audio signals acquired by at least two audio acquisition devices according to a plurality of preset sound source directions.
Specifically, the plurality of preset sound source directions may be a plurality of sound source directions that the voice processing method can actually locate. For example, the original audio signals acquired by at least two audio acquisition devices may be directionally enhanced according to a plurality of preset sound source directions to obtain multi-path beamformed audio signals. Beamforming is a combination of antenna technology and digital signal processing technology, intended for directional signal transmission or reception.
Of course, other methods may also be adopted to obtain multiple channels of audio signals respectively enhanced in multiple preset sound source directions based on the original audio signals acquired by at least two audio acquisition devices.
A plurality of default sound source directions can be set according to actual need, and the embodiment of the application does not limit the specific number and the specific direction of a plurality of default sound source directions.
According to the voice processing method provided by the embodiment, the original audio signals collected by at least two audio collecting devices are subjected to beam forming processing according to a plurality of preset sound source directions to obtain a plurality of paths of audio signals, so that the audio signals in the preset sound source directions can be enhanced, the audio signals in other directions can be inhibited, and the accuracy of sound source positioning can be further improved. In addition, the number of the preset sound source directions can be reasonably set according to needs by the voice processing method provided by the embodiment, so that the adaptability of the method can be improved, and the operation speed can be improved to a certain extent.
According to an embodiment of the present application, the speech processing method is applied to a speech processing model, and the speech processing model includes: the device comprises a feature extraction module, a feature fusion module and an acoustic awakening module.
For example, as shown in FIG. 3, the speech processing model 300 may include a feature extraction module (e.g., a first feature extraction module 310 and a second feature extraction module 320), a feature fusion module 330, and an acoustic wakeup module. The acoustic wakeup module may further include a gate 360, a fourth feature extraction module 370, and an acoustic classifier 380.
According to an embodiment of the present application, acquiring first acoustic feature information based on a plurality of audio signals includes: extracting characteristic information corresponding to the multi-channel audio signals and characteristic information corresponding to original audio signals acquired by one or more audio acquisition devices in at least two audio acquisition devices by using a characteristic extraction module; the feature fusion module 330 is used to fuse feature information corresponding to the multiple audio signals and feature information corresponding to the original audio signal to obtain first acoustic feature information.
Specifically, the first acoustic feature information may be acquired based on the beamformed multiple audio signals and the original audio signals acquired by one or more of the at least two audio acquisition devices. Here, the beamformed multiple audio signals may be considered as a main feature, and the original audio signals captured by the one or more audio capture devices may be considered as an auxiliary feature. For example, when there are N audio capture devices, the original audio signal captured by one or M audio capture devices may be used as an auxiliary feature, 1 < M and ≦ N.
In the speech processing method provided by the embodiment, the original audio signals acquired by one or more audio acquisition devices are used as auxiliary features, so that the influence on the sound source positioning result when the beam forming algorithm is abnormal can be avoided. Namely, when the beam forming algorithm is abnormal, the original audio signal can avoid a large error of the first acoustic characteristic information, and further avoid an overlarge deviation between a sound source positioning result and an actual situation. In addition, when the original audio signal collected by one audio collecting device is used as an auxiliary feature, the accuracy of sound source positioning can be improved, and meanwhile, higher operation speed is guaranteed. When the original audio signals acquired by the plurality of audio acquisition devices are used as the auxiliary features, the accuracy of sound source localization can be further improved.
According to an embodiment of the present application, acquiring first acoustic feature information based on a plurality of audio signals includes: extracting the characteristic information corresponding to the multi-channel audio signals by using a characteristic extraction module; the feature fusion module 330 is used to fuse feature information corresponding to the multiple audio signals and a phase difference between original audio signals collected by any two audio collection devices of the at least two audio collection devices, so as to obtain first acoustic feature information.
Specifically, the first acoustic feature information may be acquired based on a phase difference between the beamformed multiple audio signals and original audio signals acquired by any two of the at least two audio acquisition devices. Here, the beamformed multiple audio signals may be regarded as a main feature, and the phase difference between the original audio signals acquired by any two audio acquisition devices may be regarded as an auxiliary feature. The phase difference between the original audio signals collected by the two audio collection devices can reflect the position distribution relationship between the two audio collection devices, and the sound source direction can be estimated based on the phase difference. The multi-channel audio signal contains implicit characteristics about the direction of the sound source, while the phase difference contains explicit characteristics about the direction of the sound source relative to the multi-channel audio signal. The phase difference may thus further provide supplementary information for the determination of the direction of the sound source.
For example, when there are three audio capturing devices, a phase difference (one phase difference) between the original audio signals captured by two of the audio capturing devices may be used as the auxiliary feature, or a phase difference (three phase differences) between the original audio signals captured by every two of the three audio capturing devices may be used as the auxiliary feature. Similarly, two phase differences may also be acquired as an assist feature. When a phase difference is used as an auxiliary feature, the accuracy of sound source positioning can be improved, and meanwhile, higher operation speed is guaranteed. When a plurality of phase differences are used as the assist feature, the accuracy of sound source localization can be further improved.
According to the voice processing method provided by the embodiment, the phase difference between the original audio signals acquired by any two audio acquisition devices is used as an auxiliary feature, so that the accuracy of sound source positioning can be further improved.
According to an embodiment of the present application, acquiring first acoustic feature information based on a plurality of audio signals includes: extracting characteristic information corresponding to the multi-channel audio signals and characteristic information corresponding to the amplitude difference between original audio signals acquired by any two audio acquisition devices in at least two audio acquisition devices by using a characteristic extraction module; the feature fusion module 330 is used to fuse feature information corresponding to the multiple audio signals and feature information corresponding to the amplitude difference to obtain first acoustic feature information.
Specifically, the first acoustic feature information may be obtained based on a difference in amplitude between the beamformed multiple audio signals and original audio signals acquired by any two of the at least two audio acquisition devices. Here, the beamformed multiple audio signals may be regarded as a main feature, and the amplitude difference between the original audio signals acquired by any two audio acquisition devices may be regarded as an auxiliary feature. When the position distribution of two audio acquisition devices is different, an amplitude difference exists between original audio signals acquired by the two audio acquisition devices, namely, the amplitude difference can reflect the position distribution relationship between the two audio acquisition devices. The sound source direction can be estimated based on the amplitude difference. The multi-channel audio signal contains implicit characteristics about the direction of the sound source, while the amplitude difference contains explicit characteristics about the direction of the sound source relative to the multi-channel audio signal. The amplitude difference may thus further provide supplementary information for the determination of the direction of the sound source.
For example, when there are three audio capturing devices, the auxiliary feature may be an amplitude difference (one amplitude difference) between the original audio signals captured by two of the audio capturing devices, or an amplitude difference (three amplitude differences) between the original audio signals captured by every two of the three audio capturing devices. Similarly, two amplitude differences may also be obtained as an assist feature. When an amplitude difference is used as an auxiliary feature, the accuracy of sound source positioning can be improved, and meanwhile, higher operation speed is guaranteed. When a plurality of amplitude differences are used as the auxiliary feature, the accuracy of sound source localization can be further improved.
According to the voice processing method provided by the embodiment, the amplitude difference between the original audio signals acquired by any two audio acquisition devices is used as an auxiliary feature, so that the accuracy of sound source positioning can be further improved.
According to an embodiment of the present application, acquiring first acoustic feature information based on a plurality of audio signals includes: extracting characteristic information corresponding to the multi-channel audio signals, characteristic information corresponding to original audio signals acquired by one or more audio acquisition devices in the at least two audio acquisition devices and characteristic information corresponding to the amplitude difference between the original audio signals acquired by any two audio acquisition devices in the at least two audio acquisition devices by using a characteristic extraction module; the feature fusion module 330 is used to fuse feature information corresponding to multiple channels of audio signals, feature information corresponding to original audio signals, feature information corresponding to amplitude differences, and phase differences between original audio signals acquired by any two audio acquisition devices of the at least two audio acquisition devices, so as to obtain first acoustic feature information.
Specifically, the first acoustic feature information may be acquired based on a plurality of channels of audio signals, original audio signals collected by one or more audio collection devices, a phase difference between the original audio signals collected by any two audio collection devices, and an amplitude difference between the original audio signals collected by any two audio collection devices. Here, the beamformed multiple audio signals may be regarded as main features, and the original audio signals, the phase differences, and the amplitude differences may be regarded as auxiliary features. The original audio signal can avoid a large error of the first acoustic characteristic information when the beam forming algorithm is abnormal, and further avoid an overlarge deviation between a sound source positioning result and an actual situation. The phase difference and the amplitude difference can reflect the position distribution relation between the two audio acquisition devices, and the sound source direction can be estimated through the phase difference and the amplitude difference. The multipath audio signals contain implicit characteristics about the direction of the sound source, while the phase and amplitude differences contain explicit characteristics about the direction of the sound source relative to the multipath audio signals. The phase difference and amplitude difference may thus further provide supplementary information for the determination of the direction of the sound source.
For example, when there are three audio acquisition devices, the original audio signals acquired by one or more of the audio acquisition devices, the phase difference (one phase difference) between the original audio signals acquired by any two of the audio acquisition devices, and the amplitude difference (one amplitude difference) between the original audio signals acquired by any two of the audio acquisition devices may be used as the auxiliary feature, or the original audio signals acquired by one or more of the three audio acquisition devices, the phase difference (three phase differences) between the original audio signals acquired by every two of the audio acquisition devices, and the amplitude difference (three amplitude differences) between the original audio signals acquired by every two of the audio acquisition devices may be used as the auxiliary feature. Of course, the number of the original audio signals, the phase difference, and the amplitude difference in the auxiliary feature may be set as needed. When the number of original audio signals, phase differences and amplitude differences in the auxiliary features is small, the accuracy of sound source positioning can be improved to a certain extent, and meanwhile, higher operation speed is guaranteed. When the number of the original audio signals, the phase difference and the amplitude difference in the auxiliary features is large, the accuracy of sound source positioning can be further improved.
The speech processing method provided by this embodiment obtains the first acoustic feature information by combining the multiple audio signals, the original audio signals, the phase difference between the original audio signals collected by the audio collecting device, and the amplitude difference between the original audio signals collected by the audio collecting device, and can further improve the accuracy of sound source positioning.
Fig. 3 is a schematic structural diagram of a speech processing model 300 according to an exemplary embodiment of the present application, where the speech processing model 300 can be used to perform the above-mentioned speech processing method, i.e. to locate the sound source direction. As shown in FIG. 3, the speech processing model 300 includes a first feature extraction module 310, a second feature extraction module 320, a feature fusion module 330, a third feature extraction module 340, and a sound source classifier 350.
The speech processing model 300 is provided with a plurality of first feature extraction modules 310, and each path of audio signal, each original audio signal and each amplitude difference respectively correspond to one first feature extraction module 310. The first feature extraction module 310 may be used to extract shallow features of the input data. The speech processing model 300 is provided with a plurality of second feature extraction modules 320, each path of audio signal and each original audio signal respectively correspond to one second feature extraction module 320, and all amplitude differences may correspond to one or more second feature extraction modules 320. The output of the first feature extraction module 310 may serve as an input to a second feature extraction module 320, and the second feature extraction module 320 may be used to extract deep features of the input data. The second feature extraction module 320 may be a Deep Neural Network (DNN), a long short-Term Memory (LSTM) Network, a Convolutional Neural Network (CNN), a transform deep learning architecture, or the like, and the first feature extraction module 310 is similar to the DNN.
The output of the second feature extraction module 320 and the phase difference may be used as the input of the feature fusion module 330, and the feature fusion module 330 may be a concat module and may splice the input feature information. The third feature extraction module 340 may further process the spliced feature information to facilitate a classification process of a subsequent classifier. The third feature extraction module 340 may be a feed-forward neural network or other suitable network. The classification result of the classifier (sound source classifier 350) may be a vector (spatial position information) of three dimensions. Here, the number of classification nodes of the classifier is 3, and the classifier may employ a discriminative loss function such as cross entropy. The parameters of the plurality of second feature extraction modules 320 may be the same or different. When the parameters of the plurality of second feature extraction modules 320 are the same, the purpose of reducing the parameter number can be achieved.
The specific process of the speech processing model 300 for locating the direction of the sound source is described by taking two audio acquisition devices and three preset sound source directions as examples. Specifically, a plurality of audio signals (one audio signal, two audio signals, and three audio signals), an original audio signal collected by any one audio collection device, a phase difference between the original audio signals collected by two audio collection devices, and an amplitude difference between the original audio signals collected by two audio collection devices may be used as the input of the speech processing model 300. The first feature extraction module 310 may perform a fourier transform on the input data, such as extracting frequency domain features. The amplitude difference may be a difference between frequency domain features of the original audio signal acquired by the audio acquisition device. In one implementation, each first feature extraction module 310 may output a 40-dimensional filter bank feature vector; the phase difference may be represented by a vector of dimension 1 for the two audio acquisition devices. If the second feature extraction module 320 does not change the dimension of the input feature information, the feature information spliced by the feature fusion module 330 is a vector with the dimension of 201.
Optionally, a first feature fusion module may be disposed between the first feature extraction module 310 and the second feature extraction module 320, and a second feature fusion module may be disposed between the second feature extraction module 320 and the third feature extraction module 340, that is, one second feature extraction module 320 may be disposed in the speech processing model. The output of the first feature extraction module 310 may be used as the input of the first feature fusion module, the first feature fusion module splices the input feature information, and the second feature extraction module 320 may further process the spliced feature information. The output of the second feature extraction module 320 and the phase difference may be inputs to a second feature fusion module. The second feature fusion module may be connected to the third feature extraction module 340 and the sound source classifier 350.
Similarly, taking two audio acquisition devices and three preset sound source directions as examples, each first feature extraction module 310 may output a 40-dimensional filter bank feature vector, and the first feature fusion module splices four 40-dimensional vectors (corresponding to three audio signals, the original audio signal and the amplitude difference, respectively) to obtain a 200-dimensional vector. If the second feature extraction module does not change the dimension of the input feature information, the second feature fusion module can splice the output of the second feature extraction module and the phase difference to obtain a vector with the dimension of 201. Of course, if the phase difference is omitted from the input data, the second feature fusion module may be omitted from the speech processing model.
According to an embodiment of the present application, the speech processing method further includes: and identifying a wake-up word based on the first acoustic characteristic information by using an acoustic wake-up module.
Specifically, after the sound source direction is located, the wake-up word may be further identified based on the first acoustic feature information. In one implementation, the step of identifying the wake-up word based on the first acoustic feature information may be performed when the located sound source direction satisfies a preset condition, otherwise not performed. Here, the preset condition may be that the sound source direction is the primary driving direction. After the wake-up word is identified, the related operations can be executed according to the instruction corresponding to the wake-up word. The voice processing process can execute corresponding operation according to the voice command of the user in the main driving direction, and does not react to the voice commands of the users in other directions.
The voice processing method provided by the embodiment positions the sound source direction firstly, further identifies the awakening word, can take the sound source positioning result as a prerequisite condition for awakening word identification, starts the awakening word identification process when the sound source positioning result meets a preset condition, and can avoid confusion of the control process of the interactive device.
According to an embodiment of the present application, recognizing a wakeup word based on first acoustic feature information by using an acoustic wakeup module includes: and identifying a wake-up word by utilizing an acoustic wake-up module in combination with the spatial position information and the first acoustic feature information.
Specifically, after the sound source direction is located, the first acoustic feature information may be subjected to wake-up word recognition according to the sound source direction, that is, the wake-up word is recognized by combining the spatial position information and the first acoustic feature information. For example, the acoustic feature in the sound source direction may be extracted from the first acoustic feature information, and then the wakeup word recognition may be performed.
In one implementation, the spatial location information is a three-dimensional vector (a)1,a2,a3) Each dimension of the vector corresponds to a path of beamforming audio signal, and the audio signal corresponds to a preset sound source direction. The first acoustic feature information may be a stitched feature vector
Figure BDA0003450925390000121
Acoustic features comprising three-way audio signals
Figure BDA0003450925390000122
Acoustic features of original audio signal
Figure BDA0003450925390000123
Phase difference
Figure BDA0003450925390000124
Sum amplitude difference
Figure BDA0003450925390000125
Wherein the content of the first and second substances,
Figure BDA0003450925390000126
to
Figure BDA0003450925390000127
May be a vector, respectively. The sound source direction is the maximum value a in the spatial position informationi(i ═ 1, 2, or 3), and at this time, the preset sound source direction may be based on a and a in the first acoustic feature informationiCorresponding to
Figure BDA0003450925390000128
And performing awakening word recognition. In other implementations, the spatial location information (a) may be combined1,a2,a3) And acoustic features
Figure BDA0003450925390000129
And multiplying to obtain second acoustic characteristic information, and identifying the awakening words based on the second acoustic characteristic information.
According to the voice processing method provided by the embodiment, the information positioned in the sound source direction can be extracted from the first acoustic characteristic information by combining the spatial position information to perform awakening word recognition, so that the noises in other directions can be effectively filtered, the effect of acoustic purification is achieved, and then the awakening word recognition efficiency and the awakening word recognition accuracy can be improved.
According to an embodiment of the present application, the method for recognizing a wakeup word by using an acoustic wakeup module in combination with spatial location information and first acoustic feature information includes: performing gate control selection on the first acoustic characteristic information based on the spatial position information by using a gate control 360 of the acoustic awakening module to obtain second acoustic characteristic information; processing the second acoustic feature information by using an acoustic awakening module to obtain an acoustic score, wherein the acoustic awakening module is obtained by using a gradient overturning layer to perform countermeasure training, and the gradient overturning layer points to the energy classifier; and identifying the awakening words according to the acoustic scores.
Specifically, the spatial position information may represent a probability that each preset sound source direction is a target sound source direction. The probability that each preset sound source direction is the target sound source direction can be used as the weight of the information corresponding to each channel of audio signal in the first acoustic feature information, and the information corresponding to each channel of audio signal in the first acoustic feature information is subjected to fusion processing according to the weight to obtain second acoustic feature information. Corresponding to using the spatial position information as the gating of the acoustic wake-up module to gate the first acoustic feature information. The Gradient Reversal Layer (GRL) can point to the energy classifier and return the Gradient to the gate control of the acoustic wake-up module, so that the acoustic wake-up module after training is insensitive to energy information, output information corresponding to input information of various energies is distributed consistently, and the problem of low acoustic score corresponding to an audio signal with small sound can be solved. Here, the energy in the energy classifier may refer to a volume level of the audio signal.
In one implementation, the spatial location information is a three-dimensional vector (a)1,a2,a3) Each dimension of the vector corresponds to a path of beamforming audio signal, and the audio signal corresponds to a preset sound source direction. The first acoustic feature information may be a stitched feature vector
Figure BDA00034509253900001210
Acoustic features comprising three-way audio signals
Figure BDA00034509253900001211
Acoustic features of original audio signal
Figure BDA00034509253900001212
Phase difference
Figure BDA00034509253900001213
Sum amplitude difference
Figure BDA00034509253900001214
Wherein the content of the first and second substances,
Figure BDA00034509253900001215
to
Figure BDA00034509253900001216
May be a vector, respectively. The space position information is used as the gate control of the acoustic awakening module, the first acoustic characteristic information is subjected to gate control selection to obtain second acoustic characteristic information, and the second acoustic characteristic information is a vector
Figure BDA00034509253900001217
In other implementations, the second acoustic feature information is a vector
Figure BDA00034509253900001218
Here, the three-dimensional vector (a)1,a2,a3) As spatial position information, a4,a5,a6May be parameters obtained by model training or preset parameters.
And processing the second acoustic characteristic information by using an acoustic awakening module to obtain an acoustic score. The classification target of the acoustic wakeup module may be an acoustic unit, which may be a phoneme, syllable, or word. The acoustic score may be represented by a vector or other form. For example, the acoustic score includes probabilities of the acoustic units, and the acoustic score may be a one-dimensional or multi-dimensional vector, and the numerical value in each dimension represents the probability of the corresponding target acoustic unit. A wake word may be identified based on the acoustic score.
According to the voice processing method provided by the embodiment, the first acoustic characteristic information is used as the input of the acoustic awakening module, the spatial position information is used as the gate control of the acoustic awakening module, the acoustic awakening module can perform gate control selection on the first acoustic characteristic information based on the gate control to obtain the second acoustic characteristic information, and awakening word recognition is performed based on the second acoustic characteristic information, so that the effect of acoustic purification can be achieved, and the accuracy of awakening word recognition can be improved. Furthermore, the acoustic awakening module is trained by utilizing the gradient turnover layer, so that the trained acoustic awakening module is insensitive to energy information, and the problem of low acoustic score corresponding to the audio signal with low sound can be effectively solved.
According to an embodiment of the application, the method for identifying the awakening words according to the acoustic scores comprises the following steps: the acoustic score is input to a decoding network to identify the wake-up word.
Specifically, the acoustic unit may be a word, so that the wake-up word may be identified directly based on the acoustic score, for example, the target word with the highest probability may be used as the identified wake-up word, and this identification manner is simple and efficient.
Alternatively, the acoustic units may be phonemes or syllables, so that the acoustic scores may be input into a decoding network, and the decoding network may obtain the candidate word sequence according to the acoustic scores, for example, a word corresponding to a target acoustic unit with a probability greater than or equal to a threshold may be used as a candidate word, and the threshold may be set as required. Further, a confidence degree decision may be performed on each word in the candidate word sequence through the confidence degree module to obtain a confidence degree score corresponding to each word, whether the device is awakened or not is determined according to the confidence degree score, and if the device is awakened, a corresponding operation may be further performed according to the identified awakened word.
Of course, other conventional methods of identifying a wake word based on an acoustic score may also be employed.
The speech processing method provided by the embodiment can improve the accuracy of the recognition result by inputting the acoustic score into the decoding network to recognize the awakening word.
Fig. 4 is a flowchart illustrating a speech processing method according to another exemplary embodiment of the present application. The embodiment of fig. 4 is an example of the embodiment of fig. 2, and the same parts are not described again to avoid repetition. The method of FIG. 4 may be performed by the speech processing model 300 of FIG. 3. As shown in fig. 4, the speech processing method includes the following.
410: first acoustic feature information is acquired based on the multi-channel audio signal.
And carrying out beam forming processing on the original audio signals acquired by the at least two audio acquisition devices according to a plurality of preset sound source directions to obtain a plurality of paths of audio signals.
Take two audio collection devices and three preset sound source directions as an example. The original audio signals collected by the two audio collecting devices are subjected to beam forming processing according to three preset sound source directions to obtain a plurality of paths of audio signals: one audio signal, two audio signals and three audio signals.
The first acoustic feature information fuses features of the multi-path audio signal. Specifically, as shown in fig. 3, a plurality of channels of audio signals, an original audio signal collected by any one of two audio collecting devices, a phase difference between the original audio signals collected by the two audio collecting devices, and an amplitude difference between the original audio signals collected by the two audio collecting devices may be used as the input of the speech processing model 300. Based on these inputs, the speech processing model 300 may obtain first acoustic feature information. For example, the multi-channel audio signal, the original audio signal, and the amplitude difference may be respectively input to the corresponding first feature extraction modules 310, and each first feature extraction module 310 may output a 40-dimensional filter bank feature vector. Each first feature extraction module 310 may be connected to a second feature extraction module 320, the output of the first feature extraction module 310 may be used as the input of the second feature extraction module 320, and the second feature extraction module 320 may be used to extract deep features of the input data. The output of the second feature extraction module 320 may still be a 40-dimensional feature vector.
The phase difference may be a vector of dimension 1 for the two audio acquisition devices. The output of the second feature extraction module 320 and the phase difference may be used as the input of the feature fusion module 330, and the feature fusion module 330 may be a concat module and may splice the input feature information. The feature information (first acoustic feature information) spliced by the feature fusion module 330 is a vector with the dimension of 201.
420: and classifying the first acoustic characteristic information to obtain spatial position information, wherein the spatial position information is used for representing the direction of the sound source.
As shown in fig. 3, the third feature extraction module 340 may further process the spliced feature information, the output of the third feature extraction module 340 may be used as the input of the sound source classifier 350, and the classification result of the sound source classifier 350 may be a three-dimensional vector (spatial location information).
430: and identifying the awakening words by combining the spatial position information and the first acoustic characteristic information.
Specifically, the gate control 360 of the acoustic wakeup module may perform gate control selection on the first acoustic feature information based on the spatial position information to obtain second acoustic feature information; processing the second acoustic feature information by using an acoustic awakening module to obtain an acoustic score; and identifying the awakening words according to the acoustic scores. Here, the acoustic wake-up module is obtained by performing countermeasure training using a gradient inversion layer, which points to the energy classifier. The specific process of identifying the wake word based on the acoustic score can be seen from the description in the above embodiment.
As shown in fig. 3, the second acoustic feature information may be used as an input to a fourth feature extraction module 370, and the fourth feature extraction module 370 may further process the second acoustic feature information (e.g., extract higher-level acoustic features) to facilitate a subsequent classification process of the acoustic classifier 380. The acoustic classifier 380 may output an acoustic score. The acoustic wakeup module in this embodiment may include the gate 360, the fourth feature extraction module 370, the acoustic classifier 380, the gradient inversion layer, and the energy classifier in fig. 3. In one embodiment, the acoustic wakeup module may further include a decoding network and a confidence module, for example, the acoustic classifier 380 may be connected to the decoding network and the decoding network may be connected to the confidence module. The gradient turning layer and the energy classifier are arranged in the acoustic awakening module, so that the acoustic awakening module is insensitive to energy information, the problem of low acoustic score corresponding to the audio signal with small sound is effectively solved, and the accuracy of acoustic identification (acoustic awakening) is improved. In other embodiments, the gradient inversion layer and the energy classifier can be omitted from the acoustic wake-up module, so that the model structure can be simplified.
Fig. 6 is a flowchart illustrating a method for training a speech processing model according to an exemplary embodiment of the present application. The method of fig. 6 may be performed by a computing device. As shown in fig. 6, the training method of the speech processing model includes the following steps.
610: and acquiring first sample acoustic characteristic information based on sample data by using a characteristic extraction module and a characteristic fusion module, wherein the sample data comprises a plurality of paths of sample audio signals, the plurality of paths of sample audio signals are acquired according to original sample audio signals acquired by at least two sample audio acquisition devices, and the first sample acoustic characteristic information fuses the characteristics of the plurality of paths of sample audio signals.
The sample audio acquisition device may be an audio acquisition device that acquires a raw sample audio signal.
620: and acquiring spatial position information corresponding to the sample data based on the first sample acoustic characteristic information by using a sound source classifier.
630: and training the voice processing model by utilizing a training module based on the spatial position information corresponding to the sample data and the sound source marking information corresponding to the sample data.
As shown in fig. 5, the speech processing model 500 may include a feature extraction module (e.g., a first feature extraction module 510 and a second feature extraction module 520), a feature fusion module 530, a sound source classifier 550, an acoustic wake-up module, and a training module (not shown). The speech processing model 500 is similar to the speech processing model 300 and the same may be referred to each other. The acoustic wakeup module may further include a gate 560, a fourth feature extraction module 570, an acoustic classifier 580, a gradient flipping layer 590, and an energy classifier 595.
Specifically, in the process of training the speech processing model, sample data may be used as input of the speech processing model, and the sound source tagging information corresponding to the sample data may be used as output of the speech processing model, so as to train the speech processing model. The sound source labeling information may be obtained by labeling sample data in advance, and may represent a target sound source direction.
After the sample data is input into the voice processing model, the voice processing model can acquire first sample acoustic feature information based on the sample data and acquire space position information corresponding to the sample data based on the first sample acoustic feature information. The speech processing model may implement a training (learning) process based on the differences between the spatial location information that it actually analyzes and the pre-labeled sound source labeling information. For example, the speech processing model may implement the training process using a discriminative loss function (e.g., cross-entropy) or other loss function.
The multi-path sample audio signal is similar to the multi-path audio signal in the speech processing method described above, and the first sample acoustic feature information is similar to the first acoustic feature information in the speech processing method described above. The speech processing model obtained by the method for training the speech processing model according to the embodiment of the present application can be used to perform the speech processing method according to any of the embodiments described above, and some terms and steps in the method for training the speech processing model are similar to those in the speech processing method described above, and the similarities are mutually referred to.
The embodiment of the application provides a training method of a voice processing model, which comprises the steps of acquiring first sample acoustic characteristic information fusing the characteristics of a multi-path sample audio signal, further acquiring spatial position information corresponding to sample data based on the first sample acoustic characteristic information, and training the voice processing model based on the spatial position information corresponding to the sample data and sound source marking information corresponding to the sample data. Because the characteristics of the multi-path sample audio signals are fused to obtain the first sample acoustic characteristic information, rather than independently processing each path of sample audio signals, the training method of the speech processing model can reduce the memory occupancy rate, improve the operation speed, reduce the cascade error and improve the accuracy of sound source positioning.
According to an embodiment of the present application, the multi-channel sample audio signals are obtained by performing beamforming on original sample audio signals collected by at least two sample audio collecting devices according to a plurality of preset sound source directions.
In particular, the plurality of preset sound source directions may be a plurality of sound source directions for which the speech processing model is expected to be able to localize. For example, the original sample audio signals acquired by at least two audio acquisition devices may be directionally enhanced according to a plurality of preset sound source directions to obtain multi-path beamformed sample audio signals. Of course, other methods may also be adopted to obtain multiple paths of sample audio signals respectively enhanced in multiple preset sound source directions based on the original sample audio signals acquired by at least two audio acquisition devices.
A plurality of default sound source directions can be set according to actual need, and the embodiment of the application does not limit the specific number and the specific direction of a plurality of default sound source directions.
According to the training method of the speech processing model provided by the embodiment, the original sample audio signals collected by at least two audio collecting devices are subjected to beam forming processing according to a plurality of preset sound source directions to obtain a plurality of paths of sample audio signals, so that the audio signals in the preset sound source directions can be enhanced, the audio signals in other directions can be inhibited, and the accuracy of model sound source positioning can be further improved. In addition, the training method of the speech processing model provided by the embodiment can reasonably set the number of the preset sound source directions according to needs, so that the adaptability of the method can be improved, and the operation speed can be improved to a certain extent.
According to an embodiment of the application, the sample data further comprises at least one of: the audio signal processing method comprises the steps of acquiring original sample audio signals acquired by one or more of at least two sample audio acquisition devices, acquiring sample phase differences between the original sample audio signals acquired by any two of the at least two sample audio acquisition devices, and acquiring sample amplitude differences between the original sample audio signals acquired by any two of the at least two sample audio acquisition devices.
In particular, the beamformed multi-sample audio signals may be considered as a primary feature and the raw sample audio signals acquired by the one or more audio acquisition devices may be considered as a secondary feature. The original sample audio signal can avoid the larger error of the acoustic characteristic information of the first sample when the beam forming algorithm is abnormal, and further avoid the overlarge deviation of the sound source positioning result and the actual condition.
The sample phase difference between the original sample audio signals acquired by any two audio acquisition devices can be considered as an assist feature. The sample phase difference between the original sample audio signals collected by the two audio collecting devices can reflect the position distribution relationship between the two audio collecting devices, and the sound source direction can be estimated based on the sample phase difference. The multi-path sample audio signal contains implicit characteristics about the direction of the sound source, and the sample phase difference contains explicit characteristics about the direction of the sound source relative to the multi-path sample audio signal. The sample phase difference may thus further provide supplementary information for the determination of the sound source direction.
The difference in sample amplitude between the original sample audio signals acquired by any two audio acquisition devices can be considered an ancillary feature. When the position distribution of two audio acquisition devices is different, a sample amplitude difference exists between original sample audio signals acquired by the two audio acquisition devices, that is, the sample amplitude difference can reflect the position distribution relationship between the two audio acquisition devices. The sound source direction can be estimated based on the sample amplitude difference. The multi-path sample audio signal contains implicit characteristics about the direction of the sound source, and the sample amplitude difference contains explicit characteristics about the direction of the sound source relative to the multi-path sample audio signal. The sample amplitude difference may thus further provide supplementary information for the determination of the sound source direction.
In the training method of the speech processing model provided by this embodiment, the acoustic feature information of the first sample is obtained by combining the main feature and the auxiliary feature, so that the accuracy of model sound source localization can be further improved.
According to an embodiment of the present application, the training method further includes: obtaining acoustic information based on the first sample acoustic characteristic information by using an acoustic awakening module; and training the acoustic awakening module by using the training module according to the acoustic information and the acoustic marking information corresponding to the sample data.
Specifically, the acoustic tagging information may be tagged with respect to the sample data in advance, and may include syllables, phonemes, and/or words corresponding to the wakeup words in the sample audio signal.
The speech processing model can be used for sound source localization and further can be used for identifying awakening words. For example, the acoustic wakeup module acquires acoustic information based on the first sample acoustic feature information. The acoustic wakeup module may implement a training (learning) process based on the differences between the acoustic information that it actually analyzes and the pre-labeled acoustic labeling information. For example, the acoustic wakeup module may implement the training process using a discriminative loss function (e.g., cross entropy) or other loss function. In an embodiment, the acoustic information may be an acoustic score.
In one implementation manner, when the located sound source direction meets a preset condition, the trainable voice processing model executes a step of training the acoustic wakeup module based on the first sample acoustic feature information and the acoustic label information corresponding to the sample data, otherwise, the step is not executed. Here, the preset condition may be that the sound source direction is the primary driving direction. After the wake-up word is identified, the related operations can be executed according to the instruction corresponding to the wake-up word. The voice processing model can execute corresponding operations according to the voice commands of the users in the main driving direction, and does not react to the voice commands of the users in other directions.
The voice processing model obtained by training through the training method can position the direction of a sound source first, and then the awakening word is recognized. The sound source positioning result is used as a prerequisite condition for awakening word identification, and the awakening word identification process is started when the sound source positioning result meets a preset condition, so that confusion of the control process of the interactive equipment can be avoided.
According to an embodiment of the present application, obtaining acoustic information based on the first sample acoustic feature information by using the acoustic wakeup module includes: and obtaining acoustic information based on the first sample acoustic characteristic information and the spatial position information corresponding to the sample data by using an acoustic awakening module.
Specifically, the acoustic wakeup module may obtain the acoustic information by combining the spatial position information corresponding to the sample data and the first sample acoustic feature information, so as to extract the acoustic information in the sound source direction from the first sample acoustic feature information. For example, the spatial position information may indicate a sound source direction, and the acoustic wakeup module is trained by extracting information related to the sound source direction from the first sample acoustic feature information. The acoustic labeling information in this embodiment may include syllables, phonemes, and/or words corresponding to the wake-up word in the audio signal in the direction of the target sound source.
The voice processing model obtained through training by the training method can be combined with the spatial position information to extract information located in the sound source direction from the first acoustic characteristic information to perform awakening word recognition, so that noise in other directions can be effectively filtered, the effect of acoustic purification is achieved, and then the awakening word recognition efficiency and the awakening word recognition accuracy can be improved.
According to an embodiment of the present application, be provided with gradient switching layer 590, energy classifier 595, acoustic classifier 580 and gate 560 in the acoustics awakening module, utilize the acoustics awakening module to obtain acoustic information based on the spatial position information that first sample acoustic characteristic information, sample data correspond, include: acquiring acoustic feature information of a second sample based on the acoustic feature information of the first sample and the spatial position information corresponding to the sample data by using a gate 560; obtaining acoustic information based on the second sample acoustic feature information by using the acoustic classifier 580, wherein the training method further includes: and performing countermeasure training on the acoustic awakening module by using the gradient flipping layer 590 based on the acoustic feature information and the energy labeling information of the second sample, wherein the gradient flipping layer points to the energy classifier 595.
Specifically, the spatial position information may represent a probability that each preset sound source direction is a target sound source direction. The probability that each preset sound source direction is the target sound source direction can be used as the weight of the information corresponding to each path of sample audio signal in the first sample acoustic feature information, and the information corresponding to each path of sample audio signal in the first sample acoustic feature information is subjected to fusion processing according to the weight to obtain the second sample acoustic feature information. The method is equivalent to using the spatial position information as the gating of the acoustic wake-up module to gate and select the first sample acoustic feature information.
The Gradient Reversal Layer 590 (GRL) may point to the energy classifier 595 and return the Gradient to the gate 560 of the acoustic wake-up module, so that the acoustic wake-up module after training is insensitive to energy information and output information corresponding to input information of various energies is distributed consistently, the problem of low acoustic score corresponding to an audio signal with a small sound can be solved, and the robustness of the model is improved. Here, the energy in the energy classifier may refer to a volume level of the audio signal. The energy labeling information may be obtained by labeling the sample data according to the volume.
The acoustic wakeup module further includes an acoustic classifier 580, and the acoustic classifier 580 may acquire acoustic information based on the second sample acoustic feature information. The acoustic wakeup module is trained based on the difference between the acoustic information and the acoustic tagging information, and in addition, the gradient flipping layer 590 directed to the energy classifier 595 passes the gradient back, making the acoustic wakeup module insensitive to the energy information.
The voice processing model obtained by training through the training method can achieve the effect of acoustic purification and improve the accuracy of awakening word recognition. Furthermore, the voice processing model trains the acoustic awakening module by utilizing the gradient turnover layer, so that the trained acoustic awakening module is insensitive to energy information, and the problem of low acoustic score corresponding to the audio signal with low sound can be effectively solved.
Fig. 7 is a flowchart illustrating a method for training a speech processing model according to another exemplary embodiment of the present application. The embodiment of fig. 7 is an example of the embodiment of fig. 6, and the same parts are not repeated to avoid repetition. The method of FIG. 7 may be used to train the speech processing model 500 shown in FIG. 5. As shown in fig. 7, the training method of the speech processing model includes the following steps.
710: and acquiring first sample acoustic characteristic information based on sample data, wherein the sample data comprises a plurality of paths of sample audio signals, and the first sample acoustic characteristic information fuses the characteristics of the plurality of paths of sample audio signals.
And performing beam forming processing on the original sample audio signals acquired by the at least two audio acquisition devices according to a plurality of preset sound source directions to obtain a plurality of paths of sample audio signals.
Take two audio collection devices and three preset sound source directions as an example. The original sample audio signals collected by the two audio collecting devices are subjected to beam forming processing according to three preset sound source directions to obtain multi-path sample audio signals: the audio signal processing device comprises a path of sample audio signal, a path of two-path sample audio signal and a path of three-path sample audio signal. The multi-channel sample audio signal, the original sample audio signal collected by any one of the two audio collection devices, the sample phase difference between the original sample audio signals collected by the two audio collection devices, and the sample amplitude difference between the original sample audio signals collected by the two audio collection devices may be used as the input (sample data) of the speech processing model 500 shown in fig. 5. Based on these inputs, the speech processing model 500 may obtain first sample acoustic feature information through the first feature extraction module 510, the second feature extraction module 520, and the feature fusion module 530.
720: and acquiring spatial position information corresponding to the sample data based on the first sample acoustic characteristic information, and training the voice processing model based on the spatial position information corresponding to the sample data and the sound source marking information corresponding to the sample data.
As shown in fig. 5, the third feature extraction module 540 may further process the spliced feature information, the output of the third feature extraction module 540 may be used as the input of the sound source classifier 550, and the classification result of the sound source classifier 550 may be a three-dimensional vector (spatial position information).
730: and training the acoustic awakening module based on the first sample acoustic characteristic information, the spatial position information corresponding to the sample data and the acoustic marking information.
An acoustic wakeup module is disposed in the speech processing model 500, and as shown in fig. 5, the acoustic wakeup module includes a gate 560, a gradient flipping layer 590, an energy classifier 595, a fourth feature extraction module 570, and an acoustic classifier 580.
The gating 560 is used for gating the first sample acoustic feature information to obtain the second sample acoustic feature information. The output of the gate 560 may serve as inputs to the gradient flipping layer 590 and the fourth feature extraction module 570. The fourth feature extraction module 570 may extract higher-level acoustic features based on the second sample acoustic feature information. Based on the output of the fourth feature extraction module 570, the acoustic classifier 580 may obtain acoustic information. The acoustic wakeup module is trained based on the difference between the acoustic information and the acoustic tagging information, and furthermore, the gradient flipping layer 590 directed to the energy classifier 595 may pass the gradient back to the gate 560 of the acoustic wakeup module, rendering the acoustic wakeup module insensitive to the energy information (e.g., rendering the acoustic wakeup module blind to the energy information).
The third feature extraction module 540 and the sound source classifier 550 in the speech processing model 500 are used for sound source localization, and the acoustic wake-up module is used for obtaining an acoustic score or recognizing a wake-up word. The speech processing model 500 provided by the embodiment of the application has the capability of processing multi-channel input, and can avoid the problem of high computation amount caused by the fact that the input of each channel needs to be processed independently in the traditional method; moreover, the speech processing model 500 can learn the sequence information, the energy information and the spatial position information among all the inputs, and improve the robustness of the model and the accuracy of sound source positioning and acoustic recognition.
The gradient inversion layer 590 of the speech processing model 500 may be active during the training process and may be inactive during the testing process, for example, the speech processing model 300 shown in FIG. 3 in the testing stage, in which the gradient inversion layer and the energy classifier may be inactive, and thus the gradient inversion layer and the energy classifier are not shown in FIG. 3.
Exemplary devices
Fig. 8 is a schematic structural diagram of a speech processing apparatus 800 according to an exemplary embodiment of the present application. As shown in fig. 8, the speech processing apparatus 800 includes: an acquisition module 810 and a classification module 820.
The obtaining module 810 is configured to obtain first acoustic feature information based on multiple audio signals, where the first acoustic feature information is combined with features of the multiple audio signals, and the multiple audio signals are obtained according to original audio signals collected by at least two audio collecting devices. The classification module 820 is configured to perform classification processing on the first acoustic feature information to obtain spatial location information, where the spatial location information is used to characterize a sound source direction.
The embodiment of the application provides a voice processing device, obtains multichannel audio signal through the original audio signal who gathers according to two at least audio acquisition equipment to fuse the characteristic of multichannel audio signal in order to obtain the first acoustic characteristic information that multichannel audio signal corresponds, and then carry out classification processing in order to obtain the spatial position information that can characterize the sound source direction to first acoustic characteristic information. Because the characteristics of the multiple paths of audio signals are fused to obtain the first acoustic characteristic information, rather than performing independent processing on each path of audio signal, the voice processing device can reduce the memory occupancy rate, improve the operation speed, reduce the cascade error and improve the accuracy of sound source positioning.
According to an embodiment of the present application, the multi-channel audio signal is obtained by performing beamforming on the original audio signals acquired by at least two audio acquisition devices according to a plurality of preset sound source directions.
According to an embodiment of the present application, the obtaining module 810 is configured to obtain first acoustic feature information based on a plurality of audio signals and an original audio signal collected by one or more of at least two audio collecting devices.
According to an embodiment of the present application, the obtaining module 810 is configured to obtain first acoustic feature information based on a phase difference between a plurality of audio signals and original audio signals collected by any two audio collecting devices of at least two audio collecting devices.
According to an embodiment of the present application, the obtaining module 810 is configured to obtain the first acoustic feature information based on an amplitude difference between the multiple audio signals and original audio signals collected by any two audio collecting devices of the at least two audio collecting devices.
According to an embodiment of the present application, the obtaining module 810 is configured to obtain the first acoustic feature information based on a multi-channel audio signal, an original audio signal collected by one or more of the at least two audio collecting devices, a phase difference between original audio signals collected by any two of the at least two audio collecting devices, and an amplitude difference between original audio signals collected by any two of the at least two audio collecting devices.
According to an embodiment of the present application, the speech processing apparatus 800 further includes a recognition module 830 for recognizing the wake-up word based on the first acoustic feature information.
According to an embodiment of the present application, the recognition module 830 is configured to recognize the wake-up word by combining the spatial location information and the first acoustic feature information.
According to an embodiment of the present application, the identifying module 830 is configured to: using the spatial position information as a gate control of the acoustic awakening module to perform gate control selection on the first acoustic characteristic information to obtain second acoustic characteristic information; processing the second acoustic feature information by using an acoustic awakening module to obtain an acoustic score, wherein the acoustic awakening module is obtained by using a gradient overturning layer to perform countermeasure training, and the gradient overturning layer points to the energy classifier; and identifying the awakening words according to the acoustic scores.
According to an embodiment of the present application, the recognition module 830 is configured to input the acoustic score into the decoding network to recognize the wake-up word.
It should be understood that, for the operations and functions of the obtaining module 810, the classifying module 820 and the identifying module 830 in the foregoing embodiments, reference may be made to the description of the speech processing method provided in the foregoing embodiment of fig. 2 or fig. 4, and in order to avoid repetition, the description is not repeated here.
Fig. 9 is a schematic structural diagram of a training apparatus 900 for a speech processing model according to an exemplary embodiment of the present application. As shown in fig. 9, the training apparatus 900 includes: a first acquisition module 910, a second acquisition module 920, and a training module 930.
The first obtaining module 910 is configured to obtain first sample acoustic feature information based on sample data, where the sample data includes multiple paths of sample audio signals, the multiple paths of sample audio signals are obtained according to original sample audio signals collected by at least two sample audio collecting devices, and the first sample acoustic feature information is fused with features of the multiple paths of sample audio signals. The second obtaining module 920 is configured to obtain spatial location information corresponding to the sample data based on the first sample acoustic feature information. The training module 930 is configured to train the voice processing model based on the spatial location information corresponding to the sample data and the sound source labeling information corresponding to the sample data.
The embodiment of the application provides a training device of a voice processing model, which is used for acquiring first sample acoustic characteristic information fusing the characteristics of a multi-path sample audio signal, further acquiring space position information corresponding to sample data based on the first sample acoustic characteristic information, and training the voice processing model based on the space position information corresponding to the sample data and sound source marking information corresponding to the sample data. Because the characteristics of the multi-path sample audio signals are fused to obtain the first sample acoustic characteristic information, rather than independently processing each path of sample audio signals, the training device of the speech processing model can reduce the memory occupancy rate, improve the operation speed, reduce the cascade error and improve the accuracy of sound source positioning.
According to an embodiment of the present application, the multi-channel sample audio signals are obtained by performing beamforming on original sample audio signals collected by at least two sample audio collecting devices according to a plurality of preset sound source directions.
According to an embodiment of the application, the sample data further comprises at least one of: the audio signal processing method comprises the steps of acquiring original sample audio signals acquired by one or more of at least two sample audio acquisition devices, acquiring sample phase differences between the original sample audio signals acquired by any two of the at least two sample audio acquisition devices, and acquiring sample amplitude differences between the original sample audio signals acquired by any two of the at least two sample audio acquisition devices.
According to an embodiment of the present application, an acoustic wakeup module is disposed in the voice processing model, and the training module 930 is further configured to train the acoustic wakeup module based on the first sample acoustic feature information and the acoustic label information corresponding to the sample data.
According to an embodiment of the present application, the training module 930 is configured to train the acoustic wakeup module based on the first sample acoustic feature information, the spatial position information corresponding to the sample data, and the acoustic labeling information.
According to an embodiment of the present application, a gradient flipping layer is disposed in the acoustic wake-up module, and the training module 930 is configured to: acquiring second sample acoustic characteristic information based on the first sample acoustic characteristic information and the spatial position information corresponding to the sample data; and performing countermeasure training on the acoustic awakening module by using the gradient overturning layer based on the second acoustic characteristic information and the acoustic marking information, wherein the gradient overturning layer points to the energy classifier.
It should be understood that, in the above embodiment, the operations and functions of the first obtaining module 910, the second obtaining module 920, and the training module 930 may refer to the description in the training method of the speech processing model provided in the above embodiment of fig. 6 or fig. 7, and are not described herein again to avoid repetition.
Fig. 10 is a block diagram of an electronic device 1000 for performing a speech processing method or a training method of a speech processing model according to an exemplary embodiment of the present application.
Referring to fig. 10, electronic device 1000 includes a processing component 1010 that further includes one or more processors, and memory resources, represented by memory 1020, for storing instructions, such as application programs, that are executable by processing component 1010. The application programs stored in memory 1020 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1010 is configured to execute instructions to perform the speech processing method or the training method of the speech processing model described above.
The electronic device 1000 may also include a power supply component configured to perform power management of the electronic device 1000, a wired or wireless network interface configured to connect the electronic device 1000 to a network, and an input-output (I/O) interface. The electronic device 1000 may be operated based on an operating system stored in the memory 1020, such as Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTMOr the like.
A non-transitory computer readable storage medium, wherein instructions of the storage medium, when executed by a processor of the electronic device 1000, enable the electronic device 1000 to perform a speech processing method or a training method of a speech processing model. The voice processing method comprises the following steps: acquiring first acoustic characteristic information based on a plurality of paths of audio signals, wherein the first acoustic characteristic information is fused with the characteristics of the plurality of paths of audio signals, and the plurality of paths of audio signals are acquired according to original audio signals acquired by at least two audio acquisition devices; and classifying the first acoustic characteristic information to obtain spatial position information, wherein the spatial position information is used for representing the direction of the sound source. The training method of the speech processing model comprises the following steps: acquiring first sample acoustic characteristic information based on sample data, wherein the sample data comprises a plurality of paths of sample audio signals, the plurality of paths of sample audio signals are acquired according to original sample audio signals acquired by at least two sample audio acquisition devices, and the first sample acoustic characteristic information is fused with the characteristics of the plurality of paths of sample audio signals; acquiring space position information corresponding to sample data based on the first sample acoustic characteristic information; and training the voice processing model based on the spatial position information corresponding to the sample data and the sound source marking information corresponding to the sample data.
All the above optional technical solutions can be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program check codes, such as a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.
It should be noted that, in the description of the present application, the terms "first", "second", "third", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present application, "a plurality" means two or more unless otherwise specified.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modifications, equivalents and the like that are within the spirit and principle of the present application should be included in the scope of the present application.

Claims (20)

1. A method of speech processing, comprising:
acquiring first acoustic characteristic information based on a plurality of paths of audio signals, wherein the first acoustic characteristic information fuses characteristics of the plurality of paths of audio signals, and the plurality of paths of audio signals are acquired according to original audio signals acquired by at least two audio acquisition devices;
and classifying the first acoustic characteristic information to obtain spatial position information, wherein the spatial position information is used for representing the direction of a sound source.
2. The speech processing method of claim 1, wherein the speech processing method is applied to a speech processing model, and wherein the speech processing model comprises: the device comprises a feature extraction module, a feature fusion module and an acoustic awakening module.
3. The speech processing method according to claim 2, wherein said obtaining first acoustic feature information based on the multiple audio signals comprises:
extracting feature information corresponding to the multi-channel audio signals and feature information corresponding to original audio signals acquired by one or more of the at least two audio acquisition devices by using the feature extraction module;
and fusing the feature information corresponding to the multi-channel audio signals and the feature information corresponding to the original audio signals by using the feature fusion module to acquire the first acoustic feature information.
4. The speech processing method according to claim 2, wherein said obtaining first acoustic feature information based on the multiple audio signals comprises:
extracting the characteristic information corresponding to the multi-channel audio signals by using the characteristic extraction module;
and fusing the characteristic information corresponding to the multi-channel audio signal and the phase difference between the original audio signals acquired by any two audio acquisition devices in the at least two audio acquisition devices by using the characteristic fusion module to acquire the first acoustic characteristic information.
5. The speech processing method according to claim 2, wherein said obtaining first acoustic feature information based on the multiple audio signals comprises:
extracting feature information corresponding to the multi-channel audio signals and feature information corresponding to an amplitude difference between original audio signals acquired by any two audio acquisition devices in the at least two audio acquisition devices by using the feature extraction module;
and fusing the feature information corresponding to the multi-channel audio signals and the feature information corresponding to the amplitude difference by using the feature fusion module to acquire the first acoustic feature information.
6. The speech processing method according to claim 2, wherein said obtaining first acoustic feature information based on the multiple audio signals comprises:
extracting feature information corresponding to the multi-channel audio signals, feature information corresponding to original audio signals acquired by one or more of the at least two audio acquisition devices, and feature information corresponding to an amplitude difference between the original audio signals acquired by any two of the at least two audio acquisition devices by using the feature extraction module;
and utilizing the feature fusion module to fuse the feature information corresponding to the multi-channel audio signals, the feature information corresponding to the original audio signals, the feature information corresponding to the amplitude difference and the phase difference between the original audio signals collected by any two audio collection devices in the at least two audio collection devices so as to obtain the first acoustic feature information.
7. The speech processing method of claim 2, further comprising:
and identifying a wake-up word based on the first acoustic feature information by using the acoustic wake-up module.
8. The method of claim 7, wherein the recognizing, with the acoustic wake-up module, a wake-up word based on the first acoustic feature information comprises:
and identifying the awakening word by utilizing the acoustic awakening module in combination with the spatial position information and the first acoustic characteristic information.
9. The method of claim 8, wherein the recognizing the wake word using the acoustic wake module in combination with the spatial location information and the first acoustic feature information comprises:
performing gating selection on the first acoustic characteristic information based on the spatial position information by using gating of the acoustic wakeup module to obtain second acoustic characteristic information;
processing the second acoustic feature information by using the acoustic awakening module to obtain an acoustic score, wherein the acoustic awakening module is obtained by using a gradient inversion layer to perform countermeasure training, and the gradient inversion layer points to an energy classifier;
identifying the wake-up word according to the acoustic score.
10. The method of claim 9, wherein the identifying the wake-up word according to the acoustic score comprises:
inputting the acoustic score into a decoding network to identify the wake-up word.
11. The speech processing method according to any one of claims 1 to 10, wherein the multiple audio signals are obtained by performing beamforming on original audio signals acquired by the at least two audio acquisition devices according to a plurality of preset sound source directions.
12. The speech processing method of claim 2, wherein the speech processing model further comprises a sound source classifier and a training module, and the speech processing model is obtained by a training method comprising:
acquiring first sample acoustic feature information based on sample data by using the feature extraction module and the feature fusion module, wherein the sample data comprises a plurality of paths of sample audio signals, the plurality of paths of sample audio signals are acquired according to original sample audio signals acquired by at least two sample audio acquisition devices, and the first sample acoustic feature information fuses features of the plurality of paths of sample audio signals;
acquiring spatial position information corresponding to the sample data based on the first sample acoustic characteristic information by using the sound source classifier;
and training the voice processing model by utilizing the training module based on the spatial position information corresponding to the sample data and the sound source marking information corresponding to the sample data.
13. The method of speech processing according to claim 12 wherein the sample data further comprises at least one of: the audio signal processing device comprises original sample audio signals collected by one or more of the at least two sample audio collecting devices, a sample phase difference between the original sample audio signals collected by any two of the at least two sample audio collecting devices, and a sample amplitude difference between the original sample audio signals collected by any two of the at least two sample audio collecting devices.
14. The speech processing method of claim 12, wherein the training method further comprises:
obtaining acoustic information based on the first sample acoustic characteristic information by using the acoustic awakening module;
and training the acoustic awakening module by utilizing the training module according to the acoustic information and the acoustic marking information corresponding to the sample data.
15. The method of claim 14, wherein the obtaining acoustic information based on the first sample acoustic feature information with the acoustic wakeup module comprises:
and obtaining the acoustic information by using the acoustic awakening module based on the first sample acoustic characteristic information and the spatial position information corresponding to the sample data.
16. The speech processing method according to claim 15, wherein a gradient inversion layer, an energy classifier, an acoustic classifier, and a gate control are disposed in the acoustic wakeup module, and obtaining the acoustic information by using the acoustic wakeup module based on the first sample acoustic feature information and the spatial location information corresponding to the sample data includes:
acquiring acoustic feature information of a second sample by using the gating based on the acoustic feature information of the first sample and the spatial position information corresponding to the sample data;
deriving the acoustic information based on the second sample acoustic feature information with the acoustic classifier,
wherein the training method further comprises:
and performing countermeasure training on the acoustic awakening module by using the gradient overturning layer based on the second sample acoustic characteristic information and the energy marking information, wherein the gradient overturning layer points to the energy classifier.
17. The speech processing method according to any one of claims 12 to 16, wherein the multi-path sample audio signals are obtained by performing beamforming on original sample audio signals collected by at least two sample audio collecting devices according to a plurality of preset sound source directions.
18. A training method of a voice processing model is characterized in that the voice processing model comprises a feature extraction module, a feature fusion module, a sound source classifier and a training module, and the training method comprises the following steps:
acquiring first sample acoustic feature information based on sample data by using the feature extraction module and the feature fusion module, wherein the sample data comprises a plurality of paths of sample audio signals, the plurality of paths of sample audio signals are acquired according to original sample audio signals acquired by at least two sample audio acquisition devices, and the first sample acoustic feature information fuses features of the plurality of paths of sample audio signals;
acquiring spatial position information corresponding to the sample data based on the first sample acoustic characteristic information by using the sound source classifier;
and training the voice processing model by utilizing the training module based on the spatial position information corresponding to the sample data and the sound source marking information corresponding to the sample data.
19. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions,
wherein the processor is configured to perform the speech processing method of any one of the preceding claims 1 to 17 or the training method of the speech processing model of claim 18.
20. A computer-readable storage medium, characterized in that the storage medium stores a computer program for executing the speech processing method of any of the above claims 1 to 17 or the training method of the speech processing model of claim 18.
CN202111675104.9A 2021-12-31 2021-12-31 Speech processing method, speech processing model training method, apparatus and medium Pending CN114242066A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111675104.9A CN114242066A (en) 2021-12-31 2021-12-31 Speech processing method, speech processing model training method, apparatus and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111675104.9A CN114242066A (en) 2021-12-31 2021-12-31 Speech processing method, speech processing model training method, apparatus and medium

Publications (1)

Publication Number Publication Date
CN114242066A true CN114242066A (en) 2022-03-25

Family

ID=80745425

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111675104.9A Pending CN114242066A (en) 2021-12-31 2021-12-31 Speech processing method, speech processing model training method, apparatus and medium

Country Status (1)

Country Link
CN (1) CN114242066A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115356682A (en) * 2022-08-21 2022-11-18 嘉晨云控新能源(上海)有限公司 Sound source position sensing device and method based on accurate positioning
CN115512692A (en) * 2022-11-04 2022-12-23 腾讯科技(深圳)有限公司 Voice recognition method, device, equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115356682A (en) * 2022-08-21 2022-11-18 嘉晨云控新能源(上海)有限公司 Sound source position sensing device and method based on accurate positioning
CN115512692A (en) * 2022-11-04 2022-12-23 腾讯科技(深圳)有限公司 Voice recognition method, device, equipment and storage medium
CN115512692B (en) * 2022-11-04 2023-02-28 腾讯科技(深圳)有限公司 Voice recognition method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
CN110852215B (en) Multi-mode emotion recognition method and system and storage medium
CN110838289A (en) Awakening word detection method, device, equipment and medium based on artificial intelligence
Zhou et al. A compact representation of visual speech data using latent variables
CN111564164A (en) Multi-mode emotion recognition method and device
CN108922564B (en) Emotion recognition method and device, computer equipment and storage medium
CN104036774A (en) Method and system for recognizing Tibetan dialects
CN114242066A (en) Speech processing method, speech processing model training method, apparatus and medium
US11443757B2 (en) Artificial sound source separation method and device of thereof
KR20210052036A (en) Apparatus with convolutional neural network for obtaining multiple intent and method therof
US20220335950A1 (en) Neural network-based signal processing apparatus, neural network-based signal processing method, and computer-readable storage medium
EP4310838A1 (en) Speech wakeup method and apparatus, and storage medium and system
KR102192678B1 (en) Apparatus and method for normalizing input data of acoustic model, speech recognition apparatus
US11238289B1 (en) Automatic lie detection method and apparatus for interactive scenarios, device and medium
CN107610720B (en) Pronunciation deviation detection method and device, storage medium and equipment
CN110569908B (en) Speaker counting method and system
CN111914803A (en) Lip language keyword detection method, device, equipment and storage medium
CN116312512A (en) Multi-person scene-oriented audiovisual fusion wake-up word recognition method and device
CN117063229A (en) Interactive voice signal processing method, related equipment and system
US11775617B1 (en) Class-agnostic object detection
CN114611546A (en) Multi-mobile sound source positioning method and system based on space and frequency spectrum time sequence information modeling
CN114566156A (en) Keyword speech recognition method and device
CN114664288A (en) Voice recognition method, device, equipment and storage medium
Dennis et al. Generalized Hough transform for speech pattern classification
CN116705013B (en) Voice wake-up word detection method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20230510

Address after: 230026 No. 96, Jinzhai Road, Hefei, Anhui

Applicant after: University of Science and Technology of China

Applicant after: IFLYTEK Co.,Ltd.

Address before: NO.666, Wangjiang West Road, hi tech Zone, Hefei City, Anhui Province

Applicant before: IFLYTEK Co.,Ltd.