WO2021237740A1 - 一种语音信号处理方法及其相关设备 - Google Patents
一种语音信号处理方法及其相关设备 Download PDFInfo
- Publication number
- WO2021237740A1 WO2021237740A1 PCT/CN2020/093523 CN2020093523W WO2021237740A1 WO 2021237740 A1 WO2021237740 A1 WO 2021237740A1 CN 2020093523 W CN2020093523 W CN 2020093523W WO 2021237740 A1 WO2021237740 A1 WO 2021237740A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- signal
- user
- voice
- sensor
- vibration
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 27
- 238000000034 method Methods 0.000 claims abstract description 112
- 230000005236 sound signal Effects 0.000 claims description 224
- 210000004556 brain Anatomy 0.000 claims description 135
- 230000015654 memory Effects 0.000 claims description 71
- 230000033001 locomotion Effects 0.000 claims description 59
- 238000003062 neural network model Methods 0.000 claims description 55
- 238000012545 processing Methods 0.000 claims description 53
- 230000001755 vocal effect Effects 0.000 claims description 45
- 230000007613 environmental effect Effects 0.000 claims description 42
- 125000004122 cyclic group Chemical group 0.000 claims description 31
- 238000001914 filtration Methods 0.000 claims description 26
- 230000006399 behavior Effects 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 10
- 210000003625 skull Anatomy 0.000 claims description 6
- 230000019020 vocalization behavior Effects 0.000 claims description 4
- 230000000694 effects Effects 0.000 abstract description 15
- 230000005540 biological transmission Effects 0.000 abstract description 8
- 230000000875 corresponding effect Effects 0.000 description 122
- 230000006870 function Effects 0.000 description 111
- 230000003993 interaction Effects 0.000 description 82
- 238000012549 training Methods 0.000 description 55
- 238000013528 artificial neural network Methods 0.000 description 50
- 230000008569 process Effects 0.000 description 36
- 238000010586 diagram Methods 0.000 description 29
- 239000011159 matrix material Substances 0.000 description 21
- 238000004891 communication Methods 0.000 description 19
- 239000000284 extract Substances 0.000 description 19
- 230000000007 visual effect Effects 0.000 description 17
- 230000001537 neural effect Effects 0.000 description 13
- 230000004913 activation Effects 0.000 description 12
- 210000002569 neuron Anatomy 0.000 description 11
- 238000013527 convolutional neural network Methods 0.000 description 10
- 230000009471 action Effects 0.000 description 9
- 238000001228 spectrum Methods 0.000 description 9
- 238000004364 calculation method Methods 0.000 description 8
- 238000012546 transfer Methods 0.000 description 8
- 238000004422 calculation algorithm Methods 0.000 description 7
- 238000013500 data storage Methods 0.000 description 7
- 238000013507 mapping Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 6
- 230000002452 interceptive effect Effects 0.000 description 6
- MHABMANUFPZXEB-UHFFFAOYSA-N O-demethyl-aloesaponarin I Natural products O=C1C2=CC=CC(O)=C2C(=O)C2=C1C=C(O)C(C(O)=O)=C2C MHABMANUFPZXEB-UHFFFAOYSA-N 0.000 description 5
- 238000012806 monitoring device Methods 0.000 description 5
- 230000000306 recurrent effect Effects 0.000 description 5
- 229920001621 AMOLED Polymers 0.000 description 4
- 230000014509 gene expression Effects 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 230000001133 acceleration Effects 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 3
- 230000001276 controlling effect Effects 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000012886 linear function Methods 0.000 description 3
- 238000011022 operating instruction Methods 0.000 description 3
- 238000011084 recovery Methods 0.000 description 3
- 230000003068 static effect Effects 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000013480 data collection Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000009877 rendering Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 210000001260 vocal cord Anatomy 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 241000282472 Canis lupus familiaris Species 0.000 description 1
- 241000282326 Felis catus Species 0.000 description 1
- 241000282412 Homo Species 0.000 description 1
- 210000003484 anatomy Anatomy 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000013529 biological neural network Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000007664 blowing Methods 0.000 description 1
- 230000019771 cognition Effects 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010304 firing Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 239000011521 glass Substances 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 210000000867 larynx Anatomy 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000012634 optical imaging Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000003825 pressing Methods 0.000 description 1
- 239000002096 quantum dot Substances 0.000 description 1
- 230000005855 radiation Effects 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003238 somatosensory effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
- G06V40/171—Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
- G10L17/10—Multimodal systems, i.e. based on the integration of multiple recognition engines or fusion of expert systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
Definitions
- This application relates to the field of audio processing, and in particular to a voice signal processing method and related equipment.
- Human-computer interaction mainly studies the information exchange between humans and computers. It mainly includes two parts: human-computer and computer-to-human information exchange. It is a comprehensive discipline closely related to cognitive psychology, ergonomics, multimedia technology, and virtual reality technology.
- multi-modal interaction devices are interactive devices that have multiple interaction modes such as voice interaction, somatosensory interaction, and touch interaction in parallel.
- Human-computer interaction based on multi-modal interactive devices collect user information through multiple tracking modules (face, gesture, posture, voice, and rhythm) in the interactive device, and form a virtual user expression module after understanding, processing, and management , Interactive dialogue with the computer can greatly enhance the user's interactive experience.
- voice interaction system of the smart device completes the user's instructions by recognizing the user's voice.
- a microphone is usually used to pick up the audio signal in the environment, where the audio signal is a mixed signal of the environment.
- other signals such as environmental noise. , Other people’s voices, etc.
- the blind separation method in order to extract the voice signal from a certain user from the mixed signal, the blind separation method can be adopted, which is essentially a statistical method to separate the sound source, so it is limited by the actual modeling method. Robustness is very challenging.
- this application provides a voice signal processing method, the method including:
- the user voice signal should not be understood as only the words spoken by the user, but should be understood as including the user's voice in the voice signal.
- the voice signal including environmental noise can be understood as the presence of the user who is talking and other environmental noise (such as other people talking) in the environment.
- the collected voice signal includes the user's voice and environmental noise that are intertwined with each other.
- the relationship between the speech signal and the environmental noise should not be understood as a simple superposition. That is, it should not be understood that environmental noise is an independent signal in the speech signal.
- the vibration signal is used to indicate the vibration characteristics of the user's body part; the body part is performed based on the vocalization behavior when the user is in the vocal state Corresponding vibration parts;
- the corresponding vibration signal when the user utters a voice may be obtained based on video extraction.
- target voice information is obtained.
- An embodiment of the present application provides a voice signal processing method, including: acquiring a user's voice signal collected by a sensor, the voice signal including environmental noise; acquiring a vibration signal corresponding to the voice when the user makes the voice; wherein the vibration The signal is used to indicate the vibration characteristics of the body part of the user; the body part is the part that vibrates correspondingly based on the sound behavior when the user is in the vocal state; and the user collected according to the vibration signal and the sensor Voice signal to obtain target voice information.
- the vibration signal is used as the basis for speech recognition. Since the vibration signal does not include the external non-user voice mixed in the complex acoustic transmission, it is less affected by other environmental noise (such as reverberation), so it can be relatively compared. By suppressing this part of noise interference well, a better speech recognition effect can be achieved.
- the vibration signal is used to indicate a vibration feature corresponding to the vibration generated by the user uttering the voice.
- the body part includes at least one of the following: top of the skull, face, throat, or neck.
- the acquiring the vibration signal corresponding to the voice made by the user includes: acquiring a video frame including the user; and extracting the voice made by the user according to the video frame When the corresponding vibration signal.
- the video frame is collected by a dynamic vision sensor and/or a high-speed camera.
- the obtaining target voice information according to the vibration signal and the user voice signal collected by the sensor includes: obtaining a corresponding target audio signal according to the vibration signal;
- the target audio signal is filtered from the user voice signal collected by the sensor to obtain the signal to be filtered;
- the signal to be filtered is filtered from the user voice signal collected by the sensor to obtain the target voice information.
- the corresponding target audio signal can be recovered according to the vibration signal, and based on filtering, the target audio signal is filtered from the audio signal to obtain a noise signal.
- the filtered signal z'( n) does not contain the useful signal x'(n), which is basically the external noise except the user's target audio signal s(n); optional, if multiple cameras (DVS, high-speed camera, etc.) pick up one Human vibration, the target audio signal x1'(n), x2'(n), x3'(n), x4'(n) recovered from these vibrations, according to the above-mentioned adaptive filtering method, sequentially from the mixed audio By filtering out the signal z(n), a mixed audio signal z'(n) with various audio components of x1'(n), x2'(n), x3'(n), and x4'(n) removed is obtained.
- the method further includes: obtaining instruction information corresponding to the user's voice signal based on the target voice information, the instruction information indicating the semantic intention contained in the user's voice signal .
- the instruction information can be used to trigger the realization of the corresponding function of the semantic intention contained in the user's voice signal, for example, opening a certain application program, making a voice call, and so on.
- the obtaining target voice information according to the vibration signal and the user voice signal collected by the sensor includes:
- the target voice information is obtained through a cyclic neural network model
- a corresponding target audio signal is obtained; based on the target audio signal and the user voice signal collected by the sensor, the target voice information is obtained through a cyclic neural network model.
- the method further includes:
- the obtaining target voice information according to the vibration signal and the user voice signal collected by the sensor includes:
- the method further includes:
- the motion signal of the occlusion part of the vocal tract when the user is speaking is obtained; correspondingly, the target is obtained according to the vibration signal, the brain wave signal and the user's voice signal collected by the sensor Voice information, including:
- the obtaining target voice information according to the vibration signal, the brain wave signal, and the user voice signal collected by the sensor includes:
- the target voice information is obtained through a cyclic neural network model
- the brain wave signal obtain the corresponding second target audio signal; based on the first target audio signal, the second target audio signal and the user's voice signal collected by the sensor, the cyclic neural network model is used to obtain the Target voice information.
- the target voice information includes voiceprint features representing the voice signal of the user.
- this application provides a voice signal processing method, the method including:
- target voice information is obtained.
- the method further includes:
- the acquiring target voice information according to the brain wave signal and the user's voice signal collected by the sensor includes:
- the obtaining target voice information according to the brain wave signal and the user voice signal collected by the sensor includes:
- the method further includes:
- the obtaining target voice information according to the brain wave signal and the user voice signal collected by the sensor includes:
- the target voice information is obtained through a cyclic neural network model
- a corresponding target audio signal is obtained; based on the target audio signal and the user voice signal collected by the sensor, the target voice information is obtained through a cyclic neural network model.
- the target voice information includes voiceprint features representing the voice signal of the user.
- this application provides a voice signal processing method, the method including:
- the vibration signal is used to indicate the vibration characteristics of the user's body part; the body part is performed based on the vocalization behavior when the user is in the vocal state The location of the corresponding vibration;
- voiceprint recognition is performed.
- the vibration signal is used to indicate a vibration characteristic corresponding to the vibration generated by the voice.
- the performing voiceprint recognition based on the user's voice signal and the vibration signal collected by the sensor includes:
- the method further includes:
- the performing voiceprint recognition based on the user's voice signal and the vibration signal collected by the sensor includes:
- the performing voiceprint recognition based on the user's voice signal, the vibration signal, and the brain wave signal collected by the sensor includes:
- a voiceprint recognition result is obtained.
- the present application provides a voice signal processing device, the device including:
- the environmental voice acquisition module is used to acquire the user's voice signal collected by the sensor
- the vibration signal acquisition module is used to acquire the corresponding vibration signal when the user utters the voice; wherein the vibration signal is used to represent the vibration characteristics of the user's body part; the body part is when the user is speaking In the state, the part that vibrates correspondingly based on the vocal behavior; and
- the voice information acquisition module is used to obtain target voice information according to the vibration signal and the user voice signal collected by the sensor.
- the vibration signal is used to indicate a vibration feature corresponding to the vibration generated by the user uttering the voice.
- the body part includes at least one of the following: top of the skull, face, throat, or neck.
- the vibration signal acquisition module is configured to acquire a video frame that includes the user; and extract a corresponding vibration signal when the user makes a voice based on the video frame.
- the video frame is collected by a dynamic vision sensor and/or a high-speed camera.
- the voice information acquiring module is configured to acquire a corresponding target audio signal according to the vibration signal; based on filtering, filter the target audio from the user voice signal collected by the sensor Signal to obtain the noise signal to be filtered; filter the noise signal to be filtered from the user voice signal collected by the sensor to obtain the target voice information.
- the device further includes:
- the instruction information acquisition module is configured to acquire instruction information corresponding to the user's voice signal based on the target voice information, and the instruction information indicates the semantic intention contained in the user's voice signal.
- the voice information acquisition module is configured to obtain the target voice information through a cyclic neural network model based on the vibration signal and the user voice signal collected by the sensor; or, according to the The vibration signal is used to obtain the corresponding target audio signal; based on the target audio signal and the user voice signal collected by the sensor, the target voice information is obtained through a cyclic neural network model.
- the device further includes:
- the brain wave signal acquisition module is used to acquire the user’s brain wave signal corresponding to when the user utters the voice; correspondingly, the voice information acquisition module is used to obtain the brain wave signal according to the vibration signal and the brain wave signal. And the user's voice signal collected by the sensor to obtain target voice information.
- the device further includes:
- the motion signal acquisition module is used to acquire the motion signal of the occlusion part of the vocal tract when the user makes a voice according to the brain wave signal; correspondingly, the voice information acquisition module is used to acquire the motion signal according to the vibration signal and the motion Signal and the user's voice signal collected by the sensor to obtain target voice information.
- the voice information acquisition module is configured to obtain the target voice information through a cyclic neural network model based on the vibration signal, the brain wave signal, and the user voice signal collected by the sensor ;or,
- the brain wave signal obtain the corresponding second target audio signal; based on the first target audio signal, the second target audio signal and the user's voice signal collected by the sensor, the cyclic neural network model is used to obtain the Target voice information.
- the target voice information includes voiceprint features representing the voice signal of the user.
- the present application provides a voice signal processing device, the device including:
- the environmental voice acquisition module is used to acquire the user's voice signal collected by the sensor
- a brain wave signal acquisition module for acquiring the user's brain wave signal corresponding to when the user utters the voice
- the voice information acquisition module is used to obtain target voice information according to the brain wave signal and the user voice signal collected by the sensor.
- the device further includes:
- the motion signal acquisition module is used to acquire the motion signal of the occlusion part of the vocal tract when the user is speaking according to the brain wave signal; correspondingly, the voice information acquisition module is used to acquire the motion signal and the sensor according to the motion signal.
- the user voice signal is collected to obtain the target voice information.
- the voice information acquisition module is configured to acquire a corresponding target audio signal according to the brain wave signal
- the device further includes:
- the instruction information acquisition module is configured to acquire instruction information corresponding to the user's voice signal based on the target voice information, and the instruction information indicates the semantic intention contained in the user's voice signal.
- the voice information acquisition module is configured to obtain the target voice information through a cyclic neural network model based on the brainwave signal and the user voice signal collected by the sensor; or,
- a corresponding target audio signal is obtained; based on the target audio signal and the user voice signal collected by the sensor, the target voice information is obtained through a cyclic neural network model.
- the target voice information includes voiceprint features representing the voice signal of the user.
- the present application provides a voice signal processing device, the device including:
- the environmental voice acquisition module is used to acquire the user's voice signal collected by the sensor
- the vibration signal acquisition module is used to acquire the corresponding vibration signal when the user utters the voice; wherein the vibration signal is used to represent the vibration characteristics of the user's body part; the body part is when the user is speaking In the state, the part that vibrates correspondingly based on the vocal behavior; and
- the voiceprint recognition module is configured to perform voiceprint recognition based on the user's voice signal and the vibration signal collected by the sensor.
- the vibration signal is used to indicate a vibration characteristic corresponding to the vibration generated by the voice.
- the voiceprint recognition module is configured to perform voiceprint recognition according to the user's voice signal collected by the sensor, and obtain the first confidence that the user's voice signal collected by the sensor belongs to the user;
- the device further includes:
- a brain wave signal acquisition module configured to acquire the user's brain wave signal corresponding to when the user utters the voice
- the voiceprint recognition module is configured to perform voiceprint recognition based on the user's voice signal, the vibration signal, and the brain wave signal collected by the sensor.
- the voiceprint recognition module is configured to perform voiceprint recognition according to the user's voice signal collected by the sensor, and obtain the user's first confidence that the user's voice signal collected by the sensor belongs to the user ;
- a voiceprint recognition result is obtained.
- the present application provides an autonomous driving vehicle, which may include a processor, the processor is coupled with a memory, the memory stores program instructions, and when the program instructions stored in the memory are executed by the processor, the above-mentioned first aspect is implemented method.
- the steps executed by the autonomous vehicle in each possible implementation manner of the first aspect by the processor please refer to the first aspect for details, which will not be repeated here.
- the present application provides a computer-readable storage medium in which a computer program is stored, and when it runs on a computer, the computer executes the method described in the first aspect.
- the present application provides a circuit system including a processing circuit configured to execute the method described in the first aspect.
- the present application provides a computer program that, when run on a computer, causes the computer to execute the method described in the first aspect.
- this application provides a chip system that includes a processor for supporting a server or a threshold value acquisition device to implement the functions involved in the above aspects, for example, sending or processing the functions involved in the above methods Data and/or information.
- the chip system further includes a memory, and the memory is used to store necessary program instructions and data for the server or the communication device.
- the chip system can be composed of chips, and can also include chips and other discrete devices.
- An embodiment of the present application provides a voice signal processing method, including: acquiring a user's voice signal collected by a sensor, the voice signal including environmental noise; acquiring a vibration signal corresponding to the voice when the user makes the voice; wherein the vibration The signal is used to indicate the vibration characteristics of the body part of the user; the body part is the part that vibrates correspondingly based on the sound behavior when the user is in the vocal state; and the user collected according to the vibration signal and the sensor Voice signal to obtain target voice information.
- the vibration signal is used as the basis for speech recognition.
- the vibration signal does not contain the external non-user's voice mixed in the complex acoustic transmission, it is less affected by other environmental noises (such as the effect of reverberation), so it can be relatively compared. By suppressing this part of noise interference well, a better speech recognition effect can be achieved.
- Figure 1a is a schematic diagram of a smart device
- FIG. 1b is a schematic diagram of a graphical user interface of a mobile phone provided by an embodiment of this application;
- FIG. 2 is a schematic diagram of an application scenario of an embodiment of the application
- FIG. 3 and 4 are schematic diagrams of another application scenario provided by an embodiment of this application.
- Figure 5 is a schematic diagram of the structure of an electronic device
- FIG. 6 is a schematic diagram of the software structure of an electronic device according to an embodiment of the application.
- FIG. 7 is a schematic flowchart of a voice signal processing method provided in an embodiment of this application.
- Figure 8 is a schematic diagram of a system architecture
- Figure 9 is a schematic diagram of the structure of an RNN
- Figure 10 is a schematic diagram of the structure of an RNN
- Figure 11 is a schematic diagram of the structure of an RNN
- Figure 12 is a schematic diagram of the structure of an RNN
- Figure 13 is a schematic diagram of the structure of an RNN
- FIG. 14 is a schematic flowchart of a voice signal processing method provided by an embodiment of this application.
- FIG. 16 provides a schematic structural diagram of a voice signal processing device for this application.
- FIG. 17 provides a schematic structural diagram of a voice signal processing device for this application.
- FIG. 18 provides a schematic structural diagram of a voice signal processing device for this application.
- FIG. 19 is a schematic structural diagram of an execution device provided by an embodiment of this application.
- FIG. 20 is a schematic structural diagram of a training device provided by an embodiment of the present application.
- FIG. 21 is a schematic diagram of a structure of a chip provided by an embodiment of the application.
- component used in this specification are used to denote computer-related entities, hardware, firmware, a combination of hardware and software, software, or software in execution.
- the component may be, but is not limited to, a process, a processor, an object, an executable file, a thread of execution, a program, and/or a computer running on a processor.
- the application running on the computing device and the computing device can be components.
- One or more components may reside in processes and/or threads of execution, and components may be located on one computer and/or distributed between two or more computers.
- these components can be executed from various computer readable media having various data structures stored thereon.
- a component may be based on a signal having one or more data packets (for example, data from two components that interact with another component in a local system, a distributed system, and/or a network, such as the Internet that interacts with other systems through a signal) Communicate through local and/or remote processes.
- data packets for example, data from two components that interact with another component in a local system, a distributed system, and/or a network, such as the Internet that interacts with other systems through a signal
- the voice signal processing method provided in the embodiments of the present application can be applied in scenarios such as voice recognition and human-computer interaction related to voiceprint recognition.
- the voice signal processing method of the embodiment of the present application can be applied to voice recognition and voiceprint recognition. The following briefly introduces the voice recognition scene and the voiceprint recognition scene respectively.
- ASR Automatic speech recognition
- its goal is to convert the vocabulary content of human speech into computer-readable input, such as keystrokes, binary codes, or characters sequence.
- this application can be applied to a device with a voice interaction function; in this embodiment, "with voice interaction function” can be a function that can be implemented on the device, which can recognize the user's voice and is based on the voice Trigger the corresponding function, and then realize the voice interaction with the user.
- the devices with voice interaction function may be in smart devices such as speakers, alarm clocks, watches, robots, etc., or in-vehicle devices, or portable devices such as mobile phones, tablets, AR augmented reality devices, or VR virtual reality devices.
- the device with voice interaction function may include an audio sensor and a video sensor, where the audio sensor may collect audio signals in the environment, and the video sensor may collect video in a certain area; the audio signal may include one The audio signal emitted by or multiple users when uttering and other noise signals in the environment, the video may include one or more users who uttered above, and then, based on the audio signal and video, the audio emitted by one or more users when uttering can be extracted Signal. Furthermore, the voice interaction function with the user can be realized based on the extracted audio signal. How to extract the audio signal emitted when one or more users uttered based on the audio signal and video will be described in detail in the subsequent embodiments, and will not be repeated here.
- the above-mentioned audio sensor and video sensor may not be used as components of the device with voice interaction function, but as independent components or components integrated on other devices;
- the device with voice interaction function can only obtain the audio signal in the environment collected by the audio sensor or only the video in a certain area collected by the video sensor, and further, can extract one or more based on the audio signal and video The audio signal emitted when the user speaks.
- the voice interaction function with the user can be realized based on the extracted audio signal.
- the audio sensor is a component of the device with voice interaction function
- the video sensor is not a component of the device with voice interaction function; or, the audio sensor is not a component of the device with voice interaction function
- the video sensor is a component of the device with voice interaction function.
- the device with a voice interaction function may be a smart device as shown in FIG. 1a.
- the smart device does not perform any action when it recognizes the voice "piupiupiu". For example, if the smart device recognizes the voice "turn on the air conditioner", it will perform the corresponding action of the voice "turn on the air conditioner”: turn on the air conditioner. For example, if the smart device recognizes the sound made by the user blowing a whistle, that is, the whistle sound, it executes the action corresponding to the whistle sound: turn on the light. For example, if the smart device recognizes the voice "turn on the light”, it will not perform any action.
- the smart device if it recognizes the voice “sleeping” in the whispering mode, it will execute the corresponding action of the voice “sleeping” in the whispering mode: switch to the sleep mode.
- the voice "piupiupiu”, whistle, and whisper mode voice “sleep” are special voices.
- Voices such as "turn on the air conditioner” and "turn on the lights” are normal voices.
- normal speech refers to a type of speech that can recognize the semantics and vibrate the vocal cords when vocalizing.
- Special voice refers to a type of voice that is different from normal voice.
- a special voice refers to a type of voice that does not vibrate the vocal cords when speaking, that is, unvoiced.
- special speech refers to speech without semantics.
- a device with a voice interaction function may be a device with a display function, such as a mobile phone.
- FIG. 1b is a graphical user interface (GUI) of a mobile phone provided by an embodiment of the application.
- GUI graphical user interface
- the GUI is a display interface when the mobile phone interacts with the user.
- the mobile phone detects the user's voice wake-up word "Xiaoyi Xiaoyi"
- the mobile phone can display the text display window 101 of the voice assistant on the desktop, and the mobile phone can remind the user "Hi, I'm listening" through the window 101.
- the mobile phone displays text reminding the user through the window 101 or 102, it can also broadcast "Hi, I'm listening" to the user by voice.
- the device with voice interaction function may be a system composed of multiple devices.
- FIG. 2 shows an application scenario of an embodiment of the application.
- the application scenario in Figure 2 can also be referred to as a smart home scenario.
- the application scenario in FIG. 2 may include at least one electronic device (for example, electronic device 210, electronic device 220, electronic device 230, electronic device 240, electronic device 250), electronic device 260, and electronic device.
- the electronic device 210 in FIG. 2 may be a television.
- the electronic device 220 may be a speaker.
- the electronic device 230 may be a monitoring device.
- the electronic device 240 may be a watch.
- the electronic device 250 may be a smart microphone.
- the electronic device 260 may be a mobile phone or a tablet computer.
- the electronic device may be a wireless communication device, such as a router, a gateway device, and so on.
- the electronic device 210, the electronic device 220, the electronic device 230, the electronic device 240, the electronic device 250, and the electronic device 260 in FIG. 2 can perform uplink and downlink transmissions with the electronic device through a wireless communication protocol.
- the electronic device can send information to the electronic device 210, the electronic device 220, the electronic device 230, the electronic device 240, the electronic device 250, and the electronic device 260, and can also receive the electronic device 210, the electronic device 220, the electronic device 230, and the electronic device. 240.
- embodiments of the present application may be applied to an application scenario including one or more wireless communication devices and multiple electronic devices, which is not limited in this application.
- the device with voice interaction function may be any electronic device in the smart home system, for example, it may be a TV, a speaker, a watch, a smart microphone, a mobile phone or a tablet computer, and so on.
- Any electronic device in the smart home system can include an audio sensor or a video sensor.
- the audio information or video can be transmitted to the device with voice interaction function based on the wireless communication device, or to A server on the cloud side (not shown in Figure 2), a device with a voice interaction function, can extract audio signals emitted by one or more users when they utter a voice based on audio information and video.
- the voice interaction function with the user can be realized based on the extracted audio signal; or the server on the cloud side can extract the audio signal emitted by one or more users based on the audio information and video, and transmit the extracted audio signal.
- the device with the voice interaction function, and then the device with the voice interaction function, can realize the voice interaction function with the user based on the extracted audio signal.
- the application scenario includes the electronic device 210, the electronic device 260, and the electronic device.
- the electronic device 210 is a TV
- the electronic device 260 is a mobile phone
- the electronic device is a router.
- the router is used to realize the wireless communication between the TV and the mobile phone.
- the device with voice interaction function can be a mobile phone
- the TV can be equipped with a video sensor
- the mobile phone can be equipped with an audio sensor. After the TV obtains the video, it can transmit the video to the mobile phone.
- An audio signal emitted when one or more users speak.
- the voice interaction function with the user can be realized based on the extracted audio signal.
- the application scenario includes the electronic device 220, the electronic device 260, and the electronic device.
- the electronic device 220 is a speaker
- the electronic device 260 is a mobile phone
- the electronic device is a router.
- the router is used to realize the wireless communication between the speaker and the mobile phone.
- the device with voice interaction function can be a mobile phone.
- the mobile phone can be equipped with a video sensor
- the speaker can be equipped with an audio sensor. After the speaker obtains the audio information, the audio information can be transmitted to the mobile phone.
- the mobile phone can be based on audio information and video Extract the audio signal emitted when one or more users speak.
- the voice interaction function with the user can be realized based on the extracted audio signal.
- the application scenario includes the electronic device 230, the electronic device 260, and the electronic device.
- the electronic device 230 is a monitoring device
- the electronic device 260 is a mobile phone
- the electronic device is a router.
- the router is used to realize the wireless communication between the monitoring device and the mobile phone.
- the device with voice interaction function can be a mobile phone.
- the monitoring device can be equipped with a video sensor
- the mobile phone can be equipped with an audio sensor. After the monitoring device obtains the video, it can transmit the video to the mobile phone.
- the mobile phone can extract the video based on audio information and video. An audio signal emitted when one or more users speak.
- the voice interaction function with the user can be realized based on the extracted audio signal.
- the application scenario includes the electronic device 250, the electronic device 260, and the electronic device.
- the electronic device 250 is a microphone
- the electronic device 260 is a mobile phone
- the electronic device is a router.
- the router is used to realize the wireless communication between the microphone and the mobile phone.
- the device with voice interaction function can be a microphone.
- the mobile phone can be equipped with a video sensor
- the microphone can be equipped with an audio sensor. After the mobile phone obtains the video, it can transmit the video to the microphone.
- the microphone can extract one or The audio signal emitted when multiple users are speaking.
- the voice interaction function with the user can be realized based on the extracted audio signal.
- FIG. 3 and 4 are another application scenario provided by an embodiment of the present application.
- the application scenarios in Figure 3 and Figure 4 can also be referred to as smart driving scenarios.
- the application scenarios in FIG. 3 and FIG. 4 may include electronic equipment, which includes the device 310, the device 320, the device 330, the device 340, and the device 350.
- the electronic device can be a driving system (also called an in-vehicle system).
- the device 310 may be a display screen.
- the device 320 may be a microphone.
- the device 330 may be a speaker.
- the device 340 may be a camera.
- the device 350 may be a seat adjustment device.
- the electronic device 360 may be a mobile phone or a tablet computer.
- the electronic device can receive data sent by the device 310, the device 320, the device 330, the device 340, and the device 350.
- the electronic device and the electronic device 360 can communicate through a wireless communication protocol.
- the electronic device may send a signal to the electronic device 360, and may also receive a signal sent by the electronic device 360.
- embodiments of the present application may be applied to an application scenario including a driving system and multiple electronic devices, which is not limited in the present application.
- the application scenario includes the device 320, the device 330, the electronic device 360, and the electronic device (driving system).
- the device 320 is a microphone
- the device 340 is a camera
- the electronic device 360 is a tablet computer
- the electronic device is a driving system.
- the driving system is used for wireless communication with the mobile phone, is also used to drive the microphone to collect audio signals, and to drive the camera to collect video.
- the driving system can drive the microphone to collect audio signals and send the audio signals collected by the microphone to the tablet.
- the driving system can drive the camera to collect video and send the video collected by the camera to the tablet; the tablet can be based on audio information and video Extract the audio signal emitted when one or more users speak.
- the voice interaction function with the user can be realized based on the extracted audio signal.
- the video sensor can be deployed independently, for example, set in a preset position in the car, so that it can collect video in the preset area, for example, the video sensor can be set on the windshield Or on the seat, and then can collect the video of the user in a certain seat.
- the device with voice recognition function may be a head-mounted portable device, for example, an AR/VR device, where the head-mounted portable device may be provided with an audio sensor and a brain wave collection device.
- the sensor can collect audio signals
- the brain wave collection device can collect brain wave signals
- the head-mounted portable device can extract the audio signal emitted when one or more users are uttering based on the audio signal and the brain wave signal.
- the voice interaction function with the user can be realized based on the extracted audio signal.
- the above-mentioned audio sensor and brainwave acquisition equipment may not be components of the device with voice interaction function itself, but as independent components or components integrated on other devices; in this case, A device with voice interaction function can only obtain the audio signal in the environment collected by the audio sensor or only the brain wave signal in a certain area collected by the brain wave collection device, and further, can extract one based on the audio signal and the brain wave signal Or the audio signal emitted by multiple users. Furthermore, the voice interaction function with the user can be realized based on the extracted audio signal.
- the audio sensor is a component of the device with voice interaction function
- the brainwave acquisition device is not a component of the device with voice interaction function; or, the audio sensor is not used as a component of the device with voice interaction function.
- Component, and the brain wave collection device is not a component of the device with voice interaction function; or, the audio sensor is not a component of the device with voice interaction function, and the brain wave collection device is the device itself with voice interaction function Has the components.
- Voiceprint is a sound wave spectrum that carries verbal information displayed by an electroacoustic instrument. It is a biological feature composed of more than a hundred characteristic dimensions such as wavelength, frequency, and intensity. Voiceprint recognition achieves the purpose of distinguishing unknown sounds by analyzing the characteristics of one or more speech signals. Simply put, it is a technology to distinguish whether a certain sentence is spoken by a certain person. Through the voiceprint, the identity of the speaker can be determined and targeted answers can be given.
- this application can also be applied to the scene of speech denoising.
- the speech signal processing method in this application can be used in audio input devices that require speech denoising, such as earphones, microphones (independent microphones or on terminal equipment). Microphone, etc.), the user can speak to the audio input.
- the audio input device can extract the voice signal sent by the user from the audio input including environmental noise.
- the electronic device may be a portable electronic device that also contains other functions such as a personal digital assistant and/or a music player function, such as a mobile phone, a tablet computer, and a wearable electronic device with wireless communication function (such as a smart watch) Wait.
- portable electronic devices include, but are not limited to, portable electronic devices equipped with or other operating systems.
- the aforementioned portable electronic device may also be other portable electronic devices, such as a laptop computer (Laptop) and the like.
- the above-mentioned electronic devices may not be portable electronic devices, but may be desktop computers, televisions, speakers, monitoring equipment, cameras, display screens, microphones, seat adjustment devices, fingerprint recognition. Devices, on-board driving systems, etc.
- FIG. 5 shows a schematic structural diagram of the electronic device 100.
- the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a microphone 170C, a sensor module 180, a button 190, a camera 193, a display screen 194, and a subscriber identification module (SIM) card interface.
- SIM subscriber identification module
- the structure illustrated in the embodiment of the present application does not constitute a specific limitation on the electronic device 100.
- the electronic device 100 may include more or fewer components than shown, or combine certain components, or split certain components, or arrange different components.
- the illustrated components can be implemented in hardware, software, or a combination of software and hardware.
- the processor 110 may include one or more processing units.
- the processor 110 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU), etc.
- different processing units may be independent components, or may be integrated in one or more processors.
- the electronic device 100 may also include one or more processors 110.
- the controller can generate operation control signals according to the instruction operation code and timing signals to complete the control of fetching instructions and executing instructions.
- a memory may be provided in the processor 110 to store instructions and data.
- the memory in the processor 110 may be a cache memory.
- the memory can store instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to use the instruction or data again, it can be directly called from the memory. In this way, repeated accesses are avoided, the waiting time of the processor 110 is reduced, and the efficiency of the electronic device 100 in processing data or executing instructions is improved.
- the processor 110 may include one or more interfaces.
- the interface may include an inter-integrated circuit (I2C) interface, an inter-integrated circuit sound (I2S) interface, a pulse code modulation (PCM) interface, and a universal asynchronous transceiver (universal asynchronous transceiver) interface.
- asynchronous receiver/transmitter, UART) interface mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, SIM card interface, and/or USB interface, etc.
- the USB interface is an interface that conforms to the USB standard specification, and specifically can be a MiniUSB interface, a MicroUSB interface, and a USBTypeC interface.
- the USB interface can be used to connect a charger to charge the electronic device 100, and can also be used to transfer data between the electronic device 100 and peripheral devices.
- the USB interface can also be used to connect headphones and play audio through the headphones.
- the interface connection relationship between the modules illustrated in the embodiment of the present application is merely a schematic description, and does not constitute a structural limitation of the electronic device 100.
- the electronic device 100 may also adopt different interface connection modes in the foregoing embodiments, or a combination of multiple interface connection modes.
- the electronic device 100 implements a display function through a GPU, a display screen 194, an application processor, and the like.
- the GPU is a microprocessor for image processing, connected to the display 194 and the application processor.
- the GPU is used to perform mathematical and geometric calculations and is used for graphics rendering.
- the processor 110 may include one or more GPUs that execute program instructions to generate or change display information.
- the display screen 194 is used to display images, videos, and the like.
- the display screen 194 includes a display panel.
- the display panel can adopt liquid crystal display (LCD), organic light-emitting diode (OLED), active matrix organic light-emitting diode or active-matrix organic light-emitting diode (active-matrix organic light-emitting diode).
- LCD liquid crystal display
- OLED organic light-emitting diode
- active-matrix organic light-emitting diode active-matrix organic light-emitting diode
- AMOLED flexible light-emitting diode (FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (QLED), etc.
- the electronic device 100 may include one or more display screens 194.
- the display screen 194 of the electronic device 100 may be a flexible screen.
- the flexible screen has attracted much attention due to its unique characteristics and great potential.
- flexible screens have the characteristics of strong flexibility and bendability, and can provide users with new interactive methods based on bendable characteristics, which can meet users' more needs for electronic devices.
- the foldable display screen on the electronic device can be switched between a small screen in a folded configuration and a large screen in an unfolded configuration at any time. Therefore, users use the split screen function on electronic devices equipped with foldable display screens more and more frequently.
- the electronic device 100 can realize a shooting function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, and an application processor.
- the ISP is used to process the data fed back from the camera 193. For example, when taking a picture, the shutter is opened, the light is transmitted to the photosensitive element of the camera through the lens, the light signal is converted into an electrical signal, and the photosensitive element of the camera transfers the electrical signal to the ISP for processing and is converted into an image visible to the naked eye.
- ISP can also optimize the image noise, brightness, and skin color. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene.
- the ISP may be provided in the camera 193.
- the camera 193 is used to capture still images or videos.
- the object generates an optical image through the lens and is projected to the photosensitive element.
- the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
- CMOS complementary metal-oxide-semiconductor
- the photosensitive element converts the optical signal into an electrical signal, and then transfers the electrical signal to the ISP to convert it into a digital image signal.
- ISP outputs digital image signals to DSP for processing.
- DSP converts digital image signals into standard RGB, YUV and other formats of image signals.
- the electronic device 100 may include one or more cameras 193.
- the camera 193 in the embodiment of the present application may be a high-speed camera or a dynamic vision sensor (DVS).
- DVD dynamic vision sensor
- Digital signal processors are used to process digital signals. In addition to digital image signals, they can also process other digital signals. For example, when the electronic device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the energy of the frequency point.
- Video codecs are used to compress or decompress digital video.
- the electronic device 100 may support one or more video codecs. In this way, the electronic device 100 can play or record videos in multiple encoding formats, such as: moving picture experts group (MPEG) 1, MPEG2, MPEG3, MPEG4, and so on.
- MPEG moving picture experts group
- MPEG2 MPEG2, MPEG3, MPEG4, and so on.
- NPU is a neural-network (NN) computing processor.
- NN neural-network
- the NPU can realize applications such as intelligent cognition of the electronic device 100, such as image recognition, face recognition, voice recognition, text understanding, and so on.
- the external memory interface 120 may be used to connect an external memory card, such as a MicroSD card, to expand the storage capacity of the electronic device 100.
- the external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example, save music, video and other files in an external memory card.
- the internal memory 121 may be used to store one or more computer programs, and the one or more computer programs include instructions.
- the processor 110 can execute the above-mentioned instructions stored in the internal memory 121 to enable the electronic device 100 to execute the off-screen display method provided in some embodiments of the present application, as well as various applications and data processing.
- the internal memory 121 may include a storage program area and a storage data area. Among them, the storage program area can store the operating system; the storage program area can also store one or more applications (such as photo galleries, contacts, etc.).
- the data storage area can store data (such as photos, contacts, etc.) created during the use of the electronic device 100.
- the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic disk storage components, flash memory components, universal flash storage (UFS), and the like.
- the processor 110 may execute instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor 110 to cause the electronic device 100 to execute the instructions provided in the embodiments of the present application. The method of off-screen display, as well as other applications and data processing.
- the electronic device 100 can implement audio functions through an audio module, a speaker, a receiver, a microphone, a headphone interface, and an application processor. For example, music playback, recording, etc.
- the sensor module 180 may include an acceleration sensor 180E, a fingerprint sensor 180H, an ambient light sensor 180L, and the like.
- the acceleration sensor 180E can detect the magnitude of the acceleration of the electronic device 100 in various directions (generally three axes). When the electronic device 100 is stationary, the magnitude and direction of gravity can be detected. It can also be used to identify the posture of electronic devices, and be used in applications such as horizontal and vertical screen switching, pedometers and so on.
- the ambient light sensor 180L is used to sense the brightness of the ambient light.
- the electronic device 100 can adaptively adjust the brightness of the display screen 194 according to the perceived brightness of the ambient light.
- the ambient light sensor 180L can also be used to automatically adjust the white balance when taking pictures.
- the ambient light sensor 180L can also cooperate with the proximity light sensor 180G to detect whether the electronic device 100 is in the pocket to prevent accidental touch.
- the fingerprint sensor 180H is used to collect fingerprints.
- the electronic device 100 can use the collected fingerprint characteristics to implement fingerprint unlocking, access application locks, fingerprint photographs, fingerprint answering calls, and so on.
- the brain wave sensor 195 can collect brain wave signals.
- the button 190 includes a power-on button, a volume button, and so on.
- the button 190 may be a mechanical button. It can also be a touch button.
- the electronic device 100 may receive key input, and generate key signal input related to user settings and function control of the electronic device 100.
- FIG. 6 is a block diagram of the software structure of the electronic device 100 according to an embodiment of the present application.
- the layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Communication between layers through software interface.
- the Android system is divided into four layers, from top to bottom, the application layer, the application framework layer, the Android runtime and system library, and the kernel layer.
- the application layer can include a series of application packages.
- the application package may include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, etc.
- the application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer.
- the application framework layer includes some predefined functions.
- the application framework layer can include a window manager, a content provider, a view system, a phone manager, a resource manager, a notification manager, and so on.
- the window manager is used to manage window programs.
- the window manager can obtain the size of the display screen, determine whether there is a status bar, lock the screen, take a screenshot, etc.
- the content provider is used to store and retrieve data and make these data accessible to applications.
- the data may include video, image, audio, phone calls made and received, browsing history and bookmarks, phone book, etc.
- the view system includes visual controls, such as controls that display text, controls that display pictures, and so on.
- the view system can be used to build applications.
- the display interface can be composed of one or more views.
- a display interface that includes a short message notification icon may include a view that displays text and a view that displays pictures.
- the phone manager is used to provide the communication function of the electronic device 100. For example, the management of the call status (including connecting, hanging up, etc.).
- the resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and so on.
- the notification manager enables the application to display notification information in the status bar, which can be used to convey notification-type messages, and it can automatically disappear after a short stay without user interaction.
- the notification manager is used to notify download completion, message reminders, and so on.
- the notification manager can also be a notification that appears in the status bar at the top of the system in the form of a chart or a scroll bar text, such as a notification of an application running in the background, or a notification that appears on the screen in the form of a dialog window.
- prompt text information in the status bar sound a prompt sound, electronic device vibration, flashing indicator light, etc.
- the system library can include multiple functional modules. For example: surface manager (surface manager), media library (media libraries), 3D graphics processing library (for example: OpenGL ES), 2D graphics engine (for example: SGL), etc.
- surface manager surface manager
- media library media libraries
- 3D graphics processing library for example: OpenGL ES
- 2D graphics engine for example: SGL
- the surface manager is used to manage the display subsystem and provides a combination of 2D and 3D layers for multiple applications.
- the media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files.
- the media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
- the 3D graphics processing library is used to realize 3D graphics drawing, image rendering, synthesis, and layer processing.
- the 2D graphics engine is a graphics engine for 2D drawing.
- the kernel layer is the layer between hardware and software.
- the kernel layer contains at least display driver, camera driver, audio driver, and sensor driver.
- FIG. 7 is a schematic flowchart of a voice signal processing method provided in an embodiment of the present application.
- the voice signal processing method provided in an embodiment of the present application includes:
- the user's voice signal collected by the sensor from the environment can be obtained, and the voice signal includes environmental noise; the voice signal in the following can also be expressed as a voice signal.
- the user voice signal should not be understood as only the words spoken by the user, but should be understood as including the user's voice in the voice signal.
- the voice signal including environmental noise can be understood as the presence of users who are talking and other environmental noise (such as other people talking) in the environment.
- the collected voice signals include users who are intertwined with each other.
- the audio sensor (for example, a microphone or a microphone array) can collect a user's voice signal from the environment.
- the user's voice signal is the mixed signal z(n) in the environment.
- the execution subject of step 701 may be a device with voice interaction function or a voice input device; taking the device with voice interaction function as an example, in one implementation, the device with voice interaction function
- An audio sensor may be integrated, and the audio sensor may acquire an audio signal including the user's voice signal; in one implementation, the audio sensor may not be integrated on a device with voice interaction function, for example, the audio sensor may be integrated on other devices
- the audio sensor can transmit the collected audio signal to the device with voice interaction function, and the device with voice interaction function can obtain the audio signal.
- the audio sensor can pick up the audio signal from a certain direction in a targeted manner, such as directional pickup for the user's direction, so as to eliminate a part of external noise (but there is still noise) as much as possible.
- Directional acquisition requires a microphone array or vector microphone.
- the beamforming method can be used. It can be realized by using a beamformer, which can include delay-sum beamforming and filter summing beamforming. Specifically, the input signal of the microphone array is z i (n), and the filter transfer coefficient is w i (n), then the output of the filter-sum beamformer system is:
- the filter-sum beamforming is simplified to the delay-sum beamforming, namely:
- ⁇ i represents the time delay compensation obtained through estimation.
- the beam of the array can be directed to any direction to pick up the audio signal in that direction. If you do not want to pick up the audio signal in a certain direction, you can control the beam pointing to not include this direction. After controlling based on the picking direction The collected audio signal is z(n).
- the description of the product form of the voice input device can refer to the product form of the device with the voice interaction function described above, which will not be repeated here.
- the vibration signal corresponding to the voice of the user can be obtained, where the vibration signal is used to indicate the vibration characteristics of the body part of the user when the voice signal is issued.
- step 701 there is no strict time sequence limitation between step 701 and step 702, and step 701 can be performed before or after step 702, or at the same time, which is not limited by this application.
- the corresponding vibration signal when the user utters a voice may be obtained based on video extraction
- the device with voice interaction function may be integrated with a video sensor, and the video sensor may collect video frames including the user.
- the device with voice interaction function may, according to the video frame, Extract the vibration signal corresponding to the user.
- the video sensor can be set independently of the device with the voice interaction function, and the video sensor can collect the video frame including the user and send the video frame to the device with the voice interaction function.
- the device with voice interaction function can extract the vibration signal corresponding to the user according to the video frame; in one implementation, the video sensor can be set independently from the device with voice interaction function, and the video sensor can collect information including the user And send the video frame to the device with the voice interaction function. Accordingly, the device with the voice interaction function can extract the vibration signal corresponding to the user according to the video frame.
- the action of extracting the vibration signal from the video frame may be executed by a server on the cloud side or another device on the end side;
- the device with voice interaction function may be integrated with a video sensor, and the video sensor may collect video frames including the user, and send the video frames to a server on the cloud side or other end-side devices, Correspondingly, the cloud-side server or other end-side device may extract the vibration signal corresponding to the user according to the video frame, and send the vibration signal to the device with a voice interaction function.
- the video sensor can be set independently from the device with voice interaction function, the video sensor can collect the video frame including the user, and send the video frame to the server on the cloud side or other end-side devices, correspondingly Yes, the cloud-side server or other end-side device may extract the vibration signal corresponding to the user according to the video frame, and send the vibration signal to the device with a voice interaction function.
- the video frame is collected by a dynamic vision sensor and/or a high-speed camera.
- the dynamic vision sensor can capture video frames including the top of the skull, face, throat, or neck when the user is speaking.
- the number of dynamic vision sensors that collect video frames may be one or more;
- the dynamic vision sensor can collect video frames that include the user's full body or partial body parts.
- the dynamic vision sensor collects the realization of including the partial body parts.
- the dynamic vision sensor can only select the part that vibrates based on the vocal behavior when the user is in the vocal state for video frame collection.
- the body part can be, for example, the top of the skull, the face, the throat, or the neck.
- the video capture direction of the dynamic vision sensor can be preset.
- the dynamic vision sensor can be set at a preset position in the car, and the dynamic vision sensor's The video collection direction is set to face the user’s preset body part.
- the video collection direction of the dynamic vision sensor can be towards the preset area of the driving position, which is usually when the driving position has The area where the face is when the person sits down.
- the dynamic vision sensor can collect video frames that include the full picture of the user's body.
- the video capture direction of the dynamic vision sensor can also be preset.
- the dynamic vision sensor can be set at the preset position in the car, and the video capture of the dynamic vision sensor The direction is set to the direction toward the driving position.
- the number of dynamic vision sensors is multiple, and each dynamic vision sensor can preset its video capture direction, so that each dynamic vision sensor can capture a video frame including a body part, where the body The part is the part that vibrates correspondingly based on the vocal behavior when the user is in the vocal state.
- dynamic vision sensors can be deployed on the front and back of the headrest (picking up the video frames of the front and rear people in the car), on the car frame (the video frames of the people in the left and right directions), and under the windshield ( Video frames of people in the front row).
- the same sensors can be used for the collection of video frames of different body parts, such as all using high-speed cameras, or both using dynamic vision sensors, or using these two types of sensors in combination, which is not limited in the present application.
- dynamic vision sensors can be deployed on TVs, smart large screens or smart speakers, etc.
- dynamic vision sensors can be deployed on mobile phones, for example, based on the front of the mobile phone or rear camera.
- the vibration signal represents the original characteristics of the human voice; optionally, there may be multiple vibration signals, such as: head vibration signal x1(n), throat vibration signal x2(n), face The vibration signal x3(n) of the neck, the vibration signal x4(n) of the neck, and so on.
- the corresponding target audio signal can be recovered according to the vibration signal.
- the vibration signal is used to indicate the vibration characteristics of the body part of the user when the voice signal is sent.
- the vibration characteristics can be the vibration characteristics obtained directly from the video or the only sound after filtering out the interference of other actions. Vibration related vibration characteristics.
- filters in different directions can be decomposed into image pyramids of different scales and directions. Specifically, the image can be filtered with a low-pass filter to obtain a low-pass residual image. The difference image is continuously down-sampled into images of different scales. For images of each scale, bandpass filters in different directions are used to filter to obtain response graphs for different directions, the amplitude and phase of the response graphs are calculated, and the local motion information of the current frame t is calculated. Use the first frame of image as the reference frame.
- the phase difference between the decomposition results of the current frame and the reference frame at different scales and different directions at different pixel positions can be calculated to quantify the local motion of each pixel, and calculate based on the local motion of each pixel Global motion information of the current frame.
- the global motion information can be obtained after weighting and averaging the local motion information.
- the weight is the amplitude of the corresponding scale, direction, and pixel position.
- the weighted sum of all pixels on this scale in this direction is used to obtain the global motion information of different scales and directions.
- the sum of the above-mentioned global information can obtain the image frame Global motion information.
- each image frame can be calculated to obtain a motion magnitude value.
- the amplitude corresponding to each frame is used as the audio sample value to obtain the preliminary recovered audio signal. Then high-pass filtering is performed to obtain the restored audio signal x'(n).
- the respective target audio signals x1'(n), x2'(n), x3'(n), x4'(n) are individually restored.
- each pixel independently responds to changes in light intensity, by comparing the current light intensity with the light intensity at the time of the previous event, when the two When the amount of change (that is, the difference value) exceeds the threshold, a new event is generated.
- Each event includes pixel coordinates, firing time and light intensity polarity.
- the light intensity polarity characterizes the changing trend of light intensity. Usually +1 or On means light intensity increases, -1 or Off means light intensity decreases. Since the dynamic vision sensor has no concept of exposure, the pixel continuously monitors and responds to the light intensity, so its time resolution can reach the microsecond level. At the same time, the dynamic vision sensor is sensitive to motion and hardly responds to static areas.
- the dynamic vision sensor can be used to capture the vibration of the object, so as to achieve sound recovery. In this way, the audio signal recovered based on a certain pixel position is reached. High-pass filtering is performed on it to remove the low-frequency non-audio vibration interference, and the signal x'(n) is obtained, which can characterize the audio signal. A weighted sum of the audio signals recovered in this way for multiple pixels, such as all pixels, can be used to obtain the audio signal x'(n) recovered by the dynamic vision sensor after a weighted average. In the case of multiple sensors or multiple location target areas, they are restored separately to obtain the target audio signals x1'(n), x2'(n), x3'(n), x4'(n) independently restored.
- the corresponding target audio signal can be recovered based on the vibration signal; based on filtering, the target audio signal is filtered from the audio signal to obtain the signal to be filtered; from the voice signal The signal to be filtered is filtered to obtain target voice information, where the target voice information is a voice signal obtained after environmental noise removal processing.
- the corresponding target audio signal can be recovered according to the vibration signal, and based on filtering, the target audio signal is filtered from the audio signal to obtain the signal to be filtered.
- the filtered signal z '(n) basically does not contain the useful signal x'(n), which is basically the external noise except the user's target audio signal s(n); optional, if there are multiple cameras (DVS, high-speed camera, etc.) )
- Pick up the vibrations of a certain person, and the target audio signals x1'(n), x2'(n), x3'(n), x4'(n) will be recovered from these vibrations, according to the above-mentioned adaptive filtering method, Filter out the mixed audio signal z(n) in turn, and get the mixed audio signal z'that removes various x1'(n), x2'(n), x3'(n), x4'(n) audio components (n).
- the signal to be filtered can be filtered from the audio signal to obtain the user's voice signal; in one implementation, the noise spectrum can be obtained (that is, it is considered that z'(n) is in addition to Background noise outside the target speech signal s(n)): Transform z'(n) to the frequency domain, such as fast Fourier transform (FFT) to obtain the noise spectrum; convert the target audio signal z( n) Transform to the frequency domain, such as FFT to obtain the frequency spectrum, and then subtract the frequency spectrum from the noise spectrum to obtain the signal spectrum of the enhanced speech. Finally, perform an inverse fast fourier transform on the signal spectrum. , IFFT), to obtain the user's voice signal, that is, the voice-enhanced signal.
- FFT fast Fourier transform
- the signal to be filtered is filtered out from the audio signal by means of adaptive filtering.
- the instruction information corresponding to the voice signal of the user may also be obtained based on the target voice information, and the instruction information indicates the semantic intention contained in the voice signal of the user.
- the instruction information can be used to trigger the realization of the corresponding function of the semantic intention contained in the user's voice signal, for example, opening a certain application program, making a voice call, and so on.
- the target voice information may be obtained through a neural network model based on the vibration signal and the voice signal.
- a corresponding target audio signal is obtained according to the vibration signal; based on the target audio signal and the voice signal, the target voice information is obtained through a neural network model. That is, the input of the neural network model can also be the target audio signal recovered from the vibration signal.
- an embodiment of the present invention provides a system architecture 200.
- the data collection device 260 is used to collect audio data and store it in the database 230; among them, the audio data can include noise-free audio, vibration signals (or target audio signals recovered from vibration signals), and noisy audio signals; In a quiet environment, let people talk/play audio, and record the audio signal with a normal microphone at this time as “noise-free audio", denoted as s(n).
- a vibration sensor can be multiple to face the person's head, face, throat, neck, etc., collect the video frames during this period and obtain the corresponding vibration signal, denoted as x(n), if it is more
- the signal can be written as x1(n), x2(n), x3(n), x4(n), etc.
- the target audio signal recovered from the vibration signal.
- noise can be added to the "noise-free audio" to obtain a "noisy audio signal", denoted as sn(n).
- the training device 220 generates a target model/rule 201 based on the audio data maintained in the database 230. The following will describe in more detail how the training device 220 obtains the target model/rule 201 based on audio data.
- the target model/rule 201 can obtain the target voice information or obtain the user’s voice information based on the vibration signal and the audio signal. voice signal.
- the introduction of the training process is adaptively increased. If the invention is not in the training process, the introduction of the training process in the following example will be used. If the training process is improved, please replace the following training process introduction with the improved training process.
- the training device may use a deep neural network to train the data to generate the target model/rule 201.
- the work of each layer in the deep neural network can be expressed in mathematical expressions To describe: From the physical level, the work of each layer in the deep neural network can be understood as the transformation of the input space to the output space (that is, the row space of the matrix to the column Space), these five operations include: 1. Dimension Up/Down; 2. Enlarge/Reduce; 3. Rotate; 4. Translation; 5. "Bend”. The operations of 1, 2, and 3 are determined by Completed, the operation of 4 is completed by +b, and the operation of 5 is realized by a().
- W is a weight vector, and each value in the vector represents the weight value of a neuron in the layer of neural network.
- This vector W determines the space transformation from the input space to the output space described above, that is, the weight of each layer controls how the space is transformed.
- the purpose of training a deep neural network is to finally obtain the weight matrix of all layers of the trained neural network (the weight matrix formed by the vector W of many layers). Therefore, the training process of the neural network is essentially the way of learning the control space transformation, and more specifically the learning weight matrix.
- the weight vector of the network (of course, there is usually an initialization process before the first update, which is to pre-configure parameters for each layer in the deep neural network). For example, if the predicted value of the network is high, adjust the weight vector to make it The prediction is lower and keep adjusting until the neural network can predict the target value you really want. Therefore, it is necessary to predefine "how to compare the difference between the predicted value and the target value".
- This is the loss function or objective function, which is used to measure the difference between the predicted value and the target value. Important equation. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference, then the training of the deep neural network becomes a process of reducing this loss as much as possible.
- the target model/rule obtained by the training device 220 can be applied to different systems or devices.
- the execution device 210 is configured with an I/O interface 212 to perform data interaction with external devices.
- the "user" can input data to the I/O interface 212 through the client device 240.
- the execution device 210 can call data, codes, etc. in the data storage system 250, and can also store data, instructions, etc. in the data storage system 250.
- the calculation module 211 uses the target model/rule 201 to process the input data.
- the I/O interface 212 returns the processing result (the user's instruction information or the user's voice signal) to the client device 240 and provides it to the user.
- the training device 220 can generate corresponding target models/rules 201 based on different data for different targets, so as to provide users with better results.
- the user can manually specify to input data in the execution device 210, for example, to operate in the interface provided by the I/O interface 212.
- the client device 240 can automatically input data to the I/O interface 212 and obtain the result. If the client device 240 automatically inputs data and needs the user's authorization, the user can set the corresponding authority in the client device 240.
- the user can view the result output by the execution device 210 on the client device 240, and the specific presentation form may be a specific manner such as display, sound, and action.
- the client device 240 can also serve as a data collection terminal to store the collected audio data in the database 230.
- FIG. 2 is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
- the data storage system 250 is an external memory relative to the execution device 210. In other cases, the data storage system 250 may also be placed in the execution device 210.
- the restored audio signal is x1'(n ), x2'(n), x3'(n), x4'(n).
- Recurrent neural networks recurrent neural network, RNN
- long short-term memory networks long short-term memory, LSTM
- RNN recurrent neural network
- LSTM long short-term memory
- a neural network can be composed of neural units.
- a neural unit can refer to an arithmetic unit that takes xs and intercept 1 as inputs.
- the output of the arithmetic unit can be:
- s 1, 2,...n, n is a natural number greater than 1
- Ws is the weight of xs
- b is the bias of the neural unit.
- f is the activation functions of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
- the output signal of the activation function can be used as the input of the next convolutional layer.
- the activation function can be a sigmoid function.
- a neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
- the input of each neural unit can be connected with the local receptive field of the previous layer to extract the characteristics of the local receptive field.
- the local receptive field can be a region composed of several neural units.
- Deep Neural Network can be understood as a neural network with many hidden layers. There is no special metric for "many” here. The essence of the multi-layer neural network and deep neural network we often say The above is the same thing. Dividing DNN according to the location of different layers, the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the number of layers in the middle are all hidden layers. The layers are fully connected, that is to say, any neuron in the i-th layer must be connected to any neuron in the i+1th layer. Although DNN looks complicated, it is not complicated as far as the work of each layer is concerned.
- the linear coefficients from the fourth neuron in the second layer to the second neuron in the third layer are defined as
- the superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third-level index 2 and the input second-level index 4.
- the coefficients from the kth neuron in the L-1th layer to the jth neuron in the Lth layer are defined as Note that the input layer has no W parameter.
- more hidden layers make the network more capable of portraying complex situations in the real world.
- a model with more parameters is more complex and has a greater "capacity", which means that it can complete more complex learning tasks.
- Convolutional Neural Network (Convosutionas Neuras Network, CNN) is a deep neural network with a convolutional structure.
- the convolutional neural network contains a feature extractor composed of a convolutional layer and a sub-sampling layer.
- the feature extractor can be seen as a filter, and the convolution process can be seen as using a trainable filter to convolve with an input image or convolution feature map.
- the convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network.
- a neuron can be connected to only part of the neighboring neurons.
- a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units in the same feature plane share weights, and the shared weights here are the convolution kernels. Sharing weight can be understood as the way of extracting image information has nothing to do with location. The underlying principle is that the statistical information of a certain part of the image is the same as that of other parts. This means that the image information learned in one part can also be used in another part. So for all positions on the image, we can use the same learning image information. In the same convolution layer, multiple convolution kernels can be used to extract different image information. Generally, the more the number of convolution kernels, the richer the image information reflected by the convolution operation.
- the convolution kernel can be initialized in the form of a matrix of random size. During the training process of the convolutional neural network, the convolution kernel can obtain reasonable weights through learning. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, and at the same time reduce the risk of overfitting.
- RNN Recurrent Neural Networks
- RNNs The purpose of RNNs is to process sequence data.
- the layers In the traditional neural network model, from the input layer to the hidden layer and then to the output layer, the layers are fully connected, and the nodes between each layer are not connected. But this ordinary neural network is powerless for many problems. For example, if you want to predict what the next word of a sentence will be, you generally need to use the previous word, because the preceding and following words in a sentence are not independent.
- RNNs are called recurrent neural networks, that is, the current output of a sequence is also related to the previous output.
- RNNs can process sequence data of any length.
- the training of RNN is the same as the training of traditional ANN (artificial neural network).
- BPTT Backpropagation Through Time
- FIG. 9 is a schematic diagram of the RNN structure, in which each circle can be regarded as a unit, and each unit does the same thing, so it can be folded into the left half of the figure.
- RNN in one sentence is the repeated use of a unit structure.
- RNN is a sequence-to-sequence model, assuming that xt-1, xt, xt+1 is an input: "I am China”, then ot-1, ot should correspond to the two “Yes” and “China”, predict What is the most likely next word? That is, the probability that ot+1 should be a "person" is relatively high.
- Xt represents the input at time t
- ot represents the output at time t
- St represents the memory at time t. Because the output at the current moment is determined by the memory and the output at the current moment, just like your senior year, your knowledge is based on the knowledge learned in the senior year (current input) and the things learned in the junior year and before the junior year.
- the f() function is the activation function in neural networks, but why add it? For example, if you learned a very good problem-solving method in college, would you still use the problem-solving method you used in junior high school? Obviously it is not needed.
- the idea of RNN is the same. Since it can remember, of course it only remembers important information, and other unimportant ones will definitely be forgotten. But what is best for filtering information in a neural network? It must be an activation function, so an activation function is applied here to make a non-linear mapping to filter information. This activation function may be tanh or other.
- ot softmax (VSt) where ot represents the output at time t.
- Convolutional neural networks can use backpropagation (BP) algorithms to modify the size of the parameters in the initial super-resolution model during the training process, so that the reconstruction error loss of the super-resolution model becomes smaller and smaller. Specifically, forwarding the input signal until the output will cause error loss, and the parameters in the initial super-resolution model are updated by backpropagating the error loss information, so that the error loss is converged.
- the backpropagation algorithm is a backpropagation motion dominated by error loss, and aims to obtain the optimal super-resolution model parameters, such as a weight matrix.
- the neural network structure in the embodiment of the present application may be as follows:
- the input of the RNN is the target audio signal obtained according to the vibration signal, and the user voice signal collected by the sensor.
- the target audio signal includes multiple moments and the signal sample value corresponding to each moment
- the user voice signal collected by the sensor includes multiple moments and the signal sample value corresponding to each moment.
- the new audio signal includes multiple moments and the corresponding value of each moment.
- the signal sample value where the signal sample value corresponding to each moment is obtained by combining the signal sample value of the target audio signal and the signal sample value of the user voice signal (this application does not limit the specific combination mode).
- the new audio signal obtained after the combination can be used as the input of the cyclic neural network.
- the target audio signal can be ⁇ x 0 ,x 1 ,x 2 ,...,x t ⁇ ; where x t is the signal sample value of the target audio signal at time t, and the user voice signal can be ⁇ y 0 ,y 1 ,y 2 ,...,y t ⁇ ; where y t is the signal sample value of the user's voice signal at time t.
- the signal sample values at the corresponding time can be combined.
- ⁇ x t ,y t ⁇ is the result of combining the signal sample values at time t, then the new audio obtained by combining the target audio signal
- the signal can be ⁇ x 0 ,y 0 ⁇ , ⁇ x 1 ,y 1 ⁇ , ⁇ x 2 ,y 2 ⁇ ,..., ⁇ x t ,y t ⁇ .
- the above-mentioned combination mode of the input audio signals is only an indication.
- the combined audio signal can express the time sequence characteristics of the signal sample values, and the application does not limit the specific combination mode.
- the input of the model can be a combined audio signal.
- the input of the model can be the user's voice signal and the target audio signal collected by the sensor.
- the combination of the audio signal can be realized by the model itself.
- it may be the user's voice signal collected by the sensor and the corresponding vibration signal when the user utters the voice.
- the process of converting the vibration signal into the target audio signal may be implemented by the model itself, that is, the model can be implemented first. The vibration signal is converted into a target audio signal, and then the audio signal is combined.
- the target voice information can be output, and the target voice information can include multiple moments and the signal sample value corresponding to each moment.
- the target voice information can be ⁇ k 0 , k 1 , k 2 ,..., k l ⁇ ; it should be noted that the number of signal sample values (the number of moments) included in the target voice information can be the same as the signal included in the input audio signal The number of sampled values (the number of moments) is the same or different.
- the target voice information is only the voice information related to the human voice in the user voice signal collected by the sensor, and the number of signal sample values included in the target voice information is less than the number of signal sample values included in the input audio signal.
- the data in the training sample database can be input into the initialized neural network model for training.
- the training sample database includes a pair of "speech signal with environmental noise", “target audio signal” and the corresponding "noise signal”.
- noisy audio signal the initialized neural network model includes weights and biases; in the K-th training process, the neural network model adjusted for K-1 times is used to learn from the noisy audio signal and vision of the sample.
- the audio feature of the audio signal extracts the denoised audio signal s'(n), where K is an integer greater than 0; after the Kth training, the denoised audio signal s'extracted from the sample is obtained (n) and the error value between the noise-free audio signal s(n); based on the error value between the denoised audio signal extracted from the sample of the sample video frame and the noise-free audio signal, adjust The weights and biases used in the K+1 training process.
- the audio signal obtained after the above combination includes two dimensions, namely the user voice signal collected by the sensor and the target audio signal. Since both are audio signals (the audio signal x'(n) needs to be decoded and restored first based on the vibration signal x(n)), the audio feature vector MFCC coefficients can be extracted separately.
- the MFCC feature is the most widely used basic feature.
- the MFCC feature is based on the characteristics of the human ear, that is, the human ear's perception of the sound frequency range above about 1000 Hz does not follow a linear relationship, but follows an approximate linear relationship on the logarithmic frequency coordinate.
- MFCC is a cepstrum parameter extracted in the frequency domain of the Mel scale. The Mel scale describes this non-linear characteristic of the human ear frequency.
- Pre-processing consists of pre-emphasis, frame division and windowing.
- pre-emphasis the purpose of pre-emphasis is to eliminate the influence of nose and mouth radiation during pronunciation, and the high-frequency part of the speech can be improved through a high-pass filter.
- the voice signal is stable for a short period of time, the voice signal is divided into short periods of time through frame division and windowing, and each short period of time is called a frame.
- the FFT transform changes the time-domain signal after frame division and windowing to the frequency domain, and obtains the spectral characteristic X(k).
- the speech frame spectral feature X(k) is filtered by the above-mentioned Mel filter bank, the energy of each subband is obtained, and then the logarithm is taken to obtain the Mel frequency logarithmic energy spectrum S(m); the S(m )
- the MFCC coefficient C(n) is obtained through the Discrete Cosine Transform (DCT).
- DCT Discrete Cosine Transform
- the constructed audio feature vector still contains two dimensions, namely "noisy audio signal”, but "visual vibration audio signal”, replaced by "visual vibration signal”, where the signal acquisition method is as follows: It is a high-speed camera. For each frame of image, four scales r (such as 1, 1/4, 1/16, 1/64) and four directions ⁇ (such as up, down, left, and right) are used.
- the feature vector method is not used, and features are directly extracted from the original multimodal data and applied in the network.
- a deep network can be trained to learn "noisy audio signals" and "vibration signals” to The mapping relationship of noise-free speech signals.
- RNNs contain input units, the corresponding input sets are marked as ⁇ x 0 ,x 1 ,x 2 ,...,x t ,x t+1 ,... ⁇ , and the output sets of output units are marked as ⁇ o 0 ,o 1 , o 2 ,...,o t ,o t+1 ,... ⁇ .
- RNNs also contain hidden units, and their output sets are labeled ⁇ s 0 ,s 1 ,s 2 ,...,s t ,s t+1 ,... ⁇ .
- the connected multi-mode mixed signal x t [sn (0) ,...,sn (t) , sv (0) ,...,sv (t) ].
- the data in the training sample database can be input into the initialized neural network model for training.
- the training sample database includes a pair of "noisy audio signals”, “visual vibration signals” and corresponding “noise-free audio signals”.
- Signal the initialized neural network model includes weights and biases; in the K-th training process, the neural network model adjusted for K-1 times is used to extract the noisy audio signal and visual vibration signal of the sample After denoising the audio signal s'(n), the K is an integer greater than 0; after the Kth training, obtain the denoised audio signal s'(n) and noise-free audio extracted from the sample The error value between the signal s(n); based on the error value between the denoised audio signal extracted from the sample of the sample video frame and the noise-free audio signal, adjust the K+1 training process The weights and biases used.
- the vibration audio signal x'(n) (and or x1'(n), x2'(n), x3'(n), x4'(n)) can be used instead of the visual vibration signal x(n) (and or x1(n), x2(n), x3(n), x4(n)) for training, that is, the audio signal recovered by the vibration signal is fused with the audio signal collected by the microphone.
- the output of the neural network model may be the voice signal obtained after environmental noise removal processing, or the user's instruction information, where the instruction information is determined based on the user's voice signal, and the instruction information is used to instruct the user
- the device with voice interaction function can trigger the corresponding function based on the instruction information, such as opening an application and so on.
- An embodiment of the present application provides a voice signal processing method, including: acquiring a user's voice signal collected by a sensor, the voice signal including environmental noise; acquiring a vibration signal corresponding to the voice when the user makes the voice; wherein the vibration The signal is used to indicate the vibration characteristics of the body part of the user; the body part is the part that vibrates correspondingly based on the sound behavior when the user is in the vocal state; and the user collected according to the vibration signal and the sensor Voice signal to obtain target voice information.
- the vibration signal is used as the basis for speech recognition.
- the vibration signal does not contain the external non-user's voice mixed in the complex acoustic transmission, it is less affected by other environmental noises (such as the effect of reverberation), so it can be relatively compared. By suppressing this part of noise interference well, a better speech recognition effect can be achieved.
- the user’s brainwave signal corresponding to the voice when the user uttered the voice can also be obtained; accordingly, the target can be obtained based on the vibration signal, the brainwave signal, and the voice signal. voice message.
- the brain wave signal of the user may be acquired based on the brain wave pickup device, where the brain wave pickup device may be earphones, glasses, or other ear-worn forms.
- the brain wave signal is generated by the brain.
- Electrical acquisition equipment including electrodes, front-end analog amplifiers, analog-to-digital conversion, EEG signal processing modules, etc. collect the EEG signals of multiple brain regions according to different frequency bands.
- the acquisition by the acquisition device or the acquisition by the optical imaging device can then establish the mapping relationship between the brain wave signal and the motion signal of the vocal tract occlusion when the person is in different corpus materials.
- the sound corresponding to the brainwave signal may be acquired based on the mapping relationship between the brainwave signal and the motion signal of the vocal tract.
- the movement signal of the occlusion may be acquired based on the mapping relationship between the brainwave signal and the motion signal of the vocal tract.
- the brain wave signal can be converted into vocal occlusal joint motion (motion signal), and then these decoded motions can be converted into voice signals. That is, the brainwave signal is first converted into the motion of the vocal tract occlusal part, which involves the anatomical structure of the voice (such as the motion signal of the lips, tongue, larynx, and jaw).
- the brainwave signal is first converted into the motion of the vocal tract occlusal part, which involves the anatomical structure of the voice (such as the motion signal of the lips, tongue, larynx, and jaw).
- it is necessary to correlate a large amount of vocal tract movement with its neural activity when a person speaks. This association can be established based on the established cyclic neural network, based on a large number of previously collected vocal tract motion and voice recording data sets, and the motion signal of the vocal tract occlusal part can be converted into a voice signal.
- the target voice information may be obtained through a neural network model based on the vibration signal, the motion signal, and the voice signal.
- the corresponding first target audio signal may be obtained according to the vibration signal; the corresponding second target audio signal may be obtained according to the motion signal; based on the first target audio signal, The second target audio signal and the voice signal obtain the target voice information through a neural network model.
- the neural network model in the foregoing embodiment, which will not be repeated here.
- the voice signal may be directly mapped based on the brain wave signal, and further, the target voice information may be obtained through a neural network model based on the vibration signal, the brain wave signal, and the voice signal.
- the corresponding first target audio signal may be obtained according to the vibration signal; the corresponding second target audio signal may be obtained according to the brain wave signal; based on the first target audio signal , The second target audio signal and the voice signal obtain the target voice information through a neural network model.
- a neural network model For specific implementation details, reference may be made to the descriptions related to the neural network model in the foregoing embodiment, which will not be repeated here.
- the target voice information includes voiceprint features representing the voice signal of the user.
- the target voice information may be obtained through a neural network model based on the vibration signal and the voice signal.
- the target voice information may be used to represent the voiceprint characteristics of the user's voice signal, and further, the target voice information may be processed based on a fully connected layer to obtain a voiceprint recognition result.
- the corresponding target audio signal may be obtained according to the vibration signal; based on the target audio signal and the voice signal, the target voice information is obtained through a neural network model.
- the target voice information can be obtained through a neural network model.
- the corresponding first target audio signal may be obtained according to the vibration signal; the corresponding second target audio signal may be obtained according to the brain wave signal; based on the first target audio signal , The second target audio signal and the voice signal obtain the target voice information through a neural network model.
- the vibration signal when the user speaks is used as the basis for voiceprint recognition. Since the vibration signal is slightly interfered by other noises (such as reverberation interference, etc.), the original audio characteristics of the user's speech can be expressed, so In this application, by using vibration signals as the basis for voiceprint recognition, the recognition effect is better and the reliability is stronger.
- FIG. 14 is a schematic flowchart of a voice signal processing method provided by an embodiment of the application. As shown in FIG. 14, the method includes:
- step 140 For the specific description of step 1401, refer to the description of step 701, which will not be repeated here.
- step 1402 reference may be made to the specific description related to the brain wave signal in the foregoing embodiment, which will not be repeated here.
- target voice information can be obtained according to the brain wave signal and the voice signal.
- this embodiment is based on the brain wave signal and the voice signal.
- the description of step 703 in the above embodiment can be used for reference. , I won’t repeat it here.
- the motion signal of the occlusion part of the vocal tract when the user is speaking may also be obtained according to the brain wave signal; further, the target voice information may be obtained according to the motion signal and the voice signal.
- the target voice information is a voice signal obtained after environmental noise removal processing, and a corresponding target audio signal may be obtained based on the brain wave signal; based on filtering, the voice signal The target audio signal is filtered out to obtain the signal to be filtered out; the signal to be filtered out is filtered out from the speech signal to obtain the target voice information.
- the instruction information corresponding to the voice signal of the user may be obtained based on the target voice information, and the instruction information indicates the semantic intention contained in the voice signal of the user.
- the target voice information may be obtained through a neural network model based on the brain wave signal and the voice signal; or, the corresponding target audio signal may be obtained according to the brain wave signal; Based on the target audio signal and the voice signal, the target voice information is obtained through a neural network model; wherein, the target voice information is a voice signal obtained after environmental noise removal processing or a voice signal corresponding to the user's voice signal Instruction information.
- the target voice information includes voiceprint features representing the voice signal of the user.
- An embodiment of the present application provides a voice signal processing method, the method includes: acquiring a user's voice signal collected by a sensor; acquiring the user's brainwave signal corresponding to when the user utters the voice; and according to the The brain wave signal and the user's voice signal collected by the sensor are used to obtain target voice information.
- the vibration signal is used as the basis for speech recognition. Since the vibration signal does not contain the external non-user's voice mixed in the complex acoustic transmission, it is less affected by other environmental noises (such as the effect of reverberation), so it can be relatively compared. By suppressing this part of noise interference well, a better speech recognition effect can be achieved.
- FIG. 15 is a schematic flowchart of a voice signal processing method provided by an embodiment of the application. As shown in FIG. 15, the method includes:
- step 1501 For the specific description of step 1501, refer to the description of step 701, which will not be repeated here.
- the vibration signal when the user utters the voice; wherein the vibration signal is used to represent the vibration characteristics of the user's body part; the body part is when the user is in a voiced state, based on the voice The part where the behavior undergoes corresponding vibration;
- step 1502 For the specific description of step 1502, reference may be made to the description of step 702, which will not be repeated here.
- the vibration signal is used to indicate a vibration characteristic corresponding to the sounding vibration.
- voiceprint recognition is performed based on the user voice signal collected by the sensor to obtain the first confidence that the user voice signal collected by the sensor belongs to the user; voiceprint recognition is performed based on the vibration signal to obtain the The user's voice signal collected by the sensor belongs to the second confidence level of the target user; and the voiceprint recognition result is obtained according to the first confidence level and the second confidence level.
- the first confidence level and the second confidence level may be weighted to obtain the voiceprint recognition result.
- the user's brainwave signal corresponding to the user's voice can be acquired; according to the brainwave signal, the motion signal of the vocal tract occlusion when the user's voice is acquired; and further, Voiceprint recognition may be performed based on the user's voice signal, the vibration signal, and the motion signal collected by the sensor.
- voiceprint recognition may be performed based on the user voice signal collected by the sensor to obtain the first confidence that the user voice signal collected by the sensor belongs to the user; voiceprint recognition may be performed based on the vibration signal to obtain the The user’s voice signal collected by the sensor belongs to the user’s second confidence level; voiceprint recognition is performed according to the brain wave signal to obtain the user’s voice signal collected by the sensor belongs to the user’s third confidence level; according to the first confidence level , The second confidence level and the third confidence level, to obtain a voiceprint recognition result. For example, the first confidence, the second confidence, and the third confidence may be weighted to obtain the voiceprint recognition result.
- the voiceprint recognition result may be obtained through a neural network model based on the audio signal, the vibration signal, and the brain wave signal.
- the audio x'(n), y'(n), and x'(n), y'(n), x1'(n), x2'(n), x3'(n), x4'(n) perform voiceprint recognition separately, and then the respective voiceprint recognition results are weighted and summed to provide the final result:
- VP h1*x1+h2*x2+h3*x3+h4*x4+h5*x+h6*y+h7*s; where x1, x2, x3, x4, x, y, s represent vibration
- the respective recognition results of the signal, brain wave signal and audio signal, h1, h2, h3, h4, h5, h6, h7 represent the weight of the corresponding recognition result, and the weight can be flexibly selected. If the final recognition result VP exceeds the preset threshold VP_TH, it means that the audio voiceprint result obtained during vibration pickup is passed.
- the vibration signal when the user speaks is used as the basis for voiceprint recognition. Since the vibration signal is slightly interfered by other noises (such as reverberation interference, etc.), the original audio characteristics of the user's speech can be expressed, so In this application, by using vibration signals as the basis for voiceprint recognition, the recognition effect is better and the reliability is stronger.
- FIG. 16 provides a structural diagram of a voice signal processing apparatus for this application.
- the apparatus 1600 includes:
- the environmental voice acquisition module 1601 is used to acquire the user voice signal collected by the sensor
- the vibration signal acquisition module 1602 is used to acquire the corresponding vibration signal when the user utters the voice; wherein the vibration signal is used to represent the vibration characteristics of the user's body part; the body part is when the user is in In the vocal state, the part that vibrates correspondingly based on the vocal behavior; and
- the voice information acquisition module 1603 is configured to obtain target voice information according to the vibration signal and the user voice signal collected by the sensor.
- the vibration signal is used to indicate a vibration characteristic corresponding to the vibration generated by the voice.
- the body part includes at least one of the following: top of the skull, face, throat, or neck.
- the vibration signal acquisition module 1602 is configured to acquire a video frame that includes the user; according to the video frame, extract a vibration signal corresponding to when the user makes a voice.
- the video frame is collected by a dynamic vision sensor and/or a high-speed camera.
- the target voice information is a voice signal obtained after environmental noise removal processing
- the voice information obtaining module 1603 is configured to obtain a corresponding target audio signal according to the vibration signal; Filtering: filtering the target audio signal from the user voice signal collected by the sensor to obtain a noise signal to be filtered; filtering the noise signal to be filtered from the user voice signal collected by the sensor to obtain the Target voice information.
- the device further includes:
- the instruction information acquisition module is configured to acquire instruction information corresponding to the user's voice signal based on the target voice information, and the instruction information indicates the semantic intention contained in the user's voice signal.
- the voice information acquisition module 1603 is configured to obtain the target voice information through a cyclic neural network model based on the vibration signal and the user voice signal collected by the sensor; or, according to The vibration signal obtains a corresponding target audio signal; based on the target audio signal and the user voice signal collected by the sensor, the target voice information is obtained through a cyclic neural network model.
- the device further includes:
- the brain wave signal acquisition module is used to acquire the user’s brain wave signal corresponding to when the user utters the voice; correspondingly, the voice information acquisition module is used to obtain the brain wave signal according to the vibration signal and the brain wave signal. And the user's voice signal collected by the sensor to obtain target voice information.
- the device further includes:
- the motion signal acquisition module is used to acquire the motion signal of the occlusion part of the vocal tract when the user makes a voice according to the brain wave signal; correspondingly, the voice information acquisition module is used to acquire the motion signal according to the vibration signal and the motion Signal and the user's voice signal collected by the sensor to obtain target voice information.
- the voice information acquisition module 1603 is configured to obtain the target voice through a cyclic neural network model based on the vibration signal, the brain wave signal, and the user voice signal collected by the sensor Information; or,
- the brain wave signal obtain the corresponding second target audio signal; based on the first target audio signal, the second target audio signal and the user's voice signal collected by the sensor, the cyclic neural network model is used to obtain the Target voice information.
- the target voice information includes voiceprint features representing the voice signal of the user.
- the embodiment of the application provides a voice signal processing device.
- the device includes: an environmental voice acquisition module for acquiring a user voice signal collected by a sensor; a vibration signal acquisition module for acquiring a corresponding voice signal when the user utters the voice The vibration signal; wherein the vibration signal is used to represent the vibration characteristics of the user's body part; the body part is the part that vibrates based on the vocal behavior when the user is in the vocal state; and the voice information acquisition module , Used to obtain target voice information according to the vibration signal and the user voice signal collected by the sensor.
- the vibration signal is used as the basis for speech recognition.
- the vibration signal does not contain the external non-user's voice mixed in the complex acoustic transmission, it is less affected by other environmental noises (such as the effect of reverberation), so it can be relatively compared. By suppressing this part of noise interference well, a better speech recognition effect can be achieved.
- FIG. 17 provides a structural diagram of a voice signal processing apparatus for this application.
- the apparatus 1700 includes:
- the environmental voice acquisition module 1701 is used to acquire the user's voice signal collected by the sensor;
- the brain wave signal acquisition module 1702 is configured to acquire the user's brain wave signal corresponding to when the user utters the voice;
- the voice information obtaining module 1703 is configured to obtain target voice information according to the brain wave signal and the user voice signal collected by the sensor.
- the device further includes:
- the motion signal acquisition module is used to acquire the motion signal of the occlusion part of the vocal tract when the user is speaking according to the brain wave signal; correspondingly, the voice information acquisition module is used to acquire the motion signal and the sensor according to the motion signal.
- the user voice signal is collected to obtain the target voice information.
- the voice information acquisition module is configured to acquire a corresponding target audio signal according to the brain wave signal
- the device further includes:
- the instruction information acquisition module is configured to acquire instruction information corresponding to the user's voice signal based on the target voice information, and the instruction information indicates the semantic intention contained in the user's voice signal.
- the voice information acquisition module is configured to obtain the target voice information through a cyclic neural network model based on the brainwave signal and the user voice signal collected by the sensor; or,
- a corresponding target audio signal is obtained; based on the target audio signal and the user voice signal collected by the sensor, the target voice information is obtained through a cyclic neural network model.
- the target voice information includes voiceprint features representing the voice signal of the user.
- the embodiment of the application provides a voice signal processing device, the device includes: an environmental voice acquisition module for acquiring a user’s voice signal collected by a sensor; a brain wave signal acquisition module for acquiring the user’s voice Corresponding to the user’s brain wave signal; and a voice information acquisition module for obtaining target voice information based on the brain wave signal and the user’s voice signal collected by the sensor.
- the brainwave signal is used as the basis for speech recognition. Since the brainwave signal does not contain the voice of external non-users mixed in the complex acoustic transmission, it is not affected by other environmental noises (such as reverberation), so it can Relatively better suppression of this part of noise interference can achieve better speech recognition effects.
- FIG. 18 provides a structural diagram of a voice signal processing apparatus for this application.
- the apparatus 1800 includes:
- the environmental voice acquisition module 1801 is used to acquire the user voice signal collected by the sensor
- the vibration signal acquisition module 1802 is used to acquire the corresponding vibration signal when the user utters the voice; wherein the vibration signal is used to represent the vibration characteristics of the user's body part; the body part is when the user is in In the vocal state, the part that vibrates correspondingly based on the vocal behavior; and
- the voiceprint recognition module 1803 is configured to perform voiceprint recognition based on the user's voice signal and the vibration signal collected by the sensor.
- the vibration signal is used to indicate a vibration characteristic corresponding to the vibration generated by the voice.
- the voiceprint recognition module is configured to perform voiceprint recognition according to the user's voice signal collected by the sensor, and obtain the first confidence that the user's voice signal collected by the sensor belongs to the user;
- the device further includes:
- a brain wave signal acquisition module configured to acquire the user's brain wave signal corresponding to when the user utters the voice
- the voiceprint recognition module is configured to perform voiceprint recognition based on the user's voice signal, the vibration signal, and the brain wave signal collected by the sensor.
- the voiceprint recognition module is configured to perform voiceprint recognition according to the user's voice signal collected by the sensor, and obtain the user's first confidence that the user's voice signal collected by the sensor belongs to the user ;
- a voiceprint recognition result is obtained.
- the embodiment of the application provides a voice signal processing device.
- the device includes: an environmental voice acquisition module for acquiring a user voice signal collected by a sensor; a vibration signal acquisition module for acquiring a corresponding voice signal when the user utters the voice The vibration signal; wherein the vibration signal is used to represent the vibration characteristics of the user's body part; the body part is the part that vibrates correspondingly based on the vocal behavior when the user is in the vocal state; and the voiceprint recognition module , For performing voiceprint recognition based on the user's voice signal and the vibration signal collected by the sensor.
- the vibration signal when the user speaks is used as the basis for voiceprint recognition.
- the vibration signal is slightly interfered by other noises (such as reverberation interference, etc.), the original audio characteristics of the user's speech can be expressed, so In this application, by using vibration signals as the basis for voiceprint recognition, the recognition effect is better and the reliability is stronger.
- the execution device may be the device with voice interaction function or voice input device in the above embodiment.
- FIG. 19 is the execution provided by this embodiment of the application
- the execution device 1900 may be specifically represented as a mobile phone, a tablet, a notebook computer, a smart wearable device, a server, etc., which is not limited here.
- the task scheduling apparatus described in the embodiment corresponding to FIG. 10 may be deployed on the execution device 1900 to implement the task scheduling function in the embodiment corresponding to FIG. 10.
- the execution device 1900 includes: a receiver 1901, a transmitter 1902, a processor 1903, and a memory 1904 (the number of processors 1903 in the execution device 1900 may be one or more, and one processor is taken as an example in FIG. 19) , Where the processor 1903 may include an application processor 19031 and a communication processor 19032. In some embodiments of the present application, the receiver 1901, the transmitter 1902, the processor 1903, and the memory 1904 may be connected by a bus or other methods.
- the memory 1904 may include a read-only memory and a random access memory, and provides instructions and data to the processor 1903. A part of the memory 1904 may also include a non-volatile random access memory (NVRAM).
- NVRAM non-volatile random access memory
- the memory 1904 stores a processor and operating instructions, executable modules or data structures, or a subset of them, or an extended set of them, where the operating instructions may include various operating instructions for implementing various operations.
- the processor 1903 controls the operation of the execution device.
- the various components of the execution device are coupled together through a bus system.
- the bus system may also include a power bus, a control bus, and a status signal bus.
- various buses are referred to as bus systems in the figure.
- the method disclosed in the foregoing embodiment of the present application may be applied to the processor 1903 or implemented by the processor 1903.
- the processor 1903 may be an integrated circuit chip with signal processing capabilities. In the implementation process, the steps of the foregoing method can be completed by an integrated logic circuit of hardware in the processor 1903 or instructions in the form of software.
- the above-mentioned processor 1903 may be a general-purpose processor, a digital signal processing (digital signal processing, DSP), a microprocessor or a microcontroller, and may further include an application specific integrated circuit (ASIC), field programmable Field-programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
- DSP digital signal processing
- FPGA field programmable Field-programmable gate array
- the processor 1903 may implement or execute the methods, steps, and logical block diagrams disclosed in the embodiments of the present application.
- the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
- the steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
- the software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
- the storage medium is located in the memory 1904, and the processor 1903 reads the information in the memory 1904, and completes the steps of the foregoing method in combination with its hardware.
- the receiver 1901 can be used to receive input digital or character information, and to generate signal input related to the relevant settings and function control of the execution device.
- the transmitter 1902 can be used to output digital or character information through the first interface; the transmitter 1902 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 1902 can also include display devices such as a display screen .
- the processor 1903 is configured to execute the voice signal processing method executed by the execution device in the corresponding embodiment of FIG. 7, FIG. 14, and FIG. 15.
- FIG. 20 is a schematic structural diagram of the training device provided in an embodiment of the application.
- the training device 2000 is implemented by one or more servers, and the training device 2000 Large differences may occur due to different configurations or performances, and may include one or more central processing units (CPU) 2020 (for example, one or more processors) and memory 2032, and one or more storage applications
- the storage medium 2030 of the program 2042 or the data 2044 (for example, one or one storage device with a large amount of storage).
- the memory 2032 and the storage medium 2030 may be short-term storage or persistent storage.
- the program stored in the storage medium 2030 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the training device. Furthermore, the central processing unit 2020 may be configured to communicate with the storage medium 2030, and execute a series of instruction operations in the storage medium 2030 on the training device 2000.
- the training device 2000 may also include one or more power supplies 2026, one or more wired or wireless network interfaces 2050, and one or more input and output interfaces 2058; or, one or more operating systems 2041, such as Windows ServerTM, Mac OS XTM , UnixTM, LinuxTM, FreeBSDTM and so on.
- operating systems 2041 such as Windows ServerTM, Mac OS XTM , UnixTM, LinuxTM, FreeBSDTM and so on.
- the central processing unit 2020 is configured to execute the steps related to the neural network model training method in the foregoing embodiment.
- the embodiments of the present application also provide a product including a computer program, which when running on a computer, causes the computer to execute the steps executed by the aforementioned execution device, or causes the computer to execute the steps executed by the aforementioned training device.
- the embodiments of the present application also provide a computer-readable storage medium that stores a program for signal processing, and when it runs on a computer, the computer executes the steps performed by the aforementioned execution device , Or, make the computer execute the steps performed by the aforementioned training device.
- the execution device, training device, or terminal device provided by the embodiments of the present application may specifically be a chip.
- the chip includes a processing unit and a communication unit.
- the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, Pins or circuits, etc.
- the processing unit can execute the computer-executable instructions stored in the storage unit to make the chip in the execution device execute the data processing method described in the foregoing embodiment, or to make the chip in the training device execute the data processing method described in the foregoing embodiment.
- the storage unit is a storage unit in the chip, such as a register, a cache, etc.
- the storage unit may also be a storage unit located outside the chip in the wireless access device, such as a storage unit located outside the chip.
- Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.
- Figure 21 is a schematic structural diagram of a chip provided by an embodiment of the application.
- the Host CPU assigns tasks.
- the core part of the NPU is the arithmetic circuit 2103.
- the arithmetic circuit 2103 is controlled by the controller 2104 to extract matrix data from the memory and perform multiplication operations.
- the arithmetic circuit 2103 includes multiple processing units (Process Engine, PE). In some implementations, the arithmetic circuit 2103 is a two-dimensional systolic array. The arithmetic circuit 2103 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 2103 is a general-purpose matrix processor.
- the arithmetic circuit fetches the corresponding data of matrix B from the weight memory 2102 and caches it on each PE in the arithmetic circuit.
- the arithmetic circuit fetches matrix A data and matrix B from the input memory 2101 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in an accumulator 2108.
- the unified memory 2106 is used to store input data and output data.
- the weight data directly passes through the memory unit access controller (Direct Memory Access Controller, DMAC) 2105, and the DMAC is transferred to the weight memory 2102.
- the input data is also transferred to the unified memory 2106 through the DMAC.
- DMAC Direct Memory Access Controller
- the BIU is the Bus Interface Unit, that is, the bus interface unit 2110, which is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer (IFB) 2109.
- IFB instruction fetch buffer
- the bus interface unit 2110 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 2109 to obtain instructions from the external memory, and is also used for the storage unit access controller 2105 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
- BIU Bus Interface Unit
- the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 2106 or to transfer the weight data to the weight memory 2102 or to transfer the input data to the input memory 2101.
- the vector calculation unit 2107 includes multiple arithmetic processing units, and if necessary, further processes the output of the arithmetic circuit 2103, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on. It is mainly used in the calculation of non-convolutional/fully connected layer networks in neural networks, such as Batch Normalization, pixel-level summation, and upsampling of feature planes.
- the vector calculation unit 2107 can store the processed output vector to the unified memory 2106.
- the vector calculation unit 2107 can apply a linear function; or, apply a nonlinear function to the output of the arithmetic circuit 2103, such as linearly interpolating the feature plane extracted by the convolutional layer, and for example, a vector of accumulated values to generate the activation value.
- the vector calculation unit 2107 generates normalized values, pixel-level summed values, or both.
- the processed output vector can be used as an activation input to the arithmetic circuit 2103, for example for use in a subsequent layer in a neural network.
- the instruction fetch buffer 2109 connected to the controller 2104 is used to store instructions used by the controller 2104;
- the unified memory 2106, the input memory 2101, the weight memory 2102, and the fetch memory 2109 are all On-Chip memories.
- the external memory is private to the NPU hardware architecture.
- the processor mentioned in any of the above can be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above-mentioned programs.
- the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physically separate.
- the physical unit can be located in one place or distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
- the connection relationship between the modules indicates that they have a communication connection between them, which can be specifically implemented as one or more communication buses or signal lines.
- this application can be implemented by means of software plus necessary general hardware. Of course, it can also be implemented by dedicated hardware including dedicated integrated circuits, dedicated CPUs, dedicated memory, Dedicated components and so on to achieve. Under normal circumstances, all functions completed by computer programs can be easily implemented with corresponding hardware, and the specific hardware structure used to achieve the same function can also be diverse, such as analog circuits, digital circuits or special purpose circuits. Circuit etc. However, for this application, software program implementation is a better implementation in more cases. Based on this understanding, the technical solution of this application essentially or the part that contributes to the prior art can be embodied in the form of a software product.
- the computer software product is stored in a readable storage medium, such as a computer floppy disk. , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to make a computer device (which can be a personal computer, training device, or network device, etc.) execute the various embodiments of this application method.
- the computer program product includes one or more computer instructions.
- the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
- the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
- the computer instructions may be transmitted from a website, computer, training device, or data.
- the center transmits to another website site, computer, training equipment, or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.).
- wired such as coaxial cable, optical fiber, digital subscriber line (DSL)
- wireless such as infrared, wireless, microwave, etc.
- the computer-readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a training device or a data center integrated with one or more available media.
- the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)).
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Oral & Maxillofacial Surgery (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Game Theory and Decision Science (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Business, Economics & Management (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
一种语音信号处理方法及其相关设备,该方法可应用于音频领域,包括:获取传感器采集的用户语音信号;获取所述用户发出所述语音时对应的振动信号;其中所述振动信号用于表示所述用户的身体部位的振动特征;所述身体部位为当所述用户处于发声状态下,基于发声行为进行相应振动的部位;根据所述振动信号和所述传感器采集的用户语音信号,获得目标语音信息。本申请将振动信号作为语音识别的依据,由于振动信号没有包含复杂的声学传输时混入的外界非用户的语音,受其他环境噪声的影响很小(例如混响影响),因此可以相对较好的抑制住这部分噪声干扰,可以实现更好的语音识别效果。
Description
本申请涉及音频处理领域,尤其涉及一种语音信号处理方法及其相关设备。
人机交互(human-computer interaction,HCI)主要是研究人和计算机之间的信息交换,它主要包括人到计算机和计算机到人的信息交换两部分。是与认知心理学、人机工程学、多媒体技术、虚拟现实技术等密切相关的综合学科。在人机交互技术中,多模态交互设备是语音交互、体感交互、及触控交互等多种交互模式并行的交互设备。基于多模态交互设备的人机交互:通过交互设备中的多种跟踪模块(人脸、手势、姿态、语音、及韵律)采集用户信息,并理解、处理、及管理后形成虚拟用户表达模块,与计算机进行交互对话,能够极大提升用户的交互体验。
随着语音技术的发展,很多智能设备(例如手机、陪伴型机器人、车载设备、智能音箱以及智能语音助手等等)都可以通过语音与用户进行交互。智能设备的语音交互系统通过对用户的语音进行识别,完成用户的指令。在上述智能设备中,通常用麦克风来拾取环境中的音频信号,其中音频信号是环境的混合信号,除了智能设备希望拾取的来自某一用户的语音信号之外,还有其他信号,如环境噪声,其他人的说话声等。
在现有的实现中,为了从混合信号中提取来自某一用户的语音信号,可以采取盲分离方法,其本质上是基于统计的方法来分离音源,因此受限于实际的建模方法,在鲁棒性上挑战非常大。
发明内容
第一方面,本申请提供了一种语音信号处理方法,所述方法包括:
获取传感器采集的用户语音信号;
需要说明的是,用户语音信号,不应将语音信号仅理解为用户说出的话,而是应理解为语音信号中包括用户的发出的语音。语音信号包括环境噪声可以理解为,在环境中存在正在说话的用户以及其他环境噪声(例如在说话的其他人等),此时,采集的语音信号包括相互交织在一起的用户说话声音以及环境噪声,其中语音信号和环境噪声之间的关系不应理解为简单的叠加。即,不应理解为环境噪声在语音信号中是是独立的信号存在。
获取所述用户发出所述语音时对应的振动信号;其中所述振动信号用于表示所述用户的身体部位的振动特征;所述身体部位为当所述用户处于发声状态下,基于发声行为进行相应振动的部位;
需要说明的是,用户发出语音时对应的振动信号可以是基于视频提取得到的。
根据所述振动信号和所述传感器采集的用户语音信号,获得目标语音信息。
本申请实施例提供了一种语音信号处理方法,包括:获取传感器采集的用户的语音信号,所述语音信号包括环境噪声;获取所述用户发出所述语音时对应的振动信号;其中所述振动信号用于表示所述用户的身体部位的振动特征;所述身体部位为当所述用户处于发声状态下, 基于发声行为进行相应振动的部位;以及根据所述振动信号和所述传感器采集的用户语音信号,获得目标语音信息。通过上述方式,将振动信号作为语音识别的依据,由于振动信号没有包含复杂的声学传输时混入的外界非用户的语音,受其他环境噪声的影响很小(例如混响影响),因此可以相对较好的抑制住这部分噪声干扰,可以实现更好的语音识别效果。
在一种可选的实现中,所述振动信号用于表示与所述用户发出所述语音产生的振动相对应的振动特征。
在一种可选的实现中,所述身体部位包括如下的至少一种:颅顶、面部、喉部或颈部。
在一种可选的实现中,所述获取所述用户发出所述语音时对应的振动信号,包括:获取包括所述用户的视频帧;根据所述视频帧,提取所述用户发出所述语音时对应的振动信号。
在一种可选的实现中,所述视频帧为通过动态视觉传感器和/或高速摄像头采集得到的。
在一种可选的实现中,所述根据所述振动信号和所述传感器采集的用户语音信号,获得目标语音信息,包括:根据所述振动信号,获取对应的目标音频信号;基于滤波,从所述传感器采集的用户语音信号中滤除所述目标音频信号,得到待滤除信号;从所述传感器采集的用户语音信号中滤除所述待滤除信号,得到所述目标语音信息。
具体的,可以根据所述振动信号,恢复出对应的目标音频信号,并基于滤波,从所述音频信号中滤除所述目标音频信号,得到噪声信号,经滤波后,滤波后信号z’(n)中已经不包含有用信号x’(n),基本上是除了用户的目标音频信号s(n)的外界噪声;可选的,如果是多个摄像头(DVS,高速摄像等)拾取某一个人的振动,则将从这些振动恢复的目标音频信号x1’(n)、x2’(n)、x3’(n)、x4’(n),按照上述的自适应滤波方法,依次从混合音频信号z(n)中滤除,即得到了去除各种x1’(n)、x2’(n)、x3’(n)、x4’(n)音频成分的混合音频信号z’(n)。
在一种可选的实现中,所述方法还包括:基于所述目标语音信息,获取所述用户的语音信号对应的指令信息,所述指令信息指示所述用户的语音信号中包含的语义意图。其中,指令信息可以用来触发实现所述用户的语音信号中包含的语义意图相应的功能,例如,打开某个应用程序,进行语音通话等等。
在一种可选的实现中,所述根据所述振动信号和所述传感器采集的用户语音信号,获得目标语音信息,包括:
基于所述振动信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息;或,
根据所述振动信号,获取对应的目标音频信号;基于所述目标音频信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息。
在一种可选的实现中,所述方法还包括:
获取所述用户发出所述语音时对应的所述用户的脑波信号;
相应的,所述根据所述振动信号和所述传感器采集的用户语音信号,获得目标语音信息,包括:
根据所述振动信号、所述脑波信号和所述传感器采集的用户语音信号,获得目标语音信息。
在一种可选的实现中,所述方法还包括:
根据所述脑波信号,获取所述用户发出语音时声道咬合部位的运动信号;相应的,所述根据所述振动信号、所述脑波信号和所述传感器采集的用户语音信号,获得目标语音信息,包括:
根据所述振动信号、所述运动信号和所述传感器采集的用户语音信号,获得目标语音信息。
在一种可选的实现中,所述根据所述振动信号、所述脑波信号和所述传感器采集的用户语音信号,获得目标语音信息,包括:
基于所述振动信号、所述脑波信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息;或,
根据所述振动信号,获取对应的第一目标音频信号;
根据所述脑波信号,获取对应的第二目标音频信号;基于所述第一目标音频信号、所述第二目标音频信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息。
在一种可选的实现中,所述目标语音信息包括表示所述用户的语音信号的声纹特征。
第二方面,本申请提供了一种语音信号处理方法,所述方法包括:
获取传感器采集的用户的语音信号;
获取所述用户发出所述语音时对应的所述用户的脑波信号;以及
根据所述脑波信号和所述传感器采集的用户语音信号,获得目标语音信息。
在一种可选的实现中,所述方法还包括:
根据所述脑波信号,获取所述用户在发声时声道咬合部位的运动信号;相应的,所述根据所述脑波信号和所述传感器采集的用户语音信号,获得目标语音信息,包括:
根据所述运动信号和所述传感器采集的用户语音信号,获得所述目标语音信息。
在一种可选的实现中,所述根据所述脑波信号和所述传感器采集的用户语音信号,获得目标语音信息,包括:
根据所述脑波信号,获取对应的目标音频信号;
基于滤波,从所述传感器采集的用户语音信号中滤除所述目标音频信号,得到待滤除信号;
从所述传感器采集的用户语音信号中滤除所述待滤除信号,得到所述目标语音信息。
在一种可选的实现中,所述方法还包括:
基于所述目标语音信息,获取所述用户的语音信号对应的指令信息,所述指令信息指示所述用户的语音信号中包含的语义意图。
在一种可选的实现中,所述根据所述脑波信号和所述传感器采集的用户语音信号,获得 目标语音信息,包括:
基于所述脑波信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息;或,
根据所述脑波信号,获取对应的目标音频信号;基于所述目标音频信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息。
在一种可选的实现中,所述目标语音信息包括表示所述用户的语音信号的声纹特征。
第三方面,本申请提供了一种语音信号处理方法,所述方法包括:
获取传感器采集的用户的语音信号;
获取所述用户发出所述语音时对应的振动信号;其中所述振动信号用于表示所述用户的身体部位的振动特征;所述身体部位为当所述用户处于发声状态下,基于发声行为进行相应振动的部位;以及
基于所述传感器采集的用户语音信号以及所述振动信号,进行声纹识别。
在一种可选的实现中,所述振动信号用于表示与发出语音产生的振动相对应的振动特征。
在一种可选的实现中,所述基于所述传感器采集的用户语音信号以及所述振动信号,进行声纹识别,包括:
根据所述传感器采集的用户语音信号进行声纹识别,得到所述传感器采集的用户语音信号属于用户的第一置信度;
根据所述振动信号进行声纹识别,得到所述传感器采集的用户语音信号属于目标用户的第二置信度;
根据所述第一置信度和所述第二置信度,得到声纹识别结果。
在一种可选的实现中,所述方法还包括:
获取所述用户发出所述语音时对应的所述用户的脑波信号;
相应的,所述基于所述传感器采集的用户语音信号以及所述振动信号,进行声纹识别,包括:
基于所述传感器采集的用户语音信号、所述振动信号以及所述脑波信号,进行声纹识别。
在一种可选的实现中,所述基于所述传感器采集的用户语音信号、所述振动信号以及所述脑波信号,进行声纹识别,包括:
根据所述传感器采集的用户语音信号进行声纹识别,得到所述传感器采集的用户语音信号属于用户的第一置信度;
根据所述振动信号进行声纹识别,得到所述传感器采集的用户语音信号属于用户的第二置信度;
根据所述脑波信号进行声纹识别,得到所述传感器采集的用户语音信号属于用户的第三 置信度;
根据所述第一置信度、所述第二置信度和所述第三置信度,得到声纹识别结果。
第四方面,本申请提供了一种语音信号处理装置,所述装置包括:
环境语音获取模块,用于获取传感器采集的用户语音信号;
振动信号获取模块,用于获取所述用户发出所述语音时对应的振动信号;其中所述振动信号用于表示所述用户的身体部位的振动特征;所述身体部位为当所述用户处于发声状态下,基于发声行为进行相应振动的部位;以及
语音信息获取模块,用于根据所述振动信号和所述传感器采集的用户语音信号,获得目标语音信息。
在一种可选的实现中,所述振动信号用于表示与所述用户发出所述语音产生的振动相对应的振动特征。
在一种可选的实现中,所述身体部位包括如下的至少一种:颅顶、面部、喉部或颈部。
在一种可选的实现中,所述振动信号获取模块,用于获取包括所述用户的视频帧;根据所述视频帧,提取所述用户发出语音时对应的振动信号。
在一种可选的实现中,所述视频帧为通过动态视觉传感器和/或高速摄像头采集得到的。
在一种可选的实现中,所述语音信息获取模块,用于根据所述振动信号,获取对应的目标音频信号;基于滤波,从所述传感器采集的用户语音信号中滤除所述目标音频信号,得到噪声待滤除信号;从所述传感器采集的用户语音信号中滤除所述待滤除噪声信号,得到所述目标语音信息。
在一种可选的实现中,所述装置还包括:
指令信息获取模块,用于基于所述目标语音信息,获取所述用户的语音信号对应的指令信息,所述指令信息指示所述用户的语音信号中包含的语义意图。
在一种可选的实现中,所述语音信息获取模块,用于基于所述振动信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息;或,根据所述振动信号,获取对应的目标音频信号;基于所述目标音频信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息。
在一种可选的实现中,所述装置还包括:
脑波信号获取模块,用于获取所述用户发出所述语音时对应的所述用户的脑波信号;相应的,所述语音信息获取模块,用于根据所述振动信号、所述脑波信号和所述传感器采集的 用户语音信号,获得目标语音信息。
在一种可选的实现中,所述装置还包括:
运动信号获取模块,用于根据所述脑波信号,获取所述用户发出语音时声道咬合部位的运动信号;相应的,所述语音信息获取模块,用于根据所述振动信号、所述运动信号和所述传感器采集的用户语音信号,获得目标语音信息。
在一种可选的实现中,所述语音信息获取模块,用于基于所述振动信号、所述脑波信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息;或,
根据所述振动信号,获取对应的第一目标音频信号;
根据所述脑波信号,获取对应的第二目标音频信号;基于所述第一目标音频信号、所述第二目标音频信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息。
在一种可选的实现中,所述目标语音信息包括表示所述用户的语音信号的声纹特征。
第五方面,本申请提供了一种语音信号处理装置,所述装置包括:
环境语音获取模块,用于获取传感器采集的用户的语音信号;
脑波信号获取模块,用于获取所述用户发出所述语音时对应的所述用户的脑波信号;以及
语音信息获取模块,用于根据所述脑波信号和所述传感器采集的用户语音信号,获得目标语音信息。
在一种可选的实现中,所述装置还包括:
运动信号获取模块,用于根据所述脑波信号,获取所述用户在发声时声道咬合部位的运动信号;相应的,所述语音信息获取模块,用于根据所述运动信号和所述传感器采集的用户语音信号,获得所述目标语音信息。
在一种可选的实现中,所述语音信息获取模块,用于根据所述脑波信号,获取对应的目标音频信号;
基于滤波,从所述传感器采集的用户语音信号中滤除所述目标音频信号,得到待滤除信号;
从所述传感器采集的用户语音信号中滤除所述待滤除信号,得到所述目标语音信息。
在一种可选的实现中,所述装置还包括:
指令信息获取模块,用于基于所述目标语音信息,获取所述用户的语音信号对应的指令信息,所述指令信息指示所述用户的语音信号中包含的语义意图。
在一种可选的实现中,所述语音信息获取模块,用于基于所述脑波信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息;或,
根据所述脑波信号,获取对应的目标音频信号;基于所述目标音频信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息。
在一种可选的实现中,所述目标语音信息包括表示所述用户的语音信号的声纹特征。
第六方面,本申请提供了一种语音信号处理装置,所述装置包括:
环境语音获取模块,用于获取传感器采集的用户语音信号;
振动信号获取模块,用于获取所述用户发出所述语音时对应的振动信号;其中所述振动信号用于表示所述用户的身体部位的振动特征;所述身体部位为当所述用户处于发声状态下,基于发声行为进行相应振动的部位;以及
声纹识别模块,用于基于所述传感器采集的用户语音信号以及所述振动信号,进行声纹识别。
在一种可选的实现中,所述振动信号用于表示与发出语音产生的振动相对应的振动特征。
在一种可选的实现中,所述声纹识别模块,用于根据所述传感器采集的用户语音信号进行声纹识别,得到所述传感器采集的用户语音信号属于用户的第一置信度;
根据所述振动信号进行声纹识别,得到所述传感器采集的用户语音信号属于目标用户的第二置信度;
根据所述第一置信度和所述第二置信度,得到声纹识别结果。
在一种可选的实现中,所述装置还包括:
脑波信号获取模块,用于获取所述用户发出所述语音时对应的所述用户的脑波信号;
相应的,所述声纹识别模块,用于基于所述传感器采集的用户语音信号、所述振动信号以及所述脑波信号,进行声纹识别。
在一种可选的实现中,所述所述声纹识别模块,用于根据所述传感器采集的用户语音信号进行声纹识别,得到所述传感器采集的用户语音信号属于用户的第一置信度;
根据所述振动信号进行声纹识别,得到所述传感器采集的用户语音信号属于用户的第二置信度;
根据所述脑波信号进行声纹识别,得到所述传感器采集的用户语音信号属于用户的第三置信度;
根据所述第一置信度、所述第二置信度和所述第三置信度,得到声纹识别结果。
第七方面,本申请提供了一种自动驾驶车辆,可以包括处理器,处理器和存储器耦合,存储器存储有程序指令,当存储器存储的程序指令被处理器执行时实现上述第一方面所述的方法。对于处理器执行第一方面的各个可能实现方式中自动驾驶车辆执行的步骤,具体均可以参阅第一方面,此处不再赘述。
第八方面,本申请提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面所述的方法。
第九方面,本申请提供了一种电路系统,所述电路系统包括处理电路,所述处理电路配置为执行上述第一方面所述的方法。
第十方面,本申请提供了一种计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面所述的方法。
第十一方面,本申请提供了一种芯片系统,该芯片系统包括处理器,用于支持服务器或门限值获取装置实现上述方面中所涉及的功能,例如,发送或处理上述方法中所涉及的数据和/或信息。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存服务器或通信设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包括芯片和其他分立器件。
本申请实施例提供了一种语音信号处理方法,包括:获取传感器采集的用户的语音信号,所述语音信号包括环境噪声;获取所述用户发出所述语音时对应的振动信号;其中所述振动信号用于表示所述用户的身体部位的振动特征;所述身体部位为当所述用户处于发声状态下,基于发声行为进行相应振动的部位;以及根据所述振动信号和所述传感器采集的用户语音信号,获得目标语音信息。通过上述方式,将振动信号作为语音识别的依据,由于振动信号没有包含复杂的声学传输时混入的外界非用户的语音,受其他环境噪声的影响很小(例如混响影响),因此可以相对较好的抑制住这部分噪声干扰,可以实现更好的语音识别效果。
图1a为一种智能设备示意;
图1b为本申请实施例提供的一种手机的图形用户界面示意;
图2为本申请实施例的一种应用场景示意;
图3和图4为本申请实施例提供的另一种应用场景示意;
图5为电子设备的结构示意图;
图6为本申请实施例的电子设备的软件结构框图示意;
图7为本申请实施例中提供的一种语音信号处理方法的流程示意;
图8为一种系统架构示意;
图9为一种RNN的结构示意图;
图10为一种RNN的结构示意图;
图11为一种RNN的结构示意图;
图12为一种RNN的结构示意图;
图13为一种RNN的结构示意图;
图14为本申请实施例提供的一种语音信号处理方法的流程示意;
图15为本申请实施例提供的一种语音信号处理方法的流程示意;
图16为本申请提供了一种语音信号处理装置的结构示意;
图17为本申请提供了一种语音信号处理装置的结构示意;
图18为本申请提供了一种语音信号处理装置的结构示意;
图19为本申请实施例提供的执行设备的一种结构示意图;
图20是本申请实施例提供的训练设备一种结构示意图;
图21为本申请实施例提供的芯片的一种结构示意图。
下面将结合本发明实施例中的附图,对本发明实施例进行描述。
本申请的说明书和权利要求书及所述附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
在本说明书中使用的术语“部件”、“模块”、“系统”等用于表示计算机相关的实体、硬件、固件、硬件和软件的组合、软件、或执行中的软件。例如,部件可以是但不限于,在处理器上运行的进程、处理器、对象、可执行文件、执行线程、程序和/或计算机。通过图示,在计算设备上运行的应用和计算设备都可以是部件。一个或多个部件可驻留在进程和/或执行线程中,部件可位于一个计算机上和/或分布在2个或更多个计算机之间。此外,这些部件可从在上面存储有各种数据结构的各种计算机可读介质执行。部件可例如根据具有一个或多个数据分组(例如来自与本地系统、分布式系统和/或网络间的另一部件交互的二个部件的数据,例如通过信号与其它系统交互的互联网)的信号通过本地和/或远程进程来通信。
下面将结合附图,对本申请中的技术方案进行描述。
本申请实施例提供的语音信号处理方法能够应用在语音识别以及声纹识别相关的人机交互等场景中。具体而言,本申请实施例的语音信号处理方法能够应用在语音识别和声纹识别中,下面分别对语音识别场景和声纹识别场景进行简单的介绍。
场景一、基于语音识别的人机交互:
语音识别(automatic speech recognition,ASR),也被称为自动语音识别,在一种实现中,其目标是将人类的语音中的词汇内容转换为计算机可读的输入,例如按键、二进制编码或者字符序列。
在一种场景中,本申请可以应用在具有语音交互功能的装置上;本实施例中“具有语音交互功能”可以为装置上可以实现的一种功能,其可以识别用户的语音,并基于语音触发相应的功能,进而实现与用户之间的语音交互。其中具有语音交互功能的装置可以是例如音箱、闹钟、手表、机器人等智能设备中,或者是车载设备,或者是手机、平板、AR增强现实设备或VR虚拟现实设备等便携式设备中。
在申请的一种实施例中,具有语音交互功能的装置上可以包括音频传感器和视频传感器,其中音频传感器可以采集环境中的音频信号,视频传感器可以采集一定区域内的视频;音频信号可以包括一个或多个用户发声时发出的音频信号以及环境中其他的噪声信号,视频可以包括上述发声的一个或多个用户,进而,可以基于音频信号和视频提取出一个或多个用户发声时发出的音频信号。进而,可以基于提取得到的音频信号实现和用户的语音交互功能。关于,如何基于音频信号和视频提取出一个或多个用户发声时发出的音频信号将在后续的实施例中详细描述,这里不再赘述。
在本申请的一种实施例中,上述音频传感器和视频传感器可以不作为具有语音交互功能的装置本身具有的组件,而是作为独立存在的组件或者是集成在其他装置上的组件;在这样的情况下,具有语音交互功能的装置可以仅仅获取到音频传感器采集的环境中的音频信号或者仅仅获取到视频传感器采集的一定区域内的视频,进而,可以基于音频信号和视频提取出一个或多个用户发声时发出的音频信号。进而,可以基于提取得到的音频信号实现和用户的语音交互功能。
进一步的,音频传感器作为具有语音交互功能的装置本身具有的组件,而视频传感器不作为具有语音交互功能的装置本身具有的组件;或,音频传感器不作为具有语音交互功能的装置本身具有的组件,且视频传感器不作为具有语音交互功能的装置本身具有的组件;或,音频传感器不作为具有语音交互功能的装置本身具有的组件,而视频传感器作为具有语音交互功能的装置本身具有的组件。
示例性的,具有语音交互功能的装置可以是如图1a示出的智能设备,如图1a所示,智能设备识别到语音“piupiupiu”,则不执行任何动作。比如,智能设备识别到语音“打开空调”,则执行语音“打开空调”相应的动作:打开空调。比如,智能设备识别到用户吹口哨发出的声音,即口哨声,则执行口哨声相应的动作:开灯。比如,智能设备识别到语音“开灯”,则不执行任何动作。比如,智能设备识别到悄悄话模式的语音“睡觉”,则执行悄悄话模式的语音“睡觉”相应的动作:切换为睡眠模式。其中,语音“piupiupiu”、口哨声、悄悄话模式的语音“睡觉”等为特殊语音。语音“打开空调”、“开灯”等为正常语音。其中,正常语音是指能够识别出语义,并且发声时振动声带的一类语音。特殊语音是指区别于正常语音的一类语音。比如,特殊语音是指发声时不振动声带的一类语音,即清音。再比如,特殊语音是指没有语义的语音。
示例性的,具有语音交互功能的装置可以是具有显示功能的装置,例如可以是手机,参见图1b,图1b为本申请实施例提供的一种手机的图形用户界面(graphical user interface,GUI)示意,如图1b中示出的那样,该GUI为手机与用户交互时的显示界面。当手机检测到用户的语音唤醒词“小艺小艺”后,手机可以在桌面显示语音助手的文字显示窗口101,手机可以通过窗口101提醒用户“嗨,我在听”。应理解,手机在通过窗口101或102显示文字提醒用户的同时,也可以向用户语音播报“嗨,我在听”。
在一些场景中,具有语音交互功能的装置可以为由多个装置组成的系统。
图2所示为本申请实施例的一种应用场景。图2中的应用场景还可以被称作智能家居场景。图2中的应用场景可以包括至少一个电子设备(例如电子设备210、电子设备220、电子设备230、电子设备240、电子设备250)、电子设备260和电子设备。图2中的电子设备210 可以是电视。电子设备220可以是音箱。电子设备230可以是监控设备。电子设备240可以是手表。电子设备250可以是智能麦克风。电子设备260可以是手机或平板电脑。电子设备可以是无线通信设备,例如路由器、网关设备等。图2中的电子设备210、电子设备220、电子设备230、电子设备240、电子设备250和电子设备260可以通过无线通信协议与电子设备进行上下行传输。例如,电子设备可以向电子设备210、电子设备220、电子设备230、电子设备240、电子设备250和电子设备260发送的信息,也可以接收电子设备210、电子设备220、电子设备230、电子设备240、电子设备250和电子设备260发送的信息。
需要说明的是,本申请实施例可以应用于包括一个或多个无线通信设备以及多个电子设备的应用场景中,本申请对此不进行限定。
本申请实施例中,具有语音交互功能的装置可以为智能家居系统中任意一个电子设备,例如可以是电视、音箱、手表、智能麦克风、手机或平板电脑等等。智能家居系统中任意一个电子设备可以包括音频传感器或视频传感器,在获取到环境中的音频信息或视频后,可以基于无线通信设备将音频信息或视频传输至具有语音交互功能的装置,或者传输至云侧的服务器(图2中未示出),具有语音交互功能的装置可以基于音频信息和视频提取出一个或多个用户发声时发出的音频信号。进而,可以基于提取得到的音频信号实现和用户的语音交互功能;或者云侧的服务器可以基于音频信息和视频提取出一个或多个用户发声时发出的音频信号,并将提取得到的音频信号传输至具有语音交互功能的装置,进而具有语音交互功能的装置可以基于提取得到的音频信号实现和用户的语音交互功能。
在一个示例中,应用场景包括电子设备210、电子设备260和电子设备。电子设备210为电视,电子设备260为手机,电子设备为路由器。其中,路由器用于实现电视和手机之间的无线通信。其中;具有语音交互功能的装置可以为手机,电视上可以设置有视频传感器,手机上可以设置有音频传感器,电视获取到视频后,可以将视频传输至手机,手机可以基于音频信息和视频提取出一个或多个用户发声时发出的音频信号。进而,可以基于提取得到的音频信号实现和用户的语音交互功能。
在一个示例中,应用场景包括电子设备220、电子设备260和电子设备。电子设备220为音箱,电子设备260为手机,电子设备为路由器。其中,路由器用于实现音箱和手机之间的无线通信。其中;具有语音交互功能的装置可以为手机,手机上可以设置有视频传感器,音箱上可以设置有音频传感器,音箱获取到音频信息后,可以将音频信息传输至手机,手机可以基于音频信息和视频提取出一个或多个用户发声时发出的音频信号。进而,可以基于提取得到的音频信号实现和用户的语音交互功能。
在一个示例中,应用场景包括电子设备230、电子设备260和电子设备。电子设备230为监控设备,电子设备260为手机,电子设备为路由器。其中,路由器用于实现监控设备和手机之间的无线通信。具有语音交互功能的装置可以为手机,监控设备上可以设置有视频传感器,手机上可以设置有音频传感器,监控设备获取到视频后,可以将视频传输至手机,手机可以基于音频信息和视频提取出一个或多个用户发声时发出的音频信号。进而,可以基于提取得到的音频信号实现和用户的语音交互功能。
在一个示例中,应用场景包括电子设备250、电子设备260和电子设备。电子设备250为麦克风,电子设备260为手机,电子设备为路由器。其中,路由器用于实现麦克风和手机 之间的无线通信。具有语音交互功能的装置可以为麦克风,手机上可以设置有视频传感器,麦克风上可以设置有音频传感器,手机获取到视频后,可以将视频传输至麦克风,麦克风可以基于音频信息和视频提取出一个或多个用户发声时发出的音频信号。进而,可以基于提取得到的音频信号实现和用户的语音交互功能。
需要说明的是,以上产品形态的描述仅为一种示意,在实际应用中,可以灵活设置视频传感器和音频传感器的部署形态。
图3、图4是本申请实施例提供的另一种应用场景。图3、图4中的应用场景还可以被称作智能驾驶场景。图3、图4中的应用场景可以包括电子设备,其包括装置310、装置320、装置330、装置340、装置350。电子设备可以是驾驶系统(也可以称之为车载系统)。装置310可以是显示屏。装置320可以是麦克风。装置330可以是音箱。装置340可以是摄像头。装置350可以是座椅调节装置。电子设备360可以是手机或平板电脑。电子设备可以接收装置310、装置320、装置330、装置340、装置350发送的数据。并且,电子设备和电子设备360可以通过无线通信协议进行通信。例如,电子设备可以向电子设备360发送信号,也可以接收电子设备360发送的信号。
需要说明的是,本申请实施例可以应用于包括驾驶系统以及多个电子设备的应用场景中,本申请对此不进行限定。
在一个示例中,应用场景包括装置320、装置330、电子设备360和电子设备(驾驶系统)。装置320为麦克风,装置340为摄像头,电子设备360为平板电脑,电子设备为驾驶系统。其中,驾驶系统用于与手机进行无线通信,还用于驱动麦克风采集音频信号,并驱动摄像头采集视频。驾驶系统可以驱动麦克风采集音频信号,并将麦克风采集到的音频信号发送至平板电脑,驾驶系统可以驱动摄像头采集视频,并将摄像头采集到的视频发送至平板电脑;平板电脑可以基于音频信息和视频提取出一个或多个用户发声时发出的音频信号。进而,可以基于提取得到的音频信号实现和用户的语音交互功能。
需要说明的是,在智能驾驶场景中,视频传感器可以独立部署,例如设置在车内的预设位置,以使得其可以采集到预设区域内的视频,例如,视频传感器可以设置在挡风玻璃或者是座椅上,进而可以采集某一个座位上用户的视频。
在本申请的一个实施例中,具有语音识别功能的装置可以是头戴式便携设备,例如可以是AR/VR设备,其中,头戴式便携设备可以设置有音频传感器和脑波采集设备,音频传感器可以采集音频信号,脑波采集设备可以采集脑波信号,进而头戴式便携设备可以基于音频信号和脑波信号提取出一个或多个用户发声时发出的音频信号。进而,可以基于提取得到的音频信号实现和用户的语音交互功能。
需要说明的是,上述音频传感器和脑波采集设备可以不作为具有语音交互功能的装置本身具有的组件,而是作为独立存在的组件或者是集成在其他装置上的组件;在这样的情况下,具有语音交互功能的装置可以仅仅获取到音频传感器采集的环境中的音频信号或者仅仅获取到脑波采集设备采集的一定区域内的脑波信号,进而,可以基于音频信号和脑波信号提取出一个或多个用户发声时发出的音频信号。进而,可以基于提取得到的音频信号实现和用户的语音交互功能。
进一步的,音频传感器作为具有语音交互功能的装置本身具有的组件,而脑波采集设备 不作为具有语音交互功能的装置本身具有的组件;或,音频传感器不作为具有语音交互功能的装置本身具有的组件,且脑波采集设备不作为具有语音交互功能的装置本身具有的组件;或,音频传感器不作为具有语音交互功能的装置本身具有的组件,而脑波采集设备作为具有语音交互功能的装置本身具有的组件。
可以理解的是,图1a至图4中的应用场景的只是本发明实施例中的几种示例性的实施方式,本发明实施例中的应用场景包括但不仅限于以上应用场景。
二、声纹识别场景:
声纹(voiceprint),是用电声学仪器显示的携带言语信息的声波频谱,是由波长、频率以及强度等百余种特征维度组成的生物特征。声纹识别是通过对一种或多种语音信号的特征分析来达到对未知声音辨别的目的,简单的说就是辨别某一句话是否是某一个人说的技术。通过声纹可以确定出说话人的身份,从而进行有针对性的回答。
此外,本申请还可以应用于语音去噪的场景中,本申请中的语音信号处理方法可以用在需要进行语音去噪的音频输入装置中,例如耳机、麦克风(独立麦克风或者是终端设备上的麦克风等),用户可以向音频输入说话,通过本申请中的语音信号处理方法,音频输入装置可以从包括环境噪音的音频输入中提取出用户发出的语音信号。
应当理解,此处举例仅为方便对本申请实施例的应用场景进行理解,不对本申请实施例的应用场景进行穷举,下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
以下介绍了本申请实施例提供的电子设备、用于这样的电子设备的用户界面、和用于使用这样的电子设备的实施例。在一些实施例中,电子设备可以是还包含其它功能诸如个人数字助理和/或音乐播放器功能的便携式电子设备,诸如手机、平板电脑、具备无线通讯功能的可穿戴电子设备(如智能手表)等。便携式电子设备的示例性实施例包括但不限于搭载或者其它操作系统的便携式电子设备。上述便携式电子设备也可以是其它便携式电子设备,诸如膝上型计算机(Laptop)等。还应当理解的是,在其他一些实施例中,上述电子设备也可以不是便携式电子设备,而是诸如台式计算机、电视、音箱、监控设备、摄像头、显示屏、麦克风、座椅调节装置、指纹识别装置、车载驾驶系统等。
示例性的,图5示出了电子设备100的结构示意图。电子设备100可以包括处理器110,外部存储器接口120,内部存储器121,麦克风170C,传感器模块180,按键190,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口等。
可以理解的是,本申请实施例示意的结构并不构成对电子设备100的具体限定。在本申请另一些实施例中,电子设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,视频编解码器,数字信号处理器 (digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的部件,也可以集成在一个或多个处理器中。在一些实施例中,电子设备100也可以包括一个或多个处理器110。其中,控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。在其他一些实施例中,处理器110中还可以设置存储器,用于存储指令和数据。示例性地,处理器110中的存储器可以为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。这样就避免了重复存取,减少了处理器110的等待时间,因而提高了电子设备100处理数据或执行指令的效率。
在一些实施例中,处理器110可以包括一个或多个接口。接口可以包括集成电路间(inter-integrated circuit,I2C)接口,集成电路间音频(inter-integrated circuit sound,I2S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口,SIM卡接口,和/或USB接口等。其中,USB接口是符合USB标准规范的接口,具体可以是MiniUSB接口,MicroUSB接口,USBTypeC接口等。USB接口可以用于连接充电器为电子设备100充电,也可以用于电子设备100与外围设备之间传输数据。该USB接口也可以用于连接耳机,通过耳机播放音频。
可以理解的是,本申请实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对电子设备100的结构限定。在本申请另一些实施例中,电子设备100也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。
电子设备100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
显示屏194用于显示图像,视频等。显示屏194包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,电子设备100可以包括1个或多个显示屏194。
电子设备100的显示屏194可以是一种柔性屏,目前,柔性屏以其独特的特性和巨大的潜力而备受关注。柔性屏相对于传统屏幕而言,具有柔韧性强和可弯曲的特点,可以给用户提供基于可弯折特性的新交互方式,可以满足用户对于电子设备的更多需求。对于配置有可折叠显示屏的电子设备而言,电子设备上的可折叠显示屏可以随时在折叠形态下的小屏和展开形态下大屏之间切换。因此,用户在配置有可折叠显示屏的电子设备上使用分屏功能,也越来越频繁。
电子设备100可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。
ISP用于处理摄像头193反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将所述电信号传递给ISP处理,转化为肉眼可见的图像。ISP还可以对图像的噪点,亮度,肤色进行算法优化。ISP还可以对拍摄场景的曝光,色温等参数优化。在一些实施例中,ISP可以设置在摄像头193中。
摄像头193用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。在一些实施例中,电子设备100可以包括1个或多个摄像头193。
本申请实施例中的摄像头193可以是高速摄像头或动态视觉传感器(dynamic vision sensor,DVS)。
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。例如,当电子设备100在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。
视频编解码器用于对数字视频压缩或解压缩。电子设备100可以支持一种或多种视频编解码器。这样,电子设备100可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1,MPEG2,MPEG3,MPEG4等。
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现电子设备100的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。
外部存储器接口120可以用于连接外部存储卡,例如MicroSD卡,实现扩展电子设备100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。
内部存储器121可以用于存储一个或多个计算机程序,该一个或多个计算机程序包括指令。处理器110可以通过运行存储在内部存储器121的上述指令,从而使得电子设备100执行本申请一些实施例中所提供的灭屏显示的方法,以及各种应用以及数据处理等。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统;该存储程序区还可以存储一个或多个应用(比如图库、联系人等)等。存储数据区可存储电子设备100使用过程中所创建的数据(比如照片,联系人等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如一个或多个磁盘存储部件,闪存部件,通用闪存存储器(universal flash storage,UFS)等。在一些实施例中,处理器110可以通过运行存储在内部存储器121的指令,和/或存储在设置于处理器110中的存储器的指令,来使得电子设备100执行本申请实施例中所提供的灭屏显示的方法,以及其他应用及数据处理。电子设备100可以通过音频模块,扬声器,受话器,麦克风,耳机接口,以及应用处理器等实现音频功能。例如音乐播放,录音等。
传感器模块180可以包括加速度传感器180E、指纹传感器180H,环境光传感器180L等。
加速度传感器180E可检测电子设备100在各个方向上(一般为三轴)加速度的大小。当电 子设备100静止时可检测出重力的大小及方向。还可以用于识别电子设备姿态,应用于横竖屏切换,计步器等应用。
环境光传感器180L用于感知环境光亮度。电子设备100可以根据感知的环境光亮度自适应调节显示屏194亮度。环境光传感器180L也可用于拍照时自动调节白平衡。环境光传感器180L还可以与接近光传感器180G配合,检测电子设备100是否在口袋里,以防误触。
指纹传感器180H用于采集指纹。电子设备100可以利用采集的指纹特性实现指纹解锁,访问应用锁,指纹拍照,指纹接听来电等。
脑波传感器195可以采集脑波信号。
按键190包括开机键,音量键等。按键190可以是机械按键。也可以是触摸式按键。电子设备100可以接收按键输入,产生与电子设备100的用户设置以及功能控制有关的键信号输入。
图6是本申请实施例的电子设备100的软件结构框图。分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将Android系统分为四层,从上至下分别为应用程序层,应用程序框架层,安卓运行时(Android runtime)和系统库,以及内核层。应用程序层可以包括一系列应用程序包。
如图6所示,应用程序包可以包括相机,图库,日历,通话,地图,导航,WLAN,蓝牙,音乐,视频,短信息等应用程序。
应用程序框架层为应用程序层的应用程序提供应用编程接口(application programming interface,API)和编程框架。应用程序框架层包括一些预先定义的函数。
如图6所示,应用程序框架层可以包括窗口管理器,内容提供器,视图系统,电话管理器,资源管理器,通知管理器等。
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。
内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。
视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。
电话管理器用于提供电子设备100的通信功能。例如通话状态的管理(包括接通,挂断等)。
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话窗口形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,电子设备振动,指示灯闪烁等。
系统库可以包括多个功能模块。例如:表面管理器(surface manager),媒体库(media libraries),三维图形处理库(例如:OpenGL ES),2D图形引擎(例如:SGL)等。
表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了2D和3D图层的融合。
媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式,例如:MPEG4,H.264,MP3,AAC,AMR,JPG,PNG等。
三维图形处理库用于实现三维图形绘图,图像渲染,合成,和图层处理等。
2D图形引擎是2D绘图的绘图引擎。
内核层是硬件和软件之间的层。内核层至少包含显示驱动,摄像头驱动,音频驱动,传感器驱动。
为了便于理解,本申请以下实施例将以具有图5和图6所示结构的装置为例,结合附图对本申请实施例提供的元素按压的方法进行具体阐述。
参照图7,图7为本申请实施例中提供的一种语音信号处理方法的流程示意,如图7中示出的那样,本申请实施例中提供的语音信号处理方法,包括:
701、获取传感器采集的用户语音信号。
本申请实施例中,可以获取传感器从环境中采集的用户的语音信号,语音信号包括环境噪声;下文中的语音信号也可以表述为语音信号。
需要说明的是,用户语音信号,不应将语音信号仅理解为用户说出的话,而是应理解为语音信号中包括用户的发出的语音。
需要说明的是,语音信号包括环境噪声可以理解为,在环境中存在正在说话的用户以及其他环境噪声(例如在说话的其他人等),此时,采集的语音信号包括相互交织在一起的用户说话声音以及环境噪声,其中语音信号和环境噪声之间的关系不应理解为简单的叠加。即,不应理解为环境噪声在语音信号中是是独立的信号存在。
本申请实施例中,音频传感器(例如麦克风或者麦克风阵列)可以从环境中采集用户语音信号。用户语音信号是环境中的混合信号z(n),除了希望拾取的用户发出的语音信号s1(n),还有其他信号,如环境噪声n(n),其他人的说话声s2(n)等,即z(n)=s1(n)+s2(n)+n(n)。在需要进行语音交互和进行语音去噪的场景中,我们希望可以从音频传感器采集的环境中的语音信号中提取出用户发出的语音信号,即从混合信号z(n)中分离出用户发出的语音信号s1(n)。
需要说明的是,步骤701的执行主体可以是具有语音交互功能的装置或者是语音输入装置;以执行主体是具有语音交互功能的装置为例,在一种实现中,具有语音交互功能的装置上可以集成有音频传感器,进而音频传感器可以获取到包括用户的语音信号的音频信号;在一种实现中,音频传感器可以不集成在具有语音交互功能的装置上,例如,音频传感器可以集成在其他装置上,或者作为一个独立的装置(例如独立的麦克风),音频传感器可以将采集到的音频信号传输至具有语音交互功能的装置,则,具有语音交互功能的装置可以获取到音频信号。
可选的,音频传感器可以有针对性的拾取一定方向传过来的音频信号,比如针对用户的方向进行定向拾音,从而尽可能的消除一部分外界噪声(但是仍然有噪声)。定向采集需要麦克风阵列或者矢量麦克风,这里以麦克风阵列为例,可以采用波束形成的方法。可以采用一 个波束形成器来实现,其可以包括延时-求和波束形成与滤波求和波束形成两种,具体的,设麦克风阵列的输入信号为z
i(n),滤波器传递系数为w
i(n),则滤波-求和波束形成器系统输出为:
其中,M为麦克风数目。当滤波器系数仅为单一加权常数时,滤波-求和波束形成简化为延时-求和波束形成,即:
其中,τ
i表示通过估计而得到的时延补偿。通过控制τ
i的数值可以将阵列的波束指向任何方向,以拾取该方向的音频信号,如果不希望拾取某一方向的音频信号,则控制波束指向不包括该方向即可,基于拾取方向控制后所采集的音频信号为z(n)。
此外,语音输入装置的产品形态说明可以参照上述具有语音交互功能的装置的产品形态,这里不再赘述。
702、获取所述用户发出所述语音时对应的振动信号;其中所述振动信号用于表示所述用户的身体部位的振动特征;所述身体部位为当所述用户处于发声状态下,基于发声行为进行相应振动的部位。
本申请实施例中,可以获取到用户发出语音时对应的振动信号,其中所述振动信号用于表示所述用户在发出所述语音信号时身体部位的振动特征。
需要说明的是,步骤701和步骤702之间并没有严格的时序限定,步骤701可以在步骤702之前或者之后,或者同时执行,本申请并不限定。
本申请实施例中,用户发出语音时对应的的振动信号可以是基于视频提取得到的;
其中,从视频帧中提取振动信号的动作可以是具有语音交互功能的装置或语音输入装置执行的;以从视频帧中提取振动信号的动作可以是具有语音交互功能的装置为例:
在一种实现中,具有语音交互功能的装置上可以集成设置有视频传感器,该视频传感器可以采集到包括所述用户的视频帧,相应的,具有语音交互功能的装置可以根据所述视频帧,提取所述用户对应的振动信号。
在一种实现中,视频传感器可以与具有语音交互功能的装置独立设置,该视频传感器可以采集到包括所述用户的视频帧,并将视频帧发送至具有语音交互功能的装置,相应的,具有语音交互功能的装置可以根据所述视频帧,提取所述用户对应的振动信号;在一种实现中,视频传感器可以与具有语音交互功能的装置独立设置,该视频传感器可以采集到包括所述用户的视频帧,并将视频帧发送至具有语音交互功能的装置,相应的,具有语音交互功能的装置可以根据所述视频帧,提取所述用户对应的振动信号。
其中,从视频帧中提取振动信号的动作可以是云侧的服务器或者是端侧的其他装置执行的;
在一种实现中,具有语音交互功能的装置上可以集成设置有视频传感器,该视频传感器可以采集到包括所述用户的视频帧,并将视频帧发送至云侧的服务器或者其他端侧装置,相应的,云侧的服务器或者其他端侧装置可以根据所述视频帧,提取所述用户对应的振动信号,并将振动信号发送至具有语音交互功能的装置。
在一种实现中,视频传感器可以与具有语音交互功能的装置独立设置,该视频传感器可以采集到包括所述用户的视频帧,并将视频帧发送至云侧的服务器或者其他端侧装置,相应的,云侧的服务器或者其他端侧装置可以根据所述视频帧,提取所述用户对应的振动信号,并将振动信号发送至具有语音交互功能的装置。
需要说明的是,以上几种关于从视频帧中提取振动信号的动作的执行主体说明仅为一些实例性的举例,本申请并不限定。
在一种实现中,视频帧为通过动态视觉传感器和/或高速摄像头采集得到的。以视频帧为通过动态视觉传感器采集得到得为例,本申请实施例中,动态视觉传感器可以捕捉到包括用户说话时颅顶、面部、喉部或颈部的视频帧。
在一种实现中,采集视频帧的动态视觉传感器的数量可以为一个或多个;
其中,在采集视频帧的动态视觉传感器的数量为一个的情况下,动态视觉传感器可以采集到包括用户身体全貌或者局部身体部位的视频帧,其中,在动态视觉传感器采集到包括局部身体部位的实现中,动态视觉传感器可以只选择当所述用户处于发声状态下,基于发声行为进行相应振动的部位进行视频帧采集,身体部位可以是例如颅顶、面部、喉部或颈部。
在一种实现中,动态视觉传感器的视频采集方向可以为预先设定的,例如,在智能驾驶系统的应用场景中,可以在车内的预设位置设置动态视觉传感器,并将动态视觉传感器的视频采集方向设置为朝向用户的预设身体部位,以预设身体部位为面部为例,动态视觉传感器的视频采集方向可以朝向驾驶位的预设区域,该预设区域通常为当该驾驶位有人员坐下时,面部所在的区域。
在一种实现中,动态视觉传感器可以采集到包括用户身体全貌的视频帧。此时,动态视觉传感器的视频采集方向也可以为预先设定的,例如,在智能驾驶系统的应用场景中,可以在车内的预设位置设置动态视觉传感器,并将动态视觉传感器的视频采集方向设置为朝向驾驶位的方向。
在一种实现中,动态视觉传感器的数量为多个,每个动态视觉传感器可以预先设定其视频采集方向,以使得每个动态视觉传感器可以采集到包括一个身体部位的视频帧,其中,身体部位为当所述用户处于发声状态下,基于发声行为进行相应振动的部位。例如,在智能驾驶系统的应用场景中,动态视觉传感器可以部署在头靠的前后(拾取车内前后向人员的视频帧)、车框上(左右方向人员的视频帧)、挡风玻璃下面(前排人员的视频帧)。
本申请实施例中,不同身体部位的视频帧采集,可以使用相同的传感器,如都使用高速摄像头,或者都使用动态视觉传感器,或者这两种传感器混合使用,本申请并不限定。
在智能家居的应用场景中,动态视觉传感器可以部署在电视上、智慧大屏或者智能音箱等等,在智能手机的应用场景中,动态视觉传感器可以部署在手机上,例如基于手机的前置或者后置摄像头。
本申请实施例中,振动信号代表了人声的本源特征;可选的,振动信号可以是多个,如: 头部的振动信号x1(n)、喉部的振动信号x2(n)、面部的振动信号x3(n)、脖子的振动信号x4(n)等等。本申请实施例中,可以根据所述振动信号,恢复出对应的目标音频信号。
本申请实施例中,振动信号用于表示所述用户在发出所述语音信号时身体部位的振动特征,振动特征可以是直接视频获取的振动特征也可以是其他动作干扰滤除后的仅和发声振动相关的振动特征。
针对于高速摄像头捕捉到的视频帧,可以用不同方向的滤波器分解为不同尺度,不同方向的图像金字塔,具体可以对图像先用低通滤波器滤波得到低通残差图像,在低通残差图上不断下采样为不同尺度的图像。并对每一个尺度的图像,采用不同方向的带通滤波器滤波,得到对不同方向的响应图,对响应图求振幅和相位,并计算当前帧t的局部运动信息。将第一帧图像作为参考帧。基于金字塔的结果,可以计算当前帧与参考帧的分解结果在不同尺度、不同方向上不同像素位置上的相位差,来量化每个像素的局部运动大小,并基于每个像素的局部运动大小计算当前帧的全局运动信息。可以对局部运动信息加权平均之后得到全局运动信息。权值是对应尺度、方向和像素位置的振幅大小,对这个方向这个尺度上所有像素加权求和,得到不同尺度、方向的全局运动信息,对上述全局信息求和,可以得到这一图像帧的全局运动信息,基于上述步骤,每个图像帧都能计算得到一个运动大小值,基于连续的帧频,将每个帧对应的幅值作为音频采样值,即可得到初步恢复的音频信号,然后再进行高通滤波,即得到恢复的音频信号x’(n)。可选的,如果是多个振动信号,则基于以上方法,分别单独恢复各自对应的目标音频信号x1’(n)、x2’(n)、x3’(n)、x4’(n)。
针对于动态视觉传感器捕捉到的视频帧:由于动态视觉传感器的原理是每个像素独立对光强变化做出事件响应,通过比较当前光强与上一个事件产生时刻的光强,当两者的变化量(即差分值)超过阈值时,产生一个新的事件。每个事件包括了像素坐标,发放时间和光强极性,其中光强极性表征光强的变化趋势,通常采用+1或On表示光强增强,-1或Off表示光强减弱。动态视觉传感器由于没有曝光的概念,像素持续对光强进行监测和响应,因此其时间分辨率可以做到微秒级。同时,动态视觉传感器对于运动敏感,而对静态区域几乎不做出响应,可以利用动态视觉传感器捕捉物体的振动情况,从而实现声音的恢复。这样就到了基于某一个像素位置恢复的音频信号。对其进行高通滤波,去除低频非音频振动干扰,得到信号x’(n),可以表征音频信号。可以将多个像素,比如所有像素,这样恢复的音频信号加权求和,得到加权平均后的该动态视觉传感器恢复的音频信号x’(n)。如果是多个传感器,或者多个位置目标区域,则分别恢复,得到各自独立恢复的目标音频信号x1’(n)、x2’(n)、x3’(n)、x4’(n)。
703、根据所述振动信号和所述传感器采集的用户语音信号,获得目标语音信息。
本申请实施例中,可以根据所述振动信号,恢复出对应的目标音频信号;基于滤波,从所述音频信号中滤除所述目标音频信号,得到待滤除信号;从所述语音信号中滤除所述待滤除信号,得到目标语音信息,其中,目标语音信息为去环境噪声处理后得到的语音信号。
具体的,可以根据所述振动信号,恢复出对应的目标音频信号,并基于滤波,从所述音频信号中滤除所述目标音频信号,得到待滤除信号,经滤波后,滤波后信号z’(n)中已经基本上不包含有用信号x’(n),基本上是除了用户的目标音频信号s(n)的外界噪声;可选的,如果 是多个摄像头(DVS,高速摄像等)拾取某一个人的振动,则将从这些振动恢复的目标音频信号x1’(n)、x2’(n)、x3’(n)、x4’(n),按照上述的自适应滤波方法,依次从混合音频信号z(n)中滤除,即得到了去除各种x1’(n)、x2’(n)、x3’(n)、x4’(n)音频成分的混合音频信号z’(n)。
本申请实施例中,可以从所述音频信号中滤除所述待滤除信号,得到所述用户的语音信号;在一种实现中,可以获取噪声谱(即认为z’(n)是除了目标语音信号s(n)之外的背景噪声):并将z’(n)变换到频域,如快速傅里叶变换(fastfourier transform,FFT)变换,得到噪声谱;将目标音频信号z(n)变换到频域,如FFT变换,得到频率谱,之后从噪声谱中减去频率谱,得到增强后语音的信号谱,最后,对信号谱做快速傅里叶逆变换(inverse fast fourier transform,IFFT),得到所述用户的语音信号,即为语音增强后的信号。
在一种实现中,用自适应滤波的方式将待滤除信号从音频信号中滤除。
需要说明的是,以上从所述音频信号中滤除所述待滤除信号,得到去环境噪声处理后的语音信号的方式仅为一些实例,本申请并不限定。
在一种实现中,还可以基于所述目标语音信息,获取所述用户的语音信号对应的指令信息,所述指令信息指示所述用户的语音信号中包含的语义意图。其中,指令信息可以用来触发实现所述用户的语音信号中包含的语义意图相应的功能,例如,打开某个应用程序,进行语音通话等等。
在一种实现中,可以基于所述振动信号和所述语音信号,通过神经网络模型得到所述目标语音信息。
在一种实现中,根据所述振动信号,获取对应的目标音频信号;基于所述目标音频信号和所述语音信号,通过神经网络模型得到所述目标语音信息。即,神经网络模型的输入还可以是从振动信号中恢复的目标音频信号。
下面介绍本申请实施例提供的一种系统架构。
参见附图8,本发明实施例提供了一种系统架构200。数据采集设备260用于采集音频数据并存入数据库230;其中,音频数据可以包括无噪声音频、振动信号(或从振动信号中恢复的目标音频信号)以及带噪声的音频信号;其中,可以在安静环境中,让人说话/播放音频,用普通麦克风记录此时的音频信号为“无噪声音频”,记为s(n)。同时用振动传感器(可以是多个)对着人的头部、脸部、喉部、脖子等处,采集这期间的视频帧并得到相应的振动信号,记为x(n),如果是多个传感器,则信号可以记为x1(n),x2(n),x3(n),x4(n)等。或者从振动信号中恢复的目标音频信号。可以在“无噪声音频”上增加各种类型噪声,得到“带噪声的音频信号”,记为sn(n)。
训练设备220基于数据库230中维护的音频数据生成目标模型/规则201。下面将更详细地描述训练设备220如何基于音频数据得到目标模型/规则201,目标模型/规则201能够根据所述振动信号和所述音频信号,获得所述目标语音信息,或者获得所述用户的语音信号。
此处,根据实际实施例适应性增加训练过程介绍,如果发明点不在训练过程,就沿用下面示例中的训练过程的介绍。如果训练过程有改进,请用改进后的训练过程代替下面的训练过程的介绍。
训练设备可以采用深度神经网络对数据进行训练生成目标模型/规则201。深度神经网络 中的每一层的工作可以用数学表达式
来描述:从物理层面深度神经网络中的每一层的工作可以理解为通过五种对输入空间(输入向量的集合)的操作,完成输入空间到输出空间的变换(即矩阵的行空间到列空间),这五种操作包括:1、升维/降维;2、放大/缩小;3、旋转;4、平移;5、“弯曲”。其中1、2、3的操作由
完成,4的操作由+b完成,5的操作则由a()来实现。这里之所以用“空间”二字来表述是因为被分类的对象并不是单个事物,而是一类事物,空间是指这类事物所有个体的集合。其中,W是权重向量,该向量中的每一个值表示该层神经网络中的一个神经元的权重值。该向量W决定着上文所述的输入空间到输出空间的空间变换,即每一层的权重控制着如何变换空间。训练深度神经网络的目的,也就是最终得到训练好的神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。因此,神经网络的训练过程本质上就是学习控制空间变换的方式,更具体的就是学习权重矩阵。
因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到神经网络能够预测出真正想要的目标值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
训练设备220得到的目标模型/规则可以应用不同的系统或设备中。在附图8中,执行设备210配置有I/O接口212,与外部设备进行数据交互,“用户”可以通过客户设备240向I/O接口212输入数据。
执行设备210可以调用数据存储系统250中的数据、代码等,也可以将数据、指令等存入数据存储系统250中。
计算模块211使用目标模型/规则201对输入的数据进行处理。
最后,I/O接口212将处理结果(用户的指令信息或用户的语音信号)返回给客户设备240,提供给用户。
更深层地,训练设备220可以针对不同的目标,基于不同的数据生成相应的目标模型/规则201,以给用户提供更佳的结果。
在附图2中所示情况下,用户可以手动指定输入执行设备210中的数据,例如,在I/O接口212提供的界面中操作。另一种情况下,客户设备240可以自动地向I/O接口212输入数据并获得结果,如果客户设备240自动输入数据需要获得用户的授权,用户可以在客户设备240中设置相应权限。用户可以在客户设备240查看执行设备210输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备240也可以作为数据采集端将采集到音频数据存入数据库230。
值得注意的,附图2仅是本发明实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在附图2中,数据存储系统250相对执行设备210是外部存储器,在其它情况下,也可以将数据存储系统250置于执行设备210中。
接下来从训练侧描述本申请实施例提供的神经网络模型:
在训练数据准备阶段,可以在安静环境中,让人说话/播放音频,用普通麦克风记录此时的音频信号为“无噪声音频”,记为s(n)。同时用振动传感器(可以是多个)对着人的头部、脸部、喉部、脖子等处,采集这期间的视觉信号,记为x(n),如果是多个传感器,则信号可以记为x1(n),x2(n),x3(n),x4(n)等。用前述算法从x(n)中还原出音频信号,得到视觉麦克风采集并还原得到的“视觉音频信号”,x’(n),如果是多个传感器,则恢复的音频信号为x1’(n)、x2’(n)、x3’(n)、x4’(n)。在“无噪声音频”上增加各种类型噪声,得到“带噪声的音频信号”,记为sn(n)。
基于收集到的数据训练深度模型,学习这种“带噪声音频信号”(麦克风采集的z(n))和“视觉振动音频信号”(视觉振动传感器采集的x(n)等)到“无噪声音频信号”(增强后的语音信号s’(n))的映射关系。
可以采用循环神经网络(recurrent neural network,RNN)或者长短期记忆网络(long short-term memory,LSTM)这种考虑时序关系的深度神经网络。
首先给出本申请实施例涉及的一些技术名词的定义:
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以xs和截距1为输入的运算单元,该运算单元的输出可以为:
其中,s=1、2、……n,n为大于1的自然数,Ws为xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是sigmoid函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(2)深度神经网络
深度神经网络(Deep Neural Network,DNN),可以理解为具有很多层隐含层的神经网络,这里的“很多”并没有特别的度量标准,我们常说的多层神经网络和深度神经网络其本质上是同一个东西。从DNN按不同层的位置划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
其中,
是输入向量,
是输出向量,
是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量
经过如此简单的操作得到输出向量
由于DNN层数多,则系数W和偏移向量
的数量也就是很多了。那么,具体的参数在DNN是如何定义的呢?首先我们来看看系数W的定义。以一个三层的DNN为例,如:第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。总结下,第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
注意,输入层是没有W 参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。
(3)卷积神经网络(Convosutionas Neuras Network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使用一个可训练的滤波器与一个输入的图像或者卷积特征平面(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。这其中隐含的原理是:图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,我们都能使用同样的学习得到的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
(4)循环神经网络(RNN,Recurrent Neural Networks)
RNNs的目的使用来处理序列数据。在传统的神经网络模型中,是从输入层到隐含层再到输出层,层与层之间是全连接的,每层之间的节点是无连接的。但是这种普通的神经网络对于很多问题却无能无力。例如,你要预测句子的下一个单词是什么,一般需要用到前面的单词,因为一个句子中前后单词并不是独立的。RNNs之所以称为循环神经网路,即一个序列当前的输出与前面的输出也有关。具体的表现形式为网络会对前面的信息进行记忆并应用于当前输出的计算中,即隐藏层之间的节点不再无连接而是有连接的,并且隐藏层的输入不仅包括输入层的输出还包括上一时刻隐藏层的输出。理论上,RNNs能够对任何长度的序列数据进行处理。
对于RNN的训练和对传统的ANN(人工神经网络)训练一样。同样使用BP误差反向传播算法,不过有一点区别。如果将RNNs进行网络展开,那么参数W,U,V是共享的,而传统神经网络却不是的。并且在使用梯度下降算法中,每一步的输出不仅依赖当前步的网络,并且还以来前面若干步网络的状态。比如,在t=4时,还需要向后传递三步,已经后面的三步都需要加上各种的梯度。该学习算法称为基于时间的反向传播算法Back propagation Through Time(BPTT)。
既然已经有了人工神经网络和卷积神经网络,为什么还要循环神经网络?原因很简单,无论是卷积神经网络,还是人工神经网络,他们的前提假设都是:元素之间是相互独立的,输入与输出也是独立的,比如猫和狗。但现实世界中,很多元素都是相互连接的,比如股票随时间的变化,一个人说了:我喜欢旅游,其中最喜欢的地方是云南,以后有机会一定要去__________。这里填空,人应该都知道是填“云南“。因为我们是根据上下文的内容推断出来的,但机会要做到这一步就相当得难了。因此,就有了现在的循环神经网络,他的本质是:像人一样拥有记忆的能力。因此,他的输出就依赖于当前的输入和记忆。
图9为RNN的结构示意图,其中每个圆圈可以看作是一个单元,而且每个单元做的事情也是一样的,因此可以折叠呈左半图的样子。用一句话解释RNN,就是一个单元结构重复使用。
RNN是一个序列到序列的模型,假设xt-1,xt,xt+1是一个输入:“我是中国“,那么ot-1,ot就应该对应”是”,”中国”这两个,预测下一个词最有可能是什么?就是ot+1应该是”人”的概率比较大。
因此,我们可以做这样的定义:
Xt:表示t时刻的输入,ot:表示t时刻的输出,St:表示t时刻的记忆。因为当前时刻的输出是由记忆和当前时刻的输出决定的,就像你现在大四,你的知识是由大四学到的知识(当前输入)和大三以及大三以前学到的东西的(记忆)的结合,RNN在这点上也类似,神经网络最擅长做的就是通过一系列参数把很多内容整合到一起,然后学习这个参数,因此就定义了RNN的基础:St=f(U*Xt+W*St-1);
f()函数是神经网络中的激活函数,但为什么要加上它呢?举个例子,假如在大学学了非常好的解题方法,那初中那时候的解题方法还要用吗?显然是不用了的。RNN的想法也一样,既然能记忆了,那当然是只记重要的信息,其他不重要的,就肯定会忘记。但是在神经网络中什么最适合过滤信息呀?肯定是激活函数,因此在这里就套用一个激活函数,来做一个非线性映射,来过滤信息,这个激活函数可能为tanh,也可为其他。
假设大四快毕业了,要参加考研,请问参加考研是不是先记住你学过的内容然后去考研,还是直接带几本书去参加考研呢?很显然嘛,那RNN的想法就是预测的时候带着当前时刻的记忆St去预测。假如你要预测“我是中国“的下一个词出现的概率,这里已经很显然了,运用softmax来预测每个词出现的概率再合适不过了,但预测不能直接带用一个矩阵来预测,所有预测的时候还要带一个权重矩阵V,用公式表示为:
ot=softmax(VSt)其中ot就表示时刻t的输出。
(5)反向传播算法
卷积神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的超分辨率模型中参数的大小,使得超分辨率模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的超分辨率模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的超分辨率模型的参数,例如权重矩阵。
以本申请实施例中的神经网络模型为RNN为例,本申请实施例中的神经网络结构可以为如下:
在一种实现中,参照图10,RNN其输入为根据所述振动信号获取的目标音频信号,以及传感器采集的用户语音信号。其中,目标音频信号中包括多个时刻,以及每个时刻对应的信号采样值,传感器采集的用户语音信号中包括多个时刻,以及每个时刻对应的信号采样值,在一种实现中,可以在每个时刻,将目标音频信号的信号采样值和传感器采集的用户语音信号的信号采样值进行组合,得到一个新的音频信号,该新的音频信号包括多个时刻,以及每个时刻对应的信号采样值,其中,每个时刻对应的信号采样值是由目标音频信号的信号采样 值和用户语音信号的信号采样值组合得到的(本申请并不限定具体组合方式)。组合后得到的新的音频信号可以作为循环神经网络的输入。
在一种实现中,目标音频信号可以为{x
0,x
1,x
2,…,x
t};其中,x
t为t时刻目标音频信号的信号采样值,用户语音信号可以为{y
0,y
1,y
2,…,y
t};其中,y
t为t时刻用户语音信号的信号采样值。在组合时,可以将对应时刻的信号采样值进行组合,例如,{x
t,y
t}为t时刻将信号采样值进行组合得到的结果,则,将目标音频信号组合后得到的新的音频信号可以为{{x
0,y
0},{x
1,y
1},{x
2,y
2},…,{x
t,y
t}}。
需要说明的是,上述输入音频信号的组合方式仅为一种示意,在实际应用中,组合后的音频信号可以表达出信号采样值的时序特征即可,本申请并不限定具体的组合方式。
需要说明的是,模型的输入可以是组合后的音频信号。在另一种实现中,模型的输入可以是传感器采集的用户语音信号以及目标音频信号,此时,音频信号的组合可以是模型本身来实现的。在另一种实现中,可以是传感器采集的用户语音信号和用户发出所述语音时对应的振动信号,其中,振动信号转换为目标音频信号的过程可以是模型本身来实现的,即模型可以先将振动信号转换为目标音频信号,再将音频信号进行组合。
将上述得到的组合后的音频信号输入到RNN后。可以输出目标语音信息,目标语音信息可以包括多个时刻,以及每个时刻对应的信号采样值。例如,目标语音信息可以为{k
0,k
1,k
2,…,k
l};需要说明的是,目标语音信息包括的信号采样值数量(时刻数量)可以和输入的音频信号包括的信号采样值数量(时刻数量)相同或不同。例如,目标语音信息仅仅为传感器采集的用户语音信号中和人声相关的语音信息,则目标语音信息包括的信号采样值数量小于输入的音频信号包括的信号采样值数量。
s
t为隐藏层的第t步的状态,是网络的记忆单元。s
t根据当前输入层的输出x
t与上一步隐藏层的状态s
t-1进行计算。s
t=f(Ux
t+Ws
t-1),其中f一般是非线性的激活函数,如tanh或ReLU函数;在计算s
0时,即第一个时刻特征的隐藏层状态,需要用到s
-1,但是其并不存在,在实现中一般置为0向量;o
t是第t步的输出,o
t=g(Vs
t)。g是个线性或者非线性函数。
在模型训练的过程中,可以将训练样本数据库中的数据输入初始化的神经网络模型进行训练,所述训练样本数据库包括一对“带环境噪声的语音信号”、“目标音频信号”和对应“无噪声的音频信号”,所述初始化的神经网络模型包括权重和偏置;在第K次训练过程中,通过经过K-1次调整的神经网络模型学习从所述样本的带噪声音频信号和视觉音频信号的音频特征提取去噪之后的音频信号s’(n),所述K为大于0的整数;在第K次训练后,获取所述样本提取到的去噪之后的声音频信号s’(n)和无噪声音频信号s(n)之间的误差值;基于所述样本视频帧的所述样本提取到的去噪之后的声音频信号和无噪声音频信号之间的误差值,调整第K+1次训练过程所使用的权重和偏置。
在上面的组合后得到的音频信号中,包含了两个维度,分别是传感器采集的用户语音信号和目标音频信号。由于二者都是音频信号(需要先基于振动信号x(n)解码恢复音频信号x’(n)),因此可以分别提取音频特征向量MFCC系数。在语音识别和说话人识别领域,MFCC特征是应用最为广泛的基础特征。MFCC特征基于人耳特性,即人耳对约1000Hz以上的声音频率范围的感知不遵循线性关系,而是遵循在对数频率坐标上的近似线性关系。MFCC是在Mel标度频率域上提取出来的倒谱参数,Mel标度描述了人耳频率的这种非线性特性。
MFCC特征的提取可以包括如下步骤:预处理:由预加重,分帧加窗组成。其中预加重 的目的是消除发音时口鼻辐射带来的影响,通过高通滤波器,使语音高频部分得到提升。由于语音信号短时平稳,通过分帧加窗将语音信号分为一个一个的短时段,每个短时段被称为一帧。同时为了避免语音信号动态信息的丢失,相邻帧之间要有一段重叠区域。FFT变换将分帧加窗后的时域信号变化到频域,得到频谱特征X(k)。将语音帧频谱特征X(k)经上述梅尔滤波器组滤波后,得到每个子带的能量,然后对其取对数,得到梅尔频率对数能量谱S(m);将S(m)经离散余弦变换(DCT)得到MFCC系数C(n)。在构造特征向量时,如果视觉振动音频信号是多个,则:将多个信号取平均后,提取MFCC系数,作为“视觉音频信号;对各个视觉/振动音频信号分别提取MFCC系数,然后将这些系数组合在一起,串起来,形成一个更大的特征向量。对各个视觉/振动音频信号分别提取MFCC系数,然后对这些MFCC系数取平均值,将平均后的MFCC系数作为“视觉音频信号”的特征向量。
除了上面实施例中描述的用全部的音频信息构造特征向量,还可以直接基于视频得到的振动信号联合音频信息构造。此时,在构造的音频特征向量中,仍然包含两个维度,分别是“带噪声音频信号”,但是“视觉振动音频信号”,替换为“视觉振动信号”,其中该信号获取方式如下:如果是高速摄像头,对每一帧图像,采用四个尺度r(如1、1/4、1/16、1/64),以及四个方向θ(如上下左右)。对每一个尺度和方向,其值如下:
这样就得到了16个振动信息的特征值,可以形成特征向量
如果是DVS传感器(类脑摄像头),可以对一个确定的音频帧间隔,比如T间隔,在T/N的子间隔内,随机选择一个一个像素的振动偏移量S(t),在每个子间隔内都如此选择,得到16个振动偏移量数值,将此偏移量数值形成特征向量,并将重新组合的特征向量,作为音频信号特征,基于所述的RNN神经网络进行训练。此外,还可以将带噪音频信号、振动恢复信号、振动信号三个信号中提取特征组合成音频信号特征向量,进行训练。
在一种实现中,不使用特征向量方式,直接由原始的多模态数据在网络中提取特征并应用,具体的,可以训练深度网络来学习“带噪声的音频信号”和“振动信号”到无噪声的语音信号的映射关系。RNNs包含输入单元,对应输入集标记为{x
0,x
1,x
2,…,x
t,x
t+1,…},而输出单元的输出集则被标记为{o
0,o
1,o
2,…,o
t,o
t+1,…}。RNNs还包含隐藏单元,其输出集标记为{s
0,s
1,s
2,…,s
t,s
t+1,…}。x
t表示第t=1,2,3...步的输入,对应第t时刻的融合特征信号。这里是“带噪声音频信号”
和“视觉振动信号”
连接得到的多模混合信号x
t=[sn
(0),…,sn
(t),sv
(0),…,sv
(t)]。
s
t为隐藏层的第t步的状态,是网络的记忆单元。s
t根据当前输入层的输出x
t与上一步隐藏层的状态s
t-1进行计算。s
t=f(Ux
t+Ws
t-1),其中f一般是非线性的激活函数,如tanh或ReLU函数;在计算s
0时,即第一个时刻特征的隐藏层状态,需要用到s
-1,但是其并不存在,在实现中一般置为0向量;o
t是第t步的输出,o
t=g(Vs
t),函数g为线性或者非线性函数。
在具体的训练过程中,可以将训练样本数据库中的数据输入初始化的神经网络模型进行训练,所述训练样本数据库包括一对“带噪声音频信号”、“视觉振动信号”和对应“无噪声音频信号”,所述初始化的神经网络模型包括权重和偏置;在第K次训练过程中,通过经过K-1次调整的神经网络模型学习从所述样本的带噪声音频信号和视觉振动信号提取去噪之后的音频信号s’(n),所述K为大于0的整数;在第K次训练后,获取所述样本提取到的去噪之 后的音频信号s’(n)和无噪声音频信号s(n)之间的误差值;基于所述样本视频帧的所述样本提取到的去噪之后的音频信号和无噪声音频信号之间的误差值,调整第K+1次训练过程所使用的权重和偏置。需要说明的是,训练模型时,可以使用振动音频信号x’(n)(和或x1’(n)、x2’(n)、x3’(n)、x4’(n))代替视觉振动信号x(n)(和或x1(n)、x2(n)、x3(n)、x4(n))进行训练,即用振动信号恢复出来的音频信号与麦克采集的音频信号融合训练。
本申请实施例中,神经网络模型的输出可以为进行去环境噪声处理后得到的语音信号,或者是用户的指令信息,其中,指令信息为基于用户的语音信号确定的,指令信息用于指示用户的语音信号中携带的用户的意图,具有语音交互功能的设备可以基于该指令信息触发相应的功能,例如打开某个应用程序等等。
本申请实施例提供了一种语音信号处理方法,包括:获取传感器采集的用户的语音信号,所述语音信号包括环境噪声;获取所述用户发出所述语音时对应的振动信号;其中所述振动信号用于表示所述用户的身体部位的振动特征;所述身体部位为当所述用户处于发声状态下,基于发声行为进行相应振动的部位;以及根据所述振动信号和所述传感器采集的用户语音信号,获得目标语音信息。通过上述方式,将振动信号作为语音识别的依据,由于振动信号没有包含复杂的声学传输时混入的外界非用户的语音,受其他环境噪声的影响很小(例如混响影响),因此可以相对较好的抑制住这部分噪声干扰,可以实现更好的语音识别效果。
在一种实现中,还可以获取所述用户发出所述语音时对应的所述用户的脑波信号;相应的,可以根据所述振动信号、所述脑波信号和所述语音信号,获得目标语音信息。其中,可以基于脑波拾取设备获取用户的脑波信号,其中,脑波拾取设备可以是耳机、眼镜或者其他耳戴式形态。
本实施例中,可以建立脑波信号和声道咬合部位运动的映射关系表,并采集人在朗读各种不同语素、语句时的脑波信号和声道咬合的运动信号,脑波信号由脑电采集设备(例如包括电极、前端模拟放大器、模数转换、脑电信号处理等模块)将多个脑区位置的脑电信号按照不同频段分段采集,声道咬合的运动信号可以由肌电采集设备采集或光学成像设备采集,之后可以建立起人在不同语料素材时的脑波信号和声道咬合的运动信号之间的映射关系。此时,在获取到所述用户发出所述语音时对应的所述用户的脑波信号之后,可以基于脑波信号和声道咬合的运动信号之间的映射关系,获取脑波信号对应的声道咬合的运动信号。
本实施例中,可以将脑波信号转化为声道咬合的关节运动(运动信号),然后再将这些解码的运动转化为语音信号。即,首先将脑波信号转换成声道咬合部位的运动,这其中涉及语音产生的解剖结构(如嘴唇、舌头、喉和下颌的运动信号)。为了实现脑波信号到声道咬合部位运动的转化和映射,需要将人说话时大量声道运动与其神经活动相关联。可以基于建立的循环神经网络,根据以前收集的大量声道运动和语音记录数据集来建立这种关联,并将声道咬合部位的运动信号转换成语音信号。
在一种实现中,可以基于所述振动信号、所述运动信号和所述语音信号,通过神经网络模型得到所述目标语音信息。
在一种实现中,参照图11,可以根据所述振动信号,获取对应的第一目标音频信号;根据所述运动信号,获取对应的第二目标音频信号;基于所述第一目标音频信号、所述第二目 标音频信号和所述语音信号,通过神经网络模型得到所述目标语音信息。具体的实现细节可以参照上述实施例中与神经网络模型相关的描述,这里不再赘述。
在另一种实现中,可以基于脑波信号直接映射得到语音信号,进而,可以基于所述振动信号、所述脑波信号和所述语音信号,通过神经网络模型得到所述目标语音信息。
在一种实现中,参照图11,可以根据所述振动信号,获取对应的第一目标音频信号;根据所述脑波信号,获取对应的第二目标音频信号;基于所述第一目标音频信号、所述第二目标音频信号和所述语音信号,通过神经网络模型得到所述目标语音信息。具体的实现细节可以参照上述实施例中与神经网络模型相关的描述,这里不再赘述。
在一种实现中,所述目标语音信息包括表示所述用户的语音信号的声纹特征。
在一种实现中,可以基于所述振动信号和所述语音信号,通过神经网络模型得到所述目标语音信息。目标语音信息可以用于表示所述用户的语音信号的声纹特征,进而,可以基于全连接层对目标语音信息进行处理,得到声纹识别结果。
在一种实现中,参照图12,可以根据所述振动信号,获取对应的目标音频信号;基于所述目标音频信号和所述语音信号,通过神经网络模型得到所述目标语音信息。
在一种实现中,可以基于所述振动信号、所述脑波信号和所述语音信号,通过神经网络模型得到所述目标语音信息.
在一种实现中,参照图13,可以根据所述振动信号,获取对应的第一目标音频信号;根据所述脑波信号,获取对应的第二目标音频信号;基于所述第一目标音频信号、所述第二目标音频信号和所述语音信号,通过神经网络模型得到所述目标语音信息。
关于模型的构建方式可以参照上述图10对应的实施例中的描述,这里不再赘述。
本申请实施例中,将用户说话时的振动信号作为声纹识别的依据,由于振动信号受其他噪声的干扰(例如混响干扰等等)很小,可以表达出用户说话的本源音频特征,因此,本申请通过将振动信号作为声纹识别的依据,识别效果更佳,可靠性更强。
参照图14,图14为本申请实施例提供的一种语音信号处理方法的流程示意,如图14中示出的那样,所述方法包括:
1401、获取传感器采集的用户的语音信号。
步骤1401的具体描述可以参照步骤701的描述,这里不再赘述。
1402、获取所述用户发出所述语音时对应的所述用户的脑波信号。
步骤1402的具体描述,可以参照上述实施例中与脑波信号有关的具体描述,这里不再赘述。
1403、根据所述脑波信号和所述传感器采集的用户语音信号,获得目标语音信息。
本申请实施例中,在获取到用户的语音信号以及用户的脑波信号之后,可以根据所述脑波信号和所述语音信号,获得目标语音信息。和上述步骤703中不同的是,本实施例根据的是脑波信号和所述语音信号,关于如何根据脑波信号和所述语音信号获得目标语音信息,可以借鉴上述实施例中步骤703的描述,这里不再赘述。
本申请实施例中,还可以根据所述脑波信号,获取所述用户在发声时声道咬合部位的运动信号;进而,可以根据所述运动信号和所述语音信号,获得目标语音信息。
可选的,在一种实现中,所述目标语音信息为进行去环境噪声处理后得到的语音信号,可以根据所述脑波信号,获取对应的目标音频信号;基于滤波,从所述语音信号中滤除所述目标音频信号,得到待滤除信号;从所述语音信号中滤除所述待滤除信号,得到所述目标语音信息。
可选的,在一种实现中,可以基于所述目标语音信息,获取所述用户的语音信号对应的指令信息,所述指令信息指示所述用户的语音信号中包含的语义意图。
可选的,在一种实现中,可以基于所述脑波信号和所述语音信号,通过神经网络模型得到所述目标语音信息;或,根据所述脑波信号,获取对应的目标音频信号;基于所述目标音频信号和所述语音信号,通过神经网络模型得到所述目标语音信息;其中,所述目标语音信息为进行去环境噪声处理后得到的语音信号或与所述用户的语音信号对应的指令信息。
在一种实现中,所述目标语音信息包括表示所述用户的语音信号的声纹特征。
本申请实施例提供了一种语音信号处理方法,所述方法包括:获取传感器采集的用户的语音信号;获取所述用户发出所述语音时对应的所述用户的脑波信号;以及根据所述脑波信号和所述传感器采集的用户语音信号,获得目标语音信息。通过上述方式,将振动信号作为语音识别的依据,由于振动信号没有包含复杂的声学传输时混入的外界非用户的语音,受其他环境噪声的影响很小(例如混响影响),因此可以相对较好的抑制住这部分噪声干扰,可以实现更好的语音识别效果。
参照图15,图15为本申请实施例提供的一种语音信号处理方法的流程示意,如图15中示出的那样,所述方法包括:
1501、获取传感器采集的用户的语音信号。
步骤1501的具体描述可以参照步骤701的描述,这里不再赘述。
1502、获取所述用户发出所述语音时对应的振动信号;其中所述振动信号用于表示所述用户的身体部位的振动特征;所述身体部位为当所述用户处于发声状态下,基于发声行为进行相应振动的部位;
步骤1502的具体描述可以参照步骤702的描述,这里不再赘述。
1503、基于所述传感器采集的用户语音信号以及所述振动信号,进行声纹识别。
在一种实现中,所述振动信号用于表示与发声振动相对应的振动特征。
在一种实现中,根据所述传感器采集的用户语音信号进行声纹识别,得到所述传感器采集的用户语音信号属于用户的第一置信度;根据所述振动信号进行声纹识别,得到所述传感器采集的用户语音信号属于目标用户的第二置信度;根据所述第一置信度和所述第二置信度,得到声纹识别结果。例如可以对所述第一置信度和所述第二置信度进行加权,以得到声纹识别结果。
在一种实现中,可以获取所述用户发出所述语音时对应的所述用户的脑波信号;根据所述脑波信号,获取所述用户发出语音时声道咬合部位的运动信号;进而,可以基于所述传感器采集的用户语音信号、所述振动信号以及所述运动信号,进行声纹识别。
在一种实现中,可以根据所述传感器采集的用户语音信号进行声纹识别,得到所述传感器采集的用户语音信号属于用户的第一置信度;根据所述振动信号进行声纹识别,得到所述 传感器采集的用户语音信号属于用户的第二置信度;根据所述脑波信号进行声纹识别,得到所述传感器采集的用户语音信号属于用户的第三置信度;根据所述第一置信度、所述第二置信度和所述第三置信度,得到声纹识别结果。例如,可以对所述第一置信度、所述第二置信度和所述第三置信度进行加权,以得到声纹识别结果。
在一种实现中,可以基于所述音频信号、所述振动信号以及所述脑波信号,通过神经网络模型得到声纹识别结果。
本实施例中,如果恢复了多个音频信号(包含多个振动信息或者脑波信号恢复的多个目标音频信号),则可以先单独分别恢复音频x’(n)、y’(n)、x1’(n)、x2’(n)、x3’(n)、x4’(n)并单独进行声纹识别,然后将各自的声纹识别结果做加权和的方式提供最终结果:
VP=h1*x1+h2*x2+h3*x3+h4*x4+h5*x+h6*y+h7*s;其中,此处的x1、x2、x3、x4、x、y、s表示振动信号、脑波信号以及音频信号各自的识别结果,h1、h2、h3、h4、h5、h6、h7表示对应识别结果的加权,权重可以灵活选择。最终的识别结果VP如果超出预设门限VP_TH,则表示基于振动拾取时得到的音频声纹结果通过。
本申请实施例中,将用户说话时的振动信号作为声纹识别的依据,由于振动信号受其他噪声的干扰(例如混响干扰等等)很小,可以表达出用户说话的本源音频特征,因此,本申请通过将振动信号作为声纹识别的依据,识别效果更佳,可靠性更强。
参照图16,图16为本申请提供了一种语音信号处理装置的结构示意,如图16中示出的那样,所述装置1600包括:
环境语音获取模块1601,用于获取传感器采集的用户语音信号;
振动信号获取模块1602,用于获取所述用户发出所述语音时对应的振动信号;其中所述振动信号用于表示所述用户的身体部位的振动特征;所述身体部位为当所述用户处于发声状态下,基于发声行为进行相应振动的部位;以及
语音信息获取模块1603,用于根据所述振动信号和所述传感器采集的用户语音信号,获得目标语音信息。
在一种可选的实现中,所述振动信号用于表示与发出语音产生的振动相对应的振动特征。
在一种可选的实现中,所述身体部位包括如下的至少一种:颅顶、面部、喉部或颈部。
在一种可选的实现中,所述振动信号获取模块1602,用于获取包括所述用户的视频帧;根据所述视频帧,提取所述用户发出语音时对应的振动信号。
在一种可选的实现中,所述视频帧为通过动态视觉传感器和/或高速摄像头采集得到的。
在一种可选的实现中,所述目标语音信息为进行去环境噪声处理后得到的语音信号,所述语音信息获取模块1603,用于根据所述振动信号,获取对应的目标音频信号;基于滤波,从所述传感器采集的用户语音信号中滤除所述目标音频信号,得到噪声待滤除信号;从所述传感器采集的用户语音信号中滤除所述待滤除噪声信号,得到所述目标语音信息。
在一种可选的实现中,所述装置还包括:
指令信息获取模块,用于基于所述目标语音信息,获取所述用户的语音信号对应的指令信息,所述指令信息指示所述用户的语音信号中包含的语义意图。
在一种可选的实现中,所述语音信息获取模块1603,用于基于所述振动信号和所述传感 器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息;或,根据所述振动信号,获取对应的目标音频信号;基于所述目标音频信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息。
在一种可选的实现中,所述装置还包括:
脑波信号获取模块,用于获取所述用户发出所述语音时对应的所述用户的脑波信号;相应的,所述语音信息获取模块,用于根据所述振动信号、所述脑波信号和所述传感器采集的用户语音信号,获得目标语音信息。
在一种可选的实现中,所述装置还包括:
运动信号获取模块,用于根据所述脑波信号,获取所述用户发出语音时声道咬合部位的运动信号;相应的,所述语音信息获取模块,用于根据所述振动信号、所述运动信号和所述传感器采集的用户语音信号,获得目标语音信息。
在一种可选的实现中,所述语音信息获取模块1603,用于基于所述振动信号、所述脑波信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息;或,
根据所述振动信号,获取对应的第一目标音频信号;
根据所述脑波信号,获取对应的第二目标音频信号;基于所述第一目标音频信号、所述第二目标音频信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息。
在一种可选的实现中,所述目标语音信息包括表示所述用户的语音信号的声纹特征。
本申请实施例提供了一种语音信号处理装置,所述装置包括:环境语音获取模块,用于获取传感器采集的用户语音信号;振动信号获取模块,用于获取所述用户发出所述语音时对应的振动信号;其中所述振动信号用于表示所述用户的身体部位的振动特征;所述身体部位为当所述用户处于发声状态下,基于发声行为进行相应振动的部位;以及语音信息获取模块,用于根据所述振动信号和所述传感器采集的用户语音信号,获得目标语音信息。通过上述方式,将振动信号作为语音识别的依据,由于振动信号没有包含复杂的声学传输时混入的外界非用户的语音,受其他环境噪声的影响很小(例如混响影响),因此可以相对较好的抑制住这部分噪声干扰,可以实现更好的语音识别效果。
参照图17,图17为本申请提供了一种语音信号处理装置的结构示意,如图17中示出的那样,所述装置1700包括:
环境语音获取模块1701,用于获取传感器采集的用户的语音信号;
脑波信号获取模块1702,用于获取所述用户发出所述语音时对应的所述用户的脑波信号;以及
语音信息获取模块1703,用于根据所述脑波信号和所述传感器采集的用户语音信号,获得目标语音信息。
在一种可选的实现中,所述装置还包括:
运动信号获取模块,用于根据所述脑波信号,获取所述用户在发声时声道咬合部位的运动信号;相应的,所述语音信息获取模块,用于根据所述运动信号和所述传感器采集的用户语音信号,获得所述目标语音信息。
在一种可选的实现中,所述语音信息获取模块,用于根据所述脑波信号,获取对应的目标音频信号;
基于滤波,从所述传感器采集的用户语音信号中滤除所述目标音频信号,得到待滤除信号;
从所述传感器采集的用户语音信号中滤除所述待滤除信号,得到所述目标语音信息。
在一种可选的实现中,所述装置还包括:
指令信息获取模块,用于基于所述目标语音信息,获取所述用户的语音信号对应的指令信息,所述指令信息指示所述用户的语音信号中包含的语义意图。
在一种可选的实现中,所述语音信息获取模块,用于基于所述脑波信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息;或,
根据所述脑波信号,获取对应的目标音频信号;基于所述目标音频信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息。
在一种可选的实现中,所述目标语音信息包括表示所述用户的语音信号的声纹特征。
本申请实施例提供了一种语音信号处理装置,所述装置包括:环境语音获取模块,用于获取传感器采集的用户的语音信号;脑波信号获取模块,用于获取所述用户发出所述语音时对应的所述用户的脑波信号;以及语音信息获取模块,用于根据所述脑波信号和所述传感器采集的用户语音信号,获得目标语音信息。通过上述方式,将脑波信号作为语音识别的依据,由于脑波信号没有包含复杂的声学传输时混入的外界非用户的语音,受其他环境噪声的影响很小(例如混响影响),因此可以相对较好的抑制住这部分噪声干扰,可以实现更好的语音识别效果。
参照图18,图18为本申请提供了一种语音信号处理装置的结构示意,如图18中示出的那样,所述装置1800包括:
环境语音获取模块1801,用于获取传感器采集的用户语音信号;
振动信号获取模块1802,用于获取所述用户发出所述语音时对应的振动信号;其中所述振动信号用于表示所述用户的身体部位的振动特征;所述身体部位为当所述用户处于发声状态下,基于发声行为进行相应振动的部位;以及
声纹识别模块1803,用于基于所述传感器采集的用户语音信号以及所述振动信号,进行声纹识别。
在一种可选的实现中,所述振动信号用于表示与发出语音产生的振动相对应的振动特征。
在一种可选的实现中,所述声纹识别模块,用于根据所述传感器采集的用户语音信号进行声纹识别,得到所述传感器采集的用户语音信号属于用户的第一置信度;
根据所述振动信号进行声纹识别,得到所述传感器采集的用户语音信号属于目标用户的第二置信度;
根据所述第一置信度和所述第二置信度,得到声纹识别结果。
在一种可选的实现中,所述装置还包括:
脑波信号获取模块,用于获取所述用户发出所述语音时对应的所述用户的脑波信号;
相应的,所述声纹识别模块,用于基于所述传感器采集的用户语音信号、所述振动信号 以及所述脑波信号,进行声纹识别。
在一种可选的实现中,所述所述声纹识别模块,用于根据所述传感器采集的用户语音信号进行声纹识别,得到所述传感器采集的用户语音信号属于用户的第一置信度;
根据所述振动信号进行声纹识别,得到所述传感器采集的用户语音信号属于用户的第二置信度;
根据所述脑波信号进行声纹识别,得到所述传感器采集的用户语音信号属于用户的第三置信度;
根据所述第一置信度、所述第二置信度和所述第三置信度,得到声纹识别结果。
本申请实施例提供了一种语音信号处理装置,所述装置包括:环境语音获取模块,用于获取传感器采集的用户语音信号;振动信号获取模块,用于获取所述用户发出所述语音时对应的振动信号;其中所述振动信号用于表示所述用户的身体部位的振动特征;所述身体部位为当所述用户处于发声状态下,基于发声行为进行相应振动的部位;以及声纹识别模块,用于基于所述传感器采集的用户语音信号以及所述振动信号,进行声纹识别。本申请实施例中,将用户说话时的振动信号作为声纹识别的依据,由于振动信号受其他噪声的干扰(例如混响干扰等等)很小,可以表达出用户说话的本源音频特征,因此,本申请通过将振动信号作为声纹识别的依据,识别效果更佳,可靠性更强。
接下来介绍本申请实施例提供的一种执行设备,其中执行设备可以是上述实施例中的具有语音交互功能的装置或者语音输入设备,请参阅图19,图19为本申请实施例提供的执行设备的一种结构示意图,执行设备1900具体可以表现为手机、平板、笔记本电脑、智能穿戴设备、服务器等,此处不做限定。其中,执行设备1900上可以部署有图10对应实施例中所描述的任务调度装置,用于实现图10对应实施例中任务调度的功能。具体的,执行设备1900包括:接收器1901、发射器1902、处理器1903和存储器1904(其中执行设备1900中的处理器1903的数量可以一个或多个,图19中以一个处理器为例),其中,处理器1903可以包括应用处理器19031和通信处理器19032。在本申请的一些实施例中,接收器1901、发射器1902、处理器1903和存储器1904可通过总线或其它方式连接。
存储器1904可以包括只读存储器和随机存取存储器,并向处理器1903提供指令和数据。存储器1904的一部分还可以包括非易失性随机存取存储器(non-volatile random access memory,NVRAM)。存储器1904存储有处理器和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。
处理器1903控制执行设备的操作。具体的应用中,执行设备的各个组件通过总线系统耦合在一起,其中总线系统除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都称为总线系统。
上述本申请实施例揭示的方法可以应用于处理器1903中,或者由处理器1903实现。处理器1903可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器1903中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1903可以是通用处理器、数字信号处理器(digital signal processing,DSP)、微处理器或微控 制器,还可进一步包括专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。该处理器1903可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1904,处理器1903读取存储器1904中的信息,结合其硬件完成上述方法的步骤。
接收器1901可用于接收输入的数字或字符信息,以及产生与执行设备的相关设置以及功能控制有关的信号输入。发射器1902可用于通过第一接口输出数字或字符信息;发射器1902还可用于通过第一接口向磁盘组发送指令,以修改磁盘组中的数据;发射器1902还可以包括显示屏等显示设备。
本申请实施例中,在一种情况下,处理器1903,用于执行图7、图14以及图15对应实施例中的执行设备执行的语音信号处理方法。
本申请实施例还提供了一种训练设备,请参阅图20,图20是本申请实施例提供的训练设备一种结构示意图,具体的,训练设备2000由一个或多个服务器实现,训练设备2000可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)2020(例如,一个或一个以上处理器)和存储器2032,一个或一个以上存储应用程序2042或数据2044的存储介质2030(例如一个或一个以上海量存储设备)。其中,存储器2032和存储介质2030可以是短暂存储或持久存储。存储在存储介质2030的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对训练设备中的一系列指令操作。更进一步地,中央处理器2020可以设置为与存储介质2030通信,在训练设备2000上执行存储介质2030中的一系列指令操作。
训练设备2000还可以包括一个或一个以上电源2026,一个或一个以上有线或无线网络接口2050,一个或一个以上输入输出接口2058;或,一个或一个以上操作系统2041,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
本申请实施例中,中央处理器2020,用于执行上述实施例中的与神经网络模型训练方法相关的步骤。
本申请实施例中还提供一种包括计算机程序产品,当其在计算机上运行时,使得计算机执行如前述执行设备所执行的步骤,或者,使得计算机执行如前述训练设备所执行的步骤。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有用于进行信号处理的程序,当其在计算机上运行时,使得计算机执行如前述执行设备所执行的步骤,或者,使得计算机执行如前述训练设备所执行的步骤。
本申请实施例提供的执行设备、训练设备或终端设备具体可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、 管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使执行设备内的芯片执行上述实施例描述的数据处理方法,或者,以使训练设备内的芯片执行上述实施例描述的数据处理方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
具体的,请参阅图21,图21为本申请实施例提供的芯片的一种结构示意图,所述芯片可以表现为神经网络处理器NPU 2100,NPU 2100作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路2103,通过控制器2104控制运算电路2103提取存储器中的矩阵数据并进行乘法运算。
在一些实现中,运算电路2103内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路2103是二维脉动阵列。运算电路2103还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路2103是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器2102中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器2101中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)2108中。
统一存储器2106用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(Direct Memory Access Controller,DMAC)2105,DMAC被搬运到权重存储器2102中。输入数据也通过DMAC被搬运到统一存储器2106中。
BIU为Bus Interface Unit即,总线接口单元2110,用于AXI总线与DMAC和取指存储器(Instruction Fetch Buffer,IFB)2109的交互。
总线接口单元2110(Bus Interface Unit,简称BIU),用于取指存储器2109从外部存储器获取指令,还用于存储单元访问控制器2105从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器2106或将权重数据搬运到权重存储器2102中或将输入数据数据搬运到输入存储器2101中。
向量计算单元2107包括多个运算处理单元,在需要的情况下,对运算电路2103的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如Batch Normalization(批归一化),像素级求和,对特征平面进行上采样等。
在一些实现中,向量计算单元2107能将经处理的输出的向量存储到统一存储器2106。例如,向量计算单元2107可以将线性函数;或,非线性函数应用到运算电路2103的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元2107生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路2103的激活输入,例如用于在神经网络中的后续层中的使用。
控制器2104连接的取指存储器(instruction fetch buffer)2109,用于存储控制器2104使用的指令;
统一存储器2106,输入存储器2101,权重存储器2102以及取指存储器2109均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述程序执行的集成电路。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。
Claims (37)
- 一种语音信号处理方法,其特征在于,所述方法包括:获取传感器采集的用户语音信号;获取所述用户发出所述语音时对应的振动信号;其中所述振动信号用于表示所述用户的身体部位的振动特征;所述身体部位为当所述用户处于发声状态下,基于发声行为进行相应振动的部位;以及根据所述振动信号和所述传感器采集的用户语音信号,获得目标语音信息。
- 根据权利要求1所述的方法,其特征在于,所述振动信号用于表示与所述用户发出所述语音产生的振动相对应的振动特征。
- 根据权利要求1或2所述的方法,其特征在于,所述身体部位包括如下的至少一种:颅顶、面部、喉部或颈部。
- 根据权利要求1至3任一所述的方法,其特征在于,所述获取所述用户发出所述语音时对应的振动信号,包括:获取包括所述用户的视频帧;根据所述视频帧,提取所述用户发出所述语音时对应的振动信号。
- 根据权利要求4所述的方法,其特征在于,所述视频帧为通过动态视觉传感器和/或高速摄像头采集得到的。
- 根据权利要求1至5任一所述的方法,其特征在于,所述根据所述振动信号和所述传感器采集的用户语音信号,获得目标语音信息,包括:根据所述振动信号,获取对应的目标音频信号;基于滤波,从所述传感器采集的用户语音信号中滤除所述目标音频信号,得到待滤除信号;从所述传感器采集的用户语音信号中滤除所述待滤除信号,得到所述目标语音信息。
- 根据权利要求1至5任一所述的方法,其特征在于,所述根据所述振动信号和所述传感器采集的用户语音信号,获得目标语音信息,包括:基于所述振动信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息;或,根据所述振动信号,获取对应的目标音频信号;基于所述目标音频信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息。
- 根据权利要求1至5任一所述的方法,其特征在于,所述方法还包括:获取所述用户发出所述语音时对应的所述用户的脑波信号;相应的,所述根据所述振动信号和所述传感器采集的用户语音信号,获得目标语音信息,包括:根据所述振动信号、所述脑波信号和所述传感器采集的用户语音信号,获得目标语音信息。
- 根据权利要求8所述的方法,其特征在于,所述方法还包括:根据所述脑波信号,获取所述用户发出语音时声道咬合部位的运动信号;相应的,所述根据所述振动信号、所述脑波信号和所述传感器采集的用户语音信号,获得目标语音信息,包括:根据所述振动信号、所述运动信号和所述传感器采集的用户语音信号,获得目标语音信息。
- 根据权利要求8所述的方法,其特征在于,所述根据所述振动信号、所述脑波信号和所述传感器采集的用户语音信号,获得目标语音信息,包括:基于所述振动信号、所述脑波信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息;或,根据所述振动信号,获取对应的第一目标音频信号;根据所述脑波信号,获取对应的第二目标音频信号;基于所述第一目标音频信号、所述第二目标音频信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息。
- 根据权利要求1至5、7至10任一所述的方法,其特征在于,所述目标语音信息包括表示所述用户的语音信号的声纹特征。
- 一种语音信号处理方法,其特征在于,所述方法包括:获取传感器采集的用户语音信号;获取所述用户发出所述语音时对应的所述用户的脑波信号;以及根据所述脑波信号和所述传感器采集的用户语音信号,获得目标语音信息。
- 根据权利要求12所述的方法,其特征在于,所述方法还包括:根据所述脑波信号,获取所述用户在发声时声道咬合部位的运动信号;相应的,所述根据所述脑波信号和所述传感器采集的用户语音信号,获得目标语音信息,包括:根据所述运动信号和所述传感器采集的用户语音信号,获得所述目标语音信息。
- 根据权利要求12或13所述的方法,其特征在于,所述根据所述脑波信号和所述传感器采集的用户语音信号,获得目标语音信息,包括:根据所述脑波信号,获取对应的目标音频信号;基于滤波,从所述传感器采集的用户语音信号中滤除所述目标音频信号,得到待滤除信号;从所述传感器采集的用户语音信号中滤除所述待滤除信号,得到所述目标语音信息。
- 根据权利要求13所述的方法,其特征在于,所述根据所述脑波信号和所述传感器采集的用户语音信号,获得目标语音信息,包括:基于所述脑波信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息;或,根据所述脑波信号,获取对应的目标音频信号;基于所述目标音频信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息。
- 根据权利要求12、13或15所述的方法,其特征在于,所述目标语音信息包括表示所述用户的语音信号的声纹特征。
- 一种语音信号处理方法,其特征在于,所述方法包括:获取传感器采集的用户语音信号;获取所述用户发出所述语音时对应的振动信号;其中所述振动信号用于表示所述用户的身体部位的振动特征;所述身体部位为当所述用户处于发声状态下,基于发声行为进行相应振动的部位;以及基于所述传感器采集的用户语音信号以及所述振动信号,进行声纹识别。
- 根据权利要求17所述的方法,其特征在于,所述基于所述传感器采集的用户语音信号以及所述振动信号,进行声纹识别,包括:根据所述传感器采集的用户语音信号进行声纹识别,得到所述传感器采集的用户语音信号属于用户的第一置信度;根据所述振动信号进行声纹识别,得到所述传感器采集的用户语音信号属于目标用户的第二置信度;根据所述第一置信度和所述第二置信度,得到声纹识别结果。
- 根据权利要求17或18所述的方法,其特征在于,所述方法还包括:获取所述用户发出所述语音时对应的所述用户的脑波信号;相应的,所述基于所述传感器采集的用户语音信号以及所述振动信号,进行声纹识别,包括:基于所述传感器采集的用户语音信号、所述振动信号以及所述脑波信号,进行声纹识别。
- 根据权利要求19所述的方法,其特征在于,所述基于所述传感器采集的用户语音信号、所述振动信号以及所述脑波信号,进行声纹识别,包括:根据所述传感器采集的用户语音信号进行声纹识别,得到所述传感器采集的用户语音信号属于用户的第一置信度;根据所述振动信号进行声纹识别,得到所述传感器采集的用户语音信号属于用户的第二 置信度;根据所述脑波信号进行声纹识别,得到所述传感器采集的用户语音信号属于用户的第三置信度;根据所述第一置信度、所述第二置信度和所述第三置信度,得到声纹识别结果。
- 一种语音信号处理装置,其特征在于,所述装置包括:环境语音获取模块,用于获取传感器采集的用户语音信号;振动信号获取模块,用于获取所述用户发出所述语音时对应的振动信号;其中所述振动信号用于表示所述用户的身体部位的振动特征;所述身体部位为当所述用户处于发声状态下,基于发声行为进行相应振动的部位;以及语音信息获取模块,用于根据所述振动信号和所述传感器采集的用户语音信号,获得目标语音信息。
- 根据权利要求21所述的装置,其特征在于,所述振动信号用于表示与所述用户发出所述语音产生的振动相对应的振动特征。
- 根据权利要求21或22所述的装置,其特征在于,所述语音信息获取模块,用于根据所述振动信号,获取对应的目标音频信号;基于滤波,从所述传感器采集的用户语音信号中滤除所述目标音频信号,得到噪声待滤除信号;从所述传感器采集的用户语音信号中滤除所述待滤除信号,得到所述目标语音信息。
- 根据权利要求21或22所述的装置,其特征在于,所述语音信息获取模块,用于基于所述振动信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息;或,根据所述振动信号,获取对应的目标音频信号;基于所述目标音频信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息。
- 根据权利要求21至24任一所述的装置,其特征在于,所述装置还包括:脑波信号获取模块,用于获取所述用户发出所述语音时对应的所述用户的脑波信号;相应的,所述语音信息获取模块,用于根据所述振动信号、所述脑波信号和所述传感器采集的用户语音信号,获得目标语音信息。
- 根据权利要求21、22、24或25所述的装置,其特征在于,所述目标语音信息包括表示所述用户的语音信号的声纹特征。
- 一种语音信号处理装置,其特征在于,所述装置包括:环境语音获取模块,用于获取传感器采集的用户的语音信号;脑波信号获取模块,用于获取所述用户发出所述语音时对应的所述用户的脑波信号;以及语音信息获取模块,用于根据所述脑波信号和所述传感器采集的用户语音信号,获得目 标语音信息。
- 根据权利要求27所述的装置,其特征在于,所述装置还包括:运动信号获取模块,用于根据所述脑波信号,获取所述用户在发声时声道咬合部位的运动信号;相应的,所述语音信息获取模块,用于根据所述运动信号和所述传感器采集的用户语音信号,获得所述目标语音信息。
- 根据权利要求27或28所述的装置,其特征在于,所述语音信息获取模块,用于根据所述脑波信号,获取对应的目标音频信号;基于滤波,从所述传感器采集的用户语音信号中滤除所述目标音频信号,得到待滤除信号;从所述传感器采集的用户语音信号中滤除所述待滤除信号,得到所述目标语音信息。
- 根据权利要求27至29任一所述的装置,其特征在于,所述语音信息获取模块,用于基于所述脑波信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息;或,根据所述脑波信号,获取对应的目标音频信号;基于所述目标音频信号和所述传感器采集的用户语音信号,通过循环神经网络模型得到所述目标语音信息。
- 根据权利要求27、28或30所述的装置,其特征在于,所述目标语音信息包括表示所述用户的语音信号的声纹特征。
- 一种语音信号处理装置,其特征在于,所述装置包括:环境语音获取模块,用于获取传感器采集的用户语音信号;振动信号获取模块,用于获取所述用户发出所述语音时对应的振动信号;其中所述振动信号用于表示所述用户的身体部位的振动特征;所述身体部位为当所述用户处于发声状态下,基于发声行为进行相应振动的部位;以及声纹识别模块,用于基于所述传感器采集的用户语音信号以及所述振动信号,进行声纹识别。
- 根据权利要求32所述的装置,其特征在于,所述声纹识别模块,用于根据所述传感器采集的用户语音信号进行声纹识别,得到所述传感器采集的用户语音信号属于用户的第一置信度;根据所述振动信号进行声纹识别,得到所述传感器采集的用户语音信号属于目标用户的第二置信度;根据所述第一置信度和所述第二置信度,得到声纹识别结果。
- 根据权利要求32或33所述的装置,其特征在于,所述装置还包括:脑波信号获取模块,用于获取所述用户发出所述语音时对应的所述用户的脑波信号;相应的,所述声纹识别模块,用于基于所述传感器采集的用户语音信号、所述振动信号以及所述脑波信号,进行声纹识别。
- 一种系统,其特征在于,包括处理器、存储器;所述存储器存储有程序指令,当所述存储器存储的程序指令被所述处理器执行时实现权利要求1至20中任一项所述的方法。
- 一种计算机可读存储介质,其特征在于,包括程序,当其在计算机上运行时,使得计算机执行如权利要求1至20中任一项所述的方法。
- 一种计算机程序,其特征在于,当其在计算机上运行时,使得计算机执行如权利要求1至20中任一项所述的方法。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2020/093523 WO2021237740A1 (zh) | 2020-05-29 | 2020-05-29 | 一种语音信号处理方法及其相关设备 |
EP20938148.2A EP4141867A4 (en) | 2020-05-29 | 2020-05-29 | VOICE SIGNAL PROCESSING METHOD AND ASSOCIATED RELATED DEVICE |
CN202080026583.9A CN114072875A (zh) | 2020-05-29 | 2020-05-29 | 一种语音信号处理方法及其相关设备 |
US17/994,968 US20230098678A1 (en) | 2020-05-29 | 2022-11-28 | Speech signal processing method and related device thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2020/093523 WO2021237740A1 (zh) | 2020-05-29 | 2020-05-29 | 一种语音信号处理方法及其相关设备 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/994,968 Continuation US20230098678A1 (en) | 2020-05-29 | 2022-11-28 | Speech signal processing method and related device thereof |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021237740A1 true WO2021237740A1 (zh) | 2021-12-02 |
Family
ID=78745413
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/093523 WO2021237740A1 (zh) | 2020-05-29 | 2020-05-29 | 一种语音信号处理方法及其相关设备 |
Country Status (4)
Country | Link |
---|---|
US (1) | US20230098678A1 (zh) |
EP (1) | EP4141867A4 (zh) |
CN (1) | CN114072875A (zh) |
WO (1) | WO2021237740A1 (zh) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111667831B (zh) * | 2020-06-08 | 2022-04-26 | 中国民航大学 | 基于管制员指令语义识别的飞机地面引导系统及方法 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2010217453A (ja) * | 2009-03-16 | 2010-09-30 | Fujitsu Ltd | 音声認識用マイクロホンシステム |
CN101947152A (zh) * | 2010-09-11 | 2011-01-19 | 山东科技大学 | 仿人形义肢的脑电-语音控制系统及工作方法 |
CN103871419A (zh) * | 2012-12-11 | 2014-06-18 | 联想(北京)有限公司 | 一种信息处理方法及电子设备 |
CN110248281A (zh) * | 2018-03-07 | 2019-09-17 | 四川语文通科技有限责任公司 | 在有干扰的环境中独立出自己发声的方法之声带振动匹配 |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9767817B2 (en) * | 2008-05-14 | 2017-09-19 | Sony Corporation | Adaptively filtering a microphone signal responsive to vibration sensed in a user's face while speaking |
JP2010185975A (ja) * | 2009-02-10 | 2010-08-26 | Denso Corp | 車載音声認識装置 |
CA2899676C (en) * | 2013-01-29 | 2020-03-24 | Suzhou Institute Of Nano-Tech And Nano-Bionics (Sinano), Chinese Acade Of Sciences | Electronic skin, preparation method and use thereof |
US20160267911A1 (en) * | 2015-03-13 | 2016-09-15 | Magna Mirrors Of America, Inc. | Vehicle voice acquisition system with microphone and optical sensor |
US10635800B2 (en) * | 2016-06-07 | 2020-04-28 | Vocalzoom Systems Ltd. | System, device, and method of voice-based user authentication utilizing a challenge |
US10573323B2 (en) * | 2017-12-26 | 2020-02-25 | Intel Corporation | Speaker recognition based on vibration signals |
DK3582514T3 (da) * | 2018-06-14 | 2023-03-06 | Oticon As | Lydbehandlingsapparat |
EP3618457A1 (en) * | 2018-09-02 | 2020-03-04 | Oticon A/s | A hearing device configured to utilize non-audio information to process audio signals |
CN209642929U (zh) * | 2019-04-17 | 2019-11-15 | 科大讯飞股份有限公司 | 拾音型降噪耳机及降噪耳麦 |
CN110931031A (zh) * | 2019-10-09 | 2020-03-27 | 大象声科(深圳)科技有限公司 | 一种融合骨振动传感器和麦克风信号的深度学习语音提取和降噪方法 |
-
2020
- 2020-05-29 CN CN202080026583.9A patent/CN114072875A/zh active Pending
- 2020-05-29 WO PCT/CN2020/093523 patent/WO2021237740A1/zh unknown
- 2020-05-29 EP EP20938148.2A patent/EP4141867A4/en active Pending
-
2022
- 2022-11-28 US US17/994,968 patent/US20230098678A1/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2010217453A (ja) * | 2009-03-16 | 2010-09-30 | Fujitsu Ltd | 音声認識用マイクロホンシステム |
CN101947152A (zh) * | 2010-09-11 | 2011-01-19 | 山东科技大学 | 仿人形义肢的脑电-语音控制系统及工作方法 |
CN103871419A (zh) * | 2012-12-11 | 2014-06-18 | 联想(北京)有限公司 | 一种信息处理方法及电子设备 |
CN110248281A (zh) * | 2018-03-07 | 2019-09-17 | 四川语文通科技有限责任公司 | 在有干扰的环境中独立出自己发声的方法之声带振动匹配 |
Non-Patent Citations (1)
Title |
---|
See also references of EP4141867A4 * |
Also Published As
Publication number | Publication date |
---|---|
CN114072875A (zh) | 2022-02-18 |
EP4141867A4 (en) | 2023-06-14 |
EP4141867A1 (en) | 2023-03-01 |
US20230098678A1 (en) | 2023-03-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021135577A1 (zh) | 音频信号处理方法、装置、电子设备及存储介质 | |
CN110291489B (zh) | 计算上高效的人类标识智能助理计算机 | |
KR102299764B1 (ko) | 전자장치, 서버 및 음성출력 방법 | |
WO2021249053A1 (zh) | 图像处理的方法及相关装置 | |
WO2021135628A1 (zh) | 语音信号的处理方法、语音分离方法 | |
US11031005B2 (en) | Continuous topic detection and adaption in audio environments | |
WO2022156654A1 (zh) | 一种文本数据处理方法及装置 | |
CN113516990B (zh) | 一种语音增强方法、训练神经网络的方法以及相关设备 | |
KR102412523B1 (ko) | 음성 인식 서비스 운용 방법, 이를 지원하는 전자 장치 및 서버 | |
WO2022253061A1 (zh) | 一种语音处理方法及相关设备 | |
WO2022033556A1 (zh) | 电子设备及其语音识别方法和介质 | |
CN113539290B (zh) | 语音降噪方法和装置 | |
WO2023284435A1 (zh) | 生成动画的方法及装置 | |
CN114242037A (zh) | 一种虚拟人物生成方法及其装置 | |
CN113611318A (zh) | 一种音频数据增强方法及相关设备 | |
US20200098356A1 (en) | Electronic device and method for providing or obtaining data for training thereof | |
CN112384974A (zh) | 电子装置和用于提供或获得用于训练电子装置的数据的方法 | |
US20230098678A1 (en) | Speech signal processing method and related device thereof | |
CN113646838B (zh) | 在视频聊天过程中提供情绪修改的方法和系统 | |
CN115620728A (zh) | 音频处理方法、装置、存储介质及智能眼镜 | |
WO2022143314A1 (zh) | 一种对象注册方法及装置 | |
US20240046946A1 (en) | Speech denoising networks using speech and noise modeling | |
WO2023006001A1 (zh) | 视频处理方法及电子设备 | |
WO2020102943A1 (zh) | 手势识别模型的生成方法、装置、存储介质及电子设备 | |
US11997445B2 (en) | Systems and methods for live conversation using hearing devices |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20938148 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2020938148 Country of ref document: EP Effective date: 20221124 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |