WO2021184315A1 - Appareil d'acquisition audio, appareil de réception audio et procédé de traitement audio - Google Patents

Appareil d'acquisition audio, appareil de réception audio et procédé de traitement audio Download PDF

Info

Publication number
WO2021184315A1
WO2021184315A1 PCT/CN2020/080268 CN2020080268W WO2021184315A1 WO 2021184315 A1 WO2021184315 A1 WO 2021184315A1 CN 2020080268 W CN2020080268 W CN 2020080268W WO 2021184315 A1 WO2021184315 A1 WO 2021184315A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
audio data
control instruction
receiving device
data
Prior art date
Application number
PCT/CN2020/080268
Other languages
English (en)
Chinese (zh)
Inventor
边云锋
莫品西
薛政
刘洋
吴俊峰
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to PCT/CN2020/080268 priority Critical patent/WO2021184315A1/fr
Priority to CN202080004930.8A priority patent/CN112639963A/zh
Publication of WO2021184315A1 publication Critical patent/WO2021184315A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/018Audio watermarking, i.e. embedding inaudible data in the audio signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/26Pre-filtering or post-filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/66Remote control of cameras or camera parts, e.g. by remote control devices
    • H04N23/661Transmitting camera control signals through networks, e.g. control via the Internet
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/667Camera operation mode switching, e.g. between still and video, sport and normal or high- and low-resolution modes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/67Focus control based on electronic image sensor signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Definitions

  • the embodiments of the present application relate to the field of information processing technology, and in particular, to an audio collection device, an audio receiving device, an audio processing method, and an audio collection system.
  • Voice interaction is a common way of human-computer interaction. During voice interaction, people can control the controlled device through voice, thereby freeing their hands. However, in some scenarios, if the user is far away from the controlled device, the controlled device may not be able to accurately recognize the user's instructions due to the reduced signal-to-noise ratio of the audio data collected by the controlled device. For example, in a scene, the user uses a selfie stick to clamp a sports camera to shoot. When voice control of the sports camera, because the selfie stick increases the control distance, it is difficult for the sports camera to collect clear audio data, so the voice recognition The accuracy rate will also be greatly reduced.
  • embodiments of the present application provide an audio collection device, an audio receiving device, an audio processing method, and an audio collection system.
  • an audio collection device including: a microphone, a processor, and a wireless transceiver;
  • the processor is configured to perform instruction recognition processing on the audio data collected by the microphone to obtain a control instruction; and is also configured to send the audio data and the control instruction to an audio receiving device through the wireless transceiver;
  • the audio data is used for one or more electronic devices to perform media processing
  • the control instruction is used for one or more electronic devices to perform control processing
  • the electronic device is the audio receiving device or the audio receiving device.
  • Other electronic equipment connected to the device are
  • an audio collection device including: a microphone, a processor, and a wireless transceiver, the processor is configured to perform recognition processing on audio data collected by the microphone to obtain auxiliary recognition Information; also used to send the audio data and the auxiliary identification information to the audio receiving device through the wireless transceiver;
  • the audio data is used for one or more electronic devices to perform media processing
  • the auxiliary identification information is used for one or more electronic devices to identify control instructions from the audio data according to the auxiliary identification information
  • the electronic The device is the audio receiving device or other electronic equipment communicatively connected with the audio receiving device.
  • an audio receiving device including: a wireless transceiver and a processor;
  • the processor is configured to receive audio data and control instructions sent by an audio collection device through the wireless transceiver; wherein the control instructions are obtained by the audio collection device performing instruction recognition processing on the collected audio data;
  • the audio data is used for one or more electronic devices to perform media processing
  • the control instruction is used for one or more electronic devices to perform control processing, the electronic device being the audio receiving device or communicating with the audio receiving device Other connected electronic devices.
  • an audio receiving device including: a wireless transceiver and a processor;
  • the processor is configured to receive audio data and auxiliary identification information sent by an audio collection device through the wireless transceiver; wherein the auxiliary identification information is obtained by the audio collection device performing identification processing on the collected audio data;
  • the audio data is used for one or more electronic devices to perform media processing
  • the auxiliary identification information is used for one or more electronic devices to identify control instructions from the audio data according to the auxiliary identification information
  • the electronic device is The audio receiving device or other electronic equipment communicatively connected with the audio receiving device.
  • an audio processing method which is applied to an audio collection device, and the method includes:
  • the audio data is used for one or more electronic devices to perform media processing
  • the control instruction is used for one or more electronic devices to perform control processing
  • the electronic device is the audio receiving device or the audio receiving device.
  • Other electronic equipment connected to the device are
  • an audio processing method which is applied to an audio collection device, and the method includes:
  • the audio data is used for one or more electronic devices to perform media processing
  • the auxiliary identification information is used for one or more electronic devices to identify control instructions from the audio data according to the auxiliary identification information
  • the electronic The device is the audio receiving device or other electronic equipment communicatively connected with the audio receiving device.
  • an audio processing method which is applied to an audio receiving device, and the method includes:
  • control instruction is obtained by the audio collection device performing instruction recognition processing on the collected audio data
  • the audio data is used for one or more electronic devices to perform media processing
  • the control instruction is used for one or more An electronic device executes control processing
  • the electronic device is the audio receiving device or other electronic devices that are communicatively connected with the audio receiving device.
  • an audio processing method which is applied to an audio receiving device, and the method includes:
  • the auxiliary identification information is obtained by the audio collection device performing identification processing on the collected audio data; the audio data is used for one or more electronic devices to perform media processing, and the auxiliary identification information is used for one or more An electronic device recognizes a control instruction from the audio data according to the auxiliary identification information, and the electronic device is the audio receiving device or other electronic devices communicatively connected with the audio receiving device.
  • an audio collection system including:
  • Audio collection device and audio receiving device are Audio collection device and audio receiving device
  • the audio collection device is used to perform instruction recognition processing on the collected audio data to obtain a control instruction; send the audio data and the control instruction to the audio receiving device via a wireless network;
  • the audio data is used for one or more electronic devices to perform media processing
  • the control instruction is used for one or more electronic devices to perform control processing
  • the electronic device is the audio receiving device or the audio receiving device.
  • Other electronic equipment connected to the device are
  • an audio collection system including:
  • Audio collection device and audio receiving device are Audio collection device and audio receiving device
  • the audio collection device is configured to perform identification processing on the collected audio data to obtain auxiliary identification information; send the audio data and the auxiliary identification information to the audio receiving device via a wireless network;
  • the audio data is used for one or more electronic devices to perform media processing
  • the auxiliary identification information is used for one or more electronic devices to identify control instructions from the audio data according to the auxiliary identification information
  • the electronic The device is the audio receiving device or other electronic equipment communicatively connected with the audio receiving device.
  • the audio collection device communicates with the audio receiving device through a wireless network. Since the audio collection device can be located very close to the user, the audio collection device can collect clear audio data, which greatly improves the accuracy of speech recognition. In addition, considering that there will be a certain loss of signals in the wireless communication process, and the stability of wireless communication is not high enough, in order to ensure that the voice commands issued by the user can be accurately recognized, the embodiment of the present application is on the side of the audio collection device and on the audio data Before wireless transmission, the audio data is recognized, which further ensures the accuracy of speech recognition.
  • Fig. 1 is a schematic structural diagram of a first audio collection device according to an exemplary embodiment of the present application.
  • Fig. 2a is a diagram showing an application scenario according to an exemplary embodiment of the present application.
  • Fig. 2b is a diagram showing another application scenario according to an exemplary embodiment of the present application.
  • Fig. 2c is a diagram showing another application scenario according to an exemplary embodiment of the present application.
  • Fig. 2d is a diagram showing still another application scenario according to an exemplary embodiment of the present application.
  • Fig. 3a is a working flow chart of the first audio collection device according to an exemplary embodiment of the present application.
  • Fig. 3b is a working flow chart of the first audio receiving device according to an exemplary embodiment of the application.
  • Fig. 3c is a schematic diagram of the division of labor between the first audio receiving device and the electronic device according to an exemplary embodiment of this application.
  • Fig. 4a is a working flow chart of a second audio collection device according to an exemplary embodiment of the application.
  • Fig. 4b is a working flow chart of the second audio receiving device according to an exemplary embodiment of the application.
  • Fig. 5 is a schematic structural diagram of an audio collection system according to an exemplary embodiment of the application.
  • first, second, third, etc. may be used in the embodiments of the present application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other.
  • first information may also be referred to as second information, and similarly, the second information may also be referred to as first information.
  • word “if” as used herein can be interpreted as "when” or “when” or "in response to determination”.
  • Voice interaction is a common way of human-computer interaction. During voice interaction, people can control the controlled device through voice, thereby freeing their hands. However, in some scenarios, if the user is far away from the controlled device, the controlled device may not be able to accurately recognize the user's instructions due to the reduced signal-to-noise ratio of the audio data collected by the controlled device. For example, in a scene, the user uses a selfie stick to clamp a sports camera to shoot. When voice control of the sports camera, because the selfie stick increases the control distance, it is difficult for the sports camera to collect clear audio data, so the voice recognition The accuracy rate will also be greatly reduced.
  • an embodiment of the present application proposes an audio collection device, which is provided with a wireless transceiver and can communicate wirelessly with the audio receiving device.
  • the wireless transceiver can be a wireless network card.
  • the wireless transceiver can also be an integrated module.
  • it can also have other hardware forms, which is not limited in this application.
  • the audio collection device can communicate with the controlled device remotely, which is convenient for the user to place or wear on his body, so that it can collect clear audio data with a sufficiently high signal-to-noise ratio, so that the accuracy of speech recognition is greatly improved. promote.
  • the audio collection device can be presented in different product forms in different scenarios.
  • the audio collection device may be a wireless microphone, and the user holds the microphone for voice input.
  • the audio collection device can also be a smart speaker, which is placed closer to the user by the user.
  • the audio collection device may be a device that can be worn by the user, such as a wireless headset, a wearable speaker, and so on.
  • the audio data collected by the audio collection device can be wirelessly transmitted to the controlled device, and the controlled device performs instruction recognition processing according to the collected audio data, thereby obtaining the control instruction and responding.
  • the audio data needs to be wirelessly transmitted before being used for command recognition processing, and the signal will have a certain loss in the wireless transmission process, and the wireless transmission also needs to consider stability issues (such as packet loss). The existence of the situation makes the audio data received by the controlled device may have some defects, which causes it to still not be able to accurately recognize the control command when performing the command recognition processing.
  • FIG. 1 is a schematic structural diagram of a first audio collection device according to an exemplary embodiment of the present application.
  • the audio collection device 10 includes:
  • a microphone 101 A microphone 101, a processor 102, and a wireless transceiver 103.
  • the processor 102 is configured to perform command recognition processing on the audio data collected by the microphone 101 to obtain a control command, and is also configured to send audio data and control commands to the audio receiving device through the wireless transceiver 103. Among them, audio data is used for media processing.
  • the audio receiving device is any device that receives audio data and control instructions, and is provided with a wireless transceiver, which can establish a wireless connection channel with the wireless transceiver of the audio collecting device.
  • the audio receiving device itself may be an electronic device that needs to use audio data and respond to control instructions.
  • the audio receiving device itself can be a sports camera, and the sports camera can wirelessly communicate with the audio acquisition device 10, receive audio data and control instructions sent by the audio acquisition device 10, and use the receiver To audio data for media processing, it can also respond to received control commands.
  • the audio receiving device may be electrically connected to the electronic device, and the audio data and control instructions received by the audio receiving device will be sent to the connected electronic device, and the electronic device will use the audio data to perform media processing and execute The operation corresponding to the control instruction.
  • the electronic device may be a sports camera, and the audio receiving device may be an external plug-in of the sports camera.
  • Figure 2c shows a scene that includes two audio receiving devices.
  • the audio receiving device itself is an electronic device.
  • the first audio receiving device is a pan-tilt
  • the second audio receiving device is sports. camera.
  • both audio receiving devices can receive the audio data and control instructions sent by the audio acquisition device 10, where the audio data can be used for media processing by the sports camera, and the control instructions can be for the sports camera or the cloud. tower.
  • the user can issue such an instruction when shooting: “The pan/tilt rotates 45 degrees to the left, and the camera focuses on the face.”
  • the audio capture device 10 responds to the captured corresponding “clouds”.
  • the audio data with the camera turning 45 degrees to the left can recognize the control command used to control the pan/tilt.
  • the pan/tilt will respond accordingly after receiving the control command.
  • the audio collection device 10 can recognize the control instruction used to control the sports camera, and the sports camera will respond accordingly after receiving the control instruction.
  • the audio receiving device itself is an electronic device that needs to use audio data and respond to control instructions.
  • the audio receiving device may only be used as a repeater, which is electrically connected to the electronic device, and wirelessly communicates with the audio collecting device 10.
  • Figure 2d shows a scene, including two audio receiving devices and two electronic devices, the two electronic devices are a sports camera and a pan/tilt, and the sports camera is equipped with an audio receiver embodied in the form of a plug-in Device, the PTZ is also equipped with an audio receiving device embodied in the form of a plug-in.
  • FIG. 2c which will not be repeated here.
  • both the audio collecting device and the audio receiving device may have an independent power supply system provided on the device.
  • the media processing of audio data may specifically include audio editing and/or audio and video editing.
  • audio editing can be to edit audio data, or to make changes to audio data in sound effects, including but not limited to enhancement, polishing, voice change, etc., and can also be noise reduction and filtering on audio data, or audio recording , Audio broadcasting and other work.
  • audio and video editing There are also many kinds of audio and video editing. For example, one of them is to encapsulate the audio data and the captured video data to generate a video file. Of course, there are other audio and video editing methods, which are not listed here.
  • the electronic device may include one or more cameras with the function of taking images.
  • the instruction recognition processing is transferred to the audio collection device, it is no longer performed by the audio receiving device or the electronic equipment connected to the audio receiving device. Therefore, when the audio data is used for the instruction recognition processing, Without the process of wireless transmission, there will be no loss caused by wireless transmission, so the accuracy of command recognition can be further improved.
  • audio data can be encoded before transmission.
  • the processor of the audio collecting device may encode the audio data before sending the audio data to the audio receiving device through the wireless transceiver.
  • a preferred implementation is to use pre-encoded audio data for instruction recognition processing, that is, use the original audio data collected by a microphone to perform instruction recognition processing. In this way, the accuracy of recognition can be guaranteed. Maintain at a high level.
  • an optional method of using a voice recognition model for instruction recognition includes the following steps:
  • step S1 firstly, it is necessary to detect an audio segment that may contain a voice.
  • an audio segment that may contain a voice.
  • it can be detected by a voice activity detection algorithm, and of course, it can also be detected by a sliding window.
  • the audio segment that may contain voice does not necessarily refer to only the audio segment that contains human voice, but may also be the audio segment that includes non-human voice.
  • the audio corresponding to the control instruction can be three taps in a short time, or it can be non-voice audio such as two clappings.
  • step S2 it should be noted that there are also multiple options for the extracted audio features.
  • it can be the MFCC (Meier Frequency Cepstrum Coefficient) feature, or it can be the LPC feature, Fbank feature, bottleneck feature, etc., which can be selected by those skilled in the art according to actual needs.
  • MFCC Fast Frequency Cepstrum Coefficient
  • the designated speech recognition model is a pre-trained model.
  • speech recognition models such as GMM-HMM (Gaussian Mixed Model-Hidden Markov Model, Gaussian Mixture Model-Hidden Markov Model), DNN (Deep Neural Networks, Deep Neural Networks), LSTM (Long Short Term Memory networks), CNN (Convolutional Neural Networks, Convolutional Neural Networks) and other models.
  • the model can select the aforementioned GMM-HMM model, and the audio feature selects the MFCC feature, and the training process includes the following steps.
  • Step A Use several command voices and negative samples as training data.
  • Step B Perform MFCC feature extraction on the training data.
  • Step C Use the extracted MFCC features to train the GMM-HMM speech recognition model to obtain the required GMM-HMM model.
  • the HMM parameter estimation can use the Baum-welch algorithm, with a number of Gaussians, and the GMM model training can use the EM (Expectation Maximization) method.
  • the identified control command needs to be sent to the audio receiving device together with the audio data.
  • the control instruction may be encapsulated with audio data into a data packet, and the wireless transceiver sends the data packet to the audio receiving device.
  • the protocol used in the wireless network transmission can be a public protocol or a private protocol, and is not limited to the 2.4G or 1.9G wireless communication frequency band.
  • control command and audio data are directly encapsulated into a data packet to send, it takes up more transmission bandwidth, and the transmission of the control command is not real-time enough. Therefore, in a preferred embodiment, the control command can be encapsulated before the data packet is encapsulated.
  • the audio data is embedded, and then the audio data embedded with the control instruction is encapsulated in a data packet, and then the encapsulated data packet is sent to the audio receiving device.
  • control instruction when the audio data is sent to the audio receiving device, it needs to be used for media processing. Therefore, when the control instruction is embedded in the audio data, it should not affect the audio data itself as much as possible. For this reason, in a preferred embodiment, the control instruction can be converted into an audio digital watermark and then embedded in the audio data, so that the audio data will not be affected.
  • the embodiment of the application provides an implementation method that uses frequency modulation to convert a control instruction into an audio digital watermark, that is, converts a control instruction into an audio digital watermark with a frequency within a specified frequency range, where the specified frequency range is the human auditory frequency The frequency range outside the range.
  • the specific implementation steps are as follows:
  • Step X Convert the control command c(t) into a binary control command cb of length M, where cb(i) is the bit value of the i-th bit.
  • Step Y In order not to affect the content of the audio data itself, a frequency range outside the human auditory frequency range can be selected as the transmission frequency band of the control command, for example, a frequency range of 20kHz-24kHz can be selected.
  • the binary control instruction cb can be used to generate the audio digital watermark S(t) in the frequency range of 20kHz-24kHz.
  • the audio digital watermark S(t) after the conversion of the control command can be obtained.
  • the high-frequency signal in the audio data x(t) can be filtered out, for example, the audio data x(t) can be low-pass filtered Device.
  • the original signal in the frequency range of 20kHz-24kHz in the audio data x(t) can be filtered out, so that the audio digital watermark S(t) will not be superimposed and distorted after being combined with the audio data x(t).
  • the audio data x(t) after passing through the low-pass filter can be combined with the audio digital watermark S(t).
  • z(t) represents audio data with embedded control commands.
  • the audio data embedded in the control instruction may be encoded audio data or pre-encoded audio data.
  • the audio data embedded in the control instruction is the audio data before encoding. In this way, after the audio data embedded in the control instruction is obtained, the audio data embedded in the control instruction can be encoded and compressed, as compared to directly controlling the unembedded control instruction. The command audio data is compressed, and the compressible space will be larger, which can reduce the bandwidth required for wireless transmission.
  • FIG. 3a is a working flow chart of the first audio collection device according to an exemplary embodiment of the present application.
  • the audio acquisition device converts the control instruction into an audio digital watermark and then embeds audio data.
  • the audio data embedded with the control instruction is encoded and encapsulated into a data packet and sent to the audio receiving device.
  • the audio receiving device also needs corresponding processing for the processing on the audio collecting device side.
  • the processing of audio data by the audio collection device is shown in FIG. 3a, and the working flow of the audio receiving device can be seen in FIG. 3b.
  • the audio receiving device 20 includes a wireless transceiver 201 and a processor 202.
  • the processor 202 may receive the data packet sent by the audio collection device through the wireless transceiver 201, and decapsulate the data packet to obtain encoded audio data embedded with control instructions. Further, the processor 202 may also decode the audio data obtained by decapsulation to obtain audio data embedded with control instructions. Finally, the audio data and control instructions are separated from the audio data embedded with the control instructions.
  • control instruction can be converted into a 20kHz-24kHz audio digital watermark as an example for description. Separating the audio data with embedded control instructions may include the following steps:
  • Step a) the audio data embedded with the control instruction is filtered through a filter to filter out signals with a frequency above 20 kHz to obtain audio data x(t).
  • the frequency domain information is analyzed, and the corresponding binary control instruction cb is obtained.
  • the binary control instruction cb is transformed to obtain the control instruction c(t).
  • the audio receiving device When the audio receiving device is used as a repeater, and audio data and control instructions need to be provided to the electronic equipment connected to the audio receiving device for processing, between the audio receiving device and the electronic device, data packets can be decapsulated, audio decoded, and Watermark extraction and other tasks are flexibly distributed.
  • the audio receiving device may only be used to directly send the received data packet to the electronic device connected to it, and the decapsulation of the data packet and audio decoding are all performed by the electronic device.
  • the audio receiving device may also send the separated directly usable control instructions and audio data to the electronic device after the decapsulation of the data packet, audio decoding, and watermark extraction are completed.
  • two hardware links need to be configured between the audio receiving device and the electronic device to transmit audio data and control instructions respectively, which increases the cost of hardware.
  • FIG. 3c is a schematic diagram of the division of labor between an audio receiving device and an electronic device according to an exemplary embodiment of the present application.
  • the audio data embedded with the control instruction can be sent to the electronic device 30, and the electronic device 30 performs the separation work of the audio data embedded with the control instruction.
  • only one hardware link for transmitting the audio data embedded with the control instruction needs to be configured between the audio receiving device 20 and the electronic device 30.
  • a hardware link can be saved and costs can be reduced.
  • the audio collection device may also be provided with a control sensor.
  • the control sensor can be a pressed button or a touch sensing module, which can generate a corresponding control instruction according to a user's trigger.
  • the setting of the control sensor provides users with more various control methods. In a scene, the user wants to control a remote sports camera to press the shutter. In addition to voice control through the audio capture device, the sensor can also be controlled by touch It is very convenient to control by the buttons on it.
  • control command generated by the control sensor can also be encapsulated together with audio data into a data packet and sent to the audio receiving device.
  • control command generated by the control sensor can also be embedded in audio data, or embedded in audio data after being converted into an audio digital watermark. This part of the content can refer to the related content of the control instruction recognized by the voice, which will not be repeated here.
  • the audio segment (hereinafter referred to as the target audio segment) that can recognize the control instruction can be further processed.
  • the processing of the target audio segment may include audio effect processing such as muting, enhancing, and changing voice.
  • the electronic device is a drone, and the drone is shooting on the ground, and the user is using the audio collection device provided in this embodiment of the application to dub the video captured by the drone.
  • the audio collection device can mute the target audio segment after determining the target audio segment, thereby eliminating the aforementioned voice commands that the user does not want to input.
  • the target audio segment can also be enhanced or voice-changed to highlight the voice command.
  • it can be used for the secondary recognition of the electronic device to more accurately recognize the voice command; in another implementation, it can be used to make it easier to eliminate the voice command when editing the video later.
  • Voice commands Of course, for voice change processing, there is another possibility to increase the interest of the video.
  • the foregoing is a detailed description of the first audio collection device provided by the embodiment of the present application.
  • the first audio collection device provided by the embodiment of the present application communicates with the audio receiving device through a wireless network. Since the audio collection device can be located very close to the user, the audio collection device can collect clear audio data, which greatly improves the accuracy of speech recognition. In addition, considering that there will be a certain loss of signals in the wireless communication process, and the stability of wireless communication is not high enough, in order to ensure that the voice commands issued by the user can be accurately recognized, the embodiment of the present application is on the side of the audio collection device and on the audio data Before wireless transmission, the audio data is recognized, which further ensures the accuracy of speech recognition.
  • the audio data received by the audio receiving device may have some defects, causing it to still not be able to accurately recognize the control command when performing the command recognition processing.
  • an embodiment of the present application provides a second audio collection device.
  • the audio collection device also includes a microphone, a processor, and a wireless transceiver.
  • the difference from the first audio collection device provided by the embodiment of the present application is that the processor performs preliminary recognition processing on the audio data collected by the microphone, and the preliminary recognition processing obtains auxiliary recognition information instead of control instruction.
  • the recognized auxiliary identification information and audio data can be sent to the audio receiving device.
  • the audio data is still used for media processing, but the auxiliary identification information is used to assist the electronic device in the secondary identification.
  • the electronic device may perform secondary identification on the received audio data according to the auxiliary identification information, thereby identifying and obtaining the control instruction.
  • the electronic equipment may be the above-mentioned audio receiving device or other electronic equipment that is communicatively connected with the audio receiving device, that is, the audio receiving device itself may be the electronic equipment mentioned, and it may also be used as a repeater to connect the electronic equipment and the audio collecting device.
  • the audio receiving device may perform secondary identification on the received audio data according to the auxiliary identification information, thereby identifying and obtaining the control instruction.
  • the electronic equipment may be the above-mentioned audio receiving device or other electronic equipment that is communicatively connected with the audio receiving device, that is, the audio receiving device itself may be the electronic equipment mentioned, and it may also be used as a repeater to connect the electronic equipment and the audio
  • the auxiliary identification information is obtained by identification processing based on the audio data before wireless transmission, the identification accuracy of the auxiliary identification information is guaranteed.
  • the electronic device performs secondary recognition, although it is a command recognition processing on the audio data after wireless transmission, it can use the above-mentioned auxiliary recognition information in the recognition process, so the recognition accuracy rate will also be improved.
  • the audio collection device since the audio collection device only needs to recognize the auxiliary identification information, compared to directly recognizing the control instruction, the required computing power can be reduced.
  • auxiliary identification information that is, information used to assist electronic equipment in secondary identification.
  • the auxiliary identification information may be one or more of the following information: segment identification information used to indicate the audio segment corresponding to the control instruction, the type of audio data corresponding to the control instruction, and the control content information corresponding to the control instruction .
  • the electronic device can identify the audio segment corresponding to the control instruction in the audio data. Therefore, when the electronic device performs secondary identification, it can determine the audio segment corresponding to the control instruction according to the segment identification information, thereby Reduce the probability of missing control instructions.
  • the type of audio data corresponding to the above-mentioned control instruction in one embodiment, it can indicate whether the audio data corresponding to the control instruction is a voice type (such as a human voice) or a non-voice type (such as clapping, knocking). Click sound, etc.); in another embodiment, it can also indicate that the audio data corresponding to the control instruction is in languages of different countries such as Chinese, English, or Japanese.
  • the control content information corresponding to the aforementioned control instruction that is, the specific content corresponding to the control instruction
  • the control content information can be "adjust the lens focal length to 50mm" or "switch to anti-shake mode” and so on.
  • the control content information can help the electronic device to perform comparison and correction during the second recognition, so as to avoid recognizing the wrong control instruction.
  • the receiving side there can be one or more electronic devices, and there can also be one or more audio receiving devices.
  • the related description in the first audio collection device provided in the embodiment will not be repeated here.
  • auxiliary identification information can be encapsulated with the audio data into Data packets can also be converted into audio digital watermarks, embedded audio data, and so on.
  • Figures 4a and 4b Figure 4a is a working flow chart of a second audio collection device according to an exemplary embodiment of this application, and Figure 4b is a second audio receiving device according to an exemplary embodiment of this application. The working flow chart of the device.
  • the processing performed by the processor on the audio segment corresponding to the control instruction includes, but is not limited to, one or more of the following: enhancement, noise reduction, polishing . Performing these processing on the audio segment can make the voice instructions in the audio data more prominent, so that the electronic device can be more accurate in the secondary recognition.
  • the second audio collection device uses the audio data before wireless transmission to perform identification processing to obtain auxiliary identification information.
  • the auxiliary identification information is sent to the audio receiving device side, it can assist the electronic device to perform secondary identification of the audio data, thereby improving the accuracy of the secondary identification of the electronic device.
  • this second type of audio collection device only needs to recognize auxiliary identification information and does not need to recognize control instructions, so it requires less computing power and is easier to implement.
  • the audio receiving device includes:
  • the processor is configured to receive audio data and control instructions sent by an audio collection device through the wireless transceiver; wherein the control instructions are obtained by the audio collection device performing instruction recognition processing on the collected audio data;
  • the audio data is used for one or more electronic devices to perform media processing
  • the control instruction is used for one or more electronic devices to perform control processing, the electronic device being the audio receiving device or communicating with the audio receiving device Other connected electronic devices.
  • the processor is further configured to decode the received audio data.
  • control instruction is obtained by the audio collection device performing instruction recognition processing on the audio data before encoding.
  • the processor is further configured to decapsulate the data packet received through the wireless transceiver to obtain the audio data and the control instruction.
  • the audio data embedded with the control instruction is obtained by decapsulating the data packet.
  • the processor is further configured to separate the audio data embedded with the control instruction to obtain the audio digital watermark and audio data converted by the control instruction.
  • the frequency of the audio digital watermark is within a specified frequency range, wherein the specified frequency range is a frequency range outside the human auditory frequency range.
  • the audio data obtained by separating the audio data embedded with the control instruction is pre-encoding audio data.
  • control instruction is that the audio collection device extracts audio features of the audio clip by intercepting audio segments containing voice in audio data, and then inputs the audio features into a specified voice Obtained after recognizing the model.
  • the received control instruction further includes another control instruction generated by the control sensor of the audio collection device in response to a user's trigger.
  • the target audio segment in the received audio data is processed by the audio collection device, and the target audio segment is the audio segment corresponding to the control instruction.
  • the processing of the target audio segment includes one or more of the following: silence, enhancement, and voice change.
  • the type of audio data corresponding to the control instruction includes: a voice type and/or a non-voice type.
  • the electronic device is another electronic device communicatively connected to the audio receiving device;
  • the processor is further configured to send the received audio data and control instructions to the electronic device.
  • the electronic device is the audio receiving device
  • the processor is further configured to perform media processing using the audio data, and perform operations corresponding to the control instructions.
  • the media processing includes: audio editing and/or audio and video editing.
  • the electronic device includes one or more cameras.
  • the electronic device includes any of the following devices: drones, cameras, pan-tilts, and unmanned vehicles.
  • the embodiment of the application also provides a second type of audio receiving device.
  • the audio receiving device includes:
  • the processor is configured to receive audio data and auxiliary identification information sent by an audio collection device through the wireless transceiver; wherein the auxiliary identification information is obtained by the audio collection device performing identification processing on the collected audio data;
  • the audio data is used for one or more electronic devices to perform media processing
  • the auxiliary identification information is used for one or more electronic devices to identify control instructions from the audio data according to the auxiliary identification information
  • the electronic device is The audio receiving device or other electronic equipment communicatively connected with the audio receiving device.
  • the auxiliary identification information includes one or more of the following information: segment identification information used to indicate the audio segment corresponding to the control instruction, the type of audio data corresponding to the control instruction, and the control The control content information corresponding to the instruction.
  • the type of audio data corresponding to the control instruction includes: a voice type and/or a non-voice type.
  • the processor is further configured to decode the received audio data.
  • the auxiliary identification information is obtained by the audio collection device performing identification processing on the audio data before encoding.
  • the processor is further configured to decapsulate the data packet received through the wireless transceiver to obtain the audio data and the auxiliary identification information.
  • the audio data embedded with the auxiliary identification information is obtained by decapsulating the data packet.
  • the processor is further configured to separate the audio data embedded with the auxiliary identification information to obtain the audio digital watermark and audio data converted by the auxiliary identification information.
  • the frequency of the audio digital watermark is within a specified frequency range, wherein the specified frequency range is a frequency range outside the human auditory frequency range.
  • the audio data obtained by separating the audio data embedded with the auxiliary identification information is pre-encoding audio data.
  • the processor is further configured to receive a control instruction sent by the audio acquisition device through the wireless transceiver; the received control instruction is another control of the audio acquisition device
  • the sensor generates a control command in response to a user's trigger.
  • the target audio segment in the received audio data is processed by the audio collection device, and the target audio segment is the audio segment corresponding to the control instruction.
  • the processing of the target audio segment includes one or more of the following: enhancement, noise reduction, and polishing.
  • the media processing includes: audio editing and/or audio and video editing.
  • the electronic device is another electronic device communicatively connected to the audio receiving device;
  • the processor is further configured to send the received audio data and auxiliary identification information to the electronic device.
  • the electronic device is the audio receiving device
  • the processor is further configured to perform media processing using the audio data, and identify control instructions from the audio data according to the auxiliary identification information.
  • the electronic device includes one or more cameras.
  • the electronic device includes any of the following devices: drones, cameras, pan-tilts, and unmanned vehicles.
  • FIG. 5 shows an audio collection system implemented by a module.
  • the system includes an audio collection device 10, an audio receiving device 20 and an electronic device 30.
  • the microphone collects audio data, and the collected audio data is provided to the instruction recognition module.
  • the instruction recognition module recognizes the control instruction and provides the control instruction to the watermark embedding module.
  • the control instructions are converted into audio digital watermarks and embedded audio data.
  • the audio data embedded with the control instruction is encoded by the audio encoding module, then encapsulated into a data packet by the data encapsulation module, and sent to the opposite end through the wireless transceiver.
  • the audio receiving device 20 receives the data packet through the wireless transceiver, decapsulates the data packet through the data decapsulation module, and sends the decapsulated audio data embedded with the control instruction to the audio decoding module for decoding.
  • the audio receiving device 20 sends the decoded audio data embedded with the control instruction to the electronic device 30 through the hardware link, and the watermark extraction module of the electronic device 30 separates and transforms the audio data embedded with the control instruction Operate to obtain audio data and control instructions.
  • FIG. 5 is only used as an optional implementation manner, and in actual implementation, other modules may be used to implement the technical solution of the present application.
  • module is used in actual implementation, it is essentially the same as the processor described in the embodiment of the present application.
  • the processor described in the embodiment of the present application can actually refer to various modules that perform corresponding functions.
  • the embodiment of the application also provides a first audio processing method, which is applied to an audio collection device, and the method includes:
  • the audio data is used for one or more electronic devices to perform media processing
  • the control instruction is used for one or more electronic devices to perform control processing
  • the electronic device is the audio receiving device or the audio receiving device.
  • Other electronic equipment connected to the device are
  • the method before sending the audio data to the audio receiving device, the method further includes:
  • the audio data subjected to the instruction recognition processing is audio data before encoding.
  • sending the audio data and the control instruction to the audio receiving device includes:
  • encapsulating the audio data and the control command into a data packet includes:
  • the audio data embedded with the control instruction is encapsulated into a data packet.
  • the method before embedding the control instruction into the audio data, the method further includes:
  • the control instruction is converted into an audio digital watermark.
  • the frequency of the audio digital watermark is within a specified frequency range, wherein the specified frequency range is a frequency range outside the human auditory frequency range.
  • the audio data embedded in the control instruction is audio data before encoding.
  • performing instruction recognition processing on the collected audio data includes:
  • the audio feature is input into a designated voice recognition model, and the control instruction is recognized.
  • control instruction further includes another control instruction generated in response to a user's trigger.
  • it further includes:
  • the processing of the target audio segment includes one or more of the following: silence, enhancement, and voice change.
  • the type of audio data corresponding to the control instruction includes: a voice type and/or a non-voice type.
  • the electronic device is another electronic device communicatively connected to the audio receiving device;
  • the audio data is used by the audio receiving device to send the audio data to the electronic device to perform media processing
  • the control instruction is used by the audio receiving device to send the control instruction to the electronic device to perform control processing.
  • the electronic device is the audio receiving device
  • the audio data is used by the audio receiving device to perform media processing
  • the control instruction is used for the audio receiving device to perform control processing.
  • the electronic device includes one or more cameras.
  • the electronic device includes any of the following devices: drones, cameras, pan-tilts, and unmanned vehicles.
  • the media processing includes: audio editing and/or audio and video editing.
  • the embodiment of the application also provides a second audio processing method, which is applied to the audio collection device, and the method includes:
  • the audio data is used for one or more electronic devices to perform media processing
  • the auxiliary identification information is used for one or more electronic devices to identify control instructions from the audio data according to the auxiliary identification information
  • the electronic The device is the audio receiving device or other electronic equipment communicatively connected with the audio receiving device.
  • the auxiliary identification information includes one or more of the following information: segment identification information used to indicate the audio segment corresponding to the control instruction, the type of audio data corresponding to the control instruction, and the control The control content information corresponding to the instruction.
  • the type of audio data corresponding to the control instruction includes: a voice type and/or a non-voice type.
  • the method before sending the audio data to the audio receiving device, the method further includes:
  • the audio data for identification processing is audio data before encoding.
  • sending the audio data and the auxiliary identification information to the audio receiving device includes:
  • the audio data and the auxiliary identification information are encapsulated into a data packet and sent to the audio receiving device.
  • encapsulating the audio data and the auxiliary identification information into a data packet includes:
  • the audio data embedded with the auxiliary identification information is encapsulated into a data packet.
  • the method before embedding the auxiliary identification information into the audio data, the method further includes:
  • the auxiliary identification information is converted into an audio digital watermark.
  • the frequency of the audio digital watermark is within a specified frequency range, wherein the specified frequency range is a frequency range outside the human auditory frequency range.
  • the audio data embedded in the auxiliary identification information is audio data before encoding.
  • control instruction further includes another control instruction generated in response to a user's trigger.
  • it further includes:
  • the processing of the audio segment corresponding to the control instruction includes one or more of the following: enhancement, noise reduction, and polishing.
  • the electronic device is another electronic device communicatively connected to the audio receiving device;
  • the audio data is used by the audio receiving device to send the audio data to the electronic device to perform media processing
  • the auxiliary identification information is used by the audio receiving device to send the auxiliary identification information to the electronic device to identify a control instruction from the audio data according to the auxiliary identification information.
  • the electronic device is the audio receiving device
  • the audio data is used by the audio receiving device to perform media processing
  • the auxiliary identification information is used by the audio receiving device to identify a control instruction from the audio data according to the auxiliary identification information.
  • the electronic device includes one or more cameras.
  • the electronic device includes any of the following devices: drones, cameras, pan-tilts, and unmanned vehicles.
  • the media processing includes: audio editing and/or audio and video editing.
  • the embodiment of the application also provides a third audio processing method, which is applied to the audio receiving device, and the method includes:
  • control instruction is obtained by the audio collection device performing instruction recognition processing on the collected audio data
  • the audio data is used for one or more electronic devices to perform media processing
  • the control instruction is used for one or more An electronic device executes control processing
  • the electronic device is the audio receiving device or other electronic devices that are communicatively connected with the audio receiving device.
  • the method further includes:
  • control instruction is obtained by the audio collection device performing instruction recognition processing on the audio data before encoding.
  • the receiving audio data and control instructions sent by the audio collection device via a wireless network includes:
  • the data packet is received through the wireless network, and the data packet is decapsulated to obtain the audio data and the control instruction.
  • the decapsulating the data packet to obtain the audio data and the control instruction includes:
  • separating the audio data embedded with the control instruction to obtain the audio data and the control instruction includes:
  • the audio data embedded with the control instruction is separated to obtain audio digital watermark and audio data, and the audio data watermark is converted to obtain the control instruction.
  • the frequency of the audio digital watermark is within a specified frequency range, wherein the specified frequency range is a frequency range outside the human auditory frequency range.
  • the audio data obtained by separating the audio data embedded with the control instruction is pre-encoding audio data.
  • control instruction is that the audio collection device extracts audio features of the audio clip by intercepting audio segments containing voice in audio data, and then inputs the audio features into a specified voice Obtained after recognizing the model.
  • the received control instruction further includes another control instruction generated by the control sensor of the audio collection device in response to a user's trigger.
  • the target audio segment in the received audio data is processed by the audio collection device, and the target audio segment is an audio segment corresponding to the control instruction.
  • the processing of the target audio segment includes one or more of the following: silence, enhancement, and voice change.
  • the type of audio data corresponding to the control instruction includes: a voice type and/or a non-voice type.
  • the electronic device is another electronic device communicatively connected to the audio receiving apparatus; the method further includes:
  • the electronic device is the audio receiving device; the method further includes:
  • the media processing includes: audio editing and/or audio and video editing.
  • the electronic device includes one or more cameras.
  • the electronic device includes any of the following devices: drones, cameras, pan-tilts, and unmanned vehicles.
  • the embodiment of this application also provides a fourth audio processing method, which is applied to an audio receiving device, and the method includes:
  • the auxiliary identification information is obtained by the audio collection device performing identification processing on the collected audio data; the audio data is used for one or more electronic devices to perform media processing, and the auxiliary identification information is used for one or more An electronic device recognizes a control instruction from the audio data according to the auxiliary identification information, and the electronic device is the audio receiving device or other electronic devices communicatively connected with the audio receiving device.
  • the auxiliary identification information includes one or more of the following information: segment identification information used to indicate the audio segment corresponding to the control instruction, the type of audio data corresponding to the control instruction, and the control The control content information corresponding to the instruction.
  • the type of audio data corresponding to the control instruction includes: voice type and/or non-voice type.
  • the method further includes:
  • the auxiliary identification information is obtained by the audio collection device performing identification processing on the audio data before encoding.
  • the receiving audio data and auxiliary identification information sent by the audio collection device via a wireless network includes:
  • the data packet is received through the wireless network, and the data packet is decapsulated to obtain the audio data and the auxiliary identification information.
  • the decapsulating the data packet to obtain the audio data and the auxiliary identification information includes:
  • separating the audio data embedded with the auxiliary identification information to obtain the audio data and the auxiliary identification information includes:
  • the audio data embedded with the auxiliary identification information is separated to obtain audio digital watermark and audio data, and the audio data watermark is converted to obtain the auxiliary identification information.
  • the frequency of the audio digital watermark is within a specified frequency range, wherein the specified frequency range is a frequency range outside the human auditory frequency range.
  • the audio data obtained by separating the audio data embedded with the auxiliary identification information is pre-encoding audio data.
  • the method further includes:
  • the control instruction sent by the audio acquisition device is received through a wireless network; the received control instruction is another control instruction generated by the control sensor of the audio acquisition device in response to a user's trigger.
  • the target audio segment in the received audio data is processed by the audio collection device, and the target audio segment is an audio segment corresponding to the control instruction.
  • the processing of the target audio segment includes one or more of the following: enhancement, noise reduction, and polishing.
  • the media processing includes: audio editing and/or audio and video editing.
  • the electronic device is another electronic device communicatively connected to the audio receiving apparatus; the method further includes:
  • the received audio data and auxiliary identification information are sent to the electronic device.
  • the electronic device is the audio receiving device; the method further includes:
  • the electronic device includes one or more cameras.
  • the electronic device includes any of the following devices: drones, cameras, pan-tilts, and unmanned vehicles.
  • An embodiment of the present application also provides a first audio collection system, including:
  • Audio collection device and audio receiving device are Audio collection device and audio receiving device
  • the audio collection device is used to perform command recognition processing on the collected audio data to obtain a control command; send the audio data and the control command to the audio receiving device 20 via a wireless network;
  • the audio data is used for one or more electronic devices to perform media processing
  • the control instruction is used for one or more electronic devices to perform control processing
  • the electronic device is the audio receiving device or the audio receiving device.
  • Other electronic equipment connected to the device are
  • the audio collecting device is configured to encode the collected audio data and send it to the audio receiving device;
  • the audio receiving device is used to decode the received audio data after receiving the audio data.
  • the audio data processed by the audio collection device for instruction recognition is pre-encoded audio data.
  • the audio collection device is configured to encapsulate the audio data and the control instruction into a data packet and send it to the audio receiving device;
  • the audio receiving device is configured to decapsulate the data packet after receiving the data packet to obtain the audio data and the control instruction.
  • the audio collection device is configured to embed the control instruction into the audio data, encapsulate the audio data embedded with the control instruction into a data packet and send it to the audio receiving device;
  • the audio receiving device is configured to decapsulate the data packet after receiving the data packet, and separate the audio data embedded with the control instruction obtained by the decapsulation to obtain the audio data and the audio data. ⁇ control instructions.
  • the audio collection device is configured to convert the control instruction into an audio digital watermark and embed the audio data, encapsulate the audio data embedded with the control instruction into a data packet, and send it to all The audio receiving device;
  • the audio receiving device is configured to, after receiving the data packet, decapsulate the data packet, and separate the decapsulated audio data embedded with the control instruction to obtain audio digital watermark and audio data To transform the audio data watermark to obtain the control instruction.
  • the frequency of the audio digital watermark is within a specified frequency range, wherein the specified frequency range is a frequency range outside the human auditory frequency range.
  • the audio data embedded in the control instruction by the audio collection device is audio data before encoding.
  • the manner in which the audio collection device performs command recognition processing on the collected audio data includes:
  • the audio feature is input into a designated voice recognition model, and the control instruction is recognized.
  • control instruction further includes another control instruction generated by the control sensor of the audio collection device in response to a user's trigger.
  • the audio collection device is further configured to process the target audio segment for which the control instruction is recognized.
  • the processing of the target audio segment by the audio collection device includes one or more of the following: sound cancellation, enhancement, and sound change.
  • the type of audio data corresponding to the control instruction includes: a voice type and/or a non-voice type.
  • the electronic device is another electronic device communicatively connected to the audio receiving device;
  • the audio receiving device is also used to send the received audio data and control instructions to the electronic device.
  • the electronic device is the audio receiving device
  • the audio receiving device is further configured to use the audio data to perform media processing and perform operations corresponding to the control instructions.
  • the electronic device includes one or more cameras.
  • the electronic device includes any of the following devices: drones, cameras, pan-tilts, and unmanned vehicles.
  • the media processing includes: audio editing and/or audio and video editing.
  • the embodiment of the present application also provides a second audio collection system, including:
  • Audio collection device and audio receiving device are Audio collection device and audio receiving device
  • the audio collection device is configured to perform identification processing on the collected audio data to obtain auxiliary identification information; send the audio data and the auxiliary identification information to the audio receiving device via a wireless network;
  • the audio data is used for one or more electronic devices to perform media processing
  • the auxiliary identification information is used for one or more electronic devices to identify control instructions from the audio data according to the auxiliary identification information
  • the electronic The device is the audio receiving device or other electronic equipment communicatively connected with the audio receiving device.
  • the auxiliary identification information includes one or more of the following information: segment identification information used to indicate the audio segment corresponding to the control instruction, the type of audio data corresponding to the control instruction, and the control The control content information corresponding to the instruction.
  • the type of audio data corresponding to the control instruction includes: a voice type and/or a non-voice type.
  • the audio collecting device is configured to encode the collected audio data and send it to the audio receiving device;
  • the audio receiving device is used to decode the received audio data after receiving the audio data.
  • the audio data to be recognized by the audio collection device is audio data before encoding.
  • the audio collection device is configured to encapsulate the audio data and the auxiliary identification information into data packets and send them to the audio receiving device;
  • the audio receiving device After receiving the data packet, the audio receiving device decapsulates the data packet to obtain the audio data and the auxiliary identification information.
  • the audio collection device is configured to embed the auxiliary identification information into the audio data, encapsulate the audio data embedded with the auxiliary identification information into a data packet and send it to the audio receiver Device
  • the audio receiving device is configured to decapsulate the data packet after receiving the data packet, and separate the audio data embedded with the auxiliary identification information obtained by the decapsulation to obtain the audio data and The auxiliary identification information.
  • the audio collection device is configured to convert the auxiliary identification information into an audio digital watermark and embed the audio data, and encapsulate the audio data embedded with the auxiliary identification information into a data packet for sending To the audio receiving device;
  • the audio receiving device is configured to decapsulate the data packet after receiving the data packet, and separate the audio data embedded with the auxiliary identification information obtained by the decapsulation to obtain an audio digital watermark and audio Data, transform the audio data watermark to obtain the auxiliary identification information.
  • the frequency of the audio digital watermark is within a specified frequency range, wherein the specified frequency range is a frequency range outside the human auditory frequency range.
  • the audio data embedded in the auxiliary identification information by the audio collection device is audio data before encoding.
  • control instruction further includes another control instruction generated by the control sensor of the audio collection device in response to a user's trigger.
  • the audio collection device is further configured to process the target audio segment corresponding to the control instruction.
  • the processing of the audio segment corresponding to the control instruction by the audio collection device includes one or more of the following: enhancement, noise reduction, and polishing.
  • the electronic device is another electronic device communicatively connected to the audio receiving device;
  • the audio receiving device is also used to send the received audio data and auxiliary identification information to the electronic device.
  • the electronic device is the audio receiving device
  • the audio receiving device is further configured to perform media processing using the audio data, and identify control instructions from the audio data according to the auxiliary identification information.
  • the electronic device includes one or more cameras.
  • the electronic device includes any of the following devices: drones, cameras, pan-tilts, and unmanned vehicles.
  • the media processing includes: audio editing and/or audio and video editing.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Acoustics & Sound (AREA)
  • Quality & Reliability (AREA)
  • Probability & Statistics with Applications (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

Des modes de réalisation de la présente invention concernent un appareil d'acquisition audio, comprenant : un microphone, un processeur, et un émetteur-récepteur sans fil, le processeur étant configuré pour effectuer un traitement de reconnaissance d'instruction sur des données audio acquises par le microphone pour obtenir une instruction de commande, et configuré en outre pour envoyer les données audio et l'instruction de commande à un appareil de réception audio au moyen de l'émetteur-récepteur sans fil, les données audio étant utilisées pour un ou plusieurs dispositifs électroniques pour effectuer un traitement multimédia, l'instruction de commande étant utilisée pour un ou plusieurs dispositifs électroniques pour effectuer un traitement de commande, et le dispositif électronique est l'appareil de réception audio ou un autre dispositif électronique connecté en communication à l'appareil de réception audio. L'appareil d'acquisition audio décrit dans les modes de réalisation de la présente invention peut acquérir des données audio nettes, de façon à améliorer la précision de reconnaissance de la parole. De plus, étant donné que les données audio sont reconnues sur le côté de l'appareil d'acquisition audio avant que les données audio soient transmises sans fil, la précision de reconnaissance de la parole est mieux garantie.
PCT/CN2020/080268 2020-03-19 2020-03-19 Appareil d'acquisition audio, appareil de réception audio et procédé de traitement audio WO2021184315A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2020/080268 WO2021184315A1 (fr) 2020-03-19 2020-03-19 Appareil d'acquisition audio, appareil de réception audio et procédé de traitement audio
CN202080004930.8A CN112639963A (zh) 2020-03-19 2020-03-19 音频采集装置、音频接收装置及音频处理方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/080268 WO2021184315A1 (fr) 2020-03-19 2020-03-19 Appareil d'acquisition audio, appareil de réception audio et procédé de traitement audio

Publications (1)

Publication Number Publication Date
WO2021184315A1 true WO2021184315A1 (fr) 2021-09-23

Family

ID=75291266

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/080268 WO2021184315A1 (fr) 2020-03-19 2020-03-19 Appareil d'acquisition audio, appareil de réception audio et procédé de traitement audio

Country Status (2)

Country Link
CN (1) CN112639963A (fr)
WO (1) WO2021184315A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023004776A1 (fr) * 2021-07-30 2023-02-02 深圳市大疆创新科技有限公司 Procédé de traitement de signal pour réseau de microphones, réseau de microphones, et système
CN114758669B (zh) * 2022-06-13 2022-09-02 深圳比特微电子科技有限公司 音频处理模型的训练、音频处理方法、装置及电子设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737629A (zh) * 2011-11-11 2012-10-17 东南大学 一种嵌入式语音情感识别方法及装置
CN104008132A (zh) * 2014-05-04 2014-08-27 深圳市北科瑞声科技有限公司 语音地图搜索方法及系统
CN104010057A (zh) * 2014-06-05 2014-08-27 深圳市易科泰科技有限公司 语音识别呼叫系统
US9542956B1 (en) * 2012-01-09 2017-01-10 Interactive Voice, Inc. Systems and methods for responding to human spoken audio
JP2019087927A (ja) * 2017-11-09 2019-06-06 東京瓦斯株式会社 赤外線操作システム
CN110409957A (zh) * 2018-04-30 2019-11-05 创科(澳门离岸商业服务)有限公司 车库门开启器系统及其控制方法

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100426691B1 (ko) * 2001-08-21 2004-04-13 주식회사 마크애니 워터마크를 제어신호로 이용하는 송수신 시스템 및 그 방법
US9787887B2 (en) * 2015-07-16 2017-10-10 Gopro, Inc. Camera peripheral device for supplemental audio capture and remote control of camera

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737629A (zh) * 2011-11-11 2012-10-17 东南大学 一种嵌入式语音情感识别方法及装置
US9542956B1 (en) * 2012-01-09 2017-01-10 Interactive Voice, Inc. Systems and methods for responding to human spoken audio
CN104008132A (zh) * 2014-05-04 2014-08-27 深圳市北科瑞声科技有限公司 语音地图搜索方法及系统
CN104010057A (zh) * 2014-06-05 2014-08-27 深圳市易科泰科技有限公司 语音识别呼叫系统
JP2019087927A (ja) * 2017-11-09 2019-06-06 東京瓦斯株式会社 赤外線操作システム
CN110409957A (zh) * 2018-04-30 2019-11-05 创科(澳门离岸商业服务)有限公司 车库门开启器系统及其控制方法

Also Published As

Publication number Publication date
CN112639963A (zh) 2021-04-09

Similar Documents

Publication Publication Date Title
US10033915B2 (en) Camera peripheral device for supplemental audio capture and remote control of camera
WO2020078237A1 (fr) Procédé de traitement audio et dispositif électronique
WO2021184315A1 (fr) Appareil d'acquisition audio, appareil de réception audio et procédé de traitement audio
CN110691204B (zh) 一种音视频处理方法、装置、电子设备及存储介质
WO2003049003A3 (fr) Systemes et procedes de navigation tv a commandes vocales compressees
WO2015103836A1 (fr) Procédé et dispositif de commande vocale
TWI678696B (zh) 語音資訊的接收方法、系統及裝置
CN112770212B (zh) 一种无线耳机、视频录制系统及方法、存储介质
WO2016029393A1 (fr) Procédé et appareil de reconnaissance d'écouteur, procédé et appareil de commande d'écouteur et écouteur
US6959095B2 (en) Method and apparatus for providing multiple output channels in a microphone
WO2017000772A1 (fr) Système de traitement audio frontal
US10225670B2 (en) Method for operating a hearing system as well as a hearing system
CN111182416B (zh) 处理方法、装置及电子设备
JP2012151544A (ja) 撮像装置及びプログラム
WO2019000877A1 (fr) Procédé et dispositif de traitement de données audio
WO2021129444A1 (fr) Procédé et appareil de regroupement de fichiers, et support de stockage et dispositif électronique
KR101892268B1 (ko) 영상 회의 시 단말기를 제어하기 위한 방법, 장치 및 기록 매체
CN112104964B (zh) 一种跟随式扩声机器人的控制方法及控制系统
CN211481445U (zh) 一种基于声像耦合的语音采集智能耳机
US20180132044A1 (en) Hearing aid with camera
CN112925500A (zh) 一种会议设备
WO2020177483A1 (fr) Procédé et appareil de traitement audio et vidéo, dispositif électronique et support de stockage
KR100936830B1 (ko) 화상 회의 중계 시스템
CN205864580U (zh) 一种智能语音抓拍系统
WO2023065854A1 (fr) Procédé de commande vocale distribuée et dispositif électronique

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20925213

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20925213

Country of ref document: EP

Kind code of ref document: A1