WO2021143599A1 - 基于场景识别的语音处理方法及其装置、介质和系统 - Google Patents

基于场景识别的语音处理方法及其装置、介质和系统 Download PDF

Info

Publication number
WO2021143599A1
WO2021143599A1 PCT/CN2021/070509 CN2021070509W WO2021143599A1 WO 2021143599 A1 WO2021143599 A1 WO 2021143599A1 CN 2021070509 W CN2021070509 W CN 2021070509W WO 2021143599 A1 WO2021143599 A1 WO 2021143599A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
audio
scene
audio data
image data
Prior art date
Application number
PCT/CN2021/070509
Other languages
English (en)
French (fr)
Inventor
李峰
刘镇亿
玄建永
Original Assignee
荣耀终端有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 荣耀终端有限公司 filed Critical 荣耀终端有限公司
Publication of WO2021143599A1 publication Critical patent/WO2021143599A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working

Definitions

  • This application relates to the field of artificial intelligence, and in particular to a voice processing method based on scene recognition and its device, medium and system.
  • the environmental noise collected by the microphone differs greatly due to the different environments in which the recording equipment is located, and the impact on the target sound is also different.
  • the current industry's noise reduction of audio collected by video equipment is only based on audio or image scene judgment. For different scenes, such as indoors, studios, roads, cars, seaside, restaurants, etc., the sound types are random. It is difficult to accurately identify the type of sound due to nature and contingency, and it is easy to cause misjudgment of the scene, the accuracy rate is low, and the user experience is poor.
  • the embodiments of the present application provide a voice processing method based on scene recognition and its device, medium and system.
  • an embodiment of the present application provides a voice processing method for scene recognition, and the method includes:
  • the electronic device In the case of detecting that the electronic device is performing video recording, obtain the image data and audio data in the currently recorded video; perform feature extraction on the image data and audio data to obtain the image characteristics of the image data and the Audio features of audio data; identify the extracted image features and audio features, and identify the scene category where the electronic device currently records the video; based on the identified scene category, the real-time recording of the electronic device
  • the audio data in the video is processed, and the processed audio data and corresponding image data are output.
  • the recognized scene type is more accurate, avoiding misjudgment of the scene caused by recognition only through image features or voice features, and submitting the scene recognition accuracy rate.
  • the audio data in the video recorded by the electronic device in real time can be processed to achieve the optimal experience of each scene, avoiding the same processing of audio data in different scenes. Damage or mishandling issues.
  • the above-mentioned method further includes: said performing feature extraction on the image data and audio data to obtain the image feature of the image data and the audio feature of the audio data, including: :
  • the image data is structured to obtain the image characteristics of the image data
  • the audio data is Fourier transformed to obtain the audio characteristics of the audio data.
  • feature extraction can be performed on the sample image through a three-dimensional convolutional neural network model.
  • pre-emphasis, framing, and other pre-processing may be performed on the audio data before the Fourier transform is performed on the audio data.
  • the foregoing method further includes: processing the audio data in the video recorded in real time by the electronic device based on the identified scene category, and outputting the processed audio data And the corresponding image data, including:
  • the noise reduction processing algorithm Based on the identified scene category, select the noise reduction processing algorithm, equalization processing method, automatic gain control method, and dynamic range control method corresponding to the scene category; based on the selected noise reduction processing algorithm, equalization processing method, and automatic gain control
  • the method and the dynamic range control method process the audio data in the video recorded in real time by the electronic device; output the processed audio data and the corresponding image data.
  • one or more of the above-mentioned processing methods corresponding to the scene category can be selected as needed.
  • the above-mentioned method further includes: the situation where the electronic device performs video recording includes: video shooting, video live broadcast, or video call.
  • the foregoing method further includes: determining that the video recording performed by the electronic device is a live video broadcast or a video call;
  • the processing the audio data in the video recorded in real time by the electronic device based on the identified scene category and outputting the processed audio data and corresponding image data includes:
  • the human voice in the audio in the video recorded in real time by the electronic device is enhanced, and the sound other than the human voice in the audio is noise-reduced, and Output processed audio data and corresponding image data.
  • different voice enhancement algorithms can be adapted to different scenes
  • different noise reduction processing methods can be adapted to the noise in different scenes. This can increase the signal-to-noise ratio and improve user experience.
  • the foregoing method further includes: the recognizing the voice of the user conducting the live video or the video call based on the audio data in the currently recorded video includes:
  • At least one of signal processing and NN network methods is used to identify the voice of the user conducting the live video broadcast or the video call. Due to the large randomness of the audio during video recording, the human voice recognition by the above method can avoid other human voices (such as the voices of passing pedestrians, the voices of onlookers, etc.) in the live video or video call scene. It is judged as the human voice of the user who is currently in a live video broadcast or video call, and the accuracy of human voice recognition is improved.
  • the foregoing method further includes: recognizing a portrait of a user who is conducting a live video broadcast or a video call based on the image data in the currently recorded video;
  • the portrait of the user was identified in the following way:
  • the portrait is a portrait of a user conducting a live video broadcast or a video call. In this way, it is possible to prevent the pedestrian in the shooting scene from being misjudged as a user currently conducting a live video broadcast or a video call, and improve the accuracy of portrait recognition.
  • an embodiment of the present application provides a voice processing method for scene recognition, and the method includes:
  • the video to be processed perform feature extraction on the image data and audio data in at least part of the video to be processed to obtain the image feature of the image data and the audio feature of the audio data; and the extracted image
  • the feature and the audio feature are identified to identify the scene category of the scene in the to-be-processed video; based on the identified scene category, the audio data in the to-be-processed video is processed.
  • an existing video for example, a video downloaded through the network or a video recorded through an electronic device such as a mobile phone
  • the scene in the video can be identified, and the corresponding audio post-processing method can be selected for different scenes. Enable users to get the best audio-visual experience when watching videos.
  • an embodiment of the present application provides a voice processing device based on scene recognition, and the device includes:
  • the detection module is configured to obtain image data and audio data in the currently recorded video when it is detected that the electronic device is performing video recording;
  • the first feature extraction module is configured to perform feature extraction on the image data and audio data to obtain the image feature of the image data and the audio feature of the audio data;
  • the first recognition module is configured to recognize the extracted image features and audio features, and determine the scene category in which the video is currently recorded by the electronic device;
  • the first audio processing module is configured to process the audio data in the video recorded in real time by the electronic device based on the identified scene category, and output the processed audio data and corresponding image data.
  • an embodiment of the present application provides a voice processing device based on scene recognition, and the device includes:
  • the acquisition module is used to acquire the video to be processed
  • the second feature extraction module is configured to perform feature extraction on image data and audio data in at least part of the video to be processed to obtain the image feature of the image data and the audio feature of the audio data;
  • the second recognition module is configured to recognize the extracted image features and audio features, and recognize the scene category of the scene in the to-be-processed video;
  • the second audio processing module is configured to process the audio data in the to-be-processed video based on the identified scene category.
  • an embodiment of the present application provides a machine-readable medium with an instruction stored on the machine-readable medium, which when executed on a machine, causes the machine to execute various possible implementations of the first aspect and the second aspect described above Voice processing method based on scene recognition.
  • an embodiment of the present application provides a system, including:
  • Memory used to store instructions executed by one or more processors of the system
  • the processor is one of the processors of the system and is used to execute the voice processing method based on scene recognition in the possible implementations of the first aspect and the second aspect described above.
  • Figure 1 shows a voice processing scenario according to some embodiments of the present application
  • Fig. 2a shows a flow chart of training a first neural network model using sample image features according to some embodiments of the present application
  • Fig. 2b shows a flow chart of joint training of a second neural network model using the image scene category and sample audio features shown in Fig. 2a according to some embodiments of the present application;
  • Fig. 2c shows a flow chart of noise reduction of a mobile phone according to some embodiments of the present application
  • Fig. 3 shows a flowchart of a voice processing method based on scene recognition provided by the present application according to some embodiments of the present application
  • Fig. 4 shows a flowchart when the scene-based voice processing method provided by the present application is applied to a live video or a video call according to some embodiments of the present application;
  • Fig. 5 shows a schematic structural diagram of a voice processing device based on scene recognition according to some embodiments of the present application
  • Fig. 6 shows a schematic structural diagram of another voice processing device based on scene recognition according to some embodiments of the present application
  • Fig. 7 shows a schematic structural diagram of a mobile phone according to some embodiments of the present application.
  • Fig. 8 shows a block diagram of a system according to some embodiments of the present application.
  • Fig. 9 shows a block diagram of a system on chip (SoC) according to some embodiments of the present application.
  • SoC system on chip
  • Illustrative embodiments of the present application include, but are not limited to, a voice processing method based on scene recognition and its device, medium, and system.
  • module can refer to or include an application specific integrated circuit (ASIC), an electronic circuit, a processor (shared, dedicated, or group) that executes one or more software or firmware programs, and /Or memory, combinational logic circuits, and/or other suitable hardware components that provide the described functions, or may be part of these hardware components.
  • ASIC application specific integrated circuit
  • processor shared, dedicated, or group
  • processor shared, dedicated, or group
  • combinational logic circuits and/or other suitable hardware components that provide the described functions, or may be part of these hardware components.
  • the processor may be a microprocessor, a digital signal processor, a microcontroller, etc., and/or any combination thereof.
  • the processor may be a single-core processor, a multi-core processor, etc., and/or any combination thereof.
  • a speech processing scenario 100 is disclosed.
  • Figure 1 shows a schematic diagram of this scenario.
  • the video content such as image data collected by the camera of the mobile phone and audio data collected by the microphone
  • the video content can accurately determine the scene category the user is in, such as judging Find out which scene the user is in among different scenes such as seaside, restaurant, concert hall, meeting room, car, road, etc.
  • the environmental noise collected by the microphone is quite different, and the impact on the sound of interest to the user is also different. If the audio noise reduction is unreasonable, it will affect the user experience. Therefore, different noise reduction processing algorithms can be adapted to different scene categories.
  • noise reduction processing algorithm 1 can be used to suppress wind and ocean waves to obtain audio 1 after noise reduction processing to improve the signal of the target sound.
  • Noise ratio When the scene where the user is recording is a restaurant, the sound of tableware collision and the sound of surrounding diners are noisy.
  • the noise reduction algorithm 2 can reduce the noise of tableware collision and the sound of surrounding diners, and get Audio 2 after noise reduction processing; when the scene where the user is recording is a concert hall, in order to better appreciate the sound of the instrument or performer, the noise reduction algorithm 3 can be used to talk to the audience and make calls Noise reduction of the sound, etc., to obtain the audio 3 after noise reduction processing; when the scene where the user is recording is a meeting room, in order to make the participants clearly hear the voice of the narrator at the meeting, the noise reduction processing algorithm 4 can be used Noise reduction may be performed on possible whispering sounds, coughing sounds, and construction sounds around the meeting room in the meeting room to obtain audio 4 after noise reduction processing. So as to achieve the best audio effect corresponding to the scene and improve the user experience.
  • the voice processing scene 100 shown in FIG. 1 is only exemplary and not restrictive.
  • the environment where the user is located may also be a park, a shopping mall, etc., which will not be listed here. .
  • FIG. 1 is an example of a user using a mobile phone to record a voice to illustrate the voice processing method based on scene recognition in the embodiment of the present application.
  • the scene-based voice processing method of the present application can also be used.
  • the recognized voice processing method is used when the user uses other electronic devices with video recording or image collection and audio collection functions for video live broadcast or video call applications.
  • electronic devices include but are not limited to cameras, tablet computers, desktop computers, wearable smart devices, augmented reality devices, ultra-mobile personal computers (UMPC) or personal digital assistants (personal digital assistant, PDA), etc. .
  • feature extraction is performed on a large number of sample image data collected (212), where a large number of sample images can be a large number of images in different scenes collected by the camera of a mobile phone.
  • Feature extraction is performed on objects in a large number of sample images to obtain sample image features corresponding to objects in different scenes. For example, when the video scene captured by the camera is on a certain road, feature extraction of objects such as vehicles, street lights, and roads in the scene can be performed to obtain feature data corresponding to the vehicles, street lights, roads and other objects in the road scene ; When the video scene captured by the camera is a seaside, feature extraction can be performed on objects such as seawater, beaches, ships, etc.
  • feature extraction can be performed on the musical instruments, stage, seats on the stands and other objects in the scene to obtain the features corresponding to the musical instruments, stage, seats on the stands, etc. in the scene.
  • the feature extraction of the sample image is not limited to the feature extraction of the object, and may also include the feature extraction of the spatial layout and background of the sample image (for example, sky, grass, forest, etc.).
  • the feature extraction of the sample image can be performed through the three-dimensional convolutional neural network model, and the convolution operation is performed in the time and space dimensions at the same time, so as to obtain the visual features of each frame of the sample image at the same time, Obtaining the relevance of adjacent image frames over time can fully extract the characteristic information of the sample image.
  • the extracted sample image feature data is input to the first neural network model for training (214).
  • the first neural network model is trained with the features of the sample image, and the output of the first neural network model is compared with the expected result of the sample image feature until the output of the first neural network model is The difference between the expected results of the sample image features is less than a certain threshold, thereby completing the training of the first neural network model.
  • the first neural network model can be any artificial neural network model, such as CNN (Convolutional Neural Network, Convolutional Neural Network), DNN (Deep Neural Networks, Deep Neural Network), and RNN (Recurrent Neural Networks, Recurrent Neural Network) ), BNN (Binary Neural Network), etc.
  • CNN Convolutional Neural Network, Convolutional Neural Network
  • DNN Deep Neural Networks, Deep Neural Network
  • RNN Recurrent Neural Networks, Recurrent Neural Network
  • BNN Binary Neural Network
  • image scene categories corresponding to multiple different scenes are output (216). It can be understood that before training, you can classify different shooting scenes to obtain image scene categories corresponding to different shooting scenes, and map the image features in each shooting scene to the image scene category one-to-one, for example, there is driving in the shooting scene.
  • the feature data of vehicles, traffic lights and other objects can be matched with the road scene, and the feature data of vehicles, traffic lights and other objects in the road scene can be input into the first neural network model for training, so that Output scene categories that can characterize road scenes.
  • the feature data of the tableware, diners, etc. can be associated with the restaurant scene, and the feature data of the tableware, diners, etc. can be input into the first neural network model for training, so that Output image scene categories that can characterize restaurant scenes.
  • the image scene category output by the first neural network model can be combined with the audio signal collected from the scene.
  • the second neural network model is jointly trained to reduce the false judgment rate of the scene. Specifically, the process of jointly training the second neural network model using image scene categories and audio is as follows:
  • Feature extraction is performed on the collected audio features of a large number of samples (222).
  • pre-emphasis, framing, and windowing can be performed on them to ensure that the signals obtained by subsequent audio processing are more uniform and Smooth, improve audio processing quality.
  • Fourier transform can be performed on the windowed audio signal to obtain sample audio characteristics such as the frequency spectrum and phase of the audio signal.
  • the various audio components in the concert hall (such as piano sound, audience coughing, applause, etc.) collected through the microphone of the mobile phone shown in Figure 1 are subjected to frame processing, plus 20ms Hamming window, taking 10ms Frame shift, and Fourier transform is performed frame by frame to obtain information such as the frequency spectrum and phase of each audio component.
  • the multiple image scene categories output via the first neural network model and the sample audio features corresponding to each image scene category are input into the second neural network model, and the second neural network model is trained (224).
  • the multiple image scene categories outputted through the training of the first neural network model shown in FIG. 2a and the audio features proposed by the audio signal corresponding to each scene category will be jointly trained.
  • the audio features extracted from the various audio components (such as piano sound, audience coughing, applause, etc.) in the concert hall collected through the microphone of the mobile phone shown in Figure 1 and the data representing the scene category of the concert hall (For example, it can be a scene flag set according to a preset rule), train the target neural network, and compare the output of the second neural network model with the expected result, until the output of the second neural network model is different from the expected result The value is less than a certain threshold, thereby completing the training of the second neural network model.
  • the misjudgment rate of the scene judgment can be greatly reduced.
  • the second neural network model can be any artificial neural network model, such as CNN (Convolutional Neural Network, Convolutional Neural Network), DNN (Deep Neural Networks, Deep Neural Network), and RNN (Recurrent Neural Networks, Recurrent Neural Network) ), BNN (Binary Neural Network), etc.
  • CNN Convolutional Neural Network, Convolutional Neural Network
  • DNN Deep Neural Networks, Deep Neural Network
  • RNN Recurrent Neural Networks, Recurrent Neural Network
  • BNN Binary Neural Network
  • the scene category to which the input audio feature belongs corresponds to the input image scene category.
  • the scene to which the input audio feature belongs is A
  • the input image scene category If it is a then the scene category output after the second neural network model is trained is also a.
  • scene categories corresponding to multiple different scenes are output (226) .
  • the video B shot at the seaside extracts the image feature B1 and audio feature B2 of video B.
  • the scene category obtained is b.
  • the obtained scene category b and audio feature B2 into the second neural network model for joint training. It is hoped that the output scene category of the second neural network model after the training is completed is also b.
  • sample image features are input into the first neural network model for training to obtain image scene categories corresponding to multiple different scenes, and then multiple image scene categories output through the first neural network model are combined with each image scene.
  • the sample audio features corresponding to the category are input into the second neural network model, and the second neural network model is trained to obtain the description of the final scene category corresponding to a plurality of different scenes, which is merely exemplary and not restrictive.
  • the sample image feature and the sample audio feature can also be input into the same neural network model at the same time, and the model can be jointly trained to obtain scene categories corresponding to multiple different scenes.
  • the first neural network model and the second neural network model can be trained in a server or a computer. After the training of the first neural network model and the second neural network model is completed, the trained neural network can be The network model is transplanted to the mobile phone.
  • the neural network model transplanted to the mobile phone can be used to identify the user's current video recording scene, and after the scene to be identified is identified , Perform noise reduction and other processing based on the scene.
  • the scene recognition mark may be a mark that is preset to distinguish different scenes, such as a mark.
  • the scene identification mark corresponding to the seaside is s1
  • the scene identification mark corresponding to the restaurant is s2
  • the scene identification mark corresponding to the concert hall is s3
  • the scene identification mark corresponding to the meeting room is s4.
  • the corresponding noise reduction processing algorithm can be selected based on the above scene flag (234), and the audio data in the video recorded in real time by the mobile phone can be processed based on the noise reduction processing algorithm (236).
  • the corresponding noise reduction processing algorithm 1 when the neural network model transplanted to the mobile phone is used to identify that the current scene of the video recorded by the mobile phone is a seaside, the corresponding noise reduction processing algorithm 1 can be selected to suppress wind noise.
  • the corresponding noise reduction processing algorithm 2 can be selected to strongly suppress the noise component in the restaurant.
  • the neural network model transplanted to the mobile phone When the neural network model transplanted to the mobile phone is used to identify the scene of the video currently recorded by the mobile phone as a concert hall, in order to maintain the fidelity of the played music, you can choose to amplify the high frequency components in the audio collected from the concert hall to reduce Low frequency components.
  • the neural network model transplanted to the mobile phone is used to identify that the current video recording scene of the mobile phone is a conference room (relatively quiet)
  • the corresponding noise reduction processing algorithm 4 can be selected to perform weak noise reduction on the audio in the conference room.
  • the voices of the speakers at the meeting are fidelity.
  • the speech processing method based on scene recognition of this application can adaptively adapt different noise reduction processing methods to different scenes, so that better audio processing effects can be obtained for different scenarios, and a set of noise reduction processing methods can be avoided. Damage and mishandling caused by the processing of multiple scenes can improve the user experience.
  • the noise reduction processing method for audio may adopt a digital signal processing method to perform noise reduction processing on the audio data recorded in real time by the mobile phone.
  • adaptive filtering technology may be used to filter out noise in audio signals for different scenarios, for example, Least Mean Square (LMS) adaptive filtering technology may be used for audio noise reduction.
  • LMS Least Mean Square
  • the audio data processed by the noise reduction processing algorithm can also be equalized (Equalizer, EQ for short) processing (238), and Automatic Gain Control (AGC for short) processing (240) ) And Dynamic Range Control (DRC for short) processing (242), and output the processed audio signal.
  • EQ achieves the purpose of adjusting the tone color by gaining or attenuating one or more frequency bands in the audio data.
  • parameters such as frequency and gain in EQ can be adjusted to optimize the components of each frequency band in the audio.
  • AGC can adjust the signal size of the target sound by adjusting the loudness gain factor and gain weight of the components of each frequency band in the audio data corresponding to different scenes to achieve the best listening loudness.
  • DRC can make the sound sound softer or louder by providing compression and amplification capabilities for the audio amplitude in different scenes.
  • a user uses the mobile phone 10 to shoot a video at the seaside.
  • the process of noise reduction on the audio data in the captured video is as follows:
  • the trained neural network model transplanted to the mobile phone 10 recognizes the aforementioned image features and audio features, and recognizes that the scene where the user shoots the video is a seaside.
  • the audio after noise reduction by noise reduction processing algorithm 1 can be adapted to the equalization processing mode, automatic gain control mode and dynamic range control mode in the seaside scene, and the audio after wind noise suppression can be adjusted. In this way, when a user uses the mobile phone 10 to shoot a video at the beach, the audio in the captured video does not have strong wind noise, and the captured audio effect is clearer and more pleasant to the ear, which improves the user experience.
  • a user uses the mobile phone 10 to shoot a video in a restaurant.
  • the process of real-time noise reduction on the audio data in the captured video is as follows:
  • the trained neural network model transplanted to the mobile phone 10 recognizes the image features and audio features of the aforementioned extracted image data, and recognizes that the scene where the user shoots the video is a restaurant.
  • the noise reduction processing algorithm 2 corresponding to the restaurant scene, and perform strong noise reduction on the noise in the restaurant.
  • the audio after noise reduction by the noise reduction processing algorithm 2 can be adapted to the equalization processing mode, automatic gain control mode and dynamic range control mode in the restaurant scene, and the audio after strong noise reduction can be adjusted. In this way, when a user uses the mobile phone 10 to shoot a video in a restaurant, the sound of noisy diners colliding with the tableware in the captured video can be suppressed, and the user experience is improved.
  • the following describes in detail the detailed process of processing the noise in the audio currently received by the mobile phone 10 and the voice of the user currently performing the live broadcast when the user uses the mobile phone 10 to use the voice processing technology based on scene recognition of the present application for live video broadcasting.
  • the video recording performed by the mobile phone 10 is a live video broadcast.
  • the live broadcast software of the mobile phone 10 for example, Douyin, Kuaishou, volcano video, etc.
  • the composition of the environmental background noise in different scenarios is different, and the impact on the live broadcast is also different.
  • the restaurant the restaurant’s noise is relatively strong. With strong noise reduction, the user's voice will be more obvious and clear during live broadcast.
  • the sound of the wind at the seaside is louder. If strong noise reduction is performed on the wind sound, the sound effect of the live broadcast will also be better. Therefore, when the user is conducting a live video broadcast, the image data and audio data in the video in the scene currently recorded by the mobile phone 10 can be used to identify the scene of the current live broadcast and the person of the user currently conducting the live broadcast. Sound and portrait recognition.
  • the current live video portraits and voices are respectively identified (for example, through the neural network model transplanted to the mobile phone 10). In some embodiments, it can be based on the currently recorded video.
  • the voice data of the user who conducts a live video broadcast or a video call is identified through at least one of the signal processing and NN network methods.
  • the reason for setting the threshold is to distinguish the portrait of the user from the portrait of other pedestrians in the scene during the live video broadcast to prevent misjudgment.
  • the voice in the audio in the video recorded by the mobile phone 10 in real time can be enhanced, and the voice other than the voice in the audio can be processed such as noise reduction, and output processing
  • the subsequent audio data and corresponding image data can improve the signal-to-noise ratio of the human voice during the live video broadcast, and achieve a better live broadcast effect.
  • the corresponding noise reduction processing algorithm 1 when it is identified that the corresponding scene of the live video broadcast is a seaside, the corresponding noise reduction processing algorithm 1 can be selected to suppress wind noise.
  • the corresponding noise reduction processing algorithm 2 can be selected to strongly suppress the noise component in the restaurant. In order to deal with the background noise differently in different scenes, a better background sound processing effect can be achieved.
  • EQ can also be used to gain or attenuate one or more frequency bands in the audio to adjust the tone, or Amplify or compress the amplitude of its voice through DRC to make the singing sound clearer or softer.
  • the voice processing method based on scene recognition can also be used to reduce the noise of the audio in an existing video.
  • a mobile phone 10 or other electronic device with a video recording function to record. Good video clips, video clips downloaded from the Internet, and surveillance videos shot by surveillance cameras.
  • the following takes a video clip recorded on the mobile phone 10 as an example to describe in detail the process of noise reduction processing on audio in a recorded video clip using the voice processing method based on scene recognition provided in this application.
  • the process of denoising the audio data in the video clip of 3 seconds is as follows:
  • the image feature corresponding to the image data can be performed through the three-dimensional convolutional neural network model.
  • the audio features corresponding to the audio data can be extracted using the aforementioned Fourier transform method.
  • the noise reduction processing method for audio may adopt a digital signal processing method, and the audio to be processed is subjected to noise reduction processing. It is also possible to use adaptive filtering technology to filter the noise in the audio signal for different scenarios, for example, use Least Mean Square (LMS) adaptive filtering technology to reduce audio noise.
  • LMS Least Mean Square
  • the audio signal processed by the noise reduction processing algorithm can also be processed at least one of equalization processing, automatic gain control processing, and dynamic range control processing, and output the processed audio signal .
  • the video recording performed by the electronic device is a live video broadcast or a video call (402).
  • the live broadcast software of the mobile phone 10 such as Douyin, Kuaishou, Volcano Video, etc.
  • the mobile phone 10 is currently performing a live video broadcast.
  • the instant messaging software such as WeChat, QQ, VOIP, Skype, Facetime, etc.
  • the specific method for extracting features of image data and audio data can refer to the method for extracting features of sample images and sample audio in FIG. 2a and FIG. 2b. Please refer to the above for detailed description, so I won't repeat it here.
  • the scene recognition method can use the same method as in FIG. 2c
  • the user voice recognition method can use the above detailed introduction of processing the noise in the audio currently received by the mobile phone 10 and the voice of the user currently performing live broadcast.
  • I won’t repeat it here please refer to the above for detailed description, so I won’t repeat it here.
  • the voice in the audio in the video recorded by the electronic device in real time is enhanced, and the voice other than the voice in the audio is noise-reduced, and the processed voice is output Audio data and corresponding image data (408).
  • the signal-to-noise ratio of the human voice during a live video broadcast or a video call can be improved and user experience can be improved.
  • FIG. 5 provides a schematic structural diagram of a speech processing apparatus 500 based on scene recognition according to some embodiments of the present application.
  • a voice processing device 500 based on scene recognition includes:
  • the detection module 502 is configured to obtain image data and audio data in the currently recorded video when it is detected that the electronic device is performing video recording.
  • the feature extraction module 504 is configured to perform feature extraction on image data and audio data to obtain image features of the image data and audio features of the audio data.
  • the recognition module 506 is configured to recognize the extracted image features and audio features, and determine the scene category where the electronic device currently records the video.
  • the audio processing module 508 is configured to process the audio data in the video recorded in real time by the electronic device based on the identified scene category, and output the processed audio data and corresponding image data.
  • the voice processing device 500 based on scene recognition shown in FIG. 5 corresponds to the voice processing method based on scene recognition provided in this application, and the technology in the specific description of the voice processing method based on scene recognition provided in this application is described above. The details still apply to the speech processing apparatus 500 based on scene recognition shown in FIG. 5, and the specific description can be referred to the above, and will not be repeated here.
  • Fig. 6 provides a schematic structural diagram of a speech processing apparatus 600 based on scene recognition according to some embodiments of the present application.
  • the voice processing device 600 based on scene recognition includes:
  • the obtaining module 602 is used to obtain the to-be-processed video data.
  • the second feature extraction module 604 is configured to perform feature extraction on image data and audio data in at least part of the video to be processed to obtain image features of the image data and audio features of the audio data.
  • the second recognition module 606 is configured to recognize the extracted image features and audio features, and recognize the scene category of the scene in the video to be processed.
  • the second audio processing module 608 is configured to process the audio data in the video to be processed based on the identified scene category.
  • the voice processing device 600 based on scene recognition shown in FIG. 6 corresponds to the voice processing method based on scene recognition provided in this application for recorded videos. For specific description, please refer to the above, and will not be repeated here.
  • FIG. 7 shows a schematic structural diagram of a mobile phone 10 according to some embodiments of the present application.
  • the mobile phone 10 shown in FIG. 7 may be a smart phone, including a processor 110, a power module 140, a memory 180, a mobile communication module 130, a wireless communication module 120, a sensor module 190, an audio module 150, an interface module 160, and buttons. 101 and touch screen 102 and so on.
  • the mobile phone 10 shown in FIG. 7 also includes at least one camera 170 for recording video or collecting images.
  • the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the mobile phone 10.
  • the mobile phone 10 may include more or fewer components than shown, or combine certain components, or disassemble certain components, or arrange different components.
  • the illustrated components can be implemented in hardware, software, or a combination of software and hardware.
  • the audio module 150 is used to convert a digital audio signal into an analog audio signal for output, or convert an analog audio input into a digital audio signal.
  • the audio module 150 can also be used to encode and decode audio signals.
  • the audio module 150 may be disposed in the processor 110, or part of the functional modules of the audio module 150 may be disposed in the processor 110.
  • the audio module 150 may include a speaker, an earpiece, an analog microphone or a digital microphone (which can realize a sound pickup function), and a headphone interface.
  • the mobile phone 10 can receive audio signals in different scenes through a microphone, and can obtain audio data in different scenes collected by the microphone through the operating system of the mobile phone, and store the audio data in the memory space.
  • the camera 170 is used to capture still images or videos.
  • the object generates an optical image through the lens and is projected to the photosensitive element.
  • the photosensitive element converts the light signal into an electrical signal, and then transfers the electrical signal to ISP (Image Signal Processing) to convert it into a digital image signal.
  • ISP Image Signal Processing
  • the mobile phone 10 can implement a shooting function through an ISP, a camera 170, a video codec, a GPU (Graphic Processing Unit, graphics processor), a touch screen 102, an application processor, and the like.
  • the mobile phone 10 can use the camera 170 to record videos in multiple scenes, and can also use the camera 170 to perform live video or video calls, etc., and can also record data collected by the camera 170 in multiple different scenes. Feature extraction is performed on the image data in the video for scene recognition.
  • the processor 110 may include one or more processing units, for example, may include a central processing unit (CPU), an image processor GPU (Graphics Processing Unit), a digital signal processor DSP, and a microprocessor MCU (Micro-programmed Control Unit), AI (Artificial Intelligence) processor, or programmable logic device FPGA (Field Programmable Gate Array) and other processing modules or processing circuits. Among them, the different processing units may be independent devices or integrated in one or more processors.
  • a storage unit may be provided in the processor 110 for storing instructions and data. In some embodiments, the storage unit in the processor 110 is a cache memory 180, and a trained neural network model may be embedded in the processor 110 to perform scene recognition, face recognition, and the like.
  • the power supply module 140 may include a power supply, a power management component, and the like.
  • the power source can be a battery.
  • the power management component is used to manage the charging of the power supply and the power supply to other modules.
  • the power management component includes a charging management module and a power management module.
  • the charging management module is used to receive charging input from the charger; the power management module is used to connect to a power source, and the charging management module is connected to the processor 110.
  • the power management module receives input from the power supply and/or charging management module, and supplies power to the processor 110, the display screen 102, the camera 170, and the wireless communication module 120.
  • the mobile communication module 130 may include, but is not limited to, an antenna, a power amplifier, a filter, an LNA (low noise amplifier, low noise amplifier), and the like.
  • the mobile communication module 130 may provide a wireless communication solution including 2G/3G/4G/5G and the like applied on the mobile phone 10.
  • the mobile communication module 130 may receive electromagnetic waves by an antenna, and perform processing such as filtering, amplifying and transmitting the received electromagnetic waves to the modem processor for demodulation.
  • the mobile communication module 130 can also amplify the signal modulated by the modem processor, and convert it into electromagnetic waves for radiation by the antenna.
  • at least part of the functional modules of the mobile communication module 130 may be provided in the processor 110.
  • at least part of the functional modules of the mobile communication module 130 and at least part of the modules of the processor 110 may be provided in the same device.
  • the wireless communication module 120 may include an antenna, and transmit and receive electromagnetic waves via the antenna.
  • the wireless communication module 120 can provide applications on the mobile phone 10 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), and global navigation satellite systems ( Global navigation satellite system, GNSS), frequency modulation (FM), near field communication (NFC), infrared technology (infrared, IR) and other wireless communication solutions.
  • WLAN wireless local area networks
  • BT wireless fidelity
  • GNSS global navigation satellite systems
  • FM frequency modulation
  • NFC near field communication
  • infrared technology infrared, IR
  • the mobile communication module 130 and the wireless communication module 120 of the mobile phone 10 may also be located in the same module.
  • the touch screen 102 is used to display human-computer interaction interfaces, images, videos, and the like.
  • the sensor module 190 may include a proximity light sensor, a pressure sensor, a gyroscope sensor, an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, etc.
  • the interface module 160 includes an external memory interface, a universal serial bus (USB) interface, a subscriber identification module (SIM) card interface, and the like.
  • the mobile phone 10 further includes a button 101, a motor, an indicator, and the like.
  • the button 101 may include a volume button, an on/off button, and the like.
  • the motor is used to cause the mobile phone 10 to generate a vibration effect, for example, when the user's mobile phone 10 is called, it generates vibration to prompt the user to answer the incoming call of the mobile phone 10.
  • the indicator can include a laser indicator, a radio frequency indicator, an LED indicator, and so on.
  • FIG. 8 shows a block diagram of a system 800 according to some embodiments of the present application.
  • Figure 8 schematically illustrates an example system 800 according to various embodiments.
  • the system 800 may include one or more processors 804, a system control logic 808 connected to at least one of the processors 804, a system memory 812 connected to the system control logic 808, and a system control logic 808 connected to the system control logic 808.
  • NVM non-volatile memory
  • the processor 804 may include one or more single-core or multi-core processors. In some embodiments, the processor 804 may include any combination of a general-purpose processor and a special-purpose processor (for example, a graphics processor, an application processor, a baseband processor, etc.).
  • a general-purpose processor for example, a graphics processor, an application processor, a baseband processor, etc.
  • system control logic 808 may include any suitable interface controller to provide any suitable interface to at least one of the processors 804 and/or any suitable device or component in communication with the system control logic 808.
  • system control logic 808 may include one or more memory controllers to provide an interface to the system memory 812.
  • the system memory 812 can be used to load and store data and/or instructions.
  • the memory 812 of the system 800 may include any suitable volatile memory, such as a suitable dynamic random access memory (DRAM).
  • DRAM dynamic random access memory
  • the NVM/memory 816 may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions.
  • the NVM/memory 816 may include any suitable non-volatile memory such as flash memory and/or any suitable non-volatile storage device, such as HDD (Hard Disk Drive, hard disk drive), CD (Compact Disc) , At least one of an optical disc drive and a DVD (Digital Versatile Disc, Digital Versatile Disc) drive.
  • the NVM/memory 816 may include a part of the storage resources on the device where the system 800 is installed, or it may be accessed by the device, but not necessarily a part of the device.
  • the NVM/storage 816 can be accessed through the network via the network interface 820.
  • system memory 812 and the NVM/memory 816 may respectively include: a temporary copy and a permanent copy of the instruction 824.
  • the instructions 824 may include instructions that, when executed by at least one of the processors 804, cause the system 800 to implement the methods shown in FIGS. 3-5.
  • the instructions 824, hardware, firmware, and/or software components thereof may additionally/alternatively be placed in the system control logic 808, the network interface 820, and/or the processor 804.
  • the network interface 820 may include a transceiver to provide a radio interface for the system 800, and then communicate with any other suitable devices (such as a front-end module, an antenna, etc.) through one or more networks.
  • the network interface 820 may be integrated with other components of the system 800.
  • the network interface 820 may be integrated in at least one of the processor 804, the system memory 812, the NVM/memory 816, and a firmware device (not shown) with instructions, when at least one of the processors 804 executes the instructions At this time, the system 800 implements the voice processing method based on scene recognition as shown in FIG. 4.
  • the network interface 820 may further include any suitable hardware and/or firmware to provide a multiple input multiple output radio interface.
  • the network interface 820 may be a network adapter, a wireless network adapter, a telephone modem and/or a wireless modem.
  • At least one of the processors 804 may be packaged with the logic of one or more controllers for the system control logic 808 to form a system in package (SiP). In one embodiment, at least one of the processors 804 may be integrated with the logic of one or more controllers for the system control logic 808 on the same die to form a system on chip (SoC).
  • SiP system in package
  • SoC system on chip
  • the system 800 may further include: an input/output (I/O) device 832.
  • the I/O device 832 may include a user interface to enable a user to interact with the system 800; the design of the peripheral component interface enables the peripheral component to also interact with the system 800.
  • the system 800 further includes a sensor for determining at least one of environmental conditions and location information related to the system 800.
  • FIG. 9 shows a block diagram of a SoC (System on Chip) 900.
  • the SoC 900 includes: an interconnection unit 950, which is coupled to the application processor 910; a system agent unit 970; a bus controller unit 980; an integrated memory controller unit 940; a group or one or more co-processing 920, which may include integrated graphics logic, image processor, audio processor, and video processor; a static random access memory (SRAM) unit 930; and a direct memory access (DMA) unit 960.
  • the coprocessor 920 includes a dedicated processor, such as, for example, a network or communication processor, a compression engine, a GPGPU, a high-throughput MIC processor, or an embedded processor, or the like.
  • the various embodiments of the mechanism disclosed in this application may be implemented in hardware, software, firmware, or a combination of these implementation methods.
  • the embodiments of the present application can be implemented as a computer program or program code executed on a programmable system.
  • the programmable system includes at least one processor and a storage system (including volatile and non-volatile memory and/or storage elements) , At least one input device and at least one output device.
  • Program codes can be applied to input instructions to perform the functions described in this application and generate output information.
  • the output information can be applied to one or more output devices in a known manner.
  • a processing system includes any system having a processor such as, for example, a digital signal processor (DSP), a microcontroller, an application specific integrated circuit (ASIC), or a microprocessor.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • the program code can be implemented in a high-level programming language or an object-oriented programming language to communicate with the processing system.
  • assembly language or machine language can also be used to implement the program code.
  • the mechanism described in this application is not limited to the scope of any particular programming language. In either case, the language can be a compiled language or an interpreted language.
  • the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof.
  • the disclosed embodiments can also be implemented as instructions carried by or stored on one or more transient or non-transitory machine-readable (eg, computer-readable) storage media, which can be executed by one or more processors Read and execute.
  • the instructions can be distributed through a network or through other computer-readable media.
  • a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (for example, a computer), including, but not limited to, floppy disks, optical disks, optical disks, read-only memories (CD-ROMs), magnetic Optical disk, read only memory (ROM), random access memory (RAM), erasable programmable read only memory (EPROM), electrically erasable programmable read only memory (EEPROM), magnetic or optical card, flash memory, or A tangible machine-readable memory used to transmit information (for example, carrier waves, infrared signals, digital signals, etc.) using the Internet with electric, optical, acoustic or other forms of propagating signals. Therefore, a machine-readable medium includes any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (for example, a computer).
  • each unit/module mentioned in each device embodiment of this application is a logical unit/module.
  • a logical unit/module can be a physical unit/module or a physical unit/ A part of the module can also be realized by a combination of multiple physical units/modules.
  • the physical realization of these logical units/modules is not the most important.
  • the combination of the functions implemented by these logical units/modules is the solution to this application.
  • the above-mentioned device embodiments of this application do not introduce units/modules that are not closely related to solving the technical problems proposed by this application. This does not mean that the above-mentioned device embodiments do not exist. Other units/modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Quality & Reliability (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Studio Devices (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

一种基于场景识别的语音处理方法及其装置、介质和系统,该语音处理方法包括:在检测到电子设备进行视频录入的情况下,获取当前录入的视频中的图像数据和音频数据(302);对图像数据和音频数据进行特征提取,得到图像数据的图像特征和音频数据的音频特征(304);对提取出来的图像特征和音频特征进行识别,识别出电子设备当前录入视频所处的场景类别(306);基于识别出的场景类别,对电子设备实时录入的视频中的音频数据进行处理,并输出处理后的音频数据和对应的图像数据(308)。

Description

基于场景识别的语音处理方法及其装置、介质和系统
本申请要求于2020年01月15日提交中国专利局、申请号为202010043607.3、申请名称为“基于场景识别的语音处理方法及其装置、介质和系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,特别涉及一种基于场景识别的语音处理方法及其装置、介质和系统。
背景技术
在利用录像设备进行录像、视频直播或视频通话时,因录像设备所处的环境不同,麦克风采集到的环境噪声差异较大,对目标声音的影响也不同。但是,目前业界对录像设备采集的音频的降噪,仅仅基于音频或者图像进行场景判断,对于不同的场景,例如:室内,演播厅,马路上,车里,海边,餐厅等,声音种类存在随机性和偶然性,很难准确识别声音种类,且容易造成对场景的误判,准确率低,用户体验较差。
发明内容
本申请实施例提供了一种基于场景识别的语音处理方法及其装置、介质和系统。
第一方面,本申请实施例提供了一种于场景识别的语音处理方法,所述方法包括:
在检测到所述电子设备进行视频录入的情况下,获取当前录入的视频中的图像数据和音频数据;对所述图像数据和音频数据进行特征提取,得到所述图像数据的图像特征和所述音频数据的音频特征;对提取出来的所述图像特征和音频特征进行识别,识别出所述电子设备当前录入视频所处的场景类别;基于识别出的场景类别,对所述电子设备实时录入的视频中的音频数据进行处理,并输出处理后的音频数据和对应的图像数据。如此,基于图像特征和音频特征一起进行识别,识别出的场景类型更加准确,避免仅仅通过图像特征或语音特征进行识别而造成的场景误判,提交场景识别准确率。另外,根据识别出的场景类别,对电子设备实时录入的视频中的音频数据进行处理,可以达到每个场景的最优体验,避免对不同场景下的音频数据都进行同样的处理,而出现的损伤或误处理问题。
在上述第一方面的一种可能的实现中,上述方法还包括:所述对所述图像数据和音频数据进行特征提取,得到所述图像数据的图像特征和所述音频数据的音频特征,包括:
对所述图像数据进行结构化处理得到所述图像数据的图像特征,并且对所述音频数据 进行傅里叶变换得到所述音频数据的音频特征。在一些实施例中,可以通过三维卷积神经网络模型对样本图像进行特征提取。在一些实施例中,还可以在对音频数据进行傅里叶变换之前对音频数据进行预加重、分帧等预处理。
在上述第一方面的一种可能的实现中,上述方法还包括:所述基于识别出的场景类别,对所述电子设备实时录入的视频中的音频数据进行处理,并输出处理后的音频数据和对应的图像数据,包括:
基于识别出的场景类别,选择与所述场景类别对应的降噪处理算法、均衡处理方式、自动增益控制方式和动态范围控制方式;基于选择出的降噪处理算法、均衡处理方式、自动增益控制方式和动态范围控制方式对所述电子设备实时录入的视频中的音频数据进行处理;输出处理后的音频数据和对应的图像数据。在一些实施例中,当识别出场景类别后,可以根据需要选择与场景类别对应的上述处理方法中的其中一种或几种。
在上述第一方面的一种可能的实现中,上述方法还包括:所述电子设备进行视频录入的情况包括:视频拍摄、视频直播或视频通话。
在上述第一方面的一种可能的实现中,上述方法还包括:确定出所述电子设备进行视频录入的情况为视频直播或者视频通话;
基于当前录入的视频中的音频数据,识别出进行视频直播或者视频通话的用户的人声;并且
所述基于识别出的场景类别,对所述电子设备实时录入的视频中的音频数据进行处理,并输出处理后的音频数据和对应的图像数据,包括:
基于识别出的场景类别和所述用户的人声,对所述电子设备实时录入的视频中的音频中的人声进行增强处理,对所述音频中人声以外的声音做降噪处理,并输出处理后的音频数据和对应的图像数据。如此当用户在不同的场景中进行视频拍摄、视频直播或视频通话时,可以针对不同的场景适配不同的人声增强算法,还可以对不同场景中的噪声适配不同的降噪处理方法,如此可以提高信噪比,提高用户体验。
在上述第一方面的一种可能的实现中,上述方法还包括:所述基于当前录入的视频中的音频数据,识别出进行视频直播或者视频通话的用户的人声,包括:
基于当前录入的视频中的音频数据,通过信号处理和NN网络的方法中的至少一种,识别出进行视频直播或者视频通话的用户的人声。由于在视频录入的时候,音频的随机性较大,通过上述方法进行人声识别,可以避免将视频直播或视频通话场景中其他的人声(例如过往行人的声音,围观者的声音等)误判为当前正在进行视频直播或视频通话的用户的人声,提高人声识别准确率。
在上述第一方面的一种可能的实现中,上述方法还包括:基于当前录入的视频中的图像数据,识别出进行视频直播或者视频通话的用户的人像;
所述用户的人像是通过以下方式识别出来的:
对当前录入的视频中的图像数据进行识别;
当识别出所述图像数据中对应一个人像的尺寸大于预设阈值时,识别出该人像为进行视频直播或者视频通话的用户的人像。如此,可以避免将拍摄场景中的行人误判为当前在进行视频直播或视频通话的用户,提高人像识别准确率。
第二方面,本申请实施例提供了一种于场景识别的语音处理方法,所述方法包括:
获取待处理视频;对所述待处理视频中的至少部分视频中的图像数据和音频数据进行特征提取,以得到所述图像数据的图像特征和音频数据的音频特征;对提取出来的所述图像特征和音频特征进行识别,识别出所述待处理视频中场景的场景类别;基于识别出的场景类别,对所述待处理视频中的音频数据进行处理。如此,对于已有的视频(例如通过网络下载的视频或通过手机等电子设备录制好的视频),可以通过对该视频中的场景进行识别,针对不同的场景选择对应的音频后处理方法,以使用户在观看视频的时候,获取最优的视听体验。
第三方面,本申请实施例提供了一种基于场景识别的语音处理装置,所述装置包括:
检测模块,用于在检测到所述电子设备进行视频录入的情况下,获取当前录入的视频中的图像数据和音频数据;
第一特征提取模块,用于对所述图像数据和音频数据进行特征提取,得到所述图像数据的图像特征和所述音频数据的音频特征;
第一识别模块,用于对提取出来的所述图像特征和音频特征进行识别,确定出所述电子设备当前录入视频所处的场景类别;
第一音频处理模块,用于基于识别出的场景类别,对所述电子设备实时录入的视频中的音频数据进行处理,并输出处理后的音频数据和对应的图像数据。
第四方面,本申请实施例提供了一种基于场景识别的语音处理装置,所述装置包括:
获取模块,用于获取待处理视频;
第二特征提取模块,用于对所述待处理的视频中的至少部分视频中的图像数据和音频数据进行特征提取,以得到所述图像数据的图像特征和音频数据的音频特征;
第二识别模块,用于对提取出来的所述图像特征和音频特征进行识别,识别出所述待处理的视频中场景的场景类别;
第二音频处理模块,用于基于识别出的场景类别,对所述待处理视频中的音频数据进行处理。
第五方面,本申请实施例提供一种机器可读介质,所述机器可读介质上存储有指令,该指令在机器上执行时使机器执行上述第一方面及第二方面可能的各实现中的基于场景识别的语音处理方法。
第六方面,本申请实施例提供一种系统,包括:
存储器,用于存储由系统的一个或多个处理器执行的指令,以及
处理器,是系统的处理器之一,用于执行上述第一方面及第二方面可能的各实现中的基于场景识别的语音处理方法。
附图说明
图1根据本申请的一些实施例,示出了一种语音处理场景;
图2a根据本申请的一些实施例,示出了一种采用样本图像特征对第一神经网络模型进行训练的流程图;
图2b根据本申请的一些实施例,示出了一种采用图2a所示的图像场景类别和样本音频特征对第二神经网络模型进行联合训练的流程图;
图2c根据本申请的一些实施例,示出了一种手机降噪的流程图;
图3根据本申请的一些实施例,示出了本申请提供的基于场景识别的语音处理方法的流程图;
图4根据本申请的一些实施例,示出了本申请提供的基于场景的语音处理方法应用于视频直播或视频通话时的流程图;
图5根据本申请的一些实施例,示出了一种基于场景识别的语音处理装置的结构示意图;
图6根据本申请的一些实施例,示出了另一种基于场景识别的语音处理装置的结构示意图;
图7根据本申请的一些实施例,示出了一种手机的结构示意图;
图8根据本申请的一些实施例,示出了一种系统的框图;
图9根据本申请一些实施例,示出了一种片上系统(SoC)的框图。
具体实施方式
本申请的说明性实施例包括但不限于基于场景识别的语音处理方法及其装置、介质和系统。
可以理解,如本文所使用的,术语“模块”可以指代或者包括专用集成电路(ASIC)、电子电路、执行一个或多个软件或固件程序的处理器(共享、专用、或群组)和/或存储器、组合逻辑电路、和/或提供所描述的功能的其他适当硬件组件,或者可以作为这些硬件组件的一部分。
可以理解,在本申请各实施例中,处理器可以是微处理器、数字信号处理器、微控制器等,和/或其任何组合。根据另一个方面,所述处理器可以是单核处理器,多核处理器等,和/或其任何组合。
下面将结合附图对本申请的实施例作进一步地详细描述。
根据本申请的一些实施例公开了一种语音处理场景100。图1示出了该场景的示意图。如图1所示,当用户使用手机10进行录像时,通过录像内容(例如通过手机的摄像头采集到的图像数据和麦克风采集到的音频数据)可以准确判断出用户所处的场景类别,例如判断出用户处于海边、餐厅、音乐厅、会议室、车里、马路上等不同的场景中的哪个场景。由于手机在录像时,因所处的环境不同,麦克风采集到的环境噪声差异较大,对用户感兴趣的声音的影响也不同,若对音频降噪处理不合理,会影响用户体验。因此可以针对不同的场景类别,适配不同的降噪处理算法。例如,当用户录像所在的场景为海边时,风声或海浪声较大,可以通过降噪处理算法1对风声和海浪声进行抑制,得到经由降噪处理后的音频1,以提升目标声音的信噪比;当用户录像所在的场景为餐厅时,餐具碰撞的声音和周围用餐人员的声音比较嘈杂,可以通过降噪处理算法2对餐具碰撞的声音和周围用餐人员的声音进行降噪,得到经由降噪处理后的音频2;当用户录像所在的场景为音乐厅时,为了更好的欣赏乐器或表演者的声音,可以通过降噪处理算法3对台下观众交头接耳的声音及接打电话的声音等进行降噪,得到经由降噪处理后的音频3;当用户录像所在的场景为会议室时,为了使参会者清晰地听到会上讲述者的声音,可以通过降噪处理算法4对会 议室中可能存在的窃窃私语的声音、咳嗽声、会议室周围建筑施工的声音等等进行降噪,得到经由降噪处理后的音频4。从而达到对应场景的最佳音频效果,提升用户体验。
可以理解,图1所示的语音处理场景100仅仅是示例性的,并非限制性的,在其他实施例中,用户所处的环境还可以是公园、商场等场景,在此不再一一列举。
可以理解,图1所示的实施例,是以用户通过手机进行录像为例,对本申请实施例的基于场景识别的语音处理方法进行说明,在其他实施例中,还可以将本申请的基于场景识别的语音处理方法用于当用户使用其他具有录像或具有图像采集和音频采集功能的电子设备进行视频直播或视频通话等应用中。其中,电子设备包括但不限于摄像机、平板电脑、台式计算机、可穿戴智能设备、增强现实设备、超级移动个人计算机(ultra-mobile personal computer,UMPC)或者个人数字助理(personal digital assistant,PDA)等。
下面结合图1和图2,详细介绍在一些实施例中,采用本申请的基于场景识别的语音处理方法对音频进行处理的过程。
1)场景识别模型的训练
在图2a所示的实施例中,对采集到的大量的样本图像数据进行特征提取(212),其中,大量的样本图像可以为通过手机的摄像头采集的大量的不同场景下的图像,通过对大量的样本图像中的物体进行特征提取,得到对应不同场景下的物体的样本图像特征。例如,当摄像头拍摄的视频场景为在某条马路上时,则可以对该场景中的车辆、路灯、马路等物体进行特征提取,得到对应马路场景中的车辆、路灯、马路等物体的特征数据;当摄像头拍摄的视频场景为某海边时,则可以对该场景中的海水、沙滩、船只等物体进行特征提取,得到对应海边场景中的海水、沙滩、船只等物体的特征;当摄像头拍摄的的视频场景为某个音乐厅时,则可以对该场景中的乐器、舞台、看台上的座位等物体进行特征提取,得到对应该场景中的乐器、舞台、看台上的座位等的特征。
可以理解,对样本图像进行特征提取,不局限于对物体的特征进行提取,还可以包括对样本图像的空间布局和背景等(例如天空、草地、森林等)进行特征提取。
在一些实施例中,可以通过三维卷积神经网络模型对样本图像进行特征提取,同时在时间和空间维度上进行卷积操作,以在获取样本图像中的每一帧图像的视觉特征的同时,获取相邻图像帧随时间推移的关联性,能够充分提取样本图像的特征信息。
将提取的样本图像特征数据输入第一神经网络模型进行训练(214)。在提取出样本图像的特征之后,将样本图像的特征对第一神经网络模型进行训练,将第一神经网络模型的输出与样本图像特征的期望结果进行对比,直到第一神经网络模型的输出与样本图像特征的期望结果的差值小于一定的阈值,从而完成对第一神经网络模型的训练。
其中,第一神经网络模型可以为任意一种人工神经网络模型,例如CNN(Convolutional Neural Network,卷积神经网络)、DNN(Deep Neural Networks,深度神经网络)以及RNN(Recurrent Neural Networks,循环神经网络)、BNN(Binary Neural Network,二值神经网络)等。
当第一神经网络模型训练完成后,输出对应多个不同场景的图像场景类别(216)。可以理解,在训练之前,可以对不同的拍摄场景进行分类,得到不同的拍摄场景对应的图像场景类别,将每个拍摄场景中的图像特征和图像场景类别一一对应,例如拍摄场景中有行驶的车辆,交通信号灯等物体时,可以把车辆、交通信号灯等物体的特征数据和马路场景 对应起来,将马路场景中的车辆、交通信号灯等物体的特征数据输入第一神经网络模型进行训练,使得输出能够表征马路场景的场景类别。又例如拍摄场景中有大量的餐具、用餐人员等时,可以把餐具、用餐人员等的特征数据和餐厅场景对应起来,将餐具、用餐人员等的特征数据输入第一神经网络模型进行训练,使得输出能够表征餐厅场景的图像场景类别。
参考图2b,在得到利用第一神经网络模型输出的图像场景类别后,为了避免仅仅基于图像进行场景识别带来的误判的情况,可以结合图像场景类别和从该场景中采集的音频信号对第二神经网络模型进行联合训练,以降低场景误判率。具体地,利用图像场景类别和音频对第二神经网络模型进行联合训练的过程如下:
对采集到的大量的样本音频特征进行特征提取(222)。在一些实施例中,在对采集到的大量的样本音频信号进行分析和处理之前,可以对其进行预加重、分帧、加窗等预处理操作,以保证后续音频处理得到的信号更均匀、平滑,提高音频处理质量。为了提取音频信号中的各个频率成分的音频信号的特征,可以对加窗后的音频信号进行傅里叶变换,获取音频信号的频谱和相位等样本音频特征。例如,对图1中所示的通过手机的麦克风采集的音乐厅中的各种音频分量(例如钢琴声、观众咳嗽声、鼓掌声等等)进行分帧处理,加20ms汉明窗,取10ms帧移,并逐帧进行傅里叶变换,获取各个音频分量的频谱和相位等信息。
将经由第一神经网络模型输出的多个图像场景类别和与每个图像场景类别对应的样本音频特征输入第二神经网络模型,对第二神经网络模型进行训练(224)。即将经由图2a所示的第一神经网络模型训练输出的多个图像场景类别和对应每个场景类别的音频信号提出的音频特征进行联合训练。例如,将对图1所示的通过手机的麦克风采集的音乐厅中的各种音频分量(例如钢琴声、观众咳嗽声、鼓掌声等等)提取的音频特征和表征音乐厅这个场景类别的数据(例如可以是按照预设的规则设定的场景标志),对目标神经网络进行训练,将第二神经网络模型的输出与期望结果进行对比,直到第二神经网络模型的输出与期望结果的差值小于一定的阈值,从而完成对第二神经网络模型的训练。如此,通过先对图像特征进行训练得到场景类别,再结合音频特征和场景类别进行联合训练,生成最终的场景类别,可以大大降低场景判断的误判率。
其中,第二神经网络模型可以为任意一种人工神经网络模型,例如CNN(Convolutional Neural Network,卷积神经网络)、DNN(Deep Neural Networks,深度神经网络)以及RNN(Recurrent Neural Networks,循环神经网络)、BNN(Binary Neural Network,二值神经网络)等。
需要注意的是,在对第二神经网络模型进行训练时,输入的音频特征所属的场景类别和输入的图像场景类别一一对应,例如输入的音频特征所属的场景为A,输入的图像场景类别为a,则第二神经网络模型训练完成后输出的场景类别也是a。
在将经由图2a所示的第一神经网络模型训练输出的多个场景类别和对应每个场景类别的音频信号提出的音频特征进行联合训练后,输出对应多个不同场景的场景类别(226)。例如,在图1所示的实施例中,在海边拍摄的视频B,提取视频B的图像特征B1和音频特征B2,基于上述的训练,在对图像特征B1进行训练后,得到的场景类别为b,然后把得到的场景类别b和音频特征B2输入第二神经网络模型进行联合训练,希望第二神经网 络模型在训练完成后输出的场景类别也是b。
可以理解,上述将样本图像特征输入第一神经网络模型进行训练,得到对应多个不同场景的图像场景类别,然后再将经由第一神经网络模型输出的多个图像场景类别和与每个图像场景类别对应的样本音频特征输入第二神经网络模型,对第二神经网络模型进行训练,以得到对应多个不同场景的最终的场景类别的描述仅仅是示例性的,并非限制性的。在一些实施例中,还可以将样本图像特征和样本音频特征同时输入同一个神经网络模型,对该模型进行联合训练,以得到对应多个不同场景的场景类别。
2)在识别出待识别场景后,基于该场景进行降噪等处理
在一些实施例中,上述第一神经网络模型和第二神经网络模型可以在服务器或计算机中进行训练,当上述第一神经网络模型和第二神经网络模型训练完成后,可以将训练好的神经网络模型移植到手机中,当用户使用手机进行视频拍摄、视频直播或视频通话时,可以通过移植到手机中的神经网络模型对用户当前进行视频录入的场景进行识别,在识别出待识别场景后,基于该场景进行降噪等处理。
具体地,在图2c所示的实施例中,当用户使用手机进行视频拍摄、视频直播或视频通话时,在检测到手机进行视频录入的情况下,根据录入视频中的图像数据和音频数据对应的特征图像特征和音频特征,通过移植到手机中的神经网络模型进行识别,输出最终的场景识别标志(232)。其中,场景识别标志可以为预先设定的以区分不同场景的符号等标志。例如,海边对应的场景识别标志为s1,餐厅对应的场景识别标志为s2,音乐厅对应的场景识别标志为s3,会议室对应的场景识别标志为s4。
然后可以基于上述场景标志选择对应的降噪处理算法(234),基于该降噪处理算法对手机实时录入的视频中的音频数据进行处理(236)。例如,在图1所示的实施例中,当通过移植到手机中的神经网络模型进行识别出手机当前录入视频的场景为海边时,可以选择对应的降噪处理算法1,以抑制风噪。当通过移植到手机中的神经网络模型进行识别出手机当前录入视频的场景为餐厅时,可以选择对应的降噪处理算法2,对餐厅中的噪音分量进行强抑制。当通过移植到手机中的神经网络模型进行识别出手机当前录入视频的场景为音乐厅时,为了对演奏的音乐进行保真,可以选择放大从音乐厅采集的音频中的高频分量,减小低频分量。当通过移植到手机中的神经网络模型进行识别出手机当前录入视频的场景为会议室(比较安静)时,可以选择对应的降噪处理算法4,对会议室中的音频进行弱降噪,对与会的发言人员的声音进行保真。如此,本申请基于场景识别的的语音处理方法可以针对不同的场景自适应适配不同的降噪处理方法,从而针对不同场景能得到较好的音频处理效果,避免一套降噪处理方法要兼顾多个场景的处理而出现的损伤和误处理,提升用户体验。
在一些实施例中,针对音频的降噪处理方法可以采用数字信号处理方法,对手机实时录入的音频数据进行降噪处理。在一些实施例中,可以针对不同的场景,采用自适应滤波技术滤除音频信号中的噪声,例如,采用最小均方(Least Mean Square,简称LMS)自适应滤波技术进行音频降噪。
在图2c所示的实施例中,还可以对经由降噪处理算法处理后的音频数据进行均衡(Equalizer,简称EQ)处理(238)、自动增益控制(Automatic Gain Control,简称AGC)处理(240)和动态范围控制(Dynamic Range Control,简称DRC)处理(242),并输出 处理后的音频信号。其中,EQ是通过对音频数据中的某一个或多个频段进行增益或衰减,达到调整音色的目的。针对不同的场景可以将EQ中的频率、增益等参数进行调整,以对音频中各频段的分量进行优化。AGC通过调整不同场景对应的音频数据中的各个频段的分量的响度增益因子和增益权重,可以调整目标声音的信号大小,达到最佳听感响度。DRC通过对不同场景下的音频的幅度提供压缩和放大能力,可以使声音听起来更柔和或者更大声。
下面详细介绍手机10采用本申请的基于场景识别的语音处理技术,对手机10接收到的音频信号进行实时降噪处理的详细过程。
例如,用户在海边使用手机10拍摄视频,在拍摄的过程中,对拍摄到的视频中的音频数据进行降噪的过程如下:
1)获取手机10实时采集的视频中的图像数据和音频数据。例如,用户在海边用手机10进行录像,其中图像数据是通过手机10的摄像头进行拍摄从而获得的,音频数据是通过手机10的麦克风采集的。
2)对上述图像数据和音频数据进行特征提取,得到图像数据的特征和音频数据的音频特征,其中,图像数据对应的图像特征是通过三维卷积神经网络模型进行提取的,音频数据对应的音频特征是采用前述傅里叶变换的方法进行提取的。
3)对提取出来的图像特征和音频特征进行识别,确定手机10当前在拍摄的场景为海边。通过移植到手机10中的已经训练好的神经网络模型对前述图像特征和音频特征进行识别,识别出用户拍摄视频的场景为海边。
4)选择对应海边场景的降噪处理算法1,以抑制风噪。并且可以对经过降噪处理算法1降噪后的音频适配海边场景下的均衡处理方式、自动增益控制方式和动态范围控制方式,对抑制风噪后的音频进行调整。如此,当用户在海边使用手机10进行拍摄视频时,拍摄的视频中的音频就没有很强的风噪,拍摄的音频效果就比较清晰悦耳,提升用户体验。
又例如,用户在餐厅使用手机10拍摄视频,在拍摄的过程中,对拍摄到的视频中的音频数据进行实时降噪的过程如下:
1)获取手机10实时采集的视频中的图像数据和音频数据。例如,用户在餐厅中用手机10进行录像,其中,图像数据是通过手机10的摄像头进行拍摄从而获得的,音频数据是通过手机10的麦克风采集的。
2)对上述图像数据和音频数据进行特征提取,得到图像数据对应的图像特征和音频数据对应的音频特征,图像数据的图像特征是通过三维卷积神经网络模型进行提取的,音频特征是采用前述傅里叶变换的方法进行提取的。
3)对提取出来的图像特征和音频特征进行识别,确定手机10当前在拍摄的场景为餐厅。通过移植到手机10中的已经训练好的神经网络模型对前述提取的图像数据的图像特征和音频特征进行识别,识别出用户拍摄视频的场景为餐厅。
4)选择对应餐厅场景的降噪处理算法2,对餐厅中的噪声进行强降噪。并且可以对经过降噪处理算法2降噪后的音频适配餐厅场景下的均衡处理方式、自动增益控制方式和动态范围控制方式,对进行强降噪后的音频进行调整。如此,当用户在餐厅使用手机10进行拍摄视频时,拍摄的视频中的嘈杂的就餐人员和餐具碰撞的声音就能够被抑制掉,提升用户体验。
可以理解,上述对用户使用手机10在海边和餐厅场景下采用本申请的基于场景识别的语音处理技术进行视频拍摄的描述仅仅是示例性的,并非限制性的。
下面详细介绍用户使用手机10采用本申请的基于场景识别的语音处理技术进行视频直播时,对手机10当前接收到的音频中的噪声和当前进行直播的用户的人声进行处理的详细过程。
1)确定出手机10进行视频录入的情况为视频直播。当检测到手机10的直播软件(例如抖音、快手、火山小视频等)被打开时,可以确定手机10当前要进行视频直播。
由于当用户在使用手机10进行视频直播时,不同场景下的环境背景噪声组成不同,对直播的影响也不同,例如当用户在餐厅中进行直播时,餐厅的噪声较强,若对餐厅的噪声进行强降噪,用户在直播时,声音会更加明显、清晰。当用户在海边进行直播时,海边的风声较大,若对风声进行强降噪,则同样会使直播的声音效果更好。因此,在用户在进行视频直播时,可以通过手机10当前录入的该场景下的视频中的图像数据和音频数据,以便后续进行对当前直播的场景进行识别,以及对当前进行直播的用户的人声、人像进行识别。
2)基于手机10当前录入的视频中的图像数据和音频数据,通过对图像数据和音频数据进行特征提取后(对图像数据和音频数据进行特征提取的具体方法可以参考上述对图2a和图2b中对图像数据和音频数据进行特征提取的方法的描述,详细描述请参见上文,在此不再赘述),得到图像特征和音频特征,通过手机10中移植的已训练好的神经网络模型对该图像特征和音频特征进行识别,确定出当前用户使用手机10进行视频直播的场景类别。并且基于前述图像特征和音频特征分别对当前进行视频直播的人像和人声进行识别(例如通过移植到手机10中的神经网络模型进行识别),在一些实施例中,可以基于当前录入的视频中的音频数据,通过信号处理和NN网络的方法中的至少一种,识别出进行视频直播或者视频通话的用户的人声。当识别出图像数据中对应一个人像的尺寸大于预设的阈值时,识别出该人像为进行视频直播或者视频通话的用户的人像。
此处需要说明的是,当用户进行视频直播时,之所以设定阈值,是为了使用户的人像和用户在进行视频直播时的场景中其他行人的人像进行区分,防止误判。
如此,可以基于识别出的场景类别和用户的人声,对手机10实时录入的视频中的音频中的人声进行增强处理,对音频中人声以外的声音做降噪等处理,并输出处理后的音频数据和对应的图像数据,如此可以提高视频直播时人声的信噪比,达到较好的直播效果。例如在图1所示的实施例中,识别出视频直播的对应的场景为海边时,可以选择对应的降噪处理算法1,以抑制风噪。当识别出视频直播的对应的场景为餐厅时,可以选择对应的降噪处理算法2,对餐厅中的噪音分量进行强抑制。以针对不同的场景对背景噪声做不同的处理,达到较好的背景声音处理效果。
在一些实施例中,若进行视频直播的用户直播的内容是唱歌,为了使其歌声更加优美,还可以采用EQ对其音频中的某一个或多个频段进行增益或衰减,以调整音色,或通过DRC对其声音的幅度进行放大或压缩,使其歌声听起来较清晰或更柔和。
以上所述为对手机10接收到的音频信号进行实时降噪处理的详细过程。可以理解,在一些实施例中,还可以采用本申请提供的基于场景识别的语音处理方法对已有的一段视频中的音频进行降噪处理,例如,用手机10等具备录像功能的电子设备录好的视频片段、 从网络上下载的视频片段以及通过监控摄像头摄制的监控视频等。
下面以对手机10录好的视频片段为例,详细介绍采用本申请提供的基于场景识别的语音处理方法对一录好的视频片段中的音频进行降噪处理的过程。
例如,用户在马路边使用手机10录制的时长为3秒钟的视频片段,对该3秒钟的视频片段中的音频数据进行降噪的过程如下:
1)对该3秒钟的视频片段中的图像数据和音频数据进行特征提取,得到图像数据的特征和音频数据的音频特征,其中,图像数据对应的图像特征可以通过三维卷积神经网络模型进行提取,音频数据对应的音频特征可以采用前述傅里叶变换的方法进行提取。
2)对提取出来的图像特征和音频特征进行识别,通过移植到手机10中的已经训练好的神经网络模型对前述图像特征和音频特征进行识别,识别出该视频片段是在马路边拍摄的。
3)选择对应马路场景的降噪处理算法,以抑制过往车辆的鸣笛声、发动机的轰鸣声等比较刺耳且嘈杂的噪声。并且可以对经过降噪处理后的音频适配马路场景下的均衡处理方式、自动增益控制方式和动态范围控制方式,对已经抑制掉车辆的鸣笛声、发动机的轰鸣声的音频进行调整。如此,当用户观看此段视频时,没有车辆的鸣笛声、发动机的轰鸣声等噪声,播放的视频中用户感兴趣的音频听起来较为清晰、舒服,提升用户体验。
下面结合图1所示的语音处理场景,对本申请实施例提供的基于场景识别的语音处理方法的流程进行详细介绍,如图3所示,具体地,包括:
1)在检测到电子设备进行视频录入的情况下,获取当前录入的视频中的图像数据和音频数据(302)。可以理解,视频中的图像数据和音频数据为在同一场景下采集到的。
2)对图像数据和音频数据进行特征提取,得到图像数据的图像特征和音频数据的音频特征(304)。可以采用与上述对图2a和图2b中对样本图像和样本音频进行特征提取的相同的方法对当前录入的视频中的图像数据和音频数据进行特征提取。详细描述请参见上文,在此不再赘述。
3)对提取出来的图像特征和音频特征进行识别,确定出电子设备当前录入视频所处的场景类别(306)。场景识别的具体过程请参见上述对图2c中的相关描述,在此不再赘述。
4)基于识别出的场景类别,对电子设备实时录入的视频中的音频数据进行处理,并输出处理后的音频数据和对应的图像数据(308)。其中,针对音频的降噪处理方法可以采用数字信号处理方法,对待处理的音频进行降噪处理。也可以针对不同的场景,可以采用自适应滤波技术滤除音频信号中的噪声,例如,采用最小均方(Least Mean Square,简称LMS)自适应滤波技术进行音频降噪。在对音频进行降噪处理后,还可以对经由降噪处理算法处理后的音频信号至少进行均衡处理、自动增益控制处理和动态范围控制处理中的至少一种处理,并输出处理后的音频信号。
下面结合图1所示的语音处理场景,对本申请实施例提供的基于场景识别的语音处理方法应用于视频直播或视频通话的处理过程进行详细介绍,如图4所示,具体地,包括:
1)确定出电子设备进行视频录入的情况为视频直播或者视频通话(402)。例如,当检测到手机10的直播软件(例如抖音、快手、火山小视频等)被打开时,可以确定手机10当前在进行视频直播。当检测到手机10的即时通信软件(如微信、QQ、网络电话(VOIP)、Skype,Face time等)被打开时,可以确定手机10当前在进行视频通话。
2)对图像数据和音频数据进行特征提取,得到图像数据的图像特征和音频数据的音频特征(404)。其中,对图像数据和音频数据进行特征提取的具体方法可以参考上述对图2a和图2b中对样本图像和样本音频进行特征提取的方法。详细描述请参见上文,在此不再赘述。
3)对提取出来的图像特征和音频特征进行识别,确定出电子设备当前进行视频直播或视频通话的场景类别,并且识别出进行视频直播或视频通话的用户的人声(406)。其中,场景识别方法可以采用和上述图2c中相同的方法,用户人声识别的方法可以采用上述对手机10当前接收到的音频中的噪声和当前进行直播的用户的人声进行处理的详细介绍中相同的方法,详细描述请参见上文,在此不再赘述。
4)基于识别出的场景类别和用户的人声,对电子设备实时录入的视频中的音频中的人声进行增强处理,对音频中人声以外的声音做降噪处理,并输出处理后的音频数据和对应的图像数据(408)。如此可以提高视频直播或视频通话时人声的信噪比,提升用户体验。
图5根据本申请的一些实施例,提供了一种基于场景识别的语音处理装置500的结构示意图。如图5所示,基于场景识别的语音处理装置500包括:
检测模块502,用于在检测到电子设备进行视频录入的情况下,获取当前录入的视频中的图像数据和音频数据。
特征提取模块504,用于对图像数据和音频数据进行特征提取,得到图像数据的图像特征和音频数据的音频特征。
识别模块506,用于对提取出来的图像特征和音频特征进行识别,确定出电子设备当前录入视频所处的场景类别。
音频处理模块508,用于基于识别出的场景类别,对电子设备实时录入的视频中的音频数据进行处理,并输出处理后的音频数据和对应的图像数据。
可以理解,图5所示的基于场景识别的语音处理装置500与本申请提供的基于场景识别的语音处理方法相对应,以上关于本申请提供的基于场景识别的语音处理方法的具体描述中的技术细节依然适用于图5所示的基于场景识别的语音处理装置500,具体描述请参见上文,在此不再赘述。
图6根据本申请的一些实施例,提供了一种基于场景识别的语音处理装置600的结构示意图。如图6所示,基于场景识别的语音处理装置600包括:
获取模块602,用于获取待处理视频数据。
第二特征提取模块604,用于对待处理视频中的至少部分视频中的图像数据和音频数据进行特征提取,以得到图像数据的图像特征和音频数据的音频特征。
第二识别模块606,用于对提取出来的图像特征和音频特征进行识别,识别出待处理视频中场景的场景类别。
第二音频处理模块608,用于基于识别出的场景类别,对待处理视频中的音频数据进行处理。
图6所示的基于场景识别的语音处理装置600与针对录好的视频采用本申请提供的基于场景识别的语音处理方法相对应,具体描述请参见上文,在此不再赘述。
图7根据本申请的一些实施例,示出了一种手机10的结构示意图。如图7所示的手机10可以是一台智能手机,包括处理器110、电源模块140、存储器180,移动通信模块 130、无线通信模块120、传感器模块190、音频模块150、接口模块160、按键101以及触摸显示屏102等。图7所示的手机10还包括至少一个摄像头170,用以录入视频或采集图像。
可以理解的是,本发明实施例示意的结构并不构成对手机10的具体限定。在本申请另一些实施例中,手机10可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
音频模块150用于将数字音频信号转换成模拟音频信号输出,或者将模拟音频输入转换为数字音频信号。音频模块150还可以用于对音频信号编码和解码。在一些实施例中,音频模块150可以设置于处理器110中,或将音频模块150的部分功能模块设置于处理器110中。在一些实施例中,音频模块150可以包括扬声器、听筒、模拟的麦克风或数字的麦克风(可实现拾音功能)以及耳机接口。在本申请实施例中,手机10可以通过麦克风接收不同场景中的音频信号,并且可通过手机的操作系统获取到麦克风采集到的不同场景下的音频数据,并保存在内存空间上。
摄像头170用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件把光信号转换成电信号,之后将电信号传递给ISP(Image Signal Processing,图像信号处理)转换成数字图像信号。手机10可以通过ISP,摄像头170,视频编解码器,GPU(Graphic Processing Unit,图形处理器),触摸显示屏102以及应用处理器等实现拍摄功能。在本申请实施例中,手机10可以通过摄像头170在多个场景下进行录像,也可以通过打开摄像头170进行视频直播或视频通话等,并且还可以对通过摄像头170在多个不同场景下采集到的视频中的图像数据进行特征提取,以进行场景识别。
处理器110可以包括一个或多个处理单元,例如,可以包括中央处理器CPU(Central Processing Unit)、图像处理器GPU(Graphics Processing Unit)、数字信号处理器DSP、微处理器MCU(Micro-programmed Control Unit)、AI(Artificial Intelligence,人工智能)处理器或可编程逻辑器件FPGA(Field Programmable Gate Array)等的处理模块或处理电路。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。处理器110中可以设置存储单元,用于存储指令和数据。在一些实施例中,处理器110中的存储单元为高速缓冲存储器180,处理器110中可以内嵌已训练好的神经网络模型已进行场景识别、人脸人像识别等。
电源模块140可以包括电源、电源管理部件等。电源可以为电池。电源管理部件用于管理电源的充电和电源向其他模块的供电。在一些实施例中,电源管理部件包括充电管理模块和电源管理模块。充电管理模块用于从充电器接收充电输入;电源管理模块用于连接电源,充电管理模块与处理器110。电源管理模块接收电源和/或充电管理模块的输入,为处理器110,显示屏102,摄像头170,及无线通信模块120等供电。
移动通信模块130可以包括但不限于天线、功率放大器、滤波器、LNA(Low noise amplify,低噪声放大器)等。移动通信模块130可以提供应用在手机10上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块130可以由天线接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块130还可以对经调制解调处理器调制后的信号放大,经天线转为电磁波辐射出去。在一些实施 例中,移动通信模块130的至少部分功能模块可以被设置于处理器110中。在一些实施例中,移动通信模块130至少部分功能模块可以与处理器110的至少部分模块被设置在同一个器件中。
无线通信模块120可以包括天线,并经由天线实现对电磁波的收发。无线通信模块120可以提供应用在手机10上的包括无线局域网(wireless localarea networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。手机10可以通过无线通信技术与网络以及其他设备进行通信。在一些实施例中,手机10的移动通信模块130和无线通信模块120也可以位于同一模块中。
触摸显示屏102用于显示人机交互界面、图像、视频等。传感器模块190可以包括接近光传感器、压力传感器,陀螺仪传感器,气压传感器,磁传感器,加速度传感器,距离传感器,指纹传感器,温度传感器,触摸传感器,环境光传感器,骨传导传感器等。接口模块160包括外部存储器接口、通用串行总线(universal serial bus,USB)接口及用户标识模块(subscriber identification module,SIM)卡接口等。
在一些实施例中,手机10还包括按键101、马达以及指示器等。其中,按键101可以包括音量键、开/关机键等。马达用于使手机10产生振动效果,例如在用户的手机10被呼叫的时候产生振动,以提示用户接听手机10来电。指示器可以包括激光指示器、射频指示器、LED指示器等。
图8所示为根据本申请的一些实施例的系统800的框图。图8示意性地示出了根据多个实施例的示例系统800。在一些实施例中,系统800可以包括一个或多个处理器804,与处理器804中的至少一个连接的系统控制逻辑808,与系统控制逻辑808连接的系统内存812,与系统控制逻辑808连接的非易失性存储器(NVM)816,以及与系统控制逻辑808连接的网络接口820。
在一些实施例中,处理器804可以包括一个或多个单核或多核处理器。在一些实施例中,处理器804可以包括通用处理器和专用处理器(例如,图形处理器,应用处理器,基带处理器等)的任意组合。
在一些实施例中,系统控制逻辑808可以包括任意合适的接口控制器,以向处理器804中的至少一个和/或与系统控制逻辑808通信的任意合适的设备或组件提供任意合适的接口。
在一些实施例中,系统控制逻辑808可以包括一个或多个存储器控制器,以提供连接到系统内存812的接口。系统内存812可以用于加载以及存储数据和/或指令。在一些实施例中系统800的内存812可以包括任意合适的易失性存储器,例如合适的动态随机存取存储器(DRAM)。
NVM/存储器816可以包括用于存储数据和/或指令的一个或多个有形的、非暂时性的计算机可读介质。在一些实施例中,NVM/存储器816可以包括闪存等任意合适的非易失性存储器和/或任意合适的非易失性存储设备,例如HDD(Hard Disk Drive,硬盘驱动器),CD(Compact Disc,光盘)驱动器,DVD(Digital Versatile Disc,数字通用光盘)驱动器中的至少一个。
NVM/存储器816可以包括安装系统800的装置上的一部分存储资源,或者它可以由设备访问,但不一定是设备的一部分。例如,可以经由网络接口820通过网络访问NVM/存储816。
特别地,系统内存812和NVM/存储器816可以分别包括:指令824的暂时副本和永久副本。指令824可以包括:由处理器804中的至少一个执行时导致系统800实施如图3-5所示的方法的指令。在一些实施例中,指令824、硬件、固件和/或其软件组件可另外地/替代地置于系统控制逻辑808,网络接口820和/或处理器804中。
网络接口820可以包括收发器,用于为系统800提供无线电接口,进而通过一个或多个网络与任意其他合适的设备(如前端模块,天线等)进行通信。在一些实施例中,网络接口820可以集成于系统800的其他组件。例如,网络接口820可以集成于处理器804,系统内存812,NVM/存储器816,和具有指令的固件设备(未示出)中的至少一种,当处理器804中的至少一个执行所述指令时,系统800实现如图4所示的基于场景识别的语音处理方法。
网络接口820可以进一步包括任意合适的硬件和/或固件,以提供多输入多输出无线电接口。例如,网络接口820可以是网络适配器,无线网络适配器,电话调制解调器和/或无线调制解调器。
在一个实施例中,处理器804中的至少一个可以与用于系统控制逻辑808的一个或多个控制器的逻辑封装在一起,以形成系统封装(SiP)。在一个实施例中,处理器804中的至少一个可以与用于系统控制逻辑808的一个或多个控制器的逻辑集成在同一管芯上,以形成片上系统(SoC)。
系统800可以进一步包括:输入/输出(I/O)设备832。I/O设备832可以包括用户界面,使得用户能够与系统800进行交互;外围组件接口的设计使得外围组件也能够与系统800交互。在一些实施例中,系统800还包括传感器,用于确定与系统800相关的环境条件和位置信息的至少一种。
根据本申请的实施例,图9示出了一种SoC(System on Chip,片上系统)900的框图。在图9中,相似的部件具有同样的附图标记。另外,虚线框是更先进的SoC的可选特征。在图9中,SoC 900包括:互连单元950,其被耦合至应用处理器910;系统代理单元970;总线控制器单元980;集成存储器控制器单元940;一组或一个或多个协处理器920,其可包括集成图形逻辑、图像处理器、音频处理器和视频处理器;静态随机存取存储器(SRAM)单元930;直接存储器存取(DMA)单元960。在一个实施例中,协处理器920包括专用处理器,诸如例如网络或通信处理器、压缩引擎、GPGPU、高吞吐量MIC处理器、或嵌入式处理器等等。
本申请公开的机制的各实施例可以被实现在硬件、软件、固件或这些实现方法的组合中。本申请的实施例可实现为在可编程系统上执行的计算机程序或程序代码,该可编程系统包括至少一个处理器、存储系统(包括易失性和非易失性存储器和/或存储元件)、至少一个输入设备以及至少一个输出设备。
可将程序代码应用于输入指令,以执行本申请描述的各功能并生成输出信息。可以按已知方式将输出信息应用于一个或多个输出设备。为了本申请的目的,处理系统包括具有诸如例如数字信号处理器(DSP)、微控制器、专用集成电路(ASIC)或微处理器之类的 处理器的任何系统。
程序代码可以用高级程序化语言或面向对象的编程语言来实现,以便与处理系统通信。在需要时,也可用汇编语言或机器语言来实现程序代码。事实上,本申请中描述的机制不限于任何特定编程语言的范围。在任一情形下,该语言可以是编译语言或解释语言。
在一些情况下,所公开的实施例可以以硬件、固件、软件或其任何组合来实现。所公开的实施例还可以被实现为由一个或多个暂时或非暂时性机器可读(例如,计算机可读)存储介质承载或存储在其上的指令,其可以由一个或多个处理器读取和执行。例如,指令可以通过网络或通过其他计算机可读介质分发。因此,机器可读介质可以包括用于以机器(例如,计算机)可读的形式存储或传输信息的任何机制,包括但不限于,软盘、光盘、光碟、只读存储器(CD-ROMs)、磁光盘、只读存储器(ROM)、随机存取存储器(RAM)、可擦除可编程只读存储器(EPROM)、电可擦除可编程只读存储器(EEPROM)、磁卡或光卡、闪存、或用于利用因特网以电、光、声或其他形式的传播信号来传输信息(例如,载波、红外信号数字信号等)的有形的机器可读存储器。因此,机器可读介质包括适合于以机器(例如,计算机)可读的形式存储或传输电子指令或信息的任何类型的机器可读介质。
在附图中,可以以特定布置和/或顺序示出一些结构或方法特征。然而,应该理解,可能不需要这样的特定布置和/或排序。而是,在一些实施例中,这些特征可以以不同于说明性附图中所示的方式和/或顺序来布置。另外,在特定图中包括结构或方法特征并不意味着暗示在所有实施例中都需要这样的特征,并且在一些实施例中,可以不包括这些特征或者可以与其他特征组合。
需要说明的是,本申请各设备实施例中提到的各单元/模块都是逻辑单元/模块,在物理上,一个逻辑单元/模块可以是一个物理单元/模块,也可以是一个物理单元/模块的一部分,还可以以多个物理单元/模块的组合实现,这些逻辑单元/模块本身的物理实现方式并不是最重要的,这些逻辑单元/模块所实现的功能的组合才是解决本申请所提出的技术问题的关键。此外,为了突出本申请的创新部分,本申请上述各设备实施例并没有将与解决本申请所提出的技术问题关系不太密切的单元/模块引入,这并不表明上述设备实施例并不存在其它的单元/模块。
需要说明的是,在本专利的示例和说明书中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
虽然通过参照本申请的某些优选实施例,已经对本申请进行了图示和描述,但本领域的普通技术人员应该明白,可以在形式上和细节上对其作各种改变,而不偏离本申请的精神和范围。

Claims (12)

  1. 一种用于电子设备的基于场景识别的语音处理方法,其特征在于,所述方法包括:
    在检测到所述电子设备进行视频录入的情况下,获取当前录入的视频中的图像数据和音频数据;
    对所述图像数据和音频数据进行特征提取,得到所述图像数据的图像特征和所述音频数据的音频特征;
    对提取出来的所述图像特征和音频特征进行识别,识别出所述电子设备当前录入视频所处的场景类别;
    基于识别出的场景类别,对所述电子设备实时录入的视频中的音频数据进行处理,并输出处理后的音频数据和对应的图像数据。
  2. 根据权利要求1所述的基于场景识别的语音处理方法,其特征在于,所述对所述图像数据和音频数据进行特征提取,得到所述图像数据的图像特征和所述音频数据的音频特征,包括:
    对所述图像数据进行结构化处理得到所述图像数据的图像特征,并且对所述音频数据进行傅里叶变换得到所述音频数据的音频特征。
  3. 根据权利要求1所述的基于场景识别的语音处理方法,其特征在于,所述基于识别出的场景类别,对所述电子设备实时录入的视频中的音频数据进行处理,并输出处理后的音频数据和对应的图像数据,包括:
    基于识别出的场景类别,选择与所述场景类别对应的降噪处理算法、均衡处理方式、自动增益控制方式和动态范围控制方式;
    基于选择出的降噪处理算法、均衡处理方式、自动增益控制方式和动态范围控制方式对所述电子设备实时录入的视频中的音频数据进行处理;
    输出处理后的音频数据和对应的图像数据。
  4. 根据权利要求1所述的基于场景识别的语音处理方法,其特征在于,所述电子设备进行视频录入的情况包括:视频拍摄、视频直播或视频通话。
  5. 根据权利要求4所述的基于场景识别的语音处理方法,其特征在于,还包括:
    确定出所述电子设备进行视频录入的情况为视频直播或者视频通话;
    基于当前录入的视频中的音频数据,识别出进行视频直播或者视频通话的用户的人声;并且
    所述基于识别出的场景类别,对所述电子设备实时录入的视频中的音频数据进行处理,并输出处理后的音频数据和对应的图像数据,包括:
    基于识别出的场景类别和所述用户的人声,对所述电子设备实时录入的视频中的音频中的人声进行增强处理,对所述音频数据中人声以外的声音做降噪处理,并输出处理后的音频数据和对应的图像数据。
  6. 根据权利要求5所述的基于场景识别的语音处理方法,其特征在于,所述基于当前录入的视频中的音频数据,识别出进行视频直播或者视频通话的用户的人声,包括:
    基于当前录入的视频中的音频数据,通过信号处理和NN网络的方法中的至少一种,识别出进行视频直播或者视频通话的用户的人声。
  7. 根据权利要求5所述的基于场景识别的语音处理方法,其特征在于,所述方法还包括:基于当前录入的视频中的图像数据,识别出进行视频直播或者视频通话的用户的人像;
    所述用户的人像是通过以下方式识别出来的:
    对当前录入的视频中的图像数据进行识别;
    当识别出所述图像数据中对应一个人像的尺寸大于预设阈值时,识别出该人像为进行视频直播或者视频通话的用户的人像。
  8. 一种基于场景识别的语音处理方法,其特征在于,所述方法包括:
    获取待处理视频;
    对所述待处理视频中的至少部分视频中的图像数据和音频数据进行特征提取,以得到所述图像数据的图像特征和音频数据的音频特征;
    对提取出来的所述图像特征和音频特征进行识别,识别出所述待处理视频中场景的场景类别;
    基于识别出的场景类别,对所述待处理视频中的音频数据进行处理。
  9. 一种基于场景识别的语音处理装置,其特征在于,所述装置包括:
    检测模块,用于在检测到所述电子设备进行视频录入的情况下,获取当前录入的视频中的图像数据和音频数据;
    第一特征提取模块,用于对所述图像数据和音频数据进行特征提取,得到所述图像数据的图像特征和所述音频数据的音频特征;
    第一识别模块,用于对提取出来的所述图像特征和音频特征进行识别,确定出所述电子设备当前录入视频所处的场景类别;
    第一音频处理模块,用于基于识别出的场景类别,对所述电子设备实时录入的视频中的音频数据进行处理,并输出处理后的音频数据和对应的图像数据。
  10. 一种基于场景识别的语音处理装置,其特征在于,所述装置包括:
    获取模块,用于获取待处理视频;
    第二特征提取模块,用于对所述待处理视频中的至少部分视频中的图像数据和音频数据进行特征提取,以得到所述图像数据的图像特征和音频数据的音频特征;
    第二识别模块,用于对提取出来的所述图像特征和音频特征进行识别,识别出所述待处理视频中场景的场景类别;
    第二音频处理模块,用于基于识别出的场景类别,对所述待处理视频中的音频数据进行处理。
  11. 一种机器可读介质,其特征在于,所述机器可读介质上存储有指令,该指令在机器上执行时使机器执行权利要求1至8中任一项所述的基于场景识别的语音处理方法。
  12. 一种系统,包括:
    存储器,用于存储由系统的一个或多个处理器执行的指令,以及
    处理器,是系统的处理器之一,用于执行权利要求1至8中任一项所述的基于场景识别的语音处理方法。
PCT/CN2021/070509 2020-01-15 2021-01-06 基于场景识别的语音处理方法及其装置、介质和系统 WO2021143599A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010043607.3 2020-01-15
CN202010043607.3A CN113129917A (zh) 2020-01-15 2020-01-15 基于场景识别的语音处理方法及其装置、介质和系统

Publications (1)

Publication Number Publication Date
WO2021143599A1 true WO2021143599A1 (zh) 2021-07-22

Family

ID=76772173

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/070509 WO2021143599A1 (zh) 2020-01-15 2021-01-06 基于场景识别的语音处理方法及其装置、介质和系统

Country Status (2)

Country Link
CN (1) CN113129917A (zh)
WO (1) WO2021143599A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113946222A (zh) * 2021-11-17 2022-01-18 杭州逗酷软件科技有限公司 一种控制方法、电子设备及计算机存储介质
CN114333436A (zh) * 2021-12-17 2022-04-12 云南电网有限责任公司昆明供电局 一种将变压器套管将军帽结构分解重组培训装置
CN116631063A (zh) * 2023-05-31 2023-08-22 武汉星巡智能科技有限公司 基于用药行为识别的老人智能看护方法、装置及设备
CN117116302A (zh) * 2023-10-24 2023-11-24 深圳市齐奥通信技术有限公司 一种在复杂场景下的音频数据分析方法、系统及存储介质
CN117118956A (zh) * 2023-10-25 2023-11-24 腾讯科技(深圳)有限公司 音频处理方法、装置、电子设备及计算机可读存储介质
WO2024059536A1 (en) * 2022-09-13 2024-03-21 Dolby Laboratories Licensing Corporation Audio-visual analytic for object rendering in capture

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111986689A (zh) * 2020-07-30 2020-11-24 维沃移动通信有限公司 音频播放方法、音频播放装置和电子设备
CN114121033B (zh) * 2022-01-27 2022-04-26 深圳市北海轨道交通技术有限公司 基于深度学习的列车广播语音增强方法和系统
CN116989884A (zh) * 2022-04-26 2023-11-03 荣耀终端有限公司 异音检测方法、电子设备和存储介质
WO2023230782A1 (zh) * 2022-05-30 2023-12-07 北京小米移动软件有限公司 一种音效控制方法、装置及存储介质
CN117133311B (zh) * 2023-02-09 2024-05-10 荣耀终端有限公司 音频场景识别方法及电子设备
CN117238322B (zh) * 2023-11-10 2024-01-30 深圳市齐奥通信技术有限公司 一种基于智能感知的自适应语音调控方法及系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140368700A1 (en) * 2013-06-12 2014-12-18 Technion Research And Development Foundation Ltd. Example-based cross-modal denoising
CN105556593A (zh) * 2013-03-12 2016-05-04 谷歌技术控股有限责任公司 预处理音频信号的方法和设备
CN105872205A (zh) * 2016-03-18 2016-08-17 联想(北京)有限公司 一种信息处理方法及装置
CN107077859A (zh) * 2014-10-31 2017-08-18 英特尔公司 针对音频处理的基于环境的复杂度减小
CN109286772A (zh) * 2018-09-04 2019-01-29 Oppo广东移动通信有限公司 音效调整方法、装置、电子设备以及存储介质
CN109599107A (zh) * 2018-12-07 2019-04-09 珠海格力电器股份有限公司 一种语音识别的方法、装置及计算机存储介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090076816A1 (en) * 2007-09-13 2009-03-19 Bionica Corporation Assistive listening system with display and selective visual indicators for sound sources
WO2014194273A2 (en) * 2013-05-30 2014-12-04 Eisner, Mark Systems and methods for enhancing targeted audibility
CN108305616B (zh) * 2018-01-16 2021-03-16 国家计算机网络与信息安全管理中心 一种基于长短时特征提取的音频场景识别方法及装置
CN110049403A (zh) * 2018-01-17 2019-07-23 北京小鸟听听科技有限公司 一种基于场景识别的自适应音频控制装置和方法
CN109817236A (zh) * 2019-02-01 2019-05-28 安克创新科技股份有限公司 基于场景的音频降噪方法、装置、电子设备和存储介质
CN110147711B (zh) * 2019-02-27 2023-11-14 腾讯科技(深圳)有限公司 视频场景识别方法、装置、存储介质和电子装置
CN110225285B (zh) * 2019-04-16 2022-09-02 深圳壹账通智能科技有限公司 音视频通信方法、装置、计算机装置、及可读存储介质
CN110473568B (zh) * 2019-08-08 2022-01-07 Oppo广东移动通信有限公司 场景识别方法、装置、存储介质及电子设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105556593A (zh) * 2013-03-12 2016-05-04 谷歌技术控股有限责任公司 预处理音频信号的方法和设备
US20140368700A1 (en) * 2013-06-12 2014-12-18 Technion Research And Development Foundation Ltd. Example-based cross-modal denoising
CN107077859A (zh) * 2014-10-31 2017-08-18 英特尔公司 针对音频处理的基于环境的复杂度减小
CN105872205A (zh) * 2016-03-18 2016-08-17 联想(北京)有限公司 一种信息处理方法及装置
CN109286772A (zh) * 2018-09-04 2019-01-29 Oppo广东移动通信有限公司 音效调整方法、装置、电子设备以及存储介质
CN109599107A (zh) * 2018-12-07 2019-04-09 珠海格力电器股份有限公司 一种语音识别的方法、装置及计算机存储介质

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113946222A (zh) * 2021-11-17 2022-01-18 杭州逗酷软件科技有限公司 一种控制方法、电子设备及计算机存储介质
CN114333436A (zh) * 2021-12-17 2022-04-12 云南电网有限责任公司昆明供电局 一种将变压器套管将军帽结构分解重组培训装置
CN114333436B (zh) * 2021-12-17 2024-03-05 云南电网有限责任公司昆明供电局 一种将变压器套管将军帽结构分解重组培训装置
WO2024059536A1 (en) * 2022-09-13 2024-03-21 Dolby Laboratories Licensing Corporation Audio-visual analytic for object rendering in capture
CN116631063A (zh) * 2023-05-31 2023-08-22 武汉星巡智能科技有限公司 基于用药行为识别的老人智能看护方法、装置及设备
CN116631063B (zh) * 2023-05-31 2024-05-07 武汉星巡智能科技有限公司 基于用药行为识别的老人智能看护方法、装置及设备
CN117116302A (zh) * 2023-10-24 2023-11-24 深圳市齐奥通信技术有限公司 一种在复杂场景下的音频数据分析方法、系统及存储介质
CN117116302B (zh) * 2023-10-24 2023-12-22 深圳市齐奥通信技术有限公司 一种在复杂场景下的音频数据分析方法、系统及存储介质
CN117118956A (zh) * 2023-10-25 2023-11-24 腾讯科技(深圳)有限公司 音频处理方法、装置、电子设备及计算机可读存储介质
CN117118956B (zh) * 2023-10-25 2024-01-19 腾讯科技(深圳)有限公司 音频处理方法、装置、电子设备及计算机可读存储介质

Also Published As

Publication number Publication date
CN113129917A (zh) 2021-07-16

Similar Documents

Publication Publication Date Title
WO2021143599A1 (zh) 基于场景识别的语音处理方法及其装置、介质和系统
CN112400325B (zh) 数据驱动的音频增强
JP6464449B2 (ja) 音源分離装置、及び音源分離方法
WO2020078237A1 (zh) 音频处理方法和电子设备
CN108877823B (zh) 语音增强方法和装置
Zhao et al. Audio splicing detection and localization using environmental signature
JP2015508205A (ja) 音識別に基づくモバイルデバイスの制御
WO2021244056A1 (zh) 一种数据处理方法、装置和可读介质
US20230164509A1 (en) System and method for headphone equalization and room adjustment for binaural playback in augmented reality
US11636866B2 (en) Transform ambisonic coefficients using an adaptive network
WO2022007846A1 (zh) 语音增强方法、设备、系统以及存储介质
JP2016144080A (ja) 情報処理装置、情報処理システム、情報処理方法及びプログラム
TWI687917B (zh) 語音系統及聲音偵測方法
CN110049409B (zh) 用于全息影像的动态立体声调节方法及装置
CN113542466A (zh) 音频处理方法、电子设备及存储介质
CN113362849A (zh) 一种语音数据处理方法以及装置
WO2024093460A1 (zh) 语音检测方法及其相关设备
CN115424628B (zh) 一种语音处理方法及电子设备
CN114758669B (zh) 音频处理模型的训练、音频处理方法、装置及电子设备
US20240135944A1 (en) Controlling local rendering of remote environmental audio
CN111696565B (zh) 语音处理方法、装置和介质
CN111696564B (zh) 语音处理方法、装置和介质
EP4362502A1 (en) Controlling local rendering of remote environmental audio
TW202226222A (zh) 外接式智能音訊降噪裝置
WO2022068675A1 (zh) 发声者语音抽取方法、装置、存储介质及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21741134

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21741134

Country of ref document: EP

Kind code of ref document: A1