WO2015117403A1 - 噪音抑制方法、装置、计算机程序和计算机存储介质 - Google Patents

噪音抑制方法、装置、计算机程序和计算机存储介质 Download PDF

Info

Publication number
WO2015117403A1
WO2015117403A1 PCT/CN2014/089335 CN2014089335W WO2015117403A1 WO 2015117403 A1 WO2015117403 A1 WO 2015117403A1 CN 2014089335 W CN2014089335 W CN 2014089335W WO 2015117403 A1 WO2015117403 A1 WO 2015117403A1
Authority
WO
WIPO (PCT)
Prior art keywords
video stream
noise
stream data
audio data
pure
Prior art date
Application number
PCT/CN2014/089335
Other languages
English (en)
French (fr)
Inventor
颜蓓
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2015117403A1 publication Critical patent/WO2015117403A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Definitions

  • the present invention relates to the field of mobile communications, and in particular, to a noise suppression method, a device computer program, and a computer storage medium.
  • a very important indicator for evaluating the performance of intelligent terminals is whether the background noise suppression function of this terminal is powerful.
  • the ideal state of background noise suppression is that when the mobile phone user is in a very noisy environment, the voice effect transmitted by the other party is very clear, and the background noise portion is almost completely filtered out.
  • Most of the smart terminals currently on the market use multi-microphone noise suppression schemes, among which the dual-microphone noise suppression scheme is the most widely used.
  • the principle of the dual-microphone noise suppression scheme is that the main microphone of the front panel of the mobile phone collects noise and voice, and the auxiliary microphone of the rear panel collects noise, and the part taken by the two is processed by an algorithm to obtain a pure voice part and then transmitted. When the other party's mobile phone is received, the other party's mobile phone receives the voice part that suppresses the background noise.
  • the embodiment of the invention provides a noise suppression method, device, computer program and computer storage medium, which solves the problem that the existing noise suppression mode has a small application range.
  • a noise suppression method comprising:
  • a voice call synchronously collecting video stream data recording the user's mouth motion and recording audio data of the user's call;
  • the voiced voice superimposed portion and the pure ambient noise portion are separately processed to obtain a pure speech component.
  • the synchronously collecting and recording the video stream data of the user's mouth motion and recording the audio data of the user call are:
  • the collection of the video stream data and the audio data is performed synchronously.
  • determining, according to the video stream data, a voice noise superimposing portion and a pure ambient noise portion of the audio data including:
  • the voiced voice superimposed portion and the pure ambient noise portion are respectively processed to obtain a pure voice component, including:
  • the method further includes:
  • the embodiment of the invention further provides a noise suppression device, comprising:
  • the data collection module is configured to: synchronously collect and record video stream data of the user's mouth motion and record audio data of the user's call during the voice call;
  • a first baseband processing module configured to: determine, according to the video stream data, a voice noise superimposing portion and a pure ambient noise portion of the audio data;
  • the second baseband processing module is configured to: separately process the voiced voice superimposed portion and the pure ambient noise portion to obtain a pure voice component.
  • the data collection module includes:
  • Aligning unit configured to: time-align the starting point of collecting the audio data and the video stream data
  • the synchronous acquisition unit is configured to: synchronously perform acquisition of the video stream data and the audio data according to the synchronization reference line.
  • the first baseband processing module includes:
  • a slicing unit configured to: perform synchronous slicing of the video stream data and the audio data, and slice the video stream data in one-to-one correspondence with the slice of the audio data;
  • a video stream data analyzing unit configured to: analyze the slices of the video stream data one by one;
  • the audio data analyzing unit is configured to: when there is a mouth motion in the slice of the video stream data, determine that the slice of the corresponding audio data belongs to a portion where the voice noise is superimposed,
  • the second baseband processing module includes:
  • the speech denoising processing unit is configured to: the part of the speech noise superimposed, performing environmental noise processing to obtain a pure speech component;
  • the environmental noise processing unit is configured to perform direct deletion processing on the portion of the pure ambient noise.
  • the device further includes:
  • the voice sending module is configured to: immediately send the processed pure voice component to the other party.
  • the embodiment of the invention further provides a computer program, comprising program instructions, when the program instruction is executed by the terminal, so that the terminal can execute the above method.
  • Embodiments of the present invention also provide a computer readable storage medium carrying the computer program.
  • An embodiment of the present invention provides a noise suppression method and apparatus.
  • a voice call synchronously collecting and recording video stream data of a user's mouth motion and recording audio data of the user's call, and then determining, according to the video stream data,
  • the speech noise superimposed portion and the pure ambient noise portion of the audio data are respectively processed, and the voiced voice superimposed portion and the pure ambient noise portion are respectively processed to obtain a pure speech component. It achieves efficient and accurate noise suppression for different scenarios, and solves the problem that the existing noise suppression mode has a small application range.
  • FIG. 1 is a schematic structural diagram of a noise suppression system according to Embodiment 1 of the present invention.
  • FIG. 2 is a schematic structural diagram of a terminal according to Embodiment 2 of the present invention.
  • FIG. 3 is a flowchart of a noise suppression method according to Embodiment 3 of the present invention.
  • FIG. 4 is a schematic structural diagram of a noise suppression device according to Embodiment 4 of the present invention.
  • FIG. 5 is a schematic structural diagram of the data collection module 401 of FIG. 4;
  • FIG. 6 is a schematic structural diagram of the first baseband processing module 402 of FIG. 4;
  • FIG. 7 is a schematic structural diagram of the second baseband processing module 403 of FIG. 4.
  • the noise suppression scheme of the related art has a small application range, and the noise filtering effect in a voice call is poor in many scenarios.
  • the embodiment of the invention provides a noise suppression system.
  • the structure of the whole system is as shown in FIG. 1 and is divided into four parts:
  • the mouth motion collection module 101 the function is to collect the mouth motion of the user, and transmit it to the baseband processing module 103 for subsequent recognition and analysis processing;
  • the voice noise collecting module 102 the function is to collect the voice of the user during the call, and also collect the background noise. The working of the module needs to be synchronized with the pinhole camera 101 and its accessory circuit 201, and the collected data is also transmitted to the baseband processing module 103. Synthesized together with the data generated by the pinhole camera 101 at the same time;
  • Baseband processing module 103 The function of the module is to process and analyze the mouth motion data generated by the pinhole camera 101 to determine whether the user has a mouth motion; the module also performs audio data sent by the module 102 at the same time. deal with.
  • the method used for audio data processing depends on the recognition of the mouth motion data generated by the pinhole camera 101, the result of analysis and judgment, and the portion of the voice plus noise when the mouth motion is generated is subtracted by some algorithm from the mouth motion. Part of the noise, you can get pure voice part of the data;
  • the uplink voice path 104 receives the voice processed by the baseband processing module 103 and transmits it to a function module of the counterpart user terminal.
  • the embodiment of the present invention provides a terminal.
  • the main microphone 202 and the pinhole camera 201 are both mounted on the front side of the mobile phone.
  • the main microphone 202 can also be installed on the lower right side of the mobile phone, as close as possible to the mouth.
  • the position of the pinhole camera 201 should be installed below the front of the mobile phone, preferably centered, so that the pinhole camera 201 is not blocked by the face regardless of whether the user holds the left hand or the right hand, and the mouth can be clearly photographed. Department action. If the mobile phone itself is designed with a front camera, then you can consider using the front camera to complete this function. Of course, the position of the front camera cannot be installed on the upper left side of the front of the mobile phone.
  • the pinhole camera and its accessory circuit 201 the function is to take the user's mouth motion and take the captured content to the video data memory 204 in the baseband processing main chip 203 for preparation for subsequent recognition analysis;
  • the main microphone and its accessory circuit 202 the function is to collect the voice of the user during the conversation, and also collect the surrounding background noise, and the collected audio data is also transmitted to the audio data memory 205 of the baseband processing main chip 203 to be reserved with the pinhole camera. And the video data generated by its accessory circuit 201 is combined for processing;
  • the baseband processing main chip 203 the function is to process and analyze the video data in the video data memory 204, and slice the video stream data to determine whether the user has a mouth motion for the content in each small slice, due to the mouth
  • the image recognition technology of the motion is already a mature technology, and will not be described in detail herein; at the same time, the audio data of the voice noise in the audio data memory 205 is also sliced.
  • the manner of processing the audio data depends on the recognition and analysis of the corresponding video data slice in the video data memory 204, and the speech plus noise portion when the mouth action is generated is subtracted by the algorithm from the no-mouth action. In the noise part, pure speech part data can be obtained; since there are many algorithms currently applied to separate speech noise technology, this article will not repeat them;
  • the video data memory 204 is configured to: store a video data stream generated by the pinhole camera and its accessory circuit 201;
  • the audio data memory 205 is configured to: store the audio data stream generated by the primary microphone and its accessory circuit 202;
  • CDEC Codec
  • MODEM modem
  • Uplink link 207 The air link connecting the two parties of the session.
  • the terminal of the embodiment of the present invention may be a mobile device, a tablet computer, a notebook computer, or the like, which is not limited in the embodiment of the present invention.
  • the embodiment of the present invention provides a noise suppression method, which is combined with the noise suppression device shown in FIG. 2, and the processing flow is as shown in FIG. 3, and includes:
  • Step 301 Start.
  • Step 302 Determine whether the voice call starts, and start step 303.
  • the video stream data recording the mouth motion of the user and the audio data recording the call of the user are synchronously collected.
  • the starting point of collecting the audio data and the video stream data is first aligned in time, and then the video stream data and the audio data are acquired synchronously.
  • the collected audio data and video stream data are already synchronized, that is, the audio data and the video data are aligned in time to facilitate subsequent synchronization processing.
  • Step 303 The pinhole camera and its accessory circuit 201 start to work, and the video data is continuously collected for the user's mouth motion.
  • Step 304 The main microphone and its accessory circuit 202 start to work synchronously with the pinhole camera and its accessory circuit 201 to continuously collect audio data and video stream data, and the audio data includes components of the user's voice component and background noise.
  • Step 305 The video stream data collected by the pinhole camera and its accessory circuit 201 is stored in the video data memory 204. Under normal circumstances, the shortest time for a person to finish a word is about 200ms, so it can be processed after the length of the stored video stream data and audio data reaches 200ms.
  • Step 306 The video stream data collected by the primary mic and its accessory circuit 202 is stored in the video data memory 204.
  • Step 307 The baseband processing main chip 203 synchronizes the video data memory 204 and the audio data memory 205 into a slice analysis process, and the video data from the camera and the voice data from the microphone are synchronously analyzed. If the video data in the video slice N is determined to have a mouth motion, the audio data generated by the corresponding audio slice N is defined as a superimposed component of the voice plus noise; if the video is cut When the video data in the slice N is judged to be a mouthless motion, the audio data generated by the corresponding audio slice N is defined as a simple background noise component. The audio data in these two cases is subtracted by an algorithm to obtain a pure speech component.
  • the processing method is as follows:
  • a speech noise superimposing portion and a pure ambient noise portion of the audio data are determined.
  • the collection and storage of the video stream and the audio stream are absolutely synchronized, and the video stream data and the audio data are synchronously sliced, and the slice of the video stream data is in one-to-one correspondence with the slice of the audio data.
  • the video data and the audio data can be sliced starting from the starting point of the alignment of the collected audio data and the video stream data, and the slices must also be synchronized, for example, one slice every 0.3 seconds, then the video data and the audio data must be synchronized.
  • the first slice of the video data is defined as S1
  • the first slice of the audio data is defined as Y1
  • the subsequent analogy the nth slice of the video data is defined as Sn
  • the nth slice of the audio data is defined as Yn.
  • the voiced voice superimposed portion and the pure ambient noise portion can be separately processed to obtain a pure speech component.
  • the portion where the speech noise is superimposed is subjected to de-environmental noise processing to obtain a pure speech component; and for the portion of the pure environmental noise, a direct deletion process is performed.
  • Step 308 Immediately send the processed pure voice component to the other party;
  • the pure voice component obtained after the processing is sent to the CODEC and the MODEM 206 for encoding and modulation, and then transmitted to the uplink call link 207 to reach the counterpart terminal, and the other terminal can hear the pure voice portion with the environmental noise removed.
  • the slice can be sent as soon as a pure voice slice is obtained.
  • Step 309 Determine whether the voice call ends, if not, return to step 302;
  • Step 310 The voice call ends, and the entire embodiment device stops working.
  • the embodiment of the invention provides a noise suppression device.
  • the structure of the device is as shown in FIG. 4, and includes:
  • the data collection module 401 is configured to: synchronously collect and record video stream data of the user's mouth motion and record audio data of the user's call during the voice call;
  • the first baseband processing module 402 is configured to: determine, according to the video stream data, a voice noise superimposing portion and a pure ambient noise portion of the audio data;
  • the second baseband processing module 403 is configured to separately process the voiced voice superimposed portion and the pure ambient noise portion to obtain a pure voice component.
  • the structure of the data collection module 401 is as shown in FIG. 5, and includes:
  • the aligning unit 4011 is configured to: time-align the starting point of collecting the audio data and the video stream data;
  • the synchronous collection unit 4012 is configured to synchronously perform acquisition of the video stream data and the audio data according to the synchronization reference line.
  • the structure of the first baseband processing module 402 is as shown in FIG. 6, and includes:
  • the slicing unit 4021 is configured to: perform synchronous slicing on the video stream data and the audio data, and slice the video stream data in one-to-one correspondence with the slice of the audio data;
  • the video stream data analyzing unit 4022 is configured to: analyze the slices of the video stream data one by one;
  • the audio data analyzing unit 4023 is configured to: have a mouth in the slice of the video stream data When the action is performed, it is determined that the slice of the corresponding audio data belongs to a portion where the voice noise is superimposed.
  • the structure of the second baseband processing module 403 is as shown in FIG. 7, and includes:
  • the voice denoising processing unit 4031 is configured to: perform ambient noise processing on the portion where the voice noise is superimposed, to obtain a pure speech component;
  • the environmental noise processing unit 4032 is configured to perform a process of directly deleting the portion of the pure ambient noise.
  • the device further includes:
  • the voice sending module 404 is configured to: immediately send the processed pure voice component to the other party.
  • Embodiments of the present invention provide a noise suppression method and apparatus for synchronously collecting video stream data recording a user's mouth motion and recording audio data of the user's call during a voice call, and then determining according to the video stream data.
  • the speech noise superimposing portion and the pure ambient noise portion of the audio data are respectively processed to the voiced voice superimposed portion and the pure ambient noise portion to obtain a pure speech component. It achieves efficient and accurate noise suppression for different scenarios, and solves the problem that the existing noise suppression mode has a small application range.
  • all or part of the steps of the above embodiments may also be implemented by using an integrated circuit. These steps may be separately fabricated into individual integrated circuit modules, or multiple modules or steps may be fabricated into a single integrated circuit module. achieve. Thus, the invention is not limited to any specific combination of hardware and software.
  • Each device/function module/functional unit in the above embodiments may use a general-purpose computing device. Implementations can be centralized on a single computing device or distributed across a network of multiple computing devices.
  • each device/function module/functional unit in the above embodiment When each device/function module/functional unit in the above embodiment is implemented in the form of a software function module and sold or used as a stand-alone product, it can be stored in a computer readable storage medium.
  • the above mentioned computer readable storage medium may be a read only memory, a magnetic disk or an optical disk or the like.
  • the embodiments of the present invention provide an efficient and accurate noise suppression suitable for different scenarios, and solve the problem that the existing noise suppression mode has a small application range.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephone Function (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种噪音抑制方法、装置、计算机程序和计算机可读存储设备,该方法包括:在语音通话时,同步采集记录用户嘴部动作的视频流数据和记录所述用户通话的音频数据;根据所述视频流数据,确定所述音频数据的语音噪音叠加部分和纯环境噪音部分;分别对所述语音嗓音叠加部分和纯环境噪音部分进行处理,得到纯净的语音成分。

Description

噪音抑制方法、装置、计算机程序和计算机存储介质 技术领域
本发明涉及移动通信领域,尤其涉及一种噪音抑制方法、装置计算机程序和计算机存储介质。
背景技术
目前市面上的智能终端越来越多,评测智能终端性能一个非常重要的指标就是此终端的背景噪音抑制功能是否强大。背景噪音抑制的理想状态是当手机用户处于很嘈杂的环境中时,对方听到的传送过去的语音效果非常清晰,而背景噪音部分几乎完全被滤掉。目前市面上的智能终端绝大部分采用多麦克抑噪方案,其中又以双麦克抑噪方案用的最为广泛。双麦克抑噪方案的原理是认为手机前面板的主麦克采集的是噪音和语音,而后面板的辅麦克采集的是噪音,将两者采取的部分通过某种算法处理后得到纯语音部分后传送到对方手机,则对方手机接收到的是抑制了背景噪音的语音部分。
这种方案的抑噪效果在大部分的嘈杂环境情况下都很不错,但却有一个缺点就是:当背景噪音传过来的方向和语音传过来的方向接近一致时,这种方案就不太好区分噪音和语音,会把一部分的噪音传过去或者会把一部分的语音也当成噪音也过滤处理掉,导致对方样机听到的语音部分失真甚至发生断续并伴随有一定的背景噪音。
因此必须找到一种方法既可以还原清晰的语音,又可以滤掉从四面八方传过来的环境噪声,从而实现在嘈杂环境中较好的语音的发送方向的用户体验的目的。
发明内容
本发明实施例提供了一种噪音抑制方法、装置、计算机程序和计算机存储介质,解决了现有的抑噪方式应用范围较小的问题。
一种噪音抑制方法,包括:
在语音通话时,同步采集记录用户嘴部动作的视频流数据和记录所述用户通话的音频数据;
根据所述视频流数据,确定所述音频数据的语音噪音叠加部分和纯环境噪音部分;
分别对所述语音嗓音叠加部分和纯环境噪音部分进行处理,得到纯净的语音成分。
可选地,同步采集记录用户嘴部动作的视频流数据和记录所述用户通话的音频数据具体为:
将采集所述音频数据和所述视频流数据的起始点在时间上对齐;
同步进行所述视频流数据和所述音频数据的采集。
可选地,根据所述视频流数据,确定所述音频数据的语音噪音叠加部分和纯环境噪音部分包括:
将所述视频流数据和所述音频数据进行同步切片,视频流数据的切片与音频数据的切片一一对应;
逐个分析所述视频流数据的切片;
在所述视频流数据的切片中存在嘴部动作时,判定对应的所述音频数据的切片属于语音噪音叠加的部分;
在所述视频流数据的切片中不存在嘴部动作时,判定对应的所述音频数据的切片属于纯环境噪音部分。
可选地,分别对所述语音嗓音叠加部分和纯环境噪音部分进行处理,得到纯净的语音成分包括:
对所述语音噪音叠加的部分,进行去环境噪音处理,得到纯净的语音成分;
对于所述纯环境噪音的部分,进行直接删除的处理。
可选地,分别对所述语音嗓音叠加部分和纯环境噪音部分进行处理,得到纯净的语音成分的步骤之后,还包括:
即刻向通话对方发送处理得到的纯净的语音成分。
本发明实施例还提供了一种噪音抑制装置,包括:
数据采集模块,设置为:在语音通话时,同步采集记录用户嘴部动作的视频流数据和记录所述用户通话的音频数据;
第一基带处理模块,设置为:根据所述视频流数据,确定所述音频数据的语音噪音叠加部分和纯环境噪音部分;
第二基带处理模块,设置为:分别对所述语音嗓音叠加部分和纯环境噪音部分进行处理,得到纯净的语音成分。
可选地,所述数据采集模块包括:
对齐单元,设置为:将采集所述音频数据和所述视频流数据的起始点在时间上对齐;
同步采集单元,设置为:以所述同步基准线为准,同步进行所述视频流数据和所述音频数据的采集。
可选地,所述第一基带处理模块包括:
切片单元,设置为:将所述视频流数据和所述音频数据进行同步切片,视频流数据的切片与音频数据的切片一一对应;
视频流数据分析单元,设置为:逐个分析所述视频流数据的切片;
音频数据分析单元,设置为:在所述视频流数据的切片中存在嘴部动作时,判定对应的所述音频数据的切片属于语音噪音叠加的部分,
在所述视频流数据的切片中不存在嘴部动作时,判定对应的所述音频数据的切片属于纯环境噪音部分。
可选地,所述第二基带处理模块包括:
语音去噪处理单元,设置为:所述语音噪音叠加的部分,进行去环境噪音处理,得到纯净的语音成分;
环境噪音处理单元,设置为:对所述纯环境噪音的部分,进行直接删除的处理。
可选地,该装置还包括:
语音发送模块,设置为:即刻向通话对方发送处理得到的纯净的语音成分。
本发明实施例还提供一种计算机程序,包括程序指令,当该程序指令被终端执行时,使得该终端可执行上述方法。
本发明实施例还提供一种载有所述计算机程序的计算机可读存储介质。
本发明实施例提供了一种噪音抑制方法和装置,在语音通话时,同步采集记录用户嘴部动作的视频流数据和记录所述用户通话的音频数据,然后根据所述视频流数据,确定所述音频数据的语音噪音叠加部分和纯环境噪音部分,再分别对所述语音嗓音叠加部分和纯环境噪音部分进行处理,得到纯净的语音成分。实现了适用于不同场景的高效准确的噪音抑制,解决了现有的抑噪方式应用范围较小的问题。
附图概述
图1为本发明的实施例一提供的噪音抑制系统的结构示意图;
图2为本发明的实施例二提供的终端的结构示意图;
图3为本发明的实施例三提供的一种噪音抑制方法的流程图;
图4为本发明的实施例四提供的一种噪音抑制装置的结构示意图;
图5为图4中数据采集模块401的结构示意图;
图6为图4中第一基带处理模块402的结构示意图;
图7为图4中第二基带处理模块403的结构示意图。
本发明的较佳实施方式
相关技术的抑制噪音的方案适用范围较小,在很多场景下对语音通话中的噪音过滤效果较差。
为了解决上述问题,本发明的实施例提供了一种噪音抑制方法和装置。下文中将结合附图对本发明的实施例进行详细说明。需要说明的是,在不冲 突的情况下,本申请中的实施例及实施例中的特征可以相互任意组合。
首先,结合附图,对本发明的实施例一进行说明。
本发明实施例提供了一种噪音抑制系统,整个系统的结构如图1所示,分为四部分:
嘴部动作采集模块101:功能是采集使用者的嘴部动作,并传送到基带处理模块103进行后续识别分析处理;
语音噪音采集模块102:功能是采集使用者通话时的语音,同时也采集背景噪音,此模块工作需要和针孔摄像头101及其附属电路201同步进行,采集的数据也会传送到基带处理模块103和针孔摄像头101同时产生的数据一起进行综合处理;
基带处理模块103:此模块的功能是对针孔摄像头101产生的嘴部动作数据进行处理和分析,判断是否使用者是否有嘴部动作;此模块对同时刻102模块发过来的音频数据也进行处理。对音频数据处理采用的方式取决于对针孔摄像头101产生的嘴部动作数据的识别、分析判断结果,将有嘴部动作产生时的语音加噪音部分用某种算法减去没有嘴部动作时的噪音部分,即可得到纯净的语音部分数据;
上行语音通路104:接收来自基带处理模块103处理过后的语音,并传送到对方用户终端的一个功能模块。
下面结合附图,对本发明的实施例二进行说明。
本发明实施例提供了一种终端,如图2所示:主麦克202和针孔摄像头201均安装在手机的正面,当然主麦克202也可安装在手机的右下侧,只要尽量靠近嘴部即可。而针孔摄像头201的位置应该装在手机正面的下方,最好居中一些,这样不管用户使用左手握或者右手握的姿势,针孔摄像头201都不会被脸部挡住,可以清晰地拍摄到嘴部动作。若手机本身设计有前置摄像头,那么可以考虑就采用前置摄像头来完成此功能也可,当然这种情况下前置摄像头的位置不能安装在手机正面的左上方,必须设计在手机正面下方, 因为有屏幕旋转等成熟技术,所有这种设计也不会影响到前置摄像拍照的功能本身的。这种设计考虑覆盖到了到了手持和免提通话两种模式。并且由于此实施例只需要一个麦克,不再需要另外的消噪辅助麦克,而摄像头又可以直接使用前置摄像头,所以会大大节省手机布局的空间,也会大大降低制作成本。
本发明实施例二的方案分为如下几部分:
针孔摄像头及其附属电路201:功能是将使用者的嘴部动作拍摄下来,并将拍摄的内容传送到基带处理主芯片203中的视频数据存储器204中准备进行后续识别分析处理;
主麦克及其附属电路202:功能是采集使用者通话时的语音,同时也采集周围的背景噪音,采集的音频数据也会传送到基带处理主芯片203的音频数据存储器205中留待与针孔摄像头及其附属电路201产生的视频数据一起进行综合处理;
基带处理主芯片203:功能是对视频数据存储器204中的视频数据进行处理和分析,通过对视频流数据进行切片,对每个小切片内的内容判断是否使用者是否有嘴部动作,由于嘴部动作的图像识别技术已是成熟技术,就不在此文中进行赘述了;同时并对音频数据存储器205中的语音加噪音的音频数据也进行切片处理。对音频数据处理采用的方式取决于对视频数据存储器204中相对应的视频数据切片的识别、分析判断结果,将有嘴部动作产生时的语音加噪音部分用算法减去没有嘴部动作时的噪音部分,即可得到纯净的语音部分数据;由于目前应用于分离语音噪音技术的算法已非常多,本文中不再赘述;
视频数据存储器204设置为:存放针孔摄像头及其附属电路201产生的视频数据流;
音频数据存储器205设置为:存放主麦克及其附属电路202产生的音频数据流;
编解码器(CODEC)和调制解调器(MODEM)206:将基带处理主芯片203处理好后的音频数据流进行编码和调制并传送到上行通话链路207;
上行通话链路207:连接会话双方的空中链路。
本发明的实施例所涉及的终端,可以是手机、平板电脑、笔记本电脑等可移动设备,本发明实施例对此不作限定。
下面结合附图,对本发明的实施例三进行说明。
本发明实施例提供了一种噪音抑制方法,与图2所示的噪音抑制装置相结合,处理流程如图3所示,包括:
步骤301:开始。
步骤302:判断语音通话是否开始,开始则执行步骤303。
在语音通话时,同步采集记录用户嘴部动作的视频流数据和记录所述用户通话的音频数据。先将采集所述音频数据和所述视频流数据的起始点在时间上对齐,然后同步进行所述视频流数据和所述音频数据的采集。在本步骤中,采集的音频数据和视频流数据已经是同步的了,即音频数据和视频数据在时间上是对齐的,以便于后续的同步处理。
步骤303:针孔摄像头及其附属电路201开始工作,对使用者嘴部动作进行视频数据的持续采集。
步骤304:主麦克及其附属电路202开始与针孔摄像头及其附属电路201同步工作,持续采集音频数据和视频流数据,音频数据中包含使用者的语音成分以及背景噪音的成分。
步骤305:针孔摄像头及其附属电路201采集的视频流数据存入视频数据存储器204。一般情况下,人说完一个字的时间最短是200ms左右,故可以在存储的视频流数据和音频数据的长度达到200ms后就开始进行处理。
步骤306:主麦克及其附属电路202采集的视频流数据存入视频数据存储器204。
步骤307:基带处理主芯片203将视频数据存储器204和音频数据存储器205同步切片分析处理,自摄像头的视频数据和来自麦克的语音数据,同步分析处理。如果将视频切片N中的视频数据判断为有嘴部动作时,则将相应音频切片N产生的音频数据定义为语音加噪音的叠加成分;如果将视频切 片N中的视频数据判断为无嘴部动作时,则将相应音频切片N产生的音频数据定义为单纯的背景噪音成分。并将此两种情况下的音频数据通过算法进行相减得到纯净的语音成分。
处理方法如下:
首先,根据所述视频流数据,确定所述音频数据的语音噪音叠加部分和纯环境噪音部分。包括:
1、本发明实施例中,视频流和音频流的采集与存储绝对同步,将所述视频流数据和所述音频数据进行同步切片,视频流数据的切片与音频数据的切片一一对应。其中,可以从采集音频数据和视频流数据的对齐的起始点开始对视频数据和音频数据进行切片,切片也必须是同步的,比如说每0.3秒一个切片,那么视频数据和音频数据都必须同步进行这个长度的切片(正常人的语速,一分钟可以说100字到300字之间,那么就是说一个字花费的时间在200ms到600ms之间,可以将切片长度定为200ms到600ms之间,能达到可识别的程度即可,本发明对此不作具体限定)。将视频数据的第一个切片定义为S1,将音频数据的第一个切片定义为Y1,后续的依次类推,视频数据的第n个切片定义为Sn,将音频数据的第n个切片定义为Yn。
2、分析视频流数据的嘴部动作,并且将有嘴部动作的切片定义为1,无嘴部动作的切片定义为0,将语音噪音叠加的部分定义为S,纯环境噪音部分定义为N。在所述视频流数据的切片中存在嘴部动作时,判定对应的所述音频数据的切片属于语音噪音叠加的部分;在所述视频流数据的切片中不存在嘴部动作时,判定对应的所述音频数据的切片属于纯环境噪音部分。
在区分语音口音叠加的部分和纯环境噪音的部分之后,即可分别对所述语音嗓音叠加部分和纯环境噪音部分进行处理,得到纯净的语音成分。具体的,对所述语音噪音叠加的部分,进行去环境噪音处理,得到纯净的语音成分;对于所述纯环境噪音的部分,进行直接删除的处理。
例如,当Sn=0时,则Yn=N;当Sn=1时,Yn=N+S。那么N的部分就可以很容易的提取出来,从而得到语音噪音叠加的S部分,再对S部分进行去环境噪音处理,得到纯净的语音成分。
步骤308:即刻向通话对方发送处理得到的纯净的语音成分;
本步骤中,将处理后得到的纯净的语音成分发送到CODEC和MODEM206进行编码和调制后传送到上行通话链路207,到达对方终端,对方终端则可听到去除了环境噪音的纯净语音部分。为避免通话延时过长,可在得到一个纯净的语音切片时即刻将该切片发送。
步骤309:判断语音通话是否结束,如没结束则返回执行步骤302;
步骤310:语音通话结束,整个实施例装置也随之停止工作。
下面结合附图,对本发明的实施例四进行说明。
本发明实施例提供了一种噪音抑制装置,该装置的结构如图4所示,包括:
数据采集模块401,设置为:在语音通话时,同步采集记录用户嘴部动作的视频流数据和记录所述用户通话的音频数据;
第一基带处理模块402,设置为:根据所述视频流数据,确定所述音频数据的语音噪音叠加部分和纯环境噪音部分;
第二基带处理模块403,设置为:分别对所述语音嗓音叠加部分和纯环境噪音部分进行处理,得到纯净的语音成分。
可选地,所述数据采集模块401的结构如图5所示,包括:
对齐单元4011,设置为:将采集所述音频数据和所述视频流数据的起始点在时间上对齐;
同步采集单元4012,设置为:以所述同步基准线为准,同步进行所述视频流数据和所述音频数据的采集。
可选地,所述第一基带处理模块402的结构如图6所示,包括:
切片单元4021,设置为:将所述视频流数据和所述音频数据进行同步切片,视频流数据的切片与音频数据的切片一一对应;
视频流数据分析单元4022,设置为:逐个分析所述视频流数据的切片;
音频数据分析单元4023,设置为:在所述视频流数据的切片中存在嘴部 动作时,判定对应的所述音频数据的切片属于语音噪音叠加的部分,
在所述视频流数据的切片中不存在嘴部动作时,判定对应的所述音频数据的切片属于纯环境噪音部分。
可选地,所述第二基带处理模块403的结构如图7所示,包括:
语音去噪处理单元4031,设置为:对所述语音噪音叠加的部分,进行去环境噪音处理,得到纯净的语音成分;
环境噪音处理单元4032,设置为:对所述纯环境噪音的部分,进行直接删除的处理。
可选地,该装置还包括:
语音发送模块404,设置为:即刻向通话对方发送处理得到的纯净的语音成分。
本发明的实施例提供了一种噪音抑制方法和装置,在语音通话时,同步采集记录用户嘴部动作的视频流数据和记录所述用户通话的音频数据,然后根据所述视频流数据,确定所述音频数据的语音噪音叠加部分和纯环境噪音部分,再分别对所述语音嗓音叠加部分和纯环境噪音部分进行处理,得到纯净的语音成分。实现了适用于不同场景的高效准确的噪音抑制,解决了现有的抑噪方式应用范围较小的问题。
本领域普通技术人员可以理解上述实施例的全部或部分步骤可以使用计算机程序流程来实现,所述计算机程序可以存储于一计算机可读存储介质中,所述计算机程序在相应的硬件平台上(如系统、设备、装置、器件等)执行,在执行时,包括方法实施例的步骤之一或其组合。
可选地,上述实施例的全部或部分步骤也可以使用集成电路来实现,这些步骤可以被分别制作成一个个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。
上述实施例中的各装置/功能模块/功能单元可以采用通用的计算装置来 实现,它们可以集中在单个的计算装置上,也可以分布在多个计算装置所组成的网络上。
上述实施例中的各装置/功能模块/功能单元以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。上述提到的计算机可读取存储介质可以是只读存储器,磁盘或光盘等。
任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求所述的保护范围为准。
工业实用性
本发明实施例提供实现了适用于不同场景的高效准确的噪音抑制,解决了现有的抑噪方式应用范围较小的问题。

Claims (12)

  1. 一种噪音抑制方法,包括:
    在语音通话时,同步采集记录用户嘴部动作的视频流数据和记录所述用户通话的音频数据;
    根据所述视频流数据,确定所述音频数据的语音噪音叠加部分和纯环境噪音部分;以及
    分别对所述语音嗓音叠加部分和纯环境噪音部分进行处理,得到纯净的语音成分。
  2. 根据权利要求1所述的噪音抑制方法,其中,同步采集记录用户嘴部动作的视频流数据和记录所述用户通话的音频数据包括:
    将采集所述音频数据和所述视频流数据的起始点在时间上对齐;
    同步进行所述视频流数据和所述音频数据的采集。
  3. 根据权利要求2所述的噪音抑制方法,其中,根据所述视频流数据,确定所述音频数据的语音噪音叠加部分和纯环境噪音部分包括:
    将所述视频流数据和所述音频数据进行同步切片,视频流数据的切片与音频数据的切片一一对应;
    逐个分析所述视频流数据的切片;
    在所述视频流数据的切片中存在嘴部动作时,判定所述视频流数据对应的所述音频数据的切片属于语音噪音叠加的部分;
    在所述视频流数据的切片中不存在嘴部动作时,判定所述视频流数据对应的所述音频数据的切片属于纯环境噪音部分。
  4. 根据权利要求1或2或3所述的噪音抑制方法,其中,分别对所述语音嗓音叠加部分和纯环境噪音部分进行处理,得到纯净的语音成分包括:
    对所述语音噪音叠加的部分,进行去环境噪音处理,得到纯净的语音成分;
    对于所述纯环境噪音的部分,进行删除的处理。
  5. 根据权利要求1所述的噪音抵制方法,其中,分别对所述语音嗓音叠加部分和纯环境噪音部分进行处理,得到纯净的语音成分的步骤之后,所述方法还包括:
    即刻向通话对方发送处理得到的纯净的语音成分。
  6. 一种噪音抑制装置,包括:
    数据采集模块,其设置为:在语音通话时,同步采集记录用户嘴部动作的视频流数据和记录所述用户通话的音频数据;
    第一基带处理模块,其设置为:根据所述视频流数据,确定所述音频数据的语音噪音叠加部分和纯环境噪音部分;以及
    第二基带处理模块,其设置为:分别对所述语音嗓音叠加部分和纯环境噪音部分进行处理,得到纯净的语音成分。
  7. 根据权利要求6所述的噪音抑制装置,其中,所述数据采集模块包括:
    对齐单元,其设置为:将采集所述音频数据和所述视频流数据的起始点在时间上对齐;以及
    同步采集单元,其设置为:以所述同步基准线为准,同步进行所述视频流数据和所述音频数据的采集。
  8. 根据权利要求7所述的噪音抑制装置,其中,所述第一基带处理模块包括:
    切片单元,其设置为:将所述视频流数据和所述音频数据进行同步切片,视频流数据的切片与音频数据的切片一一对应;
    视频流数据分析单元,其设置为:逐个分析所述视频流数据的切片;以及
    音频数据分析单元,其设置为:在所述视频流数据的切片中存在嘴部动作时,判定所述视频流数据对应的所述音频数据的切片属于语音噪音叠加的部分,
    在所述视频流数据的切片中不存在嘴部动作时,判定所述视频流数据对应的所述音频数据的切片属于纯环境噪音部分。
  9. 根据权利要求6或7或8所述的噪音抑制装置,其中,所述第二基带处理模块包括:
    语音去噪处理单元,其设置为:对所述语音噪音叠加的部分,进行去环境噪音处理,得到纯净的语音成分;以及
    环境噪音处理单元,其设置为:对所述纯环境噪音的部分,进行删除的处理。
  10. 根据权利要求6所述的噪音抑制装置,该装置还包括:
    语音发送模块,其设置为:即刻向通话对方发送处理得到的纯净的语音成分。
  11. 一种计算机程序,包括程序指令,当该程序指令被终端执行时,使得该终端可执行权利要求1-5任一项所述的方法。
  12. 一种载有权利要求11所述计算机程序的计算机可读存储介质。
PCT/CN2014/089335 2014-07-23 2014-10-23 噪音抑制方法、装置、计算机程序和计算机存储介质 WO2015117403A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410353699.XA CN105321523A (zh) 2014-07-23 2014-07-23 噪音抑制方法和装置
CN201410353699.X 2014-07-23

Publications (1)

Publication Number Publication Date
WO2015117403A1 true WO2015117403A1 (zh) 2015-08-13

Family

ID=53777221

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/089335 WO2015117403A1 (zh) 2014-07-23 2014-10-23 噪音抑制方法、装置、计算机程序和计算机存储介质

Country Status (2)

Country Link
CN (1) CN105321523A (zh)
WO (1) WO2015117403A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115880737A (zh) * 2021-09-26 2023-03-31 天翼爱音乐文化科技有限公司 一种基于降噪自学习的字幕生成方法、系统、设备及介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107437420A (zh) * 2016-05-27 2017-12-05 富泰华工业(深圳)有限公司 语音信息的接收方法、系统及装置
CN108986830B (zh) * 2018-08-28 2021-02-09 安徽淘云科技有限公司 一种音频语料筛选方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003195883A (ja) * 2001-12-26 2003-07-09 Toshiba Corp 雑音除去装置およびその装置を備えた通信端末
US20030212556A1 (en) * 2002-05-09 2003-11-13 Nefian Ara V. Factorial hidden markov model for audiovisual speech recognition
CN1742322A (zh) * 2003-01-24 2006-03-01 索尼爱立信移动通讯股份有限公司 噪声减小和视听语音活动检测
CN102324035A (zh) * 2011-08-19 2012-01-18 广东好帮手电子科技股份有限公司 口型辅助语音识别术在车载导航中应用的方法及系统
CN102682273A (zh) * 2011-03-18 2012-09-19 夏普株式会社 嘴唇运动检测设备和方法

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20070050058A (ko) * 2004-09-07 2007-05-14 코닌클리케 필립스 일렉트로닉스 엔.브이. 향상된 잡음 억제를 구비한 전화통신 디바이스
US9053697B2 (en) * 2010-06-01 2015-06-09 Qualcomm Incorporated Systems, methods, devices, apparatus, and computer program products for audio equalization
JP5529635B2 (ja) * 2010-06-10 2014-06-25 キヤノン株式会社 音声信号処理装置および音声信号処理方法
KR101739942B1 (ko) * 2010-11-24 2017-05-25 삼성전자주식회사 오디오 노이즈 제거 방법 및 이를 적용한 영상 촬영 장치
CN102298443B (zh) * 2011-06-24 2013-09-25 华南理工大学 结合视频通道的智能家居语音控制系统及其控制方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003195883A (ja) * 2001-12-26 2003-07-09 Toshiba Corp 雑音除去装置およびその装置を備えた通信端末
US20030212556A1 (en) * 2002-05-09 2003-11-13 Nefian Ara V. Factorial hidden markov model for audiovisual speech recognition
CN1742322A (zh) * 2003-01-24 2006-03-01 索尼爱立信移动通讯股份有限公司 噪声减小和视听语音活动检测
CN102682273A (zh) * 2011-03-18 2012-09-19 夏普株式会社 嘴唇运动检测设备和方法
CN102324035A (zh) * 2011-08-19 2012-01-18 广东好帮手电子科技股份有限公司 口型辅助语音识别术在车载导航中应用的方法及系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115880737A (zh) * 2021-09-26 2023-03-31 天翼爱音乐文化科技有限公司 一种基于降噪自学习的字幕生成方法、系统、设备及介质
CN115880737B (zh) * 2021-09-26 2024-04-19 天翼爱音乐文化科技有限公司 一种基于降噪自学习的字幕生成方法、系统、设备及介质

Also Published As

Publication number Publication date
CN105321523A (zh) 2016-02-10

Similar Documents

Publication Publication Date Title
US9924112B2 (en) Automatic video stream selection
CN105814913B (zh) 对名字敏感的收听装置
US20210217433A1 (en) Voice processing method and apparatus, and device
US9343056B1 (en) Wind noise detection and suppression
US7907165B2 (en) Speaker predicting apparatus, speaker predicting method, and program product for predicting speaker
US20170047076A1 (en) Method and device for achieving object audio recording and electronic apparatus
US20160187453A1 (en) Method and device for a mobile terminal to locate a sound source
WO2019184650A1 (zh) 字幕生成方法及终端
CN104394286A (zh) 一种回声消除方法及装置
TWI678696B (zh) 語音資訊的接收方法、系統及裝置
CN103391347A (zh) 一种自动录音的方法及装置
CN106210219A (zh) 降噪方法及装置
WO2015117403A1 (zh) 噪音抑制方法、装置、计算机程序和计算机存储介质
US9165182B2 (en) Method and apparatus for using face detection information to improve speaker segmentation
WO2016033919A1 (zh) 一种全息音频录制回放方法
WO2023151526A1 (zh) 音频采集方法、装置、电子设备及外设组件
CN105100368B (zh) 信息处理方法及装置
WO2013170802A1 (zh) 一种提高移动终端通话音质的方法及装置
WO2015131634A1 (zh) 声音降噪方法及终端
US9930467B2 (en) Sound recording method and device
CN105338170A (zh) 一种滤除背景噪声的方法及装置
CN114120950B (zh) 一种人声屏蔽方法和电子设备
WO2021129444A1 (zh) 文件聚类方法及装置、存储介质和电子设备
CN105592226B (zh) 一种通话过程中数据的处理方法及装置
CN111182256A (zh) 一种信息处理方法及服务器

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14881938

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14881938

Country of ref document: EP

Kind code of ref document: A1