CN106303357A

CN106303357A - The video call method of a kind of far field speech enhan-cement and system

Info

Publication number: CN106303357A
Application number: CN201610770495.5A
Authority: CN
Inventors: 洪涛; 孙铭俊
Original assignee: Fuzhou Rockchip Electronics Co Ltd
Current assignee: Rockchip Electronics Co Ltd
Priority date: 2016-08-30
Filing date: 2016-08-30
Publication date: 2017-01-04
Anticipated expiration: 2036-08-30
Also published as: CN106303357B

Abstract

The present invention provides the video call system of a kind of far field speech enhan-cement, described system to include: at least two video call terminal, and many noise filterings process engine and many noise filterings API manages server；Described many noise filterings process engine, many noise filterings API management server is connected with two video call terminals by communication network；When the video call terminal of one end carries out far field video calling, caller far field sound and multiple noise source can be received by video call terminal simultaneously and record；Many noise filterings process engine and far field sound and multiple noise source are carried out filtration treatment；Many noise filterings API management server suppresses multiple noise source to strengthen the sound of call body thereof, and the video call terminal of the other end receives the sound of caller's main body after treatment again.The present invention improves the speech quality of the speech data of long-distance video call.

Description

The video call method of a kind of far field speech enhan-cement and system

Technical field

The present invention relates to set-top box technique field, particularly relate to the video call method of a kind of far field speech enhan-cement and be System.

Background technology

Far field voice call, i.e. remote speech call, the especially distance of telephone user's distance microphone 3 to 5 meters, due to The impact of the interference factor such as noise and/or reverberation, in video call process, the effect of voice is excessively poor.Actual far field voice Communicate and comprise following noise sources: (1) reverberation noise: sound wave, when indoor propagation, will be reflected by barrier and absorb, finally Disappearing, we just feel that sound source also has the mixing of several sound waves to continue for some time after stopping sounding, i.e. the reverberation time (reverberation).The length of reverberation time is the important acoustic characteristic of the buildings such as music hall, theater, auditorium.(2) back of the body Scape noise: background noise refers to noisy general name in addition to object of study.(3) people's sound interference: environment voice, non-study pair The sound of elephant.(4) echo noise: sound wave, in communication process, encounters big reflecting surface (inside the wall of building, mountain Deng) will reflect at interface, people are called echo the reflection sound wave that can distinguish with primary sound.

In sum, during the video calling of far field, the voice in far field needs to be filled into multiple noise, just can obtain pure leading to The acoustical signal clearly of words participant.

Have Application No.: 201310066421.X disclosed in prior art, patent name be " speech enhan-cement processing method and Device " Chinese patent, this inventive embodiments provides a kind of speech enhan-cement treating method and apparatus, the method, including decoding ratio Special stream, obtains the coding parameter of currently pending speech subframe, and coding parameter includes the first algebraic-codebook gain and first adaptive Answer codebook gain；Adjust the first algebraic-codebook gain, obtain the second algebraic-codebook gain；According to the first self-adapting code book gain and Second algebraic-codebook gain, determines the second self-adapting code book gain；Use the second algebraic-codebook gain and the second self-adapting code book The quantization index of gain replaces the bit that in bit stream, the first algebraic-codebook gain is corresponding with the first self-adapting code book gain.This Bright technical scheme, can be effectively improved the effect abated the noise, and improves voice call quality.But this contrast patent is with this specially The Technology Ways that profit application is taked is entirely different.

Prior art also discloses " a kind of phone system strengthened based on wireless location Microphone Array Speech and side Method ", see application number: the Chinese patent of 201310513373.4, this disclosure of the invention is a kind of based on wireless location microphone array The phone system of speech enhan-cement and method, system includes wireless location transmitter module, wireless location receiver module, microphone array Speech reception module, speech enhan-cement module, far-end speech playing module and communication module, wherein wireless location transmitter module and nothing Line position receiver module uses and wirelessly connects, and wireless location receiver module and Microphone Array Speech receiver module are respectively Being connected with speech enhan-cement module, speech enhan-cement module is connected with communication module, and far-end speech playing module is connected with communication module. Target sound source is positioned by call method initially with wireless location technology, then the voice of target speaker is used Mike Wind array carries out speech enhan-cement process and communication.This invention have fast and accurate for positioning, that reinforced effects good, robustness is high etc. is excellent Point, can be effectively improved the voice quality of existing phone system.Contrast patent is primarily upon the sound localization of microphone array and determines Strengthen to voice.When present patent application pays close attention to far field video calling, the sound of call body thereof strengthens and the suppression of many noise sources.

Summary of the invention

One of the technical problem to be solved in the present invention, is to provide the video call system of a kind of far field speech enhan-cement, makes With many noise filterings engine, suppress multiple noise source to strengthen the sound of call body thereof, and then improve the call of far field video calling Quality.

One of problem of the present invention is achieved in that the video call system of a kind of far field speech enhan-cement, described system bag Including: at least two video call terminal, many noise filterings process engine and many noise filterings API manages server；Described many Noise filtering processes engine, many noise filterings API management server is connected with two video call terminals by communication network；

When the video call terminal of one end carries out far field video calling, caller far field sound and multiple noise source can be simultaneously Received by video call terminal and record；

Many noise filterings process engine and far field sound and multiple noise source are carried out filtration treatment；

Many noise filterings API management server suppresses multiple noise source to strengthen the sound of call body thereof,

The video call terminal of the other end receives the sound of caller's main body after treatment again.

Further, described video call terminal is provided with hardware driving, operating system module, video calling middleware Module, microphone array recording module, original sound strengthen module, call master voice and noise source separation module, many noises mistake Filter engine API, call master voice and noise source merge module, video calling audio frequency and video packetization module, video calling transport module；

Described hardware driving: equipment includes interiorly or exteriorly hardware module, hardware driving is the drive software of hardware module, Typically complete to initialize in the operating system initialization stage；

Described operating system is that the unified interface of device hardware and hardware interface is abstract, and operating system is the base of running software Plinth environment；

Described video calling middleware module: there is the software kit of video call function basic function；

Described microphone array recording module: the microphone array interface of call operation system, the module of recorded voice；

Described original sound strengthens module: call audio algorithm, putting of the original sound enhancing that will record, i.e. acoustical signal Big process；

Described call master voice and noise source separation module: call many noise filterings engine API, will strengthen original sound and make For input, output master voice and noise source；

Described many noise filterings engine API: effect is as input using enhancing original sound, exports master voice voice and makes an uproar Source of sound；Many noise filterings engine API can be deployed on local device or server；

Described call master voice and noise source merge module: strengthen master voice, and after suppressing noise source, then by enhanced Master voice and the noise source after weakening, synthesize a sound；

Described video calling audio frequency and video packetization module: video flowing encodes according to H264/H265 coded system, then is packaged into PES flows；Audio frequency encodes according to AAC, AC3 coded system, then encapsulates PES stream；The PES stream of audio frequency and video is packaged into applicable network again and passes Defeated TS stream；

Described video calling transport module: TS stream transmits in a communication network according to video call service logic.

Further, described video calling middleware module includes: input equipment management module, audio frequency and video pretreatment mould Block, audio/video coding module, audio frequency and video packetization module and network transmission module.

Further, the video calling of described far field speech enhan-cement is wanted the input at modules of focused data and defeated Go out；

Far field sound input, including: call voice, environmental noise, echo noise, reverberation noise and many voice noise；

Microphone array recording module receives and records above-mentioned far field sound, and the sound of output digit signals；

Digitized far field sound is input to many noise filterings and processes engine；

Many noise filterings process engine and access many noise filterings API management server acquisition many noise filterings engine API；

The outside many noise filterings engine API of many noise filterings API management server admin；

Many noise filterings process engine calling many noise filterings engine API and process digitized far field sound, after process To the voice data strengthening the far field voice many noise sources of suppression.

Further, described many noise filterings API management server mainly has following functions: safeguard many noise filterings engine API, the outside many noise filterings engine API of management, safeguard the internal many noise filterings engine of outside many noise filterings engine API adaptation API；Safeguard the more New Policy of many noise filterings engine API, safeguard the management strategy of outside many noise filterings engine API, be responsible for Audit the service quality of many noise filterings engine API.

Further, the video calling operation of described far field speech enhan-cement is particularly as follows: the wheat of video call terminal of one end Sound that gram wind array recording module receives far field video calling participant and the multiple noise source being associated, video calling is eventually Hold and original voice data is done signal processing and amplifying by original sound enhancing module, then transfer to local or online many noises Filtration treatment engine processes；Local or online many noise filterings process engine and process first: by call master voice and noise source Separation module isolates the sound of call body thereof and multiple noise source；And then strengthened logical by described many noise filterings engine API Talk about the sound of main body and suppress multiple noise source；And then merge module by enhanced by described call master voice and noise source Multiple noise source after call body thereof sound and suppression merges, and returns to video call terminal；Video call terminal is by video Voice data after data and process is packaged into the network of applicable network transmission by described video calling audio frequency and video packetization module Bag, is transferred to the video calling of the other end through the video calling transport module of video calling middleware module by audio, video data Terminal.

Further, described video call terminal comprises one or more processor, an internal memory, one or more storages Device, a power supply, one or more adapters, a network interface and a microphone array；Described video call terminal Also comprising an operating system, operating system comprises some modules that can run on the one or more processors or application； Video call terminal can comprise standby wakeup module, described processor, internal memory, memorizer, power supply, adapter, network interface, wheat Gram wind array uses the mode of intraware communication to interconnect；

One or more processors, are configured in video call terminal perform function or process instruction；One or many Individual processor can process and be stored in internal memory or memorizer instruction；These instructions can be used for operation hardware module, has come Become specific function or process；

Internal memory is the internal storage directly exchanging data with CPU, and the content of memory element can on-demand random taking-up or deposit Enter, and the speed memorizer unrelated with the position of memory element of access.

The two of the technical problem to be solved in the present invention, are to provide the video call method of a kind of far field speech enhan-cement, make With many noise filterings engine, suppress multiple noise source to strengthen the sound of call body thereof, and then improve the call of far field video calling Quality.

The two of problem of the present invention are achieved in that the video call method of a kind of far field speech enhan-cement, and described method needs Thering is provided at least two video call terminal, many noise filterings process engine and many noise filterings API manages server；

When described method is particularly as follows: the video call terminal of one end carries out far field video calling, caller far field sound and Multiple noise source can be received by video call terminal simultaneously and record；By many noise filterings process engine to far field sound and Multiple noise source carries out filtration treatment；And then many noise filterings API management server suppresses multiple noise source to strengthen call body thereof Sound, then the sound of caller's main body after processing is sent to the video call terminal of the other end.

Present invention have the advantage that video call terminal of the present invention is by Base communication net (the Internet etc.) interconnection mutually Logical；Video calling comprises many noise filterings engine；Video calling comprises many noise filterings API and manages server；Far field video leads to During words, caller far field sound and multiple noise source can be received by microphone array simultaneously and record, and caller's master voice is often Can be flooded by multiple noise source, cause speech quality degradation.The present invention uses many noise filterings engine, suppresses multiple noise Source strengthens the sound of call body thereof, and then improves the speech quality of far field video calling.

Accompanying drawing explanation

The present invention is further illustrated the most in conjunction with the embodiments.

Fig. 1 is the system overall framework figure of the present invention.

Fig. 2 is the structural representation of each module in video call terminal of the present invention.

Fig. 3 is the schematic flow sheet of the process crossing noise filtering of far field of the present invention speech-enhancement system.

Fig. 4 is the hardware architecture diagram of video call terminal of the present invention.

Fig. 5 is the inventive method operating process schematic diagram.

Detailed description of the invention

Referring to shown in Fig. 1 to Fig. 4, the video call system of a kind of far field speech enhan-cement, described system includes: at least two Individual video call terminal, many noise filterings process engine and many noise filterings API manages server；Described many noise filterings Process engine, many noise filterings API management server is connected with two video call terminals by communication network；

Described video call terminal is provided with hardware driving, operating system module, video calling middleware module, Mike Wind array recording module, original sound strengthen module, call master voice and noise source separation module, many noise filterings engine API, Call master voice and noise source merge module, video calling audio frequency and video packetization module, video calling transport module；

Described hardware driving: equipment includes interiorly or exteriorly hardware module, hardware driving is the drive software of hardware module (network-driven, microphone array drives), typically completes to initialize in the operating system initialization stage；

Described video calling middleware module: there is the software kit of video call function basic function；Generally comprise: input The modules such as equipment control (mike etc.), audio frequency and video pretreatment, audio/video coding, audio frequency and video packing, network transmission.Video calling The operation of middleware module is with operating system.

As it is shown on figure 3, in the present invention, the video calling of described far field speech enhan-cement is wanted focused data at each mould The input of block and output；

Far field sound input, including: call voice (Cn), environmental noise, echo noise, reverberation noise and many people noise Sound；

Described many noise filterings API management server mainly has following functions: safeguard many noise filterings engine API, management Outside many noise filterings engine API, safeguards the internal many noise filterings engine API of outside many noise filterings engine API adaptation；Safeguard The more New Policy of many noise filterings engine API, safeguards the management strategy of outside many noise filterings engine API, is responsible for audit and makes an uproar more The service quality of sound filter engine API.

As it is shown in figure 5, the video calling operation of the described far field speech enhan-cement of the present invention is particularly as follows: the video calling of one end Sound that the microphone array recording module of terminal receives far field video calling participant and the multiple noise source being associated, depending on Frequently call terminal by original voice data by original sound strengthen module do signal processing and amplifying, then transfer to this locality or The many noise filterings of line process engine and process；Local or online many noise filterings process engine and process first: by call master voice The sound of call body thereof and multiple noise source is isolated with noise source separation module；And then by described many noise filterings engine API strengthens the sound of call body thereof and suppresses multiple noise source；And then merge module by described call master voice and noise source Multiple noise source after enhanced call body thereof sound and suppression is merged, and returns to video call terminal；Video calling Voice data after video data and process is packaged into applicable network by described video calling audio frequency and video packetization module by terminal The network packet of transmission, audio, video data is transferred to the other end by the video calling transport module through video calling middleware module Video call terminal.

It addition, the described video call terminal of the present invention comprises one or more processor, an internal memory, one or more Memorizer, a power supply, one or more adapters, a network interface (WIFI/3G/4G) and a microphone array； Described video call terminal also comprises an operating system, and operating system comprises some can transport on the one or more processors The module of row or application；Video call terminal can comprise standby wakeup module, described processor, internal memory, memorizer, power supply, company Connect device, network interface, microphone array use intraware communication mode interconnect (physical connection, two-way communication, two-way behaviour Make) get up；

One or more processors, can be configured in the video call device of far field perform function or process instruction. One or more processors can process and be stored in internal memory or memorizer instruction.These instructions may be used for operation hardware Module, completes specific function or process.

Internal memory is the internal storage directly exchanging data with CPU, and the content of memory element can on-demand random taking-up or deposit Enter, and the speed memorizer unrelated with the position of memory element of access.Internal memory usually used as operating system or other transport The ephemeral data storage medium of the program in row.Internal memory is a temporary storage medium, for software or program in the process of execution In, store interim data or instruction.Internal memory typically uses RAM or SRAM.

One or more memorizeies comprise one or more computer-readable storage medium.One or more memorizeies are used In perdurable data or the storage of information.One or more memorizeies include non-volatile memory medium, such as: hard disk, SSD, Flash, EEPROM etc.).

Far field video call device can comprise network interface.Network interface is used for LAN or wan communication.WIFI For local area network communication.3G/4G module is used for wan communication.Far field video call device by network interface can be outside Far field video call device equipment communication (mobile phone/flat board/TV/Set Top Box/video calling server etc.)

Far field video call device can comprise adapter (connection of WIFI network, bluetooth, GLONASS, FM Radio reception)

Far field video call device can comprise power supply, and power supply is probably rechargeable battery, and battery is probably lithium battery, stone Ink alkene or other suitable materials are made.Power supply may comprise a transformator, and external power source can change into the electricity of suitable charging Source.

Far field video call device can comprise microphone array, and microphone array is to be coupled by the signal of two mikes It it is a signal.Using this technology, sound wave was carried out by the difference that two mikes can be utilized to receive between the phase place of sound wave Filter, can filter environmental background sound to greatest extent, the most remaining sound wave needed.Have employed for using in a noisy environment The equipment of this configuration, can make hearer sound in a noisy environment and be apparent from, not have noise.

In the video call device of far field, processor, internal memory, memorizer, power supply, adapter be required for system is run Mini system.Network interface (WIFI/3G/4G), microphone array is the hardware foundation realizing far field video call function.

Operating system (Linux and Android) controls the operation of hardware module in the video call device of far field.Operating system Can control to be encapsulated in hardware driving layer by operation complicated and changeable for hardware.Keep the unification that operating system layer hardware interface calls. Operating system is the interface of user and computer, is also the interface of computer hardware and other softwares simultaneously.The merit of operating system Can include managing the hardware of computer system, software and data resource, control program and run, improve man machine interface, should for other There is provided with software and support, allow all resources of computer system play a role to greatest extent, it is provided that various forms of user interfaces, Making user have a good working environment, the exploitation for other software provides necessary service and corresponding interface etc..

Referring to shown in Fig. 4 and Fig. 5, the video call method of a kind of far field speech enhan-cement of the present invention, described method needs Thering is provided at least two video call terminal, many noise filterings process engine and many noise filterings API manages server；

In the present invention, the video calling of described far field speech enhan-cement is wanted focused data the input at modules and Output；

It addition, as shown in Figure 4, the described video call terminal of the present invention comprises one or more processor, an internal memory, One or more memorizeies, a power supply, one or more adapters, a network interface (WIFI/3G/4G) and a wheat Gram wind array；Described video call terminal also comprises an operating system, operating system comprise some can be one or more The module run on processor or application；Video call terminal can comprise standby wakeup module, described processor, internal memory, storage Device, power supply, adapter, network interface, microphone array use the mode of intraware communication to interconnect (physical connection, two-way Letter, bidirectional operation) get up；

Although the foregoing describing the detailed description of the invention of the present invention, but those familiar with the art should managing Solving, our described specific embodiment is merely exemplary rather than for the restriction to the scope of the present invention, is familiar with this The technical staff in field, in the equivalent modification made according to the spirit of the present invention and change, should be contained the present invention's In scope of the claimed protection.

Claims

1. the video call system of a far field speech enhan-cement, it is characterised in that: described system includes: at least two video calling Terminal, many noise filterings process engine and many noise filterings API manages server；Described many noise filterings process engine, many Noise filtering API management server is connected with two video call terminals by communication network；

When the video call terminal of one end carries out far field video calling, caller far field sound and multiple noise source can be regarded simultaneously Frequently call terminal receives and records；

The video call system of a kind of far field the most according to claim 1 speech enhan-cement, it is characterised in that: described video leads to Telephone terminal is provided with hardware driving, operating system module, video calling middleware module, microphone array recording module, former Beginning sound strengthens module, call master voice and noise source separation module, many noise filterings engine API, call master voice and noise Source merges module, video calling audio frequency and video packetization module, video calling transport module；

Described operating system is that the unified interface of device hardware and hardware interface is abstract, and operating system is the foundation ring of running software Border；

Described original sound strengthens module: call audio algorithm, is strengthened, i.e. at the amplification of acoustical signal by the original sound recorded Reason；

Described call master voice and noise source separation module: call many noise filterings engine API, will strengthen original sound as defeated Enter, output master voice and noise source；

Described many noise filterings engine API: effect is as input, output master voice voice and noise using enhancing original sound Source；Many noise filterings engine API can be deployed on local device or server；

Described call master voice and noise source merge module: strengthen master voice, and after suppressing noise source, then by enhanced main sound Sound and the noise source after weakening, synthesize a sound；

Described video calling audio frequency and video packetization module: video flowing encodes according to H264/H265 coded system, then it is packaged into PES stream； Audio frequency encodes according to AAC, AC3 coded system, then encapsulates PES stream；The PES stream of audio frequency and video is packaged into the TS of applicable network transmission again Stream；

The video call system of a kind of far field the most according to claim 2 speech enhan-cement, it is characterised in that: described video leads to Words middleware module includes: input equipment management module, audio frequency and video pretreatment module, audio/video coding module, audio frequency and video packing Module and network transmission module.

The video call system of a kind of far field the most according to claim 2 speech enhan-cement, it is characterised in that: described far field language The video calling that sound strengthens is wanted the input at modules and the output of focused data；

Many noise filterings process engine calling many noise filterings engine API and process digitized far field sound, are increased after process Strong far field voice suppresses the voice data of many noise sources.

The video call system of a kind of far field the most according to claim 2 speech enhan-cement, it is characterised in that: described many noises Filter API management server and mainly have following functions: safeguard many noise filterings engine API, the outside many noise filterings engine of management API, safeguards the internal many noise filterings engine API of outside many noise filterings engine API adaptation；Safeguard many noise filterings engine API More New Policy, safeguard the management strategy of outside many noise filterings engine API, be responsible for the clothes of audit many noise filterings engine API Business quality.

The video call system of a kind of far field the most according to claim 1 speech enhan-cement, it is characterised in that: described far field language Video calling operation that sound strengthens is particularly as follows: the microphone array recording module of video call terminal of one end receives far field video The call sound of participant and the multiple noise source being associated, video call terminal by original voice data by original sound Sound strengthens module and does signal processing and amplifying, then transfers to local or online many noise filterings to process engine and processes；Local or online Many noise filterings process engine and process first: isolated the sound of call body thereof by call master voice and noise source separation module With multiple noise source；And then strengthen the sound of call body thereof by described many noise filterings engine API and suppress multiple noise source； And then merge module by described call master voice and noise source multiple after enhanced call body thereof sound and suppression is made an uproar Source of sound merges, and returns to video call terminal；Voice data after video data and process is passed through institute by video call terminal State video calling audio frequency and video packetization module and be packaged into the network packet of applicable network transmission, through regarding of video calling middleware module Frequently audio, video data is transferred to the video call terminal of the other end by call transfer module.

7. the video call method of a far field speech enhan-cement, it is characterised in that: described method need to provide at least two video to lead to Telephone terminal, many noise filterings process engine and many noise filterings API manages server；

When described method is particularly as follows: the video call terminal of one end carries out far field video calling, caller far field sound and multiple Noise source can be received by video call terminal simultaneously and record；Engine is being processed to far field sound and multiple by many noise filterings Noise source carries out filtration treatment；And then many noise filterings API management server suppresses multiple noise source to strengthen the sound of call body thereof Sound, then the sound of caller's main body after processing is sent to the video call terminal of the other end.

The video call method of a kind of far field the most according to claim 7 speech enhan-cement, it is characterised in that: described video leads to Telephone terminal is provided with hardware driving, operating system module, video calling middleware module, microphone array recording module, former Beginning sound strengthens module, call master voice and noise source separation module, many noise filterings engine API, call master voice and noise Source merges module, video calling audio frequency and video packetization module, video calling transport module；

The video call method of a kind of far field the most according to claim 8 speech enhan-cement, it is characterised in that: described video leads to Words middleware module includes: input equipment management module, audio frequency and video pretreatment module, audio/video coding module, audio frequency and video packing Module and network transmission module.

The video call method of a kind of far field the most according to claim 8 speech enhan-cement, it is characterised in that: described far field The video calling of speech enhan-cement is wanted the input at modules and the output of focused data；

The video call method of 11. a kind of far field according to claim 8 speech enhan-cement, it is characterised in that: making an uproar described more Sound filters API management server mainly following functions: safeguard many noise filterings engine API, and the outside many noise filterings of management draw Hold up API, safeguard the internal many noise filterings engine API of outside many noise filterings engine API adaptation；Safeguard many noise filterings engine The more New Policy of API, safeguards the management strategy of outside many noise filterings engine API, is responsible for audit many noise filterings engine API's Service quality.

The video call method of 12. a kind of far field according to claim 8 speech enhan-cement, it is characterised in that: described far field Speech enhan-cement video calling operation particularly as follows: one end video call terminal microphone array recording module receive far field regard The frequency call sound of participant and the multiple noise source being associated, video call terminal by original voice data by original Sound strengthens module and does signal processing and amplifying, then transfers to local or online many noise filterings to process engine and processes；Local or The many noise filterings of line process engine and process first: isolated the sound of call body thereof by call master voice and noise source separation module Sound and multiple noise source；And then strengthen the sound of call body thereof by described many noise filterings engine API and suppress multiple noise Source；And then merge multiple by after enhanced call body thereof sound and suppression of module by described call master voice and noise source Noise source merges, and returns to video call terminal；Voice data after video data and process is passed through by video call terminal Described video calling audio frequency and video packetization module is packaged into the network packet of applicable network transmission, through video calling middleware module Audio, video data is transferred to the video call terminal of the other end by video calling transport module.