CN105679328A

CN105679328A - Speech signal processing method, device and system

Info

Publication number: CN105679328A
Application number: CN201610060386.4A
Authority: CN
Inventors: 刘焕; 汤峰峰; 修平平; 鄢仁祥; 曹李军
Original assignee: Suzhou Keda Technology Co Ltd
Current assignee: Suzhou Keda Technology Co Ltd
Priority date: 2016-01-28
Filing date: 2016-01-28
Publication date: 2016-06-15

Abstract

The invention discloses a speech signal processing method, device and system. The method comprises the following steps: obtaining position information of a target sound source with respect to each microphone in a microphone array; obtaining delay time of the target sound source from the time of sending a speech signal to each microphone to the time of receiving the speech signal according to the position information; and according to the delay time, carrying out speech signal processing on the sound information from each microphone, and obtaining the speech information sent by the target sound source. The speech signal processing method, device and system can carry out accurate positioning on the target sound source, have good processing effect of the target sound source speech signal, can enable the processed speech of the target sound source to realize local play or remote communication, can also process each marked sound source speech and store the processed speeches for evidence, and have very high flexibility.

Description

A kind of audio signal processing method, Apparatus and system

Technical field

The present invention relates to technical field of audio/video, it is specifically related to a kind of audio signal processing method, Apparatus and system.

Background technology

Along with the fast development of audio frequency and video technology, pick up camera and microphone array equipment are indispensable in the application scenario such as video monitoring, video conference. Unfortunately, while we can obtain clear video recording, voice communication can suffer to disturb the impact of sound source, noise and reverberation usually so that is difficult to not hear the on-the-spot speech content of shooting.

For improving the reception of adverse environment sound intermediate frequency signal, microphone array is used to usually to sound source location, and Sounnd source direction carries out the Speech processing such as Wave beam forming. But time in how The clamors of the people bubble up environment, current array microphone techniques is when the acoustics scene of complexity, then cannot carrying out sound source location, the effect that the voice information therefore sound source sent carries out Speech processing is difficult to ensure, restraint speckle interference performance is poor.

Summary of the invention

Therefore, the technical problem that the embodiment of the present invention to be solved is that the restraint speckle interference performance of speech signal processing system of the prior art in complexity many people acoustic environment is poor.

For this reason, a kind of audio signal processing method of the embodiment of the present invention, comprises the following steps:

Obtain the positional information of target sound source relative to each microphone in microphone array;

According to the positional information of target sound source relative to each microphone in microphone array, obtain described target sound source and send the time of lag that voice information gets described voice information to each microphone;

According to described time of lag, the voice information from each microphone is carried out Speech processing, obtain the voice information that described target sound source sends.

Preferably, described acquisition target sound source comprises relative to the positional information of each microphone in microphone array:

Obtain the positional information of described target sound source relative to pick up camera;

According to described target sound source relative to the position relation between the positional information of described pick up camera and default microphone array and pick up camera, obtain the positional information of target sound source relative to each microphone in microphone array.

Preferably, the described target sound source of described acquisition comprises relative to the positional information of pick up camera:

The target sound source chosen in the institute's sound source receiving the on-the-spot video information comprising sound source that pick up camera sends and comprise in described on-the-spot video information;

According to described on-the-spot video information, obtain the positional information of described target sound source relative to described pick up camera.

Preferably, described acquisition target sound source also comprises relative to the positional information of each microphone in microphone array:

The described target sound source got is carried out checking and debugging relative to the positional information of each microphone in microphone array by the correlation statistics utilizing the space geometry structure of microphone array mutual with microphone, obtains the positional information after checking and debugging.

Preferably, also comprise the following steps:

The voice information that the described target sound source got sends is sent to local loud speaker carry out playing, be sent to communicator and carry out the voice information interaction with far-end device or be sent to storing device storing.

A kind of speech signal processing device of the embodiment of the present invention, comprising:

Position acquiring unit, for obtaining the positional information of target sound source relative to each microphone in microphone array;

Time delay acquiring unit, for according to the positional information of target sound source relative to each microphone in microphone array, obtaining described target sound source and send the time of lag that voice information gets described voice information to each microphone;

Voice acquiring unit, for according to described time of lag, the voice information from each microphone being carried out Speech processing, obtains the voice information that described target sound source sends.

Preferably, described position acquiring unit comprises:

First location obtains subelement, for obtaining the positional information of described target sound source relative to pick up camera;

The second position obtains subelement, for according to described target sound source relative to the position relation between the positional information of described pick up camera and default microphone array and pick up camera, obtain the positional information of target sound source relative to each microphone in microphone array.

Preferably, described first location acquisition subelement comprises:

Receive unit, for receive the on-the-spot video information comprising sound source that pick up camera sends and comprise in described on-the-spot video information institute's sound source in the target sound source chosen;

Position obtains sub-subelement, for according to described on-the-spot video information, obtaining the positional information of described target sound source relative to described pick up camera.

Preferably, described position acquiring unit also comprises:

Position checking and debugging unit, the described target sound source got is carried out checking and debugging relative to the positional information of each microphone in microphone array by the correlation statistics mutual for the space geometry structure and microphone that utilize microphone array, obtains the positional information after checking and debugging.

Preferably, also comprise:

Send unit, carry out playing for the voice information that the described target sound source got sends is sent to local loud speaker, be sent to communicator and carry out the voice information interaction with far-end device or be sent to storing device storing.

A kind of speech signal processing system of the embodiment of the present invention, comprising:

Pick up camera, for obtaining the on-the-spot video information send comprising sound source to speech signal processing device;

Microphone array, for obtaining voice information send that target sound source sends to speech signal processing device;

Speech signal processing device, for receiving the on-the-spot video information comprising sound source that pick up camera sends; According to described on-the-spot acquiring video information, target sound source is relative to the positional information of described pick up camera; According to described target sound source relative to the position relation between the positional information of described pick up camera and default microphone array and pick up camera, obtain the positional information of target sound source relative to each microphone in microphone array;According to the positional information of target sound source relative to each microphone in microphone array, obtain described target sound source and send the time of lag that voice information gets described voice information to each microphone; According to described time of lag, the voice information from each microphone is carried out Speech processing, obtain the voice information that described target sound source sends.

Preferably, described speech signal processing device, also for the correlation statistics that the space geometry structure and microphone that utilize microphone array are mutual, the described target sound source got is carried out checking and debugging relative to the positional information of each microphone in microphone array, obtain the positional information after checking and debugging.

Preferably, described speech signal processing device, also carries out playing for the voice information that the described target sound source got sends is sent to local loud speaker, is sent to communicator and carries out the voice information interaction with far-end device or be sent to storing device storing.

Preferably, also comprise:

Display unit, for displaying scene video information, the target sound source send selected by acquisition gives described speech signal processing device;

Speaker unit, the voice information that the described target sound source sent for obtaining described speech signal processing device sends also is play;

Communicator, the voice information that the described target sound source sent for obtaining described speech signal processing device sends also carries out the voice information interaction with far-end device;

Storing device, the voice information that the described target sound source sent for obtaining described speech signal processing device sends also stores.

The technical scheme of the embodiment of the present invention, tool has the following advantages:

1. the embodiment of the present invention provide audio signal processing method, Apparatus and system, by obtaining the positional information of target sound source relative to each microphone in microphone array, can directly estimate the time delay that each microphone gets the voice information that target sound source sends, the position of combining target sound source again, when voice information is carried out Speech processing, other sound sources can be reduced in complicated many people acoustic environment on the impact in microphone speech acquisition process, thus effective after Speech processing, also improve the ability suppressing interference.

2. the embodiment of the present invention provide audio signal processing method, Apparatus and system, by gathering the positional information of target sound source relative to pick up camera, again in conjunction with the position relation between default microphone array and pick up camera, the positional information of target sound source relative to each microphone in microphone array can be got accurately, improve the positioning precision to target sound source such that it is able to improve the effect of Speech processing further.

3. the embodiment of the present invention provide audio signal processing method, Apparatus and system, utilize microphone array technology that sound source position utilizes adjacent statistic correlation accurately verify, to the orientation distance tuning of sound source, further increase the positioning precision to target sound source such that it is able to improve the effect of Speech processing further.

Accompanying drawing explanation

In order to the technical scheme being illustrated more clearly in the specific embodiment of the invention, below the accompanying drawing used required in embodiment being described is briefly described, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is the schema of a concrete example of audio signal processing method in the embodiment of the present invention 1;

Fig. 2 is the distribution plan of a concrete example of pick up camera, microphone array and sound source in the embodiment of the present invention 1;

Fig. 3 is the functional block diagram of a concrete example of speech signal processing device in the embodiment of the present invention 2;

Fig. 4 is the functional block diagram of a concrete example of speech signal processing system in the embodiment of the present invention 3.

Embodiment

Below in conjunction with accompanying drawing, the technical scheme of the present invention is clearly and completely described, it is clear that described embodiment is the present invention's part embodiment, instead of whole embodiments. Based on the embodiment in the present invention, those of ordinary skill in the art, not making other embodiments all obtained under creative work prerequisite, belong to the scope of protection of the invention.

In describing the invention, it is necessary to explanation, term " first ", " the 2nd ", " the 3rd " are only for describing object, and can not be interpreted as instruction or hint relative importance.

As long as technology feature involved in the different enforcement mode of the present invention described below does not form to conflict each other just can be combined with each other.

Embodiment 1

The present embodiment provides a kind of audio signal processing method, as shown in Figure 1, comprises the steps:

S1, acquisition target sound source are relative to the positional information of each microphone in microphone array. Positional information can comprise orientation, distance etc. The space geometry structure of microphone array can be chosen according to actual needs, such as shown in Figure 2, there is multi-acoustical 101 in space, and the space geometry of microphone array 30 is configured to circle, is positioned on pick up camera 20.

S2, according to the positional information of target sound source relative to each microphone in microphone array, obtain target sound source and send the time of lag that voice information gets this voice information to each microphone. Preferably, owing to there is known the azimuth-range of target sound source and each microphone, the calculating of time of lag (time delay) can directly utilize the relation of the velocity of sound and distance to calculate, need not utilize the complicated calculations program of the dependency between microphone, it is to increase processing efficiency.

S3, according to time of lag, the voice information from each microphone is carried out Speech processing, obtain the voice information that target sound source sends. Speech processing can be Wave beam forming, echo cancellor, squelch, gain control etc., such as, target sound source direction carries out filtering, and Wave beam forming suppresses other direction sound. Voice after Wave beam forming is carried out echo cancellor, the loudspeaker signal that filtering microphone gets. Voice after echo cancellor is carried out squelch, the interfering noise in further filtering voice. Voice after squelch is carried out gain control, regulates gain size, make voice sound more clear.

Above-mentioned audio signal processing method, by obtaining the positional information of target sound source relative to each microphone in microphone array, can directly estimate the time delay that each microphone gets the voice information that target sound source sends, the position of combining target sound source again, when voice information is carried out Speech processing, other sound sources can be reduced in complicated many people acoustic environment on the impact in microphone speech acquisition process, thus effective after Speech processing, also improve the ability suppressing interference.

Preferably, above-mentioned steps S1 comprises:

S11, acquisition target sound source are relative to the positional information of pick up camera. Preferably, concrete step comprises: the target sound source receiving the on-the-spot video information comprising sound source that pick up camera sends and choosing in institute's sound source of comprising in on-the-spot video information; According to on-the-spot video information, obtain the positional information of target sound source relative to pick up camera. Preferably, pick up camera can adopt rifle ball linkage camera system, the shooting of rifle ball covers the whole on-the-spot visual field, ball machine is responsible for generating the concrete live video image comprising target sound source, intrinsic parameters of the camera according to ball machine and convex lens model, it is possible to calculate target sound source relative to positional informations such as the orientation of pick up camera, distances.

S12, according to target sound source relative to the position relation between the positional information of pick up camera and default microphone array and pick up camera, obtain the positional information of target sound source relative to each microphone in microphone array.Position relation between microphone array and pick up camera can be demarcate well in advance.

Above-mentioned audio signal processing method, by gathering the positional information of target sound source relative to pick up camera, again in conjunction with the position relation between default microphone array and pick up camera, the positional information of target sound source relative to each microphone in microphone array can be got accurately, improve the positioning precision to target sound source such that it is able to improve the effect of Speech processing further.

Preferably, above-mentioned steps S1 also comprises:

After above-mentioned steps S12, carry out step S13: the target sound source got is carried out checking and debugging relative to the positional information of each microphone in microphone array by the correlation statistics utilizing the space geometry structure of microphone array mutual with microphone, obtains the positional information after checking and debugging. Utilize microphone array technology sound source position is utilized adjacent statistic correlation carry out accurately verify (such as according to microphones to energy carry out correlation statistics, obtain the far and near information of orientation distance), to the orientation distance tuning of sound source, further increase the positioning precision to target sound source such that it is able to improve the effect of Speech processing further.

Preferably, audio signal processing method also comprises the following steps:

S4, the voice information target sound source got sent are sent to local loud speaker and carry out playing, are sent to communicator and carry out the voice information interaction with far-end device or be sent to storing device storing. When choosing in image pickup scope or during the mark sound source position of video recording, so that it may to listen the voice after the Speech processing such as the Wave beam forming of getting this position, facilitate and store evidence obtaining, there is very high handiness.

Embodiment 2

Corresponding to embodiment 1, the present embodiment provides a kind of speech signal processing device, as shown in Figure 3, comprising:

Position acquiring unit 1, for obtaining the positional information of target sound source relative to each microphone in microphone array;

Time delay acquiring unit 2, for according to the positional information of target sound source relative to each microphone in microphone array, obtaining target sound source and send the time of lag that voice information gets voice information to each microphone;

Voice acquiring unit 3, for according to time of lag, the voice information from each microphone being carried out Speech processing, obtains the voice information that target sound source sends.

Above-mentioned speech signal processing device, by obtaining the positional information of target sound source relative to each microphone in microphone array, can directly estimate the time delay that each microphone gets the voice information that target sound source sends, the position of combining target sound source again, when voice information is carried out Speech processing, other sound sources can be reduced in complicated many people acoustic environment on the impact in microphone speech acquisition process, thus effective after Speech processing, also improve the ability suppressing interference.

Preferably, position acquiring unit 1 comprises:

First location obtains subelement, for obtaining the positional information of target sound source relative to pick up camera;

The second position obtains subelement, for according to target sound source relative to the position relation between the positional information of pick up camera and default microphone array and pick up camera, obtain the positional information of target sound source relative to each microphone in microphone array.

Preferably, first location acquisition subelement comprises:

Receive unit, for receive the on-the-spot video information comprising sound source that pick up camera sends and comprise in on-the-spot video information institute's sound source in the target sound source chosen;

Position obtains sub-subelement, for according to on-the-spot video information, obtaining the positional information of target sound source relative to pick up camera.

Above-mentioned speech signal processing device, by gathering the positional information of target sound source relative to pick up camera, again in conjunction with the position relation between default microphone array and pick up camera, the positional information of target sound source relative to each microphone in microphone array can be got accurately, improve the positioning precision to target sound source such that it is able to improve the effect of Speech processing further.

Preferably, position acquiring unit 1 also comprises:

Position checking and debugging unit, the target sound source got is carried out checking and debugging relative to the positional information of each microphone in microphone array by the correlation statistics mutual for the space geometry structure and microphone that utilize microphone array, obtains the positional information after checking and debugging.

Above-mentioned speech signal processing device, utilize microphone array technology that sound source position utilizes adjacent statistic correlation accurately verify, to the orientation distance tuning of sound source, further increase the positioning precision to target sound source such that it is able to improve the effect of Speech processing further.

Preferably, speech signal processing device also comprises:

Sending unit, the voice information for the target sound source got being sent is sent to local loud speaker and carries out playing, is sent to communicator and carries out the voice information interaction with far-end device or be sent to storing device storing. Thus it is possible not only to carry out target sound source voice after process local broadcasting or remote communication, it is also possible to store evidence obtaining after being processed respectively by each sound source voice of mark, have very high handiness.

Embodiment 3

The present embodiment provides a kind of speech signal processing system, such as, can be applicable to as shown in Figure 4, comprising in video monitoring or video conference:

Pick up camera 20, for obtaining the on-the-spot video information send comprising sound source to speech signal processing device;

Microphone array 30, for obtaining voice information send that target sound source sends to speech signal processing device;

Speech signal processing device 10, for receiving the on-the-spot video information comprising sound source that pick up camera sends; According to the positional information of on-the-spot acquiring video information target sound source relative to pick up camera; According to target sound source relative to the position relation between the positional information of pick up camera and default microphone array and pick up camera, obtain the positional information of target sound source relative to each microphone in microphone array; According to the positional information of target sound source relative to each microphone in microphone array, obtain target sound source and send the time of lag that voice information gets voice information to each microphone; According to time of lag, the voice information from each microphone is carried out Speech processing, obtain the voice information that target sound source sends.

Above-mentioned speech signal processing system, by obtaining the positional information of target sound source relative to each microphone in microphone array, can directly estimate the time delay that each microphone gets the voice information that target sound source sends, the position of combining target sound source again, when voice information is carried out Speech processing, other sound sources can be reduced in complicated many people acoustic environment on the impact in microphone speech acquisition process, thus effective after Speech processing, also improve the ability suppressing interference. By gathering the positional information of target sound source relative to pick up camera, again in conjunction with the position relation between default microphone array and pick up camera, the positional information of target sound source relative to each microphone in microphone array can be got accurately, improve the positioning precision to target sound source such that it is able to improve the effect of Speech processing further.

Preferably, speech signal processing device 10, also for the correlation statistics that the space geometry structure and microphone that utilize microphone array are mutual, the target sound source got is carried out checking and debugging relative to the positional information of each microphone in microphone array, obtain the positional information after checking and debugging.Utilize microphone array technology that sound source position utilizes adjacent statistic correlation accurately verify, to the orientation distance tuning of sound source, further increase the positioning precision to target sound source such that it is able to improve the effect of Speech processing further.

Preferably, speech signal processing device 10, also voice information for the target sound source got being sent is sent to local loud speaker and carries out playing, is sent to communicator and carries out the voice information interaction with far-end device or be sent to storing device storing.

As shown in Figure 4, speech signal processing system also comprises:

Display unit 40, for displaying scene video information, the target sound source send selected by acquisition is to speech signal processing device;

Speaker unit 50, the voice information that the target sound source sent for obtaining speech signal processing device sends also is play;

Communicator 60, the voice information that the target sound source sent for obtaining speech signal processing device sends also carries out the voice information interaction with far-end device;

Storing device 70, the voice information that the target sound source sent for obtaining speech signal processing device sends also stores.

Above-mentioned speech signal processing system, thus it is possible not only to carry out target sound source voice after process local broadcasting or remote communication, it is also possible to store evidence obtaining after being processed respectively by each sound source voice of mark, have very high handiness.

Those skilled in the art are it should be appreciated that embodiments of the invention can be provided as method, system or computer program. Therefore, the present invention can adopt the form of complete hardware embodiment, completely software implementation or the embodiment in conjunction with software and hardware aspect. And, the present invention can adopt the form at one or more upper computer program implemented of computer-usable storage medium (including but not limited to multiple head unit, CD-ROM, optical memory etc.) wherein including computer usable program code.

The present invention is that schema and/or skeleton diagram with reference to method according to embodiments of the present invention, equipment (system) and computer program describe. Should understand can by the combination of the flow process in each flow process in computer program instructions flowchart and/or skeleton diagram and/or square frame and schema and/or skeleton diagram and/or square frame. These computer program instructions can be provided to the treater of multi-purpose computer, special purpose computer, Embedded Processor or other programmable data processing device to produce a machine so that the instruction performed by the treater of computer or other programmable data processing device is produced for realizing the device of function specified in schema flow process or multiple flow process and/or skeleton diagram square frame or multiple square frame.

These computer program instructions also can be stored in and can guide in computer-readable memory that computer or other programmable data processing device work in a specific way, making the instruction that is stored in this computer-readable memory produce the manufacture comprising instruction device, this instruction device realizes the function specified in schema flow process or multiple flow process and/or skeleton diagram square frame or multiple square frame.

These computer program instructions also can be loaded in computer or other programmable data processing device, make on computer or other programmable devices, to perform a series of operation steps to produce computer implemented process, thus the instruction performed on computer or other programmable devices is provided for realizing the step of the function specified in schema flow process or multiple flow process and/or skeleton diagram square frame or multiple square frame.

Obviously, above-described embodiment is only for example is clearly described, and not to the restriction of the mode of enforcement. For those of ordinary skill in the field, can also make other changes in different forms on the basis of the above description. Here without the need to also cannot all enforcement modes be given exhaustive. And the apparent change thus extended out or variation are still among the protection domain of the invention.

Claims

1. an audio signal processing method, it is characterised in that, comprise the following steps:

2. method according to claim 1, it is characterised in that, described acquisition target sound source comprises relative to the positional information of each microphone in microphone array:

3. method according to claim 2, it is characterised in that, the described target sound source of described acquisition comprises relative to the positional information of pick up camera:

4. according to the method in claim 2 or 3, it is characterised in that, described acquisition target sound source also comprises relative to the positional information of each microphone in microphone array:

5. method according to the arbitrary item of claim 1-4, it is characterised in that, also comprise the following steps:

6. a speech signal processing device, it is characterised in that, comprising:

7. device according to claim 6, it is characterised in that, described position acquiring unit comprises:

8. device according to claim 7, it is characterised in that, described first location obtains subelement and comprises:

9. device according to claim 7 or 8, it is characterised in that, described position acquiring unit also comprises:

10. device according to the arbitrary item of claim 6-9, it is characterised in that, also comprise:

11. 1 kinds of speech signal processing systems, it is characterised in that, comprising:

Speech signal processing device, for receiving the on-the-spot video information comprising sound source that pick up camera sends; According to described on-the-spot acquiring video information, target sound source is relative to the positional information of described pick up camera; According to described target sound source relative to the position relation between the positional information of described pick up camera and default microphone array and pick up camera, obtain the positional information of target sound source relative to each microphone in microphone array; According to the positional information of target sound source relative to each microphone in microphone array, obtain described target sound source and send the time of lag that voice information gets described voice information to each microphone; According to described time of lag, the voice information from each microphone is carried out Speech processing, obtain the voice information that described target sound source sends.

12. systems according to claim 11, it is characterized in that, described speech signal processing device, also for the correlation statistics that the space geometry structure and microphone that utilize microphone array are mutual, the described target sound source got is carried out checking and debugging relative to the positional information of each microphone in microphone array, obtain the positional information after checking and debugging.

13. systems according to claim 11 or 12, it is characterized in that, described speech signal processing device, also carries out playing for the voice information that the described target sound source got sends is sent to local loud speaker, is sent to communicator and carries out the voice information interaction with far-end device or be sent to storing device storing.

14. systems according to claim 13, it is characterised in that, also comprise: