CN110858476A

CN110858476A - Sound collection method and device based on microphone array

Info

Publication number: CN110858476A
Application number: CN201810974352.5A
Authority: CN
Inventors: 王峰
Original assignee: Beijing Zidong Cognitive Technology Co Ltd
Current assignee: Beijing Zidong Cognitive Technology Co Ltd
Priority date: 2018-08-24
Filing date: 2018-08-24
Publication date: 2020-03-03
Anticipated expiration: 2038-08-24
Also published as: CN110858476B

Abstract

The invention provides a sound collection method and device based on a microphone array. The method comprises the following steps: acquiring voice in an environment and information of a speaker in the environment by using a microphone array to obtain multi-channel voice; the speaker information includes: the direction and number of speakers; converting the multi-channel speech into single-channel speech; performing sentence segmentation on the single-channel voice to obtain a voice segmented data stream containing preset type sounds; matching the voice segment data stream with each speaker to obtain single separated voice of each speaker; and respectively synthesizing the single separated voice matched with each speaker into respective voice sections. The technical scheme provided by the invention has strong applicability, can be suitable for near-field and far-field voice environments, has the characteristic of high voice detection accuracy, and can realize the separation of the voices of a plurality of speakers so that each speaker corresponds to one voice separation voice section.

Description

Sound collection method and device based on microphone array

Technical Field

The invention relates to the field of automatic processing of computer information, in particular to a sound acquisition method and device based on a microphone array.

Background

The voice is one of the most natural and effective means for people to carry out information interaction. People inevitably suffer from environmental noise, room reverberation and other speaker interference while obtaining voice signals, and the voice quality is seriously affected. Speech enhancement and separation is an effective way to suppress interference as a pre-processing scheme.

The voice separation means to separate desired voice data from various sounds, and mainly studies how to effectively select and track certain sounds in a complex sound environment. Correctly distinguishing noise from target speech of interest, emphasizing the target speech, and attenuating or eliminating the noise is a research goal for speech separation. This problem has been studied for decades by signal processing experts, artificial intelligence experts and audiologists, but the proposed methods have not been satisfactory.

At present, the method of calculating auditory scene analysis, non-negative matrix decomposition and the like is mainly utilized for voice separation, and the method is simple to implement. However, this method has large limitations, few applicable scenes, rapid performance degradation in the presence of noise, failure to consider voice characteristics, damage to voice, and failure to consider far-field voice environment.

As speech technology has evolved, speech technology is being applied in more complex environments. Speech separation is also expected to work well in far-field, noisy acoustic environments.

Therefore, the invention provides a sound collection method and device based on a microphone array to solve the defects of the prior art.

Disclosure of Invention

The invention aims to provide a sound collection method and a sound collection device based on a microphone array, which solve the problems of the existing voice separation.

According to an aspect of the present invention, there is provided a sound collecting method based on a microphone array, including:

acquiring voice in an environment and information of a speaker in the environment by using a microphone array to obtain multi-channel voice; the speaker information includes: the direction and number of speakers;

converting the multi-channel speech into single-channel speech;

performing sentence segmentation on the single-channel voice to obtain a voice segmented data stream containing preset type sounds;

matching the voice segment data stream with each speaker to obtain single separated voice of each speaker;

and respectively synthesizing the single separated voice matched with each speaker into respective voice sections.

Further, converting the multi-channel speech into single-channel speech, comprising:

receiving the multi-channel voice;

performing voice enhancement on far-field voice by using a microphone array direction finding and beam technology;

and converting the channel voice corresponding to all the enhanced microphones into single-channel voice.

Further, sentence segmentation is performed on the single-channel speech to obtain a speech segment data stream containing a preset type of sound, including:

detecting each frame of voice of the single-channel voice according to a pre-established neural network;

and performing sentence segmentation on the voice frame in the threshold range in the single-channel voice to obtain a voice segmented data stream containing preset type voice.

Further, matching the voice segment data stream with each speaker to obtain a single separated voice of each speaker, comprising: separating the voice segmented data stream based on the number of the speakers to obtain a plurality of single-separated voices respectively corresponding to the number of the speakers;

and matching the single separated voice corresponding to each speaker based on the voice production directions of all the speakers.

According to another aspect of the present invention, a microphone array based sound collection device is disclosed, comprising:

the microphone array acquisition module is used for acquiring voices in an environment and information of speakers in the environment by using a microphone array to obtain multi-channel voices; the speaker information includes: the direction and number of speakers;

the voice conversion module is used for converting the multi-channel voice into single-channel voice;

the voice detection module is used for segmenting the single-channel voice to obtain a voice segmented data stream containing preset type sounds;

the voice separation module is used for matching the voice segmented data stream with each speaker to obtain single separation voice of each speaker;

and the voice synthesis module is used for respectively synthesizing the single separated voice matched with each speaker into respective voice sections.

Further, the voice conversion module includes:

the voice receiving submodule is used for receiving the multi-channel voice;

the voice enhancement sub-module is used for carrying out voice enhancement on far-field voice by utilizing a microphone array direction finding and wave beam technology;

and the voice conversion submodule is used for converting the channel voice corresponding to all the microphones after enhancement into single-channel voice.

Further, the voice detection module comprises:

the voice detection submodule is used for detecting each frame of voice of the single-channel voice according to a pre-established neural network;

and the voice segmentation submodule is used for segmenting the voice frames in the threshold range in the single-channel voice to obtain a voice segmentation data stream containing preset type voice.

Further, the voice separation module includes:

a voice separation submodule, configured to separate the voice segment data stream based on the number of the speakers to obtain a plurality of single-separated voices corresponding to the number of the speakers;

and the voice matching sub-module is used for matching the single separated voice corresponding to each speaker based on the voice production directions of all the speakers.

The technical scheme provided by the invention acquires the voice in the environment and the information of the speaker in the environment by using the microphone array to obtain multi-channel voice, adopts the microphone array voice enhancement technology, automatically identifies and locks the microphone array and enhances the voice signal of the speaker by analyzing the far-field voice signal, automatically inhibits the surrounding random noise and background noise, and improves the accuracy of the voice signal output by the receiving end; then converting the multi-channel voice into single-channel voice; the preset voice type in the voice is cut out to form a voice segmentation data stream, so that the continuity of communication can be realized, the starting point of a continuous voice signal is automatically identified, a silent section is removed, and the input is split into a plurality of words; and finally, synthesizing the matched voice into a voice segment, and ensuring that each voice segment after voice separation only contains one speaker.

The voice separation system relies on microphone array equipment, is convenient to carry and store, is convenient and practical, and solves the problems of high cost and low efficiency of the traditional voice separation based on a cloud server; the invention can solve the problem of voice separation in scenes such as daily life, study, meetings and the like of a user, and has great significance for the development of voice separation.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

fig. 2 is a flow chart of far-field speech separation provided in an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

As shown in fig. 1, the present invention provides a sound collection method based on a microphone array, which comprises the following steps:

step 1, collecting voice in an environment and information of a speaker in the environment by using a microphone array to obtain multi-channel voice; the speaker information includes: the direction and number of speakers;

step 2, converting the multi-channel voice into single-channel voice;

step 3, sentence segmentation is carried out on the single-channel voice to obtain a voice segmentation data stream containing preset type sounds;

step 4, matching the voice segment data stream with each speaker to obtain single separated voice of each speaker;

and 5, respectively synthesizing the single separated voice matched with each speaker into respective voice sections.

In the embodiment of the application, a microphone array is used for acquiring the voice in the environment and the information of the speaker in the environment to obtain multi-channel voice, a microphone array voice enhancement technology is adopted, the voice signals of the speaker are automatically identified, locked and enhanced through the analysis of far-field voice signals, the random noise and the background noise around are automatically inhibited, and the accuracy of the voice signals output by a receiving end is improved; then converting the multi-channel voice into single-channel voice; the preset voice type in the voice is cut out to form a voice segmentation data stream, so that the continuity of communication can be realized, the starting point of a continuous voice signal is automatically identified, a silent section is removed, and the input is split into a plurality of words; and finally, synthesizing the matched voice into a voice segment, and ensuring that each voice segment after voice separation only contains one speaker.

In some embodiments of the present application, a microphone array is used to collect voices in an environment and speaker information in the environment, so as to obtain multi-channel voices, which specifically includes:

the microphone array receives voice signal input continuously, voice directions are searched for in a 360-degree plane in real time by utilizing a microphone array multi-speaker direction-finding technology, voice direction finding under a scene that a plurality of speakers sound at the same time can be achieved by the technology, and the direction of each speaker and the number of speakers are output.

In some embodiments of the present application, converting the multi-channel speech to single-channel speech includes:

receiving the multi-channel voice;

In some embodiments of the present application, performing sentence segmentation on the single-channel speech to obtain a speech segment data stream containing a preset type of sound, includes:

In some embodiments of the present application, matching the speech segment data stream with each speaker to obtain a single isolated speech for each speaker comprises: separating the voice segmented data stream based on the number of the speakers to obtain a plurality of single-separated voices respectively corresponding to the number of the speakers;

In some embodiments of the present application, the predetermined type of sound is a human voice and the speaker is a human.

Acquiring voices in an environment and information of people in the environment by using a microphone array to obtain multi-channel voices; wherein the information of the person comprises: the direction and number of people speaking; performing voice enhancement on far-field voice by using a microphone array direction finding and beam technology; and converting the channel voice corresponding to all the enhanced microphones into single-channel voice. The voice/non-voice detection technology utilizes a trained neural network to detect voice/non-voice of each frame, and if the voice frame in a small segment of voice exceeds a preset threshold value, the voice starting frame is judged as a voice starting point. Only the voice after the starting point is saved, and the non-human voice is discarded, so that a voice segment data stream only containing human voice is obtained. And separating the voice segmented data streams according to the number of people, and distributing the single separated voice obtained in real time to each speaker. Each section of voice is guaranteed to only contain one speaker, and cross aliasing errors cannot occur. Finally, the single separated voice of each person is synthesized into voice sections to form a plurality of voice sections corresponding to the number of people.

In other embodiments of the present application, the speaker is a musical instrument, such as a violin, an accordion, a flute, a urheen, or the like. The method separates the sound of each instrument from the environmental sound, distinguishes and matches the sound corresponding to each instrument according to the tone, and synthesizes the sound of each instrument to form a plurality of speech segments corresponding to the instruments.

In other embodiments of the present application, the speaker is an animal. The method separates the sound of each animal from the environmental sound, distinguishes the sound and distributes the sound to each animal, and finally synthesizes the sound of each animal to form the voice section of each animal corresponding to the number of the animals.

Fig. 2 shows an embodiment of the present invention, applied to a far-field acoustic environment:

in a far-field environment, a plurality of users communicate at different positions, the environment contains background noises with different degrees, and the users can realize real-time and continuous separation of voice under zero operation.

In fig. 2, a microphone array receives voices in an environment, at a certain moment, a microphone array direction-finding module finds directions each including a voice at the moment, and records the directions as a speaker direction and the number of speakers, the speaker direction is used for guiding the direction of a beam, and the number of speakers is sent to a voice separation module to guide the number of output voices. And the microphone array beam module performs beam forming on each direction by using the obtained voice direction to obtain the voice enhanced in each direction and fuse the voice into single-channel voice. The voice/non-voice detection technology is utilized to decompose the continuous voice signal into voice segmented data flow, and non-voice and noise are further filtered, so that the system efficiency is improved. The voice separation module separates the mixed voice into a corresponding number of voices according to the number of speakers. The tracking module distributes each real-time voice separation segment to each speaker by utilizing similarity calculation, and ensures that the voice of each speaker does not contain the voices of other speakers after separation. The voice segment data stream is synthesized into rhythmic continuous voice by using a voice synthesis technology, and the rhythmic continuous voice can be output to a user or uploaded to a server by using a microphone of the equipment. In the process of processing the voice signal, the equipment displays the processing progress in real time.

According to the working mode and principle, the voice separation of a plurality of users in a long distance can be realized.

Based on the same inventive concept, the invention also provides a sound collection device based on a microphone array, which comprises:

Preferably, the voice conversion module includes:

the voice receiving submodule is used for receiving the multi-channel voice;

Preferably, the voice detection module includes:

Preferably, the voice separation module includes:

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A sound collection method based on a microphone array is characterized by comprising the following steps:

converting the multi-channel speech into single-channel speech;

2. The method of claim 1, wherein converting the multi-channel speech to single-channel speech comprises:

receiving the multi-channel voice;

3. The method of claim 2, wherein performing sentence-segmentation on the single-channel speech to obtain a speech segmented data stream containing a preset type of sound comprises:

4. The method of claim 3, wherein matching the stream of speech segments to each speaker results in a single separate speech for each speaker, comprising: separating the voice segmented data stream based on the number of the speakers to obtain a plurality of single-separated voices respectively corresponding to the number of the speakers;

5. A sound collection device based on a microphone array, comprising:

6. The apparatus of claim 5, wherein the voice conversion module comprises:

the voice receiving submodule is used for receiving the multi-channel voice;

7. The apparatus of claim 5, wherein the voice detection module comprises:

8. The apparatus of claim 5, wherein the voice separation module comprises:

a voice separation submodule, configured to separate the voice segment data stream based on the number of the speakers to obtain a plurality of single-separated voices corresponding to the number of the speakers; and the voice matching sub-module is used for matching the single separated voice corresponding to each speaker based on the voice production directions of all the speakers.