CN109300475A

CN109300475A - Microphone array sound pick-up method and device

Info

Publication number: CN109300475A
Application number: CN201710608727.1A
Authority: CN
Inventors: 施隆海
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2017-07-25
Filing date: 2017-07-25
Publication date: 2019-02-01

Abstract

The present invention discloses a kind of microphone array sound pick-up method and device.The microphone array sound pick-up method includes: to carry out recognition of face using panoramic video, captures the face of donor, obtains the azimuth information of authorization human face；According to the azimuth information of power human face, beam forming is carried out；Pickup is carried out using the microphone array of beam forming.The present invention reduces the difficulty of Wave beam forming by video identification, solves the problems, such as speech recognition under cocktail party effect, realizes high directivity pickup.

Description

Microphone array sound pick-up method and device

Technical field

The present invention relates to field of speech recognition, in particular to a kind of microphone array sound pick-up method and device.

Background technique

Wired home audio access is the hot spot of wired home at present.

The auditory system of people can distinguish and track oneself interested voice in the environment of noisy multiple talkers Signal, and sound required for oneself is told, this resolution capability is one kind specific to inside of human body speech understanding mechanism Sensing capability, that is, the mankind speech Separation ability, referred to as " cocktail party effect ".

Current speech recognition system can reach very high discrimination to clean speech, but when voice is by noise pollution When, system performance can sharply decline.

Summary of the invention

In view of the above technical problem, the present invention provides a kind of microphone array sound pick-up method and devices, are known by video Not Jiang Di Wave beam forming difficulty, realize high directivity pickup.

According to an aspect of the present invention, a kind of microphone array sound pick-up method is provided, comprising:

Recognition of face is carried out using panoramic video, captures the face of donor, obtains the azimuth information of authorization human face；

According to the azimuth information of power human face, beam forming is carried out；

Pickup is carried out using the microphone array of beam forming.

In one embodiment of the invention, described to include: using the microphone array progress pickup of beam forming

Signal is separated according to Wave beam forming, only picks up the voice signal in authorization human face orientation.

In one embodiment of the invention, the method also includes:

Joint authentication is carried out using recognition of face and Application on Voiceprint Recognition.

In one embodiment of the invention, it is described using recognition of face and Application on Voiceprint Recognition carry out joint authentication include:

Donor is confirmed using recognition of face；

The keyword issued using Application on Voiceprint Recognition donor, to further confirm that donor.

In one embodiment of the invention, after in joint, the authentication is passed, the method also includes:

Extract the control instruction that donor sends；

The control instruction is parsed, and corresponding controlling behavior is completed according to the control instruction after parsing.

According to another aspect of the present invention, a kind of microphone array sound pick up equipment is provided, comprising:

Face recognition module captures the face of donor, obtains donor for carrying out recognition of face using panoramic video The azimuth information of face；

Beamforming block carries out beam forming for the azimuth information according to power human face；

Pickup module, for carrying out pickup using the microphone array of beam forming.

In one embodiment of the invention, pickup module is used to separate signal according to Wave beam forming, only picks up donor The voice signal in facial orientation.

In one embodiment of the invention, microphone array sound pick up equipment is joined using recognition of face and Application on Voiceprint Recognition Close authentication.

In one embodiment of the invention, described device further include:

Face recognition module is used to confirm donor using recognition of face；

Voiceprint identification module, the keyword for being issued using Application on Voiceprint Recognition donor, to further confirm that donor.

In one embodiment of the invention, described device further include:

Acoustic control module extracts the control instruction that donor sends after the authentication is passed in joint；To the control instruction It is parsed, and corresponding controlling behavior is completed according to the control instruction after parsing.

The present invention reduces the difficulty of Wave beam forming by video identification, solves the difficulty of speech recognition under cocktail party effect Topic, realizes high directivity pickup.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1 is the schematic diagram of inventive microphone array sound pick-up method first embodiment.

Fig. 2 is the schematic diagram of recognition of face in one embodiment of the invention.

Fig. 3 is the contrast schematic diagram of single microphone and microphone array pickup in one embodiment of the invention.

Fig. 4 is the schematic diagram of inventive microphone array sound pick-up method second embodiment.

Fig. 5 is the schematic diagram of inventive microphone array sound pick up equipment first embodiment.

Fig. 6 is the schematic diagram of inventive microphone array sound pick up equipment second embodiment.

Fig. 7 is the operation schematic diagram of microphone array sound pick up equipment in one embodiment of the invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Below Description only actually at least one exemplary embodiment be it is illustrative, never as to the present invention and its application or make Any restrictions.Based on the embodiments of the present invention, those of ordinary skill in the art are not making creative work premise Under every other embodiment obtained, shall fall within the protection scope of the present invention.

Unless specifically stated otherwise, positioned opposite, the digital table of the component and step that otherwise illustrate in these embodiments It is not limited the scope of the invention up to formula and numerical value.

Simultaneously, it should be appreciated that for ease of description, the size of various pieces shown in attached drawing is not according to reality Proportionate relationship draw.

Technology, method and apparatus known to person of ordinary skill in the relevant may be not discussed in detail, but suitable In the case of, the technology, method and apparatus should be considered as authorizing part of specification.

It is shown here and discuss all examples in, any occurrence should be construed as merely illustratively, without It is as limitation.Therefore, the other examples of exemplary embodiment can have different values.

It should also be noted that similar label and letter indicate similar terms in following attached drawing, therefore, once a certain Xiang Yi It is defined in a attached drawing, then in subsequent attached drawing does not need that it is further discussed.

Fig. 1 is the schematic diagram of inventive microphone array sound pick-up method first embodiment.Preferably, the present embodiment can be by this Invention microphone array sound pick up equipment executes.Method includes the following steps:

Step 11, as shown in Fig. 2, carrying out recognition of face using panoramic video, the face of donor is captured；Known using face It Que Ren not donor；Obtain the azimuth information of authorization human face.

In one embodiment of the invention, in step 11, specifically may be used using the step of panoramic video progress recognition of face To include: to carry out recognition of face using VR camera (virtual reality camera) panoramic video.

Step 12, according to the azimuth information of power human face, beam forming is carried out.

Step 13, pickup is carried out using the microphone array of beam forming.

Wherein, microphone array (Microphone Array) increases a spatial domain on the basis of time domain and frequency domain, Empty time-frequency combined processing is carried out to the signal from space different direction, microphone array inherits the related calculation of aerial array Method, while the method for some single microphone speech processes is absorbed again.Microphone array has spatial selectivity, is capturing it While the high-quality signal of specific direction, and reduce noise and other interference.It speaks in addition, microphone array is not necessarily limited The activity of people can detect automatically, position and track speaker in its receiving area.

In one embodiment of the invention, microphone array can be using the space of planar array and space topological battle array.

Fig. 3 is the contrast schematic diagram of single microphone and microphone array pickup in one embodiment of the invention.Such as Fig. 3 institute Show, single microphone picks up required voice with ambient noise simultaneously；Realization suppression is listed in using the microphone array of beam forming technique Voice needed for being picked up while ambient noise processed (such as: the voice signal in authorization human face orientation).

In one embodiment of the invention, step 13 can specifically include: separating signal according to Wave beam forming, only picks up Authorize the voice signal in human face orientation.

Based on the microphone array sound pick-up method that the panoramic video secondary beam that the above embodiment of the present invention provides is formed, lead to Crossing video identification reduces the difficulty of Wave beam forming, solves the problems, such as speech recognition under cocktail party effect, realizes high direction Property pickup.

The above embodiment of the present invention combines video channel, the problem of indoor pickup is directed toward is simplified, in conjunction with space MIC gusts Column and beam-forming technology may be implemented relatively good pickup and be directed toward, be effectively reduced the language of noise and other people.

The above embodiment of the present invention simplifies the difficulty of blind source signal separation, wherein blind speech separation refers in source voice It is only extensive by observation signal according to the statistical property of input source voice signal in the case that signal and transport channel parameters are unknown It appears again the process of each source signal." blind " has double meaning: first is that source voice signal cannot be observed；How two refer to source signal Aliasing is unknown.

The above embodiment of the present invention simplifies the difficulty of blind source signal separation, improves the reliable of system voice Signal separator Property, therefore the success rate of speech recognition can be improved.

Fig. 4 is the schematic diagram of inventive microphone array sound pick-up method second embodiment.Preferably, the present embodiment can be by this Invention microphone array sound pick up equipment executes.Method includes the following steps:

Step 41, as shown in Fig. 2, carrying out recognition of face using panoramic video, the face of donor is captured；Known using face It Que Ren not donor；Obtain the azimuth information of authorization human face.

Step 42, according to the azimuth information of power human face, beam forming is carried out.

Step 43, pickup is carried out using the microphone array of beam forming.

Step 44, the keyword issued using Application on Voiceprint Recognition donor, to further confirm that donor.Thus in the present invention State that embodiment can use recognition of face and Application on Voiceprint Recognition carries out joint authentication.

In one embodiment of the invention, Application on Voiceprint Recognition may include:

Step 441, voice is collected.

Step 442, noise suppressed and speech detection are carried out.

Step 443, feature extraction is carried out.

The task of feature extraction is to extract and select have the characteristics such as separability is strong, stability is high to the vocal print of speaker Acoustics or language feature.Different from speech recognition, the feature of Application on Voiceprint Recognition must be " personalization " feature, and Speaker Identification Feature must be " common feature " for speaker.

Although at present major part Voiceprint Recognition System be all acoustics level feature, one personal touch of characterization Feature should be multifaceted.

In an embodiment of the invention, features described above may include: the anatomical structure of (1) and the pronunciation mechanism of the mankind Related acoustic feature (such as frequency spectrum, cepstrum, formant, fundamental tone, reflection coefficient), nasal sound, band deep breathing sound, hoarse Sound, laugh etc.；(2) semanteme, the rhetoric, pronunciation, speech habit influenced by socioeconomic status, education level, birthplace etc. Deng；(3) features such as personal touch or the rhythm influenced by parent, rhythm, speed, intonation, volume.

Step 444, sound modeling is carried out.

In an embodiment of the invention, the feature that vocal print automatic identification model can be used includes: (1) acoustic feature (cepstrum)；(2) lexical characteristics (speaker relevant word n-gram, phoneme n-gram)；(3) prosodic features (utilizes n-gram The fundamental tone and energy " posture " of description)；(4) languages, dialect and accent information；(5) channel information (which kind of channel used)；Deng Deng.

Step 445, identification matching is carried out.

In an embodiment of the invention, the identification matching in step 445 may include following a few major class methods:

(1) template matching method: main to use using dynamic time bending (DTW) to be directed at trained and test feature sequence In the application (usually text inter-related task) of fixed phrases.

(2) arest neighbors method: retaining all characteristic vectors when training, and when identification finds in trained vector each vector Nearest K, are identified accordingly, and the amount of usual model storage and similar calculating is all very big.

(3) neural network method: there are many kinds of forms, such as Multilayer Perception, radial basis function (RBF), can explicitly instruct Practice to distinguish speaker and its background speaker, training burden is very big, and the replicability of model is bad.

(4) hidden Markov model (HMM) method: the HMM or gauss hybrid models (GMM) of usually used list state, It is popular method, effect is relatively good.

(5) VQ clustering method (such as LBG): effect is relatively good, and algorithm complexity is not high yet and HMM method cooperates more Better effect can be received.

(6) multinomial classifier methods: there is higher precision, but model storage and calculation amount are all bigger.

Step 45, the control instruction that donor sends is extracted.

Step 46, the control instruction is parsed, and corresponding control row is completed according to the control instruction after parsing For.

The above embodiment of the present invention, can be with VR in the case where several individuals are talked into test room each other simultaneously Camera carries out face recognition, confirms donor；Wave beam forming is directed toward authorization human face；Donor issues keyword and control Word, Application on Voiceprint Recognition keyword, confirmation authorization；Start to manipulate, parse control word, completes controlling behavior.

Wave beam forming pickup is the technology of comparative maturity, but speech recognition scene this for cocktail party effect, solution It determines or highly difficult.

The recognition of face that the above embodiment of the present invention introduces VR video carrys out secondary beam and is formed, and improves the reliable of authentication Property, the difficulty of fanaticism speech processes is reduced, the accuracy rate of speech recognition is improved.

The above embodiment of the present invention can realize bio-identification by recognition of face and Application on Voiceprint Recognition, realize face knowledge Other and Application on Voiceprint Recognition multiple authentication, to strengthen the safety of system.

Fig. 5 is the schematic diagram of inventive microphone array sound pick up equipment first embodiment.As shown in figure 5, the microphone Array sound pick up equipment may include face recognition module 51, beamforming block 52 and pickup module 53, in which:

Face recognition module 51, is used for, as shown in Fig. 2, carrying out recognition of face using panoramic video, captures the face of donor Portion；Donor is confirmed using recognition of face；Obtain the azimuth information of authorization human face.

In one embodiment of the invention, face recognition module 51 can be implemented as VRcamera (virtual reality camera shooting Head).

Beamforming block 52 carries out beam forming for the azimuth information according to power human face.

Pickup module 53, for carrying out pickup using the microphone array of beam forming.

In one embodiment of the invention, pickup module 53 specifically can be implemented as microphone array.

In one embodiment of the invention, pickup module 53 specifically can be used for separating signal according to Wave beam forming, only Pick up the voice signal in authorization human face orientation.It is listed in the above embodiment of the present invention using the microphone array of beam forming technique Realize inhibit ambient noise while, can pick up required voice (such as: authorization human face orientation voice signal).

Based on the microphone array sound pick up equipment that the panoramic video secondary beam that the above embodiment of the present invention provides is formed, lead to Crossing video identification reduces the difficulty of Wave beam forming, solves the problems, such as speech recognition under cocktail party effect, realizes high direction Property pickup.

Fig. 6 is the schematic diagram of inventive microphone array sound pick up equipment second embodiment.Compared with embodiment illustrated in fig. 5, In the embodiment shown in fig. 6, the microphone array sound pick up equipment can also include voiceprint identification module 54 and acoustic control module 55, Wherein:

Face recognition module 51 is used to confirm donor using recognition of face.

Voiceprint identification module 54, the keyword for being issued using Application on Voiceprint Recognition donor, to further confirm that donor.

Thus the microphone array sound pick up equipment of the above embodiment of the present invention is joined using recognition of face and Application on Voiceprint Recognition Close authentication.

Acoustic control module 55 extracts the control instruction that donor sends after the authentication is passed in joint；The control is referred to Order is parsed, and completes corresponding controlling behavior according to the control instruction after parsing.

Fig. 7 is the operation schematic diagram of microphone array sound pick up equipment in one embodiment of the invention.As shown in fig. 7, described Microphone array sound pick up equipment may include microphone array and VR camera, in which:

VR camera carries out recognition of face, and microphone array carries out Application on Voiceprint Recognition authentication, and the above embodiment of the present invention passes through The joint of recognition of face and Application on Voiceprint Recognition authentication, strengthens the safety of system.

VR camera carries out recognition of face using panoramic video, captures the face of donor；Obtain recognition of face position (azimuth information of authorization human face).Later, inventive microphone array sound pick up equipment utilizes the azimuth information for authorizing human face It generates Wave beam forming and separates signal, thus microphone array can be used for separating signal according to Wave beam forming, only pick up donor The voice signal in facial orientation.

Thus the microphone array in the above embodiment of the present invention using beam forming technique is listed in realization inhibition ambient noise While, can pick up required voice (such as: authorization human face orientation voice signal).

Such as: in the case where several individuals are talked into test room each other simultaneously, inventive microphone array is picked up Mixer can carry out face recognition using VR camera, confirm donor；Wave beam forming is directed toward authorization human face；Donor Keyword and control word are issued, using Application on Voiceprint Recognition keyword, confirmation authorization；Start to manipulate, parse control word, completes control row For.

The above embodiment of the present invention can reduce the difficulty of Wave beam forming by video identification, solve cocktail party effect The problem of lower speech recognition, realizes high directivity pickup.

The functional units such as beamforming block 52 described above, voiceprint identification module 54, acoustic control module 55 can be with It is embodied as general processor, the programmable logic controller (PLC) (PLC), Digital Signal Processing for executing function described herein It is device (DSP), specific integrated circuit (ASIC), field programmable gate array (FPGA) or other programmable logic device, discrete Door or transistor logic, discrete hardware components or it is any appropriately combined.

So far, the present invention is described in detail.In order to avoid covering design of the invention, it is public that this field institute is not described The some details known.Those skilled in the art as described above, completely it can be appreciated how implementing technology disclosed herein Scheme.

Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..

Description of the invention is given for the purpose of illustration and description, and is not exhaustively or will be of the invention It is limited to disclosed form.Many modifications and variations are obvious for the ordinary skill in the art.It selects and retouches It states embodiment and is to more preferably illustrate the principle of the present invention and practical application, and those skilled in the art is enable to manage The solution present invention is to design various embodiments suitable for specific applications with various modifications.

Claims

1. a kind of microphone array sound pick-up method characterized by comprising

Pickup is carried out using the microphone array of beam forming.

2. the method according to claim 1, wherein the microphone array using beam forming carries out pickup Include:

3. method according to claim 1 or 2, which is characterized in that further include:

4. according to the method described in claim 3, it is characterized in that, described carry out joint mirror using recognition of face and Application on Voiceprint Recognition Power includes:

Donor is confirmed using recognition of face；

5. according to the method described in claim 3, it is characterized in that, after in joint, the authentication is passed, further includes:

Extract the control instruction that donor sends；

6. a kind of microphone array sound pick up equipment characterized by comprising

Face recognition module captures the face of donor, obtains authorization human face for carrying out recognition of face using panoramic video Azimuth information；

7. device according to claim 6, which is characterized in that

Pickup module is used to separate signal according to Wave beam forming, only picks up the voice signal in authorization human face orientation.

8. device according to claim 6 or 7, which is characterized in that

Microphone array sound pick up equipment carries out joint authentication using recognition of face and Application on Voiceprint Recognition.

9. device according to claim 8, which is characterized in that further include:

Face recognition module is used to confirm donor using recognition of face；

10. device according to claim 8, which is characterized in that further include:

Acoustic control module extracts the control instruction that donor sends after the authentication is passed in joint；The control instruction is carried out Parsing, and corresponding controlling behavior is completed according to the control instruction after parsing.