CN106710603B - Utilize the audio recognition method and system of linear microphone array - Google Patents

Utilize the audio recognition method and system of linear microphone array Download PDF

Info

Publication number
CN106710603B
CN106710603B CN201611202169.0A CN201611202169A CN106710603B CN 106710603 B CN106710603 B CN 106710603B CN 201611202169 A CN201611202169 A CN 201611202169A CN 106710603 B CN106710603 B CN 106710603B
Authority
CN
China
Prior art keywords
former
noise
identified
sound
voice data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611202169.0A
Other languages
Chinese (zh)
Other versions
CN106710603A (en
Inventor
贺来朋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Yunzhixin Intelligent Technology Co Ltd
Unisound Shanghai Intelligent Technology Co Ltd
Original Assignee
Unisound Shanghai Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Shanghai Intelligent Technology Co Ltd filed Critical Unisound Shanghai Intelligent Technology Co Ltd
Priority to CN201611202169.0A priority Critical patent/CN106710603B/en
Publication of CN106710603A publication Critical patent/CN106710603A/en
Application granted granted Critical
Publication of CN106710603B publication Critical patent/CN106710603B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Abstract

The invention discloses a kind of audio recognition method using linear microphone array, include the following steps: to record the sound of environment using linear microphone array to form audio data;Sound in front of linear microphone array obtains region and Beam-former is arranged, and obtains region formation in sound using Beam-former and is located at the main beam region at middle part and positioned at the first noise beam area and the second noise beam area of both sides;Audio data is input to the main beam, the first noise wave beam of corresponding first noise beam area and the second noise wave beam of corresponding second noise beam area that corresponding main beam region is obtained in Beam-former;The first noise wave beam and the second noise wave beam are filtered out from main beam to obtain voice data to be identified;Speech recognition is carried out to obtain corresponding text data and export to voice data to be identified.Calculation amount of the present invention is small, and the voice data quality of acquisition is high, can be improved the accuracy rate of speech recognition.

Description

Utilize the audio recognition method and system of linear microphone array
Technical field
Identify field the present invention relates to man machine language, espespecially a kind of audio recognition method using linear microphone array and System.
Background technique
In speech recognition system, usually noise reduction process is carried out to the audio signal that microphone is got, to inhibit Ambient noise component in audio signal, to improve the recognition accuracy of speech recognition system.The wheat according to used in system The difference of gram wind number, the noise reduction algorithm of use can substantially be divided into single microphone noise reduction, dual microphone noise reduction and microphone Array noise reduction algorithm etc..
With the fast development of hardware system, microphone array is just had been more and more widely used.According to opening up for array element Structure difference is flutterred, microphone array can be generally divided into linear array and annular array.Either linear array or circular array Column are typically necessary the dimensional orientation that desired signal is obtained by auditory localization algorithm, then by solid when carrying out noise reduction process Determine beamforming algorithm and form the reception wave beam with specific shape, and beam main lobe center is directed toward where desired signal Direction.
However, carrying out auditory localization simultaneously and the calculation amount of adaptive beamforming is very big, and when auditory localization occurs When deviation, it is easy to cause to inhibit or introduce to be distorted to desired signal, and then influence speech recognition system performance.
Summary of the invention
It is an object of the invention to overcome the deficiencies of existing technologies, propose that a kind of voice using linear microphone array is known Other method and system solve the set-up mode of existing microphone array there are computationally intensive, calculate complicated and cost of implementation compared with High problem, it is therefore intended that reach good noise reduction effect using microphone array, to obtain the audio data of high quality and mention The accuracy rate of high speech recognition.
To achieve the goals above, the present invention provides a kind of audio recognition method using linear microphone array, institutes The method of stating includes:
The sound of environment is recorded using linear microphone array to form audio data;For the linear microphone array The sound in front obtains region and Beam-former is arranged, and obtains region formation in the sound using the Beam-former and is located at The main beam region at middle part and the first noise beam area and the second noise beam area positioned at both sides;By the audio number According to being input in the Beam-former to obtain the main beam in the corresponding main beam region, correspond to the first noise wave beam The first noise wave beam in region and the second noise wave beam of corresponding second noise beam area;From the main beam The first noise wave beam and the second noise wave beam are filtered out to obtain voice data to be identified;To the language to be identified Sound data carry out speech recognition to obtain corresponding text data and export.
The invention has the benefit that the present invention forms three beam areas by the way that sound is obtained design in region, In two wave beams for obtaining noise, another wave beam passes through Beam-former output correspondence for obtaining desired signal Noise wave beam and main beam, noise wave beam is then further filtered out from main beam by sef-adapting filter module.The party Method does not need real-time tracking sound bearing, and avoiding traditional algorithm may be because that sound source position estimated bias bring believes expectation Number inhibition or distortion;Algorithm calculation amount is small simultaneously, realizes that process is simple and convenient, cost is relatively low, the voice data quality of acquisition Height can be improved the accuracy rate of speech recognition.It, can be further in addition combined with voice data to the adaptive of speech recognition device Improve the accuracy rate of speech recognition.
Further improvement of the present invention are as follows: obtain region for the sound in front of the linear microphone array and wave beam is set Shaper, comprising: the sound obtains the plane domain that region includes 0 ° to 180 ° of angle;Setting is used to form described first and makes an uproar The center that first Beam-former is formed by wave beam is directed toward the sound and obtained by first Beam-former in sound wave region Take 20 ° of directions in region;The second Beam-former for being used to form the main beam region is set, by second Wave beam forming It is directed toward 90 ° of directions that the sound obtains region in the center that device is formed by wave beam;Setting is used to form second noise waves The center that the third Beam-former is formed by wave beam is directed toward the sound and obtained by the third Beam-former in beam region 160 ° of directions in region.
Further improvement of the present invention are as follows: when setting Beam-former, be provided with and the line in each Beam-former The filter that each microphone in property microphone array is correspondingly connected with, uses fixed beam shaping Algorithm for each Wave beam forming Filter in device calculates filter coefficient;
The fixed beam shaping Algorithm includes:
yn(k)=xn(k)+vn(k), n=1,2 ..., N (formula one)
In formula one, ynIt (k) is the collected audio data of n-th of microphone, xn(k) and vnIt (k) is collected respectively Desired signal and additive noise;In formula two,It is the output of Beam-former, the output of Beam-former is approached linearly The desired signal that some microphone receives in microphone array,It is the corresponding filter coefficient of n-th of microphone;
In formula three, em(k) output signal of Beam-former and the error of collected desired signal are indicated, it is equal to The error e of desired signalX, m(k) with the error e of additive noiseV, m(k) sum;And the error e of desired signalX, m(k) it makes an uproar with additivity The error e of soundV, m(k) it can be indicated with formula four and formula five;
Formula six and formula seven are obtained based on mean square error is minimized, passes through minimumTo enable additive noise minimum, In conjunction with constraint eX, m(k)=0 to obtain optimum filter coefficient hM, o, h thereinmIt is corresponding for filters all in Beam-former Filter coefficient matrices, hM, oFor the corresponding optimal filter coefficient value of filters all in Beam-former.
Further improvement of the present invention are as follows: speech recognition is carried out to the voice data to be identified, comprising: first with institute It states voice data to be identified and acoustic model is adaptively operated;Then using the acoustic model through adaptively operating to institute It states voice data to be identified and carries out speech recognition.
Further improvement of the present invention are as follows: acoustic model is adaptively grasped using the voice data to be identified Make, comprising: extract the voice data to be identified of setting quantity, and text mark is carried out to extracted voice data to be identified Note;
Extract it is described setting quantity the corresponding acoustic feature of voice data to be identified, and by corresponding text marking with The acoustic feature combines to form adaptive training data;
Adaptive training is carried out to the acoustic model using the adaptive training data.
The present invention also provides a kind of linear Microphone Array Speech identifying system, the system comprises: with it is described linear The Beam-former of microphone array communication connection, sound of the Beam-former in front of the linear microphone array obtain Region is taken to form the main beam region for being located at middle part and the first noise beam area and the second noise beam zone positioned at both sides Domain, for being handled the received audio data and being obtained main beam, the correspondence in the corresponding main beam region First noise wave beam of first noise beam area and the second noise waves of corresponding second noise beam area Beam;
Sef-adapting filter module communicates to connect with the Beam-former, receives the main beam, first noise Wave beam and the second noise wave beam, and for filtering out the first noise wave beam and second noise waves from the main beam Beam is to obtain voice data to be identified;
Speech recognition device is communicated to connect with the sef-adapting filter module, receives the voice data to be identified, and For carrying out speech recognition to the voice data to be identified to obtain corresponding text data and export.
Further improvement of the present invention are as follows: it includes plane domain of the angle from 0 ° to 180 ° that the sound, which obtains region,;Institute Stating Beam-former includes: the first Beam-former for being used to form first noise waves region, first Wave beam forming It is directed toward 20 ° of directions that the sound obtains region in the center that device is formed by wave beam;
It is used to form second Beam-former in the main beam region, second Beam-former is formed by wave beam Center be directed toward 90 ° of directions that the sound obtains region;
It is used to form the third Beam-former of second noise beam area, the third Beam-former is formed The center of wave beam be directed toward 160 ° of directions that the sound obtains region.
Further improvement of the present invention are as follows: in each Beam-former be equipped with it is each in the linear microphone array The filter that microphone is correspondingly connected with, the filter in each Beam-former are provided with corresponding filter coefficient;The filter Wave device coefficient is calculated by fixed beam shaping Algorithm;
The fixed beam shaping Algorithm includes:
yn(k)=xn(k)+vn(k), n=1,2 ..., N (formula one)
In formula one, ynIt (k) is the collected audio data of n-th of microphone, xn(k) and vnIt (k) is collected respectively Desired signal and additive noise;In formula two,It is the output of Beam-former, the output of Beam-former is approached linearly The desired signal that some microphone receives in microphone array,It is the corresponding filter coefficient of n-th of microphone;
In formula three, em(k) output signal of Beam-former and the error of collected desired signal are indicated, it is equal to The error e of desired signalX, m(k) with the error e of additive noiseV, m(k) sum;And the error e of desired signalX, m(k) it makes an uproar with additivity The error e of soundV, m(k) it can be indicated with formula four and formula five;
Formula six and formula seven are obtained based on mean square error is minimized, passes through minimumTo enable additive noise minimum, In conjunction with constraint eX, m(k)=0 to obtain optimum filter coefficient hM, o, h thereinmIt is corresponding for filters all in Beam-former Filter coefficient matrices, hM, oFor the corresponding optimal filter coefficient value of filters all in Beam-former.
Further improvement of the present invention are as follows: institute's speech recognizer includes an acoustic model, described in the acoustic model warp Voice data to be identified is used further to identify voice data to be identified after carrying out adaptive training.
Further improvement of the present invention are as follows: institute's speech recognizer further include have characteristic extracting module, text input module, Training data memory module and training module;
The characteristic extracting module and the sef-adapting filter module communicate to connect, and receive the voice number to be identified According to for extracting acoustic feature from the received voice data to be identified of institute;
The text input module is for inputting text marking corresponding with the voice data to be identified;
The training data memory module and the characteristic extracting module and the text input module communicate to connect, and are used for The acoustic feature and corresponding text marking are stored, the acoustic feature and corresponding text marking combine to form adaptive instruction Practice data;
The training module and the training data memory module communicate to connect, and read in the training data memory module It stores adaptive training data and adaptive training is carried out to the acoustic model using read adaptive training data.
Detailed description of the invention
Fig. 1 is the schematic diagram that sound obtains region;
Fig. 2 is the method flow diagram of linear Microphone Array Speech identification;
Fig. 3 is the schematic diagram of multi-channel adaptive filter.
Specific embodiment
With reference to the accompanying drawing, invention is further described in detail.With the fast development of hardware system, microphone array Column just have been more and more widely used.Especially in man machine language's interaction scenarios, traditional technology to carrying out auditory localization simultaneously It is very big with the calculation amount of adaptive beamforming, and when deviation occurs in auditory localization, it is easy to desired signal is caused to press down System introduces distortion, and then influences speech recognition system performance.The present invention is suitable in front of the linear microphone array 180 ° of plane domain, applicable scene are man-machine interactive voice.Speaker is when passing through voice control machine, the speaker It can stand in face of machine, so when linear microphone array is listed in the voice for obtaining speaker, it is only necessary to consider in front of machine Voice, without considering the voice of rear of machine.The present invention is divided the plane domain in front of microphone, in utilization The wider wave beam in portion obtains the voice of speaker as much as possible, while inhibiting ambient noise as much as possible;Using two sides compared with Narrow wave beam obtains environmental noise as much as possible, while inhibiting desired human voice signal.Pass through adaptive filter algorithm again, from Ambient noise component is further eliminated in the output of main beam.In the following, in conjunction with attached drawing to the linear microphone array language of the present invention The method and system of sound identification are illustrated.
As shown in Fig. 2, the invention discloses a kind of linear Microphone Array Speech identifying system, which includes linear wheat Gram wind array 1, Beam-former, sef-adapting filter module 3 and speech recognition device 4.
Wherein, linear microphone array is used to record the sound of external environment and is converted to voice signal by digitizing Audio data.The sound recorded is formed in the front of linear microphone array and obtains region, and Beam-former obtains area in sound Domain is formed with positioned at the main beam region at middle part, and in the first noise beam area of main beam region two sides and the second noise Beam area, Beam-former and linear microphone array communicate to connect, receive main beam region, the first noise beam area and Audio data in second noise beam area simultaneously obtains the main beam in the corresponding main beam region, corresponding institute after being handled State the first noise wave beam of the first noise beam area and the second noise wave beam of corresponding second noise beam area.
As shown in figure 3, sef-adapting filter module 3 is a multi-channel filter, communicate to connect, receives with Beam-former First noise wave beam is passed through first by main beam, the first noise wave beam and the second noise wave beam issued in Beam-former Sef-adapting filter 31 carries out adaptively, adaptively, then the second noise wave beam is carried out by the second sef-adapting filter 32 It will filter out and export in the first noise wave beam and the second noise wave beam main beam after carrying out adaptively, the result after output passes through Feedback mechanism is transmitted to sef-adapting filter 31 and 32, and according to normalization minimum mean-square (Normalized LeastMean Square, NLMS) algorithm constantly updates adaptive filter coefficient, and last result exports from sef-adapting filter module 3, Obtain voice data to be identified.
Speech recognition device 4 is communicated to connect with sef-adapting filter module, and speech recognition device 4 receives sef-adapting filter mould The voice data to be identified of block output, and speech recognition is carried out to voice data to be identified and obtains corresponding text data simultaneously Output.
It include that sound of the angle from 0 ° to 180 ° obtains region, wave beam shape before linear microphone array 1 The number grown up to be a useful person is 3, respectively the first Beam-former 21, the second Beam-former 22 and third Beam-former 23, First Beam-former 21 is used to form the first noise beam area, and the center that the first Beam-former 21 is formed by wave beam refers to 20 ° of the direction in region is obtained to sound;Second Beam-former 22 is used to form main beam region, the second Beam-former 22 It is directed toward 90 ° of the direction that sound obtains region in the center for being formed by wave beam;Third Beam-former 23 is used to form second and makes an uproar 160 ° of the direction that sound obtains region is directed toward at beam of sound region, the center for the wave beam that third Beam-former 23 is formed.Wherein, The width for the main lobe that second Beam-former 22 is formed by wave beam is greater than the first Beam-former and third Beam-former Be formed by the width of the main lobe of wave beam, preferably, the second Beam-former 22 formed wave beam main lobe width be less than etc. In 90 °, the width for the main lobe that the first Beam-former and third Beam-former are formed by wave beam is less than or equal to 40 °.
As presently preferred embodiments of the present invention, in the first Beam-former 21, the second Beam-former 22 and third wave The filter being all correspondingly connected with each of linear microphone array microphone is designed in beamformer 23, and each A filter is all filtered by filter coefficient corresponding with itself, and filter coefficient is to shape to calculate by fixed beam Method is calculated, which includes:
yn(k)=xn(k)+vn(k), n=1,2 ..., N (formula one)
In formula one, ynIt (k) is the collected audio data of n-th of microphone, xn(k) and vnIt (k) is collected respectively Desired signal and additive noise, in formula two,Be Beam-former estimation output, audio data is filtered so that The output of Beam-former exports again after approaching the desired signal that some microphone receives in linear microphone array,It is The corresponding filter coefficient of n-th of microphone;
In formula three, em(k) output signal of Beam-former and the error of collected desired signal are indicated, it is equal to The error e of desired signalX, m(k) with the error e of additive noiseV, m(k) sum;And the error e of desired signalX, m(k) it makes an uproar with additivity The error e of soundV, m(k) it can be indicated with formula four and formula five, the error of desired signal is equal to output and the wave beam of Beam-former The difference of the input of shaper, the error of additive noise are equal to the sum of all additive noises.
Formula six and formula seven are obtained based on mean square error is minimized, passes through minimumTo enable additive noise minimum, In conjunction with constraint eX, m(k)=0 to obtain optimum filter coefficient hM, o, h thereinmIt is corresponding for filters all in Beam-former Filter coefficient matrices, hM, oFor the corresponding optimal filter coefficient value of filters all in Beam-former.
Because the additive noise of last Beam-former desired output is small as far as possible, obtained using based on error criterion Mean square error is minimized as shown in formula six outWhenWhen minimum, the optimum filter of output filter Coefficient hM, o, as shown in formula seven, and in order to guarantee that the distortion of desired signal is minimum, so constraint condition is added, eX, m(k)=0 estimate Count out optimum filter coefficient hM, o
It is equipped with acoustic model in speech recognition device 4 therein, using acoustic model to the voice data to be identified of input Speech recognition is carried out, to identify corresponding speech text.Since the filtering processing of microphone array is inevitably to master Wave beam causes to be distorted, and carries out the accuracy rate that will affect identification when speech recognition, to voice data in acoustic model to reduce the mistake Very to the influence of speech recognition accuracy, before acoustic model carries out speech recognition, processed using by the microphone array Voice data adaptive training is done to the acoustic model, speech recognition device 4 then passes through when carrying out speech recognition The identification that the acoustic model through adaptive training carries out, so that it is accurate to the identification of voice data to improve speech recognition device Rate reduces the influence being distorted to speech recognition accuracy.
As presently preferred embodiments of the present invention, be additionally provided in speech recognition device characteristic extracting module, text input module, Training data memory module and training module.Wherein, characteristic extracting module and the sef-adapting filter module communicate to connect, For receiving the voice data to be identified of Beam-former output, then extracted from the received voice data to be identified of institute Acoustic feature;Text input module is for receiving the text marking corresponding with voice data to be identified being manually entered;Training Data memory module and characteristic extracting module and text input module all communicate to connect, it stores the sound that characteristic extracting module is extracted The corresponding text marking exported in feature and text input module is learned, and acoustic feature and corresponding text marking combine shape At adaptive training data;Training module and training data memory module communicate to connect, it reads in training data memory module The adaptive training data of storage simultaneously carry out adaptive training to acoustic model using read adaptive training data.
By taking speech control automatic teller machine as an example, multiple microphones are laterally set on automatic teller machine, which is at least arranged three It is a, and be configured with certain spacing, which is used to obtain the sound in front of automatic teller machine to form voice signal.Laterally The sound that the microphone of setting forms 180 ° in front of automatic teller machine obtains region, which obtains the sound in region Sound includes the control instruction sound and ambient noise of speaker.It obtains in region in the sound and is formed by three Beam-formers Three wave beams, main beam, the first noise wave beam and the second noise wave beam.The voice signal that multiple microphones obtain is digitized It is input in Beam-former after forming voice data, the first noise wave beam is exported by the first Beam-former, passes through second Beam-former exports main beam, exports the second noise wave beam by third Beam-former.By main beam, the first noise wave beam It is input to adaptive-filtering module together with the second noise wave beam to be filtered, the first noise wave beam and are filtered out from main beam Two noise wave beams, to export voice data to be identified, which is microphone array output.It will be wait know Other voice data, which is input in speech recognition device, adaptively to be operated, to improve speech recognition device to the voice to be identified The recognition accuracy of data then again identifies the voice data to be identified and forms corresponding text data output. This article notebook data can be sent to automatic teller machine to enable automatic teller machine execute corresponding movement.
The present invention also provides a kind of linear Microphone Array Speech recognition methods, and this method is first by microphone according to line Property array is configured, the microphone and audio signal for saying people is converted into audio data;Then audio data is sent into wave Beamformer forms main beam and noise wave beam and export through filtering gives sef-adapting filter module, then by sef-adapting filter The information of module acquisition noise wave beam and main beam simultaneously rejects noise wave beam from main beam, forms voice number to be identified According to, finally the voice data to be identified of formation is input in speech recognition device identified and formed text data output. Specifically,
First step selects three or more microphones, and be horizontally arranged at interval being aligned microphone array 1, It is formed with one 0 ° to 180 ° of sound before linear microphone array and obtains region, as shown in Figure 1.The sound obtains area The sound that domain obtains both had included people's one's voice in speech and the noise from ambient enviroment.Microphone receives the sound and obtains region The sound of acquisition, and the audio signal that will acquire carries out digitized processing and forms an audio data, and exports.
Second step, in conjunction with Fig. 2, the sound in 1 front of linear microphone array obtains region and Beam-former is arranged, Beam-former obtains region in sound and is formed with positioned at the main beam region at middle part, and the first of main beam region two sides Noise beam area and the second noise beam area,;Then audio data is input in Beam-former, and exported and main wave The corresponding main beam in beam region, the first noise wave beam corresponding with the first noise beam area, and with the second noise beam zone The corresponding second noise wave beam in domain.
Third step filters the first noise wave beam, main beam and the second noise wave beam that Beam-former exports Wave processing, makes an uproar the first noise wave beam and second according to the first noise wave beam of input and the second noise wave beam noise information Beam of sound is rejected from main beam, forms a voice data to be identified.Specifically, the master issued in Beam-former is received Wave beam, the first noise wave beam and the second noise wave beam carry out the first noise wave beam certainly by the first sef-adapting filter 31 It adapts to, the second noise wave beam is carried out adaptively by the second sef-adapting filter 32, first after then carrying out adaptively Noise wave beam and the second noise wave beam are filtered out and are exported from main beam, and the result after output passes through a judgment module and artificially sets Fixed standard is compared, if standard is not achieved in the beam quality after adaptive-filtering, the result of output is returned to first It is again adaptive in sef-adapting filter 31 and the second sef-adapting filter 32, it moves in circles, recognizes until the result of output reaches For the standard of setting, last result is exported from sef-adapting filter module 3, obtains voice data to be identified.
Four steps identifies voice data to be identified, and forms text data output on machine.
It include an angle before linear microphone array is 0 ° to 180 ° as presently preferred embodiments of the present invention Sound obtain region, the audio data of formation is input in Beam-former by microphone, and the number of Beam-former is 3 It is a, it is divided into the first Beam-former 21, the second Beam-former 22 and third Beam-former 23, by the first Beam-former It is used to form the first noise beam area, 180 ° of sound are directed toward at the center of the wave beam of first Beam-former and obtain region 20 ° of direction, the first Beam-former 21 obtain the audio data for being located at linear the first noise of microphone array beam area, and Export the first noise wave beam;Second Beam-former 22 is used to form main beam region, by the wave beam of second Beam-former Center be directed toward 90 ° of the direction that 180 ° of sound obtain region, the second Beam-former 22, which obtains, is located at linear microphone array Main beam region audio data, export main beam;Third Beam-former 23 is used to form the second noise beam area, will It is directed toward 160 ° of the direction that 180 ° of sound obtain region, third Beam-former 23 in the center of the wave beam of the third Beam-former The noise for being located at the second noise beam area of linear microphone array is obtained, the second noise wave beam is exported.Wherein, the second wave beam The width that shaper 22 is formed by the main lobe of wave beam is greater than the first Beam-former and third Beam-former is formed by The width of the main lobe of wave beam, preferably, the width of the main lobe for the wave beam that the second Beam-former 22 is formed is less than or equal to 90 °, the The width that one Beam-former and third Beam-former are formed by the main lobe of wave beam is less than or equal to 40 °.
As presently preferred embodiments of the present invention, in the first Beam-former 21, the second Beam-former 22 and third wave The filter being all correspondingly connected with each of linear microphone array microphone is designed in beamformer 23, and each A filter is all filtered by filter coefficient corresponding with itself, and filter coefficient is to shape to calculate by fixed beam Method is calculated, which includes:
yn(k)=xn(k)+vn(k), n=1,2 ..., N (formula one)
In formula one, ynIt (k) is the collected audio data of n-th of microphone, xn(k) and vnIt (k) is collected respectively Desired signal and additive noise, in formula two,Be Beam-former estimation output, audio data is filtered so that The output of Beam-former exports again after approaching the desired signal that some microphone receives in linear microphone array,It is The corresponding filter coefficient of n-th of microphone;
In formula three, em(k) output signal of Beam-former and the error of collected desired signal are indicated, it is equal to The error e of desired signalX, m(k) with the error e of additive noiseV, m(k) sum;And the error e of desired signalX, m(k) it makes an uproar with additivity The error e of soundV, m(k) it can be indicated with formula four and formula five, the error of desired signal is equal to output and the wave beam of Beam-former The difference of the input of shaper, the error of additive noise are equal to the sum of all additive noises.
Because the additive noise of last Beam-former desired output is small as far as possible, obtained using based on error criterion Mean square error is minimized as shown in formula six outWhenWhen minimum, the optimum filter of output filter Coefficient hM, o, as shown in formula seven, and in order to guarantee that the distortion of desired signal is minimum, so constraint condition is added, eX, m(k)=0 estimate Count out optimum filter coefficient hM, o
It is equipped with acoustic model in speech recognition device 4 therein, using acoustic model to the voice data to be identified of input Speech recognition is carried out, to identify corresponding speech text.Since the filtering processing of microphone array is inevitably to master Wave beam causes to be distorted, and carries out the accuracy rate that will affect identification when speech recognition, to voice data in acoustic model to reduce the mistake Very to the influence of speech recognition accuracy, before acoustic model carries out speech recognition, processed using by the microphone array Voice data adaptive training is done to the acoustic model, speech recognition device 4 then passes through when carrying out speech recognition The identification that the acoustic model through adaptive training carries out, so that it is accurate to the identification of voice data to improve speech recognition device Rate reduces the influence being distorted to speech recognition accuracy.
As presently preferred embodiments of the present invention, acoustic model is adaptively grasped using the voice data to be identified Make, further comprising the steps of: speech recognition device extracts the voice data to be identified of setting quantity first, and to extraction wait know Other voice data carries out text marking;Then the corresponding acoustic feature of voice data to be identified of setting quantity is extracted again, And it combines corresponding text marking to form adaptive training data with acoustic feature;Finally using adaptive training data to sound It learns model and carries out adaptive training.The acoustic model through adaptively operating is to voice data to be identified after adaptive training terminates Carry out speech recognition.
By taking speech control automatic teller machine as an example, multiple microphones are laterally set on automatic teller machine, which is at least arranged three It is a, and be configured with certain spacing, which is used to obtain the sound in front of automatic teller machine to form voice signal.Laterally The sound that the microphone of setting forms 180 ° in front of automatic teller machine obtains region, which obtains the sound in region Sound includes the control instruction sound and ambient noise of speaker.It obtains in region in the sound and is formed by three Beam-formers Three wave beams, main beam, the first noise wave beam and the second noise wave beam.The voice signal that multiple microphones obtain is digitized It is input in Beam-former after forming voice data, the first noise wave beam is exported by the first Beam-former, passes through second Beam-former exports main beam, exports the second noise wave beam by third Beam-former.By main beam, the first noise wave beam It is input to adaptive-filtering module together with the second noise wave beam to be filtered, the first noise wave beam and are filtered out from main beam Two noise wave beams, to export voice data to be identified, which is microphone array output.It will be wait know Other voice data, which is input in speech recognition device, adaptively to be operated, to improve speech recognition device to the voice to be identified The recognition accuracy of data then again identifies the voice data to be identified and forms corresponding text data output. This article notebook data can be sent to automatic teller machine to enable automatic teller machine execute corresponding movement.
The present invention is directed to specific man-machine interactive voice, does not need real-time tracking sound bearing, and avoiding traditional algorithm may Because of inhibition or distortion of the sound source position estimated bias bring to desired signal;Algorithm calculation amount is small simultaneously, realizes process letter Just, cost is relatively low for folk prescription, and the voice data quality of acquisition is high, can be improved the accuracy rate of speech recognition.
It is described the invention in detail above in conjunction with accompanying drawings and embodiments, those skilled in the art can basis Above description makes many variations example to the present invention.Thus, certain details in embodiment should not constitute limitation of the invention, The present invention will be using the range that the appended claims define as protection scope of the present invention.

Claims (6)

1. a kind of audio recognition method using linear microphone array, which is characterized in that described method includes following steps:
The sound of environment is recorded using linear microphone array to form audio data;
Region is obtained for the sound in front of the linear microphone array, Beam-former is set, utilize the Beam-former It obtains region in the sound and is formed and be located at the main beam region at middle part and positioned at the first noise beam area of both sides and the Two noise beam areas;
The audio data is input to main beam, the correspondence that the corresponding main beam region is obtained in the Beam-former First noise wave beam of first noise beam area and the second noise waves of corresponding second noise beam area Beam;
The first noise wave beam and the second noise wave beam are filtered out from the main beam to obtain voice number to be identified According to;
Speech recognition is carried out to obtain corresponding text data and export to the voice data to be identified;The sound obtains Region includes plane domain of the angle from 0 ° to 180 °, obtains region setting for the sound in front of the linear microphone array Beam-former, comprising: setting is used to form the first Beam-former of first noise beam area, by the first wave It is directed toward 20 ° of directions that the sound obtains region in the center that beamformer is formed by wave beam;
The second Beam-former for being used to form the main beam region is set, second Beam-former is formed by wave It is directed toward 90 ° of directions that the sound obtains region in the center of beam;
The third Beam-former for being used to form second noise beam area is set, by third Beam-former institute shape At the center of wave beam be directed toward 160 ° of directions that the sound obtains region;When Beam-former is set, each Beam-former In be provided with the filter being correspondingly connected with each microphone in the linear microphone array, using fixed beam shape calculate Method is that the filter in each Beam-former calculates filter coefficient;
The fixed beam shaping Algorithm includes:
yn(k)=xn(k)+vn(k), n=1,2 ..., N (formula one)
In formula one, ynIt (k) is the collected audio data of n-th of microphone, xn(k) and vnIt (k) is collected expectation respectively Signal and additive noise;In formula two,It is the output of Beam-former, the output of Beam-former is approached into linear Mike The desired signal that some microphone receives in wind array,It is the corresponding filter coefficient of n-th of microphone;
In formula three, em(k) output signal of Beam-former and the error of collected desired signal are indicated, it is equal to expectation The error e of signalx,m(k) with the error e of additive noisev,m(k) sum;And the error e of desired signalx,m(k) with additive noise Error ev,m(k) it can be indicated with formula four and formula five;
Formula six and formula seven are obtained based on mean square error is minimized, passes through minimumTo enable additive noise minimum, in conjunction with Constrain ex,m(k)=0 to obtain optimum filter coefficient hm,o, h thereinmFor the corresponding filter of filters all in Beam-former Wave device coefficient matrix, hm,oFor the corresponding optimal filter coefficient value of filters all in Beam-former.
2. the method as described in claim 1, which is characterized in that carry out speech recognition, packet to the voice data to be identified It includes: acoustic model adaptively being operated first with the voice data to be identified;Then using adaptively being operated Acoustic model carries out speech recognition to the voice data to be identified.
3. method according to claim 2, which is characterized in that carried out using the voice data to be identified to acoustic model Adaptive operation, comprising:
The voice data to be identified of setting quantity is extracted, and text marking is carried out to extracted voice data to be identified;
Extract it is described setting quantity the corresponding acoustic feature of voice data to be identified, and by corresponding text marking with it is described Acoustic feature combines to form adaptive training data;
Adaptive training is carried out to the acoustic model using the adaptive training data.
4. a kind of linear Microphone Array Speech identifying system, which is characterized in that the system comprises:
Linear microphone array, for recording the sound of environment to form audio data;
With the Beam-former of the linear microphone array communication connection, the Beam-former is in the linear microphone array Column front sound obtain region formed be located at middle part main beam region and positioned at both sides the first noise beam area and Second noise beam area, for being handled the received audio data and being obtained the corresponding main beam region Main beam, corresponding first noise beam area the first noise wave beam and corresponding second noise beam area The second noise wave beam;
Sef-adapting filter module communicates to connect with the Beam-former, receives the main beam, the first noise wave beam And the output of the second noise wave beam, and for filtering out the first noise wave beam and described second from the main beam Noise wave beam is to obtain voice data to be identified;
Speech recognition device is communicated to connect with the sef-adapting filter module, receives the voice data to be identified, and be used for Speech recognition is carried out to obtain corresponding text data and export to the voice data to be identified;The sound obtains region Plane domain including angle from 0 ° to 180 °;
The Beam-former includes: the first Beam-former for being used to form first noise beam area, and described first It is directed toward 20 ° of directions that the sound obtains region in the center that Beam-former is formed by wave beam;
It is used to form second Beam-former in the main beam region, second Beam-former is formed by wave beam The heart is directed toward 90 ° of directions that the sound obtains region;
It is used to form the third Beam-former of second noise beam area, the third Beam-former is formed by wave It is directed toward 160 ° of directions that the sound obtains region in the center of beam;It is equipped with and the linear microphone array in each Beam-former The filter that each microphone in column is correspondingly connected with, the filter in each Beam-former are provided with corresponding filter system Number;The filter coefficient is calculated by fixed beam shaping Algorithm;
The fixed beam shaping Algorithm includes:
yn(k)=xn(k)+vn(k), n=1,2 ..., N (formula one)
In formula one, ynIt (k) is the collected audio data of n-th of microphone, xn(k) and vnIt (k) is collected expectation respectively Signal and additive noise;In formula two,It is the output of Beam-former, the output of Beam-former is approached into linear Mike The desired signal that some microphone receives in wind array,It is the corresponding filter coefficient of n-th of microphone;
In formula three, em(k) output signal of Beam-former and the error of collected desired signal are indicated, it is equal to expectation The error e of signalx,m(k) with the error e of additive noisev,m(k) sum;And the error e of desired signalx,m(k) with additive noise Error ev,m(k) it can be indicated with formula four and formula five;
Formula six and formula seven are obtained based on mean square error is minimized, passes through minimumTo enable additive noise minimum, in conjunction with Constrain ex,m(k)=0 to obtain optimum filter coefficient hm,o, h thereinmFor the corresponding filter of filters all in Beam-former Wave device coefficient matrix, hm,oFor the corresponding optimal filter coefficient value of filters all in Beam-former.
5. system as claimed in claim 4, which is characterized in that institute's speech recognizer includes an acoustic model, the acoustics Model is used further to identify voice data to be identified after the voice data to be identified carries out adaptive training.
6. system as claimed in claim 5, which is characterized in that institute's speech recognizer further includes having characteristic extracting module, text This input module, training data memory module and training module;
The characteristic extracting module and the sef-adapting filter module communicate to connect, and receive the voice data to be identified, For extracting acoustic feature from the received voice data to be identified of institute;
The text input module is for inputting text marking corresponding with the voice data to be identified;
The training data memory module and the characteristic extracting module and the text input module communicate to connect, for storing The acoustic feature and corresponding text marking, the acoustic feature and corresponding text marking combine to form adaptive training number According to;
The training module and the training data memory module communicate to connect, and read storage in the training data memory module Adaptive training data simultaneously carry out adaptive training to the acoustic model using read adaptive training data.
CN201611202169.0A 2016-12-23 2016-12-23 Utilize the audio recognition method and system of linear microphone array Active CN106710603B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611202169.0A CN106710603B (en) 2016-12-23 2016-12-23 Utilize the audio recognition method and system of linear microphone array

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611202169.0A CN106710603B (en) 2016-12-23 2016-12-23 Utilize the audio recognition method and system of linear microphone array

Publications (2)

Publication Number Publication Date
CN106710603A CN106710603A (en) 2017-05-24
CN106710603B true CN106710603B (en) 2019-08-06

Family

ID=58903066

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611202169.0A Active CN106710603B (en) 2016-12-23 2016-12-23 Utilize the audio recognition method and system of linear microphone array

Country Status (1)

Country Link
CN (1) CN106710603B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107507623A (en) * 2017-10-09 2017-12-22 维拓智能科技(深圳)有限公司 Self-service terminal based on Microphone Array Speech interaction
CN107785029B (en) * 2017-10-23 2021-01-29 科大讯飞股份有限公司 Target voice detection method and device
CN108364664B (en) * 2018-02-01 2020-04-24 云知声智能科技股份有限公司 Method for automatic data acquisition and marking
CN108696781A (en) * 2018-05-17 2018-10-23 四川湖山电器股份有限公司 A kind of method that the microphone of linear pattern forms directive property in space
CN108922518B (en) * 2018-07-18 2020-10-23 苏州思必驰信息科技有限公司 Voice data amplification method and system
CN110797042B (en) * 2018-08-03 2022-04-15 杭州海康威视数字技术股份有限公司 Audio processing method, device and storage medium
CN109949810B (en) * 2019-03-28 2021-09-07 荣耀终端有限公司 Voice wake-up method, device, equipment and medium
CN110322892B (en) * 2019-06-18 2021-11-16 中国船舶工业系统工程研究院 Voice pickup system and method based on microphone array
CN110364176A (en) * 2019-08-21 2019-10-22 百度在线网络技术(北京)有限公司 Audio signal processing method and device
CN110519676B (en) * 2019-08-22 2021-04-09 云知声智能科技股份有限公司 Decentralized distributed microphone pickup method
CN111429916B (en) * 2020-02-20 2023-06-09 西安声联科技有限公司 Sound signal recording system
CN113393856B (en) * 2020-03-11 2024-01-16 华为技术有限公司 Pickup method and device and electronic equipment
CN111986678B (en) * 2020-09-03 2023-12-29 杭州蓦然认知科技有限公司 Voice acquisition method and device for multipath voice recognition
CN113053408B (en) * 2021-03-12 2022-06-14 云知声智能科技股份有限公司 Sound source separation method and device
CN113301476B (en) * 2021-03-31 2023-11-14 阿里巴巴(中国)有限公司 Pickup device and microphone array structure
CN113539288A (en) * 2021-07-22 2021-10-22 南京华捷艾米软件科技有限公司 Voice signal denoising method and device
CN113782024B (en) * 2021-09-27 2024-03-12 上海互问信息科技有限公司 Method for improving accuracy of automatic voice recognition after voice awakening

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1995034983A1 (en) * 1994-06-14 1995-12-21 Ab Volvo Adaptive microphone arrangement and method for adapting to an incoming target-noise signal
CN1851806A (en) * 2006-05-30 2006-10-25 北京中星微电子有限公司 Adaptive microphone array system and its voice signal processing method
CN102831898A (en) * 2012-08-31 2012-12-19 厦门大学 Microphone array voice enhancement device with sound source direction tracking function and method thereof
CN102969002A (en) * 2012-11-28 2013-03-13 厦门大学 Microphone array speech enhancement device capable of suppressing mobile noise
CN105532017A (en) * 2013-03-12 2016-04-27 谷歌技术控股有限责任公司 Apparatus and method for beamforming to obtain voice and noise signals

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1995034983A1 (en) * 1994-06-14 1995-12-21 Ab Volvo Adaptive microphone arrangement and method for adapting to an incoming target-noise signal
CN1851806A (en) * 2006-05-30 2006-10-25 北京中星微电子有限公司 Adaptive microphone array system and its voice signal processing method
CN102831898A (en) * 2012-08-31 2012-12-19 厦门大学 Microphone array voice enhancement device with sound source direction tracking function and method thereof
CN102969002A (en) * 2012-11-28 2013-03-13 厦门大学 Microphone array speech enhancement device capable of suppressing mobile noise
CN105532017A (en) * 2013-03-12 2016-04-27 谷歌技术控股有限责任公司 Apparatus and method for beamforming to obtain voice and noise signals

Also Published As

Publication number Publication date
CN106710603A (en) 2017-05-24

Similar Documents

Publication Publication Date Title
CN106710603B (en) Utilize the audio recognition method and system of linear microphone array
CN106653041B (en) Audio signal processing apparatus, method and electronic apparatus
CN106782584B (en) Audio signal processing device, method and electronic device
CN107346661B (en) Microphone array-based remote iris tracking and collecting method
CN108109617B (en) Remote pickup method
JP4191518B2 (en) Orthogonal circular microphone array system and three-dimensional direction detection method of a sound source using the same
CN102831898B (en) Microphone array voice enhancement device with sound source direction tracking function and method thereof
CN102421050B (en) Apparatus and method for enhancing audio quality using non-uniform configuration of microphones
CN103426440A (en) Voice endpoint detection device and voice endpoint detection method utilizing energy spectrum entropy spatial information
CN106448722A (en) Sound recording method, device and system
CN107919133A (en) For the speech-enhancement system and sound enhancement method of destination object
CN109074816A (en) Far field automatic speech recognition pretreatment
EP2630807A1 (en) Systems, methods, apparatus, and computer-readable media for far-field multi-source tracking and separation
CN107017003A (en) A kind of microphone array far field speech sound enhancement device
CN110706717B (en) Microphone array panel-based human voice detection orientation method
Yamamoto et al. Real-time robot audition system that recognizes simultaneous speech in the real world
JP5841986B2 (en) Audio processing apparatus, audio processing method, and audio processing program
CN110379439A (en) A kind of method and relevant apparatus of audio processing
CN106992010A (en) Without the microphone array speech enhancement device under the conditions of direct sound wave
CN103208291A (en) Speech enhancement method and device applicable to strong noise environments
Ince et al. Assessment of general applicability of ego noise estimation
CN109147787A (en) A kind of smart television acoustic control identifying system and its recognition methods
CN112363112B (en) Sound source positioning method and device based on linear microphone array
CN113903353A (en) Directional noise elimination method and device based on spatial discrimination detection
CN108680902A (en) A kind of sonic location system based on multi-microphone array

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20171010

Address after: 200233 Shanghai City, Xuhui District Guangxi 65 No. 1 Jinglu room 702 unit 03

Applicant after: Cloud known sound (Shanghai) Technology Co. Ltd.

Address before: 200233 Shanghai, Qinzhou, North Road, No. 82, building 2, layer 1198,

Applicant before: SHANGHAI YUZHIYI INFORMATION TECHNOLOGY CO., LTD.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20200415

Address after: 200233 Shanghai City, Xuhui District Guangxi 65 No. 1 Jinglu room 702 unit 03

Co-patentee after: Xiamen yunzhixin Intelligent Technology Co., Ltd

Patentee after: YUNZHISHENG (SHANGHAI) INTELLIGENT TECHNOLOGY Co.,Ltd.

Address before: 200233 Shanghai City, Xuhui District Guangxi 65 No. 1 Jinglu room 702 unit 03

Patentee before: YUNZHISHENG (SHANGHAI) INTELLIGENT TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right