CN109509465A - Processing method, component, equipment and the medium of voice signal - Google Patents

Processing method, component, equipment and the medium of voice signal Download PDF

Info

Publication number
CN109509465A
CN109509465A CN201710850441.4A CN201710850441A CN109509465A CN 109509465 A CN109509465 A CN 109509465A CN 201710850441 A CN201710850441 A CN 201710850441A CN 109509465 A CN109509465 A CN 109509465A
Authority
CN
China
Prior art keywords
voice signal
voice
signal
road
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710850441.4A
Other languages
Chinese (zh)
Other versions
CN109509465B (en
Inventor
都家宇
田彪
雷鸣
姚海涛
刘勇
黄雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201710850441.4A priority Critical patent/CN109509465B/en
Publication of CN109509465A publication Critical patent/CN109509465A/en
Application granted granted Critical
Publication of CN109509465B publication Critical patent/CN109509465B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The embodiment of the present application discloses processing method, component, equipment and the medium of a kind of voice signal, to improve the flexibility of voice control.The method, comprising: the speech signal separation from different direction in the mixing voice signal received is obtained multi-path voice signal by processing component;The processing component is to progress parallelism recognition some or all of in the multi-path voice signal, wherein, the parallelism recognition includes: to some or all of in the multi-path voice signal, every road voice signal is divided into multiple recognition units to identify respectively, wherein each recognition unit includes continuous multiframe.

Description

Processing method, component, equipment and the medium of voice signal
Technical field
This application involves technical field of data processing more particularly to a kind of processing method of voice signal, component, equipment and Computer readable storage medium.
Background technique
With the continuous development of speech recognition technology, intelligent sound control system has obtained quick development, intelligent sound Control system can quickly, accurately and efficiently execute corresponding function by the identification to voice.
Existing intelligent sound control system can be in the number of intelligent sound control system after collecting voice signal According to searching the target data to match with voice signal semanteme, and then the control according to corresponding to the target data found in library Instruction, control execute corresponding function.
But the voice signal that existing speech control system responds only to single user executes corresponding function, lacks Weary flexibility.
Summary of the invention
The embodiment of the present application provides processing method, component, equipment and the computer-readable storage medium of a kind of voice signal Matter, to improve the flexibility of voice control.
According to the embodiment of the present application in a first aspect, providing a kind of processing method of voice signal, comprising:
Speech signal separation from different direction in the mixing voice signal received is obtained multichannel language by processing component Sound signal;
Processing component is to progress parallelism recognition some or all of in multi-path voice signal, wherein parallelism recognition includes: pair Some or all of in multi-path voice signal, every road voice signal is divided into multiple recognition units to identify respectively, wherein Each recognition unit includes continuous multiframe.
According to the second aspect of the embodiment of the present application, a kind of processing component of voice signal is provided, comprising:
Speech processing module, the speech signal separation from different direction in the mixing voice signal for will receive, Obtain multi-path voice signal;
Identification module, for progress parallelism recognition some or all of in multi-path voice signal, wherein parallelism recognition packet It includes: to some or all of in multi-path voice signal, every road voice signal being divided into multiple recognition units to identify respectively, Wherein each recognition unit includes continuous multiframe.
According to the third aspect of the embodiment of the present application, a kind of processing equipment of voice signal is provided, comprising: memory and place Manage device;The memory is for storing executable program code;The processor is for reading the executable journey stored in memory Sequence code is to execute the processing method of above-mentioned voice signal.
According to the fourth aspect of the embodiment of the present application, a kind of computer readable storage medium is provided, is stored thereon with calculating Machine program instruction realizes the processing method of above-mentioned voice signal when computer program instructions are executed by processor.
According to the 5th of the embodiment of the present application the aspect, a kind of vehicle-mounted voice interactive device is provided, equipment includes: microphone array Column and processor;Wherein,
Microphone array, for acquiring mixing voice signal;
Processor is communicated to connect with microphone array, and different direction is come from the mixing voice signal for will receive Speech signal separation, obtain multi-path voice signal, and to carrying out parallelism recognition some or all of in multi-path voice signal, In, parallelism recognition includes: that it is single that every road voice signal is divided into multiple identifications respectively to some or all of in multi-path voice signal Position is to be identified, wherein each recognition unit includes continuous multiframe.
According to the 6th of the embodiment of the present application the aspect, a kind of vehicle-mounted Internet control system is provided, comprising: microphone control Component and control assembly;Wherein,
Microphone control assembly, for controlling microphone array acquisition mixing voice signal;
Control assembly, for controlling the speech signal separation in the mixing voice signal that will be received from different direction, Multi-path voice signal is obtained, and to progress parallelism recognition some or all of in multi-path voice signal, wherein parallelism recognition packet It includes: to some or all of in multi-path voice signal, every road voice signal being divided into multiple recognition units to identify respectively, Wherein each recognition unit includes continuous multiframe.
According to the processing method of the voice signal in the embodiment of the present application, component, equipment and computer readable storage medium, By the speech signal separation from different direction in the mixing voice signal received, multi-path voice signal is obtained, and to multichannel Some or all of in voice signal carry out parallelism recognition, wherein parallelism recognition include: in multi-path voice signal part or All, every road voice signal is divided into multiple recognition units to identify respectively, wherein each recognition unit includes continuous Multiframe.The technical solution of the embodiment of the present application, when being identified to the road part or all of voice signal Zhong Mei voice signal, It is identified by the way that every road voice signal is divided into multiple recognition units, effectively reduces the number of identification, thereby reduced pair Central processing unit (Central Processing Unit, CPU) resource that every road voice signal occupies when identifying, so that It can be to progress parallelism recognition some or all of in multi-path voice signal.Further, so that using the embodiment of the present application skill The interactive voice equipment of art scheme can parallel identify multi-path voice signal, and respond only to list in the prior art The voice signal of a user is compared, and the flexibility of voice control is substantially increased.
Detailed description of the invention
In order to more clearly explain the technical solutions in the embodiments of the present application, make required in being described below to embodiment Attached drawing is briefly introduced, it should be apparent that, the drawings in the following description are only some examples of the present application, for this For the those of ordinary skill in field, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.
Fig. 1 shows the schematic illustration of semantics recognition in the prior art;
Fig. 2 shows the schematic illustrations of semantics recognition in the embodiment of the present application;
Fig. 3 shows the schematic diagram of an application scenarios of the processing method of the voice signal according to the embodiment of the present application;
Fig. 4 shows the schematic illustration for carrying out beam forming in the embodiment of the present application based on microphone array;
Fig. 5 shows the schematic diagram of the another application scene of the processing method of the voice signal according to the embodiment of the present application;
Fig. 6 shows the schematic flow diagram of the processing method of the voice signal according to the embodiment of the present application;
Fig. 7 shows the structural schematic diagram of the processing component of the voice signal according to the embodiment of the present application;
Fig. 8 shows the calculating equipment that can be realized processing method and component according to the voice signal of the embodiment of the present application Exemplary hardware architecture structure chart;
Fig. 9 shows the structural schematic diagram of the vehicle-mounted voice interactive device of the embodiment of the present application;
Figure 10 shows the structural schematic diagram of the vehicle-mounted Internet control system of the embodiment of the present application.
Specific embodiment
The feature and exemplary embodiment of the various aspects of the application is described more fully below, in order to make the mesh of the application , technical solution and advantage be more clearly understood, with reference to the accompanying drawings and embodiments, the application is further retouched in detail It states.It should be understood that specific embodiment described herein is only configured to explain the application, it is not configured as limiting the application. To those skilled in the art, the application can be real in the case where not needing some details in these details It applies.Below the description of embodiment is used for the purpose of better understanding the application to provide by showing the example of the application.
It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence " including ... ", it is not excluded that including There is also other identical elements in the process, method, article or equipment of the element.
It should be noted that can include but is not limited to when being identified to voice signal: semantics recognition, context identification, Tone identification etc..The embodiment of the present application is illustrated by taking semantics recognition as an example in intelligent sound control system.
Existing intelligent sound control system, when executing corresponding function according to the voice signal received, comprising: language Justice identification link, semantic matches link and control execute link.Wherein, semantics recognition link, which refers to, is collecting voice letter After number, semantics recognition is carried out to voice signal, identifies the semanteme for including in voice signal;Semantic matches link refer to based on pair What the semantics recognition of voice signal as a result, search from the database of intelligent sound control system matched with semantics recognition result Target data;And it controls execution link and then refers to that the control instruction according to corresponding to the target data found, control equipment are held The corresponding function of row.
In existing intelligent sound control system, in semantics recognition link, after collecting voice signal, first voice is believed Number carry out sub-frame processing, then every frame voice data in voice signal is identified, and then according to every frame voice data Recognition result determines the semanteme that voice signal includes.
For example, as shown in Figure 1, including 7 frame voice data in voice signal after sub-frame processing, namely from t=i-3 The 7 frame voice data at moment to t=i+3 moment then can carry out semantic knowledge to 7 frame voice data respectively in semantics recognition link Not, and then the semantics recognition result of 7 frame voice data is combined to determine the semanteme that voice signal includes.
Above-mentioned treatment process in semantics recognition link will occupy a large amount of cpu resource, and intelligent sound control system Cpu resource be often it is extremely limited, this will certainly make it possible to be assigned to semantic matches link and control execute link CPU Resource is more limited, and the voice signal for responding only to single user so as to cause existing intelligent sound control system executes phase The function of answering lacks flexibility.
For example, existing intelligent sound control system in the car, forms vehicle-mounted voice control system, still Existing vehicle-mounted voice control system is typically only capable to carry out voice control by main driving, and in actual use, main driving Voice signal is easy to be led to the control of practical vehicle-mounted voice control system by the voice signal interference of copilot and interior back seat personnel The effect is unsatisfactory for system.
For another example, when existing intelligent sound control system is in smart machine, for example, apply intelligent sound box, When smart television, automatic shopping machine, smart machine can only often be carried out voice control by user, and multiple users simultaneously Speech or in a noisy environment when, the control effect of intelligent sound control system will substantially reduce in smart machine.
In view of this, in one embodiment, the embodiment of the present application is used when carrying out semantics recognition to voice signal Low frame per second (Low Frame Rate, LFR) acoustic model carries out semantics recognition to collected voice signal, to reduce semantic knowledge The cpu resource that other link occupies.
It in one embodiment, will when carrying out semantics recognition to collected voice signal using LFR acoustic model Voice signal is divided into multiple recognition units to carry out semantics recognition, wherein each recognition unit includes continuous multiframe.
In one example, sub-frame processing first is carried out to collected voice signal, is carried out at framing to voice signal After reason, a frame voice data is chosen in every preset quantity frame voice data as target frame, and with target frame voice number It is that recognition unit carries out semantics recognition to target frame voice data according to adjacent multiframe voice data and target frame voice data. It wherein, may include identical speech data frame between adjacent recognition unit.
For example, as shown in Fig. 2, including N frame voice data in voice signal after sub-frame processing, now with therein 7 It is illustrated for frame voice data, namely is said by taking the 7 frame voice data from the t=i-3 moment to the t=i+3 moment as an example It is bright.
A frame is chosen when carrying out semantics recognition to voice signal, in every 3 frame voice data as target frame voice data, For example, choosing voice data, the t=i moment at t=i-3 moment in the 7 frame voice data at t=i-3 moment to t=i+3 moment Voice data and the voice data at t=i+3 moment as target frame voice data.
For the voice data at t=i-3 moment, voice data, t when carrying out semantics recognition, in conjunction with the t=i-6 moment The voice data at=i-5 moment, the voice data at t=i-4 moment, the voice data at t=i-3 moment, the language at t=i-2 moment The voice data of sound data, the voice data at t=i-1 moment and t=i moment carries out semantics recognition.It is similar therewith, for t The voice data at=i moment, voice data, the voice at t=i-2 moment when carrying out semantics recognition, in conjunction with the t=i-3 moment Data, the voice data at t=i-1 moment, the voice data at t=i moment, the voice data at t=i+1 moment, t=i+2 moment Voice data and the voice data at t=i+3 moment carry out semantics recognition.
It can be seen that semantics recognition process shown in Figure 2 from semantics recognition process shown in Figure 2, and shown in Fig. 1 Semantics recognition process compare, identification number or frequency when can be substantially reduced semantics recognition, and then reduce semantics recognition The cpu resource that link occupies.Simultaneously because the voice data number of frames of identification is reduced during semantics recognition, shown in Fig. 2 Semantics recognition process can also improve the efficiency of semantics recognition.
Semantics recognition process shown in Figure 2, after choosing target frame voice data, with target frame voice data phase Adjacent multiframe voice data and target frame voice data is that unit carries out semantics recognition to target frame voice data, with selection mesh After marking frame voice data, the mode for carrying out semantics recognition only in conjunction with target frame voice data is compared, to each target frame language When sound data carry out semantics recognition, more voice messagings are combined, therefore, semantics recognition process shown in Figure 2 is reducing While the identification frequency of semantics recognition, the cpu resource for reducing semantics recognition link occupancy, raising semantics recognition efficiency, moreover it is possible to The accuracy rate of semantics recognition is enough effectively ensured.
Certainly, in this example, with first three frame voice data adjacent with target frame voice data, rear three frames voice data with And target frame voice data is that recognition unit carries out semantics recognition to target frame voice data.In the application other embodiments, To target frame voice number as unit of by the multiframe voice data and target frame voice data adjacent with target frame voice data When according to carrying out semantics recognition, the multiframe voice data adjacent with target frame voice data can also be according to the accurate of speech recognition Rate demand is set.It for example, can will be with target frame voice data if the accuracy rate to speech recognition is more demanding The quantity setting of adjacent multiframe voice data it is big a bit;On the contrary, if the accuracy rate of speech recognition is required it is lower, can be with The quantity of the multiframe voice data adjacent with target frame voice data is arranged a little bit smaller.
In one embodiment, it due to carrying out semantics recognition using method for recognizing semantics shown in Figure 2, can reduce The cpu resource that semantics recognition link occupies.Therefore, the processing scheme of voice signal provided by the embodiments of the present application is using Fig. 2 Shown in after method for recognizing semantics, in the limited situation of cpu resource, still be able to parallel carry out multi-path voice signal Semantics recognition.
Although, can be in the processing to voice signal after the processing method using voice signal shown in Figure 2 Semantics recognition is carried out to multi-path voice signal parallel, but is receiving the creolized language comprising the voice signal from different direction When sound signal, it is also necessary to each voice signal is separated from mixing voice signal, obtains multi-path voice signal, it is then right Parallel semantics recognition is carried out some or all of in multi-path voice signal, to improve the accuracy rate to voice signal semantics recognition.
For this purpose, in one embodiment, the mixing voice that will be received in the embodiment of the present application using microphone array From the speech signal separation of different direction in signal, multi-path voice signal is obtained, and based on beamforming algorithm from acquisition Each voice signal is separated in mixing voice signal, and then semantics recognition is carried out to each voice signal, improves semantics recognition Accuracy rate, to solve the problems, such as that semantics recognition accuracy rate is low when to mixing voice signal identification in the prior art.
For example, as shown in figure 3, vehicle-mounted voice control environment in, including chief aviation pilot 31, copilot 32, with And interactive voice equipment 33, wherein include microphone array in interactive voice equipment 33.
During chief aviation pilot 31 and copilot 32 talk, if interactive voice equipment 33 is in the open state, language The language of voice signal and copilot 32 of the microphone array by real-time acquisition including chief aviation pilot 31 in sound interactive device 33 The mixing voice signal of sound signal.Certainly, in practical vehicle environment, mixing voice signal may also contain interior back seat personnel's Voice signal and environmental noise.
When microphone array acquires mixing voice signal, since chief aviation pilot 31 and copilot 32 are in interactive voice The different direction of equipment 33, therefore, for microphone array, the voice signal of chief aviation pilot 31 and the voice of copilot 32 Signal is from different orientation.Based on this, microphone array can form wave beam in different directions, and pick up the voice in wave beam Signal eliminates the noise outside wave beam, achievees the purpose that separate voice signal and voice signals enhancement.
In one example, as shown in figure 4, in voice signal and the passenger side of the microphone array acquisition including chief aviation pilot 31 After the mixing voice signal of the voice signal of the person of sailing 32, collected mixing voice signal is pre-processed first, it is pre- to locate Delay inequality of every road voice signal relative to reference signal is acquired using phse conversion weighting broad sense cross correlation algorithm after reason, most It is based on calculated delay inequality afterwards, forms wave beam by postponing cumulative beamforming algorithm.
In one example, pretreatment includes framing, mute detection plus Hamming window.Since voice signal is unstable state letter Number, what it was characterized in changing over time, but within the very short period, it is believed that voice signal has metastable spy Sign namely voice signal have short-term stationarity.Therefore, when handling voice signal usually by short time period to voice Signal carries out framing.
And the purpose of mute detection is to reject the mute frame in voice signal, mute detection can both eliminate mute frame pair Bring influences when before and after frames speech recognition, and can reduce unnecessary calculation amount, improves computational efficiency.
In addition, be equivalent to and time domain speech signal is intercepted with rectangular window due to the sub-frame processing to voice signal, by It is equivalent to frequency domain convolution in time domain product, for frequency domain, rectangular window intercepts the spectral leakage that will cause voice signal, therefore, It needs that Hamming window is added to alleviate spectral leakage.
After pre-processing to voice signal, the speech enhan-cement based on microphone array needs to determine sound source Then position obtains enhanced voice signal using enhancing algorithm to estimate the position or direction of expectation sound source.
In one example, with based on delay inequality (Time Difference of Arrival, TDOA) estimation method into It is illustrated for row auditory localization.
Common time delay estimation method include: broad sense cross-correlation (Generalized Cross Correlation, GCC) method, linear regression (Linear Regression, LR) method, lowest mean square (Least Mean Square, LMS) are adaptive Method etc..Hereafter it is illustrated by taking GCC method as an example.
GCC method calculates the crosspower spectrum of a pair of of microphone signal first, multiplied by corresponding weight, is finally Fourier Inverse transformation obtains the cross-correlation function of signal, is exactly this arrival delay inequality τ to microphone at the time of peak value corresponds toi
The performance of GCC method depends on the weighting function chosen, wherein most representative is maximum likelihood (Maximum Likelihood, ML) weighted sum phse conversion (Phase Transform, PHAT) weighting.
Ideally, maximum likelihood weighting can achieve optimal estimation, but maximum likelihood weighting needs known sound source The power spectrum of signal and noise, in practical application, this condition hardly results in satisfaction.And phse conversion weighting has been abandoned and has been believed sound source Number and the demand of power spectrum of noise sharpened cross-correlation function by normalizing crosspower spectrum function so that peak value is prominent, Preferably inhibit the interference at cross-correlation puppet peak.In addition, phse conversion weighting has stronger robustness in reverberant ambiance.
For ideal free found field environment, when sound-source signal auto-correlation function is maximum value, cross-correlation function is also most Big value, so the maximum value of cross-correlation function, corresponding time, as delay inequality need only be found out when calculating.
And reverberant ambiance, due to being superimposed countless reverb signals, the peak point of function may have multiple, ask for this Topic can be solved using phse conversion weighting broad sense cross-correlation GCC-PHAT algorithm.
GCC-PHAT algorithm utilizes signal in the cross-correlation letter of time domain not directly in time-domain calculation cross-correlation function Several and signal first calculates the cross-spectral density between two voice signals in the corresponding relationship of the crosspower spectrum function of frequency domain, Then PHAT weighting is carried out, finally passes through inverse Fourier transform, obtains broad sense cross-correlation function, and then acquire corresponding time delay Difference.
Delay-cumulative Wave beam forming (Delay-Sum Beamforming, DSB) algorithm using GCC-PHAT obtain when Prolong poor τi, delay compensation, the voice for receiving each microphone are carried out to the voice signal in each microphone channel first Signal is aligned on a timeline, and then uniformly weighting, summation, obtains output signal.
During beam forming, every road voice can be determined according to the relationship of each orientation beam energy and phase difference The azimuth information of signal.
In one example, microphone array is listed in the voice signal and the passenger side of chief aviation pilot 31 in determining mixing voice signal It, can be based on the azimuth information determined, from adopting after the voice signal of the person of sailing 32 is relative to the azimuth information of microphone array The voice signal of chief aviation pilot 31 and the voice signal of copilot 32 are isolated in the mixing voice signal collected.
In one example, in voice signal and the copilot 32 for isolating chief aviation pilot 31 from mixing voice signal Voice signal after, can also the voice signal of voice signal and copilot 32 to chief aviation pilot 31 carry out wave beam respectively Forming processing and signal enhancing processing.For example, signal enhancing processing can include but is not limited to: signal enhanced processing, drop Make an uproar processing etc..
In one example, in the voice signal and pair for isolating chief aviation pilot 31 from collected mixing voice signal After the voice signal of driver 32, can parallel the voice signal to chief aviation pilot 31 and the voice signal of copilot 32 into Row semantics recognition.Specifically when the voice signal of voice signal and copilot 32 to chief aviation pilot 31 carries out semantics recognition, Method for recognizing semantics shown in Figure 2 can be used, to reduce the cpu resource of semantics recognition occupancy.
It in one embodiment, is including the mixing of the voice signal from different direction by microphone array acquisition After voice signal, the azimuth information of each voice signal can be determined using the beamforming algorithm shown in Fig. 4, and based on true The azimuth information for each voice signal made isolates each voice signal from mixing voice signal, obtains multi-path voice letter Number, so using Fig. 2 shows method for recognizing semantics to some or all of carry out language in the multi-path voice signal isolated Justice identification.
In one example, if in the collected mixing voice signal of microphone array including main driving voice signal, Copilot voice signal and environmental noise then isolate the voice of main driving in microphone array from mixing voice signal After signal, copilot voice signal and environmental noise, due to that obviously can not include valuable letter in environmental noise Breath therefore, can be only to the voice for isolating main driving in the subsequent progress semantics recognition to the multi-path voice signal isolated Signal and copilot voice signal carry out semantics recognition, without carrying out semantics recognition to the environmental noise isolated, with further Reduce the cpu resource that semantics recognition link occupies.
In one embodiment, voice signal and copilot 32 parallel to chief aviation pilot 31 voice signal into It, can be by multiple wake-up engines in interactive voice equipment 33 parallel to 31 voice signal of chief aviation pilot after row semantics recognition The semantics recognition result of 32 voice signal of semantics recognition result and copilot detects, and detects 31 voice signal of chief aviation pilot 32 voice signal of semantics recognition result and copilot semantics recognition result in whether comprising wake up word.
In one example, password or life that word refers to speech control system in activation interactive voice equipment 33 are waken up It enables, can be particular words, specific sentence or signal specific predetermined etc..For example, waking up word is " your good spot Horse ".
In one example, multiple wake-up engines in interactive voice equipment 33 are detecting a certain user (chief aviation pilot Or copilot) voice signal in comprising wake up word when, utilize wake up word wake up interactive voice equipment 33 in voice control System processed, and in subsequent preset duration according to the voice signal of the user to carry out voice control.
In one embodiment, multiple wake-up engines in interactive voice equipment 33 can be with speech control system phase Connection, but which is specifically waken up into the semantics recognition result detected in engine in actually using and is sent to speech control system, By whether including to wake up word to determine in semantics recognition result.Namely which road wakes up in the semantics recognition result of engine detection comprising calling out Awake word, the road Ze Jianggai semantics recognition result are sent to speech control system.
For example, semantic knowledge is carried out in the voice signal of voice signal and copilot 32 parallel to chief aviation pilot 31 It, can be by multiple wake-up engines in interactive voice equipment 33 parallel to the semantics recognition of 31 voice signal of chief aviation pilot after not As a result the semantics recognition result with 32 voice signal of copilot detects.
If detecting comprising waking up word in the semantics recognition result of 31 voice signal of chief aviation pilot, by 31 language of chief aviation pilot The semantics recognition result of sound signal is sent to speech control system, and then carries out voice control by chief aviation pilot 31;And if detecting Into the semantics recognition result of 32 voice signal of copilot comprising waking up word, then the semanteme of 32 voice signal of copilot is known Other result is sent to speech control system, and then carries out voice control by copilot 32.
In one embodiment, it is controlled for convenience, wakes up voice control in chief aviation pilot 31 or copilot 32 After system processed, for example, being also based on what microphone array was determined after chief aviation pilot 31 wakes up speech control system The azimuth information of chief aviation pilot 31 can be with 31 place side of oriented acquisition chief aviation pilot in the preset duration after current time The voice messaging of position, and beam forming processing and signal enhancing are carried out to collected voice signal and handled, then by wave beam at Treated that voice signal is sent to speech control system for shape processing and signal enhancing.Wherein, preset duration can be rule of thumb Value setting, such as: 30 seconds.
It is illustrated above in conjunction with processing method of the vehicle environment to voice signal provided by the embodiments of the present application, this Shen Please embodiment can be also used in other smart machines comprising speech control system.Wherein, smart machine may include but not It is limited to: intelligent sound box, smart television, automatic shopping machine.
For example, by taking intelligent sound box as an example, as shown in figure 5, in smart home environment, including intelligent sound box 50, use Family 51 and user 52 include microphone array, the semantics recognition system, Semantic detection system of acquisition voice signal in intelligent sound box 50 System and speech control system.
When specifically used, user 51 and user 52 are in the identification range of intelligent sound box 50, by voice to intelligence Speaker 50 sends control command.
Microphone array acquisition in intelligent sound box 50 includes the mixing of 52 voice signal of 51 voice signal of user and user Voice signal is then based on the azimuth information that beamforming algorithm determines user 51 and user 52, and according to the user determined 51 azimuth information and the azimuth information of user 52 isolate voice signal and the user 52 of user 51 from mixing voice signal Voice signal, the voice signal of the voice signal of user 51 and user 52 is then sent to semantics recognition system.
Semantics recognition system in intelligent sound box 50 is in the voice signal for receiving user 51 and the voice signal of user 52 Later, the voice signal to user 51 and the voice signal of user 52 carry out semantics recognition parallel, then by the voice of user 51 The semantics recognition result of the voice signal of the semantics recognition result and user 52 of signal is sent to Semantic detection system.
Semantic detection system in intelligent sound box 50 is in semantics recognition result and the user for receiving 51 voice signal of user After the semantics recognition result of 52 voice signals, start two wake-up engines, the semantic of 51 voice signal of parallel detection user is known Whether comprising waking up word in the semantics recognition result of 52 voice signal of other result and user.For example, starting wakes up engine 1 and wakes up Engine 2 wakes up engine 1 and wakes up the operation parallel of engine 2, the semantics recognition knot of 51 voice signal of user is detected by wake-up engine 1 Whether comprising waking up word in fruit, whether detected in the semantics recognition result of 52 voice signal of user by wake-up engine 2 comprising waking up Word.
If waking up engine 1 to detect in the semantics recognition result of 51 voice signal of user comprising waking up word, engine 1 is waken up The semantics recognition result of 51 voice signal of user is sent to speech control system;If waking up engine 2 detects 52 voice of user Comprising waking up word in the semantics recognition result of signal, then wakes up engine 2 and send the semantics recognition result of 52 voice signal of user To speech control system.
Speech control system in intelligent sound box 50 after the semantics recognition result for receiving the transmission of Semantic detection system, Search the target data to match with semantics recognition result in the database according to semantics recognition result, and then according to finding Control instruction corresponding to target data, control intelligent sound box 50 execute corresponding function.
Speech control system is searching the mesh to match with semantics recognition result according to semantics recognition result in the database Data are marked, the target data to match with semantics recognition result can be searched in the database that intelligent sound box 50 is locally stored, Semantics recognition result can also be uploaded to Cloud Server or cloud computing platform, in the database or cloud computing platform of Cloud Server Database in search the target data to match with semantics recognition result, the application is not construed as limiting this.
The implementation procedure of the processing method of above-mentioned voice signal is illustrated below with reference to specific system process flow, It should be noted, however, that the specific embodiment merely to the application is better described, is not constituted to the improper of the application It limits.
For overall flow, as shown in fig. 6, the processing method 600 of voice signal, may comprise steps of:
Step S601, processing component by the speech signal separation from different direction in the mixing voice signal received, Obtain multi-path voice signal.
Step S602, processing component is to progress parallelism recognition some or all of in multi-path voice signal, wherein parallel to know It does not include: that every road voice signal is divided into multiple recognition units to carry out respectively to some or all of in multi-path voice signal Identification, wherein each recognition unit includes continuous multiframe.
In the embodiment of the present application, when being identified to the road part or all of voice signal Zhong Mei voice signal, lead to The road Guo Jiangmei voice signal is divided into multiple recognition units and is identified, effectively reduces the number of identification, thereby reduces to every The cpu resource that road voice signal occupies when identifying is enable to carry out simultaneously to some or all of in multi-path voice signal Row identification.Further, so that using the embodiment of the present application technical solution interactive voice equipment, can be parallel to multi-path voice Signal is identified, compared with the voice signal for responding only to single user in the prior art, substantially increases voice control Flexibility.
When realizing, as shown in fig. 7, the processing component 700 of voice signal, may include:
Speech processing module 701, the voice signal from different direction point in the mixing voice signal for will receive From obtaining multi-path voice signal.
Identification module 702, for progress parallelism recognition some or all of in multi-path voice signal, wherein parallel to know It does not include: that every road voice signal is divided into multiple recognition units to carry out respectively to some or all of in multi-path voice signal Identification, wherein each recognition unit includes continuous multiframe.
In one embodiment, identification module 702 are specifically used for: sub-frame processing is carried out to every road voice signal respectively, Obtain multiframe voice data;In multiframe voice data, a frame is chosen from every preset quantity frame voice data as target frame Voice data;It is right using the multiframe voice data and target frame voice data adjacent with target frame voice data as recognition unit Every road voice signal is identified.
In one embodiment, speech processing module 701 are specifically used for: determining the road mixing voice signal Zhong Mei voice The azimuth information of signal;Based on the azimuth information of every road voice signal, by mixing voice signal from the more of different direction Road voice signal is separated.
In one embodiment, signal enhancing module 703, for every road voice signal carry out beam forming processing and Signal enhancing processing.
In one embodiment, device further include: detection module 704, the knowledge for the every road voice signal of parallel detection Whether comprising waking up word in other result;First sending module 705, in the recognition result for detecting any road voice signal When comprising waking up word, speech control system will be sent to comprising the recognition result for waking up word.
In one embodiment, device further include: orientation determining module 706, for determining the identification comprising waking up word As a result the azimuth information of voice signal is corresponded to;Acquisition module 707 is used for the language of oriented acquisition azimuth information in preset duration Sound signal, and beam forming processing and signal enhancing processing are carried out to collected voice signal;Second sending module 708 is used In by beam forming processing and signal enhancing, treated that voice signal is sent to speech control system.
In one embodiment, speech control system by the recognition result comprising waking up word after being waken up, according to reception The voice signal arrived executes corresponding function.
In one embodiment, multi-path voice signal, comprising: the voice signal of main driving and the voice letter of copilot Number.
Fig. 8 shows the calculating equipment that can be realized processing method and component according to the voice signal of the embodiment of the present application Exemplary hardware architecture structure chart.As shown in figure 8, calculate equipment 800 include input equipment 801, input interface 802, in Central processor 803, memory 804, output interface 805 and output equipment 806.Wherein, input interface 802, central processing unit 803, memory 804 and output interface 805 are connected with each other by bus 810, and input equipment 801 and output equipment 806 are distinguished It is connect by input interface 802 and output interface 805 with bus 810, and then is connect with the other assemblies for calculating equipment 800.
Specifically, input equipment 801 is received from external input information, and will input information by input interface 802 It is transmitted to central processing unit 803;Central processing unit 803 is based on the computer executable instructions stored in memory 804 to input Information is handled to generate output information, and output information is temporarily or permanently stored in memory 804, is then passed through Output information is transmitted to output equipment 806 by output interface 805;Output information is output to and calculates equipment 800 by output equipment 806 Outside for users to use.
That is, the equipment shown in Fig. 8 that calculates also may be implemented as the processing equipment of voice signal, the voice signal Processing equipment may include: the memory for being stored with computer executable instructions;And processor, the processor are executing meter The processing method and component of the voice signal for combining Fig. 2 to Fig. 7 to describe may be implemented when calculation machine executable instruction.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real It is existing.It when implemented in software, can be entirely or partly with the shape of computer program product or computer readable storage medium Formula is realized.The computer program product or computer readable storage medium include one or more computer instructions.It is calculating When being loaded on machine and executing the computer program instructions, entirely or partly generate according to process described in the embodiment of the present application Or function.The computer can be general purpose computer, special purpose computer, computer network or other programmable devices.Institute Stating computer instruction may be stored in a computer readable storage medium, or from a computer readable storage medium to another A computer readable storage medium transmission, for example, the computer instruction can be from web-site, computer, a server Or data center passes through wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL)) or wireless (for example, infrared, wireless, micro- Wave etc.) mode transmitted to another web-site, computer, server or data center.The computer-readable storage Medium can be any usable medium that computer can access or include the integrated service of one or more usable mediums The data storage devices such as device, data center.The usable medium can be magnetic medium, (for example, floppy disk, hard disk, tape), light Medium (for example, DVD) or semiconductor medium (for example, solid state hard disk Solid State Disk (SSD)) etc..
In addition, the embodiment of the present application can provide a kind of calculating in conjunction with the processing method of the voice signal in above-described embodiment Machine readable storage medium storing program for executing is realized.Computer program instructions are stored on the computer readable storage medium;The computer program The processing method of any one voice signal in above-described embodiment is realized in instruction when being executed by processor.
The application also provides a kind of vehicle-mounted voice interactive device.It will be understood by those skilled in the art that the vehicle-mounted voice is handed over Mutual equipment can manage and control involved by hardware or the application of the processing component of above-mentioned Fig. 7 or voice signal shown in Fig. 8 And voice signal processing equipment hardware and the application involved in software resource computer program, be directly to run System software on above-mentioned processing component or processing equipment.
Vehicle-mounted voice interactive device provided by the present application can be handed over the module of other on vehicle or function device Mutually, to control the function of corresponding module or function device.
The following detailed description of the structural schematic diagram of vehicle-mounted voice interactive device provided by the present application.Fig. 9 is that the application one is real The structural schematic diagram of the vehicle-mounted voice interactive device of example offer is provided.As shown in figure 9, vehicle-mounted voice interactive device provided by the present application It include: microphone array 901 and processor 902, wherein
Microphone array 901, for acquiring mixing voice signal.
Processor 902 is communicated to connect with microphone array 901, from not in the mixing voice signal for will receive With the speech signal separation in orientation, multi-path voice signal is obtained, and parallel to carrying out some or all of in multi-path voice signal Identification, wherein parallelism recognition includes: respectively to be divided into every road voice signal more to some or all of in multi-path voice signal A recognition unit is to be identified, wherein each recognition unit includes continuous multiframe.
In one embodiment, processor 902 is specifically used for: carrying out sub-frame processing to every road voice signal respectively, obtains To multiframe voice data;In multiframe voice data, a frame is chosen from every preset quantity frame voice data as target frame language Sound data;Using the multiframe voice data and target frame voice data adjacent with target frame voice data as recognition unit, to every Road voice signal is identified.
In one embodiment, processor 902 is specifically used for: determining the side of the road mixing voice signal Zhong Mei voice signal Position information;Based on the azimuth information of every road voice signal, the multi-path voice in mixing voice signal from different direction is believed It number is separated.
In one embodiment, processor 902 is also used to, and carries out beam forming processing and signal to every road voice signal Enhancing processing.
In one embodiment, processor 902 are also used to: being in the recognition result of the every road voice signal of parallel detection It is no to include wake-up word;When in the recognition result for detecting any road voice signal comprising waking up word, by the knowledge comprising waking up word Other result is sent to speech control system.
In one embodiment, processor 902 are also used to: being determined and corresponded to voice letter comprising the recognition result for waking up word Number azimuth information;In preset duration, the voice signal of oriented acquisition azimuth information, and collected voice signal is carried out Beam forming processing and signal enhancing processing;By beam forming processing and signal enhancing, treated that voice signal is sent to voice Control system.
In one embodiment, speech control system by the recognition result comprising waking up word after being waken up, according to reception The voice signal arrived executes corresponding function.
Further, the vehicle-mounted voice interactive device can by above-mentioned microphone array 901 and processor 902, or Person is on the basis of above-mentioned microphone array 901 and processor 902, in conjunction with other units, control corresponding component to execute on State the processing method of voice signal in Fig. 6.
The application also provides a kind of vehicle-mounted internet operating system.It will be understood by those skilled in the art that the vehicle-mounted interconnection Net operation system can manage and control the hardware or the application of the processing component of above-mentioned Fig. 7 or voice signal shown in Fig. 8 The computer program of software resource involved in the hardware and the application of the processing equipment of related voice signal, is direct Operate in the system software in above-mentioned processing component or processing equipment.
Vehicle-mounted Internet control system provided by the present application can be handed over the module of other on vehicle or function device Mutually, to control the function of corresponding module or function device.
Development based on vehicle-mounted Internet control system provided by the present application and vehicle communication technology, so that vehicle is no longer Other than communication network, vehicle can be interconnected network consisting with server-side, to form vehicle-mounted internet.It should It is urgent that vehicle-mounted internet system can provide voice communications services, positioning service, navigation Service, mobile internet access, vehicle Rescue, vehicle data and management service, car entertainment service etc..
The following detailed description of the structural schematic diagram of vehicle-mounted Internet control system provided by the present application.Figure 10 is the application one The structural schematic diagram for the vehicle-mounted Internet control system that embodiment provides.As shown in Figure 10, vehicle-mounted internet provided by the present application Control system includes: microphone control assembly 1001 and control assembly 1002, wherein
Microphone control assembly 1001, for controlling microphone array acquisition mixing voice signal;
Control assembly 1002, for controlling the voice signal point in the mixing voice signal that will be received from different direction From obtaining multi-path voice signal, and to carrying out parallelism recognition some or all of in multi-path voice signal, wherein parallelism recognition It include: that every road voice signal is divided into multiple recognition units to know respectively to some or all of in multi-path voice signal Not, wherein each recognition unit includes continuous multiframe.
Further, which can pass through above-mentioned microphone control assembly 1001 and control group Part 1002, or on the basis of above-mentioned microphone control assembly 1001 and control assembly 1002, in conjunction with other units, control phase The component answered is to execute the processing method of voice signal in above-mentioned Fig. 6.
It should be clear that the application is not limited to specific configuration described above and shown in figure and processing. For brevity, it is omitted here the detailed description to known method.In the above-described embodiments, several tools have been described and illustrated The step of body, is as example.But the present processes process is not limited to described and illustrated specific steps, this field Technical staff can be variously modified, modification and addition after understanding spirit herein, or suitable between changing the step Sequence.
It should also be noted that, the exemplary embodiment referred in the application, is retouched based on a series of step or device State certain methods or system.But the application is not limited to the sequence of above-mentioned steps, that is to say, that can be according in embodiment The sequence referred to executes step, may also be distinct from that the sequence in embodiment or several steps are performed simultaneously.
The above, the only specific embodiment of the application, it is apparent to those skilled in the art that, For convenience of description and succinctly, the system, module of foregoing description and the specific work process of unit can refer to preceding method Corresponding process in embodiment, details are not described herein.It should be understood that the protection scope of the application is not limited thereto, it is any to be familiar with Those skilled in the art within the technical scope of the present application, can readily occur in various equivalent modifications or substitutions, These modifications or substitutions should all cover within the scope of protection of this application.

Claims (26)

1. a kind of processing method of voice signal, which is characterized in that the described method includes:
Speech signal separation from different direction in the mixing voice signal received is obtained multi-path voice letter by processing component Number;
The processing component is to progress parallelism recognition some or all of in the multi-path voice signal, wherein the parallel knowledge Do not include: to some or all of in the multi-path voice signal, respectively by every road voice signal be divided into multiple recognition units with It is identified, wherein each recognition unit includes continuous multiframe.
2. the method according to claim 1, wherein described be divided into multiple identifications lists for every road voice signal respectively Position is to be identified, comprising:
Sub-frame processing is carried out to every road voice signal respectively, obtains multiframe voice data;
In the multiframe voice data, a frame is chosen from every preset quantity frame voice data as target frame voice data;
It is that identification is single with the multiframe voice data adjacent with the target frame voice data and the target frame voice data Position, identifies every road voice signal.
3. the method according to claim 1, wherein the processing component will be in the mixing voice signal that received Multi-path voice Signal separator from different direction, comprising:
Determine the azimuth information of the road mixing voice signal Zhong Mei voice signal;
Based on the azimuth information of every road voice signal, the multi-path voice in the mixing voice signal from different direction is believed It number is separated.
4. the method according to claim 1, wherein the processing component will be in the mixing voice signal that received Multi-path voice Signal separator from different direction, it is described in the multi-path voice signal after obtaining multi-path voice signal Some or all of carry out parallelism recognition before, the method also includes:
The processing component carries out beam forming processing to every road voice signal and signal enhancing is handled.
5. method according to any of claims 1-4, which is characterized in that the method also includes:
Whether comprising waking up word in the recognition result of the every road voice signal of parallel detection;
When in the recognition result for detecting any road voice signal comprising waking up word, the recognition result for waking up word will be included It is sent to speech control system.
6. according to the method described in claim 5, it is characterized in that, the method also includes:
Determine the azimuth information that voice signal is corresponded to comprising the recognition result for waking up word;
In preset duration, the voice signal of azimuth information described in oriented acquisition, and wave beam is carried out to collected voice signal Forming processing and signal enhancing processing;
By beam forming processing and signal enhancing, treated that voice signal is sent to the speech control system.
7. according to the method described in claim 6, it is characterized in that, the speech control system is being included the wake-up word After recognition result wakes up, corresponding function is executed according to the voice signal received.
8. the method according to claim 1, wherein the multi-path voice signal, comprising: the voice of main driving is believed Number and copilot voice signal.
9. a kind of processing component of voice signal, which is characterized in that described device includes:
Speech processing module, the speech signal separation from different direction in the mixing voice signal for will receive, obtains Multi-path voice signal;
Identification module, for progress parallelism recognition some or all of in the multi-path voice signal, wherein the parallel knowledge Do not include: to some or all of in the multi-path voice signal, respectively by every road voice signal be divided into multiple recognition units with It is identified, wherein each recognition unit includes continuous multiframe.
10. component according to claim 9, which is characterized in that the identification module is specifically used for:
Sub-frame processing is carried out to every road voice signal respectively, obtains multiframe voice data;
In the multiframe voice data, a frame is chosen from every preset quantity frame voice data as target frame voice data;
It is that identification is single with the multiframe voice data adjacent with the target frame voice data and the target frame voice data Position, identifies every road voice signal.
11. component according to claim 9, which is characterized in that the speech processing module is specifically used for:
Determine the azimuth information of the road mixing voice signal Zhong Mei voice signal;
Based on the azimuth information of every road voice signal, the multi-path voice in the mixing voice signal from different direction is believed It number is separated.
12. component according to claim 9, which is characterized in that described device further includes signal enhancing module, for pair Every road voice signal carries out beam forming processing and signal enhancing processing.
13. the component according to any one of claim 9-12, which is characterized in that described device further include:
Detection module, for whether including wake-up word in the recognition result of the every road voice signal of parallel detection;
First sending module will include institute when in the recognition result for detecting any road voice signal comprising waking up word The recognition result for stating wake-up word is sent to speech control system.
14. component according to claim 13, which is characterized in that described device further include:
Orientation determining module, for determining the azimuth information for corresponding to voice signal comprising the recognition result for waking up word;
Acquisition module is used in preset duration, the voice signal of azimuth information described in oriented acquisition, and to collected voice Signal carries out beam forming processing and signal enhancing processing;
Second sending module, for treated that voice signal is sent to the voice control by beam forming processing and signal enhancing System processed.
15. component according to claim 14, which is characterized in that the speech control system is being included the wake-up word Recognition result wake up after, corresponding function is executed according to the voice signal that receives.
16. component according to claim 9, which is characterized in that the multi-path voice signal, comprising: the voice of main driving The voice signal of signal and copilot.
17. a kind of processing equipment of voice signal, which is characterized in that including memory and processor;The memory is for storing up There is executable program code;The processor is for reading the executable program code stored in the memory with right of execution Benefit requires method described in any one of 1-8.
18. a kind of computer readable storage medium, is stored thereon with computer program instructions, which is characterized in that when the calculating Such as method of any of claims 1-8 is realized when machine program instruction is executed by processor.
19. a kind of vehicle-mounted voice interactive device, which is characterized in that the equipment includes: microphone array and processor;Wherein,
The microphone array, for acquiring mixing voice signal;
The processor is communicated to connect with the microphone array, from difference in the mixing voice signal for will receive The speech signal separation in orientation obtains multi-path voice signal, and carries out simultaneously to some or all of in the multi-path voice signal Row identification, wherein the parallelism recognition includes: to some or all of in the multi-path voice signal, respectively by every road voice Signal is divided into multiple recognition units to be identified, wherein each recognition unit includes continuous multiframe.
20. equipment according to claim 19, which is characterized in that the processor is specifically used for:
Sub-frame processing is carried out to every road voice signal respectively, obtains multiframe voice data;
In the multiframe voice data, a frame is chosen from every preset quantity frame voice data as target frame voice data;
It is that identification is single with the multiframe voice data adjacent with the target frame voice data and the target frame voice data Position, identifies every road voice signal.
21. equipment according to claim 19, which is characterized in that the processor is specifically used for:
Determine the azimuth information of the road mixing voice signal Zhong Mei voice signal;
Based on the azimuth information of every road voice signal, the multi-path voice in the mixing voice signal from different direction is believed It number is separated.
22. equipment according to claim 19, which is characterized in that the processor is also used to, to every road voice signal into The processing of traveling wave beam shaping and signal enhancing processing.
23. equipment described in any one of 9-22 according to claim 1, which is characterized in that the processor is also used to:
Whether comprising waking up word in the recognition result of the every road voice signal of parallel detection;
When in the recognition result for detecting any road voice signal comprising waking up word, the recognition result for waking up word will be included It is sent to speech control system.
24. equipment according to claim 23, which is characterized in that the processor is also used to:
Determine the azimuth information that voice signal is corresponded to comprising the recognition result for waking up word;
In preset duration, the voice signal of azimuth information described in oriented acquisition, and wave beam is carried out to collected voice signal Forming processing and signal enhancing processing;
By beam forming processing and signal enhancing, treated that voice signal is sent to the speech control system.
25. equipment according to claim 24, which is characterized in that the speech control system is being included the wake-up word Recognition result wake up after, corresponding function is executed according to the voice signal that receives.
26. a kind of vehicle-mounted Internet control system characterized by comprising microphone control assembly and control assembly;Wherein,
The microphone control assembly, for controlling microphone array acquisition mixing voice signal;
The control assembly, for controlling the speech signal separation in the mixing voice signal that will be received from different direction, Multi-path voice signal is obtained, and to progress parallelism recognition some or all of in the multi-path voice signal, wherein described parallel Identification includes: that every road voice signal is divided into multiple recognition units respectively to some or all of in the multi-path voice signal To be identified, wherein each recognition unit includes continuous multiframe.
CN201710850441.4A 2017-09-15 2017-09-15 Voice signal processing method, assembly, equipment and medium Active CN109509465B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710850441.4A CN109509465B (en) 2017-09-15 2017-09-15 Voice signal processing method, assembly, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710850441.4A CN109509465B (en) 2017-09-15 2017-09-15 Voice signal processing method, assembly, equipment and medium

Publications (2)

Publication Number Publication Date
CN109509465A true CN109509465A (en) 2019-03-22
CN109509465B CN109509465B (en) 2023-07-25

Family

ID=65745190

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710850441.4A Active CN109509465B (en) 2017-09-15 2017-09-15 Voice signal processing method, assembly, equipment and medium

Country Status (1)

Country Link
CN (1) CN109509465B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110021298A (en) * 2019-04-23 2019-07-16 广州小鹏汽车科技有限公司 A kind of automotive voice control system
CN110223693A (en) * 2019-06-21 2019-09-10 北京猎户星空科技有限公司 A kind of robot control method, device, electronic equipment and readable storage medium storing program for executing
CN110954866A (en) * 2019-11-22 2020-04-03 达闼科技成都有限公司 Sound source positioning method, electronic device and storage medium
CN111816180A (en) * 2020-07-08 2020-10-23 北京声智科技有限公司 Method, device, equipment, system and medium for controlling elevator based on voice
CN111862963A (en) * 2019-04-12 2020-10-30 阿里巴巴集团控股有限公司 Voice wake-up method, device and equipment
CN112289335A (en) * 2019-07-24 2021-01-29 阿里巴巴集团控股有限公司 Voice signal processing method and device and pickup equipment
CN113327608A (en) * 2021-06-03 2021-08-31 阿波罗智联(北京)科技有限公司 Voice processing method and device for vehicle, electronic equipment and medium
WO2023168713A1 (en) * 2022-03-11 2023-09-14 华为技术有限公司 Interactive speech signal processing method, related device and system
CN117275480A (en) * 2023-09-20 2023-12-22 镁佳(北京)科技有限公司 Full duplex intelligent voice dialogue method and device

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040024592A1 (en) * 2002-08-01 2004-02-05 Yamaha Corporation Audio data processing apparatus and audio data distributing apparatus
CN101923855A (en) * 2009-06-17 2010-12-22 复旦大学 Test-irrelevant voice print identifying system
CN103733258A (en) * 2011-08-24 2014-04-16 索尼公司 Encoding device and method, decoding device and method, and program
CN103794212A (en) * 2012-10-29 2014-05-14 三星电子株式会社 Voice recognition apparatus and voice recognition method thereof
KR101681988B1 (en) * 2015-07-28 2016-12-02 현대자동차주식회사 Speech recognition apparatus, vehicle having the same and speech recongition method
US20170076726A1 (en) * 2015-09-14 2017-03-16 Samsung Electronics Co., Ltd. Electronic device, method for driving electronic device, voice recognition device, method for driving voice recognition device, and non-transitory computer readable recording medium
US9734822B1 (en) * 2015-06-01 2017-08-15 Amazon Technologies, Inc. Feedback based beamformed signal selection
CN107924687A (en) * 2015-09-23 2018-04-17 三星电子株式会社 Speech recognition apparatus, the audio recognition method of user equipment and non-transitory computer readable recording medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040024592A1 (en) * 2002-08-01 2004-02-05 Yamaha Corporation Audio data processing apparatus and audio data distributing apparatus
CN101923855A (en) * 2009-06-17 2010-12-22 复旦大学 Test-irrelevant voice print identifying system
CN103733258A (en) * 2011-08-24 2014-04-16 索尼公司 Encoding device and method, decoding device and method, and program
CN103794212A (en) * 2012-10-29 2014-05-14 三星电子株式会社 Voice recognition apparatus and voice recognition method thereof
US9734822B1 (en) * 2015-06-01 2017-08-15 Amazon Technologies, Inc. Feedback based beamformed signal selection
KR101681988B1 (en) * 2015-07-28 2016-12-02 현대자동차주식회사 Speech recognition apparatus, vehicle having the same and speech recongition method
US20170076726A1 (en) * 2015-09-14 2017-03-16 Samsung Electronics Co., Ltd. Electronic device, method for driving electronic device, voice recognition device, method for driving voice recognition device, and non-transitory computer readable recording medium
CN107924687A (en) * 2015-09-23 2018-04-17 三星电子株式会社 Speech recognition apparatus, the audio recognition method of user equipment and non-transitory computer readable recording medium

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111862963A (en) * 2019-04-12 2020-10-30 阿里巴巴集团控股有限公司 Voice wake-up method, device and equipment
CN110021298A (en) * 2019-04-23 2019-07-16 广州小鹏汽车科技有限公司 A kind of automotive voice control system
CN110223693A (en) * 2019-06-21 2019-09-10 北京猎户星空科技有限公司 A kind of robot control method, device, electronic equipment and readable storage medium storing program for executing
CN110223693B (en) * 2019-06-21 2021-08-20 北京猎户星空科技有限公司 Robot control method and device, electronic equipment and readable storage medium
CN112289335A (en) * 2019-07-24 2021-01-29 阿里巴巴集团控股有限公司 Voice signal processing method and device and pickup equipment
CN110954866B (en) * 2019-11-22 2022-04-22 达闼机器人有限公司 Sound source positioning method, electronic device and storage medium
CN110954866A (en) * 2019-11-22 2020-04-03 达闼科技成都有限公司 Sound source positioning method, electronic device and storage medium
CN111816180A (en) * 2020-07-08 2020-10-23 北京声智科技有限公司 Method, device, equipment, system and medium for controlling elevator based on voice
CN111816180B (en) * 2020-07-08 2022-02-08 北京声智科技有限公司 Method, device, equipment, system and medium for controlling elevator based on voice
CN113327608A (en) * 2021-06-03 2021-08-31 阿波罗智联(北京)科技有限公司 Voice processing method and device for vehicle, electronic equipment and medium
CN113327608B (en) * 2021-06-03 2022-12-09 阿波罗智联(北京)科技有限公司 Voice processing method and device for vehicle, electronic equipment and medium
WO2023168713A1 (en) * 2022-03-11 2023-09-14 华为技术有限公司 Interactive speech signal processing method, related device and system
CN117275480A (en) * 2023-09-20 2023-12-22 镁佳(北京)科技有限公司 Full duplex intelligent voice dialogue method and device

Also Published As

Publication number Publication date
CN109509465B (en) 2023-07-25

Similar Documents

Publication Publication Date Title
CN109509465A (en) Processing method, component, equipment and the medium of voice signal
CN110556103B (en) Audio signal processing method, device, system, equipment and storage medium
CN110491403B (en) Audio signal processing method, device, medium and audio interaction equipment
US10602267B2 (en) Sound signal processing apparatus and method for enhancing a sound signal
EP3248189B1 (en) Environment adjusted speaker identification
US11158333B2 (en) Multi-stream target-speech detection and channel fusion
CN109949810A (en) A kind of voice awakening method, device, equipment and medium
CN108922553B (en) Direction-of-arrival estimation method and system for sound box equipment
KR20190095181A (en) Video conference system using artificial intelligence
EP3714452B1 (en) Method and system for speech enhancement
EP2987312B1 (en) System and method for acoustic echo cancellation
CN106782563B (en) Smart home voice interaction system
US20190355354A1 (en) Method, apparatus and system for speech interaction
US11869481B2 (en) Speech signal recognition method and device
US10629226B1 (en) Acoustic signal processing with voice activity detector having processor in an idle state
KR20190098110A (en) Intelligent Presentation Method
CN110673096B (en) Voice positioning method and device, computer readable storage medium and electronic equipment
EP3057097B1 (en) Time zero convergence single microphone noise reduction
US20210390952A1 (en) Robust speaker localization in presence of strong noise interference systems and methods
US20200211560A1 (en) Data Processing Device and Method for Performing Speech-Based Human Machine Interaction
CN110261816A (en) Voice Wave arrival direction estimating method and device
US10157611B1 (en) System and method for speech enhancement in multisource environments
Zhang et al. Robust DOA Estimation Based on Convolutional Neural Network and Time-Frequency Masking.
US11528571B1 (en) Microphone occlusion detection
CN113077779A (en) Noise reduction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant