CN109509465B

CN109509465B - Voice signal processing method, assembly, equipment and medium

Info

Publication number: CN109509465B
Application number: CN201710850441.4A
Authority: CN
Inventors: 都家宇; 田彪; 雷鸣; 姚海涛; 刘勇; 黄雷
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2017-09-15
Filing date: 2017-09-15
Publication date: 2023-07-25
Anticipated expiration: 2037-09-15
Also published as: CN109509465A

Abstract

The embodiment of the application discloses a processing method, a processing component, processing equipment and a processing medium for voice signals, which are used for improving the flexibility of voice control. The method comprises the following steps: the processing component separates the voice signals from different directions in the received mixed voice signals to obtain multiple paths of voice signals; the processing component performs parallel recognition on part or all of the multipath voice signals, wherein the parallel recognition comprises: and dividing each voice signal into a plurality of recognition units for recognition of part or all of the plurality of voice signals, wherein each recognition unit comprises a continuous multi-frame.

Description

Voice signal processing method, assembly, equipment and medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a method, a component, an apparatus, and a computer readable storage medium for processing a voice signal.

Background

Along with the continuous development of voice recognition technology, intelligent voice control systems have been rapidly developed, and the intelligent voice control systems can rapidly, accurately and effectively execute corresponding functions through voice recognition.

In the existing intelligent voice control system, after voice signals are collected, target data matched with the semantics of the voice signals can be searched in a database of the intelligent voice control system, and then corresponding functions are controlled and executed according to control instructions corresponding to the searched target data.

However, the existing voice control system can only perform a corresponding function in response to a voice signal of a single user, and lacks flexibility.

Disclosure of Invention

The embodiment of the application provides a processing method, a processing component, processing equipment and a computer readable storage medium for voice signals, which are used for improving the flexibility of voice control.

According to a first aspect of an embodiment of the present application, there is provided a method for processing a speech signal, including:

the processing component separates the voice signals from different directions in the received mixed voice signals to obtain multiple paths of voice signals;

the processing component performs parallel recognition on part or all of the multipath voice signals, wherein the parallel recognition comprises: for part or all of the multiple voice signals, each voice signal is divided into a plurality of recognition units for recognition, wherein each recognition unit comprises a continuous multi-frame.

According to a second aspect of embodiments of the present application, there is provided a processing component for a speech signal, including:

the voice processing module is used for separating voice signals from different directions in the received mixed voice signals to obtain multiple paths of voice signals;

the recognition module is used for carrying out parallel recognition on part or all of the multipath voice signals, wherein the parallel recognition comprises the following steps: for part or all of the multiple voice signals, each voice signal is divided into a plurality of recognition units for recognition, wherein each recognition unit comprises a continuous multi-frame.

According to a third aspect of embodiments of the present application, there is provided a processing apparatus for a speech signal, including: a memory and a processor; the memory is used for storing executable program codes; the processor is configured to read executable program code stored in the memory to perform the above-described method of processing a speech signal.

According to a fourth aspect of embodiments of the present application, there is provided a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method of processing a speech signal.

According to a fifth aspect of embodiments of the present application, there is provided a vehicle-mounted voice interaction device, the device including: a microphone array and a processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

The microphone array is used for collecting mixed voice signals;

the processor is in communication connection with the microphone array and is used for separating voice signals from different directions in the received mixed voice signals to obtain multiple paths of voice signals and carrying out parallel recognition on part or all of the multiple paths of voice signals, wherein the parallel recognition comprises the following steps: for part or all of the multiple voice signals, each voice signal is divided into a plurality of recognition units for recognition, wherein each recognition unit comprises a continuous multi-frame.

According to a sixth aspect of the embodiments of the present application, there is provided an in-vehicle internet control system, including: a microphone control assembly and a control assembly; wherein, the liquid crystal display device comprises a liquid crystal display device,

the microphone control component is used for controlling the microphone array to collect the mixed voice signals;

the control component is used for controlling the separation of voice signals from different directions in the received mixed voice signals to obtain multiple paths of voice signals and carrying out parallel recognition on part or all of the multiple paths of voice signals, wherein the parallel recognition comprises the following steps: for part or all of the multiple voice signals, each voice signal is divided into a plurality of recognition units for recognition, wherein each recognition unit comprises a continuous multi-frame.

According to the processing method, the component, the equipment and the computer readable storage medium of the voice signals in the embodiment of the application, the voice signals from different directions in the received mixed voice signals are separated to obtain multiple paths of voice signals, and part or all of the multiple paths of voice signals are parallelly identified, wherein the parallel identification comprises the following steps: for part or all of the multiple voice signals, each voice signal is divided into a plurality of recognition units for recognition, wherein each recognition unit comprises a continuous multi-frame. According to the technical scheme, when each path of voice signals in part or all of voice signals are identified, each path of voice signals are divided into a plurality of identification units to be identified, so that the number of times of identification is effectively reduced, and then the resources of a central processing unit (Central Processing Unit, CPU) occupied when each path of voice signals are identified are reduced, and therefore part or all of multipath voice signals can be identified in parallel. Furthermore, by adopting the voice interaction equipment of the technical scheme of the embodiment of the application, multiple paths of voice signals can be identified in parallel, and compared with the voice signals which can only be responded to a single user in the prior art, the flexibility of voice control is greatly improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 shows a schematic diagram of semantic recognition in the prior art;

FIG. 2 shows a schematic diagram of semantic recognition in an embodiment of the present application;

fig. 3 is a schematic diagram illustrating an application scenario of a processing method of a speech signal according to an embodiment of the present application;

fig. 4 shows a schematic diagram of beamforming based on a microphone array in an embodiment of the present application;

fig. 5 shows a schematic diagram of another application scenario of a processing method of a speech signal according to an embodiment of the present application;

fig. 6 shows a schematic flow chart of a method of processing a speech signal according to an embodiment of the present application;

FIG. 7 shows a schematic structural diagram of a processing component of a speech signal according to an embodiment of the present application;

FIG. 8 illustrates a block diagram of an exemplary hardware architecture of a computing device capable of implementing the processing methods and components of speech signals according to embodiments of the present application;

Fig. 9 shows a schematic structural diagram of a vehicle-mounted voice interaction device according to an embodiment of the present application;

fig. 10 shows a schematic structural diagram of the in-vehicle internet control system of the embodiment of the present application.

Detailed Description

Features and exemplary embodiments of various aspects of the present application are described in detail below, and in order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely configured to explain the present application and are not configured to limit the present application. It will be apparent to one skilled in the art that the present application may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present application by showing examples of the present application.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It should be noted that, when identifying the voice signal, the method may include, but is not limited to: semantic recognition, context recognition, mood recognition, etc. The embodiment of the application is described by taking semantic recognition as an example in an intelligent voice control system.

The existing intelligent voice control system, when executing corresponding functions according to received voice signals, includes: the system comprises a semantic recognition link, a semantic matching link and a control execution link. The semantic recognition link is used for carrying out semantic recognition on the voice signal after the voice signal is acquired, and recognizing the semantics contained in the voice signal; the semantic matching link is to search target data matched with the semantic recognition result from a database of the intelligent voice control system based on the semantic recognition result of the voice signal; and the control execution link refers to controlling the equipment to execute corresponding functions according to the control instruction corresponding to the searched target data.

In the existing intelligent voice control system, after a voice signal is acquired in a semantic recognition link, firstly framing the voice signal, then recognizing each frame of voice data in the voice signal, and further determining the semantics contained in the voice signal according to the recognition result of each frame of voice data.

For example, as shown in fig. 1, the speech signal after framing includes 7 frames of speech data, that is, 7 frames of speech data from time t=i-3 to time t=i+3, in the semantic recognition link, semantic recognition is performed on the 7 frames of speech data, and then the semantic contained in the speech signal is determined by combining the semantic recognition result of the 7 frames of speech data.

The processing procedure in the semantic recognition link occupies a large amount of CPU resources, and the CPU resources of the intelligent voice control system are often extremely limited, which tends to make the CPU resources which can be distributed to the semantic matching link and the control execution link more limited, so that the existing intelligent voice control system can only respond to the voice signals of a single user to execute corresponding functions, and the flexibility is lacking.

For example, the existing intelligent voice control system is applied in an automobile to form an on-vehicle voice control system, but the existing on-vehicle voice control system can only be controlled by a main driver, and in actual use, a voice signal of the main driver is easily interfered by voice signals of a co-driver and a rear seat staff in the automobile, so that the control effect of the actual on-vehicle voice control system is not ideal.

For another example, when the existing intelligent voice control system is applied to an intelligent device, for example, when the intelligent voice control system is applied to an intelligent sound box, an intelligent television and an automatic shopping machine, the intelligent device can be controlled by voice only by one user, and when a plurality of users speak at the same time or in a noisy environment, the control effect of the intelligent voice control system in the intelligent device is greatly reduced.

In view of this, in one embodiment, when performing semantic recognition on a voice signal, the embodiment of the present application performs semantic recognition on the collected voice signal using a Low Frame Rate (LFR) acoustic model, so as to reduce CPU resources occupied by a semantic recognition link.

In one embodiment, when using an LFR acoustic model to semantically identify an acquired speech signal, the speech signal is divided into a plurality of identification units for semantic identification, wherein each identification unit comprises a continuous multiframe.

In one example, the collected voice signal is firstly subjected to framing, after the voice signal is subjected to framing, one frame of voice data is selected from every preset number of frames of voice data to serve as a target frame, and the multi-frame voice data adjacent to the target frame of voice data and the target frame of voice data serve as recognition units to carry out semantic recognition on the target frame of voice data. Wherein adjacent recognition units may include the same frame of speech data therebetween.

For example, as shown in fig. 2, the speech signal after framing includes N frames of speech data, and 7 frames of speech data are illustrated as examples, that is, 7 frames of speech data from time t=i-3 to time t=i+3.

When semantic recognition is performed on a voice signal, one frame is selected from every 3 frames of voice data as target frames of voice data, for example, 7 frames of voice data from t=i-3 to t=i+3, and voice data at t=i-3, t=i, and t=i+3 are selected as target frames of voice data.

For the voice data at the time t=i-3, when the semantic recognition is performed, the voice data at the time t=i-6, the voice data at the time t=i-5, the voice data at the time t=i-4, the voice data at the time t=i-3, the voice data at the time t=i-2, the voice data at the time t=i-1, and the voice data at the time t=i are combined to perform the semantic recognition. Similarly, when semantic recognition is performed for the voice data at time t=i, the voice data at time t=i-3, the voice data at time t=i-2, the voice data at time t=i-1, the voice data at time t=i+1, the voice data at time t=i+2, and the voice data at time t=i+3 are combined to perform semantic recognition.

As can be seen from the semantic recognition process shown in fig. 2, the semantic recognition process shown in fig. 2 can significantly reduce the number of times or frequency of recognition during the semantic recognition, and further reduce CPU resources occupied by the semantic recognition link, compared with the semantic recognition process shown in fig. 1. Meanwhile, the semantic recognition process shown in fig. 2 can also improve the efficiency of semantic recognition because the number of frames of recognized voice data is reduced in the semantic recognition process.

The semantic recognition process shown in fig. 2 performs semantic recognition on target frame voice data by using multi-frame voice data adjacent to the target frame voice data and target frame voice data as units after selecting the target frame voice data, and compared with a manner of performing semantic recognition by only combining the target frame voice data after selecting the target frame voice data, more voice information is combined when performing semantic recognition on each target frame voice data, so that the semantic recognition process shown in fig. 2 can effectively ensure the accuracy of the semantic recognition while reducing the recognition frequency of the semantic recognition, reducing the CPU resources occupied by the semantic recognition link and improving the semantic recognition efficiency.

Of course, in this example, the target frame voice data is semantically recognized with the first three frames of voice data, the last three frames of voice data, and the target frame voice data adjacent to the target frame voice data as recognition units. In other embodiments of the present application, when semantic recognition is performed on target frame voice data in units of multi-frame voice data adjacent to the target frame voice data and the target frame voice data, the multi-frame voice data adjacent to the target frame voice data may also be set according to the accuracy requirement of voice recognition. For example, if the accuracy requirement for voice recognition is high, the number of multi-frame voice data adjacent to the target frame voice data may be set to be a little larger; conversely, if the accuracy requirement for voice recognition is low, the number of multi-frame voice data adjacent to the target frame voice data may be set to be small.

In one embodiment, the CPU resources occupied by the semantic recognition link can be reduced due to the semantic recognition performed by the semantic recognition method shown in fig. 2. Therefore, after the processing scheme of the voice signal provided by the embodiment of the application adopts the semantic recognition method shown in fig. 2, the semantic recognition can be performed on multiple paths of voice signals in parallel under the condition of limited CPU resources.

Although the processing method of the voice signals shown in fig. 2 is adopted, semantic recognition can be performed on multiple voice signals in parallel when the voice signals are processed, when a mixed voice signal containing voice signals from different directions is received, each voice signal needs to be separated from the mixed voice signal to obtain multiple voice signals, and then part or all of the multiple voice signals are subjected to parallel semantic recognition, so that the accuracy of semantic recognition on the voice signals is improved.

Therefore, in one embodiment of the present application, a microphone array is used to separate the received voice signals from different directions in the mixed voice signals, so as to obtain multiple voice signals, and each voice signal is separated from the collected mixed voice signals based on a beam forming algorithm, so that semantic recognition is performed on each voice signal, and the accuracy of semantic recognition is improved, thereby solving the problem of low semantic recognition accuracy in the prior art when the mixed voice signals are recognized.

For example, as shown in fig. 3, in an on-vehicle voice control environment, a primary driver 31, a secondary driver 32, and a voice interaction device 33 are included, wherein the voice interaction device 33 includes a microphone array therein.

During the speaking of the primary driver 31 and the secondary driver 32, if the voice interaction device 33 is in an on state, the microphone array in the voice interaction device 33 will collect in real time a mixed voice signal comprising the voice signal of the primary driver 31 and the voice signal of the secondary driver 32. Of course, in an actual vehicle environment, the mixed voice signal may also include the voice signal of the rear seat occupant and the environmental noise.

When the microphone array collects the mixed speech signal, since the primary driver 31 and the secondary driver 32 are in different orientations of the speech interaction device 33, the speech signal of the primary driver 31 and the speech signal of the secondary driver 32 come from different orientations in terms of the microphone array. Based on the above, the microphone array can form beams in different directions, pick up voice signals in the beams, eliminate noise outside the beams, and achieve the purpose of separating the voice signals from the voice signals for enhancement.

In one example, as shown in fig. 4, after the microphone array collects a mixed voice signal including a voice signal of the primary driver 31 and a voice signal of the secondary driver 32, the collected mixed voice signal is first preprocessed, a phase-shift weighted generalized cross correlation algorithm is used to find a time delay difference of each voice signal with respect to a reference signal after the preprocessing, and finally a beam is formed by a delay-accumulation beam forming algorithm based on the calculated time delay difference.

In one example, preprocessing includes framing, silence detection, and hamming window. Since the speech signal is a non-stationary signal, its characteristics are time-varying, but in a very short period of time the speech signal can be considered to have a relatively stable characteristic, i.e. the speech signal has a short-time stationarity. Therefore, when processing a voice signal, the voice signal is generally framed in a short period of time.

The purpose of silence detection is to remove the silence frame in the voice signal, so that the silence detection can eliminate the influence of the silence frame on the front and back frame voice recognition, reduce unnecessary calculation amount and improve calculation efficiency.

In addition, since the framing processing of the speech signal corresponds to the clipping of the time-domain speech signal by using a rectangular window, and since the time-domain product corresponds to the frequency-domain convolution, the clipping of the rectangular window causes spectrum leakage of the speech signal in the frequency domain, and thus, a hamming window needs to be added to alleviate the spectrum leakage.

After preprocessing the speech signal, the sound source needs to be positioned based on the speech enhancement of the microphone array to estimate the position or direction of the desired sound source, and then the enhanced speech signal is obtained by using an enhancement algorithm.

In one example, sound source localization based on a time delay difference (Time Difference of Arrival, TDOA) estimation method is illustrated.

The common delay difference estimation method comprises the following steps: generalized cross correlation (Generalized Cross Correlation, GCC) method, linear regression (Linear Regression, LR) method, least mean square (Least Mean Square, LMS) adaptive method, and the like. The GCC method is described below as an example.

The GCC method firstly calculates the cross power spectrum of a pair of microphone signals, multiplies the cross power spectrum by corresponding weight, and finally carries out Fourier inverse transformation to obtain the cross correlation function of the signals, wherein the moment corresponding to the peak value is the arrival time delay difference tau of the pair of microphones _i 。

The performance of the GCC method depends on the chosen weight function, of which the most representative is the maximum likelihood (Maximum Likelihood, ML) weight and the Phase Transform (phas) weight.

In an ideal case, the maximum likelihood weighting can reach the optimal estimation, but the maximum likelihood weighting needs the power spectrum of the known sound source signal and noise, and in practical application, the condition is difficult to be satisfied. The phase transformation weighting abandons the requirement on the power spectrum of the sound source signal and noise, and sharpens the cross-correlation function by normalizing the cross-power spectrum function, so that the peak value is prominent, and the interference of the cross-correlation false peak is better suppressed. In addition, in a reverberant environment, the phase transformation weights are robust.

For an ideal free sound field environment, when the autocorrelation function of the sound source signal is the maximum value, the cross-correlation function is also the maximum value, so that only the maximum value of the cross-correlation function is found out during calculation, and the corresponding time is the time delay difference.

The reverberant environment has a plurality of peak points of functions due to superposition of innumerable reverberant signals, and the problem can be solved by adopting a phase transformation weighted generalized cross correlation GCC-PHAT algorithm.

The GCC-PHAT algorithm does not directly calculate a cross-correlation function in the time domain, but utilizes the corresponding relation between the cross-correlation function of signals in the time domain and the cross-power spectrum function of signals in the frequency domain to firstly calculate the cross-power spectrum density between two voice signals, then performs PHAT weighting, finally obtains a generalized cross-correlation function through inverse Fourier transform, and further obtains a corresponding time delay difference.

Delay-and-accumulate Beamforming (DSB) algorithm utilizes the Delay difference τ obtained by GCC-PHAT _i Firstly, performing delay compensation on voice signals on each microphone channel, enabling the voice signals received by each microphone to be aligned on a time axis, and then uniformly weighting and summing to obtain output signals.

In the beam forming process, the azimuth information of each path of voice signal can be determined according to the relation between the energy of each azimuth beam and the phase difference.

In one example, the microphone array may separate the voice signal of the primary driver 31 and the voice signal of the secondary driver 32 from the collected mixed voice signal based on the determined position information after determining the position information of the voice signal of the primary driver 31 and the voice signal of the secondary driver 32 in the mixed voice signal with respect to the microphone array.

In one example, after the voice signal of the primary driver 31 and the voice signal of the secondary driver 32 are separated from the mixed voice signal, the voice signal of the primary driver 31 and the voice signal of the secondary driver 32 may also be subjected to a beam forming process and a signal enhancing process, respectively. For example, the signal enhancement processing may include, but is not limited to: signal amplification processing, noise reduction processing, and the like.

In one example, after separating the voice signal of the primary driver 31 and the voice signal of the secondary driver 32 from the collected mixed voice signals, the voice signals of the primary driver 31 and the voice signals of the secondary driver 32 may be semantically recognized in parallel. In particular, when the voice signal of the primary driver 31 and the voice signal of the secondary driver 32 are subjected to semantic recognition, the semantic recognition method shown in fig. 2 may be adopted to reduce CPU resources occupied by the semantic recognition.

In one embodiment, after the mixed voice signals including the voice signals from different directions are collected through the microphone array, the beamforming algorithm shown in fig. 4 may be used to determine the direction information of each voice signal, separate each voice signal from the mixed voice signals based on the determined direction information of each voice signal, obtain multiple voice signals, and further perform semantic recognition on part or all of the separated multiple voice signals by using the semantic recognition method shown in fig. 2.

In one example, if the mixed voice signal collected by the microphone array includes the voice signal of the main driver, the voice signal of the assistant driver and the environmental noise, after the microphone array separates the voice signal of the main driver, the voice signal of the assistant driver and the environmental noise from the mixed voice signal, since the environmental noise obviously cannot include valuable information, when the separated multi-path voice signals are subjected to semantic recognition, only the voice signal of the main driver and the voice signal of the assistant driver are separated, but not the separated environmental noise, so that CPU resources occupied by the semantic recognition link can be further reduced.

In one embodiment, after the semantic recognition of the voice signal of the primary driver 31 and the voice signal of the secondary driver 32 are performed in parallel, the semantic recognition result of the voice signal of the primary driver 31 and the semantic recognition result of the voice signal of the secondary driver 32 may be detected in parallel by a plurality of wake-up engines in the voice interaction device 33, to detect whether the semantic recognition result of the voice signal of the primary driver 31 and the semantic recognition result of the voice signal of the secondary driver 32 contain wake-up words.

In one example, wake-up words refer to passwords or commands that activate a voice control system in the voice interaction device 33, which may be predefined specific words, specific sentences, specific signals, or the like. For example, the wake-up word is "hello zebra".

In one example, when the wake-up words are included in the voice signal of a user (primary driver or secondary driver) detected by the plurality of wake-up engines in the voice interaction device 33, the voice control system in the voice interaction device 33 is awakened by the wake-up words, and voice control is performed according to the voice signal of the user in a subsequent preset time period.

In one embodiment, the plurality of wake-up engines in the voice interaction device 33 may be connected to the voice control system, but in actual use, the semantic recognition result detected in which wake-up engine is specifically sent to the voice control system is determined by whether the semantic recognition result includes a wake-up word. That is, which way of the semantic recognition results detected by the wake engine contains the wake words, and then the semantic recognition results are sent to the voice control system.

For example, after the semantic recognition of the voice signal of the primary driver 31 and the voice signal of the secondary driver 32 in parallel, the semantic recognition result of the voice signal of the primary driver 31 and the semantic recognition result of the voice signal of the secondary driver 32 may be detected in parallel by a plurality of wake-up engines in the voice interaction device 33.

If the semantic recognition result of the voice signal of the main driver 31 is detected to contain the wake-up word, the semantic recognition result of the voice signal of the main driver 31 is sent to a voice control system, and then the voice control is carried out by the main driver 31; if the semantic recognition result of the voice signal of the secondary driver 32 is detected to include the wake-up word, the semantic recognition result of the voice signal of the secondary driver 32 is sent to the voice control system, and the voice control is further performed by the secondary driver 32.

In one embodiment, for convenience, after the primary driver 31 or the secondary driver 32 wakes up the voice control system, for example, after the primary driver 31 wakes up the voice control system, the voice information of the azimuth of the primary driver 31 may be directionally collected within a preset time period after the current time based on the azimuth information of the primary driver 31 determined by the microphone array, the collected voice signal is subjected to beamforming processing and signal enhancement processing, and then the voice signal after the beamforming processing and the signal enhancement processing is sent to the voice control system. The preset duration may be set according to an empirical value, for example: 30 seconds.

The method for processing the voice signal provided by the embodiment of the application is described above in connection with the vehicle-mounted environment, and the embodiment of the application can also be used in other intelligent devices comprising a voice control system. Wherein the smart device may include, but is not limited to: intelligent audio amplifier, intelligent TV, automatic shopping machine.

For example, taking a smart speaker as an example, as shown in fig. 5, in a smart home environment, including a smart speaker 50, a user 51 and a user 52, the smart speaker 50 includes a microphone array for collecting voice signals, a semantic recognition system, a semantic detection system and a voice control system.

In particular use, both user 51 and user 52 are within the recognition range of intelligent enclosure 50, sending control commands to intelligent enclosure 50 by voice.

The microphone array in the smart speaker 50 collects a mixed voice signal including the voice signal of the user 51 and the voice signal of the user 52, then determines the azimuth information of the user 51 and the user 52 based on the beam forming algorithm, separates the voice signal of the user 51 and the voice signal of the user 52 from the mixed voice signal according to the determined azimuth information of the user 51 and the azimuth information of the user 52, and then transmits the voice signal of the user 51 and the voice signal of the user 52 to the semantic recognition system.

After receiving the voice signal of the user 51 and the voice signal of the user 52, the semantic recognition system in the smart speaker 50 performs semantic recognition on the voice signal of the user 51 and the voice signal of the user 52 in parallel, and then sends the semantic recognition result of the voice signal of the user 51 and the semantic recognition result of the voice signal of the user 52 to the semantic detection system.

After the semantic detection system in the intelligent sound box 50 receives the semantic recognition result of the voice signal of the user 51 and the semantic recognition result of the voice signal of the user 52, two wake-up engines are started, and whether the semantic recognition result of the voice signal of the user 51 and the semantic recognition result of the voice signal of the user 52 contain wake-up words or not is detected in parallel. For example, the wake-up engine 1 and the wake-up engine 2 are started, the wake-up engine 1 and the wake-up engine 2 run in parallel, the wake-up engine 1 detects whether the semantic recognition result of the voice signal of the user 51 contains a wake-up word, and the wake-up engine 2 detects whether the semantic recognition result of the voice signal of the user 52 contains a wake-up word.

If the awakening engine 1 detects that the semantic recognition result of the voice signal of the user 51 contains the awakening word, the awakening engine 1 sends the semantic recognition result of the voice signal of the user 51 to a voice control system; if the wake-up engine 2 detects that the semantic recognition result of the voice signal of the user 52 includes a wake-up word, the wake-up engine 2 sends the semantic recognition result of the voice signal of the user 52 to the voice control system.

After receiving the semantic recognition result sent by the semantic detection system, the voice control system in the intelligent sound box 50 searches target data matched with the semantic recognition result in the database according to the semantic recognition result, and further controls the intelligent sound box 50 to execute corresponding functions according to the control instruction corresponding to the searched target data.

The voice control system searches the database for the target data matched with the semantic recognition result according to the semantic recognition result, may search the database locally stored in the intelligent sound box 50 for the target data matched with the semantic recognition result, or may upload the semantic recognition result to the cloud server or the cloud computing platform, and search the database of the cloud server or the database of the cloud computing platform for the target data matched with the semantic recognition result, which is not limited in this application.

The following describes the execution of the above-mentioned voice signal processing method in conjunction with a specific system processing flow, however, it should be noted that this specific embodiment is only for better explaining the present application, and should not be construed as unduly limiting the present application.

As shown in fig. 6, the processing method 600 of the voice signal may include the following steps in terms of overall flow:

In step S601, the processing component separates the received voice signals from different directions in the mixed voice signal, so as to obtain multiple voice signals.

In step S602, the processing component performs parallel recognition on part or all of the multiple paths of voice signals, where the parallel recognition includes: for part or all of the multiple voice signals, each voice signal is divided into a plurality of recognition units for recognition, wherein each recognition unit comprises a continuous multi-frame.

In the embodiment of the application, when each path of voice signals in part or all of voice signals are identified, the voice signals are divided into a plurality of identification units for identification, so that the number of times of identification is effectively reduced, and the CPU resources occupied when each path of voice signals are identified are further reduced, and therefore, the part or all of the multipath voice signals can be identified in parallel. Furthermore, by adopting the voice interaction equipment of the technical scheme of the embodiment of the application, multiple paths of voice signals can be identified in parallel, and compared with the voice signals which can only be responded to a single user in the prior art, the flexibility of voice control is greatly improved.

In implementation, as shown in fig. 7, the processing component 700 of the voice signal may include:

The voice processing module 701 is configured to separate voice signals from different directions in the received mixed voice signal, so as to obtain multiple voice signals.

The recognition module 702 is configured to perform parallel recognition on part or all of the multiple paths of voice signals, where the parallel recognition includes: for part or all of the multiple voice signals, each voice signal is divided into a plurality of recognition units for recognition, wherein each recognition unit comprises a continuous multi-frame.

In one embodiment, the identification module 702 is specifically configured to: respectively carrying out framing treatment on each path of voice signal to obtain multi-frame voice data; selecting one frame from every preset number of frame voice data in multi-frame voice data as target frame voice data; and recognizing each path of voice signal by taking the multi-frame voice data adjacent to the target frame voice data and the target frame voice data as recognition units.

In one embodiment, the voice processing module 701 is specifically configured to: determining azimuth information of each path of voice signals in the mixed voice signals; and separating multiple paths of voice signals from different directions in the mixed voice signals based on the direction information of each path of voice signals.

In one embodiment, the signal enhancement module 703 is configured to perform beam forming processing and signal enhancement processing on each path of voice signal.

In one embodiment, the apparatus further comprises: the detection module 704 is configured to detect in parallel whether the recognition result of each path of voice signal includes a wake-up word; the first sending module 705 is configured to send, when detecting that the recognition result of any one of the voice signals includes a wake-up word, the recognition result including the wake-up word to the voice control system.

In one embodiment, the apparatus further comprises: the location determining module 706 is configured to determine location information of the voice signal corresponding to the recognition result including the wake-up word; the acquisition module 707 is configured to directionally acquire a voice signal of azimuth information within a preset duration, and perform beamforming processing and signal enhancement processing on the acquired voice signal; a second transmitting module 708, configured to transmit the voice signal after the beamforming processing and the signal enhancement processing to a voice control system.

In one embodiment, the voice control system performs a corresponding function according to the received voice signal after being awakened by the recognition result including the awakening word.

In one embodiment, a multi-path speech signal comprises: a voice signal of the main driving and a voice signal of the co-driving.

Fig. 8 illustrates a block diagram of an exemplary hardware architecture of a computing device capable of implementing the processing methods and components of speech signals according to embodiments of the present application. As shown in fig. 8, computing device 800 includes an input device 801, an input interface 802, a central processor 803, a memory 804, an output interface 805, and an output device 806. The input interface 802, the central processor 803, the memory 804, and the output interface 805 are connected to each other through a bus 810, and the input device 801 and the output device 806 are connected to the bus 810 through the input interface 802 and the output interface 805, respectively, and further connected to other components of the computing device 800.

Specifically, the input device 801 receives input information from the outside and transmits the input information to the central processor 803 through the input interface 802; the central processor 803 processes the input information based on computer executable instructions stored in the memory 804 to generate output information, temporarily or permanently stores the output information in the memory 804, and then transmits the output information to the output device 806 through the output interface 805; output device 806 outputs the output information to the outside of computing device 800 for use by a user.

That is, the computing device shown in fig. 8 may also be implemented as a processing device for a speech signal, which may include: a memory storing computer-executable instructions; and a processor that when executing computer-executable instructions can implement the methods and components for processing speech signals described in connection with fig. 2-7.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be embodied in whole or in part in the form of a computer program product or a computer-readable storage medium. The computer program product or computer-readable storage medium includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

In addition, in combination with the processing method of the voice signal in the above embodiment, the embodiment of the application may be implemented by providing a computer readable storage medium. The computer readable storage medium has stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement a method of processing a speech signal according to any of the above embodiments.

The application also provides vehicle-mounted voice interaction equipment. It will be appreciated by those skilled in the art that the in-vehicle voice interaction device may manage and control the hardware of the processing component of the voice signal shown in fig. 7 or fig. 8 or the hardware of the processing device of the voice signal related to the present application and the computer program of the software resource related to the present application are system software directly running on the processing component or the processing device.

The vehicle-mounted voice interaction device provided by the application can interact with other modules or functional devices on a vehicle to control the functions of the corresponding modules or functional devices.

The following describes in detail the schematic structural diagram of the vehicle-mounted voice interaction device provided by the application. Fig. 9 is a schematic structural diagram of a vehicle-mounted voice interaction device according to an embodiment of the present application. As shown in fig. 9, the vehicle-mounted voice interaction device provided by the present application includes: a microphone array 901, and a processor 902, wherein,

A microphone array 901 for collecting a mixed speech signal.

The processor 902 is communicatively connected to the microphone array 901, and is configured to separate voice signals from different directions in the received mixed voice signal, obtain multiple paths of voice signals, and perform parallel recognition on part or all of the multiple paths of voice signals, where the parallel recognition includes: for part or all of the multiple voice signals, each voice signal is divided into a plurality of recognition units for recognition, wherein each recognition unit comprises a continuous multi-frame.

In one embodiment, the processor 902 is specifically configured to: respectively carrying out framing treatment on each path of voice signal to obtain multi-frame voice data; selecting one frame from every preset number of frame voice data in multi-frame voice data as target frame voice data; and recognizing each path of voice signal by taking the multi-frame voice data adjacent to the target frame voice data and the target frame voice data as recognition units.

In one embodiment, the processor 902 is specifically configured to: determining azimuth information of each path of voice signals in the mixed voice signals; and separating multiple paths of voice signals from different directions in the mixed voice signals based on the direction information of each path of voice signals.

In one embodiment, the processor 902 is further configured to perform beamforming processing and signal enhancement processing on each speech signal.

In one embodiment, the processor 902 is further configured to: detecting whether the recognition result of each path of voice signal contains a wake-up word in parallel; when the recognition result of any voice signal contains the wake-up word, the recognition result containing the wake-up word is sent to the voice control system.

In one embodiment, the processor 902 is further configured to: determining azimuth information of the voice signal corresponding to the recognition result containing the wake-up word; the method comprises the steps of directionally collecting voice signals of azimuth information within a preset duration, and carrying out beam forming processing and signal enhancement processing on the collected voice signals; and sending the voice signals after the beam forming processing and the signal enhancement processing to a voice control system.

Further, the vehicle-mounted voice interaction device may control the corresponding components to execute the processing method of the voice signal in fig. 6 through the microphone array 901 and the processor 902, or on the basis of the microphone array 901 and the processor 902 and in combination with other units.

The application also provides a vehicle-mounted Internet operating system. Those skilled in the art will appreciate that the on-vehicle internet operating system can manage and control the hardware of the processing component of the speech signal shown in fig. 7 or fig. 8 or the hardware of the processing device of the speech signal related to the present application and the computer program of the software resource related to the present application, and is the system software directly running on the processing component or the processing device.

The vehicle-mounted internet control system provided by the application can interact with other modules or functional equipment on the vehicle to control the functions of the corresponding modules or functional equipment.

Based on the development of the vehicle-mounted internet control system and the vehicle communication technology, the vehicle is not independent of a communication network, and can be connected with a service end to form a network, so that the vehicle-mounted internet is formed. The vehicle-mounted internet system can provide voice communication service, positioning service, navigation service, mobile internet access, vehicle emergency rescue, vehicle data and management service, vehicle-mounted entertainment service and the like.

The following describes in detail the schematic structural diagram of the vehicle-mounted internet control system provided in the present application. Fig. 10 is a schematic structural diagram of an in-vehicle internet control system according to an embodiment of the present application. As shown in fig. 10, the vehicle-mounted internet control system provided in the present application includes: a microphone control 1001 and a control 1002, wherein,

A microphone control assembly 1001 for controlling the microphone array to collect a mixed speech signal;

the control component 1002 is configured to control to separate voice signals from different directions in the received mixed voice signal, obtain multiple paths of voice signals, and perform parallel recognition on part or all of the multiple paths of voice signals, where the parallel recognition includes: for part or all of the multiple voice signals, each voice signal is divided into a plurality of recognition units for recognition, wherein each recognition unit comprises a continuous multi-frame.

Further, the vehicle-mounted internet control system may control the corresponding components to perform the processing method of the voice signal in fig. 6 through the above-mentioned microphone control component 1001 and control component 1002, or by combining other units on the basis of the above-mentioned microphone control component 1001 and control component 1002.

It should be clear that the present application is not limited to the particular arrangements and processes described above and illustrated in the drawings. For the sake of brevity, a detailed description of known methods is omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present application are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications, and additions, or change the order between steps, after appreciating the spirit of the present application.

It should also be noted that the exemplary embodiments mentioned in this application describe some methods or systems based on a series of steps or devices. However, the present application is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be different from the order in the embodiments, or several steps may be performed simultaneously.

In the foregoing, only the specific embodiments of the present application are described, and it will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the systems, modules and units described above may refer to the corresponding processes in the foregoing method embodiments, which are not repeated herein. It should be understood that the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present application, which are intended to be included in the scope of the present application.

Claims

1. A method of processing a speech signal, the method comprising:

The processing component performs parallel recognition on part or all of the multipath voice signals, wherein the parallel recognition comprises: dividing each voice signal into a plurality of recognition units for recognition of part or all of the plurality of voice signals, wherein each recognition unit comprises a continuous multi-frame; wherein, divide each voice signal into a plurality of recognition units for recognition respectively, include:

respectively carrying out framing treatment on each path of voice signal to obtain multi-frame voice data;

selecting one frame from every preset number of frames of voice data in the multi-frame voice data as target frame voice data;

and recognizing each path of voice signal by taking the multi-frame voice data adjacent to the target frame voice data and the target frame voice data as a recognition unit.

2. The method of claim 1, wherein the processing component separates multiple speech signals from different directions in the received mixed speech signal, comprising:

determining azimuth information of each path of voice signals in the mixed voice signals;

and separating multiple paths of voice signals from different directions in the mixed voice signals based on the direction information of each path of voice signals.

3. The method of claim 1, wherein the processing component separates multiple voice signals from different directions in the received mixed voice signal to obtain multiple voice signals, and wherein before the parallel recognition of some or all of the multiple voice signals, the method further comprises:

the processing component performs beam forming processing and signal enhancement processing on each path of voice signals.

4. A method according to any one of claims 1-3, characterized in that the method further comprises:

detecting whether the recognition result of each path of voice signal contains a wake-up word in parallel;

when the recognition result of any voice signal contains a wake-up word, the recognition result containing the wake-up word is sent to a voice control system.

5. The method according to claim 4, wherein the method further comprises:

determining azimuth information of the voice signal corresponding to the recognition result containing the wake-up word;

directionally acquiring the voice signals of the azimuth information within a preset duration, and carrying out beam forming processing and signal enhancement processing on the acquired voice signals;

and sending the voice signals after the beam forming processing and the signal enhancement processing to the voice control system.

6. The method of claim 5, wherein the voice control system performs a corresponding function according to the received voice signal after being awakened by the recognition result including the awakening word.

7. The method of claim 1, wherein the multipath speech signal comprises: a voice signal of the main driving and a voice signal of the co-driving.

8. A processing assembly for a speech signal, said processing assembly comprising:

the recognition module is used for carrying out parallel recognition on part or all of the multipath voice signals, wherein the parallel recognition comprises the following steps: dividing each voice signal into a plurality of recognition units for recognition of part or all of the plurality of voice signals, wherein each recognition unit comprises a continuous multi-frame; the identification module is specifically configured to:

9. The assembly according to claim 8, wherein the speech processing module is specifically configured to:

10. The assembly of claim 8 wherein the processing assembly further comprises a signal enhancement module for performing beamforming processing and signal enhancement processing on each speech signal.

11. The assembly of any one of claims 8-10, wherein the processing assembly further comprises:

the detection module is used for detecting whether the recognition result of each path of voice signal contains a wake-up word or not in parallel;

the first sending module is used for sending the recognition result containing the wake-up word to the voice control system when the recognition result of any voice signal contains the wake-up word.

12. The assembly of claim 11, wherein the processing assembly further comprises:

The azimuth determining module is used for determining azimuth information of the voice signal corresponding to the recognition result containing the wake-up word;

the acquisition module is used for directionally acquiring the voice signals of the azimuth information within a preset duration, and carrying out beam forming processing and signal enhancement processing on the acquired voice signals;

and the second sending module is used for sending the voice signals after the beam forming processing and the signal enhancement processing to the voice control system.

13. The assembly of claim 12, wherein the voice control system performs a corresponding function based on the received voice signal after being awakened by the recognition result including the wake-up word.

14. The assembly of claim 8, wherein the multipath speech signal comprises: a voice signal of the main driving and a voice signal of the co-driving.

15. A processing device for speech signals, comprising a memory and a processor; the memory is used for storing executable program codes; the processor is configured to read executable program code stored in the memory to perform the method of any one of claims 1-7.

16. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1-7.

17. An in-vehicle voice interaction device, the device comprising: a microphone array and a processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the microphone array is used for collecting mixed voice signals;

the processor is in communication connection with the microphone array, and is configured to separate voice signals from different directions in the received mixed voice signals, obtain multiple paths of voice signals, and perform parallel recognition on part or all of the multiple paths of voice signals, where the parallel recognition includes: dividing each voice signal into a plurality of recognition units for recognition of part or all of the plurality of voice signals, wherein each recognition unit comprises a continuous multi-frame; wherein, the processor is specifically configured to:

18. The apparatus of claim 17, wherein the processor is specifically configured to:

19. The apparatus of claim 17 wherein the processor is further configured to perform beamforming processing and signal enhancement processing on each of the speech signals.

20. The apparatus of any one of claims 17-19, wherein the processor is further configured to:

21. The apparatus of claim 20, wherein the processor is further configured to:

22. The apparatus of claim 21, wherein the voice control system performs a corresponding function according to the received voice signal after being awakened by the recognition result including the awakening word.

23. An in-vehicle internet control system, comprising: a microphone control assembly and a control assembly; wherein, the liquid crystal display device comprises a liquid crystal display device,

the microphone control assembly is used for controlling the microphone array to collect mixed voice signals;

the control component is used for controlling the separation of voice signals from different directions in the received mixed voice signals to obtain multiple paths of voice signals and carrying out parallel recognition on part or all of the multiple paths of voice signals, wherein the parallel recognition comprises the following steps: dividing each voice signal into a plurality of recognition units for recognition of part or all of the plurality of voice signals, wherein each recognition unit comprises a continuous multi-frame; the control assembly is specifically configured to: