CN110992953A

CN110992953A - Voice data processing method, device, system and storage medium

Info

Publication number: CN110992953A
Application number: CN201911293058.9A
Authority: CN
Inventors: 李玉澄
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2019-12-16
Filing date: 2019-12-16
Publication date: 2020-04-10

Abstract

The invention discloses a voice data processing method, a device, a system and a storage medium. The method is realized by a voice data processing device on the Bluetooth headset, and firstly, the voice data is written into a voice frequency cache region while the voice frequency data is awakened and identified; and then, judging the recognition result, if the recognition result is awakening, establishing an audio transmission path with the voice interaction equipment terminal, and sending the audio data in the audio buffer area to the voice interaction equipment terminal. Therefore, the problem of user voice loss when a synchronous link (SCO) channel of vertical connection is established can be solved by reserving data in the audio cache region, and the experience of 'one-word-to-one-word' can be realized. And secondly, because the awakening words and the awakening words can be sent to the voice interaction equipment, secondary awakening verification can be carried out at the voice interaction equipment so as to prevent the user from being interrupted by the awakening triggered by mistake.

Description

Voice data processing method, device, system and storage medium

Technical Field

The invention relates to the field of artificial intelligence voice interaction, in particular to a method, a device and a system for processing voice data by using a Bluetooth headset and a storage medium.

Background

Along with the continuous development and progress of artificial intelligence and electronic communication technology, intelligent voice interaction equipment such as intelligent watches and intelligent sound boxes are increasingly popular with people. Recently, products for voice interaction through bluetooth headsets have been released by various big businesses.

At present, these products performing voice interaction through a bluetooth headset integrate a wakeup algorithm in a bluetooth chip or on a Digital Signal Processing (DSP) chip, and when voice wakeup is triggered, voice interaction is realized through standard Hands-free profile (HFP). In this case, the mobile terminal receives the wake-up command and applies for establishing a Connection-Oriented Synchronous link (SCO) channel, and the time from the time when the SCO channel is actually established may have an interval of several seconds; the voice of the user is lost during the period of time, so that the experience of 'one word arrive' cannot be realized by speaking the awakening word and the command word continuously.

In addition, due to the existing mode, after the detection of the wakeup word, only the statement behind the wakeup word is sent, and deeper "secondary wakeup check" cannot be performed at the mobile device side. Thus, when a wake-up word is triggered by mistake, the current behavior of the user is interrupted, resulting in poor user experience.

Disclosure of Invention

In view of the above problems, the present inventors have innovatively provided a method, apparatus, system, and storage medium for performing speech processing.

According to a first aspect of the embodiments of the present invention, a voice data processing method is applied to a voice data processing apparatus on a bluetooth headset, and the method includes: awakening and identifying the first audio data, and writing the first audio data into an audio cache region to form second audio data, wherein the result obtained by awakening and identifying is an identification result; and judging the recognition result, if the recognition result is awakening, establishing an audio transmission path with the voice interaction equipment terminal, and sending the second audio data in the audio buffer area to the voice interaction equipment terminal.

According to an embodiment of the present invention, before performing the wake-up recognition on the first audio data, the method further includes: collecting an original audio signal; and carrying out signal processing on the original audio signal to obtain first audio data.

According to an embodiment of the present invention, wherein the acquiring of the original audio signal comprises: acquiring at least two paths of original audio signals through a microphone array; accordingly, signal processing an original audio signal to obtain first audio data includes: performing signal processing on the at least two paths of original audio signals to obtain at least two paths of processing results; and combining the at least two paths of processing results to obtain first audio data, wherein the first audio data is one path of audio data.

According to an embodiment of the present invention, wherein writing the first audio data into the audio buffer to form the second audio data includes: and writing the first audio data into the audio buffer zone according to a time sequence segmentation and overflow covering mode to form second audio data.

According to an embodiment of the present invention, the performing wake-up recognition on the first audio data includes: judging whether a wake-up word exists in the first audio data judgment according to the wake-up model, if so, setting and determining the recognition result as wake-up, recording a write-in point for writing in second audio data in the audio cache region, and taking the write-in point as an obtained wake-up point; correspondingly, the sending of the second audio data in the audio buffer to the voice interaction device includes: and sending the audio data in a preset time period before the wake-up point and after the wake-up to the voice interaction equipment terminal.

According to an embodiment of the present invention, sending the second audio data in the audio buffer to the voice interaction device includes: and sending all contents in the second audio data containing the awakening words in the audio cache region to the voice interaction equipment end for secondary awakening verification.

According to a second aspect of the embodiments of the present invention, a speech data processing apparatus, the apparatus comprising: the awakening identification module is used for awakening and identifying the first audio data; the audio writing module is used for writing the first audio data into the audio buffer area to form second audio data; the identification result judging module is used for judging the identification result; and the sending module is used for establishing an audio transmission path with the voice interaction equipment terminal and sending the second audio data in the audio buffer area to the voice interaction equipment terminal if the identification result is awakening.

According to an embodiment of the present invention, the apparatus further comprises: the signal acquisition module is used for acquiring an original audio signal; and the signal processing module is used for carrying out signal processing on the original audio signal to obtain first audio data.

According to an embodiment of the present invention, the signal acquisition module includes: the microphone array unit is used for acquiring at least two paths of original audio signals through a microphone array; accordingly, the signal processing module comprises: the multi-path signal processing unit is used for carrying out signal processing on the at least two paths of original audio signals to obtain at least two paths of processing results; the signal merging unit is used for merging the at least two paths of processing results to obtain first audio data, and the first audio data is one path of audio data.

According to an embodiment of the present invention, the audio writing module is specifically configured to: and writing the first audio data into the audio buffer zone according to a time sequence segmentation and overflow covering mode to form second audio data.

According to an embodiment of the present invention, the wake-up identification module includes: the judging unit judges whether a wakeup word exists in the first audio data according to the wakeup model; the identification result determining unit is used for determining the identification result as awakening if the identification result is positive; the wake-up point recording unit is used for recording a write-in point for writing second audio data in the audio cache region and taking the write-in point as an obtained wake-up point; correspondingly, the sending module is specifically configured to establish an audio transmission path with the voice interaction device end if the recognition result is wakeup, and send the second audio data in the audio buffer area to the voice interaction device end.

According to an embodiment of the present invention, the sending module is specifically configured to send the second audio data in the audio buffer area, including the wakeup word, to the voice interaction device.

According to a third aspect of embodiments of the present invention, there is provided a speech data processing system, the system including: the voice data processing device is arranged on the Bluetooth headset and used for executing any one of the voice data processing methods; and the mobile equipment is connected with the Bluetooth headset and used for receiving the second audio data sent by the voice data processing device and processing the second audio data.

According to a fourth aspect of embodiments of the present invention, there is provided a computer storage medium comprising a set of computer executable instructions for performing any of the above-described speech data processing methods when the instructions are executed.

The embodiment of the invention provides a voice data processing method, a device, a system and a storage medium. Firstly, the audio data are simultaneously written into an audio buffer area when the audio data are awakened and identified; and then, judging the recognition result, if the recognition result is awakening, establishing an audio transmission path with the voice interaction equipment terminal, and sending the audio data in the audio buffer area to the voice interaction equipment terminal. Therefore, the problem that the voice of the user is lost when the SCO channel is established can be solved by reserving the data in the audio cache region, and the experience of' one word can be achieved. And secondly, because the awakening words can be sent to the mobile equipment together, secondary awakening verification can be carried out at the voice interaction equipment end so as to prevent the user from being interrupted by the mistakenly triggered awakening.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

FIG. 1 is a schematic flow chart of a voice data processing method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a configuration of an audio data processing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

The voice data processing method of the embodiment of the invention is applied to a voice data processing device on a Bluetooth headset, and the device is usually a DSP chip and can also be other hardware devices capable of processing audio signals or audio data. The device can be built in the Bluetooth headset or hung on the Bluetooth headset.

Fig. 1 shows an implementation flow of a voice data processing method according to an embodiment of the present invention. Referring to fig. 1, the method includes: operation 110, performing wakeup identification on the first audio data, and writing the first audio data into an audio buffer to form second audio data, where a result obtained by the wakeup identification is an identification result; operation 120, determining the recognition result, if the recognition result is awakening, establishing an audio transmission path with the voice interaction device, and sending the second audio data in the audio buffer to the voice interaction device.

In operation 110, the first audio data may be directly acquired audio signals or audio data that has been subjected to signal processing, and the audio data that has been subjected to signal processing is recommended to be used in order to improve the accuracy of wake-up recognition. The awakening identification mainly refers to inputting first audio data into an awakening module, and executing an awakening algorithm by the awakening module to obtain an identification result. The main purpose is to identify the currently input audio and determine whether the audio is awake. The existing wake-up algorithm relates to a key word detection technology, which can perform keyword search, namely wake-up word search, in input audio data, and if a wake-up word is recognized, a recognition result can be set to wake up. The first audio data can be written into the audio buffer area to form second audio data, if the cost allows, the audio buffer area can be set to be long enough, the longer the audio buffer area is set, the more the audio input data of the user can be stored, the less the audio data is lost, and the better the implementation effect is. The second audio data, here, ideally retains all of the content of the first audio data, including the wake-up word.

In operation 120, if the recognition result is wake-up, an audio transmission path with the voice interactive apparatus is established. The Bluetooth headset is used as an input and output device of audio data, and the main function of the Bluetooth headset is to receive the input of the audio data, return useful audio to the voice interaction device and output the audio returned by the voice interaction device. The method executed by the embodiment of the invention is also only a method for preprocessing the audio data, so that the audio data suspected to be problematic is finally transmitted to the voice interaction equipment for processing, and then the response audio data returned by the voice interaction equipment is returned to the user. The voice interaction device can be any voice interaction device connected with the Bluetooth headset, such as a smart watch, a smart sound box, a smart car device and the like. When establishing a channel with a voice interactive device, an integrated circuit built-in audio bus (I2S) is typically used. And the establishment of the I2S channel may be accomplished by triggering a wake-up event of the bluetooth chip. And after receiving the awakening event, the Bluetooth chip opens an I2S audio channel and sends audio data in a specified time period before the awakening point in the DSP chip and after the awakening point to the voice interaction equipment.

In the embodiment of the present invention, the voice data processing apparatus on the bluetooth headset has a certain processing capability and an independent memory, so that not only the original audio signal can be collected, but also the original audio signal can be processed, for example, directional signal transmission is performed by using a Beamforming (Beamforming) technology or denoising processing is performed on the received signal.

In the embodiment of the invention, at least two paths of original audio signals are acquired by the microphone array, so that better denoising is performed, and audio data really expected in the voice interaction process is obtained. The audio data obtained by denoising the original audio signal is easier to identify, and accordingly the accuracy of awakening identification can be improved.

In the embodiment of the invention, the first audio data is simultaneously sent to the wake-up module and the audio buffer, and the first audio data is segmented according to the time sequence, so that the operation can be conveniently carried out according to smaller units, such as recording or reading data, and the processing is easier. For example, the first audio data may be stored in 10ms per frame. The overflow override is to ensure that new data is written when more history data is saved.

In the embodiment of the invention, a wakeup point is set, the wakeup point is carried out while wakeup identification is carried out, once a wakeup word is detected, a write-in point for writing second audio data in an audio cache area is recorded immediately, the point is the wakeup point, the position is recorded, the audio data input by a user can be divided into a wakeup word part and a suspected problem part, and then the suspected problem part can be divided to be used as the input of a voice interaction program.

In the embodiment of the present invention, in addition to sending the suspected problem part after the wakeup word to the voice interaction device for voice interaction processing, all the contents of the second audio data including the wakeup word part may also be sent to the voice interaction device. In this way, the voice interaction device can also perform more complex secondary wake-up verification on the audio data. When the second wake-up check is carried out, whether the real intention of the user is wake-up or a false trigger containing a wake-up word is judged according to the context of the conversation. The processing of such contexts usually requires the establishment of more complex recognition models, such as convolutional neural network models, etc., requires greater processing and computational power, and usually requires processing to a dialog processing module sent to the voice interaction device. Through the secondary awakening verification, the bad experience that the existing program of the user is interrupted due to the fact that the voice interaction process is triggered by mistake can be greatly reduced.

Furthermore, the embodiment of the invention also provides a voice data processing device. As shown in fig. 2, the apparatus 20 includes: a wake-up recognition module 201, configured to perform wake-up recognition on the first audio data; the audio writing module 202 is configured to write the first audio data into an audio buffer to form second audio data; the identification result judging module 203 is used for judging the identification result; and the sending module 204 is configured to establish an audio transmission path with the voice interaction device end if the recognition result is wakeup, and send the second audio data in the audio buffer to the voice interaction device end.

According to an embodiment of the present invention, the apparatus 20 further comprises: the signal acquisition module is used for acquiring an original audio signal; and the signal processing module is used for carrying out signal processing on the original audio signal to obtain first audio data.

According to an embodiment of the present invention, the audio writing module 202 is specifically configured to: and writing the first audio data into the audio buffer zone according to a time sequence segmentation and overflow covering mode to form second audio data.

According to an embodiment of the present invention, the wake-up recognition module includes 201: the judging unit is used for judging whether a wakeup word exists according to the wakeup model and the first audio data; the identification result determining unit is used for determining the identification result as awakening if the identification result is positive; the wake-up point recording unit is used for recording a write-in point for writing second audio data in the audio cache region and taking the write-in point as an obtained wake-up point; correspondingly, the sending module is specifically configured to establish an audio transmission path with the voice interaction device end if the recognition result is wakeup, and send the second audio data in the audio buffer area to the voice interaction device end.

According to an embodiment of the present invention, the sending module 204 is specifically configured to send the second audio data in the audio buffer area, including the wakeup word, to the voice interaction device.

Here, it should be noted that: the above description of the embodiment of the voice data processing apparatus, the above description of the embodiment of the voice data processing system, and the above description of the embodiment of the computer storage medium are similar to the descriptions of the foregoing method embodiments, and have similar beneficial effects to the foregoing method embodiments, and therefore, the description thereof is omitted. For the technical details of the present invention that have not been disclosed yet in the description of the embodiment of the voice data processing apparatus, the description of the embodiment of the voice data processing system, and the description of the embodiment of the computer storage medium, please refer to the description of the foregoing method embodiment of the present invention for understanding, and therefore, the description is not repeated for brevity.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of a unit is only one logical function division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another device, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media capable of storing program codes, such as a removable storage medium, a Read Only Memory (ROM), a magnetic disk, and an optical disk.

Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage medium, a ROM, a magnetic disk, an optical disk, or the like, which can store the program code.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A voice data processing method implemented by a voice data processing apparatus on a bluetooth headset, the method comprising:

awakening and identifying first audio data, and writing the first audio data into an audio cache region to form second audio data, wherein the result obtained by awakening and identifying is an identification result;

and judging the recognition result, if the recognition result is awaken, establishing an audio transmission path with a voice interaction equipment end, and sending the second audio data in the audio buffer area to the voice interaction equipment end.

2. The method of claim 1, wherein prior to the wake up identification of the first audio data, the method further comprises:

collecting an original audio signal;

and carrying out signal processing on the original audio signal to obtain first audio data.

3. The method of claim 2, wherein the capturing the original audio signal comprises:

acquiring at least two paths of original audio signals through a microphone array;

correspondingly, the signal processing of the original audio signal to obtain first audio data includes:

performing signal processing on the at least two paths of original audio signals to obtain at least two paths of processing results;

and combining the at least two paths of processing results to obtain first audio data, wherein the first audio data is one path of audio data.

4. The method of claim 1, wherein writing the first audio data into an audio buffer to form second audio data comprises:

and writing the first audio data into an audio buffer zone according to a time sequence segmentation and overflow covering mode to form second audio data.

5. The method of claim 4, wherein performing wake up recognition on the first audio data comprises:

judging whether a wakeup word exists in the first audio data according to a wakeup model, if so, determining a recognition result as wakeup, recording a write-in point written in the audio cache area and the second audio data as a wakeup point;

correspondingly, the sending the second audio data in the audio buffer to the voice interaction device includes:

and sending the audio data in the preset time period before the wake-up point and after wake-up to the voice interaction equipment terminal.

6. The method according to claim 1, wherein the sending the second audio data in the audio buffer to the voice interaction device side comprises:

and sending all contents in the second audio data containing the awakening words in the audio cache region to a voice interaction equipment end for secondary awakening verification.

7. A voice data processing device arranged on a Bluetooth headset, the device comprising:

the awakening identification module is used for awakening and identifying the first audio data;

the audio writing module is used for writing the first audio data into an audio buffer area to form second audio data;

the identification result judging module is used for judging the identification result;

and the sending module is used for establishing an audio transmission path with the voice interaction equipment terminal and sending the second audio data in the audio buffer area to the voice interaction equipment terminal if the identification result is awaken.

8. The apparatus of claim 7, further comprising:

the signal acquisition module is used for acquiring an original audio signal;

and the signal processing module is used for carrying out signal processing on the original audio signal to obtain first audio data.

9. A speech data processing system, characterized in that the system comprises:

a voice data processing device, disposed on a bluetooth headset, for performing the voice data processing method of any one of claims 1 to 6;

and the voice interaction equipment is connected with the Bluetooth headset and is used for receiving the second audio data sent by the voice data processing device and processing the second audio data.

10. A storage medium on which program instructions are stored, wherein the program instructions are operable when executed to perform a speech data processing method according to any one of claims 1 to 6.