WO2023109129A1

WO2023109129A1 - Speech data processing method and apparatus

Info

Publication number: WO2023109129A1
Application number: PCT/CN2022/107607
Authority: WO
Inventors: 李含珍; 王峰; 任晓楠
Original assignee: 海信视像科技股份有限公司
Priority date: 2021-12-13
Filing date: 2022-07-25
Publication date: 2023-06-22
Also published as: CN114155854B; CN114155854A

Abstract

A speech data processing method and apparatus. The method comprises: configuring, as a first word, a wake-up word of a smart device (S100); processing collected first speech data (S101); and when the first speech data does not comprise the first word but comprises preset content (S102), prompting a user of the first word (S103). In this way, a situation in which a smart device cannot be awakened when a user forgets or does not know a modified wake-up word, or mistakenly says the wake-up words of other devices, etc. is prevented, such that the intelligence level of the smart device is enhanced, and the usage experience for a user of the smart device applying the method and apparatus is improved.

Description

Voice data processing method and device

Cross References to Related Applications

This application claims the priority of the Chinese patent application with application number 202111516804.3 and titled "Method and Device for Processing Voice Data" submitted to the State Intellectual Property Office on December 13, 2021, the entire contents of which are incorporated by reference in this application middle.

technical field

The present application relates to the technical field of voice data processing, in particular to a voice data processing method and device.

Background technique

With the development of electronic technology, more and more smart devices such as televisions and speakers are equipped with voice interaction functions, so that users can send instructions to smart devices by speaking. When the terminal device collects the user's voice data, it can recognize and execute the instructions in it.

In the prior art, in order to save the power consumption of the smart device, the smart device is usually in a low power consumption mode. When the user talks to the smart device, he needs to speak the wake-up word of the smart device first to "wake up" the smart device so that it Switch to normal working state. Correspondingly, only after the smart device detects the wake-up word, it continues to execute the user's spoken instruction after the wake-up word.

With the existing technology, the wake-up word of some smart devices can be changed, but after the wake-up word is changed, once the user forgets or cannot determine the changed wake-up word, the user will not be able to "wake up" the smart device, resulting in the intelligentization of smart devices. Insufficient level seriously degrades user experience.

Contents of the invention

The present application provides a voice data processing method and device, which are used to solve the technical problems of insufficient intelligence of the smart device and poor user experience due to failure to wake up the smart device.

The present application provides a voice data processing method, including: determining that the wake-up word of the smart device is configured as the first word; collecting the user's first voice data; One word, the smart device switches the working state; when it is recognized that the first voice data does not include the first word but includes the preset content, the smart device does not switch the working state, and prompts the user The first word; when it is recognized that the first voice data does not include the first word and does not include the preset content, the smart device does not switch the working state

In an embodiment of the first aspect of the present application, when it is recognized that the first voice data does not include the first word but includes preset content, the smart device does not switch the working state, and sends a message to the user Prompting the first word includes: when it is recognized that the first voice data does not include the first word but includes the preset content, adding 1 to the number of times of detection, the number of times of detection is the number of voices collected continuously The data does not include the word but includes the preset number of times; when the detected number is greater than the preset number, prompting the user for the first word.

In an embodiment of the first aspect of the present application, when it is recognized that the first voice data does not include the first word but includes preset content, the smart device does not switch the working state, and sends a message to the user Prompting the first word includes: collecting the second voice data of the user when it is recognized that the first voice data does not include the first word but includes preset content; when the second voice data is recognized includes a sentence whose semantic meaning is to inquire about the first word, and prompts the user for the first word.

In an embodiment of the first aspect of the present application, the preset content includes one or more of the following: at least one wake-up word that the smart device has been configured before the first word; At least one wake-up word configured by a given user account; and a configured wake-up word of at least one other device with a voice data processing function.

In an embodiment of the first aspect of the present application, the method further includes: when the smart device is started, acquiring at least one wake-up word configured by the smart device before the first word from a storage device, Obtain the configured wake-up word of the at least one other device with voice data processing function from the server; when the user logs in the smart device using an account, obtain the binding of the smart device from the server according to the user account At least one wake word configured for the user account.

In an embodiment of the first aspect of the present application, the method further includes: when the user logs in to the smart device using an account, and the user modifies the wake-up word of the smart device from the first word to the second word , sending the second word to the server, so that the server records the second word.

In an embodiment of the first aspect of the present application, the prompting the user of the first word includes: displaying text prompt information of the first word on a display interface; or playing the first word by voice. Voice prompts for words.

In an embodiment of the first aspect of the present application, after a preset time, stop prompting the user for the first word; or, when the third voice data of the user is collected and the third voice data is recognized After including the first word in , stop prompting the user for the first word.

In an embodiment of the first aspect of the present application, after the collection of the user's first voice data, it further includes: using a machine learning model to determine whether the first voice data includes the first word and the predetermined set content; or, determine the pinyin of each character in the first voice data, and determine the first voice through the pinyin of each character, the pinyin of the first word, and the preset content Whether the data includes the first word and the preset content.

The second aspect of the present application provides a voice data processing device, which is used to execute the voice data processing method provided in any one of the first aspect of the present application, and the device includes: a determination module, which is used to determine the wake-up of the smart device The word is configured as a first word; the collection module is used to collect the first voice data of the user; the processing module is used to identify whether the first voice data includes the first word and whether it includes preset content; wherein , when it is recognized that the first voice data includes the first word, the smart device switches the working state; when it is recognized that the first voice data does not include the first word but includes preset content, The smart device does not switch the working state; when it is recognized that the first voice data does not include the first word and does not include the preset content, the smart device does not switch the working state; the prompt module is used to When it is recognized that the first voice data does not include the first word but includes preset content, prompting the user for the first word.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present application. Those skilled in the art can also obtain other drawings based on these drawings without any creative effort.

FIG. 1 is a schematic diagram of an application scenario of the present application;

Fig. 2 is a schematic flow chart of a method for processing voice data by an intelligent device;

Fig. 3 is a schematic flow chart of an embodiment of a method for processing voice data provided by the present application;

FIG. 4 is a schematic diagram of the wake-up word of the smart device provided by the present application;

FIG. 5 is a schematic diagram of a way for a smart device to prompt a wake-up word provided by the present application;

FIG. 6 is a schematic diagram of another way for a smart device to prompt a wake-up word provided by the present application;

FIG. 7 is a schematic flow diagram of another embodiment of a method for processing voice data provided by the present application;

FIG. 8 is a schematic flow diagram of another embodiment of a method for processing voice data provided by the present application;

FIG. 9 is a schematic diagram of the smart device provided by the present application to realize the processing of preset content;

FIG. 10 is a schematic flow diagram of an embodiment of processing voice data provided by the present application;

FIG. 11 is a schematic diagram of a processing structure for voice data processing by a smart device provided by the present application;

FIG. 12 is a schematic structural diagram of an embodiment of a voice data processing device provided by the present application.

Detailed ways

The following will clearly and completely describe the technical solutions in the embodiments of the application with reference to the drawings in the embodiments of the application. Apparently, the described embodiments are only some of the embodiments of the application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.

The terms "first", "second", "third", "fourth", etc. (if any) in the specification and claims of the present application and the above drawings are used to distinguish similar objects, and not necessarily Used to describe a specific sequence or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the application described herein, for example, can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having", as well as any variations thereof, are intended to cover a non-exclusive inclusion, for example, a process, method, system, product or device comprising a sequence of steps or elements is not necessarily limited to the expressly listed instead, may include other steps or elements not explicitly listed or inherent to the process, method, product or apparatus.

Before formally introducing the embodiments of the present application, the scenarios to which the present application is applied and the problems existing in the scenarios will be described with reference to the accompanying drawings. For example, FIG. 1 is a schematic diagram of an application scenario of the present application, showing a schematic diagram of a user 1 controlling a smart device 2 through voice interaction, wherein the smart device 2 can be a mobile phone, a tablet computer, a TV, a smart speaker or other Smart home appliances and other electronic devices with related voice interaction functions. In FIG. 1 , the smart device 2 is a TV as an example.

In some embodiments, in order to save power consumption, the smart device 2 is usually in a low power consumption mode. When the user 1 needs to issue instructions to the smart device 2 through voice, he needs to speak the wake-up word set by the smart device 2 first. XXXX", followed by the command "Play movie". Then for the smart device 2, its processing flow can refer to the process shown in FIG. 2, wherein, FIG. 2 is a schematic flow diagram of a method for processing voice data by a smart device. After the first voice data, first identify whether the first voice data includes the first word in S20, such as the wake-up word "XXXX", if the wake-up word is not included, it will not switch to the normal working state but continue to maintain low power consumption state, and return to execute S10 to continue collecting voice data. If it is recognized in S20 that the first voice data includes the first word, the smart device 2 switches to the working state according to the first word in S30, and recognizes and executes the command in the first voice data in S40, or continues to collect subsequent voice data, and then recognize and execute the commands therein, etc. Finally, after executing the command or not detecting that the user continues to speak for a certain period of time, the smart device 2 switches from the working state back to the low power consumption state again, and re-executes the acquisition in S10 Speech data and steps to recognize wake word.

In some embodiments, after receiving the voice data from the user, the smart device 2 as shown in the scene of FIG. 1 can process the voice data through its built-in machine learning model to obtain the wake-up words and commands therein; or , the smart device 2 can also send the voice data to the network server 3, and the server 3 performs processing such as recognition on the voice data and returns the obtained wake-up words and commands to the smart device. Finally, the smart device 2 determines that the user 1 has spoken the command “play movie”, obtains movie data from the server 3 , and plays the movie on its display screen 21 .

In some embodiments, the wake-up word of the smart device is not fixed, but can be deleted, modified, replaced by the user, so as to enrich the user experience and improve functionality. For example, as shown in FIG. 1 , the wake-up word preset by the supplier of the smart device 2 is "XXXX", and the user 1 can change the wake-up word to "YYYY" and so on. The above-mentioned "XXXX" and "YYYY" are just examples of generality. The number of words and specific implementation of each wake-up word are not limited, as long as the wake-up words before and after the change are different. For example, change the wake-up word from "Hisense Xiaoju" to "Xiaoju Xiaoju" and so on.

However, when the wake-up word of the smart device is changed to "YYYY", once the user forgets the modified wake-up word, or other home users do not know the changed wake-up word or have not adapted to the changed wake-up word, they are still facing the smart device. When the device speaks the preset wake-up word "XXXX", the smart device will judge that the voice data collected does not include the wake-up word "YYYY", so that it will not switch the working state, so that the user cannot issue commands to the smart device through voice. It makes users feel that they cannot "wake up" smart devices, and seriously reduces user experience.

In other embodiments, a family includes multiple smart terminals, such as a TV in a living room, an air conditioner in a bedroom, and a smart speaker. These smart terminals are set to different wake-up words by users, or are set to different wake-up words by default, and different wake-up words are required to wake up corresponding devices. In this way, it is very likely that when the user originally wants to wake up the TV, he will call out the wake-up word of other devices. It is also possible that the user works in different places, such as home, office, public place, etc. The devices in these places may also need to be woken up through different wake-up words. When the user is used to exhaling a specific wake-up word at home, It is possible to go to other places and call out the specific wake word to wake up other devices. Obviously, different devices have different wake-up words, which makes it very likely that the user’s exhaled wake-up word does not correspond to the target device, resulting in failure to wake up the target device.

Therefore, the present application provides a voice data processing method and device, which are used to solve the technical problem that the smart device may not be able to wake up after changing the wake-up word in the above scenario, which makes the smart device less intelligent. The technical solution of the present application will be described in detail below with specific embodiments. The following specific embodiments may be combined with each other, and the same or similar concepts or processes may not be repeated in some embodiments.

Fig. 3 is a schematic flow chart of an embodiment of the processing method of voice data provided by the present application. The processing method shown in Fig. 3 can be applied in the scene shown in Fig. 1 and executed by the smart device 2. The method includes:

S101: The smart device collects first voice data, and identifies whether the first voice data includes a wake-up word and whether it includes preset content.

Among them, in S100 before executing S101, the smart device determines that its wake-up word is configured as the first word, assuming that the wake-up word before changing is "XXXX", and the user changes the wake-up word of the smart device to the first word "YYYY", the smart device At this time, the device will only switch to the working state after collecting the voice data and recognizing that it includes the first word "YYYY" currently used as the wake-up word. It is understandable that when the wake-up word of the smart device is configured as the first word, unless the wake-up word is reconfigured, the smart device will repeatedly collect voice data before reconfiguration, and use the first word as the wake-up word. Switching of working status. In some embodiments, S100 may be that after the smart device is started, it is determined that its current wake-up word is configured as the first word; or, S100 specifically may be that the smart device configures the current wake-up word as the first word according to the user's instruction.

As shown in S20-S40 in Figure 3, when the smart device determines that the first voice data collected in S101 includes the first word "YYYY", switch the working state and execute the command after the first word in the first voice data, or Continue to collect voice data and execute commands in it. The implementation of S20-S40 is the same as that shown in FIG. 2 , and will not be repeated here. Otherwise, when it is detected that the voice data does not include the first word currently used as the wake-up word, the smart device will not switch the working state.

In particular, in S102-S103 of the embodiment of the present application, the smart device recognizes that although the first voice data collected in S101 does not include the first word "YYYY", but includes preset content, it is determined that the user speaks the first voice data It is hoped to wake up the smart device, but the wrong wake-up word is spoken. Therefore, the smart device reminds the user through S103 that the currently configured wake-up word of the smart device is the first words, and return to S101 to re-collect the voice data and identify it.

In some embodiments, the above preset content may include one or more of the following items labeled a-c: a. At least one wake-up word configured before the wake-up word of the smart device is configured as the first word, denoted as The second word, for example, the default wake-up word provided by the supplier of the smart device is "XXXX". During the use of the smart device, the user once configured the wake-up word as "AAAA" and "BBBB", and this time After configuration, the current wake-up word is the first word "YYYY". At this time, the preset content of the smart device may include the words "AAAA" and "BBBB" configured by the smart device. These second words are grouped in the second word It is stored in the smart device in the form of voice data. After the voice data is subsequently received, the voice recognition model and other methods can be used to determine whether the voice data includes the stored preset content. When the smart device 2 is started, the server may send the second set of words to the smart device 2 .

In some embodiments, the above preset content may further include: b. at least one wake-up word configured by the user account bound to the smart device, which is recorded as the third word. For example, FIG. 4 is a schematic diagram of the wake-up word of the smart device provided by the present application, wherein the user "logs in" the smart device 2 through his user account during the process of using the smart device 2, and realizes the "login" between the user account and the smart device 2. Binding", at this time, the smart device 2 can obtain the third word set from the network server, and the third word set is the wake-up word configured by other devices used by the user account.

Specifically, as shown in Figure 4, when the user logs in to the smart device using the user account and changes the wake-up word from "XXXX" to the new wake-up word "YYYY" through the path marked ①, the smart device will store the changed wake-up word The word "YYYY", and then the smart device also sends the wake-up word "YYYY" to the server for storage through the path marked ②, and stores it in the third word set corresponding to the user account. For the server, when receiving the wake-up word sent by different devices bound to the same user account, the wake-up word will be stored in the third word set corresponding to the user account for recording. Then when the smart device 2 as shown in Figure 4 detects that the user logs in with its user account, it can request the server to obtain the word set stored in the server according to its user account, so that the server sends the word set through the path marked ③ to smart device.

In some embodiments, the above preset content may further include: c. One or more wake-up words configured for at least one other device with voice data processing function, marked as the fourth word. Among them, other devices refer to electronic devices that also have voice recognition functions, such as smart speakers, computers, mobile phones, etc., which may be provided by different suppliers of smart devices. As shown in FIG. 4 , the server provided by the supplier of the smart device 2 can obtain the wake-up words preset by other devices from the Internet through the path marked ⑤, and store them in the fourth word set. When the smart device 2 starts, the server can send the fourth set of words to the smart device 2 through the path labeled ④ in FIG. 4 .

In some embodiments, the preset content stored in the smart device as shown in Figure 4 may include one or more of the above-mentioned a-c, then when the smart device recognizes that the voice data includes the first word, it will switch The command is executed in the working state; and when the smart device recognizes that the voice data does not include words, but includes any preset content (second word, third word or fourth word), it will prompt the first word. It can be understood that when the smart device recognizes that neither the first word nor the preset content is included in the voice data, it will not respond, and the voice data will be re-collected for recognition.

In some embodiments, FIG. 5 is a schematic diagram of a way for a smart device to prompt a wake-up word provided in the present application. Assuming that the first word "YYYY" of the smart device includes the pre-modified wake-up word "XXXX", then When the user speaks "XXXX, play a movie" and the smart device recognizes that the voice data does not include the first word but includes the wake-up word before the change in the preset content, the smart device 2 can display a text form on its display interface 21. The text prompt message 211: "Please call me YYYY", can be realized specifically through a pop-up window on the UI interface, etc. This embodiment does not limit the implementation of the UI. In some embodiments, after the smart device displays the text prompt information, it may keep displaying it until the smart device subsequently collects the user's third voice data and recognizes that the third voice data includes the first word, indicating that the smart device If the prompt allows the user to determine a new wake-up word, stop displaying the prompt information on the display interface; or, in order to prevent the impact on other display pages, the smart device can stop displaying the prompt information after a preset time (for example, 15s).

Figure 6 is a schematic diagram of another way for the smart device to prompt the wake-up word provided by this application. When the smart device recognizes that the voice data does not include the first word but includes the wake-up word before the change in the preset content, it can be played through a speaker, etc. The device plays the voice prompt message "please call me YYYY" and the like of the first word by voice. It is understandable that the above voice prompt information is only an example, you can also play such as "My name is YYYY now, please use my new name to wake me up", "My name is YYYY now, I am waiting for you to wake me up anytime" etc. Richer and more user-friendly voice prompt information.

Therefore, the voice data processing method provided by this application can implement the following scenario application in the specific implementation process: Scenario 1: After user A changes the wake-up word of the smart device, user B speaks the wake-up word before the change to the smart device word, the smart device will prompt the changed wake word. Scenario 2: The user forgets after changing the wake-up word of the smart device, or habitually speaks the wake-up word before the change, and the smart device will prompt the changed wake-up word. Scenario 3: The user speaks the wake-up word of other devices to the smart device, and the smart device will prompt the wake-up word.

To sum up, in the voice data processing method provided by the embodiment of the present application, in addition to collecting the first voice data and switching the working state according to the first word in the first voice data, the smart device also does not include the first voice data in the first voice data. Words but include preset content, prompt the user with the first word, so when the wake-up word of the smart device can be changed, prevent the user from forgetting, or not knowing the modified wake-up word, or uttering other devices by mistake In the case of words, etc., it is impossible to "wake up" the smart device, so that the smart device can actively prompt the user for the correct word when the user mispronounces the vocabulary in the preset content but actually "hopes" to wake up the smart device. The wake-up word helps the user speak the current wake-up word again to wake up the smart device, thereby improving the intelligence of the smart device and improving the user experience of the smart device. Moreover, the whole process can be realized and optimized only through the software of the smart device, which can avoid changes to the hardware of the smart device, has low design and manufacturing costs, and is easy to implement and popularize.

Fig. 7 is a schematic flow chart of another embodiment of the voice data processing method provided by the present application. The embodiment shown in Fig. 7 is based on the embodiment shown in Fig. 3 , when the smart device recognizes the collected In the first voice data, if the first word is not included but the preset content is included, in S201, the number of times of detection is increased by 1. Wherein, the number of times of detection is the number of times that the voice data continuously collected by the smart device does not include the first word but includes preset content. Subsequently, when it is determined in S202 that the accumulated number of times of detection is greater than the preset number of times, the smart device prompts the first word through S103.

For example, when the first voice data collected by the smart device for three consecutive times does not include the first word "YYYY" currently used as the wake-up word but includes the word "XXXX" in the preset content, it means that the user continues to The word in the preset content is called out to wake up the smart device but the wrong wake-up word is used. Therefore, it is detected for the third time that the first voice data does not include the first word "YYYY" but includes the same word in the preset content. After the word "XXXX", the smart device prompts the user for the first word in the manner shown in Figure 5 or Figure 6 . Alternatively, the above detection times may also be the number of times that the voice data collected within a preset time period (for example, 1 minute) does not include the first word but includes the preset content, etc. Therefore, in the embodiment shown in Figure 7, through the calculation and accumulation of detection times, it is verified whether the purpose of the user's utterance of the preset content is to wake up the smart device, so as to ensure the accuracy of the subsequent prompts, ensure the effectiveness of the prompts, and improve the accuracy of the smart device. processing accuracy and processing efficiency. The implementation of other steps in FIG. 7 is the same as that in FIG. 3 and will not be repeated here.

Fig. 8 is a schematic flow chart of another embodiment of the voice data processing method provided by the present application. The embodiment shown in Fig. 8 is based on the embodiment shown in Fig. 3 , when the smart device recognizes the collected When the first voice data does not include the first word but includes the preset content, the first word is not directly prompted through S103, but continues to collect the second voice data in S301, wherein the smart device can collect the user's The valid words spoken after the voice data, and until the end of stream recognition, the collected data is recorded as the second voice data. Subsequently, in S302, when it is recognized that the second voice data includes a sentence related to the query first word, the first word is prompted through S103. Wherein, the detected sentence may be a sentence in which the user asks the smart device for the first word, and the semantics of the sentence included in the second voice data may be determined as asking the first word by means of semantic recognition.

For example, assuming that the current wake-up word of the smart device is the first word "YYYY", the preset content includes the word "XXXX" before the change. Then when the user utters the word "XXXX", the smart device does not prompt immediately, and the user may continue to say something similar to "Is the wake-up word wrong?" "Is the voice wake-up broken?" "Why can't the voice wake up? " and other sentences, and then, the smart device determines that the user really wants to wake up the smart device but cannot determine the wake-up word according to the above-mentioned sentences included in the collected second voice data. At this time, the smart device uses the method shown in Figure 5 or Figure 6 Prompt the first word. Therefore, in this embodiment, even if the user does not say the wake-up word, the smart device can still respond to the user's inquiry about the preset content of the wake-up word, so that the smart device can further enrich functions and improve the degree of intelligence.

Fig. 9 is a schematic diagram of the smart device provided by the present application to realize the processing of the preset content. As shown in Fig. 9, the smart device stores the preset content through its local storage, and the main wake-up word of the smart device can be preset, and Popular wake words for other brands of smart devices, etc. Subsequently, the cloud storage provided by the supplier's server can manage the newly added popular wake-up words issued to the smart device and the wake-up words corresponding to the user account through two methods of operation and account. Among them, the operation management mode means that if there are new popular wake-up words for other devices in the market, the same type of smart devices will be identified through featurecode and other methods, and the newly-added wake-up words will be issued to such devices in batches. The account management mode means that the wake-up word changed by the device that the user has logged in will be bound and stored synchronously with the user account through the cloud. Then, when the smart device is started, it first checks the local wake-up word storage. If there is no wake-up word stored locally, it pulls the wake-up word data of operation management and stores it; The word is merged with the local; if the user performs the operation of changing the wake-up word locally, if the user account is detected to be online after the operation is completed, the local will actively push the update to the cloud.

Figure 10 is a schematic flow diagram of an embodiment of the processing of voice data provided by the present application, wherein, after the user speaks the first voice data, the smart device uses a machine learning model to verify the wake-up word, and after determining After the first voice data includes the first word, the normal response is to execute the command. And when the first word is not included but the preset content is included (the preset content is a modified wake-up word), the number of times detected is counted, and the second voice data spoken by the user is continued to be collected. Subsequently, according to the semantics in the second voice data, it is determined that the user is asking for the first word, and the smart device then prompts the user for the first word, or, when the number of times is greater than the preset number of times, the smart device prompts the user for the first word again.

In some embodiments, the smart device can use its own machine learning model to identify whether the first voice data includes the first word and preset content, or the smart device can also send the first voice data to a server in the cloud, The server recognizes whether the first voice data includes the first word and preset content and returns the recognition result to the smart device, so as to reduce the calculation amount of the smart device. Alternatively, the smart device can also perform recognition by comparing the pinyin of each character in the first voice data with the pinyin of the wake-up word and the pinyin of the preset content, so as to increase the fuzziness of recognition and improve the recognition rate.

Exemplarily, in the embodiment shown in Figure 10, the verification of the wake-up word can be performed in the following two ways, (1) wake-up model scoring: when the first voice data is collected, the wake-up model is Wake word score. If the score result is the first word currently set by the user, it will normally respond to the user’s wake-up; if the score result is not the first word currently set but is the stored preset content, then collect the second voice data and start the semantic analysis push preparation stage, Afterwards, the second voice data is pushed to the server for semantic recognition and processing, and when the recognition result sent by the server is received, and the semantics of the second voice data is determined to be the query of the first word, the smart device prompts the first word again. (2) Cloud recognition text translation: After the user turns on the error wake-up prompt function, the recognition engine is always on, collects the first voice data and detects the recognition text and translates it into pinyin, and then compares it with the stored first word and preset content It is also translated into pinyin for exact matching. If the pinyin matching is the first word currently set by the user, it will normally respond to the user's wake-up; if the pinyin matching result is not the first word currently set but is the preset content, the second voice data will be collected and the semantic analysis push preparation stage will be started.

Figure 11 is a schematic diagram of the processing structure of the smart device for processing speech data provided by the present application, wherein the speech recognition technology for speech data processing mainly includes four parts: signal processing and feature extraction, acoustic model, language model, and decoder. In this structure, signal processing and feature extraction take the audio signal as input, enhance the speech by eliminating noise and channel distortion, transform the signal from the time domain to the frequency domain, and extract suitable representative features for the subsequent acoustic model. Feature vector. There are currently many methods for sound feature extraction, such as Mel Frequency Cepstral Coefficients (MFCC), Linear Predictive Cepstral Coefficients (LPCC), and Multimedia Content Description Interface (MPEG7). An acoustic model is the output of converting speech into an acoustic representation, that is, finding the probability that a given speech originates from an acoustic symbol. The most commonly used acoustic modeling method is the Hidden Markov Model (HMM). Under HMM, the state is a hidden variable, the voice is an observation, and the jump between states conforms to the Markov assumption. Among them, the state transition probability density is mostly modeled by geometric distribution, and the model that fits hidden variables to the observation probability of the observed value is commonly used Gaussian mixture model (GMM). Based on the development of deep learning, models such as deep neural network (DNN), convolutional neural network (CNN), and recurrent neural network (RNN) have been applied to the modeling of observation probability, and have achieved very good results. The FSMN proposed by HKUST Xunfei is an improved network structure based on DNN. The delay structure is introduced into the hidden layer of DNN, and the historical information of the hidden layer at time t-N ~ t-1 is used as the input of the next layer, thereby introducing the historical information of the speech sequence and avoiding the problems caused by RNN training BPTT. Such as: gradient disappearance, high computational complexity, etc. Language model estimation uses the training corpus to learn the relationship between words and words to estimate the possibility of hypothetical word sequences, also known as language model scores. Statistical language model has become the mainstream technology of language processing in speech recognition. Among them, there are many kinds of statistical language models, such as N-Gram language model, Markov N-gram model (Markov N-gram), exponential model (Exponential Models), decision tree Model (Decision Tree Models), etc. The N-gram language model is the most commonly used statistical language model, especially the binary language model (bigram) and trigram language model (trigram). The decoder (Decoder) recognizes the input speech frame sequence based on the trained acoustic model, combined with the dictionary and language model. The main work done includes: given the input feature sequence xT1x1T, in the search space (Search Space) composed of four knowledge sources such as acoustic model, acoustic context, pronunciation dictionary and language model, through Viterbi (Viterbi) Search, find the best word string, etc.

In the above-mentioned embodiments, the voice data processing method provided by the embodiment of the present application is introduced, and in order to realize the various functions in the method provided by the above-mentioned embodiment of the present application, the smart device as the execution subject may include a hardware structure and/or Or a software module, the above-mentioned functions are realized in the form of a hardware structure, a software module, or a hardware structure plus a software module. Whether one of the above-mentioned functions is executed in the form of a hardware structure, a software module, or a hardware structure plus a software module depends on the specific application and design constraints of the technical solution.

For example, FIG. 12 is a schematic structural diagram of an embodiment of a voice data processing device provided by the present application. The device 100 shown in FIG. Among them, the determination module 1004 is used to determine that the wake-up word of the smart device is configured as the first word; the collection module 1001 is used to collect the user's first voice data; the processing module 1002 is used to identify whether the first voice data includes the latest settings of the smart device the first word, and whether it includes the preset content; wherein, when it is recognized that the first voice data includes the first word, the smart device switches the working state; when it is recognized that the first voice data does not include the first word but includes the preset content, the smart device does not switch the working state; when it is recognized that the first voice data does not include the first word and does not include the preset content, the smart device does not switch the working state; the prompt module 1003 is used to recognize the first voice data Prompt the user for the first word when it does not include the first word but includes the preset content

Specifically, for the specific principle and implementation of the above-mentioned steps performed by each module in the device for processing voice data, reference may be made to the description in the method for processing voice data in the foregoing embodiments of the present application, and details will not be repeated here.

It should be noted that it should be understood that the division of each module of the above device is only a division of logical functions, and may be fully or partially integrated into one physical entity or physically separated during actual implementation. And these modules can all be implemented in the form of calling software through processing elements; they can also be implemented in the form of hardware; some modules can also be implemented in the form of calling software through processing elements, and some modules can be implemented in the form of hardware. It may be a separately established processing element, or it may be integrated into a certain chip of the above-mentioned device. In addition, it may also be stored in the memory of the above-mentioned device in the form of program code, which is called and executed by a certain processing element of the above-mentioned device. Determine the functionality of the module. The implementation of other modules is similar. In addition, all or part of these modules can be integrated together, and can also be implemented independently. The processing element mentioned here may be an integrated circuit with signal processing capabilities. In the implementation process, each step of the above method or each module above can be completed by an integrated logic circuit of hardware in the processor element or an instruction in the form of software.

For example, the above modules may be one or more integrated circuits configured to implement the above method, for example: one or more specific integrated circuits (application specific integrated circuit, ASIC), or, one or more microprocessors (digital signal processor, DSP), or, one or more field programmable gate arrays (field programmable gate array, FPGA), etc. For another example, when one of the above modules is implemented in the form of a processing element scheduling program code, the processing element may be a general-purpose processor, such as a central processing unit (central processing unit, CPU) or other processors that can call program codes. For another example, these modules can be integrated together and implemented in the form of a system-on-a-chip (SOC).

In the above embodiments, all or part of them may be implemented by software, hardware, firmware or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part. The computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server, or data center by wired (eg, coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media. The available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a DVD), or a semiconductor medium (such as a solid state disk (SSD)), etc.

The present application also provides an electronic device, including: a processor and a memory connected through a bus; wherein, a computer program is stored in the memory, and when the processor executes the computer program, the processor can be used to execute any of the above-mentioned embodiments of the present application. A method for processing voice data.

The present application also provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed, it can be used to perform any voice data processing method in the data processing methods provided in the foregoing embodiments of the present application.

The embodiment of the present application also provides a chip for running instructions, and the chip is used to execute the voice data processing method provided in any one of the foregoing embodiments of the present application.

The present application also provides a computer program product, including a computer program. When the computer program is executed by a processor, it can be used to implement any voice data processing method as described above in the present application.

Those of ordinary skill in the art can understand that all or part of the steps for implementing the above method embodiments can be completed by program instructions and related hardware. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, it executes the steps including the above-mentioned method embodiments; and the aforementioned storage medium includes: ROM, RAM, magnetic disk or optical disk and other various media that can store program codes.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, rather than limiting them; although the application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the various embodiments of the present application. scope.

Claims

A method for processing voice data, applied to smart devices, comprising:

Determining that the wake-up word of the smart device is configured as the first word;

collecting the user's first voice data;

When it is recognized that the first voice data includes the first word, the smart device switches the working state;

When it is recognized that the first voice data does not include the first word but includes preset content, the smart device does not switch the working state, and prompts the user for the first word;

When it is recognized that the first voice data does not include the first word and does not include the preset content, the smart device does not switch working states.
According to the method according to claim 1, when it is recognized that the first voice data does not include the first word but includes preset content, the smart device does not switch the working state, and prompts the user for the The first term mentioned above includes:

When it is recognized that the first speech data does not include the first word but includes the preset content, add 1 to the number of detections, and the number of detections is that the speech data that is continuously collected does not include the word and includes the number of times of said predetermined content;

When the number of times of detection is greater than a preset number of times, the first word is prompted to the user.
According to the method according to claim 1, when it is recognized that the first voice data does not include the first word but includes preset content, the smart device does not switch the working state, and prompts the user for the The first term mentioned above includes:

When it is recognized that the first voice data does not include the first word but includes preset content, collect second voice data of the user;

When it is recognized that the second voice data includes a sentence whose semantic meaning is to inquire about the first word, the user is prompted for the first word.
According to the method according to any one of claims 1-3, the preset content includes one or more of the following:

at least one wake-up word configured to the smart device prior to the first word;

At least one wake-up word configured by the user account bound to the smart device;

A configured wake word for at least one other device with voice data processing capabilities.
The method according to claim 4, said method further comprising:

When the smart device is started, at least one wake-up word configured by the smart device before the first word is obtained from the storage device, and the configured wake-up word of the at least one other device with voice data processing function is obtained from the server. Configured wake word;

When the user logs in to the smart device with an account, at least one wake-up word configured with a user account bound to the smart device is obtained from a server according to the user account.
The method according to claim 5, said method further comprising:

When the user logs in the smart device with an account, and the user modifies the wake-up word of the smart device from the first word to the second word, the second word is sent to the server, so that the The server records the second word.
According to the method according to any one of claims 1-6, said prompting said first word to said user comprises:

Displaying text prompt information of the first word on the display interface;

Alternatively, the voice prompt information of the first word is played by voice.
The method according to claim 7,

After a preset time, stop prompting the user for the first word;

Alternatively, when the user's third voice data is collected and it is recognized that the third voice data includes the first word, stop prompting the user for the first word.
According to the method according to any one of claims 1-8, after the collection of the first voice data of the user, further comprising:

Using a machine learning model to determine whether the first speech data includes the first word and the preset content;

Or, determine the pinyin of each character in the first voice data, and determine the pinyin of each character in the first voice data through the pinyin of each character, the pinyin of the first word, and the pinyin of the preset content. Whether the first word and the preset content are included.
A device for processing voice data, comprising:

A determining module, configured to determine that the wake-up word of the smart device is configured as the first word;

A collection module, configured to collect the user's first voice data;

A processing module, configured to identify whether the first speech data includes the first word, and whether it includes preset content; wherein, when it is recognized that the first speech data includes the first word, the intelligent The device switches the working state; when it is recognized that the first voice data does not include the first word but includes the preset content, the smart device does not switch the working state; when it is recognized that the first voice data does not include When the first word does not include the preset content, the smart device does not switch the working state;

A prompting module, configured to prompt the user for the first word when it is recognized that the first voice data does not include the first word but includes preset content.