WO2023109129A1 - Speech data processing method and apparatus - Google Patents

Speech data processing method and apparatus Download PDF

Info

Publication number
WO2023109129A1
WO2023109129A1 PCT/CN2022/107607 CN2022107607W WO2023109129A1 WO 2023109129 A1 WO2023109129 A1 WO 2023109129A1 CN 2022107607 W CN2022107607 W CN 2022107607W WO 2023109129 A1 WO2023109129 A1 WO 2023109129A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
smart device
voice data
user
wake
Prior art date
Application number
PCT/CN2022/107607
Other languages
French (fr)
Chinese (zh)
Inventor
李含珍
王峰
任晓楠
Original Assignee
海信视像科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 海信视像科技股份有限公司 filed Critical 海信视像科技股份有限公司
Publication of WO2023109129A1 publication Critical patent/WO2023109129A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/34Adaptation of a single recogniser for parallel processing, e.g. by use of multiple processors or cloud computing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Definitions

  • the present application relates to the technical field of voice data processing, in particular to a voice data processing method and device.
  • the smart device in order to save the power consumption of the smart device, the smart device is usually in a low power consumption mode.
  • the user talks to the smart device he needs to speak the wake-up word of the smart device first to "wake up" the smart device so that it Switch to normal working state.
  • the smart device detects the wake-up word it continues to execute the user's spoken instruction after the wake-up word.
  • the wake-up word of some smart devices can be changed, but after the wake-up word is changed, once the user forgets or cannot determine the changed wake-up word, the user will not be able to "wake up" the smart device, resulting in the intelligentization of smart devices. Insufficient level seriously degrades user experience.
  • the present application provides a voice data processing method and device, which are used to solve the technical problems of insufficient intelligence of the smart device and poor user experience due to failure to wake up the smart device.
  • the present application provides a voice data processing method, including: determining that the wake-up word of the smart device is configured as the first word; collecting the user's first voice data; One word, the smart device switches the working state; when it is recognized that the first voice data does not include the first word but includes the preset content, the smart device does not switch the working state, and prompts the user The first word; when it is recognized that the first voice data does not include the first word and does not include the preset content, the smart device does not switch the working state
  • the smart device when it is recognized that the first voice data does not include the first word but includes preset content, the smart device does not switch the working state, and sends a message to the user Prompting the first word includes: when it is recognized that the first voice data does not include the first word but includes the preset content, adding 1 to the number of times of detection, the number of times of detection is the number of voices collected continuously The data does not include the word but includes the preset number of times; when the detected number is greater than the preset number, prompting the user for the first word.
  • the smart device when it is recognized that the first voice data does not include the first word but includes preset content, the smart device does not switch the working state, and sends a message to the user
  • Prompting the first word includes: collecting the second voice data of the user when it is recognized that the first voice data does not include the first word but includes preset content; when the second voice data is recognized includes a sentence whose semantic meaning is to inquire about the first word, and prompts the user for the first word.
  • the preset content includes one or more of the following: at least one wake-up word that the smart device has been configured before the first word; At least one wake-up word configured by a given user account; and a configured wake-up word of at least one other device with a voice data processing function.
  • the method further includes: when the smart device is started, acquiring at least one wake-up word configured by the smart device before the first word from a storage device, Obtain the configured wake-up word of the at least one other device with voice data processing function from the server; when the user logs in the smart device using an account, obtain the binding of the smart device from the server according to the user account At least one wake word configured for the user account.
  • the method further includes: when the user logs in to the smart device using an account, and the user modifies the wake-up word of the smart device from the first word to the second word , sending the second word to the server, so that the server records the second word.
  • the prompting the user of the first word includes: displaying text prompt information of the first word on a display interface; or playing the first word by voice. Voice prompts for words.
  • stop prompting the user for the first word after a preset time, stop prompting the user for the first word; or, when the third voice data of the user is collected and the third voice data is recognized After including the first word in , stop prompting the user for the first word.
  • the collection of the user's first voice data after the collection of the user's first voice data, it further includes: using a machine learning model to determine whether the first voice data includes the first word and the predetermined set content; or, determine the pinyin of each character in the first voice data, and determine the first voice through the pinyin of each character, the pinyin of the first word, and the preset content Whether the data includes the first word and the preset content.
  • the second aspect of the present application provides a voice data processing device, which is used to execute the voice data processing method provided in any one of the first aspect of the present application, and the device includes: a determination module, which is used to determine the wake-up of the smart device The word is configured as a first word; the collection module is used to collect the first voice data of the user; the processing module is used to identify whether the first voice data includes the first word and whether it includes preset content; wherein , when it is recognized that the first voice data includes the first word, the smart device switches the working state; when it is recognized that the first voice data does not include the first word but includes preset content, The smart device does not switch the working state; when it is recognized that the first voice data does not include the first word and does not include the preset content, the smart device does not switch the working state; the prompt module is used to When it is recognized that the first voice data does not include the first word but includes preset content, prompting the user for the first word.
  • FIG. 1 is a schematic diagram of an application scenario of the present application
  • Fig. 2 is a schematic flow chart of a method for processing voice data by an intelligent device
  • Fig. 3 is a schematic flow chart of an embodiment of a method for processing voice data provided by the present application
  • FIG. 4 is a schematic diagram of the wake-up word of the smart device provided by the present application.
  • FIG. 5 is a schematic diagram of a way for a smart device to prompt a wake-up word provided by the present application
  • FIG. 6 is a schematic diagram of another way for a smart device to prompt a wake-up word provided by the present application
  • FIG. 7 is a schematic flow diagram of another embodiment of a method for processing voice data provided by the present application.
  • FIG. 8 is a schematic flow diagram of another embodiment of a method for processing voice data provided by the present application.
  • FIG. 9 is a schematic diagram of the smart device provided by the present application to realize the processing of preset content
  • FIG. 10 is a schematic flow diagram of an embodiment of processing voice data provided by the present application.
  • FIG. 11 is a schematic diagram of a processing structure for voice data processing by a smart device provided by the present application.
  • FIG. 12 is a schematic structural diagram of an embodiment of a voice data processing device provided by the present application.
  • FIG. 1 is a schematic diagram of an application scenario of the present application, showing a schematic diagram of a user 1 controlling a smart device 2 through voice interaction, wherein the smart device 2 can be a mobile phone, a tablet computer, a TV, a smart speaker or other Smart home appliances and other electronic devices with related voice interaction functions.
  • the smart device 2 is a TV as an example.
  • the smart device 2 in order to save power consumption, is usually in a low power consumption mode.
  • the user 1 needs to issue instructions to the smart device 2 through voice, he needs to speak the wake-up word set by the smart device 2 first. XXXX", followed by the command "Play movie”. Then for the smart device 2, its processing flow can refer to the process shown in FIG. 2, wherein, FIG. 2 is a schematic flow diagram of a method for processing voice data by a smart device.
  • the smart device 2 After the first voice data, first identify whether the first voice data includes the first word in S20, such as the wake-up word "XXXX", if the wake-up word is not included, it will not switch to the normal working state but continue to maintain low power consumption state, and return to execute S10 to continue collecting voice data. If it is recognized in S20 that the first voice data includes the first word, the smart device 2 switches to the working state according to the first word in S30, and recognizes and executes the command in the first voice data in S40, or continues to collect subsequent voice data, and then recognize and execute the commands therein, etc.
  • the first word in S20 such as the wake-up word "XXXXX"
  • the smart device 2 switches from the working state back to the low power consumption state again, and re-executes the acquisition in S10 Speech data and steps to recognize wake word.
  • the smart device 2 after receiving the voice data from the user, the smart device 2 as shown in the scene of FIG. 1 can process the voice data through its built-in machine learning model to obtain the wake-up words and commands therein; or , the smart device 2 can also send the voice data to the network server 3, and the server 3 performs processing such as recognition on the voice data and returns the obtained wake-up words and commands to the smart device. Finally, the smart device 2 determines that the user 1 has spoken the command “play movie”, obtains movie data from the server 3 , and plays the movie on its display screen 21 .
  • the wake-up word of the smart device is not fixed, but can be deleted, modified, replaced by the user, so as to enrich the user experience and improve functionality.
  • the wake-up word preset by the supplier of the smart device 2 is "XXXX", and the user 1 can change the wake-up word to "YYYY” and so on.
  • the above-mentioned "XXXX” and "YYYY” are just examples of generality.
  • the number of words and specific implementation of each wake-up word are not limited, as long as the wake-up words before and after the change are different. For example, change the wake-up word from "Hisense Xiaoju” to "Xiaoju Xiaoju” and so on.
  • the smart device when the wake-up word of the smart device is changed to "YYYY", once the user forgets the modified wake-up word, or other home users do not know the changed wake-up word or have not adapted to the changed wake-up word, they are still facing the smart device.
  • the smart device When the device speaks the preset wake-up word "XXXX”, the smart device will judge that the voice data collected does not include the wake-up word "YYYY", so that it will not switch the working state, so that the user cannot issue commands to the smart device through voice. It makes users feel that they cannot "wake up” smart devices, and seriously reduces user experience.
  • a family includes multiple smart terminals, such as a TV in a living room, an air conditioner in a bedroom, and a smart speaker. These smart terminals are set to different wake-up words by users, or are set to different wake-up words by default, and different wake-up words are required to wake up corresponding devices. In this way, it is very likely that when the user originally wants to wake up the TV, he will call out the wake-up word of other devices. It is also possible that the user works in different places, such as home, office, public place, etc. The devices in these places may also need to be woken up through different wake-up words.
  • the present application provides a voice data processing method and device, which are used to solve the technical problem that the smart device may not be able to wake up after changing the wake-up word in the above scenario, which makes the smart device less intelligent.
  • the technical solution of the present application will be described in detail below with specific embodiments. The following specific embodiments may be combined with each other, and the same or similar concepts or processes may not be repeated in some embodiments.
  • Fig. 3 is a schematic flow chart of an embodiment of the processing method of voice data provided by the present application.
  • the processing method shown in Fig. 3 can be applied in the scene shown in Fig. 1 and executed by the smart device 2.
  • the method includes:
  • the smart device collects first voice data, and identifies whether the first voice data includes a wake-up word and whether it includes preset content.
  • the smart device determines that its wake-up word is configured as the first word, assuming that the wake-up word before changing is "XXXX", and the user changes the wake-up word of the smart device to the first word "YYYY", the smart device At this time, the device will only switch to the working state after collecting the voice data and recognizing that it includes the first word "YYYY" currently used as the wake-up word. It is understandable that when the wake-up word of the smart device is configured as the first word, unless the wake-up word is reconfigured, the smart device will repeatedly collect voice data before reconfiguration, and use the first word as the wake-up word. Switching of working status.
  • S100 may be that after the smart device is started, it is determined that its current wake-up word is configured as the first word; or, S100 specifically may be that the smart device configures the current wake-up word as the first word according to the user's instruction.
  • S20-S40 when the smart device determines that the first voice data collected in S101 includes the first word "YYYY”, switch the working state and execute the command after the first word in the first voice data, or Continue to collect voice data and execute commands in it.
  • the implementation of S20-S40 is the same as that shown in FIG. 2 , and will not be repeated here. Otherwise, when it is detected that the voice data does not include the first word currently used as the wake-up word, the smart device will not switch the working state.
  • the smart device recognizes that although the first voice data collected in S101 does not include the first word "YYYY", but includes preset content, it is determined that the user speaks the first voice data It is hoped to wake up the smart device, but the wrong wake-up word is spoken. Therefore, the smart device reminds the user through S103 that the currently configured wake-up word of the smart device is the first words, and return to S101 to re-collect the voice data and identify it.
  • the above preset content may include one or more of the following items labeled a-c: a.
  • At least one wake-up word configured before the wake-up word of the smart device is configured as the first word, denoted as The second word, for example, the default wake-up word provided by the supplier of the smart device is "XXXX".
  • the user once configured the wake-up word as "AAAA” and "BBBB”, and this time After configuration, the current wake-up word is the first word "YYYY”.
  • the preset content of the smart device may include the words "AAAA" and "BBBB” configured by the smart device. These second words are grouped in the second word It is stored in the smart device in the form of voice data. After the voice data is subsequently received, the voice recognition model and other methods can be used to determine whether the voice data includes the stored preset content.
  • the server may send the second set of words to the smart device 2 .
  • the above preset content may further include: b. at least one wake-up word configured by the user account bound to the smart device, which is recorded as the third word.
  • FIG. 4 is a schematic diagram of the wake-up word of the smart device provided by the present application, wherein the user "logs in” the smart device 2 through his user account during the process of using the smart device 2, and realizes the "login” between the user account and the smart device 2. Binding", at this time, the smart device 2 can obtain the third word set from the network server, and the third word set is the wake-up word configured by other devices used by the user account.
  • the smart device when the user logs in to the smart device using the user account and changes the wake-up word from "XXXX" to the new wake-up word "YYYY” through the path marked 1, the smart device will store the changed wake-up word The word "YYYY”, and then the smart device also sends the wake-up word "YYYY” to the server for storage through the path marked 2, and stores it in the third word set corresponding to the user account.
  • the server when receiving the wake-up word sent by different devices bound to the same user account, the wake-up word will be stored in the third word set corresponding to the user account for recording.
  • the smart device 2 as shown in Figure 4 detects that the user logs in with its user account, it can request the server to obtain the word set stored in the server according to its user account, so that the server sends the word set through the path marked 3 to smart device.
  • the above preset content may further include: c.
  • other devices refer to electronic devices that also have voice recognition functions, such as smart speakers, computers, mobile phones, etc., which may be provided by different suppliers of smart devices.
  • the server provided by the supplier of the smart device 2 can obtain the wake-up words preset by other devices from the Internet through the path marked 5, and store them in the fourth word set.
  • the server can send the fourth set of words to the smart device 2 through the path labeled 4 in FIG. 4 .
  • the preset content stored in the smart device as shown in Figure 4 may include one or more of the above-mentioned a-c, then when the smart device recognizes that the voice data includes the first word, it will switch The command is executed in the working state; and when the smart device recognizes that the voice data does not include words, but includes any preset content (second word, third word or fourth word), it will prompt the first word. It can be understood that when the smart device recognizes that neither the first word nor the preset content is included in the voice data, it will not respond, and the voice data will be re-collected for recognition.
  • FIG. 5 is a schematic diagram of a way for a smart device to prompt a wake-up word provided in the present application.
  • the smart device 2 can display a text form on its display interface 21.
  • the text prompt message 211 "Please call me YYYY”, can be realized specifically through a pop-up window on the UI interface, etc. This embodiment does not limit the implementation of the UI.
  • the smart device may keep displaying it until the smart device subsequently collects the user's third voice data and recognizes that the third voice data includes the first word, indicating that the smart device If the prompt allows the user to determine a new wake-up word, stop displaying the prompt information on the display interface; or, in order to prevent the impact on other display pages, the smart device can stop displaying the prompt information after a preset time (for example, 15s).
  • a preset time for example, 15s
  • FIG. 6 is a schematic diagram of another way for the smart device to prompt the wake-up word provided by this application.
  • the smart device recognizes that the voice data does not include the first word but includes the wake-up word before the change in the preset content, it can be played through a speaker, etc.
  • the device plays the voice prompt message "please call me YYYY” and the like of the first word by voice. It is understandable that the above voice prompt information is only an example, you can also play such as "My name is YYYY now, please use my new name to wake me up", “My name is YYYY now, I am waiting for you to wake me up anytime” etc. Richer and more user-friendly voice prompt information.
  • the voice data processing method provided by this application can implement the following scenario application in the specific implementation process: Scenario 1: After user A changes the wake-up word of the smart device, user B speaks the wake-up word before the change to the smart device word, the smart device will prompt the changed wake word.
  • Scenario 2 The user forgets after changing the wake-up word of the smart device, or habitually speaks the wake-up word before the change, and the smart device will prompt the changed wake-up word.
  • Scenario 3 The user speaks the wake-up word of other devices to the smart device, and the smart device will prompt the wake-up word.
  • the smart device in addition to collecting the first voice data and switching the working state according to the first word in the first voice data, the smart device also does not include the first voice data in the first voice data. Words but include preset content, prompt the user with the first word, so when the wake-up word of the smart device can be changed, prevent the user from forgetting, or not knowing the modified wake-up word, or uttering other devices by mistake In the case of words, etc., it is impossible to "wake up” the smart device, so that the smart device can actively prompt the user for the correct word when the user mispronounces the vocabulary in the preset content but actually "hopes" to wake up the smart device.
  • the wake-up word helps the user speak the current wake-up word again to wake up the smart device, thereby improving the intelligence of the smart device and improving the user experience of the smart device. Moreover, the whole process can be realized and optimized only through the software of the smart device, which can avoid changes to the hardware of the smart device, has low design and manufacturing costs, and is easy to implement and popularize.
  • Fig. 7 is a schematic flow chart of another embodiment of the voice data processing method provided by the present application.
  • the embodiment shown in Fig. 7 is based on the embodiment shown in Fig. 3 , when the smart device recognizes the collected In the first voice data, if the first word is not included but the preset content is included, in S201, the number of times of detection is increased by 1. Wherein, the number of times of detection is the number of times that the voice data continuously collected by the smart device does not include the first word but includes preset content. Subsequently, when it is determined in S202 that the accumulated number of times of detection is greater than the preset number of times, the smart device prompts the first word through S103.
  • the smart device when the first voice data collected by the smart device for three consecutive times does not include the first word "YYYY” currently used as the wake-up word but includes the word "XXXX” in the preset content, it means that the user continues to The word in the preset content is called out to wake up the smart device but the wrong wake-up word is used. Therefore, it is detected for the third time that the first voice data does not include the first word "YYYY” but includes the same word in the preset content. After the word "XXXX", the smart device prompts the user for the first word in the manner shown in Figure 5 or Figure 6 .
  • the above detection times may also be the number of times that the voice data collected within a preset time period (for example, 1 minute) does not include the first word but includes the preset content, etc. Therefore, in the embodiment shown in Figure 7, through the calculation and accumulation of detection times, it is verified whether the purpose of the user's utterance of the preset content is to wake up the smart device, so as to ensure the accuracy of the subsequent prompts, ensure the effectiveness of the prompts, and improve the accuracy of the smart device. processing accuracy and processing efficiency.
  • the implementation of other steps in FIG. 7 is the same as that in FIG. 3 and will not be repeated here.
  • Fig. 8 is a schematic flow chart of another embodiment of the voice data processing method provided by the present application.
  • the embodiment shown in Fig. 8 is based on the embodiment shown in Fig. 3 , when the smart device recognizes the collected
  • the first voice data does not include the first word but includes the preset content
  • the first word is not directly prompted through S103, but continues to collect the second voice data in S301, wherein the smart device can collect the user's The valid words spoken after the voice data, and until the end of stream recognition, the collected data is recorded as the second voice data.
  • the detected sentence may be a sentence in which the user asks the smart device for the first word
  • the semantics of the sentence included in the second voice data may be determined as asking the first word by means of semantic recognition.
  • the smart device determines that the user really wants to wake up the smart device but cannot determine the wake-up word according to the above-mentioned sentences included in the collected second voice data.
  • the smart device uses the method shown in Figure 5 or Figure 6 Prompt the first word. Therefore, in this embodiment, even if the user does not say the wake-up word, the smart device can still respond to the user's inquiry about the preset content of the wake-up word, so that the smart device can further enrich functions and improve the degree of intelligence.
  • Fig. 9 is a schematic diagram of the smart device provided by the present application to realize the processing of the preset content.
  • the smart device stores the preset content through its local storage, and the main wake-up word of the smart device can be preset, and Popular wake words for other brands of smart devices, etc.
  • the cloud storage provided by the supplier's server can manage the newly added popular wake-up words issued to the smart device and the wake-up words corresponding to the user account through two methods of operation and account.
  • the operation management mode means that if there are new popular wake-up words for other devices in the market, the same type of smart devices will be identified through featurecode and other methods, and the newly-added wake-up words will be issued to such devices in batches.
  • the account management mode means that the wake-up word changed by the device that the user has logged in will be bound and stored synchronously with the user account through the cloud. Then, when the smart device is started, it first checks the local wake-up word storage. If there is no wake-up word stored locally, it pulls the wake-up word data of operation management and stores it; The word is merged with the local; if the user performs the operation of changing the wake-up word locally, if the user account is detected to be online after the operation is completed, the local will actively push the update to the cloud.
  • Figure 10 is a schematic flow diagram of an embodiment of the processing of voice data provided by the present application, wherein, after the user speaks the first voice data, the smart device uses a machine learning model to verify the wake-up word, and after determining After the first voice data includes the first word, the normal response is to execute the command. And when the first word is not included but the preset content is included (the preset content is a modified wake-up word), the number of times detected is counted, and the second voice data spoken by the user is continued to be collected.
  • the smart device determines that the user is asking for the first word, and the smart device then prompts the user for the first word, or, when the number of times is greater than the preset number of times, the smart device prompts the user for the first word again.
  • the smart device can use its own machine learning model to identify whether the first voice data includes the first word and preset content, or the smart device can also send the first voice data to a server in the cloud, The server recognizes whether the first voice data includes the first word and preset content and returns the recognition result to the smart device, so as to reduce the calculation amount of the smart device.
  • the smart device can also perform recognition by comparing the pinyin of each character in the first voice data with the pinyin of the wake-up word and the pinyin of the preset content, so as to increase the fuzziness of recognition and improve the recognition rate.
  • the verification of the wake-up word can be performed in the following two ways, (1) wake-up model scoring: when the first voice data is collected, the wake-up model is Wake word score. If the score result is the first word currently set by the user, it will normally respond to the user’s wake-up; if the score result is not the first word currently set but is the stored preset content, then collect the second voice data and start the semantic analysis push preparation stage, Afterwards, the second voice data is pushed to the server for semantic recognition and processing, and when the recognition result sent by the server is received, and the semantics of the second voice data is determined to be the query of the first word, the smart device prompts the first word again.
  • wake-up model scoring when the first voice data is collected, the wake-up model is Wake word score. If the score result is the first word currently set by the user, it will normally respond to the user’s wake-up; if the score result is not the first word currently set but is the stored preset content, then collect the second voice data and start the semantic analysis push preparation stage,
  • FIG 11 is a schematic diagram of the processing structure of the smart device for processing speech data provided by the present application, wherein the speech recognition technology for speech data processing mainly includes four parts: signal processing and feature extraction, acoustic model, language model, and decoder.
  • signal processing and feature extraction take the audio signal as input, enhance the speech by eliminating noise and channel distortion, transform the signal from the time domain to the frequency domain, and extract suitable representative features for the subsequent acoustic model.
  • Feature vector There are currently many methods for sound feature extraction, such as Mel Frequency Cepstral Coefficients (MFCC), Linear Predictive Cepstral Coefficients (LPCC), and Multimedia Content Description Interface (MPEG7).
  • MFCC Mel Frequency Cepstral Coefficients
  • LPCC Linear Predictive Cepstral Coefficients
  • MPEG7 Multimedia Content Description Interface
  • An acoustic model is the output of converting speech into an acoustic representation, that is, finding the probability that a given speech originates from an acoustic symbol.
  • the most commonly used acoustic modeling method is the Hidden Markov Model (HMM).
  • HMM Hidden Markov Model
  • the state is a hidden variable
  • the voice is an observation
  • the jump between states conforms to the Markov assumption.
  • the state transition probability density is mostly modeled by geometric distribution
  • GMM Gaussian mixture model
  • DNN deep neural network
  • CNN convolutional neural network
  • RNN recurrent neural network
  • the FSMN proposed by HKUST Xunfei is an improved network structure based on DNN.
  • the delay structure is introduced into the hidden layer of DNN, and the historical information of the hidden layer at time t-N ⁇ t-1 is used as the input of the next layer, thereby introducing the historical information of the speech sequence and avoiding the problems caused by RNN training BPTT.
  • problems caused by RNN training BPTT Such as: gradient disappearance, high computational complexity, etc.
  • Language model estimation uses the training corpus to learn the relationship between words and words to estimate the possibility of hypothetical word sequences, also known as language model scores.
  • Statistical language model has become the mainstream technology of language processing in speech recognition. Among them, there are many kinds of statistical language models, such as N-Gram language model, Markov N-gram model (Markov N-gram), exponential model (Exponential Models), decision tree Model (Decision Tree Models), etc.
  • the N-gram language model is the most commonly used statistical language model, especially the binary language model (bigram) and trigram language model (trigram).
  • the decoder (Decoder) recognizes the input speech frame sequence based on the trained acoustic model, combined with the dictionary and language model.
  • the main work done includes: given the input feature sequence xT1x1T, in the search space (Search Space) composed of four knowledge sources such as acoustic model, acoustic context, pronunciation dictionary and language model, through Viterbi (Viterbi) Search, find the best word string, etc.
  • search space composed of four knowledge sources such as acoustic model, acoustic context, pronunciation dictionary and language model, through Viterbi (Viterbi) Search, find the best word string, etc.
  • the voice data processing method provided by the embodiment of the present application is introduced, and in order to realize the various functions in the method provided by the above-mentioned embodiment of the present application, the smart device as the execution subject may include a hardware structure and/or Or a software module, the above-mentioned functions are realized in the form of a hardware structure, a software module, or a hardware structure plus a software module. Whether one of the above-mentioned functions is executed in the form of a hardware structure, a software module, or a hardware structure plus a software module depends on the specific application and design constraints of the technical solution.
  • FIG. 12 is a schematic structural diagram of an embodiment of a voice data processing device provided by the present application.
  • each module of the above device is only a division of logical functions, and may be fully or partially integrated into one physical entity or physically separated during actual implementation.
  • these modules can all be implemented in the form of calling software through processing elements; they can also be implemented in the form of hardware; some modules can also be implemented in the form of calling software through processing elements, and some modules can be implemented in the form of hardware. It may be a separately established processing element, or it may be integrated into a certain chip of the above-mentioned device. In addition, it may also be stored in the memory of the above-mentioned device in the form of program code, which is called and executed by a certain processing element of the above-mentioned device. Determine the functionality of the module.
  • each step of the above method or each module above can be completed by an integrated logic circuit of hardware in the processor element or an instruction in the form of software.
  • the above modules may be one or more integrated circuits configured to implement the above method, for example: one or more specific integrated circuits (application specific integrated circuit, ASIC), or, one or more microprocessors (digital signal processor, DSP), or, one or more field programmable gate arrays (field programmable gate array, FPGA), etc.
  • the processing element may be a general-purpose processor, such as a central processing unit (central processing unit, CPU) or other processors that can call program codes.
  • these modules can be integrated together and implemented in the form of a system-on-a-chip (SOC).
  • SOC system-on-a-chip
  • all or part of them may be implemented by software, hardware, firmware or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server, or data center by wired (eg, coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media.
  • the available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a DVD), or a semiconductor medium (such as a solid state disk (SSD)), etc.
  • the present application also provides an electronic device, including: a processor and a memory connected through a bus; wherein, a computer program is stored in the memory, and when the processor executes the computer program, the processor can be used to execute any of the above-mentioned embodiments of the present application.
  • a method for processing voice data including: a processor and a memory connected through a bus; wherein, a computer program is stored in the memory, and when the processor executes the computer program, the processor can be used to execute any of the above-mentioned embodiments of the present application.
  • the present application also provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed, it can be used to perform any voice data processing method in the data processing methods provided in the foregoing embodiments of the present application.
  • the embodiment of the present application also provides a chip for running instructions, and the chip is used to execute the voice data processing method provided in any one of the foregoing embodiments of the present application.
  • the present application also provides a computer program product, including a computer program.
  • a computer program product including a computer program.
  • the computer program When executed by a processor, it can be used to implement any voice data processing method as described above in the present application.
  • the aforementioned program can be stored in a computer-readable storage medium.
  • the program executes the steps including the above-mentioned method embodiments; and the aforementioned storage medium includes: ROM, RAM, magnetic disk or optical disk and other various media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Telephone Function (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

A speech data processing method and apparatus. The method comprises: configuring, as a first word, a wake-up word of a smart device (S100); processing collected first speech data (S101); and when the first speech data does not comprise the first word but comprises preset content (S102), prompting a user of the first word (S103). In this way, a situation in which a smart device cannot be awakened when a user forgets or does not know a modified wake-up word, or mistakenly says the wake-up words of other devices, etc. is prevented, such that the intelligence level of the smart device is enhanced, and the usage experience for a user of the smart device applying the method and apparatus is improved.

Description

语音数据的处理方法及装置Voice data processing method and device
相关申请的交叉引用Cross References to Related Applications
本申请要求在2021年12月13日递交到国家知识产权局的申请号为202111516804.3,名称为“语音数据的处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with application number 202111516804.3 and titled "Method and Device for Processing Voice Data" submitted to the State Intellectual Property Office on December 13, 2021, the entire contents of which are incorporated by reference in this application middle.
技术领域technical field
本申请涉及语音数据处理技术领域,尤其涉及一种语音数据的处理方法及装置。The present application relates to the technical field of voice data processing, in particular to a voice data processing method and device.
背景技术Background technique
随着电子技术的发展,越来越多的电视机、音箱等智能设备都设置有语音交互功能,使用户可以通过说话的方式向智能设备发出指令,当终端设备采集用户的语音数据后,识别并执行其中的指令。With the development of electronic technology, more and more smart devices such as televisions and speakers are equipped with voice interaction functions, so that users can send instructions to smart devices by speaking. When the terminal device collects the user's voice data, it can recognize and execute the instructions in it.
现有技术中,为了节省智能设备的功耗,智能设备通常处于低功耗等工作模式,用户与智能设备对话时,需要首先说出智能设备的唤醒词,来“唤醒”智能设备,使其切换为正常工作状态。相应地,智能设备只有在检测到唤醒词之后,才继续处用户在该唤醒词之后的说出的指令In the prior art, in order to save the power consumption of the smart device, the smart device is usually in a low power consumption mode. When the user talks to the smart device, he needs to speak the wake-up word of the smart device first to "wake up" the smart device so that it Switch to normal working state. Correspondingly, only after the smart device detects the wake-up word, it continues to execute the user's spoken instruction after the wake-up word.
采用现有技术,一些智能设备的唤醒词可以进行更改,而在唤醒词更改后,一旦用户遗忘或者不能确定更改后的唤醒词,用户将无法“唤醒”智能设备,导致了智能设备的智能化程度不足,严重降低用户体验。With the existing technology, the wake-up word of some smart devices can be changed, but after the wake-up word is changed, once the user forgets or cannot determine the changed wake-up word, the user will not be able to "wake up" the smart device, resulting in the intelligentization of smart devices. Insufficient level seriously degrades user experience.
发明内容Contents of the invention
本申请提供一种语音数据的处理方法及装置,用于解决无法唤醒智能设备导致智能设备的智能化程度不足、用户体验较差的技术问题。The present application provides a voice data processing method and device, which are used to solve the technical problems of insufficient intelligence of the smart device and poor user experience due to failure to wake up the smart device.
本申请提供一种语音数据的处理方法,包括:确定所述智能设备的唤醒词被配置为第一词语;采集用户的第一语音数据;当识别到所述第一语音数据中包括所述第一词语,所述智能设备切换工作状态;当识别到所述第一语音数据中不包括所述第一词语但包括预设内容时,所述智能设备不切换工作状态,并向所述用户提示所述第一词语;当识别到所述第一语音数据中不包括所述第一词语且不包括所述预设内容时,所述智能设备不切换工作状态The present application provides a voice data processing method, including: determining that the wake-up word of the smart device is configured as the first word; collecting the user's first voice data; One word, the smart device switches the working state; when it is recognized that the first voice data does not include the first word but includes the preset content, the smart device does not switch the working state, and prompts the user The first word; when it is recognized that the first voice data does not include the first word and does not include the preset content, the smart device does not switch the working state
在本申请第一方面一实施例中,所述识别到所述第一语音数据中不包括所述第一词语但包括预设内容时,所述智能设备不切换工作状态,并向所述 用户提示所述第一词语,包括:当识别到所述第一语音数据中不包括所述第一词语但包括所述预设内容时,将检测次数加1,所述检测次数为连续采集到语音数据中不包括所述词语且包括所述预设内容的次数;当所述检测次数大于预设次数时,向所述用户提示所述第一词语。In an embodiment of the first aspect of the present application, when it is recognized that the first voice data does not include the first word but includes preset content, the smart device does not switch the working state, and sends a message to the user Prompting the first word includes: when it is recognized that the first voice data does not include the first word but includes the preset content, adding 1 to the number of times of detection, the number of times of detection is the number of voices collected continuously The data does not include the word but includes the preset number of times; when the detected number is greater than the preset number, prompting the user for the first word.
在本申请第一方面一实施例中,所述识别到所述第一语音数据中不包括所述第一词语但包括预设内容时,所述智能设备不切换工作状态,并向所述用户提示所述第一词语,包括:当识别到所述第一语音数据中不包括所述第一词语但包括预设内容时,采集用户的第二语音数据;当识别到所述第二语音数据中包括语义为询问所述第一词语的语句,向所述用户提示所述第一词语。In an embodiment of the first aspect of the present application, when it is recognized that the first voice data does not include the first word but includes preset content, the smart device does not switch the working state, and sends a message to the user Prompting the first word includes: collecting the second voice data of the user when it is recognized that the first voice data does not include the first word but includes preset content; when the second voice data is recognized includes a sentence whose semantic meaning is to inquire about the first word, and prompts the user for the first word.
在本申请第一方面一实施例中,所述预设内容包括以下的一项或多项:所述智能设备在所述第一词语之前被配置过的至少一个唤醒词;所述智能设备绑定的用户账户配置过的至少一个唤醒词;具有语音数据处理功能的至少一个其他设备的被配置的唤醒词。In an embodiment of the first aspect of the present application, the preset content includes one or more of the following: at least one wake-up word that the smart device has been configured before the first word; At least one wake-up word configured by a given user account; and a configured wake-up word of at least one other device with a voice data processing function.
在本申请第一方面一实施例中,所述方法还包括:当所述智能设备启动时,从存储设备中获取所述智能设备在所述第一词语之前被配置过的至少一个唤醒词,从服务器获取所述具有语音数据处理功能的至少一个其他设备的被配置的唤醒词;当所述用户使用账号登录所述智能设备,根据所述用户的账号,从服务器获取所述智能设备绑定的用户账户配置过的至少一个唤醒词。In an embodiment of the first aspect of the present application, the method further includes: when the smart device is started, acquiring at least one wake-up word configured by the smart device before the first word from a storage device, Obtain the configured wake-up word of the at least one other device with voice data processing function from the server; when the user logs in the smart device using an account, obtain the binding of the smart device from the server according to the user account At least one wake word configured for the user account.
在本申请第一方面一实施例中,所述方法还包括:当所述用户使用账号登录所述智能设备,且所述用户将所述智能设备的唤醒词由第一词语修改为第二词语时,将所述第二词语发送至所述服务器,使所述服务器记录所述第二词语。In an embodiment of the first aspect of the present application, the method further includes: when the user logs in to the smart device using an account, and the user modifies the wake-up word of the smart device from the first word to the second word , sending the second word to the server, so that the server records the second word.
在本申请第一方面一实施例中,所述向所述用户提示所述第一词语,包括:在显示界面上显示所述第一词语的文本提示信息;或者,通过语音播放所述第一词语的语音提示信息。In an embodiment of the first aspect of the present application, the prompting the user of the first word includes: displaying text prompt information of the first word on a display interface; or playing the first word by voice. Voice prompts for words.
在本申请第一方面一实施例中,在预设时间后,停止向所述用户提示所述第一词语;或者,当采集到用户的第三语音数据,并识别到所述第三语音数据中包括所述第一词语后,停止向所述用户提示所述第一词语。In an embodiment of the first aspect of the present application, after a preset time, stop prompting the user for the first word; or, when the third voice data of the user is collected and the third voice data is recognized After including the first word in , stop prompting the user for the first word.
在本申请第一方面一实施例中,所述采集用户的第一语音数据之后,还包括:通过机器学习模型,确定所述第一语音数据中的是否包括所述第一词语和所述预设内容;或者,确定所述第一语音数据中每个文字的拼音,通过所述每个文字的拼音、所述第一词语的拼音以及所述预设内容的拼音,确定所述第一语音数据中的是否包括所述第一词语和所述预设内容。In an embodiment of the first aspect of the present application, after the collection of the user's first voice data, it further includes: using a machine learning model to determine whether the first voice data includes the first word and the predetermined set content; or, determine the pinyin of each character in the first voice data, and determine the first voice through the pinyin of each character, the pinyin of the first word, and the preset content Whether the data includes the first word and the preset content.
本申请第二方面提供一种语音数据的处理装置,用于执行如本申请第一方面任一项提供的语音数据的处理方法,该装置包括:确定模块,用于确定 所述智能设备的唤醒词被配置为第一词语;采集模块,用于采集用户的第一语音数据;处理模块,用于识别所述第一语音数据中是否包括所述第一词语,以及是否包括预设内容;其中,当识别到所述第一语音数据中包括所述第一词语,所述智能设备切换工作状态;当识别到所述第一语音数据中不包括所述第一词语但包括预设内容时,所述智能设备不切换工作状态;当识别到所述第一语音数据中不包括所述第一词语且不包括所述预设内容时,所述智能设备不切换工作状态;提示模块,用于当识别到所述第一语音数据中不包括所述第一词语但包括预设内容时,向所述用户提示所述第一词语。The second aspect of the present application provides a voice data processing device, which is used to execute the voice data processing method provided in any one of the first aspect of the present application, and the device includes: a determination module, which is used to determine the wake-up of the smart device The word is configured as a first word; the collection module is used to collect the first voice data of the user; the processing module is used to identify whether the first voice data includes the first word and whether it includes preset content; wherein , when it is recognized that the first voice data includes the first word, the smart device switches the working state; when it is recognized that the first voice data does not include the first word but includes preset content, The smart device does not switch the working state; when it is recognized that the first voice data does not include the first word and does not include the preset content, the smart device does not switch the working state; the prompt module is used to When it is recognized that the first voice data does not include the first word but includes preset content, prompting the user for the first word.
附图说明Description of drawings
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present application. Those skilled in the art can also obtain other drawings based on these drawings without any creative effort.
图1为本申请一种应用场景的示意图;FIG. 1 is a schematic diagram of an application scenario of the present application;
图2为一种智能设备处理语音数据的方法流程示意图;Fig. 2 is a schematic flow chart of a method for processing voice data by an intelligent device;
图3为本申请提供的语音数据的处理方法一实施例的流程示意图;Fig. 3 is a schematic flow chart of an embodiment of a method for processing voice data provided by the present application;
图4为本申请提供的智能设备的唤醒词的示意图;FIG. 4 is a schematic diagram of the wake-up word of the smart device provided by the present application;
图5为本申请提供的智能设备一种提示唤醒词的方式的示意图;FIG. 5 is a schematic diagram of a way for a smart device to prompt a wake-up word provided by the present application;
图6为本申请提供的智能设备另一种提示唤醒词的方式的示意图;FIG. 6 is a schematic diagram of another way for a smart device to prompt a wake-up word provided by the present application;
图7为本申请提供的语音数据的处理方法另一实施例的流程示意图;FIG. 7 is a schematic flow diagram of another embodiment of a method for processing voice data provided by the present application;
图8为本申请提供的语音数据的处理方法又一实施例的流程示意图;FIG. 8 is a schematic flow diagram of another embodiment of a method for processing voice data provided by the present application;
图9为本申请提供的智能设备实现对预设内容进行处理的示意图;FIG. 9 is a schematic diagram of the smart device provided by the present application to realize the processing of preset content;
图10为本申请提供的对语音数据进行处理的一实施例的流程示意图;FIG. 10 is a schematic flow diagram of an embodiment of processing voice data provided by the present application;
图11为本申请提供的智能设备进行语音数据处理的处理结构示意图;FIG. 11 is a schematic diagram of a processing structure for voice data processing by a smart device provided by the present application;
图12为本申请提供的语音数据的处理装置一实施例的结构示意图。FIG. 12 is a schematic structural diagram of an embodiment of a voice data processing device provided by the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the application with reference to the drawings in the embodiments of the application. Apparently, the described embodiments are only some of the embodiments of the application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的 顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例例如能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if any) in the specification and claims of the present application and the above drawings are used to distinguish similar objects, and not necessarily Used to describe a specific sequence or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the application described herein, for example, can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having", as well as any variations thereof, are intended to cover a non-exclusive inclusion, for example, a process, method, system, product or device comprising a sequence of steps or elements is not necessarily limited to the expressly listed instead, may include other steps or elements not explicitly listed or inherent to the process, method, product or apparatus.
在正式介绍本申请实施例之前,先结合附图,对本申请所应用的场景,以及场景中所存在的问题进行说明。例如,图1为本申请一种应用场景的示意图,示出了用户1通过语音交互的方式控制智能设备2的示意图,其中,智能设备2可以是手机、平板电脑、电视机、智能音箱或者其他智能家电等具有相关语音交互功能的电子设备,图1中以智能设备2为电视机作为示例。Before formally introducing the embodiments of the present application, the scenarios to which the present application is applied and the problems existing in the scenarios will be described with reference to the accompanying drawings. For example, FIG. 1 is a schematic diagram of an application scenario of the present application, showing a schematic diagram of a user 1 controlling a smart device 2 through voice interaction, wherein the smart device 2 can be a mobile phone, a tablet computer, a TV, a smart speaker or other Smart home appliances and other electronic devices with related voice interaction functions. In FIG. 1 , the smart device 2 is a TV as an example.
在一些实施例中,为了节省功耗,智能设备2平时处于低功耗等工作模式下,用户1需要通过语音向智能设备2发出指令时,需要先说出智能设备2所设置的唤醒词“XXXX”,随后再说出指令“播放电影”。则对于智能设备2,其处理流程可以参照图2所示的过程,其中,图2为一种智能设备处理语音数据的方法流程示意图,当智能设备2在S10中通过麦克风等语音采集装置采集到第一语音数据后,首先在S20中识别第一语音数据中是否包括第一词语,如唤醒词“XXXX”,如果不包括该唤醒词则不会切换到正常工作状态而是继续保持低功耗状态,并返回执行S10继续采集语音数据。如果S20中识别到第一语音数据中包括第一词语,则智能设备2在S30中根据第一词语切换到工作状态,并在S40中识别并执行第一语音数据中的命令,或者继续采集后续的语音数据,再识别并执行其中的命令等,最终在执行完命令或者在一端时间未检测到用户继续说话后,智能设备2再次从工作状态切换回低功耗状态,并重新执行S10中采集语音数据并识别唤醒词的步骤。In some embodiments, in order to save power consumption, the smart device 2 is usually in a low power consumption mode. When the user 1 needs to issue instructions to the smart device 2 through voice, he needs to speak the wake-up word set by the smart device 2 first. XXXX", followed by the command "Play movie". Then for the smart device 2, its processing flow can refer to the process shown in FIG. 2, wherein, FIG. 2 is a schematic flow diagram of a method for processing voice data by a smart device. After the first voice data, first identify whether the first voice data includes the first word in S20, such as the wake-up word "XXXX", if the wake-up word is not included, it will not switch to the normal working state but continue to maintain low power consumption state, and return to execute S10 to continue collecting voice data. If it is recognized in S20 that the first voice data includes the first word, the smart device 2 switches to the working state according to the first word in S30, and recognizes and executes the command in the first voice data in S40, or continues to collect subsequent voice data, and then recognize and execute the commands therein, etc. Finally, after executing the command or not detecting that the user continues to speak for a certain period of time, the smart device 2 switches from the working state back to the low power consumption state again, and re-executes the acquisition in S10 Speech data and steps to recognize wake word.
在一些实施例中,如图1场景中所示的智能设备2可以在接收到用户发出的语音数据后,通过其内置的机器学习模型对语音数据进行处理得到其中的唤醒词和命令等;或者,智能设备2还可以将语音数据发送至网络服务器3,由服务器3对语音数据进行识别等处理并将得到的唤醒词和命令返回智能设备。最终,智能设备2确定用户1说出了“播放电影”的命令,从服务器3获取电影数据,并在其显示屏幕21上播放电影。In some embodiments, after receiving the voice data from the user, the smart device 2 as shown in the scene of FIG. 1 can process the voice data through its built-in machine learning model to obtain the wake-up words and commands therein; or , the smart device 2 can also send the voice data to the network server 3, and the server 3 performs processing such as recognition on the voice data and returns the obtained wake-up words and commands to the smart device. Finally, the smart device 2 determines that the user 1 has spoken the command “play movie”, obtains movie data from the server 3 , and plays the movie on its display screen 21 .
在一些实施例中,智能设备的唤醒词并不固定、而是可以由用户进行删除、修改、替换等操作,以丰富用户的使用体验并提高功能性。例如,如图1所示的智能设备2的供应商预设的唤醒词为“XXXX”,用户1可以将唤醒词更改为“YYYY”等。上述“XXXX”和“YYYY”仅为示例的普遍性表示,每个唤醒 词的字数和具体的实现不作限定,更改前后的唤醒词存在不同即可。如,将唤醒词由“海信小聚”修改为“小聚小聚”等。In some embodiments, the wake-up word of the smart device is not fixed, but can be deleted, modified, replaced by the user, so as to enrich the user experience and improve functionality. For example, as shown in FIG. 1 , the wake-up word preset by the supplier of the smart device 2 is "XXXX", and the user 1 can change the wake-up word to "YYYY" and so on. The above-mentioned "XXXX" and "YYYY" are just examples of generality. The number of words and specific implementation of each wake-up word are not limited, as long as the wake-up words before and after the change are different. For example, change the wake-up word from "Hisense Xiaoju" to "Xiaoju Xiaoju" and so on.
但是,当智能设备的唤醒词更改为“YYYY”后,一旦用户遗忘修改后的唤醒词、或者其他家庭用户并不知道更改后的唤醒词或者还未适应更改后的唤醒词,还对着智能设备说出预设的唤醒词“XXXX”时,智能设备将判断其采集到的语音数据中不包括唤醒词“YYYY”,从而不会切换工作状态,使得用户无法通过语音向智能设备发出命令,给用户造成无法“唤醒”智能设备的感受,严重降低用户体验。However, when the wake-up word of the smart device is changed to "YYYY", once the user forgets the modified wake-up word, or other home users do not know the changed wake-up word or have not adapted to the changed wake-up word, they are still facing the smart device. When the device speaks the preset wake-up word "XXXX", the smart device will judge that the voice data collected does not include the wake-up word "YYYY", so that it will not switch the working state, so that the user cannot issue commands to the smart device through voice. It makes users feel that they cannot "wake up" smart devices, and seriously reduces user experience.
在另一些实施例中,一个家庭中包括多个智能终端,如客厅中的电视,卧室中的空调,智能音箱等。这些智能终端被用户设置为不同的唤醒词,或者默认为不同的唤醒词,需要通过不同的唤醒词才能唤醒对应的设备。这样很有可能当用户本来想唤醒电视时候,呼出其他设备的唤醒词的情况。还有可能用户在不同的场所之间办公,如家庭、办公室、公共场所等,这些场所内的设备也极大可能需要通过不同的唤醒词对应唤醒,当用户在家庭习惯呼出特定唤醒词时候,很有可能去了其他场所还呼出该特定唤醒词以唤醒其他设备。很显然不同的设备拥有不同的唤醒词,这样导致用户很有可能出现呼出的唤醒词不对应目标设备的情况,导致无法唤醒目标设备。In other embodiments, a family includes multiple smart terminals, such as a TV in a living room, an air conditioner in a bedroom, and a smart speaker. These smart terminals are set to different wake-up words by users, or are set to different wake-up words by default, and different wake-up words are required to wake up corresponding devices. In this way, it is very likely that when the user originally wants to wake up the TV, he will call out the wake-up word of other devices. It is also possible that the user works in different places, such as home, office, public place, etc. The devices in these places may also need to be woken up through different wake-up words. When the user is used to exhaling a specific wake-up word at home, It is possible to go to other places and call out the specific wake word to wake up other devices. Obviously, different devices have different wake-up words, which makes it very likely that the user’s exhaled wake-up word does not correspond to the target device, resulting in failure to wake up the target device.
因此,本申请提供一种语音数据的处理方法及装置,用于解决上述场景中智能设备更改唤醒词后可能无法唤醒,使得智能设备的智能化程度较低的技术问题。下面以具体地实施例对本申请的技术方案进行详细说明。下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例不再赘述。Therefore, the present application provides a voice data processing method and device, which are used to solve the technical problem that the smart device may not be able to wake up after changing the wake-up word in the above scenario, which makes the smart device less intelligent. The technical solution of the present application will be described in detail below with specific embodiments. The following specific embodiments may be combined with each other, and the same or similar concepts or processes may not be repeated in some embodiments.
图3为本申请提供的语音数据的处理方法一实施例的流程示意图,如图3所示的处理方法可应用在如图1所示的场景中,由智能设备2执行,该方法包括:Fig. 3 is a schematic flow chart of an embodiment of the processing method of voice data provided by the present application. The processing method shown in Fig. 3 can be applied in the scene shown in Fig. 1 and executed by the smart device 2. The method includes:
S101:智能设备采集第一语音数据,并识别第一语音数据中是否包括唤醒词,以及是否包括预设内容。S101: The smart device collects first voice data, and identifies whether the first voice data includes a wake-up word and whether it includes preset content.
其中,在执行S101之前的S100中智能设备确定其唤醒词被配置为第一词语,假设更改之前的唤醒词为“XXXX”,用户将智能设备的唤醒词更改为第一词语“YYYY”,智能设备此时只有在采集到语音数据并识别出其中包括当前作为唤醒词的第一词语“YYYY”后,才会切换工作状态。可以理解的是,当智能设备的唤醒词被配置为第一词语后,除非唤醒词被重新配置,在重新被配置之前,智能设备都将重复采集语音数据,并根据第一词语作为唤醒词进行工作状态的切换。在一些实施例中,S100可以是智能 设备启动后,确定其当前唤醒词被配置为第一词语;或者,S100具体可以是智能设备根据用户的指示将当前唤醒词配置为第一词语。Among them, in S100 before executing S101, the smart device determines that its wake-up word is configured as the first word, assuming that the wake-up word before changing is "XXXX", and the user changes the wake-up word of the smart device to the first word "YYYY", the smart device At this time, the device will only switch to the working state after collecting the voice data and recognizing that it includes the first word "YYYY" currently used as the wake-up word. It is understandable that when the wake-up word of the smart device is configured as the first word, unless the wake-up word is reconfigured, the smart device will repeatedly collect voice data before reconfiguration, and use the first word as the wake-up word. Switching of working status. In some embodiments, S100 may be that after the smart device is started, it is determined that its current wake-up word is configured as the first word; or, S100 specifically may be that the smart device configures the current wake-up word as the first word according to the user's instruction.
如图3中的S20-S40,当智能设备确定S101中采集的第一语音数据中包括第一词语“YYYY”时,切换工作状态,并执行第一语音数据中第一词语后的命令,或者继续采集语音数据并执行其中的命令。S20-S40的实现方式与图2所示相同,不再赘述。否则,在检测到语音数据中不包括当前作为唤醒词的第一词语时,智能设备不会切换工作状态。As shown in S20-S40 in Figure 3, when the smart device determines that the first voice data collected in S101 includes the first word "YYYY", switch the working state and execute the command after the first word in the first voice data, or Continue to collect voice data and execute commands in it. The implementation of S20-S40 is the same as that shown in FIG. 2 , and will not be repeated here. Otherwise, when it is detected that the voice data does not include the first word currently used as the wake-up word, the smart device will not switch the working state.
特别地,在本申请实施例的S102-S103中,智能设备识别S101采集的第一语音数据中虽然不包括第一词语“YYYY”,但是包括预设内容时,确定用户说出第一语音数据是希望唤醒智能设备,却说出的是错误的唤醒词,因此,智能设备通过S103向用户通过可视化页面(UI)、语音(TTS)等方式向用户提示智能设备当前被配置的唤醒词为第一词语,并返回S101中重新采集语音数据并进行识别。In particular, in S102-S103 of the embodiment of the present application, the smart device recognizes that although the first voice data collected in S101 does not include the first word "YYYY", but includes preset content, it is determined that the user speaks the first voice data It is hoped to wake up the smart device, but the wrong wake-up word is spoken. Therefore, the smart device reminds the user through S103 that the currently configured wake-up word of the smart device is the first words, and return to S101 to re-collect the voice data and identify it.
在一些实施例中,上述预设内容可以包括以下标号为a-c的一项或多项:a、智能设备的唤醒词在被配置为第一词语之前,被配置过的至少一个唤醒词,记为第二词语,例如,智能设备的供应商提供的预设唤醒词为“XXXX”,该智能设备在使用的过程中,用户曾经将唤醒词配置为“AAAA”、“BBBB”,并在本次配置后当前的唤醒词为第一词语“YYYY”,此时,该智能设备的预设内容可以包括该智能设备配置过的词语“AAAA”和“BBBB”,这些第二词语以第二词语集合的形式存储在智能设备中,在后续接收到语音数据后,可以使用语音识别模型等方式,判断语音数据中是否包括存储的预设内容。当智能设备2启动时,服务器可以将第二词语集合发送至智能设备2。In some embodiments, the above preset content may include one or more of the following items labeled a-c: a. At least one wake-up word configured before the wake-up word of the smart device is configured as the first word, denoted as The second word, for example, the default wake-up word provided by the supplier of the smart device is "XXXX". During the use of the smart device, the user once configured the wake-up word as "AAAA" and "BBBB", and this time After configuration, the current wake-up word is the first word "YYYY". At this time, the preset content of the smart device may include the words "AAAA" and "BBBB" configured by the smart device. These second words are grouped in the second word It is stored in the smart device in the form of voice data. After the voice data is subsequently received, the voice recognition model and other methods can be used to determine whether the voice data includes the stored preset content. When the smart device 2 is started, the server may send the second set of words to the smart device 2 .
在一些实施例中,上述预设内容还可以包括:b、智能设备绑定的用户账户所配置过的至少一个唤醒词,记为第三词语。例如,图4为本申请提供的智能设备的唤醒词的示意图,其中,用户在使用智能设备2的过程中,通过其用户账户“登录”该智能设备2,实现用户账户与智能设备2的“绑定”,此时,智能设备2可以从网络服务器获取第三词语集合,该第三词语集合中是用户账户使用的其他设备所配置的唤醒词。In some embodiments, the above preset content may further include: b. at least one wake-up word configured by the user account bound to the smart device, which is recorded as the third word. For example, FIG. 4 is a schematic diagram of the wake-up word of the smart device provided by the present application, wherein the user "logs in" the smart device 2 through his user account during the process of using the smart device 2, and realizes the "login" between the user account and the smart device 2. Binding", at this time, the smart device 2 can obtain the third word set from the network server, and the third word set is the wake-up word configured by other devices used by the user account.
具体地,如图4所示,当用户使用用户账户登录智能设备,并通过标号①的路径将唤醒词由“XXXX”更改为新的唤醒词“YYYY”,使得智能设备将存储更改后的唤醒词“YYYY”,随后,智能设备还通过标号②的路径将该唤醒词“YYYY”发送至服务器存储,存储到与用户账户对应的第三词语集合中。对于服务器,在接收到同一用户账户绑定的不同设备发送的唤醒词时,都会将该唤醒词存入用户账户对应的第三词语集合中进行记录。则当如图4所示的智能设备2在检测到用户使用其用户账户登录后,都可以 根据其用户账户向服务器请求获取服务器存储的词语集合,使得服务器将词语集合通过标号③的路径发送至智能设备。Specifically, as shown in Figure 4, when the user logs in to the smart device using the user account and changes the wake-up word from "XXXX" to the new wake-up word "YYYY" through the path marked ①, the smart device will store the changed wake-up word The word "YYYY", and then the smart device also sends the wake-up word "YYYY" to the server for storage through the path marked ②, and stores it in the third word set corresponding to the user account. For the server, when receiving the wake-up word sent by different devices bound to the same user account, the wake-up word will be stored in the third word set corresponding to the user account for recording. Then when the smart device 2 as shown in Figure 4 detects that the user logs in with its user account, it can request the server to obtain the word set stored in the server according to its user account, so that the server sends the word set through the path marked ③ to smart device.
在一些实施例中,上述预设内容还可以包括:c、具有语音数据处理功能的至少一个其他设备被配置的一个或多个唤醒词,记为第四词语。其中,其他设备是指同样具备语音识别功能的电子设备,例如:智能音箱、电脑、手机等,可以是由智能设备不同供应商提供的。如图4所示的智能设备2的供应商提供的服务器,可以通过标号⑤的路径从互联网中获取其他设备所预设的唤醒词,并存储在第四词语集合中。当智能设备2启动时,服务器可以将第四词语集合通过图4中标号④的路径发送至智能设备2。In some embodiments, the above preset content may further include: c. One or more wake-up words configured for at least one other device with voice data processing function, marked as the fourth word. Among them, other devices refer to electronic devices that also have voice recognition functions, such as smart speakers, computers, mobile phones, etc., which may be provided by different suppliers of smart devices. As shown in FIG. 4 , the server provided by the supplier of the smart device 2 can obtain the wake-up words preset by other devices from the Internet through the path marked ⑤, and store them in the fourth word set. When the smart device 2 starts, the server can send the fourth set of words to the smart device 2 through the path labeled ④ in FIG. 4 .
在一些实施例中,如图4所示的智能设备中所存储的预设内容可以同时包括上述a-c中的一种或多种,则当智能设备识别到语音数据中包括第一词语,会切换工作状态执行命令;而当智能设备识别到语音数据中不包括词语、但包括任一预设内容(第二词语、第三词语或者第四词语)时,则会提示第一词语。可以理解的是,当智能设备识别到语音数据中既不包括第一词语、也不包括预设内容,则不会响应,并重新采集语音数据进行识别。In some embodiments, the preset content stored in the smart device as shown in Figure 4 may include one or more of the above-mentioned a-c, then when the smart device recognizes that the voice data includes the first word, it will switch The command is executed in the working state; and when the smart device recognizes that the voice data does not include words, but includes any preset content (second word, third word or fourth word), it will prompt the first word. It can be understood that when the smart device recognizes that neither the first word nor the preset content is included in the voice data, it will not respond, and the voice data will be re-collected for recognition.
在一些实施例中,图5为本申请提供的智能设备一种提示唤醒词的方式的示意图,假设智能设备的第一词语“YYYY”,预设内容包括更改前的唤醒词“XXXX”,则当用户说出“XXXX,播放电影”,智能设备识别到语音数据中不包括第一词语但包括预设内容中更改前的唤醒词时,智能设备2可以在其显示界面21上显示文本形式的文本提示信息211:“请叫我YYYY”,具体可以通过UI界面上弹窗等形式实现,本实施例对UI实现不做限定。在一些实施例中,智能设备显示文本提示信息后,可以一直保持显示状态,直到智能设备在后续采集到用户的第三语音数据,并识别到第三语音数据中包括第一词语,说明智能设备的提示让用户确定了新的唤醒词,则停止在显示界面上显示提示信息;或者,为了防止对其他显示页面的影响,智能设备可以在预设时间(例如15s)后,停止显示提示信息。In some embodiments, FIG. 5 is a schematic diagram of a way for a smart device to prompt a wake-up word provided in the present application. Assuming that the first word "YYYY" of the smart device includes the pre-modified wake-up word "XXXX", then When the user speaks "XXXX, play a movie" and the smart device recognizes that the voice data does not include the first word but includes the wake-up word before the change in the preset content, the smart device 2 can display a text form on its display interface 21. The text prompt message 211: "Please call me YYYY", can be realized specifically through a pop-up window on the UI interface, etc. This embodiment does not limit the implementation of the UI. In some embodiments, after the smart device displays the text prompt information, it may keep displaying it until the smart device subsequently collects the user's third voice data and recognizes that the third voice data includes the first word, indicating that the smart device If the prompt allows the user to determine a new wake-up word, stop displaying the prompt information on the display interface; or, in order to prevent the impact on other display pages, the smart device can stop displaying the prompt information after a preset time (for example, 15s).
图6为本申请提供的智能设备另一种提示唤醒词的方式的示意图,智能设备识别到语音数据中不包括第一词语但包括预设内容中更改前的唤醒词时,可以通过扬声器等播放装置,通过语音播放第一词语的语音提示信息“请叫我YYYY”等。可以理解的是,上述语音提示信息仅为示例,还可以播放如“我的名字现在叫做YYYY,请使用我的新名字唤醒我”,“我现在叫YYYY噢,随时等您叫醒我”等更加丰富、人性化的语音提示信息。Figure 6 is a schematic diagram of another way for the smart device to prompt the wake-up word provided by this application. When the smart device recognizes that the voice data does not include the first word but includes the wake-up word before the change in the preset content, it can be played through a speaker, etc. The device plays the voice prompt message "please call me YYYY" and the like of the first word by voice. It is understandable that the above voice prompt information is only an example, you can also play such as "My name is YYYY now, please use my new name to wake me up", "My name is YYYY now, I am waiting for you to wake me up anytime" etc. Richer and more user-friendly voice prompt information.
因此,本申请提供的语音数据的处理方法在具体实现过程中,可以实现如下的场景应用:场景一、用户A将智能设备的唤醒词更改后,用户B向该智能设备说出更改前的唤醒词,则智能设备将提示更改后的唤醒词。 场景二、用户将智能设备的唤醒词更改后遗忘,或者习惯性地说出更改前的唤醒词,则智能设备将提示更改后的唤醒词。场景三、用户向智能设备说出其他设备的唤醒词,则智能设备将提示其唤醒词。Therefore, the voice data processing method provided by this application can implement the following scenario application in the specific implementation process: Scenario 1: After user A changes the wake-up word of the smart device, user B speaks the wake-up word before the change to the smart device word, the smart device will prompt the changed wake word. Scenario 2: The user forgets after changing the wake-up word of the smart device, or habitually speaks the wake-up word before the change, and the smart device will prompt the changed wake-up word. Scenario 3: The user speaks the wake-up word of other devices to the smart device, and the smart device will prompt the wake-up word.
综上,本申请实施例提供的语音数据的处理方法,智能设备除了采集第一语音数据,并根据第一语音数据中的第一词语切换工作状态,还在第一语音数据中不包括第一词语但包括预设内容时,向用户提示第一词语,因此在智能设备的唤醒词可以被更改的情况下,防止用户因遗忘、或不知道修改后的唤醒词、或错误说出其他设备唤醒词等情况下,无法“唤醒”智能设备的情况,使得智能设备能够在用户错误地说出预设内容中的词汇但实际是“希望”唤醒智能设备的情况下,主动向用户提示其正确的唤醒词,帮助用户再次说出当前的唤醒词而唤醒智能设备,从而提高了智能设备的智能化程度,提高了智能设备的用户的使用体验。且整个过程都可以仅通过智能设备的软件进行实现与优化,能够避免对智能设备硬件的改动,具有较低的设计与制造成本,易于实现与推广。To sum up, in the voice data processing method provided by the embodiment of the present application, in addition to collecting the first voice data and switching the working state according to the first word in the first voice data, the smart device also does not include the first voice data in the first voice data. Words but include preset content, prompt the user with the first word, so when the wake-up word of the smart device can be changed, prevent the user from forgetting, or not knowing the modified wake-up word, or uttering other devices by mistake In the case of words, etc., it is impossible to "wake up" the smart device, so that the smart device can actively prompt the user for the correct word when the user mispronounces the vocabulary in the preset content but actually "hopes" to wake up the smart device. The wake-up word helps the user speak the current wake-up word again to wake up the smart device, thereby improving the intelligence of the smart device and improving the user experience of the smart device. Moreover, the whole process can be realized and optimized only through the software of the smart device, which can avoid changes to the hardware of the smart device, has low design and manufacturing costs, and is easy to implement and popularize.
图7为本申请提供的语音数据的处理方法另一实施例的流程示意图,如图7所示的实施例在图3所示实施例的基础上,当智能设备在S102中识别到所采集的第一语音数据中,不包括第一词语但包括预设内容时,在S201中将检测次数加1。其中,检测次数为智能设备连续采集到的语音数据中不包括第一词语但包括预设内容的次数。随后,当S202中确定检测次数累积已大于预设次数,智能设备再通过S103提示第一词语。Fig. 7 is a schematic flow chart of another embodiment of the voice data processing method provided by the present application. The embodiment shown in Fig. 7 is based on the embodiment shown in Fig. 3 , when the smart device recognizes the collected In the first voice data, if the first word is not included but the preset content is included, in S201, the number of times of detection is increased by 1. Wherein, the number of times of detection is the number of times that the voice data continuously collected by the smart device does not include the first word but includes preset content. Subsequently, when it is determined in S202 that the accumulated number of times of detection is greater than the preset number of times, the smart device prompts the first word through S103.
示例性地,当智能设备连续3次采集到的第一语音数据中,均不包括当前作为唤醒词的第一词语“YYYY”但包括预设内容中的词语“XXXX”,说明此时用户连续呼出预设内容中的词语,是希望唤醒智能设备但使用了错误的唤醒词,因此,在连续第3次检测到第一语音数据中不包括第一词语“YYYY”但包括预设内容中相同的词语“XXXX”后,智能设备通过如图5或者图6的方式向用户提示第一词语。或者,上述检测次数还可以是预设时间段(例如1分钟)内所采集到的语音数据中不包括所述第一词语且包括所述预设内容的次数等。因此,如图7所示的实施例通过检测次数的计算与累积,对用户说出预设内容的目的是否为唤醒智能设备进行验证,保证后续提示的准确性,保证提示的有效,提高智能设备的处理准确率和处理效率。图7中其他步骤的实现方式与图3中相同,不再赘述。For example, when the first voice data collected by the smart device for three consecutive times does not include the first word "YYYY" currently used as the wake-up word but includes the word "XXXX" in the preset content, it means that the user continues to The word in the preset content is called out to wake up the smart device but the wrong wake-up word is used. Therefore, it is detected for the third time that the first voice data does not include the first word "YYYY" but includes the same word in the preset content. After the word "XXXX", the smart device prompts the user for the first word in the manner shown in Figure 5 or Figure 6 . Alternatively, the above detection times may also be the number of times that the voice data collected within a preset time period (for example, 1 minute) does not include the first word but includes the preset content, etc. Therefore, in the embodiment shown in Figure 7, through the calculation and accumulation of detection times, it is verified whether the purpose of the user's utterance of the preset content is to wake up the smart device, so as to ensure the accuracy of the subsequent prompts, ensure the effectiveness of the prompts, and improve the accuracy of the smart device. processing accuracy and processing efficiency. The implementation of other steps in FIG. 7 is the same as that in FIG. 3 and will not be repeated here.
图8为本申请提供的语音数据的处理方法又一实施例的流程示意图,如图8所示的实施例在图3所示实施例的基础上,当智能设备在S102中识别到所采集的第一语音数据中,不包括第一词语但包括预设内容时,并不直接通过S103提示第一词语,而是继续在S301中采集第二语音数据,其中,智能设备可以采集用户在第一语音数据之后说出的有效话术,并直 到流式识别截止的情况下,所采集到的数据记为第二语音数据。随后,并S302中在识别待第二语音数据中包括询问第一词语相关的语句时,再通过S103提示第一词语。其中,检测到的语句可以是用户向智能设备询问第一词语的语句,可以通过语义识别的方式,确定第二语音数据中包括的语句的语义为询问第一词语。Fig. 8 is a schematic flow chart of another embodiment of the voice data processing method provided by the present application. The embodiment shown in Fig. 8 is based on the embodiment shown in Fig. 3 , when the smart device recognizes the collected When the first voice data does not include the first word but includes the preset content, the first word is not directly prompted through S103, but continues to collect the second voice data in S301, wherein the smart device can collect the user's The valid words spoken after the voice data, and until the end of stream recognition, the collected data is recorded as the second voice data. Subsequently, in S302, when it is recognized that the second voice data includes a sentence related to the query first word, the first word is prompted through S103. Wherein, the detected sentence may be a sentence in which the user asks the smart device for the first word, and the semantics of the sentence included in the second voice data may be determined as asking the first word by means of semantic recognition.
示例性地,假设智能设备当前的唤醒词为第一词语“YYYY”,预设内容中包括更改前的词语“XXXX”。则当用户说出词语“XXXX”后,智能设备并没有立即提示,则用户可能会继续说出类似于“唤醒词不对吗?”“语音唤醒坏了吗?”,“为什么不能语音唤醒了?”等语句,随后,智能设备根据采集到的第二语音数据中包括的上述语句,确定用户确实希望唤醒智能设备但并不能确定唤醒词,此时,智能设备再通过图5或者图6的方式提示第一词语。因此,本实施例能够在用户没有说出唤醒词的情况下,智能设备依然能够响应于用户询问唤醒词的预设内容的语句,使得智能设备进一步丰富了功能、提高了智能化程度。For example, assuming that the current wake-up word of the smart device is the first word "YYYY", the preset content includes the word "XXXX" before the change. Then when the user utters the word "XXXX", the smart device does not prompt immediately, and the user may continue to say something similar to "Is the wake-up word wrong?" "Is the voice wake-up broken?" "Why can't the voice wake up? " and other sentences, and then, the smart device determines that the user really wants to wake up the smart device but cannot determine the wake-up word according to the above-mentioned sentences included in the collected second voice data. At this time, the smart device uses the method shown in Figure 5 or Figure 6 Prompt the first word. Therefore, in this embodiment, even if the user does not say the wake-up word, the smart device can still respond to the user's inquiry about the preset content of the wake-up word, so that the smart device can further enrich functions and improve the degree of intelligence.
图9为本申请提供的智能设备实现对预设内容进行处理的示意图,如图9所示,智能设备通过其本地存储对预设内容进行存储,可以预置该智能设备的主唤醒词,以及其他品牌的智能设备的热门唤醒词等。随后,供应商的服务器所提供的云端存储可以通过运行和账号两种方式管理向智能设备下发的新增热门唤醒词以及用户账户对应的唤醒词。其中,运行管理模式指若市场上其他设备有新增的热门唤醒词,则通过featurecode等方式识别同一类智能设备并为此类设备批量下发新增的唤醒词。账号管理模式指用户登录过的设备更改的唤醒词都会通过云端与用户账户进行绑定及同步存储。则当智能设备启动后,首先进行本地唤醒词存储校验,若本地无存储的唤醒词则拉取运营管理的唤醒词数据并存储;若用户通过账号登陆上线后,云端主动推送云端存储的唤醒词并与本地合并;若用户在本地进行更改唤醒词操作,操作完成后若检测到该用户账号上线,则本地主动向云端推送更新。Fig. 9 is a schematic diagram of the smart device provided by the present application to realize the processing of the preset content. As shown in Fig. 9, the smart device stores the preset content through its local storage, and the main wake-up word of the smart device can be preset, and Popular wake words for other brands of smart devices, etc. Subsequently, the cloud storage provided by the supplier's server can manage the newly added popular wake-up words issued to the smart device and the wake-up words corresponding to the user account through two methods of operation and account. Among them, the operation management mode means that if there are new popular wake-up words for other devices in the market, the same type of smart devices will be identified through featurecode and other methods, and the newly-added wake-up words will be issued to such devices in batches. The account management mode means that the wake-up word changed by the device that the user has logged in will be bound and stored synchronously with the user account through the cloud. Then, when the smart device is started, it first checks the local wake-up word storage. If there is no wake-up word stored locally, it pulls the wake-up word data of operation management and stores it; The word is merged with the local; if the user performs the operation of changing the wake-up word locally, if the user account is detected to be online after the operation is completed, the local will actively push the update to the cloud.
图10为本申请提供的对语音数据进行处理的一实施例的流程示意图,其中,当用户说出第一语音数据后,智能设备使用机器学习模型等方式对唤醒词进行校验,并在确定第一语音数据中包括第一词语后,正常响应执行命令。而当不包括第一词语但包括预设内容(预设内容是修改过的唤醒词),对检测到的次数进行统计,并继续采集用户说出的第二语音数据。后续根据第二语音数据中的语义确定用户在询问第一词语,智能设备再向用户提示第一词语,或者,当次数大于预设次数后,智能设备再向用户提示第一词语。Figure 10 is a schematic flow diagram of an embodiment of the processing of voice data provided by the present application, wherein, after the user speaks the first voice data, the smart device uses a machine learning model to verify the wake-up word, and after determining After the first voice data includes the first word, the normal response is to execute the command. And when the first word is not included but the preset content is included (the preset content is a modified wake-up word), the number of times detected is counted, and the second voice data spoken by the user is continued to be collected. Subsequently, according to the semantics in the second voice data, it is determined that the user is asking for the first word, and the smart device then prompts the user for the first word, or, when the number of times is greater than the preset number of times, the smart device prompts the user for the first word again.
在一些实施例中,智能设备可以通过其自身的机器学习模型,识别第 一语音数据中是否包括第一词语和预设内容,或者,智能设备还可以将第一语音数据发送至云端的服务器,由服务器识别第一语音数据中是否包括第一词语和预设内容并向智能设备返回识别结果,来减少智能设备的计算量。又或者,智能设备还可以通过将第一语音数据中每个文字的拼音,与唤醒词的拼音以及预设内容的拼音进行比较的方式进行识别,从而提高识别的模糊度来提高识别率。In some embodiments, the smart device can use its own machine learning model to identify whether the first voice data includes the first word and preset content, or the smart device can also send the first voice data to a server in the cloud, The server recognizes whether the first voice data includes the first word and preset content and returns the recognition result to the smart device, so as to reduce the calculation amount of the smart device. Alternatively, the smart device can also perform recognition by comparing the pinyin of each character in the first voice data with the pinyin of the wake-up word and the pinyin of the preset content, so as to increase the fuzziness of recognition and improve the recognition rate.
示例性地,在图10所示的实施例中,可以通过如下两种方式进行唤醒词的校验,(1)唤醒模型打分:当采集到第一语音数据,唤醒模型为第一语音数据中的唤醒词打分。若得分结果为用户当前设置的第一词语,则正常响应用户唤醒;若得分结果不是当前设置的第一词语但为存储的预设内容,则采集第二语音数据并启动语义分析推送准备阶段,后续将第二语音数据推送到服务器进行语义的识别与处理,并接收到服务器发送的识别结果,确定第二语音数据的语义为询问第一词语时,智能设备再提示第一词语。(2)云端识别文本转译:用户开启此错误唤醒提示功能后,识别引擎一直处于开启状态,采集第一语音数据并检测到识别文本后转译为拼音,再与存储的第一词语以及预设内容也转译为拼音后进行精确匹配。若拼音匹配为用户当前设置的第一词语,则正常响应用户唤醒;若拼音匹配结果不是当前设置的第一词语但为预设内容,则采集第二语音数据并启动语义分析推送准备阶段。Exemplarily, in the embodiment shown in Figure 10, the verification of the wake-up word can be performed in the following two ways, (1) wake-up model scoring: when the first voice data is collected, the wake-up model is Wake word score. If the score result is the first word currently set by the user, it will normally respond to the user’s wake-up; if the score result is not the first word currently set but is the stored preset content, then collect the second voice data and start the semantic analysis push preparation stage, Afterwards, the second voice data is pushed to the server for semantic recognition and processing, and when the recognition result sent by the server is received, and the semantics of the second voice data is determined to be the query of the first word, the smart device prompts the first word again. (2) Cloud recognition text translation: After the user turns on the error wake-up prompt function, the recognition engine is always on, collects the first voice data and detects the recognition text and translates it into pinyin, and then compares it with the stored first word and preset content It is also translated into pinyin for exact matching. If the pinyin matching is the first word currently set by the user, it will normally respond to the user's wake-up; if the pinyin matching result is not the first word currently set but is the preset content, the second voice data will be collected and the semantic analysis push preparation stage will be started.
图11为本申请提供的智能设备进行语音数据处理的处理结构示意图,其中,针对语音数据处理的语音识别技术主要包括信号处理和特征提取、声学模型、语言模型、解码器四部分。在该结构中,信号处理和特征提取以音频信号为输入,通过消除噪声和信道失真对语音进行增强,将信号从时域转化到频域,并为后面的声学模型提取合适的有代表性的特征向量。针对声音特征提取,目前有许多方法,如梅尔频率倒谱系数(MFCC)、线性预测倒谱系数(LPCC)、多媒体内容描述接口(MPEG7)等。声学模型是把语音转化为声学表示的输出,即找到给定的语音源于某个声学符号的概率。最常用的声学建模方式是隐马尔科夫模型(HMM)。在HMM下,状态是隐变量,语音是观测值,状态之间的跳转符合马尔科夫假设。其中,状态转移概率密度多采用几何分布建模,而拟合隐变量到观测值的观测概率的模型常用高斯混合模型(GMM)。基于深度学习的发展,深度神经网络(DNN)、卷积神经网络(CNN)、循环神经网络(RNN)等模型被应用到观测概率的建模中,并取得了非常好的效果。由科大讯飞提出的FSMN就是一种基于DNN改进型网络结构。在DNN的隐藏层中引入延时结构,将t-N~t-1时刻的隐藏层历史信息作为下一层的输入,从而引入了语音序列的历史信息,同时避免了RNN训练BPTT带来的问题,如:梯度消逝,计 算复杂度高等。语言模型估计通过训练语料学习词与词之间的相互关系,来估计假设词序列的可能性,又叫语言模型分数。统计语言模型成为语音识别中语言处理的主流技术,其中统计语言模型有很多种,如N-Gram语言模型、马尔可夫N元模型(Markov N-gram)、指数模型(Exponential Models)、决策树模型(Decision Tree Models)等。而N元语言模型是最常被使用的统计语言模型,特别是二元语言模型(bigram)、三元语言模型(trigram)。解码器(Decoder)基于训练好的声学模型,并结合词典、语言模型,对输入的语音帧序列识别。主要完成的工作包括:给定输入特征序列xT1x1T的情况下,在由声学模型、声学上下文、发音词典和语言模型等四种知识源组成的搜索空间(Search Space)中,通过维特比(Viterbi)搜索,寻找最佳词串等。Figure 11 is a schematic diagram of the processing structure of the smart device for processing speech data provided by the present application, wherein the speech recognition technology for speech data processing mainly includes four parts: signal processing and feature extraction, acoustic model, language model, and decoder. In this structure, signal processing and feature extraction take the audio signal as input, enhance the speech by eliminating noise and channel distortion, transform the signal from the time domain to the frequency domain, and extract suitable representative features for the subsequent acoustic model. Feature vector. There are currently many methods for sound feature extraction, such as Mel Frequency Cepstral Coefficients (MFCC), Linear Predictive Cepstral Coefficients (LPCC), and Multimedia Content Description Interface (MPEG7). An acoustic model is the output of converting speech into an acoustic representation, that is, finding the probability that a given speech originates from an acoustic symbol. The most commonly used acoustic modeling method is the Hidden Markov Model (HMM). Under HMM, the state is a hidden variable, the voice is an observation, and the jump between states conforms to the Markov assumption. Among them, the state transition probability density is mostly modeled by geometric distribution, and the model that fits hidden variables to the observation probability of the observed value is commonly used Gaussian mixture model (GMM). Based on the development of deep learning, models such as deep neural network (DNN), convolutional neural network (CNN), and recurrent neural network (RNN) have been applied to the modeling of observation probability, and have achieved very good results. The FSMN proposed by HKUST Xunfei is an improved network structure based on DNN. The delay structure is introduced into the hidden layer of DNN, and the historical information of the hidden layer at time t-N ~ t-1 is used as the input of the next layer, thereby introducing the historical information of the speech sequence and avoiding the problems caused by RNN training BPTT. Such as: gradient disappearance, high computational complexity, etc. Language model estimation uses the training corpus to learn the relationship between words and words to estimate the possibility of hypothetical word sequences, also known as language model scores. Statistical language model has become the mainstream technology of language processing in speech recognition. Among them, there are many kinds of statistical language models, such as N-Gram language model, Markov N-gram model (Markov N-gram), exponential model (Exponential Models), decision tree Model (Decision Tree Models), etc. The N-gram language model is the most commonly used statistical language model, especially the binary language model (bigram) and trigram language model (trigram). The decoder (Decoder) recognizes the input speech frame sequence based on the trained acoustic model, combined with the dictionary and language model. The main work done includes: given the input feature sequence xT1x1T, in the search space (Search Space) composed of four knowledge sources such as acoustic model, acoustic context, pronunciation dictionary and language model, through Viterbi (Viterbi) Search, find the best word string, etc.
在前述各实施例中,对本申请实施例提供的语音数据的处理方法进行了介绍,而为了实现上述本申请实施例提供的方法中的各功能,作为执行主体的智能设备可以包括硬件结构和/或软件模块,以硬件结构、软件模块、或硬件结构加软件模块的形式来实现上述各功能。上述各功能中的某个功能以硬件结构、软件模块、还是硬件结构加软件模块的方式来执行,取决于技术方案的特定应用和设计约束条件。In the above-mentioned embodiments, the voice data processing method provided by the embodiment of the present application is introduced, and in order to realize the various functions in the method provided by the above-mentioned embodiment of the present application, the smart device as the execution subject may include a hardware structure and/or Or a software module, the above-mentioned functions are realized in the form of a hardware structure, a software module, or a hardware structure plus a software module. Whether one of the above-mentioned functions is executed in the form of a hardware structure, a software module, or a hardware structure plus a software module depends on the specific application and design constraints of the technical solution.
例如,图12为本申请提供的语音数据的处理装置一实施例的结构示意图,如图12所示的装置100包括:采集模块1001、处理模块1002、提示模块1003和确定模块1004。其中,确定模块1004用于确定智能设备的唤醒词被配置为第一词语;采集模块1001用于采集用户的第一语音数据;处理模块1002用于识别第一语音数据中是否包括智能设备最新设置的第一词语,以及是否包括预设内容;其中,当识别到第一语音数据中包括第一词语,智能设备切换工作状态;当识别到第一语音数据中不包括第一词语但包括预设内容时,智能设备不切换工作状态;当识别到第一语音数据中不包括第一词语且不包括预设内容时,智能设备不切换工作状态;提示模块1003用于当识别到第一语音数据中不包括第一词语但包括预设内容时,向用户提示第一词语For example, FIG. 12 is a schematic structural diagram of an embodiment of a voice data processing device provided by the present application. The device 100 shown in FIG. Among them, the determination module 1004 is used to determine that the wake-up word of the smart device is configured as the first word; the collection module 1001 is used to collect the user's first voice data; the processing module 1002 is used to identify whether the first voice data includes the latest settings of the smart device the first word, and whether it includes the preset content; wherein, when it is recognized that the first voice data includes the first word, the smart device switches the working state; when it is recognized that the first voice data does not include the first word but includes the preset content, the smart device does not switch the working state; when it is recognized that the first voice data does not include the first word and does not include the preset content, the smart device does not switch the working state; the prompt module 1003 is used to recognize the first voice data Prompt the user for the first word when it does not include the first word but includes the preset content
具体地,语音数据的处理装置中的各模块分别执行的上述步骤的具体原理及实现方式,可参考本申请前述实施例中的语音数据的处理方法中的描述,不再赘述。Specifically, for the specific principle and implementation of the above-mentioned steps performed by each module in the device for processing voice data, reference may be made to the description in the method for processing voice data in the foregoing embodiments of the present application, and details will not be repeated here.
需要说明的是,应理解以上装置的各个模块的划分仅仅是一种逻辑功能的划分,实际实现时可以全部或部分集成到一个物理实体上,也可以物理上分开。且这些模块可以全部以软件通过处理元件调用的形式实现;也可以全部以硬件的形式实现;还可以部分模块通过处理元件调用软件的形式实现,部分模块通过硬件的形式实现。可以为单独设立的处理元件,也可以集成在上述装置的某一个芯片中实现,此外,也可以以程序代码的形 式存储于上述装置的存储器中,由上述装置的某一个处理元件调用并执行以上确定模块的功能。其它模块的实现与之类似。此外这些模块全部或部分可以集成在一起,也可以独立实现。这里所述的处理元件可以是一种集成电路,具有信号的处理能力。在实现过程中,上述方法的各步骤或以上各个模块可以通过处理器元件中的硬件的集成逻辑电路或者软件形式的指令完成。It should be noted that it should be understood that the division of each module of the above device is only a division of logical functions, and may be fully or partially integrated into one physical entity or physically separated during actual implementation. And these modules can all be implemented in the form of calling software through processing elements; they can also be implemented in the form of hardware; some modules can also be implemented in the form of calling software through processing elements, and some modules can be implemented in the form of hardware. It may be a separately established processing element, or it may be integrated into a certain chip of the above-mentioned device. In addition, it may also be stored in the memory of the above-mentioned device in the form of program code, which is called and executed by a certain processing element of the above-mentioned device. Determine the functionality of the module. The implementation of other modules is similar. In addition, all or part of these modules can be integrated together, and can also be implemented independently. The processing element mentioned here may be an integrated circuit with signal processing capabilities. In the implementation process, each step of the above method or each module above can be completed by an integrated logic circuit of hardware in the processor element or an instruction in the form of software.
例如,以上这些模块可以是被配置成实施以上方法的一个或多个集成电路,例如:一个或多个特定集成电路(application specific integrated circuit,ASIC),或,一个或多个微处理器(digital signal processor,DSP),或,一个或者多个现场可编程门阵列(field programmable gate array,FPGA)等。再如,当以上某个模块通过处理元件调度程序代码的形式实现时,该处理元件可以是通用处理器,例如中央处理器(central processing unit,CPU)或其它可以调用程序代码的处理器。再如,这些模块可以集成在一起,以片上系统(system-on-a-chip,SOC)的形式实现。For example, the above modules may be one or more integrated circuits configured to implement the above method, for example: one or more specific integrated circuits (application specific integrated circuit, ASIC), or, one or more microprocessors (digital signal processor, DSP), or, one or more field programmable gate arrays (field programmable gate array, FPGA), etc. For another example, when one of the above modules is implemented in the form of a processing element scheduling program code, the processing element may be a general-purpose processor, such as a central processing unit (central processing unit, CPU) or other processors that can call program codes. For another example, these modules can be integrated together and implemented in the form of a system-on-a-chip (SOC).
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘solid state disk(SSD))等。In the above embodiments, all or part of them may be implemented by software, hardware, firmware or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part. The computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server, or data center by wired (eg, coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media. The available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a DVD), or a semiconductor medium (such as a solid state disk (SSD)), etc.
本申还提供一种电子设备,包括:处理器以及存储器,通过总线连接;其中,存储器中存储有计算机程序,当处理器执行计算机程序时,处理器可用于执行如本申请前述实施例中任一语音数据的处理方法。The present application also provides an electronic device, including: a processor and a memory connected through a bus; wherein, a computer program is stored in the memory, and when the processor executes the computer program, the processor can be used to execute any of the above-mentioned embodiments of the present application. A method for processing voice data.
本申请还提供一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,计算机程序被执行时可用于执行如本申请前述实施例提供的数据处理方法中任一语音数据的处理方法。The present application also provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed, it can be used to perform any voice data processing method in the data processing methods provided in the foregoing embodiments of the present application.
本申请实施例还提供一种运行指令的芯片,所述芯片用于执行如本申 请前述任一实施例提供的语音数据的处理方法。The embodiment of the present application also provides a chip for running instructions, and the chip is used to execute the voice data processing method provided in any one of the foregoing embodiments of the present application.
本申请还提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时,可用于实现如本申请前述任一语音数据的处理方法。The present application also provides a computer program product, including a computer program. When the computer program is executed by a processor, it can be used to implement any voice data processing method as described above in the present application.
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the steps for implementing the above method embodiments can be completed by program instructions and related hardware. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, it executes the steps including the above-mentioned method embodiments; and the aforementioned storage medium includes: ROM, RAM, magnetic disk or optical disk and other various media that can store program codes.
最后应说明的是:以上各实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述各实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, rather than limiting them; although the application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the various embodiments of the present application. scope.

Claims (10)

  1. 一种语音数据的处理方法,应用于智能设备,包括:A method for processing voice data, applied to smart devices, comprising:
    确定所述智能设备的唤醒词被配置为第一词语;Determining that the wake-up word of the smart device is configured as the first word;
    采集用户的第一语音数据;collecting the user's first voice data;
    当识别到所述第一语音数据中包括所述第一词语,所述智能设备切换工作状态;When it is recognized that the first voice data includes the first word, the smart device switches the working state;
    当识别到所述第一语音数据中不包括所述第一词语但包括预设内容时,所述智能设备不切换工作状态,并向所述用户提示所述第一词语;When it is recognized that the first voice data does not include the first word but includes preset content, the smart device does not switch the working state, and prompts the user for the first word;
    当识别到所述第一语音数据中不包括所述第一词语且不包括所述预设内容时,所述智能设备不切换工作状态。When it is recognized that the first voice data does not include the first word and does not include the preset content, the smart device does not switch working states.
  2. 根据权利要求1所述的方法,所述识别到所述第一语音数据中不包括所述第一词语但包括预设内容时,所述智能设备不切换工作状态,并向所述用户提示所述第一词语,包括:According to the method according to claim 1, when it is recognized that the first voice data does not include the first word but includes preset content, the smart device does not switch the working state, and prompts the user for the The first term mentioned above includes:
    当识别到所述第一语音数据中不包括所述第一词语但包括所述预设内容时,将检测次数加1,所述检测次数为连续采集到语音数据中不包括所述词语且包括所述预设内容的次数;When it is recognized that the first speech data does not include the first word but includes the preset content, add 1 to the number of detections, and the number of detections is that the speech data that is continuously collected does not include the word and includes the number of times of said predetermined content;
    当所述检测次数大于预设次数时,向所述用户提示所述第一词语。When the number of times of detection is greater than a preset number of times, the first word is prompted to the user.
  3. 根据权利要求1所述的方法,所述识别到所述第一语音数据中不包括所述第一词语但包括预设内容时,所述智能设备不切换工作状态,并向所述用户提示所述第一词语,包括:According to the method according to claim 1, when it is recognized that the first voice data does not include the first word but includes preset content, the smart device does not switch the working state, and prompts the user for the The first term mentioned above includes:
    当识别到所述第一语音数据中不包括所述第一词语但包括预设内容时,采集用户的第二语音数据;When it is recognized that the first voice data does not include the first word but includes preset content, collect second voice data of the user;
    当识别到所述第二语音数据中包括语义为询问所述第一词语的语句,向所述用户提示所述第一词语。When it is recognized that the second voice data includes a sentence whose semantic meaning is to inquire about the first word, the user is prompted for the first word.
  4. 根据权利要求1-3任一项所述的方法,所述预设内容包括以下的一项或多项:According to the method according to any one of claims 1-3, the preset content includes one or more of the following:
    所述智能设备在所述第一词语之前被配置过的至少一个唤醒词;at least one wake-up word configured to the smart device prior to the first word;
    所述智能设备绑定的用户账户配置过的至少一个唤醒词;At least one wake-up word configured by the user account bound to the smart device;
    具有语音数据处理功能的至少一个其他设备的被配置的唤醒词。A configured wake word for at least one other device with voice data processing capabilities.
  5. 根据权利要求4所述的方法,所述方法还包括:The method according to claim 4, said method further comprising:
    当所述智能设备启动时,从存储设备中获取所述智能设备在所述第一词语之前被配置过的至少一个唤醒词,从服务器获取所述具有语音数据处理功能的至少一个其他设备的被配置的唤醒词;When the smart device is started, at least one wake-up word configured by the smart device before the first word is obtained from the storage device, and the configured wake-up word of the at least one other device with voice data processing function is obtained from the server. Configured wake word;
    当所述用户使用账号登录所述智能设备,根据所述用户的账号,从服务器获取所述智能设备绑定的用户账户配置过的至少一个唤醒词。When the user logs in to the smart device with an account, at least one wake-up word configured with a user account bound to the smart device is obtained from a server according to the user account.
  6. 根据权利要求5所述的方法,所述方法还包括:The method according to claim 5, said method further comprising:
    当所述用户使用账号登录所述智能设备,且所述用户将所述智能设备的唤醒词由第一词语修改为第二词语时,将所述第二词语发送至所述服务器,使所述服务器记录所述第二词语。When the user logs in the smart device with an account, and the user modifies the wake-up word of the smart device from the first word to the second word, the second word is sent to the server, so that the The server records the second word.
  7. 根据权利要求1-6任一项所述的方法,所述向所述用户提示所述第一词语,包括:According to the method according to any one of claims 1-6, said prompting said first word to said user comprises:
    在显示界面上显示所述第一词语的文本提示信息;Displaying text prompt information of the first word on the display interface;
    或者,通过语音播放所述第一词语的语音提示信息。Alternatively, the voice prompt information of the first word is played by voice.
  8. 根据权利要求7所述的方法,The method according to claim 7,
    在预设时间后,停止向所述用户提示所述第一词语;After a preset time, stop prompting the user for the first word;
    或者,当采集到用户的第三语音数据,并识别到所述第三语音数据中包括所述第一词语后,停止向所述用户提示所述第一词语。Alternatively, when the user's third voice data is collected and it is recognized that the third voice data includes the first word, stop prompting the user for the first word.
  9. 根据权利要求1-8任一项所述的方法,所述采集用户的第一语音数据之后,还包括:According to the method according to any one of claims 1-8, after the collection of the first voice data of the user, further comprising:
    通过机器学习模型,确定所述第一语音数据中的是否包括所述第一词语和所述预设内容;Using a machine learning model to determine whether the first speech data includes the first word and the preset content;
    或者,确定所述第一语音数据中每个文字的拼音,通过所述每个文字的拼音、所述第一词语的拼音以及所述预设内容的拼音,确定所述第一语音数据中的是否包括所述第一词语和所述预设内容。Or, determine the pinyin of each character in the first voice data, and determine the pinyin of each character in the first voice data through the pinyin of each character, the pinyin of the first word, and the pinyin of the preset content. Whether the first word and the preset content are included.
  10. 一种语音数据的处理装置,包括:A device for processing voice data, comprising:
    确定模块,用于确定智能设备的唤醒词被配置为第一词语;A determining module, configured to determine that the wake-up word of the smart device is configured as the first word;
    采集模块,用于采集用户的第一语音数据;A collection module, configured to collect the user's first voice data;
    处理模块,用于识别所述第一语音数据中是否包括所述第一词语,以及是否包括预设内容;其中,当识别到所述第一语音数据中包括所述第一词语,所述智能设备切换工作状态;当识别到所述第一语音数据中不包括所述第一词语但包括预设内容时,所述智能设备不切换工作状态;当识别到所述第一语音数据中不包括所述第一词语且不包括所述预设内容时,所述智能设备不切换工作状态;A processing module, configured to identify whether the first speech data includes the first word, and whether it includes preset content; wherein, when it is recognized that the first speech data includes the first word, the intelligent The device switches the working state; when it is recognized that the first voice data does not include the first word but includes the preset content, the smart device does not switch the working state; when it is recognized that the first voice data does not include When the first word does not include the preset content, the smart device does not switch the working state;
    提示模块,用于当识别到所述第一语音数据中不包括所述第一词语但包括预设内容时,向所述用户提示所述第一词语。A prompting module, configured to prompt the user for the first word when it is recognized that the first voice data does not include the first word but includes preset content.
PCT/CN2022/107607 2021-12-13 2022-07-25 Speech data processing method and apparatus WO2023109129A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111516804.3A CN114155854B (en) 2021-12-13 2021-12-13 Voice data processing method and device
CN202111516804.3 2021-12-13

Publications (1)

Publication Number Publication Date
WO2023109129A1 true WO2023109129A1 (en) 2023-06-22

Family

ID=80450733

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/107607 WO2023109129A1 (en) 2021-12-13 2022-07-25 Speech data processing method and apparatus

Country Status (2)

Country Link
CN (1) CN114155854B (en)
WO (1) WO2023109129A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114155854B (en) * 2021-12-13 2023-09-26 海信视像科技股份有限公司 Voice data processing method and device
CN116564316B (en) * 2023-07-11 2023-11-03 北京边锋信息技术有限公司 Voice man-machine interaction method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108986813A (en) * 2018-08-31 2018-12-11 出门问问信息科技有限公司 Wake up update method, device and the electronic equipment of word
CN111386566A (en) * 2017-12-15 2020-07-07 海尔优家智能科技(北京)有限公司 Device control method, cloud device, intelligent device, computer medium and device
CN111640434A (en) * 2020-06-05 2020-09-08 三星电子(中国)研发中心 Method and apparatus for controlling voice device
US10872599B1 (en) * 2018-06-28 2020-12-22 Amazon Technologies, Inc. Wakeword training
CN112885341A (en) * 2019-11-29 2021-06-01 北京安云世纪科技有限公司 Voice wake-up method and device, electronic equipment and storage medium
US20210241759A1 (en) * 2020-02-04 2021-08-05 Soundhound, Inc. Wake suppression for audio playing and listening devices
CN114155854A (en) * 2021-12-13 2022-03-08 海信视像科技股份有限公司 Voice data processing method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109377987B (en) * 2018-08-31 2020-07-28 百度在线网络技术(北京)有限公司 Interaction method, device, equipment and storage medium between intelligent voice equipment
CN111105789A (en) * 2018-10-25 2020-05-05 珠海格力电器股份有限公司 Awakening word obtaining method and device
CN109493849A (en) * 2018-12-29 2019-03-19 联想(北京)有限公司 Voice awakening method, device and electronic equipment
CN113066490B (en) * 2021-03-16 2022-10-14 海信视像科技股份有限公司 Prompting method of awakening response and display equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111386566A (en) * 2017-12-15 2020-07-07 海尔优家智能科技(北京)有限公司 Device control method, cloud device, intelligent device, computer medium and device
US10872599B1 (en) * 2018-06-28 2020-12-22 Amazon Technologies, Inc. Wakeword training
CN108986813A (en) * 2018-08-31 2018-12-11 出门问问信息科技有限公司 Wake up update method, device and the electronic equipment of word
CN112885341A (en) * 2019-11-29 2021-06-01 北京安云世纪科技有限公司 Voice wake-up method and device, electronic equipment and storage medium
US20210241759A1 (en) * 2020-02-04 2021-08-05 Soundhound, Inc. Wake suppression for audio playing and listening devices
CN111640434A (en) * 2020-06-05 2020-09-08 三星电子(中国)研发中心 Method and apparatus for controlling voice device
CN114155854A (en) * 2021-12-13 2022-03-08 海信视像科技股份有限公司 Voice data processing method and device

Also Published As

Publication number Publication date
CN114155854B (en) 2023-09-26
CN114155854A (en) 2022-03-08

Similar Documents

Publication Publication Date Title
US11437041B1 (en) Speech interface device with caching component
US11676575B2 (en) On-device learning in a hybrid speech processing system
US11720326B2 (en) Audio output control
US11600291B1 (en) Device selection from audio data
JP6772198B2 (en) Language model speech end pointing
US11669300B1 (en) Wake word detection configuration
US11763808B2 (en) Temporary account association with voice-enabled devices
US10714085B2 (en) Temporary account association with voice-enabled devices
WO2023109129A1 (en) Speech data processing method and apparatus
WO2017071182A1 (en) Voice wakeup method, apparatus and system
US11509525B1 (en) Device configuration by natural language processing system
US8972260B2 (en) Speech recognition using multiple language models
CN111344780A (en) Context-based device arbitration
US11258671B1 (en) Functionality management for devices
CN112927683A (en) Dynamic wake-up word for voice-enabled devices
US11211056B1 (en) Natural language understanding model generation
US20220161131A1 (en) Systems and devices for controlling network applications
JP4475628B2 (en) Conversation control device, conversation control method, and program thereof
WO2019236745A1 (en) Temporary account association with voice-enabled devices
US11763814B2 (en) Hybrid voice command processing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22905891

Country of ref document: EP

Kind code of ref document: A1