CN111261151A - Voice processing method and device, electronic equipment and storage medium - Google Patents

Voice processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111261151A
CN111261151A CN201811465282.7A CN201811465282A CN111261151A CN 111261151 A CN111261151 A CN 111261151A CN 201811465282 A CN201811465282 A CN 201811465282A CN 111261151 A CN111261151 A CN 111261151A
Authority
CN
China
Prior art keywords
voice
control
server
target
party
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811465282.7A
Other languages
Chinese (zh)
Other versions
CN111261151B (en
Inventor
杨一帆
徐运
曹轲
罗红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Hangzhou Information Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Hangzhou Information Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201811465282.7A priority Critical patent/CN111261151B/en
Publication of CN111261151A publication Critical patent/CN111261151A/en
Application granted granted Critical
Publication of CN111261151B publication Critical patent/CN111261151B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a voice processing method, a voice processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: the electronic equipment receives awakening voice input; if the first awakening word is recognized from the awakening voice, receiving the control voice input subsequently, sending the control voice to the first server, and enabling the first server to recognize a first control instruction of the electronic equipment corresponding to the control voice and return; the electronic equipment executes the first control instruction; if a second awakening word is recognized from the awakening voice, receiving a subsequently input control voice, sending the control voice to a second server, enabling the second server to determine target third-party equipment corresponding to the control voice based on a control feature word set of the third-party equipment, recognizing a second control instruction of the target third-party equipment corresponding to the control voice, and returning; and the electronic equipment sends the second control instruction to the target third-party equipment, so that the target third-party equipment executes the second control instruction. The method and the device are used for improving the accuracy of voice control and improving user experience.

Description

Voice processing method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of artificial intelligence, and in particular, to a method and an apparatus for processing speech, an electronic device, and a storage medium.
Background
With the continuous development of internet technology, intelligent hardware comes along. The intelligent hardware is used for transforming the traditional equipment in a mode of combining software and hardware, so that the traditional equipment has an intelligent function. The intelligent equipment, namely intelligent hardware, can also be called intelligent equipment after the intellectuality, possesses the ability of connecting the internet, forms the typical framework of "cloud + end", has more added value.
In order to facilitate the use of users, more and more intelligent devices are developed in the direction of voice interaction, and when an intelligent device processes an input voice, it is usually necessary to perform voice recognition (ASR) Processing on the voice first, convert the input voice into a text, and perform semantic analysis and feedback (NLP) Processing on text information to understand the semantics corresponding to the voice of a user and give corresponding feedback according to the semantics of the user.
However, in the prior art, when a user performs voice control through an intelligent device, the intelligent device cannot accurately identify a device that the user needs to control, which affects user experience.
Disclosure of Invention
The invention provides a voice processing method, a voice processing device, electronic equipment and a storage medium, which are used for solving the problem that voice control through intelligent equipment is inaccurate in the prior art.
In a first aspect, the present invention discloses a speech processing method applied to an electronic device, the method comprising:
receiving a wake-up voice input;
if a first awakening word is recognized from the awakening voice, receiving a control voice input subsequently, sending the control voice to a first server, and enabling the first server to recognize that the control voice corresponds to a first control instruction of the electronic equipment and return; executing the first control instruction;
if a second awakening word is recognized from the awakening voice, receiving a subsequently input control voice, sending the control voice to a second server, enabling the second server to determine a target third-party device corresponding to the control voice based on a control feature word set of a third-party device, recognizing a second control instruction of the target third-party device corresponding to the control voice, and returning; and sending the second control instruction to the target third-party equipment, so that the target third-party equipment executes the second control instruction.
In an optional design, the determining, based on the control feature word set of the third-party device, a target third-party device corresponding to the control speech includes:
identifying target control feature words contained in the control speech based on the control feature word set of the third-party equipment;
and determining target third-party equipment which has a mapping relation with the target control characteristic words according to the mapping relation between each third-party equipment and the control characteristic words in the control characteristic word set of the third-party equipment.
In an optional design, if a wake word is not recognized from the wake speech, the method further comprises:
receiving subsequently input dialogue voice, sending the dialogue voice to a third server, enabling the third server to convert the dialogue voice into a dialogue text, analyzing the dialogue text to generate a response result, converting the response result into response voice, and sending the response voice to the electronic equipment;
and receiving and playing the response voice sent by the third server.
In an optional design, if the third server is preset with a preferentially recognized hotword corresponding to the electronic device, the converting the dialogue speech into the dialogue text by the third server includes:
the third server converts the dialogue speech into dialogue text based on the preferentially recognized hotword.
In an optional design, if the third server is preset with user information corresponding to the electronic device, the converting the response result into a response voice includes:
according to the user information, selecting a text-to-speech TTS engine corresponding to the user information, and converting the response result into response speech, wherein the user information comprises: at least one of age, region, and gender.
In an optional design, after the sending the conversational speech to the third server, the method further includes:
receiving and playing prompt voice containing sensitive words in the dialogue voice sent by the third server; and the prompt voice is sent by the third server after detecting that the dialogue voice contains preset sensitive words.
In a second aspect, the present invention discloses a speech processing apparatus applied to an electronic device, the apparatus comprising:
the receiving module is used for receiving awakening voice input;
the processing module is used for receiving a control voice input subsequently if a first awakening word is recognized from the awakening voice, sending the control voice to a first server, and enabling the first server to recognize a first control instruction of the electronic equipment corresponding to the control voice and return the control voice; executing the first control instruction;
the processing module is further configured to receive a subsequently input control voice if a second wake-up word is recognized from the wake-up voice, send the control voice to a second server, enable the second server to determine a target third-party device corresponding to the control voice based on a control feature word set of the third-party device, recognize a second control instruction of the target third-party device corresponding to the control voice, and return the second control instruction; and sending the second control instruction to the target third-party equipment, so that the target third-party equipment executes the second control instruction.
In a third aspect, the present invention discloses an electronic device, comprising: a memory, a processor, and a transceiver;
the processor is used for reading the program in the memory and executing the following processes: receiving a wake-up voice input; if a first awakening word is recognized from the awakening voice, receiving a control voice input subsequently, sending the control voice to a first server through a transceiver, and enabling the first server to recognize a first control instruction of the electronic equipment corresponding to the control voice and return; executing the first control instruction; if a second awakening word is recognized from the awakening voice, receiving a subsequently input control voice, sending the control voice to a second server through a transceiver, enabling the second server to determine a target third-party device corresponding to the control voice based on a control feature word set of the third-party device, recognizing a second control instruction of the target third-party device corresponding to the control voice, and returning; and sending the second control instruction to the target third-party equipment, so that the target third-party equipment executes the second control instruction.
In an optional design, the determining, based on the control feature word set of the third-party device, a target third-party device corresponding to the control speech includes:
identifying target control feature words contained in the control speech based on the control feature word set of the third-party equipment;
and determining target third-party equipment which has a mapping relation with the target control characteristic words according to the mapping relation between each third-party equipment and the control characteristic words in the control characteristic word set of the third-party equipment.
In an optional design, the processor is further configured to receive a subsequently input conversation voice, send the conversation voice to a third server through the transceiver, convert the conversation voice into a conversation text by the third server, analyze the conversation text to generate a response result, convert the response result into a response voice, and send the response voice to the electronic device; and receiving and playing the response voice sent by the third server.
In an optional design, if the third server is preset with a preferentially recognized hotword corresponding to the electronic device, the converting the dialogue speech into the dialogue text by the third server includes:
the third server converts the dialogue speech into dialogue text based on the preferentially recognized hotword.
In an optional design, if the third server is preset with user information corresponding to the electronic device, the converting the response result into a response voice includes:
according to the user information, selecting a text-to-speech TTS engine corresponding to the user information, and converting the response result into response speech, wherein the user information comprises: at least one of age, region, and gender.
In an optional design, the processor is further configured to receive, through the transceiver, a prompt voice containing a sensitive word in the conversation voice sent by the third server and play the prompt voice; and the prompt voice is sent by the third server after detecting that the dialogue voice contains preset sensitive words.
In a fourth aspect, the present invention discloses an electronic device, comprising: the system comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus;
the memory has stored therein a computer program which, when executed by the processor, causes the processor to perform the method as set forth in the first aspect or any one of the alternative designs of the first aspect.
In a fifth aspect, the present invention discloses a computer readable storage medium storing a computer program executable by an electronic device, the program, when run on the electronic device, causing the electronic device to perform the method as set forth in the first aspect or any one of the alternative designs of the first aspect.
The invention has the following beneficial effects:
in the embodiment of the invention, the electronic equipment can select different servers to process the control voice according to different awakening words contained in the awakening voice, so that different business logics for voice processing are selected to process the control voice, and when the awakening voice contains a first awakening word, a first control instruction of the electronic equipment corresponding to the control voice is recognized; when the awakening voice contains the second awakening word, the target third-party equipment corresponding to the control voice and the second control instruction corresponding to the target third-party equipment are identified, so that the accuracy of voice control is guaranteed, and the user experience is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram of a speech processing process according to an embodiment of the present invention;
FIG. 2 is a second schematic diagram of a speech processing procedure according to an embodiment of the present invention;
FIG. 3 is a third schematic diagram illustrating a speech processing procedure according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It is to be understood that the terms "first," "second," and the like in the description of the present application are used for descriptive purposes only and not for purposes of indicating or implying relative importance, nor for purposes of indicating or implying order.
Example 1:
fig. 1 is a schematic diagram of a speech processing process provided in an embodiment of the present invention, where the process includes:
s101: and receiving awakening voice input, if a first awakening word is recognized from the awakening voice, performing S102, and if a second awakening word is recognized from the awakening voice, performing S103.
The voice processing method provided by the embodiment of the invention is applied to electronic equipment, and the electronic equipment can be intelligent hardware with a voice input function, such as an intelligent sound box, an intelligent mobile phone and the like.
In order to accurately identify whether a voice uttered by a user is interacting with an electronic device, the electronic device usually requires the user to wake up by a wake-up voice, such as "smarts" wake up a smart speaker, before interacting with the electronic device. In the embodiment of the invention, in order to accurately identify different service logics of whether a user has a conversation with the electronic device or controls the electronic device or other third-party devices through the electronic device, a first wake-up word for controlling the electronic device and a second wake-up word for controlling the third-party devices other than the electronic device are preset in the electronic device. Illustratively, the electronic device is an intelligent sound box, the first awakening word is mikugu, the second awakening word is 'linguo', and if the intelligent sound box receives the awakening voice operation of 'mikugu', the intelligent sound box determines that the subsequent input of the user is a control voice which needs to be executed by the intelligent sound box, and sends the received subsequent input control voice to the first server; if the smart sound box receives the 'smart car', the user determines that the subsequent input is the control voice which needs to be executed by third-party equipment (such as an air conditioner and a television), and the control voice which receives the subsequent input is sent to the second server.
Specifically, after receiving the wake-up voice input of the user, the electronic device performs ASR processing on the wake-up voice, converts the wake-up voice into a wake-up text, and identifies whether the converted wake-up text includes the first wake-up word or the second wake-up word.
S102: receiving subsequently input control voice, sending the control voice to a first server, and enabling the first server to identify and return a first control instruction of the electronic equipment corresponding to the control voice; and executing the first control instruction.
If the electronic equipment is awakened by the awakening voice containing the first awakening word, the electronic equipment enters a pickup mode, continues to receive control voice input by a user subsequently, and sends the received control voice to the first server; the method comprises the steps that after receiving control voice, the first server carries out ASR processing on the control voice, the control voice is converted into a control text, NLP processing corresponding to electronic equipment is carried out on the control text, first control instructions corresponding to the control text in all first control instructions corresponding to the electronic equipment are determined, the first control instructions corresponding to the control text are sent to the electronic equipment, and the electronic equipment executes the first control instructions after receiving the first control instructions sent by the first server.
Optionally, a first database containing each first control instruction of the electronic device and an instruction text corresponding to the first control instruction may be preset in the first server, and when performing NLP processing on the control text converted from the control speech, the first server may query the first database and the instruction text converted from the control speech and having the highest matching degree, and use the first control instruction corresponding to the instruction text having the highest matching degree as the first control instruction corresponding to the control speech.
S103: receiving a control voice input subsequently, sending the control voice to a second server, enabling the second server to determine a target third-party device corresponding to the control voice based on a control feature word set of a third-party device, identifying a second control instruction of the target third-party device corresponding to the control voice, and returning; and sending the second control instruction to the target third-party equipment, so that the target third-party equipment executes the second control instruction.
The determining the target third-party device corresponding to the control voice based on the control feature word set of the third-party device includes:
identifying target control feature words contained in the control speech based on the control feature word set of the third-party equipment;
and determining target third-party equipment which has a mapping relation with the target control characteristic words according to the mapping relation between each third-party equipment and the control characteristic words in the control characteristic word set of the third-party equipment.
If the electronic equipment is awakened by the awakening voice containing the second awakening word, the electronic equipment enters a pickup mode, continues to receive the control voice input by the user subsequently, and sends the received control voice to the second server; and after receiving the control voice, the second server performs ASR processing on the control voice, converts the control voice into a control text, performs NLP processing corresponding to third-party equipment on the control text, and determines target third-party equipment corresponding to the control text and a second control instruction for controlling the target third-party equipment.
Specifically, a second database including each second control instruction of each third-party device and an instruction text corresponding to the second control instruction may be preset in the second server. The NLP processing performed by the second server on the control text converted from the control speech by the third-party device may be: and the second server identifies target control feature words contained in the control text based on the control feature word sets of the third-party equipment, and determines the target third-party equipment with a mapping relation with the target control feature words according to the mapping relation between each third-party equipment and the control feature words in the control feature word sets of the third-party equipment. The second server may query, from among the instruction texts corresponding to each second control instruction of the target third-party device and the second control instruction stored in the second database, the instruction text with the highest matching degree with the control text, and use the second control instruction of the target third-party device corresponding to the instruction text with the highest matching degree as the second control instruction corresponding to the control voice.
And after determining the target third-party equipment and the second control instruction, the second server sends the target third-party equipment and the second control instruction to the electronic equipment, and the electronic equipment sends the second control instruction to the target third-party equipment so that the target third-party equipment executes the second control instruction.
For example: the control text converted from the control voice is switched to the first central station, the second server determines that a target control characteristic word, namely the first central station, exists, the first central station and the television have a mapping relation, the television is used as target third-party equipment, the command text with the highest matching degree with the first central station is identified through the corresponding relation between each second control command corresponding to the television and the command text, the first central station is switched, and the second control command corresponding to the first central station is determined.
In this embodiment of the present invention, the first server and the second server may be the same server, and optionally, if the first server and the second server are the same server, the electronic device sends the control voice to the server and sends a wakeup word identifier corresponding to the control voice, such as an application identification identifier (APPID), to the server at the same time. For example: the electronic equipment receives control voice input after recognizing the first awakening word, sends the control voice to the server and simultaneously sends APPID 'Oneself' corresponding to the first awakening word; the electronic device sends the control voice to the server and sends the APPID "Other" corresponding to the second wake-up word at the same time when the received control voice input after the second wake-up word is recognized. So that the server identifies the service logic corresponding to different awakening words, such as NLP processing logic.
Preferably, in order to facilitate the identification of the electronic device as to the device executing the control command, the first server and the second server may send the first control command and the second control command to the electronic device via a structure data (json) containing the identification of the device and the control command executing the control command. So that the electronic device identifies the device executing the control instructions.
As shown in fig. 2, the server selects NLPs with different processing logics according to the APPID corresponding to the wakeup word, the NLPs analyze and feed back the control text converted from the control speech, and the electronic device distributes the control instruction according to the processing result fed back, if the control instruction is executed by the electronic device itself, the electronic device executes the control instruction, otherwise, the electronic device sends the control instruction to the corresponding target third-party device for execution.
In the embodiment of the invention, the electronic equipment can select different servers to process the control voice according to different awakening words contained in the awakening voice, so that different business logics for voice processing are selected to process the control voice, and when the awakening voice contains a first awakening word, a first control instruction of the electronic equipment corresponding to the control voice is recognized; when the awakening voice contains the second awakening word, the target third-party equipment corresponding to the control voice and the second control instruction corresponding to the target third-party equipment are identified, so that the accuracy of voice control is guaranteed, and the user experience is improved.
Example 2:
to improve the user experience, on the basis of the above embodiment, in the embodiment of the present invention, if a wake-up word is not recognized from the wake-up speech, the method further includes:
receiving subsequently input dialogue voice, sending the dialogue voice to a third server, enabling the third server to convert the dialogue voice into a dialogue text, analyzing the dialogue text to generate a response result, converting the response result into response voice, and sending the response voice to the electronic equipment;
and receiving and playing the response voice sent by the third server.
Specifically, if the electronic device does not recognize the wake word from the wake speech, it indicates that the user needs to have a conversation with the electronic device. For example: the electronic equipment is an intelligent sound box, the first awakening word is migu, the second awakening word is 'smart', and if the intelligent sound box receives the awakening voice of 'smart', the awakening voice does not include the awakening word 'migu' or 'smart', so that the user is required to have a conversation with the intelligent sound box.
After the electronic equipment is awakened by the awakening voice without the awakening word, the electronic equipment enters a pickup mode, continues to receive the conversation voice subsequently input by the user, and sends the received conversation voice to the third server. And after receiving the conversation voice, the third server converts the conversation voice into a conversation text through ASR processing, performs NLP processing related to conversation on the conversation text, generates a response result, converts the response result into response voice and sends the response voice to the electronic equipment for playing. For example: the third server converts the conversation voice into a conversation text which is ' weather today ', searches for ' fine, 20-8 ℃ and breeze ' weather conditions of the day ' and converts the search result into voice information to be sent to the electronic equipment to be played.
Preferably, in the embodiment of the present invention, if the third server stores a response pair associated with the electronic device, where the response pair is a question-answer group of a question and a corresponding response, and the response pair is preset by the user. And the third server preferentially matches the question in the response pair associated with the dialog text and the electronic equipment when performing NLP processing related to the dialog, and directly takes the response corresponding to the question as a response result if the matching degree of the dialog text and the question is greater than the threshold value of the matching degree.
In addition, in order to more accurately convert the dialogue voice into the dialogue text, if the third server is preset with a preferentially-recognized hotword corresponding to the electronic device, the converting the dialogue voice into the dialogue text by the third server includes:
the third server converts the dialogue speech into dialogue text based on the preferentially recognized hotword.
Specifically, the user can establish connection with a third server through the terminal, specifically, the user can perform binding operation on a certain electronic device through the terminal, and the user can self-define the hotword through the terminal and upload the hotword to the third server, wherein the hotword can be information such as the device alias, the name of a home scene, and the name in an address book. The terminal and the third server are connected and bind certain electronic equipment, and the binding operation can be realized through APP on the terminal, so that repeated description is omitted.
And when converting the conversation voice into the conversation text, the third server preferentially matches the conversation voice with the hotword set for the electronic equipment, performs voice-to-text recognition, and converts the conversation voice into the conversation text so as to improve the accuracy of the determined response result.
Example 3:
in order to improve user experience, on the basis of the foregoing embodiments, in an embodiment of the present invention, if user information corresponding to the electronic device is preset in the third server, the converting the response result into a response voice includes:
selecting a Text-To-Speech (TTS) engine corresponding To the user information according To the user information, and converting the response result into response voice, wherein the user information comprises: at least one of age, region, and gender.
Specifically, after the user is bound with a certain electronic device through the terminal, user information corresponding to the electronic device, such as age, region, gender and the like, can be set through the terminal, and when the third server converts the response result into the response voice, the TTS engine corresponding to the user information is selected according to the user information, and the response result is converted into the response voice, so that the response voice more conforms to the voice habit of the user of the electronic device. The following are exemplary: taking the user information as the region "Sichuan" as an example, the third server selects a TTS engine corresponding to "Sichuan", and at this time, the TTS engine corresponding to "Sichuan" converts the response result into response voice by using the voice characteristics such as the tone, tone and the like corresponding to "Sichuan".
After sending the dialogue speech to a third server in order to facilitate filtering illegal speech, the method further comprises:
receiving and playing prompt voice containing sensitive words in the dialogue voice sent by the third server; and the prompt voice is sent by the third server after detecting that the dialogue voice contains preset sensitive words.
The third server is provided with a sensitive word database which contains a plurality of set sensitive words, after converting the conversation voice into a conversation text, the third server firstly detects whether the conversation text contains the sensitive words, if yes, the third server sends a prompt voice containing the sensitive words in the conversation voice to the electronic equipment, and if not, the process of analyzing the conversation text to generate a response result is not carried out.
As shown in fig. 3, a user sets an electronic device through a terminal, sets various hotwords and uploads the hotwords to a third server, the third server performs priority matching according to the hotwords set by the user to obtain a dialogue text converted by dialogue voice, identifies whether the dialogue text contains sensitive words, returns a prompt voice containing the sensitive words to the electronic device if the dialogue text contains the sensitive words, judges whether a matched answer pair exists if the dialogue text does not contain the sensitive words, returns an answer voice to the electronic device if the dialogue text does not contain the sensitive words, and returns an answer voice to the electronic device if the dialogue text does not contain the sensitive words.
Example 4:
fig. 4 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present invention, applied to an electronic device, where the speech processing apparatus includes:
a receiving module 41, configured to receive a wake-up voice input;
the processing module 42 is configured to receive a subsequently input control voice if a first wake-up word is recognized from the wake-up voice, send the control voice to a first server, and enable the first server to recognize that the control voice corresponds to a first control instruction of the electronic device and return the first control instruction; executing the first control instruction;
the processing module 42 is further configured to receive a subsequently input control voice if a second wake-up word is recognized from the wake-up voice, send the control voice to a second server, enable the second server to determine, based on a control feature word set of a third-party device, a target third-party device corresponding to the control voice, recognize a second control instruction corresponding to the target third-party device corresponding to the control voice, and return the second control instruction; and sending the second control instruction to the target third-party equipment, so that the target third-party equipment executes the second control instruction.
Optionally, the determining, based on the control feature word set of the third-party device, the target third-party device corresponding to the control voice includes:
identifying target control feature words contained in the control speech based on the control feature word set of the third-party equipment;
and determining target third-party equipment which has a mapping relation with the target control characteristic words according to the mapping relation between each third-party equipment and the control characteristic words in the control characteristic word set of the third-party equipment.
The processing module 42 is further configured to receive a subsequently input dialog voice if a wakeup word is not recognized from the wakeup voice, send the dialog voice to a third server, enable the third server to convert the dialog voice into a dialog text, analyze the dialog text to generate a response result, convert the response result into a response voice, and send the response voice to the electronic device; and receiving and playing the response voice sent by the third server.
Optionally, if a preferentially-recognized hotword corresponding to the electronic device is preset in the third server, the converting the dialog speech into a dialog text by the third server includes:
the third server converts the dialogue speech into dialogue text based on the preferentially recognized hotword.
Optionally, if the third server is preset with user information corresponding to the electronic device, the converting the response result into response voice includes:
according to the user information, selecting a text-to-speech TTS engine corresponding to the user information, and converting the response result into response speech, wherein the user information comprises: at least one of age, region, and gender.
The processing module 42 is further configured to receive and play a prompt voice containing a sensitive word in the dialog voice sent by the third server; and the prompt voice is sent by the third server after detecting that the dialogue voice contains preset sensitive words.
Example 5:
as shown in fig. 5, based on the same inventive concept, an embodiment of the present invention further provides an electronic device, and since a principle of the electronic device for solving the problem is similar to that of the voice processing method, implementation of the electronic device may refer to implementation of the method, and repeated details are not repeated.
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, where in fig. 5, the bus architecture may include any number of interconnected buses and bridges, and specifically, one or more processors 51 represented by a processor 51 and various circuits of a memory 53 represented by a memory 53 are linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface. The transceiver 52 may be a number of elements, including a transmitter and a transceiver, providing a means for communicating with various other apparatus over a transmission medium. The processor 51 is responsible for managing the bus architecture and general processing, and the memory 53 may store data used by the processor 51 in performing operations.
In the electronic device provided in the embodiment of the present invention:
the processor 51 is configured to read the program in the memory 53, and execute the following processes: receiving a wake-up voice input; if a first awakening word is recognized from the awakening voice, receiving a control voice input subsequently, sending the control voice to a first server through a transceiver 52, and enabling the first server to recognize that the control voice corresponds to a first control instruction of the electronic equipment and return; executing the first control instruction; if a second awakening word is recognized from the awakening voice, receiving a subsequently input control voice, sending the control voice to a second server through a transceiver 52, enabling the second server to determine a target third-party device corresponding to the control voice based on a control feature word set of the third-party device, recognizing a second control instruction of the target third-party device corresponding to the control voice, and returning; and sending the second control instruction to the target third-party equipment, so that the target third-party equipment executes the second control instruction.
Optionally, the determining, based on the control feature word set of the third-party device, the target third-party device corresponding to the control voice includes:
identifying target control feature words contained in the control speech based on the control feature word set of the third-party equipment;
and determining target third-party equipment which has a mapping relation with the target control characteristic words according to the mapping relation between each third-party equipment and the control characteristic words in the control characteristic word set of the third-party equipment.
The processor 51 is further configured to receive a subsequently input conversation voice, send the conversation voice to a third server through the transceiver 52, convert the conversation voice into a conversation text by the third server, analyze the conversation text to generate a response result, convert the response result into a response voice, and send the response voice to the electronic device; and receiving and playing the response voice sent by the third server.
Optionally, if a preferentially-recognized hotword corresponding to the electronic device is preset in the third server, the converting the dialog speech into a dialog text by the third server includes:
the third server converts the dialogue speech into dialogue text based on the preferentially recognized hotword.
Optionally, if the third server is preset with user information corresponding to the electronic device, the converting the response result into response voice includes:
according to the user information, selecting a text-to-speech TTS engine corresponding to the user information, and converting the response result into response speech, wherein the user information comprises: at least one of age, region, and gender.
The processor 51 is further configured to receive, through the transceiver 52, a prompt voice containing a sensitive word in the conversation voice sent by the third server and play the prompt voice; and the prompt voice is sent by the third server after detecting that the dialogue voice contains preset sensitive words.
Example 6:
on the basis of the foregoing embodiments, an embodiment of the present invention further provides an electronic device, as shown in fig. 6, including: the system comprises a processor 61, a communication interface 62, a memory 63 and a communication bus 64, wherein the processor 61, the communication interface 62 and the memory 63 complete mutual communication through the communication bus 64;
the memory 63 stores therein a computer program that, when executed by the processor 61, causes the processor 61 to execute the speech processing method described in the above embodiments.
On the basis of the foregoing embodiments, an embodiment of the present invention further provides a computer storage readable storage medium, in which a computer program executable by an electronic device is stored, and when the program runs on the electronic device, the electronic device is caused to execute the voice processing method described in the foregoing embodiments.
For the system/apparatus embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (15)

1. A speech processing method, applied to an electronic device, the method comprising:
receiving a wake-up voice input;
if a first awakening word is recognized from the awakening voice, receiving a control voice input subsequently, sending the control voice to a first server, and enabling the first server to recognize that the control voice corresponds to a first control instruction of the electronic equipment and return; executing the first control instruction;
if a second awakening word is recognized from the awakening voice, receiving a subsequently input control voice, sending the control voice to a second server, enabling the second server to determine a target third-party device corresponding to the control voice based on a control feature word set of a third-party device, recognizing a second control instruction of the target third-party device corresponding to the control voice, and returning; and sending the second control instruction to the target third-party equipment, so that the target third-party equipment executes the second control instruction.
2. The method of claim 1, wherein determining the target third party device to which the control speech corresponds based on a set of control feature words for the third party device comprises:
identifying target control feature words contained in the control speech based on the control feature word set of the third-party equipment;
and determining target third-party equipment which has a mapping relation with the target control characteristic words according to the mapping relation between each third-party equipment and the control characteristic words in the control characteristic word set of the third-party equipment.
3. The method of claim 1, wherein if a wake word is not recognized from the wake speech, the method further comprises:
receiving subsequently input dialogue voice, sending the dialogue voice to a third server, enabling the third server to convert the dialogue voice into a dialogue text, analyzing the dialogue text to generate a response result, converting the response result into response voice, and sending the response voice to the electronic equipment;
and receiving and playing the response voice sent by the third server.
4. The method of claim 3, wherein if the third server has a pre-set prior-recognized hotword corresponding to the electronic device, the third server converting the conversational speech into conversational text comprises:
the third server converts the dialogue speech into dialogue text based on the preferentially recognized hotword.
5. The method of claim 3, wherein if the user information corresponding to the electronic device is preset in the third server, the converting the response result into the response voice comprises:
according to the user information, selecting a text-to-speech TTS engine corresponding to the user information, and converting the response result into response speech, wherein the user information comprises: at least one of age, region, and gender.
6. The method of claim 3, wherein after the transmitting the conversational speech to a third server, the method further comprises:
receiving and playing prompt voice containing sensitive words in the dialogue voice sent by the third server; and the prompt voice is sent by the third server after detecting that the dialogue voice contains preset sensitive words.
7. A speech processing apparatus, applied to an electronic device, the apparatus comprising:
the receiving module is used for receiving awakening voice input;
the processing module is used for receiving a control voice input subsequently if a first awakening word is recognized from the awakening voice, sending the control voice to a first server, and enabling the first server to recognize a first control instruction of the electronic equipment corresponding to the control voice and return the control voice; executing the first control instruction;
the processing module is further configured to receive a subsequently input control voice if a second wake-up word is recognized from the wake-up voice, send the control voice to a second server, enable the second server to determine a target third-party device corresponding to the control voice based on a control feature word set of the third-party device, recognize a second control instruction of the target third-party device corresponding to the control voice, and return the second control instruction; and sending the second control instruction to the target third-party equipment, so that the target third-party equipment executes the second control instruction.
8. An electronic device, comprising: a memory, a processor, and a transceiver;
the processor is used for reading the program in the memory and executing the following processes: receiving a wake-up voice input; if a first awakening word is recognized from the awakening voice, receiving a control voice input subsequently, sending the control voice to a first server through a transceiver, and enabling the first server to recognize a first control instruction of the electronic equipment corresponding to the control voice and return; executing the first control instruction; if a second awakening word is recognized from the awakening voice, receiving a subsequently input control voice, sending the control voice to a second server through a transceiver, enabling the second server to determine a target third-party device corresponding to the control voice based on a control feature word set of the third-party device, recognizing a second control instruction of the target third-party device corresponding to the control voice, and returning; and sending the second control instruction to the target third-party equipment, so that the target third-party equipment executes the second control instruction.
9. The electronic device of claim 8, wherein determining the target third party device to which the control speech corresponds based on a set of control feature words for the third party device comprises:
identifying target control feature words contained in the control speech based on the control feature word set of the third-party equipment;
and determining target third-party equipment which has a mapping relation with the target control characteristic words according to the mapping relation between each third-party equipment and the control characteristic words in the control characteristic word set of the third-party equipment.
10. The electronic device of claim 8, wherein the processor is further configured to receive a subsequently input dialog voice, send the dialog voice to a third server via a transceiver, cause the third server to convert the dialog voice into a dialog text, parse the dialog text to generate a response result, and convert the response result into a response voice to send to the electronic device; and receiving and playing the response voice sent by the third server.
11. The electronic device of claim 10, wherein if the third server has a pre-set prior-recognized hotword corresponding to the electronic device, the third server converting the conversational speech into conversational text comprises:
the third server converts the dialogue speech into dialogue text based on the preferentially recognized hotword.
12. The electronic device according to claim 10, wherein if the user information corresponding to the electronic device is preset in the third server, the converting the response result into the response voice includes:
according to the user information, selecting a text-to-speech TTS engine corresponding to the user information, and converting the response result into response speech, wherein the user information comprises: at least one of age, region, and gender.
13. The electronic device of claim 10, wherein the processor is further configured to receive and play, through the transceiver, a prompt voice containing a sensitive word in the dialog voice sent by the third server; and the prompt voice is sent by the third server after detecting that the dialogue voice contains preset sensitive words.
14. An electronic device, comprising: the system comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus;
the memory has stored therein a computer program which, when executed by the processor, causes the processor to carry out the steps of the method of any one of claims 1-6.
15. A computer-readable storage medium, characterized in that it stores a computer program executable by an electronic device, which program, when run on the electronic device, causes the electronic device to carry out the steps of the method according to any one of claims 1-6.
CN201811465282.7A 2018-12-03 2018-12-03 Voice processing method and device, electronic equipment and storage medium Active CN111261151B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811465282.7A CN111261151B (en) 2018-12-03 2018-12-03 Voice processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811465282.7A CN111261151B (en) 2018-12-03 2018-12-03 Voice processing method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111261151A true CN111261151A (en) 2020-06-09
CN111261151B CN111261151B (en) 2022-12-27

Family

ID=70946808

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811465282.7A Active CN111261151B (en) 2018-12-03 2018-12-03 Voice processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111261151B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111787461A (en) * 2020-06-30 2020-10-16 歌尔科技有限公司 Intelligent sound equipment, control method and device thereof and computer readable storage medium
CN111933135A (en) * 2020-07-31 2020-11-13 深圳Tcl新技术有限公司 Terminal control method and device, intelligent terminal and computer readable storage medium
CN112634897A (en) * 2020-12-31 2021-04-09 青岛海尔科技有限公司 Equipment awakening method and device, storage medium and electronic device
CN113066493A (en) * 2021-03-30 2021-07-02 联想(北京)有限公司 Equipment control method and system and first electronic equipment
CN113555016A (en) * 2021-06-24 2021-10-26 北京房江湖科技有限公司 Voice interaction method, electronic equipment and readable storage medium
CN114244879A (en) * 2021-12-15 2022-03-25 北京声智科技有限公司 Industrial control system, industrial control method and electronic equipment
CN115294983A (en) * 2022-09-28 2022-11-04 科大讯飞股份有限公司 Autonomous mobile equipment awakening method, system and base station

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20130133629A (en) * 2012-05-29 2013-12-09 삼성전자주식회사 Method and apparatus for executing voice command in electronic device
CN107704275A (en) * 2017-09-04 2018-02-16 百度在线网络技术(北京)有限公司 Smart machine awakening method, device, server and smart machine
CN108520743A (en) * 2018-02-02 2018-09-11 百度在线网络技术(北京)有限公司 Sound control method, smart machine and the computer-readable medium of smart machine
CN108831448A (en) * 2018-03-22 2018-11-16 北京小米移动软件有限公司 The method, apparatus and storage medium of voice control smart machine

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20130133629A (en) * 2012-05-29 2013-12-09 삼성전자주식회사 Method and apparatus for executing voice command in electronic device
CN107704275A (en) * 2017-09-04 2018-02-16 百度在线网络技术(北京)有限公司 Smart machine awakening method, device, server and smart machine
CN108520743A (en) * 2018-02-02 2018-09-11 百度在线网络技术(北京)有限公司 Sound control method, smart machine and the computer-readable medium of smart machine
CN108831448A (en) * 2018-03-22 2018-11-16 北京小米移动软件有限公司 The method, apparatus and storage medium of voice control smart machine

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111787461A (en) * 2020-06-30 2020-10-16 歌尔科技有限公司 Intelligent sound equipment, control method and device thereof and computer readable storage medium
CN111787461B (en) * 2020-06-30 2021-12-24 歌尔科技有限公司 Intelligent sound equipment, control method and device thereof and computer readable storage medium
CN111933135A (en) * 2020-07-31 2020-11-13 深圳Tcl新技术有限公司 Terminal control method and device, intelligent terminal and computer readable storage medium
CN112634897A (en) * 2020-12-31 2021-04-09 青岛海尔科技有限公司 Equipment awakening method and device, storage medium and electronic device
CN113066493A (en) * 2021-03-30 2021-07-02 联想(北京)有限公司 Equipment control method and system and first electronic equipment
CN113066493B (en) * 2021-03-30 2023-01-06 联想(北京)有限公司 Equipment control method and system and first electronic equipment
CN113555016A (en) * 2021-06-24 2021-10-26 北京房江湖科技有限公司 Voice interaction method, electronic equipment and readable storage medium
CN114244879A (en) * 2021-12-15 2022-03-25 北京声智科技有限公司 Industrial control system, industrial control method and electronic equipment
CN115294983A (en) * 2022-09-28 2022-11-04 科大讯飞股份有限公司 Autonomous mobile equipment awakening method, system and base station

Also Published As

Publication number Publication date
CN111261151B (en) 2022-12-27

Similar Documents

Publication Publication Date Title
CN111261151B (en) Voice processing method and device, electronic equipment and storage medium
WO2019101083A1 (en) Voice data processing method, voice-based interactive device, and storage medium
US9336773B2 (en) System and method for standardized speech recognition infrastructure
JP2019057273A (en) Method and apparatus for pushing information
CN111081280B (en) Text-independent speech emotion recognition method and device and emotion recognition algorithm model generation method
US11574637B1 (en) Spoken language understanding models
US11605376B1 (en) Processing orchestration for systems including machine-learned components
CN109712610A (en) The method and apparatus of voice for identification
CN110570855A (en) system, method and device for controlling intelligent household equipment through conversation mechanism
CN111243604B (en) Training method for speaker recognition neural network model supporting multiple awakening words, speaker recognition method and system
CN111081254B (en) Voice recognition method and device
CN112767916A (en) Voice interaction method, device, equipment, medium and product of intelligent voice equipment
CN111178081A (en) Semantic recognition method, server, electronic device and computer storage medium
CN112740323A (en) Voice understanding method and device
CN112579031A (en) Voice interaction method and system and electronic equipment
KR20210001082A (en) Electornic device for processing user utterance and method for operating thereof
CN112837683B (en) Voice service method and device
US11527237B1 (en) User-system dialog expansion
CN110659361B (en) Conversation method, device, equipment and medium
WO2023093280A1 (en) Speech control method and apparatus, electronic device, and storage medium
CN116486815A (en) Vehicle-mounted voice signal processing method and device
KR20200119035A (en) Dialogue system, electronic apparatus and method for controlling the dialogue system
CN115019781A (en) Conversation service execution method, device, storage medium and electronic equipment
CN112306560B (en) Method and apparatus for waking up an electronic device
CN111899718A (en) Method, apparatus, device and medium for recognizing synthesized speech

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant