WO2020119542A1 - Voice interaction method, device and system - Google Patents

Voice interaction method, device and system Download PDF

Info

Publication number
WO2020119542A1
WO2020119542A1 PCT/CN2019/122934 CN2019122934W WO2020119542A1 WO 2020119542 A1 WO2020119542 A1 WO 2020119542A1 CN 2019122934 W CN2019122934 W CN 2019122934W WO 2020119542 A1 WO2020119542 A1 WO 2020119542A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
voice data
interaction
configuration information
voice
Prior art date
Application number
PCT/CN2019/122934
Other languages
French (fr)
Chinese (zh)
Inventor
祝俊
袁英灿
王德淼
孟伟
吴逸超
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020119542A1 publication Critical patent/WO2020119542A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/4401Bootstrapping
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/4401Bootstrapping
    • G06F9/4418Suspend and resume; Hibernate and awake
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the invention relates to the field of computer technology, in particular to a voice interaction method, device and system.
  • the present invention provides a voice interaction method, device and system, in an effort to solve or at least alleviate at least one of the above problems.
  • a voice interaction method including the steps of: sending first voice data input by a user to a server, so that the server confirms an interaction scenario based on the first voice data; acquiring configuration information based on the interaction scenario; And processing the second voice data input by the user based on the obtained configuration information and outputting a response.
  • the method according to the present invention further includes the step of presetting configuration information in each interaction scenario, where the configuration information includes at least one piece of target data for use in the interaction scenario.
  • the step of processing the second voice data input by the user based on the configuration information and outputting a response includes: determining whether the second voice data input by the user matches the target data in the configuration information; If it matches, the state data at the current moment is obtained; the second voice data and the state data are sent to the server, so that the server recognizes the second voice data according to the state data and returns a response instruction; and outputs a response to the user according to the response instruction.
  • the method according to the present invention further includes the step of receiving the third voice data input by the user: detecting whether the third voice data input by the user contains a predetermined object; and if the third voice data contains the predetermined object, enter the interaction status.
  • the step of sending the first voice data input by the user to the server, so that the server confirms the interaction scenario according to the first voice data includes: in response to the user inputting the first voice data, obtaining the current time State data of the server; and sending the first voice data and the state data to the server, so that the server performs recognition processing on the first voice data according to the state data and returns a response command, where the response command also includes an interactive scene.
  • the step of acquiring configuration information based on the interaction scenario further includes: outputting a response to the user according to the response instruction.
  • the method according to the present invention further includes the steps of: in response to the user's request to switch the interactive scene, forward the request to the server, so that the server confirms the interactive scene to be switched; determine whether to close the interactive scene before switching; For the interactive scene, the configuration information is obtained based on the interactive scene after switching; and if the interactive scene before switching is not closed, the configuration information is obtained based on the interactive scene before switching and the interactive scene after switching.
  • a voice interaction method including the steps of: determining an interaction scenario based on first voice data input by a user; acquiring configuration information based on the interaction scenario; and based on the acquired configuration information
  • the second voice data is processed and the response is output.
  • a voice interaction device including: a connection management unit adapted to receive first voice data input by a user and send it to a server, so that the server confirms the interaction scenario according to the first voice data; information acquisition The unit is adapted to acquire configuration information based on the interaction scenario; and the data processing unit is adapted to process the second voice data input by the user based on the acquired configuration information and output a response.
  • the device according to the present invention further includes an information storage unit adapted to pre-store configuration information in each interaction scenario, where the configuration information includes at least one piece of target data for use in the interaction scenario.
  • the data processing unit further includes a judgment module adapted to judge whether the second voice data input by the user matches the target data in the configuration information; the information acquisition unit is further adapted to When the data matches the target data, the current state data is obtained; the connection management unit is further adapted to send the second voice data and the state data to the server, and receive the server to perform recognition processing on the second voice data according to the state data The returned response instruction; and the connection management unit is further adapted to output a response to the user according to the response instruction.
  • connection management unit is further adapted to receive third voice data input by the user; the device further includes a detection unit adapted to detect whether the third voice data input by the user contains a predetermined object, and Enter the interactive state when the third voice data contains a predetermined object.
  • connection management unit is further adapted to forward the request to the server in response to the user's request to switch the interactive scene, so that the server confirms the interactive scene to be switched;
  • the information acquisition unit further includes a decision module, suitable To determine whether to close the interactive scene before switching;
  • the information acquisition unit is also adapted to obtain configuration information based on the interactive scene after switching when the interactive scene before switching is closed, and based on the pre-switching mode when the interactive scene before switching is not closed. The configuration information is obtained in the interactive scenario and the switched interactive scenario.
  • a voice interaction system including: a client, including the voice interaction device described above; and a server, adapted to receive voice data and status data from the client, and based on the status data and Voice data to determine the interaction scenario of the client.
  • the server is further adapted to perform recognition processing on the voice data according to the status data and return a response instruction to the client.
  • the client is a smart speaker.
  • a smart speaker including: an interface unit adapted to receive first voice data input by a user; an interaction control unit adapted to determine an interaction scenario based on the first voice data input by the user, and The configuration information is acquired based on the interaction scenario, and the interaction control unit is further adapted to process the second voice data based on the configuration information and output a response.
  • a computing device including: at least one processor; and a memory storing program instructions, wherein the program instructions are configured to be executed by at least one processor, and the program instructions include Instructions for performing any of the methods described above.
  • a readable storage medium storing program instructions, which causes the computing device to perform any of the methods described above when the program instructions are read and executed by the computing device.
  • the client when the client receives the first voice data input by the user, the first voice data is forwarded to the server, and the server confirms the interaction scenario; then the client obtains configuration information according to the interaction scenario, and then interacts In the scenario, as long as the voice data input by the user matches the target data in the configuration information, the client is directly awakened for voice interaction.
  • this solution can reduce the interaction cost and improve the user experience.
  • FIG. 1 shows a schematic diagram of a scene of a voice interaction system 100 according to an embodiment of the present invention
  • FIG. 2 shows a schematic diagram of a computing device 200 according to an embodiment of the invention
  • FIG. 3 shows an interaction flowchart of a voice interaction method 300 according to an embodiment of the present invention
  • FIG. 4 shows a schematic flowchart of a voice interaction method 400 according to another embodiment of the present invention.
  • FIG. 5 shows a schematic diagram of a voice interaction device 500 according to an embodiment of the present invention.
  • FIG. 1 shows a schematic diagram of a scene of a voice interaction system 100 according to an embodiment of the present invention.
  • the system 100 includes a client 110 and a server 120.
  • the system 100 shown in FIG. 1 is only an example, and those skilled in the art can understand that in practical applications, the system 100 generally includes multiple clients 110 and servers 120.
  • the present invention treats the clients included in the system 100
  • the number of the terminal 110 and the server 120 are not limited.
  • the client 110 is a smart device having a voice interaction device (for example, the voice interaction device 500 according to an embodiment of the present invention), which can receive voice instructions from a user and return voice or non-voice information to the user.
  • a typical voice interaction device includes a voice input unit such as a microphone, a voice output unit such as a speaker, and a processor.
  • the voice interaction device may be built in the client 110, or may be used as an independent module to cooperate with the client 110 (for example, to communicate with the client 110 via an API or by other means to call functions or applications on the client 110), The embodiments of the present invention do not limit this.
  • the client 110 may be, for example, a mobile device with a voice interaction device (eg, a smart speaker), a smart robot, a smart home appliance (including a smart TV, a smart refrigerator, a smart microwave oven, etc.), but it is not limited thereto.
  • An application scenario of the client 110 is a home scenario, that is, the client 110 is placed in the user's home, and the user can issue a voice instruction to the client 110 to implement certain functions, such as surfing the Internet, on-demand songs, shopping, weather forecasting, and home Control of other smart home devices, etc.
  • the server 120 and the client 110 communicate via a network, which may be, for example, a cloud server physically located in one or more locations.
  • the server 120 provides a recognition service for the voice data received on the client 110 to obtain a text representation of the voice data input by the user; the server 120 also obtains a representation of the user's intention based on the text representation and generates a response instruction to return to the client 110.
  • the client 110 performs corresponding operations according to the response instruction to provide users with corresponding services, such as setting an alarm clock, making a call, sending an email, broadcasting information, playing songs, and videos.
  • the client 110 may also output a corresponding voice response to the user according to the response instruction, which is not limited in the embodiment of the present invention.
  • the microphone of the voice interaction module continuously receives external sounds.
  • the user wants to use the client 110 for voice interaction, he needs to first speak the corresponding wake-up word to wake up the client 110 (more specifically, to wake up the voice interaction module in the client 110 by entering the wake-up word), so that he enters Interactive state.
  • the client 110 ends a voice interaction, if the user wants to use the client 110 for interaction again, the user needs to enter a wake-up word again to wake up the client 110.
  • the following exemplarily shows some voice interaction processes. Among them, the fixed wake-up word is set to "elf".
  • the system 100 according to the interaction scenario in which the user interacts with the client 110 in voice interaction, one or more items that the user may use in each interaction scenario are preset
  • the target data constitutes the configuration information in each interactive scenario.
  • the configuration information includes interaction templates corresponding to various interaction scenarios.
  • the user in a specific interaction scenario, the user does not need to enter a wake-up word multiple times to interact with the client 110, as long as the input voice command includes the target data in the interaction scenario.
  • the target data may be: "previous song” "next song” “collect this song” “louder” “pause playback” "continue playing "What is the current song” and so on, these target data constitute the configuration information corresponding to the interactive scene of listening to the song.
  • the target data may be set to "a little louder” or “turn up the volume”, and so on.
  • the client 110 after receiving the voice data input by the user, the client 110 also obtains the status data on the client 110 at the current moment, and transmits it to the server 120 together with the voice data.
  • the state data of the client 110 is, for example, the state where the user is operating an application or similar software on the client 110.
  • the user may be using an application to play video streaming data; for example, the user is using a social software to communicate with a specific individual; but not limited to this.
  • the server 120 can also perform scene analysis based on the state data and voice data to confirm the interactive scene that the user expects to enter when inputting the voice data. For example, the user inputs voice data-"I want to watch a drama", the server 120 confirms that the music player software is currently being used on the client 110 through the status data, and the server 120 can basically determine that the user expects to enter the interactive scene for playing video. As another example, if the user inputs voice data-"What is the weather in Hangzhou now", the server 120 can basically confirm that the user expects to enter the interactive scene for viewing the weather forecast.
  • the server 120 returns the confirmed interaction scenario together with the response instruction to the client 110.
  • the client 110 obtains its corresponding configuration information according to the interaction scenario. In this way, in this interaction scenario, the client 110 only needs to determine whether the voice data input by the user is consistent with the target data in the configuration information, and if they match, directly output a response.
  • the voice interaction process between the user and the client 110 may be optimized as:
  • the client 110 is always in the interactive state, and the user can directly input voice instructions to instruct the client 110 to execute The corresponding operation.
  • the system 100 does not need to repeat the interaction process that has been previously executed (for example, the process of waking up the client 110), which reduces the interaction cost and improves the user experience.
  • the following uses the client 110 to be implemented as a smart speaker as an example to outline a voice interaction solution according to an embodiment of the present invention.
  • the smart speaker further includes: an interface unit and an interactive control unit.
  • the interface unit receives the first voice data input by the user; the interaction control unit determines the interaction scenario according to the first voice data input by the user, and obtains configuration information based on the interaction scenario, and at the same time, the interaction control unit can also control the second voice based on the configuration information.
  • the voice data is processed and the response is output.
  • the server 120 may also be implemented as other electronic devices (eg, other computing devices in the same IoT environment) connected to the client 110 through the network. Even when the client 110 has sufficient storage space and computing power, the server 120 can also be implemented as the client 110 itself.
  • FIG. 2 shows a schematic diagram of a computing device 200 according to an embodiment of the invention.
  • the computing device 200 typically includes a system memory 206 and one or more processors 204.
  • the memory bus 208 may be used for communication between the processor 204 and the system memory 206.
  • the processor 204 may be any type of processing, including but not limited to: a microprocessor ( ⁇ P), a microcontroller ( ⁇ C), a digital information processor (DSP), or any combination thereof.
  • the processor 204 may include one or more levels of cache, such as a level one cache 210 and a level two cache 212, a processor core 214, and a register 216.
  • the example processor core 214 may include an arithmetic logic unit (ALU), a floating point unit (FPU), a digital signal processing core (DSP core), or any combination thereof.
  • the example memory controller 218 may be used with the processor 204, or in some implementations, the memory controller 218 may be an internal part of the processor 204.
  • the system memory 206 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof.
  • the system memory 206 may include an operating system 220, one or more applications 222, and program data 224.
  • the application 222 may be arranged to execute instructions by the one or more processors 204 using the program data 224 on the operating system.
  • the computing device 200 may also include an interface bus 240 that facilitates communication from various interface devices (eg, output device 242, peripheral interface 244, and communication device 246) to the basic configuration 202 via the bus/interface controller 230.
  • the example output device 242 includes a graphics processing unit 248 and an audio processing unit 250. They may be configured to facilitate communication with various external devices such as displays or speakers via one or more A/V ports 252.
  • the example peripheral interface 244 may include a serial interface controller 254 and a parallel interface controller 256, which may be configured to facilitate via one or more I/O ports 258 and such as input devices (eg, keyboard, mouse, pen) , Voice input devices, touch input devices) or other peripheral devices (such as printers, scanners, etc.) to communicate.
  • the example communication device 246 may include a network controller 260, which may be arranged to facilitate communication with one or more other computing devices 262 via a network communication link via one or more communication ports 264.
  • the network communication link may be an example of a communication medium.
  • Communication media can generally be embodied as computer readable instructions, data structures, program modules in a modulated data signal such as a carrier wave or other transmission mechanism, and can include any information delivery media.
  • a "modulated data signal" may be a signal in which one or more of its data set or its changes can be made in such a way as to encode information in the signal.
  • the communication medium may include a wired medium such as a wired network or a dedicated line network, and various wireless media such as sound, radio frequency (RF), microwave, infrared (IR), or other wireless media.
  • RF radio frequency
  • IR infrared
  • the term computer readable media as used herein may include both storage media and communication media.
  • the computing device 200 may be implemented as a server, such as a file server, a database server, an application server, a WEB server, etc., or as a personal computer including a desktop computer and a notebook computer configuration. Of course, the computing device 200 may also be implemented as part of a small-sized portable (or mobile) electronic device. In an embodiment according to the present invention, the computing device 200 is configured to perform the voice interaction method according to the present invention.
  • the application 222 of the computing device 200 includes multiple program instructions for executing the voice interaction method 300 according to the present invention.
  • FIG. 3 shows an interaction flowchart of a voice interaction method 300 according to an embodiment of the present invention.
  • the interaction method 300 is suitable for execution in the system 100 described above.
  • the voice data (or voice command) input by the user is divided into third voice data (voice data used to wake up the client 110, generally including a predetermined object/wake Words), the first voice data (voice data input by the user after the client 110 is awakened, which contains general instructions), and the second voice data (voice data input by the user after confirming the interaction scene, which generally contains target data) .
  • third voice data voice data used to wake up the client 110, generally including a predetermined object/wake Words
  • the first voice data voice data input by the user after the client 110 is awakened, which contains general instructions
  • the second voice data voice data input by the user after confirming the interaction scene, which generally contains target data
  • the method 300 starts at step S301.
  • step S301 the client 110 receives the third voice data input by the user, and detects whether it contains a predetermined object (the predetermined object is, for example, a predetermined wake-up word), and enters the interactive state if the predetermined object is included.
  • the predetermined object is, for example, a predetermined wake-up word
  • the third voice data is generally used to wake up the client 110 to make it in an interactive state.
  • the predetermined object may be set in advance when the client 110 leaves the factory, or may be set by the user during the process of using the client 110.
  • the present invention does not limit the length and content of the predetermined object.
  • the client 110 responds to the user by playing a voice when it detects that the third voice data contains a predetermined object, for example, the client 110 plays a voice-"Hello, please speak", in order to To inform the user that the client 110 is already in an interactive state and can start voice interaction.
  • step S302 the client 110 receives the first voice data input by the user, and in response to the user input, acquires the state data of the client 110 at the current moment.
  • the status data of the client 110 may include any available information on the client 110.
  • the status data of the client 110 includes one or more of the following information: the client's process data, the client's application list, the application usage history data on the client, the user's personal data associated with the client, from Data obtained on at least one sensor of the client (such as the location information of the client, environmental information, etc.), and text data in the display interface of the client, but is not limited thereto.
  • step S303 the client 110 sends the first voice data from the user and the local state data to the server 120 together.
  • step S304 the server 120 performs recognition processing on the first voice data according to the received state data.
  • the recognition processing of the first voice data by the server 120 may be divided into two parts.
  • the server 120 recognizes the first voice data through ASR (Automatic Speech Recognition) voice recognition technology.
  • the server 120 may first represent the first voice data as text data, and then perform word segmentation processing on the text data to obtain a text representation of the first voice data (It should be noted that other methods may also be used to represent the voice data. Not limited to text representation).
  • a typical ASR speech recognition method may be, for example, a method based on a vocal tract model and speech knowledge, a template matching method, and a method using a neural network, etc.
  • the embodiment of the present invention does not determine which ASR method is used for speech recognition processing. Too many restrictions, any known or future known algorithms of this type can be combined with embodiments of the present invention to implement the method 300 of the present invention.
  • the server 120 may also include some preprocessing operations on the first voice data, such as: sampling, quantizing, and removing voice data that does not contain voice content (eg, silent voice data ), frame the voice data, window processing, etc.
  • voice content eg, silent voice data
  • the server 120 may also include some preprocessing operations on the first voice data, such as: sampling, quantizing, and removing voice data that does not contain voice content (eg, silent voice data ), frame the voice data, window processing, etc.
  • voice content eg, silent voice data
  • the server 120 processes the text representation in combination with the status data to understand the user's intention, and finally obtains the representation of the user's intention.
  • the server 120 may use NLP (Natural Language Processing) natural language processing method to understand the first voice data input by the user, and finally identify the user's intention, the user's intention often corresponds to the actual operation, such as Play music, view contacts, etc.
  • the server 120 may further determine the parameters of the user's intention, such as which song or which singer's song is specifically played, and so on.
  • the embodiments of the present invention do not make too many restrictions on which NLP algorithm is used to understand user intent. Any known or future known algorithm of this type can be combined with the embodiments of the present invention to implement the method of the present invention 300.
  • the server 120 determines the current interaction scene by analyzing the user's intention.
  • the interactive scene characterizes the interactive scene that the client 110 is currently or (according to user input).
  • the interactive scene may be, for example, a call scene, a short message scene, a music scene, a video scene, a browsing information scene, and so on.
  • the server 120 generates a response instruction after performing the above recognition on the first voice data.
  • the response instruction includes a response to the user's intention and specific execution parameters.
  • the first voice data input by the user is "I want to listen to a song”
  • the response instruction generated by the server 120 includes a playback instruction.
  • the corresponding text may be included in the response instruction to reply to the voice data input by the user.
  • the response command contains text data of "OK, to be played for you”.
  • the response instruction may also include execution parameters of the playback instruction, such as a playlist, a cover of the song to be played, a download address, etc., which is not limited thereto.
  • the response command also contains interactive scenes.
  • the server 120 obtains through processing and analysis that the interactive scene corresponding to “I want to listen to a song” is a “music scene”. Then, in addition to the above parameters, the response command generated by the server 120 also includes a “music scene”.
  • step S305 the server 120 returns the above response instruction to the client 110.
  • step S306 the client 110 acquires configuration information based on the interaction scenario in the received response instruction.
  • the client 110 is preset with configuration information in each interaction scenario.
  • the configuration information includes at least one piece of target data for use in the interactive scenario.
  • the configuration information please refer to the related description in FIG. 1 above, and no more details are provided here.
  • the client 110 outputs a response to the user according to the relevant instructions and execution parameters in the response instructions.
  • the client 110 converts the text data contained in the response instruction to voice data through TTS technology, and replies to the user by voice-"OK, will play for you"; meanwhile, the client 110 executes the playback instruction to play songs for the user .
  • the client 110 may also download corresponding songs, covers, etc. according to the execution parameters, which will not be repeated here.
  • step S307 the client 110 receives the second voice data input by the user, and determines whether the second voice data input by the user matches the target data in the configuration information.
  • the interactive state is directly entered. That is, the user can wake up the client 110 without entering the predetermined object again.
  • the user enters the second voice data-"next song", after the client 110 determines that the second voice data matches the target data in the music scene, then directly enters the interactive state .
  • the embodiment of the present invention does not make excessive restrictions.
  • a person skilled in the art may calculate the matching degree of the second voice data and the target data in any way, and when the matching degree is higher than a preset value, it is determined that the two match.
  • the client 110 acquires the state data at the current moment as described in step S302. I won't repeat them here.
  • step S308 the client 110 sends the second voice data and status data to the server 120.
  • step S309 the server 120 recognizes the corresponding second voice data according to the received status data.
  • the processing of the second voice data is completely consistent with the processing of the first voice data, so for the specific content of the recognition, reference may be made to the relevant description of step S304, and no more details will be given here.
  • step S310 the server 120 returns a response instruction to the client 110.
  • the client 110 outputs a response to the user according to the response instruction.
  • step S307 to step S310 is repeated.
  • the method 300 may also include a scenario of switching interactive scenes.
  • step S311 in response to the user's request to switch the interactive scene, the client 110 forwards the request to the server 120.
  • the user's request to switch the interactive scene can be triggered in the following two ways.
  • the user resends the third voice data containing the predetermined object to the client 110.
  • the client 110 detects the predetermined object, and determines that the user wants to switch the interactive scene. For example, in a music scene, the user enters-"Elf, help me check the current weather", which triggers a request to switch the interactive scene.
  • the user switches the display interface of the client 110 so that the client 110 jumps to other applications or services.
  • the display interface of the client 110 is a video playback interface, and when the user switches the display interface to the display interface where the picture is taken, a request to switch the interactive scene is triggered.
  • the server 120 confirms the interactive scene to be switched, and in the subsequent step S313, returns a response instruction.
  • the server 120 may analyze the interactive scene that the user wants to switch according to the third voice data input by the user. For example, if the user inputs ""Elf, help me check the current weather", the server 120 can analyze the interactive scene to be switched as the weather query scene based on this.
  • the client 110 in response to the request to switch the interactive scene, the client 110 will also collect the state data at the current time and send it to the server 120 together with the request. In this way, the server 120 can use the status data to perform scene analysis to confirm the interactive scene to be switched. For example, when the display interface of the client 110 is switched from the video playback interface to the picture shooting interface, the server 120 may analyze that the interactive scene to be switched is a picture shooting scene.
  • the server 120 can also combine the status data and the third voice data input by the user to perform scene analysis to confirm the interactive scene to be switched.
  • the server 120 can also combine the status data and the third voice data input by the user to perform scene analysis to confirm the interactive scene to be switched.
  • the server 120 After confirming the interaction scenario to be switched, the server 120 generates a corresponding response instruction to the client 110, and the response is output to the user. For example, switch to the application that the user desires to open.
  • the relevant description of the response command can be referred to the previous description, and will not be expanded here.
  • step S314 the client 110 determines whether to close the interactive scene before switching.
  • the client 110 determines whether to close the interaction scene before the switch through the status data.
  • the client 110 obtains the process data being executed to make a judgment: if the process data being executed does not include the process data corresponding to the interactive scene before switching, it means that the previous process has been closed, so the interactive scene before switching is closed; If the process data being executed also includes process data corresponding to the interactive scene before switching, it means that the previous process is still being executed, so the interactive scene before switching is not closed.
  • configuration information is obtained based on the interactive scene after switching.
  • the configuration information is obtained based on the interactive scene before switching and the interactive scene after switching. That is, while retaining the original configuration information, the configuration information corresponding to the switched interactive scene is acquired.
  • the client combines local status, user habits and other information to pre-set different configuration information for different interaction scenarios to support the rapid wake-up of the client in each interaction scenario, that is, Directly respond to the user's voice command without a wake-up word (ie, a predetermined object).
  • a wake-up word ie, a predetermined object
  • the client 110 when receiving the first voice data input by the user, forwards the first voice data to the server 120, and the server 120 confirms the interaction scenario; and then the client 110 obtains the configuration according to the interaction scenario Information, in this interaction scenario, as long as the voice data input by the user matches the target data in the configuration information, the client 110 is directly awakened for voice interaction.
  • this solution has the advantages of fast response and low cost.
  • the server 120 performs scene analysis based on the state data on the client 110, and closely integrates the recognition of the voice data with the current state of the client 110 and the interactive scene, which can significantly improve the accuracy of recognition.
  • FIG. 4 shows a schematic flowchart of a voice interaction method 400 according to another embodiment of the present invention.
  • the method 400 shown in FIG. 4 is suitable for execution in the client 110 and is a further description of the method shown in FIG. 3.
  • the method 400 includes step S410, sending the first voice data input by the user to the server 120, so that the server 120 confirms the interaction scene according to the first voice data.
  • the client 110 may also locally determine the current interaction scene according to the first voice data input by the user. For example, the client 110 confirms the current interaction scene according to the current state data (for example, the currently used application program is not limited to this). The embodiments of the present invention do not limit this.
  • step S420 configuration information is acquired based on the interaction scenario.
  • step S430 the second voice data input by the user is processed based on the acquired configuration information and a response is output.
  • the method 400 further includes the step of setting configuration information in each interaction scenario in advance.
  • the configuration information includes at least one piece of target data for use in the interactive scenario. Through these target data, you can directly wake up the client to enter the interactive state.
  • the target data in the configuration information can be set in combination with the state of the client itself, the user's input preferences, etc.; it can be preset in the factory or can be set by the user during the use of the client 110. There are no restrictions.
  • a process of switching the client 110 from the sleep state to the interactive state according to the third voice data input by the user is also included.
  • the step of reloading the configuration information in response to the user's request to switch the interactive scene is also included.
  • FIG. 5 shows a schematic diagram of a voice data recognition device 500 residing in the client 110 according to an embodiment of the present invention.
  • the voice interaction device 500 includes at least: an information storage unit 510, a connection management unit 520, an information acquisition unit 530, and a data processing unit 540.
  • the information storage unit 510 pre-stores configuration information in each interaction scenario, where the configuration information includes at least one piece of target data for use in the interaction scenario.
  • the connection management unit 520 is used to implement various input/output operations of the voice interaction device 500, for example, receiving the first voice data input by the user and sending it to the server 120, so that the server 120 confirms the interaction scene according to the first voice data.
  • the information acquisition unit 530 acquires configuration information based on the interaction scenario.
  • the data processing unit 540 processes the second voice data input by the user based on the acquired configuration information and outputs a response.
  • the data processing unit 540 further includes a judgment module 542 adapted to judge whether the second voice data input by the user matches the target data in the configuration information.
  • the information acquisition unit 530 acquires the state data at the current moment.
  • the connection management unit 520 sends the second voice data and the status data to the server 120, and receives a response instruction returned by the server 120 after performing recognition processing on the second voice data according to the status data. Finally, the connection management unit 520 also outputs a response to the user according to the response instruction.
  • connection management unit 520 is also used to receive the third voice data input by the user.
  • the voice interaction device 500 includes a detection unit (not shown) in addition to the above-mentioned parts.
  • the detection unit detects whether the third voice data input by the user includes a predetermined object, and the client 110 enters an interactive state when the third voice data includes the predetermined object.
  • connection management unit 520 may also respond to the user's request to switch the interactive scene, and forward the request to the server 120, so that the server 120 confirms the interactive scene to be switched.
  • the information acquisition unit 530 further includes a decision module 532, which is used to determine whether to close the interaction scene before the handover. If it is determined after judgment that the interaction scene before switching is to be closed, the information acquisition unit 530 obtains configuration information based on the interaction scene after switching; if it is determined that the interaction scene before switching is not to be closed after judgment, the information acquisition unit 530 is based on pre-switching The configuration information is obtained in the interactive scenario and the switched interactive scenario.
  • the various technologies described herein may be implemented in combination with hardware or software, or a combination thereof.
  • the method and apparatus of the present invention or some aspects or parts of the method and apparatus of the present invention, may adopt embedded tangible media, such as a removable hard disk, U disk, floppy disk, CD-ROM, or any other machine-readable storage medium
  • program code ie, instructions
  • the machine becomes a device for practicing the invention.
  • the computing device In the case where the program code is executed on a programmable computer, the computing device generally includes a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), and at least one input device, And at least one output device.
  • the memory is configured to store program code; the processor is configured to execute the method of the present invention according to the instructions in the program code stored in the memory.
  • readable media includes readable storage media and communication media.
  • the readable storage medium stores information such as computer readable instructions, data structures, program modules, or other data.
  • Communication media generally embodies computer-readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transmission mechanism, and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.
  • the algorithm and display are not inherently related to any particular computer, virtual system, or other devices.
  • Various general-purpose systems can also be used with examples of the present invention. From the above description, the structure required to construct such a system is obvious.
  • the present invention is not directed to any particular programming language. It should be understood that various programming languages can be used to implement the contents of the present invention described herein, and the above descriptions of specific languages are for disclosure of the best embodiments of the present invention.
  • modules or units or components of the device in the examples disclosed herein may be arranged in the device as described in this embodiment, or alternatively may be positioned differently from the device in this example Of one or more devices.
  • the modules in the foregoing examples may be combined into one module or, in addition, may be divided into multiple sub-modules.
  • modules in the device in the embodiment can be adaptively changed and set in one or more devices different from the embodiment.
  • the modules or units or components in the embodiments may be combined into one module or unit or component, and in addition, they may be divided into a plurality of submodules or subunits or subcomponents. Except that at least some of such features and/or processes or units are mutually exclusive, all features disclosed in this specification (including the accompanying claims, abstract and drawings) and any method so disclosed may be adopted in any combination All processes or units of equipment are combined. Unless expressly stated otherwise, each feature disclosed in this specification (including the accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Security & Cryptography (AREA)
  • User Interface Of Digital Computer (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A voice interaction method, device and system. The voice interaction method comprises the following steps: sending first voice data inputted by a user so as to receive an interactive scenario confirmed on the basis of the first voice data (S410); acquiring configuration information on the basis of the interactive scenario (S420); and on the basis of the acquired configuration information, processing second voice data inputted by the user and outputting a response (S430).

Description

一种语音交互方法、装置及系统Voice interaction method, device and system
本申请要求2018年12月11日递交的申请号为201811513712.8、发明名称为“一种语音交互方法、装置及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requires the priority of the Chinese patent application filed on December 11, 2018 with the application number 201811513712.8 and the invention titled "a voice interaction method, device and system", the entire contents of which are incorporated by reference in this application.
技术领域Technical field
本发明涉及计算机技术领域,尤其涉及一种语音交互方法、装置及系统。The invention relates to the field of computer technology, in particular to a voice interaction method, device and system.
背景技术Background technique
过去十几年来,互联网在人们生活的各个领域不断深化,人们可以通过互联网方便地进行购物、社交、娱乐、理财等活动。同时,为提高用户体验,研究人员实现了很多交互方案,如文字输入、手势输入、语音输入等。其中,智能语音交互由于其操作的便捷性而成为新一代交互模式的研究热点。Over the past decade or so, the Internet has continued to deepen in all areas of people’s lives, and people can conveniently conduct activities such as shopping, socializing, entertainment, and financial management through the Internet. At the same time, in order to improve the user experience, researchers have implemented many interactive solutions, such as text input, gesture input, voice input, etc. Among them, intelligent voice interaction has become a research hotspot of the new generation of interaction modes due to its convenient operation.
随着语音交互技术的逐步发展,越来越多的智能设备具有了语音唤醒功能。当前市面上比较流行的智能设备都配置有固定的唤醒词,当接收到用户输入的语音数据时,直接判断用户输入的语音数据与预设的固定唤醒词是否匹配。若两者匹配,则将处于休眠状态的智能设备转为交互状态或工作状态。这样,每次用户想与智能设备进行交互时,必须先使用固定的唤醒词将该设备唤醒,再输入语音指令。当结束一次语音交互的过程后,若用户要再次与该智能设备进行交互,需要再次输入固定的唤醒词将其唤醒,再输入语音指令。With the gradual development of voice interaction technology, more and more smart devices have voice wake-up function. The currently popular smart devices on the market are all configured with fixed wake-up words. When the voice data input by the user is received, it is directly judged whether the voice data input by the user matches the preset fixed wake-up words. If the two match, the smart device in the dormant state is changed to the interactive state or the working state. In this way, every time the user wants to interact with the smart device, the user must first wake up the device using a fixed wake-up word, and then enter the voice command. After the process of a voice interaction ends, if the user wants to interact with the smart device again, he needs to input a fixed wake-up word again to wake it up, and then enter a voice command.
这种方式下,在每一次语音交互之前,用户需要先输入固定唤醒词,这无疑增加了用户的操作次数,从而会增加交互成本,降低用户的交互体验。因此,需要一种优化的语音交互方案。In this way, before each voice interaction, the user needs to input a fixed wake-up word, which undoubtedly increases the number of operations of the user, thereby increasing the interaction cost and reducing the user's interactive experience. Therefore, an optimized voice interaction solution is needed.
发明内容Summary of the invention
为此,本发明提供了一种语音交互方法、装置及系统,以力图解决或至少缓解上面存在的至少一个问题。To this end, the present invention provides a voice interaction method, device and system, in an effort to solve or at least alleviate at least one of the above problems.
根据本发明的一个方面,提供了一种语音交互方法,包括步骤:将用户输入的第一语音数据发送给服务器,以便服务器根据所述第一语音数据确认交互场景;基于交互场景获取配置信息;以及基于所获取的配置信息对用户输入的第二语音数据进行处理并输 出响应。According to an aspect of the present invention, there is provided a voice interaction method, including the steps of: sending first voice data input by a user to a server, so that the server confirms an interaction scenario based on the first voice data; acquiring configuration information based on the interaction scenario; And processing the second voice data input by the user based on the obtained configuration information and outputting a response.
可选地,根据本发明的方法还包括步骤:预先设置各交互场景下的配置信息,其中,配置信息中包含至少一条用于在该交互场景下使用的目标数据。Optionally, the method according to the present invention further includes the step of presetting configuration information in each interaction scenario, where the configuration information includes at least one piece of target data for use in the interaction scenario.
可选地,在根据本发明的方法中,基于配置信息对用户输入的第二语音数据进行处理并输出响应的步骤包括:判断用户输入的第二语音数据与配置信息中的目标数据是否匹配;若匹配则获取当前时刻的状态数据;将第二语音数据与状态数据发送给服务器,以便服务器根据状态数据对第二语音数据进行识别处理并返回响应指令;以及根据响应指令输出响应给用户。Optionally, in the method according to the present invention, the step of processing the second voice data input by the user based on the configuration information and outputting a response includes: determining whether the second voice data input by the user matches the target data in the configuration information; If it matches, the state data at the current moment is obtained; the second voice data and the state data are sent to the server, so that the server recognizes the second voice data according to the state data and returns a response instruction; and outputs a response to the user according to the response instruction.
可选地,根据本发明的方法还包括接收用户输入的第三语音数据的步骤:检测用户输入的第三语音数据中是否包含预定对象;以及若第三语音数据中包含预定对象,则进入交互状态。Optionally, the method according to the present invention further includes the step of receiving the third voice data input by the user: detecting whether the third voice data input by the user contains a predetermined object; and if the third voice data contains the predetermined object, enter the interaction status.
可选地,在根据本发明的方法中,将用户输入的第一语音数据发送给服务器,以便服务器根据第一语音数据确认交互场景的步骤包括:响应于用户输入第一语音数据,获取当前时刻的状态数据;以及将第一语音数据与状态数据发送给服务器,以便服务器根据状态数据对所述第一语音数据进行识别处理并返回响应指令,其中响应指令中还包括交互场景。Optionally, in the method according to the present invention, the step of sending the first voice data input by the user to the server, so that the server confirms the interaction scenario according to the first voice data includes: in response to the user inputting the first voice data, obtaining the current time State data of the server; and sending the first voice data and the state data to the server, so that the server performs recognition processing on the first voice data according to the state data and returns a response command, where the response command also includes an interactive scene.
可选地,在根据本发明的方法中,基于交互场景获取配置信息的步骤还包括:根据响应指令输出响应给用户。Optionally, in the method according to the present invention, the step of acquiring configuration information based on the interaction scenario further includes: outputting a response to the user according to the response instruction.
可选地,根据本发明的方法还包括步骤:响应于用户切换交互场景的请求,转发请求至服务器,以便服务器确认待切换的交互场景;判断是否关闭切换前的交互场景;若关闭切换前的交互场景,则基于切换后的交互场景得到配置信息;以及若不关闭切换前的交互场景,则基于切换前的交互场景和切换后的交互场景得到配置信息。Optionally, the method according to the present invention further includes the steps of: in response to the user's request to switch the interactive scene, forward the request to the server, so that the server confirms the interactive scene to be switched; determine whether to close the interactive scene before switching; For the interactive scene, the configuration information is obtained based on the interactive scene after switching; and if the interactive scene before switching is not closed, the configuration information is obtained based on the interactive scene before switching and the interactive scene after switching.
根据本发明的另一方面,提供了一种语音交互方法,包括步骤:根据用户输入的第一语音数据确定交互场景;基于交互场景获取配置信息;以及基于所获取的配置信息对用户输入的第二语音数据进行处理并输出响应。According to another aspect of the present invention, a voice interaction method is provided, including the steps of: determining an interaction scenario based on first voice data input by a user; acquiring configuration information based on the interaction scenario; and based on the acquired configuration information The second voice data is processed and the response is output.
根据本发明的又一方面,提供了一种语音交互装置,包括:连接管理单元,适于接收用户输入的第一语音数据并发送给服务器,以便服务器根据第一语音数据确认交互场景;信息获取单元,适于基于交互场景获取配置信息;以及数据处理单元,适于基于所获取的配置信息对用户输入的第二语音数据进行处理并输出响应。According to still another aspect of the present invention, there is provided a voice interaction device, including: a connection management unit adapted to receive first voice data input by a user and send it to a server, so that the server confirms the interaction scenario according to the first voice data; information acquisition The unit is adapted to acquire configuration information based on the interaction scenario; and the data processing unit is adapted to process the second voice data input by the user based on the acquired configuration information and output a response.
可选地,根据本发明的装置还包括信息存储单元,适于预先存储各交互场景下的配 置信息,其中,配置信息中包含至少一条用于在该交互场景下使用的目标数据。Optionally, the device according to the present invention further includes an information storage unit adapted to pre-store configuration information in each interaction scenario, where the configuration information includes at least one piece of target data for use in the interaction scenario.
可选地,在根据本发明的装置中,数据处理单元还包括判断模块,适于判断用户输入的第二语音数据与配置信息中的目标数据是否匹配;信息获取单元还适于在第二语音数据与所述目标数据匹配时,获取当前时刻的状态数据;连接管理单元还适于将第二语音数据与状态数据发送给服务器,并接收该服务器根据状态数据对第二语音数据进行识别处理后返回的响应指令;以及连接管理单元还适于根据该响应指令输出响应给用户。Optionally, in the device according to the present invention, the data processing unit further includes a judgment module adapted to judge whether the second voice data input by the user matches the target data in the configuration information; the information acquisition unit is further adapted to When the data matches the target data, the current state data is obtained; the connection management unit is further adapted to send the second voice data and the state data to the server, and receive the server to perform recognition processing on the second voice data according to the state data The returned response instruction; and the connection management unit is further adapted to output a response to the user according to the response instruction.
可选地,在根据本发明的装置中,连接管理单元还适于接收用户输入的第三语音数据;装置还包括检测单元,适于检测用户输入的第三语音数据中是否包含预定对象,并在第三语音数据包含预定对象时进入交互状态。Optionally, in the device according to the present invention, the connection management unit is further adapted to receive third voice data input by the user; the device further includes a detection unit adapted to detect whether the third voice data input by the user contains a predetermined object, and Enter the interactive state when the third voice data contains a predetermined object.
可选地,在根据本发明的装置中,连接管理单元还适于响应于用户切换交互场景的请求,转发请求至服务器,以便服务器确认待切换的交互场景;信息获取单元还包括判决模块,适于判断是否关闭切换前的交互场景;信息获取单元还适于在关闭切换前的交互场景时,基于切换后的交互场景得到配置信息,且在不关闭切换前的交互场景时,基于切换前的交互场景和切换后的交互场景得到配置信息。Optionally, in the device according to the present invention, the connection management unit is further adapted to forward the request to the server in response to the user's request to switch the interactive scene, so that the server confirms the interactive scene to be switched; the information acquisition unit further includes a decision module, suitable To determine whether to close the interactive scene before switching; the information acquisition unit is also adapted to obtain configuration information based on the interactive scene after switching when the interactive scene before switching is closed, and based on the pre-switching mode when the interactive scene before switching is not closed. The configuration information is obtained in the interactive scenario and the switched interactive scenario.
根据本发明的再一个方面,提供了一种语音交互系统,包括:客户端,包括如上所述的语音交互装置;以及服务器,适于接收来自客户端的语音数据和状态数据,并基于状态数据和语音数据,确定客户端的交互场景。According to still another aspect of the present invention, there is provided a voice interaction system, including: a client, including the voice interaction device described above; and a server, adapted to receive voice data and status data from the client, and based on the status data and Voice data to determine the interaction scenario of the client.
可选地,在根据本发明的系统中,服务器还适于根据状态数据对语音数据进行识别处理、返回响应指令给客户端。Optionally, in the system according to the present invention, the server is further adapted to perform recognition processing on the voice data according to the status data and return a response instruction to the client.
可选地,在根据本发明的系统中,客户端是智能音箱。Optionally, in the system according to the invention, the client is a smart speaker.
根据本发明的再一方面,提供了一种智能音箱,包括:接口单元,适于接收用户输入的第一语音数据;交互控制单元,适于根据用户输入的第一语音数据确定交互场景,并基于交互场景获取配置信息,交互控制单元还适于基于配置信息对第二语音数据进行处理,并输出响应。According to still another aspect of the present invention, there is provided a smart speaker, including: an interface unit adapted to receive first voice data input by a user; an interaction control unit adapted to determine an interaction scenario based on the first voice data input by the user, and The configuration information is acquired based on the interaction scenario, and the interaction control unit is further adapted to process the second voice data based on the configuration information and output a response.
根据本发明的再一个方面,提供了一种计算设备,包括:至少一个处理器;和存储有程序指令的存储器,其中,程序指令被配置为适于由至少一个处理器执行,程序指令包括用于执行如上所述任一方法的指令。According to yet another aspect of the present invention, there is provided a computing device including: at least one processor; and a memory storing program instructions, wherein the program instructions are configured to be executed by at least one processor, and the program instructions include Instructions for performing any of the methods described above.
根据本发明的再一个方面,提供了一种存储有程序指令的可读存储介质,当程序指令被计算设备读取并执行时,使得计算设备执行如上所述的任一方法。According to yet another aspect of the present invention, there is provided a readable storage medium storing program instructions, which causes the computing device to perform any of the methods described above when the program instructions are read and executed by the computing device.
根据本发明的语音交互方法,客户端在接收到用户输入的第一语音数据时,将第一 语音数据转发给服务器,由服务器确认交互场景;而后客户端根据交互场景获取配置信息,在该交互场景下,只要用户输入的语音数据与配置信息中的目标数据相匹配,则直接唤醒客户端进行语音交互。相比于现有的语音交互方案,本方案能够降低交互成本,提高用户体验。According to the voice interaction method of the present invention, when the client receives the first voice data input by the user, the first voice data is forwarded to the server, and the server confirms the interaction scenario; then the client obtains configuration information according to the interaction scenario, and then interacts In the scenario, as long as the voice data input by the user matches the target data in the configuration information, the client is directly awakened for voice interaction. Compared with the existing voice interaction solution, this solution can reduce the interaction cost and improve the user experience.
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。The above description is only an overview of the technical solutions of the present invention. In order to understand the technical means of the present invention more clearly, it can be implemented according to the content of the specification, and in order to make the above and other objects, features and advantages of the present invention more obvious and understandable The specific embodiments of the present invention are listed below.
附图说明BRIEF DESCRIPTION
为了实现上述以及相关目的,本文结合下面的描述和附图来描述某些说明性方面,这些方面指示了可以实践本文所公开的原理的各种方式,并且所有方面及其等效方面旨在落入所要求保护的主题的范围内。通过结合附图阅读下面的详细描述,本公开的上述以及其它目的、特征和优势将变得更加明显。遍及本公开,相同的附图标记通常指代相同的部件或元素。In order to achieve the above and related objectives, the following description and accompanying drawings are used to describe certain illustrative aspects, which indicate various ways in which the principles disclosed herein may be practiced, and all aspects and their equivalents are intended to Within the scope of the claimed subject matter. The above and other objects, features, and advantages of the present disclosure will become more apparent by reading the following detailed description in conjunction with the accompanying drawings. Throughout this disclosure, the same reference numbers generally refer to the same components or elements.
图1示出了根据本发明一个实施例的语音交互系统100的场景示意图;FIG. 1 shows a schematic diagram of a scene of a voice interaction system 100 according to an embodiment of the present invention;
图2示出了根据本发明一个实施例的计算设备200的示意图;2 shows a schematic diagram of a computing device 200 according to an embodiment of the invention;
图3示出了根据本发明一个实施例的语音交互方法300的交互流程图;FIG. 3 shows an interaction flowchart of a voice interaction method 300 according to an embodiment of the present invention;
图4示出了根据本发明另一个实施例的语音交互方法400的流程示意图;以及4 shows a schematic flowchart of a voice interaction method 400 according to another embodiment of the present invention; and
图5示出了根据本发明一个实施例的语音交互装置500的示意图。FIG. 5 shows a schematic diagram of a voice interaction device 500 according to an embodiment of the present invention.
具体实施方式detailed description
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。Hereinafter, exemplary embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although the exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure can be implemented in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided to enable a more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.
图1示出了根据本发明一个实施例的语音交互系统100的场景示意图。如图1所示,系统100中包括客户端110和服务器120。应当指出,图1所示的系统100仅作为一个示例,本领域技术人员可以理解,在实际应用中,系统100通常包括多个客户端110和服务器120,本发明对系统100中所包括的客户端110和服务器120的数量均不做限制。FIG. 1 shows a schematic diagram of a scene of a voice interaction system 100 according to an embodiment of the present invention. As shown in FIG. 1, the system 100 includes a client 110 and a server 120. It should be noted that the system 100 shown in FIG. 1 is only an example, and those skilled in the art can understand that in practical applications, the system 100 generally includes multiple clients 110 and servers 120. The present invention treats the clients included in the system 100 The number of the terminal 110 and the server 120 are not limited.
客户端110为具有语音交互装置(如,根据本发明实施例的语音交互装置500)的 智能设备,其可以接收用户发出的语音指示,以及向用户返回语音或非语音信息。一个典型的语音交互装置包括麦克风等语音输入单元、扬声器等语音输出单元以及处理器。语音交互装置可以内置在客户端110中,也可以作为一个独立的模块与客户端110配合使用(例如经由API或通过其它方式与客户端110进行通信,调用客户端110上的功能或应用),本发明的实施例对此不做限制。客户端110例如可以是具有语音交互装置的移动设备(如,智能音箱)、智能机器人、智能家电(包括智能电视、智能冰箱、智能微波炉等),但不限于此。客户端110的一个应用场景为家用场景,即,客户端110放置于用户家中,用户可以向客户端110发出语音指示以实现某些功能,例如上网、点播歌曲、购物、了解天气预报、对家中的其他智能家居设备进行控制,等等。The client 110 is a smart device having a voice interaction device (for example, the voice interaction device 500 according to an embodiment of the present invention), which can receive voice instructions from a user and return voice or non-voice information to the user. A typical voice interaction device includes a voice input unit such as a microphone, a voice output unit such as a speaker, and a processor. The voice interaction device may be built in the client 110, or may be used as an independent module to cooperate with the client 110 (for example, to communicate with the client 110 via an API or by other means to call functions or applications on the client 110), The embodiments of the present invention do not limit this. The client 110 may be, for example, a mobile device with a voice interaction device (eg, a smart speaker), a smart robot, a smart home appliance (including a smart TV, a smart refrigerator, a smart microwave oven, etc.), but it is not limited thereto. An application scenario of the client 110 is a home scenario, that is, the client 110 is placed in the user's home, and the user can issue a voice instruction to the client 110 to implement certain functions, such as surfing the Internet, on-demand songs, shopping, weather forecasting, and home Control of other smart home devices, etc.
服务器120与客户端110通过网络进行通信,其例如可以是物理上位于一个或多个地点的云服务器。服务器120为客户端110上接收的语音数据提供识别服务,以得到用户输入的语音数据的文本表示;服务器120还会基于文本表示得到用户意图的表示,并生成响应指令,返回给客户端110。客户端110根据该响应指令执行相应的操作,来为用户提供相应的服务,例如设置闹钟、拨打电话、发送邮件、播报资讯、播放歌曲、视频等。当然,客户端110也可以根据响应指令输出相应的语音响应给用户,本发明的实施例对此不做限制。The server 120 and the client 110 communicate via a network, which may be, for example, a cloud server physically located in one or more locations. The server 120 provides a recognition service for the voice data received on the client 110 to obtain a text representation of the voice data input by the user; the server 120 also obtains a representation of the user's intention based on the text representation and generates a response instruction to return to the client 110. The client 110 performs corresponding operations according to the response instruction to provide users with corresponding services, such as setting an alarm clock, making a call, sending an email, broadcasting information, playing songs, and videos. Of course, the client 110 may also output a corresponding voice response to the user according to the response instruction, which is not limited in the embodiment of the present invention.
根据一些实施例,在客户端110中,语音交互模块的麦克风持续接收外部声音。当用户要使用客户端110进行语音交互时,需要先说出相应的唤醒词来唤醒客户端110(更具体地说,通过输入唤醒词来唤醒客户端110中的语音交互模块),使其进入交互状态。在客户端110结束一次语音交互后,若用户要再次使用客户端110进行交互,就需要再次输入唤醒词来唤醒客户端110。According to some embodiments, in the client 110, the microphone of the voice interaction module continuously receives external sounds. When the user wants to use the client 110 for voice interaction, he needs to first speak the corresponding wake-up word to wake up the client 110 (more specifically, to wake up the voice interaction module in the client 110 by entering the wake-up word), so that he enters Interactive state. After the client 110 ends a voice interaction, if the user wants to use the client 110 for interaction again, the user needs to enter a wake-up word again to wake up the client 110.
以下示例性地示出了一些语音交互过程。其中,固定的唤醒词设置为“小精灵”。The following exemplarily shows some voice interaction processes. Among them, the fixed wake-up word is set to "elf".
用户:小精灵。User: Elf.
客户端:我在,你说。Client: I'm here, you said.
用户;我想听**的歌。User; I want to listen to the best song.
客户端:好的,即将为你播放**的歌。Client: Ok, I will play the best song for you soon.
用户:小精灵。User: Elf.
客户端:我在,你说。Client: I'm here, you said.
用户:把声音调到50。User: Adjust the sound to 50.
客户端:好的,声音已调到50。Client: Yes, the sound has been adjusted to 50.
用户;小精灵,收藏这首歌。User; elf, favorite this song.
客户端:好的,歌曲已收藏。Client: Ok, the song has been collected.
用户:小精灵,播放收藏。User: Elf, play favorites.
客户端:好的,即将为你播放收藏。Client: Ok, the collection will be played for you soon.
用户:小精灵。User: Elf.
客户端:我在,你说。Client: I'm here, you said.
用户:下一首。User: Next song.
客户端:好的。Client: OK.
用户:小精灵,上一首。User: Elf, last one.
客户端:好的。Client: OK.
从上例可以看到,用户在每次交互时,都需要先输入唤醒词、再输入相应的语音指令。也就是说,用户每要指示客户端110去执行一个操作,都需要先输入一次唤醒词。对用户而言,这样的交互方式过于繁琐。为降低交互成本,减少用户重复输入唤醒词,在根据本发明的系统100中,根据用户与客户端110进行语音交互的交互场景,预先设置各交互场景下,用户可能要使用的一条或多条目标数据,以此构成各交互场景下的配置信息。换句话说,配置信息中包含了各种交互场景对应的交互模板。根据本发明的实施方式,在特定的交互场景下,用户无需多次输入唤醒词来与客户端110进行交互,只要输入的语音指令中包含了该交互场景下的目标数据即可。As you can see from the above example, each time the user interacts, he needs to enter the wake-up word first and then the corresponding voice command. That is to say, every time the user instructs the client 110 to perform an operation, he needs to input a wake-up word once. For users, this type of interaction is too cumbersome. In order to reduce the interaction cost and reduce the user's repeated input of the wake-up word, in the system 100 according to the present invention, according to the interaction scenario in which the user interacts with the client 110 in voice interaction, one or more items that the user may use in each interaction scenario are preset The target data constitutes the configuration information in each interactive scenario. In other words, the configuration information includes interaction templates corresponding to various interaction scenarios. According to an embodiment of the present invention, in a specific interaction scenario, the user does not need to enter a wake-up word multiple times to interact with the client 110, as long as the input voice command includes the target data in the interaction scenario.
例如,在上述示例中所展现的听歌的交互场景中,目标数据就可以是:“上一首”“下一首”“收藏这首歌”“声音大一点”“暂停播放”“继续播放”“当前是什么歌”等等,这些目标数据就构成了听歌的交互场景所对应的配置信息。应当指出,上述说明仅作为示例,本发明的实施例并不限于此。在具体的实施例中,目标数据可以设置为“声音大一点”,也可以设置为“调大音量”,等等。For example, in the interactive scene of listening to songs shown in the above example, the target data may be: "previous song" "next song" "collect this song" "louder" "pause playback" "continue playing "What is the current song" and so on, these target data constitute the configuration information corresponding to the interactive scene of listening to the song. It should be noted that the above description is only an example, and embodiments of the present invention are not limited thereto. In a specific embodiment, the target data may be set to "a little louder" or "turn up the volume", and so on.
根据本发明的实施方式,客户端110在接收到用户输入的语音数据后,还会获取当前时刻客户端110上的状态数据,连同该语音数据一并传送给服务器120。客户端110的状态数据例如是用户正在操作客户端110上的某个应用或者类似软件的状态。例如,用户可能正在使用某个应用播放视频流数据;又如,用户正在使用某个社交软件与特定个人进行交流;但不限于此。According to the embodiment of the present invention, after receiving the voice data input by the user, the client 110 also obtains the status data on the client 110 at the current moment, and transmits it to the server 120 together with the voice data. The state data of the client 110 is, for example, the state where the user is operating an application or similar software on the client 110. For example, the user may be using an application to play video streaming data; for example, the user is using a social software to communicate with a specific individual; but not limited to this.
服务器120在生成响应指令的同时,还可以依据状态数据和语音数据进行场景分析,以确认出用户输入语音数据时所期望进入的交互场景。例如,用户输入语音数据——“我 想看剧”,服务器120通过状态数据确认当前客户端110上正在使用音乐播放器软件,服务器120基本可以确定出用户期望进入播放视频的交互场景。又如,用户输入语音数据——“现在杭州的天气怎样”,服务器120基本能够确认出用户期望进入查看天气预报的交互场景。While generating the response instruction, the server 120 can also perform scene analysis based on the state data and voice data to confirm the interactive scene that the user expects to enter when inputting the voice data. For example, the user inputs voice data-"I want to watch a drama", the server 120 confirms that the music player software is currently being used on the client 110 through the status data, and the server 120 can basically determine that the user expects to enter the interactive scene for playing video. As another example, if the user inputs voice data-"What is the weather in Hangzhou now", the server 120 can basically confirm that the user expects to enter the interactive scene for viewing the weather forecast.
服务器120将确认的交互场景连同响应指令返回给客户端110。客户端110根据该交互场景,获取其对应的配置信息。这样,在该交互场景下,客户端110只需要判断用户输入的语音数据与配置信息中的目标数据是否一致,若一致则直接输出响应。The server 120 returns the confirmed interaction scenario together with the response instruction to the client 110. The client 110 obtains its corresponding configuration information according to the interaction scenario. In this way, in this interaction scenario, the client 110 only needs to determine whether the voice data input by the user is consistent with the target data in the configuration information, and if they match, directly output a response.
还是以上述示例中所展现的听歌的交互场景为例,在根据本发明的语音交互系统100中,用户和客户端110的语音交互过程可以优化为:Taking the interaction scenario of listening to songs presented in the above example as an example, in the voice interaction system 100 according to the present invention, the voice interaction process between the user and the client 110 may be optimized as:
用户:小精灵。User: Elf.
客户端:我在,你说。Client: I'm here, you said.
用户;我想听**的歌。User; I want to listen to the best song.
客户端:好的,即将为你播放**的歌。Client: Ok, I will play the best song for you soon.
用户:把声音调到50。User: Adjust the sound to 50.
客户端:好的,声音已调到50。Client: Yes, the sound has been adjusted to 50.
用户;收藏这首歌。User; favorite this song.
客户端:好的,歌曲已收藏。Client: Ok, the song has been collected.
用户:播放收藏。User: Play favorites.
客户端:好的,即将为你播放收藏。Client: Ok, the collection will be played for you soon.
用户:下一首。User: Next song.
客户端:好的。Client: OK.
用户:上一首。User: Last song.
客户端:好的。Client: OK.
对比上面示出的交互过程可以看出,根据本发明的语音交互系统100,只要没有切换出当前的交互场景,客户端110就一直处于交互状态,用户可以直接输入语音指令来指示客户端110执行相应的操作。这样,系统100无需多次重复之前已经执行过的交互流程(例如,唤醒客户端110的流程),降低了交互成本,提高了用户体验。As can be seen from the interaction process shown above, according to the voice interaction system 100 of the present invention, as long as the current interaction scenario is not switched out, the client 110 is always in the interactive state, and the user can directly input voice instructions to instruct the client 110 to execute The corresponding operation. In this way, the system 100 does not need to repeat the interaction process that has been previously executed (for example, the process of waking up the client 110), which reduces the interaction cost and improves the user experience.
以下以客户端110被实现为智能音箱为例,概括说明根据本发明实施例的语音交互方案。The following uses the client 110 to be implemented as a smart speaker as an example to outline a voice interaction solution according to an embodiment of the present invention.
除基本的配置外,根据本发明一个实施例的智能音箱还包括:接口单元和交互控制 单元。其中,接口单元接收用户输入的第一语音数据;交互控制单元根据用户输入的第一语音数据确定交互场景,并基于该交互场景获取配置信息,同时,交互控制单元还能够基于配置信息对第二语音数据进行处理,并输出响应。In addition to the basic configuration, the smart speaker according to an embodiment of the present invention further includes: an interface unit and an interactive control unit. Wherein, the interface unit receives the first voice data input by the user; the interaction control unit determines the interaction scenario according to the first voice data input by the user, and obtains configuration information based on the interaction scenario, and at the same time, the interaction control unit can also control the second voice based on the configuration information. The voice data is processed and the response is output.
关于智能音箱进行语音交互过程的具体描述,可参考前文及下文关于图3的相关描述,此处不再进行赘述。For the specific description of the voice interaction process of the smart speaker, please refer to the related descriptions of FIG. 3 above and below, which will not be repeated here.
应当指出,在根据本发明的另一些实施方式中,服务器120也可以实现为通过网络与客户端110相连的其他电子设备(如,同处于一个物联网环境中的其他计算设备)。甚至,当客户端110具有足够的存储空间和算力的条件下,服务器120也可以实现为客户端110本身。It should be noted that in other embodiments according to the present invention, the server 120 may also be implemented as other electronic devices (eg, other computing devices in the same IoT environment) connected to the client 110 through the network. Even when the client 110 has sufficient storage space and computing power, the server 120 can also be implemented as the client 110 itself.
根据本发明的实施方式,客户端110和服务器120均可以通过如下所述的计算设备200来实现。图2示出了根据本发明一个实施例的计算设备200的示意图。According to the embodiments of the present invention, both the client 110 and the server 120 may be implemented by the computing device 200 as described below. FIG. 2 shows a schematic diagram of a computing device 200 according to an embodiment of the invention.
如图2所示,在基本配置202中,计算设备200典型地包括系统存储器206和一个或者多个处理器204。存储器总线208可以用于在处理器204和系统存储器206之间的通信。As shown in FIG. 2, in the basic configuration 202, the computing device 200 typically includes a system memory 206 and one or more processors 204. The memory bus 208 may be used for communication between the processor 204 and the system memory 206.
取决于期望的配置,处理器204可以是任何类型的处理,包括但不限于:微处理器(μP)、微控制器(μC)、数字信息处理器(DSP)或者它们的任何组合。处理器204可以包括诸如一级高速缓存210和二级高速缓存212之类的一个或者多个级别的高速缓存、处理器核心214和寄存器216。示例的处理器核心214可以包括运算逻辑单元(ALU)、浮点数单元(FPU)、数字信号处理核心(DSP核心)或者它们的任何组合。示例的存储器控制器218可以与处理器204一起使用,或者在一些实现中,存储器控制器218可以是处理器204的一个内部部分。Depending on the desired configuration, the processor 204 may be any type of processing, including but not limited to: a microprocessor (μP), a microcontroller (μC), a digital information processor (DSP), or any combination thereof. The processor 204 may include one or more levels of cache, such as a level one cache 210 and a level two cache 212, a processor core 214, and a register 216. The example processor core 214 may include an arithmetic logic unit (ALU), a floating point unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 218 may be used with the processor 204, or in some implementations, the memory controller 218 may be an internal part of the processor 204.
取决于期望的配置,系统存储器206可以是任意类型的存储器,包括但不限于:易失性存储器(诸如RAM)、非易失性存储器(诸如ROM、闪存等)或者它们的任何组合。系统存储器206可以包括操作系统220、一个或者多个应用222以及程序数据224。在一些实施方式中,应用222可以布置为在操作系统上由一个或多个处理器204利用程序数据224执行指令。Depending on the desired configuration, the system memory 206 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. The system memory 206 may include an operating system 220, one or more applications 222, and program data 224. In some embodiments, the application 222 may be arranged to execute instructions by the one or more processors 204 using the program data 224 on the operating system.
计算设备200还可以包括有助于从各种接口设备(例如,输出设备242、外设接口244和通信设备246)到基本配置202经由总线/接口控制器230的通信的接口总线240。示例的输出设备242包括图形处理单元248和音频处理单元250。它们可以被配置为有助于经由一个或者多个A/V端口252与诸如显示器或者扬声器之类的各种外部设备进行 通信。示例外设接口244可以包括串行接口控制器254和并行接口控制器256,它们可以被配置为有助于经由一个或者多个I/O端口258和诸如输入设备(例如,键盘、鼠标、笔、语音输入设备、触摸输入设备)或者其他外设(例如打印机、扫描仪等)之类的外部设备进行通信。示例的通信设备246可以包括网络控制器260,其可以被布置为便于经由一个或者多个通信端口264与一个或者多个其他计算设备262通过网络通信链路的通信。The computing device 200 may also include an interface bus 240 that facilitates communication from various interface devices (eg, output device 242, peripheral interface 244, and communication device 246) to the basic configuration 202 via the bus/interface controller 230. The example output device 242 includes a graphics processing unit 248 and an audio processing unit 250. They may be configured to facilitate communication with various external devices such as displays or speakers via one or more A/V ports 252. The example peripheral interface 244 may include a serial interface controller 254 and a parallel interface controller 256, which may be configured to facilitate via one or more I/O ports 258 and such as input devices (eg, keyboard, mouse, pen) , Voice input devices, touch input devices) or other peripheral devices (such as printers, scanners, etc.) to communicate. The example communication device 246 may include a network controller 260, which may be arranged to facilitate communication with one or more other computing devices 262 via a network communication link via one or more communication ports 264.
网络通信链路可以是通信介质的一个示例。通信介质通常可以体现为在诸如载波或者其他传输机制之类的调制数据信号中的计算机可读指令、数据结构、程序模块,并且可以包括任何信息递送介质。“调制数据信号”可以是这样的信号,它的数据集中的一个或者多个或者它的改变可以在信号中编码信息的方式进行。作为非限制性的示例,通信介质可以包括诸如有线网络或者专线网络之类的有线介质,以及诸如声音、射频(RF)、微波、红外(IR)或者其它无线介质在内的各种无线介质。这里使用的术语计算机可读介质可以包括存储介质和通信介质二者。The network communication link may be an example of a communication medium. Communication media can generally be embodied as computer readable instructions, data structures, program modules in a modulated data signal such as a carrier wave or other transmission mechanism, and can include any information delivery media. A "modulated data signal" may be a signal in which one or more of its data set or its changes can be made in such a way as to encode information in the signal. As a non-limiting example, the communication medium may include a wired medium such as a wired network or a dedicated line network, and various wireless media such as sound, radio frequency (RF), microwave, infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.
计算设备200可以实现为服务器,例如文件服务器、数据库服务器、应用程序服务器和WEB服务器等,也可以实现为包括桌面计算机和笔记本计算机配置的个人计算机。当然,计算设备200也可以实现为小尺寸便携(或者移动)电子设备的一部分。在根据本发明的实施例中,计算设备200被配置为执行根据本发明的语音交互方法。计算设备200的应用222中包含执行根据本发明的语音交互方法300的多条程序指令。The computing device 200 may be implemented as a server, such as a file server, a database server, an application server, a WEB server, etc., or as a personal computer including a desktop computer and a notebook computer configuration. Of course, the computing device 200 may also be implemented as part of a small-sized portable (or mobile) electronic device. In an embodiment according to the present invention, the computing device 200 is configured to perform the voice interaction method according to the present invention. The application 222 of the computing device 200 includes multiple program instructions for executing the voice interaction method 300 according to the present invention.
图3示出了根据本发明一个实施例的语音交互方法300的交互流程图。该交互方法300适于在上述系统100中执行。需要说明的是,为使下述说明更清楚,此处将用户输入的语音数据(或语音指令)区分为,第三语音数据(用于唤醒客户端110的语音数据,一般包含预定对象/唤醒词)、第一语音数据(在客户端110被唤醒后,用户输入的包含一般指令的语音数据)、第二语音数据(在确认交互场景后,用户输入的语音数据,一般会包含目标数据)。但应当理解,它们都是用户输入的语音数据,本发明不受限于此。FIG. 3 shows an interaction flowchart of a voice interaction method 300 according to an embodiment of the present invention. The interaction method 300 is suitable for execution in the system 100 described above. It should be noted that, in order to make the following description clearer, the voice data (or voice command) input by the user is divided into third voice data (voice data used to wake up the client 110, generally including a predetermined object/wake Words), the first voice data (voice data input by the user after the client 110 is awakened, which contains general instructions), and the second voice data (voice data input by the user after confirming the interaction scene, which generally contains target data) . However, it should be understood that they are all voice data input by the user, and the present invention is not limited thereto.
如图3所示,方法300始于步骤S301。As shown in FIG. 3, the method 300 starts at step S301.
在步骤S301中,客户端110接收用户输入的第三语音数据,并检测其中是否包含预定对象(预定对象例如是预定的唤醒词),若包含预定对象则进入交互状态。In step S301, the client 110 receives the third voice data input by the user, and detects whether it contains a predetermined object (the predetermined object is, for example, a predetermined wake-up word), and enters the interactive state if the predetermined object is included.
在根据本发明的实施例中,第三语音数据一般用于唤醒客户端110,使其处于交互状态。需要说明的是,预定对象可以在客户端110出厂时预先设置,也可以由用户在使用客户端110的过程中自行设置,本发明对预定对象的长短、内容均不做限制。In the embodiment according to the present invention, the third voice data is generally used to wake up the client 110 to make it in an interactive state. It should be noted that the predetermined object may be set in advance when the client 110 leaves the factory, or may be set by the user during the process of using the client 110. The present invention does not limit the length and content of the predetermined object.
在一种实施例中,客户端110在检测到第三语音数据中包含预定对象时,以播放语音的方式响应用户,例如,客户端110播放语音——“你好,请讲”,以此来告知用户,客户端110已处于交互状态,可以开始语音交互。In an embodiment, the client 110 responds to the user by playing a voice when it detects that the third voice data contains a predetermined object, for example, the client 110 plays a voice-"Hello, please speak", in order to To inform the user that the client 110 is already in an interactive state and can start voice interaction.
随后在步骤S302中,客户端110接收用户输入的第一语音数据,并且响应于用户输入,获取客户端110在当前时刻的状态数据。Then in step S302, the client 110 receives the first voice data input by the user, and in response to the user input, acquires the state data of the client 110 at the current moment.
客户端110的状态数据可以包括任意可以得到的客户端110上的信息。在一些实施例中,客户端110的状态数据包括下列信息中的一个或多个:客户端的进程数据、客户端的应用列表、客户端上应用使用历史数据、关联于该客户端的用户个人数据、从客户端的至少一个传感器上获得的数据(如客户端的位置信息、环境信息等)、客户端显示界面中的文本数据,但不限于此。The status data of the client 110 may include any available information on the client 110. In some embodiments, the status data of the client 110 includes one or more of the following information: the client's process data, the client's application list, the application usage history data on the client, the user's personal data associated with the client, from Data obtained on at least one sensor of the client (such as the location information of the client, environmental information, etc.), and text data in the display interface of the client, but is not limited thereto.
随后在步骤S303中,客户端110将来自用户的第一语音数据与本地的状态数据,一并发送给服务器120。Then in step S303, the client 110 sends the first voice data from the user and the local state data to the server 120 together.
随后,在步骤S304中,服务器120根据所接收到的状态数据对第一语音数据进行识别处理。Subsequently, in step S304, the server 120 performs recognition processing on the first voice data according to the received state data.
在根据本发明的实施例中,服务器120对第一语音数据的识别处理可以分为两部分。In the embodiment according to the present invention, the recognition processing of the first voice data by the server 120 may be divided into two parts.
首先,服务器120通过ASR(Automatic Speech Recognition)语音识别技术对第一语音数据进行识别。服务器120可以先将第一语音数据表示为文本数据,再对文本数据进行分词处理,得到第一语音数据的文本表示(应当指出,也可以采用其他方式来表示语音数据,本发明的实施例并不限于文本表示)。典型的ASR语音识别方法例如可以是:基于声道模型和语音知识的方法、模板匹配的方法、以及利用神经网络的方法等,本发明的实施例对采用何种ASR方法进行语音识别处理并不做过多限制,任何已知的或未来可知的此类算法均可以与本发明的实施例相结合,以实现本发明的方法300。First, the server 120 recognizes the first voice data through ASR (Automatic Speech Recognition) voice recognition technology. The server 120 may first represent the first voice data as text data, and then perform word segmentation processing on the text data to obtain a text representation of the first voice data (It should be noted that other methods may also be used to represent the voice data. Not limited to text representation). A typical ASR speech recognition method may be, for example, a method based on a vocal tract model and speech knowledge, a template matching method, and a method using a neural network, etc. The embodiment of the present invention does not determine which ASR method is used for speech recognition processing. Too many restrictions, any known or future known algorithms of this type can be combined with embodiments of the present invention to implement the method 300 of the present invention.
需要说明的是,服务器120在通过ASR技术进行识别时,还可以包括对第一语音数据的一些预处理操作,如:采样、量化、去除不包含语音内容的语音数据(如,静默的语音数据)、对语音数据进行分帧、加窗等处理,等等。本发明的实施例在此处不做过多展开。It should be noted that when the server 120 recognizes the ASR technology, it may also include some preprocessing operations on the first voice data, such as: sampling, quantizing, and removing voice data that does not contain voice content (eg, silent voice data ), frame the voice data, window processing, etc. The embodiments of the present invention are not expanded too much here.
然后,服务器120再结合状态数据对文本表示进行处理,以理解用户意图,最终得到用户意图的表示。在一些实施例中,服务器120可以采用NLP(Natural Language Processing)自然语言处理方法来对用户输入的第一语音数据进行理解,最终识别出用户的意图,用户的意图往往对应着实际的操作,如播放音乐、查看通讯录等。在另一些实 施例中,服务器120还可以进一步确定用户意图的参数,如具体播放哪首歌或哪位歌手的歌曲,等等。本发明的实施例对采用何种NLP算法来理解用户意图并不做过多限制,任何已知的或未来可知的此类算法均可以与本发明的实施例相结合,以实现本发明的方法300。Then, the server 120 processes the text representation in combination with the status data to understand the user's intention, and finally obtains the representation of the user's intention. In some embodiments, the server 120 may use NLP (Natural Language Processing) natural language processing method to understand the first voice data input by the user, and finally identify the user's intention, the user's intention often corresponds to the actual operation, such as Play music, view contacts, etc. In other embodiments, the server 120 may further determine the parameters of the user's intention, such as which song or which singer's song is specifically played, and so on. The embodiments of the present invention do not make too many restrictions on which NLP algorithm is used to understand user intent. Any known or future known algorithm of this type can be combined with the embodiments of the present invention to implement the method of the present invention 300.
在根据本发明的实施例中,服务器120通过分析用户意图,确定当前的交互场景。交互场景表征的是客户端110当前或(按照用户输入)即将处于的交互场景。交互场景例如可以是通话场景、短消息场景、音乐场景、视频场景、浏览资讯场景等等。In the embodiment according to the present invention, the server 120 determines the current interaction scene by analyzing the user's intention. The interactive scene characterizes the interactive scene that the client 110 is currently or (according to user input). The interactive scene may be, for example, a call scene, a short message scene, a music scene, a video scene, a browsing information scene, and so on.
在根据本发明的实施例中,服务器120在对第一语音数据进行上述识别后,生成响应指令。In the embodiment according to the present invention, the server 120 generates a response instruction after performing the above recognition on the first voice data.
一方面,响应指令包含了对用户意图的响应、及具体的执行参数。例如,用户输入的第一语音数据为——“我要听歌”,服务器120生成的响应指令中就包含了播放指令。同时,响应指令中还可以包含相应的文本数据,用以回复用户输入的语音数据。例如,响应指令中包含“好的,即将为您播放”的文本数据。除此之外,响应指令中还可以包含播放指令的执行参数,如播放列表、播放歌曲的封面、下载地址等等,不限于此。On the one hand, the response instruction includes a response to the user's intention and specific execution parameters. For example, the first voice data input by the user is "I want to listen to a song", and the response instruction generated by the server 120 includes a playback instruction. At the same time, the corresponding text may be included in the response instruction to reply to the voice data input by the user. For example, the response command contains text data of "OK, to be played for you". In addition, the response instruction may also include execution parameters of the playback instruction, such as a playlist, a cover of the song to be played, a download address, etc., which is not limited thereto.
另一方面,响应指令还包含了交互场景。例如,服务器120通过处理分析得到,“我要听歌”对应的交互场景是“音乐场景”,那么,服务器120生成的响应指令中除上述参数外,还包含“音乐场景”。On the other hand, the response command also contains interactive scenes. For example, the server 120 obtains through processing and analysis that the interactive scene corresponding to “I want to listen to a song” is a “music scene”. Then, in addition to the above parameters, the response command generated by the server 120 also includes a “music scene”.
随后,在步骤S305中,服务器120返回上述响应指令给客户端110。Subsequently, in step S305, the server 120 returns the above response instruction to the client 110.
之后在步骤S306中,客户端110一方面基于所接收到的响应指令中的交互场景,获取配置信息。Then, in step S306, the client 110 acquires configuration information based on the interaction scenario in the received response instruction.
如前文所述,客户端110上预先设置有各交互场景下的配置信息。其中,配置信息中包含至少一条用于在该交互场景下使用的目标数据。关于配置信息的具体描述可参考前文图1的相关描述,此处不再赘述。As mentioned above, the client 110 is preset with configuration information in each interaction scenario. The configuration information includes at least one piece of target data for use in the interactive scenario. For the specific description of the configuration information, please refer to the related description in FIG. 1 above, and no more details are provided here.
另一方面,客户端110根据该响应指令中的相关指令和执行参数,输出响应给用户。例如,客户端110通过TTS技术将响应指令中包含的文本数据转换为语音数据,通过语音回复用户——“好的,即将为您播放”;同时,客户端110执行播放指令,为用户播放歌曲。在又一些实施例中,客户端110还可以根据执行参数下载相应的歌曲、封面等,此处不再一一赘述。On the other hand, the client 110 outputs a response to the user according to the relevant instructions and execution parameters in the response instructions. For example, the client 110 converts the text data contained in the response instruction to voice data through TTS technology, and replies to the user by voice-"OK, will play for you"; meanwhile, the client 110 executes the playback instruction to play songs for the user . In still other embodiments, the client 110 may also download corresponding songs, covers, etc. according to the execution parameters, which will not be repeated here.
接下来,在步骤S307中,客户端110接收用户输入的第二语音数据,并判断用户输入的第二语音数据与配置信息中的目标数据是否匹配。Next, in step S307, the client 110 receives the second voice data input by the user, and determines whether the second voice data input by the user matches the target data in the configuration information.
根据本发明的实施例,若第二语音数据与配置信息中的至少一条目标数据相匹配,则直接进入交互状态。即,用户不用再次输入预定对象,即可唤醒客户端110。接上例,在音乐场景下,用户输入第二语音数据——“下一首”,客户端110经判断后确认该第二语音数据与音乐场景下的目标数据相匹配,则直接进入交互状态。According to an embodiment of the present invention, if the second voice data matches at least one piece of target data in the configuration information, the interactive state is directly entered. That is, the user can wake up the client 110 without entering the predetermined object again. Continuing with the above example, in the music scene, the user enters the second voice data-"next song", after the client 110 determines that the second voice data matches the target data in the music scene, then directly enters the interactive state .
需要说明的是,关于采用何种方法来判断第二语音数据和目标数据是否匹配,本发明的实施例并不做过多地限制。例如,本领域技术人员可以通过任意一种方式来计算第二语音数据和目标数据的匹配度,当匹配度高于预设值时,就判定二者相匹配。It should be noted that, regarding which method is used to determine whether the second voice data and the target data match, the embodiment of the present invention does not make excessive restrictions. For example, a person skilled in the art may calculate the matching degree of the second voice data and the target data in any way, and when the matching degree is higher than a preset value, it is determined that the two match.
客户端110在交互状态下,同步骤S302所描述的,获取当前时刻的状态数据。此处不再赘述。In the interactive state, the client 110 acquires the state data at the current moment as described in step S302. I won't repeat them here.
随后在步骤S308中,客户端110将第二语音数据与状态数据发送给服务器120。Then in step S308, the client 110 sends the second voice data and status data to the server 120.
接下来,在步骤S309中,服务器120根据所接收到的状态数据对对应的第二语音数据进行识别。在根据本发明的实施例中,对第二语音数据的处理与对第一语音数据的处理过程完全一致,故关于识别的具体内容,可参考步骤S304的相关描述,此处不再进行赘述。Next, in step S309, the server 120 recognizes the corresponding second voice data according to the received status data. In the embodiment according to the present invention, the processing of the second voice data is completely consistent with the processing of the first voice data, so for the specific content of the recognition, reference may be made to the relevant description of step S304, and no more details will be given here.
随后,在步骤S310中,服务器120返回响应指令给客户端110。由客户端110根据该响应指令输出响应给用户。Subsequently, in step S310, the server 120 returns a response instruction to the client 110. The client 110 outputs a response to the user according to the response instruction.
之后,只要用户输入的第二语音数据与配置信息中的目标数据相匹配,即,一直处于当前的交互场景中,则重复步骤S307~步骤S310的过程。After that, as long as the second voice data input by the user matches the target data in the configuration information, that is, has always been in the current interactive scene, the process from step S307 to step S310 is repeated.
根据一些实施方式,方法300中还可能包含切换交互场景的情形。According to some embodiments, the method 300 may also include a scenario of switching interactive scenes.
在步骤S311中,响应于用户切换交互场景的请求,客户端110转发该请求给服务器120。In step S311, in response to the user's request to switch the interactive scene, the client 110 forwards the request to the server 120.
在根据本发明的实施例中,用户切换交互场景的请求可以由如下两种方式来触发。In the embodiment according to the present invention, the user's request to switch the interactive scene can be triggered in the following two ways.
在一种实施例中,用户重新发送包含预定对象的第三语音数据给客户端110。此时,客户端110检测到预定对象,则判定用户要切换交互场景。例如,在音乐场景下,用户输入——“小精灵,帮我查一下当前天气”,即触发切换交互场景的请求。In one embodiment, the user resends the third voice data containing the predetermined object to the client 110. At this time, the client 110 detects the predetermined object, and determines that the user wants to switch the interactive scene. For example, in a music scene, the user enters-"Elf, help me check the current weather", which triggers a request to switch the interactive scene.
在另一种实施例中,用户切换客户端110的显示界面,使客户端110跳转到其他应用或业务。例如在视频场景下,客户端110的显示界面是视频播放界面,当用户切换显示界面到图片拍摄的显示界面时,就触发了切换交互场景的请求。In another embodiment, the user switches the display interface of the client 110 so that the client 110 jumps to other applications or services. For example, in a video scene, the display interface of the client 110 is a video playback interface, and when the user switches the display interface to the display interface where the picture is taken, a request to switch the interactive scene is triggered.
在随后的步骤S312中,服务器120确认待切换的交互场景,并在随后的步骤S313中,返回响应指令。In the subsequent step S312, the server 120 confirms the interactive scene to be switched, and in the subsequent step S313, returns a response instruction.
服务器120可以根据用户输入的第三语音数据,来分析用户想切换的交互场景。例如,用户输入——“小精灵,帮我查一下当前天气”,服务器120可以据此分析出待切换的交互场景为天气查询场景。The server 120 may analyze the interactive scene that the user wants to switch according to the third voice data input by the user. For example, if the user inputs ""Elf, help me check the current weather", the server 120 can analyze the interactive scene to be switched as the weather query scene based on this.
另外,响应于切换交互场景的请求,客户端110同样会采集当前时刻的状态数据,并与该请求一起,发送给服务器120。这样,服务器120就可以利用状态数据来进行场景分析,以确认待切换的交互场景。例如,当客户端110的显示界面由视频播放界面切换到图片拍摄界面时,服务器120可以分析出待切换的交互场景是图片拍摄场景。In addition, in response to the request to switch the interactive scene, the client 110 will also collect the state data at the current time and send it to the server 120 together with the request. In this way, the server 120 can use the status data to perform scene analysis to confirm the interactive scene to be switched. For example, when the display interface of the client 110 is switched from the video playback interface to the picture shooting interface, the server 120 may analyze that the interactive scene to be switched is a picture shooting scene.
当然,服务器120还可以结合状态数据和用户输入的第三语音数据来进行场景分析,以确认待切换的交互场景。关于本部分的具体内容可参考前文步骤S304中的相关描述,此处不做赘述。Of course, the server 120 can also combine the status data and the third voice data input by the user to perform scene analysis to confirm the interactive scene to be switched. For the specific content of this section, reference may be made to the related description in step S304 above, which is not repeated here.
在确认出待切换的交互场景后,服务器120生成相应的响应指令给客户端110,由其输出响应给用户。例如,切换到用户期望打开的应用。响应指令的相关描述可参见前文描述,此处不再展开。After confirming the interaction scenario to be switched, the server 120 generates a corresponding response instruction to the client 110, and the response is output to the user. For example, switch to the application that the user desires to open. The relevant description of the response command can be referred to the previous description, and will not be expanded here.
同时,在步骤S314中,客户端110判断是否关闭切换前的交互场景。Meanwhile, in step S314, the client 110 determines whether to close the interactive scene before switching.
根据本发明的实施方式,客户端110通过状态数据来判断是否关闭切换前的交互场景。客户端110获取正在执行的进程数据来进行判断:若正在执行的进程数据中不包含切换前的交互场景所对应的进程数据,则说明上一个进程已经被关闭,故关闭切换前的交互场景;若正在执行的进程数据中还包含切换前的交互场景所对应的进程数据,则说明上一个进程还在执行中,故不关闭切换前的交互场景。According to the embodiment of the present invention, the client 110 determines whether to close the interaction scene before the switch through the status data. The client 110 obtains the process data being executed to make a judgment: if the process data being executed does not include the process data corresponding to the interactive scene before switching, it means that the previous process has been closed, so the interactive scene before switching is closed; If the process data being executed also includes process data corresponding to the interactive scene before switching, it means that the previous process is still being executed, so the interactive scene before switching is not closed.
接着,在经判断确认关闭切换前的交互场景时,基于切换后的交互场景得到配置信息。在经判断确认不关闭切换前的交互场景时,基于切换前的交互场景和切换后的交互场景得到配置信息。即,在保有原先的配置信息的同时,获取切换后的交互场景对应的配置信息。关于获取配置信息的内容可参考前文步骤S306的相关描述,此处不再赘述。Next, when it is determined that the interactive scene before switching is closed, configuration information is obtained based on the interactive scene after switching. When it is determined that the interactive scene before switching is not closed, the configuration information is obtained based on the interactive scene before switching and the interactive scene after switching. That is, while retaining the original configuration information, the configuration information corresponding to the switched interactive scene is acquired. For the content of obtaining configuration information, reference may be made to the related description of step S306 above, and details are not described herein again.
在实际的应用场景中,用户常常一边听音乐,一边浏览资讯、聊天等。设想一种场景,用户正在通过音频播放类应用播放歌曲(即,客户端110处于音乐场景),此时,用户输入第三语音数据——“小精灵,我要看最新的新闻”,客户端110根据响应指令打开客户端110上的某个新闻类应用。此时,客户端110的显示界面就跳转到了该新闻类应用,但是音频播放类应用依然可以在后台播放歌曲。客户端110通过收集状态数据来进行判断分析,最终确认不用关闭切换前的交互场景(即,音乐场景),从而基于切换前的交互场景和切换后的交互场景(即,浏览资讯场景)得到配置信息。即,保留客 户端110上当前正在使用的音乐场景下的配置信息,再获取浏览资讯场景下的配置信息。之后,用户可以同时使用这两套配置信息中的目标数据来与客户端110进行语音交互。In actual application scenarios, users often listen to music while browsing information and chatting. Imagine a scenario where the user is playing a song through an audio playback application (that is, the client 110 is in a music scene). At this time, the user enters the third voice data-"Elf, I want to watch the latest news", the client 110 Open a news application on the client 110 according to the response instruction. At this time, the display interface of the client 110 jumps to the news application, but the audio playback application can still play songs in the background. The client 110 conducts judgment and analysis by collecting status data, and finally confirms that it is not necessary to close the interactive scene before switching (ie, music scene), so as to be configured based on the interactive scene before switching and the interactive scene after switching (ie, browsing information scene) information. That is, the configuration information in the music scene currently in use on the client 110 is retained, and then the configuration information in the browsing information scene is acquired. After that, the user can use the target data in the two sets of configuration information to perform voice interaction with the client 110 at the same time.
基于上述描述,通过本发明的语音交互方案,客户端结合本地状态、用户习惯等信息,为不同的交互场景预先设置不同的配置信息,以支持在各交互场景下,快捷唤醒客户端,即能直接响应用户无唤醒词(即,预定对象)的语音指令。Based on the above description, through the voice interaction solution of the present invention, the client combines local status, user habits and other information to pre-set different configuration information for different interaction scenarios to support the rapid wake-up of the client in each interaction scenario, that is, Directly respond to the user's voice command without a wake-up word (ie, a predetermined object).
根据本发明的语音交互方法300,客户端110在接收到用户输入的第一语音数据时,将第一语音数据转发给服务器120,由服务器120确认交互场景;而后客户端110根据交互场景获取配置信息,在该交互场景下,只要用户输入的语音数据与配置信息中的目标数据相匹配,则直接唤醒客户端110,进行语音交互。相比于现有的语音交互方案,本方案具有响应快、成本低等优势。另外,服务器120基于客户端110上的状态数据进行场景分析,将对语音数据的识别与客户端110的当前状态、交互场景紧密结合,能够显著提升识别的准确率。According to the voice interaction method 300 of the present invention, when receiving the first voice data input by the user, the client 110 forwards the first voice data to the server 120, and the server 120 confirms the interaction scenario; and then the client 110 obtains the configuration according to the interaction scenario Information, in this interaction scenario, as long as the voice data input by the user matches the target data in the configuration information, the client 110 is directly awakened for voice interaction. Compared with the existing voice interaction solution, this solution has the advantages of fast response and low cost. In addition, the server 120 performs scene analysis based on the state data on the client 110, and closely integrates the recognition of the voice data with the current state of the client 110 and the interactive scene, which can significantly improve the accuracy of recognition.
方法300的执行涉及到系统100中的各个部件,为此,在图4中示出了根据本发明另一个实施例的语音交互方法400的流程示意图。图4所示的方法400适于在客户端110中执行,是图3所示方法的进一步说明。The execution of the method 300 involves various components in the system 100. For this reason, FIG. 4 shows a schematic flowchart of a voice interaction method 400 according to another embodiment of the present invention. The method 400 shown in FIG. 4 is suitable for execution in the client 110 and is a further description of the method shown in FIG. 3.
如图4所示,方法400包括步骤S410,将用户输入的第一语音数据发送给服务器120,以便服务器120根据第一语音数据确认交互场景。As shown in FIG. 4, the method 400 includes step S410, sending the first voice data input by the user to the server 120, so that the server 120 confirms the interaction scene according to the first voice data.
需要说明的是,客户端110在接收到用户输入的第一语音数据后,也可以在本地根据用户输入的第一语音数据确定出当前的交互场景。例如,客户端110根据当前的状态数据(如,当前正在使用的应用程序,不限于此),来确认当前的交互场景。本发明的实施例对此不做限制。It should be noted that, after receiving the first voice data input by the user, the client 110 may also locally determine the current interaction scene according to the first voice data input by the user. For example, the client 110 confirms the current interaction scene according to the current state data (for example, the currently used application program is not limited to this). The embodiments of the present invention do not limit this.
随后在步骤S420中,基于交互场景获取配置信息。随后在步骤S430中,基于所获取的配置信息对用户输入的第二语音数据进行处理并输出响应。Subsequently, in step S420, configuration information is acquired based on the interaction scenario. Then in step S430, the second voice data input by the user is processed based on the acquired configuration information and a response is output.
除此之外,方法400还包括步骤:预先设置各交互场景下的配置信息。根据本发明的实施方式,配置信息中包含至少一条用于在该交互场景下使用的目标数据。通过这些目标数据,可以直接唤醒客户端进入交互状态。配置信息中的目标数据可以结合客户端本身的状态、用户的输入偏好等来设置;可以在出厂时预先设置,也可以由用户在使用客户端110的过程中自行设置,本发明的实施例对此不做限制。In addition, the method 400 further includes the step of setting configuration information in each interaction scenario in advance. According to an embodiment of the present invention, the configuration information includes at least one piece of target data for use in the interactive scenario. Through these target data, you can directly wake up the client to enter the interactive state. The target data in the configuration information can be set in combination with the state of the client itself, the user's input preferences, etc.; it can be preset in the factory or can be set by the user during the use of the client 110. There are no restrictions.
当然,在接收来自客户端110的第一语音数据之前,还包括根据用户输入的第三语音数据将客户端110由休眠状态切换到交互状态的过程。以及,在交互状态下,响应于 用户切换交互场景的请求,重新加载配置信息的步骤。关于整个语音交互流程的描述,可参考前文方法300中的具体阐述,篇幅所限,此处不做赘述。Of course, before receiving the first voice data from the client 110, a process of switching the client 110 from the sleep state to the interactive state according to the third voice data input by the user is also included. And, in the interactive state, the step of reloading the configuration information in response to the user's request to switch the interactive scene. For the description of the entire voice interaction process, reference may be made to the specific explanation in the foregoing method 300, which is limited in length, and will not be repeated here.
为配合图3~图4的相关描述进一步说明客户端110,图5示出了根据本发明一个实施例的驻留在客户端110中的语音数据识别装置500的示意图。To further illustrate the client 110 in conjunction with the related descriptions of FIGS. 3 to 4, FIG. 5 shows a schematic diagram of a voice data recognition device 500 residing in the client 110 according to an embodiment of the present invention.
如图5所示,除基本的配置外,语音交互装置500至少包括:信息存储单元510、连接管理单元520、信息获取单元530、数据处理单元540。As shown in FIG. 5, in addition to the basic configuration, the voice interaction device 500 includes at least: an information storage unit 510, a connection management unit 520, an information acquisition unit 530, and a data processing unit 540.
根据一种实施方式,信息存储单元510预先存储各交互场景下的配置信息,其中,配置信息中包含至少一条用于在该交互场景下使用的目标数据。连接管理单元520用于实现语音交互装置500的各种输入/输出操作,例如,接收用户输入的第一语音数据并发送给服务器120,以便服务器120根据第一语音数据确认交互场景。信息获取单元530基于交互场景获取配置信息。数据处理单元540基于所获取的配置信息对用户输入的第二语音数据进行处理并输出响应。According to an embodiment, the information storage unit 510 pre-stores configuration information in each interaction scenario, where the configuration information includes at least one piece of target data for use in the interaction scenario. The connection management unit 520 is used to implement various input/output operations of the voice interaction device 500, for example, receiving the first voice data input by the user and sending it to the server 120, so that the server 120 confirms the interaction scene according to the first voice data. The information acquisition unit 530 acquires configuration information based on the interaction scenario. The data processing unit 540 processes the second voice data input by the user based on the acquired configuration information and outputs a response.
在一些实施例中,数据处理单元540还包括判断模块542,适于判断用户输入的第二语音数据与配置信息中的目标数据是否匹配。在第二语音数据与目标数据相匹配时,信息获取单元530获取当前时刻的状态数据。连接管理单元520将第二语音数据与状态数据发送给服务器120,并接收该服务器120根据状态数据对第二语音数据进行识别处理后返回的响应指令。最后,连接管理单元520还会根据该响应指令输出响应给用户。In some embodiments, the data processing unit 540 further includes a judgment module 542 adapted to judge whether the second voice data input by the user matches the target data in the configuration information. When the second voice data matches the target data, the information acquisition unit 530 acquires the state data at the current moment. The connection management unit 520 sends the second voice data and the status data to the server 120, and receives a response instruction returned by the server 120 after performing recognition processing on the second voice data according to the status data. Finally, the connection management unit 520 also outputs a response to the user according to the response instruction.
当然,连接管理单元520还用于接收用户输入的第三语音数据。Of course, the connection management unit 520 is also used to receive the third voice data input by the user.
语音交互装置500除了上述各部分外,还包括检测单元(未示出)。检测单元检测用户输入的第三语音数据中是否包含预定对象,客户端110在第三语音数据包含预定对象时进入交互状态。The voice interaction device 500 includes a detection unit (not shown) in addition to the above-mentioned parts. The detection unit detects whether the third voice data input by the user includes a predetermined object, and the client 110 enters an interactive state when the third voice data includes the predetermined object.
在又一些实施例中,连接管理单元520还可以响应用户切换交互场景的请求,转发请求给服务器120,以便服务器120确认待切换的交互场景。进一步地,信息获取单元530还包括判决模块532,该判决模块532用于判断是否关闭切换前的交互场景。若经判断后确认要关闭切换前的交互场景,则信息获取单元530基于切换后的交互场景得到配置信息;若经判断后确认不关闭切换前的交互场景,则信息获取单元530基于切换前的交互场景和切换后的交互场景得到配置信息。In still other embodiments, the connection management unit 520 may also respond to the user's request to switch the interactive scene, and forward the request to the server 120, so that the server 120 confirms the interactive scene to be switched. Further, the information acquisition unit 530 further includes a decision module 532, which is used to determine whether to close the interaction scene before the handover. If it is determined after judgment that the interaction scene before switching is to be closed, the information acquisition unit 530 obtains configuration information based on the interaction scene after switching; if it is determined that the interaction scene before switching is not to be closed after judgment, the information acquisition unit 530 is based on pre-switching The configuration information is obtained in the interactive scenario and the switched interactive scenario.
关于语音交互装置500中各部分所执行操作的具体描述可参见前文关于图1、图3、图4的相关内容,此处不再赘述。For a specific description of the operations performed by the various parts in the voice interaction device 500, refer to the related content of FIG. 1, FIG. 3, and FIG. 4 above, and details are not described here.
这里描述的各种技术可结合硬件或软件,或者它们的组合一起实现。从而,本发明 的方法和设备,或者本发明的方法和设备的某些方面或部分可采取嵌入有形媒介,例如可移动硬盘、U盘、软盘、CD-ROM或者其它任意机器可读的存储介质中的程序代码(即指令)的形式,其中当程序被载入诸如计算机之类的机器,并被所述机器执行时,所述机器变成实践本发明的设备。The various technologies described herein may be implemented in combination with hardware or software, or a combination thereof. Thus, the method and apparatus of the present invention, or some aspects or parts of the method and apparatus of the present invention, may adopt embedded tangible media, such as a removable hard disk, U disk, floppy disk, CD-ROM, or any other machine-readable storage medium In the form of program code (ie, instructions) in which when the program is loaded into a machine such as a computer and executed by the machine, the machine becomes a device for practicing the invention.
在程序代码在可编程计算机上执行的情况下,计算设备一般包括处理器、处理器可读的存储介质(包括易失性和非易失性存储器和/或存储元件),至少一个输入装置,和至少一个输出装置。其中,存储器被配置用于存储程序代码;处理器被配置用于根据该存储器中存储的所述程序代码中的指令,执行本发明的方法。In the case where the program code is executed on a programmable computer, the computing device generally includes a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), and at least one input device, And at least one output device. Wherein, the memory is configured to store program code; the processor is configured to execute the method of the present invention according to the instructions in the program code stored in the memory.
以示例而非限制的方式,可读介质包括可读存储介质和通信介质。可读存储介质存储诸如计算机可读指令、数据结构、程序模块或其它数据等信息。通信介质一般以诸如载波或其它传输机制等已调制数据信号来体现计算机可读指令、数据结构、程序模块或其它数据,并且包括任何信息传递介质。以上的任一种的组合也包括在可读介质的范围之内。By way of example, and not limitation, readable media includes readable storage media and communication media. The readable storage medium stores information such as computer readable instructions, data structures, program modules, or other data. Communication media generally embodies computer-readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transmission mechanism, and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.
在此处所提供的说明书中,算法和显示不与任何特定计算机、虚拟系统或者其它设备固有相关。各种通用系统也可以与本发明的示例一起使用。根据上面的描述,构造这类系统所要求的结构是显而易见的。此外,本发明也不针对任何特定编程语言。应当明白,可以利用各种编程语言实现在此描述的本发明的内容,并且上面对特定语言所做的描述是为了披露本发明的最佳实施方式。In the description provided here, the algorithm and display are not inherently related to any particular computer, virtual system, or other devices. Various general-purpose systems can also be used with examples of the present invention. From the above description, the structure required to construct such a system is obvious. In addition, the present invention is not directed to any particular programming language. It should be understood that various programming languages can be used to implement the contents of the present invention described herein, and the above descriptions of specific languages are for disclosure of the best embodiments of the present invention.
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下被实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。The specification provided here explains a lot of specific details. However, it can be understood that the embodiments of the present invention can be practiced without these specific details. In some instances, well-known methods, structures, and techniques have not been shown in detail so as not to obscure the understanding of this description.
类似地,应当理解,为了精简本公开并帮助理解各个发明方面中的一个或多个,在上面对本发明的示例性实施例的描述中,本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释成反映如下意图:即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多特征。更确切地说,如下面的权利要求书所反映的那样,发明方面在于少于前面公开的单个实施例的所有特征。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本发明的单独实施例。Similarly, it should be understood that in order to streamline the disclosure and help understand one or more of the various inventive aspects, in the above description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together into a single embodiment, Figure, or its description. However, the disclosed method should not be interpreted as reflecting the intention that the claimed invention claims more features than those explicitly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Therefore, the claims following a specific embodiment are hereby expressly incorporated into the specific embodiment, wherein each claim itself serves as a separate embodiment of the present invention.
本领域那些技术人员应当理解在本文所公开的示例中的设备的模块或单元或组件可以布置在如该实施例中所描述的设备中,或者可替换地可以定位在与该示例中的设备不 同的一个或多个设备中。前述示例中的模块可以组合为一个模块或者此外可以分成多个子模块。Those skilled in the art should understand that the modules or units or components of the device in the examples disclosed herein may be arranged in the device as described in this embodiment, or alternatively may be positioned differently from the device in this example Of one or more devices. The modules in the foregoing examples may be combined into one module or, in addition, may be divided into multiple sub-modules.
本领域那些技术人员可以理解,可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件,以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。Those skilled in the art can understand that the modules in the device in the embodiment can be adaptively changed and set in one or more devices different from the embodiment. The modules or units or components in the embodiments may be combined into one module or unit or component, and in addition, they may be divided into a plurality of submodules or subunits or subcomponents. Except that at least some of such features and/or processes or units are mutually exclusive, all features disclosed in this specification (including the accompanying claims, abstract and drawings) and any method so disclosed may be adopted in any combination All processes or units of equipment are combined. Unless expressly stated otherwise, each feature disclosed in this specification (including the accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose.
此外,本领域的技术人员能够理解,尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如,在下面的权利要求书中,所要求保护的实施例的任意之一都可以以任意的组合方式来使用。In addition, those skilled in the art can understand that although some of the embodiments described herein include certain features included in other embodiments but not other features, the combination of features of different embodiments is meant to be within the scope of the present invention And form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.
此外,所述实施例中的一些在此被描述成可以由计算机系统的处理器或者由执行所述功能的其它装置实施的方法或方法元素的组合。因此,具有用于实施所述方法或方法元素的必要指令的处理器形成用于实施该方法或方法元素的装置。此外,装置实施例的在此所述的元素是如下装置的例子:该装置用于实施由为了实施该发明的目的的元素所执行的功能。Furthermore, some of the described embodiments are described herein as methods or combinations of method elements that can be implemented by a processor of a computer system or by other devices that perform the described functions. Therefore, a processor having the necessary instructions for implementing the method or method element forms a device for implementing the method or method element. Furthermore, the elements of the device embodiments described herein are examples of devices for implementing the functions performed by the elements for the purpose of implementing the invention.
如在此所使用的那样,除非另行规定,使用序数词“第一”、“第二”、“第三”等等来描述普通对象仅仅表示涉及类似对象的不同实例,并且并不意图暗示这样被描述的对象必须具有时间上、空间上、排序方面或者以任意其它方式的给定顺序。As used herein, unless otherwise specified, the use of ordinal words "first", "second", "third", etc. to describe ordinary objects merely indicates different instances involving similar objects and is not intended to imply such The objects described must have a given order in time, space, order, or in any other way.
尽管根据有限数量的实施例描述了本发明,但是受益于上面的描述,本技术领域内的技术人员明白,在由此描述的本发明的范围内,可以设想其它实施例。此外,应当注意,本说明书中使用的语言主要是为了可读性和教导的目的而选择的,而不是为了解释或者限定本发明的主题而选择的。因此,在不偏离所附权利要求书的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。对于本发明的范围,对本发明所做的公开是说明性的而非限制性的,本发明的范围由所附权利要求书限定。Although the invention has been described in terms of a limited number of embodiments, benefiting from the above description, those skilled in the art understand that other embodiments are conceivable within the scope of the invention thus described. In addition, it should be noted that the language used in this specification is mainly selected for readability and teaching purposes, not for explaining or limiting the subject matter of the present invention. Therefore, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. With regard to the scope of the present invention, the disclosure made to the present invention is illustrative rather than limiting, and the scope of the present invention is defined by the appended claims.

Claims (16)

  1. 一种语音交互方法,包括步骤:A voice interaction method, including steps:
    将用户输入的第一语音数据发送给服务器,以便所述服务器根据所述第一语音数据确认交互场景;Sending the first voice data input by the user to the server, so that the server confirms the interaction scene according to the first voice data;
    基于所述交互场景获取配置信息;以及Acquiring configuration information based on the interaction scenario; and
    基于所获取的配置信息对用户输入的第二语音数据进行处理并输出响应。Process the second voice data input by the user based on the obtained configuration information and output a response.
  2. 如权利要求1所述的方法,还包括步骤:The method of claim 1, further comprising the steps of:
    预先设置各交互场景下的配置信息,Set the configuration information in each interactive scenario in advance,
    其中,所述配置信息中包含至少一条用于在该交互场景下使用的目标数据。Wherein, the configuration information includes at least one piece of target data for use in the interactive scenario.
  3. 如权利要求2所述的方法,其中,所述基于配置信息对用户输入的第二语音数据进行处理并输出响应的步骤包括:The method of claim 2, wherein the step of processing the second voice data input by the user based on the configuration information and outputting a response includes:
    判断用户输入的第二语音数据与所述配置信息中的目标数据是否匹配;Determine whether the second voice data input by the user matches the target data in the configuration information;
    若匹配则获取当前时刻的状态数据;If it matches, the status data at the current moment is obtained;
    将所述第二语音数据与状态数据发送给所述服务器,以便所述服务器根据状态数据对所述第二语音数据进行识别处理并返回响应指令;以及Sending the second voice data and status data to the server, so that the server performs recognition processing on the second voice data according to the status data and returns a response instruction; and
    根据所述响应指令输出响应给用户。Output a response to the user according to the response instruction.
  4. 如权利要求1-3中任一项所述的方法,其中,在所述将用户输入的第一语音数据发送给服务器,以便服务器根据所述第一语音数据确认交互场景的步骤之前,还包括接收用户输入的第三语音数据的步骤:The method according to any one of claims 1 to 3, wherein before the step of sending the first voice data input by the user to the server so that the server confirms the interaction scene according to the first voice data, the method further includes Steps for receiving the third voice data input by the user:
    检测用户输入的第三语音数据中是否包含预定对象;以及Detecting whether the third voice data input by the user contains a predetermined object; and
    若所述第三语音数据中包含预定对象,则进入交互状态。If the third voice data contains a predetermined object, it enters an interactive state.
  5. 如权利要求1-4中任一项所述的方法,其中,所述将用户输入的第一语音数据发送给服务器,以便服务器根据所述第一语音数据确认交互场景的步骤包括:The method according to any one of claims 1 to 4, wherein the step of sending the first voice data input by the user to the server, so that the server confirms the interaction scenario according to the first voice data includes:
    响应于用户输入第一语音数据,获取当前时刻的状态数据;以及In response to the user inputting the first voice data, acquiring status data at the current moment; and
    将所述第一语音数据与状态数据发送给服务器,以便服务器根据状态数据对所述第一语音数据进行识别处理并返回响应指令,Sending the first voice data and the status data to the server, so that the server performs recognition processing on the first voice data according to the status data and returns a response instruction,
    其中所述响应指令中还包括交互场景。The response instruction also includes an interactive scene.
  6. 如权利要求5所述的方法,其中,所述基于交互场景获取配置信息的步骤还包括:The method of claim 5, wherein the step of acquiring configuration information based on the interaction scenario further comprises:
    根据所述响应指令输出响应给用户。Output a response to the user according to the response instruction.
  7. 如权利要求1-6中任一项所述的方法,还包括步骤:The method according to any one of claims 1-6, further comprising the steps of:
    响应于用户切换交互场景的请求,转发所述请求至服务器,以便服务器确认待切换的交互场景;In response to the user's request to switch the interactive scene, forward the request to the server, so that the server confirms the interactive scene to be switched;
    判断是否关闭切换前的交互场景;Determine whether to close the interactive scene before switching;
    若关闭切换前的交互场景,则基于切换后的交互场景得到配置信息;以及If the interactive scene before switching is closed, configuration information is obtained based on the interactive scene after switching; and
    若不关闭切换前的交互场景,则基于切换前的交互场景和切换后的交互场景得到配置信息。If the interactive scene before switching is not closed, configuration information is obtained based on the interactive scene before switching and the interactive scene after switching.
  8. 一种语音交互方法,包括步骤:A voice interaction method, including steps:
    根据用户输入的第一语音数据确定交互场景;Determine the interaction scenario according to the first voice data input by the user;
    基于所述交互场景获取配置信息;以及Acquiring configuration information based on the interaction scenario; and
    基于所获取的配置信息对用户输入的第二语音数据进行处理并输出响应。Process the second voice data input by the user based on the obtained configuration information and output a response.
  9. 一种语音交互装置,包括:A voice interaction device, including:
    连接管理单元,适于接收用户输入的第一语音数据并发送给服务器,以便所述服务器根据所述第一语音数据确认交互场景;The connection management unit is adapted to receive the first voice data input by the user and send it to the server, so that the server confirms the interaction scenario according to the first voice data;
    信息获取单元,适于基于所述交互场景获取配置信息;以及An information acquisition unit adapted to acquire configuration information based on the interaction scenario; and
    数据处理单元,适于基于所获取的配置信息对用户输入的第二语音数据进行处理并输出响应。The data processing unit is adapted to process the second voice data input by the user based on the acquired configuration information and output a response.
  10. 如权利要求9所述的装置,还包括:The apparatus of claim 9, further comprising:
    信息存储单元,适于预先存储各交互场景下的配置信息,其中,所述配置信息中包含至少一条用于在该交互场景下使用的目标数据。The information storage unit is adapted to store configuration information in each interactive scenario in advance, wherein the configuration information includes at least one piece of target data for use in the interactive scenario.
  11. 一种语音交互系统,包括:A voice interaction system, including:
    客户端,包括如权利要求9或10所述的语音交互装置;以及Client, including the voice interaction device according to claim 9 or 10; and
    服务器,适于接收来自客户端的语音数据和状态数据,并基于状态数据和语音数据,确定所述客户端的交互场景。The server is adapted to receive voice data and status data from the client, and determine interaction scenarios of the client based on the status data and voice data.
  12. 如权利要求11所述的系统,其中,The system of claim 11, wherein:
    所述服务器还适于根据状态数据对语音数据进行识别处理、返回响应指令给客户端。The server is further adapted to recognize and process the voice data according to the status data and return a response instruction to the client.
  13. 如权利要求11或12所述的系统,其中,所述客户端是智能音箱。The system according to claim 11 or 12, wherein the client is a smart speaker.
  14. 一种智能音箱,包括:A smart speaker, including:
    接口单元,适于接收用户输入的第一语音数据;The interface unit is adapted to receive the first voice data input by the user;
    交互控制单元,适于根据用户输入的第一语音数据确定交互场景,并基于所述交互 场景获取配置信息,所述交互控制单元还适于基于所述配置信息对第二语音数据进行处理,并输出响应。The interaction control unit is adapted to determine an interaction scenario based on the first voice data input by the user and obtain configuration information based on the interaction scenario, the interaction control unit is further adapted to process the second voice data based on the configuration information, and Output response.
  15. 一种计算设备,包括:A computing device, including:
    至少一个处理器;和At least one processor; and
    存储有程序指令的存储器,其中,所述程序指令被配置为适于由所述至少一个处理器执行,所述程序指令包括用于执行如权利要求1-8中任一项所述方法的指令。A memory storing program instructions, wherein the program instructions are configured to be executed by the at least one processor, and the program instructions include instructions for performing the method according to any one of claims 1-8 .
  16. 一种存储有程序指令的可读存储介质,当所述程序指令被计算设备读取并执行时,使得所述计算设备执行如权利要求1-8中任一项所述的方法。A readable storage medium storing program instructions, when the program instructions are read and executed by a computing device, causing the computing device to perform the method according to any one of claims 1-8.
PCT/CN2019/122934 2018-12-11 2019-12-04 Voice interaction method, device and system WO2020119542A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811513712.8A CN111312235B (en) 2018-12-11 2018-12-11 Voice interaction method, device and system
CN201811513712.8 2018-12-11

Publications (1)

Publication Number Publication Date
WO2020119542A1 true WO2020119542A1 (en) 2020-06-18

Family

ID=71075824

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/122934 WO2020119542A1 (en) 2018-12-11 2019-12-04 Voice interaction method, device and system

Country Status (3)

Country Link
CN (1) CN111312235B (en)
TW (1) TW202025138A (en)
WO (1) WO2020119542A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112017667A (en) * 2020-09-04 2020-12-01 华人运通(上海)云计算科技有限公司 Voice interaction method, vehicle and computer storage medium
CN113370923A (en) * 2021-07-23 2021-09-10 深圳市元征科技股份有限公司 Vehicle configuration adjusting method and device, electronic equipment and storage medium
CN113838464A (en) * 2021-09-24 2021-12-24 浪潮金融信息技术有限公司 Intelligent voice interaction system, method and medium
CN114884974A (en) * 2022-04-08 2022-08-09 海南车智易通信息技术有限公司 Data multiplexing method, system and computing equipment

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20210147027A (en) * 2019-05-06 2021-12-06 구글 엘엘씨 Proactive caching of assistant action content on client devices to enable on-device resolution of speech or input utterances
CN112084768A (en) * 2020-08-06 2020-12-15 珠海格力电器股份有限公司 Multi-round interaction method and device and storage medium
CN112104533B (en) * 2020-09-14 2023-02-17 深圳Tcl数字技术有限公司 Scene switching method, terminal and storage medium
CN112397061B (en) * 2020-11-04 2023-10-27 中国平安人寿保险股份有限公司 Online interaction method, device, equipment and storage medium
CN112463106A (en) * 2020-11-12 2021-03-09 深圳Tcl新技术有限公司 Voice interaction method, device and equipment based on intelligent screen and storage medium
CN113411459B (en) * 2021-06-10 2022-11-11 品令科技(北京)有限公司 Remote voice interaction system and method controlled by initiator
CN114356275B (en) * 2021-12-06 2023-12-29 上海小度技术有限公司 Interactive control method and device, intelligent voice equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107145329A (en) * 2017-04-10 2017-09-08 北京猎户星空科技有限公司 Apparatus control method, device and smart machine
CN107977183A (en) * 2017-11-16 2018-05-01 百度在线网络技术(北京)有限公司 voice interactive method, device and equipment
CN108337362A (en) * 2017-12-26 2018-07-27 百度在线网络技术(北京)有限公司 Voice interactive method, device, equipment and storage medium
CN108509619A (en) * 2018-04-04 2018-09-07 科大讯飞股份有限公司 A kind of voice interactive method and equipment
CN108538298A (en) * 2018-04-04 2018-09-14 科大讯飞股份有限公司 voice awakening method and device
CN108874967A (en) * 2018-06-07 2018-11-23 腾讯科技(深圳)有限公司 Dialogue state determines method and device, conversational system, terminal, storage medium

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4537755B2 (en) * 2004-04-30 2010-09-08 株式会社日立製作所 Spoken dialogue system
CN101453688B (en) * 2007-12-04 2010-07-14 中兴通讯股份有限公司 Method for fast responding scene switching in mobile stream media service
CN103413549B (en) * 2013-07-31 2016-07-06 深圳创维-Rgb电子有限公司 The method of interactive voice, system and interactive terminal
CN104240698A (en) * 2014-09-24 2014-12-24 上海伯释信息科技有限公司 Voice recognition method
CN105719649B (en) * 2016-01-19 2019-07-05 百度在线网络技术(北京)有限公司 Audio recognition method and device
CN107507615A (en) * 2017-08-29 2017-12-22 百度在线网络技术(北京)有限公司 Interface intelligent interaction control method, device, system and storage medium
CN108091333B (en) * 2017-12-28 2021-11-30 Oppo广东移动通信有限公司 Voice control method and related product
CN108521500A (en) * 2018-03-13 2018-09-11 努比亚技术有限公司 A kind of voice scenery control method, equipment and computer readable storage medium
CN108597509A (en) * 2018-03-30 2018-09-28 百度在线网络技术(北京)有限公司 Intelligent sound interacts implementation method, device, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107145329A (en) * 2017-04-10 2017-09-08 北京猎户星空科技有限公司 Apparatus control method, device and smart machine
CN107977183A (en) * 2017-11-16 2018-05-01 百度在线网络技术(北京)有限公司 voice interactive method, device and equipment
CN108337362A (en) * 2017-12-26 2018-07-27 百度在线网络技术(北京)有限公司 Voice interactive method, device, equipment and storage medium
CN108509619A (en) * 2018-04-04 2018-09-07 科大讯飞股份有限公司 A kind of voice interactive method and equipment
CN108538298A (en) * 2018-04-04 2018-09-14 科大讯飞股份有限公司 voice awakening method and device
CN108874967A (en) * 2018-06-07 2018-11-23 腾讯科技(深圳)有限公司 Dialogue state determines method and device, conversational system, terminal, storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112017667A (en) * 2020-09-04 2020-12-01 华人运通(上海)云计算科技有限公司 Voice interaction method, vehicle and computer storage medium
CN112017667B (en) * 2020-09-04 2024-03-15 华人运通(上海)云计算科技有限公司 Voice interaction method, vehicle and computer storage medium
CN113370923A (en) * 2021-07-23 2021-09-10 深圳市元征科技股份有限公司 Vehicle configuration adjusting method and device, electronic equipment and storage medium
CN113370923B (en) * 2021-07-23 2023-11-03 深圳市元征科技股份有限公司 Vehicle configuration adjusting method and device, electronic equipment and storage medium
CN113838464A (en) * 2021-09-24 2021-12-24 浪潮金融信息技术有限公司 Intelligent voice interaction system, method and medium
CN114884974A (en) * 2022-04-08 2022-08-09 海南车智易通信息技术有限公司 Data multiplexing method, system and computing equipment
CN114884974B (en) * 2022-04-08 2024-02-23 海南车智易通信息技术有限公司 Data multiplexing method, system and computing device

Also Published As

Publication number Publication date
TW202025138A (en) 2020-07-01
CN111312235A (en) 2020-06-19
CN111312235B (en) 2023-06-30

Similar Documents

Publication Publication Date Title
WO2020119542A1 (en) Voice interaction method, device and system
CN110634483B (en) Man-machine interaction method and device, electronic equipment and storage medium
TWI511125B (en) Voice control method, mobile terminal apparatus and voice controlsystem
EP3734596B1 (en) Determining target device based on speech input of user and controlling target device
KR102429436B1 (en) Server for seleting a target device according to a voice input, and controlling the selected target device, and method for operating the same
WO2020119541A1 (en) Voice data identification method, apparatus and system
WO2020119569A1 (en) Voice interaction method, device and system
WO2020244573A1 (en) Voice instruction processing method and device, and control system
EP2919472A1 (en) Display apparatus, method for controlling display apparatus, and interactive system
US20150222948A1 (en) Multimedia Device Voice Control System and Method, and Computer Storage Medium
CN109309751B (en) Voice recording method, electronic device and storage medium
CN110740262A (en) Background music adding method and device and electronic equipment
JP6619488B2 (en) Continuous conversation function in artificial intelligence equipment
CN109410950B (en) Voice control method and system of cooking equipment
JP6449991B2 (en) Media file processing method and terminal
CN112702633A (en) Multimedia intelligent playing method and device, playing equipment and storage medium
CN112233676A (en) Intelligent device awakening method and device, electronic device and storage medium
WO2021051588A1 (en) Data processing method and apparatus, and apparatus used for data processing
CN108881766B (en) Video processing method, device, terminal and storage medium
US10693944B1 (en) Media-player initialization optimization
CN109658924B (en) Session message processing method and device and intelligent equipment
KR102584324B1 (en) Method for providing of voice recognition service and apparatus thereof
CN111862965A (en) Awakening processing method and device, intelligent sound box and electronic equipment
WO2019154282A1 (en) Household appliance and voice recognition method, control method and control device thereof
JP2019053774A (en) Method and terminal for processing media file

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19895194

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19895194

Country of ref document: EP

Kind code of ref document: A1