CN111161734A

CN111161734A - Voice interaction method and device based on designated scene

Info

Publication number: CN111161734A
Application number: CN201911417773.9A
Authority: CN
Inventors: 叶亚玲; 郭鹏亮; 刘强
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-05-15

Abstract

The invention discloses a voice interaction method based on a specified scene, which comprises the following steps: responding to a received operation instruction corresponding to the user voice, and performing scene confirmation; according to the scene confirmation result, under the appointed scene, responding to the user voice according to the matching result of the response instruction configured for the appointed scene and the received operation instruction; and under the non-specified scene, responding to the user voice according to the operation instruction type. The invention also discloses a voice interaction device based on the appointed scene. According to the method and the device disclosed by the invention, the interference and interruption of other applications to the specified scene can be avoided, the false triggering in the non-specified scene can also be avoided, various operations in the specified scene can be accurately controlled without a wakeup word, the method and the device are particularly suitable for the Karaoke scene, and the interaction experience of a user is improved.

Description

Voice interaction method and device based on designated scene

Technical Field

The invention relates to the technical field of voice interaction, in particular to a voice interaction method and device based on a specified scene.

Background

At present, the general K song scenes in the market can be controlled only by manually inputting the K song operations such as song requesting, song cutting, volume adjustment and the like, and the operation is complex and not convenient and fast enough. Therefore, the control of the Karaoke scenes through voice is provided in the industry, but the existing scheme for controlling the Karaoke scenes through voice only can control the Karaoke instructions through a voice remote controller or only can use a microphone, the instructions can be spoken only after being awakened through an awakening word, and the simultaneous control of the voice of the remote controller and the voice of the microphone cannot be simultaneously supported.

Disclosure of Invention

In order to solve the problems, the inventor conceives that in the voice interaction process, the response is preferentially carried out based on a Karaoke scene, only the Karaoke response is carried out on a specific instruction in a non-Karaoke scene, and the corresponding response is carried out on other instructions according to other voice interaction intentions, so that various operation instructions in the Karaoke scene can be accurately controlled through voice, accurate response can be carried out without a wake-up word, the voice instructions of a voice remote controller and a plurality of microphones can be simultaneously supported, the influence on people singing is avoided, and the user experience of the voice interaction in the Karaoke scene is improved. The concept of the scheme can be further expanded and applied to other situations in which voice interaction of a specific scene needs to be accurately controlled so as to avoid interference.

According to a first aspect of the present invention, a method for voice interaction based on a specified scene is provided, which includes the following steps:

responding to a received operation instruction corresponding to the user voice, and performing scene confirmation;

according to the scene confirmation result, under the appointed scene, responding to the user voice according to the matching result of the response instruction configured for the appointed scene and the received operation instruction; and under the non-specified scene, responding to the user voice according to the operation instruction type.

According to a second aspect of the present invention, there is provided a voice interactive apparatus based on a specified scene, comprising

The scene confirmation module is used for responding to the received operation instruction corresponding to the user voice and confirming the scene; and

the response scheduling module is used for responding to the user voice according to the scene confirmation result and the matching result of the response instruction configured for the specified scene and the received operation instruction in the specified scene; and under the non-specified scene, responding to the user voice according to the operation instruction type.

According to a third aspect of the present invention, there is provided an electronic apparatus comprising: the system comprises at least one processor and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the steps of the method.

According to a fourth aspect of the invention, a storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.

According to the voice interaction scheme provided by the embodiment, the audio instruction input by the user can be responded based on the appointed scene, the configuration instruction which is the specific scene is preferentially responded in the appointed scene, and the appropriate response is carried out based on the specific type of the operation instruction in the non-appointed scene, so that the interference and interruption of other applications to the appointed scene can be avoided, the false triggering of the operation in the non-appointed scene can also be avoided, various operations in the appointed scene can be accurately carried out without wake-up words, the interaction experience of the user is improved, and the voice interaction scheme is particularly suitable for the K song scene.

Drawings

FIG. 1 is a flowchart of a method for voice interaction based on a specific scenario according to an embodiment of the present invention;

FIG. 2 is a schematic block diagram of a voice interaction apparatus based on a specific scenario according to an embodiment of the present invention;

fig. 3 is a schematic diagram of an electronic device according to an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

As used in this disclosure, "module," "device," "system," and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, an element may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, may be an element. One or more elements may be in a process and/or thread of execution and an element may be localized on one computer and/or distributed between two or more computers and may be operated by various computer-readable media. The elements may also communicate by way of local and/or remote processes based on a signal having one or more data packets, e.g., from a data packet interacting with another element in a local system, distributed system, and/or across a network in the internet with other systems by way of the signal.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The method for voice interaction based on the designated scene in the embodiment of the present invention may be applied to any terminal device integrated with a voice function, for example, a computer, a smart phone, a tablet computer, a smart home, and other terminal devices with a voice function, which is not limited in this respect. The scheme provided by the invention can realize the response of the audio instruction of the user based on the appointed scene, thereby achieving the effects of not interfering the appointed scene and not triggering the operation instruction of the appointed scene by mistake, improving the user interaction experience, being particularly suitable for the K song scene, and realizing the K song voice control supporting no awakening words and multiple input channels.

The present invention will be described in further detail with reference to the accompanying drawings.

Fig. 1 schematically shows a method flow of a voice interaction method based on a specified scene according to an embodiment of the present invention, and as shown in fig. 1, the method of this embodiment includes the following steps:

step S101: and responding to the received operation instruction corresponding to the user voice, and performing scene confirmation. In the voice interaction, the voice of the user can be monitored in real time, the audio data corresponding to the voice of the user is collected for recognition and analysis, the operation instruction corresponding to the voice of the user is obtained through semantic analysis, and the contents of the voice recognition and the semantic analysis can be obtained by referring to the prior art. After receiving the analyzed operation instruction, the embodiment of the invention firstly confirms the scene and judges whether the current scene is in the appointed scene. The scene confirmation may be performed by determining whether the currently running application is a specific application, for example, by a package name of the application. Illustratively, taking the specified scene as the K song scene as an example, the package name of the running application can be acquired through the system interface, and whether the package name of the K song application is included is judged, if the package name includes that the K song application is running, it is determined that the K song application is in the K song scene, otherwise, the K song application is regarded as a non-K song scene.

Step S102: the response processing is performed according to the scene confirmation result, and the processing of step S103 is performed in the designated scene, and the processing of step S104 is performed in the non-designated scene. After the scene confirmation is performed, according to the scene confirmation result, if it is determined that the scene is the designated scene, the process of step S103 is performed, otherwise, the process of step S104 is performed.

Step S103: and responding to the voice of the user according to the matching result of the response instruction configured for the specified scene and the received operation instruction. And under the specified scene, responding based on the response instruction of the specified scene. The response instruction of the designated scene is obtained by configuring and storing according to requirements in advance. Illustratively, the designated scene is a Karaoke scene, and the response instructions configured for the Karaoke scene can be a first type of instructions for Karaoke control and a second type of instructions related to music. The first type of instruction comprises operating instructions for Karaoke control such as Karaoke, original singing, cutting, re-singing and a set song, and the second type of instruction is instructions which are related to music and can touch applications of vocal music such as ' song of XXX I ' wanted to sing ', ' XXX I wanted to sing ', ' XXX playing ', and the like. In the scene, the analyzed operation instruction is matched with a response instruction configured in the Karaoke scene, and according to a matching result, if the operation instruction is consistent with a certain response instruction, an action corresponding to the response instruction is executed, otherwise, the response action is not executed, namely, no response is executed. Therefore, under the specified scene, the instruction of other voice scenes cannot be responded, for example, the voice instruction of 'i want to watch live broadcast' cannot be responded, and the interruption of the specified scene such as a Karaoke scene can be avoided. In addition, in a specified scene, the instructions in the related field all respond through the response instructions in the specified scene, for example, in a K song scene, the music listening instruction of 'play XXX' also responds through the K song application, but does not respond through other music applications such as the music of the cool dog, and the controllability of the voice interaction process is ensured.

Step S104: and responding to the voice of the user according to the type of the operation instruction. And under the non-specified scene, responding according to the type of the operation instruction so as to achieve the effect of not carrying out false triggering on the special instruction of the specified scene. Illustratively, taking the specified scene as the song-K scene as an example, in the non-song-K scene, responding to the user voice according to the type of the operation instruction includes not responding when the operation instruction is the first type of instruction. Therefore, the first class of instructions are configured into control instructions related to a specified scene, such as karaoke, original singing, song cutting, singing re-singing, specified songs and other operation instructions for controlling the karaoke in a karaoke scene, and the error triggering of the operation instructions of the specific scene in a voice scene can be avoided without responding to the instructions in a non-karaoke scene. Preferably, the specific special response instructions can be further configured for the specified scene, so that the special response instructions can uniquely correspond to the specified scene, and the response instructions configured for the Karaoke scene also comprise a third type of instructions for specifying Karaoke song-requesting operation, such as expressions of 'I want to sing', and the like, which are related to singing, and can be specified to the Karaoke scene. Correspondingly, responding to the user voice according to the type of the operation instruction can also be realized by entering a Karaoke scene when the operation instruction is a third type instruction, and responding according to a response instruction configured for the Karaoke scene and a matching result of the operation instruction. And starting the application through the package name of the application when the karaoke scene enters, and executing a specific operation instruction under the application so as to respond to the voice of the user. In addition, in order to enable the scheme of the embodiment of the present invention to coordinate the voice interaction process of scheduling multiple voice applications as much as possible, when the operation instruction is of another type, the method may further be implemented to respond by entering a corresponding scene, for example, when the operation instruction is a second type of instruction and not a third type of instruction, the method may respond by opening another similar application, for example, when the operation instruction is "play XXX", the method may be implemented to respond by opening music playing software, such as music of a cool dog, because the operation instruction belongs to the second type of instruction related to music but does not belong to the third type of instruction specified for opening the song K application. Of course, a live application may be opened to respond if there are other types of operational instructions, such as "watch live".

The method realizes that the response is preferentially carried out based on the response instruction of the appointed scene in the appointed scene, the response is triggered only by the appointed special instruction in the non-appointed scene, otherwise, the response is not carried out or the response is carried out by other scenes, the accurate control of the voice interaction process in the appointed scene can be realized, the false triggering of the appointed scene is avoided, and the interaction experience is improved. By the method, voice interaction is started without waking based on the wake-up word in the appointed scene, and the interaction process is simple.

FIG. 2 schematically shows a voice interaction device based on a specified scene according to one implementation of the invention, which includes

A scene confirmation module 20, configured to perform scene confirmation in response to a received operation instruction corresponding to the user voice;

the response scheduling module 21 is configured to respond to the user voice according to the scene confirmation result and the matching result of the response instruction configured for the specified scene and the received operation instruction in the specified scene; and under the non-specified scene, responding to the user voice according to the operation instruction type.

Illustratively, the designated scene is a Karaoke scene, and the response instructions configured for the Karaoke scene comprise a first type of instructions for Karaoke control, a second type of instructions related to music and a third type of instructions for designating Karaoke song-requesting operation. Illustratively, the first type of instruction may include instructions for karaoke control, vocal accompaniment, original singing, song cutting, re-singing, a designated song, etc., the second type of instruction may be music-related instructions including "songs of my wanted XXX", "play XXX", etc., and the third type of instruction may be song-by-song saying instructions of K song related to singing, such as "i wanted to sing", etc. In addition, the specific implementation manner of each module in the embodiment of the apparatus of the present invention may be described with reference to the foregoing method portion, and the priority scheme mentioned in the foregoing method portion is also applicable to the apparatus, which is not described herein again.

In specific practice, a user can set a designated scene according to requirements, and the application of the concept of the invention is not limited to the Karaoke scene, and can also be other designated scenes which need to be accurately controlled and avoid false triggering. By the method, not only can accurate control be realized and false triggering be avoided, but also accurate voice control can be performed without a wake-up stage, and audio acquisition can be realized through a plurality of recording devices, so that the method is very friendly to users.

In some embodiments, the present invention further provides a computer-readable storage medium, in which one or more programs including executable instructions are stored, where the executable instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform the above-mentioned voice interaction method based on a specific scenario of the present invention.

In some embodiments, the present invention further provides a computer program product, the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, the computer program includes program instructions, when the program instructions are executed by a computer, the computer executes the above-mentioned voice interaction method based on the specified scenario.

In some embodiments, an embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the voice interaction method based on the specified scene.

In some embodiments, the present invention further provides a storage medium, on which a computer program is stored, which when executed by a processor is capable of executing the above-mentioned voice interaction method based on a specified scenario.

The voice interaction device based on the designated scene in the embodiment of the present invention may be used to execute the voice interaction method based on the designated scene in the embodiment of the present invention, and accordingly achieve the technical effect achieved by the voice interaction method based on the designated scene in the embodiment of the present invention, and details are not repeated here. In the embodiment of the present invention, the relevant functional module may be implemented by a hardware processor (hardware processor).

Fig. 3 is a schematic hardware structure diagram of an electronic device for performing a voice interaction method based on a specified scenario according to another embodiment of the present application, and as shown in fig. 3, the electronic device includes:

one or more processors 510 and memory 520, with one processor 510 being an example in fig. 3.

The apparatus for performing the voice interaction method based on the designated scene may further include: an input device 530 and an output device 540.

The processor 510, the memory 520, the input device 530, and the output device 540 may be connected by a bus or other means, such as the bus connection in fig. 3.

The memory 520, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the voice interaction method based on a specific scenario in the embodiment of the present application. The processor 510 executes various functional applications of the server and data processing, i.e., implements the voice interaction method based on a specified scenario in the above method embodiments, by running the nonvolatile software program, instructions and modules stored in the memory 520.

The memory 520 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the voice interactive apparatus based on a designated scene, and the like. Further, the memory 520 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 520 may optionally include memory remotely located from processor 510, which may be connected via a network to a voice interaction device based on a given scenario. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 530 may receive input numeric or character information and generate signals related to user settings and function control of the voice interactive device based on a designated scene. The output device 540 may include a display device such as a display screen.

The one or more modules are stored in the memory 520 and when executed by the one or more processors 510, perform the method for voice interaction based on a specified scenario in any of the above method embodiments.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.

(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.

(5) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. The voice interaction method based on the designated scene is characterized by comprising the following steps:

2. The method according to claim 1, wherein the designated scene is a Karaoke scene, and the response instructions configured for the designated scene comprise a first type of instructions for Karaoke control and a second type of instructions related to music.

3. The method of claim 2, wherein the response instructions configured for the specified scene further comprise a third type of instructions for a song-on-karaoke operation.

4. The method of claim 3, wherein responding to user speech according to the type of operation instruction in the non-specific scenario comprises

When the operation instruction is a first type instruction, no response is carried out;

when the operation instruction is a third type instruction, entering a K song scene, and responding according to a response instruction configured for the K song scene and a matching result of the operation instruction;

and when the operation instruction is of other types, entering a corresponding scene to respond.

5. The voice interaction device based on the specified scene is characterized by comprising

6. The apparatus of claim 5, wherein the designated scene is a Karaoke scene.

7. The apparatus of claim 6, wherein the response instructions configured for the specified scene comprise a first type of instructions for Karaoke control, a second type of instructions related to music, and a third type of instructions for specifying Karaoke operation.

8. The apparatus of claim 7, wherein the first type of instruction comprises vocal accompaniment, original vocal, song cutting, song re-singing, and a designated song.

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-4.

10. Storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 4.