CN110767220A

CN110767220A - Interaction method, device, equipment and storage medium of intelligent voice assistant

Info

Publication number: CN110767220A
Application number: CN201910984654.5A
Authority: CN
Inventors: 李卓卿; 龙振海
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-10-16
Filing date: 2019-10-16
Publication date: 2020-02-07
Anticipated expiration: 2039-10-16

Abstract

The invention provides an interactive method, a device, equipment and a storage medium of an intelligent voice assistant, which relate to the artificial intelligence technology, and the method comprises the following steps: acquiring resources of the virtual image and configuration information of the corresponding resources from the server, and extracting model resources of the virtual image corresponding to the intelligent voice assistant from the resources; presenting the virtual image of the intelligent voice assistant according to the model resources and the resources indicated by the default configuration items in the configuration information; inquiring configuration information according to the interactive instruction of the corresponding intelligent voice assistant to obtain configuration items conforming to the interactive instruction; extracting image resources and conversational resources indicated in the configuration items from resources issued by the server; based on the image resource, the virtual image is controlled to present the image conforming to the interactive instruction, and based on the speech resource, the virtual image is controlled to play the speech conforming to the interactive instruction.

Description

Interaction method, device, equipment and storage medium of intelligent voice assistant

Technical Field

The present invention relates to artificial intelligence technology, and in particular, to an interactive method, an interactive device, an interactive apparatus, and a storage medium for an intelligent voice assistant.

Background

Artificial Intelligence (AI) is a theory, method and technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.

With the development of artificial intelligence technology, various intelligent voice assistant products appear in the market, the intelligent voice assistant can be a three-dimensional image which has affinity and is popular with users, has the function of voice interaction with the users, can enhance the sensitivity of the users to application products and the viscosity of the users, and can greatly improve the user experience.

Disclosure of Invention

The embodiment of the invention provides an interaction method, an interaction device, interaction equipment and a storage medium of an intelligent voice assistant, which can provide a one-stop solution of the intelligent voice assistant with high efficiency and configuration aiming at various clients, thereby effectively reducing the development cost.

The technical scheme of the embodiment of the invention is realized as follows:

the embodiment of the invention provides an interactive method of an intelligent voice assistant, which comprises the following steps:

acquiring resources of an avatar and configuration information corresponding to the resources from a server, and extracting model resources of the avatar corresponding to the intelligent voice assistant from the resources;

presenting an avatar of the intelligent voice assistant according to the model resources and resources indicated by default configuration items in the configuration information;

inquiring the configuration information according to the interactive instruction corresponding to the intelligent voice assistant to obtain configuration items conforming to the interactive instruction;

extracting the image resource and the speech resource indicated in the configuration project from the resource issued by the server;

based on the image resource, controlling the virtual image to present an image conforming to the interactive instruction, and based on the speech resource, controlling the virtual image to play speech conforming to the interactive instruction.

acquiring resources of a virtual image of the intelligent voice assistant and generating configuration information corresponding to the resources;

issuing the resources of the intelligent voice assistant and the configuration information corresponding to the resources to a client so that the client executes the following operations:

presenting an avatar of the voice assistant;

The embodiment of the invention provides an interactive device of an intelligent voice assistant, which comprises:

the resource management module is used for acquiring resources of the virtual image and configuration information corresponding to the resources from a server and extracting model resources of the virtual image corresponding to the intelligent voice assistant from the resources;

the configuration item query module is used for querying the configuration information according to the interactive instruction corresponding to the intelligent voice assistant to obtain the configuration items conforming to the interactive instruction;

the resource extraction module is used for extracting the image resources and the conversational resources indicated in the configuration items from the resources issued by the server;

the animation function module is used for presenting the virtual image of the intelligent voice assistant according to the model resources and the resources indicated by the default configuration items in the configuration information; and the voice playing module is used for controlling the virtual image to present an image conforming to the interactive instruction based on the image resource and controlling the virtual image to play voice conforming to the interactive instruction based on the speech resource.

In the foregoing solution, the resource management module is further configured to:

submitting a resource acquisition request carrying a version of the client and an avatar identifier corresponding to the avatar to the server, such that

The server inquires resources which are adapted to the version of the client and correspond to the virtual image in a database according to the version and the virtual image identification;

and receiving the resources which are adapted to the version of the client and correspond to the virtual image and the configuration information corresponding to the resources, which are sent by the server.

In the foregoing solution, the animation function module is further configured to:

presenting a default image of the virtual image of the intelligent voice assistant on the basis of the model corresponding to the model resource according to the image resource indicated by the default configuration item in the configuration information;

wherein the default persona comprises: a default skin of the avatar and a default prop.

The animation function module is also used for

And playing the voice corresponding to the virtual image according to the conversational resources indicated by the default configuration items in the configuration information.

In the foregoing solution, the configuration item query module is further configured to:

when the interactive instruction comprises a voice interactive instruction, performing voice recognition and semantic recognition on the voice interactive instruction to obtain scene information representing the environment where the virtual image is located and condition information representing keywords for executing the voice interactive instruction; inquiring the configuration information according to the scene information and the condition information to obtain configuration items meeting the scene information and the condition information; and/or

When the interactive instruction is a touch interactive instruction, presenting intent options corresponding to at least one of the following of the intelligent voice assistant: the method comprises the steps of intelligent conversation, equipment control, message leaving of a vehicle machine, switching of an avatar of the intelligent voice assistant, adding of a prop of the intelligent voice assistant, switching of a scene where the intelligent voice assistant is located, and setting of active time of the intelligent voice assistant; inquiring the configuration information based on the intention represented by the selected intention option to obtain a configuration item conforming to the intention; and/or

Sending an interactive instruction corresponding to the intelligent voice assistant to the server so that the server carries out semantic recognition on the interactive instruction to obtain the intention of the interactive instruction; and inquiring the configuration information according to the intention to obtain the configuration items meeting the intention.

when the intention of the interactive instruction is to perform a conversation with the avatar, controlling the avatar to perform a feedback action corresponding to the emotion of a conversation result based on the avatar resource corresponding to the conversation; controlling the virtual image to play voice which accords with the question-answer characteristics of the dialogue based on the tactical resources corresponding to the dialogue; and/or

When the intention of the interaction instruction is to control equipment, controlling the virtual image to present control operation on the equipment based on image resources corresponding to the control operation, and controlling the virtual image to play voice representing a control result based on the language resources corresponding to the control operation; and/or

When interactive instruction's intention is for looking over the car machine message, based on the image resource that the car machine message was looked over to the correspondence, control the avatar presents the action of looking over the car machine message, based on the talk skill resource that the car machine message was looked over to the correspondence, controls the pronunciation that the car machine message was broadcast to the avatar.

the resource configuration module is used for acquiring resources of the virtual image of the intelligent voice assistant and generating configuration information corresponding to the resources;

the resource issuing module is used for issuing the resources of the intelligent voice assistant and the configuration information corresponding to the resources to a client so that the client executes the following operations:

presenting an avatar of the voice assistant;

In the foregoing solution, the resource configuration module is further configured to:

receiving image resources of the intelligent voice assistant uploaded by an art resource provider, and creating speech resources of the intelligent voice assistant;

wherein the avatar resources include at least one of: scene resources, model resources, skin resources, and action resources;

when receiving the image resource of new virtual image, distributing new version identification to the received image resource, and

generating configuration information corresponding to the new version, wherein the configuration information comprises at least one of the following: a scene resource configuration project, a model resource configuration project, a skin resource configuration project, an action resource configuration project and a speech resource configuration project;

and when receiving that the image resources of the intelligent voice assistant are the updated resources of the existing virtual image, updating the image resources and the configuration information of the corresponding version of the existing virtual image.

In the above solution, the apparatus further comprises: a speech semantic recognition module to:

receiving an interactive instruction of the corresponding intelligent voice assistant uploaded by the client;

when the interactive instruction is a voice interactive instruction, performing voice recognition on the voice interactive instruction to obtain text information, performing semantic recognition on the text information to obtain the intention of the interactive instruction, and sending the intention to the client so as to enable the interactive instruction to be used as the voice interactive instruction

The client inquires the configuration information according to the intention to obtain a configuration item conforming to the intention, controls the virtual image to present an image conforming to the interactive instruction according to the image resource indicated in the configuration item, and controls the virtual image to present the image conforming to the interactive instruction

Controlling the virtual image to play voice conforming to the interactive instruction according to the speech technology resource indicated in the configuration information;

wherein the intent comprises at least one of:

the method comprises the steps of intelligent conversation, equipment control, message leaving of a vehicle machine, switching of an avatar of the intelligent voice assistant, adding of a prop of the intelligent voice assistant, switching of a scene where the intelligent voice assistant is located, and setting of active time of the intelligent voice assistant.

In the above solution, the apparatus further comprises: a voiceprint signal processing module to:

extracting voice print characteristic parameters of the voice interaction instruction;

according to the voiceprint characteristic parameters, the identity of the user who initiates the voice interaction instruction is identified;

determining to continue semantic recognition of the text information when the user is identified as an authorized user;

and when the user is identified as an unauthorized user, returning prompt information that the user does not have the operation authority to the client.

for the intentions of the potential individual candidates of the interaction instruction, performing the following:

and associating the resources responding to the intention with the scene information and the condition information corresponding to the intention to form corresponding configuration items so as to form the configuration information corresponding to the intention based on the combination of the configuration items.

binding image resources and conversational resources used for responding to a conversational intention with scene information and condition information of the conversational intention to generate corresponding configuration items, wherein the conversational intention comprises at least one of chatting, knowledge question answering and weather inquiry;

binding avatar resources and dialog resources for responding to a control device intention with scenario information and condition information corresponding to the control device intention to generate corresponding configuration items, wherein the control device intention includes at least one of: controlling the household equipment through a client in the vehicle, and controlling the vehicle through the client in the household equipment;

and binding image resources and speech resources for responding to the vehicle-mounted device message checking intention with the scene information and the condition information corresponding to the vehicle-mounted device message checking intention so as to generate a corresponding configuration item.

The embodiment of the invention provides interactive equipment of an intelligent voice assistant, which comprises:

a memory for storing executable instructions;

and the processor is used for realizing the interaction method of the intelligent voice assistant provided by the embodiment of the invention when executing the executable instructions stored in the memory.

The embodiment of the invention provides a storage medium, which stores executable instructions and is used for causing a processor to execute so as to realize the interaction method of the intelligent voice assistant provided by the embodiment of the invention.

The embodiment of the invention has the following beneficial effects:

the embodiment of the invention realizes a complete set of solution of the intelligent voice assistant for the resources and configuration of the intelligent voice assistant, so that content providers can access the intelligent voice assistant in a quick and low-cost manner to flexibly apply the functions of the intelligent voice assistant in various products, and the repeated development cost is saved.

Drawings

FIG. 1 is an alternative structural diagram of an interactive system architecture of an intelligent voice assistant provided by an embodiment of the present invention;

FIG. 2A is an alternative structural diagram of an interactive device of an intelligent voice assistant provided by an embodiment of the invention;

FIG. 2B is an alternative structural diagram of an interaction device of an intelligent voice assistant provided by an embodiment of the invention;

FIG. 3 is an alternative flow diagram of an interaction method of an intelligent voice assistant according to an embodiment of the present invention;

4A-4B are alternative flow diagrams of the interaction method of the intelligent voice assistant provided by the embodiment of the invention;

FIG. 5 is a block diagram of an interactive system for an intelligent voice assistant provided by an embodiment of the present invention;

FIG. 6 is a resource configuration architecture diagram of an intelligent voice assistant provided by an embodiment of the present invention;

FIG. 7 is a flow chart of a terminal of an intelligent voice assistant according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first \ second \ third" are only used to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where appropriate, so that the embodiments of the invention described herein can be practiced in other than the order shown or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.

Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.

1) The intelligent voice assistant: the intelligent voice assistant is a typical application scene of intelligent voice interaction, realizes an interaction mode based on voice input, and can obtain a feedback result by speaking.

2) Skin: an avatar decoration comprising: clothing, equipment, etc., may be used to decorate the various avatars.

3) Props: for matching weapons, tools, etc. used by the avatar.

4) Rendering: and (3) performing two-dimensional projection on the model in the three-dimensional scene into a digital image according to the set environment, light, material and rendering parameters.

In the related art, the solution of the intelligent voice assistant only provides voice interaction, does not have the display of a three-dimensional image, and does not have an interaction function with the three-dimensional virtual image, such as touch feedback, or only provides a visual system based on the three-dimensional virtual image, only has expression form animation, does not have related configuration of actions and dialogs, cannot obtain corresponding feedback according to the input of a user, cannot issue resources in real time, and cannot operate.

At present, if an intelligent voice assistant with a three-dimensional virtual image needs to be applied in the market, the following aspects of development are needed, backgrounds with capabilities of artificial intelligence, voice semantic recognition and the like can be provided, art providers of corresponding image resources such as actions, role models and the like can be provided, a cloud system for uploading and managing resources and corresponding configurations, a terminal for handling user interaction and rendering/displaying an avatar of the intelligent voice assistant, therefore, the cost of developing such an intelligent voice assistant with a three-dimensional virtual image is very high, and the technical scheme provides a complete set of solutions, including background, cloud, art, and terminal, as long as the content provider or application developer simply accesses, and uploading image resources at the cloud end to complete the configuration of the speech operation actions, so that the function of the intelligent voice assistant of the three-dimensional virtual image can be realized at the application client. In addition, compared with the intelligent voice assistant independently developed by a content provider, the development of image resources is uniformly performed by the art resource provider, and the presentation quality of the virtual image can be ensured.

The embodiment of the invention provides an interaction method, an interaction device, equipment and a storage medium of an intelligent voice assistant, which can realize voice interaction and three-dimensional image display between the intelligent voice assistant and a user. In the following, an exemplary application will be explained when the device is implemented as a terminal as well as a server.

Referring to fig. 1, fig. 1 is a schematic diagram of an alternative architecture of an interactive system 100 of an intelligent voice assistant according to an embodiment of the present invention, which includes a terminal 400, a network 300, a server 200, an intelligent device 500, and an art resource provider 600. The terminal 400 is connected to the server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of both. The server 200 comprises a background server 200-1 and a cloud server 200-2, the terminal 400 is used for processing user interaction and rendering/displaying an avatar of the intelligent voice assistant, for example, displaying the avatar of the intelligent voice assistant on a graphical interface of the client 410, the background server 200-1 is used for providing functions of artificial intelligence, voice/semantic recognition and the like, and the cloud server 200-2 is used for uploading and managing resources and corresponding configuration resources. The interactive system 100 is implemented as follows: the cloud server 200-2 receives the resources of the virtual image of the intelligent voice assistant uploaded by the art resource provider and generates configuration information of the corresponding resources; the background server 200-1 issues resources of the intelligent voice assistant and configuration information of corresponding resources to the client, the client 410 on the terminal 400 acquires resources of the virtual image and configuration information of the corresponding resources from the background server 200-1, receives an interactive instruction of a user for the intelligent voice assistant at the client 410, inquires the configuration information according to the interactive instruction of the corresponding intelligent voice assistant to obtain a configuration item conforming to the interactive instruction, extracts image resources and speech resources indicated in the configuration item from the resources issued by the server 200-1 to control the virtual image to present the image conforming to the interactive instruction and play speech conforming to the interactive instruction, so that a speech interaction function and a three-dimensional interaction function between the user and the intelligent voice assistant are realized.

In some embodiments, the system architecture 100 may further include the smart device 500, the client 410 receives a control instruction of the user for the smart device 500, the client 410 sends the control instruction to the smart device 500 through the backend server 200-1 or a local control manner, and the client 410 may obtain an execution situation of the control execution in real time and present the execution situation of the control instruction to the user through a corresponding image and a corresponding dialog.

It can be understood that the backend server 200-1 and the cloud server 200-2 may be different service modules integrated on the same device, or may be independent service modules disposed on different devices.

Referring to fig. 2A, fig. 2A is a schematic diagram of an optional structure of an interactive device of an intelligent voice assistant according to an embodiment of the present invention, and taking the interactive device as a terminal as an example, a terminal 400 shown in fig. 2A includes: at least one processor 410, memory 450, at least one network interface 420, and a user interface 430. The various components in the terminal 400 are coupled together by a bus system 440. It is understood that the bus system 440 is used to enable communications among the components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 440 in fig. 2A.

The Processor 410 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The user interface 430 includes one or more output devices 431, including one or more speakers and/or one or more visual displays, that enable the presentation of media content. The user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 450 optionally includes one or more storage devices physically located remote from processor 410.

The memory 450 includes either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 450 described in embodiments of the invention is intended to comprise any suitable type of memory.

In some embodiments, memory 450 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.

An operating system 451, including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;

a network communication module 452 for communicating to other computing devices via one or more (wired or wireless) network interfaces 420, exemplary network interfaces 420 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

a presentation module 453 for enabling presentation of information (e.g., user interfaces for operating peripherals and displaying content and information) via one or more output devices 431 (e.g., display screens, speakers, etc.) associated with user interface 430;

an input processing module 454 for detecting one or more user inputs or interactions from one of the one or more input devices 432 and translating the detected inputs or interactions.

In some embodiments, the interaction device of the intelligent voice assistant provided by the embodiments of the present invention may be implemented in software, and fig. 2A illustrates the interaction device 455 of the intelligent voice assistant stored in the memory 450, which may be software in the form of programs and plug-ins, and includes the following software modules: a resource management module 4551, a configuration item query module 4552, a resource extraction module 4553 and an animation function module 4554. These modules are logical and thus may be combined or further split according to the functionality implemented, the functionality of the individual modules being described below.

Referring to fig. 2B, fig. 2B is a schematic diagram of an optional structure of an interactive device of an intelligent voice assistant according to an embodiment of the present invention, and taking the interactive device as a server as an example, a server 200 shown in fig. 2B includes: at least one processor 210, memory 250, at least one network interface 220, and a user interface 230. The various components in server 200 are coupled together by a bus system 240. It is understood that the bus system 240 is used to enable communications among the components. The bus system 240 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 240 in fig. 2.

The Processor 210 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The user interface 230 includes one or more output devices 231, including one or more speakers and/or one or more visual display screens, that enable the presentation of media content. The user interface 230 also includes one or more input devices 232, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 250 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 250 optionally includes one or more storage devices physically located remotely from processor 210.

The memory 250 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 250 described in embodiments of the invention is intended to comprise any suitable type of memory.

In some embodiments, memory 250 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.

An operating system 251 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;

a network communication module 252 for reaching other computing devices via one or more (wired or wireless) network interfaces 420, exemplary network interfaces 420 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

a presentation module 253 to enable presentation of information (e.g., a user interface for operating peripherals and displaying content and information) via one or more output devices 231 (e.g., a display screen, speakers, etc.) associated with the user interface 230;

an input processing module 254 for detecting one or more user inputs or interactions from one of the one or more input devices 232 and translating the detected inputs or interactions.

In some embodiments, the interaction device of the intelligent voice assistant provided by the embodiments of the present invention may be implemented in software, and fig. 2B shows the interaction device 255 of the intelligent voice assistant stored in the memory 250, which may be software in the form of programs and plug-ins, etc., and includes the following software modules: a resource configuration module 2551, a resource issuing module 2552, a speech semantic recognition module 2553 and a voiceprint signal processing module 2554. These modules are logical and thus may be combined or further split according to the functionality implemented, the functionality of the individual modules being described below.

In other embodiments, the interaction Device of the intelligent voice assistant provided by the embodiments of the present invention may be implemented in hardware, and for example, the Device provided by the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to perform the interaction method of the intelligent voice assistant provided by the embodiments of the present invention, for example, the processor in the form of the hardware decoding processor may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

The implementation process of the interaction method of the intelligent voice assistant provided by the embodiment of the invention is divided into two stages, wherein the first stage relates to the process that the server acquires resources and configures the resources, and the second stage relates to the process that the client performs voice interaction and image interaction with the user through the received resources and configuration sent by the server, and the implementation process of the application is respectively explained below aiming at the two stages.

The resource allocation process performed on the server is described below with reference to steps 301 and 304 shown in fig. 3.

Referring to fig. 3, fig. 3 is an alternative flowchart of the interaction method of the intelligent voice assistant according to the embodiment of the present invention.

In step 301, the server obtains resources of the avatar of the intelligent voice assistant.

In step 302, the server generates configuration information for the corresponding resource.

In some embodiments, the acquiring resources of the avatar of the intelligent voice assistant in step 301 may be implemented by receiving avatar resources of the intelligent voice assistant uploaded by an art resource provider, and creating the conversational resources of the intelligent voice assistant, where the avatar resources include at least one of: scene resources, model resources, skin resources, and action resources; generating configuration information of the corresponding resource in step 302, specifically, the configuration information is generated by the following two ways, when an image resource of a new virtual image is received, a new version identifier is allocated to the received image resource, and configuration information corresponding to the new version is generated, where the configuration information includes at least one of the following: a scene resource configuration project, a model resource configuration project, a skin resource configuration project, an action resource configuration project and a speech resource configuration project; and when the image resources of the intelligent voice assistant are updated resources of the existing virtual image, updating the image resources and the configuration information of the corresponding version of the existing virtual image.

In some embodiments, the resources uploaded by the art resource provider may be stored in a database of the cloud server, or may be stored in a background server, where the image resources include at least one of: scene resources, model resources, skin resources, action resources, where the conversational resources may be correspondingly embodied as conversational techniques played by the avatar when being presented.

In some embodiments, when a character resource of a new avatar is received, a new version identification is allocated to the received character resource, and configuration information corresponding to the new version is generated, the configuration information including at least one of: scene resource configuration items, model resource configuration items, skin resource configuration items, action resource configuration items and conversational resource configuration items. Specifically, for a new voice assistant, the corresponding image resource is the first resource version under the new voice assistant, a new version identifier is allocated to the image resource, so that the resource version is managed, in the resource version management process, the version identifier of the resource and the version of the client are configured correspondingly, when the image resource of the intelligent voice assistant is the updated resource of the existing avatar, the image resource and the configuration information of the corresponding version of the existing avatar are updated, specifically, the existing version resource is updated, the original version is also reserved, and a new version is created according to the uploaded image resource of the art provider.

In step 303, the server issues the resources of the intelligent voice assistant and the configuration information of the corresponding resources to the client.

In step 304, the client performs the following operations: presenting an avatar of the voice assistant; based on the image resource, the virtual image is controlled to present the image conforming to the interactive instruction, and based on the conversational resource, the virtual image is controlled to play the voice conforming to the interactive instruction.

In some embodiments, after the resources of the intelligent voice assistant and the configuration information corresponding to the resources are issued to the client in step 303, the following technical solution may be further implemented: receiving an interactive instruction of the corresponding intelligent voice assistant uploaded by the client; when the interactive instruction is a voice interactive instruction, performing voice recognition on the voice interactive instruction to obtain text information, performing semantic recognition on the text information to obtain an intention of the interactive instruction, and sending the intention to the client, so that the client queries the configuration information according to the intention to obtain a configuration item according with the intention, controls the virtual image to present an image according with the interactive instruction according to an image resource indicated in the configuration item, and controls the virtual image to play voice according with the interactive instruction according to the speech resource indicated in the configuration information; wherein the intent comprises at least one of: the method comprises the steps of intelligent conversation, equipment control, message leaving of a vehicle machine, switching of an avatar of the intelligent voice assistant, adding of a prop of the intelligent voice assistant, switching of a scene where the intelligent voice assistant is located, and setting of active time of the intelligent voice assistant.

In some embodiments, after issuing the resources of the intelligent voice assistant and the configuration information of the corresponding resources to the client, the client uploads the corresponding interactive instructions of the intelligent voice assistant to the server, the interactive instructions may be voice interactive instructions or touch interactive instructions, when the server receives the voice interactive instructions uploaded by the client, the voice interactive instructions are subjected to voice recognition to convert the received voice into text, the converted text is subjected to semantic analysis to obtain the intention of the interactive instructions, and the obtained intention is returned to the client, in the configured resources, a configuration mode with one dimension is to configure the resources according to the intention, the client queries the corresponding configuration information according to the intention returned by the server to obtain configuration items according with the intention, and based on the image resources and the speech resources indicated by the configuration items, and controlling the virtual image to present the image conforming to the interactive instruction and play the corresponding voice.

The intention here is the same as that mentioned in the previous embodiment, and the intention may be an intelligent conversation, device control, a message left by a vehicle, switching an avatar of the intelligent voice assistant, adding a prop of the intelligent voice assistant, switching a scene where the intelligent voice assistant is located, and setting an active time of the intelligent voice assistant. The partially intended instructions may be accomplished on a local client execution basis, with the partially intended instructions needing to be executed remotely by the server.

In some embodiments, before semantic recognition is performed on the text information, the following technical scheme can be further executed, and voiceprint characteristic parameters of the voice interaction instruction are extracted; according to the voiceprint characteristic parameters, the identity of a user initiating a voice interaction instruction is identified; when the user is identified as an authorized user, determining to continue semantic identification on the text information; and when the user is identified as an unauthorized user, returning prompt information that the user does not have the operation authority to the client.

In some embodiments, the server performs authorization authentication on an initiator of the voice interaction instruction before executing the voice interaction instruction sent by the client, since different users can use the client on the same terminal to send the interaction instruction to the intelligent voice interaction assistant, in order to prevent malicious users from interacting with the intelligent voice assistant without knowing the authorized users, for example, the situation that the intelligent voice assistant controls household appliances and the like to cause harm is caused, the server performs voiceprint recognition on the received voice instruction, recognizes the identity of the user who initiates the voice interaction instruction through voiceprint characteristic parameters, when the user is an authorized user, continues to perform semantic recognition on text information to respond to the voice interaction instruction, and when the user is an unauthorized user, returns warning information to the client, and prompting that the user does not have the operation authority, and performing semantic recognition on the text information before authorization passes.

In some embodiments, the configuration of the resource is performed based on an intention dimension, and a technical solution for generating configuration information of the corresponding resource corresponding to the intention dimension is as follows: for the intentions of the potential individual candidates of the instruction of interaction, the following is performed: and associating the resources for responding to the intention with scene information and condition information corresponding to the intention to form corresponding configuration items so as to form configuration information corresponding to the intention based on the combination of the configuration items.

In some embodiments, the intention can be divided into a conversation intention, a control device intention and an intention related to a car-mounted message, and based on the division of the intention, the resources for responding to the intention and the scene information and the condition information corresponding to the intention are managed to form the corresponding configuration items, wherein the image resources and the dialogue resources for responding to the conversation intention are bound with the scene information and the condition information of the conversation intention to generate the corresponding configuration items, and the conversation intention comprises at least one of chatting, knowledge question answering and weather inquiry; binding the avatar resource and the dialog resource for responding to the control device intention, the scene information and the condition information corresponding to the control device intention to generate a corresponding configuration item, wherein the control device intention comprises at least one of the following: controlling the household equipment through a client in the vehicle, and controlling the vehicle through the client in the household equipment; and binding the image resources and the dialogue resources for responding the vehicle-mounted device message checking intention with the scene information and the condition information corresponding to the vehicle-mounted device message checking intention so as to generate corresponding configuration items.

The following describes the process of the intelligent voice assistant and the user performing voice interaction and image interaction with the user on the client in conjunction with steps 401 and 406 shown in fig. 4.

Referring to fig. 4A, fig. 4A is an alternative flowchart of the interaction method of the intelligent voice assistant according to the embodiment of the present invention, which will be described with reference to steps 401 and 406 shown in fig. 4A.

In step 401, the client acquires the resources of the avatar and the configuration information of the corresponding resources from the server.

In step 402, the client extracts model resources corresponding to the avatar of the intelligent voice assistant from the resources.

The resources may specifically include model resources, scene resources, action resources, skin resources, prop resources, special effect resources, or tactical resources. Meanwhile, the scene resource, the action resource, the skin resource, the prop resource, and the special effect resource or the dialog resource correspond to the avatar, and may be a resource packet for the avatar.

Referring to fig. 4B, based on fig. 4A, fig. 4B is an optional flowchart of the interaction method of the intelligent voice assistant according to the embodiment of the present invention, and will be described with reference to steps 4011 and 4013 shown in fig. 4B.

In step 401, the client obtains the resources of the avatar and the configuration information corresponding to the resources from the server, which can be implemented by performing the following steps 4011 and 4013.

In step 4011, the client submits a resource acquisition request carrying a version of the client and an avatar identifier corresponding to the avatar to the server.

In step 4012, the server queries, according to the version and the avatar identifier, resources corresponding to the version of the client and adapted to the avatar in the database.

In step 4013, the client receives the resource that is adapted to the version of the client and corresponds to the avatar and the configuration information of the resource that is issued by the server.

In some embodiments, the client submits a resource acquisition request to the server, where the resource acquisition request carries the avatar identifier, so as to acquire a resource corresponding to the avatar identifier, where the resource includes a scene resource, an action resource, a skin resource, a prop resource, a special effect resource, or a talk resource, which are configured around a specific avatar. The resource obtaining request also carries version information of the client, and for the same avatar, the resource obtaining request can have resource configurations corresponding to different client versions, and for different client versions, the resource configurations that can be used are different, for example, a resource corresponding to a client version 1.0 can have three scenes, and a resource corresponding to a client version 2.0 can have 5 scenes. The server queries the resources of the corresponding virtual image which meet the requirements of the client version in the database according to the client version carried in the resource acquisition request and the virtual image identification of the corresponding virtual image, and issues the queried resources and the configuration information of the corresponding resources to the client which submits the resource acquisition request.

In some embodiments, the configuration information includes configuration items corresponding to various scenes and conditions and configuration items initially presented, and the resource and configuration issued by the server may be a complete resource and configuration information corresponding to the requested avatar, or may be a partial resource and corresponding configuration information, for example, a hot resource based on historical data statistics, and the server issues only the hot resource and corresponding configuration information corresponding to the requested avatar, thereby reducing transmission delay.

In some embodiments, the client may first obtain the configuration information of the corresponding resource from the server, analyze the configuration information, and then obtain a part or all of the resource corresponding to the current requirement from the server by applying the current requirement. For example, if the current application requirement is to perform question-answering interaction with the avatar, the received configuration information may be analyzed to obtain only the conversational resources and the avatar resources corresponding to the question-answering intention, so that the time for obtaining the resources may be shortened.

In step 403, the client presents the avatar of the intelligent voice assistant according to the model resource, the resource indicated by the default configuration item in the configuration information, which may be implemented by: presenting a default image of the virtual image of the intelligent voice assistant on the basis of the model corresponding to the model resource according to the image resource indicated by the default configuration item in the configuration information, wherein the default image comprises: the default skin of the virtual image and the default prop, wherein the model resource can be a two-dimensional model or a three-dimensional model, and after the default image of the virtual image of the intelligent voice assistant is presented, the client can play the voice of the corresponding virtual image according to the speech resource indicated by the default configuration item in the configuration information.

In some embodiments, for a particular avatar, when first rendered on the display interface of the client, the particular avatar appears as an initialized default avatar, which may include a default scene in which the avatar is first rendered, the skin on which the avatar is first rendered, the prop worn by the avatar for the first time, and special effects surrounding the avatar. On this basis, when presenting the avatar, it is also possible to play the voice of the corresponding avatar, for example, a greeting corresponding to the avatar, or a sentence in which the avatar itself has relevance.

In step 404, the client queries the configuration information according to the interactive instruction of the corresponding intelligent voice assistant to obtain the configuration items conforming to the interactive instruction.

In step 404, the client queries the configuration information according to the interactive instruction of the corresponding intelligent voice assistant to obtain a configuration item conforming to the interactive instruction, which can be implemented by executing the following steps.

And when the interactive instruction comprises a voice interactive instruction, performing voice recognition and semantic recognition on the voice interactive instruction to obtain scene information representing the environment where the virtual image is located and condition information representing keywords for executing the voice interactive instruction. And inquiring the configuration information according to the scene information and the condition information to obtain configuration items according with the scene information and the condition information. The configuration items can be scene resource configuration items, model resource configuration items, skin resource configuration items, action resource configuration items and conversational resource configuration items generated in a resource configuration stage.

In some embodiments, when the interactive instruction comprises a voice interactive instruction, performing voice recognition and semantic recognition on the voice interactive instruction to obtain scene information representing the environment where the virtual image is located and condition information representing keywords for executing the voice interactive instruction. And inquiring the configuration information according to the scene information and the condition information to obtain configuration items according with the scene information and the condition information. The process of performing Speech Recognition and semantic Recognition on the voice interaction command may be executed locally at the client, or the voice interaction command may be sent to the server, and the server performs Speech Recognition on the voice interaction command through an Automatic Speech Recognition ASR (ASR) technology.

In some embodiments, based on the text obtained by the voice recognition, scene information representing an environment where the avatar is located and condition information representing a keyword for executing the voice interaction instruction are obtained, for example, when the voice interaction instruction is "how the weather is today", the data that the server returns the weather is rainy, "weather" is used as scene information of a use scene of the avatar configuration and the speech configuration, "rainy" is used as condition information of a use condition of the avatar configuration and the speech configuration, and then the configuration information of the issued resource is queried according to the scene information and the condition information, so as to obtain the scene information meeting the scene of the weather and the configuration item of the condition information of the rain condition.

When the interactive instruction is a touch interactive instruction, presenting intention options corresponding to at least one of the following of the intelligent voice assistant: the method comprises the steps of intelligent conversation, equipment control, message leaving of a vehicle machine, switching of virtual images of the intelligent voice assistant, adding of props of the intelligent voice assistant, switching of scenes where the intelligent voice assistant is located, and setting of active time of the intelligent voice assistant.

The intelligent dialog comprises chat intention, knowledge question and answer intention, weather inquiry intention and the like, the device control comprises controlling furniture intelligent devices through an intelligent voice assistant client on the vehicle-mounted system, controlling the intelligent devices in the vehicle through the intelligent voice assistant client on the terminal at home, and the vehicle-mounted message means sending a message to an intelligent product corresponding to the intelligent voice assistant through the intelligent voice assistant client on the vehicle-mounted system. And inquiring configuration information based on the intention represented by the selected intention option to obtain a configuration item meeting the intention. The configuration items can be scene resource configuration items, model resource configuration items, skin resource configuration items, action resource configuration items and conversational resource configuration items generated in a resource configuration stage.

In some embodiments, the interaction instruction may be a touch interaction instruction, an avatar of the intelligent voice assistant is touched, various intention options for the intelligent voice assistant may be presented when the intelligent voice assistant is touched, and an intention for the intelligent voice assistant, which is characterized by the touch interaction instruction, may be determined by receiving a selection operation of the user, where the intention corresponds to a corresponding configuration item in the configuration information, for example, a configuration item regarding the prop, the scene, and the active time, so as to obtain the configuration item according with the intention.

In some embodiments, the client may further determine an intention corresponding to the touch mode by touching the touch mode corresponding to the interaction instruction, where the touch mode may be a tap mode, a press mode, or a two-point click mode, different intentions are configured for different touch modes, and the corresponding intention is locally determined by obtaining the touch mode of the interaction instruction.

And sending the interactive instruction corresponding to the intelligent voice assistant to a server so that the server performs semantic recognition on the interactive instruction to obtain the intention of the interactive instruction. And inquiring the configuration information according to the intention to obtain the configuration items according with the intention. The configuration items can be scene resource configuration items, model resource configuration items, skin resource configuration items, action resource configuration items and conversational resource configuration items generated in a resource configuration stage.

In some examples, the intention of the interactive instruction can be obtained by analyzing the interactive instruction by the server, the interactive instruction corresponding to the intelligent voice assistant is sent to the server so that the server performs semantic recognition on the interactive instruction to obtain the intention of the interactive instruction, the server returns the intention of the obtained interactive instruction to the client, and the client queries configuration information according to the obtained intention to obtain a configuration item conforming to the intention.

In some embodiments, semantic analysis is performed on the interactive instruction through a Speech recognition module, a Text-To-Speech (TTS) module and a voiceprint signal processing module in the server To obtain an intention of the interactive instruction, and then configuration information is queried according To the intention To obtain a configuration item meeting the intention. Similar to the above embodiments, the intent herein corresponds to the corresponding configuration items in the configuration information, such as configuration items regarding props, scenes, and active times.

In step 405, the client extracts the image resource and the conversational resource indicated in the configuration item from the resources issued by the server.

Here, the configuration item file in the configuration information may be used to indicate an avatar resource and a dialog resource, and read the indicated avatar resource and dialog resource through the configuration item file, where the avatar resource includes visually presented resources of scene resource, action resource, skin resource, prop resource, special effect resource, and the like. The conversational resources here are configured text resources or voice resources.

In step 406, the client controls the avatar to present the avatar corresponding to the interactive instruction based on the avatar resources, and controls the avatar to play the voice corresponding to the interactive instruction based on the conversational resources.

In some embodiments, the step 406 may be implemented in such a manner that, when the interactive instruction is intended to have a conversation with the avatar, the avatar is controlled to perform a feedback action corresponding to the emotion of the conversation result based on the avatar resource of the corresponding conversation; and when the intention of the interactive instruction is to carry out conversation with the virtual image, controlling the virtual image to play voice which accords with the question-answer characteristics of the conversation based on the conversation resources of the corresponding conversation.

In some embodiments, when the interactive instruction is intended to perform a dialog with an avatar, for example, to perform a question-answering with the interactive avatar, for the intention of the question-answering, there may be a specifically configured avatar resource, a corresponding scene in the avatar resource may be a scene related to the question-answering, for example, a certain constructed program scene of the question-answering, a corresponding skin in the avatar resource may be a learner who participates in the question-answering, and the like, and a corresponding action in the avatar resource may be an action according to the question-answering.

For example, for questioning, the corresponding action may be a questioning action, for answering, the corresponding action may be an answering action, for evaluating the answering, the corresponding action may be an action matching with the feature of the answer result, and the configuration items found by the client in step 404 may further include configuration items such as skin and special effects, so that when the avatar is presented, the avatar is presented based on the corresponding skin and special effect resources. When feedback is carried out, along with the change of the dialogue result, the skin, special effect, scene and action under the image resource configuration item of the intention can also be changed, and in addition, the interactive voice command can be answered through a question-answer model based on a neural network.

In some embodiments, the step 406 may be implemented by controlling the avatar to present the control operation on the device based on the avatar resource corresponding to the control operation when the intention of the interactive instruction is to control the device; and when the intention of the interactive instruction is to control the equipment, controlling the virtual image to play the voice representing the control result based on the conversational resources corresponding to the control operation.

In some embodiments, when the interactive instruction is intended to control the smart device in the home through the avatar on the client of the vehicle-mounted system, for example, to control the device in the home to be turned on, the avatar resource with a specific configuration may be provided, the corresponding scene in the avatar resource may be a scene in which the device is turned on by the avatar, and the corresponding action in the avatar resource may be an action conforming to the control operation, for example, for the control operation of turning on the light, the corresponding action may be an action of pressing down a light switch, and at the same time, the light serving as the prop is turned on, and the configuration items found by the client in step 404 may further include configuration items such as skin, special effect, prop, and the like, so that when the avatar is presented, the avatar is presented based on the corresponding skin, special effect, prop resource, and when the control operation is performed, along with the change of the control state, the represented control result also changes, and the skin, special effect, scene and action under the configuration item of the image resource of the intention also can change.

In some embodiments, the step 406 may be implemented in such a manner that when the intention of the interactive instruction is to view the car-mounted device message, the virtual image is controlled to present an action of viewing the car-mounted device message based on the image resource corresponding to the viewed car-mounted device message; and when the intention of the interactive instruction is to check the car machine messages, controlling the virtual image to play the voice of the car machine messages based on the corresponding talk resources for checking the car machine messages.

In some embodiments, when the interactive instruction is intended to leave a message and query a message for an intelligent robot at home corresponding to a client via an avatar on a client of a vehicle-mounted system, for the intention of leaving a message in a vehicle, a specifically configured avatar resource exists, a corresponding scene in the avatar resource may be a scene for leaving a message or querying a message, a corresponding action in the avatar resource may be an action of viewing a message of a vehicle skill or an action of leaving a message for the intelligent robot, and configuration items such as skin, special effect, prop and the like may also exist in the configuration items found by the client in step 404, so that when presenting the avatar, the avatar is presented based on the corresponding skin, special effect, prop resource, and when leaving a message in a vehicle operation or querying a message in a vehicle, along with changes in states of the intelligent robot during a message leaving process and a message querying process, the skin, special effects, scenes, and actions under the configuration items for the intended character resource may also change. In addition, the voice interaction process can be embodied as outputting the left-word and voice contents by the virtual image.

In the following, an exemplary application of the embodiments of the present invention in a practical application scenario will be described.

The interactive system of the intelligent voice assistant mainly comprises a background server for providing an AI/voice semantic recognition function, a cloud server responsible for image resource management and voice operation configuration management, and a terminal responsible for analyzing voice operation configuration, processing user interaction and rendering of the three-dimensional intelligent voice assistant. This scheme is currently used for gaming machine applications (android operating system/apple operating system), in-vehicle systems.

Referring to fig. 5, fig. 5 is a module architecture diagram of an interactive system of an intelligent voice assistant according to an embodiment of the present invention, and as shown in fig. 5, the module architecture diagram is divided into a terminal portion and a server portion, and the server portion is divided into a background server and a pan-tilt server.

For the terminal, the terminal has the service capability, which refers to the functions of the application client, such as the image management function, the text-to-speech playing function, the intelligent speech assistant development function, the mall function and the function of interacting with the virtual image.

The image management function is realized based on each resource management module, wherein each management function module comprises: the device comprises an action management module, a sound management module, a scene management module, a light management module, a prop management module, a special effect management module and a configuration item query module. For the text-to-speech playing function, the application client can convert the text into speech and play the speech; for the creation function, the intimacy degree measurement standard between the virtual image and the user is increased, and as the intimacy degree is increased, the action resources and the tactical resources which are not unlocked before can be unlocked and loaded for use, that is, the virtual image newly adds the action resources and the tactical resources corresponding to the intimacy degree, or as the intimacy degree is increased, the items which are not unlocked in the mall can be unlocked and purchased, or as the intimacy degree is increased, the functions which are not unlocked can be unlocked, and the virtual image has the functions corresponding to the intimacy degree; for the function of the market, it means that a virtual mall is set up, and the user can purchase skin props for the virtual image in the virtual mall.

The avatar interaction function includes: dancing, touch feedback, prop action, scene switching, sound mouth shape, length standby and role switching. For dancing, the virtual image is controlled to dance according to the dance motions set in advance; for the touch feedback, the virtual image is touched to make the virtual image perform corresponding feedback action, for example, a hand of the virtual image is clicked, and the virtual image performs a hand-calling action; for the prop action, configuring a corresponding prop for the virtual image, and making an action corresponding to the prop; for scene switching, the scene where the avatar is located is switched; for the sound mouth shape, when the virtual image carries out voice output, the mouth shape corresponding to the voice is inquired, so that the sound of the virtual image is matched with the mouth shape; for long standby, it refers to controlling the active time of the avatar; for role switching, it refers to controlling the intelligent voice assistant to switch to a different avatar.

Besides the above mentioned resource management module, the terminal is also provided with a task management module, a basic function module, an adaptation module and a bottom layer local framework, and the basic function module comprises: the system comprises a network function module, a storage function module, a log reporting module, a text-to-speech module, an input module and an animation module. The resource management module and the basic function module are developed and completed in the rendering engine, the adaptation module is an adaptation layer of the rendering engine communicating with the bottom layer, for example, a system volume adjusting interface is required to be used in the rendering engine, so that an android/apple interface is required to be called, and the local framework layer is the android/apple interface.

For the background server, a voice recognition module, a text-to-voice module and a voiceprint signal processing module are arranged in the background server, the voice recognition module is used for carrying out voice recognition on voice, the text-to-voice module is used for converting text into voice to be played, and the voiceprint signal processing module is used for carrying out authority recognition on voice so as to recognize the identity of a speaker according to the voice. For the cloud server, a cloud content management system is arranged in the cloud server, and the cloud content management system comprises: a resource management end and a tactical configuration management end. The resource management end manages skin resources, scene resources, action resources, prop resources and other image resources, and the conversational configuration management end performs conversational configuration based on the image resources.

The supporting module for supporting the terminal service function is realized by the built-in modules in the background server and the cloud server, and the supporting module comprises intention identification and cloud management. The intention recognition comprises chatting intention, knowledge question and answer intention, query weather intention, home control car intention, car control home intention and car machine message intention. For the chatting intention, the interactive content of the user and the intelligent voice assistant is chatting; for the knowledge question and answer intention, the interactive content of the user and the intelligent voice assistant is the knowledge question and answer; for inquiring weather intention, the interactive content of the user and the intelligent voice assistant is inquiring weather; for the intention of controlling the vehicle at home, the method means controlling an air conditioner in the vehicle or controlling the vehicle to start by an intelligent voice assistant on the mobile terminal; for a car control home, the intelligent voice assistant on the vehicle-mounted system controls intelligent equipment in the home; for the message leaving of the car machine, the message leaving is carried out on the intelligent robot through the intelligent voice assistant on the car-mounted system, wherein the intelligent robot is a robot corresponding to the intelligent voice assistant. The cloud management system comprises image resource management, action and operation management and resource and configuration version management.

Referring to fig. 6, fig. 6 is a resource configuration architecture diagram of the intelligent voice assistant according to the embodiment of the present invention, as shown in fig. 6, a content provider needs to register a client id on a server, and after obtaining the client id, the content provider may upload an image resource to the server, where the process of uploading the image resource is as follows: creating an avatar, generating an avatar identifier corresponding to the avatar by the system, creating different resource versions for the avatar, adding image resources such as scenes, actions and the like to the corresponding resource versions, configuring corresponding dialects according to the avatar identifier after adding the resources, selecting the corresponding resource versions, introducing the corresponding action resources, and selecting the use scenes and the use conditions to configure the actions and the dialects. The configuration is specifically what action should be triggered under what scenario and condition, for example, for the interactive instruction of "how is the weather today", the background returns the use condition of "rain", and the corresponding configuration item (i.e. corresponding to scenario and condition) is searched in the configuration of the action according to the use scenario of "weather" and the use condition of "rain", so as to determine what action needs to be used under the current scenario and condition.

The version of the resource refers to version management, and since different versions of the resource may be different, the resource version may be configured to be related to the client version, for example, for the client of version 1.0, there are only 10 actions in the corresponding resource version, and for the client of version 2.0, there are 20 actions in the corresponding resource version, which are controlled by the resource version management.

Referring to fig. 7, fig. 7 is a flowchart of a terminal of an intelligent voice assistant according to an embodiment of the present invention, as shown in fig. 7, a content provider configures resources, including action resources, model resources, scene resources, and an action and technology configuration policy, of a three-dimensional avatar of the intelligent voice assistant on a cloud server, and if the avatar has multiple actions, the resources of the multiple actions need to be uploaded. The uploaded resources can be stored in a cloud database, and corresponding configuration information is generated according to the uploaded resources. The background server packages the resources and the configuration information generated by the cloud database according to the agreed message format and content, and sends the packaged configuration information to the client through a communication protocol, the client downloads, decompresses and loads the resources according to the configuration information sent by the background, when a user sends a voice instruction (such as 'today day of the week') or clicks and touches an avatar on a screen (such as clicking the head of a three-dimensional avatar), the client inquires the configuration according to the instruction input by the user, if the corresponding configuration is inquired, the corresponding action and the language operation played according to the configured corresponding resources are carried out, and if the corresponding configuration is not inquired, the default action and the language operation are played.

Continuing with the exemplary structure of the intelligent voice assistant's interaction device 455 provided by embodiments of the present invention as implemented as software modules, in some embodiments, as shown in FIG. 2A, the software modules stored in the intelligent voice assistant's interaction device 455 in memory 450 may include:

a resource management module 4551, configured to obtain resources of an avatar and configuration information corresponding to the resources from a server, and extract model resources of the avatar corresponding to the intelligent voice assistant from the resources;

a configuration item query module 4552, configured to query the configuration information according to the interaction instruction corresponding to the intelligent voice assistant, so as to obtain a configuration item meeting the interaction instruction;

a resource extraction module 4553, configured to extract an image resource and a conversational resource indicated in the configuration item from resources delivered by the server;

an animation function module 4554, configured to present an avatar of the intelligent voice assistant according to the model resource and the resource indicated by the default configuration item in the configuration information; and the voice playing module is used for controlling the virtual image to present an image conforming to the interactive instruction based on the image resource and controlling the virtual image to play voice conforming to the interactive instruction based on the speech resource.

In some embodiments, the resource management module 4551 is further configured to:

submitting a resource acquisition request carrying a version of the client and an avatar identifier of the corresponding avatar to the server, such that

and receiving the resources which are adapted to the version of the client and correspond to the virtual image and the configuration information of the corresponding resources, which are sent by the server.

In some embodiments, the animation function module 4554 is further configured to:

Animation function module 4554 also used for

And playing the voice of the corresponding virtual image according to the conversational resources indicated by the default configuration items in the configuration information.

In some embodiments, the project query module 4552 is further configured to:

when the interactive instruction comprises a voice interactive instruction, performing voice recognition and semantic recognition on the voice interactive instruction to obtain scene information representing the environment where the virtual image is located and condition information representing keywords for executing the voice interactive instruction, and inquiring configuration information according to the scene information and the condition information to obtain configuration items according with the scene information and the condition information; and/or the presence of a gas in the gas,

when the interactive instruction is a touch interactive instruction, presenting intention options corresponding to at least one of the following of the intelligent voice assistant: the method comprises the steps of intelligent conversation, equipment control, vehicle-mounted device message leaving, switching of virtual images of the intelligent voice assistant, adding of props of the intelligent voice assistant, switching of scenes where the intelligent voice assistant is located, setting of active time of the intelligent voice assistant, and query of configuration information based on intentions represented by selected intention options to obtain configuration items conforming to intentions; and/or the presence of a gas in the gas,

and sending the interactive instruction corresponding to the intelligent voice assistant to a server so that the server performs semantic recognition on the interactive instruction to obtain the intention of the interactive instruction, and inquiring configuration information according to the intention to obtain a configuration item according with the intention.

when the intention of the interactive instruction is to carry out conversation with the virtual image, controlling the virtual image to execute feedback action corresponding to the emotion of a conversation result based on image resources of the corresponding conversation, and controlling the virtual image to play voice conforming to the question-answer characteristics of the conversation based on the dialect resources of the corresponding conversation; and/or the presence of a gas in the gas,

when the intention of the interactive instruction is to control the equipment, controlling the virtual image to present the control operation on the equipment based on the image resource corresponding to the control operation, and controlling the virtual image to play the voice representing the control result based on the speech resource corresponding to the control operation; and/or the presence of a gas in the gas,

when the intention of the interactive instruction is to check the car machine messages, the virtual images are controlled to present actions of checking the car machine messages based on the image resources corresponding to the checked car machine messages, and the virtual images are controlled to play the voices of the car machine messages based on the speech resources corresponding to the checked car machine messages.

Continuing with the exemplary structure of the intelligent voice assistant's interaction device 255 as implemented as software modules provided by embodiments of the present invention, in some embodiments, as shown in FIG. 2B, the software modules stored in the intelligent voice assistant's interaction device 255 in memory 250 may include:

the resource configuration module 2551 is used for acquiring resources of the virtual image of the intelligent voice assistant and generating configuration information of the corresponding resources;

the resource issuing module 2552 is configured to issue the resources of the intelligent voice assistant and the configuration information of the corresponding resources to the client, so that the client performs the following operations:

presenting an avatar of the voice assistant;

based on the image resource, controlling the virtual image to present the image conforming to the interactive instruction, and

and controlling the virtual image to play the voice which accords with the interactive instruction based on the speech technology resource.

In some embodiments, resource configuration module 2551 is further configured to:

wherein the image resource comprises at least one of: scene resources, model resources, skin resources, and action resources;

when receiving the image resource of new virtual image, distributing new version identification for the received image resource, and

and when the image resources of the intelligent voice assistant are updated resources of the existing virtual image, updating the image resources and the configuration information of the corresponding version of the existing virtual image.

In some embodiments, the apparatus further comprises: a speech semantic recognition module 2553 configured to:

receiving a corresponding interactive instruction of the intelligent voice assistant uploaded by the client;

when the interactive instruction is a voice interactive instruction, voice recognition is carried out on the voice interactive instruction to obtain text information, semantic recognition is carried out on the text information to obtain the intention of the interactive instruction, and the intention is sent to the client side so that the interactive instruction can be sent to the client side

Controlling the virtual image to play the voice conforming to the interactive instruction according to the speech resources indicated in the configuration information;

wherein the intent includes at least one of:

the method comprises the steps of intelligent conversation, equipment control, message leaving of a vehicle machine, switching of virtual images of the intelligent voice assistant, adding of props of the intelligent voice assistant, switching of scenes where the intelligent voice assistant is located, and setting of active time of the intelligent voice assistant.

In some embodiments, the apparatus further comprises: a voiceprint signal processing module 2554 to:

according to the voiceprint characteristic parameters, the identity of a user initiating a voice interaction instruction is identified;

when the user is identified as an authorized user, determining to continue semantic identification on the text information;

for the intentions of the potential individual candidates of the instruction of interaction, the following is performed:

and associating the resources for responding to the intention with scene information and condition information corresponding to the intention to form corresponding configuration items so as to form configuration information corresponding to the intention based on the combination of the configuration items.

binding the avatar resource and the dialog resource for responding to the control device intention, the scene information and the condition information corresponding to the control device intention to generate a corresponding configuration item, wherein the control device intention comprises at least one of the following: controlling the household equipment through a client in the vehicle, and controlling the vehicle through the client in the household equipment;

and binding the image resources and the dialogue resources for responding the vehicle-mounted device message checking intention with the scene information and the condition information corresponding to the vehicle-mounted device message checking intention so as to generate corresponding configuration items.

a memory for storing executable instructions;

and a processor, configured to execute the executable instructions stored in the memory, to implement the interaction method of the intelligent voice assistant provided by the embodiment of the present invention, for example, as shown in fig. 3 and fig. 4A-4B.

Embodiments of the present invention provide a storage medium storing executable instructions, which when executed by a processor, will cause the processor to perform the methods provided by embodiments of the present invention, for example, as shown in fig. 3 and fig. 4A-4B, which illustrate the interactive methods of the intelligent voice assistant.

In some embodiments, the storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may, but need not, correspond to files in a file system, may be stored in portions of files that hold other programs or data, such as in one or more scripts in a hypertext markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

In summary, according to the embodiments of the present invention, resources are uploaded in the background, and resources are rendered by the client on the terminal, so that voice interaction and form interaction between the user and the intelligent voice assistant are achieved, and thus a set of solution for the intelligent voice assistant including the terminal and the background is formed, so that content providers can apply the solution quickly and at low cost, and the corresponding application products have the functions of the intelligent voice assistant.

The above description is only an example of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention are included in the protection scope of the present invention.

Claims

1. An interactive method for an intelligent voice assistant, the method comprising:

based on the image resource, controlling the virtual image to present an image conforming to the interactive instruction, and

and controlling the virtual image to play the voice conforming to the interactive instruction based on the speech technology resource.

2. The method of claim 1,

the acquiring of the resources of the avatar and the configuration information corresponding to the resources from the server includes:

and receiving the resources which are adapted to the version of the client and correspond to the virtual image and the configuration information corresponding to the resources, wherein the resources are issued by the server.

3. The method of claim 1, wherein presenting the avatar of the intelligent voice assistant according to the model resources, resources indicated by default configuration items in the configuration information, comprises:

wherein the default persona comprises: a default skin and a default prop of the avatar;

the method further comprises the following steps:

4. The method of claim 1, wherein the querying the configuration information according to the interactive instruction corresponding to the intelligent voice assistant to obtain the configuration item according with the interactive instruction comprises:

when the interactive instruction comprises a voice interactive instruction, performing voice recognition and semantic recognition on the voice interactive instruction to obtain scene information representing the environment where the virtual image is located and condition information representing keywords for executing the voice interactive instruction, and inquiring the configuration information according to the scene information and the condition information to obtain configuration items meeting the scene information and the condition information; and/or the presence of a gas in the gas,

when the interactive instruction is a touch interactive instruction, presenting intent options corresponding to at least one of the following of the intelligent voice assistant: the method comprises the steps of intelligent conversation, equipment control, vehicle-mounted device message leaving, virtual image switching of the intelligent voice assistant, prop adding of the intelligent voice assistant, scene switching of the intelligent voice assistant, setting of active time of the intelligent voice assistant, and query of configuration information based on intention represented by a selected intention option to obtain a configuration item according with the intention; and/or the presence of a gas in the gas,

and sending the interactive instruction corresponding to the intelligent voice assistant to the server so that the server carries out semantic recognition on the interactive instruction to obtain the intention of the interactive instruction, and inquiring the configuration information according to the intention to obtain the configuration item conforming to the intention.

5. The method according to any of claims 1-4, wherein said controlling said avatar to present an avatar conforming to said interactive instructions based on said avatar resources comprises:

when the intention of the interactive instruction is to perform a conversation with the virtual image, controlling the virtual image to execute a feedback action corresponding to the emotion of a conversation result based on image resources corresponding to the conversation, and controlling the virtual image to play voice conforming to the question-answer characteristics of the conversation based on the tactical resources corresponding to the conversation; and/or the presence of a gas in the gas,

when the intention of the interaction instruction is to control equipment, controlling the virtual image to present control operation on the equipment based on image resources corresponding to the control operation, and controlling the virtual image to play voice representing a control result based on the language resources corresponding to the control operation; and/or the presence of a gas in the gas,

6. An interactive method for an intelligent voice assistant, the method comprising:

presenting an avatar of the voice assistant;

7. The method of claim 6, wherein the obtaining resources for the avatar of the intelligent voice assistant comprises:

the generating configuration information corresponding to the resource includes:

8. The method of claim 6, wherein after issuing the resources of the intelligent voice assistant and the configuration information corresponding to the resources to a client, the method further comprises:

wherein the intent comprises at least one of:

9. The method of claim 8, wherein prior to semantically identifying the textual information, the method further comprises:

10. The method of claim 7, wherein the generating configuration information corresponding to the resource comprises:

11. The method of claim 10, wherein associating the resource for responding to the intention with the scenario information and condition information corresponding to the intention forms a corresponding configuration item, comprising:

12. An interactive apparatus of an intelligent voice assistant, the apparatus comprising:

13. An interactive apparatus of an intelligent voice assistant, the apparatus comprising:

presenting an avatar of the voice assistant;

14. An intelligent voice assistant interaction device, the device comprising:

a memory for storing executable instructions;

a processor for implementing the intelligent voice assistant interaction method of any one of claims 1 to 5 or 6 to 11 when executing the executable instructions stored in the memory.

15. A storage medium having stored thereon executable instructions for causing a processor to perform the method of intelligent voice assistant interaction of any of claims 1-5 or 6-11 when executed.