CN113157241A

CN113157241A - Interaction equipment, interaction device and interaction system

Info

Publication number: CN113157241A
Application number: CN202110482187.3A
Authority: CN
Inventors: 司马华鹏; 周亚南; 曹志惠; 涂坤; 樊景星; 朱逸飞
Original assignee: Nanjing Guiji Intelligent Technology Co ltd
Current assignee: Nanjing Guiji Intelligent Technology Co ltd
Priority date: 2021-04-30
Filing date: 2021-04-30
Publication date: 2021-07-23

Abstract

The embodiment of the application provides an interactive device, mutual device and interactive system, and interactive device carries out communication connection with server and display device respectively, and interactive device includes the casing and sets up in the inside functional module of casing, and functional module includes: the camera module is configured to acquire image information of a target area; the pickup module is configured to acquire audio input information of a target area; the control module is electrically connected with the camera module and the pickup module respectively; the control module is configured to send the image information and/or the audio input information to the server so that the server can generate display information according to the image information and/or the audio input information, and send a control instruction to the camera module and the sound pickup module; a transmission interface configured to output display information to a display device, wherein the display information at least includes: motion information of the avatar, and/or image presentation information.

Description

Interaction equipment, interaction device and interaction system

Technical Field

The application relates to the technical field of voice interaction, in particular to an interaction device, an interaction device and an interaction system.

Background

With the development of intelligent terminals, more and more service places or public places, such as banks, markets, exhibition halls and the like, start to provide services for users through large-screen terminal equipment with certain interactive functions. The large-screen terminal device with the interaction function generally comprises related components such as a camera, a loudspeaker, a microphone, a display, a necessary control device and the like which are integrated in the large-screen terminal device, and the large-screen terminal device acquires corresponding information through the components and feeds the information back to a user according to a preset instruction so as to realize interaction with the user.

In the related art, the large-screen terminal device mostly adopts a design that required components such as a camera, a loudspeaker, a microphone, a display and necessary control devices are integrated in the large-screen terminal device, and the design mode has extremely high requirements on the performance of a processing unit of the large-screen terminal device, so that the production cost of the large-screen terminal device is high; moreover, the above-mentioned manner of using the whole large-screen terminal device as the object of use inevitably leads to the abandonment of the large-screen device that does not include the interactive function in the past, resulting in the too high use cost of the terminal user.

Aiming at the problems that the use cost of large-screen terminal equipment adopting integrated design is too high and the adaptability to a use scene is poor in the use process in the related technology, no reasonable solution is provided in the related technology.

Disclosure of Invention

The embodiment of the application provides an interactive device, an interactive device and an interactive system, which are used for at least solving the problems that the use cost of a large-screen terminal device adopting an integrated design in the related art is too high and the adaptability to a use scene in the use process is poor.

In an embodiment of the present application, an interactive device is provided, where the interactive device is in communication connection with a server and a display device, respectively, the interactive device includes a housing and a functional module disposed inside the housing, and the functional module includes: the camera module is configured to acquire image information of a target area; the pickup module is configured to acquire audio input information of the target area; the control module is electrically connected with the camera module and the pickup module respectively; the control module is configured to send the image information and/or the audio input information to the server so that the server can generate display information according to the image information and/or the audio input information, and send a control instruction to the camera module and the pickup module; a transmission interface configured to output the display information to the display device, wherein the display information at least includes: motion information of the avatar, and/or image presentation information.

In another embodiment of the present application, an interaction apparatus is further provided, including the interaction device and the display module described in the above embodiments, where the interaction device is configured to send the acquired image information and audio input information to a server; the display module is configured to display information, wherein the display information is generated by the server according to the image information and the audio input information, and the display information at least includes: motion information of the avatar, and/or image presentation information.

In another embodiment of the present application, an interactive system is further provided, including the interactive apparatus and the server described in the above embodiments, wherein the server is configured to generate display information according to the image information and the audio input information acquired by the interactive device, and the display information at least includes: motion information of the avatar, and/or image presentation information.

In an embodiment of the present application, a computer-readable storage medium is also proposed, in which a computer program is stored, wherein the computer program is configured to perform the steps of any of the above-described method embodiments when executed.

In an embodiment of the present application, there is further proposed an electronic device comprising a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to perform the steps of any of the above method embodiments.

Through the embodiment of the application, an interactive device is provided, including the casing and the functional module that sets up in the casing, can accomplish the information interaction between with server and display device, can understand that this interactive device constitutes a functional box, conveniently carry and the function is perfect, through the inside camera module of box, pickup module, control module and transmission interface, accomplish the collection of image and audio information, and the generation of display information and control command, only need external display device can accomplish and serve the interactive process between the object. The problem of adopt the use cost of integrated design's large-size screen terminal equipment among the correlation technique too high and in the use with the adaptability of use scene lack is solved. According to the embodiment of the application, the interactive device is independent of the display device, so that the display device with any function or form can be connected with the interactive device in the embodiment of the application, and then the large-screen terminal device capable of providing corresponding services is formed. Therefore, on one hand, a user can form large-screen terminal equipment capable of providing corresponding services by accessing the interactive equipment based on the related display equipment of the user, so that the updating and updating of the user on the past equipment are reduced, and the use cost of the user is further improved; on the other hand, for some special use scenes, the interactive device can be directly connected to the targeted display device in the scene to form the terminal device capable of providing the corresponding service, so that the requirements of the scene can be adapted.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a block diagram of an alternative interactive device architecture according to an embodiment of the present application;

FIG. 2 is a schematic connection diagram of an alternative interactive device according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an interactive device application according to an embodiment of the present application;

FIG. 4 is a block diagram of an alternative interactive apparatus according to an embodiment of the present application;

fig. 5 is a block diagram of an alternative interactive system architecture according to an embodiment of the present application.

Detailed Description

The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Fig. 1 is a block diagram of an alternative interaction device according to an embodiment of the present application, fig. 2 is a connection schematic diagram of an alternative interaction device according to an embodiment of the present application, as shown in fig. 1 and fig. 2, an embodiment of the present application provides an interaction device 1, which is in communication connection with a server 2 and a display device 3, respectively, the interaction device 1 includes a housing 11 and a function module 12 disposed inside the housing, and the function module 12 includes:

a camera module 121 configured to acquire image information of a target area;

a pickup module 122 configured to acquire audio input information of a target area;

a control module 123 electrically connected to the camera module 121 and the sound pickup module 122, respectively; the control module 123 is configured to send the image information and/or the audio input information to the server 1 for the server 1 to generate display information according to the image information and/or the audio input information, and send a control instruction to the camera module 121 and the sound pickup module 122;

a transmission interface 124 configured to output display information to the display device 3, wherein the display information at least includes: motion information of the avatar, and/or image presentation information.

It should be noted that "electrically connected" in the embodiments of the present application may be understood as a form in which different components in a circuit structure are connected by physical circuits such as a PCB copper foil or a wire, which can transmit electrical signals, or a form in which the different components are connected by a cable to connect corresponding interfaces for wired signal transmission and control, or a form in which wireless signal transmission and control can be achieved through bluetooth, radio frequency, WIFI, or other wireless signal manners.

It should be noted that the target area may be a peripheral area of the positions where the interaction device 1 and the display device 3 are located, may be a circular area with a diameter of 3-10 meters and centered on the interaction device 1, or may be a circular area with a diameter of 2-9 meters and centered on the display device 3.

The image information or image presentation information may include a static image frame or may include a dynamic video, i.e., a continuous image frame.

It should be noted that the camera module 121 is generally a camera and corresponding optical and mechanical components. The image/video information of the target area acquired by the camera module 121 may be face information of a user detected by the camera unit, or environmental information in an environment set for the interactive device, for example, capturing image information of a current environment at a certain event, or acquiring a surveillance video stream of the current environment in real time. Based on the camera module 121, computer vision processing such as face detection, face capture, face following and the like can be further realized, and subsequent analysis and application are performed by combining the existing computer vision processing technology in the prior art.

The sound pickup module 122 may be a microphone array. The audio input information of the target area acquired by the sound pickup module 122 may be environmental sound in an environment set by the interactive device, or may be sound input of the user detected by the sound pickup module 122.

The control module 123 generally includes a processor, and a corresponding video codec unit, an audio processing unit, a power supply unit, a communication and WIFI unit, an OTA (Over the Air) unit, and the like. The video coding and decoding unit can code the video stream acquired by the camera module to be sent to the server, and can also decode the video stream sent by the server and send the decoded video stream to the display equipment through the transmission port; the audio processing unit generally includes a Voice recognition, hardware noise reduction, software noise reduction, VAD (Voice Activity Detection) functional unit, and the like, and processes the audio input information acquired by the sound pickup module 122 to implement operations such as noise reduction, word recognition and the like, and meanwhile, the audio processing unit may also perform a/D conversion on the audio input information, perform D/a conversion on the audio output information, and the like; the power supply unit can realize power supply processing, the communication and WIFI unit can realize connection between the interaction equipment and the server, and the OTA module can realize local firmware upgrading.

The transmission Interface 124 is generally an HDMI (High Definition Multimedia Interface) Interface or a VGA (Video Graphics Array) Interface, and may also be extended by other interfaces or adapters to implement connection between the interactive device and any display device and output of display information.

In the embodiment of the present application, the display device may be a display integrating a display function and other functions, for example, a desktop computer, a tablet computer, a television, or a display only having a display function, for example, a liquid crystal display, an LED screen, or the like.

In an embodiment, the interaction device 1 further comprises:

and an audio output module 125 disposed inside or outside the housing 11 and configured to output first audio output information generated by the server 2 according to the image information and the audio input information. The audio output module 125 may be a speaker.

The first audio output information may include music output for a specific scene, or may be a welcome or answer output when a service object is detected to be close to the display device, for example, "Hi, where do you want to go? "toilet is beside floor XXX".

In an example, the interactive device body is a box body, and includes a housing 11 and a functional module 12, and the camera module, the sound pickup module, the audio output module, the control module, and the transmission interface are all disposed inside the housing. In this example, each component in the interactive device is integrated in the box body, the interactive device has better integration, and the deployment of the interactive device in the using process is more convenient and flexible.

In another example, the interactive apparatus body is a box body, and includes a housing 11 and a functional module 12, the camera module, the sound pickup module, the control module and the transmission interface are all disposed in the box body, and the audio output module is disposed outside the box body. In the example, the audio output module and other components are independently arranged, so that the influence of near-field noise possibly generated by the pickup module in the process of outputting audio output information by the audio output module is avoided, and pickup and voice recognition effects of the pickup module are improved. In this example, the audio output module may be provided in another cartridge, or an audio output device such as a speaker may be directly employed. Generally speaking, the audio output module may be set away from the main body of the interactive device during the setting process to further avoid the influence of the near-field noise generated by the pickup module, for example, for a certain display device, the main body of the interactive device may be set on the top of the display device, and the audio output module may be set on both sides or the bottom of the display device.

Fig. 3 is a schematic diagram of an interactive device application according to an embodiment of the present application. As shown in fig. 3, "multimodal interaction apparatus" in fig. 3 corresponds to the aforementioned "interaction apparatus 1", and "display unit" in fig. 3 corresponds to the aforementioned "display apparatus 3". In the application embodiment, when the interactive device is in use, the camera module and the pickup module respectively acquire corresponding image/video information and audio input information, the image/video information and the audio input information are processed by the control module and then sent to the server, corresponding display information and audio output information are generated according to a preset voice and visual algorithm service, the server sends the display information and the audio output information to the control module of the interactive device, the control module further outputs the display information to the display through the transmission interface for display, and the audio output information is output through the audio output module, so that interaction with a user is completed.

The motion information of the avatar may be motion information matched with the image presentation screen or motion information matched with the audio output information. The virtual image can be a virtual character image, and can also be a virtual animal image, cartoon image, mythological image or fairy tale image.

In an embodiment, the control module 123 is further configured to send the image information and/or the audio input information to the server for the server to generate the conversational information according to the image information and the audio input information, wherein the conversational information is used to indicate a conversational meaning that the avatar interacts with the service object;

the audio output module 125 is further configured to output second audio output information when the avatar interacts with the service object according to the tactical information, wherein the second audio output information corresponds to motion information of the avatar, the motion information of the avatar including facial motion information and limb motion information.

In one embodiment, the control module 123 is further configured to,

the method comprises the steps that an indication server selects a first action module corresponding to image display information from a preset virtual image action database according to image display information, wherein the first action module is used for indicating the action of a virtual image when the image corresponding to the image display information is displayed;

the indication server selects a second action module corresponding to the speech information from the virtual image action database according to the speech information, and the second action module is used for indicating the action of the virtual image when the virtual image is interacted with the service object according to the speech information;

the control module 123 is further configured to transmit the motion information of the avatar determined by the first motion module and/or the second motion module to the display device through the transmission interface;

the virtual image action database comprises a plurality of preset action modules, wherein each action module corresponds to one or more limb actions and/or facial actions of the virtual image. An action module may correspond to a limb action and/or a facial action of the avatar, or, alternatively, to a set of limb action(s) and/or a set of facial action(s).

It should be noted that the image display information may include a static image frame, or may include a dynamic video, i.e., a continuous image frame. The language operation information may include a language operation logic rule set in different application scenes, and the action information of the avatar may include a body action and/or a face action when introducing the product, and a body action and/or a face action corresponding to the language operation information to be used when replying to the user.

For example, in the field of financial applications, the image display information may be a display picture or a video generated according to a text introduction of a financial product and a corresponding image material, and may include a display picture for introducing the financial product or a display picture including an avatar. The conversational information may include automatically generating corresponding reply conversational phrases for possible points of interest of the investor, e.g., investment form, investment period, expected income, procedural costs, risk level, etc., in cooperation with generating the image presentation information. The avatar motion information may include a limb motion and/or a facial motion corresponding to when the introduction of the financial product is made, and a limb motion and/or a facial motion corresponding to the verbal information used when replying to the user's consultation with the financial product.

In the field of educational training, the image display information may be a display picture or a video generated according to the text introduction of the training course and the corresponding image material, and may include a display picture of the introduction course or a display picture containing an avatar. The conversational information may include, for example, a synopsis of the course, a content of the course, a profile of a teacher, a purpose or value of the course, etc., which may be of potential interest to the trainee, and may be used to automatically generate a corresponding conversational language and to generate image presentation information in cooperation therewith. The avatar motion information may include limb motions and/or facial motions corresponding to when the training course introduction is performed, and limb motions and/or facial motions corresponding to the tactical information for use in responding to the user's consultation with the training course.

In the public service field, for example, a government recruiter department wishes to promote and explain a certain recruiter quotation policy, the image display information may be a display picture or a video generated according to the text introduction of the policy explanation and the corresponding image material, and may include a display picture for introducing courses or a display picture containing an avatar. The dialog information can include corresponding reply dialogs automatically generated according to possible concerns of the enterprise, such as adaptation objects of a business recruitment policy, the effective and deadline time of the policy, materials or procedures required to be handled by the enterprise, policies and tax benefits enjoyable by the enterprise, and the like, and the image display information is generated in cooperation. The avatar action information may include a body action and/or a face action corresponding to when policy promotion is performed, and a body action and/or a face action corresponding to the conversational information used when replying to the business's consultation for policy.

In an embodiment, the sound pickup module 122 is further configured to obtain voice data input by a service object in the target area, where the voice data includes real-time voice data and/or non-real-time voice data;

the control module 123 is further configured to instruct the server to select a target first action module and/or a target second action module corresponding to the voice data from the avatar action database according to the voice data and the dialect information;

the control module 123 is further configured to obtain the target first action module and/or the target second action module, and push the target first action module and/or the target second action module to the display device through the transmission interface.

In an embodiment, the control module 123 is further configured to instruct the server to send a target uniform resource locator URL address corresponding to the target first action module and/or the target second action module to the control module, so that the control module obtains the target first action module and/or the target second action module according to the target URL address;

the URL address is used for indicating an address of a CDN node of a content delivery network provided with a first action module and/or a second action module, and each CDN node corresponds to one URL address;

the control module 123 is further configured to send the target first action module and/or the target second action module to the display device via the transmission interface.

It should be noted that, in the above embodiment, a plurality of corresponding second action modules in the avatar action database may be respectively set on a corresponding CDN node, and the URL address corresponding to the CDN node corresponds to the second action module.

The process of pushing the target second action module to the display device is described in an exemplary manner in combination with the operation of the aforementioned interaction device:

in an example, the interactive device employs non-real-time voice functionality. In this example, a service object performs a voice query for a certain problem, and an Automatic Speech Recognition technology (ASR) module integrated with a pickup module in the interactive device performs semantic Recognition on an audio corresponding to the voice query to determine a query text corresponding to a query content of the service object; the intelligent dialogue module integrated by the server stores question-answer rules, and the intelligent dialogue module can inquire answer texts corresponding to the consultation texts in the preset question-answer rules. After the answer text is determined, the target second action module corresponding to the answer text can be determined in the avatar action database.

The interactive device downloads the target second action module from the corresponding CDN node according To the target URL address corresponding To the target second action module and then sends the target second action module To the display device, so that the answer Text is converted into corresponding answer audio through a Text To Speech (TTS) module integrated in the interactive device and is output through an audio output module, and the virtual image is made To interact with the service object according To the target second action module.

Compared with the scheme that the virtual image required by service object interaction needs to be rendered in real time at local or service side and then pushed in flow in the related technology, the technical scheme in the embodiment can significantly reduce the possible time delay and hardware cost, and through experiments, the technical scheme in the embodiment can actually reduce the time delay of 10 to 20ms compared with the related technology. Since the service of the interactive device in the embodiment of the present application often has a very high real-time requirement, the above embodiment can significantly improve the user experience when providing the service to the service object.

It should be noted that the interactive device in the embodiment of the present application is independent of the display setting, and therefore, for a display unit with any function or form, the interactive device capable of providing a corresponding service, that is, a large-screen terminal in the related art, may be formed by connecting with the interactive device. Therefore, a user of the large-screen terminal can form an interactive device capable of providing corresponding services by accessing the interactive device in the embodiment of the application based on the related display equipment of the user, such as a tablet computer, a liquid crystal display, an LED screen and the like, so that the updating of the existing display equipment by the user is reduced, and the use cost of the user is further improved. Meanwhile, for part of special use scenes, the interactive device can be directly connected to the targeted display device in the scene to form an interactive device capable of providing corresponding services, so that the requirements of the scene can be adapted. For example, in a narrow environment where a large-screen terminal cannot be partially arranged, a tablet computer or a liquid crystal display can be directly hung on a wall surface, and meanwhile, the interaction device in the embodiment of the application is hung and connected to the tablet computer or the liquid crystal display, so that the interaction device can be rapidly deployed in the environment.

On the other hand, for a service form with a high requirement on processing performance of the display content, such as the processing of the aforementioned avatar, the interaction device in the embodiment of the present application deploys the corresponding processing of the service in a server independent from the display device, so that the requirements of the local display device and the interaction device for hardware are reduced, on one hand, hardware cost is further controlled, and on the other hand, since no requirement is made on the processing performance of the display device and the interaction device, the service content with any form and processing requirement can be provided on the basis, so as to further improve the service adaptability of the interaction device.

In addition, the interactive device in the embodiment of the application has the advantages that the related components are integrated in the interactive device, so that only the related after-sale maintenance is needed to be carried out on the multi-mode interactive device in the subsequent maintenance and upgrading process, and the corresponding after-sale work of the interactive device is more convenient and faster as the interactive device is obviously improved in size and shape compared with the integrated large-screen terminal device in the related technology.

In another embodiment of the present application, an interaction device is also presented. Fig. 4 is a block diagram of an alternative interaction apparatus according to an embodiment of the present application, as shown in fig. 4, including the interaction device and the display module according to the above embodiment, where the interaction device is configured to send the acquired image information and audio input information to a server; the display module is configured to display information, wherein the display information is generated by the server according to the image information and the audio input information, and the display information at least comprises: motion information of the avatar, and/or image presentation information.

In an embodiment, the display module is further configured to:

receiving first display information sent by interactive equipment, wherein the first display information is display information subjected to decoding processing by the interactive equipment; and/or

And receiving second display information sent by the server, wherein the second display information is display information which is not subjected to decoding processing.

In an embodiment, the display module is further configured to:

acquiring instruction information input by a user, wherein the instruction information at least comprises one of the following items: touch information, mouse input information and keyboard input information;

and sending the instruction information to the interaction equipment or the server.

In an embodiment, the interaction device is further configured to: sending a first registration request to a server; after the first registration request passes, sending a first heartbeat signal to the server to maintain connection with the server;

the display module is further configured to: sending a second registration request to the server; and after the second registration request passes, sending a second heartbeat signal to the server to maintain the connection with the server.

In the above-mentioned interactive apparatus, the interactive device and the display module are independent from each other.

The display module is any display device capable of realizing the display function. In the working process of the display unit, in an example, referring to the above process, the interactive device sends the display information to the display module through the transmission interface for displaying, and in this example, the display module does not need to perform corresponding processing or calculation basically. In another example, the generated display information may be directly sent to the display module for display by the server through a wireless communication link, and in this example, the display unit needs to have certain processing or computing power. In another example, the multi-mode interaction device may send the display information to the display module for display through the transmission interface, where the real-time requirement is relatively low and the data amount is large, such as video stream, and the multi-mode interaction device may send the display information to the display module for display through the transmission interface, and send the part with the real-time requirement is relatively high and the data amount is small, such as display of characters or images, to the display module for display through the wireless communication link, so as to meet the real-time feedback requirement of character recognition and the like in the user interaction process, and reduce the processing or calculation amount of the display unit itself.

Correspondingly, in the process of interacting with the user, the display module can also feed back the input (touch or keyboard) of the user, such as touch information, to the interaction device through the transmission interface, and further send the input to the server for corresponding processing through the multi-mode interaction device; or, the display module can directly send the data to the server through the wireless communication link to perform corresponding processing.

In addition, in order to realize the management of the display module, the display module can also register with the server and keep connection through a heartbeat signal, so that the server can control and manage the display modules correspondingly connected with each interactive device.

In another embodiment of the present application, an interactive system is further provided, and fig. 5 is a block diagram of an alternative interactive system structure according to an embodiment of the present application, and as shown in fig. 5, the interactive system includes the interactive apparatus and the server described in the above embodiments. The server is configured to generate display information according to the image information and the audio input information acquired by the interactive device, wherein the display information at least comprises: motion information of the avatar, and/or image presentation information.

In one embodiment, a server includes:

a media and resource module configured to access multimedia resources and publish the multimedia resources to a content distribution network;

the instant messaging module is configured to realize the message receiving and sending between the interactive equipment and/or the server and the human agent;

a voice communication module configured to receive, recognize and record audio data;

the intelligent dialogue module is configured to interact with a user through a robot, wherein the robot is a trained neural network model;

the conversation intermediate module is configured to control and schedule the instant messaging module, the voice communication module and the intelligent conversation module;

and the video live broadcast module is configured to generate a real-time video stream from the image information acquired by the interactive equipment and store the real-time video stream.

It should be noted that, in the media and resource module, the multimedia resource may include the action module of the avatar in the avatar action library, image display information, or other possible image/video and audio information, and the multimedia resource may be set in the corresponding CDN node, so as to be pushed to the control module when requested by the control module, and further transmitted to the display module for display.

The voice communication module is stored with a pre-trained ASR model and a pre-trained TTS model, wherein the ASR model is used for identifying the audio input by the user, namely converting the audio input by the user into corresponding text content; the TTS model is used for converting the answer text into audio and pushing the audio to the control module to be output. It should be noted that, in an example, the voice communication module may also be integrated in the control module, rather than being disposed in the server, in this example, the control module recognizes the audio input by the user, and transmits the recognized text to the intelligent dialogue module in the server for dialogue processing.

The instant messaging module is connected with a background human agent group, and under the condition that a user needs manual intervention, the instant messaging module can request the human agent group to enable the human agent group to be allocated with a human agent, and further realizes message receiving and sending among the human agent, the server and the interactive equipment.

The intelligent dialogue module is also called a BOT module, and question and answer rules are stored in the intelligent dialogue module, namely corresponding answer contents are provided for different inquiry or consultation contents which may be input by a user. The intelligent dialogue module identifies the input text identified by the voice communication module through a pre-trained NLP model, and selects a corresponding answer text from the question-answer rule according to the identification content and feeds the answer text back to the voice communication module.

The live video module can send the real-time video stream to the human agent group under the condition allowed by the user so that the human agent group can monitor the interaction state of the user in real time to judge whether the user needs to perform manual intervention or not, and further actively intervene and communicate under the condition that the user has a large possibility of needing manual intervention.

In an embodiment, the server further comprises:

the management and control module is configured to receive registration requests sent by the interactive device and the display module respectively, record parameter information of the display module of the interactive device, and monitor states of the interactive device and the display module, wherein the interactive device comprises one or more than one interactive device, and the display module comprises one or more than one display module.

The following description will be given taking as an example a large-screen terminal device in the form of a service provided by a certain bank as a digital person (equivalent to the aforementioned "avatar").

The present exemplary embodiment provides a multi-modal interaction device (equivalent to the aforementioned "interaction device"), including a multi-modal interaction device main body, in which an internal portion (equivalent to a "function module" in the aforementioned interaction device) of the multi-modal interaction device main body is provided with:

a camera unit configured to acquire image/video information of a user;

a pickup unit configured to acquire audio input information of a user;

an audio output unit configured to output audio output information;

a control unit configured to drive and control the image pickup unit, the sound pickup unit, and the audio output unit;

and the transmission interface is configured to output display information to the display, and the display information is generated by the server according to the image/video information and the audio input information.

In the exemplary embodiment, the multi-modal interactive apparatus main body adopts a structural design in which the camera unit, the sound pickup unit, the audio output unit, the control unit, and the transmission interface are all disposed in a box body. The camera shooting module, the sound pickup module, the audio output module, the control module and the transmission interface are referred to in the description.

The bank is based on an own LED large screen as a display unit, and the multi-mode interaction device is connected with the LED large screen through the transmission interface and further forms the multi-mode interaction device with the LED large screen.

In the process that a user uses the multi-modal interaction device, the voice of the user can be acquired by a microphone array in the sound pickup unit as audio input information, and after hardware noise reduction and software noise reduction, voice recognition Processing is carried out by a voice recognition module in the control unit through a preset ASR (Natural Language Processing) model, an NLP (Natural Language Processing) model and the like so as to clarify the intention or instruction expected by the user; meanwhile, the camera shooting unit can detect and follow the face of the user, and acquire video information such as video streams in the using process of the user so as to perform corresponding processing such as expression recognition.

The control unit sends the processed audio input information and the processed video information to a cloud server deployed at the cloud end, and the server generates corresponding display information and audio output information based on preset voice and visual rules according to the recognition results of the audio input information and the video information. In one example, a user inputs a voice, namely how long the bank card transacted in 10 months and 20 days can be transacted, through the microphone array in the sound pickup unit, the voice input is acquired, the service transaction progress inquired by the user is determined after recognition, meanwhile, the image pickup unit acquires face information of the user, and the identity of the user can be determined through comparison with the face information input when the user transacts the service. After receiving the voice input and the face information of the user uploaded by the pickup unit and the camera unit, the server can search the service handling progress under the corresponding user name of the face information in the background, and after inquiring the relevant result, the server can generate corresponding display information and audio output information. In this example, the display information is a video stream, which specifically includes a detailed progress flow of the user handling the service card, and a guidance of the virtual image of the digital person to the flow, and the audio output information is a voice description in the virtual image guidance process of the digital person.

It should be noted that, in this example, the display content includes guidance of the avatar of the digital person to the corresponding flow, and the generation of the content involves calculation and driving corresponding to the avatar of the digital person. The image and video processing of the related digital human depends on high-performance CPU and GPU processing in the actual engineering realization process, and for the traditional integrated large-screen terminal equipment, when a task similar to the digital human and the like with larger requirements on processing performance is involved, high-performance hardware needs to be arranged locally to meet the corresponding performance requirements, and higher hardware cost is brought along therewith. In this example, the corresponding calculation and driving processing of the avatar of the digital person is completed by the cloud, and the local display unit and the multi-modal interaction device do not participate in the above processing process, so that the hardware requirement is not high.

After the server generates the display information and the audio output information, the display information and the audio output information can be sent to the multi-modal interaction device, the multi-modal interaction device sends the display information to the display unit through the transmission interface for displaying, and meanwhile, the audio output information is directly played through a microphone in the audio output unit. In the above example, the detailed progress flow for transacting the business card by the user and the guide of the virtual shape of the digital person to the flow are displayed in the LED large screen, and the voice explanation is played by the microphone, so that the interaction with the user is formed.

Alternatively, in this embodiment, a person skilled in the art may understand that all or part of the steps in the methods of the foregoing embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including instructions for causing one or more computer devices (which may be personal computers, servers, network devices, or the like) to execute all or part of the steps of the method described in the embodiments of the present application.

In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. The utility model provides an interactive device which characterized in that, interactive device carries out communication connection with server and display device respectively, interactive device includes the casing and sets up in the inside functional module of casing, functional module includes:

the camera module is configured to acquire image information of a target area;

the pickup module is configured to acquire audio input information of the target area;

the control module is electrically connected with the camera module and the pickup module respectively; the control module is configured to send the image information and/or the audio input information to the server so that the server can generate display information according to the image information and/or the audio input information, and send a control instruction to the camera module and the pickup module;

a transmission interface configured to output the display information to the display device, wherein the display information at least includes: motion information of the avatar, and/or image presentation information.

2. The interactive device of claim 1, further comprising:

an audio output module disposed inside or outside the housing and configured to output first audio output information, wherein the first audio output information is generated by the server according to the image information and the audio input information.

3. The interaction device of claim 2, wherein the control module is further configured to send the image information and/or the audio input information to the server for the server to generate conversational information according to the image information and the audio input information, wherein the conversational information is used to indicate a conversational operation used by the avatar to interact with a service object;

the audio output module is further configured to output second audio output information when the avatar interacts with the service object according to the tactical information, wherein the second audio output information corresponds to action information of the avatar, and the action information of the avatar includes facial action information and limb action information.

4. The interaction device of claim 3, wherein the control module is further configured to,

the server is instructed to select a first action module corresponding to the image display information from a preset virtual image action database according to the image display information, wherein the first action module is used for indicating the action of the virtual image when the image corresponding to the image display information is displayed;

instructing the server to select a second action module corresponding to the speech operation information from the virtual image action database according to the speech operation information, wherein the second action module is used for instructing the action of the virtual image when interacting with the service object according to the speech operation information;

the control module is further configured to send the action information of the avatar determined by the first action module and/or the second action module to the display device through the transmission interface;

the virtual image action database comprises a plurality of preset action modules, wherein each action module corresponds to one or more limb actions and/or facial actions of the virtual image.

5. The interactive device of claim 4, wherein the pickup module is further configured to obtain voice data input by the service object in the target area, wherein the voice data comprises real-time voice data and/or non-real-time voice data;

the control module is further configured to instruct the server to select a target first action module and/or a target second action module corresponding to the voice data in the avatar action database according to the voice data and the dialect information;

the control module is further configured to acquire the target first action module and/or the target second action module, and push the target first action module and/or the target second action module to the display device through the transmission interface.

6. The interaction device according to claim 5, wherein the control module is further configured to instruct the server to send a target Uniform Resource Locator (URL) address corresponding to the target first action module and/or the target second action module to the control module, so that the control module obtains the target first action module and/or the target second action module according to the target URL address;

the URL address is used for indicating an address of a CDN node of a content delivery network provided with the first action module and/or the second action module, and each CDN node corresponds to one URL address;

the control module is further configured to send the target first action module and/or the target second action module to the display device through the transmission interface.

7. An interaction device, comprising the interaction apparatus of any one of claims 1 to 6 and a display module, wherein,

the interactive device is configured to send the acquired image information and audio input information to a server;

the display module is configured to display information, wherein the display information is generated by the server according to the image information and the audio input information, and the display information at least includes: motion information of the avatar, and/or image presentation information.

8. The interaction device of claim 7, wherein the display module is further configured to:

receiving first display information sent by the interactive equipment, wherein the first display information is display information subjected to decoding processing by the interactive equipment; and/or

9. The interaction device of claim 7, wherein the display module is further configured to:

10. The interaction device of claim 7,

the interaction device is further configured to: sending a first registration request to the server; after the first registration request passes, sending a first heartbeat signal to the server to maintain connection with the server;

11. An interactive system, comprising the interactive apparatus of any one of claims 7 to 10 and a server, wherein,

the server is configured to generate display information according to the image information and the audio input information acquired by the interactive device, wherein the display information at least comprises: motion information of the avatar, and/or image presentation information.

12. The interactive system according to claim 11, wherein the server comprises:

the instant messaging module is configured to realize the message transceiving between the interactive equipment and/or the server and the human agent;

an intelligent dialogue module configured to interact with a user through a robot, wherein the robot is a trained neural network model;

the call conversation intermediate module is configured to control and schedule the instant messaging module, the voice communication module and the intelligent conversation module;

13. The interactive system of claim 11, wherein the server further comprises:

the management and control module is configured to receive registration requests sent by the interactive device and the display module, record parameter information of the display module of the interactive device, and monitor states of the interactive device and the display module, wherein the interactive device comprises one or more modules, and the display module comprises one or more modules.