CN112788004A

CN112788004A - Method and equipment for executing instructions through virtual conference robot

Info

Publication number: CN112788004A
Application number: CN202011594013.8A
Authority: CN
Inventors: 程翰
Original assignee: Shanghai Zhangmen Science and Technology Co Ltd
Current assignee: Shanghai Zhangmen Science and Technology Co Ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-05-11
Anticipated expiration: 2040-12-29
Also published as: CN112788004B; WO2022142618A1

Abstract

The application aims to provide a method and equipment for executing instructions through a virtual conference robot, wherein the method comprises the following steps: responding to a creating request aiming at a multi-person audio and video conference group, and creating the conference group, wherein the conference group comprises a plurality of users and a virtual conference robot; acquiring audio and video input information sent by the users through the virtual conference robot, identifying a plurality of associated instruction information in the audio and video input information, and generating instruction information to be executed according to the instruction information; and executing the instruction information to be executed through the virtual conference robot. According to the method and the device, the virtual conference robot can generate the final instruction information to be executed according to the plurality of instruction information respectively issued by the plurality of identified users in the audio and video conference group, so that the instruction execution efficiency can be greatly improved, and the human-computer interaction experience is optimized.

Description

Method and equipment for executing instructions through virtual conference robot

Technical Field

The present application relates to the field of communications, and more particularly, to a technique for executing instructions through a virtual conference robot.

Background

With the development of the times, the related technology of artificial intelligence has been in breakthrough development, and is more and more closely related to the number of lives of people, the interaction function between the intelligent robot and the human is stronger and stronger, and more scenes are available. However, in the prior art, the interaction function of the virtual robot (such as siri) is relatively single, and usually, only a single instruction issued by one user is simply and individually identified.

Disclosure of Invention

An object of the present application is to provide a method and apparatus for executing instructions by a virtual conference robot.

According to an aspect of the present application, there is provided a method of executing instructions by a virtual conference robot, the method including:

responding to a creating request aiming at a multi-person audio and video conference group, and creating the conference group, wherein the conference group comprises a plurality of users and a virtual conference robot;

acquiring audio and video input information sent by the users through the virtual conference robot, identifying a plurality of associated instruction information in the audio and video input information, and generating instruction information to be executed according to the instruction information;

and executing the instruction information to be executed through the virtual conference robot.

According to an aspect of the present application, there is provided a network device for executing instructions by a virtual conference robot, the device including:

the system comprises a one-to-one module, a virtual conference robot and a plurality of users, wherein the one-to-one module is used for responding to a creating request aiming at a multi-user audio and video conference group and creating the conference group, and the conference group comprises a plurality of users and the virtual conference robot;

the first module and the second module are used for acquiring audio and video input information sent by the users through the virtual conference robot, identifying a plurality of associated instruction information in the audio and video input information, and generating instruction information to be executed according to the instruction information;

and the three modules are used for executing the instruction information to be executed through the virtual conference robot.

According to an aspect of the present application, there is provided an apparatus for executing an instruction by a virtual conference robot, wherein the apparatus includes:

a processor; and

a memory arranged to store computer executable instructions that, when executed, cause the processor to:

According to one aspect of the application, there is provided a computer-readable medium storing instructions that, when executed, cause a system to:

According to another aspect of the application, there is provided a computer program product comprising a computer program which, when executed by a processor, performs the method of:

Compared with the prior art, the method and the device have the advantages that the conference group comprising the multiple users and the virtual conference robot can be created in response to the creation request of the multi-user audio and video conference group, the virtual conference robot can identify the associated multiple instruction information from the audio and video input information sent by the multiple users and generate and execute the instruction information to be executed according to the multiple instruction information, and therefore the virtual conference robot can generate the final instruction information to be executed according to the multiple instruction information respectively issued by the multiple users in the identified audio and video conference group, the instruction execution efficiency can be greatly improved, and the human-computer interaction experience is optimized.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 illustrates a flow diagram of a method for executing instructions by a virtual conference robot, according to one embodiment of the present application;

fig. 2 illustrates a network device architecture diagram for executing instructions via a virtual conference robot, according to one embodiment of the present application;

FIG. 3 illustrates an exemplary system that can be used to implement the various embodiments described in this application.

The same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

The present application is described in further detail below with reference to the attached figures.

In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (e.g., Central Processing Units (CPUs)), input/output interfaces, network interfaces, and memory.

The Memory may include forms of volatile Memory, Random Access Memory (RAM), and/or non-volatile Memory in a computer-readable medium, such as Read Only Memory (ROM) or Flash Memory. Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, Phase-Change Memory (PCM), Programmable Random Access Memory (PRAM), Static Random-Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), electrically Erasable Programmable Read-Only Memory (EEPROM), flash Memory or other Memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.

The device referred to in this application includes, but is not limited to, a user device, a network device, or a device formed by integrating a user device and a network device through a network. The user equipment includes, but is not limited to, any mobile electronic product, such as a smart phone, a tablet computer, etc., capable of performing human-computer interaction with a user (e.g., human-computer interaction through a touch panel), and the mobile electronic product may employ any operating system, such as an Android operating system, an iOS operating system, etc. The network Device includes an electronic Device capable of automatically performing numerical calculation and information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded Device, and the like. The network device includes but is not limited to a computer, a network host, a single network server, a plurality of network server sets or a cloud of a plurality of servers; here, the Cloud is composed of a large number of computers or web servers based on Cloud Computing (Cloud Computing), which is a kind of distributed Computing, one virtual supercomputer consisting of a collection of loosely coupled computers. Including, but not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a VPN network, a wireless Ad Hoc network (Ad Hoc network), etc. Preferably, the device may also be a program running on the user device, the network device, or a device formed by integrating the user device and the network device, the touch terminal, or the network device and the touch terminal through a network.

Of course, those skilled in the art will appreciate that the foregoing is by way of example only, and that other existing or future devices, which may be suitable for use in the present application, are also encompassed within the scope of the present application and are hereby incorporated by reference.

In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

Fig. 1 shows a flowchart of a method for executing instructions by a virtual conference robot according to an embodiment of the present application, the method including step S11, step S12, and step S13. In step S11, the network device creates a multi-user audio/video conference group in response to a creation request for the conference group, where the conference group includes a plurality of users and a virtual conference robot; in step S12, the network device obtains audio and video input information sent by the multiple users through the virtual conference robot, identifies multiple associated instruction information in the audio and video input information, and generates instruction information to be executed according to the multiple instruction information; in step S13, the network device executes the instruction information to be executed through the virtual conference robot.

In step S11, the network device creates a multi-user audio/video conference group in response to a creation request for the conference group, where the conference group includes a plurality of users and a virtual conference robot. In some embodiments, the conference group includes, but is not limited to, a multi-person online audio conference group, a multi-person online video conference group, and a multi-person online audio video conference group. In some embodiments, the creation request includes user identification information corresponding to a plurality of users, and the server automatically joins the plurality of users corresponding to the user identification information to the conference group after creating the conference group according to the creation request, or, after creating the conference group, the creator of the conference group needs to actively invite the plurality of users to join the conference group, where the user immediately joins the conference group after inviting the user, or the user needs to confirm that the user passes after inviting the user and then joins the conference group, or, after creating the conference group, the plurality of users need to actively apply for joining the conference group, or join the conference group immediately after applying, or join the conference group after requiring that the creator of the conference group passes after applying. In some embodiments, in response to a creation request for a multi-user audio/video conference group, a conference group and a virtual conference robot corresponding to the conference group are created, where the conference group includes a plurality of users and the virtual conference robot. In some embodiments, in response to a create request for a multi-user audio-video conference group, creating the conference group, the conference group including a plurality of users; and then responding to a request of adding a virtual conference robot aiming at the conference group, creating a virtual conference robot corresponding to the conference group, and adding the virtual conference robot into the conference group. In some embodiments, the virtual conference robot is a virtual user role that does not actually exist in the conference group, and the virtual conference robot may acquire audio/video input information sent by a plurality of users in the conference group, identify a plurality of associated instruction information from the audio/video input information, generate final instruction information to be executed according to the plurality of instruction information, and execute the final instruction information to be executed.

In step S12, the network device obtains audio and video input information sent by the multiple users through the virtual conference robot, identifies multiple associated instruction information in the audio and video input information, and generates instruction information to be executed according to the multiple instruction information. In some embodiments, the audiovisual input information may include only audio input information or both audio input information and video input information. In some embodiments, the virtual conference robot may directly determine video source user identification information corresponding to the received video input information respectively sent by each user. In some embodiments, if the virtual conference robot acquires audio input information that is respectively sent by each user before and after audio synthesis, the audio source user identification information corresponding to each piece of audio input information may be directly determined, and if the virtual conference robot acquires audio input information after audio synthesis, the audio source user identification information corresponding to each piece of audio content needs to be determined from the synthesized audio information according to voiceprint feature information of each user that is pre-stored in the server. In some embodiments, the virtual conference robot identifies a plurality of instruction information semantically related from the acquired audio-video input information, or identifies a plurality of instruction information semantically related from the acquired audio-video input information and having an issue time interval smaller than or equal to a predetermined time interval, where the plurality of instruction information may originate from the same user or from different users, and may identify the instruction information from the audio input information, for example, a certain piece of audio content in the audio input information is directly determined as the instruction information, or may identify the instruction information from the video input information, for example, the instruction information issued by a user is determined from an image information of the video input information according to the gesture, the body motion, the expression, and the like of the user, for example, the server may store each gesture, each expression, and the like in advance, If the corresponding gesture, limb action and expression are identified in the video picture of the video input information through the image identification technology, the corresponding instruction information can be determined as the instruction information issued by the user. In some embodiments, the instruction information to be executed is generated according to a plurality of identified instruction information, for example, the instruction information issued by the user a is "box ordering 6 pm at P restaurant", the instruction information issued by the user B is "one J dish in advance", and the instruction information issued by the user C is "one K dish in advance", so that the instruction information to be executed "box ordering 6 pm at P restaurant, one J dish and K dish in advance" can be generated. In some embodiments, instruction source user identification information corresponding to each instruction information in the plurality of instruction information that can be recognized as a result (i.e., audio/video source user identification information of audio/video input information corresponding to the instruction information) is generated according to the plurality of instruction information, for example, instruction information issued by user a is "book a box at 6 pm in P restaurant", instruction information issued by user B is "book a J dish in advance", instruction information issued by user C is "book a K dish in advance", so that instruction information to be executed "book a box at 6 pm in P restaurant, book a J dish in advance for user B, and book a K dish in advance for user C" can be generated.

In step S13, the network device executes the instruction information to be executed through the virtual conference robot. For example, the to-be-executed instruction information is "order a box at 6 pm in a P restaurant, order a J dish and a K dish in advance", the virtual conference robot first searches for a web page or an applet corresponding to the P restaurant according to the to-be-executed instruction information, and executes corresponding box reservation and dish ordering operations on the web page or the applet.

The method and the device can respond to a creation request aiming at a multi-user audio and video conference group, create the conference group comprising a plurality of users and the virtual conference robot, and can identify a plurality of associated instruction information from audio and video input information sent by the plurality of users through the virtual conference robot, and generate and execute instruction information to be executed according to the plurality of instruction information, so that the virtual conference robot can generate a final instruction information to be executed according to the plurality of instruction information respectively issued by the plurality of users in the identified audio and video conference group, the instruction execution efficiency can be greatly improved, and the human-computer interaction experience is optimized.

In some embodiments, the step S11 includes: the method comprises the steps that network equipment responds to a creating request aiming at a multi-person audio and video conference group, and creates the conference group and a virtual conference robot corresponding to the conference group, wherein the conference group comprises a plurality of users and the virtual conference robot. In some embodiments, the user who initiated the create request is the creator of the conference group. In some embodiments, the creation request includes user identification information corresponding to a plurality of users, and the server automatically joins the plurality of users corresponding to the user identification information to the conference group after creating the conference group according to the creation request, or, after creating the conference group, the creator of the conference group needs to actively invite the plurality of users to join the conference group, where the user immediately joins the conference group after inviting the user, or the user needs to confirm that the user passes after inviting the user and then joins the conference group, or, after creating the conference group, the plurality of users need to actively apply for joining the conference group, or join the conference group immediately after applying, or join the conference group after requiring that the creator of the conference group passes after applying.

In some embodiments, the step S11 includes: the method comprises the steps that network equipment responds to a creating request aiming at a multi-person audio and video conference group, and the conference group is created, wherein the conference group comprises a plurality of users; and responding to a request of adding a virtual conference robot aiming at the conference group, creating a virtual conference robot corresponding to the conference group, and adding the virtual conference robot into the conference group. In some embodiments, the user who initiated the create request is the creator of the conference group. In some embodiments, only the creator of the conference group may initiate the add virtual conference robot request, or any one of the users in the conference group may initiate the add virtual conference robot request. In some embodiments, the server may directly create the corresponding virtual conference robot after receiving the request, or send the request to all other users in the conference group, and each user confirms whether to approve the request, and only if the users satisfying a predetermined number or a predetermined ratio confirm that the request is approved, the server may create the corresponding virtual conference robot.

In some embodiments, the step S12 includes step S121 (not shown), step S122 (not shown), and step S123 (not shown). In step S121, the network device obtains audio and video input information sent by the multiple users through the virtual conference robot, and identifies target instruction information of a first user of the multiple users from the audio and video input information; in step S122, the network device identifies, by the virtual conference robot, one or more instruction information associated with the target instruction information from second audio/video input information sent by users other than the first user; in step S123, the network device generates instruction information to be executed according to the target instruction information and the one or more instruction information through the virtual robot. In some embodiments, the target instruction information issued by the first user is identified from the first audiovisual input information sent by the first user, then identifying one or more instruction information semantically associated with the target instruction information from second audio-video input information sent by other users except the first user in the plurality of users in the conference group, or identifying one or more instruction information semantically associated with the target instruction information and having an issue time interval less than or equal to a predetermined time interval compared to the target instruction information, then generating instruction information to be executed according to the target instruction information and the one or more instruction information, or combining the first user identification information corresponding to the target instruction information and the instruction source user identification information corresponding to each instruction information, and generating instruction information to be executed according to the target instruction information and the one or more instruction information.

In some embodiments, the step S121 includes: the network equipment acquires audio and video input information sent by the users through the virtual conference robot, if preset instruction triggering indication information is identified from first audio and video input information sent by the first user, a second time point is identified and determined from a first time point corresponding to the instruction triggering indication information, and target instruction information of the first user is acquired from the first audio and video input information between the first time point and the second time point. In some embodiments, the predetermined instructional trigger indication may be a particular voice password, e.g., "degree of minutia," that may be recognized in the first audio input sent by the first user, or a particular gesture or limb action, e.g., "OK gesture," that may be recognized in the first video input sent by the first user. In some embodiments, the target instruction information issued by the first user is acquired from the first audio/video information sent by the first user from the first time point corresponding to the instruction triggering instruction information until the second time point (i.e., the instruction ending time point) corresponding to the target instruction information is identified in the first audio/video information, and the content of the audio clip from the first time point to the second time point in the first audio input information may be used as the target instruction information issued by the first user, or the target instruction information issued by the first user may be determined according to the content of the video frame from the first time point to the second time point in the first video input information.

In some embodiments, the identifying and determining a second time point from a first time point corresponding to the instruction triggering indication information includes: and if the preset instruction ending indication information is identified from the first audio and video input information sent by the first user after the first time point corresponding to the instruction triggering indication information, determining the time point corresponding to the instruction ending indication information as a second time point. In some embodiments, the predetermined instruction end indication may be a particular voice password, such as "small over," or the predetermined instruction trigger indication may be a particular gesture or limb movement. In some embodiments, after the first time point corresponding to the instruction triggering indication information, if the specific voice password is recognized in the first audio input information sent by the first user, or if the specific gesture or body action is recognized in the first video input information sent by the first user, the time point corresponding to the specific voice password, the specific gesture or body action may be determined as the second time point.

In some embodiments, the identifying and determining a second time point from a first time point corresponding to the instruction triggering indication information includes: and after a first time point corresponding to the instruction triggering indication information, if no new audio content is acquired from the first audio and video input information within a preset time interval after a third time point is identified, determining the third time point as a second time point. In some embodiments, after the first time point corresponding to the instruction triggering indication information is instructed, if it is recognized that no new audio content is acquired from the first audio input information sent by the first user within a predetermined time interval (e.g., 5 seconds) after the third time point, the third time point is determined as the second time point. In some embodiments, the virtual conference robot identifies a specific voice password (i.e., instruction triggering indication information) from the first audio input information of the user a at a first time point, after the first time point, if it is identified that no new audio content is received from the first audio input information within a predetermined time interval after a third time point, determines the third time point as a second time point, and determines the content of the audio segment of the first audio input information from the first time point to the second time point as the target instruction information issued by the first user.

In some embodiments, the step S122 includes a step S1221 (not shown) and a step S1222 (not shown). In step S1221, the network device identifies, by using the virtual conference robot, at least one piece of instruction information corresponding to another user from second audio/video input information sent by another user of the multiple users except the first user; in step S1222, the network device determines one or more instruction information associated with the target instruction information from the at least one instruction information. In some embodiments, the virtual conference robot may identify at least one instruction information issued by other users from second audio/video input information sent by other users in the conference group except the first user, where the identification manner is the same as the manner described above for identifying the target instruction information issued by the first user from the first audio/video information, and is not described herein again. In some embodiments, one or more instruction information semantically associated with the target instruction information is determined from the at least one instruction information, or one or more instruction information semantically associated with the target instruction information and less than or equal to a predetermined time interval compared to an issue time interval of the target instruction information is determined from the at least one instruction information.

In some embodiments, said step S1222 includes: the network device determines one or more instruction information associated with the target instruction information and having an issue time interval less than or equal to a predetermined time interval compared to the target instruction information from the at least one instruction information. In some embodiments, each of the one or more instruction messages is required to satisfy that the issue time interval compared to the target instruction message is less than or equal to a predetermined time interval, or, on the basis, the issue time interval of the one or more instruction messages relative to each other is less than or equal to a predetermined time interval, which may be the same as or different from the previous predetermined time interval.

In some embodiments, if the target instruction information includes one or more user identification information; wherein the step S1221 includes: and identifying at least one instruction information corresponding to one or more users from second audio and video input information sent by one or more users corresponding to the one or more user identification information through the virtual conference robot. In some embodiments, the user identification information includes, but is not limited to, user name information, user nickname information, user ID information, and the like. For example, the target instruction information is "order a box at 6 pm in a P restaurant, and dishes at night in two pre-click for user B and user C", which includes user identification information "user B" and "user C" corresponding to user B and user C, respectively. In some embodiments, at least one instruction information issued by one or more users is identified from second audio/video input information sent by one or more users corresponding to one or more user identification information included in the target instruction information, and in the above example, at least one instruction information issued by user B and user C is identified from second audio/video input information sent by user B and user C.

In some embodiments, the step S122 includes a step S1221 (not shown). In step S1221, the network device outputs, by using the virtual conference robot, instruction issue prompt information corresponding to the target instruction information in the conference group according to the target instruction information, and identifies, from second audio/video input information sent by users other than the first user among the plurality of users, one or more instruction information output by the other users for the instruction issue prompt information. In some embodiments, the virtual conference robot may output, according to the target instruction information, instruction issue prompt information corresponding to the target instruction information in an audio manner in the conference group to prompt users other than the first user in the conference group to issue one or more instruction information associated with the target instruction information, for example, the target instruction information is "a box ordered at 6 pm in a P restaurant, a dish at a pre-ordered night is not determined yet", the virtual conference robot may generate, according to semantics of the target instruction information, instruction issue prompt information corresponding to the target instruction information "please find a dish at a pre-ordered night", and output the issue prompt information in the conference group in an audio manner. In some embodiments, other users may issue their own instruction information through respective second audiovisual input information for the issue prompt information. In some embodiments, the issuing prompt message has a corresponding predetermined issuing time limit, other users need to issue their own instruction message within the predetermined issuing time limit through respective second audio/video input messages, and the instruction message issued by a certain user after exceeding the predetermined issuing time limit will not be regarded as one or more instruction messages associated with the target instruction message.

In some embodiments, the step S1221 includes a step S12211 (not shown) and a step S12212 (not shown). In step S12211, the network device obtains, by using the virtual conference robot, identification information of at least one other user from the target instruction information; in step S12212, the network device outputs instruction issue prompt information corresponding to the at least one other user in the conference group according to the target instruction information through the virtual conference robot, and identifies, according to the identification information, one or more instruction information output by the at least one other user for the instruction issue prompt information from second audio/video input information sent by the at least one other user. In some embodiments, the destination instruction information includes identification information of at least one other user, for example, the destination instruction information is "order a car at 6 pm in a P restaurant, a pre-ordered dish at night is not yet determined, and user b and user c arrange each other", generate instruction issue prompt information "please ask user b and user c to order a dish at night" corresponding to user b and user c, and output the issue prompt information in the conference group in an audio manner. In some embodiments, if the target instruction information includes a referential expression like "everybody", it is necessary to determine the identification information of at least one other user to which the referential expression refers by analyzing the semantics of the referential expression, for example, the referential expression "everybody" refers to the identification information of all users except the first user in the conference group.

In some embodiments, the step S12212 includes: and for each user in the at least one other user, the network equipment outputs instruction issuing prompt information corresponding to the user in the conference group through the virtual conference robot according to the target instruction information, and identifies the instruction information output by the user aiming at the instruction issuing prompt information from second audio and video input information sent by the user. In some embodiments, for each of the at least one other user, the virtual conference robot may generate instruction issue prompt information corresponding to each user according to the target instruction information, where each user may issue different instruction issue prompt information or issue the same instruction issue prompt information, and then output the issue prompt information corresponding to each user in the conference group in an audio manner. In some embodiments, the issuing prompt information corresponding to each user may be continuously output in the conference group, or the issuing prompt information corresponding to one user may be output in the conference group first, and after the instruction information issued by the user for the instruction issuing prompt information is identified, the issuing prompt information corresponding to another user is output in the conference group organization, for example, the target instruction information is "a box at 6 pm is ordered in P restaurant, a pre-ordered dish at night is not yet determined, and a user b and a user c are arranged", and the at least one other user includes a user b and a user c, and the instruction issuing prompt information "please ask for a dish at night under user b's pre-ordering" and the instruction issuing prompt information "please ask for a dish at night under user c's pre-ordering" corresponding to the user b are generated.

In some embodiments, the step S123 includes: the network equipment performs semantic analysis on the target instruction information and the one or more instruction information through the virtual conference robot to obtain instruction purposes corresponding to the instruction information; and according to the instruction purpose corresponding to each instruction information, executing instruction synthesis operation on the target instruction information and the one or more instruction information to generate instruction information to be executed. In some embodiments, the virtual conference robot performs semantic analysis on the target instruction information and one or more pieces of instruction information associated with the target instruction information to obtain an instruction purpose corresponding to the target instruction information and an instruction purpose corresponding to each piece of instruction information in the one or more pieces of instruction information, and according to the instruction purpose corresponding to each piece of instruction information, if there is no semantic conflict between the instruction purposes corresponding to each piece of instruction information, directly performs instruction synthesis operation on the target instruction information and the one or more pieces of instruction information to synthesize final instruction information to be executed; if at least one of one or more instruction information associated with target instruction information has an instruction purpose conflict with each other, determining second target instruction information from the at least one instruction information, and discarding other instruction information except the second target instruction information in the at least one instruction information, so that only the second target instruction information participates in subsequent instruction synthesis operation with the target instruction information and other instruction information except the at least one instruction information in the one or more instruction information; if at least one instruction information in the one or more instruction information and the target instruction information have instruction purpose conflict semantically, the at least one instruction information can be directly discarded, and the instruction synthesis operation is executed on the rest instruction information except the at least one instruction information and the target instruction information in the one or more instruction information. For example, the instruction information issued by the user a is "order a box at 6 pm in P restaurant", the instruction information issued by the user B is "order a J dish in advance", and the instruction information issued by the user C is "order a K dish in advance", because there is no conflict of instruction purpose among each other semantically, the instruction synthesizing operation can be directly executed, and the instruction information to be finally executed "order a box at 6 pm in P restaurant, order a J dish in advance for the user B, and order a K dish in advance for the user C" is obtained by synthesizing. For another example, if the instruction information issued by the user B is "order one J dish in advance and not order K dish in advance", and the instruction information issued by the user C is "order one K dish in advance and not order J dish in advance", it can be considered that the instruction purpose of the two instruction information is semantically conflicting.

In some embodiments, the method further comprises: and if the instruction destination corresponding to the first instruction information in the one or more instruction information includes the instruction destination corresponding to the second instruction information, the network equipment discards the second instruction information. In some embodiments, if the instruction destination of the first instruction information of the one or more instruction information includes the instruction destination of the second instruction information, discarding the second instruction information, so that only the first instruction information participates in subsequent instruction composition operations with the target instruction information and other instruction information of the one or more instruction information except the first instruction information and the second instruction information, and the second instruction information is discarded without participating in subsequent instruction composition operations. For example, the first instruction information is "order one J dish and K dish in advance", the second instruction information is "order one J dish in advance", and the second instruction information is discarded because the instruction purpose of the first instruction information includes the instruction purpose of the second instruction information.

In some embodiments, the method further comprises: and if the instruction purpose corresponding to the first instruction information in the one or more instruction information is lower than the instruction purpose corresponding to the second instruction information, the network equipment discards the second instruction information. In some embodiments, if the instruction destination of a first instruction information of the one or more instruction information is lower than the instruction destination of a second instruction information, discarding the second instruction information, so that only the first instruction information participates in a subsequent instruction synthesizing operation with the target instruction information and other instruction information except the first instruction information and the second instruction information of the one or more instruction information, and the second instruction information is discarded without participating in the subsequent instruction synthesizing operation. For example, the first instruction information is "one J vegetable and K vegetable is ordered in advance", the second instruction information is "two vegetables are ordered in advance", and the second instruction information is discarded because the instruction purpose of the first instruction information is the lower order of the instruction purpose of the second instruction information.

In some embodiments, the method further comprises: if at least one piece of instruction information in the one or more pieces of instruction information has an instruction purpose conflict, the network equipment determines second target instruction information from the at least one piece of instruction information, and discards other instruction information except the second target instruction information in the at least one piece of instruction information. In some embodiments, if at least one of the one or more instruction information associated with the target instruction information conflicts with each other for the purpose of instruction, a second target instruction information is determined from the at least one instruction information, and the other instruction information of the at least one instruction information except the second target instruction information is discarded, so that only the second target instruction information participates in the subsequent instruction synthesis operation with the target instruction information and the other instruction information of the one or more instruction information except the at least one instruction information.

In some embodiments, the determining the second target instruction information from the at least one instruction information comprises: determining a degree of association between each of the at least one instruction information and the target instruction information; and determining the instruction information with the highest corresponding relevance degree from the at least one instruction information as second target instruction information. In some embodiments, the instruction information with the highest semantic relevance may be selected from the at least one instruction information as the second target instruction information according to the semantic relevance between each instruction information of the at least one instruction information and the target instruction information. For example, the target instruction information is "a car ordered at night in a P restaurant, people help to determine the reservation time of the next car", the first instruction information is "6 pm reservation", the second instruction information is "4 pm reservation", an instruction purpose conflict exists between the first instruction information and the second instruction information, and the first instruction information is used as the second target instruction information in the first instruction information and the second instruction information because the semantic association degree between the first instruction information and the target instruction information is greater than that between the second instruction information and the target instruction information.

In some embodiments, the determining the second target instruction information from the at least one instruction information comprises: performing feasibility analysis on each instruction information in the at least one instruction information to obtain feasibility corresponding to each instruction information; and determining the corresponding instruction information with the highest feasibility as second target instruction information from the at least one instruction information. In some embodiments, feasibility analysis may be performed on each of the at least one piece of instruction information to obtain a feasibility degree corresponding to each piece of instruction information, and then, according to the feasibility degree corresponding to each piece of instruction information, a piece of instruction information with the highest corresponding feasibility degree may be selected from the at least one piece of instruction information as the second target instruction information. For example, the target instruction information is "a car is ordered at night in P restaurant, people help to determine which car is ordered", the first instruction information is "ordered car a", the second instruction information is "ordered car B", there is a conflict of instruction purpose between the first instruction information and the second instruction information, the virtual conference robot inquires about the car ordering situation at night in P restaurant on a web page or a small program of P restaurant, knows that the car a is not ordered by other people at night, and the car B is ordered by other people between 5 pm and 7 pm, so that the feasibility of the first instruction information is determined to be greater than that of the second instruction information, and therefore the first instruction information is used as the second target instruction information in the first instruction information and the second instruction information.

In some embodiments, the determining the second target instruction information from the at least one instruction information comprises: generating instruction conflict indication information about the at least one instruction information and sending the instruction conflict indication information to a specific user in the conference group; receiving feedback information which is returned by the specific user and corresponds to the instruction conflict indication information, wherein the feedback information comprises identification information of one instruction information in the at least one instruction information; and according to the feedback information, determining instruction information corresponding to the identification information from the at least one instruction information as second target instruction information. In some embodiments, if at least one piece of instruction information has an instruction purpose conflict with each other, instruction conflict indication information about the at least one piece of instruction information is generated, where the instruction conflict indication information includes instruction contents of each piece of instruction information in the at least one piece of instruction information, and preferably, the instruction conflict indication information further includes an instruction purpose conflict reason corresponding to the at least one piece of instruction information obtained through semantic analysis by the virtual conference robot, and then the instruction conflict indication information may be sent only to a specific user in the conference group, where the specific user may be an instruction source user corresponding to the at least one piece of instruction information, or may also be a creator of the conference group, or may also be all users in the conference group. In some embodiments, after receiving the instruction collision indication information, the specific user may select one instruction information from the at least one instruction, generate feedback information corresponding to the instruction collision indication information, and return the feedback information to the virtual conference robot, where the feedback information includes identification information corresponding to the selected instruction information, and after receiving the feedback information, the virtual conference robot determines the instruction information identified by the identification information as the second target instruction information. In some embodiments, if the virtual conference robot sends the instruction conflict instruction information to multiple users in the conference group, according to instruction identification information included in multiple pieces of feedback information respectively returned to the virtual conference robot by the multiple users, the instruction information identified by the instruction identification information that appears the most frequently in the multiple pieces of feedback information is determined as the second target instruction information.

In some embodiments, the performing an instruction synthesis operation on the target instruction information and the one or more instruction information to generate instruction information to be executed includes: acquiring the target instruction information and instruction source user identification information corresponding to the one or more instruction information; and combining the instruction source user identification information, executing instruction synthesis operation on the target instruction information and the one or more instruction information, and generating instruction information to be executed. In some embodiments, the virtual conference robot obtains target instruction information and instruction source user identification information corresponding to each of the one or more instruction information, and then performs instruction synthesis operation on the target instruction information and the one or more instruction information in combination with the instruction source user identification information corresponding to each instruction information to generate final instruction information to be executed. For example, the target instruction information is "sit an airplane to go to L city successively from tomorrow, everyone determines flight time of himself and orders people after determining, the first instruction information issued by user b is" flight in tomorrow afternoon ", the instruction source user identification information corresponding to the first instruction information is" user b ", the second instruction information issued by user c is" flight in the afternoon morning ", the instruction source user identification information corresponding to the second instruction information is" user c ", so that by combining the instruction source user identification information corresponding to each instruction information, instruction synthesis operation is performed on the target instruction information, the first instruction information and the second instruction information to generate final instruction information to be executed," reserve a flight to go to L city at tomorrow afternoon for user b, reserve a flight to go to L city at afternoon for user c ", and then the virtual conference robot executes the instruction to be executed according to the instruction, and executing corresponding flight booking operation on the ticket booking webpage or the small program.

Fig. 2 is a diagram illustrating a network device structure for executing instructions through a virtual conference robot according to an embodiment of the present application, where the method includes a module 11, a module 12, and a module 13. A one-to-one module 11, configured to respond to a creation request for a multi-user audio/video conference group, and create the conference group, where the conference group includes multiple users and a virtual conference robot; the first module 12 is configured to acquire, through the virtual conference robot, audio/video input information sent by the multiple users, identify multiple associated instruction information in the audio/video input information, and generate instruction information to be executed according to the multiple instruction information; and the three modules 13 are used for executing the instruction information to be executed through the virtual conference robot.

The one-to-one module 11 is configured to respond to a creation request for a multi-user audio/video conference group, and create the conference group, where the conference group includes a plurality of users and a virtual conference robot. In some embodiments, the conference group includes, but is not limited to, a multi-person online audio conference group, a multi-person online video conference group, and a multi-person online audio video conference group. In some embodiments, the creation request includes user identification information corresponding to a plurality of users, and the server automatically joins the plurality of users corresponding to the user identification information to the conference group after creating the conference group according to the creation request, or, after creating the conference group, the creator of the conference group needs to actively invite the plurality of users to join the conference group, where the user immediately joins the conference group after inviting the user, or the user needs to confirm that the user passes after inviting the user and then joins the conference group, or, after creating the conference group, the plurality of users need to actively apply for joining the conference group, or join the conference group immediately after applying, or join the conference group after requiring that the creator of the conference group passes after applying. In some embodiments, in response to a creation request for a multi-user audio/video conference group, a conference group and a virtual conference robot corresponding to the conference group are created, where the conference group includes a plurality of users and the virtual conference robot. In some embodiments, in response to a create request for a multi-user audio-video conference group, creating the conference group, the conference group including a plurality of users; and then responding to a request of adding a virtual conference robot aiming at the conference group, creating a virtual conference robot corresponding to the conference group, and adding the virtual conference robot into the conference group. In some embodiments, the virtual conference robot is a virtual user role that does not actually exist in the conference group, and the virtual conference robot may acquire audio/video input information sent by a plurality of users in the conference group, identify a plurality of associated instruction information from the audio/video input information, generate final instruction information to be executed according to the plurality of instruction information, and execute the final instruction information to be executed.

And the secondary module 12 is configured to acquire, through the virtual conference robot, audio and video input information sent by the multiple users, identify multiple associated instruction information in the audio and video input information, and generate instruction information to be executed according to the multiple instruction information. In some embodiments, the audiovisual input information may include only audio input information or both audio input information and video input information. In some embodiments, the virtual conference robot may directly determine video source user identification information corresponding to the received video input information respectively sent by each user. In some embodiments, if the virtual conference robot acquires audio input information that is respectively sent by each user before and after audio synthesis, the audio source user identification information corresponding to each piece of audio input information may be directly determined, and if the virtual conference robot acquires audio input information after audio synthesis, the audio source user identification information corresponding to each piece of audio content needs to be determined from the synthesized audio information according to voiceprint feature information of each user that is pre-stored in the server. In some embodiments, the virtual conference robot identifies a plurality of instruction information semantically related from the acquired audio-video input information, or identifies a plurality of instruction information semantically related from the acquired audio-video input information and having an issue time interval smaller than or equal to a predetermined time interval, where the plurality of instruction information may originate from the same user or from different users, and may identify the instruction information from the audio input information, for example, a certain piece of audio content in the audio input information is directly determined as the instruction information, or may identify the instruction information from the video input information, for example, the instruction information issued by a user is determined from an image information of the video input information according to the gesture, the body motion, the expression, and the like of the user, for example, the server may store each gesture, each expression, and the like in advance, If the corresponding gesture, limb action and expression are identified in the video picture of the video input information through the image identification technology, the corresponding instruction information can be determined as the instruction information issued by the user. In some embodiments, the instruction information to be executed is generated according to a plurality of identified instruction information, for example, the instruction information issued by the user a is "box ordering 6 pm at P restaurant", the instruction information issued by the user B is "one J dish in advance", and the instruction information issued by the user C is "one K dish in advance", so that the instruction information to be executed "box ordering 6 pm at P restaurant, one J dish and K dish in advance" can be generated. In some embodiments, instruction source user identification information corresponding to each instruction information in the plurality of instruction information that can be recognized as a result (i.e., audio/video source user identification information of audio/video input information corresponding to the instruction information) is generated according to the plurality of instruction information, for example, instruction information issued by user a is "book a box at 6 pm in P restaurant", instruction information issued by user B is "book a J dish in advance", instruction information issued by user C is "book a K dish in advance", so that instruction information to be executed "book a box at 6 pm in P restaurant, book a J dish in advance for user B, and book a K dish in advance for user C" can be generated.

And the three modules 13 are used for executing the instruction information to be executed through the virtual conference robot. For example, the to-be-executed instruction information is "order a box at 6 pm in a P restaurant, order a J dish and a K dish in advance", the virtual conference robot first searches for a web page or an applet corresponding to the P restaurant according to the to-be-executed instruction information, and executes corresponding box reservation and dish ordering operations on the web page or the applet.

In some embodiments, the module 11 is configured to: responding to a creating request aiming at a multi-person audio and video conference group, and creating the conference group and a virtual conference robot corresponding to the conference group, wherein the conference group comprises a plurality of users and the virtual conference robot. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the module 11 is configured to: responding to a creating request aiming at a multi-person audio and video conference group, and creating the conference group, wherein the conference group comprises a plurality of users; and responding to a request of adding a virtual conference robot aiming at the conference group, creating a virtual conference robot corresponding to the conference group, and adding the virtual conference robot into the conference group. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the two-module 12 includes a two-one module 121 (not shown), a two-two module 122 (not shown), and a two-three module 123 (not shown). A second-first module 121, configured to acquire, by the virtual conference robot, audio/video input information sent by the multiple users, and identify, from the audio/video input information, target instruction information of a first user in the multiple users; a second-second module 122, configured to identify, by the virtual conference robot, one or more pieces of instruction information associated with the target instruction information from second audio-video input information sent by users other than the first user; a module 123, configured to generate instruction information to be executed according to the target instruction information and the one or more instruction information by the virtual robot. Here, the specific implementation manners of the first-second module 121, the second-second module 122, and the first-second-third module 123 are the same as or similar to the embodiments related to steps S121, S122, and S123 in fig. 1, and therefore, the detailed descriptions thereof are omitted, and the detailed descriptions thereof are incorporated herein by reference.

In some embodiments, the one-two-one module 121 is configured to: the method comprises the steps that audio and video input information sent by a plurality of users is obtained through the virtual conference robot, if preset instruction triggering indication information is identified from first audio and video input information sent by a first user, a second time point is identified and determined from a first time point corresponding to the instruction triggering indication information, and target instruction information of the first user is obtained from the first audio and video input information between the first time point and the second time point. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the identifying and determining a second time point from a first time point corresponding to the instruction triggering indication information includes: and if the preset instruction ending indication information is identified from the first audio and video input information sent by the first user after the first time point corresponding to the instruction triggering indication information, determining the time point corresponding to the instruction ending indication information as a second time point. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the identifying and determining a second time point from a first time point corresponding to the instruction triggering indication information includes: and after a first time point corresponding to the instruction triggering indication information, if no new audio content is acquired from the first audio and video input information within a preset time interval after a third time point is identified, determining the third time point as a second time point. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the two-two module 122 includes a two-one module 1221 (not shown) and a two-two module 1222 (not shown). A first, a second, a first module 1221, configured to identify, by the virtual conference robot, at least one instruction information corresponding to another user from second audio/video input information sent by the another user except the first user; a binary module 1222 to determine one or more instruction information associated with the target instruction information from the at least one instruction information. Here, the specific implementation of the first-second-first module 1221 and the second-second module 1222 is the same as or similar to the embodiment of steps S1221 and S1222 in fig. 1, and therefore, the detailed description is omitted, and the detailed implementation is incorporated herein by reference.

In some embodiments, the one, two, and two modules 1222 are to: determining one or more instruction information associated with the target instruction information and having an issue time interval less than or equal to a predetermined time interval compared to the target instruction information from the at least one instruction information. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, if the target instruction information includes one or more user identification information; wherein, the one, two, one module 1221 is configured to: and identifying at least one instruction information corresponding to one or more users from second audio and video input information sent by one or more users corresponding to the one or more user identification information through the virtual conference robot. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the two-two module 122 includes a two-one module 1221 (not shown). A second-first module 1221, configured to output, by the virtual conference robot, instruction issue prompt information corresponding to the target instruction information in the conference group according to the target instruction information, and identify, from second audio/video input information sent by other users except the first user among the multiple users, one or more instruction information output by the other users for the instruction issue prompt information. Here, the specific implementation of the first, second, and first module 1221 is the same as or similar to the embodiment of step S1221 in fig. 1, and therefore, the detailed description is omitted, and the detailed description is incorporated herein by reference.

In some embodiments, the one-two-one module 1221 includes a two-one module 12211 (not shown) and a two-two module 12212 (not shown). A two-to-one module 12211, configured to obtain, by the virtual conference robot, identification information of at least one other user from the target instruction information; a second-second module 12212, configured to output, by the virtual conference robot, instruction issue prompt information corresponding to the at least one other user in the conference group according to the target instruction information, and identify, according to the identification information, one or more instruction information output by the at least one other user for the instruction issue prompt information from second audio/video input information sent by the at least one other user. Here, the specific implementation of the first, second, and first module 12211 and the first, second, and second module 12212 is the same as or similar to the embodiment related to steps S12211 and S12212 in fig. 1, and therefore, the detailed description is omitted, and the detailed description is incorporated herein by reference.

In some embodiments, the one, two module 12212 is to: and for each user in the at least one other user, outputting instruction issuing prompt information corresponding to the user in the conference group through the virtual conference robot according to the target instruction information, and identifying the instruction information output by the user aiming at the instruction issuing prompt information from second audio and video input information sent by the user. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the two, and three modules 123 are configured to: semantic analysis is carried out on the target instruction information and the one or more instruction information through the virtual conference robot, and an instruction purpose corresponding to each instruction information is obtained; and according to the instruction purpose corresponding to each instruction information, executing instruction synthesis operation on the target instruction information and the one or more instruction information to generate instruction information to be executed. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the apparatus is further configured to: and if the instruction purpose corresponding to the first instruction information in the one or more instruction information comprises the instruction purpose corresponding to the second instruction information, discarding the second instruction information. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the apparatus is further configured to: and if the instruction purpose corresponding to the first instruction information in the one or more instruction information is lower than the instruction purpose corresponding to the second instruction information, discarding the second instruction information. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the apparatus is further configured to: if at least one of the one or more instruction information has an instruction purpose conflict, determining second target instruction information from the at least one instruction information, and discarding other instruction information except the second target instruction information in the at least one instruction information. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the determining the second target instruction information from the at least one instruction information comprises: determining a degree of association between each of the at least one instruction information and the target instruction information; and determining the instruction information with the highest corresponding relevance degree from the at least one instruction information as second target instruction information. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the determining the second target instruction information from the at least one instruction information comprises: performing feasibility analysis on each instruction information in the at least one instruction information to obtain feasibility corresponding to each instruction information; and determining the corresponding instruction information with the highest feasibility as second target instruction information from the at least one instruction information. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the determining the second target instruction information from the at least one instruction information comprises: generating instruction conflict indication information about the at least one instruction information and sending the instruction conflict indication information to a specific user in the conference group; receiving feedback information which is returned by the specific user and corresponds to the instruction conflict indication information, wherein the feedback information comprises identification information of one instruction information in the at least one instruction information; and according to the feedback information, determining instruction information corresponding to the identification information from the at least one instruction information as second target instruction information. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, the performing an instruction synthesis operation on the target instruction information and the one or more instruction information to generate instruction information to be executed includes: acquiring the target instruction information and instruction source user identification information corresponding to the one or more instruction information; and combining the instruction source user identification information, executing instruction synthesis operation on the target instruction information and the one or more instruction information, and generating instruction information to be executed. Here, the related operations are the same as or similar to those of the embodiment shown in fig. 1, and therefore are not described again, and are included herein by reference.

In some embodiments, as shown in FIG. 3, the system 300 can be implemented as any of the devices in the various embodiments described. In some embodiments, system 300 may include one or more computer-readable media (e.g., system memory or NVM/storage 320) having instructions and one or more processors (e.g., processor(s) 305) coupled with the one or more computer-readable media and configured to execute the instructions to implement modules to perform the actions described herein.

For one embodiment, system control module 310 may include any suitable interface controllers to provide any suitable interface to at least one of processor(s) 305 and/or any suitable device or component in communication with system control module 310.

The system control module 310 may include a memory controller module 330 to provide an interface to the system memory 315. Memory controller module 330 may be a hardware module, a software module, and/or a firmware module.

System memory 315 may be used, for example, to load and store data and/or instructions for system 300. For one embodiment, system memory 315 may include any suitable volatile memory, such as suitable DRAM. In some embodiments, the system memory 315 may include a double data rate type four synchronous dynamic random access memory (DDR4 SDRAM).

For one embodiment, system control module 310 may include one or more input/output (I/O) controllers to provide an interface to NVM/storage 320 and communication interface(s) 325.

For example, NVM/storage 320 may be used to store data and/or instructions. NVM/storage 320 may include any suitable non-volatile memory (e.g., flash memory) and/or may include any suitable non-volatile storage device(s) (e.g., one or more Hard Disk Drives (HDDs), one or more Compact Disc (CD) drives, and/or one or more Digital Versatile Disc (DVD) drives).

NVM/storage 320 may include storage resources that are physically part of the device on which system 300 is installed or may be accessed by the device and not necessarily part of the device. For example, NVM/storage 320 may be accessible over a network via communication interface(s) 325.

Communication interface(s) 325 may provide an interface for system 300 to communicate over one or more networks and/or with any other suitable device. System 300 may wirelessly communicate with one or more components of a wireless network according to any of one or more wireless network standards and/or protocols.

For one embodiment, at least one of the processor(s) 305 may be packaged together with logic for one or more controller(s) (e.g., memory controller module 330) of the system control module 310. For one embodiment, at least one of the processor(s) 305 may be packaged together with logic for one or more controller(s) of the system control module 310 to form a System In Package (SiP). For one embodiment, at least one of the processor(s) 305 may be integrated on the same die with logic for one or more controller(s) of the system control module 310. For one embodiment, at least one of the processor(s) 305 may be integrated on the same die with logic for one or more controller(s) of the system control module 310 to form a system on a chip (SoC).

In various embodiments, system 300 may be, but is not limited to being: a server, a workstation, a desktop computing device, or a mobile computing device (e.g., a laptop computing device, a holding computing device, a tablet, a netbook, etc.). In various embodiments, system 300 may have more or fewer components and/or different architectures. For example, in some embodiments, system 300 includes one or more cameras, a keyboard, a Liquid Crystal Display (LCD) screen (including a touch screen display), a non-volatile memory port, multiple antennas, a graphics chip, an Application Specific Integrated Circuit (ASIC), and speakers.

The present application also provides a computer readable storage medium having stored thereon computer code which, when executed, performs a method as in any one of the preceding.

The present application also provides a computer program product, which when executed by a computer device, performs the method of any of the preceding claims.

The present application further provides a computer device, comprising:

one or more processors;

a memory for storing one or more computer programs;

the one or more computer programs, when executed by the one or more processors, cause the one or more processors to implement the method of any preceding claim.

It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Those skilled in the art will appreciate that the form in which the computer program instructions reside on a computer-readable medium includes, but is not limited to, source files, executable files, installation package files, and the like, and that the manner in which the computer program instructions are executed by a computer includes, but is not limited to: the computer directly executes the instruction, or the computer compiles the instruction and then executes the corresponding compiled program, or the computer reads and executes the instruction, or the computer reads and installs the instruction and then executes the corresponding installed program. Computer-readable media herein can be any available computer-readable storage media or communication media that can be accessed by a computer.

Communication media includes media by which communication signals, including, for example, computer readable instructions, data structures, program modules, or other data, are transmitted from one system to another. Communication media may include conductive transmission media such as cables and wires (e.g., fiber optics, coaxial, etc.) and wireless (non-conductive transmission) media capable of propagating energy waves such as acoustic, electromagnetic, RF, microwave, and infrared. Computer readable instructions, data structures, program modules, or other data may be embodied in a modulated data signal, for example, in a wireless medium such as a carrier wave or similar mechanism such as is embodied as part of spread spectrum techniques. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. The modulation may be analog, digital or hybrid modulation techniques.

By way of example, and not limitation, computer-readable storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. For example, computer-readable storage media include, but are not limited to, volatile memory such as random access memory (RAM, DRAM, SRAM); and non-volatile memory such as flash memory, various read-only memories (ROM, PROM, EPROM, EEPROM), magnetic and ferromagnetic/ferroelectric memories (MRAM, FeRAM); and magnetic and optical storage devices (hard disk, tape, CD, DVD); or other now known media or later developed that can store computer-readable information/data for use by a computer system.

An embodiment according to the present application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or a solution according to the aforementioned embodiments of the present application.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims

1. A method for executing instructions through a virtual conference robot is applied to network equipment, wherein the method comprises the following steps:

2. The method of claim 1, wherein the creating the conference group in response to a create request for a multi-person audio video conference group comprises:

responding to a creating request aiming at a multi-person audio and video conference group, and creating the conference group and a virtual conference robot corresponding to the conference group, wherein the conference group comprises a plurality of users and the virtual conference robot.

3. The method of claim 1, wherein the creating the conference group in response to a create request for a multi-person audio video conference group comprises:

responding to a creating request aiming at a multi-person audio and video conference group, and creating the conference group, wherein the conference group comprises a plurality of users;

and responding to a request of adding a virtual conference robot aiming at the conference group, creating a virtual conference robot corresponding to the conference group, and adding the virtual conference robot into the conference group.

4. The method according to claim 1, wherein the acquiring, by the virtual conference robot, audio and video input information sent by the plurality of users, identifying a plurality of associated instruction information in the audio and video input information, and generating instruction information to be executed according to the plurality of instruction information includes:

acquiring audio and video input information sent by the multiple users through the virtual conference robot, and identifying target instruction information of a first user in the multiple users from the audio and video input information;

identifying one or more instruction information associated with the target instruction information from second audio and video input information sent by other users except the first user in the plurality of users through the virtual conference robot;

and generating instruction information to be executed by the virtual robot according to the target instruction information and the one or more instruction information.

5. The method of claim 4, wherein the obtaining, by the virtual conference robot, audio and video input information sent by the plurality of users, and identifying target instruction information of a first user of the plurality of users from the audio and video input information comprises:

the method comprises the steps that audio and video input information sent by a plurality of users is obtained through the virtual conference robot, if preset instruction triggering indication information is identified from first audio and video input information sent by a first user, a second time point is identified and determined from a first time point corresponding to the instruction triggering indication information, and target instruction information of the first user is obtained from the first audio and video input information between the first time point and the second time point.

6. The method of claim 5, wherein the identifying and determining a second time point from a first time point corresponding to the instruction triggering indication information comprises:

and if the preset instruction ending indication information is identified from the first audio and video input information sent by the first user after the first time point corresponding to the instruction triggering indication information, determining the time point corresponding to the instruction ending indication information as a second time point.

7. The method of claim 5, wherein the identifying and determining a second time point from a first time point corresponding to the instruction triggering indication information comprises:

and after a first time point corresponding to the instruction triggering indication information, if no new audio content is acquired from the first audio and video input information within a preset time interval after a third time point is identified, determining the third time point as a second time point.

8. The method of claim 5, wherein the identifying, by the virtual conference robot, one or more instruction information associated with the target instruction information from second audiovisual input information sent by users of the plurality of users other than the first user comprises:

identifying at least one instruction information corresponding to other users from second audio and video input information sent by other users except the first user in the plurality of users through the virtual conference robot;

determining one or more instruction information associated with the target instruction information from the at least one instruction information.

9. The method of claim 8, wherein said determining one or more instruction information associated with the target instruction information from the at least one instruction information comprises:

determining one or more instruction information associated with the target instruction information and having an issue time interval less than or equal to a predetermined time interval compared to the target instruction information from the at least one instruction information.

10. The method of claim 8, wherein if the target instruction information includes one or more user identification information;

the identifying, by the virtual conference robot, at least one instruction information corresponding to another user from second audio/video input information sent by another user of the plurality of users except the first user includes:

and identifying at least one instruction information corresponding to one or more users from second audio and video input information sent by one or more users corresponding to the one or more user identification information through the virtual conference robot.

11. The method of claim 5, wherein the identifying, by the virtual conference robot, one or more instruction information associated with the target instruction information from second audiovisual input information sent by users of the plurality of users other than the first user comprises:

and outputting instruction issuing prompt information corresponding to the target instruction information in the conference group through the virtual conference robot according to the target instruction information, and identifying one or more instruction information output by other users aiming at the instruction issuing prompt information from second audio and video input information sent by other users except the first user in the plurality of users.

12. The method of claim 11, wherein the outputting, by the virtual conference robot, instruction issue prompt information corresponding to the target instruction information in the conference group according to the target instruction information, and identifying one or more instruction information output by other users in relation to the instruction issue prompt information from second audio and video input information sent by other users in the plurality of users except the first user comprises:

acquiring identification information of at least one other user from the target instruction information through the virtual conference robot;

and outputting instruction issuing prompt information corresponding to the at least one other user in the conference group through the virtual conference robot according to the target instruction information, and identifying one or more instruction information output by the at least one other user aiming at the instruction issuing prompt information from second audio and video input information sent by the at least one other user according to the identification information.

13. The method according to claim 12, wherein the outputting, by the virtual conference robot, instruction issue prompt information corresponding to the at least one other user in the conference group according to the target instruction information, and identifying, according to the identification information, one or more instruction information output by the at least one other user for the instruction issue prompt information from second audio/video input information sent by the at least one other user comprises:

and for each user in the at least one other user, outputting instruction issuing prompt information corresponding to the user in the conference group through the virtual conference robot according to the target instruction information, and identifying the instruction information output by the user aiming at the instruction issuing prompt information from second audio and video input information sent by the user.

14. The method of claim 4, wherein the generating, by the virtual robot, instruction information to be executed from the target instruction information and the one or more instruction information comprises:

semantic analysis is carried out on the target instruction information and the one or more instruction information through the virtual conference robot, and an instruction purpose corresponding to each instruction information is obtained;

and according to the instruction purpose corresponding to each instruction information, executing instruction synthesis operation on the target instruction information and the one or more instruction information to generate instruction information to be executed.

15. The method of claim 14, wherein the method further comprises:

and if the instruction purpose corresponding to the first instruction information in the one or more instruction information comprises the instruction purpose corresponding to the second instruction information, discarding the second instruction information.

16. The method of claim 14, wherein the method further comprises:

and if the instruction purpose corresponding to the first instruction information in the one or more instruction information is lower than the instruction purpose corresponding to the second instruction information, discarding the second instruction information.

17. The method of claim 14, wherein the method further comprises:

if at least one of the one or more instruction information has an instruction purpose conflict, determining second target instruction information from the at least one instruction information, and discarding other instruction information except the second target instruction information in the at least one instruction information.

18. The method of claim 17, wherein said determining a second target instruction information from said at least one instruction information comprises:

determining a degree of association between each of the at least one instruction information and the target instruction information;

and determining the instruction information with the highest corresponding relevance degree from the at least one instruction information as second target instruction information.

19. The method of claim 17, wherein said determining a second target instruction information from said at least one instruction information comprises:

performing feasibility analysis on each instruction information in the at least one instruction information to obtain feasibility corresponding to each instruction information;

and determining the corresponding instruction information with the highest feasibility as second target instruction information from the at least one instruction information.

20. The method of claim 17, wherein said determining a second target instruction information from said at least one instruction information comprises:

generating instruction conflict indication information about the at least one instruction information and sending the instruction conflict indication information to a specific user in the conference group;

receiving feedback information which is returned by the specific user and corresponds to the instruction conflict indication information, wherein the feedback information comprises identification information of one instruction information in the at least one instruction information;

and according to the feedback information, determining instruction information corresponding to the identification information from the at least one instruction information as second target instruction information.

21. The method of claim 14, wherein the performing an instruction synthesis operation on the target instruction information and the one or more instruction information to generate instruction information to be executed comprises:

acquiring the target instruction information and instruction source user identification information corresponding to the one or more instruction information;

and combining the instruction source user identification information, executing instruction synthesis operation on the target instruction information and the one or more instruction information, and generating instruction information to be executed.

22. An apparatus for executing instructions by a virtual conference robot, wherein the apparatus comprises:

a processor; and

a memory arranged to store computer executable instructions that, when executed, cause the processor to perform the method of any one of claims 1 to 21.

23. A computer-readable medium storing instructions that, when executed, cause a system to perform the operations of any of the methods of claims 1-21.

24. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method according to any one of claims 1 to 21 when executed by a processor.