CN115249359B

CN115249359B - Explanation method, robot, electronic device, and storage medium

Info

Publication number: CN115249359B
Application number: CN202111089343.6A
Authority: CN
Inventors: 王伟健; 王军锋
Original assignee: Cloudminds Beijing Technologies Co Ltd
Current assignee: Cloudminds Beijing Technologies Co Ltd
Priority date: 2021-09-16
Filing date: 2021-09-16
Publication date: 2023-03-31
Anticipated expiration: 2041-09-16
Also published as: CN115249359A; WO2023040992A1

Abstract

The application relates to the technical field of robot control, and discloses an explanation method, an explanation device, electronic equipment and a storage medium, wherein the explanation method comprises the following steps: acquiring real-time audio and video data in a preset space range; wherein the preset spatial range comprises at least one audience; determining a current explanation scene according to the real-time audio and video data; wherein, the explanation scene includes: a guide explanation scene and a non-guide explanation scene; and determining an explanation mode corresponding to the current explanation scene according to the corresponding relation between the preset explanation scene and the explanation mode, and explaining for the audience according to the corresponding explanation mode. The robot identifies the current explanation scene according to the real-time audio and video data, and adopts the corresponding explanation mode to explain according to the current explanation scene, so that the robot can flexibly deal with complex scenes, and the interactivity in the explanation process and the user experience of audiences are improved.

Description

Explanation method, robot, electronic device, and storage medium

Technical Field

The embodiment of the application relates to the technical field of robot control, in particular to an explanation method, a robot, electronic equipment and a storage medium.

Background

With the continuous development and progress of communication technology and artificial intelligence, the technology of intelligent robots is becoming mature day by day, intelligent robots are also becoming closer to real people in the aspect of personification and stronger in the smoothness degree of communication with people, and it has become a mainstream trend to adopt robots to replace real people to do partial work, for example, intelligent robots are applied to carry out knowledge explanation on tourists.

However, in the process of communicating with audiences or providing explanation services for the audiences, the robot has poor interactivity with the audiences when facing a complex scene, has low capability of responding to the complex scene, and has relatively poor user experience.

Disclosure of Invention

The embodiment of the application mainly aims to provide an explanation method, a robot, electronic equipment and a storage medium, and aims to enable the robot to change an explanation mode according to an explanation scene in an explanation process, improve the response capability of the robot to a complex scene and the flexibility of explanation, and improve audience experience in the explanation process.

In order to achieve the above object, an embodiment of the present application provides an interpretation method applied to a robot, the method including: acquiring real-time audio and video data in a preset space range; wherein the preset spatial range comprises at least one audience; determining a current explanation scene according to the real-time audio and video data; wherein, the explanation scene includes: a guide explanation scene and a non-guide explanation scene; and determining an explanation mode corresponding to the current explanation scene according to the corresponding relation between the preset explanation scene and the explanation mode, and explaining for the audience according to the corresponding explanation mode.

In order to achieve the above object, an embodiment of the present application further provides a robot, including: the acquisition module is used for acquiring real-time audio and video data in a preset spatial range; wherein the preset spatial range comprises at least one audience; the determining module is used for determining the current explanation scene according to the real-time audio and video data; wherein, the explanation scene includes: a guide explanation scene and a non-guide explanation scene; and the explanation module is used for determining an explanation mode corresponding to the current explanation scene according to the corresponding relation between the preset explanation scene and the explanation mode and explaining for the audience according to the corresponding explanation mode.

In order to achieve the above object, an embodiment of the present application further provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of teaching as described above.

To achieve the above object, an embodiment of the present application further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the teaching method as described above.

According to the explanation method provided by the embodiment of the application, the robot acquires audio and video data in a preset space range to analyze a current explanation scene, adopts an explanation mode corresponding to the current explanation scene to explain audiences according to the corresponding relation between the explanation scene and the explanation mode, and can accurately determine the current explanation environment by analyzing the explanation scene according to the acquired audio and video data, so that a proper explanation mode can be conveniently adopted; adopt appropriate explanation mode to explain through the explanation environment according to determining for the robot can be according to actual conditions and needs at the in-process of explaining, more humanized explain and carry out the interdynamic with spectator, very big promotion the flexibility of robot explanation in-process and to the reply ability of complex environment, and then make spectator obtain better experience, promote the practicality of robot explanation.

Drawings

One or more embodiments are illustrated by the corresponding figures in the drawings, which are not meant to be limiting.

FIG. 1 is a flow chart of an explanation method in an embodiment of the invention;

FIG. 2 is a diagram illustrating a viewfinder image before tracking of a positioning target in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a viewfinder image after tracking a positioning target in an embodiment of the present invention;

FIG. 4 is a schematic view of a viewfinder image before tracking of another embodiment of the present invention;

FIG. 5 is a schematic view of a viewfinder image after another embodiment of the invention is used for locating target tracking;

FIG. 6 is a schematic view of a robot according to another embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device in another embodiment of the invention.

Detailed Description

As can be seen from the background art, in the process of communicating with audiences or providing explanation services for the audiences, the robot has poor interactivity with the audiences when facing complex scenes, has low response capability to the complex scenes, and has relatively poor user experience. Therefore, how to improve the capability of the robot to respond to complex scenes and the user experience is a problem which needs to be solved urgently.

In order to solve the above problem, an embodiment of the present application provides an teaching method applied to a robot, including: acquiring real-time audio and video data in a preset spatial range; wherein the preset spatial range comprises at least one audience; determining a current explanation scene according to the real-time audio and video data; wherein, the explanation scene includes: a guide explanation scene and a non-guide explanation scene; and determining an explanation mode corresponding to the current explanation scene according to the corresponding relation between the preset explanation scene and the explanation mode, and explaining for the audience according to the corresponding explanation mode.

According to the explanation method provided by the embodiment of the application, the robot analyzes the current explanation scene according to the acquired audio and video data in the preset space range, adopts the explanation mode corresponding to the current explanation scene to explain for audiences according to the corresponding relation between the explanation scene and the explanation mode, and can accurately determine the explanation environment of the current position by analyzing the explanation scene according to the acquired real-time audio and video data, so that the proper explanation mode can be conveniently adopted; adopt appropriate explanation mode to explain through the explanation environment according to determining for the robot can be according to actual conditions and needs at the in-process of explaining, more humanized explain and carry out the interdynamic with spectator, very big promotion the flexibility of robot explanation in-process and to the reply ability of complex environment, and then make spectator obtain better user experience, promote the practicality of robot explanation.

To make the objects, technical solutions and advantages of the embodiments of the present application clearer, the embodiments of the present application will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that in the examples of the present application, numerous technical details are set forth in order to provide a better understanding of the present application. However, the technical solutions claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments. The following embodiments are divided for convenience of description, and should not constitute any limitation to the specific implementation manner of the present application, and the embodiments may be mutually incorporated and referred to without contradiction.

The following description will specifically describe the implementation details of the teaching method applied to the robot in conjunction with specific embodiments, and the following description is only provided for the convenience of understanding and is not necessary for implementing the present solution.

A first aspect of an embodiment of the present invention provides an explanation method applied to a robot, which can be applied to multiple robots and terminals, where this embodiment is described by applying to a single humanoid robot as an example, and a flow of the explanation method refers to fig. 1, and includes the following steps:

step 101, acquiring real-time audio and video data in a preset space range.

Specifically, the robot enters a working state after an explanation task is set, and starts an explanation service after a preset explanation time is reached. The robot may acquire real-time audio and video data within a preset spatial range through the head camera representing the eyes of the robot, where the preset spatial range includes at least one viewer, for example, the robot acquires sound and video data of one or more viewers within a sight range through the head camera representing the eyes according to a head orientation. In practical application, the camera can also be arranged at the position of the chest, etc. By collecting real-time audio and video data in a preset space range, the explanation environment and the change of the explanation environment where the current position is located can be conveniently and accurately judged.

And 102, determining a current explanation scene according to the real-time audio and video data.

Specifically, after real-time audio and video data within a preset range are acquired by a camera, a current explanation scene is determined according to the acquired real-time audio and video data, wherein the explanation scene comprises: a guided explanation scenario and an unguided explanation scenario. The method comprises the steps of analyzing the acquired audio and video data, extracting and identifying face images of audiences in the acquired audio and video data through a face recognition algorithm, calculating the sizes of sounds in different directions, and determining the current explanation scene according to the analysis result of the face images of the audiences and the calculation result of the sizes of the sounds. The acquired real-time audio and video data are analyzed to acquire the current explanation scene, so that the accurate analysis and determination of the explanation environment where the robot is located in the complex environment are realized, the robot can conveniently adopt a proper explanation mode for explanation in the follow-up process, and the capability of the robot for dealing with the complex environment is improved.

In one example, the determining, by the robot, the current explanation scene according to the real-time audio and video data includes: detecting whether a target audience is in a speaking state; the target audience is the audience with the face proportion in the face image larger than a preset threshold; under the condition that the target audience is in a speaking state, judging the current explanation scene as a guide explanation scene; and under the condition that the target audience is not in the speaking state, judging that the current explanation scene is a non-guide explanation scene. In an actual explanation task, a guide in a speaking state may exist in a plurality of audiences facing the robot, and when the guide in the speaking state exists among the audiences, the robot should pay attention to the interaction with the guide and cooperate with the guide to explain; when there is no guide in the speaking state among the viewers, the robot should pay attention to the interaction with all viewers to explain to as many viewers as possible. Therefore, after the robot acquires the real-time audio and video data, the robot analyzes the audio and video data, identifies the audience with the face larger than the preset threshold in the face image as the target audience, and detects whether the target audience is in a speaking state. When the target audience is in a speaking state, judging that a guide in the speaking state exists in the audience, wherein the current explanation scene is a guide explanation scene, and when the target audience is not in the speaking state, judging that the guide in the speaking state does not exist in the audience, wherein the current explanation environment is a non-guide explanation scene. Whether the target audience is in the speaking state or not is detected and identified to judge the current explanation environment, the current explanation environment of the robot can be accurately identified, an appropriate explanation mode is convenient to adopt for explanation, and the adaptability of the robot to a complex environment is improved.

In addition, the preset threshold of the face proportion in the face images of the target audiences can be changed according to the specific working environment and actual needs of the robot, the target audiences with the face proportion in one or more face images larger than the preset threshold can exist in the obtained real-time audio and video data, and the current explanation scene can be judged to be the guide explanation scene as long as at least one target audience is in a speaking state. In addition, the robot acquires real-time audio and video data within a preset range, and after the current explanation scene is judged, if the state of audiences in the real-time audio and video data changes, the robot can also timely re-determine the current explanation scene, so that the explanation can be carried out more flexibly.

In another example, the robot detects whether the target audience is in a speaking state, comprising: detecting whether the volume of the direction in which the target audience is positioned is larger than a preset threshold value or not; under the condition that the volume is larger than a preset threshold value, judging that the target audience is in a speaking state; and in the case that the volume is not greater than the preset threshold value, judging that the target audience is not in the speaking state. In the process of judging the speaking state of the target audience, the robot further analyzes the audio data according to the position of the target audience in the video data to obtain the volume of the direction in which the target audience is positioned, and detects whether the volume of the direction in which the target audience is positioned is larger than a preset threshold value. Under the condition that the volume of the direction in which the target audience is positioned is larger than a preset threshold value, judging that the target audience is in a speaking state; and under the condition that the volume of the direction in which the target audience is positioned is not larger than a preset threshold value, judging that the target audience is not in the speaking state. Whether the target audience is in the speaking state or not is accurately judged according to the volume in the direction where the target audience is located, so that the efficiency of recognizing the speaking state is improved while the judgment accuracy is ensured.

In addition, the speaking state can be judged only according to the volume of the target audience in the direction under a relatively quiet scene, when the robot judges whether the target audience is in the speaking state in a noisy environment or under the condition of overlarge volume in a preset range, the robot can also be combined with image data in a video to analyze, detect whether the target audience has limb actions or pose changes which often accompany the speaking, judge that the target audience is in the speaking state when the volume of the target audience in the direction is larger than a preset threshold value and the target audience has certain limb actions or pose changes, and accurately judge whether the target audience is in the speaking state in the noisy environment as far as possible.

In another example, before detecting whether the target audience is in the speaking state, the robot further comprises: detecting whether a target audience exists; and under the condition that the target audience does not exist, judging the current explanation scene as a non-guide explanation scene. When the robot analyzes the self-located explanation environment according to the acquired real-time audio and video data, the face images of all audiences in the real-time audio and video data are extracted and analyzed, the positions and the sizes of the face images of all the audiences in the real-time audio and video data are obtained through calculation, the face proportion in each face image is acquired, whether the face image with the face proportion larger than a preset threshold exists or not is detected, when the face image with the face proportion larger than the preset threshold exists in each face image, the audience to which the face image with the face proportion larger than the preset threshold belongs is marked as a target audience, the current explanation scene is further analyzed conveniently, when the face image with the face proportion larger than the preset threshold does not exist in each face image, the target audience does not exist in the audience, and the current explanation scene is directly judged to be an undirected explanation scene. Whether a target audience exists in the audience is detected through face proportion detection, and under the condition that the target audience does not exist, the current explanation environment is directly determined to be a non-guide explanation scene, so that the efficiency of judging the current explanation scene is greatly improved while the accuracy of judging the scene is ensured.

In addition, in the process of detecting whether the target audience exists, the distance between the audience and the robot can be combined for detection, after the position and the size of the face image of each audience are calculated through a face recognition algorithm, the distance between each audience and the robot is obtained, the audience with the face proportion in the face image larger than a preset threshold and/or the distance between the face image and the robot smaller than a certain value is used as the target audience, the target audience is detected through the combination of the face proportion and the man-machine distance, and whether the target audience exists in the audience is more accurately identified.

And 103, explaining for the audience by adopting the explanation mode corresponding to the current explanation scene according to the corresponding relation between the preset explanation scene and the explanation mode.

Specifically, after the robot completes recognition of the current explanation scene according to the acquired real-time audio and video, the robot determines an explanation mode corresponding to the current explanation scene according to a preset corresponding relationship between the explanation scene and the explanation mode, and explains the audience according to the corresponding explanation mode. Through the corresponding relation according to the recognition result of the current explanation scene, the preset explanation scene and the explanation mode, the robot adopts the explanation mode corresponding to the current explanation scene to explain for audiences, so that the robot can carry out flexible explanation in a complex scene, and the audiences are guaranteed to experience well in the robot explanation process.

In one example, when the current explanation scene is a guide explanation scene after the robot acquires the current explanation scene, the robot explains the audience according to a corresponding explanation mode, including: taking the center of the face image of the target audience as a positioning target, and tracking the positioning target for explanation; under the condition that the current explanation scene is a non-guidance explanation scene, explanation is carried out on audiences according to a corresponding explanation mode, and the method comprises the following steps: acquiring a target circle, taking the center of the target circle as a positioning target, and tracking the positioning target for explanation; the target circle is the smallest circle containing the center point of the face image of each viewer.

Specifically, when the current explanation scene of the robot is a wizard explanation scene, the robot tracks and explains the face of the focus by using the face of the target audience as the face of the focus and using the face of the focus as a positioning target to be tracked by the robot during the explanation. In the process of explaining the face of the target audience tracked by the robot, the acquired real-time audio and video data is analyzed according to a face recognition algorithm to obtain the face image and the position of the face image of each audience, for example, the view image acquired before the robot carries out positioning target tracking is as shown in fig. 2, the intersection point of dotted lines in the view image is the view center of the robot camera, four audiences, audience 1, audience 2, audience 3 and audience 4 are arranged in a preset space range, wherein the face proportion in the face image of the audience 1 is greater than a preset threshold value and is in a speaking state, and the face image of the audience 1 is taken as a focus face by the robot. At this time, the view center of the robot camera is not aligned with the center of the focus face, and the robot can automatically adjust the face orientation and the pose of the robot when detecting that the view center is not aligned with the center of the focus face, so that the view center moves towards the center of the focus face. The view image acquired by the robot after tracking the positioning target is as shown in fig. 3, the view center of the robot camera coincides with the center of the face image of the audience 1, that is, the tracking of the positioning target is completed, and then the robot starts to explain the audience.

And when the current explanation scene of the robot is a non-guidance explanation scene, taking the center of the calculated target circle as a positioning target, and tracking the center of the target circle for explanation. In the process of explaining the center of a tracking target circle, the robot firstly analyzes the acquired real-time audio and video data according to a face recognition algorithm to obtain face images of all audiences and the size and the position of each face image, and calculates the center of the target circle containing the center point of each face image, for example, a view finding image before the robot locates the target tracking is shown in fig. 4, four audiences are in a preset space range, audience 1, audience 2, audience 3 and audience 4 are in alignment with the center of the target circle, and the robot automatically adjusts the orientation and the pose of the robot when detecting that the view finding center is not in alignment with the center of the target circle, so that the view finding center moves towards the center of the target circle. The view finding image obtained after the robot performs the positioning target tracking is as shown in fig. 5, the view finding center is aligned with the center of the target circle, and then the robot starts to explain for the audience.

Different positioning target acquisition modes are adopted according to the current explanation scene, tracking of the positioning target is tracked, and explanation is carried out on audiences, so that the robot can provide more humanized explanation service when facing a plurality of audiences, and the audiences can acquire good human-computer interaction experience.

Further, the robot acquiring the target circle includes: acquiring coordinates of the central points of the face images of the audiences, and generating a hash point array according to the coordinates of the central points; and calculating the minimum circle containing each hash point in the hash point array, and taking the minimum circle as a target circle. In the process of acquiring the target circle, the robot acquires the coordinates of the central point of the face image of each audience according to the face recognition result, generates a hash point number sequence according to the coordinates of each central point, then acquires the minimum circle containing each hash point in the hash point number sequence through calculation, and takes the acquired minimum circle as the target circle. The target circle is obtained by introducing a hash point calculation mode, and the optimal positioning target tracked by the robot is found by utilizing the characteristics of hash calculation, so that when the center of the target circle is used as the positioning target for tracking and explaining, the robot can feel to explain facing the robot by audiences as much as possible, and the user experience of the audiences is greatly improved.

In another example, a robot tracks a positioning target for interpretation, comprising: adjusting the orientation of the robot, and beginning to explain after the positioning target appears in a preset area within the sight range of the robot; and under the condition that the positioning target exceeds the preset area, the orientation of the robot is readjusted until the positioning target is in the preset area. Due to the limitation of factors such as robot control technology or environment, the robot may not be able to make the view center and the positioning target completely coincide by adjusting the orientation and the pose, so when the robot performs positioning target tracking, a preset area may be set in advance within the range of the robot's sight line, and when the positioning target is in the preset area, it is determined that the positioning target has been tracked, for example, a tolerance circle is set with the view center as the center of circle, and in the process of the robot performing positioning target tracking, the robot determines that the tracking of the positioning target has been completed and starts to perform explanation when the center of the focus face or the center of the target circle appears in the tolerance circle. After the explanation is started, the robot can periodically judge whether the positioning target is still in the preset area according to the acquired real-time audio and video data, and can readjust the orientation and the pose of the robot when the positioning target exceeds the preset area until the positioning target is in the preset area, so that the positioning target can be accurately tracked in the explanation process. The robot positioning system has the advantages that tracking with certain error tolerance is carried out on the positioning target through the preset area, the efficiency of tracking the positioning target by the robot is greatly improved, the situation that the positioning target cannot be tracked and the fluency of explanation is influenced due to the fact that the robot is influenced by the robot or environmental factors is avoided, and the user experience of audiences in the robot explanation process is guaranteed.

In another example, after the robot determines an explanation mode according to an explanation scene and explains the audience according to the determined explanation mode, the robot further includes: and switching to a manual explanation mode according to the instruction of the administrator, and playing the explanation content input by the administrator. The robot is in the in-process of explaining for spectator, the staff that is located high in the clouds artificial seat can pass through the camera and the sound collection system of robot, for example, the microphone, acquire the site conditions in the robot place region, under the condition that the robot can't explain scene recognition or meet emergency, the administrator who is in high in the clouds artificial seat can send the instruction to the robot, the robot is after receiving administrator's instruction, switch to the artificial explanation mode, play the explanation content of administrator input, thereby make the administrator can be directly intervene the conversation of robot and scene spectator at high in the clouds artificial seat, provide artifical explanation and service, thereby further promote the user experience of robot explanation in-process spectator and to the reply ability of complicated scene and emergency.

In addition, it should be understood that the above steps of the various methods are divided for clarity, and the implementation may be combined into one step or split into some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included in the protection scope of the present patent; it is within the scope of this patent to add insignificant modifications or introduce insignificant designs to the algorithms or processes, but not to change the core designs of the algorithms and processes.

Another aspect of an embodiment of the present invention provides a robot, referring to fig. 6, including:

the acquisition module 601 is used for acquiring real-time audio and video data in a preset spatial range; wherein the preset spatial range includes at least one viewer.

The determining module 602 is configured to determine a current explanation scene according to real-time audio/video data; wherein, the explanation scene includes: a guided explanation scene and a non-guided explanation scene.

The explanation module 603 is configured to determine an explanation mode corresponding to the current explanation scene according to a preset correspondence between the explanation scene and the explanation mode, and perform explanation for the audience according to the corresponding explanation mode.

It should be understood that the present embodiment is an apparatus embodiment corresponding to the method embodiment, and the present embodiment can be implemented in cooperation with the method embodiment. The related technical details mentioned in the method embodiment are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related art details mentioned in the present embodiment can also be applied in the method embodiment.

It should be noted that, all the modules involved in this embodiment are logic modules, and in practical application, one logic unit may be one physical unit, may also be a part of one physical unit, and may also be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present invention, a unit which is not so closely related to solve the technical problem proposed by the present invention is not introduced in the present embodiment, but this does not indicate that there is no other unit in the present embodiment.

Another aspect of the embodiments of the present invention further provides an electronic device, which, with reference to fig. 7, includes: includes at least one processor 701; and a memory 702 communicatively coupled to the at least one processor 701; the memory 702 stores instructions executable by the at least one processor 701, and the instructions are executed by the at least one processor 701 to enable the at least one processor 701 to perform the method according to any one of the method embodiments described above.

The memory 702 and the processor 701 are coupled by a bus, which may comprise any number of interconnecting buses and bridges that couple one or more of the various circuits of the processor 701 and the memory 702. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, etc., which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. Data processed by the processor 701 is transmitted over a wireless medium through an antenna, which receives the data and transmits the data to the processor 701.

The processor 701 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory 702 may be used for storing data used by the processor 701 in performing operations.

Another aspect of the embodiments of the present invention also provides a computer-readable storage medium storing a computer program. The computer program realizes the above-described method embodiments when executed by a processor.

That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the present application, and that various changes in form and details may be made therein without departing from the spirit and scope of the present application in practice.

Claims

1. An explanation method, applied to a robot, comprising:

acquiring real-time audio and video data in a preset space range; wherein the preset spatial range includes at least one viewer;

determining a current explanation scene according to the real-time audio and video data; wherein the explanation scene comprises: a guide explanation scene and a non-guide explanation scene;

determining an explanation mode corresponding to the current explanation scene according to the corresponding relation between a preset explanation scene and an explanation mode, and explaining for the audience according to the corresponding explanation mode;

wherein, the determining the current explanation scene according to the real-time audio and video data comprises:

detecting whether a target audience is in a speaking state; the target audience is the audience of which the face proportion in the face image is larger than a preset threshold;

under the condition that the target audience is in a speaking state, judging that the current explanation scene is a guide explanation scene;

and under the condition that the target audience is not in the speaking state, judging that the current explanation scene is a non-guide explanation scene.

2. The interpretation method of claim 1, further comprising, before the detecting whether the target audience is in the speaking state:

detecting whether the target audience exists;

and under the condition that the target audience does not exist, judging that the current explanation scene is the non-guide explanation scene.

3. The method of claim 1, wherein the detecting whether the target audience is in a speaking state comprises:

detecting whether the volume of the direction in which the target audience is located is larger than a preset threshold value;

under the condition that the volume is larger than the preset threshold value, judging that the target audience is in a speaking state;

and if the volume is not greater than the preset threshold value, judging that the target audience is not in the speaking state.

4. The explanation method according to claim 1, wherein, in a case where the current explanation scene is the guide explanation scene, the explaining for the audience according to the corresponding explanation mode includes: taking the center of the face image of the target audience as a positioning target, and tracking the positioning target for explanation;

when the current explanation scene is the non-guidance explanation scene, the explaining for the audience according to the corresponding explanation mode includes: acquiring a target circle, taking the center of the target circle as the positioning target, and tracking the positioning target for explanation; the target circle is the smallest circle containing the center point of the face image of each audience.

5. The explanation method as claimed in claim 4, wherein the tracking the positioning target for explanation comprises:

adjusting the orientation of the robot, and beginning to explain after the positioning target appears in a preset area within the range of the visual line of the robot;

and under the condition that the positioning target exceeds the preset area, the orientation of the robot is readjusted until the positioning target is in the preset area.

6. The interpretation method according to claim 4, wherein the obtaining the target circle comprises:

acquiring coordinates of a central point of each audience face image, and generating a hash point array according to the coordinates of the central point;

and calculating the minimum circle containing each hash point in the hash point array, and taking the minimum circle as the target circle.

7. The explanation method according to any one of claims 1 to 6, characterized by, after determining an explanation mode based on the explanation scene and explaining for the audience based on the determined explanation mode, further comprising:

and switching to a manual explanation mode according to an instruction of an administrator, and playing the explanation content input by the administrator.

8. A robot, comprising:

the acquisition module is used for acquiring real-time audio and video data in a preset spatial range; wherein the preset spatial range includes at least one viewer;

the determining module is used for determining the current explanation scene according to the real-time audio and video data; wherein the explanation scene comprises: a guide explanation scene and a non-guide explanation scene;

detecting whether a target audience is in a speaking state; the target audience is the audience with the face proportion in the face image larger than a preset threshold;

under the condition that the target audience is not in a speaking state, judging that the current explanation scene is a non-guidance explanation scene;

and the explaining module is used for determining the current explaining mode corresponding to the explaining scene according to the corresponding relation between the preset explaining scene and the explaining mode and explaining the audience according to the corresponding explaining mode.

9. An electronic device, comprising:

at least one processor; and (c) a second step of,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the interpretation method of any one of claims 1 to 7.