CN110853073A

CN110853073A - Method, device, equipment and system for determining attention point and information processing method

Info

Publication number: CN110853073A
Application number: CN201810829677.4A
Authority: CN
Inventors: 李炜明; 张辉; 王强; 考月英
Original assignee: Beijing Samsung Telecommunications Technology Research Co Ltd; Samsung Electronics Co Ltd
Current assignee: Beijing Samsung Telecom R&D Center; Beijing Samsung Telecommunications Technology Research Co Ltd; Samsung Electronics Co Ltd
Priority date: 2018-07-25
Filing date: 2018-07-25
Publication date: 2020-02-28

Abstract

The embodiment of the application provides a method, a device, equipment and a system for determining a focus point and an information processing method, and belongs to the technical field of information processing. The method for determining the attention point comprises the following steps: acquiring at least two images of a scene; based on the at least two images, a point of interest of an object in the image in the scene is determined. According to the scheme, the attention point of the object in the real scene can be determined without the need of wearing any equipment by the object, the scheme can be suitable for the actual application scene without the need of wearing additional equipment by matching with the object, and a more natural interaction mode can be provided for the object.

Description

Method, device, equipment and system for determining attention point and information processing method

Technical Field

The present invention relates to the field of information processing technologies, and in particular, to a method, an apparatus, a device, a system, and an information processing method for determining a point of interest.

Background

The user's point of regard reflects the object that the user gazes at, and by knowing the user's point of regard, the object that the user is interested in can be known. Most of existing gazing point detection schemes require a user to wear a device equipped with an eye camera for observing the eyes of the user and an external camera for observing the surrounding environment of the user, and extract the gazing point of the user by associating the sight line direction of the user detected by the eye camera with an environment object image captured by the external camera. Although the scheme can obtain higher precision, the user is required to wear additional equipment, so that the user is restricted and limited, the user is inconvenient, and the application scene of the scheme is greatly limited. Such as service occasions like reception, shopping, etc., are not suitable for requiring a customer to wear a specific device. Still other detection schemes, while not requiring the user to wear additional equipment, typically limit the location of the detected point of regard to a particular display.

Disclosure of Invention

The present application aims to solve at least one of the above technical drawbacks. The technical scheme adopted by the application is as follows:

in a first aspect, the present application provides a method of determining a point of interest, the method comprising:

acquiring at least two images of a scene;

based on the at least two images, a point of interest of an object in the image in the scene is determined.

In a second aspect, the present application provides an apparatus for determining a point of interest, the apparatus comprising:

the system comprises an image acquisition module, a scene acquisition module and a scene acquisition module, wherein the image acquisition module is used for acquiring at least two images of a scene;

and the attention point confirming module is used for determining the attention point of the object in the image in the scene based on the at least two images.

In a third aspect, the present application provides an electronic device comprising an image acquisition module, a memory, and a processor;

the system comprises an image acquisition module, a scene acquisition module and a scene processing module, wherein the image acquisition module is used for acquiring at least two images of a scene;

a memory for storing machine readable instructions that, when executed by the processor, configure the processor to determine a point of interest in a scene for an object in an image based on at least two images acquired by an image acquisition module.

In a fourth aspect, the present application provides a system for determining a point of interest, the system comprising an image capture device, and an electronic device connected to the image capture device;

the image acquisition equipment is used for acquiring at least two images of a scene and sending the at least two images to the electronic equipment;

and the electronic equipment is used for receiving the at least two images sent by the image acquisition equipment and determining the attention point of the object in the images in the scene based on the at least two received images.

In a fifth aspect, the present application provides a behavior information obtaining method, including:

acquiring a focus of an object;

and acquiring the behavior information of the object according to the attention point.

In a sixth aspect, the present application provides a behavior information acquiring apparatus, including:

the attention point acquisition module is used for acquiring the attention point of the object;

and the behavior information acquisition module is used for acquiring the behavior information of the object according to the attention point.

In a seventh aspect, the present application provides an electronic device comprising a memory and a processor;

a memory for storing machine readable instructions, which when executed by the processor, cause the processor to perform the method for determining a point of interest as shown in the first aspect of the present application and/or the behavior information obtaining method as shown in the fifth aspect of the present application.

In an eighth aspect, the present application provides a computer-readable storage medium for storing computer instructions, which when executed on a computer, enable the computer to perform the method for determining a point of interest as shown in the first aspect of the present application and/or the behavior information obtaining method as shown in the fifth aspect of the present application.

The technical scheme provided by the embodiment of the application has the following beneficial effects: based on the at least two images, the detection of the focus of the object in the scene can be achieved. According to the scheme, the attention point of the object in the real scene can be determined without wearing any wearable device by the object, so that the method and the device can be suitable for the actual application scene without wearing additional equipment by matching the object, and a more natural interaction mode can be provided for the object.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

FIG. 1 is a schematic diagram illustrating a method for determining a point of interest in an embodiment of the present application;

FIG. 2 is a schematic flow chart illustrating a method for determining a point of interest according to an embodiment of the present application;

FIG. 3 is a schematic diagram of two panoramic cameras for acquiring panoramic images in the embodiment of the present application;

FIG. 4 is a schematic diagram illustrating the rectification of a panoramic image according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram illustrating a position relationship between projected points of object points in space on two images according to an embodiment of the present application;

fig. 6 is a schematic diagram of acquiring a panoramic image by a panoramic camera in the embodiment of the present application;

fig. 7 is a schematic diagram of a method for calibrating two panoramic images at different times in an embodiment of the present application;

fig. 8a is a schematic diagram of acquiring a panoramic image by a monocular camera according to an embodiment of the present application;

fig. 8b is a schematic diagram of acquiring a panoramic image by using a binocular camera in the embodiment of the present application;

fig. 8c is a schematic diagram of acquiring a panoramic image by a wide-angle camera composed of a plurality of monocular cameras according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a method for capturing panoramic information by controlling the motion of a camera according to an embodiment of the present application;

FIG. 10 is a diagram illustrating a method for determining a location image based on quality of a target image according to an embodiment of the present disclosure;

FIG. 11 is a schematic diagram of a method of obtaining three-dimensional gaze information in an example of the present application;

FIG. 12 is a schematic diagram of a method of obtaining three-dimensional gaze information in another example of the present application;

FIG. 13 is a schematic diagram of a method of obtaining three-dimensional gaze information in yet another example of the present application;

FIG. 14 is a diagram illustrating a method for obtaining a position image of an object according to an embodiment of the present disclosure;

FIG. 15 is a diagram illustrating another method for obtaining a position image of an object according to an embodiment of the present disclosure;

FIG. 16 is a schematic diagram of a method of determining a point of interest in an example of the present application;

FIG. 17 is a schematic diagram of a method of determining a point of interest in another example of the present application;

FIG. 18 is a schematic illustration of a method of determining a point of interest in yet another example of the present application;

FIG. 19 is a schematic diagram of a method for obtaining depth information of a stationary object according to an embodiment of the present application;

FIG. 20 is a schematic flow chart illustrating a method for determining a point of interest according to another embodiment of the present application;

FIG. 21 is a schematic diagram illustrating an example of determining an image of a user's field of view;

FIG. 22 is a schematic diagram of a method for determining an image of a user's field of view in an example of the present application;

FIG. 23 is a schematic structural diagram of an apparatus for determining a point of interest in an embodiment of the present application;

fig. 24 is a schematic structural diagram of an electronic device in an embodiment of the present application;

FIG. 25 is a block diagram of a system for determining points of interest in an embodiment of the present application;

fig. 26 is a schematic flowchart of an information obtaining method according to an embodiment of the present application;

fig. 27 is a schematic flowchart of an information obtaining method according to another embodiment of the present application;

fig. 28 is a schematic structural diagram of an information acquisition apparatus according to an embodiment of the present application;

29a, 29b, 29c and 29d are schematic diagrams illustrating four setting modes of the electronic device for determining the attention point in the shopping scene in the embodiment of the present application;

FIG. 30 is a schematic diagram of a method for providing a customer-centric service in a shopping or hospitality scenario in an embodiment of the present application;

fig. 31 is a schematic diagram of an intelligent home scene in an embodiment of the present application;

fig. 32 is a schematic view of another smart home scenario in the embodiment of the present application;

fig. 33 is a schematic diagram illustrating a manner in which a control interface of an internet of things device is displayed according to a user's gaze in an embodiment of the present application;

FIG. 34 is a schematic view of a travel application scenario in an embodiment of the present application;

FIG. 35 is a schematic view of an assistant driving scenario in an embodiment of the present application;

FIG. 36 is a schematic view of a method for automatically determining the travel intention of surrounding pedestrians on a vehicle in the embodiment of the present application;

FIG. 37 is a schematic diagram of a teaching operation scenario in an embodiment of the present application;

FIG. 38 is a schematic diagram of a method for providing operation suggestions to a user in a teaching operation scenario in an embodiment of the present application;

FIG. 39 is a schematic view of a driver interaction scenario in an embodiment of the present application;

FIG. 40 is a diagram illustrating a method for providing input to a human-computer interaction system based on an image of a user's field of view, in accordance with an embodiment of the present invention;

FIG. 41 is a schematic view of another exemplary driver interaction scenario in accordance with an embodiment of the subject application;

FIG. 42 is a schematic illustration of a method of detecting and alerting a user to a potentially threatening object in a surrounding traffic environment;

FIG. 43 is a schematic diagram of a multi-user scene security monitoring scenario in an embodiment of the present application;

FIG. 44 is a schematic diagram of a method for performing FOV detection on multi-row human behavior in an embodiment of the present application;

FIG. 45 is a schematic diagram of a classroom scene in an embodiment of the subject application;

FIG. 46 is a diagram illustrating a method for attention analysis of multiple users in an embodiment of the present application;

fig. 47 is a schematic structural diagram of an electronic device provided in the present application;

FIG. 48 is a schematic illustration of a method of determining a point of interest of an object in an example of the present application;

FIG. 49 is a schematic diagram of a method for determining a point of interest of an object according to another example of the present application;

FIG. 50 is a diagram illustrating a method for determining a point of interest of an object according to yet another example of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" include plural referents unless the context clearly dictates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

With the rapid development of science and technology and the improvement of living needs, high-technology technologies such as human-computer interaction, artificial intelligence and the like have started to appear more and more in daily life of people, and how to make services more conform to the actual matters concerned by objects (such as users) when providing services for the objects is one of the important problems needing to be improved at present. The existing detection scheme for the user's gaze point generally needs the detected object to wear corresponding equipment, which causes inconvenience to the detected object, and the applicable scene is also greatly limited, or the detection of the gaze point can only be limited on a specific display, and cannot meet the requirements of practical application.

The method, the device, the equipment, the system and the information acquisition method for determining the focus aim to solve the technical problems in the prior art.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

In the embodiment of the present application, the focus point may include a gaze point and/or a pointing point of a part of the object. That is, the specific content of the focus point may be determined according to the actual application scene and the requirement, for example, if an object gazed by an object in the scene needs to be detected, the focus point may be a gazing point of the object, and if a hand motion of the object needs to be detected, the focus point may be a pointing point of a hand, a pointing point of an arm or other parts, or the like.

For better understanding of the present application, the following first takes an object as a user and a focus point as a fixation point of the user as an example, and a principle of the method for determining the focus point provided by the present application is explained.

Fig. 1 shows a schematic diagram of a method of determining a point of regard of a user in a scene. As shown in fig. 1, two panoramic images including a scene in which a user is present and the user may be simultaneously captured by the panoramic camera 1 and the panoramic camera 2 located at different positions, the two panoramic images having a parallax, forming a stereoscopic image pair. Object a, object B and object C are objects in the scene. In order to determine the gaze point of the user in the actual scene, first, the gaze information of the user in the actual scene needs to be acquired, and the intersection point position of the gaze information and the object in the actual scene is analyzed, so as to determine the gaze point of the user in the actual scene. In addition, after the point of regard is determined, the user may be further provided with relevant services or information based on the point of regard of the user, for example: the visual field image of the user is determined based on the gazing point of the user, the user intention is obtained based on the visual field image, and related services or information are provided for the user based on the user intention.

The sight line information of the user may include start point information and direction information of the sight line. The starting point of the sight line may specifically be a midpoint position of a connecting line between centers of two eyeballs of the user, and the starting point information is a coordinate of the starting point in the reference coordinate system. The direction information of the sight line comprises the rotation angle of the sight line of the user along the coordinate axes of the reference coordinate system in the reference coordinate system, namely the included angle between the sight line of the user and the coordinate axes. A ray whose starting point of the line of sight and direction of the line of sight are uniquely determined in space may be referred to as a user line of sight (i.e., a user's gaze path), and a point on the user line of sight may be referred to as a line of sight point.

The reference coordinate system may be selected as needed, and may be, for example, a camera coordinate system of any one of at least two panoramic cameras that capture panoramic images, a specified coordinate system, or a world coordinate system defined on an object in a certain environment.

It should be noted that all the information related to the coordinate systems are finally based on the same coordinate system, and when the reference coordinate system is not the camera coordinate system, the information related to the coordinate systems, such as position information and/or direction information, which are involved in determining the point of interest, need to be converted into the reference coordinate system, and the conversion between different coordinate systems can be realized based on the rotation parameters and the translation parameters between the coordinate systems. In practical applications, the reference coordinate system may be selected as a camera coordinate system of any panoramic camera in order to save computation.

In an alternative of the present application, the user's gaze information may be three-dimensional gaze information. When the user sight line information is three-dimensional sight line information, the reference coordinate system can be a three-dimensional coordinate system, the starting point information of the three-dimensional sight line information is a three-dimensional coordinate of the starting point in the three-dimensional coordinate system, and the direction information of the three-dimensional sight line information is an included angle between the user sight line and three coordinate axes of the three-dimensional coordinate system. For convenience of description, the three-dimensional line-of-sight information may be represented by (X, Y, Z, Ra, Rb, Rc), where (X, Y, Z) represents start point information of the three-dimensional line-of-sight and (Ra, Rb, Rc) represents direction information of the three-dimensional line-of-sight.

The gaze information of the user may specifically be determined based on the position image of the user's body in the two images. As shown in fig. 1, for any sight point on the sight line of the user, there are two cases:

(1) when the sight line point does not intersect with any object point in the scene, the sight line point is a blank point which does not have any object in the corresponding space, for example, the sight line point a is projected to the two panoramic images respectively, the projection points on the two panoramic images correspond to different objects in the space, and image windows determined on the two panoramic images based on the projection points have different image contents.

(2) When the sight point intersects with an object in the scene, such as the sight point B, the sight point intersects with an object B in the scene, the sight point is projected to the two panoramic images respectively, the projection points on the two panoramic images should correspond to the same object seen by the space user, and image windows determined on the two panoramic images based on the projection points should have the same image content.

Therefore, based on the above principle, whether the gaze point intersects with an object in the space can be detected by comparing whether the projection points of the gaze point of the user in the two panoramic images correspond to the same object and/or whether the image contents of the image windows determined based on the projection points are consistent, so as to determine the gaze point.

It can be seen that the physical constraints that must be satisfied for the objects in the scene to be observed by the user are: the gaze point of the user should intersect with the surface of the object, the gaze point of the user can be obtained by detecting the intersection condition of the gaze of the user and the object in the scene in the space, and then the user can be further provided with a service based on the intention of the user based on the gaze point, for example, a visual field image of the user can be further determined based on the gaze point and the panoramic image, or the detailed information of the object corresponding to the gaze point can be presented to the user, and the like.

It should be noted that, for different points of interest, the route information of interest of the object may also be different, for example, for the point of regard in the example in fig. 1, the route information of interest is line of sight information, and for example, when the point of interest is the point of interest, the route information of interest may be pointing route information of a finger, and the starting point of the pointing route information may be a fingertip of the finger, or a designated position of the finger, or may also be determined according to a key point of an image of a finger part, and when the fingertip is the starting point of the pointing route information, the coordinate of the starting point in the reference coordinate system is the starting point information of the route of interest. And an included angle between the pointing route and the coordinate axis of the reference coordinate system is the direction information of the concerned route. Similarly, ideally, the projection points on at least two images of the attention path and the attention point, which is the intersection point of the closest spatial object, are also the same object.

According to the method for determining the attention point, in an actual application scene, the attention point of the object at different moments can be determined based on images of different frames in the video by continuously acquiring the images, such as continuously acquiring the video of the scene where the object is located, so that the attention point of the object is tracked.

Fig. 2 is a flowchart illustrating a method for determining a point of interest according to an embodiment of the present application, where as shown in fig. 2, the method may include the following steps:

step S110: acquiring at least two images of a scene;

step S120: based on the at least two images, a point of interest of an object in the image in the scene is determined.

In the embodiment of the present application, the object may be a user and/or other object that needs to detect the point of interest. It should be noted that in at least two acquired images, each image at least includes an image of an object and an image of an object that may be focused on by the object in the scene, or each image at least includes an image of the object and an image of an object that is desired to be focused on by the object, and the like. At least two of the acquired at least two images are different images, that is, at least two images should have a stereoscopic parallax and are images of a scene acquired from different positions/angles.

In the embodiment of the present application, the at least two images may be selected as the at least two panoramic images.

By acquiring the panoramic image, the acquired image can be ensured to include all contents in the scene more comprehensively. The panoramic image may be selected according to a possible change condition of an object attention track in an actual scene, for example, the panoramic image may be at least two 360-degree panoramic images in a horizontal direction, or may be at least two 360-degree panoramic images in a vertical direction. Generally, a panoramic image in the vertical direction (for example, two panoramic images in the vertical direction can be acquired by two panoramic cameras placed in an up-down manner) has good depth estimation quality for objects located in 360 ranges around the camera, but has poor depth estimation quality for objects above and below the camera. It is best to acquire at least two panoramic images in the vertical direction if both the object and the object of interest of the object appear in the front, back, left, and right directions of the scene space. If an object of interest appears near above the scene space, at least two panoramic images in the horizontal direction may be acquired using a horizontally-arranged panoramic camera.

In the embodiment of the application, when the attention point of the object is determined based on at least two panoramic images, the panoramic image is generally a planar panoramic image, and if the obtained panoramic image is a spherical panoramic image or a cylindrical spherical image, the planar panoramic image can be obtained after the image is corrected and longitudinally expanded.

When the number of acquired images is greater than two, and the point of interest is determined based on at least two images, any two different images may be used, all the images may be used, or at least two of the images may be selected according to the quality of the acquired images or other screening conditions. When two images are used, two images may be selected from the plurality of images based on the quality of the acquired images or other predetermined conditions.

According to the method for determining the attention point, the attention point of the object in the actual scene can be determined based on at least two different images. The scheme does not need any wearable equipment worn by the object, so that the method is suitable for more practical application scenes. According to the scheme of the embodiment of the application, the object is not required to wear the additional equipment, so that the method and the device can be applied to an actual application scene in which the object is not required to be matched with the wearing of the additional equipment. According to the scheme of the embodiment of the application, a more natural interaction mode can be provided for the object, the behavior information of the object can be obtained in time, technical support is provided for the object or other objects related to the object to provide related services or information and the like based on the behavior information of the object, and the actual application requirements are better met.

In this embodiment of the present application, acquiring at least two images of a scene may include:

at least two images are acquired by at least two cameras in different positions, or at least two images are acquired by controlling the translation and/or rotation of the cameras.

In practical application, the mode of acquiring at least two images can be selected according to practical application scenes and application requirements. It is understood that, when at least two images are acquired by the cameras whose number corresponds to the required number of images, at least two images at the same time may be acquired by controlling the plurality of cameras. When the number of cameras is smaller than the number of required images, at least two images can be acquired by controlling the movement of the cameras.

In the following, different schemes for acquiring panoramic images will be described by taking the acquisition of two panoramic images as an example.

The first method is as follows: two panoramic images are acquired by two panoramic cameras.

When an image is acquired by a camera, the camera generally needs to be calibrated in order to determine the correlation between the position of a certain point on the surface of an object in space and the corresponding point in the image. The panoramic camera calibration aims to determine parameters in an imaging model of the panoramic camera, wherein the parameters comprise internal parameters of the camera, lens distortion parameters (generally comprising radial distortion parameters and tangential distortion parameters), rotation parameters and translation parameters between two fisheye cameras of each panoramic camera, and the like. The rotation parameters can be generally expressed by Euler angles, the translation parameters can be expressed by translation vectors, the rotation parameters between the two panoramic cameras describe the rotation angles between the camera coordinate systems of the two panoramic cameras, the translation parameters describe the translation between the camera coordinate systems of the two panoramic cameras, and when the optical center of one panoramic camera and the camera coordinate system of the panoramic camera are given arbitrarily, the optical center and the camera coordinate system of the other panoramic camera can be known according to the rotation parameters and the translation parameters. The parameters to be calibrated can be calibrated in advance and stored in a storage device.

In practical application, the calibration may be to correct the image according to the calibration parameters stored in the storage device after the image is acquired, or to configure the camera according to the calibration parameters before the image is acquired, and when the camera acquires the image, to finish correcting the output image according to the configured calibration parameters. The specific implementation of camera calibration is prior art and will not be described in detail here.

In the embodiment of the application, when two panoramic images are acquired by two panoramic cameras, the baseline direction of the two panoramic cameras can be perpendicular to the ground, and the baseline direction is a connecting line of optical centers of the two panoramic cameras. The base line directions of the two panoramic cameras are perpendicular to the ground, namely the two panoramic cameras can be placed up and down, and therefore when images are obtained, omission of horizontal direction information of the surrounding environment can be effectively avoided, and comprehensiveness of objects which may be interested by users in the obtained panoramic images is guaranteed.

Fig. 3 shows a schematic diagram of the position relationship between two panoramic cameras in an example of the present application. As shown in fig. 3, the two panoramic cameras may be fixed by a connecting rod, a connecting line of optical centers of the two panoramic cameras is perpendicular to the ground, and a panoramic shooting device shown in fig. 3 may capture a 360-degree horizontal stereoscopic video around, and acquire two spherical panoramic images with parallax at each time. Two sets of panoramic videos respectively shot by the upper and lower panoramic cameras shown in fig. 3 form a panoramic stereo video. By using the calibration parameters, the shot panoramic video can be corrected.

In one example, the optical center of the panoramic camera located above is defined as O1, the optical center of the panoramic camera located below is defined as O2, and the line segment O1O2 is the baseline for both panoramic cameras. The camera coordinate system of the upper panoramic camera is represented by O1-X1Y1Z1, X1, Y1 and Z1 as three coordinate axes perpendicular to each other, and the camera coordinate system of the lower panoramic camera is represented by O2-X2Y2Z2, X2, Y2 and Z2 as three coordinate axes perpendicular to each other. In the actual equipment generating and assembling process, errors in postures and directions inevitably exist in the two panoramic cameras, and in order to reduce the errors, longitude and latitude correction needs to be performed on the two panoramic images, as shown in fig. 4, after correction, in an ideal case, the Z1 direction and the Z2 direction are both overlapped with the O1O2 direction, the X1 direction is overlapped with the X2 direction, the Y1 direction is overlapped with the Y2 direction, and the spherical panoramic image is unfolded onto a cylindrical surface to obtain a cylindrical panoramic image, as can be seen from the figure, the spherical coordinate systems O1-X1Y1 and O2-X2Y2Z2 also satisfy that the Z1 direction and the Z2 direction are both overlapped with the O1O2 direction, the X1 is overlapped with the X2 direction, and the Y1 is overlapped with the Y2 direction in a corresponding cylindrical coordinate system in the unfolded cylindrical panoramic image. After the cylindrical panoramic image is expanded into the planar panoramic image, as shown in fig. 5, ideally, the projection points of the same object point P on the two planar panoramic images in the space, i.e., the image points P1 and P2, are located on the same image column, the expanded planar panoramic image includes a panoramic view of 360 degrees in the horizontal direction (X1-0 to X1-360 degrees, and X2-0 to X2-360 degrees shown in the figure), and the vertical direction includes a view of 180 degrees (Z1-0 to Z1-180 degrees, and Z2-0 to Z2-180 degrees shown in the figure).

The second method comprises the following steps: obtaining two panoramic images by using one panoramic camera

The method acquires two panoramic images by controlling the movement of one panoramic camera, firstly acquires the self-movement information of the panoramic camera, specifically, the self-movement information can be acquired by an inertial sensor unit on the panoramic camera, or by using a visual instant positioning and mapping (SLAM) algorithm, or by comprehensively using the inertial sensor unit and the visual SLAM algorithm. In the video shot during the camera motion process, one panoramic image at the time t1 is selected as a reference time image, the other panoramic image at the time t0 is selected as a current time image, and the distance between the positions of the cameras when the two panoramic images are shot exceeds a certain threshold value, so as to obtain the two panoramic images with parallax, as shown in fig. 6.

The panoramic image at the reference moment and the panoramic image at the current moment need to be calibrated, and two panoramic images shot by one panoramic camera at different positions at two different moments are subjected to image transformation to form a pair of panoramic stereo image pairs, so that the capturing of the panoramic stereo information of the scene is realized.

Fig. 7 is a schematic diagram illustrating a method for calibrating two panoramic images captured at different times. As shown in the figure, by camera motion parameter estimation, a 6-degree-of-freedom pose t1 of the panoramic camera at time t1 and a 6-degree-of-freedom pose t0 at time t0 can be obtained. Specifically, the 6-degree-of-freedom attitude t1 at the time t1 of the panoramic camera can be expressed by six parameters: the 6-degree-of-freedom attitude at time t0, including the translation parameter Tt1 ═ Xt1, Yt1, Zt1] and the rotation parameter Rt1 ═ Rat1, Rbt1, Rct1, can be expressed in terms of six parameters: the three translation parameters, namely [ Xt0, Rbt0 and Rct0], are translation parameters Tt0 ═ Xt0, Yt0 and Zt0] and rotation parameters Rt0 ═ Rat0, Rbt0 and Rct0], wherein the three translation parameters respectively refer to the translation amounts of the camera along three coordinate axes of X, Y, Z, and the three rotation parameters respectively refer to the rotation angle around the axis X, the rotation angle around the axis Y and the rotation angle around the axis Z. Based on the 6-degree-of-freedom camera pose t1 and the 6-degree-of-freedom camera pose t0, a rotation matrix R01 between the two panoramic images at the reference time and the current time may be calculated as: r01 ═ inv (Rt0) × Rt1, where inv (Rt0) represents the inverse matrix of Rt 0. And performing rotation transformation on the panoramic image at the time t0 according to the R01 rotation matrix to finish the correction of the panoramic image at the time t0, wherein after the corrected panoramic image at the time t0 and the corrected panoramic image at the time t1 are longitudinally and transversely unfolded through images, the projections of the same object point on the planar panoramic images corresponding to the two panoramic images are aligned to the same image column direction. The calibrated panoramic image at the reference moment and the panoramic image at the current moment can be used as two panoramic images to be acquired, and the two panoramic images form a panoramic stereo image pair.

The third method comprises the following steps: obtaining two panoramic images by one or more ordinary cameras, i.e. non-panoramic cameras

Fig. 8a, 8b, and 8c are schematic diagrams illustrating the principle of acquiring panoramic images by a single monocular camera, a single binocular camera (stereo camera), and a wide-angle camera composed of a plurality of monocular cameras, respectively, in which images at different positions of a scene at different times can be captured by controlling the camera motion, and as shown in the figures, images at different positions of the camera at times t0 and t1 can be acquired.

When one or more ordinary cameras are used, camera calibration is not needed any more, and the motion parameters between the cameras can be obtained by controlling a camera motion parameter recording module (a camera position acquisition module) in the camera motion process. According to the scheme, the panoramic video can be captured in a mode of controlling the camera to rotate, the video with the stereoscopic parallax can be captured in a mode of controlling the camera to move, and then two panoramic images with the stereoscopic parallax at different moments are obtained based on the captured panoramic video and the video with the stereoscopic parallax.

When controlling the motion of the camera, the control system can plan the motion track of the camera through the camera motion control and camera position acquisition module so as to control the motion of the camera, acquire the position of the camera, and establish a depth map of a static object in the surrounding environment of the system through the modeling module. Because the field of view of the camera in this scheme is limited, the camera cannot observe all objects and objects in the scene at the same time, and therefore the surrounding environment needs to be scanned by controlling the motion of the camera, and static objects, dynamic objects, and objects in the scene can be observed according to a certain time sampling rate. In practical application, the camera can be driven by the motion cloud platform to rotate, and in order to capture the information of the object and the panoramic scene, the motion of the camera needs to be controlled.

FIG. 9 illustrates a flow diagram of a method of capturing panoramic information of a user and a scene by controlling motion of a camera. As shown in fig. 9, during image capturing, it is necessary to detect whether a currently captured image set completely captures panoramic information of a scene. And when the condition that the depth integrity is not met is detected, driving the camera to perform translational motion and shooting images. And when detecting that the visual field integrity condition is not met, driving the camera to perform rotary motion and shooting the image. Through the control flow, the capture of the panoramic information of the scene is completed, and each image obtained by shooting has 6-degree-of-freedom parameter record of the translation position and the rotation posture of the camera in the space at the shooting moment.

When the field angle of the images in the image set covers a 360-degree spherical field, the currently shot image set can be considered to capture the complete field of view of the scene; when there are at least two or more images having a certain shooting position distance for all regions in the 360-degree field of view, it is considered that the currently shot image set captures the complete depth information of the scene. When the visual field and the depth are complete, the spatial positions and the postures of the images are read from the self-motion parameter recording module of the camera and are put into an image set with 6-degree-of-freedom posture marks.

In addition, the control system can also track the position of a moving object in the scene through the object tracking module, and track the motion trail of the object through the object (such as a user) tracking module, so that the object and all objects in the scene can be comprehensively shot. As shown in fig. 9, specifically, during the shooting process, the tracking of the object can be maintained, when the object is located in the field of view, the position of the object is estimated, and the current position of the object can be used for controlling the camera to point at the position at the next moment, and updating and capturing the image of the object at the next moment; when the object is out of the field of view, the camera is driven to perform rotational motion and perform photographing. In the shooting control process, the tracking of a moving object, namely a dynamic object, in the scene can be kept, when the moving object is located in the field of view, the position of the moving object is estimated, the estimated position of the object is used for controlling the camera to point to the position at the next moment, and the image of the moving object at the next moment is updated and captured. And when the moving object is out of the visual field, driving the camera to perform rotary motion and shooting. The specific way of tracking the moving object can be realized by adopting the prior art.

For a resulting set of images with 6-degree-of-freedom pose markers, each image in the set has a record of the 6-degree-of-freedom motion pose parameters in the world coordinate system at the time the camera took the image. By using the image set, two required panoramic images can be synthesized in an image splicing mode.

In addition, since all the regions in the scene have at least two or more images with a certain shooting position distance, that is, each region corresponds to a stereo image pair, a depth map of each partial image region can be obtained by three-dimensional reconstruction calculation of the stereo image pair (at least two images with parallax) covering each partial image region in the panoramic view, and then the depth maps of all the regions are spliced to obtain the panoramic depth image. Therefore, the panoramic image with the depth information can be acquired through the scheme.

In practical applications, if two panoramic images are obtained by one panoramic camera or at least one common camera, in order to improve the accuracy of the determined focus, two panoramic images that are relatively continuous can be obtained by controlling the cameras. The relative continuity may mean that two panoramic images are continuously acquired, that a shooting time difference between the two panoramic images is smaller than a set value, or that an acquisition duration between a plurality of two-dimensional images used for generating the two panoramic images is smaller than a set duration when the panoramic images are generated based on the acquired two-dimensional images.

In an embodiment of the present application, determining a point of interest of an object in an image in a scene based on at least two images may include:

determining attention route information corresponding to the object based on the at least two images;

and determining the attention point according to the attention route information.

The attention route information is information for identifying an attention position and an attention direction of an object in a scene, and may specifically include start point information of the attention route and direction information of the attention route. The starting point information may specifically be coordinates of the attention point in the reference coordinate system, and the direction information may specifically be an included angle between the attention route and a coordinate axis of the reference coordinate system. For example, when the point of interest is the gaze point of the user, the line of interest information is the line of sight information of the user, and when the point of interest is the pointing point of the finger of the subject, the line of interest information is the pointing line information.

In the embodiment of the present application, determining, based on at least two images, route information of interest corresponding to an object includes:

determining a position image of the object based on the at least two images;

the attention route information is determined based on the position image of the object.

Wherein the position image of the object may include at least one of:

a body image group, a head image group, a face key point image group, an eye image group, an arm image group, and a hand image group.

In practical application, one or more of the above image groups can be selected according to different application requirements and application scenes, and the specific content of the point of interest to be determined. For example, when the point of interest is a fixation point, the region image may include at least one of a body image group, a head image group, a face key point image group, and an eye image group; when the focus point is a pointing point of a finger, the position image may be at least one of a body image group, a hand image group, and an arm image group.

In an embodiment of the present application, determining the position image of the object based on the at least two images may include:

determining an image quality of an object image of the at least two images;

based on the image quality of the object image, a corresponding location image is determined.

In practical applications, the quality of the acquired image may vary according to the application scenario and/or the image capturing device. In order to better adapt to different application scenarios and avoid overlarge deviation of the finally determined attention point due to image quality, after at least two images are acquired, the position image used for determining attention route information can be determined according to the quality of the object image in the images.

The image quality of the object image can be identified by the image quality of at least two images, or the object image can be extracted from at least two images, and then the quality of the extracted object image is determined. The image quality may be classified based on at least one of resolution, sharpness, and the like of the image.

Generally, the higher the image quality, the more accurate the extracted position image may be relatively, and the more accurate the attention route information determined based on the position image. For example, when the object is a user and the focus is a fixation point, when the quality of the user image is high enough, such as when a preset high quality condition is met, the position image can be accurate to the eye image group, and the sight line information can be determined based on the head image group, the face image group, and the eye image group; when the image quality of the user is not high enough, such as when a medium quality condition is met but a high quality condition is not met, the position image may be only accurate to the head image group, and the sight line information is determined based on the head image, and at this time, the sight line information is identified by adopting the orientation information of the head of the user on the assumption that the visual field orientation of the user is consistent with the orientation of the head; if the image quality of the user is worse, if the medium quality condition is not met, the region image may only be accurate to the body image group, and the gaze information needs to be determined based on the body image group. In practical application, different image quality conditions can be configured according to actual needs. For another example, when the attention point is a pointing point of a finger, if the quality of the image is high enough, the position image may be accurate to the hand image group, the pointing route information may be determined based on the hand image group, or the hand image group and the arm image group, if the image quality is not high enough, the position image may only be accurate to the arm image group, the pointing route information needs to be determined based on the arm image group, and at this time, the pointing direction of the hand is identified by the pointing direction of the arm.

As shown in fig. 10, when the object is a user, taking three-dimensional sight line information as an example, first, user detection may be performed on an acquired image to obtain a user image, quality of the user image is obtained by evaluating quality of the user image (quality evaluation may be performed based on resolution, sharpness, and the like), and when the quality is high, three-dimensional sight line estimation may be performed based on a face image and a human eye image to obtain three-dimensional sight line information; in medium quality, three-dimensional sight estimation can be carried out based on the face image, and the face orientation is used as the three-dimensional sight information of the fixation point; at low quality, three-dimensional gaze estimation may be performed based on body orientation, which is taken as three-dimensional gaze information for the point of regard.

In this embodiment of the application, determining the route information of interest based on the position image of the object may include:

obtaining at least two pieces of initial attention route information based on the position image of the object;

and obtaining the concerned route information by fusing at least two pieces of initial concerned route information.

In order to improve the accuracy of the obtained attention route information, the obtained two different initial attention route information may be fused to obtain final attention route information.

In this embodiment of the application, the method for determining the attention point may further include:

and correcting the attention route information by performing correction through part modeling.

After the attention route information is obtained based on the part image, modeling can be carried out based on at least one image in the part images, and then the attention route information can be corrected based on the difference between the part model image and the corresponding part image, so that the accuracy of the attention route information is improved.

For example, when the attention route information is the sight line information, face modeling may be performed based on the already obtained sight line information, and correction of the sight line information may be performed based on a difference between the face model and at least one face image in the face image group; when the attention route information is the pointing route information, the pointing route information can be corrected based on a difference between the hand model image and at least one hand image in the hand image group by performing hand modeling based on the already obtained pointing route information.

based on the position image of the object, the category information of the object is obtained.

In practical application, the method for determining the attention point according to the embodiment of the application can further obtain the classification information of the object. When the concerned route information is obtained through the neural network, the neural network can be trained to simultaneously complete the output of the concerned route information and the class information of the object, the concerned route information output by the trained neural network is the concerned route information of the object output on the basis of the class of the known object, and compared with the neural network which only outputs the concerned route information, the accuracy of the output concerned route information can be effectively improved.

In the embodiment of the present application, determining the route information of interest based on the position image of the object may include at least one of the following manners:

the first method is as follows:

obtaining attention route information through a first neuron network based on the position image;

the second method comprises the following steps:

obtaining starting point information of the concerned route based on the part image;

based on the part image, direction information of the route of interest is determined by the second neuron network.

As can be seen, the start point information and the direction information of the attention route information may be obtained simultaneously based on the part image or may be obtained separately. The neuron network (the first neuron network or the second neuron network) may be an attention route information estimation model obtained through deep learning training, and specifically, for the first neuron network, the attention route information estimation model outputs start point information and direction information, and for the second neuron network, the attention route information estimation module outputs direction information. The neuron network may be one or more of a convolutional neuron network, a fully connected neuron network, a twin neuron network, or other types of neuron networks.

In this embodiment of the application, if the attention point is a pointing point of a finger, the attention route information is obtained based on the position image, and the method specifically includes:

obtaining attention route information based on the hand image group; and/or the presence of a gas in the gas,

and obtaining attention route information based on the hand image group and the arm image group.

Specifically, based on the hand image group, hand motion posture classification can be performed on the hand images, the pointing motion of the target hand is recognized, and then the pointing route information is determined based on the image of the portion corresponding to the pointing motion in the hand image group.

In addition, the performance of the pointing route estimation can be enhanced by analyzing the arm image group. Specifically, the arm posture can be recognized based on the arm image group, anatomical constraints of finger pointing and arm posture are utilized, the pointing route information is constrained based on the recognized arm posture, and the accuracy of the pointing route information is improved.

As can be seen from the foregoing description, the specific implementation of obtaining the pointing path information of the finger based on the hand image group, or based on the hand image group and the arm image group, can be obtained through the neural network based on these image groups.

In the embodiment of the present application, when the point of interest is a gaze point, obtaining the route information of interest based on the position image may specifically include at least one of the following manners:

mode 1:

the attention route information is obtained based on the head image group, the face key point image group, and the eye image group.

Mode 2:

obtaining initial attention route information based on the head image group, the face key point image group and the eye image group;

performing face modeling according to the initial attention route information to obtain a face model image;

and acquiring the face model image and the image error of the face image, and acquiring the attention route information according to the initial attention route information and the image error.

Mode 3:

obtaining a head characteristic image group, a face characteristic image group and an eye characteristic image group according to the head image group;

obtaining first initial attention route information based on the head characteristic image group, the face characteristic image group and the eye characteristic image group;

obtaining second initial attention route information based on the eye feature image group;

and fusing the first initial concerned route information and the second initial concerned route information to obtain the concerned route information.

As can be seen from the comparison between the mode 1 and the mode 2, the scheme shown in the mode 2 may be that after the route information of interest is obtained by the scheme of the mode 1, the route information of interest is corrected by a face modeling mode, and the corrected route information of interest is used as final route information of interest. In the method 3, final attention route information is obtained by fusing two different attention route information.

In the embodiment of the present application, the route information of interest and the route information of initial interest in the

modes

1 and 2, and the first route information of interest and the second route information of interest in the mode 3 may be implemented by a neural network.

In the case of specifying a target part image based on at least two images, the same part image may be used in the same or

different modes

1, 2, and 3 depending on whether the modes are the same or not. For example, the position images in the

modes

1 and 2 each include a head image group, a face key point image group, and an eye image group, and if the determination methods of the position images in the two modes are the same, the same position images in the two modes are the same, and if the determination methods of the position images are different, the same position features in the two modes may be the same or different.

In this embodiment of the application, obtaining, by a neural network, the first initial route information of interest and the second initial route information of interest in the above mode 3 based on the group of head images may specifically include:

extracting the features of the head image group through a first convolutional neural network to obtain a head feature image group;

carrying out face position detection on the head characteristic image group through a first full-connection neuron network to obtain face position information of each head characteristic image;

obtaining a face feature image group by the face feature pooling layer according to the head feature image group and the face position information of each head feature image;

performing face key point detection on the face feature image group through a second full-connection neuron network to obtain face key point position information of each face feature image;

obtaining an eye feature image group by the eye feature pooling layer according to the face feature image group and the position information of the face key points of each face feature image;

extracting the features of the head feature image group, the face feature image group and the eye feature image group through a second convolutional neural network, and obtaining first initial attention route information through a third fully-connected neural network according to the extracted feature images;

and performing feature extraction on the eye feature image group through a third convolution neural network, and obtaining second initial attention route information through a fourth full-connection neural network according to the extracted feature image.

The following description will be made of three ways of obtaining the route information of interest among the above-described

ways

1, 2, and 3, taking the user as an object, the user's gaze point as a point of interest, the three-dimensional sight line information as route information of interest, and two panoramic images of a scene as an example.

In the mode 1, the region image includes a head image group, a face image group, an eye image group, and a face key point image group, and three-dimensional sight line information of the user can be obtained by the neural network based on these region images. As a specific example of mode 1, fig. 11 shows a schematic diagram of obtaining three-dimensional sight line information by a neuron network. The neuron network may include one convolutional neuron network and two fully-connected neuron networks respectively connected to the convolutional neuron networks, wherein one of the fully-connected neuron networks is a regression network for regressing and outputting three-dimensional sight line information, and the other one of the fully-connected neuron networks is a classification network for feature classification of an image.

As shown in fig. 11, two head images, two face keypoint images, and two eye images acquired from two panoramic images are input to the convolutional neural network as inputs of the deep convolutional neural network. Before inputting the images into the convolutional neural network, all the input images can be unified into the same size by zero padding through a feature connection layer, and then the input images are overlapped into a multi-layer image with the same size to form an image pile and input into the convolutional neural network. The convolutional neural network performs feature extraction on an input image, and features of a generated graph output from the convolutional neural network are connected to two fully connected neural network branches. The regression network outputs three-dimensional sight line information based on the image output by the convolution neural network; the categories output by the classification network may include classification information such as the user's head thickness direction, gender, whether to wear glasses, and age. The above categories can be further refined, for example, the head thickness direction can be further divided into eight categories: the age group can be divided into five types: children, teenagers, young adults, middle-aged adults and the elderly.

As can be seen from the foregoing description, the classification network shown in fig. 11 is not necessary for determining the three-dimensional sight line information, and in practical applications, the neural network shown in fig. 11 may not include the classification network, and the classification network may be used only when the neural network is obtained through training, so that the regression network of the neural network learns better features, and the accuracy of the three-dimensional sight line information output by the regression network is improved. Therefore, when three-dimensional sight information is obtained by the neuron network shown in fig. 11, the fully-connected neuron network of the classification network may not be used to save computation consumption.

In the mode 2, the part image includes a head image group, a face image group, an eye image group, and a face key point image group, and as a specific example of the mode 2, fig. 12 shows a schematic diagram in which three-dimensional sight line information is obtained by a convolutional neuron network, a fully-connected neuron network, a three-dimensional face rendering model, and a twin neuron network. Specifically, the neuron network in this example includes a convolutional neuron network and a fully connected neuron network connected to the convolutional neuron network, and the twin neuron network includes an image comparison twin network connected to a fully connected neuron network. In this scheme, the determination of the three-dimensional sight line information may be performed in two stages.

In the first stage, two head images, two face key point images, and two eye images are input to a convolutional neural network after feature connection, image features generated from the convolutional neural network output are connected to a fully connected neural network for regression, and the fully connected neural network outputs one piece of initial three-dimensional sight line information, i.e., (X ', Y', Z ', Ra', Rb ', Rc') shown in the figure, and (X ', Y', Z ') are three-dimensional sight line starting point information of the initial sight line information, and (Ra', Rb ', Rc') are three-dimensional sight line direction information of the initial sight line information.

In the second stage, the initial three-dimensional gaze information is used to render a three-dimensional model image of the user's face. The three-dimensional face drawing model may use a face image drawing method with parameterization when drawing a face three-dimensional model image, the drawn face image has the same face orientation and eye sight orientation as the initial three-dimensional sight information, and specifically, the three-dimensional face drawing model may complete drawing of the face three-dimensional model image according to the initial three-dimensional sight information. In addition, when the three-dimensional face drawing model draws a three-dimensional face model image, in addition to the initial three-dimensional sight line information, parameterized face drawing can be performed by combining face information (such as head key points, face key points and the like) so that the drawn three-dimensional face model image is closer to the face image input into the convolutional neural network. The parameterized face drawing method is an image rendering method used for drawing a three-dimensional face model image, and can automatically adjust the shape and texture of the three-dimensional eye and face models according to the posture and three-dimensional sight line information of the head and/or the face and draw the three-dimensional face image.

After the drawing is finished, the three-dimensional face model image generated by drawing and at least one of two face images initially input into the convolutional neural network are subjected to image error extraction through an image comparison twin network, the extracted error characteristics and the initial three-dimensional sight line information are input into a fully-connected neural network, and the final three-dimensional sight line information is output through the fully-connected neural network based on the initial three-dimensional sight line information and the image error.

As can be seen from the schemes shown in fig. 11 and 12, compared with the scheme shown in fig. 11, the scheme shown in fig. 12 can correct the initial three-dimensional sight line information based on the image error, and obtain more accurate three-dimensional sight line information.

It is to be understood that the convolutional neural network in the scheme shown in fig. 12 may be the same as or different from the convolutional neural network in the scheme shown in fig. 11, and similarly, the fully-connected neural network shown in fig. 12 that outputs the initial three-dimensional sight line information may be the same as or different from the fully-connected neural network shown in fig. 11 that outputs the three-dimensional sight line information.

As for the above mode 3, the position image may only adopt the head image group, the first initial attention route information and the second initial attention route information are obtained by the neuron network based on the head image, and the final attention route information is obtained by fusing the two attention route information.

As a specific example of the mode 3, fig. 13 shows a schematic diagram of obtaining the first initial attention route information and the second initial attention route information through the neural network, and then obtaining the attention route information by fusing the first initial attention route information and the second initial attention route information through the fusion network. In this example, the user is still used as an object, the gaze point is used as a focus point, the three-dimensional sight line information is used as focus route information, and two panoramic images are used as examples.

As shown in fig. 13, the input in this scheme is two head images, the two head images are first passed through a convolutional neuron network CNN1 to obtain two head feature images, and the two head feature images are passed through a fully connected neuron network FC1 for face detection to output face position information (face position) in each head feature image. The two head characteristic images and the face position information of the two detected head characteristic images are input into a face characteristic pooling layer (a face characteristic pooling layer), the face characteristic pooling layer intercepts images corresponding to the face position from the two head characteristic images according to the face position information to form two face characteristic images, and the two face characteristic images pass through a full-connection neuron network FC2 for detecting face key points to detect face key point position information (face key points) such as eyes in each face characteristic image. The two face characteristic images and the position information of the face key points of the two detected face characteristic images are input into an eye characteristic pooling layer (human eye characteristic pooling layer), and the eye characteristic pooling layer is used for respectively intercepting the eye characteristic images corresponding to the eye images from the two human face characteristic images according to the eye position information in the position information of the face key points detected. After the two eye feature images are subjected to feature re-extraction through the convolutional neuron network CNN3, the extracted feature images are input into a full-connection neuron network FC3 for regression, and local three-dimensional sight regression is performed through the full-connection neuron network FC3 to obtain three-dimensional local sight information, namely local three-dimensional sight information (second initial attention route information). After two head characteristic images, two face characteristic images and two eye characteristic images are subjected to characteristic connection and input to a convolutional neuron network CNN2 for characteristic re-extraction, the extracted characteristic images are input to another full-connection neuron network FC4 for regression, and the full-connection neuron network FC4 performs three-dimensional global three-dimensional sight regression according to the characteristics to obtain global three-dimensional sight information (first initial attention route information). And finally, fusing the global three-dimensional sight line information and the local three-dimensional sight line information through a fusion network to obtain final three-dimensional sight line estimation, namely the three-dimensional sight line information of the user.

The specific scheme of fusing the global three-dimensional sight line information and the local three-dimensional sight line information to obtain the final three-dimensional sight line information can be selected according to the requirement. For example, the global three-dimensional gaze information and the local three-dimensional gaze information may be fused by means of a weighted average.

In the embodiment of the present application, when the start point information and the direction information of the route of interest information are determined by two in the above manner, the position image may include a head image group, a face image group, an eye image group, and a face key point image group.

Fig. 14 is a schematic diagram illustrating a method for determining a region image of an object based on two panoramic images in an example of the present application. In this example, three-dimensional sight line information of the user is still used as the attention route information, and two panoramic images are used as an example for explanation. As shown in fig. 14, the positions of the users, i.e., body detection, may be detected from the two panoramic images to obtain two user images, then head detection is performed on the two user images to obtain two head images, face detection is performed based on the two head images to obtain two face images, the positions of key points of the faces are detected from the two face images, and the eye image regions are detected to obtain two key point images of the faces and two eye images. The face key point image mainly comprises eyes and eyes, the root of two wings of a nose, a mouth corner and the like. The user image, the head image, the face image, and the face key point image may be obtained in the prior art, and will not be described in detail herein.

It is to be understood that the scheme shown in fig. 14 may be used for the position images of the object based on the

modes

1, 2, and 3, and other schemes may be used.

As for the second mode, the orientation of the face relative to the reference coordinate system can be estimated by using the positions of the extracted key points of the face, the position of the face relative to the reference coordinate system in the space can be calculated by using the position information of the corresponding key points in the key points of the face detected from the two panoramic images through a triangulation method, and the positions of the eyes of the person relative to the reference coordinate system in the space can also be obtained, so that the coordinates of the midpoint of the center line of the eyes in the reference coordinate system are obtained, and the start point information of the attention route information is obtained.

The principle of triangulation is to calculate the position of a key point in a camera coordinate system by using the distance between straight lines (baselines) between the optical centers of two cameras (or the same camera at different positions) and two included angles between the connecting line of the two optical centers and two corresponding key points and the baselines. The detailed implementation steps for calculating the position of the key point in space by triangulation are prior art and will not be described in detail here.

When the reference coordinate system is not the coordinate system of the camera for acquiring the panoramic image, after the positions of the key points (e.g., the face key point and the eye center) are calculated by triangulation, the calculated positions need to be converted into the reference coordinate system.

For the determination of the direction information in the second mode, it may be obtained from a neuron network based on the part image, and the neuron network may be obtained from a direction information estimation model of the route of interest obtained based on deep learning training, and the model may only need to output the direction information, for example, for the three-dimensional sight line information, only the direction information (Ra, Rb, Rc) may be output.

Taking the three-dimensional visual line information as an example, it is understood that (Ra, Rb, Rc) may be determined in any manner as shown in fig. 11 to 13 when (X, Y, Z) and (Ra, Rb, Rc) are determined separately, and only when the neuron network is trained, the network training output (Ra, Rb, Rc) of (X, Y, Z, Ra, Rb, Rc) output in fig. 11 to 13 may be required. For example, in the network architecture shown in fig. 8, the fully-connected neuron network as the regression network may only need to output (Ra, Rb, Rc), that is, the regression network obtained based on training only needs to have the detection function of (Ra, Rb, Rc).

In the embodiment of the present application, the position image of the object may be obtained by performing motion prediction on the object based on the historical position information of the position image.

In practical application, when an object image is detected on an image for the first time at a certain time, an image acquired later can predict a candidate area where the position of the object is likely to appear in the current image according to the motion condition of the object at the previous time, and then can detect the object image only in the candidate area, so that the object can be predicted according to the motion of the object based on the historical position information of the object image, and the predicted candidate area of the object image can be detected to obtain the part image of the object at the current time. In this way, the computational overhead can be effectively reduced. The specific manner of predicting the motion of the object can be realized by the prior art.

As an example, a schematic diagram of a method of obtaining a user's region image is shown in fig. 15. After the system is initialized, namely, the panoramic image of the scene where the user is located is obtained, the panoramic image obtained at the current time T0 is subjected to user image detection, so that the user image characteristics (the position image of the user, such as the body image, the face key point image and the like of the user) at the current time are obtained, the motion of the user can be predicted for the image at the time Ti after the time T0, that is, the possible position of the user image in the image at the time Ti is predicted based on the position of the user image in the image at the time T0, the predicted possible position of the user image at the time Ti is taken as a candidate region (user image candidate region UCR) for detecting the position image of the user at the time Ti, and carrying out user detection on the to-be-selected area in the image acquired at the moment Ti to obtain the part image of the user.

By adopting the scheme, in practical application, if the attention points of the user at different moments are required to be tracked based on at least two panoramic images of corresponding frames in at least two videos, after system initialization, the detection of the part image of the object can be continuously carried out on each frame of the input video, when the user image is detected for the first time at a certain moment, the to-be-selected area where the user position is possibly generated in the current frame can be predicted according to the motion condition of the user at the previous moment on each frame image of the input video, and then the user image detection is carried out only in the to-be-selected area, so that the user image detection can be avoided on each frame of panoramic image, and the calculation amount can be effectively reduced.

In an embodiment of the present application, the manner of determining the attention point of the object in the scene in the at least two images based on the at least two images may include at least one of the following manner ① and manner ②:

mode ①:

determining the matching degree of image information corresponding to projection points of the concerned route on at least two images according to the concerned route information;

based on the degree of matching, a point of interest is determined.

Mode ②:

acquiring a depth image of a scene;

according to the attention route information, determining the matching degree of the depth values of the route points of the attention route corresponding to the projection points of the depth image and the depth values corresponding to the corresponding route points;

based on the degree of matching, a point of interest is determined.

The route point of the attention route refers to a point on the attention route, and the route point is the sight point for the sight line information of the user.

In an embodiment of the present application, the depth image of the scene may be selected as a dense depth image of the scene.

The dense depth image refers to an image in which each pixel point on the image has a depth value. And the dense depth image is adopted, so that each pixel point on the dense depth image has depth information.

The depth values of the waypoints may be calculated based on the coordinates of the waypoints in the reference coordinate system. When the reference coordinate system and the camera coordinate system of the depth image are the same coordinate system, the distance between the coordinate of the route point in the reference coordinate system and the origin of the reference coordinate system may be calculated, and the distance may be used as the depth value of the route point; when the reference coordinate system and the camera coordinate system of the depth image are not the same coordinate system, the depth values of the route points corresponding to the projection points of the depth image and the depth values of the route points need to be unified to the same coordinate system, and the attention points are determined based on the matching degree of the two depth values unified to the same coordinate system. Specifically, the following two ways can be adopted when unifying:

one way is to convert the coordinates of the route point in the reference coordinate system to the corresponding coordinates in the camera coordinate system of the depth image, calculate the distance between the converted coordinates of the route point and the origin of the camera coordinate system of the depth image, where the distance can be used as the depth value of the route point, i.e. the depth value of the pixel point corresponding to the projection point of the route point on at least one of the at least two images, and determine the attention point based on the matching degree of the depth value and the depth value corresponding to the projection point of the corresponding route point on the depth image. If the route point is the point of interest, the depth value of the projected point of the route point on the depth image of the field should match the calculated depth value of the route point, and therefore, for any route point, by comparing the depth value of the route point on the depth image of the scene with the depth value of the route point, it can be determined whether the route point is the point of interest.

Another way is that the depth value of the pixel point in the depth image may be converted to the depth value in the reference coordinate system, that is, the depth value of the projection point of the route point on the depth image may be converted to the depth value in the reference coordinate system, at this time, the depth value of the route point may be directly obtained by calculating the distance between the coordinate of the route point in the reference coordinate system and the origin of the reference coordinate system, and the attention point may be determined based on the matching degree between the calculated depth value and the converted depth value of the projection point.

In this embodiment of the application, the depth image of the scene may be a depth image obtained based on the at least two acquired images, may also be a depth image of the scene obtained through shooting by the depth camera, and may also be a depth image of the scene obtained based on a stereoscopic image pair of each region in the scene shot by the camera when the at least two images are acquired in the above-described manner.

As can be seen from the foregoing description, if a route point intersects with an object in space, in an ideal case, the route point (e.g., a sight point b shown in fig. 1) is respectively projected in each of at least two images, and in an ideal case, image information corresponding to image windows based on each projection point on different images should be the same. Therefore, the attention point can be determined based on the matching degree of the image information corresponding to the projection point.

In this embodiment of the application, after obtaining the matching degree, the method may further include: correcting the matching degree by at least one of the following modes:

and (4) carrying out focus point motion prediction correction and image semantic segmentation correction.

By correcting the matching degree and taking the corrected matching degree as the final matching degree, the accuracy of the determined attention point can be further improved.

In the embodiment of the present application, determining the matching degree of the image information corresponding to the projection points of the route of interest on the at least two images may include at least one of the following manners:

the method a:

determining the image similarity degree of image windows corresponding to projection points of route points of the attention route on at least two images, wherein the image similarity degree is the matching degree;

mode b:

determining the image similarity of image windows corresponding to projection points of route points of the attention route on at least two images;

obtaining a predicted position of the route point according to the historical position of the route point, and obtaining a position similarity degree corresponding to the route point based on the current position and the predicted position of the route point;

determining the matching degree according to the image similarity degree and the position similarity degree;

mode c:

performing image semantic segmentation on at least one panoramic image to obtain a semantic segmented image, determining semantic image information corresponding to projection points of route points on the semantic segmented image, and obtaining semantic possible degrees corresponding to the route points on the basis of the semantic image information and image windows corresponding to the projection points of the corresponding route points on at least one of the at least two images;

determining the matching degree according to the image similarity degree and the semantic possible degree;

mode d:

and determining the matching degree according to the image similarity degree, the position similarity degree and the semantic possibility degree.

The image window corresponding to the projection point of the route point on the image is the image window determined in the image according to the projection point on the image. The specific mode of determining the image window according to the projection point may be selected according to actual needs, for example, the image window may be obtained on the panoramic image according to a preset window size with the projection point as a center. The image similarity of the image windows corresponding to the projection points of the route points on the at least two panoramic images refers to the similarity of the images in the image windows corresponding to the at least two projection points of the route points on the at least two images.

For the mode a, each route point of the attention route can be projected to at least two images respectively, each route point corresponds to a projection point in each image, therefore, all projection points of all route points on each image in the attention route form a projection curve corresponding to the attention route, and for each projection point, an image window can be determined according to the projection point, so that the image similarity degree of the image window corresponding to the projection point of each route point on different images can be calculated, and the attention point is determined based on the similarity degree.

As an example, fig. 16 is a schematic diagram illustrating a method of determining a point of interest with the degree of image similarity as the degree of matching. This example is still illustrated with two panoramic images as an example. As shown in fig. 16, for the two panoramic images shown in the figure (the upper panoramic image and the lower panoramic image), the attention route is projected to the two images, respectively, to obtain corresponding projection curves (curves indicated by arc-shaped broken lines in the two images). Since the projection points of the same object point in the space on the two images are located on the same image column, a certain route point on the attention route corresponds to the projection points on the two panoramic images and is located on the same image column. Starting from the starting point of the attention route, for each pixel point on the projection curve, the image similarity degree of the corresponding image window on the two panoramic images can be respectively calculated. Specifically, a point may be selected from the projection curve in the upper panoramic image, then a point is selected from the projection curve of the lower panoramic image having the same column coordinates as the corresponding point, the two points are taken as the center, the corresponding windows of the upper and lower panoramic images are selected, the size of the window may be selected as required, and a variation curve of the image similarity along the route direction may be obtained by calculating the image similarity of the two image windows corresponding to each route point, such as the image window similarity curve shown in fig. 16, where a1 and a2 shown in the figure are two projection points of the same route point on the two panoramic images, the image windows corresponding to the two projection points are W1 and W2, and the image similarity of W1 and W2 corresponds to the similarity of the S point on the similarity curve. After the image similarity corresponding to each route point is determined, the attention point can be determined based on the image similarity.

The specific calculation method of the image similarity can be selected according to application requirements, such as calculating the similarity according to the gray level difference. The image characteristics of the corresponding image windows can be extracted, the similarity degree of the image characteristics of the corresponding image windows is calculated, and the similarity degree of the image characteristics is used as the image similarity degree. For example, a convolutional neuron network may be adopted to extract features of images of corresponding image windows, to obtain feature images of each image, and then different feature images are input to a twin neuron network, so that a similarity score between the feature images may be obtained by the twin neuron network, and the similarity score is used as an image similarity degree.

In this embodiment of the application, when the image similarity is taken as the matching degree, the determining of the attention point may include:

the route point that is closest to the starting point of the attention route and is greater than the set value in the image similarity degrees corresponding to all the route points may be determined as the attention point, or the route point corresponding to the maximum image similarity degree in the image similarity degrees corresponding to all the route points may be determined as the attention point.

It should be noted that, in an actual application scene, if an object in the scene intersects with a point of interest, the object blocks further extension of the object sight line, and therefore, in the actual application, the route point closest to the starting point of the route of interest may be selected as the point of interest, at this time, when determining the image similarity, the image similarity corresponding to all the route points does not need to be calculated, and the image similarity corresponding to different route points may be calculated from near to far according to the distance between the route point and the starting point of the route of interest, and then the route point corresponding to the first image similarity greater than the set value is the point of interest.

Comparing the mode b, the mode c and the mode d with the mode a, it can be seen that the mode b, the mode c and the mode d can correct the image similarity through the position similarity and/or the semantic possibility on the basis of the image similarity obtained in the mode a, the position similarity represents the motion prediction information of the route point, and the semantic possibility represents the possibility that the object in which the route point is located is the object in the image window corresponding to the route point.

As an example of determining the matching degree according to the image similarity degree, the position similarity degree, and the semantic likelihood degree in the manner d, fig. 17 shows a schematic diagram of a scheme of determining the attention point based on the manner d, and the scheme still takes two panoramic images and takes the user's gaze point as the attention point for explanation.

As shown in fig. 17, the scheme may include three branches, where a first branch inputs images (an upper image window and a lower image window shown in the figure) of corresponding image windows of the same sight point (denoted as P0) in the two panoramic images, the two image windows are located on the same image column, image feature extraction may be performed on the two image windows through a convolutional neural network, and the two feature images output by the convolutional neural network obtain similarity scores of the two feature images through a twin neural network, where the similarity score is used as the image similarity.

For the second branch, the current time is denoted as T (i), the position of P0 at the time T (i-d) before the time T (i), that is, the historical position of P0 at the time (d ≧ 1) may be first obtained, the position of P0 at the current time (the predicted position of P0) is predicted by a Long Short-Term Memory (LSTM) network based on the historical position of P0, and the position similarity score of P0 is obtained according to the predicted position and the real position of the current time, and by using this method, the position similarity score corresponding to each sight point may be obtained, and this score is used as the similarity. The historical position of the sight point can be determined according to the panoramic image at the historical moment, for example, the historical position can be calculated according to the panoramic image at the previous moment of the current moment. The position similarity degree can be realized by adopting the prior art, for example, the position similarity degree can be determined based on the distance between two positions, and the similarity degree is larger when the distance is smaller.

For the third branch, any panoramic image can be subjected to an object segmentation convolutional neural network to obtain a semantic segmentation image, and semantic image information of a projection point of each gaze point on the semantic segmentation image can be obtained based on the semantic segmentation image, so that the semantic possibility degree (semantic score) of the gaze point, that is, the possibility degree of the projection point of the gaze point on the semantic segmentation image on the object in the image window determined based on the corresponding projection point in the first branch, is determined.

And obtaining two classification prediction results of whether the sight point corresponding to the image window in the first branch is the fixation point or not through a fusion network according to the image similarity degree, the position similarity degree and the semantic possibility degree obtained by the three branches. When the object included in the image window in branch one is an object of interest to the subject (an object corresponding to the projection point of the sight line point on the semantic segmentation image), the result of the binary prediction is yes, and when the object included in the image window in branch one is not an object of interest to the subject, the result of the binary prediction is no. With this configuration, the gaze point corresponding to the prediction result "yes" may be used as the candidate point of the gaze point, and the gaze point may be determined based on the candidate point, for example, the candidate point closest to the start point of the gaze may be determined as the gaze point.

FIG. 18 shows, as an example of the manner ②, a schematic diagram of a scheme for determining a point of interest based on a degree of matching of depth values (e.g., may be a difference in depth values), which is illustrated by taking two panoramic images, as an example, three-dimensional gaze information, as shown in FIG. 18, in which, on the one hand, a semantic segmentation may be performed on one panoramic image by a semantic segmentation network (which may include a convolutional neural network and a deconvolution neural network), the panoramic image may be segmented into image regions corresponding to different object types in the scene, resulting in a semantic segmented image of the scene, image keypoint extraction may be performed on each of the two panoramic images, matching the extracted keypoints on the two images, and estimating the depth of these matched points in the scene according to a stereotriangulation depth estimation method, resulting in a sparse depth map of the scene, superimposing the sparse depth map of the scene with the semantic segmented image of the scene, via a convolutional neural network, outputting a dense depth map of the scene, projecting gaze points on the three-dimensional depth map, a projection curve, and a point coordinate of the gaze coordinates of the point may be determined on the gaze coordinate system, where a point may be a point corresponding to a point, where a point may be calculated, a candidate depth value, and a point, may be calculated on the gaze coordinate of the camera, where a point may be calculated, where a point, may be a point, and a point, where a point, may be calculated, where a point.

In the embodiment of the application, the depth value of the static object in the depth image of the scene is obtained according to the historical depth value of the static object.

In this embodiment of the application, obtaining the depth image of the scene based on the at least two images may specifically include:

determining a static object and a dynamic object in at least two images;

obtaining historical consistency depth information of the static object, and determining current consistency depth information of the static object according to the historical consistency depth information of the static object;

and obtaining a dense depth image of the scene according to the current thickness depth information of the static object and the current dense depth information of the dynamic object.

Generally, most image areas in the image of the scene correspond to static objects, i.e., static objects, in the scene, and the depth images of the static objects may only need to be estimated at the time of system initialization, so that the repeated calculation of the depth images of the static objects at each time can be avoided, and the effect of reducing the calculation overhead can be achieved. When the camera moves, the obtained depth image of the static object can be subjected to projection transformation in space according to the movement of the camera at the initial moment and the current moment, and the depth information at the initial moment is registered to the current moment to obtain the depth image at the current moment.

Fig. 19 is a schematic diagram illustrating a method for obtaining a current depth value of a stationary object according to a historical depth value of the stationary object. Specifically, as shown in the figure, at system initialization, stationary objects in a scene may be detected first. When the camera is static relative to the scene, the image areas of the corresponding static objects in the scene in the image can be detected by using the pixel gray scale change threshold value between the video frames, and the depth of the scene corresponding to the pixels in the image areas of the static objects is estimated by using the stereo image pair obtained by the system. The camera self-motion estimation may be performed using a camera self-motion estimation method (e.g., a SLAM-based method and/or an inertial sensor-based method). If the camera has no motion, the scene depth map of the static object obtained in the initial system can be directly used as the depth map of the static object at the current moment. If the camera moves, rotation and translation transformation parameters between the current position and the initial position of the camera can be obtained through camera motion estimation, and the set of rotation and translation transformation parameters are used for carrying out rotation and translation transformation on the image of the scene static object obtained at the initial moment to obtain a scene depth map of the static object in the image at the current moment. And then, obtaining a depth image of the scene according to the current depth information of the static object and the current depth information of the dynamic object, so that the determination of the object focus point can be realized based on the depth image of the scene, and the gaze intersection detection of the three-dimensional user can be realized to obtain the gaze point of the user.

Fig. 20 is a flowchart illustrating a method for determining a point of interest according to another embodiment of the present application, where as shown in fig. 20, on the basis of the method shown in fig. 2, the method may further include:

step S130: an image of interest of the object is determined from the point of interest and at least one of the at least two images.

After the attention point of the object is obtained, an image of a region of interest of the object in the scene, that is, an attention image of the object, may be acquired based on the attention point, so that a service more meeting the object requirement may be further provided for the object based on the attention image of the object.

For example, when the focus point is the user's gaze point, the focus image is the user's view image; when the attention point is a pointing point of a finger, the attention image may be an image of a range to which the finger of the object points. After the attention image of the user is acquired, the acquired attention image of the object can be presented to the user through the electronic device, or information of the object in the attention image can be presented to each object or other objects related to the object through the electronic device, or the object possibly interested in the object can be analyzed based on specific content in the attention image, information of related objects can be recommended to the object, and the like.

The attention image is determined according to the attention point, the attention image can be acquired based on the attention point through controlling the image acquisition device, or the attention image can be determined based on the attention point and at least one of the at least two acquired images.

In an embodiment of the present application, determining the image of interest of the object according to the point of interest and at least one of the at least two images may include:

determining an observation image window of the object according to the projection point of the attention point in at least one image of the at least two images;

and determining the attention image according to the corresponding relation between the observation image window and the attention window.

For an acquired image of a scene, a focus range focused by an object may correspond to an image window corresponding to a projection point of a focus point on the image, and therefore, an observation image window of the object may be determined according to the projection point of the focus point on the image, and a focused image may be determined based on a correspondence between the observation image window and the focus window. Wherein, determining the observation image window of the object may include:

determining a window of interest of the object;

and determining an image window corresponding to the attention window determined according to the projection point of the attention point on at least one image of the at least two images as an observation image window.

The attention window may be determined according to an attention range of an actual object, and a size of the attention window may be configured empirically. For example, when the point of interest is the point of regard of the user, the point of interest may be a visual field window of the user, and the user visual field range, i.e., the size of the visual field window, may adopt a human-average visual field range value, and generally, the monocular comfortable visual field of the normal human eye is about 60 degrees, and the upper and lower angles and the horizontal angle of the user visual field range may be taken as the human-eye average visual field of 60 degrees. When the focus point is a pointing point of a finger, the size of the focus window may be configured as needed.

Specifically, a rectangular window in a space may be determined based on the starting point of the attention route, and the attention range, and the rectangular window may be used as the attention window. In practical applications, optionally, the attention route from the starting point of the attention route may be perpendicular to the attention window, the vertical center is located at the center of the attention window, the width of the window and the starting point of the attention route determine the viewing angle of the attention range of the object in the horizontal direction (as for the viewing angle window, the viewing angle in the horizontal direction may be 60 degrees), the height of the screen and the starting point of the attention route determine the viewing angle of the attention range of the object in the vertical direction, the size of the rectangle is related to the distance of the rectangle from the starting point, and the larger the distance, the larger the area of the rectangle. The distance between the rectangle and the starting point can be set at will, for example, d m (d is more than or equal to 0) can be set, and d can be set at will according to requirements. In practical applications, the distance value of d may be set to the focal length of the camera, taking into account the matching of the resolution of the image of interest and the camera image. And projecting the attention point to the image, and determining an observation image window according to the size of an image area obtained by converting the size of the attention window around the projection point (optionally taking the projection point as the center).

It will be appreciated that a rectangle is only one alternative form of the focus window for the object, but is not exclusive and a circle or other form of focus window may be chosen as desired.

In this embodiment of the application, determining the attention image according to the correspondence between the observation image window and the attention window may include:

determining a conversion matrix from the observation image window to the attention window according to the position information of the attention window in the reference coordinate system and the position information of the observation image window in the reference coordinate system;

and converting the observation image window according to the conversion matrix, and determining a window image corresponding to the converted image observation window in the corresponding image of the at least two images as the attention image.

Since the optical center of the device acquiring the image of the scene and the starting point of the attention route are not coincident in most cases and have a certain distance, in order to obtain an image closer to the attention of the object, it is necessary to perform image transformation on the image of the observation image window. Specifically, a relationship matrix from an observation image window to an attention window can be obtained through internal parameters of the panoramic camera and attention route information. And performing image transformation on the observation image window according to the relation matrix, and determining the corresponding image of the transformed image observation window in the panoramic image as the attention image.

Specifically, when the observation image window is converted according to the relationship matrix, for each pixel point coordinate in the attention window, the corresponding pixel point coordinate in the observation image window can be obtained through the conversion matrix, and the color value of the pixel at the coordinate is assigned to the pixel at the corresponding coordinate in the attention window. By the scheme, the converted image observation window can be ensured to be consistent with the angle (observation range) of the optical center of the camera and the angle of the object attention range to the maximum extent.

Fig. 21 is a schematic diagram illustrating a principle of determining a visual field image of a user according to a gaze point, where M is a starting point of a three-dimensional gaze, i.e., a gaze center of the user, and G is the gaze point. A schematic diagram of a method of determining a sight field image from a gaze point and a panoramic image is shown in fig. 22. As shown in fig. 21 and 22, first, a rectangle in a three-dimensional space, i.e., a view field window (a user view field screen shown in the figure) is determined based on the M point, the three-dimensional sight line information, and the view field range, and the three-dimensional sight line from the M point is perpendicular to the view field window, and its vertical center is located at the center of the window. The fixation point G is projected to the panoramic image, so that the projection point R on the panoramic image can be obtained, and in this example, in order to more intuitively embody the process of determining the view field image, fig. 21 shows a schematic diagram of projecting the G point on the spherical panoramic image. The method comprises the steps of obtaining an observation image window by conversion around a projection point R according to the size of a visual field window, carrying out image conversion on the observation image window according to a homography conversion relation matrix H from the observation image window to the visual field window, specifically, obtaining three-dimensional coordinates of four vertexes of the image observation window in a reference three-dimensional coordinate system through internal parameters of a camera, obtaining three-dimensional coordinates of the four vertexes of the visual field window in the reference three-dimensional coordinate system through three-dimensional sight line information, calculating the matrix H, converting the observation image window according to the matrix H, and outputting an image corresponding to the converted observation image window on a panoramic image as a user visual field image.

Based on the same principle as the method for determining a point of interest shown in fig. 2, the embodiment of the present application further provides an apparatus for determining a point of interest, as shown in fig. 23, the apparatus 100 for determining a point of interest may include an image acquisition module 110 and a point of interest confirmation module 120.

An image acquisition module 110, configured to acquire at least two images of a scene;

a focus point confirmation module 120, configured to determine a focus point of an object in the image in the scene based on the at least two images.

Compared with the prior art, the device for determining the attention point can detect the attention point of the object in the actual scene based on at least two images of the scene where the object is located. According to the scheme, the attention point of the object in the real scene can be determined without the need of wearing any equipment by the object, so that the method is suitable for the actual application scene without the need of wearing additional equipment by the object in a matching manner, and a more natural interaction mode can be provided for the object.

The apparatus for determining a point of interest provided in the embodiments of the present invention may implement the embodiments of the method for determining a point of interest provided in the embodiments of the present application, and specific function implementation of the apparatus for determining a point of interest may refer to descriptions in the embodiments of the method for determining a point of interest, and will not be described herein again.

Based on the same principle as the method for determining a point of interest and the method for determining a point of interest provided by the embodiments of the present application, the present application also provides an electronic device, as shown in fig. 24, the electronic device 200 may include an image acquisition module 210, a memory 220, and a processor 230.

An image acquisition module 210 for acquiring at least two images of a scene;

a memory 220 for storing machine readable instructions that, when executed by the processor, configure the processor 230 to determine a point of interest in the scene for an object in the image based on the at least two images acquired by the image acquisition module.

The electronic device 200 provided in the embodiment of the present application may implement the embodiment of the method for determining a point of interest provided in the embodiment of the present application, and may perform the function of the apparatus for determining a point of interest provided in the embodiment of the present application, and for the detailed description of the electronic device 200, reference may be made to the description in the embodiment of the method for determining a point of interest, which is not described herein again.

Based on the same principle as the method for determining a point of interest and the method for determining a point of interest provided by the embodiments of the present application, the present application also provides a system for determining a point of interest, as shown in fig. 25, the system 300 for determining a point of interest may include an image capturing device 310, and an electronic device 320 connected to the image capturing device 310;

the image acquisition device 310 is configured to acquire at least two images of a scene and send the at least two images to the electronic device;

and the electronic device 320 is used for receiving the at least two images sent by the image acquisition device and determining the attention point of the object in the images in the scene based on the at least two received images.

The system for determining a point of interest provided in the embodiment of the present application may implement the embodiment of the method for determining a point of interest provided in the embodiment of the present application, may perform a function of the apparatus for determining a point of interest provided in the embodiment of the present application, and for a detailed description of the system for determining a point of interest, reference may be made to the description in the embodiment of the method for determining a point of interest, which is not described herein again.

As can be seen from the solutions shown in fig. 24 and fig. 25, in the method for determining a point of interest provided in the embodiment of the present application, the step of acquiring at least two images and the step of determining a point of interest may be performed by one electronic device including an image capturing module, or may be performed by an image capturing device and an electronic device that performs determination of a point of interest based on images captured by the image capturing device. When the image acquisition device and the electronic device execute the method together, the electronic device can also be a server of a cloud, that is, software for determining the attention point of an object in the image in a scene can operate in the cloud based on at least two images, and the cloud determines the attention point by acquiring the at least two images acquired by the image acquisition device.

The application also provides a behavior information acquisition method. Fig. 26 is a schematic flowchart illustrating an information obtaining method according to an embodiment of the present application, and as shown in fig. 26, the information obtaining method may include:

step S210: acquiring a focus of an object;

step S220: and acquiring the behavior information of the object according to the attention point.

The focus of the object in step S210 may be a focus obtained according to the method for determining a focus shown in any of the above embodiments of the present application. That is to say, the information acquiring method according to the embodiment of the present application may be implemented continuously after the method for determining the point of interest according to the embodiment of the present application obtains the point of interest, and acquire the behavior information of the object according to the point of interest.

According to the information acquisition method, after the attention point of the object is obtained, the behavior information of the object can be further acquired according to the information of the attention point, so that a service which better meets the actual requirement of the object can be provided for the object based on the behavior information of the object, or the intention of the object is analyzed based on the behavior information of the object, or other processing based on the behavior information is performed as required.

In an embodiment of the present application, the behavior information may include at least one of the following a to G:

A. an object of interest to the subject;

B. a duration of time that the object is focused on the object;

C. information of changes over time of an object of interest to the subject;

D. a trajectory of points of interest of the object;

E. an image of interest of the object;

F. interaction information of the object and the device;

G. speech information of the object.

For the information a, an object in which the attention point is located in the space may be determined according to the coordinate information of the attention point, or may also be determined based on a projection point of the attention point on an image in which the attention point is determined by the user, where the object in which the projection point is located on the image is the object of attention.

For the information B, the attention points of the object at different times may be continuously detected to obtain the attention duration of the object at the attention point, where the attention duration may be the attention duration of the object corresponding to each attention point, may also be the attention duration of the same object, and may also be the attention duration of the same class of objects.

For the information C, the object concerned by the object at different times can be determined according to the points of interest at different times, and then the change information of the object concerned by the object with time can be obtained.

For the information D, the attention point trajectory of the object can be obtained according to the attention points of the object at different times.

As for the information E, the image of interest of the object may be an image acquired by the image acquisition apparatus based on the point of interest, or an image obtained from the point of interest and a scene image based on which the point of interest is determined, as may be an image of interest obtained based on the manner shown in fig. 22.

For the information F, the interaction information of the object and the device may include voice interaction information between the object and the device, and/or contact information between the object and the device, and/or an instruction of the user received through the device, and the like. For example, voice/touch interaction may be performed with an object through a voice interaction device or a touch device, so as to obtain voice information of the object or touch action information of the device in a time period corresponding to a time when the attention point is acquired or a time when the attention point is determined.

For the information G, the voice information of the object may be the voice information of the object acquired by the voice acquisition device/human-computer interaction device, and the voice information of the object in the corresponding time period may be acquired based on the determined time of the attention point.

Fig. 27 is a schematic flow chart of a behavior information obtaining method provided in another embodiment of the present application, and as can be seen from fig. 27, on the basis of the information obtaining method shown in fig. 26, the information obtaining method may further include:

step S230: and processing according to the behavior information of the object.

In the embodiment of the present application, the processing may be at least one of the following a to f:

a. saving an image or video of an object of interest to the subject;

b. providing a service or information related to an object of interest to the subject;

c. providing information or services related to an image of interest of the object;

d. the state of the object of interest of the control object or the state of the control object itself;

e. providing prompt information;

f. attention-related information is provided.

For a, when an object focused by a subject is acquired, an image or video including the object may be saved to be provided for the subject or related personnel to use. For example, in a tourist attraction, when a tourist rides or rides a car for playing, a related electronic device may be configured on the car, and the electronic device may acquire an object in the attraction concerned by the object by determining a point of interest of the object, and store the related image or video, thereby implementing automatic storage of the image or video of the attraction concerned by the tourist, or a attraction management department may configure the related electronic device in a hot attraction, and store the related image or video by the device after acquiring the object concerned by the tourist, so as to provide the image or video to a user or a manager, thereby facilitating the manager to manage the attraction based on the stored image or video.

For b, providing services or information related to the object of interest may include:

providing related information of an object concerned by the object, which is acquired according to the instruction of the object; and/or provide services or information or the like determined in connection with the object of interest and the speech information of the object. For example, when the instruction of the object is a voice search instruction, related information of an object concerned by the object may be searched according to the voice search instruction of the object, so as to respond to the search instruction of the object based on the searched information.

In practical applications, the related information of the object concerned by the object (such as related information of the object retrieved on the internet, related information of the object found in a pre-configured database, the concerned image, or an image or video of the concerned object, etc.) may be displayed to the object itself or related personnel through a designated device; the intention of the object can be determined based on at least one of the information A to the information F, and corresponding service or information can be provided for the object based on the intention of the object. The specified device may be the device itself that executes the information acquisition method, or may be another device that communicates with the device that executes the information acquisition method.

For c, when the attention image of the object is acquired, relevant services or information can be provided for the object or relevant objects based on the image or objects in the image. For example, the image of interest may be provided to the subject, or all objects in the image of interest may be derived based on image analysis, the relevant information for each object may be provided to the subject, or other relevant objects may analyze the intent of the object based on the image of interest, take corresponding action based on the intent, and so on.

For d, when an object of interest is obtained, for example, when the object is an electronic device, related parameters or control information of the object may be sent to a device associated with the object or directly displayed on the object, and the object may control states, parameters, and the like of the object based on the related parameters or control information.

For e, the hint information may include at least one of the following:

prompt information for abnormal behavior of the object;

prompt information for a potentially threatening object;

hints for the behavior of the object.

For example, when the change information over time of the object of interest is acquired, the change information over time of the object of interest may be compared with the standard change information, and correction information may be generated based on the comparison result to remind the behavior of the object based on the correction information or to enable the object to follow the change information over time of the object of interest after correction to the standard change information based on the order of attention of the object to the object based on the correction information.

For another example, the potential dangerous object which is not concerned by the object can be determined according to the concerned image of the object and the image of the scene where the object is located, and corresponding reminding information can be generated in time. The behavior mode of the object can be predicted based on the attention image of the object, and prompt information for the abnormal behavior mode is generated when the behavior mode is abnormal so as to prompt the object or other related objects; the predicted behavior of the object can be obtained based on the position and the attention point track of the object, prompt information is generated according to the predicted behavior of the object, and the object or other related objects in the scene are prompted based on the prompt information, or the prompt information of the predicted behavior can be sent to the designated equipment, so that the designated equipment can adjust the equipment state according to the predicted behavior.

For f, for example, the object that the object focuses on at different times may be obtained based on the focus of the object at different times, the attention information of the object may be analyzed or statistically obtained based on the object that the object focuses on at different times, the attention related information of the object may be provided to the object itself or other related objects so that the object itself or other related objects adjust the state or adjust the scene based on the information, or the electronic device may generate the adjustment information of the scene according to a certain policy based on the attention related information.

Based on the same principle as the information acquisition method shown in fig. 26, the embodiment of the present application also provides a behavior information acquisition apparatus. As shown in fig. 28, the information acquisition apparatus 400 may include a point of interest acquisition module 410 and a behavior information acquisition module 420.

A focus point acquiring module 410, configured to acquire a focus point of an object;

the behavior information obtaining module 420 is configured to obtain behavior information of the object according to the attention point.

The information acquisition device of the embodiment of the application can further acquire the behavior information of the object according to the attention point after acquiring the attention point of the object, so that a service which better meets the actual requirement of the object can be provided for the object based on the behavior information of the object, or the intention of the object is analyzed based on the behavior information of the object, or other processing based on the behavior information is performed as required, and the like, thereby better meeting the actual application requirement.

The information acquisition device provided in the embodiment of the present invention may implement the embodiment of the behavior information acquisition method provided in the embodiment of the present application, and specific function implementation of the information acquisition device may refer to descriptions in the embodiment of the information acquisition method, which are not described herein again.

In order to better understand the scheme for determining the focus and the information acquisition scheme provided by the present application, the following describes in detail the information acquisition method provided in the embodiments of the present application with a user as an object and a gaze point of the user as a focus, in combination with a specific practical application scenario.

Scene 1: shopping scenario 1

In an actual shopping scene, the electronic device for determining the attention point can be installed at different positions in the scene according to actual needs. For example, it can be installed near a merchandise shelf, and as shown in fig. 29a, it can simultaneously photograph merchandise on the shelf and shoppers near the shelf; or on a mobile platform, as shown in fig. 29b, which is equivalent to an intelligent agent or robot, and can shoot the goods in the shopping place and the nearby shoppers at the same time; the device can also be arranged on a mobile shopping cart, as shown in fig. 29c, and can shoot the customers and the goods nearby; it can also be installed on the front desk, as shown in fig. 29d, to see the guests and the objects placed near the front desk.

After the point of regard is determined by the electronic device that determines the point of regard, the commodity that the user gazes, that is, the object that the user focuses on, can be known according to the point of regard, and further, related services are provided for the user based on the commodity, specifically, related information of the commodity can be displayed to the user through the electronic device (which may be a device that determines the point of regard, or other devices), and the related information may be detailed information of the commodity, or may be information of the commodity or the service that the user may be interested in predicted according to the commodity.

Scene 2: shopping or reception scenarios

The point of regard of the user can be continuously detected through the electronic equipment for determining the point of regard, and the commodities which are concerned by the user at different moments can be known according to the point of regard at different moments, so that the watching duration of the user on the same object or the same class of objects in the field can be known, the user intention can be further determined based on the duration, and more appropriate service or information can be provided for the user based on the user intention.

FIG. 30 is a schematic illustration of a method of providing customer-centric services in a shopping or hospitality scenario. As shown in the figure, the gaze points of the user at different moments can be obtained by tracking the visual field of the user, the objects gazed by the user in the visual field of the user can be continuously detected, as shown in the figure, the objects gazed by the user at different moments can be obtained according to the gaze points of the user at different moments, such as T (i-2), T (i-1), T (i), and the like, and the intention type of the user can be analyzed by recording the gaze time of the user on the same article or the same object type in the visual field. For example, when a user makes a note to an item a that exceeds a certain length of time threshold, the user's intent may be classified as "interested in item a" by the user. As another example, when the user gazes at an item in the field of view for a time less than a length threshold gazing time, and the user's gaze point moves in the scene, the user's intent may be classified as "looking for".

After classifying the user intentions at the corresponding moments based on the watching duration of the commodity by the user, the service more meeting the user requirements can be provided based on the classified user intentions. For example, when the intention is "interested in the item a", the related information of the item a may be presented to the user, and when the intention is "find", the information of other items than the item which the user has focused on may be presented to the user.

In addition, in practical application, a service database may be configured, and the service scheme most suitable for the service scheme corresponding to different user intentions may be recorded in the service database. For example: corresponding to the user intention of 'interested in the article A', the corresponding service can be configured as 'recommending information of the article A', the corresponding information can be played to the user through the field multimedia device, and the information can also be sent to the personal multimedia device (such as a mobile phone, smart glasses and the like) of the user through the communication device.

In addition, in a shopping or reception scene, an artificial voice interaction device may be further provided, through which voice interaction may be performed with a user in the scene, and when a gaze point is determined, dialog information between the user and the interaction device, which is obtained through the artificial voice interaction device within a time period corresponding to the determined time, may be further obtained based on the determined time of the gaze point, so that the gaze point of the user and the dialog information may be integrated, a user intention may be more accurately obtained, and a relevant service or information may be provided for the user, as shown in fig. 30, the voice information of the user may be obtained through a user dialog engine.

Scene 3: reception scene

In a reception scene, the movement of the sight of the user, namely the change condition of an object concerned by the user along with time can be obtained according to the fixation points of the user at different moments, the attention or the demand of the user is judged, and related services or information is provided for the user; the user behavior can be further predicted based on the change situation, so that the relevant service or information can be provided for the user in time.

Scene 4: internet of things intelligent home scene

In the smart home scenario, home appliances in the home may communicate with each other in a networked manner and may be connected to a smart agent. The electronic equipment for determining the focus point can be installed on a mobile platform, moves in a home environment, captures images of a user, determines the focus point of the user based on the captured images, and analyzes smart home articles observed by the sight line of the user based on the focus point, so that corresponding services or information are provided.

In the smart home scenario shown in fig. 31, when it is detected that a user gazes at a home article (e.g., a smart refrigerator), an intelligent agent (which may be an intelligent robot or an intelligent device) capable of determining a gazing point and acquiring user behavior information may be connected to the refrigerator, retrieve information such as food stored in the refrigerator, and display the retrieved information to the user through a display in a smart speaker or a room, or, if the intelligent agent has a display function, the retrieved information may be directly displayed to the user by the intelligent agent.

Fig. 32 is a schematic diagram of another smart home scene, and fig. 33 is a schematic diagram illustrating a manner of displaying a control interface of an internet of things device according to a user sight line in the scene shown in fig. 32, and as shown in fig. 32 and 33, an intelligent agent may track the user sight line, obtain a view image of the user based on a gaze point when the gaze point of the user is determined, and determine whether the view of the user includes one of a group of registered internet of things devices by detecting the view image of the user. When detecting that an object watched by a user is matched with a certain registered internet of things device, such as an internet of things air conditioner or an internet of things television shown in the figure, an intelligent agent can communicate with the device through the internet of things, synchronously read information such as device states and control parameters and the like, display the information on mobile multimedia equipment of the user, allow the user to check or control the state of the device, and control the air conditioner or adjust the control parameters of the opened air conditioner according to the state and the control parameters of the air conditioner displayed on the interface through a control interface of the internet of things air conditioner displayed on the mobile equipment of the user.

Scene 5: play scene

In the playing scene shown in fig. 34, the electronic device for determining the attention point and acquiring the behavior information may be installed on a riding vehicle, and during the playing process of the user, the device may acquire videos of the user and objects in the scene, and may detect the gaze point of the user at different times based on the acquired image of each frame in the video, so as to track the field of view of the user, and may save the field of view image, i.e., the field of view video, of the user in each frame of image in the video based on the gaze point, thereby achieving automatic acquisition and saving of the object of interest in the scenic spot where the user pays attention during the traveling process, and achieving recording of the travel memory of the user.

Scene 6: driving assistance scenarios

In the driving assistance scenario shown in fig. 35, the electronic device for determining the attention point and acquiring the behavior information may be installed on the vehicle, observe pedestrians around the vehicle and objects in the environment, and determine whether the pedestrians around the vehicle may enter the driving route, thereby providing an early warning for the driver or providing a control basis for automated driving.

Fig. 36 is a schematic diagram illustrating a method for automatically determining the travel intention of a pedestrian around a vehicle in a driving assistance scene according to an embodiment of the present disclosure. As shown in fig. 35 and 36, the electronic device mounted on the vehicle can detect pedestrians in the image by acquiring the scene image, and continuously detect the gaze point of each pedestrian, so as to track the field of view of each pedestrian, thereby identifying the traveling direction of the pedestrian and providing help for safe driving of the vehicle.

Specifically, a view field image and a gaze point change trajectory of the pedestrian are obtained based on the view field tracking of each pedestrian, and the intention of the pedestrian is analyzed and the behavior of the pedestrian is predicted according to the view field content and the change trajectory. For example, when a pedestrian approaches an intersection and the view track of the pedestrian shows that the pedestrian is scanning the road condition of the intersection with the line of sight, it can be determined that the pedestrian intends to cross the intersection. Pedestrian behavior may then be predicted based on the pedestrian's intent and the pedestrian behavior model, for example, that a pedestrian may cross a road. These pedestrian behaviors can be passed to an automated driving decision unit to adjust the vehicle state for safe driving purposes. When the predicted pedestrian traveling route and the predicted vehicle traveling route conflict, the situation can be judged to be a dangerous situation, and the vehicle can automatically judge the traveling intention of surrounding pedestrians by giving an early warning to a driver.

In addition, the scheme can also be used for autonomous navigation of the robot, and help is provided for avoiding collision between the robot and pedestrians in the traveling process.

Scene 7: teaching operational scenarios

In a kitchen, factory, laboratory, etc. scenario, a teaching intelligence agent may assist a user in completing a particular task consisting of a series of operations by observing the user and providing suggestions.

In the application scenario shown in fig. 37, a sequence criterion (criterion change information) of the operator's attention shift, that is, at what time point the operator's attention point should be located on what object (such as the object a/B, C in the drawing), may be generated according to a standard operation sequence (standard operation sequence) of a specific task.

Fig. 38 is a schematic diagram showing a method of providing an operation suggestion to the user in the teaching operation scene shown in fig. 37. As shown in fig. 37 and 38, the intelligent agent may be installed in a fixed or mobile location in the operating environment, enabling simultaneous observation of the operator and the various items that the operator needs to use to accomplish tasks. Specifically, the intelligent agent can continuously detect the fixation point of the user and track the visual field of the user by acquiring the images in the scene, and can track the change of the object type watched by the operator along with the time through continuous analysis of the images in the scene where the operator is located, so as to obtain the change information of the object type concerned by the operator along with the operation time. The method comprises the steps of comparing the operation change information of an operator with standard change information to judge whether the operation of the operator meets the standard or not, generating correction information based on a comparison result when the operator performs a certain specific operation, namely the attention point of the operator is not placed on an object to be watched, and providing operation suggestions to the operator in a multimedia mode based on the correction information to enable the operator to correct operation behaviors according to the operation suggestions.

Scene 8: driver interaction scenario

In the scenario shown in fig. 39, the user's hands are occupied by the driving device (e.g., steering wheel, handlebar) while driving the motor vehicle or bicycle. Based on the scheme of the embodiment of the application, the visual field of the user can be extracted, and a natural user interaction mode without user gestures can be provided.

Fig. 40 is a schematic diagram illustrating a method for providing input to a human-computer interaction system based on a visual field image of a user in a scene as shown in fig. 39, and as shown in fig. 39 and 40, an electronic device for determining a point of interest and acquiring behavior information may be mounted in a vehicle, and the device may simultaneously view an image of the user and a scene image outside the vehicle. The user can interact with the man-machine question-and-answer interactive system on the vehicle through voice, for example, when the user observes an object outside the vehicle, an information retrieval request (such as 'what' is) is made to the interactive system through voice, an image acquisition module (such as a panoramic camera system) of the electronic device captures a visual field image of the user when the user speaks the retrieval request, and the visual field image can be displayed to the user through detecting the object from the visual field image, retrieving on the internet aiming at the object and giving appropriate matching information in an image or voice mode.

It can be understood that, in practical applications, the electronic device may be integrated with a vehicle-mounted computer of a vehicle, that is, a function module for determining a gaze point and acquiring behavior information may be integrated into the vehicle-mounted device, and the method shown in fig. 40 is completed by the vehicle-mounted device, for example, when capturing a visual field image of a user when the user speaks the retrieval request, the vehicle-mounted computer may automatically deliver the visual field image to the vehicle-mounted computer, detect an object from the visual field image, perform retrieval on the internet for the object, and give appropriate matching information to present the object to the user in an image or voice manner. In addition, the image acquisition, the determination of the attention point based on the image and the function of acquiring the behavior information can be completed by one device or a plurality of devices.

In the scenario shown in fig. 41, a riding user may be provided with prompt information for a potentially threatening object based on the solution of the embodiment of the present application. A schematic illustration of a method of detecting and alerting a user to potentially threatening objects in a surrounding traffic environment in a scene such as that shown in fig. 41 is shown in fig. 42. As shown in fig. 41 and 42, the electronic device mounted on the vehicle may determine the gaze point of the user by acquiring a panoramic image in the traffic environment, obtaining a visual field image of the user, performing tracking on the visual field of the user, analyzing the visual field image of the user, it can be known which objects in the scene are seen by the user, which objects are in the scene and which are potentially threatening objects can be known by analyzing the panoramic image, by comparing the objects in the scene with the objects in the user's view image, if the user does not look at potential objects (such as vehicles approaching from the rear) which may appear in the surrounding traffic scene, the user early warning system (which may be the electronic device or other devices in communication connection with the electronic device) can give a safety prompt to the user in a multimedia manner.

Scene 9: security monitoring scenario

In a security monitoring scenario, such as the security monitoring scenario of a public place shown in fig. 43, based on the scheme of the embodiment of the present application, a matched service may be provided for a user in the scenario.

A schematic diagram of a method of performing simultaneous gaze tracking detection of multiple human behaviors in a scene such as that shown in fig. 43 is shown in fig. 44. As shown in fig. 43 and 44, the electronic device for determining a point of interest and acquiring behavior information may be installed on a fixed or mobile platform in a scene, and observe and track the gaze points of multiple persons in the scene, and a view image of each person may be obtained based on the gaze points of each person, thereby realizing pedestrian view tracking. By analyzing the visual field image of each person, the analyzed pedestrian intention can be matched with the configured pedestrian intention database, the behavior mode of the monitored pedestrian is predicted, and if an abnormal behavior mode occurs, corresponding prompt information can be generated so as to take corresponding behavior or provide matched services based on the prompt information. For example, the point of regard of a certain user is constantly changing, the user's intention is to search, based on the database configuration, the user who predicts the search intention may need help, and needs to perform special processing, and then corresponding prompt information may be generated to provide help to the user who needs help in time.

Scene 10: meeting or classroom scene

In the multi-user application scenario such as the meeting place or the classroom shown in fig. 45, an electronic device for determining a point of interest and acquiring behavior information may be installed in a fixed or mobile location in the scenario, and the device may simultaneously observe panoramic images of a plurality of users located in the meeting place or the classroom and provide attention (point of interest) analysis, and through performing statistical analysis on the attention of participants or students, may provide the result of the attention analysis to the participants or teachers, so that the participants or teachers may adjust the conference or lecture process according to the result of the analysis, or may generate adjustment information of the scenario based on the result of the attention analysis, and provide the adjustment information to the participants or teachers as a reference.

Fig. 46 is a schematic diagram illustrating a method for analyzing the attention of the user in the scene shown in fig. 45, and as shown in fig. 45 and fig. 46, it is known whether the attention of the user is in the correct place through tracking detection of the gaze points of multiple users (such as students) in the scene, so as to obtain a statistical result of the attention deviation condition and an overall statistical analysis result of the attention, and based on the statistical result of the attention deviation, a conference host or a teacher can be reminded to adjust the conference or lecture process accordingly.

The present application further provides an electronic device comprising a memory and a processor;

a memory for storing machine readable instructions which, when executed by the processor, cause the processor to perform the method of determining a point of interest as shown in any of the applicant's embodiments and/or the method of information acquisition as shown in any of the applicant's embodiments.

Embodiments of the present application further provide a computer-readable storage medium for storing computer instructions, which when executed on a computer, enable the computer to perform the method for determining a point of interest shown in any embodiment of the present application, and/or the information acquisition method shown in any embodiment of the present application.

Fig. 47 shows a schematic structural diagram of an electronic device provided in an embodiment of the present application, where the electronic device 200 may include: a processor 2001 and a memory 2003. The processor 2001 is coupled to a memory 2003, such as via a bus 2002. Optionally, the electronic device 2000 may also include a transceiver 2004. In addition, in practical applications, the processor 2001, the transceiver 2004, the memory 2003, and the like are not limited to one, and the configuration of the electronic device 2000 is not limited to the embodiment of the present application.

The processor 2001 and the memory 2003 may be applied to the embodiments of the present application, and are used to implement the functions of the apparatus for determining the attention point and/or the apparatus for acquiring information in the embodiments of the present application. The transceiver 2004 may include a receiver and/or a transmitter for receiving and/or transmitting information to enable data interaction between the electronic device 2000 and other devices.

The processor 2001 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 2001 may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs and microprocessors, and the like.

Bus 2002 may include a path that conveys information between the aforementioned components. The bus 2002 may be a PCI bus or an EISA bus, etc. The bus 2002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 47, but this does not mean only one bus or one type of bus.

The memory 2003 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, a CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

Optionally, the memory 2003 is used for storing application program codes for executing the schemes in the embodiments of the present application, and the execution of the application program codes is controlled by the processor 2001. The processor 2001, when executing the application program code stored in the memory 2003, may implement the method of determining a point of interest and/or the behavior information acquisition method provided in any of the embodiments of the present application.

The following further describes the method for determining a point of interest and/or the information obtaining method according to the present application with reference to the following specific examples, i.e., the following three examples, which take the point of regard of the user as the point of regard, and take the three-dimensional sight line information as the route of interest information as an example.

Example 1

In this example, two panoramic cameras connected up and down as shown in fig. 3 are used to acquire two panoramic images, and the gaze point of the user is obtained based on the two panoramic images. A flow diagram of this example is shown in fig. 48, and as shown in the figure, the method of this example may mainly include the following steps:

step S1.1: panoramic camera calibration and panoramic stereo video pair correction

As can be seen from the foregoing description, when capturing images/videos by the panoramic cameras, the panoramic cameras need to be calibrated, and the calibration includes calibration of each panoramic camera itself and calibration between two panoramic cameras. After the panoramic camera calibration is completed, the upper panoramic camera and the lower panoramic camera can respectively shoot two sets of panoramic videos of the scene where the user is located, and the two sets of panoramic videos form a panoramic stereo video pair. Based on the image of each frame corresponding to each of the two sets of panoramic videos obtained by shooting, two panoramic images of the scene at the image shooting time of each frame can be obtained, continuous determination of the focus of the user at different times can be further achieved based on the two sets of panoramic videos, and the sight line of the user can be tracked.

The obtained panoramic stereo video pair (two spherical panoramic videos) also needs to be corrected in a longitude and latitude manner, spherical panoramic images of corresponding frames in the two corrected panoramic videos are expanded in a longitude and latitude manner to obtain two planar panoramic images, and projection points of the same object point in the scene space on the two planar panoramic images are located on the same image column, as shown in the above fig. 5.

Step S1.2: user image feature extraction (human body/face/key point)

By performing user feature extraction on the two planar panoramic images obtained in step S1.1, a part image of the user can be obtained, and in this example, the part image may include at least one of two body images, two head images, two face keypoint images, and two eye image groups.

Step S1.3: three-dimensional gaze extraction algorithm and model

Based on the user' S position image obtained in step S1.2, three-dimensional gaze information of the user, including start point information and direction information of the three-dimensional gaze, can be determined. The three-dimensional sight line extraction algorithm model can be realized by adopting a three-dimensional sight line information estimation model obtained based on deep learning training. The model may specifically implement at least one scheme including, but not limited to, the scheme in fig. 11, fig. 12, and fig. 13 described above.

Step S1.4: sight line three-dimensional intersection detection algorithm and model

The method comprises a step of detecting the intersection condition of the three-dimensional sight line of the user and an object in the scene in the three-dimensional space, and a step of determining the fixation point of the user. The sight line three-dimensional intersection detection algorithm and the model of the step can specifically implement at least one scheme including, but not limited to, the scheme in fig. 12, fig. 13 and fig. 14.

Step S1.5: user visual field extraction algorithm

The step may obtain the view field image of the user based on the gaze point of the user and the panoramic image, and the specific implementation of the step may adopt the scheme of obtaining the attention image of the object based on the attention point described in the above embodiment of the present application, where the attention point in this example is the gaze point of the user and the attention image is the view field image of the user.

Step S1.6: service method based on visual field analysis

In this step, based on the view image of the user obtained in step S1.5, corresponding services can be extracted based on the view image used in combination with different application scenarios. The application scenario may include, but is not limited to, the application scenarios shown in fig. 29 (fig. 29a, 29b, 29c, 29 d) to fig. 46.

Example two

In this example, the following description will be given by taking an example in which one panoramic camera is used to acquire two panoramic images and the gaze point of the user is obtained based on the two panoramic images, as shown in fig. 6. A flow diagram of this example is shown in fig. 49, and as shown in the figure, the method of this example may mainly include the following steps:

step S2.1: camera self-motion acquisition, reference time panoramic image capture, current time panoramic image capture

In this step, two panoramic images are obtained by controlling the movement of one panoramic camera, and the detailed description of this step can be referred to the description of obtaining two panoramic images by one panoramic camera in the foregoing. The camera self-motion acquisition means acquiring self-motion information of the panoramic camera, and obtaining two panoramic images, namely a reference time panoramic image and a current time panoramic image, at different times and with shooting positions larger than a set distance threshold value by controlling the motion of the camera.

Step S2.2: panoramic stereo image calibration

And (2) calibrating the panoramic stereo images of the reference time panoramic image and the current time panoramic image acquired in the step (S2.1), carrying out image transformation on the two panoramic images to form a pair of panoramic stereo image pairs, and aligning the projection points of the same space point on the two planar panoramic images to the same image column direction after the calibrated two panoramic images are unfolded through the longitude and latitude images.

Step S2.3: user image feature extraction

Step S2.4: three-dimensional gaze information estimation

Step S2.5: three-dimensional line-of-sight intersection detection

Step S2.6: user field of view extraction

Steps S2.3 to S2.6 are respectively used to implement the acquisition of the position image of the user, the determination of the three-dimensional sight line information, the determination of the gaze point, and the acquisition of the view field image of the user, and correspond to steps S1.2 to S1.5 in the above example 1, and may be implemented by using the scheme described in the above example 1.

Example three

In this example, the non-panoramic camera shown in fig. 8a, 8b, or 8c acquires two panoramic images, and obtains the gaze point of the user based on the two panoramic images. A flow diagram of this example is shown in fig. 50, and as shown in the figure, the method of this example may mainly include the following steps:

step S3.1: camera motion control and camera three-dimensional position acquisition

Because the non-panoramic camera has a limited field of view, the camera cannot simultaneously observe all objects and users in the scene at the same time, and therefore, the camera needs to scan the surrounding environment in a manner of controlling the motion of the camera to observe the objects and users in the scene. Camera motion control includes acquiring images of different scene areas by controlling camera rotation, capturing images with stereo parallax by controlling camera translation.

When the camera is controlled to move, the motion trail of the camera can be planned through the camera motion control and camera three-dimensional position acquisition module, so that the camera motion is controlled, and the three-dimensional position of the camera is acquired.

Step S3.2: three-dimensional modeling of an environment

Based on the three-dimensional position information of the camera, a three-dimensional depth map of a static object in the surrounding environment of the scene can be established through a three-dimensional modeling module, and the depth information of the static object in the scene is obtained.

Step S3.3: object three-dimensional tracking

Step S3.4: user three-dimensional tracking

In the process of image shooting, the three-dimensional position of a moving object in a scene can be tracked through the object three-dimensional tracking module so as to estimate the position of the moving object at the next moment, the camera is controlled to point to the position at the next moment according to the estimated position, and the image of the motion problem at the next moment is captured. And tracking the three-dimensional motion trail of the user in the scene through the user three-dimensional tracking module so as to predict the position of the user at the next moment and shoot the image of the user at the next moment. If the user or the moving object is out of the camera view, the camera is controlled to rotate and shoot, and if the user and the moving object are out of the camera view, the camera can be controlled to rotate based on the position of the user.

Based on the schemes in steps S3.1 to S3.4, images of 360-degree horizontal views of scenes can be obtained, and each region in each scene should have two or more images whose shooting distances are greater than a set distance, and based on the obtained images, two panoramic images can be obtained by means of image stitching. The detailed description of step S3.1 to step S3.4 can be seen from the description above of acquiring two panoramic image portions by one or more ordinary cameras, i.e. non-panoramic cameras.

Step S3.5: three-dimensional gaze information estimation

Step S3.6: three-dimensional line-of-sight intersection detection

Step S3.7: user field of view extraction

Steps S3.5 to S3.7 are specifically configured to implement a scheme for obtaining the gaze point of the user based on the two acquired panoramic images, and correspond to steps S1.2 to S1.5 in example 1, and may be specifically implemented by using the scheme described in example 1.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method of determining a point of interest, comprising:

acquiring at least two images of a scene;

2. The method of claim 1, wherein determining the point of interest of the object in the image in the scene based on the at least two images comprises:

3. The method of claim 2, wherein determining route of interest information corresponding to the object based on the at least two images comprises:

determining a position image of the object based on the at least two images;

determining the attention route information based on the position image of the object.

4. The method of claim 3, further comprising:

and obtaining the category information of the object based on the position image of the object.

5. The method according to claim 3 or 4, wherein the object position image comprises at least one of:

6. The method according to claim 3 or 4, characterized in that the route of interest information comprises start point information and direction information of the route of interest.

7. The method according to claim 3 or 4, wherein the determining the attention route information based on the position image of the object includes:

and obtaining the concerned route information by fusing the at least two pieces of initial concerned route information.

8. The method according to any one of claims 3 to 7, further comprising:

and correcting the attention route information through part modeling.

9. The method according to claim 7, wherein the point of interest is a gaze point, the object position image is the head image group, and the obtaining at least two initial attention route information based on the object position image comprises:

carrying out face position detection on the head characteristic image group through a first full-connection neuron network to obtain face position information in each head characteristic image;

obtaining a face feature image group by a face feature pooling layer according to the head feature image group and face position information in each head feature image;

performing face key point detection on the face feature image group through a second fully connected neuron network to obtain face key point position information in each face feature image;

obtaining an eye feature image group by an eye feature pooling layer according to the face feature image group and the position information of the face key points in each face feature image;

performing feature extraction on the head feature image group, the face feature image group and the eye feature image group through a second convolutional neural network, and obtaining first initial attention route information through a third fully-connected neural network according to the extracted feature images;

10. The method according to any of claims 2 to 9, characterized in that said determining said point of interest comprises at least one of:

the first method is as follows:

determining the matching degree of image information corresponding to projection points of the concerned route on the at least two images according to the concerned route information;

determining the attention point based on the matching degree;

the second method comprises the following steps:

acquiring a depth image of the scene;

according to the concerned route information, determining the matching degree of the depth values of the route points of the concerned route corresponding to the projection points of the depth image and the depth values of the corresponding route points;

based on the degree of matching, the point of interest is determined.

11. The method of claim 10, further comprising: correcting the matching degree by at least one of the following modes:

and (4) route point motion prediction correction and image semantic segmentation correction.

12. The method of claim 10, wherein the depth values of the static objects in the depth image of the scene are derived from historical depth values of the static objects.

13. The method of any one of claims 1 to 12, further comprising:

determining an image of interest of the object from the point of interest and at least one of the at least two images.

14. The method of claim 11, wherein determining the image of interest of the object based on the point of interest and at least one of the at least two images comprises:

determining an observation image window according to the projection point of the attention point in at least one image of the at least two images;

and determining the attention image according to the corresponding relation between the observation image window and the attention window of the object.

15. The method according to any one of claims 1 to 14, characterized in that the at least two images are at least two panoramic images.

16. The method of any of claims 1 to 15, wherein said acquiring at least two images of a scene comprises:

acquiring the at least two images through at least two cameras at different positions; alternatively, the first and second electrodes may be,

the at least two images are acquired by controlling the translation and/or rotation of the camera.

17. The method according to any of claims 1 to 16, characterized in that the point of interest comprises a point of regard and/or a pointing point of a part of the object.

18. An apparatus for determining a point of interest, comprising:

19. An electronic device comprising an image acquisition module, a memory, and a processor;

a memory for storing machine readable instructions that, when executed by the processor, configure the processor to determine a point of interest of an object in the image in the scene based on the at least two images acquired by the image acquisition module.

20. A system for determining a point of interest, the system comprising an image capture device, and an electronic device connected to the image capture device;

and the electronic equipment is used for receiving the at least two images sent by the image acquisition equipment and determining the attention points of the objects in the images in the scene based on the at least two received images.

21. A behavior information acquisition method is characterized by comprising the following steps:

acquiring a focus of an object;

22. The method according to claim 21, characterized in that the point of interest of the object is obtained according to the method for determining a point of interest according to any of claims 1 to 17.

23. The method according to claim 21 or 22, characterized in that the behavior information of the object comprises at least one of the following:

an object of interest to the subject;

a duration of time that the object is focused on an object;

information of a change over time of an object of interest to the subject;

a point of interest trajectory of the object;

an image of interest of the object;

voice information of the object;

and the interaction information of the object and the equipment.

24. The method according to any one of claims 21 to 23, wherein after the obtaining the behavior information of the object, further comprising:

and processing according to the behavior information of the object.

25. The method of claim 24, wherein the processing comprises at least one of:

saving an image or video of an object of interest to the subject;

providing a service or information related to an object of interest to the subject;

providing a service or information related to an image of interest of the object;

controlling the state of an object of interest of the subject or the state of the subject;

providing prompt information;

attention-related information is provided.

26. The method of claim 25, wherein providing services or information related to the object of interest comprises at least one of:

providing related information of an object concerned by the object, which is acquired according to the instruction of the object;

providing a service or information determined in connection with the object of interest of the object and the speech information of the object.

27. A behavior information acquisition apparatus characterized by comprising:

28. An electronic device comprising a memory and a processor;

a memory for storing machine readable instructions which, when executed by the processor, cause the processor to perform the method of determining a point of interest of any one of claims 1 to 17 and/or the method of obtaining behavioural information of any one of claims 21 to 26.

29. A computer-readable storage medium, characterized in that the computer storage medium is used for storing computer instructions which, when run on a computer, make the computer perform the method of determining a point of interest of any one of claims 1 to 17 and/or the behavior information acquisition method of any one of claims 21 to 26.