CN117724612A

CN117724612A - Intelligent video target automatic monitoring system and method based on man-machine interaction

Info

Publication number: CN117724612A
Application number: CN202311758400.4A
Authority: CN
Inventors: 李宁
Original assignee: Rizhao Ruifei Media Co ltd
Current assignee: Rizhao Ruifei Media Co ltd
Priority date: 2023-12-19
Filing date: 2023-12-19
Publication date: 2024-03-19

Abstract

The application relates to the field of intelligent monitoring of targets, and particularly discloses an intelligent video target automatic monitoring system and method based on human-computer interaction. In this way, based on the classification result, the user to be monitored can better experience the VR game.

Description

Intelligent video target automatic monitoring system and method based on man-machine interaction

Technical Field

The application relates to the field of intelligent monitoring of targets, and more particularly, to an intelligent video target automatic monitoring system and method based on man-machine interaction.

Background

With the rapid development of computer technology, virtual Reality (VR) technology is becoming more and more popular. The virtual reality VR technology is a computer simulation system capable of creating and experiencing a virtual world, and a computer processor is utilized to generate a simulation environment, so that the virtual reality VR technology is also a system simulation of interactive three-dimensional dynamic views and entity behaviors fused by multi-source information, and a user can be immersed in the environment. Currently, VR technology is widely used in video, virtual reality games, painting, and other scenes.

Conventional VR systems typically use handles, controls, or gesture recognition as input devices. However, these approaches may limit the freedom and realism of the user. The handle and controller require the user to hold, can cause fatigue, and are not natural and intuitive enough. Gesture recognition, while providing a more natural way of interaction, remains a challenge for its accuracy and reliability.

Therefore, an intelligent video target automatic monitoring system and method based on man-machine interaction are expected, and interaction between a user to be monitored and a VR game is realized based on an interaction mode of hands and viewpoints.

Disclosure of Invention

The present application has been made in order to solve the above technical problems. The embodiment of the application provides an intelligent video target automatic monitoring system and method based on human-computer interaction, which firstly collect monitoring videos of a user to be monitored, then acquire a user hand region of interest and a user eye region of interest from the monitoring videos of the user to be monitored through a user hand feature extraction module based on a target detection network and a target detection network based on an anchor-free window respectively, then perform feature extraction on the user hand region of interest through a depth feature fusion module serving as a user hand feature extractor, perform feature extraction on the user eye region of interest through a user local part feature extractor based on a spatial attention mechanism, and finally obtain classification results through a classifier after feature information of the user hand region of interest and the user eye region of interest are fused. In this way, based on the classification result, the user to be monitored can better experience the VR game.

According to a first aspect of the present application, there is provided an intelligent video object automatic monitoring system based on human-computer interaction, comprising:

the video acquisition module is used for acquiring a monitoring video of a user to be monitored;

The target area acquisition module is used for acquiring a user hand region of interest and a user eye region of interest based on the hand motion and the eye motion of the user to be detected in the monitoring video of the user to be monitored respectively;

the target feature extraction module is used for respectively acquiring feature information of the user hand region of interest and feature information of the user eye region of interest to obtain a user hand action feature map and a user eye sight feature map;

and the user action determining module is used for obtaining a classification result based on the user hand action feature map and the user eye sight feature map.

The intelligent video target automatic monitoring system based on human-computer interaction further comprises a training module for training the user hand feature extraction module based on the target detection network, the depth feature fusion module serving as the user hand feature extractor, the target detection network based on the anchor-free window, the user local part feature extractor based on the spatial attention mechanism and the classifier;

wherein, training module includes:

the training data acquisition unit is used for acquiring training data, wherein the training data comprises a monitoring video of a user to be monitored;

The key frame acquisition unit is used for acquiring a plurality of video key frames from the monitoring video of the user to be monitored and then arranging the video key frames into a three-dimensional user experience input tensor;

the user hand region training unit is used for enabling the three-dimensional user experience input tensor to pass through a user hand feature extraction module based on a target detection network so as to obtain a training user hand region of interest;

the user hand feature training unit is used for enabling the hand interested region of the training user to pass through a depth feature fusion module serving as a user hand feature extractor to obtain a hand action feature diagram of the training user;

the user eye region training unit is used for enabling the three-dimensional user experience input tensor to pass through a target detection network based on an anchor-free window so as to obtain a training user eye region of interest;

the user eye feature training unit is used for enabling the training user eye interested area to pass through a user local part feature extractor based on a spatial attention mechanism so as to obtain a training user eye sight feature map;

the user training feature fusion unit is used for fusing the hand action feature images of the training user and the eye sight feature images of the training user to obtain training user experience feature images;

The geometric rigidity consistency factor calculation unit is used for calculating geometric rigidity consistency factors based on order prior between the hand action feature images of the training user and the eye sight feature images of the training user;

the classification loss unit is used for enabling the training user experience feature images to pass through a classifier to obtain classification loss function values;

the training unit is used for training the user hand feature extraction module based on the target detection network, the depth feature fusion module used as the user hand feature extractor, the target detection network based on the anchor window, the user local part feature extractor based on the spatial attention mechanism and the classifier by taking the weighted sum between the classification loss function value and the geometric rigidity unification factor based on the order prior as the loss function value.

According to a second aspect of the present application, there is provided an intelligent video target automatic monitoring method based on human-computer interaction, which includes:

collecting a monitoring video of a user to be monitored;

the hand action and the eye action of the user to be detected in the monitoring video of the user to be monitored are respectively based on to obtain a user hand region of interest and a user eye region of interest;

The characteristic information of the user hand region of interest and the characteristic information of the user eye region of interest are respectively obtained to obtain a user hand action characteristic diagram and a user eye sight line characteristic diagram;

and obtaining a classification result based on the hand motion feature map of the user and the eye sight feature map of the user.

With reference to the first aspect of the present application, in an intelligent video target automatic monitoring system based on man-machine interaction in the first aspect of the present application, the target area obtaining module includes: the video preprocessing unit is used for preprocessing the monitoring video of the user to be monitored to obtain a three-dimensional user experience input tensor; and the target region acquisition unit is used for acquiring the user hand region of interest and the user eye region of interest based on the three-dimensional user experience input tensor. Wherein, the video preprocessing unit includes: a key frame acquisition subunit, configured to acquire a plurality of video key frames from the monitoring video of the user to be monitored; and the key frame arrangement subunit is used for arranging the video key frames into the three-dimensional user experience input tensor according to the time dimension. The target area acquisition unit includes: the user hand feature acquisition subunit is used for extracting target area features of the three-dimensional user experience input tensor by using a user hand feature extraction module based on a target detection network so as to obtain an interesting area of the user hand; and the user eye feature acquisition subunit is used for extracting target area features of the three-dimensional user experience input tensor by using a target detection network based on an anchor-free window so as to obtain the user eye region of interest.

With reference to the first aspect of the present application, in an intelligent video target automatic monitoring system based on man-machine interaction of the first aspect of the present application, the geometric rigidity consistency factor calculating unit is configured to: calculating a rigidity consistency factor of a priori feature of a parameterized geometric relationship between the hand motion feature map of the training user and the eye sight feature map of the training user according to the following formula; wherein, the formula is:

wherein,feature values representing the (i, j, k) th positions of the training user's hand motion feature map,the feature values representing the (i, j, k) th position of the eye sight feature map of the training user, W, H, C represent the width, height and channel number of the feature map, log represents the logarithmic function value based on 2, lambda represents the predetermined weight, and Loss represents the rigidity uniformity factor of the prior feature of the parameterized geometric relationship.

With reference to the second aspect of the present application, in an intelligent video target automatic monitoring method based on man-machine interaction in the second aspect of the present application, the method for obtaining feature information of a region of interest of a user's hand and feature information of a region of interest of a user's eye to obtain a feature map of a user's hand motion and a feature map of a user's eye sight line respectively includes: the user hand feature extraction unit is used for enabling the region of interest of the user hand to pass through the depth feature fusion module serving as a user hand feature extractor to obtain the user hand action feature map; and the user eye feature extraction unit is used for enabling the user eye region of interest to pass through a user local part feature extractor based on a spatial attention mechanism so as to obtain the user eye sight feature map.

Compared with the prior art, the intelligent video target automatic monitoring system and method based on human-computer interaction firstly collect monitoring videos of users to be monitored, then acquire a user hand region of interest and a user eye region of interest from the monitoring videos of the users to be monitored through a user hand feature extraction module based on a target detection network and a target detection network based on an anchor-free window respectively, then perform feature extraction on the user hand region of interest through a depth feature fusion module serving as a user hand feature extractor, perform feature extraction on the user eye region of interest through a user local part feature extractor based on a spatial attention mechanism, and finally obtain classification results through a classifier after feature information of the user hand region of interest and the user eye region of interest is fused. In this way, based on the classification result, the user to be monitored can better experience the VR game.

Drawings

The foregoing and other objects, features and advantages of the present application will become more apparent from the following more particular description of embodiments of the present application, as illustrated in the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate the application and not constitute a limitation to the application. In the drawings, like reference numerals generally refer to like parts or steps.

Fig. 1 illustrates a schematic block diagram of an intelligent video object automatic monitoring system based on human-machine interaction according to an embodiment of the present application.

Fig. 2 illustrates a schematic block diagram of a target area acquisition module in an intelligent video target automatic monitoring system based on human-computer interaction according to an embodiment of the application.

Fig. 3 illustrates a schematic block diagram of a video preprocessing unit in a target area acquisition module in an intelligent video target automatic monitoring system based on man-machine interaction according to an embodiment of the application.

Fig. 4 illustrates a schematic block diagram of a target area acquisition unit in a target area acquisition module in an intelligent video target automatic monitoring system based on man-machine interaction according to an embodiment of the application.

Fig. 5 illustrates a schematic block diagram of a target feature extraction module in an intelligent video target automatic monitoring system based on human-computer interaction according to an embodiment of the application.

Fig. 6 illustrates a schematic block diagram of a user action determination module in an intelligent video object automatic monitoring system based on human-computer interaction according to an embodiment of the present application.

Fig. 7 illustrates a schematic block diagram of a training module in an intelligent video object automatic monitoring system based on human-machine interaction according to an embodiment of the present application.

Fig. 8 illustrates a architecture diagram of an intelligent video object automatic monitoring system based on human-computer interaction according to an embodiment of the present application.

Fig. 9 illustrates a flowchart of an intelligent video object automatic monitoring method based on human-computer interaction according to an embodiment of the application.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application and not all of the embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.

Exemplary System

Fig. 1 illustrates a schematic block diagram of an intelligent video object automatic monitoring system based on human-machine interaction according to an embodiment of the present application. As shown in fig. 1, the intelligent video target automatic monitoring system 100 based on man-machine interaction according to an embodiment of the present application includes: the video acquisition module 110 is used for acquiring a monitoring video of a user to be monitored; the target area obtaining module 120 is configured to obtain a user hand region of interest and a user eye region of interest based on a hand motion and an eye motion of the user to be detected in the monitoring video of the user to be monitored, respectively; the target feature extraction module 130 is configured to obtain feature information of the user hand region of interest and feature information of the user eye region of interest, so as to obtain a user hand motion feature map and a user eye sight feature map; the user action determining module 140 is configured to obtain a classification result based on the user hand action feature map and the user eye sight feature map.

Human-computer interaction refers to the process of information exchange and operation between a human being and a computer or other intelligent device. It involves the design and development of a user interface so that one can effectively interact and control with a computer system. The goal of human-machine interaction is to enable a user to easily use a computer system and obtain the desired information or perform a particular task. Human-machine interaction may include a variety of forms, such as using a keyboard, mouse, touch screen, or voice commands to enter and manipulate information. It may also involve Graphical User Interface (GUI), virtual Reality (VR), augmented Reality (AR), etc. technologies to provide a more intuitive, natural and immersive interactive experience. Among other things, human-machine interaction plays an important role in Virtual Reality (VR) games, which can provide a more immersive and interactive game experience. For example, gesture controls are often used in VR games to simulate a player's hand movements. By using a device such as a handle, glove, or camera, a player can directly use hand movements to manipulate characters or objects in a game, such as grabbing, dragging, throwing, etc.

As described in the background above, conventional VR systems typically use handles, controls, or gesture recognition as input devices. However, these approaches may limit the freedom and realism of the user. The handle and controller require the user to hold, can cause fatigue, and are not natural and intuitive enough. Gesture recognition, while providing a more natural way of interaction, remains a challenge for its accuracy and reliability. Therefore, an intelligent video target automatic monitoring system and method based on man-machine interaction are expected, and interaction between a user to be monitored and a VR game is realized based on an interaction mode of hands and viewpoints.

Specifically, in the embodiment of the present application, the video capturing module 110 is configured to capture a monitoring video of a user to be monitored. It should be appreciated that by capturing the surveillance video, the hand motion of the user to be monitored in the VR game may be captured in real time. Thus, gesture recognition and hand interaction can be realized, so that a user can control characters or objects in the game through gestures, and the immersion and interactivity of the game are enhanced. In addition, the monitoring video can acquire viewpoint information of the user to be monitored, namely the gaze point and the observation behavior of the user in the VR game. By analyzing the viewpoint information of the user, the system can learn the points of interest, the attention distribution, etc. of the user, thereby providing personalized game experience and interactive feedback. Therefore, in order to achieve better interaction between the user and the VR game, a monitoring video of the user to be monitored is first acquired.

More specifically, when the monitoring video of the user to be monitored is collected, the whole action information of the user to be monitored can be collected through the camera. The cameras may be mounted at appropriate locations around the VR museum or near the machine to clearly monitor the user's motion changes during the game.

In this embodiment of the present application, the target area obtaining module 120 is configured to obtain the user hand region of interest and the user eye region of interest based on the hand motion and the eye motion of the user to be detected in the surveillance video of the user to be monitored, respectively. Considering that the hand motions of the user may exhibit the user's interest in characters, objects, or interface elements in the game and operational intent, the eye motions of the user may exhibit the user's attention distribution and points of interest in the game. Based on the characteristic information, the characteristic information of the hand action area and the characteristic information of the eye area of the user to be monitored are respectively obtained from the monitoring video. Thus, not only is the interaction mode facilitated to be optimized, so that a user can interact with the virtual environment more naturally and efficiently, but also the content and design of the game are facilitated to be improved, the game is more attractive and interesting, and more relevant content is provided. Thus, by capturing the user's hand region of interest and eye region of interest, the system may provide personalized interactive feedback based on the user's interests and attention. For example, when a user looks at an object, the system may trigger a corresponding animation, sound effect, or interaction event, enhancing the user's immersion and engagement.

Fig. 2 illustrates a schematic block diagram of a target area acquisition module in an intelligent video target automatic monitoring system based on human-computer interaction according to an embodiment of the application. As shown in fig. 2, the target area obtaining module 120 includes: the video preprocessing unit 121 is configured to perform a preprocessing operation on the surveillance video of the user to be monitored to obtain a three-dimensional user experience input tensor; the target region obtaining unit 122 is configured to obtain the region of interest of the user's hand and the region of interest of the user's eye based on the three-dimensional user experience input tensor.

The surveillance video is preprocessed based on the fact that the surveillance video typically contains a lot of redundant information and background noise. Through preprocessing operation, the redundant information can be filtered, key information in the video, such as hand actions, eye actions, gestures and the like of a user, is extracted, so that the calculated amount of subsequent processing is reduced, and the efficiency and accuracy of the system are improved. In addition, through preprocessing operation, the monitoring video can be converted into input data with higher dimensionality and richer dimension, such as three-dimensional user experience input tensor. Thus, the diversity and details of the user behaviors can be better captured, and more comprehensive user experience analysis and interaction functions are provided.

Fig. 3 illustrates a schematic block diagram of a video preprocessing unit in a target area acquisition module in an intelligent video target automatic monitoring system based on man-machine interaction according to an embodiment of the application. As shown in fig. 3, the video preprocessing unit 121 includes: a key frame obtaining subunit 121-1, configured to obtain a plurality of video key frames from the surveillance video of the user to be monitored; a keyframe arrangement subunit 121-2, configured to arrange the plurality of video keyframes into the three-dimensional user experience input tensor according to a time dimension.

Further, it should be appreciated that acquiring the user's hand region of interest and eye region of interest based on the three-dimensional user experience input tensor may provide more accurate, fine and personalized interactive control, feedback and adaptability while also supporting data analysis and user behavior understanding. This helps to promote the user experience, game effect and level of intelligentization of the system. Therefore, the hand characteristic information and the eye characteristic information of the user to be monitored are extracted respectively.

Fig. 4 illustrates a schematic block diagram of a target area acquisition unit in a target area acquisition module in an intelligent video target automatic monitoring system based on man-machine interaction according to an embodiment of the application. As shown in fig. 4, the target area acquiring unit 122 includes: a user hand feature acquiring subunit 122-1, configured to perform target region feature extraction on the three-dimensional user experience input tensor by using a target detection network-based user hand feature extraction module to obtain the user hand region of interest; a user eye feature acquiring subunit 122-2, configured to perform target region feature extraction on the three-dimensional user experience input tensor by using a target detection network based on an anchor-free window to obtain the region of interest of the user eye. The user hand feature extraction module is based on an anchor window target detection network.

In particular, the user hand feature acquisition subunit 122-1 is configured to: processing the three-dimensional user experience input tensor by using the anchor window-based target detection network according to the following formula to obtain the user hand region of interest; wherein, the formula is:

ROI＝H(ψ _det ,B)＝(cls(ψ _det ,B)，Regr(ψ _det ,B))

wherein, psi is _det Tensor is input for three-dimensional user experience, B is anchor box, ROI is region of interest of user's hand, cls (ψ _det B) represents classification, regr (ψ) _det And B) represents regression.

In particular, the user ocular feature acquisition subunit 122-2 is configured to: each input matrix in the three-dimensional user experience input tensor is respectively passed through a plurality of convolution layers to obtain a plurality of shallow feature images; the shallow feature images are respectively passed through the target detection network based on the anchor-free window to obtain eye feature images of the users to be monitored; and arranging the plurality of user eye feature images to be monitored along a channel dimension to form the user eye region of interest.

In this embodiment of the present application, the target feature extraction module 130 is configured to obtain feature information of the region of interest of the user's hand and feature information of the region of interest of the user's eye, so as to obtain a feature map of the user's hand motion and a feature map of the user's eye sight line. It should be appreciated that the user's hand movements are one of the important ways in which they interact with VR games. By acquiring the characteristic information of the region of interest of the hand of the user, the system can recognize the hand actions, namely, the gestures and the hand actions of the user are converted into instructions for controlling the game. And the eye gaze of the user is the focus of his attention in VR games. By acquiring the characteristic information of the region of interest of the eyes of the user, the system can analyze the gaze point, gaze time and gaze sequence of the user, so as to know the attention distribution and behavior intention of the user. The feature information is extracted and analyzed from the hands and eyes of the user, so that the game content presentation is optimized, personalized feedback is provided, the user experience is improved, a more natural and visual interaction mode is realized, and the immersion feeling and the operation convenience of the user are improved. Therefore, in order to better monitor the experience of the user, the hand region of interest and the eye region of interest of the user are respectively subjected to convolution coding through a deep learning technology so as to extract feature change information.

Fig. 5 illustrates a schematic block diagram of a target feature extraction module in an intelligent video target automatic monitoring system based on human-computer interaction according to an embodiment of the application. As shown in fig. 5, the target feature extraction module 130 includes: the user hand feature extraction unit 131 is configured to obtain the user hand motion feature map by using the user hand region of interest as a depth feature fusion module of the user hand feature extractor; a user eye feature extraction unit 132, configured to pass the user eye region of interest through a user local part feature extractor based on a spatial attention mechanism to obtain the user eye gaze feature map.

It should be appreciated that hand movements in the region of interest of the user's hand often involve spatial and temporal variations. The depth feature fusion module can comprehensively consider the space and time information and integrate the space and time information into the feature extraction process. In contrast, convolutional neural network models typically focus on only local spatial information, while modeling for temporal information is relatively weak. Therefore, the depth feature fusion module captures the feature information in space and time of the region of interest of the hand of the user. The depth feature fusion module can comprehensively consider the space and time information and fuse the space and time information into the generation process of the hand action feature map. In this way, the hand movements of the user, including aspects of shape, motion trajectory, and speed of the gestures, may be more fully described.

In a specific embodiment of the present application, the hand feature extraction unit 131 is configured to: extracting a shallow feature map from an ith layer of the user hand feature extractor, wherein the ith layer is a first layer to a sixth layer of the user hand feature extractor; fusing the shallow feature images to obtain a shallow fused feature image; extracting a deep feature map from a j-th layer of the user hand feature extractor, wherein the ratio between the j-th layer and the i-th layer is more than or equal to 5; fusing the plurality of deep feature maps to obtain a deep fusion feature map; and fusing the shallow fusion feature map and the deep fusion feature map by using the depth feature fusion module to obtain the hand action feature map of the user.

Further, it should be appreciated that the user's eye gaze information may provide important cues regarding the user's attention and interests. By extracting features of the region of interest of the user's eyes, the user's point of attention and gaze point during viewing of the VR game may be captured. Such information is valuable for analyzing user behavior, understanding user intent, and providing personalized interactions and services. Thus, a user's eye region of interest is characterized by a user local region feature extractor based on a spatial attention mechanism. The spatial attention mechanism can weight and select the features according to the importance of the user eye region of interest, and the effect of feature extraction can be improved by focusing on the user eye region of interest, so that the extracted features are more accurate and targeted. Therefore, the attention degree of the system to the eye vision of the user can be enhanced, and the fineness and the intelligent level of interaction are improved.

In a specific embodiment of the present application, the user eye feature extraction unit 132 is configured to: performing depth convolution encoding on the user eye region of interest by using a convolution encoding part of the user local part feature extractor to obtain an initial convolution feature map; inputting the initial convolution feature map into a spatial attention portion of the user local region feature extractor to obtain a spatial attention map; -passing said spatial attention map through a Softmax activation function to obtain a spatial attention profile; and calculating the position-wise point multiplication of the spatial attention characteristic diagram and the initial convolution characteristic diagram to obtain the eye sight characteristic diagram of the user.

In this embodiment, the user action determining module 140 is configured to obtain a classification result based on the user hand action feature map and the user eye gaze feature map.

Fig. 6 illustrates a schematic block diagram of a user action determination module in an intelligent video object automatic monitoring system based on human-computer interaction according to an embodiment of the present application. As shown in fig. 6, the user action determining module 140 includes: the user action feature fusion unit 141 is configured to perform feature fusion on the user hand action feature map and the user eye sight feature map to obtain a user experience feature map; the user action type determining unit 142 is configured to perform feature classification on the user experience feature map through a classifier to obtain a classification result. The classification result is used for representing the action type of the user to be monitored.

It should be appreciated that the user's hand movements and eye gaze are important ways for the user to interact with the VR game. By fusing the hand action feature diagram and the eye sight feature diagram of the user, the hand action and the visual attention of the user can be comprehensively considered, so that the interaction action and experience of the user can be more comprehensively described. In this way, the understanding and simulation of the user's behavior by the system may be improved, enabling the system to more accurately respond to the user's intent and needs.

In a specific embodiment of the present application, the user action feature fusion unit 141 is configured to: fusing the user hand motion feature map and the user eye sight feature map to obtain the user experience feature map by using a fusion formula, wherein the fusion formula is as follows:

F ₁ ＝αF _s +βF _d

wherein F is ₁ F for the user experience feature map _s For the hand motion feature map of the user, F _d For the eye vision of the userFeature map, "+" indicates that elements at corresponding positions of the user hand motion feature map and the user eye gaze feature map are added, and α and β are weighting parameters for controlling balance between the user hand motion feature map and the user eye gaze feature map in the user experience feature map.

Further, considering that the classifier can learn the relationship between different behavioral patterns and interactive patterns, and map the user experience feature map to the corresponding category or label. Based on the feature classification, the user experience feature map is subjected to feature classification through a classifier. In this way, the system can be helped to accurately identify the behavior of the user, such as the type of gesture motion, the target of eye gaze, etc., so as to more accurately understand the intention and demand of the user, and thus better realize the interaction of the user with the VR game.

In a specific embodiment of the present application, the user action type determining unit 142 is configured to: processing the user experience feature map with the classifier in the following classification formula to obtain the classification result; wherein, the classification formula is: o=softmax { (W) _c ,B _c )|Project(F ₁ ) Of which Project (F) ₁ ) Representing projection of the user experience feature map as a vector, W _c Is a weight matrix, B _c Representing the bias vector, softmax representing the normalized exponential function, and O representing the classification result.

It should be noted that, besides classifying the motion feature information of the user by using the classifier, a clustering algorithm may also be used to perform cluster analysis on the user experience feature map. Clustering algorithms are an unsupervised learning method that discovers the inherent structure and pattern in data by dividing the data samples into similar groups or clusters. In such an embodiment, the user experience feature map may be used as input data, which may be subjected to a clustering analysis using a clustering algorithm to obtain a classification result. The method comprises the following specific steps: 1. data preprocessing: preprocessing operations such as feature normalization, dimension reduction and the like are carried out on the user experience feature map so as to facilitate the processing of a clustering algorithm. 2. Selecting a clustering algorithm: a suitable clustering algorithm is selected to process the user experience feature map. Common clustering algorithms include K-means, hierarchical clustering, DBSCAN, and the like. 3. And (3) cluster analysis: the user experience feature map is input into a clustering algorithm, which divides the feature map into different clusters according to the similarity of the feature map. Each cluster represents a class of user experience patterns or behavior patterns. 4. Classification result: and obtaining a classification result of the user experience feature map according to the output of the clustering algorithm. Each cluster may be labeled as a different category or use the center of the cluster as a representative feature. 5. Subsequent analysis and application: and carrying out subsequent analysis and application according to the classification result. For example, the frequency, duration and other information of each category can be counted for user behavior analysis and improvement.

The embodiment has the advantage that the mode and the structure in the user experience characteristic diagram can be automatically discovered without marking training data in advance. However, rather than precisely classifying feature maps into predefined categories, clustering algorithms may be more suitable for exploratory analysis and discovery of new user behavior patterns than classifiers.

Further, it should be appreciated that when monitoring interactions between the user to be monitored and the VR game, the user hand feature extraction module based on the target detection network, the depth feature fusion module serving as the user hand feature extractor, the target detection network based on the anchor-free window, the user local part feature extractor based on the spatial attention mechanism, and the classifier need to be trained. That is, in the intelligent video target automatic monitoring system based on man-machine interaction according to the embodiment of the present application, the system further includes a training module, where the training module is configured to train the target detection network-based user hand feature extraction module, the depth feature fusion module serving as the user hand feature extractor, the anchor-window-based target detection network, the spatial attention mechanism-based user local part feature extractor, and the classifier.

Fig. 7 illustrates a schematic block diagram in an intelligent video object automatic monitoring system based on human-machine interaction according to an embodiment of the present application. As shown in fig. 7, the training module 200 includes: a training data acquisition unit 210, configured to acquire training data, where the training data includes a monitoring video of a user to be monitored; the key frame acquisition unit 220 is configured to acquire a plurality of video key frames from the monitoring video of the user to be monitored, and arrange the video key frames into a three-dimensional user experience input tensor; the user hand region training unit 230 is configured to obtain a training user hand region of interest by using the three-dimensional user experience input tensor through a user hand feature extraction module based on a target detection network; the user hand feature training unit 240 is configured to obtain a training user hand motion feature map by using the training user hand region of interest as a depth feature fusion module of the user hand feature extractor; the user eye region training unit 250 is configured to pass the three-dimensional user experience input tensor through a target detection network based on an anchor-free window to obtain a training user eye region of interest; a user eye feature training unit 260, configured to pass the training user eye region of interest through a user local part feature extractor based on a spatial attention mechanism to obtain a training user eye sight line feature map; a user training feature fusion unit 270, configured to fuse the hand motion feature map of the training user and the eye sight feature map of the training user to obtain a training user experience feature map; a geometric rigidity conforming factor calculation unit 280 for calculating geometric rigidity conforming factors based on order priors between the training user hand motion feature map and the training user eye sight feature map; a classification loss unit 290, configured to pass the training user experience feature map through a classifier to obtain a classification loss function value; the training unit 300 is configured to train the objective detection network-based user hand feature extraction module, the depth feature fusion module serving as a user hand feature extractor, the objective detection network based on the anchor-free window, the spatial attention mechanism-based user local part feature extractor, and the classifier with a weighted sum between the classification loss function value and the order-based prior geometric rigidity unification factor as a loss function value.

In particular, in the technical solution of the present application, it is considered that since two feature extractors play different roles in the entire system and mutually complement each other's feature extraction capabilities through mutual learning, overall performance is improved, that is, by training a depth feature fusion module as a user hand feature extractor and a user local part feature extractor based on a spatial attention mechanism, it is intended to have them mutually learn in terms of feature extraction to improve the feature extraction capabilities of the resulting user hand motion feature map and user gaze motion feature map. Specifically, the depth feature fusion module, which is a user hand feature extractor, is responsible for extracting features of hand motions from a region of interest of the user's hand. The module may learn deep and shallow features associated with hand movements, such as hand position, finger movement, etc. Through training, the module can gradually optimize the feature extraction process, and more meaningful and effective hand motion features of the user are extracted. On the other hand, a user local part feature extractor based on a spatial attention mechanism is responsible for extracting features of gaze actions from a region of interest of the user's eyes. The feature extractor may learn local site features associated with gaze actions, such as eye position, gaze point, etc. Through training, the feature extractor can gradually optimize the feature extraction process, and more meaningful and effective user sight line action features are extracted. By letting the two feature extractors learn each other, they can be made to improve the feature extraction capabilities in common. For example, a user hand motion profile may provide deep and shallow information of hand motion, while a user gaze motion profile may provide local location information of gaze motion. By fusing the features, the classifier can better judge the action type of the user to be monitored. Therefore, in the technical scheme of the application, the depth feature fusion module serving as the user hand feature extractor and the user local part feature extractor based on the spatial attention mechanism are trained by calculating the rigidity unification factor of the parameterized geometric relation priori features between the training user hand motion feature image and the training user eye sight feature image, so that the feature extraction capability of the obtained user hand motion feature image and the obtained user sight motion feature image can be improved, and the accuracy and the reliability of the system to the type of the user motion to be monitored are improved.

In order to realize feature alignment and migration learning of a depth feature fusion module serving as a user hand feature extractor and a user local part feature extractor based on a spatial attention mechanism, in the technical scheme of the application, a parameterized geometric relation matrix can be constructed through information measurement of internal element sub-dimensions of a feature map by calculating a rigidity unification factor of parameterized geometric relation priori features between a training user hand action feature map and a training user eye sight feature map, and the parameterized geometric relation matrix is used for carrying out rigidity transformation and adjustment on the feature map so as to enable the feature map to be more approximate to a manifold structure of a target feature map. The factors are used as loss functions, and the feature extraction parameters of the depth feature fusion module serving as the user hand feature extractor and the user local part feature extractor based on the spatial attention mechanism can be optimized in a self-adaptive mode in the training process, so that the super convex consistency and the expression capacity of the features are improved, the original information and the semantics of the features can be kept, and the performances of the depth feature fusion module serving as the user hand feature extractor and the user local part feature extractor based on the spatial attention mechanism under different tasks and scenes can be improved.

Specifically, in one specific embodiment of the present application, the geometric rigidity uniformity factor calculating unit 280 is configured to: calculating a rigidity consistency factor of a priori feature of a parameterized geometric relationship between the hand motion feature map of the training user and the eye sight feature map of the training user according to the following formula; wherein, the formula is:

wherein,(i, j, k) representing the motion feature map of the hands of the training userThe characteristic value of the location is used to determine,the feature values representing the (i, j, k) th position of the eye sight feature map of the training user, W, H, C represent the width, height and channel number of the feature map, log represents the logarithmic function value based on 2, lambda represents the predetermined weight, and Loss represents the rigidity uniformity factor of the prior feature of the parameterized geometric relationship.

Further, fig. 8 illustrates a schematic diagram of an intelligent video object automatic monitoring system based on man-machine interaction according to an embodiment of the present application. As shown in fig. 8, firstly, a monitoring video of a user to be monitored is collected, and a plurality of video key frames are obtained from the monitoring video of the user to be monitored; then, arranging the video key frames into three-dimensional user experience input tensors, and enabling the three-dimensional user experience input tensors to pass through a user hand feature extraction module based on a target detection network so as to obtain a user hand region of interest; then, the region of interest of the hand of the user passes through a depth feature fusion module serving as a hand feature extractor of the user so as to obtain a hand action feature diagram of the user; further, the three-dimensional user experience input tensor passes through a target detection network based on an anchor-free window to obtain a user eye region of interest; still further, the user eye region of interest is passed through a user local part feature extractor based on a spatial attention mechanism to obtain a user eye gaze feature map; then, fusing the hand action feature image of the user and the eye sight feature image of the user to obtain a user experience feature image; and finally, the user experience feature map passes through a classifier to obtain a classification result, wherein the classification result is used for representing the action type of the user to be monitored.

In summary, the intelligent video target automatic monitoring system 100 based on man-machine interaction according to the embodiment of the application is illustrated, firstly, a monitoring video of a user to be monitored is collected, then, a user hand region of interest and a user eye region of interest are obtained from the monitoring video of the user to be monitored through a user hand feature extraction module based on a target detection network and a target detection network based on an anchor-free window, respectively, then, feature extraction is performed on the user hand region of interest through a depth feature fusion module serving as a user hand feature extractor, feature extraction is performed on the user eye region of interest through a user local part feature extractor based on a spatial attention mechanism, and finally, classification results are obtained through a classifier after feature information of the user hand region of interest and the user eye region of interest are fused. In this way, based on the classification result, the user to be monitored can better experience the VR game.

As described above, the intelligent video object automatic monitoring system 100 based on man-machine interaction according to the embodiment of the present application may be implemented in various terminal devices, for example, a server or the like deployed with an intelligent video object automatic monitoring algorithm based on man-machine interaction. In one example, the automatic video object monitoring system 100 may be integrated into a terminal device as a software module and/or hardware module based on human-machine interaction. For example, the human-machine interaction-based intelligent video object automatic monitoring system 100 may be a software module in the operating system of the terminal device, or may be an application developed for the terminal device; of course, the intelligent video object automatic monitoring system 100 based on man-machine interaction can be one of a plurality of hardware modules of the terminal device.

Alternatively, in another example, the human-machine interaction-based intelligent video object automatic monitoring system 100 and the terminal device may be separate devices, and the human-machine interaction-based intelligent video object automatic monitoring system 100 may be connected to the terminal device through a wired and/or wireless network and transmit interaction information in a contracted data format.

Exemplary method

Fig. 9 illustrates a flowchart of an intelligent video object automatic monitoring method based on human-computer interaction according to an embodiment of the application. As shown in fig. 9, an intelligent video target automatic monitoring method based on man-machine interaction according to an embodiment of the present application includes: s110, collecting a monitoring video of a user to be monitored; s120, obtaining a user hand region of interest and a user eye region of interest based on the hand motion and the eye motion of the user to be detected in the monitoring video of the user to be monitored; s130, respectively acquiring characteristic information of the user hand region of interest and characteristic information of the user eye region of interest to obtain a user hand action characteristic diagram and a user eye sight line characteristic diagram; and S140, obtaining a classification result based on the hand motion feature map of the user and the eye sight feature map of the user.

Here, it will be understood by those skilled in the art that the specific functions and operations of the respective steps in the above-described human-computer interaction-based intelligent video object automatic monitoring method have been described in detail in the above description with reference to the human-computer interaction-based intelligent video object automatic monitoring system of fig. 1, and thus, repetitive descriptions thereof will be omitted.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present invention.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The above-described integrated units may be implemented in hardware or in the form of a combination of hardware and firmware, as will be apparent to those skilled in the art. When implemented in software, the functions described above may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. Taking this as an example but not limited to: the computer readable medium may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Furthermore, it is possible to provide a device for the treatment of a disease. Any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the fixing of the medium.

In summary, the foregoing description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An intelligent video target automatic monitoring system based on man-machine interaction, which is characterized by comprising:

2. The human-computer interaction-based intelligent video target automatic monitoring system according to claim 1, wherein the target area acquisition module comprises:

The video preprocessing unit is used for preprocessing the monitoring video of the user to be monitored to obtain a three-dimensional user experience input tensor;

and the target region acquisition unit is used for acquiring the user hand region of interest and the user eye region of interest based on the three-dimensional user experience input tensor.

3. The human-computer interaction-based intelligent video target automatic monitoring system according to claim 2, wherein the video preprocessing unit comprises:

a key frame acquisition subunit, configured to acquire a plurality of video key frames from the monitoring video of the user to be monitored;

and the key frame arrangement subunit is used for arranging the video key frames into the three-dimensional user experience input tensor according to the time dimension.

4. The human-computer interaction-based intelligent video target automatic monitoring system according to claim 3, wherein the target area acquisition unit comprises:

the user hand feature acquisition subunit is used for extracting target area features of the three-dimensional user experience input tensor by using a user hand feature extraction module based on a target detection network so as to obtain an interesting area of the user hand;

And the user eye feature acquisition subunit is used for extracting target area features of the three-dimensional user experience input tensor by using a target detection network based on an anchor-free window so as to obtain the user eye region of interest.

5. The human-computer interaction-based intelligent video object automatic monitoring system of claim 4, wherein the user hand feature extraction module is an anchor window-based object detection network of an anchor window-based object detection network.

6. The human-computer interaction-based intelligent video target automatic monitoring system according to claim 5, wherein the target feature extraction module comprises:

the user hand feature extraction unit is used for enabling the region of interest of the user hand to pass through the depth feature fusion module serving as a user hand feature extractor to obtain the user hand action feature map;

and the user eye feature extraction unit is used for enabling the user eye region of interest to pass through a user local part feature extractor based on a spatial attention mechanism so as to obtain the user eye sight feature map.

7. The human-machine interaction-based intelligent video object automatic monitoring system according to claim 6, wherein the user action determining module comprises:

The user action feature fusion unit is used for carrying out feature fusion on the hand action feature image of the user and the eye sight feature image of the user so as to obtain a user experience feature image;

and the user action type determining unit is used for carrying out feature classification on the user experience feature images through a classifier to obtain classification results.

8. The human-computer interaction-based intelligent video target automatic monitoring system according to claim 7, further comprising a training module for training the target detection network-based user hand feature extraction module, the depth feature fusion module as a user hand feature extractor, the unmanned window-based target detection network, the spatial attention mechanism-based user local part feature extractor, and the classifier;

wherein, training module includes:

9. The human-computer interaction-based intelligent video object automatic monitoring system according to claim 8, wherein the geometric rigidity conforming factor calculating unit is configured to: calculating a rigidity consistency factor of a priori feature of a parameterized geometric relationship between the hand motion feature map of the training user and the eye sight feature map of the training user according to the following formula;

wherein, the formula is:

wherein,characteristic values representing the (i, j, k) th position of the training user's hand motion characteristic map,/or->The feature values representing the (i, j, k) th position of the eye sight feature map of the training user, W, H, C represent the width, height and channel number of the feature map, log represents the logarithmic function value based on 2, lambda represents the predetermined weight, and Loss represents the rigidity uniformity factor of the prior feature of the parameterized geometric relationship.

10. An intelligent video target automatic monitoring method based on man-machine interaction is characterized by comprising the following steps:

collecting a monitoring video of a user to be monitored;