CN116719419B

CN116719419B - Intelligent interaction method and system for meta universe

Info

Publication number: CN116719419B
Application number: CN202310995469.2A
Authority: CN
Inventors: 张青辉; 王英
Original assignee: 4u Beijing Technology Co ltd
Current assignee: 4u Beijing Technology Co ltd
Priority date: 2023-08-09
Filing date: 2023-08-09
Publication date: 2023-11-03
Anticipated expiration: 2043-08-09
Also published as: CN116719419A

Abstract

An intelligent interaction method and system for meta-universe are disclosed. Firstly, acquiring a user gesture interaction video acquired by a camera, then, sampling and extracting features of the user gesture interaction video to obtain a gesture operation type semantic understanding feature vector, and then, determining a user control action intention corresponding to the user gesture interaction video based on the gesture operation type semantic understanding feature vector. In this way, the information such as the spatial layout and the object position in the background information can be utilized to assist in identifying and understanding the intention of the manipulation action of the user, so that a more accurate and adaptable interaction experience is provided.

Description

Intelligent interaction method and system for meta universe

Technical Field

The present disclosure relates to the field of intelligent interaction, and more particularly, to an intelligent interaction method of a meta-universe and a system thereof.

Background

The meta universe is a virtual digital world that allows users to perform immersive interactions and experiences through different devices and platforms. With the development and popularization of metauniverse, users have increasingly demanded to intelligently interact with virtual environments.

Among them, gesture recognition and tracking technology is widely used in metauniverse systems as a natural and intuitive interaction way. The scenes in reality are often complex and changeable, and background information may convey user-specific intention, but are often directly ignored in the gesture recognition process, so that gesture recognition and tracking algorithms have difficulty in accurately understanding the intention of the user, thereby affecting the interactive experience of the user. Thus, an optimized meta-universe interaction scheme is desired.

Disclosure of Invention

In view of this, the disclosure proposes a meta-universe intelligent interaction method and system thereof, which can utilize information such as spatial layout and object position in background information to assist in identifying and understanding the intention of the manipulation actions of the user, so as to provide a more accurate and adaptable interaction experience.

According to an aspect of the present disclosure, there is provided an intelligent interaction method of a meta universe, including: acquiring a user gesture interaction video acquired by a camera; sampling and feature extraction are carried out on the user gesture interaction video to obtain gesture operation type semantic understanding feature vectors; and determining the user operation and control action intention corresponding to the user gesture interaction video based on the gesture operation type semantic understanding feature vector.

According to another aspect of the present disclosure, there is provided an intelligent interactive system of a meta-universe, comprising: the video acquisition module is used for acquiring user gesture interaction videos acquired by the camera; the sampling and feature extraction module is used for sampling and feature extraction of the user gesture interaction video to obtain gesture operation type semantic understanding feature vectors; and the control action intention judging module is used for determining the user control action intention corresponding to the user gesture interaction video based on the gesture operation type semantic understanding feature vector.

According to the embodiment of the disclosure, firstly, a user gesture interaction video acquired by a camera is acquired, then, the user gesture interaction video is sampled and extracted with features to obtain gesture operation type semantic understanding feature vectors, and then, user control action intentions corresponding to the user gesture interaction video are determined based on the gesture operation type semantic understanding feature vectors. In this way, the information such as the spatial layout and the object position in the background information can be utilized to assist in identifying and understanding the intention of the manipulation action of the user, so that a more accurate and adaptable interaction experience is provided.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 illustrates a flow chart of a method of intelligent interaction of a metauniverse in accordance with an embodiment of the present disclosure.

Fig. 2 shows an architectural diagram of a smart interaction method of the metauniverse in accordance with an embodiment of the present disclosure.

Fig. 3 shows a flowchart of sub-step S120 of the intelligent interaction method of the meta-universe, according to an embodiment of the disclosure.

Fig. 4 shows a flowchart of sub-step S122 of the intelligent interaction method of the meta-universe, according to an embodiment of the disclosure.

Fig. 5 shows a flowchart of sub-step S1221 of the intelligent interaction method of the metauniverse according to an embodiment of the disclosure.

Fig. 6 shows a flowchart of sub-step S123 of the intelligent interaction method of the meta-universe, according to an embodiment of the disclosure.

Fig. 7 shows a flowchart of sub-step S1231 of the intelligent interaction method of the meta-universe, according to an embodiment of the disclosure.

FIG. 8 illustrates a block diagram of an intelligent interaction system of the metauniverse in accordance with an embodiment of the disclosure.

Fig. 9 illustrates an application scenario diagram of a smart interaction method of a metauniverse according to an embodiment of the disclosure.

Detailed Description

The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some, but not all embodiments of the disclosure. All other embodiments, which can be made by one of ordinary skill in the art without undue burden based on the embodiments of the present disclosure, are also within the scope of the present disclosure.

As used in this disclosure and in the claims, the terms "a," "an," "the," and/or "the" are not specific to a singular, but may include a plurality, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.

Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

In addition, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.

Aiming at the technical problems, the technical conception of the present disclosure is as follows: by using a camera and a deep learning algorithm, the gesture characteristics of the user are captured, the dependency relationship between the foreground and the background is analyzed to read and judge the operation intention of the user, and a proper response is provided for the object in the meta universe.

It should be appreciated that when a user attempts to select, manipulate, and manipulate objects in the metauniverse by gestures, the context in which the user is located tends to limit and affect the user's gesture actions. For example, in a meta universe simulating a real environment, a user may need to simulate a real motion pattern by gestures, such as waving an arm, to perform a running motion. However, the environment in which the user is located may impose limitations on the speed and force with which the arm is swung, resulting in different gesture actions when the user produces the same manipulation action intent. In particular, if a user uses the meta universe in his living room, the space may be relatively large and more private, resulting in no concern when the user makes a gesture, the speed and force with which the user swings the arm is relatively large. If in a crowded mall environment, users may need to accommodate a narrower space for gesture operations. This may result in a user waving the arm at a relatively low speed and force to avoid collisions with surrounding objects or people. Thus, in the technical concept of the present disclosure, it is expected to assist in recognition and understanding of a manipulation action intention of a user using information such as a spatial layout and an object position in background information, thereby providing a more accurate and adaptable interactive experience.

Based on this, FIG. 1 shows a flow chart of a smart interaction method of the metauniverse according to an embodiment of the disclosure. Fig. 2 shows an architectural diagram of a smart interaction method of the metauniverse in accordance with an embodiment of the present disclosure. As shown in fig. 1 and 2, the intelligent interaction method of the meta universe according to an embodiment of the disclosure includes the steps of: s110, acquiring a user gesture interaction video acquired by a camera; s120, sampling and feature extraction are carried out on the user gesture interaction video to obtain gesture operation type semantic understanding feature vectors; and S130, determining the user operation action intention corresponding to the user gesture interaction video based on the gesture operation type semantic understanding feature vector.

Accordingly, in the technical scheme of the disclosure, first, a user gesture interaction video acquired by a camera is acquired, and the user gesture interaction video is sparsely sampled to obtain a plurality of user gesture interaction image frames.

And then, carrying out image feature extraction and foreground-background feature blending on the plurality of user gesture interaction image frames to obtain a plurality of foreground-background interaction perception gesture feature vectors. In a specific example of the present disclosure, the encoding process of extracting image features and blending foreground-background features on the plurality of user gesture interaction image frames to obtain a plurality of foreground-background interaction perception gesture feature vectors includes: firstly, respectively passing the plurality of user gesture interaction image frames through a region divider based on a target detection network to respectively obtain a plurality of user gesture foreground images; then, masking the plurality of user gesture interaction image frames based on the plurality of user gesture foreground images to obtain a plurality of user gesture background images; respectively passing the plurality of user gesture foreground images through a foreground feature capturer based on a first convolution neural network model to obtain a plurality of user gesture foreground feature vectors; simultaneously, the user gesture background images are respectively passed through a background feature capturer based on a second convolution neural network model to obtain a plurality of user gesture background feature vectors; and then, performing feature level interaction on the corresponding foreground feature vectors of the user gestures and the background feature vectors of the user gestures by using a cascading function to obtain a plurality of foreground-background interaction perception gesture feature vectors.

The cascade function is used for carrying out feature level interaction on the user gesture foreground feature vector and the user gesture background feature vector, so that a network has certain logic reasoning capability to better mine association information between vectors.

Further, global context semantic information among the plurality of foreground-background interaction-aware gesture feature vectors is extracted to obtain the gesture operation type semantic understanding feature vector. That is, hidden feature information of a corresponding one of the user gesture interactive image frames is characterized in view of one of the foreground-background interactive awareness gesture feature vectors, and a correlation exists between the foreground-background interactive awareness gesture feature vectors. Thus, in the technical solution of the present disclosure, it is desirable to mine and construct global context semantic information between the plurality of foreground-background interaction-aware gesture feature vectors.

In a specific example of the present disclosure, the process of extracting global context semantic information between the plurality of foreground-background interaction-aware gesture feature vectors to obtain the encoding of the gesture operation type semantic understanding feature vector includes: and passing the plurality of foreground-background interaction perception gesture feature vectors through a gesture semantic understanding device based on a converter to obtain gesture operation type semantic understanding feature vectors.

Accordingly, as shown in fig. 3, sampling and feature extraction are performed on the user gesture interaction video to obtain a gesture operation type semantic understanding feature vector, which includes: s121, sparsely sampling the user gesture interaction video to obtain a plurality of user gesture interaction image frames; s122, extracting image features and blending foreground-background features on the plurality of user gesture interaction image frames to obtain a plurality of foreground-background interaction perception gesture feature vectors; and S123, extracting global context semantic information among the foreground-background interaction perception gesture feature vectors to obtain the gesture operation type semantic understanding feature vector. It should be understood that in step S121, the purpose of sparsely sampling the user gesture interactive video to obtain a plurality of user gesture interactive image frames is to extract a plurality of image frames from the original user gesture interactive video. Since video is typically composed of sequential image frames, sparse sampling may selectively extract some image frames to reduce data volume and computational complexity; in step S122, image feature extraction is performed on each user gesture interaction image frame, where the image features may include color histogram, texture features, edge features, and the like, and in addition, foreground-background feature blending may be performed, and features of the foreground (user gesture) and the background (environmental information) are fused to obtain a more comprehensive gesture feature vector; in step S123, global context semantic information is extracted from the plurality of foreground-background interaction-aware gesture feature vectors, where the global context semantic information may include a time sequence relationship, a spatial position relationship, a semantic association between gestures, and the like, and by extracting the global context semantic information, the type and the intention of the gesture operation may be better understood, so as to obtain a semantic understanding feature vector of the gesture operation type. In other words, the three steps are respectively used for extracting the image frame of the user gesture interactive video, extracting the foreground-background interactive perception gesture feature vector and extracting the global context semantic information, and finally obtaining the semantic understanding feature vector of the gesture operation type, and the feature vector can be used in the application fields of gesture recognition, gesture control and the like.

More specifically, in step S122, as shown in fig. 4, image feature extraction and foreground-background feature blending are performed on the plurality of user gesture interaction image frames to obtain a plurality of foreground-background interaction perception gesture feature vectors, including: s1221, dividing the areas of the user gesture interaction image frames to obtain a plurality of user gesture foreground images and a plurality of user gesture background images; s1222, respectively passing the plurality of user gesture foreground images through a foreground feature capturer based on a first convolution neural network model to obtain a plurality of user gesture foreground feature vectors; s1223, enabling the user gesture background images to respectively pass through a background feature capturer based on a second convolution neural network model to obtain a plurality of user gesture background feature vectors; and S1224, performing feature level interaction on the plurality of user gesture foreground feature vectors and the plurality of user gesture background feature vectors to obtain the plurality of foreground-background interaction perception gesture feature vectors.

It is worth mentioning that convolutional neural networks (Convolutional Neural Network, CNN) are a deep learning model, dedicated to processing data with a grid structure, such as images and videos. The core idea of convolutional neural networks is to build the network through convolutional layers, pooling layers and fully connected layers. The convolution layer (Convolutional Layer) is a core layer of the convolution neural network, performs feature extraction on input data through convolution operation, and performs sliding window convolution operation on the input data by using a set of learnable convolution kernels (or filters), so that local features are extracted, and the convolution layer can capture spatial structure information of the input data. A Pooling Layer (Pooling Layer) is used to reduce the spatial dimensions of the feature map, reduce the number of parameters, and extract more robust features, and common Pooling operations include maximum Pooling and average Pooling, which respectively select the maximum or average value in the local area as the pooled value. The fully connected layer (Fully Connected Layer) connects the outputs of the previous convolutional layer and the pooling layer and maps them to the final output class, each neuron in the fully connected layer being connected to all neurons of the previous layer, and the combination and classification of features is done by learning weights. In addition to these major components, convolutional neural networks may also include activation functions, batch normalization, dropout, and the like techniques to enhance the expressive and generalizing capabilities of the model. By stacking multiple convolutional and pooling layers, convolutional neural networks can progressively extract increasingly abstract features, from low-level features (such as edges and textures) to high-level features (such as shapes and objects).

Further, the first convolutional neural network model and the second convolutional neural network model are two independent models for respectively extracting features of a foreground image and a background image of a user gesture. These models are typically trained on large-scale image data to efficiently extract semantic features of the image. In gesture recognition and understanding, they are used to capture local features of gestures in order to better distinguish between foreground and background and extract relevant information about the gesture.

More specifically, in step S1221, as shown in fig. 5, the area division is performed on the plurality of user gesture interaction image frames to obtain a plurality of user gesture foreground images and a plurality of user gesture background images, including: s12211, respectively passing the plurality of user gesture interaction image frames through a region divider based on a target detection network to respectively obtain a plurality of user gesture foreground images; and S12212, masking the plurality of user gesture interaction image frames based on the plurality of user gesture foreground images to obtain the plurality of user gesture background images.

It is worth mentioning that the object detection network (Object Detection Network) is a deep learning model for detecting and locating the position and class of a plurality of objects in an image or video. It can identify the different objects in the image and provide each object with a bounding box and a corresponding class label. The main purpose of the object detection network is to find objects of interest in images and to determine their location and class. The detection network is generally composed of two main components: 1. regional proposal network (Region Proposal Network, RPN): the RPN is responsible for generating candidate target areas, namely a boundary box which can contain targets, and the RPN proposes candidate areas on different positions and scales in a sliding window or anchor box mode; 2. target classification and bounding box regression network: the network receives the candidate areas provided by the RPN, classifies each candidate area, and carries out bounding box regression, wherein the classification part is responsible for determining the object category in each candidate area, and the bounding box regression part is used for accurately positioning the bounding box of the target. In the case of multi-user gesture interaction images, the target detection network may be configured to divide the user gesture interaction image frames and extract the region where the gesture is located as the user gesture foreground image. By identifying and locating gesture targets, the target detection network may help accomplish the task of gesture identification and understanding.

It is worth mentioning that Masking is an image processing technique for Masking or Masking certain areas of an image so that only the area of interest is of interest in the subsequent processing. In the case of multi-user gesture interactive images, masking may be used to extract foreground and background images of the user gesture. Specifically, by the region divider based on the target detection network, the plurality of user gesture interaction image frames can be respectively divided into a plurality of user gesture foreground images, namely, the region containing the user gesture is extracted. The masking process is to process the foreground images, keep the areas of the user gesture, and mask other areas, so as to obtain the background image of the user gesture. The main purpose of the masking process is to isolate the target region of interest for subsequent processing and analysis. In the scene of multi-user gesture interaction, the foreground image and the background image of the user gesture are extracted, so that the user gesture can be further analyzed, identified or tracked. For example, the foreground image may be input into a gesture recognition model to recognize different gesture actions; and the background image can be used for background modeling, background segmentation, or scene understanding, etc. Masking processes facilitate further image processing and analysis by masking or masking portions of the image to extract regions of interest. In multi-user gesture interaction applications, masking may be used to separate the foreground and background of a user gesture, providing more accurate and efficient data for subsequent gesture recognition and analysis.

More specifically, in step S1224, performing feature level interactions between the plurality of user gesture foreground feature vectors and the plurality of user gesture background feature vectors to obtain the plurality of foreground-background interaction aware gesture feature vectors, including: and performing feature level interaction on the corresponding foreground feature vectors of the user gestures and the background feature vectors of the user gestures by using a cascading function to obtain a plurality of foreground-background interaction perception gesture feature vectors. It should be appreciated that the cascading function (Concatenation Function) is a method of feature level interaction for joining two feature vectors together to form a larger feature vector. When a plurality of user gesture foreground feature vectors and user gesture background feature vectors perform feature level interaction, the cascade function can connect the feature level interaction with the user gesture foreground feature vectors and the user gesture background feature vectors in sequence to obtain foreground-background interaction perception gesture feature vectors. Cascading functions are typically implemented using a concatenation (operation) that links two feature vectors together in dimensions to form one longer feature vector. For example, if the dimension of the user gesture foreground feature vector is n1 and the dimension of the user gesture background feature vector is n2, then the cascading function will generate a foreground-background interaction-aware gesture feature vector having a dimension of n1+n2. The cascading function is used for fusing foreground and background features and combining information of the foreground and the background features. By connecting the foreground features and the background features together, the interactive perception gesture feature vector can be made to contain both foreground and background information, thereby describing the features of gesture interaction more fully. The choice of cascading functions depends on the specific application scenario and task requirements. Common cascading functions include simple vector concatenation operations, and more complex functions may also be used to control the manner in which features are fused, such as weighted fusion using element-by-element additions or multiplications. When the cascade function performs feature level interaction on a plurality of user gesture foreground feature vectors and user gesture background feature vectors, the feature level interaction is performed on the plurality of user gesture foreground feature vectors and the user gesture background feature vectors through splicing operation, so that foreground-background interaction perception gesture feature vectors are formed, foreground and background information is fused, and richer feature representation is provided for subsequent gesture analysis and recognition tasks.

More specifically, in step S123, as shown in fig. 6, extracting global context semantic information between the plurality of foreground-background interaction-aware gesture feature vectors to obtain the gesture operation type semantic understanding feature vector includes: s1231, respectively performing feature distribution optimization on the foreground-background interaction sensing gesture feature vectors to obtain optimized foreground-background interaction sensing gesture feature vectors; and S1232, enabling the plurality of optimized foreground-background interaction perception gesture characteristics to pass through a gesture semantic understanding device based on a converter to obtain the gesture operation type semantic understanding feature vector.

More specifically, in step S1231, as shown in fig. 7, the feature distribution optimization is performed on the plurality of foreground-background interaction-aware gesture feature vectors to obtain a plurality of optimized foreground-background interaction-aware gesture feature vectors, including: s12311, performing homogeneous Gilbert space metric dense point distribution sampling fusion on the user gesture foreground feature vector and the user gesture background feature vector to obtain a fusion feature vector; and S12312, cascading the fusion feature vector with the foreground-background interaction sensing gesture feature vector to obtain the optimized foreground-background interaction sensing gesture feature vector.

In the technical solution of the present disclosure, when the cascade function is used to perform feature level interaction on each set of corresponding foreground feature vectors of the user gesture and the background feature vectors of the user gesture to obtain the multiple foreground-background interaction-aware gesture feature vectors, although the cascade function may extract interaction features of the foreground feature vectors of the user gesture and the background feature vectors of the user gesture through point convolution and activation operations, and preserve image semantic features of foreground images and background images of the foreground feature vectors of the user gesture and the background feature vectors of the user gesture, it is still desirable to obtain a feature representation for expressing the overall image semantics of the interactive image frame of the user gesture by fusing the foreground feature vectors of the user gesture and the background feature vectors of the user gesture, in consideration of the image entirety of the interactive image frame of the user gesture.

And, applicant of the present disclosure considers the user gesture foreground featuresThe vector and the user gesture background feature vector are respectively the user gesture foreground image and the user gesture background image, and are expressed by the semantic association features of the local image in the dense acquisition mode under the convolution kernel scale of the image semantic homogeneous coding based on the convolution neural network model, so the user gesture foreground feature vector is recorded as And the user gesture background feature vector, e.g. denoted +.>And performing homogeneous Gilbert spatial metric dense point distribution sampling fusion.

Accordingly, in one specific example, performing homogeneous gilbert spatial metric dense point distribution sampling fusion on the user gesture foreground feature vector and the user gesture background feature vector to obtain a fused feature vector, including: carrying out homogeneous Gilbert space metric dense point distribution sampling fusion on the user gesture foreground feature vector and the user gesture background feature vector by using the following fusion formula to obtain the fusion feature vector; wherein, the fusion formula is: wherein ,/>Representing the user gesture foreground feature vector, < ->Representing the user gesture background feature vector, +.>Representing a transpose operation-> and />The user gesture foreground feature vector +.>And the user gesture background feature vector +.>Is defined as the global feature mean value of (2), and feature vector +.> and />Are all row vectors, +.>Represent Min distance and +.>Is super-parameter (herba Cinchi Oleracei)>Representing vector addition, ++>Representing multiplication by location +.>Representing the fused feature vector.

Here, by applying a gesture foreground feature vector to the user And the user gesture background feature vector +.>Homogeneous gilbert spatial metric of feature distribution center of (c) to +.>And the user gesture background feature vector +.>Is a fusion feature of (2)The distribution is subjected to real (ground-trunk) geometric center constraint of a fused feature manifold hyperplane in a high-dimensional feature space, point-by-point feature association of cross distance constraint is used as a bias term, so that feature dense point sampling pattern distribution fusion in association constraint limits of feature distribution is realized, and therefore homogeneous sampling association fusion among vectors is enhanced. Then, the fusion feature vector is added again>Cascading with the foreground-background interaction-aware gesture feature vector may improve feature representation of the foreground-background interaction-aware gesture feature vector to overall image semantics of the user gesture interaction image frame.

Further, the gesture operation type semantic understanding feature vector is passed through a classifier to obtain a classification result, wherein the classification result is used for representing a user operation action intention label corresponding to the user gesture interaction video.

Accordingly, in step S130, determining, based on the gesture operation type semantic understanding feature vector, a user manipulation action intention corresponding to the user gesture interaction video includes: and the gesture operation type semantic understanding feature vector passes through a classifier to obtain a classification result, wherein the classification result is used for representing a user operation action intention label corresponding to the user gesture interaction video.

It should be appreciated that the role of the classifier is to learn the classification rules and classifier using a given class, known training data, and then classify (or predict) the unknown data. Logistic regression (logistics), SVM, etc. are commonly used to solve the classification problem, and for multi-classification problems (multi-class classification), logistic regression or SVM can be used as well, but multiple bi-classifications are required to compose multiple classifications, but this is error-prone and inefficient, and the commonly used multi-classification method is the Softmax classification function.

Accordingly, in one possible implementation manner, the gesture operation type semantic understanding feature vector is passed through a classifier to obtain a classification result, where the classification result is used to represent a user manipulation action intention label corresponding to a user gesture interaction video, and the method includes: performing full-connection coding on the gesture operation type semantic understanding feature vector by using a full-connection layer of the classifier to obtain a coding classification feature vector; and inputting the coding classification feature vector into a Softmax classification function of the classifier to obtain the classification result.

It is noted that full-concatenated coding (Fully Connected Encoding) is an operation of coding input feature vectors through a full-concatenated layer to generate coded classification feature vectors. In the gesture operation type semantic understanding task, by using full-connection encoding, the gesture operation type semantic understanding feature vector can be converted into a more characterized encoding classification feature vector. Full-join encoding generally refers to multiplying an input feature vector by a weight matrix of the full-join layer and performing a nonlinear transformation by an activation function. The weight matrix of the full connection layer defines a linear combination mode of feature vectors, and the activation function introduces nonlinear transformation to increase the expression capacity of the model. Through full-connection coding, the original gesture operation type semantic understanding feature vector can be mapped into a higher-dimension feature space, so that richer semantic information and feature representations are captured. The full-connection coding can adaptively learn the combination mode of the characteristics through learning the parameters of the weight matrix, so that the generated coding classification characteristic vector is more differentiated and expressed. The generated coding classification feature vector can be used as a Softmax function which is transmitted to a classifier as input data and used for obtaining a classification result of the gesture operation type. The Softmax function maps the encoded classification feature vector to a probability distribution, one probability value for each category, representing the confidence level of the gesture operation type. And selecting the category with the highest probability as a classification result, so that the user manipulation action intention label corresponding to the user gesture interaction video can be represented. In other words, the full-join encoding encodes the gesture operation type semantic understanding feature vector through the full-join layer, generating an encoded classification feature vector for extracting more representative feature representations. The coding classification feature vectors can be input into a Softmax function of the classifier to obtain a classification result of the gesture operation type, and the classification result is used for representing a user operation intention label corresponding to the user gesture interaction video.

In summary, according to the meta-universe intelligent interaction method disclosed by the embodiment of the disclosure, the identification and understanding of the manipulation action intention of the user can be assisted by using the information such as the spatial layout and the object position in the background information, so that more accurate and adaptive interaction experience is provided.

Fig. 8 shows a block diagram of a smart interaction system 100 of the metauniverse in accordance with an embodiment of the present disclosure. As shown in fig. 8, a meta-universe intelligent interaction system 100 according to an embodiment of the present disclosure includes: the video acquisition module 110 is used for acquiring user gesture interaction videos acquired by the camera; the sampling and feature extraction module 120 is configured to sample and extract features of the gesture interaction video of the user to obtain a gesture operation type semantic understanding feature vector; and a manipulation action intention judging module 130, configured to determine a user manipulation action intention corresponding to the user gesture interaction video based on the gesture operation type semantic understanding feature vector.

Here, it will be understood by those skilled in the art that the specific functions and operations of the respective units and modules in the above-described metauniverse intelligent interaction system 100 have been described in detail in the above description of the metauniverse intelligent interaction method with reference to fig. 1 to 7, and thus, repetitive descriptions thereof will be omitted.

As described above, the smart interactive system 100 of the meta universe according to the embodiment of the present disclosure may be implemented in various wireless terminals, for example, a server or the like having a smart interactive algorithm of the meta universe. In one possible implementation, the meta-universe intelligent interaction system 100 according to embodiments of the present disclosure may be integrated into a wireless terminal as one software module and/or hardware module. For example, the smart interactive system 100 of the meta-universe may be a software module in the operating system of the wireless terminal, or may be an application developed for the wireless terminal; of course, the meta-universe of intelligent interactive systems 100 may equally be one of a number of hardware modules of the wireless terminal.

Alternatively, in another example, the smart interactive system 100 of the metauniverse and the wireless terminal may also be separate devices, and the smart interactive system 100 of the metauniverse may be connected to the wireless terminal through a wired and/or wireless network and transmit interactive information in a agreed data format.

Fig. 9 illustrates an application scenario diagram of a smart interaction method of a metauniverse according to an embodiment of the disclosure. As shown in fig. 9, in this application scenario, first, a user gesture interaction video (e.g., D shown in fig. 9) acquired by a camera (e.g., C shown in fig. 9) is acquired, and then, the user gesture interaction video is input to a server (e.g., S shown in fig. 9) in which a smart interaction algorithm of a metauniverse is deployed, where the server can process the user gesture interaction video using the smart interaction algorithm of the metauniverse to obtain a classification result for representing a user manipulation action intention tag corresponding to the user gesture interaction video.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An intelligent interaction method of a meta-universe is characterized by comprising the following steps: acquiring a user gesture interaction video acquired by a camera; sampling and feature extraction are carried out on the user gesture interaction video to obtain gesture operation type semantic understanding feature vectors; determining user operation and control action intentions corresponding to the user gesture interaction video based on the gesture operation type semantic understanding feature vector;

the method for sampling and extracting the features of the user gesture interaction video to obtain gesture operation type semantic understanding feature vectors comprises the following steps: sparse sampling is carried out on the user gesture interaction video to obtain a plurality of user gesture interaction image frames; image feature extraction and foreground-background feature blending are carried out on the plurality of user gesture interaction image frames so as to obtain a plurality of foreground-background interaction perception gesture feature vectors; extracting global context semantic information among the foreground-background interaction perception gesture feature vectors to obtain gesture operation type semantic understanding feature vectors;

the image feature extraction and foreground-background feature blending are performed on the plurality of user gesture interaction image frames to obtain a plurality of foreground-background interaction perception gesture feature vectors, including: dividing the areas of the plurality of user gesture interaction image frames to obtain a plurality of user gesture foreground images and a plurality of user gesture background images; respectively passing the plurality of user gesture foreground images through a foreground feature capturer based on a first convolutional neural network model to obtain a plurality of user gesture foreground feature vectors; respectively passing the plurality of user gesture background images through a background feature capturer based on a second convolution neural network model to obtain a plurality of user gesture background feature vectors; performing feature level interaction on the foreground feature vectors and the background feature vectors of the user gestures to obtain the foreground-background interaction sensing gesture feature vectors;

The feature level interaction is performed on the foreground feature vectors and the background feature vectors of the user gestures to obtain the foreground-background interaction sensing gesture feature vectors, which comprises the following steps: performing feature level interaction on the corresponding foreground feature vectors of the user gestures and the background feature vectors of the user gestures by using a cascading function to obtain a plurality of foreground-background interaction perception gesture feature vectors;

extracting global context semantic information among the foreground-background interaction perception gesture feature vectors to obtain the gesture operation type semantic understanding feature vector, wherein the method comprises the following steps of: respectively carrying out feature distribution optimization on the foreground-background interaction sensing gesture feature vectors to obtain a plurality of optimized foreground-background interaction sensing gesture feature vectors; and passing the plurality of optimized foreground-background interaction-aware gesture features through a converter-based gesture semantic understanding device to obtain the gesture operation type semantic understanding feature vector;

the feature distribution optimization is performed on the foreground-background interaction sensing gesture feature vectors to obtain optimized foreground-background interaction sensing gesture feature vectors, including: carrying out homogeneous Gilbert space metric dense point distribution sampling fusion on the user gesture foreground feature vector and the user gesture background feature vector to obtain a fusion feature vector; cascading the fusion feature vector with the foreground-background interaction perception gesture feature vector to obtain the optimized foreground-background interaction perception gesture feature vector;

The method for performing homogeneous gilbert spatial metric dense point distribution sampling fusion on the user gesture foreground feature vector and the user gesture background feature vector to obtain a fused feature vector comprises the following steps: carrying out homogeneous Gilbert space metric dense point distribution sampling fusion on the user gesture foreground feature vector and the user gesture background feature vector by using the following fusion formula to obtain the fusion feature vector; wherein, the fusion formula is:

，

wherein ,representing the user gesture foreground feature vector, < ->Representing the user gesture background feature vector,representing a transpose operation-> and />The user gesture foreground feature vector +.>And the user gesture background feature vector +.>Is defined as the global feature mean value of (2), and feature vector +.> and />Are all row vectors, +.>Represent Min distance and +.>Is super-parameter (herba Cinchi Oleracei)>Representing vector addition, ++>Representing multiplication by location +.>Representing the fused feature vector.

2. The method of claim 1, wherein the partitioning the plurality of user gesture interaction image frames to obtain a plurality of user gesture foreground images and a plurality of user gesture background images comprises: respectively passing the plurality of user gesture interaction image frames through a region divider based on a target detection network to respectively obtain a plurality of user gesture foreground images; and masking the plurality of user gesture interaction image frames based on the plurality of user gesture foreground images to obtain the plurality of user gesture background images.

3. The intelligent interaction method of the metauniverse according to claim 2, wherein determining the user manipulation action intention corresponding to the user gesture interaction video based on the gesture operation type semantic understanding feature vector comprises: and the gesture operation type semantic understanding feature vector passes through a classifier to obtain a classification result, wherein the classification result is used for representing a user operation action intention label corresponding to the user gesture interaction video.

4. An intelligent interactive system of a meta-universe, comprising: the video acquisition module is used for acquiring user gesture interaction videos acquired by the camera; the sampling and feature extraction module is used for sampling and feature extraction of the user gesture interaction video to obtain gesture operation type semantic understanding feature vectors; the control action intention judging module is used for determining the user control action intention corresponding to the user gesture interaction video based on the gesture operation type semantic understanding feature vector;

wherein, the sampling and feature extraction module includes: sparse sampling is carried out on the user gesture interaction video to obtain a plurality of user gesture interaction image frames; image feature extraction and foreground-background feature blending are carried out on the plurality of user gesture interaction image frames so as to obtain a plurality of foreground-background interaction perception gesture feature vectors; extracting global context semantic information among the foreground-background interaction perception gesture feature vectors to obtain gesture operation type semantic understanding feature vectors;

Image feature extraction and foreground-background feature blending are carried out on the plurality of user gesture interaction image frames to obtain a plurality of foreground-background interaction perception gesture feature vectors, and the method comprises the following steps: dividing the areas of the plurality of user gesture interaction image frames to obtain a plurality of user gesture foreground images and a plurality of user gesture background images; respectively passing the plurality of user gesture foreground images through a foreground feature capturer based on a first convolutional neural network model to obtain a plurality of user gesture foreground feature vectors; respectively passing the plurality of user gesture background images through a background feature capturer based on a second convolution neural network model to obtain a plurality of user gesture background feature vectors; performing feature level interaction on the foreground feature vectors and the background feature vectors of the user gestures to obtain the foreground-background interaction sensing gesture feature vectors;

，