CN115237255B

CN115237255B - Natural image co-pointing target positioning system and method based on eye movement and voice

Info

Publication number: CN115237255B
Application number: CN202210906536.4A
Authority: CN
Inventors: 张珺倩; 黄如强; 杨超; 王宁慈; 于文东; 张久松; 耿震; 孟祥轶; 任晓琪
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-07-29
Filing date: 2022-07-29
Publication date: 2023-10-31
Anticipated expiration: 2042-07-29
Also published as: CN115237255A

Abstract

The invention discloses a natural image co-pointing target positioning system based on eye movement and voice, which comprises a natural image co-pointing target positioning system, a data acquisition and preprocessing system; the natural image co-pointing target positioning system comprises an image feature extraction module, a text feature extraction module, an eye movement feature extraction module, a feature fusion module, a target identification module and a confidence calculation module; the data acquisition and preprocessing system comprises an image acquisition module, an image preprocessing module, a voice acquisition module, a voice preprocessing module, an eye movement acquisition module and an eye movement preprocessing module; the feature fusion module, the target identification module and the confidence coefficient calculation module are connected in sequence; the input end of the feature fusion module is respectively connected with the output ends of the image feature extraction module, the text feature extraction module and the eye movement feature extraction module. The invention acquires the eye movement input and the voice input of the user at the same time, and utilizes the complementary advantages among the information of a plurality of modes to be beneficial to achieving the efficient human-computer interaction effect.

Description

Natural image co-pointing target positioning system and method based on eye movement and voice

Technical Field

The invention relates to a human-computer interaction system and a human-computer interaction method, in particular to a natural image co-pointing target positioning system and a natural image co-pointing target positioning method based on eye movement and voice.

Background

Currently, existing man-machine interaction systems generally adopt a single-mode data input form, such as touch key interaction input or voice input. However, the expression form of human is not limited to a single mode, but often presents a multi-modal characteristic. Under the condition that the human intention needs to be comprehensively understood and the target is positioned, the accuracy of a single-mode interaction method or system is low, different interaction environments are difficult to adapt, and natural and intelligent man-machine cooperation is difficult to realize.

In the existing research, a single-mode image text matching method combines computer vision and natural language processing technology, analyzes human intention by analyzing language structures of text description and simultaneously referring to visual information acquired by a system and using various reasoning strategies, and mainly comprises a model based on integral feature representation, a model based on modularization, a model based on a graph and the like. The model based on the integral feature representation uses a single vector to represent the image-text feature, and the complex text context and the space structure in the image are ignored; the model based on modularization simplifies language structure, ignores the relation between visual targets and is not applicable to long sentences. The graph-based model is capable of expressing a graph structure, but relies on a target detection model and a text parsing model.

In addition, human eye gazing information is used as an important carrier for human intention expression and is also applied to the field of human-computer interaction. The research of sight estimation and tracking reveals the gazing direction of human eyes, and on the basis, the eye movement information is also used for predicting the tasks of significant targets, gazing targets and the like in the natural images by combining the natural images. However, for a single-mode input, in the case of multiple similar objects appearing in the image, it is difficult for the algorithm model to correctly identify the object in question.

Disclosure of Invention

The invention provides a natural image co-pointing target positioning system and method based on eye movement and voice for solving the technical problems in the prior art.

The invention adopts the technical proposal for solving the technical problems in the prior art that:

a natural image co-pointing target positioning system based on eye movement and voice comprises a natural image co-pointing target positioning system and a data acquisition and preprocessing system; the natural image co-pointing target positioning system comprises an image feature extraction module, a text feature extraction module, an eye movement feature extraction module, a feature fusion module, a target identification module and a confidence calculation module; the data acquisition and preprocessing system comprises an image acquisition module, an image preprocessing module, a voice acquisition module, a voice preprocessing module, an eye movement acquisition module and an eye movement preprocessing module; the image acquisition module, the image preprocessing module and the image feature extraction module are sequentially connected; the voice acquisition module, the voice preprocessing module and the text feature extraction module are sequentially connected; the eye movement acquisition module, the eye movement preprocessing module and the eye movement characteristic extraction module are sequentially connected; the feature fusion module, the target identification module and the confidence coefficient calculation module are connected in sequence; the input end of the feature fusion module is respectively connected with the output ends of the image feature extraction module, the text feature extraction module and the eye movement feature extraction module;

the image acquisition module is used for acquiring an environment image in front of a user in real time, the image preprocessing module is used for preprocessing the acquired image, and the image feature extraction module is used for extracting image features of the preprocessed image;

the voice acquisition module is used for acquiring voice description information sent by a user, the voice preprocessing module is used for preprocessing the acquired voice information into text information, and the text feature extraction module is used for extracting text features of the preprocessed text information;

the eye movement acquisition module is used for acquiring a real-time eye movement coordinate sequence of a user, the eye movement preprocessing module is used for preprocessing the acquired eye movement coordinate sequence, and the eye movement characteristic extraction module is used for extracting eye movement characteristics of the preprocessed eye movement coordinate sequence;

the feature fusion module is used for fusing the image features, the text features and the eye movement features and generating multi-modal features;

the target recognition module is used for reducing the dimension of the multi-mode features and generating candidate target frames;

the confidence calculating module is used for calculating the confidence of all candidate target frames.

Further, the image preprocessing module comprises an image screening module, an image scaling module and an image normalization module which are connected in sequence; the image screening module is used for selecting images with qualified quality; the image scaling module is used for randomly scaling the image; the image normalization module is used for normalizing the image pixel values.

Further, the image feature extraction module includes: the position feature extraction module is used for extracting image features and completing a pre-trained residual error convolutional neural network, a layer of convolutional neural network used for reducing the dimension of the image features and a position feature extraction module used for generating position features and representing positions corresponding to each feature vector in the image features.

Further, the voice preprocessing module comprises a template matching voice recognition algorithm module; the template matching voice recognition algorithm module is used for converting the audio information into a text sequence.

Further, the text feature extraction module includes a Roberta model that performs pre-training using the common dataset, the Roberta model for generating text-embedded expressions.

Further, the eye movement preprocessing module comprises a Gao Situ generation module, and a Gao Situ generation module is used for converting the eye movement coordinate sequence into a Gaussian diagram sequence according to the mean value and the variance of the eye movement identification error.

Further, the eye movement feature extraction module includes: the bilinear interpolation scaling module is used for scaling Gao Situ according to the size of the image features, is used for a layer of convolution neural network for reducing the dimension of the image features and is used for calculating the long-term and short-term memory neural network of the eye movement features.

Further, the feature fusion module comprises a feature splicing module and a transducer model which adopts a public data set to perform pre-training, wherein the feature splicing module is used for splicing image features, text features and eye movement features; the characteristic splicing module comprises an encoder and a decoder, wherein the encoder and the decoder comprise six layers of multi-head attention modules; each multi-head attention module comprises an eight-head self-attention layer, a Dropout layer A, a standardization layer A, a full connection layer A, an activation function layer, a Dropout layer B, a full connection layer B, dropout layer C and a standardization layer B which are connected in sequence.

Further, the target recognition module comprises a full connection layer; the confidence computation module includes a SoftMax layer.

The invention also provides a natural image co-pointing target positioning method based on eye movement and voice by using the natural image co-pointing target positioning system based on eye movement and voice, which comprises the following steps:

step 1, acquiring environment images, user voice description information and data of a user eye movement coordinate sequence through experiments by utilizing a data acquisition and preprocessing system, preprocessing the data to manufacture a training sample set and a test set, and training and testing a natural image co-pointing target positioning system by using the training sample set and the test set;

step 2, synchronously acquiring corresponding environment images, user voice description information and user eye movement coordinate sequences by the image acquisition module, the voice acquisition module and the eye movement acquisition module;

step 3, the image preprocessing module, the voice preprocessing module and the eye movement preprocessing module correspondingly preprocess the acquired environment image, the user voice description information and the user eye movement coordinate sequence;

step 4, correspondingly inputting the preprocessed environment image, the user voice description information and the user eye movement coordinate sequence into an image feature extraction module, a text feature extraction module and an eye movement feature extraction module to correspondingly obtain image features, text features and eye movement features;

step 5, the image features, the text features and the eye movement features are fused by a feature fusion module to obtain multi-modal features, and the multi-modal features are subjected to dimension reduction by a target recognition module to generate candidate target frames; calculating the confidence coefficient of all candidate target frames by a confidence coefficient calculation module; filtering the candidate frames according to the ordering of the confidence degrees, and determining a target frame;

step 6, confirming whether the target frame result is correct or not by the user; if the result of the target frame is correct, the result is saved, otherwise, the steps 2 to 5 are repeated.

The invention has the advantages and positive effects that:

1. the natural image co-pointing target positioning system and method based on eye movement and voice provided by the invention are different from the existing single-mode man-machine interaction mode, meanwhile, eye movement input and voice input of a user are obtained, the advantage complementation among information of multiple modes is utilized, the user intention expression is more comprehensively understood, and the efficient and natural man-machine interaction effect is facilitated.

2. According to the natural image co-pointing target positioning system and method based on eye movement and voice, a deep learning model fused with various modal characteristics is established, eye movement characteristics are extracted through a long-term and short-term memory neural network, and compared with a model without the eye movement characteristics, the positioning accuracy of the natural image co-pointing target positioning system and method based on the eye movement and voice is obviously improved.

Drawings

Fig. 1 is a schematic structural diagram of a natural image co-pointing object positioning system based on eye movement and voice according to the present invention.

Fig. 2 is a schematic diagram of the operation of the multi-modal object localization of the eye movement and voice based natural image co-pointing object localization system of the present invention.

Fig. 3 is a workflow diagram of a natural image co-pointing object localization method based on eye movement and voice of the present invention.

Detailed Description

For a further understanding of the invention, its features and advantages, reference is now made to the following examples, which are illustrated in the accompanying drawings in which:

the following English words and English abbreviations in the invention are defined as follows:

ResNet101: a residual convolution neural network uses a convolution neural network with a residual structure, namely, certain layers of the neural network skip the connection of neurons of the next layer, interlayer connection is realized, and the strong connection between each layer is weakened, so that the degradation problem in a deep network is solved. Where 101 refers to the model comprising 101 layers of fully connected and convolved layers.

Transformer model: with a self-attention mechanism, a deep neural network, initially used for machine translation, is constituted by an encoder module and a decoder module.

Dropout layer: a neural network layer is adopted in deep learning training, and the over fitting phenomenon is reduced by setting half of characteristic values to zero.

Roberta model: a pre-training model in the form of a text feature encoder based on a self-attention mechanism.

SoftMax layer: the eigenvalue output was converted to a neural network layer ranging from a probability distribution of 0,1 and 1 using a softmax function.

AR glasses: augmented reality glasses, a hardware device in the form of glasses that fuses virtual information with the real world.

ImageNet: a large natural image public dataset for computer vision research.

CC-NEWS, OPENWEBTEXT, STORIES, BOOKCORPUS, text discloses a dataset.

Flickr30k, MS COCO and Visual Genome: are all public data sets in the cross-modal field of image text.

Referring to fig. 1 to 3, a natural image co-pointing target positioning system based on eye movement and voice includes a natural image co-pointing target positioning system, a data acquisition and preprocessing system; the natural image co-pointing target positioning system comprises an image feature extraction module, a text feature extraction module, an eye movement feature extraction module, a feature fusion module, a target identification module and a confidence calculation module; the data acquisition and preprocessing system comprises an image acquisition module, an image preprocessing module, a voice acquisition module, a voice preprocessing module, an eye movement acquisition module and an eye movement preprocessing module; the image acquisition module, the image preprocessing module and the image feature extraction module are sequentially connected; the voice acquisition module, the voice preprocessing module and the text feature extraction module are sequentially connected; the eye movement acquisition module, the eye movement preprocessing module and the eye movement characteristic extraction module are sequentially connected; the feature fusion module, the target identification module and the confidence coefficient calculation module are connected in sequence; the input end of the feature fusion module is respectively connected with the output ends of the image feature extraction module, the text feature extraction module and the eye movement feature extraction module.

The image acquisition module is used for acquiring an environment image in front of a user in real time, the image preprocessing module is used for preprocessing the acquired image, and the image feature extraction module is used for extracting image features of the preprocessed image.

The voice acquisition module is used for acquiring voice description information sent by a user, the voice preprocessing module is used for preprocessing the acquired voice information into text information, and the text feature extraction module is used for extracting text features of the preprocessed text information.

The eye movement collection module is used for collecting real-time eye movement coordinate sequences of users, the eye movement preprocessing module is used for preprocessing the collected eye movement coordinate sequences, and the eye movement feature extraction module is used for extracting eye movement features of the preprocessed eye movement coordinate sequences.

The feature fusion module is used for fusing the image features, the text features and the eye movement features and generating multi-modal features.

The target recognition module is used for reducing the dimension of the multi-mode features and generating candidate target frames.

The voice acquisition module may include a microphone; the image acquisition module can comprise AR glasses with a display function and a real-time eye movement coordinate acquisition function; the eye movement acquisition module can comprise AR glasses with a display function and a real-time eye movement coordinate acquisition function; the data acquisition system may also include a synchronous acquisition program. The image acquisition module and the eye movement acquisition module can share AR glasses.

The data acquisition and preprocessing system can acquire corresponding data such as environmental images, user voice description information, eye movement coordinate sequences of the user and the like, preprocesses the data, and compiles training and verification samples for training and verifying the target positioning model. The acquired data set is according to the training set: the test set=2:1-5:1 ratio is divided, and the data of the training set is adopted to train the target positioning model. And testing the target positioning model by adopting the data of the test set.

Preferably, the image preprocessing module can comprise an image screening module, an image scaling module and an image normalization module which are sequentially connected; the image screening module is used for selecting images with qualified quality; the image scaling module is used for randomly scaling the image; the image normalization module is used for normalizing the image pixel values.

Preferably, the image feature extraction module may include: the position feature extraction module is used for extracting image features and completing a pre-trained residual error convolutional neural network, a layer of convolutional neural network used for reducing the dimension of the image features and a position feature extraction module used for generating position features and representing positions corresponding to each feature vector in the image features.

Preferably, the voice preprocessing module may include a template matching voice recognition algorithm module; the template matching voice recognition algorithm module is used for converting the audio information into a text sequence.

Preferably, the text feature extraction module may comprise a RoBERTa model that performs pre-training using the common dataset, the RoBERTa model being used to generate the text-embedded expression. Pre-training of the Roberta model employed a total of 160G training text, including the common dataset CC-NEWS, OPENWEBTEXT, STORIES, BOOKCORPUS and Wikipedia.

Preferably, the eye movement preprocessing module may include a gaussian image generation module Gao Situ generation module for converting the eye movement coordinate sequence into a gaussian image sequence according to the mean and variance of the eye movement identification error.

Preferably, the eye movement feature extraction module may include: the bilinear interpolation scaling module is used for scaling Gao Situ according to the size of the image features, is used for a layer of convolution neural network for reducing the dimension of the image features and is used for calculating the long-term and short-term memory neural network of the eye movement features.

Preferably, the feature fusion module may include a feature stitching module and a transducer model pre-trained using a common dataset, the feature stitching module being configured to stitch image features, text features, and eye movement features; the characteristic splicing module can comprise an encoder and a decoder, and the encoder and the decoder can comprise six layers of multi-head attention modules; each multi-head attention module comprises an eight-head self-attention layer, a Dropout layer A, a standardization layer A, a full connection layer A, an activation function layer, a Dropout layer B, a full connection layer B, dropout layer C and a standardization layer B which are connected in sequence.

Dropout layer A, dropout layer B, dropout layer C is Dropout layer; the full connection layer A, the full connection layer B and the like are all full connection layers; the normalization layer A, the normalization layer B and the like are all normalization layers. The English numbers added later are convenient for distinguishing.

The public dataset of the pre-trained transducer model is referred to as the Flickr30k, MS COCO, and Visual Genome dataset.

Preferably, the object recognition module may comprise a fully connected layer; the confidence computation module includes a SoftMax layer.

The above-mentioned image acquisition module, voice acquisition module, eye movement acquisition module, image preprocessing module, image screening module, image scaling module, image normalization module, image feature extraction module, residual convolution neural network, one-layer convolution neural network, position feature extraction module, voice preprocessing module, template matching voice recognition algorithm module, text feature extraction module, roBERTa model, eye movement preprocessing module, gao Situ generation module, eye movement feature extraction module, bilinear interpolation scaling module, long and short term memory neural network, feature fusion module, feature splicing module, transducer model, encoder, decoder, six-layer multi-head attention module, eight-head self-attention layer, dropout layer A, standardization layer A, full connection layer A, activation function layer, dropout layer B, full connection layer B, dropout layer C, standardization layer B, object recognition module, full connection layer, confidence calculation module, softMax layer and other functional modules can all be applied in the prior art, or constructed by adopting components and functional modules in the prior art and adopting conventional techniques.

The invention also provides an embodiment of a natural image co-pointing target positioning method based on eye movement and voice by using the natural image co-pointing target positioning system based on eye movement and voice, which comprises the following steps:

step 1, acquiring environment images, user voice description information and data of a user eye movement coordinate sequence through experiments by utilizing a data acquisition and preprocessing system, preprocessing the data, manufacturing the preprocessed data into a training sample set and a test set, and training and testing a natural image co-pointing target positioning system by using the training sample set and the test set.

And 2, synchronously acquiring corresponding environment images, user voice description information and user eye movement coordinate sequences by the image acquisition module, the voice acquisition module and the eye movement acquisition module.

And 3, preprocessing the collected environment images, the user voice description information and the user eye movement coordinate sequence by the image preprocessing module, the voice preprocessing module and the eye movement preprocessing module.

And step 4, correspondingly inputting the preprocessed environment image, the user voice description information and the user eye movement coordinate sequence into an image feature extraction module, a text feature extraction module and an eye movement feature extraction module to correspondingly obtain image features, text features and eye movement features.

Step 5, the image features, the text features and the eye movement features are fused by a feature fusion module to obtain multi-modal features, and the multi-modal features are subjected to dimension reduction by a target recognition module to generate candidate target frames; calculating the confidence coefficient of all candidate target frames by a confidence coefficient calculation module; and filtering the candidate frames according to the ordering of the confidence degrees, and determining the target frame.

The workflow and working principle of the invention are further described in the following with a preferred embodiment of the invention:

a natural image co-pointing target positioning system based on eye movement and voice, comprising: the device comprises an image acquisition module, an image preprocessing module and an image feature extraction module which are connected in sequence; the device comprises a voice acquisition module, a voice preprocessing module and a text feature extraction module which are connected in sequence; the device comprises an eye movement acquisition module, an eye movement preprocessing module and an eye movement characteristic extraction module which are connected in sequence; the system also comprises a feature fusion module, a target identification module and a confidence coefficient calculation module which are connected in sequence; the input end of the feature fusion module is respectively connected with the output ends of the image feature extraction module, the text feature extraction module and the eye movement feature extraction module.

The image preprocessing module comprises an image screening module, an image scaling module and an image normalization module which are connected in sequence; the image screening module is used for selecting images with qualified quality; the image scaling module is used for randomly scaling the image; the image normalization module is used for normalizing the image pixel values.

The image feature extraction module comprises: the position feature extraction module is used for extracting image features and completing a pre-trained residual error convolutional neural network, a layer of convolutional neural network used for reducing the dimension of the image features and a position feature extraction module used for generating position features and representing positions corresponding to each feature vector in the image features.

The voice preprocessing module comprises a template matching voice recognition algorithm module; the template matching voice recognition algorithm module is used for converting the audio information into a text sequence.

The text feature extraction module includes a RoBERTa model that uses the common dataset to complete the pre-training, the RoBERTa model being used to generate the text-embedded expression.

The eye movement preprocessing module comprises a Gao Situ generation module and a Gao Situ generation module, wherein the Gao Situ generation module is used for converting an eye movement coordinate sequence into a Gaussian image sequence according to the mean value and the variance of eye movement identification errors.

The eye movement feature extraction module comprises: the bilinear interpolation scaling module is used for scaling Gao Situ according to the size of the image features, is used for a layer of convolution neural network for reducing the dimension of the image features and is used for calculating the long-term and short-term memory neural network of the eye movement features.

The feature fusion module comprises a feature splicing module and a transducer model which adopts a public data set to perform pre-training, and the feature splicing module is used for splicing image features, text features and eye movement features; the characteristic splicing module comprises an encoder and a decoder, wherein the encoder and the decoder comprise six layers of multi-head attention modules; each multi-head attention module comprises an eight-head self-attention layer, a Dropout layer A, a standardization layer A, a full connection layer A, an activation function layer, a Dropout layer B, a full connection layer B, dropout layer C and a standardization layer B which are connected in sequence.

The target recognition module comprises a full connection layer; the confidence computation module includes a SoftMax layer.

The positioning method of the natural image co-pointing target positioning system based on eye movement and voice comprises the following steps:

1. and (5) data acquisition.

The invention provides a data acquisition system comprising an image acquisition module, a voice acquisition module and an eye movement acquisition module, wherein the voice acquisition module comprises a microphone; the image acquisition module comprises AR glasses with a display function and a real-time eye movement coordinate acquisition function; the eye movement acquisition module comprises AR glasses with a display function and a real-time eye movement coordinate acquisition function; the data acquisition system also includes a synchronous acquisition program. The image acquisition module and the eye movement acquisition module share AR glasses.

And a synchronous acquisition program is adopted for synchronously acquiring the user voice description and the user eye movement coordinate sequence. In the data acquisition experiment process, a tested person sits on a chair, the worn AR glasses sequentially play text descriptions and corresponding natural images, the tested person needs to read out the content of the text descriptions, and the tested person looks at the target positions described by the texts during the display of the natural images. The method collects 6 seconds of eye movement coordinate data during the period that 1000 users watch natural images, and the frame rate is set to 60 frames per second.

2. And (5) preprocessing data.

Image preprocessing:

the method comprises the steps of removing images with poor quality such as ghost images and blur images by adopting an image screening module, an image scaling module and an image normalization module which are sequentially connected, selecting images with qualified quality, performing random scaling on the images with the side length of 800 pixels to 1333 pixels to prevent over fitting in the training process, and performing normalization processing.

Voice pretreatment:

the voice preprocessing module can adopt a dynamic time warping algorithm (DTW) to carry out feature training and recognition, adopts a template matching voice recognition algorithm module to carry out voice template matching, and adopts a hidden Markov model to build a statistical model for a time sequence structure of a voice signal, and adopts a vector quantization technology to carry out signal compression. The template matching voice recognition algorithm module converts the audio information into a text sequence.

Eye movement pretreatment:

the eye movement preprocessing module comprises a Gao Situ generation module, a Gao Situ generation module selects an eye movement sequence of three to six seconds for 180 frames, calculates the mean value and the variance of the eye movement coordinate recognition error, and establishes a Gaussian diagram sequence by taking the eye movement coordinate as the center.

Image information processing:

the image feature extraction module comprises: the position feature extraction module is used for extracting image features and completing a pre-trained residual error convolutional neural network, a layer of convolutional neural network used for reducing the dimension of the image features and a position feature extraction module used for generating position features and representing positions corresponding to each feature vector in the image features. The residual convolution neural network adopts ResNet101, the residual convolution neural network is pretrained by adopting a public data set ImageNet, the residual convolution neural network ResNet101 after pretraining is adopted to extract the characteristic of the environment image, the size of the characteristic map is (w, h), w represents the width of the characteristic map, and h represents the height of the characteristic map.

The sizes of w and h in the feature map are positively correlated with the size of the input image, the feature dimension is set to 2048, and finally the feature map is expanded into features with the length of w x h after feature dimension reduction through a one-dimensional convolution kernel, wherein the dimension is finally 256, and the image feature size is (w x h, 256). In order to represent the position information of each feature in the feature map, the position coding module generates a corresponding position feature for the image feature, wherein the size of the position feature is the same as that of the image feature. Wherein the characteristic value of the first 128 dimensions represents the position of the characteristic value in the x axis, the characteristic value of the second 128 dimensions represents the position of the characteristic value in the y axis, and the calculation method of each characteristic value is as follows:

wherein:

P _(pos,2i) a coded value representing even digits in the position feature;

P _(pos,2i+1) a coded value representing an odd digit in the position feature;

pos represents the position of the x-axis or y-axis;

i represents the dimension of the feature value;

d _model the dimension setting representing the position feature may take on a value of 128.

Text information processing:

the text feature extraction module comprises a RoBERTa model which adopts a public data set to complete pre-training, firstly, a text sequence described by a text is converted into a digital sequence by contrasting a word stock, then the digital sequence is input into the RoBERTa model to be encoded, so that text features are obtained, the length of the text features is the same as that of an input text, the dimension of the text features is 256, the size of the text features is (T, 256), and T is the length of the text.

Eye movement information processing:

and scaling the Gao Situ sequence into w and h with the same size as the image features by a bilinear interpolation scaling module, wherein the feature dimension is the length of the eye movement coordinate sequence, adjusting the dimension to 256 by a layer of convolution neural network, expanding the feature map into w x h length, and inputting the w x h length into a long-short-period memory neural network for calculation to obtain eye movement feature representation, wherein the eye movement feature is (w x h, 256).

Multi-feature fusion:

performing feature stitching on the image features, the text features and the eye movement features in a first dimension to obtain multi-mode feature vectors with feature sizes of (w, h+T+w, h, 256); the position features of the environment image are expanded to be the same as the length of the multi-mode feature vector after being filled with the all-zero vector, and are used as multi-mode position features.

The multi-mode feature vector and the multi-mode position feature are added and then sequentially input into an encoder and a decoder to obtain an output feature vector.

The acquired data set is according to the training set: the test set=5:1 ratio is divided, and the data of the training set is adopted to train the target positioning model. And testing the target positioning model by adopting the data of the test set.

3. And (5) calculating results.

And inputting the preprocessed test set data into a natural image co-pointing target positioning system based on eye movement and voice after training, and predicting results. And respectively carrying out feature extraction and feature fusion on the environment image, the user text description and the eye movement Gaussian image sequence, then simultaneously inputting the full-connection layer and the softMax layer to obtain 100 candidate target frame prediction results of each image and 100 confidence coefficient results corresponding to the 100 candidate target frame prediction results, and carrying out model accuracy calculation on all test set results and the real value relative ratio.

For each image, selecting a candidate frame with highest confidence as a prediction result, and calculating a merging ratio IoU with a true value of a target frame, wherein a calculation formula is as follows:

wherein A is a predicted target frame, B is a true value of the target frame, and the intersection ratio is the ratio of the intersection area and the union area of the target frame and the target frame. An intersection ratio greater than 0.5 is considered to be predictive correct. The ratio of the number of correctly predicted images to the total number of test images is the prediction accuracy. Table 1 shows the comparison of the test results of the present method with the results of the single mode speech input:

TABLE 1

The above-described embodiments are only for illustrating the technical spirit and features of the present invention, and it is intended to enable those skilled in the art to understand the content of the present invention and to implement it accordingly, and the scope of the present invention is not limited to the embodiments, i.e. equivalent changes or modifications to the spirit of the present invention are still within the scope of the present invention.

Claims

1. The natural image co-pointing target positioning system based on eye movement and voice is characterized by comprising a natural image co-pointing target positioning system and a data acquisition and preprocessing system; the natural image co-pointing target positioning system comprises an image feature extraction module, a text feature extraction module, an eye movement feature extraction module, a feature fusion module, a target identification module and a confidence calculation module; the data acquisition and preprocessing system comprises an image acquisition module, an image preprocessing module, a voice acquisition module, a voice preprocessing module, an eye movement acquisition module and an eye movement preprocessing module; the image acquisition module, the image preprocessing module and the image feature extraction module are sequentially connected; the voice acquisition module, the voice preprocessing module and the text feature extraction module are sequentially connected; the eye movement acquisition module, the eye movement preprocessing module and the eye movement characteristic extraction module are sequentially connected; the feature fusion module, the target identification module and the confidence coefficient calculation module are connected in sequence; the input end of the feature fusion module is respectively connected with the output ends of the image feature extraction module, the text feature extraction module and the eye movement feature extraction module;

the feature fusion module is used for fusing the image features, the text features and the eye movement features and generating multi-modal features; the feature fusion module comprises a feature splicing module and a transducer model which adopts a public data set to perform pre-training, wherein the feature splicing module is used for splicing image features, text features and eye movement features; the characteristic splicing module comprises an encoder and a decoder, wherein the encoder and the decoder comprise six layers of multi-head attention modules; each multi-head attention module comprises eight self-attention layers, a Dropout layer A, a standardization layer A, a full connection layer A, an activation function layer, a Dropout layer B, a full connection layer B, dropout layer C and a standardization layer B which are connected in sequence;

the feature fusion module fuses the image feature, the text feature and the eye movement feature and generates a multi-modal feature, which comprises the following steps:

performing feature stitching on the image features, the text features and the eye movement features in a first dimension to obtain a multi-modal feature vector; the feature size of the multi-modal feature vector is (w h+t+w h, 256); w is the width of the feature map, h is the height of the feature map, and T is the length of the text;

filling the position features of the environment image with all zero vectors, and expanding the position features to be the same as the length of the multi-mode feature vectors to serve as multi-mode position features; the position feature of the environment image is the position feature of the position corresponding to each feature vector in the image feature;

adding the multi-mode feature vector and the multi-mode position feature, and sequentially inputting the multi-mode feature vector and the multi-mode position feature to the encoder and the decoder to obtain an output feature vector;

2. The natural image co-pointing target positioning system based on eye movement and voice according to claim 1, wherein the image preprocessing module comprises an image screening module, an image scaling module and an image normalization module which are connected in sequence; the image screening module is used for selecting images with qualified quality; the image scaling module is used for randomly scaling the image; the image normalization module is used for normalizing the image pixel values.

3. The eye movement and voice based natural image co-pointing object localization system of claim 1, wherein the image feature extraction module comprises: the position feature extraction module is used for extracting image features and completing a pre-trained residual error convolutional neural network, a layer of convolutional neural network used for reducing the dimension of the image features and a position feature extraction module used for generating position features and representing positions corresponding to each feature vector in the image features.

4. The eye movement and voice based natural image co-pointing target positioning system of claim 1, wherein the voice preprocessing module comprises a template matching voice recognition algorithm module; the template matching voice recognition algorithm module is used for converting the audio information into a text sequence.

5. The eye movement and voice based natural image co-pointing object localization system of claim 1, wherein the text feature extraction module comprises a RoBERTa model that performs pre-training using a common dataset, the RoBERTa model being used to generate the text-embedded expressions.

6. The eye movement and voice based natural image co-pointing target positioning system of claim 1, wherein the eye movement pre-processing module comprises a Gao Situ generation module, gao Situ generation module for converting the eye movement coordinate sequence into a gaussian image sequence based on the mean and variance of the eye movement recognition errors.

7. The eye movement and voice based natural image co-pointing object localization system of claim 1, wherein the eye movement feature extraction module comprises: the bilinear interpolation scaling module is used for scaling Gao Situ according to the size of the image features, is used for a layer of convolution neural network for reducing the dimension of the image features and is used for calculating the long-term and short-term memory neural network of the eye movement features.

8. The eye movement and voice based natural image co-pointing object localization system of claim 1, wherein the object recognition module comprises a fully connected layer; the confidence computation module includes a SoftMax layer.

9. An eye movement and voice based natural image co-pointing object localization method using the eye movement and voice based natural image co-pointing object localization system according to any one of claims 1 to 8, comprising the steps of:

step 6, confirming whether the target frame result is correct or not by the user; if the result of the target frame is correct, the result is saved, otherwise, the steps 2 to 6 are repeated.