CN115237255B - Natural image co-pointing target positioning system and method based on eye movement and voice - Google Patents

Natural image co-pointing target positioning system and method based on eye movement and voice Download PDF

Info

Publication number
CN115237255B
CN115237255B CN202210906536.4A CN202210906536A CN115237255B CN 115237255 B CN115237255 B CN 115237255B CN 202210906536 A CN202210906536 A CN 202210906536A CN 115237255 B CN115237255 B CN 115237255B
Authority
CN
China
Prior art keywords
module
image
eye movement
feature
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210906536.4A
Other languages
Chinese (zh)
Other versions
CN115237255A (en
Inventor
张珺倩
黄如强
杨超
王宁慈
于文东
张久松
耿震
孟祥轶
任晓琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202210906536.4A priority Critical patent/CN115237255B/en
Publication of CN115237255A publication Critical patent/CN115237255A/en
Application granted granted Critical
Publication of CN115237255B publication Critical patent/CN115237255B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/013Eye tracking input arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Acoustics & Sound (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a natural image co-pointing target positioning system based on eye movement and voice, which comprises a natural image co-pointing target positioning system, a data acquisition and preprocessing system; the natural image co-pointing target positioning system comprises an image feature extraction module, a text feature extraction module, an eye movement feature extraction module, a feature fusion module, a target identification module and a confidence calculation module; the data acquisition and preprocessing system comprises an image acquisition module, an image preprocessing module, a voice acquisition module, a voice preprocessing module, an eye movement acquisition module and an eye movement preprocessing module; the feature fusion module, the target identification module and the confidence coefficient calculation module are connected in sequence; the input end of the feature fusion module is respectively connected with the output ends of the image feature extraction module, the text feature extraction module and the eye movement feature extraction module. The invention acquires the eye movement input and the voice input of the user at the same time, and utilizes the complementary advantages among the information of a plurality of modes to be beneficial to achieving the efficient human-computer interaction effect.

Description

Natural image co-pointing target positioning system and method based on eye movement and voice
Technical Field
The invention relates to a human-computer interaction system and a human-computer interaction method, in particular to a natural image co-pointing target positioning system and a natural image co-pointing target positioning method based on eye movement and voice.
Background
Currently, existing man-machine interaction systems generally adopt a single-mode data input form, such as touch key interaction input or voice input. However, the expression form of human is not limited to a single mode, but often presents a multi-modal characteristic. Under the condition that the human intention needs to be comprehensively understood and the target is positioned, the accuracy of a single-mode interaction method or system is low, different interaction environments are difficult to adapt, and natural and intelligent man-machine cooperation is difficult to realize.
In the existing research, a single-mode image text matching method combines computer vision and natural language processing technology, analyzes human intention by analyzing language structures of text description and simultaneously referring to visual information acquired by a system and using various reasoning strategies, and mainly comprises a model based on integral feature representation, a model based on modularization, a model based on a graph and the like. The model based on the integral feature representation uses a single vector to represent the image-text feature, and the complex text context and the space structure in the image are ignored; the model based on modularization simplifies language structure, ignores the relation between visual targets and is not applicable to long sentences. The graph-based model is capable of expressing a graph structure, but relies on a target detection model and a text parsing model.
In addition, human eye gazing information is used as an important carrier for human intention expression and is also applied to the field of human-computer interaction. The research of sight estimation and tracking reveals the gazing direction of human eyes, and on the basis, the eye movement information is also used for predicting the tasks of significant targets, gazing targets and the like in the natural images by combining the natural images. However, for a single-mode input, in the case of multiple similar objects appearing in the image, it is difficult for the algorithm model to correctly identify the object in question.
Disclosure of Invention
The invention provides a natural image co-pointing target positioning system and method based on eye movement and voice for solving the technical problems in the prior art.
The invention adopts the technical proposal for solving the technical problems in the prior art that:
a natural image co-pointing target positioning system based on eye movement and voice comprises a natural image co-pointing target positioning system and a data acquisition and preprocessing system; the natural image co-pointing target positioning system comprises an image feature extraction module, a text feature extraction module, an eye movement feature extraction module, a feature fusion module, a target identification module and a confidence calculation module; the data acquisition and preprocessing system comprises an image acquisition module, an image preprocessing module, a voice acquisition module, a voice preprocessing module, an eye movement acquisition module and an eye movement preprocessing module; the image acquisition module, the image preprocessing module and the image feature extraction module are sequentially connected; the voice acquisition module, the voice preprocessing module and the text feature extraction module are sequentially connected; the eye movement acquisition module, the eye movement preprocessing module and the eye movement characteristic extraction module are sequentially connected; the feature fusion module, the target identification module and the confidence coefficient calculation module are connected in sequence; the input end of the feature fusion module is respectively connected with the output ends of the image feature extraction module, the text feature extraction module and the eye movement feature extraction module;
the image acquisition module is used for acquiring an environment image in front of a user in real time, the image preprocessing module is used for preprocessing the acquired image, and the image feature extraction module is used for extracting image features of the preprocessed image;
the voice acquisition module is used for acquiring voice description information sent by a user, the voice preprocessing module is used for preprocessing the acquired voice information into text information, and the text feature extraction module is used for extracting text features of the preprocessed text information;
the eye movement acquisition module is used for acquiring a real-time eye movement coordinate sequence of a user, the eye movement preprocessing module is used for preprocessing the acquired eye movement coordinate sequence, and the eye movement characteristic extraction module is used for extracting eye movement characteristics of the preprocessed eye movement coordinate sequence;
the feature fusion module is used for fusing the image features, the text features and the eye movement features and generating multi-modal features;
the target recognition module is used for reducing the dimension of the multi-mode features and generating candidate target frames;
the confidence calculating module is used for calculating the confidence of all candidate target frames.
Further, the image preprocessing module comprises an image screening module, an image scaling module and an image normalization module which are connected in sequence; the image screening module is used for selecting images with qualified quality; the image scaling module is used for randomly scaling the image; the image normalization module is used for normalizing the image pixel values.
Further, the image feature extraction module includes: the position feature extraction module is used for extracting image features and completing a pre-trained residual error convolutional neural network, a layer of convolutional neural network used for reducing the dimension of the image features and a position feature extraction module used for generating position features and representing positions corresponding to each feature vector in the image features.
Further, the voice preprocessing module comprises a template matching voice recognition algorithm module; the template matching voice recognition algorithm module is used for converting the audio information into a text sequence.
Further, the text feature extraction module includes a Roberta model that performs pre-training using the common dataset, the Roberta model for generating text-embedded expressions.
Further, the eye movement preprocessing module comprises a Gao Situ generation module, and a Gao Situ generation module is used for converting the eye movement coordinate sequence into a Gaussian diagram sequence according to the mean value and the variance of the eye movement identification error.
Further, the eye movement feature extraction module includes: the bilinear interpolation scaling module is used for scaling Gao Situ according to the size of the image features, is used for a layer of convolution neural network for reducing the dimension of the image features and is used for calculating the long-term and short-term memory neural network of the eye movement features.
Further, the feature fusion module comprises a feature splicing module and a transducer model which adopts a public data set to perform pre-training, wherein the feature splicing module is used for splicing image features, text features and eye movement features; the characteristic splicing module comprises an encoder and a decoder, wherein the encoder and the decoder comprise six layers of multi-head attention modules; each multi-head attention module comprises an eight-head self-attention layer, a Dropout layer A, a standardization layer A, a full connection layer A, an activation function layer, a Dropout layer B, a full connection layer B, dropout layer C and a standardization layer B which are connected in sequence.
Further, the target recognition module comprises a full connection layer; the confidence computation module includes a SoftMax layer.
The invention also provides a natural image co-pointing target positioning method based on eye movement and voice by using the natural image co-pointing target positioning system based on eye movement and voice, which comprises the following steps:
step 1, acquiring environment images, user voice description information and data of a user eye movement coordinate sequence through experiments by utilizing a data acquisition and preprocessing system, preprocessing the data to manufacture a training sample set and a test set, and training and testing a natural image co-pointing target positioning system by using the training sample set and the test set;
step 2, synchronously acquiring corresponding environment images, user voice description information and user eye movement coordinate sequences by the image acquisition module, the voice acquisition module and the eye movement acquisition module;
step 3, the image preprocessing module, the voice preprocessing module and the eye movement preprocessing module correspondingly preprocess the acquired environment image, the user voice description information and the user eye movement coordinate sequence;
step 4, correspondingly inputting the preprocessed environment image, the user voice description information and the user eye movement coordinate sequence into an image feature extraction module, a text feature extraction module and an eye movement feature extraction module to correspondingly obtain image features, text features and eye movement features;
step 5, the image features, the text features and the eye movement features are fused by a feature fusion module to obtain multi-modal features, and the multi-modal features are subjected to dimension reduction by a target recognition module to generate candidate target frames; calculating the confidence coefficient of all candidate target frames by a confidence coefficient calculation module; filtering the candidate frames according to the ordering of the confidence degrees, and determining a target frame;
step 6, confirming whether the target frame result is correct or not by the user; if the result of the target frame is correct, the result is saved, otherwise, the steps 2 to 5 are repeated.
The invention has the advantages and positive effects that:
1. the natural image co-pointing target positioning system and method based on eye movement and voice provided by the invention are different from the existing single-mode man-machine interaction mode, meanwhile, eye movement input and voice input of a user are obtained, the advantage complementation among information of multiple modes is utilized, the user intention expression is more comprehensively understood, and the efficient and natural man-machine interaction effect is facilitated.
2. According to the natural image co-pointing target positioning system and method based on eye movement and voice, a deep learning model fused with various modal characteristics is established, eye movement characteristics are extracted through a long-term and short-term memory neural network, and compared with a model without the eye movement characteristics, the positioning accuracy of the natural image co-pointing target positioning system and method based on the eye movement and voice is obviously improved.
Drawings
Fig. 1 is a schematic structural diagram of a natural image co-pointing object positioning system based on eye movement and voice according to the present invention.
Fig. 2 is a schematic diagram of the operation of the multi-modal object localization of the eye movement and voice based natural image co-pointing object localization system of the present invention.
Fig. 3 is a workflow diagram of a natural image co-pointing object localization method based on eye movement and voice of the present invention.
Detailed Description
For a further understanding of the invention, its features and advantages, reference is now made to the following examples, which are illustrated in the accompanying drawings in which:
the following English words and English abbreviations in the invention are defined as follows:
ResNet101: a residual convolution neural network uses a convolution neural network with a residual structure, namely, certain layers of the neural network skip the connection of neurons of the next layer, interlayer connection is realized, and the strong connection between each layer is weakened, so that the degradation problem in a deep network is solved. Where 101 refers to the model comprising 101 layers of fully connected and convolved layers.
Transformer model: with a self-attention mechanism, a deep neural network, initially used for machine translation, is constituted by an encoder module and a decoder module.
Dropout layer: a neural network layer is adopted in deep learning training, and the over fitting phenomenon is reduced by setting half of characteristic values to zero.
Roberta model: a pre-training model in the form of a text feature encoder based on a self-attention mechanism.
SoftMax layer: the eigenvalue output was converted to a neural network layer ranging from a probability distribution of 0,1 and 1 using a softmax function.
AR glasses: augmented reality glasses, a hardware device in the form of glasses that fuses virtual information with the real world.
ImageNet: a large natural image public dataset for computer vision research.
CC-NEWS, OPENWEBTEXT, STORIES, BOOKCORPUS, text discloses a dataset.
Flickr30k, MS COCO and Visual Genome: are all public data sets in the cross-modal field of image text.
Referring to fig. 1 to 3, a natural image co-pointing target positioning system based on eye movement and voice includes a natural image co-pointing target positioning system, a data acquisition and preprocessing system; the natural image co-pointing target positioning system comprises an image feature extraction module, a text feature extraction module, an eye movement feature extraction module, a feature fusion module, a target identification module and a confidence calculation module; the data acquisition and preprocessing system comprises an image acquisition module, an image preprocessing module, a voice acquisition module, a voice preprocessing module, an eye movement acquisition module and an eye movement preprocessing module; the image acquisition module, the image preprocessing module and the image feature extraction module are sequentially connected; the voice acquisition module, the voice preprocessing module and the text feature extraction module are sequentially connected; the eye movement acquisition module, the eye movement preprocessing module and the eye movement characteristic extraction module are sequentially connected; the feature fusion module, the target identification module and the confidence coefficient calculation module are connected in sequence; the input end of the feature fusion module is respectively connected with the output ends of the image feature extraction module, the text feature extraction module and the eye movement feature extraction module.
The image acquisition module is used for acquiring an environment image in front of a user in real time, the image preprocessing module is used for preprocessing the acquired image, and the image feature extraction module is used for extracting image features of the preprocessed image.
The voice acquisition module is used for acquiring voice description information sent by a user, the voice preprocessing module is used for preprocessing the acquired voice information into text information, and the text feature extraction module is used for extracting text features of the preprocessed text information.
The eye movement collection module is used for collecting real-time eye movement coordinate sequences of users, the eye movement preprocessing module is used for preprocessing the collected eye movement coordinate sequences, and the eye movement feature extraction module is used for extracting eye movement features of the preprocessed eye movement coordinate sequences.
The feature fusion module is used for fusing the image features, the text features and the eye movement features and generating multi-modal features.
The target recognition module is used for reducing the dimension of the multi-mode features and generating candidate target frames.
The confidence calculating module is used for calculating the confidence of all candidate target frames.
The voice acquisition module may include a microphone; the image acquisition module can comprise AR glasses with a display function and a real-time eye movement coordinate acquisition function; the eye movement acquisition module can comprise AR glasses with a display function and a real-time eye movement coordinate acquisition function; the data acquisition system may also include a synchronous acquisition program. The image acquisition module and the eye movement acquisition module can share AR glasses.
The data acquisition and preprocessing system can acquire corresponding data such as environmental images, user voice description information, eye movement coordinate sequences of the user and the like, preprocesses the data, and compiles training and verification samples for training and verifying the target positioning model. The acquired data set is according to the training set: the test set=2:1-5:1 ratio is divided, and the data of the training set is adopted to train the target positioning model. And testing the target positioning model by adopting the data of the test set.
Preferably, the image preprocessing module can comprise an image screening module, an image scaling module and an image normalization module which are sequentially connected; the image screening module is used for selecting images with qualified quality; the image scaling module is used for randomly scaling the image; the image normalization module is used for normalizing the image pixel values.
Preferably, the image feature extraction module may include: the position feature extraction module is used for extracting image features and completing a pre-trained residual error convolutional neural network, a layer of convolutional neural network used for reducing the dimension of the image features and a position feature extraction module used for generating position features and representing positions corresponding to each feature vector in the image features.
Preferably, the voice preprocessing module may include a template matching voice recognition algorithm module; the template matching voice recognition algorithm module is used for converting the audio information into a text sequence.
Preferably, the text feature extraction module may comprise a RoBERTa model that performs pre-training using the common dataset, the RoBERTa model being used to generate the text-embedded expression. Pre-training of the Roberta model employed a total of 160G training text, including the common dataset CC-NEWS, OPENWEBTEXT, STORIES, BOOKCORPUS and Wikipedia.
Preferably, the eye movement preprocessing module may include a gaussian image generation module Gao Situ generation module for converting the eye movement coordinate sequence into a gaussian image sequence according to the mean and variance of the eye movement identification error.
Preferably, the eye movement feature extraction module may include: the bilinear interpolation scaling module is used for scaling Gao Situ according to the size of the image features, is used for a layer of convolution neural network for reducing the dimension of the image features and is used for calculating the long-term and short-term memory neural network of the eye movement features.
Preferably, the feature fusion module may include a feature stitching module and a transducer model pre-trained using a common dataset, the feature stitching module being configured to stitch image features, text features, and eye movement features; the characteristic splicing module can comprise an encoder and a decoder, and the encoder and the decoder can comprise six layers of multi-head attention modules; each multi-head attention module comprises an eight-head self-attention layer, a Dropout layer A, a standardization layer A, a full connection layer A, an activation function layer, a Dropout layer B, a full connection layer B, dropout layer C and a standardization layer B which are connected in sequence.
Dropout layer A, dropout layer B, dropout layer C is Dropout layer; the full connection layer A, the full connection layer B and the like are all full connection layers; the normalization layer A, the normalization layer B and the like are all normalization layers. The English numbers added later are convenient for distinguishing.
The public dataset of the pre-trained transducer model is referred to as the Flickr30k, MS COCO, and Visual Genome dataset.
Preferably, the object recognition module may comprise a fully connected layer; the confidence computation module includes a SoftMax layer.
The above-mentioned image acquisition module, voice acquisition module, eye movement acquisition module, image preprocessing module, image screening module, image scaling module, image normalization module, image feature extraction module, residual convolution neural network, one-layer convolution neural network, position feature extraction module, voice preprocessing module, template matching voice recognition algorithm module, text feature extraction module, roBERTa model, eye movement preprocessing module, gao Situ generation module, eye movement feature extraction module, bilinear interpolation scaling module, long and short term memory neural network, feature fusion module, feature splicing module, transducer model, encoder, decoder, six-layer multi-head attention module, eight-head self-attention layer, dropout layer A, standardization layer A, full connection layer A, activation function layer, dropout layer B, full connection layer B, dropout layer C, standardization layer B, object recognition module, full connection layer, confidence calculation module, softMax layer and other functional modules can all be applied in the prior art, or constructed by adopting components and functional modules in the prior art and adopting conventional techniques.
The invention also provides an embodiment of a natural image co-pointing target positioning method based on eye movement and voice by using the natural image co-pointing target positioning system based on eye movement and voice, which comprises the following steps:
step 1, acquiring environment images, user voice description information and data of a user eye movement coordinate sequence through experiments by utilizing a data acquisition and preprocessing system, preprocessing the data, manufacturing the preprocessed data into a training sample set and a test set, and training and testing a natural image co-pointing target positioning system by using the training sample set and the test set.
And 2, synchronously acquiring corresponding environment images, user voice description information and user eye movement coordinate sequences by the image acquisition module, the voice acquisition module and the eye movement acquisition module.
And 3, preprocessing the collected environment images, the user voice description information and the user eye movement coordinate sequence by the image preprocessing module, the voice preprocessing module and the eye movement preprocessing module.
And step 4, correspondingly inputting the preprocessed environment image, the user voice description information and the user eye movement coordinate sequence into an image feature extraction module, a text feature extraction module and an eye movement feature extraction module to correspondingly obtain image features, text features and eye movement features.
Step 5, the image features, the text features and the eye movement features are fused by a feature fusion module to obtain multi-modal features, and the multi-modal features are subjected to dimension reduction by a target recognition module to generate candidate target frames; calculating the confidence coefficient of all candidate target frames by a confidence coefficient calculation module; and filtering the candidate frames according to the ordering of the confidence degrees, and determining the target frame.
Step 6, confirming whether the target frame result is correct or not by the user; if the result of the target frame is correct, the result is saved, otherwise, the steps 2 to 5 are repeated.
The workflow and working principle of the invention are further described in the following with a preferred embodiment of the invention:
a natural image co-pointing target positioning system based on eye movement and voice, comprising: the device comprises an image acquisition module, an image preprocessing module and an image feature extraction module which are connected in sequence; the device comprises a voice acquisition module, a voice preprocessing module and a text feature extraction module which are connected in sequence; the device comprises an eye movement acquisition module, an eye movement preprocessing module and an eye movement characteristic extraction module which are connected in sequence; the system also comprises a feature fusion module, a target identification module and a confidence coefficient calculation module which are connected in sequence; the input end of the feature fusion module is respectively connected with the output ends of the image feature extraction module, the text feature extraction module and the eye movement feature extraction module.
The image preprocessing module comprises an image screening module, an image scaling module and an image normalization module which are connected in sequence; the image screening module is used for selecting images with qualified quality; the image scaling module is used for randomly scaling the image; the image normalization module is used for normalizing the image pixel values.
The image feature extraction module comprises: the position feature extraction module is used for extracting image features and completing a pre-trained residual error convolutional neural network, a layer of convolutional neural network used for reducing the dimension of the image features and a position feature extraction module used for generating position features and representing positions corresponding to each feature vector in the image features.
The voice preprocessing module comprises a template matching voice recognition algorithm module; the template matching voice recognition algorithm module is used for converting the audio information into a text sequence.
The text feature extraction module includes a RoBERTa model that uses the common dataset to complete the pre-training, the RoBERTa model being used to generate the text-embedded expression.
The eye movement preprocessing module comprises a Gao Situ generation module and a Gao Situ generation module, wherein the Gao Situ generation module is used for converting an eye movement coordinate sequence into a Gaussian image sequence according to the mean value and the variance of eye movement identification errors.
The eye movement feature extraction module comprises: the bilinear interpolation scaling module is used for scaling Gao Situ according to the size of the image features, is used for a layer of convolution neural network for reducing the dimension of the image features and is used for calculating the long-term and short-term memory neural network of the eye movement features.
The feature fusion module comprises a feature splicing module and a transducer model which adopts a public data set to perform pre-training, and the feature splicing module is used for splicing image features, text features and eye movement features; the characteristic splicing module comprises an encoder and a decoder, wherein the encoder and the decoder comprise six layers of multi-head attention modules; each multi-head attention module comprises an eight-head self-attention layer, a Dropout layer A, a standardization layer A, a full connection layer A, an activation function layer, a Dropout layer B, a full connection layer B, dropout layer C and a standardization layer B which are connected in sequence.
The target recognition module comprises a full connection layer; the confidence computation module includes a SoftMax layer.
The positioning method of the natural image co-pointing target positioning system based on eye movement and voice comprises the following steps:
1. and (5) data acquisition.
The invention provides a data acquisition system comprising an image acquisition module, a voice acquisition module and an eye movement acquisition module, wherein the voice acquisition module comprises a microphone; the image acquisition module comprises AR glasses with a display function and a real-time eye movement coordinate acquisition function; the eye movement acquisition module comprises AR glasses with a display function and a real-time eye movement coordinate acquisition function; the data acquisition system also includes a synchronous acquisition program. The image acquisition module and the eye movement acquisition module share AR glasses.
And a synchronous acquisition program is adopted for synchronously acquiring the user voice description and the user eye movement coordinate sequence. In the data acquisition experiment process, a tested person sits on a chair, the worn AR glasses sequentially play text descriptions and corresponding natural images, the tested person needs to read out the content of the text descriptions, and the tested person looks at the target positions described by the texts during the display of the natural images. The method collects 6 seconds of eye movement coordinate data during the period that 1000 users watch natural images, and the frame rate is set to 60 frames per second.
2. And (5) preprocessing data.
Image preprocessing:
the method comprises the steps of removing images with poor quality such as ghost images and blur images by adopting an image screening module, an image scaling module and an image normalization module which are sequentially connected, selecting images with qualified quality, performing random scaling on the images with the side length of 800 pixels to 1333 pixels to prevent over fitting in the training process, and performing normalization processing.
Voice pretreatment:
the voice preprocessing module can adopt a dynamic time warping algorithm (DTW) to carry out feature training and recognition, adopts a template matching voice recognition algorithm module to carry out voice template matching, and adopts a hidden Markov model to build a statistical model for a time sequence structure of a voice signal, and adopts a vector quantization technology to carry out signal compression. The template matching voice recognition algorithm module converts the audio information into a text sequence.
Eye movement pretreatment:
the eye movement preprocessing module comprises a Gao Situ generation module, a Gao Situ generation module selects an eye movement sequence of three to six seconds for 180 frames, calculates the mean value and the variance of the eye movement coordinate recognition error, and establishes a Gaussian diagram sequence by taking the eye movement coordinate as the center.
Image information processing:
the image feature extraction module comprises: the position feature extraction module is used for extracting image features and completing a pre-trained residual error convolutional neural network, a layer of convolutional neural network used for reducing the dimension of the image features and a position feature extraction module used for generating position features and representing positions corresponding to each feature vector in the image features. The residual convolution neural network adopts ResNet101, the residual convolution neural network is pretrained by adopting a public data set ImageNet, the residual convolution neural network ResNet101 after pretraining is adopted to extract the characteristic of the environment image, the size of the characteristic map is (w, h), w represents the width of the characteristic map, and h represents the height of the characteristic map.
The sizes of w and h in the feature map are positively correlated with the size of the input image, the feature dimension is set to 2048, and finally the feature map is expanded into features with the length of w x h after feature dimension reduction through a one-dimensional convolution kernel, wherein the dimension is finally 256, and the image feature size is (w x h, 256). In order to represent the position information of each feature in the feature map, the position coding module generates a corresponding position feature for the image feature, wherein the size of the position feature is the same as that of the image feature. Wherein the characteristic value of the first 128 dimensions represents the position of the characteristic value in the x axis, the characteristic value of the second 128 dimensions represents the position of the characteristic value in the y axis, and the calculation method of each characteristic value is as follows:
wherein:
P (pos,2i) a coded value representing even digits in the position feature;
P (pos,2i+1) a coded value representing an odd digit in the position feature;
pos represents the position of the x-axis or y-axis;
i represents the dimension of the feature value;
d model the dimension setting representing the position feature may take on a value of 128.
Text information processing:
the text feature extraction module comprises a RoBERTa model which adopts a public data set to complete pre-training, firstly, a text sequence described by a text is converted into a digital sequence by contrasting a word stock, then the digital sequence is input into the RoBERTa model to be encoded, so that text features are obtained, the length of the text features is the same as that of an input text, the dimension of the text features is 256, the size of the text features is (T, 256), and T is the length of the text.
Eye movement information processing:
and scaling the Gao Situ sequence into w and h with the same size as the image features by a bilinear interpolation scaling module, wherein the feature dimension is the length of the eye movement coordinate sequence, adjusting the dimension to 256 by a layer of convolution neural network, expanding the feature map into w x h length, and inputting the w x h length into a long-short-period memory neural network for calculation to obtain eye movement feature representation, wherein the eye movement feature is (w x h, 256).
Multi-feature fusion:
performing feature stitching on the image features, the text features and the eye movement features in a first dimension to obtain multi-mode feature vectors with feature sizes of (w, h+T+w, h, 256); the position features of the environment image are expanded to be the same as the length of the multi-mode feature vector after being filled with the all-zero vector, and are used as multi-mode position features.
The feature fusion module comprises a feature splicing module and a transducer model which adopts a public data set to perform pre-training, and the feature splicing module is used for splicing image features, text features and eye movement features; the characteristic splicing module comprises an encoder and a decoder, wherein the encoder and the decoder comprise six layers of multi-head attention modules; each multi-head attention module comprises an eight-head self-attention layer, a Dropout layer A, a standardization layer A, a full connection layer A, an activation function layer, a Dropout layer B, a full connection layer B, dropout layer C and a standardization layer B which are connected in sequence.
The multi-mode feature vector and the multi-mode position feature are added and then sequentially input into an encoder and a decoder to obtain an output feature vector.
The acquired data set is according to the training set: the test set=5:1 ratio is divided, and the data of the training set is adopted to train the target positioning model. And testing the target positioning model by adopting the data of the test set.
3. And (5) calculating results.
And inputting the preprocessed test set data into a natural image co-pointing target positioning system based on eye movement and voice after training, and predicting results. And respectively carrying out feature extraction and feature fusion on the environment image, the user text description and the eye movement Gaussian image sequence, then simultaneously inputting the full-connection layer and the softMax layer to obtain 100 candidate target frame prediction results of each image and 100 confidence coefficient results corresponding to the 100 candidate target frame prediction results, and carrying out model accuracy calculation on all test set results and the real value relative ratio.
For each image, selecting a candidate frame with highest confidence as a prediction result, and calculating a merging ratio IoU with a true value of a target frame, wherein a calculation formula is as follows:
wherein A is a predicted target frame, B is a true value of the target frame, and the intersection ratio is the ratio of the intersection area and the union area of the target frame and the target frame. An intersection ratio greater than 0.5 is considered to be predictive correct. The ratio of the number of correctly predicted images to the total number of test images is the prediction accuracy. Table 1 shows the comparison of the test results of the present method with the results of the single mode speech input:
TABLE 1
The above-described embodiments are only for illustrating the technical spirit and features of the present invention, and it is intended to enable those skilled in the art to understand the content of the present invention and to implement it accordingly, and the scope of the present invention is not limited to the embodiments, i.e. equivalent changes or modifications to the spirit of the present invention are still within the scope of the present invention.

Claims (9)

1. The natural image co-pointing target positioning system based on eye movement and voice is characterized by comprising a natural image co-pointing target positioning system and a data acquisition and preprocessing system; the natural image co-pointing target positioning system comprises an image feature extraction module, a text feature extraction module, an eye movement feature extraction module, a feature fusion module, a target identification module and a confidence calculation module; the data acquisition and preprocessing system comprises an image acquisition module, an image preprocessing module, a voice acquisition module, a voice preprocessing module, an eye movement acquisition module and an eye movement preprocessing module; the image acquisition module, the image preprocessing module and the image feature extraction module are sequentially connected; the voice acquisition module, the voice preprocessing module and the text feature extraction module are sequentially connected; the eye movement acquisition module, the eye movement preprocessing module and the eye movement characteristic extraction module are sequentially connected; the feature fusion module, the target identification module and the confidence coefficient calculation module are connected in sequence; the input end of the feature fusion module is respectively connected with the output ends of the image feature extraction module, the text feature extraction module and the eye movement feature extraction module;
the image acquisition module is used for acquiring an environment image in front of a user in real time, the image preprocessing module is used for preprocessing the acquired image, and the image feature extraction module is used for extracting image features of the preprocessed image;
the voice acquisition module is used for acquiring voice description information sent by a user, the voice preprocessing module is used for preprocessing the acquired voice information into text information, and the text feature extraction module is used for extracting text features of the preprocessed text information;
the eye movement acquisition module is used for acquiring a real-time eye movement coordinate sequence of a user, the eye movement preprocessing module is used for preprocessing the acquired eye movement coordinate sequence, and the eye movement characteristic extraction module is used for extracting eye movement characteristics of the preprocessed eye movement coordinate sequence;
the feature fusion module is used for fusing the image features, the text features and the eye movement features and generating multi-modal features; the feature fusion module comprises a feature splicing module and a transducer model which adopts a public data set to perform pre-training, wherein the feature splicing module is used for splicing image features, text features and eye movement features; the characteristic splicing module comprises an encoder and a decoder, wherein the encoder and the decoder comprise six layers of multi-head attention modules; each multi-head attention module comprises eight self-attention layers, a Dropout layer A, a standardization layer A, a full connection layer A, an activation function layer, a Dropout layer B, a full connection layer B, dropout layer C and a standardization layer B which are connected in sequence;
the feature fusion module fuses the image feature, the text feature and the eye movement feature and generates a multi-modal feature, which comprises the following steps:
performing feature stitching on the image features, the text features and the eye movement features in a first dimension to obtain a multi-modal feature vector; the feature size of the multi-modal feature vector is (w h+t+w h, 256); w is the width of the feature map, h is the height of the feature map, and T is the length of the text;
filling the position features of the environment image with all zero vectors, and expanding the position features to be the same as the length of the multi-mode feature vectors to serve as multi-mode position features; the position feature of the environment image is the position feature of the position corresponding to each feature vector in the image feature;
adding the multi-mode feature vector and the multi-mode position feature, and sequentially inputting the multi-mode feature vector and the multi-mode position feature to the encoder and the decoder to obtain an output feature vector;
the target recognition module is used for reducing the dimension of the multi-mode features and generating candidate target frames;
the confidence calculating module is used for calculating the confidence of all candidate target frames.
2. The natural image co-pointing target positioning system based on eye movement and voice according to claim 1, wherein the image preprocessing module comprises an image screening module, an image scaling module and an image normalization module which are connected in sequence; the image screening module is used for selecting images with qualified quality; the image scaling module is used for randomly scaling the image; the image normalization module is used for normalizing the image pixel values.
3. The eye movement and voice based natural image co-pointing object localization system of claim 1, wherein the image feature extraction module comprises: the position feature extraction module is used for extracting image features and completing a pre-trained residual error convolutional neural network, a layer of convolutional neural network used for reducing the dimension of the image features and a position feature extraction module used for generating position features and representing positions corresponding to each feature vector in the image features.
4. The eye movement and voice based natural image co-pointing target positioning system of claim 1, wherein the voice preprocessing module comprises a template matching voice recognition algorithm module; the template matching voice recognition algorithm module is used for converting the audio information into a text sequence.
5. The eye movement and voice based natural image co-pointing object localization system of claim 1, wherein the text feature extraction module comprises a RoBERTa model that performs pre-training using a common dataset, the RoBERTa model being used to generate the text-embedded expressions.
6. The eye movement and voice based natural image co-pointing target positioning system of claim 1, wherein the eye movement pre-processing module comprises a Gao Situ generation module, gao Situ generation module for converting the eye movement coordinate sequence into a gaussian image sequence based on the mean and variance of the eye movement recognition errors.
7. The eye movement and voice based natural image co-pointing object localization system of claim 1, wherein the eye movement feature extraction module comprises: the bilinear interpolation scaling module is used for scaling Gao Situ according to the size of the image features, is used for a layer of convolution neural network for reducing the dimension of the image features and is used for calculating the long-term and short-term memory neural network of the eye movement features.
8. The eye movement and voice based natural image co-pointing object localization system of claim 1, wherein the object recognition module comprises a fully connected layer; the confidence computation module includes a SoftMax layer.
9. An eye movement and voice based natural image co-pointing object localization method using the eye movement and voice based natural image co-pointing object localization system according to any one of claims 1 to 8, comprising the steps of:
step 1, acquiring environment images, user voice description information and data of a user eye movement coordinate sequence through experiments by utilizing a data acquisition and preprocessing system, preprocessing the data to manufacture a training sample set and a test set, and training and testing a natural image co-pointing target positioning system by using the training sample set and the test set;
step 2, synchronously acquiring corresponding environment images, user voice description information and user eye movement coordinate sequences by the image acquisition module, the voice acquisition module and the eye movement acquisition module;
step 3, the image preprocessing module, the voice preprocessing module and the eye movement preprocessing module correspondingly preprocess the acquired environment image, the user voice description information and the user eye movement coordinate sequence;
step 4, correspondingly inputting the preprocessed environment image, the user voice description information and the user eye movement coordinate sequence into an image feature extraction module, a text feature extraction module and an eye movement feature extraction module to correspondingly obtain image features, text features and eye movement features;
step 5, the image features, the text features and the eye movement features are fused by a feature fusion module to obtain multi-modal features, and the multi-modal features are subjected to dimension reduction by a target recognition module to generate candidate target frames; calculating the confidence coefficient of all candidate target frames by a confidence coefficient calculation module; filtering the candidate frames according to the ordering of the confidence degrees, and determining a target frame;
step 6, confirming whether the target frame result is correct or not by the user; if the result of the target frame is correct, the result is saved, otherwise, the steps 2 to 6 are repeated.
CN202210906536.4A 2022-07-29 2022-07-29 Natural image co-pointing target positioning system and method based on eye movement and voice Active CN115237255B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210906536.4A CN115237255B (en) 2022-07-29 2022-07-29 Natural image co-pointing target positioning system and method based on eye movement and voice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210906536.4A CN115237255B (en) 2022-07-29 2022-07-29 Natural image co-pointing target positioning system and method based on eye movement and voice

Publications (2)

Publication Number Publication Date
CN115237255A CN115237255A (en) 2022-10-25
CN115237255B true CN115237255B (en) 2023-10-31

Family

ID=83676884

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210906536.4A Active CN115237255B (en) 2022-07-29 2022-07-29 Natural image co-pointing target positioning system and method based on eye movement and voice

Country Status (1)

Country Link
CN (1) CN115237255B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116185182B (en) * 2022-12-30 2023-10-03 天津大学 Controllable image description generation system and method for fusing eye movement attention
CN116385757B (en) * 2022-12-30 2023-10-31 天津大学 Visual language navigation system and method based on VR equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109814722A (en) * 2019-02-25 2019-05-28 苏州长风航空电子有限公司 A kind of multi-modal man-machine interactive system and exchange method
CN110495854A (en) * 2019-07-30 2019-11-26 科大讯飞股份有限公司 Feature extracting method, device, electronic equipment and storage medium
CN111967334A (en) * 2020-07-20 2020-11-20 中国人民解放军军事科学院国防科技创新研究院 Human body intention identification method, system and storage medium
JP2021015189A (en) * 2019-07-11 2021-02-12 中部電力株式会社 Multi-modal voice recognition device and multi-modal voice recognition method
CN112424727A (en) * 2018-05-22 2021-02-26 奇跃公司 Cross-modal input fusion for wearable systems

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2018256365A1 (en) * 2017-04-19 2019-10-31 Magic Leap, Inc. Multimodal task execution and text editing for a wearable system
KR102168802B1 (en) * 2018-09-20 2020-10-22 한국전자통신연구원 Apparatus and method for interaction

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112424727A (en) * 2018-05-22 2021-02-26 奇跃公司 Cross-modal input fusion for wearable systems
CN109814722A (en) * 2019-02-25 2019-05-28 苏州长风航空电子有限公司 A kind of multi-modal man-machine interactive system and exchange method
JP2021015189A (en) * 2019-07-11 2021-02-12 中部電力株式会社 Multi-modal voice recognition device and multi-modal voice recognition method
CN110495854A (en) * 2019-07-30 2019-11-26 科大讯飞股份有限公司 Feature extracting method, device, electronic equipment and storage medium
CN111967334A (en) * 2020-07-20 2020-11-20 中国人民解放军军事科学院国防科技创新研究院 Human body intention identification method, system and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
多模态交互中的目标选择技术;周小舟 等;《包装工程》;第36-44页 *

Also Published As

Publication number Publication date
CN115237255A (en) 2022-10-25

Similar Documents

Publication Publication Date Title
KR102266529B1 (en) Method, apparatus, device and readable storage medium for image-based data processing
CN115237255B (en) Natural image co-pointing target positioning system and method based on eye movement and voice
JP6351689B2 (en) Attention based configurable convolutional neural network (ABC-CNN) system and method for visual question answering
US11783615B2 (en) Systems and methods for language driven gesture understanding
CN111161200A (en) Human body posture migration method based on attention mechanism
CN110647612A (en) Visual conversation generation method based on double-visual attention network
CN114511906A (en) Cross-modal dynamic convolution-based video multi-modal emotion recognition method and device and computer equipment
CN113822192A (en) Method, device and medium for identifying emotion of escort personnel based on Transformer multi-modal feature fusion
CN110826462A (en) Human body behavior identification method of non-local double-current convolutional neural network model
CN108538283B (en) Method for converting lip image characteristics into voice coding parameters
CN114120432A (en) Online learning attention tracking method based on sight estimation and application thereof
CN114550057A (en) Video emotion recognition method based on multi-modal representation learning
CN114140885A (en) Emotion analysis model generation method and device, electronic equipment and storage medium
CN112768070A (en) Mental health evaluation method and system based on dialogue communication
CN114724224A (en) Multi-mode emotion recognition method for medical care robot
CN112906520A (en) Gesture coding-based action recognition method and device
CN114170537A (en) Multi-mode three-dimensional visual attention prediction method and application thereof
Ahammad et al. Recognizing Bengali sign language gestures for digits in real time using convolutional neural network
CN114913342A (en) Motion blurred image line segment detection method and system fusing event and image
CN116434252A (en) Training of image recognition model and image recognition method, device, medium and equipment
CN113453065A (en) Video segmentation method, system, terminal and medium based on deep learning
Zhang et al. C2st: Cross-modal contextualized sequence transduction for continuous sign language recognition
CN115116117A (en) Learning input data acquisition method based on multi-mode fusion network
CN114663910A (en) Multi-mode learning state analysis system
CN114359785A (en) Lip language identification method and device based on adaptive matrix feature fusion network and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant