CN112818883B

CN112818883B - Deep learning detection and positioning method for interested target based on eye movement signal

Info

Publication number: CN112818883B
Application number: CN202110169344.5A
Authority: CN
Inventors: 曾洪; 王新志; 宋爱国; 张建喜; 刘兴
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-02-07
Filing date: 2021-02-07
Publication date: 2024-03-26
Anticipated expiration: 2041-02-07
Also published as: CN112818883A

Abstract

The invention discloses a method for deep learning detection and positioning of an interesting target based on an eye movement signal, which comprises the following steps: acquiring original output data of an eye movement instrument; preprocessing the original data, including removing abnormal data and filtering the data; distinguishing staring events and glancing events from the preprocessed eye movement data by adopting an adaptive threshold algorithm; the method comprises the steps of manually marking a staring sequence, designing and processing a deep neural network of different length sequences, training a ConvLSTM neural network, detecting an intentional staring cluster and an unintentional staring cluster by using the network, superposing the intentional staring cluster and an original image to locate a specific position of an interested target in a picture, identifying staring events and glancing events by using a self-adaptive threshold algorithm in the method, and improving the identification precision of different tested objects, wherein the deep neural network also improves the detection and location precision of the interested target, especially the detection and location precision of a shielding target, an incomplete target, an uncertain target and the like, which is far Gao Yuji in computer vision.

Description

Deep learning detection and positioning method for interested target based on eye movement signal

Technical Field

The invention relates to the technical field of eye movement tracking, in particular to a method for deep learning detection and positioning of an interesting target based on eye movement signals.

Background

Target detection is a popular direction of computer vision and digital image processing, and has wide application in the military and civil fields. The method can be used for searching and detecting military targets in military aspects; the civil aspect can be used in various industries such as weather, agriculture, security and protection, etc. The existing interested target detection technology based on computer vision obtains a better detection result for a specific image target, however, the method is highly dependent on prior information and characteristics of the image target, and is still difficult to effectively detect an uncertain target, a shielding target and an incomplete target in an image, and meanwhile, the detection precision is influenced by the depth of field, resolution, weather, illumination conditions and the like of a picture. The development of eye tracking technology provides possibility for solving the problem, the space-time characteristic of eye movement signals is physiological and behavioral manifestation in the process of extracting visual information, and a large number of researches show that the gaze pattern of human eyes is closely related to the cognitive process of human beings. For example, gaze and saccade behavior may exhibit differences in pupil size, gaze duration, and number of gaze times, and the like, with the intentional gaze being tested providing evidence for detecting and locating objects of interest.

The eye movement event is classified by utilizing the eye movement signal to detect and locate the interested target, and a plurality of mature eye movement event classification algorithms such as an I-VT algorithm based on a speed threshold value, an I-DT algorithm based on distribution, an I-HMM hidden Markov model algorithm based on machine learning and the like exist in the field, wherein the targets of the algorithms are to classify various eye movement events, and no subsequent target detection and location analysis is performed. In recent years, research is also carried out on predicting an interesting target in a remote sensing image based on the attention degree of an attention area, the attention area to be tested is extracted through an attention area clustering algorithm, the judgment on whether the area contains the target is realized by utilizing the difference of the attention degrees of the area to be tested, and an experimental result shows that the method combines an interesting area analysis technology based on eye movement with a cognitive analysis technology, so that the detection and the positioning of the complex background image target are realized to a certain extent. Researchers also obtain tested staring clusters through a clustering method, manually extract staring features and mark, and identify intentional staring clusters and unintentional staring clusters by adopting a machine learning method so as to realize the detection of interested targets, and the method has a certain identification effect, but the identification effect is highly dependent on the extracted features.

With the development of artificial intelligence technology, a great number of excellent neural network models and algorithms are developed, and deep learning is widely applied in the fields of image classification and recognition, text classification, voice processing and the like, and an unparalleled effect is achieved. Compared with the mainstream algorithm of deep learning when convolutional neural networks and long-short memory networks, CNN is more suitable for spatial expansion, local features of data are extracted, and the characteristic of convolutional abstraction Cheng Gaowei is extracted, so CNN is commonly used for feature extraction of images, LSTM is more suitable for modeling time sequence data and is good for capturing correlation before and after time sequence, and LSTM is commonly used for feature extraction of data such as text, voice, video and the like. Eye movement signals have spatial and temporal characteristics due to their specificity. For similar signals, such as brain signals, the ConvLSTM network is adopted by the existing researchers to process, so that the signal characteristics can be better extracted compared with the independent CNN network and the independent LSTM network, and a better classification effect is achieved.

Disclosure of Invention

In order to solve the technical problems of the background technology, the invention aims to provide a deep learning detection and positioning method of an interested target based on eye movement signals, wherein a staring sequence is extracted through an original staring point, and the staring sequence is sent into a neural network to detect an intentional staring cluster, so that the staring cluster is overlapped with an original image to position the interested target in the image.

In order to achieve the technical purpose, the technical scheme of the invention is as follows:

the method for detecting the target deep learning of interest based on the eye movement signal comprises the following steps:

step 1, acquiring original output data of an eye movement instrument in real time, wherein the original output data comprise a timestamp, a horizontal direction coordinate x, a vertical direction coordinate y and a data validity;

step 2, preprocessing the original output data, removing abnormal data and some effective data near the abnormal data, and performing data filtering;

step 3, calculating the eye movement angular velocity v and the eye movement angular acceleration a from the preprocessed data, and distinguishing the preprocessed data into a staring event and a glancing event by adopting an adaptive threshold algorithm to obtain a staring sequence;

and 4, visualizing the staring sequence, superposing the staring sequence with the corresponding experimental picture, and then assisting in manually marking the staring sequence category.

And 5, processing depth neural networks of different length sequences by using the staring sequences after manual marking, and detecting and positioning the target of interest by using the depth neural networks of different length sequences.

Further, the original output data of step 1 is automatically outputted by the programIn order to calculate the angular eye movement speed and the angular eye movement acceleration conveniently, the output coordinates (x, y) of the eye movement instrument need to be converted into screen pixel coordinates (x _s ，y _s ) The conversion formula is as follows:

x _s ＝x*r _v

y _s ＝y*r _h

wherein x is the horizontal direction coordinate output by the eye tracker, y is the vertical direction coordinate output by the eye tracker, and x is the vertical direction coordinate _s Screen pixel coordinates, y, for horizontal method _s Screen pixel coordinates, r, for vertical method _v For horizontal screen resolution, r _h Is the vertical screen resolution.

Further, step 2 pre-processes the data after coordinate conversion, firstly eliminates abnormal data, wherein the abnormal data is generated by blink under test, external noise or outside the screen under test, and the like, and needs to be positioned at the center of an invalid data sequence, meanwhile eliminates the abnormal data and a small part of effective data near the abnormal data, ensures the robustness of elimination, and then applies spike filtering, median filtering and Butterworth filtering to reduce signal noise.

Further, the eye movement velocity v in step 3 is an eye movement angular velocity a, the eye movement acceleration is an eye movement angular acceleration, and the calculation formula is as follows:

wherein v represents eye movement angular velocity, a represents eye movement acceleration, t is the time interval between different gaze points, x ₀ ，y ₀ Is the screen pixel coordinate, x, of the start time _t ，y _t Is the screen pixel coordinate at time t and d is the distance from the screen to be tested.

Further, in the self-adaptive threshold algorithm for classifying eye movement events in step 3, the speed threshold of the whole eye movement data is calculated first, glance events are classified, then the glance events are used for dividing the data into small windows, the local speed threshold in the glance windows is calculated, glance and stare events are classified, compared with the method that the uniform speed threshold is calculated directly by using all the data, the self-adaptive calculated speed threshold in the glance windows is more reasonable, and the individual difference among the tested is processed.

Further, the manual gaze sequence marking in step 4 is visualized by using the gaze sequence classified in step 3 and superimposed with the corresponding experimental picture, and the gaze sequence marked near the target object is marked as an intentional gaze cluster according to the target object mark gaze sequence category specified in the experiment, otherwise, is marked as an unintentional gaze cluster.

Further, the neural network for processing the unequal length sequences in the step 5 adopts a Making technology. It is apparent that different gaze sequences are not necessarily the same length, but the network requires consistent input data dimensions, and thus requires the design of deep neural networks that handle sequences of unequal length. The usual process is to fill with zeros to equal length or directly truncate to equal length sequences, with the disadvantage of slightly changing the gaze sequence classification characteristics. Although the neural network for processing the unequal length sequences in the step 5 uses zero-padding, the zero in the sequences is not involved in calculation in the network by adopting a Masking technology, so that the network is ensured to process the unequal length sequences, and the influence of zero-padding on the classification characteristics of the staring sequences can be removed.

Furthermore, the deep neural network in step 5 adopts cnn+lstm combination to form a ConvLSTM network, and the network can process data with space-time characteristics, spatial local characteristics of CNN learning data, and time sequence characteristics of LSTM learning data through convolution abstract Cheng Gaowei characteristics.

Further, training a ConvLSTM depth neural network by using a manually marked gaze sequence, separating a gaze cluster from newly acquired eye movement data through a self-adaptive threshold algorithm, detecting an intentional gaze cluster and an unintentional gaze cluster by using the trained ConvLSTM network, and superposing the intentional gaze cluster with an original image to locate a specific position of an interested target in the image.

Compared with the prior art, the invention has the beneficial effects that:

(1) According to the method for detecting and positioning the interested target deep learning based on the eye movement signals, the detection capability of the tested shielding target, the incomplete target and the uncertain target is fully utilized through the eye movement signals, and the detection and positioning precision of the interested target in a complex environment is improved;

(2) The method adopts the self-adaptive threshold value to classify the eye movement events, considers individual variability among the tested, can realize higher-precision eye movement event classification, is favorable for extracting the staring sequence to make manual marks and improving the reliability of training data, and further improves the training effect of the deep neural network;

(3) The method adopts ConvLSTM depth neural network to identify the intentional gazing sequence and the unintentional gazing sequence, integrates the advantages of CNN and LSTM, can process the data with space-time characteristics at the same time, has higher identification precision on the intentional gazing sequence, and realizes the high-precision detection and positioning of the interested target.

Drawings

FIG. 1 is a general frame flow diagram of the present invention;

fig. 2 is a deep neural network architecture.

Detailed Description

The principles and features of the present invention are described below with reference to the drawings, the examples are illustrated for the purpose of illustrating the invention and are not to be construed as limiting the scope of the invention.

The object of interest in the invention is usually a tested object of interest in a picture, and can also be an experimental specified object, such as military objects of aircrafts, ships, tanks and the like. Here, the subject is free to view pictures containing unequal amounts of the object of interest, and the eye tracker records the subject's eye movement signal for subsequent analysis and processing to detect the subject object of interest.

As shown in fig. 1, the method for detecting the deep learning of the interested target based on the eye movement signal provided by the embodiment of the invention comprises the following steps:

(1) Eye movement signal acquisition and preprocessing

Eye tracking technology is an important component of multi-modal interactive technology, and the visual system is an important way for human beings to acquire external information, and provides a unique information source for human vision exploration world. The eye tracking technology utilizes a sensor to sense the rotation of eye beads, and calculates the direction and angle of the eye gaze through a microprocessor, so as to achieve the aim of tracking the gaze position. The eye movement instrument is an important instrument for recording the eye movement track by utilizing an eye movement tracking technology, and can record the current gazing position, pupil diameter and other information of a user. Eye movement signals recorded by an eye movement apparatus can be classified into various eye movement event types, gaze, saccades, smooth tracking, and the like according to eye movement speed. Where gaze represents the location of the user's attention, most likely the region of interest in the picture.

In the example, an image switching experiment is designed by using an eye tracker, eye movement signals of each image to be tested and browsed freely are recorded, and the original data output by the eye tracker comprises a timestamp, a horizontal direction coordinate x, a vertical direction coordinate y and a data validity. The output coordinates of the eye movement instrument are [0,1], and the output coordinates need to be converted into the screen pixel coordinates so as to calculate the eye movement angular velocity and the eye movement angular acceleration conveniently, wherein the conversion formula is as follows:

x _s ＝x*r _v

y _s ＝y*r _h

Preprocessing the data after coordinate conversion, firstly removing abnormal data, wherein the abnormal data is generated by blink under test, external noise or trial looking outside a screen and the like, positioning to the center of an invalid data sequence is required, and meanwhile removing the abnormal data and small effective data near the abnormal data, so as to ensure the robustness of removal. Then spike filtering, median filtering and Butterworth filtering are applied to reduce signal noise.

(2) Eye movement event marking algorithm

And calculating the eye movement speed and the eye movement acceleration by using the preprocessed eye movement data, and distinguishing staring events from glancing events by adopting an adaptive threshold algorithm. Here, the eye movement speed is an eye movement angular speed, the eye movement acceleration is an eye movement angular acceleration, and the calculation formula is as follows:

An adaptive algorithm for classifying eye movement events firstly calculates an eye movement speed threshold of whole data, classifies glance events, then divides the data into small windows by utilizing glance events, calculates a local speed threshold in the glance windows, and classifies glance and stare events. Compared with the method for directly calculating the speed threshold by using all data, the self-adaptive speed threshold calculation method is more reasonable in the glance window, and the algorithm well solves the threshold difference problem caused by the tested individual difference.

(3) Gaze sequence markers and deep neural network training

And superposing each experimental picture with the corresponding staring sequence to obtain a series of superposition pictures for assisting manual marking of staring sequence categories. The criterion for marking is whether the target object specified at the time of the experiment coincides with the gaze sequence, the gaze sequence in the vicinity of the target object being marked as an intentional gaze sequence, otherwise as an unintentional gaze sequence.

Obviously, different gaze sequences are of unequal length, and a neural network that processes the unequal length sequences needs to be designed, and if not processed, a deep neural network cannot be trained because the neural network requires consistent dimensions of the input data. It is common practice to fill in with zeros as equal length sequences or directly truncate as equal length sequences, with the disadvantage of losing some of the sequence classification characteristics. The method for processing the unequal length sequences uses the idea of using zero padding to be equal length sequences, but adopts a Masking technology to delete the cost function of the network calculation result of zero in the sequences, thereby ensuring the network to process the unequal length sequences and removing the influence of the zero padding on the sequence characteristics.

Because the eye movement signal has the particularity, the eye movement signal not only contains the spatial local characteristics, but also is a signal which continuously changes along with time, so that a convolutional neural network CNN and a long and short time memory network LSTM are fused to form a ConvLSTM network, the spatial domain characteristic extraction of the CNN and the time sequence characteristic extraction of the LSTM are combined, the characteristics of the eye movement signal can be better extracted, and the network structure schematic diagram is shown in figure 2 and comprises a convolutional layer, a pooling layer, a Dropout layer, an LSTM layer and a full connection layer.

(4) Using trained networks

After ConvLSTM deep neural network training, can be used for detection and location of interested target, newly gathered eye movement data first pass through adaptive threshold algorithm and separate out and stare the cluster, then use ConvLSTM network after training to detect and stare the cluster intentionally and stare the cluster unintentionally, stare the cluster intentionally and the specific position in the picture of interested target can be located with the original picture stack.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. The method for deep learning detection and positioning of the target of interest based on the eye movement signal is characterized by comprising the following steps: the method comprises the following steps:

step 1, acquiring original output data of an eye movement instrument in real time, wherein the original output data comprises a timestamp, a horizontal direction coordinate x, a vertical direction coordinate y and a data validity;

step 4, visualizing the staring sequence, superposing the staring sequence with the corresponding experimental picture, and manually marking the staring sequence category;

2. The eye movement signal based object of interest deep learning detection and localization method of claim 1, wherein: the original output data in the step 1 are automatically read from the eye movement instrument in real time by a program.

3. The eye movement signal based object of interest deep learning detection and localization method of claim 1, wherein: the preprocessing in the step 2 comprises removing abnormal data and filtering processing data, wherein the abnormal data are generated by tested blinking, external noise or tested looking out of a screen and the like, the abnormal data need to be positioned at the center of an invalid data sequence, meanwhile, the abnormal data and small part of effective data near the abnormal data are removed, the robustness of the removal is ensured, and the filtering processing data adopt peak filtering, median filtering and Butterworth filtering, so that the signal noise is reduced.

4. The eye movement signal based object of interest deep learning detection and localization method of claim 1, wherein: the eye movement angular velocity v and the eye movement angular acceleration a described in the step 3 are calculated according to the following formula:

wherein v represents eye movement angular velocity, a represents eye movement acceleration, t is the time interval between different gaze points, x ₀ ,y ₀ Is the coordinates of the start time, x _t ,y _t The coordinate at the moment t, d is the distance from the tested screen;

the self-adaptive threshold algorithm comprises the following steps of firstly calculating a speed threshold of whole eye movement data, classifying glance events, then dividing the eye movement data into small windows by using the glance events, calculating a local speed threshold in the glance windows, and classifying glance and stare events.

5. The eye movement signal based object of interest deep learning detection and localization method of claim 1, wherein: and (3) the manual staring sequence marking in the step (4) is visualized by adopting the staring sequence classified in the step (3) and is overlapped with the corresponding experimental picture, and the staring sequence marked near the target object is marked as an intentional staring cluster according to the target object staring sequence class specified in the experiment, otherwise, is marked as an unintentional staring cluster.

6. The eye movement signal based object of interest deep learning detection and localization method of claim 1, wherein: and 5, zero padding is used in the process of processing the neural network of the unequal-length sequences, and the Masking technology is adopted to ensure that zero padding in the sequences does not participate in calculation in the network.

7. The eye movement signal based object of interest deep learning detection and localization method of claim 1, wherein: in the step 5, the deep neural network structure adopts CNN+LSTM to combine to form a ConvLSTM network, the ConvLSTM network can process data with space-time characteristics, wherein CNN learning data has space local characteristics, the CNN learning data can be characterized by convolution abstraction Cheng Gaowei, and the LSTM learning data has time sequence characteristics.

8. The eye movement signal based object of interest deep learning detection and localization method of claim 7, wherein: training ConvLSTM depth neural network by using a manually marked gaze sequence, separating a gaze cluster from newly acquired eye movement data through a self-adaptive threshold algorithm, detecting an intentional gaze cluster and an unintentional gaze cluster by using the trained ConvLSTM network, and superposing the intentional gaze cluster with an original image to locate a specific position of an interested target in the image.