CN112418296B - Bionic binocular target identification and tracking method based on human eye visual attention mechanism - Google Patents

Bionic binocular target identification and tracking method based on human eye visual attention mechanism Download PDF

Info

Publication number
CN112418296B
CN112418296B CN202011298898.7A CN202011298898A CN112418296B CN 112418296 B CN112418296 B CN 112418296B CN 202011298898 A CN202011298898 A CN 202011298898A CN 112418296 B CN112418296 B CN 112418296B
Authority
CN
China
Prior art keywords
point
salient
frame
saliency
tracking
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011298898.7A
Other languages
Chinese (zh)
Other versions
CN112418296A (en
Inventor
陈利利
谭锦钢
李嘉茂
王开放
张晓林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Institute of Microsystem and Information Technology of CAS
Original Assignee
Shanghai Institute of Microsystem and Information Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Institute of Microsystem and Information Technology of CAS filed Critical Shanghai Institute of Microsystem and Information Technology of CAS
Priority to CN202011298898.7A priority Critical patent/CN112418296B/en
Publication of CN112418296A publication Critical patent/CN112418296A/en
Application granted granted Critical
Publication of CN112418296B publication Critical patent/CN112418296B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • G06F18/2414Smoothing the distance, e.g. radial basis function networks [RBFN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • G06V10/464Salient features, e.g. scale invariant feature transforms [SIFT] using a plurality of salient features, e.g. bag-of-words [BoW] representations
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Abstract

The invention relates to a bionic binocular target identification and tracking method based on a human eye visual attention mechanism, which comprises the following steps: the bionic binocular device detects the current scene; constructing an example level classification network, a saliency point of regard detection network and a human body posture detection network; inputting the image information of the current scene into the instance level classification network to obtain an instance level classification result diagram under the current scene; attempting to acquire a mask map of a salient point gaze point region based on speech information, a human body pose detection network, and a salient point gaze point detection network in a current scene; and aligning the mask map of the saliency point-of-regard region with the example level classification result map, acquiring a saliency target in the current scene and example level category and outline thereof, and tracking the saliency target by using the bionic binocular device. The invention improves the tracking switching robustness, can lead the final result to be more accurate and is closer to the human eye vision mechanism.

Description

Bionic binocular target identification and tracking method based on human eye visual attention mechanism
Technical Field
The invention relates to the technical field of intelligent bionics, in particular to a bionic binocular target identification and tracking method based on a human eye visual attention mechanism.
Background
The human visual attention mechanism refers to: in the human visual perception process, the visual rod cells of peripheral vision are utilized to rapidly perceive a real scene, a visual angle area where the human eye most wants to pay attention to is obtained, then the central vision (which is responsible for by the visual cone cells) of the human eye is aligned to the area, and the category and outline information of the salient target are obtained.
The existing target tracking switching method based on the bionic binocular vision system realizes real-time tracking by randomly initializing a target to be tracked and turning eyeballs. When the tracked target is lost, the system switches a target randomly again to track, or the system performs target tracking switching once every fixed time (such as 5 seconds) to achieve the effect of bionic binocular eyes. However, from the viewpoint of the visual attention mechanism of human eyes, the two modes obviously contradict the design of the bionic binocular vision system. The design aim of the bionic binocular vision system is to realize bionic movable eyes close to human eyes, and the human vision system has data screening capability with abnormal prominence, and is in face of surrounding environment, reacts in time aiming at the most interesting target area of the human vision system, and is 'invisible' to irrelevant targets or areas. When a new scene is performed, more attention is given to the target areas that are considered important, while less attention is given to the targets or areas that are considered less important. And the attention of the objects and areas which are focused on in the current scene gradually decreases with the passage of time.
However, the current target tracking switching method in the bionic binocular vision system does not have the human vision attention mechanism, but is given strict control logic artificially, and obviously does not accord with the mechanism of human eyes. In addition, in the aspect of whether the tracking target needs to be switched in the tracking process, besides being determined by visual factors, factors of other modes (such as hearing and smell) play a key role, namely, the switching of the bionic binocular vision is the result of multi-mode information fusion. In the existing target tracking switching method of bionic binocular vision, the line of sight switching of the bionic eye can be performed only according to visual information.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a bionic binocular target recognition and tracking method based on a human eye visual attention mechanism, which is more in line with the human eye visual attention mechanism.
The invention provides a bionic binocular target recognition and tracking method based on a human eye visual attention mechanism, which comprises the following steps:
step S1, providing a bionic binocular device, detecting a current scene by the bionic binocular device, and acquiring multi-modal information of the current scene, wherein the multi-modal information comprises image information and voice information;
step S2, an example level classification network, a saliency point of regard detection network and a human body posture detection network are constructed;
step S3, inputting the image information of the current scene into the instance level segmentation network to obtain an instance level segmentation result diagram under the current scene, wherein the instance level segmentation result diagram comprises instance level categories and outlines;
step S4: attempting to acquire a mask map of the salient point gaze point region based on the voice information in the current scene, and performing step S5 when the attempt is successful; if the attempt is unsuccessful, continuing to attempt to acquire a mask map of the salient point gaze area based on the human body posture detection network, and if the attempt is successful, performing step S5; if the attempt is unsuccessful, continuing to attempt to acquire a mask map of the salient point region based on the salient point detection network, and if the attempt is successful, performing step S5; ending the step when the attempt is unsuccessful;
and S5, aligning the mask map of the obtained salient point region with the example level classification result map obtained in the step S3, obtaining a salient target in the current scene and example level categories and outlines thereof, and tracking the salient target by the bionic binocular device.
Further, the step S4 includes:
step S41, judging whether the bionic binocular device detects voice information in the current scene, if so, performing step S42, and if not, performing step S43;
step S42, judging whether the discussion content of the current voice interaction is related to the example level category obtained in the step S3 according to the voice information in the current scene, if so, taking the example level category related to the discussion content as the target category with highest saliency, generating a mask diagram of a saliency point-of-view area, and carrying out step S5; if not, go to step S43;
step S43, inputting the image information of the current scene into the human body posture detection network, detecting human body key points in the image information, judging whether a specific action exists in the human body in the current scene, if so, determining a significant gaze point according to the key points of the human body, generating a mask map of a significant gaze point area, performing step S5, and if not, performing step S44;
step S44, inputting the image information of the current scene into the saliency-gaze-point detection network, generating a saliency-gaze-point prediction result, and acquiring a mask map of the saliency-gaze-point area based on the saliency-gaze-point prediction result.
Further, the step S42 includes:
step S421, extracting the center word of the voice information in the current scene through word segmentation operation;
step S422, calculating the distance between the center word and the class word of the instance-level class;
step S423, sorting the distances between the center word and the class words of the instance class from low to high, and reserving center word classes with the distances smaller than a preset threshold;
step S424, judging whether the intersection exists between the reserved center word category and the instance-level category obtained in the step S3, and if so, extracting the target category in the intersection.
The method for determining the salient point of gaze in step S43 is as follows: detecting wrist key points, elbow key points, eye key points and shoulder key points, and when detecting that the height difference between the wrist key points and the eye key points is within a certain threshold value, positioning a significant fixation point in the middle position of the eye key points; when the heights of the elbow key point, the wrist key point and the shoulder key point are detected to be sequentially increased, the significance fixation point is positioned at the position of the wrist key point.
Further, the step S44 includes:
step S441, inputting images of a plurality of continuous frames into a saliency point of regard detection network to obtain network output results of the plurality of continuous frames and corner data of continuous frame image tracking by a bionic binocular device, and caching the obtained network output results and the corner data;
step S442, for a plurality of continuous frames in step S441, buffering the continuous frames in a static state, and establishing a salient point count table;
step S443, the coordinates of the most significant points of a plurality of continuous frames are obtained from the network output result, whether the current frame i meets the replacement condition is judged, if yes, the network output result of the current frame i is replaced by the network output result of the previous frame i-1, the cached network output result is updated, and step S444 is carried out; if the replacement condition is not satisfied, step S444 is directly performed;
step S444, judging whether the current frame i is a jump initial frame at the moment, if yes, completely clearing the cached salient point count statistical table; if not, updating the count table of the salient points;
step S445, aiming at the salient point times statistical table obtained in the step S444, a Gaussian attenuation strategy is adopted to obtain a salient attenuation coefficient diagram, and the salient attenuation coefficient diagram is multiplied on a network output result diagram of the current frame i to obtain a mask diagram of a final salient point fixation point region of the current frame i.
Further, the method for acquiring the continuous frames in the stationary state in step S442 is as follows: from the several continuous frames in the step S441, continuous frames are extracted that make the rotation angle of the bionic binocular device smaller than a preset threshold.
Further, the method for establishing the statistics table of the number of salient points in step S442 includes: for the first frame in the continuous frames in the static state, a count table of the number of salient points of all 0 is established.
Further, the replacement condition in the step S443 is:
and d (i-1, i-2) > τ
Wherein, (x) i ,y i ) Representing the coordinates of the current frame i, d representing a distance function, τ representing a preset distance threshold, i-1 representing the previous frame of the current frame i, and i-2 representing the previous two frames of the current frame i; or,d (i-1, i-2) =0 and d (i-2, i-3) > τ
Wherein, (x) i ,y i ) The coordinate of the current frame i is represented, d represents a distance function, tau represents a preset distance threshold, i-1 represents a frame before the current frame i, i-2 represents two frames before the current frame i, and i-3 represents three frames before the current frame i.
Further, the method for determining whether the frame is the skip initial frame in step S444 is as follows: judging whether the current frame i at the moment and the previous frame i-1 of the current frame at the moment meet d (i, i-1) > tau or not, if so, indicating that the current frame i at the moment is a jump initial frame; if not, it means that the current frame i at this time is not the skip initial frame.
Further, the method for updating the statistics table of the number of significant points in step S444 is as follows: the number of salient points is counted in a table of (x i ,y i ) The values in the range of radius τ are all incremented by 1 for the center.
The invention realizes the bionic binocular target recognition and tracking switching method based on the human vision attention mechanism under the driving of the multi-mode information, the method considers the multi-mode input condition, simultaneously constructs the salient point detection and the example level segmentation two network branches, greatly improves the tracking switching robustness by adding the mutation type detection and filtering strategy in the salient point detection, and more naturally simulates the salient attention mechanism of human vision by adding the Gaussian attenuation strategy, so that the final result is not only more accurate, but also is more close to the human vision mechanism.
Drawings
FIG. 1 is a flow chart of a bionic binocular object recognition and tracking method based on a human eye visual attention mechanism according to the present invention;
FIG. 2 is a block diagram of an example hierarchical network;
FIG. 3 is a diagram of a saliency point of regard detection network;
FIG. 4 is a diagram of a human body posture detection network;
fig. 5 is a diagram illustrating an example of human body key point detection of a bionic binocular target recognition and tracking method based on a human eye visual attention mechanism according to the present invention.
Fig. 6 is a diagram illustrating human body specific behavior recognition (right-hand call-in) based on a bionic binocular target recognition and tracking method of a human eye visual attention mechanism according to the present invention.
Fig. 7 is a diagram of an example of the results of a bionic binocular object recognition and tracking method based on a human eye visual attention mechanism according to the present invention when there is no voice interaction in a scene.
Fig. 8 is a diagram of an example of the results of a bionic binocular object recognition and tracking method based on a human eye visual attention mechanism according to the present invention when a scene contains a voice interaction.
Detailed Description
Preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.
As shown in fig. 1, the bionic binocular target recognition and tracking method based on the human eye visual attention mechanism of the present invention comprises the following steps:
step S1, a bionic binocular device is provided, the bionic binocular device detects a current scene, image information of the current scene can be obtained in the detection process, and meanwhile, information of other modes, such as voice information, can be obtained.
Step S2, an example level classification network and a saliency point of regard detection network are constructed. In addition, in the human-computer interaction scenario, not only the voice input of the interactive object but also the interaction of the body language of the interactive object should be considered, so that the human body gesture detection network is constructed simultaneously in the steps.
The example level segmentation network structure is shown in fig. 2, and the network extracts the structural characteristics (for example, 32 times of downsampling) of the network according to the position of each generated candidate frame in the original image on the basis of the fast-RCNN to obtain the region of the candidate frame on the feature body, and then cuts out the features in the region to obtain the feature body corresponding to the candidate frame. Since the size of each candidate box is different, resulting in different sizes of the feature to be cut out, a Roi-Align module (i.e., region of interest alignment module) is used to obtain features of uniform specification size (to be applied to each predicted branch of the back end). After each candidate frame gets the feature body with uniform size, 2 branch sub-networks are passed, wherein the multi-layer convolution network is responsible for predicting a binary mask graph (binary digits 0 and 1, i.e. which areas in the candidate frame belong to the background and which areas belong to the foreground of the example), and the other fully connected network branch is responsible for regression frame (predicting the error amount of the center coordinates (x, y) of the candidate frame and the frame sizes (h, w) relative to the true labeling frame, i.e. fine tuning the position and the size of the candidate frame) and category prediction (i.e. which category the example in the candidate frame belongs to). Based on the results of the two branches, the segmentation results and the corresponding categories of all the examples of the whole scene are finally obtained.
The saliency point of regard detection network structure is shown in fig. 3, and is a network structure based on TASED-NET, i.e., a network structure comprehensively considering space-time coding, which predicts successive image frames acting over a period of time by using a convolutional neural network, thereby predicting a probability map of a saliency target of a current image frame. As can be seen from fig. 3, the saliency target detection network takes images of continuous frames as input, extracts features layer by utilizing a rolling layer and a pooling layer in a coding stage, increases the size of the output features by alternately using upper pooling and transpose convolution in a decoding stage, and simultaneously blends corresponding shallow features (obtained by pooling features in shallow stages) in each decoding stage so as to achieve the effect of fusing high-level semantic features and low-level texture features.
The human body posture detection network structure is shown in fig. 4, and the network comprises a target detection sub-network, wherein the position of a human body frame is detected first, and then posture estimation of each person is performed based on the position frame (the human body frame and the position frame can be obtained by example segmentation results of the target detection sub-network). First, the network will utilize a spatial transformation network (parameter θ) to radiotransform the content within the detection box, resulting in an accurate high quality human box for an inaccurate human detection box. The human body frame and the pixels after the space transformation are input into a Single Person Posture Estimator (SPPE) module to obtain the estimated posture. Finally, the estimated pose is mapped back to the original image coordinates using a spatial inverse transformation network (parameter γ, obtained via an inverse transformation of parameter θ). The Parallel single pose estimator (SPPE) module in the network is only used in a training stage, the training stage is used as an additional regularization term, the parameters of the single pose estimator (SPPE) module can be frozen when the model is prevented from being trapped into local optimization, and the space transformation network is optimized.
And step S3, inputting the image information of the current scene detected in the step S1 into an instance-level segmentation result diagram in the step S2 to obtain the instance-level segmentation result diagram in the current scene, wherein the instance-level segmentation result diagram comprises an instance-level frame, a contour and a corresponding category.
Step S4: attempting to acquire a mask map of the salient point gaze point region based on the voice information in the current scene, and performing step S5 when the attempt is successful; if the attempt is unsuccessful, continuing to attempt to acquire a mask map of the salient point gaze area based on the human body posture detection network, and if the attempt is successful, performing step S5; if the attempt is unsuccessful, continuing to attempt to acquire a mask map of the salient point region based on the salient point detection network, and if the attempt is successful, performing step S5; ending the step when the attempt is unsuccessful; the step S4 specifically comprises the following steps:
step S41, judging whether voice information in the current scene is detected, if yes, proceeding to step S42, otherwise proceeding to step S43.
Step S42, judging whether the discussion content of the current voice interaction is related to the example level category obtained in the step S3 according to the voice information in the current scene, if so, taking the example level category related to the discussion content as the target category with the highest saliency, wherein the target category with the highest saliency is the saliency fixation point tracked by the bionic binocular device, generating a mask map of the saliency fixation point area, and performing the step S5; if not, step S43 is performed. The step S42 specifically includes:
step S421, extracting the center word of the voice information in the current scene through word segmentation operation.
Step S422, calculating the distance between the center word and the known category word, wherein the known category word is the category word of the instance-level category obtained in step S3.
Step S423, sorting the distances between the center word and the category words from low to high, and reserving the center word category with the distance smaller than a preset threshold value, thereby screening the category with low correlation.
Step S424, judging whether the reserved center word category and the instance level category obtained in step S3 have an intersection, if so, the discussion content representing the current voice interaction is related to the target category detected by a certain instance level segmentation network, extracting the target category in the intersection, and generating a mask map with the highest degree of saliency of the category region fixation point. If there is no intersection, the discussion representing the current voice interaction is independent of the salient object class within the current scene, and the attempt to generate a mask map of the salient gaze point area based on the voice information in the current scene fails.
Step S43, inputting the image information of the current scene detected in step S1 into the human body posture detection network in step S2, detecting human body key points in the image information, judging whether a specific action exists in the human body in the current scene, if so, determining a significant gaze point according to the key points of the human body, generating a mask map of a significant gaze point area, performing step S5, and if not, performing step S44.
Specifically, as shown in fig. 5, the human body posture detection network detects 17 key points (eyes, nose, wrists, shoulders, etc.) of the human body. In this embodiment, whether the current interactive object is performing a specific action is determined according to the positional relationship of the wrist key point, the elbow key point, the eye key point, and the shoulder key point of the human body. Of these specific actions, focus is placed on both "call on" and "delivery" actions. Specifically, as shown in fig. 6, when it is detected that the difference in height between the wrist key point and the eye key point is within a certain threshold (the threshold is a super parameter and is dynamically adjusted according to different actual scenes), the action of "calling" is considered to be performed, and the point of attention at this time is located at the middle position of the two eye key points; when the heights of the elbow, the wrist and the shoulder on the same side are detected to be sequentially increased, the object delivery action is considered to be performed, and the remarkable fixation point is positioned at the key point of the wrist.
Step S44, inputting the image information of the current scene detected in step S1 into the saliency-fixation point detection network in step S2, generating a saliency-fixation point prediction result, and acquiring a mask map of the saliency-fixation point region based on the saliency-fixation point prediction result.
By analyzing the data set tracked by the gaze point of the human eye, it is found that the human eye does not jump very frequently in the actual jump, and is maintained for a period of time each time the gaze point of the human eye jumps. The actual output of the existing saliency point of regard detection network may suddenly jump, which is contrary to the visual attention mechanism of human eyes. Thus, the present invention proposes a mutant-type detection algorithm in the process of the saliency-gaze-point detection network acquiring the mask map of the saliency-gaze-point area. Meanwhile, when the eyes watch the remarkable fixation point for a long time, the attention degree of the eyes to the current fixation point gradually decreases along with the time, and otherwise, the attention degree of the eyes to moving objects or objects which are not seen relatively increases. Therefore, the invention also designs a significant point of regard Gaussian decay strategy.
Specifically, step S44 includes:
in step S441, images of a plurality of continuous frames are input into the saliency point of regard detection network, network output results of the plurality of continuous frames and corner data of images of the continuous frames tracked by the bionic binocular device are obtained, and the obtained network output results and corner data are cached.
In step S442, for the plurality of continuous frames in step S441, it is determined whether the rotation angle of the bionic binocular device from the previous frame to the next frame is smaller than a preset threshold (super parameter), if yes, the previous frame to the next frame is considered to be a relatively static scene, so that the continuous frames with rotation angles smaller than the preset threshold of the bionic binocular device are continuous frames in a static state. According to the judging method, continuous frames in a static state are cached, and a count table of all 0 salient points is established for a first frame in the continuous frames in the static state. The salient point count table is used for counting the number of times of most salient points in the vicinity of each point in the continuous frames in a static state in the following steps, and caching the counted salient point count table.
Step S443, obtaining the coordinates of the most significant points of a plurality of continuous frames from the network output result, and judging whether the current frame i meets the following replacement conditions:
1)and d (i-1, i-2) > τ. Wherein, (x) i ,y i ) Representing the coordinates of the current frame i, d representing the distance function, τ representing a preset distance threshold. That is, the result of the current frame i exceeds the range of the previous frame i-1 (i.e., the distance between the most significant point position of the current frame i and the most significant point position of the previous frame i-1 exceeds a threshold), and the result of the previous two frames i-2 also exceeds the range of the previous three frames i-3, at this time, the continuous jump appears, and the replacement condition is satisfied; or:
2) d (i, i-1) > τ, and d (i-1, i-2) =0, d (i-2, i-3) > τ. That is, the result of the current frame i exceeds the range of the current frame i-1, the result of the previous frame i-1 is replaced by the previous two frames i-2, and the result of the previous two frames i-2 exceeds the range of the previous three frames i-3, the requirement of at least 3 continuous frames is not met, and the replacement condition is met;
if the replacement condition is satisfied, replacing the network output result of the current frame i by the network output result of the previous frame i-1 (avoiding frequent jump), updating the cached network output result, and performing step S444; if the replacement condition is not satisfied, step S444 is directly performed.
Step S444, judging whether the current frame i is a jump initial frame at the moment, if yes, completely clearing the cached salient point count statistical table; if not, only the count table of the number of the significant points is required to be updated, and the updating method is as follows: the number of salient points is counted in a table of (x i ,y i ) The values in the range of radius τ are all incremented by 1 for the center. The method for judging whether the frame is the jump initial frame comprises the following steps: judging whether the current frame i at the moment and the previous frame of the current frame at the moment meet d (i, i-1) > tau or not, if so, indicating that the current frame i at the moment is a jump initial frame; if not, it means that the current frame i at this time is not the skip initial frame.
Step S445, aiming at the salient point times statistical table obtained in the step S444, a Gaussian attenuation strategy is adopted to obtain a salient attenuation coefficient diagram, and the salient attenuation coefficient diagram is multiplied on a network output result diagram of the current frame i to obtain a mask diagram of a final salient point fixation point region of the current frame i.
And S5, aligning the mask map of the obtained salient point region with the example level classification result map obtained in the step S3, obtaining a salient target in the current scene and example level category and outline thereof, and tracking the salient target by the bionic binocular device.
Eye movement of the human eye (tracking of objects) should be related to the actual scene, with obvious correlation to object mobility, object class, object distance, etc. within the scene. In order to make the eye movement of the bionic visual system closer to the eye movement mechanism of the human visual system, it should not be specified explicitly how long to track a certain target, but rather implicit tracking should be achieved based on the real-time results of the saliency detection (the results show that the saliency target does not jump very frequently, but that focusing on a saliency target continues for a period of time, which also coincides with the visual characteristics of the human). Meanwhile, in the implicit tracking process, in order to enable the whole implicit tracking logic to be more similar to a human brain mechanism, the invention simultaneously adds the consideration factors of voice interaction and accords with a mutation type detection and filtering algorithm and a Gaussian attenuation strategy of human eyes.
In a word, the invention realizes the bionic binocular target recognition and tracking switching method based on the human vision attention mechanism under the driving of the multi-mode information, the method considers the multi-mode input condition, simultaneously divides the salient point detection and example level division task into two branches, greatly improves the tracking switching robustness by adding the mutant detection and filtering strategy in the salient point detection, and more naturally simulates the salient attention mechanism of human vision by adding the Gaussian attenuation strategy, so that the final result is more accurate and is closer to the human vision mechanism. Fig. 7 shows an example of the result of salient object detection and segmentation without voice interaction in the scene, and fig. 8 shows an example of visual saliency map generated by multimodal information guidance with voice interaction in the scene (voice input: "how does me feel somewhat thirsty. Compared with the traditional method, the invention realizes the bionic characteristic on the bionic visual system.
The foregoing description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and various modifications can be made to the above-described embodiment of the present invention. All simple, equivalent changes and modifications made in accordance with the claims and the specification of this application fall within the scope of the patent claims. The present invention is not described in detail in the conventional art.

Claims (8)

1. A bionic binocular target recognition and tracking method based on a human eye visual attention mechanism is characterized by comprising the following steps:
step S1, providing a bionic binocular device, detecting a current scene by the bionic binocular device, and acquiring multi-modal information of the current scene, wherein the multi-modal information comprises image information and voice information;
step S2, an example level classification network, a saliency point of regard detection network and a human body posture detection network are constructed;
step S3, inputting the image information of the current scene into the instance level segmentation network to obtain an instance level segmentation result diagram under the current scene, wherein the instance level segmentation result diagram comprises instance level categories and outlines;
step S4, attempting to acquire a mask map of the salient point gaze area based on the voice information in the current scene, and performing step S5 when the attempt is successful; if the attempt is unsuccessful, continuing to attempt to acquire a mask map of the salient point gaze area based on the human body posture detection network, and if the attempt is successful, performing step S5; if the attempt is unsuccessful, continuing to attempt to acquire a mask map of the salient point region based on the salient point detection network, and if the attempt is successful, performing step S5; ending the step when the attempt is unsuccessful;
step S5, aligning the mask map of the obtained salient point region with the example level classification result map obtained in the step S3, obtaining a salient target in the current scene and example level class and outline thereof, and tracking the salient target by the bionic binocular device;
the step S4 includes:
step S41, judging whether the bionic binocular device detects voice information in the current scene, if so, performing step S42, and if not, performing step S43;
step S42, judging whether the discussion content of the current voice interaction is related to the example level category obtained in the step S3 according to the voice information in the current scene, if so, taking the example level category related to the discussion content as the target category with highest saliency, generating a mask diagram of a saliency point-of-view area, and carrying out step S5; if not, go to step S43;
step S43, inputting the image information of the current scene into the human body posture detection network, detecting human body key points in the image information, judging whether a specific action exists in the human body in the current scene, if so, determining a significant gaze point according to the key points of the human body, generating a mask map of a significant gaze point area, performing step S5, and if not, performing step S44;
step S44, inputting the image information of the current scene into the saliency-gaze-point detection network, generating a saliency-gaze-point prediction result, and acquiring a mask map of a saliency-gaze-point area based on the saliency-gaze-point prediction result;
the step S44 includes:
step S441, inputting images of a plurality of continuous frames into a saliency point of regard detection network to obtain network output results of the plurality of continuous frames and corner data of continuous frame image tracking by a bionic binocular device, and caching the obtained network output results and the corner data;
step S442, for a plurality of continuous frames in step S441, buffering the continuous frames in a static state, and establishing a salient point count table;
step S443, the coordinates of the most significant points of a plurality of continuous frames are obtained from the network output result, whether the current frame i meets the replacement condition is judged, if yes, the network output result of the current frame i is replaced by the network output result of the previous frame i-1, the cached network output result is updated, and step S444 is carried out; if the replacement condition is not satisfied, step S444 is directly performed;
step S444, judging whether the current frame i is a jump initial frame at the moment, if yes, completely clearing the cached salient point count statistical table; if not, updating the count table of the salient points;
step S445, aiming at the salient point times statistical table obtained in the step S444, a Gaussian attenuation strategy is adopted to obtain a salient attenuation coefficient diagram, and the salient attenuation coefficient diagram is multiplied on a network output result diagram of the current frame i to obtain a mask diagram of a final salient point fixation point region of the current frame i.
2. The method for identifying and tracking a bionic binocular object based on a human eye visual attention mechanism according to claim 1, wherein the step S42 comprises:
step S421, extracting the center word of the voice information in the current scene through word segmentation operation;
step S422, calculating the distance between the center word and the class word of the instance-level class;
step S423, sorting the distances between the center word and the class words of the instance class from low to high, and reserving center word classes with the distances smaller than a preset threshold;
step S424, judging whether the intersection exists between the reserved center word category and the instance-level category obtained in the step S3, and if so, extracting the target category in the intersection.
3. The method for identifying and tracking a bionic binocular target based on a human eye visual attention mechanism according to claim 1, wherein the method for determining the salient point of gaze in step S43 is as follows: detecting wrist key points, elbow key points, eye key points and shoulder key points, and when detecting that the height difference between the wrist key points and the eye key points is within a certain threshold value, positioning a significant fixation point in the middle position of the eye key points; when the heights of the elbow key point, the wrist key point and the shoulder key point are detected to be sequentially increased, the significance fixation point is positioned at the position of the wrist key point.
4. The method for identifying and tracking a bionic binocular object based on a human eye visual attention mechanism according to claim 1, wherein the method for acquiring the continuous frames in the stationary state in the step S442 is as follows: from the several continuous frames in the step S441, continuous frames are extracted that make the rotation angle of the bionic binocular device smaller than a preset threshold.
5. The method for identifying and tracking a bionic binocular target based on a human eye visual attention mechanism according to claim 1, wherein the method for establishing a statistics table of the number of salient points in the step S442 is as follows: for the first frame in the continuous frames in the static state, a count table of the number of salient points of all 0 is established.
6. The method for identifying and tracking a bionic binocular object based on a human eye visual attention mechanism according to claim 1, wherein the substitution condition in the step S443 is as follows:
and d (i-1, i-2)>τ,
Wherein, (x) i ,y i ) Representing the coordinates of the current frame i, d representing a distance function, τ representing a preset distance threshold, i-1 representing the previous frame of the current frame i, and i-2 representing the previous two frames of the current frame i; or,and d (i-1, i-2) =0 and d (i-2, i-3) > τ,
wherein, (x) i ,y i ) The coordinate of the current frame i is represented, d represents a distance function, tau represents a preset distance threshold, i-1 represents a frame before the current frame i, i-2 represents two frames before the current frame i, and i-3 represents three frames before the current frame i.
7. The method for identifying and tracking a bionic binocular object based on a human eye visual attention mechanism according to claim 6, wherein the method for determining whether to skip the initial frame in the step S444 is as follows: judging whether the current frame i at the moment and the previous frame i-1 of the current frame at the moment meet d (i-1, i-2) > tau or not, if so, indicating that the current frame i at the moment is a jump initial frame; if not, it means that the current frame i at this time is not the skip initial frame.
8. The method for identifying and tracking a bionic binocular target based on a human eye visual attention mechanism according to claim 6, wherein the method for updating the statistics table of the number of salient points in step S444 is as follows: the number of salient points is counted in a table of (x i ,y i ) The values in the range of radius τ are all incremented by 1 for the center.
CN202011298898.7A 2020-11-18 2020-11-18 Bionic binocular target identification and tracking method based on human eye visual attention mechanism Active CN112418296B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011298898.7A CN112418296B (en) 2020-11-18 2020-11-18 Bionic binocular target identification and tracking method based on human eye visual attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011298898.7A CN112418296B (en) 2020-11-18 2020-11-18 Bionic binocular target identification and tracking method based on human eye visual attention mechanism

Publications (2)

Publication Number Publication Date
CN112418296A CN112418296A (en) 2021-02-26
CN112418296B true CN112418296B (en) 2024-04-02

Family

ID=74773498

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011298898.7A Active CN112418296B (en) 2020-11-18 2020-11-18 Bionic binocular target identification and tracking method based on human eye visual attention mechanism

Country Status (1)

Country Link
CN (1) CN112418296B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076972A (en) * 2021-03-04 2021-07-06 山东师范大学 Two-stage Logo image detection method and system based on deep learning
CN113326751B (en) * 2021-05-19 2024-02-13 中国科学院上海微系统与信息技术研究所 Hand 3D key point labeling method
CN114445267B (en) * 2022-01-28 2024-02-06 南京博视医疗科技有限公司 Eye movement tracking method and device based on retina image
CN115690892B (en) * 2023-01-03 2023-06-13 京东方艺云(杭州)科技有限公司 Mitigation method and device, electronic equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933435A (en) * 2015-06-25 2015-09-23 中国计量学院 Machine vision construction method based on human vision simulation
CN105005788A (en) * 2015-06-25 2015-10-28 中国计量学院 Target perception method based on emulation of human low level vision
CN106042005A (en) * 2016-06-01 2016-10-26 山东科技大学 Bionic eye positioning tracking system and working method thereof
CN107909059A (en) * 2017-11-30 2018-04-13 中南大学 It is a kind of towards cooperateing with complicated City scenarios the traffic mark board of bionical vision to detect and recognition methods
CN110210563A (en) * 2019-06-04 2019-09-06 北京大学 The study of pattern pulse data space time information and recognition methods based on Spike cube SNN
CN110347186A (en) * 2019-07-17 2019-10-18 中国人民解放军国防科技大学 Ground moving target autonomous tracking system based on bionic binocular linkage
CN110458877A (en) * 2019-08-14 2019-11-15 湖南科华军融民科技研究院有限公司 The infrared air navigation aid merged with visible optical information based on bionical vision
WO2020119518A1 (en) * 2018-12-11 2020-06-18 中国科学院深圳先进技术研究院 Control method and device based on spatial awareness of artificial retina
CN111723707A (en) * 2020-06-09 2020-09-29 天津大学 Method and device for estimating fixation point based on visual saliency

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10621991B2 (en) * 2018-05-06 2020-04-14 Microsoft Technology Licensing, Llc Joint neural network for speaker recognition

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933435A (en) * 2015-06-25 2015-09-23 中国计量学院 Machine vision construction method based on human vision simulation
CN105005788A (en) * 2015-06-25 2015-10-28 中国计量学院 Target perception method based on emulation of human low level vision
CN106042005A (en) * 2016-06-01 2016-10-26 山东科技大学 Bionic eye positioning tracking system and working method thereof
CN107909059A (en) * 2017-11-30 2018-04-13 中南大学 It is a kind of towards cooperateing with complicated City scenarios the traffic mark board of bionical vision to detect and recognition methods
WO2020119518A1 (en) * 2018-12-11 2020-06-18 中国科学院深圳先进技术研究院 Control method and device based on spatial awareness of artificial retina
CN110210563A (en) * 2019-06-04 2019-09-06 北京大学 The study of pattern pulse data space time information and recognition methods based on Spike cube SNN
CN110347186A (en) * 2019-07-17 2019-10-18 中国人民解放军国防科技大学 Ground moving target autonomous tracking system based on bionic binocular linkage
CN110458877A (en) * 2019-08-14 2019-11-15 湖南科华军融民科技研究院有限公司 The infrared air navigation aid merged with visible optical information based on bionical vision
CN111723707A (en) * 2020-06-09 2020-09-29 天津大学 Method and device for estimating fixation point based on visual saliency

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
A Multimodal Saliency Model for Videos With High Audio-Visual Correspondence;X. Min;《 IEEE Transactions on Image Processing》;第29卷;3805-3819 *
BVMSOD: bionic vision mechanism based salient object detection;Tan J等;《2019 IEEE International Conference on Cyborg and Bionic Systems (CBS)》;335-339 *
Occluded Pedestrian Detection Based on Depth Vision Significance in Biomimetic Binocular;W. Wei等;《 IEEE Sensors Journal》;第19卷(第23期);11469-11474 *
仿生双眼的立体视控制系统;王开放等;《电子设计工程》;第26卷(第6期);1-6 *

Also Published As

Publication number Publication date
CN112418296A (en) 2021-02-26

Similar Documents

Publication Publication Date Title
CN112418296B (en) Bionic binocular target identification and tracking method based on human eye visual attention mechanism
CN109344725B (en) Multi-pedestrian online tracking method based on space-time attention mechanism
Habili et al. Segmentation of the face and hands in sign language video sequences using color and motion cues
CN112889108B (en) Speech classification using audiovisual data
US20220101654A1 (en) Method for recognizing actions, device and storage medium
Yang et al. Visual tracking for multimodal human computer interaction
CN112651334B (en) Robot video interaction method and system
CN111723707A (en) Method and device for estimating fixation point based on visual saliency
CN115562499B (en) Intelligent ring-based accurate interaction control method and system and storage medium
CN114926796A (en) Bend detection method based on novel mixed attention module
CN112712068A (en) Key point detection method and device, electronic equipment and storage medium
CN111950507A (en) Data processing and model training method, device, equipment and medium
Ogasawara et al. Object-based video coding by visual saliency and temporal correlation
ELBAŞI et al. Control charts approach for scenario recognition in video sequences
WO2021259033A1 (en) Facial recognition method, electronic device, and storage medium
CN114677620A (en) Focusing method, electronic device and computer readable medium
CN112069943A (en) Online multi-person posture estimation and tracking method based on top-down framework
Tapu et al. Face recognition in video streams for mobile assistive devices dedicated to visually impaired
Xue et al. Infrared Target Tracking Algorithm Based on Motion Features and Contour Features
CN112784648B (en) Method and device for optimizing feature extraction of pedestrian re-identification system of video
CN117237411A (en) Pedestrian multi-target tracking method based on deep learning
CN116896654B (en) Video processing method and related device
CN116301388B (en) Man-machine interaction scene system for intelligent multi-mode combined application
CN116824641B (en) Gesture classification method, device, equipment and computer storage medium
CN113903083B (en) Behavior recognition method and apparatus, electronic device, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant