CN114639125B - Pedestrian intention prediction method and device based on video image and electronic equipment - Google Patents

Pedestrian intention prediction method and device based on video image and electronic equipment Download PDF

Info

Publication number
CN114639125B
CN114639125B CN202210323532.3A CN202210323532A CN114639125B CN 114639125 B CN114639125 B CN 114639125B CN 202210323532 A CN202210323532 A CN 202210323532A CN 114639125 B CN114639125 B CN 114639125B
Authority
CN
China
Prior art keywords
pedestrian
sequence
video image
feature
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210323532.3A
Other languages
Chinese (zh)
Other versions
CN114639125A (en
Inventor
陈禹行
董铮
李雪
范圣印
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yihang Yuanzhi Technology Co Ltd
Original Assignee
Beijing Yihang Yuanzhi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yihang Yuanzhi Technology Co Ltd filed Critical Beijing Yihang Yuanzhi Technology Co Ltd
Priority to CN202210323532.3A priority Critical patent/CN114639125B/en
Publication of CN114639125A publication Critical patent/CN114639125A/en
Application granted granted Critical
Publication of CN114639125B publication Critical patent/CN114639125B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The present disclosure provides a pedestrian intention prediction method based on a video image, including: acquiring a video image sequence containing the pedestrian and an observation track sequence of the pedestrian based on continuous frames containing the pedestrian in the video image data acquired in real time; acquiring a vehicle speed sequence corresponding to continuous frames based on vehicle speed data acquired in real time; extracting video image characteristics frame by frame based on a video image sequence containing pedestrians, and acquiring continuous frame average characteristics of the video image sequence; acquiring observation track characteristics of the pedestrian based on the observation track sequence of the pedestrian; acquiring the speed characteristics of the vehicle based on the speed sequence of the vehicle corresponding to the continuous frames; acquiring modal fusion characteristics based on the continuous frame average characteristics, the observation track characteristics and the speed characteristics of the vehicle; and extracting intention characteristics representing the intention of the pedestrian based on the semantic information of the modal fusion characteristics. The disclosure also provides a pedestrian intention prediction device, an electronic device and a readable storage medium.

Description

Pedestrian intention prediction method and device based on video image and electronic equipment
Technical Field
The present disclosure relates to the field of automatic driving technologies, and in particular, to a method and an apparatus for predicting pedestrian intention based on a video image, an electronic device, and a storage medium.
Background
Autonomous vehicles are equipped with a series of sensors that sense the environment around the vehicle and assist the vehicle in making decisions. The existing automatic driving automobile generally integrates mature perception technologies, such as target detection, target tracking and the like, and can accurately and truly capture pedestrians and other vehicles in the road. Pedestrians belong to vulnerable groups in traffic scenes and are easily seriously injured in traffic accidents, and vehicles are braked to require certain buffer time and can not take effective measures aiming at the sudden behaviors of the pedestrians due to the fact that only a target detection and tracking technology is relied on. Therefore, if the intention of the pedestrian can be predicted on line in real time, input information can be provided for the decision of the vehicle in advance, the vehicle behavior can be adjusted timely, and the traffic road safety can be guaranteed.
A Pedestrian intention prediction method for stacking RNN fusion context information is provided in a BMVC (national center of motion) paper 'Pedestrian action intersection fusion in stacked RNNs' in 2019, and features of different modes are extracted layer by layer and fused on the basis of a gate control cycle unit (GRU), and the features are sequentially a Pedestrian image, a scene image, a Pedestrian pose, a Pedestrian track and a self-speed from complex information to simple information from bottom to top. The method can predict the pedestrian intention in real time, but the complex video semantics are not sufficiently extracted, and the 2D space structure of the video image can be damaged by using the convolution layer and the GRU to perform time sequence modeling.
A baseline method for Pedestrian intention Prediction is provided in WACV (central mark for Evaluating Pedestrian Action Prediction) in 2021, and is characterized in that video image features of pedestrians are extracted by adopting 3D convolution, then tracks, poses and own vehicle speeds of the pedestrians are coded by adopting a recurrent neural network, and finally, features of different modes are fused by an attention mechanism. The method can effectively fuse information of different modes, but the real-time requirement is difficult to achieve.
A TITS paper of Cross or NotContext-Based registration of Pedestrian Crossing in the Urban environmental proposes an Intention prediction framework, extracts video information of pedestrians by using 3D convolution, simply calculates Pedestrian distance information through pixel coordinates, and designs a fusion module Based on a full connection layer. The method explores the relationship between the speed of the vehicle and the distance of the pedestrian and the street crossing intention of the pedestrian, but cannot meet the real-time performance, and the generalization capability under a complex scene is weak.
At this stage, the related papers and methods in the field of pedestrian intention prediction have at least the following drawbacks and disadvantages.
Firstly, the video image features cannot be effectively extracted in real time. Due to the computing resource limitation of the automatic driving system platform and the requirement of online video feature extraction, the video pedestrian feature extraction network needs to meet the real-time performance and fully utilize computing resources. The video data is essentially equivalent to the superposition of image data on a time dimension, the 3D convolution neural network can extract the characteristics of 3-dimensional data, but the 3D convolution operation amount is large, and in addition, a characteristic diagram after 3D convolution processing cannot be reused, so that the real-time characteristic extraction is not facilitated. A2D convolutional neural network is an effective method for extracting image space features, and one architecture is that the 2D convolutional neural network is adopted to extract features of a single-frame image in a video, then the features are input into the cyclic neural network to perform time sequence modeling and are updated iteratively, but the operation of changing the size (Reshape) causes high computational complexity and damages a space structure; another typical architecture is to perform post-fusion on the video single-frame image features extracted by the 2D convolutional neural network, but lacks modeling of different levels of semantics.
The second is the lack of an efficient method of fusing multimodal features. Some methods adopt an attention mechanism to correlate information of different modes, obtain better prediction accuracy, but have the problems of high computational complexity, easy occurrence of overfitting and the like. Some methods are used for solving the problems of modality isomerism and insufficient fusion by fusing multi-mode information at different levels and different stages, but the mode can cause information loss to a certain extent, does not have robustness, and is difficult to adapt to variable traffic scenes. In addition, the method is not favorable for real-time operation, and requires more feature tensors to be stored in a memory (video memory), so that the quick response requirement of the automatic driving vehicle cannot be met.
Disclosure of Invention
In order to solve at least one of the above technical problems, the present disclosure provides a method, an apparatus, an electronic device, and a storage medium for predicting a pedestrian intention based on a video image.
According to an aspect of the present disclosure, there is provided a pedestrian intention prediction method based on a video image, including:
acquiring a video image sequence containing the pedestrian and an observation track sequence of the pedestrian based on continuous frames containing the pedestrian in the video image data acquired in real time; acquiring a vehicle speed sequence corresponding to the continuous frames based on vehicle speed data acquired in real time;
extracting video image features frame by frame based on the video image sequence containing pedestrians, and acquiring continuous frame average features of the video image sequence; acquiring observation track characteristics of the pedestrian based on the observation track sequence of the pedestrian; acquiring the speed characteristics of the vehicle based on the speed sequence of the vehicle corresponding to the continuous frames;
acquiring modal fusion characteristics based on the continuous frame average characteristics, the observation track characteristics and the vehicle speed characteristics;
and extracting intention characteristics representing the intention of the pedestrian at least based on the semantic information of the modal fusion characteristics.
The pedestrian intention prediction method based on video images according to at least one embodiment of the present disclosure, wherein the extracting, frame by frame, video image features of pedestrians and acquiring continuous frame average features of the video image sequence based on the video image sequence containing pedestrians comprises:
performing feature extraction based on 2D convolution on a video image of a current frame to obtain a feature map of the current frame and a corresponding feature tensor;
performing time sequence modeling based on the feature map of the current frame and the feature map of the previous frame of the current frame to update the feature map of the current frame and the corresponding feature tensor, and taking the updated feature tensor of the current frame as the video image feature of the current frame;
if the feature map of the previous frame of the current frame does not exist, the feature map of the previous frame of the current frame is filled with a value of 0.
The pedestrian intention prediction method based on the video image according to at least one embodiment of the present disclosure further includes:
and carrying out dimensionality reduction processing based on a full connection layer on the feature tensor corresponding to the updated feature map of the current frame to obtain a high-dimensional feature tensor of the current frame, and storing the high-dimensional feature tensor into a high-dimensional feature tensor sequence.
The pedestrian intention prediction method based on the video image according to at least one embodiment of the present disclosure further includes:
deleting the feature map of the previous frame of the current frame, and keeping the feature map of the current frame.
The pedestrian intention prediction method based on the video image according to at least one embodiment of the present disclosure further includes:
and when the length of the high-dimensional feature tensor sequence reaches the frame number value of the continuous frames, averaging the high-dimensional feature tensor in the high-dimensional feature tensor sequence to obtain the average features of the continuous frames.
According to the pedestrian intention prediction method based on the video image in at least one embodiment of the present disclosure, if the sequence element at the earliest time in the high-dimensional feature tensor sequence is an feature map filled with a value of 0, the feature map is deleted and then the high-dimensional feature tensor sequence is averaged to obtain the average feature of the continuous frames.
According to at least one embodiment of the present disclosure, a pedestrian intention prediction method based on a video image, performing time sequence modeling based on a feature map of a current frame and a feature map of a previous frame of the current frame to update the feature map of the current frame, includes:
associating at least one part of channels of the feature map of the previous frame of the current frame with corresponding channels of the feature map of the current frame, and performing time sequence modeling;
and updating the characteristic diagram of the at least one part of channels of the previous frame of the current frame to the corresponding position of the characteristic diagram of the current frame to obtain the updated characteristic diagram of the current frame.
According to the pedestrian intention prediction method based on the video image, the feature extraction based on the 2D convolution is carried out on the video image of the current frame through a 2D convolution backbone network, and the 2D convolution backbone network comprises one or more than two 2D convolution layers.
According to the pedestrian intention prediction method based on the video image of at least one embodiment of the present disclosure, the 2D convolutional backbone network is a 2D convolutional backbone network embedded with a time sequence modeling so that the 2D convolutional backbone network can perform the time sequence modeling.
According to the pedestrian intention prediction method based on the video image, disclosed by the at least one embodiment of the disclosure, the 2D convolution backbone network is provided with a limited number of time sequence modeling positions so as to balance the performance and the calculation amount of the 2D convolution backbone network.
According to at least one embodiment of the present disclosure, a pedestrian intention prediction method based on a video image, the method for obtaining an observation track characteristic of a pedestrian based on an observation track sequence of the pedestrian, includes:
enhancing the observation track sequence by using a full-connection layer to obtain an enhanced input track set;
splicing the enhanced input track set in a time dimension to obtain an input track tensor;
inputting the input track tensor into a 1D convolution network, and extracting local short-term features;
and inputting the local short-term features into a multi-layer perceptron to carry out coding processing so as to obtain global track features.
According to at least one embodiment of the present disclosure, a pedestrian intention prediction method based on a video image, which acquires a vehicle speed feature based on a vehicle speed sequence corresponding to the continuous frames, includes:
splicing the speed sequence of the vehicle in a time dimension to obtain an input speed tensor;
and inputting the input speed tensor to a multi-layer perceptron for coding processing so as to obtain the speed characteristic of the vehicle.
According to at least one embodiment of the present disclosure, a pedestrian intention prediction method based on a video image, wherein the acquiring of a modal fusion feature based on the continuous frame average feature, the observation trajectory feature, and the vehicle speed feature includes:
the continuous frame average characteristic, the observation track characteristic and the vehicle speed are measuredDegree feature S i Adding to obtain the modal fusion characteristics.
According to at least one embodiment of the present disclosure, the method for predicting pedestrian intention based on video images, which extracts intention features representing pedestrian intention based on semantic information of the modal fusion features, includes:
and inputting the modal fusion features into a full connection layer, and mapping the modal fusion features into a two-dimensional tensor so as to represent the class of the street crossing intention and the class of the street crossing intention of the pedestrian.
According to another aspect of the present disclosure, there is provided a pedestrian intention prediction apparatus based on a video image, including:
the system comprises a video image sequence acquisition module, a pedestrian detection module and a pedestrian detection module, wherein the video image sequence acquisition module acquires a video image sequence containing pedestrians on the basis of continuous frames containing the pedestrians in video image data acquired in real time;
the pedestrian observation track sequence acquisition module is used for acquiring an observation track sequence of a pedestrian based on continuous frames containing the pedestrian in video image data acquired in real time;
the vehicle speed sequence acquisition module acquires a vehicle speed sequence corresponding to the continuous frames on the basis of vehicle speed data acquired in real time;
the image feature acquisition module extracts video image features frame by frame based on the video image sequence containing pedestrians and acquires continuous frame average features of the video image sequence;
the pedestrian observation track characteristic acquisition module acquires observation track characteristics of pedestrians on the basis of an observation track sequence of the pedestrians;
a vehicle speed feature acquisition module that acquires a vehicle speed feature based on a vehicle speed sequence corresponding to the continuous frames;
a multi-modal feature fusion module that obtains modal fusion features based on the continuous frame average features, the observed trajectory features, and the vehicle speed features; and
and the pedestrian intention acquisition module extracts intention features representing the intention of the pedestrian at least based on the semantic information of the modal fusion features.
According to yet another aspect of the present disclosure, there is provided an electronic device including:
a memory storing execution instructions;
a processor executing execution instructions stored by the memory such that the processor performs the pedestrian intent prediction method of any of the embodiments of the present disclosure.
According to still another aspect of the present disclosure, there is provided a readable storage medium having stored therein execution instructions for implementing the pedestrian intention prediction method of any one of the embodiments of the present disclosure when executed by a processor.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.
Fig. 1 is a flowchart illustrating a pedestrian intention prediction method based on a video image according to an embodiment of the present disclosure.
Fig. 2 is a flowchart of a method for extracting video image features of pedestrians in a preferred embodiment of the present disclosure.
FIG. 3 is a network architecture diagram for pedestrian intent (street intent) prediction in online, real-time, one embodiment of the present disclosure.
Fig. 4 is a flowchart of a method for acquiring average features of consecutive frames of a video image sequence in a pedestrian intention prediction method according to an embodiment of the present disclosure.
Fig. 5 is a complete flow chart of extracting video image features frame by frame and acquiring average features of consecutive frames of a video image sequence in the pedestrian intention prediction method according to an embodiment of the present disclosure.
Fig. 6 is a block diagram schematically illustrating a configuration of a pedestrian intention prediction apparatus using a hardware implementation of a processing system according to an embodiment of the present disclosure.
Description of the reference numerals
1000 pedestrian intention prediction device
1002 video image sequence acquisition module
1004 observation track sequence acquisition module
1006 vehicle speed sequence acquisition module
1008 image characteristic acquisition module
1010 pedestrian observation track feature acquisition module
1012 vehicle speed characteristic acquisition module
1014 multimodal feature fusion Module
1016 pedestrian intention acquisition module
1100 bus
1200 processor
1300 memory
1400 and other circuits.
Detailed Description
The present disclosure will be described in further detail with reference to the drawings and embodiments. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not to be construed as limitations of the present disclosure. It should be further noted that, for the convenience of description, only the portions relevant to the present disclosure are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict. Technical solutions of the present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Unless otherwise indicated, the illustrated exemplary embodiments/examples are to be understood as providing exemplary features of various details of some ways in which the technical concepts of the present disclosure may be practiced. Accordingly, unless otherwise indicated, features of the various embodiments may be additionally combined, separated, interchanged, and/or rearranged without departing from the technical concept of the present disclosure.
The use of cross-hatching and/or shading in the drawings is generally used to clarify the boundaries between adjacent components. As such, unless otherwise noted, the presence or absence of cross-hatching or shading does not convey or indicate any preference or requirement for a particular material, material property, size, proportion, commonality between the illustrated components and/or any other characteristic, attribute, property, etc., of a component. Further, in the drawings, the size and relative sizes of components may be exaggerated for clarity and/or descriptive purposes. While example embodiments may be practiced differently, the specific process sequence may be performed in a different order than that described. For example, two processes described consecutively may be performed substantially simultaneously or in reverse order to that described. In addition, like reference numerals denote like parts.
When an element is referred to as being "on" or "over," "connected to" or "coupled to" another element, it can be directly on, connected or coupled to the other element or intervening elements may be present. However, when an element is referred to as being "directly on," "directly connected to" or "directly coupled to" another element, there are no intervening elements present. For purposes of this disclosure, the term "connected" may refer to physically, electrically, etc., and may or may not have intermediate components.
The terminology used herein is for the purpose of describing particular embodiments and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Furthermore, when the terms "comprises" and/or "comprising" and variations thereof are used in this specification, the stated features, integers, steps, operations, elements, components and/or groups thereof are stated to be present but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components and/or groups thereof. It is also noted that, as used herein, the terms "substantially," "about," and other similar terms are used as approximate terms and not as degree terms, and as such, are used to interpret inherent deviations in measured values, calculated values, and/or provided values that would be recognized by one of ordinary skill in the art.
The video image-based pedestrian intention prediction method, apparatus, electronic device, and storage medium of the present disclosure are described in detail below with reference to fig. 1 to 6.
Fig. 1 is a flowchart illustrating a pedestrian intention prediction method based on a video image according to an embodiment of the present disclosure.
Referring to fig. 1, a pedestrian intention prediction method S100 based on a video image of the present disclosure includes:
s110, acquiring a video image sequence containing the pedestrian and an observation track sequence of the pedestrian based on continuous frames (namely continuous observation frames) containing the pedestrian in the video image data acquired in real time; and acquiring a vehicle speed sequence corresponding to continuous frames based on vehicle speed data acquired in real time.
In the pedestrian intention prediction method based on the video images, three data sources are used, and forward-looking video streams shot by the vehicle-mounted camera respectively comprise video image data of pedestrians and scenes around the pedestrians, observation track data of the pedestrians for a continuous period of time and vehicle speed data recorded by a vehicle-mounted sensor (such as an on-board diagnostic device (OBD) sensor) for a continuous period of time.
The video image data of the continuous observation frames may include one pedestrian or may include more than two pedestrians.
The specific application scenario can be that the automatic driving vehicle is equipped with a high-resolution vehicle-mounted camera, video data in front of the vehicle is collected, the field angle of the vehicle-mounted camera covers pedestrian walking areas on two sides of a motor vehicle lane, and an on-board diagnostic (OBD) sensor synchronized with the vehicle-mounted camera records the position coordinates (such as GPS coordinates) and the motion state (speed, direction and the like) of the vehicle.
According to a pedestrian target detection and tracking algorithm in the prior art, the external frame coordinates of the pedestrian can be extracted in real time.
Illustratively, the disclosure is in the context of video of successive observation frames containing pedestriansThe image data is used for identifying the pedestrian, and all coordinates (x) of the pedestrian external frame in the current frame video image are obtained lt ,y lt ,x rb ,y rb ) And pedestrian ID, wherein (x) lt ,y lt ),(x rb ,y rb ) Respectively representing the coordinates of the upper left corner and the lower right corner of the circumscribed frame of a certain pedestrian in a pixel coordinate system.
Illustratively, the present disclosure extracts an observation trajectory sequence of a pedestrian by the following method.
For the original pedestrian bounding box coordinate (x) lt ,y lt ,x rb ,y rb ) The method converts the coordinates of the upper left corner and the lower right corner of the pedestrian external frame into the coordinates of a central point, the height, the width and a first-order difference of the coordinates of the central point, and divides the converted coordinates of the central point by the resolution of the video image according to a normalization principle so as to map the coordinate values between 0 and 1.
Illustratively, the present disclosure represents the observation trajectory of pedestrian i as
Figure BDA0003570931750000101
T is more than or equal to T-n +1 and less than or equal to T, namely the observation track sequence of the pedestrian i, wherein,
Figure BDA0003570931750000102
the coordinate value and the height and width value of the central point of the pedestrian external frame are represented,
Figure BDA0003570931750000103
the first-order difference between the coordinate value of the central point of the pedestrian external frame and the height and width value is represented, T is the serial number of the current observation frame, namely the last observation frame, n is the number of the observation frames, i is the ID of the pedestrian, namely the identification number of the pedestrian, and the disclosure exemplarily takes n as 15.
According to a preferred embodiment of the present disclosure, a video image containing a pedestrian is extracted by the following method.
Expanding the pedestrian external frame into k by taking the short edge of the video image as a reference context Multiple sizes, obtaining pedestrian external frame (x ') containing background information' lt ,y′ lt ,x′ rb ,y′ rb ) Preferably k contex 1.5, the experiment proves that the best effect is achieved when 1.5 is taken, on one hand, the characteristics of the pedestrian can be kept sufficiently prominent, on the other hand, scenes around the pedestrian and other related pedestrians can be included, and k context Too large or too small will affect the practical effects of both aspects. Preferably, the pedestrian outline border is expanded according to the following equation:
w'=w+min(w,h)×k contcxt h′=h+min(w,h)×k Context
Figure BDA0003570931750000104
Figure BDA0003570931750000105
then according to (x) lt ′,y lt ′,x rb ′,y rb ') extracting the corresponding pedestrian and surrounding scene image area if (x) lt ′,y lt ′,x rb ′,y rb ') exceeds the boundary of the original video image, the coordinates of the image boundary are taken as the coordinates of the pedestrian outline border.
Keeping the aspect ratio of the acquired rectangular region of the pedestrian outline unchanged, setting and scaling the long side of the rectangular region to 224 pixels for example, so that the size of the original rectangular region (image region) is changed, placing the changed image region at the center position of a 2D space with the size of 224 multiplied by 224 for example, filling the pixel value of a non-image region in the 2D space to be (0,0,0), and obtaining a video image sequence containing the pedestrian
Figure BDA0003570931750000106
T is the serial number of the current observation frame, namely the last observation frame, n is the number of the observation frames, and i is the ID of the pedestrian, namely the identification number of the pedestrian.
In the actual processing process, along with the image acquisition of the vehicle-mounted camera, the video image data is extracted frame by frame, and historical frame data does not need to be saved.
The speed data of the vehicle described in the above of the present disclosure can be obtained in real time based on the record of the vehicle-mounted diagnosis system, and further represent the speed of the vehicle corresponding to the continuous frames as
Figure BDA0003570931750000111
That is, the vehicle speed sequence, where T is the serial number of the current observation frame, that is, the last observation frame, n is the number of observation frames, and i is the ID of the pedestrian, that is, the identification number of the pedestrian, and this disclosure exemplarily takes n as 15.
With continuing reference to fig. 1, the video image-based pedestrian intention prediction method S100 of the present disclosure further includes:
s120, based on video image sequence containing pedestrians
Figure BDA0003570931750000112
Extracting video image characteristics frame by frame (obtained based on characteristic diagram at t moment), and acquiring video image sequence
Figure BDA0003570931750000113
Continuous frame average characteristics of
Figure BDA0003570931750000114
Pedestrian-based observation trajectory sequence
Figure BDA0003570931750000115
Obtaining the observation track characteristic B of the pedestrian i
Vehicle speed sequence based on continuous frame correspondence
Figure BDA0003570931750000116
Obtaining the speed characteristic S of the vehicle i
The acquiring of the continuous frame average feature, the acquiring of the observation trajectory feature of the pedestrian, and the acquiring of the vehicle speed feature in step S120 may be performed simultaneously or substantially simultaneously.
The method preferably extracts the video image features of the pedestrians on line frame by frame based on a progressive real-time video image feature extraction network, and obtains continuous frame average features of the video image sequence.
According to a preferred embodiment of the present disclosure, the present disclosure extracts video image features of pedestrians frame by the following steps.
Fig. 2 is a flowchart of a method for extracting video image features of pedestrians according to a preferred embodiment of the present disclosure.
Referring to fig. 2, in step S120, a sequence based on video images including a pedestrian is provided
Figure BDA0003570931750000117
Extracting video image characteristics of pedestrians frame by frame, comprising:
s1202, performing feature extraction based on 2D convolution on the video image (the video image at the time t) of the current frame to acquire a feature map of the current frame
Figure BDA0003570931750000118
And its corresponding feature tensor
Figure BDA0003570931750000119
S1204, feature map based on current frame
Figure BDA0003570931750000121
And the feature map of the previous frame of the current frame
Figure BDA0003570931750000122
Performing time sequence modeling to update the characteristic diagram of the current frame
Figure BDA0003570931750000123
And its corresponding feature tensor
Figure BDA0003570931750000124
The updated feature tensor of the current frame
Figure BDA0003570931750000125
As a video image characteristic of the current frame.
If the feature map of the previous frame of the current frame does not exist (namely, the feature map at the time of t-1 does not exist), the feature map of the previous frame of the current frame is filled with a value of 0
Figure BDA0003570931750000126
In step S1202, feature extraction based on 2D convolution is performed on a video image of a current frame (a video image at time t), and the feature extraction is performed through a 2D convolution backbone network, where the 2D convolution backbone network includes one or more 2D convolution layers. The 2D convolutional layer generally includes an activation function module, a residual module, and the like, and can effectively extract a single-frame image feature of a video image.
The 2D convolutional backbone network selected by the method can be ResNet-50 used for algorithm research or MobileNet-V2 used for actual deployment, in the video understanding method and the pedestrian intention identification method in the prior art, ResNet-50 is mostly adopted as a basic network to compare algorithm performance, and MobileNet-V2 can effectively reduce calculation cost and guarantee the real-time requirement of an automatic driving embedded platform.
Other types of 2D convolutional backbone networks may be adopted by those skilled in the art in light of the teachings of the present disclosure, and all of them fall within the scope of the present disclosure.
In step S1024, feature map based on current frame
Figure BDA0003570931750000127
And the feature map of the previous frame of the current frame
Figure BDA0003570931750000128
Performing time sequence modeling to update the characteristic diagram of the current frame
Figure BDA0003570931750000129
Preferably, the method comprises the following steps:
feature map of a previous frame of a current frame
Figure BDA00035709317500001210
At least a part of the channel and the currentFeature map of a frame
Figure BDA00035709317500001211
The corresponding channels are correlated, and time sequence modeling is carried out;
and updating the characteristic diagram of at least one part of channels of the previous frame of the current frame to the corresponding position of the characteristic diagram of the current frame to obtain the updated characteristic diagram of the current frame.
In the video image-based pedestrian intention prediction method of the present disclosure, preferably, the 2D convolutional backbone network is a 2D convolutional backbone network embedded in the time-series modeling, so that the 2D convolutional backbone network can perform the time-series modeling.
According to a preferred embodiment of the present disclosure, the 2D convolutional backbone network used by the present disclosure is provided with a limited number of timing modeling positions to balance the performance and the computational load of the 2D convolutional backbone network.
Since the timing modeling may bring extra computation to the 2D convolution backbone network, the present disclosure preferably performs the timing modeling only in limited locations. In the initial stage, the positions in the 2D convolutional backbone network which need to be subjected to timing modeling can be preset.
Wherein, the feature map of the previous frame of the current frame is used
Figure BDA0003570931750000131
At least a part of the channels and the characteristic diagram of the current frame
Figure BDA0003570931750000132
The corresponding channels are correlated, time sequence modeling is carried out, and specifically, feature maps of two frames before and after the given time sequence are given
Figure BDA0003570931750000133
And
Figure BDA0003570931750000134
both have consistent dimensions and sizes, and are all expressed as [ N, C, H, W ]]Wherein N represents the size of Batch during network training and inference, and C, H, W represents the number of channels and height of the current position feature map in the 2D convolutional backbone network respectivelyWidth, if there is no previous frame feature map
Figure BDA0003570931750000135
The present disclosure preferably fills out with a value of 0
Figure BDA0003570931750000136
As shown in fig. 3, fig. 3 is a network architecture diagram of pedestrian intent (street intent) prediction in online, real-time, according to one embodiment of the present disclosure.
According to the preferred embodiment of the present disclosure, the feature maps of adjacent moments are selected
Figure BDA0003570931750000137
And
Figure BDA0003570931750000138
front closed and rear open channel region in
Figure BDA0003570931750000139
And (d) carrying out time sequence modeling, wherein d is a channel interception parameter.
The timing modeling can be implemented in various ways, such as splicing and processing by a multi-layer perceptron, switching, adding, subtracting, etc. The selection/adjustment of the specific manner of time-series modeling by those skilled in the art in light of the teachings of the present disclosure falls within the scope of the present disclosure.
In the present disclosure, the channel truncation parameter is exemplarily taken to be d-4.
Fig. 4 is a flowchart of a method for acquiring average features of consecutive frames of a video image sequence in a pedestrian intention prediction method according to an embodiment of the present disclosure.
Referring to fig. 4, in step S120, extracting the video image features of the pedestrian frame by frame based on the video image sequence containing the pedestrian, and acquiring the continuous frame average features of the video image sequence, including:
s1202, performing feature extraction based on 2D convolution on the video image (the video image at the time t) of the current frame to acquire a feature map of the current frame
Figure BDA00035709317500001310
And its corresponding feature tensor
Figure BDA00035709317500001311
S1204, feature map based on current frame
Figure BDA00035709317500001312
And the feature map of the previous frame of the current frame
Figure BDA00035709317500001313
Performing time sequence modeling to update the characteristic diagram of the current frame
Figure BDA00035709317500001314
And its corresponding feature tensor
Figure BDA00035709317500001315
The updated feature tensor of the current frame
Figure BDA00035709317500001316
As a video image characteristic of the current frame;
if the feature map of the previous frame of the current frame does not exist (namely, the feature map at the time of t-1 does not exist), the feature map of the previous frame of the current frame is filled with a value of 0
Figure BDA00035709317500001317
S1206, updating the feature map of the current frame
Figure BDA00035709317500001318
Corresponding feature tensor
Figure BDA00035709317500001319
Performing dimensionality reduction processing based on the full connection layer to obtain a high-dimensional feature tensor of the current frame
Figure BDA0003570931750000141
And stored up to the sequence L of high-dimensional feature tensors i To (1);
s1208, when the high-dimensional feature tensor sequence L i Up to the value of the number of frames of consecutive frames (in this disclosure, n is illustratively 15), for the sequence L of high-dimensional feature tensors i Averaging the medium-high dimension feature tension to obtain continuous frame average feature
Figure BDA0003570931750000142
Feature tensor processed by 2D convolution backbone network
Figure BDA0003570931750000143
Since the dimensionality is too high, which easily causes the increase of the computation amount in the subsequent process, the dimension reduction processing based on the full connection layer is preferably performed on the dimension reduction processing.
Wherein, the updated characteristic diagram of the current frame
Figure BDA0003570931750000144
Corresponding feature tensor
Figure BDA0003570931750000145
Performing dimensionality reduction processing based on the full connection layer to obtain a high-dimensional feature tensor of the current frame
Figure BDA0003570931750000146
By the following formula:
Figure BDA0003570931750000147
wherein the content of the first and second substances,
Figure BDA0003570931750000148
a fully-connected layer is shown,
Figure BDA0003570931750000149
the tensor representing the high-dimensional features, which, in this disclosure,
Figure BDA00035709317500001410
is exemplarily taken to be 128.
Wherein, for the high-dimensional feature tensor sequence L i Averaging the medium-high dimension feature tension to obtain continuous frame average feature
Figure BDA00035709317500001411
This can be done by the following equation:
Figure BDA00035709317500001412
wherein the high-dimensional feature tensor sequence L i Expressed as:
Figure BDA00035709317500001413
wherein n is a high-dimensional feature tensor sequence L i Is the maximum value of the length of (a).
According to the preferred embodiment of the present disclosure, in step S1206, the feature map of the previous frame of the current frame is deleted
Figure BDA00035709317500001414
Preserving feature maps of a current frame
Figure BDA00035709317500001415
According to the preferred embodiment of the present disclosure, in step S1208, if the high-dimensional feature tensor sequence L i The earliest time sequence element in the feature map is a feature map filled with 0 value
Figure BDA00035709317500001416
It is deleted before it is applied to the sequence L of high-dimensional feature tensors i Averaging to obtain continuous frame average characteristics
Figure BDA00035709317500001417
Fig. 5 is a complete flow chart of extracting video image features frame by frame and acquiring average features of consecutive frames of a video image sequence in the pedestrian intention prediction method according to an embodiment of the present disclosure.
In the pedestrian intention prediction method based on video images S100 according to the preferred embodiment of the present disclosure, in step S120, a sequence of observation trajectories based on pedestrians
Figure BDA0003570931750000151
Obtaining the observation track characteristic B of the pedestrian i The method comprises the following steps:
using a full connection layer to enhance the observation track sequence, and obtaining an enhanced input track set
Figure BDA0003570931750000152
Splicing the enhanced input track set in the time dimension to obtain an input track tensor
Figure BDA0003570931750000153
Tensor of input track
Figure BDA0003570931750000154
Inputting the data into a 1D convolution network, and extracting local short-term features
Figure BDA0003570931750000155
Local short-term features
Figure BDA0003570931750000156
Inputting to a multi-layer perceptron for coding processing to obtain a global track characteristic B i
The track feature extraction network is a lightweight track feature extraction network, and an input observation track sequence
Figure BDA0003570931750000157
For lightweight information, preferably, the present disclosure uses full concatenationInputting a layer-connection enhanced track:
Figure BDA0003570931750000158
wherein phi traj (. represents a fully connected layer, the enhanced input trace set is
Figure BDA0003570931750000159
In the present disclosure, the first and second electrodes, by way of example,
Figure BDA00035709317500001510
has a dimension of 32.
Preferably, the present disclosure employs a 1D convolutional network to extract local short-term features of the trajectory
Figure BDA00035709317500001511
Can be expressed as:
Figure BDA00035709317500001512
preferably, the present disclosure will locally short term features
Figure BDA00035709317500001513
Inputting the data into a multi-layer perceptron to carry out coding processing so as to obtain a global track characteristic B i It can be expressed as:
Figure BDA00035709317500001514
wherein, MLP traj Being a multi-layer perceptron, B i I.e., global trajectory feature, in this disclosure, illustratively, B i Dimension of 128.
According to the pedestrian intention prediction method based on video images S100 of the preferred embodiment of the present disclosure, in step S120, the own vehicle speed sequence corresponding to based on the consecutive frames
Figure BDA00035709317500001515
Obtaining the speed characteristic S of the vehicle i The method comprises the following steps:
splicing the speed sequence of the vehicle in the time dimension to obtain the input speed tensor
Figure BDA0003570931750000161
Inputting the input speed tensor into a multi-layer perceptron for coding processing so as to obtain the speed characteristic S of the vehicle i
The speed coding network of the present disclosure is a lightweight speed coding network, wherein the vehicle speed input is a vehicle speed sequence
Figure BDA0003570931750000162
For light weight information, elements of the set are spliced in the time dimension to obtain tensor
Figure BDA0003570931750000163
Directly adopts a multilayer perceptron to carry out coding to obtain the speed characteristic S of the vehicle i
Figure BDA0003570931750000164
Wherein, MLP spd Representing a multi-layered perceptron, in this disclosure, illustratively S i Dimension of 128.
With continuing reference to fig. 1, the video image-based pedestrian intention prediction method S100 of the present disclosure further includes:
s130, averaging characteristics based on continuous frames
Figure BDA0003570931750000165
Observation trajectory feature B i And the speed characteristic S of the vehicle i Obtaining modality fusion features
Figure BDA0003570931750000166
And S140, extracting intention characteristics representing the intention of the pedestrian based on the semantic information of the modal fusion characteristics.
The multi-modal fusion network disclosed by the invention is a lightweight multi-modal fusion network, feature tensors among different modalities are fused, high-efficiency multi-modal fusion can be realized under the condition of keeping low parameter number and FLOPs (computational power), and modal fusion features are obtained so as to further extract intention features representing the intention of pedestrians.
In the method, based on a back propagation algorithm of a neural network, the intention features representing the intention of the pedestrian can be effectively extracted by the feature extraction networks of different modes, and practice proves that the purpose of extracting the homogeneous feature tensor can be achieved by the feature extraction networks of different modes by setting the feature tensors of different modes to be consistent channel sizes and then performing fusion operation (the method is preferably an addition feature method based on a full connection layer).
The lightweight multimodal fusion network of the present disclosure features by averaging successive frames
Figure BDA0003570931750000167
Observation trajectory feature B i And the speed characteristic S of the vehicle i Adding to obtain modal fusion characteristics
Figure BDA0003570931750000168
Wherein the content of the first and second substances,
Figure BDA0003570931750000169
B i 、S i are the same size, are exemplary 128, modal fusion features
Figure BDA0003570931750000171
Contains rich semantic information of multiple modes.
In light of the technical solutions disclosed herein, those skilled in the art can also perform multimodal fusion by other methods, such as feature splicing, decision layer fusion, and the like, all of which fall within the scope of the present disclosure.
Further, in step S140, extracting an intention feature characterizing the intention of the pedestrian based on the semantic information of the modal fusion feature includes:
inputting the modal fusion features into a full connection layer, and mapping the modal fusion features into a two-dimensional tensor to represent the street crossing intention category and street crossing intention category of the pedestrian:
output=φ fusion (H i )
where output represents the intent prediction result, φ fusion (. cndot.) denotes a fully connected layer.
According to the description, the pedestrian prediction method based on the video images considers the calculation limit of the embedded platform of the automatic driving vehicle and the speed requirement of pedestrian intention prediction, and based on the network architecture of the progressive real-time video image feature extraction provided by the disclosure, for the video image at the time t, only the feature graph corresponding to the time t-1 at the last time is needed to perform time sequence modeling, so that the video image features are extracted frame by frame without repeatedly calculating the features.
And for the problems of excessive fusion, high computational complexity and the like of the multi-mode fusion method in the prior art, the method preferably adopts a mode of feature superposition and full-connection layer mapping to fuse features of different modes, has few parameters and FLOPs, and realizes real-time prediction of pedestrian intentions.
The video image-based pedestrian intention prediction apparatus 1000 according to an embodiment of the present disclosure includes:
a video image sequence acquisition module 1002, wherein the video image sequence acquisition module 1002 acquires a video image sequence including a pedestrian based on continuous frames (continuous observation frames) including the pedestrian in video image data acquired in real time;
a pedestrian observation trajectory sequence acquisition module 1004, wherein the observation trajectory sequence acquisition module 1004 acquires an observation trajectory sequence of a pedestrian based on continuous frames (continuous observation frames) containing the pedestrian in the video image data acquired in real time;
a vehicle speed sequence acquisition module 1006, wherein the vehicle speed sequence acquisition module 1006 acquires a vehicle speed sequence corresponding to successive frames based on vehicle speed data acquired in real time;
an image feature acquisition module 1008 (a progressive real-time video image feature extraction network, a 2D convolution backbone network) extracts video image features frame by frame based on a video image sequence containing pedestrians, and acquires continuous frame average features of the video image sequence;
the pedestrian observation trajectory feature acquisition module 1010, the pedestrian observation trajectory feature acquisition module 1010 acquires observation trajectory features of pedestrians based on an observation trajectory sequence of the pedestrians;
the vehicle speed feature acquisition module 1012 acquires the vehicle speed feature S based on the vehicle speed sequence corresponding to the continuous frames by the vehicle speed feature acquisition module 1012 i
The multi-modal feature fusion module 1014, wherein the multi-modal feature fusion module 1014 obtains modal fusion features based on the average features of the continuous frames, the observation trajectory features and the speed features of the vehicle;
a pedestrian intention acquisition module 1016, wherein the pedestrian intention acquisition module 1016 extracts intention characteristics representing the intention of the pedestrian based on at least the semantic information of the modal fusion characteristics.
The video image-based pedestrian intention prediction apparatus 1000 of the present disclosure may be implemented by way of a computer software program architecture.
Fig. 6 is a block diagram schematically illustrating the structure of a video image-based pedestrian intention prediction apparatus 1000 using a hardware implementation of a processing system according to an embodiment of the present disclosure.
The video image-based pedestrian intention prediction apparatus 1000 may include corresponding modules that perform each or several of the steps of the above-described flowcharts. Thus, each step or several steps in the above-described flow charts may be performed by a respective module, and the apparatus may comprise one or more of these modules. The modules may be one or more hardware modules specifically configured to perform the respective steps, or implemented by a processor configured to perform the respective steps, or stored within a computer-readable medium for implementation by a processor, or by some combination.
The hardware architecture may be implemented with a bus architecture. The bus architecture may include any number of interconnecting buses and bridges depending on the specific application of the hardware and the overall design constraints. The bus 1100 couples various circuits including the one or more processors 1200, the memory 1300, and/or the hardware modules together. The bus 1100 may also connect various other circuits 1400, such as peripherals, voltage regulators, power management circuits, external antennas, and the like.
The bus 1100 may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one connection line is shown, but no single bus or type of bus is shown.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present disclosure includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the implementations of the present disclosure. The processor performs the various methods and processes described above. For example, method embodiments in the present disclosure may be implemented as a software program tangibly embodied in a machine-readable medium, such as a memory. In some embodiments, some or all of the software program may be loaded and/or installed via memory and/or a communication interface. When the software program is loaded into memory and executed by a processor, one or more steps of the method described above may be performed. Alternatively, in other embodiments, the processor may be configured to perform one of the methods described above by any other suitable means (e.g., by means of firmware).
The logic and/or steps represented in the flowcharts or otherwise described herein may be embodied in any readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
For the purposes of this description, a "readable storage medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the readable storage medium include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). In addition, the readable storage medium may even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in the memory.
It should be understood that portions of the present disclosure may be implemented in hardware, software, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps of the method implementing the above embodiments may be implemented by hardware that is instructed to be associated with a program, which may be stored in a readable storage medium, and which, when executed, includes one or a combination of the steps of the method embodiments.
In addition, each functional unit in the embodiments of the present disclosure may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.
The present disclosure also provides an electronic device, including: a memory storing execution instructions; and a processor or other hardware module that executes the execution instructions stored in the memory, such that the processor or other hardware module performs the video image-based pedestrian intent prediction method described above.
The present disclosure also provides a readable storage medium, in which an execution instruction is stored, and the execution instruction is executed by a processor to implement the above-mentioned pedestrian intention prediction method based on video images.
In the description herein, reference to the description of the terms "one embodiment/implementation," "some embodiments/implementations," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment/implementation or example is included in at least one embodiment/implementation or example of the present application. In this specification, the schematic representations of the terms described above are not necessarily the same embodiment/mode or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments/modes or examples. Furthermore, the various embodiments/aspects or examples and features of the various embodiments/aspects or examples described in this specification can be combined and combined by one skilled in the art without conflicting therewith.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
It will be understood by those skilled in the art that the foregoing embodiments are merely for clarity of illustration of the disclosure and are not intended to limit the scope of the disclosure. Other variations or modifications may occur to those skilled in the art, based on the foregoing disclosure, and are still within the scope of the present disclosure.

Claims (13)

1. A pedestrian intention prediction method based on a video image is characterized by comprising the following steps:
acquiring a video image sequence containing the pedestrian and an observation track sequence of the pedestrian based on continuous frames containing the pedestrian in the video image data acquired in real time; acquiring a vehicle speed sequence corresponding to the continuous frames based on vehicle speed data acquired in real time;
extracting video image features frame by frame based on the video image sequence containing pedestrians, and acquiring continuous frame average features of the video image sequence; acquiring observation track characteristics of the pedestrian based on the observation track sequence of the pedestrian; acquiring the speed characteristics of the vehicle based on the speed sequence of the vehicle corresponding to the continuous frames;
acquiring modal fusion characteristics based on the continuous frame average characteristics, the observation track characteristics and the vehicle speed characteristics; and
extracting intention features representing the intention of the pedestrian at least based on semantic information of the modal fusion features;
the method for extracting the video image features of the pedestrians frame by frame based on the video image sequence containing the pedestrians and acquiring the continuous frame average features of the video image sequence comprises the following steps: performing feature extraction based on 2D convolution on a video image of a current frame to obtain a feature map of the current frame and a corresponding feature tensor; performing time sequence modeling based on the feature map of the current frame and the feature map of the previous frame of the current frame to update the feature map of the current frame and the corresponding feature tensor, and taking the updated feature tensor of the current frame as the video image feature of the current frame; if the feature map of the previous frame of the current frame does not exist, filling the feature map of the previous frame of the current frame with a value of 0; performing dimensionality reduction processing based on a full connection layer on the feature tensor corresponding to the updated feature map of the current frame to obtain a high-dimensional feature tensor of the current frame, and storing the high-dimensional feature tensor into a high-dimensional feature tensor sequence; when the length of the high-dimensional feature tensor sequence reaches the frame number value of the continuous frames, averaging the high-dimensional feature tensor in the high-dimensional feature tensor sequence to obtain the average features of the continuous frames;
acquiring modal fusion characteristics based on the continuous frame average characteristics, the observation trajectory characteristics and the vehicle speed characteristics, wherein the modal fusion characteristics comprise: and adding the continuous frame average characteristic, the observation track characteristic and the vehicle speed characteristic to obtain the modal fusion characteristic.
2. The video-image-based pedestrian intention prediction method according to claim 1, characterized by further comprising:
deleting the feature map of the previous frame of the current frame, and keeping the feature map of the current frame.
3. The method according to claim 1, wherein if the sequence element at the earliest time in the high-dimensional feature tensor sequence is an eigenmap filled with a value of 0, the high-dimensional feature tensor sequence is averaged after being deleted to obtain the average feature of the consecutive frames.
4. The method according to claim 1, wherein the step of performing time-series modeling based on the feature map of the current frame and the feature map of the previous frame to update the feature map of the current frame comprises:
associating at least one part of channels of the feature map of the previous frame of the current frame with corresponding channels of the feature map of the current frame, and performing time sequence modeling; and
and updating the characteristic diagram of the at least one part of channels of the previous frame of the current frame to the corresponding position of the characteristic diagram of the current frame to obtain the updated characteristic diagram of the current frame.
5. The method of claim 4, wherein the 2D convolution-based feature extraction is performed on the video image of the current frame through a 2D convolution backbone network, and the 2D convolution backbone network comprises one or more 2D convolution layers.
6. The video-image-based pedestrian intention prediction method according to claim 5, wherein the 2D convolutional backbone network is a 2D convolutional backbone network embedded in a time-series modeling so that the 2D convolutional backbone network can perform the time-series modeling.
7. The video-image-based pedestrian intention prediction method of claim 6, wherein the 2D convolutional backbone network is set to a finite number of time-series modeling positions to balance performance and computational load of the 2D convolutional backbone network.
8. The pedestrian intention prediction method based on video images according to claim 1, wherein the acquiring of the observation trajectory feature of the pedestrian based on the observation trajectory sequence of the pedestrian comprises:
enhancing the observation track sequence by using a full-connection layer to obtain an enhanced input track set;
splicing the enhanced input track set in a time dimension to obtain an input track tensor;
inputting the input track tensor into a 1D convolution network, and extracting local short-term features; and
and inputting the local short-term features into a multi-layer perceptron to carry out coding processing so as to obtain global track features.
9. The method according to claim 1, wherein the obtaining of the vehicle speed feature based on the vehicle speed sequence corresponding to the consecutive frames comprises:
splicing the speed sequence of the vehicle in a time dimension to obtain an input speed tensor; and
and inputting the input speed tensor to a multi-layer perceptron for coding processing so as to obtain the speed characteristic of the vehicle.
10. The pedestrian intention prediction method based on video images according to claim 1, wherein the extracting of the intention feature characterizing the intention of pedestrians based on the semantic information of the modal fusion feature comprises:
and inputting the modal fusion features into a full connection layer, and mapping the modal fusion features into a two-dimensional tensor so as to represent the class of the street crossing intention and the class of the street crossing intention of the pedestrian.
11. A pedestrian intention prediction apparatus based on a video image, comprising:
the system comprises a video image sequence acquisition module, a pedestrian detection module and a pedestrian detection module, wherein the video image sequence acquisition module acquires a video image sequence containing pedestrians on the basis of continuous frames containing the pedestrians in video image data acquired in real time;
the pedestrian observation track sequence acquisition module is used for acquiring an observation track sequence of a pedestrian based on continuous frames containing the pedestrian in video image data acquired in real time;
the vehicle speed sequence acquisition module is used for acquiring a vehicle speed sequence corresponding to the continuous frames based on vehicle speed data acquired in real time;
the image feature acquisition module extracts video image features frame by frame based on the video image sequence containing pedestrians and acquires continuous frame average features of the video image sequence;
the pedestrian observation track characteristic acquisition module acquires observation track characteristics of pedestrians on the basis of an observation track sequence of the pedestrians;
a vehicle speed feature acquisition module that acquires a vehicle speed feature based on a vehicle speed sequence corresponding to the continuous frames;
a multi-modal feature fusion module that obtains modal fusion features based on the continuous frame average features, the observed trajectory features, and the vehicle speed features; and
a pedestrian intention acquisition module which extracts intention features representing pedestrian intention at least based on semantic information of the modal fusion features;
the method for extracting the video image features of the pedestrians frame by frame based on the video image sequence containing the pedestrians and acquiring the continuous frame average features of the video image sequence comprises the following steps: performing feature extraction based on 2D convolution on a video image of a current frame to obtain a feature map of the current frame and a corresponding feature tensor; performing time sequence modeling based on the feature map of the current frame and the feature map of the previous frame of the current frame to update the feature map of the current frame and the corresponding feature tensor, and taking the updated feature tensor of the current frame as the video image feature of the current frame; if the feature map of the previous frame of the current frame does not exist, filling the feature map of the previous frame of the current frame with a value of 0; performing dimensionality reduction processing based on a full connection layer on the feature tensor corresponding to the updated feature map of the current frame to obtain a high-dimensional feature tensor of the current frame, and storing the high-dimensional feature tensor into a high-dimensional feature tensor sequence; when the length of the high-dimensional feature tensor sequence reaches the frame number value of the continuous frames, averaging the high-dimensional feature tensor in the high-dimensional feature tensor sequence to obtain the average features of the continuous frames;
acquiring modal fusion characteristics based on the continuous frame average characteristics, the observation trajectory characteristics and the vehicle speed characteristics, wherein the modal fusion characteristics comprise: and adding the continuous frame average characteristic, the observation track characteristic and the vehicle speed characteristic to obtain the modal fusion characteristic.
12. An electronic device, comprising:
a memory storing execution instructions; and
a processor executing execution instructions stored by the memory to cause the processor to perform the pedestrian intent prediction method of any of claims 1 to 10.
13. A readable storage medium having stored therein executable instructions for implementing the pedestrian intention prediction method of any one of claims 1 to 10 when executed by a processor.
CN202210323532.3A 2022-03-29 2022-03-29 Pedestrian intention prediction method and device based on video image and electronic equipment Active CN114639125B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210323532.3A CN114639125B (en) 2022-03-29 2022-03-29 Pedestrian intention prediction method and device based on video image and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210323532.3A CN114639125B (en) 2022-03-29 2022-03-29 Pedestrian intention prediction method and device based on video image and electronic equipment

Publications (2)

Publication Number Publication Date
CN114639125A CN114639125A (en) 2022-06-17
CN114639125B true CN114639125B (en) 2022-09-16

Family

ID=81952338

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210323532.3A Active CN114639125B (en) 2022-03-29 2022-03-29 Pedestrian intention prediction method and device based on video image and electronic equipment

Country Status (1)

Country Link
CN (1) CN114639125B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112016472A (en) * 2020-08-31 2020-12-01 山东大学 Driver attention area prediction method and system based on target dynamic information
CN113392725A (en) * 2021-05-26 2021-09-14 苏州易航远智智能科技有限公司 Pedestrian street crossing intention identification method based on video data
CN113807298A (en) * 2021-07-26 2021-12-17 北京易航远智科技有限公司 Pedestrian crossing intention prediction method and device, electronic equipment and readable storage medium
CN114120439A (en) * 2021-10-12 2022-03-01 江苏大学 Pedestrian intention multi-task identification and track prediction method under self-vehicle view angle of intelligent vehicle

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11205082B2 (en) * 2019-10-08 2021-12-21 Toyota Research Institute, Inc. Spatiotemporal relationship reasoning for pedestrian intent prediction
US20210114627A1 (en) * 2019-10-17 2021-04-22 Perceptive Automata, Inc. Neural networks for navigation of autonomous vehicles based upon predicted human intents
CN112579824A (en) * 2020-12-16 2021-03-30 北京中科闻歌科技股份有限公司 Video data classification method and device, electronic equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112016472A (en) * 2020-08-31 2020-12-01 山东大学 Driver attention area prediction method and system based on target dynamic information
CN113392725A (en) * 2021-05-26 2021-09-14 苏州易航远智智能科技有限公司 Pedestrian street crossing intention identification method based on video data
CN113807298A (en) * 2021-07-26 2021-12-17 北京易航远智科技有限公司 Pedestrian crossing intention prediction method and device, electronic equipment and readable storage medium
CN114120439A (en) * 2021-10-12 2022-03-01 江苏大学 Pedestrian intention multi-task identification and track prediction method under self-vehicle view angle of intelligent vehicle

Also Published As

Publication number Publication date
CN114639125A (en) 2022-06-17

Similar Documents

Publication Publication Date Title
US11256986B2 (en) Systems and methods for training a neural keypoint detection network
JP2019061658A (en) Area discriminator training method, area discrimination device, area discriminator training device, and program
CN112889071B (en) System and method for determining depth information in a two-dimensional image
US20190375261A1 (en) Method and device for determining a trajectory in off-road scenarios
CN114418895A (en) Driving assistance method and device, vehicle-mounted device and storage medium
CN113591872A (en) Data processing system, object detection method and device
CN108830131B (en) Deep learning-based traffic target detection and ranging method
CN113312983A (en) Semantic segmentation method, system, device and medium based on multi-modal data fusion
CN113392725B (en) Pedestrian street crossing intention identification method based on video data
CN111062405A (en) Method and device for training image recognition model and image recognition method and device
CN111967396A (en) Processing method, device and equipment for obstacle detection and storage medium
CN113807298B (en) Pedestrian crossing intention prediction method and device, electronic equipment and readable storage medium
CN114764856A (en) Image semantic segmentation method and image semantic segmentation device
CN115273032A (en) Traffic sign recognition method, apparatus, device and medium
Aditya et al. Collision Detection: An Improved Deep Learning Approach Using SENet and ResNext
Nejad et al. Vehicle trajectory prediction in top-view image sequences based on deep learning method
CN114639125B (en) Pedestrian intention prediction method and device based on video image and electronic equipment
JP6992099B2 (en) Information processing device, vehicle, vehicle control method, program, information processing server, information processing method
CN113723170A (en) Integrated hazard detection architecture system and method
CN117372991A (en) Automatic driving method and system based on multi-view multi-mode fusion
CN112446292B (en) 2D image salient object detection method and system
JP7420607B2 (en) Information processing device, information processing method, vehicle, information processing server, and program
CN115249269A (en) Object detection method, computer program product, storage medium, and electronic device
Wu et al. A survey of vision-based road parameter estimating methods
CN115240133A (en) Bus congestion degree analysis method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant