CN114639125B

CN114639125B - Pedestrian intention prediction method and device based on video image and electronic equipment

Info

Publication number: CN114639125B
Application number: CN202210323532.3A
Authority: CN
Inventors: 陈禹行; 董铮; 李雪; 范圣印
Original assignee: Beijing Yihang Yuanzhi Technology Co Ltd
Current assignee: Beijing Yihang Yuanzhi Technology Co Ltd
Priority date: 2022-03-29
Filing date: 2022-03-29
Publication date: 2022-09-16
Anticipated expiration: 2042-03-29
Also published as: CN114639125A

Abstract

The present disclosure provides a pedestrian intention prediction method based on a video image, including: acquiring a video image sequence containing the pedestrian and an observation track sequence of the pedestrian based on continuous frames containing the pedestrian in the video image data acquired in real time; acquiring a vehicle speed sequence corresponding to continuous frames based on vehicle speed data acquired in real time; extracting video image characteristics frame by frame based on a video image sequence containing pedestrians, and acquiring continuous frame average characteristics of the video image sequence; acquiring observation track characteristics of the pedestrian based on the observation track sequence of the pedestrian; acquiring the speed characteristics of the vehicle based on the speed sequence of the vehicle corresponding to the continuous frames; acquiring modal fusion characteristics based on the continuous frame average characteristics, the observation track characteristics and the speed characteristics of the vehicle; and extracting intention characteristics representing the intention of the pedestrian based on the semantic information of the modal fusion characteristics. The disclosure also provides a pedestrian intention prediction device, an electronic device and a readable storage medium.

Description

Pedestrian intention prediction method and device based on video image and electronic equipment

Technical Field

The present disclosure relates to the field of automatic driving technologies, and in particular, to a method and an apparatus for predicting pedestrian intention based on a video image, an electronic device, and a storage medium.

Background

Autonomous vehicles are equipped with a series of sensors that sense the environment around the vehicle and assist the vehicle in making decisions. The existing automatic driving automobile generally integrates mature perception technologies, such as target detection, target tracking and the like, and can accurately and truly capture pedestrians and other vehicles in the road. Pedestrians belong to vulnerable groups in traffic scenes and are easily seriously injured in traffic accidents, and vehicles are braked to require certain buffer time and can not take effective measures aiming at the sudden behaviors of the pedestrians due to the fact that only a target detection and tracking technology is relied on. Therefore, if the intention of the pedestrian can be predicted on line in real time, input information can be provided for the decision of the vehicle in advance, the vehicle behavior can be adjusted timely, and the traffic road safety can be guaranteed.

A Pedestrian intention prediction method for stacking RNN fusion context information is provided in a BMVC (national center of motion) paper 'Pedestrian action intersection fusion in stacked RNNs' in 2019, and features of different modes are extracted layer by layer and fused on the basis of a gate control cycle unit (GRU), and the features are sequentially a Pedestrian image, a scene image, a Pedestrian pose, a Pedestrian track and a self-speed from complex information to simple information from bottom to top. The method can predict the pedestrian intention in real time, but the complex video semantics are not sufficiently extracted, and the 2D space structure of the video image can be damaged by using the convolution layer and the GRU to perform time sequence modeling.

A baseline method for Pedestrian intention Prediction is provided in WACV (central mark for Evaluating Pedestrian Action Prediction) in 2021, and is characterized in that video image features of pedestrians are extracted by adopting 3D convolution, then tracks, poses and own vehicle speeds of the pedestrians are coded by adopting a recurrent neural network, and finally, features of different modes are fused by an attention mechanism. The method can effectively fuse information of different modes, but the real-time requirement is difficult to achieve.

A TITS paper of Cross or NotContext-Based registration of Pedestrian Crossing in the Urban environmental proposes an Intention prediction framework, extracts video information of pedestrians by using 3D convolution, simply calculates Pedestrian distance information through pixel coordinates, and designs a fusion module Based on a full connection layer. The method explores the relationship between the speed of the vehicle and the distance of the pedestrian and the street crossing intention of the pedestrian, but cannot meet the real-time performance, and the generalization capability under a complex scene is weak.

At this stage, the related papers and methods in the field of pedestrian intention prediction have at least the following drawbacks and disadvantages.

Firstly, the video image features cannot be effectively extracted in real time. Due to the computing resource limitation of the automatic driving system platform and the requirement of online video feature extraction, the video pedestrian feature extraction network needs to meet the real-time performance and fully utilize computing resources. The video data is essentially equivalent to the superposition of image data on a time dimension, the 3D convolution neural network can extract the characteristics of 3-dimensional data, but the 3D convolution operation amount is large, and in addition, a characteristic diagram after 3D convolution processing cannot be reused, so that the real-time characteristic extraction is not facilitated. A2D convolutional neural network is an effective method for extracting image space features, and one architecture is that the 2D convolutional neural network is adopted to extract features of a single-frame image in a video, then the features are input into the cyclic neural network to perform time sequence modeling and are updated iteratively, but the operation of changing the size (Reshape) causes high computational complexity and damages a space structure; another typical architecture is to perform post-fusion on the video single-frame image features extracted by the 2D convolutional neural network, but lacks modeling of different levels of semantics.

The second is the lack of an efficient method of fusing multimodal features. Some methods adopt an attention mechanism to correlate information of different modes, obtain better prediction accuracy, but have the problems of high computational complexity, easy occurrence of overfitting and the like. Some methods are used for solving the problems of modality isomerism and insufficient fusion by fusing multi-mode information at different levels and different stages, but the mode can cause information loss to a certain extent, does not have robustness, and is difficult to adapt to variable traffic scenes. In addition, the method is not favorable for real-time operation, and requires more feature tensors to be stored in a memory (video memory), so that the quick response requirement of the automatic driving vehicle cannot be met.

Disclosure of Invention

In order to solve at least one of the above technical problems, the present disclosure provides a method, an apparatus, an electronic device, and a storage medium for predicting a pedestrian intention based on a video image.

According to an aspect of the present disclosure, there is provided a pedestrian intention prediction method based on a video image, including:

acquiring a video image sequence containing the pedestrian and an observation track sequence of the pedestrian based on continuous frames containing the pedestrian in the video image data acquired in real time; acquiring a vehicle speed sequence corresponding to the continuous frames based on vehicle speed data acquired in real time;

extracting video image features frame by frame based on the video image sequence containing pedestrians, and acquiring continuous frame average features of the video image sequence; acquiring observation track characteristics of the pedestrian based on the observation track sequence of the pedestrian; acquiring the speed characteristics of the vehicle based on the speed sequence of the vehicle corresponding to the continuous frames;

acquiring modal fusion characteristics based on the continuous frame average characteristics, the observation track characteristics and the vehicle speed characteristics;

and extracting intention characteristics representing the intention of the pedestrian at least based on the semantic information of the modal fusion characteristics.

The pedestrian intention prediction method based on video images according to at least one embodiment of the present disclosure, wherein the extracting, frame by frame, video image features of pedestrians and acquiring continuous frame average features of the video image sequence based on the video image sequence containing pedestrians comprises:

performing feature extraction based on 2D convolution on a video image of a current frame to obtain a feature map of the current frame and a corresponding feature tensor;

performing time sequence modeling based on the feature map of the current frame and the feature map of the previous frame of the current frame to update the feature map of the current frame and the corresponding feature tensor, and taking the updated feature tensor of the current frame as the video image feature of the current frame;

if the feature map of the previous frame of the current frame does not exist, the feature map of the previous frame of the current frame is filled with a value of 0.

The pedestrian intention prediction method based on the video image according to at least one embodiment of the present disclosure further includes:

and carrying out dimensionality reduction processing based on a full connection layer on the feature tensor corresponding to the updated feature map of the current frame to obtain a high-dimensional feature tensor of the current frame, and storing the high-dimensional feature tensor into a high-dimensional feature tensor sequence.

deleting the feature map of the previous frame of the current frame, and keeping the feature map of the current frame.

and when the length of the high-dimensional feature tensor sequence reaches the frame number value of the continuous frames, averaging the high-dimensional feature tensor in the high-dimensional feature tensor sequence to obtain the average features of the continuous frames.

According to the pedestrian intention prediction method based on the video image in at least one embodiment of the present disclosure, if the sequence element at the earliest time in the high-dimensional feature tensor sequence is an feature map filled with a value of 0, the feature map is deleted and then the high-dimensional feature tensor sequence is averaged to obtain the average feature of the continuous frames.

According to at least one embodiment of the present disclosure, a pedestrian intention prediction method based on a video image, performing time sequence modeling based on a feature map of a current frame and a feature map of a previous frame of the current frame to update the feature map of the current frame, includes:

associating at least one part of channels of the feature map of the previous frame of the current frame with corresponding channels of the feature map of the current frame, and performing time sequence modeling;

and updating the characteristic diagram of the at least one part of channels of the previous frame of the current frame to the corresponding position of the characteristic diagram of the current frame to obtain the updated characteristic diagram of the current frame.

According to the pedestrian intention prediction method based on the video image, the feature extraction based on the 2D convolution is carried out on the video image of the current frame through a 2D convolution backbone network, and the 2D convolution backbone network comprises one or more than two 2D convolution layers.

According to the pedestrian intention prediction method based on the video image of at least one embodiment of the present disclosure, the 2D convolutional backbone network is a 2D convolutional backbone network embedded with a time sequence modeling so that the 2D convolutional backbone network can perform the time sequence modeling.

According to the pedestrian intention prediction method based on the video image, disclosed by the at least one embodiment of the disclosure, the 2D convolution backbone network is provided with a limited number of time sequence modeling positions so as to balance the performance and the calculation amount of the 2D convolution backbone network.

According to at least one embodiment of the present disclosure, a pedestrian intention prediction method based on a video image, the method for obtaining an observation track characteristic of a pedestrian based on an observation track sequence of the pedestrian, includes:

enhancing the observation track sequence by using a full-connection layer to obtain an enhanced input track set;

splicing the enhanced input track set in a time dimension to obtain an input track tensor;

inputting the input track tensor into a 1D convolution network, and extracting local short-term features;

and inputting the local short-term features into a multi-layer perceptron to carry out coding processing so as to obtain global track features.

According to at least one embodiment of the present disclosure, a pedestrian intention prediction method based on a video image, which acquires a vehicle speed feature based on a vehicle speed sequence corresponding to the continuous frames, includes:

splicing the speed sequence of the vehicle in a time dimension to obtain an input speed tensor;

and inputting the input speed tensor to a multi-layer perceptron for coding processing so as to obtain the speed characteristic of the vehicle.

According to at least one embodiment of the present disclosure, a pedestrian intention prediction method based on a video image, wherein the acquiring of a modal fusion feature based on the continuous frame average feature, the observation trajectory feature, and the vehicle speed feature includes:

the continuous frame average characteristic, the observation track characteristic and the vehicle speed are measuredDegree feature S _i Adding to obtain the modal fusion characteristics.

According to at least one embodiment of the present disclosure, the method for predicting pedestrian intention based on video images, which extracts intention features representing pedestrian intention based on semantic information of the modal fusion features, includes:

and inputting the modal fusion features into a full connection layer, and mapping the modal fusion features into a two-dimensional tensor so as to represent the class of the street crossing intention and the class of the street crossing intention of the pedestrian.

According to another aspect of the present disclosure, there is provided a pedestrian intention prediction apparatus based on a video image, including:

the system comprises a video image sequence acquisition module, a pedestrian detection module and a pedestrian detection module, wherein the video image sequence acquisition module acquires a video image sequence containing pedestrians on the basis of continuous frames containing the pedestrians in video image data acquired in real time;

the pedestrian observation track sequence acquisition module is used for acquiring an observation track sequence of a pedestrian based on continuous frames containing the pedestrian in video image data acquired in real time;

the vehicle speed sequence acquisition module acquires a vehicle speed sequence corresponding to the continuous frames on the basis of vehicle speed data acquired in real time;

the image feature acquisition module extracts video image features frame by frame based on the video image sequence containing pedestrians and acquires continuous frame average features of the video image sequence;

the pedestrian observation track characteristic acquisition module acquires observation track characteristics of pedestrians on the basis of an observation track sequence of the pedestrians;

a vehicle speed feature acquisition module that acquires a vehicle speed feature based on a vehicle speed sequence corresponding to the continuous frames;

a multi-modal feature fusion module that obtains modal fusion features based on the continuous frame average features, the observed trajectory features, and the vehicle speed features; and

and the pedestrian intention acquisition module extracts intention features representing the intention of the pedestrian at least based on the semantic information of the modal fusion features.

According to yet another aspect of the present disclosure, there is provided an electronic device including:

a memory storing execution instructions;

a processor executing execution instructions stored by the memory such that the processor performs the pedestrian intent prediction method of any of the embodiments of the present disclosure.

According to still another aspect of the present disclosure, there is provided a readable storage medium having stored therein execution instructions for implementing the pedestrian intention prediction method of any one of the embodiments of the present disclosure when executed by a processor.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.

Fig. 1 is a flowchart illustrating a pedestrian intention prediction method based on a video image according to an embodiment of the present disclosure.

Fig. 2 is a flowchart of a method for extracting video image features of pedestrians in a preferred embodiment of the present disclosure.

FIG. 3 is a network architecture diagram for pedestrian intent (street intent) prediction in online, real-time, one embodiment of the present disclosure.

Fig. 4 is a flowchart of a method for acquiring average features of consecutive frames of a video image sequence in a pedestrian intention prediction method according to an embodiment of the present disclosure.

Fig. 5 is a complete flow chart of extracting video image features frame by frame and acquiring average features of consecutive frames of a video image sequence in the pedestrian intention prediction method according to an embodiment of the present disclosure.

Fig. 6 is a block diagram schematically illustrating a configuration of a pedestrian intention prediction apparatus using a hardware implementation of a processing system according to an embodiment of the present disclosure.

Description of the reference numerals

1000 pedestrian intention prediction device

1002 video image sequence acquisition module

1004 observation track sequence acquisition module

1006 vehicle speed sequence acquisition module

1008 image characteristic acquisition module

1010 pedestrian observation track feature acquisition module

1012 vehicle speed characteristic acquisition module

1014 multimodal feature fusion Module

1016 pedestrian intention acquisition module

1100 bus

1200 processor

1300 memory

1400 and other circuits.

Detailed Description

The present disclosure will be described in further detail with reference to the drawings and embodiments. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not to be construed as limitations of the present disclosure. It should be further noted that, for the convenience of description, only the portions relevant to the present disclosure are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict. Technical solutions of the present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Unless otherwise indicated, the illustrated exemplary embodiments/examples are to be understood as providing exemplary features of various details of some ways in which the technical concepts of the present disclosure may be practiced. Accordingly, unless otherwise indicated, features of the various embodiments may be additionally combined, separated, interchanged, and/or rearranged without departing from the technical concept of the present disclosure.

The use of cross-hatching and/or shading in the drawings is generally used to clarify the boundaries between adjacent components. As such, unless otherwise noted, the presence or absence of cross-hatching or shading does not convey or indicate any preference or requirement for a particular material, material property, size, proportion, commonality between the illustrated components and/or any other characteristic, attribute, property, etc., of a component. Further, in the drawings, the size and relative sizes of components may be exaggerated for clarity and/or descriptive purposes. While example embodiments may be practiced differently, the specific process sequence may be performed in a different order than that described. For example, two processes described consecutively may be performed substantially simultaneously or in reverse order to that described. In addition, like reference numerals denote like parts.

When an element is referred to as being "on" or "over," "connected to" or "coupled to" another element, it can be directly on, connected or coupled to the other element or intervening elements may be present. However, when an element is referred to as being "directly on," "directly connected to" or "directly coupled to" another element, there are no intervening elements present. For purposes of this disclosure, the term "connected" may refer to physically, electrically, etc., and may or may not have intermediate components.

The terminology used herein is for the purpose of describing particular embodiments and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Furthermore, when the terms "comprises" and/or "comprising" and variations thereof are used in this specification, the stated features, integers, steps, operations, elements, components and/or groups thereof are stated to be present but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components and/or groups thereof. It is also noted that, as used herein, the terms "substantially," "about," and other similar terms are used as approximate terms and not as degree terms, and as such, are used to interpret inherent deviations in measured values, calculated values, and/or provided values that would be recognized by one of ordinary skill in the art.

The video image-based pedestrian intention prediction method, apparatus, electronic device, and storage medium of the present disclosure are described in detail below with reference to fig. 1 to 6.

Referring to fig. 1, a pedestrian intention prediction method S100 based on a video image of the present disclosure includes:

s110, acquiring a video image sequence containing the pedestrian and an observation track sequence of the pedestrian based on continuous frames (namely continuous observation frames) containing the pedestrian in the video image data acquired in real time; and acquiring a vehicle speed sequence corresponding to continuous frames based on vehicle speed data acquired in real time.

In the pedestrian intention prediction method based on the video images, three data sources are used, and forward-looking video streams shot by the vehicle-mounted camera respectively comprise video image data of pedestrians and scenes around the pedestrians, observation track data of the pedestrians for a continuous period of time and vehicle speed data recorded by a vehicle-mounted sensor (such as an on-board diagnostic device (OBD) sensor) for a continuous period of time.

The video image data of the continuous observation frames may include one pedestrian or may include more than two pedestrians.

The specific application scenario can be that the automatic driving vehicle is equipped with a high-resolution vehicle-mounted camera, video data in front of the vehicle is collected, the field angle of the vehicle-mounted camera covers pedestrian walking areas on two sides of a motor vehicle lane, and an on-board diagnostic (OBD) sensor synchronized with the vehicle-mounted camera records the position coordinates (such as GPS coordinates) and the motion state (speed, direction and the like) of the vehicle.

According to a pedestrian target detection and tracking algorithm in the prior art, the external frame coordinates of the pedestrian can be extracted in real time.

Illustratively, the disclosure is in the context of video of successive observation frames containing pedestriansThe image data is used for identifying the pedestrian, and all coordinates (x) of the pedestrian external frame in the current frame video image are obtained _lt ,y _lt ,x _rb ,y _rb ) And pedestrian ID, wherein (x) _lt ,y _lt ),(x _rb ,y _rb ) Respectively representing the coordinates of the upper left corner and the lower right corner of the circumscribed frame of a certain pedestrian in a pixel coordinate system.

Illustratively, the present disclosure extracts an observation trajectory sequence of a pedestrian by the following method.

For the original pedestrian bounding box coordinate (x) _lt ,y _lt ,x _rb ,y _rb ) The method converts the coordinates of the upper left corner and the lower right corner of the pedestrian external frame into the coordinates of a central point, the height, the width and a first-order difference of the coordinates of the central point, and divides the converted coordinates of the central point by the resolution of the video image according to a normalization principle so as to map the coordinate values between 0 and 1.

Illustratively, the present disclosure represents the observation trajectory of pedestrian i as

T is more than or equal to T-n +1 and less than or equal to T, namely the observation track sequence of the pedestrian i, wherein,

the coordinate value and the height and width value of the central point of the pedestrian external frame are represented,

the first-order difference between the coordinate value of the central point of the pedestrian external frame and the height and width value is represented, T is the serial number of the current observation frame, namely the last observation frame, n is the number of the observation frames, i is the ID of the pedestrian, namely the identification number of the pedestrian, and the disclosure exemplarily takes n as 15.

According to a preferred embodiment of the present disclosure, a video image containing a pedestrian is extracted by the following method.

Expanding the pedestrian external frame into k by taking the short edge of the video image as a reference _context Multiple sizes, obtaining pedestrian external frame (x ') containing background information' _lt ,y′ _lt ,x′ _rb ,y′ _rb ) Preferably k _contex 1.5, the experiment proves that the best effect is achieved when 1.5 is taken, on one hand, the characteristics of the pedestrian can be kept sufficiently prominent, on the other hand, scenes around the pedestrian and other related pedestrians can be included, and k _context Too large or too small will affect the practical effects of both aspects. Preferably, the pedestrian outline border is expanded according to the following equation:

w'＝w+min(w，h)×k _contcxt h′＝h+min(w，h)×k _Context

then according to (x) _lt ′,y _lt ′,x _rb ′,y _rb ') extracting the corresponding pedestrian and surrounding scene image area if (x) _lt ′,y _lt ′,x _rb ′,y _rb ') exceeds the boundary of the original video image, the coordinates of the image boundary are taken as the coordinates of the pedestrian outline border.

Keeping the aspect ratio of the acquired rectangular region of the pedestrian outline unchanged, setting and scaling the long side of the rectangular region to 224 pixels for example, so that the size of the original rectangular region (image region) is changed, placing the changed image region at the center position of a 2D space with the size of 224 multiplied by 224 for example, filling the pixel value of a non-image region in the 2D space to be (0,0,0), and obtaining a video image sequence containing the pedestrian

T is the serial number of the current observation frame, namely the last observation frame, n is the number of the observation frames, and i is the ID of the pedestrian, namely the identification number of the pedestrian.

In the actual processing process, along with the image acquisition of the vehicle-mounted camera, the video image data is extracted frame by frame, and historical frame data does not need to be saved.

The speed data of the vehicle described in the above of the present disclosure can be obtained in real time based on the record of the vehicle-mounted diagnosis system, and further represent the speed of the vehicle corresponding to the continuous frames as

That is, the vehicle speed sequence, where T is the serial number of the current observation frame, that is, the last observation frame, n is the number of observation frames, and i is the ID of the pedestrian, that is, the identification number of the pedestrian, and this disclosure exemplarily takes n as 15.

With continuing reference to fig. 1, the video image-based pedestrian intention prediction method S100 of the present disclosure further includes:

s120, based on video image sequence containing pedestrians

Extracting video image characteristics frame by frame (obtained based on characteristic diagram at t moment), and acquiring video image sequence

Continuous frame average characteristics of

Pedestrian-based observation trajectory sequence

Obtaining the observation track characteristic B of the pedestrian _i ；

Vehicle speed sequence based on continuous frame correspondence

Obtaining the speed characteristic S of the vehicle _i 。

The acquiring of the continuous frame average feature, the acquiring of the observation trajectory feature of the pedestrian, and the acquiring of the vehicle speed feature in step S120 may be performed simultaneously or substantially simultaneously.

The method preferably extracts the video image features of the pedestrians on line frame by frame based on a progressive real-time video image feature extraction network, and obtains continuous frame average features of the video image sequence.

According to a preferred embodiment of the present disclosure, the present disclosure extracts video image features of pedestrians frame by the following steps.

Fig. 2 is a flowchart of a method for extracting video image features of pedestrians according to a preferred embodiment of the present disclosure.

Referring to fig. 2, in step S120, a sequence based on video images including a pedestrian is provided

Extracting video image characteristics of pedestrians frame by frame, comprising:

s1202, performing feature extraction based on 2D convolution on the video image (the video image at the time t) of the current frame to acquire a feature map of the current frame

And its corresponding feature tensor

S1204, feature map based on current frame

And the feature map of the previous frame of the current frame

Performing time sequence modeling to update the characteristic diagram of the current frame

And its corresponding feature tensor

The updated feature tensor of the current frame

As a video image characteristic of the current frame.

If the feature map of the previous frame of the current frame does not exist (namely, the feature map at the time of t-1 does not exist), the feature map of the previous frame of the current frame is filled with a value of 0

In step S1202, feature extraction based on 2D convolution is performed on a video image of a current frame (a video image at time t), and the feature extraction is performed through a 2D convolution backbone network, where the 2D convolution backbone network includes one or more 2D convolution layers. The 2D convolutional layer generally includes an activation function module, a residual module, and the like, and can effectively extract a single-frame image feature of a video image.

The 2D convolutional backbone network selected by the method can be ResNet-50 used for algorithm research or MobileNet-V2 used for actual deployment, in the video understanding method and the pedestrian intention identification method in the prior art, ResNet-50 is mostly adopted as a basic network to compare algorithm performance, and MobileNet-V2 can effectively reduce calculation cost and guarantee the real-time requirement of an automatic driving embedded platform.

Other types of 2D convolutional backbone networks may be adopted by those skilled in the art in light of the teachings of the present disclosure, and all of them fall within the scope of the present disclosure.

In step S1024, feature map based on current frame

And the feature map of the previous frame of the current frame

Preferably, the method comprises the following steps:

feature map of a previous frame of a current frame

At least a part of the channel and the currentFeature map of a frame

The corresponding channels are correlated, and time sequence modeling is carried out;

and updating the characteristic diagram of at least one part of channels of the previous frame of the current frame to the corresponding position of the characteristic diagram of the current frame to obtain the updated characteristic diagram of the current frame.

In the video image-based pedestrian intention prediction method of the present disclosure, preferably, the 2D convolutional backbone network is a 2D convolutional backbone network embedded in the time-series modeling, so that the 2D convolutional backbone network can perform the time-series modeling.

According to a preferred embodiment of the present disclosure, the 2D convolutional backbone network used by the present disclosure is provided with a limited number of timing modeling positions to balance the performance and the computational load of the 2D convolutional backbone network.

Since the timing modeling may bring extra computation to the 2D convolution backbone network, the present disclosure preferably performs the timing modeling only in limited locations. In the initial stage, the positions in the 2D convolutional backbone network which need to be subjected to timing modeling can be preset.

Wherein, the feature map of the previous frame of the current frame is used

At least a part of the channels and the characteristic diagram of the current frame

The corresponding channels are correlated, time sequence modeling is carried out, and specifically, feature maps of two frames before and after the given time sequence are given

And

both have consistent dimensions and sizes, and are all expressed as [ N, C, H, W ]]Wherein N represents the size of Batch during network training and inference, and C, H, W represents the number of channels and height of the current position feature map in the 2D convolutional backbone network respectivelyWidth, if there is no previous frame feature map

The present disclosure preferably fills out with a value of 0

As shown in fig. 3, fig. 3 is a network architecture diagram of pedestrian intent (street intent) prediction in online, real-time, according to one embodiment of the present disclosure.

According to the preferred embodiment of the present disclosure, the feature maps of adjacent moments are selected

And

front closed and rear open channel region in

And (d) carrying out time sequence modeling, wherein d is a channel interception parameter.

The timing modeling can be implemented in various ways, such as splicing and processing by a multi-layer perceptron, switching, adding, subtracting, etc. The selection/adjustment of the specific manner of time-series modeling by those skilled in the art in light of the teachings of the present disclosure falls within the scope of the present disclosure.

In the present disclosure, the channel truncation parameter is exemplarily taken to be d-4.

Referring to fig. 4, in step S120, extracting the video image features of the pedestrian frame by frame based on the video image sequence containing the pedestrian, and acquiring the continuous frame average features of the video image sequence, including:

And its corresponding feature tensor

S1204, feature map based on current frame

And the feature map of the previous frame of the current frame

And its corresponding feature tensor

The updated feature tensor of the current frame

As a video image characteristic of the current frame;

S1206, updating the feature map of the current frame

Corresponding feature tensor

Performing dimensionality reduction processing based on the full connection layer to obtain a high-dimensional feature tensor of the current frame

And stored up to the sequence L of high-dimensional feature tensors _i To (1);

s1208, when the high-dimensional feature tensor sequence L _i Up to the value of the number of frames of consecutive frames (in this disclosure, n is illustratively 15), for the sequence L of high-dimensional feature tensors _i Averaging the medium-high dimension feature tension to obtain continuous frame average feature

Feature tensor processed by 2D convolution backbone network

Since the dimensionality is too high, which easily causes the increase of the computation amount in the subsequent process, the dimension reduction processing based on the full connection layer is preferably performed on the dimension reduction processing.

Wherein, the updated characteristic diagram of the current frame

Corresponding feature tensor

By the following formula:

wherein the content of the first and second substances,

a fully-connected layer is shown,

the tensor representing the high-dimensional features, which, in this disclosure,

is exemplarily taken to be 128.

Wherein, for the high-dimensional feature tensor sequence L _i Averaging the medium-high dimension feature tension to obtain continuous frame average feature

This can be done by the following equation:

wherein the high-dimensional feature tensor sequence L _i Expressed as:

wherein n is a high-dimensional feature tensor sequence L _i Is the maximum value of the length of (a).

According to the preferred embodiment of the present disclosure, in step S1206, the feature map of the previous frame of the current frame is deleted

Preserving feature maps of a current frame

According to the preferred embodiment of the present disclosure, in step S1208, if the high-dimensional feature tensor sequence L _i The earliest time sequence element in the feature map is a feature map filled with 0 value

It is deleted before it is applied to the sequence L of high-dimensional feature tensors _i Averaging to obtain continuous frame average characteristics

In the pedestrian intention prediction method based on video images S100 according to the preferred embodiment of the present disclosure, in step S120, a sequence of observation trajectories based on pedestrians

Obtaining the observation track characteristic B of the pedestrian _i The method comprises the following steps:

using a full connection layer to enhance the observation track sequence, and obtaining an enhanced input track set

Splicing the enhanced input track set in the time dimension to obtain an input track tensor

Tensor of input track

Inputting the data into a 1D convolution network, and extracting local short-term features

Local short-term features

Inputting to a multi-layer perceptron for coding processing to obtain a global track characteristic B _i 。

The track feature extraction network is a lightweight track feature extraction network, and an input observation track sequence

For lightweight information, preferably, the present disclosure uses full concatenationInputting a layer-connection enhanced track:

wherein phi _traj (. represents a fully connected layer, the enhanced input trace set is

In the present disclosure, the first and second electrodes, by way of example,

has a dimension of 32.

Preferably, the present disclosure employs a 1D convolutional network to extract local short-term features of the trajectory

Can be expressed as:

preferably, the present disclosure will locally short term features

Inputting the data into a multi-layer perceptron to carry out coding processing so as to obtain a global track characteristic B _i It can be expressed as:

wherein, MLP _traj Being a multi-layer perceptron, B _i I.e., global trajectory feature, in this disclosure, illustratively, B _i Dimension of 128.

According to the pedestrian intention prediction method based on video images S100 of the preferred embodiment of the present disclosure, in step S120, the own vehicle speed sequence corresponding to based on the consecutive frames

Obtaining the speed characteristic S of the vehicle _i The method comprises the following steps:

splicing the speed sequence of the vehicle in the time dimension to obtain the input speed tensor

Inputting the input speed tensor into a multi-layer perceptron for coding processing so as to obtain the speed characteristic S of the vehicle _i 。

The speed coding network of the present disclosure is a lightweight speed coding network, wherein the vehicle speed input is a vehicle speed sequence

For light weight information, elements of the set are spliced in the time dimension to obtain tensor

Directly adopts a multilayer perceptron to carry out coding to obtain the speed characteristic S of the vehicle _i ：

Wherein, MLP _spd Representing a multi-layered perceptron, in this disclosure, illustratively S _i Dimension of 128.

s130, averaging characteristics based on continuous frames

Observation trajectory feature B _i And the speed characteristic S of the vehicle _i Obtaining modality fusion features

And S140, extracting intention characteristics representing the intention of the pedestrian based on the semantic information of the modal fusion characteristics.

The multi-modal fusion network disclosed by the invention is a lightweight multi-modal fusion network, feature tensors among different modalities are fused, high-efficiency multi-modal fusion can be realized under the condition of keeping low parameter number and FLOPs (computational power), and modal fusion features are obtained so as to further extract intention features representing the intention of pedestrians.

In the method, based on a back propagation algorithm of a neural network, the intention features representing the intention of the pedestrian can be effectively extracted by the feature extraction networks of different modes, and practice proves that the purpose of extracting the homogeneous feature tensor can be achieved by the feature extraction networks of different modes by setting the feature tensors of different modes to be consistent channel sizes and then performing fusion operation (the method is preferably an addition feature method based on a full connection layer).

The lightweight multimodal fusion network of the present disclosure features by averaging successive frames

Observation trajectory feature B _i And the speed characteristic S of the vehicle _i Adding to obtain modal fusion characteristics

Wherein the content of the first and second substances,

B _i 、S _i are the same size, are exemplary 128, modal fusion features

Contains rich semantic information of multiple modes.

In light of the technical solutions disclosed herein, those skilled in the art can also perform multimodal fusion by other methods, such as feature splicing, decision layer fusion, and the like, all of which fall within the scope of the present disclosure.

Further, in step S140, extracting an intention feature characterizing the intention of the pedestrian based on the semantic information of the modal fusion feature includes:

inputting the modal fusion features into a full connection layer, and mapping the modal fusion features into a two-dimensional tensor to represent the street crossing intention category and street crossing intention category of the pedestrian:

output＝φ _fusion (H _i )

where output represents the intent prediction result, φ _fusion (. cndot.) denotes a fully connected layer.

According to the description, the pedestrian prediction method based on the video images considers the calculation limit of the embedded platform of the automatic driving vehicle and the speed requirement of pedestrian intention prediction, and based on the network architecture of the progressive real-time video image feature extraction provided by the disclosure, for the video image at the time t, only the feature graph corresponding to the time t-1 at the last time is needed to perform time sequence modeling, so that the video image features are extracted frame by frame without repeatedly calculating the features.

And for the problems of excessive fusion, high computational complexity and the like of the multi-mode fusion method in the prior art, the method preferably adopts a mode of feature superposition and full-connection layer mapping to fuse features of different modes, has few parameters and FLOPs, and realizes real-time prediction of pedestrian intentions.

The video image-based pedestrian intention prediction apparatus 1000 according to an embodiment of the present disclosure includes:

a video image sequence acquisition module 1002, wherein the video image sequence acquisition module 1002 acquires a video image sequence including a pedestrian based on continuous frames (continuous observation frames) including the pedestrian in video image data acquired in real time;

a pedestrian observation trajectory sequence acquisition module 1004, wherein the observation trajectory sequence acquisition module 1004 acquires an observation trajectory sequence of a pedestrian based on continuous frames (continuous observation frames) containing the pedestrian in the video image data acquired in real time;

a vehicle speed sequence acquisition module 1006, wherein the vehicle speed sequence acquisition module 1006 acquires a vehicle speed sequence corresponding to successive frames based on vehicle speed data acquired in real time;

an image feature acquisition module 1008 (a progressive real-time video image feature extraction network, a 2D convolution backbone network) extracts video image features frame by frame based on a video image sequence containing pedestrians, and acquires continuous frame average features of the video image sequence;

the pedestrian observation trajectory feature acquisition module 1010, the pedestrian observation trajectory feature acquisition module 1010 acquires observation trajectory features of pedestrians based on an observation trajectory sequence of the pedestrians;

the vehicle speed feature acquisition module 1012 acquires the vehicle speed feature S based on the vehicle speed sequence corresponding to the continuous frames by the vehicle speed feature acquisition module 1012 _i ；

The multi-modal feature fusion module 1014, wherein the multi-modal feature fusion module 1014 obtains modal fusion features based on the average features of the continuous frames, the observation trajectory features and the speed features of the vehicle;

a pedestrian intention acquisition module 1016, wherein the pedestrian intention acquisition module 1016 extracts intention characteristics representing the intention of the pedestrian based on at least the semantic information of the modal fusion characteristics.

The video image-based pedestrian intention prediction apparatus 1000 of the present disclosure may be implemented by way of a computer software program architecture.

Fig. 6 is a block diagram schematically illustrating the structure of a video image-based pedestrian intention prediction apparatus 1000 using a hardware implementation of a processing system according to an embodiment of the present disclosure.

The video image-based pedestrian intention prediction apparatus 1000 may include corresponding modules that perform each or several of the steps of the above-described flowcharts. Thus, each step or several steps in the above-described flow charts may be performed by a respective module, and the apparatus may comprise one or more of these modules. The modules may be one or more hardware modules specifically configured to perform the respective steps, or implemented by a processor configured to perform the respective steps, or stored within a computer-readable medium for implementation by a processor, or by some combination.

The hardware architecture may be implemented with a bus architecture. The bus architecture may include any number of interconnecting buses and bridges depending on the specific application of the hardware and the overall design constraints. The bus 1100 couples various circuits including the one or more processors 1200, the memory 1300, and/or the hardware modules together. The bus 1100 may also connect various other circuits 1400, such as peripherals, voltage regulators, power management circuits, external antennas, and the like.

The bus 1100 may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one connection line is shown, but no single bus or type of bus is shown.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present disclosure includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the implementations of the present disclosure. The processor performs the various methods and processes described above. For example, method embodiments in the present disclosure may be implemented as a software program tangibly embodied in a machine-readable medium, such as a memory. In some embodiments, some or all of the software program may be loaded and/or installed via memory and/or a communication interface. When the software program is loaded into memory and executed by a processor, one or more steps of the method described above may be performed. Alternatively, in other embodiments, the processor may be configured to perform one of the methods described above by any other suitable means (e.g., by means of firmware).

The logic and/or steps represented in the flowcharts or otherwise described herein may be embodied in any readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

For the purposes of this description, a "readable storage medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the readable storage medium include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). In addition, the readable storage medium may even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in the memory.

It should be understood that portions of the present disclosure may be implemented in hardware, software, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps of the method implementing the above embodiments may be implemented by hardware that is instructed to be associated with a program, which may be stored in a readable storage medium, and which, when executed, includes one or a combination of the steps of the method embodiments.

In addition, each functional unit in the embodiments of the present disclosure may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

The present disclosure also provides an electronic device, including: a memory storing execution instructions; and a processor or other hardware module that executes the execution instructions stored in the memory, such that the processor or other hardware module performs the video image-based pedestrian intent prediction method described above.

The present disclosure also provides a readable storage medium, in which an execution instruction is stored, and the execution instruction is executed by a processor to implement the above-mentioned pedestrian intention prediction method based on video images.

In the description herein, reference to the description of the terms "one embodiment/implementation," "some embodiments/implementations," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment/implementation or example is included in at least one embodiment/implementation or example of the present application. In this specification, the schematic representations of the terms described above are not necessarily the same embodiment/mode or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments/modes or examples. Furthermore, the various embodiments/aspects or examples and features of the various embodiments/aspects or examples described in this specification can be combined and combined by one skilled in the art without conflicting therewith.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

It will be understood by those skilled in the art that the foregoing embodiments are merely for clarity of illustration of the disclosure and are not intended to limit the scope of the disclosure. Other variations or modifications may occur to those skilled in the art, based on the foregoing disclosure, and are still within the scope of the present disclosure.

Claims

1. A pedestrian intention prediction method based on a video image is characterized by comprising the following steps:

acquiring modal fusion characteristics based on the continuous frame average characteristics, the observation track characteristics and the vehicle speed characteristics; and

extracting intention features representing the intention of the pedestrian at least based on semantic information of the modal fusion features;

the method for extracting the video image features of the pedestrians frame by frame based on the video image sequence containing the pedestrians and acquiring the continuous frame average features of the video image sequence comprises the following steps: performing feature extraction based on 2D convolution on a video image of a current frame to obtain a feature map of the current frame and a corresponding feature tensor; performing time sequence modeling based on the feature map of the current frame and the feature map of the previous frame of the current frame to update the feature map of the current frame and the corresponding feature tensor, and taking the updated feature tensor of the current frame as the video image feature of the current frame; if the feature map of the previous frame of the current frame does not exist, filling the feature map of the previous frame of the current frame with a value of 0; performing dimensionality reduction processing based on a full connection layer on the feature tensor corresponding to the updated feature map of the current frame to obtain a high-dimensional feature tensor of the current frame, and storing the high-dimensional feature tensor into a high-dimensional feature tensor sequence; when the length of the high-dimensional feature tensor sequence reaches the frame number value of the continuous frames, averaging the high-dimensional feature tensor in the high-dimensional feature tensor sequence to obtain the average features of the continuous frames;

acquiring modal fusion characteristics based on the continuous frame average characteristics, the observation trajectory characteristics and the vehicle speed characteristics, wherein the modal fusion characteristics comprise: and adding the continuous frame average characteristic, the observation track characteristic and the vehicle speed characteristic to obtain the modal fusion characteristic.

2. The video-image-based pedestrian intention prediction method according to claim 1, characterized by further comprising:

3. The method according to claim 1, wherein if the sequence element at the earliest time in the high-dimensional feature tensor sequence is an eigenmap filled with a value of 0, the high-dimensional feature tensor sequence is averaged after being deleted to obtain the average feature of the consecutive frames.

4. The method according to claim 1, wherein the step of performing time-series modeling based on the feature map of the current frame and the feature map of the previous frame to update the feature map of the current frame comprises:

associating at least one part of channels of the feature map of the previous frame of the current frame with corresponding channels of the feature map of the current frame, and performing time sequence modeling; and

5. The method of claim 4, wherein the 2D convolution-based feature extraction is performed on the video image of the current frame through a 2D convolution backbone network, and the 2D convolution backbone network comprises one or more 2D convolution layers.

6. The video-image-based pedestrian intention prediction method according to claim 5, wherein the 2D convolutional backbone network is a 2D convolutional backbone network embedded in a time-series modeling so that the 2D convolutional backbone network can perform the time-series modeling.

7. The video-image-based pedestrian intention prediction method of claim 6, wherein the 2D convolutional backbone network is set to a finite number of time-series modeling positions to balance performance and computational load of the 2D convolutional backbone network.

8. The pedestrian intention prediction method based on video images according to claim 1, wherein the acquiring of the observation trajectory feature of the pedestrian based on the observation trajectory sequence of the pedestrian comprises:

inputting the input track tensor into a 1D convolution network, and extracting local short-term features; and

9. The method according to claim 1, wherein the obtaining of the vehicle speed feature based on the vehicle speed sequence corresponding to the consecutive frames comprises:

splicing the speed sequence of the vehicle in a time dimension to obtain an input speed tensor; and

10. The pedestrian intention prediction method based on video images according to claim 1, wherein the extracting of the intention feature characterizing the intention of pedestrians based on the semantic information of the modal fusion feature comprises:

11. A pedestrian intention prediction apparatus based on a video image, comprising:

the vehicle speed sequence acquisition module is used for acquiring a vehicle speed sequence corresponding to the continuous frames based on vehicle speed data acquired in real time;

a pedestrian intention acquisition module which extracts intention features representing pedestrian intention at least based on semantic information of the modal fusion features;

12. An electronic device, comprising:

a memory storing execution instructions; and

a processor executing execution instructions stored by the memory to cause the processor to perform the pedestrian intent prediction method of any of claims 1 to 10.

13. A readable storage medium having stored therein executable instructions for implementing the pedestrian intention prediction method of any one of claims 1 to 10 when executed by a processor.