IT201800003102A1

IT201800003102A1 - A PROCEDURE TO FORECAST A TRAJECTORY AND A FRUSTUM OF VIEW, CORRESPONDING SYSTEM AND IT PRODUCT

Info

Publication number: IT201800003102A1
Application number: IT102018000003102A
Authority: IT
Inventors: Fabio Galasso; Theodore Tsesmelis; Herbert Kaestle; Irtiza Hasan; Francesco Setti; Marco Cristani; Bue Alessio Del
Original assignee: Osram Gmbh; Fondazione St Italiano Tecnologia; Univ Degli Studi Di Verona
Priority date: 2018-02-27
Filing date: 2018-02-27
Publication date: 2019-08-27

Description

DESCRIZIONE dell’invenzione industriale intitolata: DESCRIPTION of the industrial invention entitled:

“Un procedimento per predire una traiettoria e un frustum di vista, corrispondente sistema e prodotto informatico” "A procedure for predicting a trajectory and a frustum of view, corresponding system and IT product"

TESTO DELLA DESCRIZIONE TEXT OF THE DESCRIPTION

Campo dell’Invenzione Field of the Invention

La presente invenzione si riferisce generalmente al campo dell’illuminazione intelligente, in particolare ad un procedimento per predire una traiettoria e un frustum di vista di una persona, un corrispondente sistema e prodotto informatico. The present invention generally refers to the field of intelligent lighting, in particular to a process for predicting a trajectory and a frustum of view of a person, a corresponding computer system and product.

Per tutta questa descrizione, verrà fatto riferimento a vari documenti riproducendo tra parentesi quadre (ad es., [X]) un numero che identifica il documento in un “Elenco di Documenti citati” che appare alla fine della descrizione. Throughout this description, reference will be made to various documents by reproducing in square brackets (eg, [X]) a number that identifies the document in a "List of Cited Documents" that appears at the end of the description.

Sfondo dell’Invenzione Background of the Invention

Anticipare le traiettorie che si potrebbero verificare in futuro è importante per varie ragioni: nella visione artificiale o computer vision, predire il percorso aiuta il modellamento delle dinamiche per l’inseguimento dei bersagli Anticipating the trajectories that could occur in the future is important for various reasons: in artificial vision or computer vision, predicting the path helps modeling the dynamics for tracking targets

[40, 47, 48, 59] e comprensione del comportamento [3, 30, 33, 35, 47]; nella robotica, i sistemi autonomi dovrebbero pianificare le rotte che eviteranno le collisioni ed essere rispettosi della prossemica umana [13, 21, 31, 36, 53, 62]. [40, 47, 48, 59] and understanding of behavior [3, 30, 33, 35, 47]; in robotics, autonomous systems should plan routes that will avoid collisions and be respectful of human proxemics [13, 21, 31, 36, 53, 62].

Per i sistemi a guida autonoma i concetti di pianificazione del percorso dello stato dell’arte considerano solo lo stato e posizioni attuali dei veicoli e pedoni coinvolti il che conduce ad una configurazione o pattern di movimento innaturale ed artificiale dell’intero flusso di traffico. I classici approcci di previsione [38] adottavano filtri Kalman [29], modelli di regressione lineare [37] o Gaussiana [44, 45, 57, 58], modelli autoregressivi [2] e analisi delle serie temporali [43]. Questi approcci ignorano le interazioni da uomo a uomo, che invece giocano un ruolo importante. For self-driving systems, the state of the art route planning concepts consider only the current status and positions of the vehicles and pedestrians involved, which leads to an unnatural and artificial movement configuration or pattern of the entire traffic flow. The classical prediction approaches [38] adopted Kalman filters [29], linear regression models [37] or Gaussian [44, 45, 57, 58], autoregressive models [2] and time series analysis [43]. These approaches ignore human-to-human interactions, which instead play an important role.

La Considerazione di altri pedoni nella scena e la loro innata elusione di collisioni era stata inizialmente sperimentata tramite [24]. Questo seme iniziale è stato ulteriormente sviluppato tramite [34], [35] e [40], che hanno rispettivamente introdotto un modello teorico di pilotato da dati, continuo, e di gioco. Particolarmente, questi approcci impiegano con successo sostanzialmente dei suggerimenti per inseguire la previsione quale l’interazione da uomo a uomo e la destinazione prevista delle persone. Lavori più recenti codificano le interazione da uomo a uomo in un descrittore “sociale” [4] o propongono attributi umani [60] per la previsione nelle folle. Più implicitamente, altri procedimenti [3, 55] incorporano il ragionamento di prossemica nella predizione raggruppando variabili nascoste che rappresentano la probabile posizione di un pedone in una Memoria a Lungo e a Breve Termine (LSTM). Tuttavia, questi lavori precedenti che estrapolano i movimenti rilevati dall’osservazione di precedenti intervalli di tempo non sono adeguati per risolvere il compito, poiché il comportamento umano è pilotato dall’influenza ambientale così come dalle dinamiche di semplici leggi di inerzia. The consideration of other pedestrians in the scene and their innate avoidance of collisions was initially experimented with [24]. This initial seed was further developed through [34], [35] and [40], which respectively introduced a theoretical data-driven, continuous, and game model. Particularly, these approaches successfully employ substantially suggestions to pursue prediction such as human-to-human interaction and the intended destination of people. More recent works encode human-to-human interactions in a "social" descriptor [4] or propose human attributes [60] for forecasting in crowds. More implicitly, other procedures [3, 55] incorporate proxemic reasoning into prediction by grouping hidden variables that represent the probable position of a pawn in a Long and Short Term Memory (LSTM). However, these previous works that extrapolate the movements detected by the observation of previous time intervals are not adequate to solve the task, since human behavior is driven by environmental influence as well as by the dynamics of simple laws of inertia.

Sintesi dell’Invenzione Summary of the Invention

La presente invenzione cerca di fornire un procedimento che predice una traiettoria di una persona considerando un’area di attenzione alla visibilità pilotata dalla posa della testa, vale a dire, all’interno del cono di attenzione della persona. In particolare, uno scopo della presente invenzione è di fornire un procedimento che considera la posa della testa, congiuntamente alle informazioni posizionali, come un suggerimento per eseguire la previsione. La presente invenzione cerca anche di fornire un corrispondente sistema e prodotto informatico. The present invention seeks to provide a process that predicts a trajectory of a person by considering an area of attention to visibility driven by the pose of the head, that is, within the person's attention cone. In particular, an object of the present invention is to provide a method which considers the pose of the head, together with the positional information, as a suggestion for carrying out the prediction. The present invention also seeks to provide a corresponding computer system and product.

Secondo un primo aspetto, la descrizione fornisce un procedimento che comprende i passi di ricevere da almeno un sensore di immagine segnali di immagine di almeno una persona; rilevare, dai segnali di immagine una posizione bidimensionale dell’almeno una persona; stimare, come una funzione dei segnali di immagine ricevuti dall’almeno un sensore di immagine, una posa della testa dell’almeno una persona; generare dalla posa della testa stimata un frustum di vista dell’almeno una persona; immettere la posizione bidimensionale e il frustum di vista in una rete neurale ricorrente; generare una traiettoria di movimento predetta dell’almeno una persona. According to a first aspect, the description provides a method which comprises the steps of receiving from at least one image sensor image signals of at least one person; detect, from the image signals, a two-dimensional position of at least one person; estimate, as a function of the image signals received by at least one image sensor, a pose of the head of at least one person; generate a frustum of sight of at least one person from the estimated head pose; enter the two-dimensional position and the frustum of view in a recurrent neural network; generate a predicted trajectory of movement of at least one person.

Il frustum di vista (o frustum di visione) è la regione di spazio che può apparire sullo schermo di un computer, rappresentando di conseguenza più o meno ciò che il campo di veduta è per una camera nozionale. La designazione “frustum” si riferisce al cono, che rappresenta la linea di veduta della persona assieme al tipico angolo di apertura del campo di veduta umano. The frustum of view (or frustum of vision) is the region of space that can appear on a computer screen, thus representing roughly what the field of view is for a notional camera. The designation "frustum" refers to the cone, which represents the person's line of sight together with the typical opening angle of the human field of view.

La forma esatta della regione coperta da un frustum di vista può variare a seconda ad es. del sistema ottico considerato. Nella maggior parte delle applicazioni esso si può considerare come un frustum di una piramide rettangolare. The exact shape of the region covered by a view frustum can vary depending on e.g. of the considered optical system. In most applications it can be considered as a frustum of a rectangular pyramid.

Preferibilmente, il procedimento comprende anche generare un frustum di vista predetto dell’almeno una persona. Ciò consente di ragionare su dove guarderanno probablimente le persone, fornendo un livello granulato fine di predizione a lungo termine mai raggiunto sinora in scenari affollati. Preferibilmente, la posa della testa di detta almeno una persona è bidimensionale. Preferably, the process also comprises generating a predicted view frustum of at least one person. This allows you to reason about where people are likely to look, providing a fine granulated level of long-term prediction never before achieved in crowded scenarios. Preferably, the pose of the head of said at least one person is two-dimensional.

Preferibilmente, il procedimento comprende anche passare la traiettoria del movimento predetta e il frustum di vista predetto ad un server di controllo di pannello. Preferably, the method also comprises passing the predicted trajectory of movement and the predicted view frustum to a panel control server.

Preferibilmente, il procedimento comprende anche, il server di controllo di pannello che risponde alla traiettoria del movimento predetta e il frustum di vista predetto, e che controlla il pannello per visualizzare e/o chiudere un contenuto video. Preferably, the method also comprises, the panel control server which responds to the predicted motion trajectory and the predicted view frustum, and which controls the panel to display and / or close a video content.

Preferibilmente, il procedimento comprende anche passare la traiettoria predetta e il frustum di vista predetto ad un sistema di controllo del traffico. Preferably, the method also comprises passing the predicted trajectory and the predicted view frustum to a traffic control system.

Preferibilmente, il procedimento comprende anche, un sistema di controllo del traffico che risponde alla traiettoria del movimento predetta e del frustum di vista predetto, e che assegna una traiettoria di veicolo. Preferably, the method also comprises a traffic control system which responds to the predicted motion trajectory and the predicted view frustum, and which assigns a vehicle trajectory.

Preferibilmente, la rete neurale ricorrente è una rete di Memoria a Lungo e a Breve Termine (LSTM). Preferably, the recurrent neural network is a Long and Short Term Memory (LSTM) network.

Secondo un ulteriore aspetto, la descrizione fornisce un sistema comprendente: According to a further aspect, the description provides a system comprising:

- almeno un sensore di immagine per generare segnali di immagine di almeno una persona, - at least one image sensor for generating image signals of at least one person,

- un rilevatore di persone e lo stimatore di posa della testa congiunti per rilevare dai segnali di immagine una posizione bidimensionale dell’almeno una persona; stimare, come una funzione dei segnali di immagine ricevuti dall’almeno un sensore di immagine, una posa della testa dell’almeno una persona; e generare dalla posa della testa stimata un frustum di vista dell’almeno una persona, - a person detector and the head pose estimator joined to detect a two-dimensional position of at least one person from the image signals; estimate, as a function of the image signals received by at least one image sensor, a pose of the head of at least one person; and generate a frustum of sight of at least one person from the pose of the estimated head,

- una rete neurale ricorrente accoppiata al rilevatore di persone e allo stimatore di posa della testa congiunti, per elaborare la posizione bidimensionale e il frustum di vista di detta almeno una persona e generare una traiettoria di movimento predetta dell’almeno una persona. - a recurrent neural network coupled to the person detector and the joint head pose estimator, to process the two-dimensional position and the frustum of view of said at least one person and generate a predicted trajectory of movement of the at least one person.

Preferibilmente, la rete neurale ricorrente genera anche un frustum di vista predetto dell’almeno una persona. Preferably, the recurrent neural network also generates a predicted view frustum of at least one person.

Preferibilmente, il sistema comprende inoltre un server di controllo di pannello per rispondere alla traiettoria del movimento predetta e al frustum di vista predetto, e controlla il pannello per visualizzare un contenuto video. Preferably, the system further comprises a panel control server for responding to the predicted motion trajectory and predicted view frustum, and controls the panel to display video content.

Preferibilmente, il sistema comprende inoltre un sistema di controllo del traffico per rispondere alla traiettoria del movimento predetta e al frustum di vista predetto, ed assegna una traiettoria di veicolo. Preferably, the system further comprises a traffic control system for responding to the predicted motion trajectory and the predicted view frustum, and assigns a vehicle trajectory.

Secondo ancora un ulteriore aspetto, la descrizione fornisce un mezzo di registrazione leggibile su computer non transitorio che memorizza un prodotto informatico, che, quando eseguito da un processore, fa sì che un computer esegua i suddetti procedimenti. In a still further aspect, the disclosure provides a non-transient computer-readable recording medium that stores a computer product, which, when executed by a processor, causes a computer to perform the above processes.

Breve Descrizione dei Disegni Brief Description of the Drawings

Le forme di realizzazione sono spiegate a titolo di Embodiments are explained by way of

esempio facendo riferimento ai disegni annessi, in cui: example referring to the attached drawings, in which:

Fig. 1a mostra una spiegazione grafica di tracklet e Fig. 1a shows a graphical explanation of tracklet e

e punto di ancoraggio vislet and anchor point vislet

Fig. 1b mostra una spiegazione grafica di un Fig. 1b shows a graphical explanation of a

raggruppamento di un frustum visivo; grouping of a visual frustum;

Fig. 1c mostra una spiegazione grafica di angoli per Fig. 1c shows a graphical explanation of angles for

l’analisi di correlazione; correlation analysis;

Fig. 2a mostra l’analisi tra la discrepanza di angolo ω tra la posa della testa e il movimento, la velocità Fig. 2a shows the analysis between the angle discrepancy ω between the pose of the head and the movement, the speed

regolarizzata del pedone e gli errori medi di approcci regularized pedestrian and average errors of approaches

diversi sulla sequenza UCY; different on the UCY sequence;

Fig. 2b mostra la correlazione tra l’angolo di movimento β e l’angolo di orientamento della testa α quando varia la Fig. 2b shows the correlation between the angle of movement β and the orientation angle of the head α when the

velocità; speed;

Fig. 3a mostra il risultato qualitativo della MX-LSTM; Fig. 3a shows the qualitative result of the MX-LSTM;

Fig. 3b mostra uno studio qualitativo di Ablazione su Fig. 3b shows a qualitative study of Ablation on

una Singola MX-LSTM; a Single MX-LSTM;

Fig. 4 mostra un diagramma di flusso esemplificativo di Fig. 4 shows an exemplary flow chart of

una forma di realizzazione; an embodiment;

Fig. 5 mostra un diagramma di flusso esemplificativo di Fig. 5 shows an exemplary flow chart of

un’altra forma di realizzazione. another embodiment.

Descrizione Dettagliata dell’Invenzione Detailed Description of the Invention

Forme di realizzazione di esempio che incorporano uno Example embodiments incorporating one

o più aspetti dell’invenzione sono descritte ed illustrate or more aspects of the invention are described and illustrated

nei disegni. Si deve comprendere che si possono utilizzare in the drawings. It must be understood that they can be used

altre forme di realizzazione e che si possono effettuare other embodiments and which can be carried out

cambiamenti strutturali o logici senza discostarsi structural or logical changes without departing

dall’ambito della presente invenzione. La seguente within the scope of the present invention. The following

descrizione dettagliata, quindi, non si deve considerare in detailed description, therefore, should not be considered in

un senso limitativo. Gli stessi numeri di riferimento sono utilizzati per far riferimento a stesse o simili porzioni o componenti. a limiting sense. The same reference numerals are used to refer to the same or similar portions or components.

La presente invenzione considera la posa della testa, congiuntamente alle informazioni posizionali, come un suggerimento per eseguire la previsione. In particolare, tracklet (sequenze di coordinate (x; y)) e vislet, vale a dire, punti di riferimento indicanti l’orientamento panoramico della testa, sono l’ingresso della nuova MiXing LSTM (MX-LSTM), un modello basato su LSTM che apprende come sono correlati i flussi di tracklet e vislet, mischiandoli assieme nella ricorsione di stato nascosto LSTM per mezzo di matrici di covarianza complete a flusso incrociato, ottimizate durante la retropropagazione. La MX-LSTM è in grado di codificare come sono collegati i movimenti della testa e le dinamiche delle persone. Ad esempio, essa cattura il fatto che la rotazione della testa verso una particolare direzione può anticipare una deriva di traiettoria con un’accelerazione (come nel caso di una persona che lascia un gruppo dopo una conversazione). Ciò succede grazie ad una nuova ottimizzazione dei parametri LSTM utilizzante una covarianza completa Gaussiana attraverso una parametrizzazione log-Cholesky nella retropropagazione, che assicura matrici semidefinite positive. Le informazioni di vislet sono anche utilizzate per costruire un contesto di scena, cioè dove sono le persone e come si muovono, tramite un raggruppamento a stato condiviso come in [3, 55], che è qui ulteriormente migliorato utilizzando la posa della testa scartando le persone che un individuo non può vedere. The present invention considers the pose of the head, in conjunction with positional information, as a suggestion for making the prediction. In particular, tracklets (sequences of coordinates (x; y)) and vislets, i.e., reference points indicating the panoramic orientation of the head, are the entrance to the new MiXing LSTM (MX-LSTM), a model based on LSTM that learns how tracklet and vislet flows are related, mixing them together in LSTM hidden state recursion by means of cross-flow complete covariance matrices, optimized during back-propagation. The MX-LSTM is able to encode how the movements of the head and the dynamics of people are connected. For example, it captures the fact that the rotation of the head towards a particular direction can anticipate a drift of trajectory with an acceleration (as in the case of a person who leaves a group after a conversation). This happens thanks to a new optimization of the LSTM parameters using a complete Gaussian covariance through a log-Cholesky parameterization in the back propagation, which ensures positive semidefinite matrices. Vislet information is also used to construct a scene context, i.e. where people are and how they move, via a shared state grouping as in [3, 55], which is further improved here using the head pose discarding the people that an individual cannot see.

La MX-LSTM predice anche orientamenti di testa, consentendo di ragionare su dove guarderanno probablimente le persone, fornendo un livello granulato fine di predizione a lungo termine mai raggiunto sinora in scenari affollati. The MX-LSTM also predicts head-on orientations, allowing you to reason about where people are likely to look, providing a fine granulated level of long-term prediction never before achieved in crowded scenarios.

Nell'adottare protocolli standard per la previsione della traiettoria [3, 34, 40] e nell'utilizzare le informazioni sulla posa della testa fornite da uno stimatore di posa della testa standard [32], la MX-LSTM definisce il nuovo stato-deli'arte in entrambi nelle sequenze UCY (Zara01, Zara02 e UCY) e nell'insieme di dati o dataset del CentroCittà o TownCentre. In particolare, la MX-LSTM ha la capacità di prevedere le persone quando esse si spostano lentamente, il tallone di Achille di tutti qli altri approcci sinora proposti. In adopting standard protocols for trajectory prediction [3, 34, 40] and in using head pose information provided by a standard head pose estimator [32], the MX-LSTM defines the new state-deli art in both in the UCY sequences (Zara01, Zara02 and UCY) and in the dataset or dataset of the CentroCittà or TownCentre. In particular, the MX-LSTM has the ability to predict people when they move slowly, the Achilles heel of all the other approaches proposed so far.

Qui, presentiamo la MX-LSTM, in qrado di predire conqiuntamente posizioni e orientamenti della testa di un individuo qrazie alla presenza di due flussi di informazioni: Tracklet e vislet. Here, we present the MX-LSTM, able to jointly predict positions and orientations of an individual's head thanks to the presence of two streams of information: Tracklet and vislet.

Dato un soqqetto i, una tracklet (si veda Fig. 1a ) è formata da posizioni ( x,y) consecutive sul piano di terra, Given a subject i, a tracklet (see Fig.1a) is formed by consecutive positions (x, y) on the ground plane,

mentre una vislet è formata da punti while a vislet is made up of points

di ancoraqqio <con > indicante un of anchorqqio <with> indicating a

punto di riferimento ad una distanza fissa r dal corrispondente verso cui è orientata la faccia (La distanzar non è influente in questa invenzione, e essa può essere qualsiasi valore; in questa invenzione, per scopi di visualizzazione, la impostiamo su 0.5). In pratica, è un vettore a dimensione fissa che ha oriqine da la cui direzione indica implicitamente l'angolo panoramico reference point at a fixed distance r from the corresponding towards which the face is oriented (The spacing is not influential in this invention, and it can be any value; in this invention, for display purposes, we set it to 0.5). In practice, it is a fixed dimension vector that originates from whose direction implicitly indicates the panoramic angle

della testa. Per principio, sarebbe possibile codificare direttamente l'orientamento della testa con un angolo. Preferiamo la rappresentazione della vislet poiché essa non mostra discontinuità (tra 360° e 0°) e perché essa è più vicina alla rappresentazione della posizione (x, y) e quindi più adatta per l’interazione vislet-posizione. of the head. In principle, it would be possible to directly code the orientation of the head with an angle. We prefer the representation of the vislet because it does not show discontinuity (between 360 ° and 0 °) and because it is closer to the representation of the position (x, y) and therefore more suitable for the vislet-position interaction.

Nella documentazione di previsione [3, 53, 59] si ipotizza che la predizione segua un periodo di “osservazione” in cui dati di base di verità o di ground-truth sono alimentati nella macchina. Qui, le tracklet e vislet di osservazione sono alimentate nella MX-LSTM, che mischia assieme i due flussi per comprendere la loro relazione, fornendo una predizione congiunta. Negli esperimenti valutiamo i casi in cui le vislet passate sono di groundtruth, ma anche il caso “nel contesto”, in cui la posa della testa è fornita da un rilevatore di testa reale. In questo modo, la MX-LSTM non richiederà alcuna annotazione aggiuntiva rispetto ai precedenti approcci. In the prediction documentation [3, 53, 59] it is assumed that the prediction follows an "observation" period in which basic truth or ground truth data is fed into the machine. Here, the observation tracklets and vislets are fed into the MX-LSTM, which mixes the two streams together to understand their relationship, providing a joint prediction. In the experiments we evaluate the cases where the past vislets are of groundtruth, but also the case “in context”, where the head pose is provided by a real head detector. In this way, the MX-LSTM will not require any additional annotation compared to previous approaches.

Una singola MX-LSTM è istanziata per ciascun pedone i, accettando tracklet e vislet con due funzioni di <incorporazione o embedding separate:>A single MX-LSTM is instantiated for each i-pedestrian, accepting tracklets and vislets with two separate <embedding or embedding functions:>

dove la funzione di embedding ∅ consiste in una proiezione lineare attraverso i pesi di embedding Wx e Wα in un vettore D-dimensionale, moltiplicata per una non linearità RELU, dove D è la dimensione dello spazio nascosto. where the embedding function ∅ consists of a linear projection through the embedding weights Wx and Wα into a D-dimensional vector, multiplied by a non-linearity RELU, where D is the dimension of the hidden space.

Il ragguppamento sociale introdotto in [3] è un modo efficace per lasciare che la LSTM catturi il fatto di come si muovono le persone in una scena affollata evitando collisioni. Questo lavoro considera un’area di interesse isotropica attorno al singolo pedone, in cui sono considerati gli stati nascosti dei vicini, compresi quelli che sono dietro al pedone. Secondo la nostra invenzione, miglioriamo questo modulo utilizzando le informazioni di vislet selezionando quali individui considerare, costruendo un frustum di vista di attenzione (VFOA), vale a dire un triangolo che ha origine da allineato con e con un'apertura data dall'angolo y ed una profondità d; questi parametri sono stati appresi tramite validazione incrociata sulla partizione di addestramento del dataset TownCentre. The social grouping introduced in [3] is an effective way to let LSTM capture the fact of how people move in a crowded scene while avoiding collisions. This work considers an isotropic area of interest around the single pedestrian, in which the hidden states of the neighbors are considered, including those behind the pedestrian. According to our invention, we improve this module by using the vislet information by selecting which individuals to consider, building an attention view frustum (VFOA), i.e. a triangle that originates from aligned with and with an opening given by the angle y and a depth d; these parameters were learned through cross-validation on the training partition of the TownCentre dataset.

Il nostro raggruppamento sociale del frustum di vista è un tensore No X No X D, in cui lo spazio attorno al pedone è diviso in una griglia di No X No celle come in [3], in cui è posizionato il VFOA, che agisce come la nuova regione di interesse dove si devono considerare le persone. Il raggruppamento si verifica come segue: Our social grouping of the frustum of view is a No X No X D tensor, in which the space around the pawn is divided into a grid of No X No cells as in [3], in which the VFOA is positioned, which acts as the new region of interest where people have to be considered. Grouping occurs as follows:

dove gli indici m e n passano sopra la griglia No X No e la condizione j ∈ VFOAi è soddisfatta quando il soggetto j è nel VFOA del soggetto i . Il vettore di raggruppamento è successivamente incorporato in un vettore D-dimensionale tramite where the indices m and n pass over the No X No grid and condition j ∈ VFOAi is satisfied when subject j is in subject i's VFOA. The grouping vector is subsequently incorporated into a D-dimensional vector via

Infine, l'equazione di ricorsione MX-LSTM è Finally, the MX-LSTM recursion equation is

Per principio (ma nella seguente parte modificheremo infine la formulazione), lo stato nascosto è rafforzato per contenere i parametri di una distribuzione multivariata In principle (but we will change the wording in the following part), the hidden state is hardened to contain the parameters of a multivariate distribution

Gaussiana quadridimensionale come segue: Four-dimensional Gaussian as follows:

vettorizzata di In pratica vectorized of In practice

contengono le covarianze fra le distribuzioni di coordinate ( x,y) delle tracklet e delle vislet. La distribuzione è successivamente campionata per generare la predizione congiunta di punti tracklet e vislet contain the covariances between the (x, y) coordinate distributions of tracklets and vislets. The distribution is then sampled to generate the joint prediction of tracklet and vislet points

In altre parole, essa è in grado di prevedere nello In other words, it is able to predict in the

stesso tempo traiettorie e pose della testa. same time trajectories and poses of the head.

I parametri di peso della LSTM si trovano minimizzando il logaritmo di verosimiglianza Gaussiano multivariato per la i-esima traiettoria The weight parameters of the LSTM are found by minimizing the multivariate Gaussian log of likelihood for the i-th trajectory

dove Tobs è la trama temporale sino a quando non si osservano i dati di ground-truth tramite la LSTM, mentre Tobs + 1, Tpred sono le trame temporali per le quali è richiesta la predizione. La perdita dell'Eq. 7 è minimizzata su tutte le sequenze di addestramento, e per impedire il sovraddattamento comprendiamo un termine di regolarizzazione l2 · where Tobs is the time frame until the ground truth data is observed through the LSTM, while Tobs + 1, Tpred are the time frames for which prediction is required. The loss of Eq. 7 is minimized on all training sequences, and to prevent overfitting we include a regularization term l2

L'ottimizzazione fornisce le matrici di peso della MX-LSTM, che a loro volta producono l'insieme di parametri Gaussiani, comprendenti la covarianza completa ∑.L'ultima è necessaria per rafforzare la LSTM nella codifica delle relazioni fra le distribuzioni di coordinate ( x,y) delle tracklet e delle vislet. The optimization provides the weight matrices of the MX-LSTM, which in turn produce the set of Gaussian parameters, including the complete covariance ∑ The latter is necessary to strengthen the LSTM in coding the relations between the coordinate distributions ( x, y) of tracklets and vislets.

In generale, la stima di una matrice di covarianza completa attraverso l'ottimizzazione di una funzione oggettiva (come il logaritmo di verosimiglianza dell'Eq. (7)) è un difficile problema numerico [41], poiché si deve garantire che la risultante stima si una covarianza appropriata, cioè, una matrice semi-definita positiva (p.s.d.). In general, the estimation of a complete covariance matrix through the optimization of an objective function (such as the log likelihood of Eq. (7)) is a difficult numerical problem [41], since it must be ensured that the resulting estimate si an appropriate covariance, that is, a positive semi-definite matrix (p.s.d.).

Le LSTM che comportano perdite di logaritmo di verosimiglianza su distribuzioni Gaussiane sono state sinora limitate a due dimensioni per semplici distribuzioni Gaussiane [3] o miscuglio di distribuzioni Gaussiane [19], in cui le matrici di covarianza 2 x 2 sono state ottenute ottimizzando semplicemente l’indice di correlazione scalare LSTMs involving log-likelihood losses on Gaussian distributions have so far been limited to two dimensions for simple Gaussian distributions [3] or a mixture of Gaussian distributions [19], in which the 2 x 2 covariance matrices have been obtained by simply optimizing l scalar correlation index

che diventa il termine di covarianza di ∑ con which becomes the covariance term of ∑ with

[19]. Nel caso di problemi dimensionali superiori, termini di correlazione a livello di coppia non si possono ottimizzare e utilizzare per costruire ∑, poiché il processo di ottimizzazione per ciascun termine di correlazione è reciprocamente indipendente, mentre la definitezza-positiva è una limitazione simultanea su variabili multiple [42]. Questa carenza di coordinamento fornisce matrici lontane dall’essere s.d.p., che richiedono a loro volta procedure di correzione tramite proiezione nella matrice s.d.p. più vicina utilizzando, ad esempio, una funzione di costo basata sulla norma di Frobenious [7, 25]. Queste procedure sono costose [41], e difficili da incorporare nel processo di ottimizzazione [12], specialmente nel caso della LSTM, dove le non-linearità dovute a pesi di embedding rendono la derivazione analitica difficile da formulare. Sinora, nessuna perdita di LSTM ha coinvolto covarianze complete di dimensione > 2. [19]. In the case of higher dimensional problems, pair-level correlation terms cannot be optimized and used to construct ∑, since the optimization process for each correlation term is mutually independent, while positive-definite is a simultaneous limitation on multiple variables [42]. This lack of coordination provides matrices far from being s.d.p., which in turn require correction procedures by projection into the s.d.p. closer using, for example, a cost function based on the Frobenious norm [7, 25]. These procedures are expensive [41], and difficult to incorporate into the optimization process [12], especially in the case of LSTM, where non-linearities due to embedding weights make the analytical derivation difficult to formulate. To date, no loss of LSTM has involved complete covariances of size> 2.

La nostra soluzione comporta un’ottimizzazione non limitata, dove un’opportuna parametrizzazione delle variabili da apprendere rafforza la limitazione semidefinita positiva, che è più facile da esprimere, migliorando notevolmente le proprietà di convergenza dell’algoritmo di ottimizzazione. Our solution involves a non-limited optimization, where an appropriate parameterization of the variables to be learned strengthens the positive semidefinite limitation, which is easier to express, greatly improving the convergence properties of the optimization algorithm.

In pratica, consideriamo la famiglia di parametrizzazioni di Choleski [42]: si lasci che ∑ indichi una matrice di covarianza positiva definita n × n (nel nostro caso, n = 4). Poiché ∑ è simmetrico, sono richiesti solo n(n 1)/2 parametri per rappresentarlo. La fattorizzazione di Choleski è data da: In practice, we consider the Choleski family of parametrizations [42]: let ∑ denote a positive covariance matrix defined n × n (in our case, n = 4). Since ∑ is symmetric, only n (n 1) / 2 parameters are required to represent it. The Choleski factorization is given by:

∑ = L<T>L, (8) ∑ = L <T> L, (8)

dove L è una matrice triangolare superiore n x n. In pratica, il processo di ottimizzazione si concentrerebbe sul trovare gli n(n 1)/2 valori scalari distinti per L, che servono successivamente a risolvere la covarianza data l’Eq. (8). Un problema con la fattorizzazione di Cholesky è la sua non univocità: è valida qualsiasi matrice ottenuta moltiplicando un sottoinsieme delle righe di L per -1; come conseguenza, la non univocità della soluzione rende il processo di ottimizzazione difficile da convergere. Per render L univoco i suoi elementi diagonali devono essere tutti positivi. A tal fine, la parametrizzazione Logaritmica di Cholesky [42] ipotizza che i valori trovati dall’ottimizzatore della diagonale di covarianza principale sono il logaritmo dei valori di L : Formalmente, i valori trovati tramite l’ottimizzatore si possono scrivere come where L is an upper triangular matrix n x n. In practice, the optimization process would focus on finding the n (n 1) / 2 distinct scalar values for L, which subsequently serve to solve the covariance given Eq. (8). A problem with Cholesky factorization is its non-uniqueness: any matrix obtained by multiplying a subset of the rows of L by -1 is valid; as a consequence, the non-uniqueness of the solution makes the optimization process difficult to converge. To make L unique, its diagonal elements must all be positive. To this end, the Cholesky Logarithmic parameterization [42] assumes that the values found by the optimizer of the main covariance diagonal are the logarithm of the values of L: Formally, the values found through the optimizer can be written as

In pratica, dopo la stima dei parametri Wx, Wa WH, WLSTM, WO,oi valori di θL sono estratti tramite In practice, after the estimation of the parameters Wx, Wa WH, WLSTM, WO, or the values of θL are extracted through

(9) (9)

torizzata di θL. Succesivamente, i torized by θL. Subsequently, i

valori diagonali di θL vengono esponenziati per formare L e ottenere ∑ attraverso l’Eq.(8). diagonal values of θL are exponentiated to form L and obtain ∑ through Eq. (8).

Sinora, nessuno studio quantitativo si è concentrato su come la conoscenza della posa della testa influisce sulla previsione della traiettoria. Qui, mostriamo un’analisi preliminare degli insiemi di dati di previsione comuni con un’enfasi sulla posa della testa che ha motivato la progettazione della MX-LSTM. To date, no quantitative studies have focused on how knowledge of head pose affects trajectory prediction. Here, we show a preliminary analysis of common prediction datasets with an emphasis on the head pose that motivated the design of the MX-LSTM.

In particolare, ci concentriamo sul dataset UCY [34], composto dalle sequenze Zara01, Zara02, e UCY, che fornisce le annotazioni per l’angolo panoramico della posa della testa di tutti i pedoni. Consideriamo anche il dataset del Town Center [6], dove abbiamo manualmente annotato la posa della testa, utilizzando lo stesso protocollo di annotazione di [34]. Scopriamo i seguenti fatti (Qui è presentata l’analisi sulla sequenza UCY, che è simile a ciò che abbiamo osservato sulle altre sequenze): In particular, we focus on the dataset UCY [34], composed of the sequences Zara01, Zara02, and UCY, which provides the annotations for the panoramic angle of the pose of the head of all pedestrians. We also consider the Town Center dataset [6], where we manually annotated the pose of the head, using the same annotation protocol as [34]. We discover the following facts (Here the analysis on the UCY sequence is presented, which is similar to what we have observed on the other sequences):

1) Le persone sovente non guardano i loro passi. Per dimostrare questo fatto, per ciascuna singola traiettoria composta da T trame (omettendo i singoli indici), calcoliamo tutti gli αt, βt, e ωt di Fig.1c. L’αt è l’angolo panoramico della posa della testa rispetto ad un dato sistema di riferimento; analogamente, βt è l’angolo di movimento, e ωt mostra la discrepanza tra i due. Per ciascuna singola traiettoria, calcoliamo la media Sull’asse a y multiple Fig. 2a, mostriamo il valore ω (in gradi) di tutte le sequenze, in un ordine crescente. Dalla figura, omettiamo quelle sequenze dove la velocità è al di sotto di 0.45m/sec.: in quei casi l’individuo è sostanzialmente fermo e il vettore di movimento xt+1 − xt porta poco se non alcun significato, e conseguentemente l’angolo βt non si può considerare. Il valore ω varia da 0.02° a 72°. Concludiamo che nel 25% delle sequenze video il disallineamento tra la posa della testa e la direzione del passo è maggiore di 20°. 1) People often don't watch their steps. To demonstrate this fact, for each single trajectory composed of T plots (omitting the single indices), we calculate all the αt, βt, and ωt of Fig.1c. Αt is the panoramic angle of the pose of the head with respect to a given reference system; similarly, βt is the angle of movement, and ωt shows the discrepancy between the two. For each single trajectory, we calculate the average On the multiple y axis Fig. 2a, we show the value ω (in degrees) of all the sequences, in an increasing order. From the figure, we omit those sequences where the velocity is below 0.45m / sec .: in those cases the individual is substantially stationary and the motion vector xt + 1 - xt carries little if any meaning, and consequently the angle βt cannot be considered. The ω value ranges from 0.02 ° to 72 °. We conclude that in 25% of the video sequences the misalignment between the pose of the head and the direction of the step is greater than 20 °.

2) La posa della testa e i movimenti sono (statisticamente) correlati; Sulla stessa figura, riferiamo la curva di velocità, dove ciascun punto-y fornisce la velocità media dell’i- traiettoria ordinata sull’assex. Per scopi di leggibilità, la curva è stata regolarizzata con un filtro a media mobile di dimensione 10. Come si mostra, vi è una relazione di proporzionalità inversa tra la velocità di ω e del pedone: l’allineamento tra la testa verso la direzione di movimento è superiore quando la velocità è superiore; quando la persona rallenta la posa della testa è notevolmente disallineata. La relazione è statisticamente significativa: consideriamo il coefficiente di correlazione circolare di Pearson [28] tra gli angoli αt e βt, calcolato su tutte le trame delle sequenze considerate per quella figura. Sugli interi dati, la correlazione è 0.83 (valore-p< 0.01). Verifichiamo anche come la correlazione cambia con la velocità: Fig. 2b mostra i valori di correlazione rispetto alla velocità, calcolati raggruppando gli angoli αt e βt attorno ad un certo valore di velocità; in particolare, ciascun valore di correlazione alla velocità τ è stato calcolato considerando tutti i campioni nell’intervallo [ t − 0.01R, τ 0.01R], dove R è l’intero intervallo di velocità. Tutti i valori riferiti hanno significato statistico (valore-p< 0.01). Il tracciato mostra chiaramente che la correlazione è inferiore a basse velocità, dove la discrepanza tra gli angoli αt e βt è in generale superiore. La sfida qui è di verificare se questa discrepanza si può apprendere tramite la MX-LSTM per migliorare la previsione. In modo più intrigante, la MX-LSTM dovrebbe apprendere come si evolvono nel tempo queste relazioni, il che non è ancora stato verificato, poiché l’analisi sinora effettuata considera ciascun istante di tempo come indipendente l’uno dall’altro. 2) The pose of the head and the movements are (statistically) correlated; On the same figure, we refer to the speed curve, where each y-point provides the average speed of the trajectory ordered on the axis. For purposes of readability, the curve was smoothed with a moving average filter of size 10. As shown, there is an inverse proportionality relationship between the speed of ω and the pedestrian: the alignment between the head towards the direction of movement is higher when the speed is higher; when the person slows down the head pose is noticeably misaligned. The relationship is statistically significant: we consider Pearson's circular correlation coefficient [28] between the angles αt and βt, calculated on all the plots of the sequences considered for that figure. On the whole data, the correlation is 0.83 (p-value <0.01). We also verify how the correlation changes with the velocity: Fig. 2b shows the correlation values with respect to the velocity, calculated by grouping the angles αt and βt around a certain velocity value; in particular, each correlation value to the speed τ was calculated considering all the samples in the interval [t - 0.01R, τ 0.01R], where R is the entire speed range. All reported values have statistical significance (p-value <0.01). The plot clearly shows that the correlation is lower at low speeds, where the discrepancy between the angles αt and βt is generally higher. The challenge here is to check whether this discrepancy can be learned via the MX-LSTM to improve prediction. More intriguingly, the MX-LSTM should learn how these relationships evolve over time, which has not yet been verified, since the analysis carried out so far considers each instant of time as independent of each other.

3) Gli errori di previsione sono in generale superiori quando la velocità del pedone è inferiore; In Fig. 2 sono riferiti l’errore Medio di Spostamento o Mean Average Displacement (MAD) [40] dei seguenti approcci: SF [59], LTA [53], LSTM Vanilla e LSTM Sociale [3], assieme al nostro approccio MX-LSTM. In generale, velocità inferiori portano a maggiori errori, poiché quando le persone camminano molto lentamente il loro comportamento diventa meno prevedibile, per via di ragioni fisiche (meno inerzia) ma anche comportamentali (le persone che camminano lentamente sono solitamente coinvolte in altre attività, come il parlare con altre persone, guardarsi attorno). Al contrario, è qui mostrato che la MX-LSTM esegue molto bene anche a velocità inferiori, raggiungendo errori molto vicini allo zero con le persone statiche. 3) The prediction errors are generally higher when the pedestrian speed is lower; In Fig. 2 the Mean Average Displacement (MAD) [40] of the following approaches are reported: SF [59], LTA [53], LSTM Vanilla and LSTM Sociale [3], together with our MX approach -LSTM. In general, slower speeds lead to more errors, as when people walk very slowly their behavior becomes less predictable, due to physical reasons (less inertia) but also behavioral (people who walk slowly are usually involved in other activities, such as talking to other people, looking around). Conversely, it is shown here that the MX-LSTM performs very well even at lower speeds, hitting very close to zero errors with static people.

In sintesi, la posa della testa è correlata al movimento, specialmente quando le persone si muovono rapidamente. Quando le persone si muovono lentamente, la correlazione è più debole ma significativa, gli errori di predizione sono maggiori, e la posa della testa è notevolmente disallineata con il movimento. Questi fatti giustificano e motivano il nostro obiettivo con la MX-LSTM, per catturare le informazioni sulla posa della testa congiuntamente al movimento e di utilizzarle per una migliore previsione. In summary, head pose is related to movement, especially when people are moving quickly. When people move slowly, the correlation is weaker but significant, prediction errors are greater, and the pose of the head is noticeably out of alignment with the movement. These facts justify and motivate our goal with the MX-LSTM, to capture the information on the pose of the head in conjunction with the movement and use it for better prediction.

Successivamente, sono qui presentati esperimenti sia quantitativi sia qualitativi. I risultati quantitativi convalidano il modello di MX-LSTM proposto, impostando il nuovo stato dell’arte per la previsione della traiettoria; sono anche forniti risultati per uno studio di ablazione che mostra l’importanza delle diverse parti della MX-LSTM. Infine, presentiamo i primissimi risultati sulla previsione della posa della testa. I risultati qualitativi svelano l’interazione tra tracklet e vislet appresa dalla MX-LSTM. Next, both quantitative and qualitative experiments are presented here. The quantitative results validate the proposed MX-LSTM model, setting the new state of the art for trajectory prediction; results are also provided for an ablation study showing the importance of the different parts of the MX-LSTM. Finally, we present the very first results on the prediction of the head pose. The qualitative results reveal the interaction between tracklet and vislet learned from the MX-LSTM.

Il nostro modello è valutato nei confronti di tutti gli approcci pubblicati che hanno reso il loro codice pubblicamente disponibile: modello di Forza Sociale (SF) [59], Elusione di Traiettoria Lineare (LTA) [40], LSTM Vanilla e LSTM Sociale (S-LSTM) [3]. Our model is evaluated against all published approaches that have made their code publicly available: Social Strength (SF) model [59], Linear Path Avoidance (LTA) [40], LSTM Vanilla and LSTM Social (S -LSTM) [3].

Gli esperimenti seguono il protocollo di valutazione ampiamente utilizzato di [40], in cui l’algoritmo osserva prima 8 trame di “osservazione” di ground-truth (GT) di una traiettoria, che predicono le seguenti 12. Per le tre Sequenze UCY sono stati addestrati tre modelli: per ciascuna abbiamo utilizzato due sequenze come dati di addestramento e successivamente abbiamo effettuato dei test sulla terza sequenza. Per il dataset Town Centre il modello è stato addestrato e testato sui rispettivi insiemi forniti. La griglia per il raggruppamento sociale (Eq.(3)) ha celle No × No con No = 32. L’angolo di apertura del frustum di vista è stato convalidato in modo incrociato sulla partizione di addestramento del TownCenter e mantenuto fisso per le rimanenti prove ( y= 40°), mentre la profondità d è semplicemente legata dalla griglia di raggruppamento sociale. Le prestazioni di predizione della traiettoria sono analizzate con l’errore di Mean Average Displacement (MAD) (distanza euclidea tra punti predetti e GT, mediati sulla sequenza), ed errore di Spostamento Medio Finale o Final Average Displacement (FAD) (distanza tra l’ultimo punto predetto e corrispondente punto GT) [40]. The experiments follow the widely used evaluation protocol of [40], in which the algorithm first observes 8 ground-truth (GT) "observation" plots of a trajectory, which predict the following 12. For the three UCY Sequences are Three models were trained: for each we used two sequences as training data and then we tested the third sequence. For the Town Center dataset the model was trained and tested on the respective sets provided. The grid for social grouping (Eq. (3)) has cells No × No with No = 32. The opening angle of the view frustum has been cross-validated on the TownCenter training partition and kept fixed for the remaining evidence (y = 40 °), while the depth d is simply linked by the social grouping grid. The trajectory prediction performance is analyzed with the Mean Average Displacement (MAD) error (Euclidean distance between predicted points and GT, averaged over the sequence), and Final Average Displacement (FAD) error (distance between last point predicted and corresponding point GT) [40].

I risultati sono riportati nella Tabella 1. La MX-LSTM supera i procedimenti dello stato dell’arte in ogni singola sequenza e con entrambe le metriche, con un miglioramento medio del 32.7%. Il più alto guadagno relativo è conseguito nel dataset Zara02, dove percorsi non lineari complessi sono per lo più causati da gruppi di conversazione in piedi e da persone che camminano vicini ad essi, evitando collisioni. Le persone che rallentano e che guardano alle vetrine dei negozi pongono anche una sfida. Come mostrato in Fig. 2, pedoni che si muovono lentamente e che interagiscono causano problemi ai procedimenti concorrenti, mentre la MX-LSTM supera chiaramente tali svantaggi indicando un modello migliore. The results are shown in Table 1. The MX-LSTM exceeds the state of the art procedures in each individual sequence and with both metrics, with an average improvement of 32.7%. The highest relative gain is achieved in the Zara02 dataset, where complex non-linear paths are mostly caused by standing talkgroups and people walking close to them, avoiding collisions. People who slow down and look at shop windows also pose a challenge. As shown in Fig. 2, slow moving and interacting pedestrians cause problems for competing processes, while the MX-LSTM clearly overcomes these disadvantages by indicating a better model.

Si noti cortesemente che procedimenti diversi fanno affidamento a dati di ingresso diversi: entrambi il SF e LTA richiedono il punto di destinazione di ciascun individuo, mentre il SF richiede anche annotazioni riguardanti gruppi sociali; la MX-LSTM richiede la posa della testa di ciascun individuo per le prime 8 trame, ma ciò si può stimare tramite uno stimatore di posa della testa. Questo motiva il nostro prossimo esperimento: stimiamo automaticamente il riquadro di delimitazione della testa date le posizioni dei piedi sul piano di terra, ipotizzando sul piano del pavimento, ipotizzando che le persone siano alte 1.80m. Successivamente, applichiamo lo stimatore di posa della testa di [32] che fornisce angoli continui che si possono utilizzare come ingresso del nostro approccio ora denominato “MX-LSTM-HPE”. Come mostrato dai punteggi in Tabella 1, la MX-LSTM-HPE non risente di errori piccoli nella posa della testa immessa, con un calo medio delle prestazioni solo del 5%. Si noti che la MX-LSTM-HPE supera ancora tutti i procedimenti concorrenti su tutti i dataset anche con le informazioni sulla posa della testa stimate rumorose. Please note that different processes rely on different input data: both the SF and LTA require the destination point of each individual, while the SF also requires annotations regarding social groups; MX-LSTM requires each individual's head pose for the first 8 plots, but this can be estimated via a head pose estimator. This motivates our next experiment: we automatically estimate the bounding box of the head given the positions of the feet on the ground plane, assuming on the plane of the floor, assuming that people are 1.80m tall. Next, we apply the head pose estimator of [32] which provides continuous angles that can be used as an input to our approach now called “MX-LSTM-HPE”. As shown by the scores in Table 1, the MX-LSTM-HPE is not affected by small errors in the pose of the head entered, with an average decrease in performance of only 5%. Note that the MX-LSTM-HPE still outperforms all competing processes on all datasets even with the noisy estimated head pose information.

Quanto dovrebbe essere accurata la stima della posa della testa, affinché la MX-LSTM-HPE abbia prestazioni convincenti, ad esempio che superano la LSTM Sociale? Rispondiamo a questa domanda corrompendo la vera stima della posa della testa con rumore additivo Gaussiano dove How accurate should the head pose estimate be, in order for the MX-LSTM-HPE to perform convincingly, for example exceeding the Social LSTM? We answer this question by corrupting the true estimate of the head pose with additive Gaussian noise where

è la posa della testa corretta e la deviazione standard. is the correct head pose and standard deviation.

La MX-LSTM-HPE supera la LSTM sociale sino ad un rumore di 24◦. The MX-LSTM-HPE outperforms the social LSTM by up to 24◦ noise.

Oltre ai modelli nella documentazione, verifichiamo tre variazione della MX-LSTM per catturare i contributi netti delle diverse parti che caratterizzano il nostro approccio. In addition to the models in the documentation, we test three variations of the MX-LSTM to capture the net contributions of the different parts that characterize our approach.

MX-LSTM Diagonale a Blocchi (BD-MX-LSTM): essa serve per evidenziale l’importanza della stima di covarianze complete per comprendere l’interazione tra tracklet e vislet. Sostanzialmente, l’approccio stima due covarianze bidimensionali (La covarianza 2 x 2 è stimata impiegando due varianze σ1 , σ2 ed un termine di correlazione p come presentato in [19] Eq.(24) e (25).) ∑x e ∑a rispettivamente per la traiettoria e il modellamento di vislet, senza catturare le covarianze a flusso incrociato. Le equazioni che differiscono dalla MX-LSTM w.r.t. saranno fornite come materiale supplementare. Block Diagonal MX-LSTM (BD-MX-LSTM): it serves as evidence of the importance of estimating complete covariances to understand the interaction between tracklet and vislet. Basically, the approach estimates two two-dimensional covariances (The 2 x 2 covariance is estimated using two variances σ1, σ2 and a correlation term p as presented in [19] Eq. (24) and (25).) ∑x and ∑a for trajectory and vislet modeling respectively, without capturing cross-flow covariances. The equations that differ from the MX-LSTM w.r.t. will be provided as supplemental material.

MX-LSTM Senza Frustum: questa variazione della MX-LSTM utilizza il raggruppamento sociale come in [3], in cui l’area di interesse dove gli stati nascosti delle persone sono raggruppati nel tensore sociale tutto attorno all’individuo. In altre parole, nessun frustum che seleziona le persone che si devono considerare è qui utilizzato. MX-LSTM Without Frustum: this variation of the MX-LSTM uses the social grouping as in [3], in which the area of interest where the hidden states of people are grouped in the social tensor all around the individual. In other words, no frustum selecting the people to be considered is used here.

MX-LSTM singola: In questo caso, nessun raggruppamento sociale è preso in considerazione, quindi l’operazione di embedding dell’Eq. (4) è assente, e la matrice di peso W H sparisce. In pratica, questa variante apprende modelli indipendenti per ciascuna persona, ciascuno che considera i punti di tracklet e vislet. Single MX-LSTM: In this case, no social grouping is taken into account, so the embedding operation of Eq. (4) is absent, and the weight matrix W H disappears. In practice, this variant learns independent patterns for each person, each considering the tracklet and vislet points.

Tabella 1, ultime tre colonne, riferisce risultati numerici per tutte le semplificazioni MX-LSTM su tutti i dataset. I fatti principale che emergono sono: 1) le variazioni più alte sono con la sequenza Zara02, dove la MX-LSTM raddoppia le prestazioni del peggior approccio (MX-LSTM Singola); 2) le peggiori prestazioni sono in generale la MX-LSTM Singola, che mostra che il ragionamento sociale è di fatto necessario; 3) il ragionamento sociale è sistematicamente migliorato con l’aiuto del frustum di vista basato su vislet; 4) la stima di covarianza completa ha un ruolo nel far diminuire l’errore che è già piccolo con l’adozione di vislet. Table 1, last three columns, reports numerical results for all MX-LSTM simplifications on all datasets. The main facts that emerge are: 1) the highest variations are with the Zara02 sequence, where the MX-LSTM doubles the performance of the worst approach (MX-LSTM Single); 2) the worst performances are in general the Single MX-LSTM, which shows that social reasoning is in fact necessary; 3) social reasoning is systematically improved with the help of the vislet-based view frustum; 4) the complete covariance estimate plays a role in decreasing the error that is already small with the adoption of vislets.

Tabella 1. Errori Mean e Final Average Displacement (in metri) per tutti i procedimenti su tutti i dataset. Le prime 5 colonne sono i procedimenti comparativi e il nostro modello proposto addestrato e testato con annotazioni GT. La MX-LSTM-HPE è il nostro modello testato con l’uscita di un reale stimatore di posa della testa [32]. Le ultime 3 colonne sono variazioni del nostro approccio addestrato e testato su annotazioni GT. Table 1. Mean and Final Average Displacement errors (in meters) for all proceedings on all datasets. The first 5 columns are the comparative procedures and our proposed model trained and tested with GT annotations. The MX-LSTM-HPE is our tested model with the output of a real head pose estimator [32]. The last 3 columns are variations of our trained and tested approach on GT annotations.

Tabella 2. Errore angolare medio (in gradi) per lo stimatore di posa della testa stato dell’arte [32], e il modello MX-LSTM alimentato con Annotazioni GT e valori stimati (MX-LSTM-HPE). Table 2. Average angular error (in degrees) for the state of the art head pose estimator [32], and the MX-LSTM model fed with GT Annotations and estimated values (MX-LSTM-HPE).

Nel sintetizzare sinora i risultati, avendo delle vislet come ingresso consente di aumentare definitivamente le prestazioni di previsione della traiettoria, anche se le vislet sono stimate con rumore. Le vislet si dovrebbero utilizzare per comprendere le interazioni sociali con il raggruppamento sociale, costruendo un frustum di vista che dice quali persone sono attualmente osservate da ciascun individuo. Tutte queste caratteristiche sono effettuate in modo efficiente tramite la MX-LSTM: in effetti il tempo di addestramento è lo stesso di quando si ha una LSTM con raggruppamento sociale. In summarizing the results so far, having vislets as input allows to definitively increase the trajectory prediction performance, even if the vislets are estimated with noise. Vislets should be used to understand social interactions with social grouping by constructing a view frustum that tells which people are currently being observed by each individual. All these features are efficiently performed through the MX-LSTM: in fact the training time is the same as when having a social grouping LSTM.

Come effettuato con le traiettorie, forniamo anche una previsione della posa della testa di ciascun individuo in corrispondenza di ciascuna trama che è un attributo distintivo del nostro procedimento. Valutiamo le prestazioni di questa stima in termini di errore angolare medio eα, che è la differenza media assoluta tra la posa stimata (angolo αt in Fig. 1c) e la GT annotata. As done with the trajectories, we also provide a prediction of each individual's head pose at each plot which is a distinctive attribute of our process. We evaluate the performance of this estimate in terms of mean angular error eα, which is the absolute mean difference between the estimated pose (angle αt in Fig. 1c) and the annotated GT.

Tabella 2 mostra risultati numerici dello stimatore di posa statica della testa [32] (HPE), la MX-LSTM che utilizza GT pose della testa, e la MX-LSTM alimentata con l’uscita HPE durante il periodo di osservazione (MX-LSTM-HPE). In tutti i casi la nostra uscita di previsione è confrontabile con quella del HPE, ma nel nostro caso non utilizziamo suggerimenti di aspetto – cioè non guardiamo affatto alle immagini. Nel caso di Zara01, la MX-LSTM è anche migliore della predizione statica il che dimostra la potenza di previsione del nostro modello. A nostro parere ciò è dovuto al fatto che in questa sequenza le traiettorie sono per lo più molto lineari e veloci, e le teste sono soprattutto allineate con la direzione del movimento. Quando forniamo stime al modello MX-LSTM durante il periodo di osservazione, l’errore angolare aumenta, come previsto. Nonostante ciò, l’errore è sorprendentemente limitato. Table 2 shows numerical results of the static head pose estimator [32] (HPE), the MX-LSTM using GT head pose, and the MX-LSTM powered with the HPE output during the observation period (MX-LSTM -HPE). In all cases our prediction output is comparable to that of the HPE, but in our case we don't use appearance hints - i.e. we don't look at the images at all. In the case of Zara01, the MX-LSTM is even better than the static prediction which demonstrates the prediction power of our model. In our opinion this is due to the fact that in this sequence the trajectories are mostly very linear and fast, and the heads are mostly aligned with the direction of movement. When we provide estimates to the MX-LSTM model during the observation period, the angular error increases, as expected. Despite this, the error is surprisingly limited.

Fig. 3 mostra risultati qualitativi sul dataset Zara02, che è stato mostrato come lo scenario più complesso negli esperimenti quantitativi. Fig. 3a presenta risultati MX-LSTM: uno scenario di gruppo è preso in considerazione, con l’attenzione incentrata sulla ragazza nell’angolo sinistro inferiore. Nella colonna sinistra, le vislet di predizione di ground-truth mostrano che la ragazza conversa con i membri del gruppo, con un movimento vicinmo a zero e l’angolo panoramico della testa che oscilla. Il comportamento della S-LSTM, che predice in modo errato la ragazza che lascia il gruppo. Questo errore conferma il problema dei procedimenti concorrenti nel prevedere il movimento di persone che si spostano lentamente o che sono statiche come discusso sopra, e ulteriormente confermati dai risultati degli esperimenti quantitativi. Nella colonna centrale, è mostrata la sequenza di osservazione fornita alla MX-LSTM (quasi statica con vislet oscillanti). La predizione correlata mostra vislet oscillanti, e quasi nessun movimento, confermando che la MX-LSTM ha appreso questo particolare comportamento sociale. Se forniamo alla MX-LSTM una sequenza di osservazione artificiale senza traiettoria ma con le vislet orientate verso ovest (terza colonna), dove nessuna persona è presente, la MX-LSTM predice una traiettoria che parte dal gruppo. Fig. 3 shows qualitative results on the Zara02 dataset, which was shown as the most complex scenario in quantitative experiments. Fig. 3a presents MX-LSTM results: a group scenario is taken into consideration, with the focus on the girl in the lower left corner. In the left column, the ground-truth prediction vislets show that the girl converses with the group members, with near-zero movement and the panoramic angle of her head swinging. The behavior of the S-LSTM, which incorrectly predicts the girl leaving the group. This error confirms the problem of concurrent procedures in predicting the movement of slowly moving or static people as discussed above, and further confirmed by the results of quantitative experiments. In the central column, the observation sequence provided to the MX-LSTM (quasi static with oscillating vislets) is shown. The related prediction shows oscillating vislets, and almost no movement, confirming that the MX-LSTM has learned this particular social behavior. If we provide the MX-LSTM with an artificial observation sequence with no trajectory but with the vislets oriented to the west (third column), where no person is present, the MX-LSTM predicts a trajectory starting from the group.

Le due righe di Fig. 3b analizzano la MX-LSTM Singola, in cui nessun raggruppamento sociale è preso in considerazione. Quindi, qui ciascun pedone non è influenzato dalle persone circostanti, e la relazione tra le tracklet e le vislets nella predizione si può osservare senza alcun fattore di confusione. Fig. 3b prima riga mostra tre situazioni in cui le vislets della sequenza di osservazione sono realizzate in modo artificiale che puntano a nord, resultando non allineate alla traiettoria. In questo caso la MX-LSTM Singola predice una traiettoria che rallenta che devia verso nord, specialmente nelle seconde e terze figure. Se l’osservazione ha le vislet lecite (appena visibili poiché esse sono allineate alla traiettoria), la risultante traiettoria ha un comportamento diverso, più vicina alla GT. La seconda riga è simile, con le vislet di osservazione realizzate che puntano a sud. E’ anche mostrata la predizione con le vislet modificate. La sola differenza è nella figura sinistra inferiore: qui le vislet di osservazione che puntano a sud sono in accordo con il movimento, tale che la risultante traiettoria predetta non decelera come negli altri casi, ma accelera verso sud. The two lines of Fig. 3b analyze the Single MX-LSTM, in which no social grouping is taken into consideration. Thus, here each pedestrian is unaffected by the surrounding people, and the relationship between the tracklets and the vislets in the prediction can be observed without any confounding factors. Fig. 3b first row shows three situations in which the vislets of the observation sequence are artificially made pointing north, resulting not aligned with the trajectory. In this case the MX-LSTM Single predicts a slowing trajectory that deviates north, especially in the second and third figures. If the observation has the legal vislets (barely visible since they are aligned with the trajectory), the resulting trajectory has a different behavior, closer to the GT. The second row is similar, with the realized observation vislets pointing south. The prediction with the modified vislets is also shown. The only difference is in the lower left figure: here the observation vislets pointing south are in agreement with the movement, such that the resulting predicted trajectory does not decelerate as in the other cases, but accelerates southwards.

Fig. 4 mostra un diagramma di flusso esemplificativo di una forma di realizzazione. Come illustrato in Fig. 4, almeno una camera installata 100 in un centro commerciale cattura le immagini e le invia ad un rilevatore 101. Il rilevatore è un un rilevatore di persone e stimatore di posa della testa congiunti, come descritto in [23], che elabora le immagini etichettate ed emette le posizioni spaziali e i frustum di vista di almeno una persona nelle immagini. La posizione spaziale e frustum di vista di una persona, come due ingressi, entrano nel sistema LSTM 102, o lo chiamiano un sistema MX-LSTM. Il sistema LSTM 102 elabora i segnali e prevede la futura traiettoria e il frustum di vista della persona. Inoltre, le informazioni predette di posizione e frustum di vista di una persona sono passate al pannello di controllo pubblicitario del server 103. Esso utilizza future informazioni di dove sarà una persona e dove guarderà per cambiare il pannello pubblicitario 104 al tempo di esecuzione. Ad esempio, il monitor è acceso solo quando una persona lo guarderà. Quindi, si risparmia energia. Combinato con una complessa tecnologia di rilevamento della faccia, questo sistema potrebbe anche essere utilizzato per selezionare i contenuti appropriati da presentare sullo schermo per una certa persona. Fig. 4 shows an exemplary flow chart of one embodiment. As illustrated in Fig. 4, at least one installed camera 100 in a shopping mall captures the images and sends them to a detector 101. The detector is a joint people detector and head pose estimator, as described in [23], which processes the tagged images and outputs the spatial positions and view frustums of at least one person in the images. A person's spatial position and frustum of view, such as two entrances, enter the LSTM 102 system, or call it an MX-LSTM system. The LSTM 102 system processes the signals and predicts the person's future trajectory and frustum of sight. Further, the aforementioned information of a person's position and view frustum is passed to the advertising control panel of the server 103. It uses future information of where a person will be and where he will look to change the advertising panel 104 at runtime. For example, the monitor is only turned on when a person will look at it. Thus, it saves energy. Combined with complex face detection technology, this system could also be used to select the appropriate content to be presented on the screen for a certain person.

Fig. 5 mostra un diagramma di flusso esemplificativo di un’altra forma di realizzazione. Analogamente alla suddetta forma di realizzazione, la camera 203 prende immagini di almeno una persona, il rilevatore di persone e stimatore di posa della testa congiunti 202 emettono le informazioni di posizione spaziale e orientamento della testa, e la LSTM 201 utilizza queste informazioni e prevede la futura posizione e frustum di vista della persona. Basandosi sulle future informazioni dove sarà una persona, un sistema di flusso del traffico 200 per veicoli autonomi regolerà conseguentemente il percorso del veicolo. In questa forma di realizzazione, le traiettorie predette si possono utilizzare per evitare potenziali collisioni con i pedoni. Fig. 5 shows an exemplary flow chart of another embodiment. Similarly to the above embodiment, the camera 203 takes images of at least one person, the joint person detector and head pose estimator 202 emit the spatial position and orientation information of the head, and the LSTM 201 uses this information and predicts the future position and frustum of the person's view. Based on future information where a person will be, a traffic flow system 200 for autonomous vehicles will adjust the vehicle's path accordingly. In this embodiment, the predicted trajectories can be used to avoid potential collisions with pedestrians.

Come contributi principali, in questa invenzione: Mostriamo che la previsione della traiettoria si può notevolmente migliorare considerando le stime di posa della testa; Proponiamo una nuova architettura LSTM, la MX-LSTM, che sfrutta informazioni posizionali (tracklet) e di orientamento (vislet) grazie ad un’ottimizzazione di parametri Gaussiani d-variati comprendenti covarianze complete con d > 2; Motiviamo la necessità della MX-LSTM che mostra che le pose della testa sono correlate alle traiettorie, anche a basse velocità, dove fallisce la maggior parte degli approcci di previsione; Definiamo un nuovo tipo di raggruppamento sociale, nel senso di [3, 55], sfruttando le informazioni di vislet; definiamo risultati di previsione dello stato dell’arte su diversi dataset; Presentiamo i risultati della MX-LSTM di previsione di posa della testa, che mostra nuove capacità di analisi del comportamento a lungo termine. As main contributions, in this invention: We show that trajectory prediction can be greatly improved by considering head pose estimates; We propose a new LSTM architecture, the MX-LSTM, which exploits positional information (tracklet) and orientation (vislet) thanks to an optimization of d-varied Gaussian parameters including complete covariances with d> 2; We motivate the need for MX-LSTM which shows that head poses are related to trajectories, even at low speeds, where most prediction approaches fail; We define a new type of social grouping, in the sense of [3, 55], exploiting the information of vislet; we define forecast results of the state of the art on different datasets; We present the results of the MX-LSTM of Head Pose Prediction, which show new long-term behavioral analysis capabilities.

Elenco di Documenti citati List of documents cited

[1] P. Abbeel e A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In ICML, 2004. 2 [1] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In ICML, 2004. 2

[2] H. Akaike. Fitting autoregressive models for prediction. Annals of the institute of Statistical Mathematics, 21(1):243–247, 1969. 2 [2] H. Akaike. Fitting autoregressive models for prediction. Annals of the institute of Statistical Mathematics, 21 (1): 243–247, 1969. 2

[3] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese. Social LSTM: Human trajectory prediction in crowded spaces. In CVPR, 2016. 1, 2, 3, 4, 5, 6, 7 [3] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese. Social LSTM: Human trajectory prediction in crowded spaces. In CVPR, 2016. 1, 2, 3, 4, 5, 6, 7

[4] A. Alahi, V. Ramanathan, e L. Fei-Fei. Sociallyaware large-scale crowd forecasting. In CVPR, 2014. 2 [4] A. Alahi, V. Ramanathan, and L. Fei-Fei. Sociallyaware large-scale crowd forecasting. In CVPR, 2014. 2

[5] S. O. Ba e J.-M. Odobez. A probabilistic framework for joint head tracking and pose estimation. In ICPR, 2004. [5] S. O. Ba and J.-M. Odobez. A probabilistic framework for joint head tracking and pose estimation. In ICPR, 2004.

2 2

[6] B. Benfold e I. Reid. Stable multi-target tracking in realtime surveillance video. In CVPR, 2011. 4 [6] B. Benfold and I. Reid. Stable multi-target tracking in realtime surveillance video. In CVPR, 2011. 4

[7] S. Boyd e L. Xiao. Least-squares covariance matrix adjustment. SIAM Journal on Matrix Analysis and Applications, 27(2):532–546, 2005. 4 [7] S. Boyd and L. Xiao. Least-squares covariance matrix adjustment. SIAM Journal on Matrix Analysis and Applications, 27 (2): 532–546, 2005. 4

[8] J. F. Caminada e W. J. M. van Bommel. Philips engineering report 43, 1980. 1, 2 [8] J. F. Caminada and W. J. M. van Bommel. Philips engineering report 43, 1980. 1, 2

[9] C. Chen e J.-M. Odobez. We are not contortionists: Coupled adaptive learning for head and body orientation estimation in surveillance video. In CVPR, 2012. 1 [9] C. Chen and J.-M. Odobez. We are not contortionists: Coupled adaptive learning for head and body orientation estimation in surveillance video. In CVPR, 2012. 1

[10] H. Coskun, F. Achilles, R. Di Pietro, N. Navab, e F. Tombari. Long short-term memory kalman filters: Recurrent neural estimators for pose regularization. In ICCV, 2017. 2 [10] H. Coskun, F. Achilles, R. Di Pietro, N. Navab, and F. Tombari. Long short-term memory kalman filters: Recurrent neural estimators for pose regularization. In ICCV, 2017. 2

[11] N. Davoudian e P. Raynham. What do pedestrians look at at night? Lighting Research and Technology, 44(4):438–448, 2012. 1, 2 [11] N. Davoudian and P. Raynham. What do pedestrians look at at night? Lighting Research and Technology, 44 (4): 438-448, 2012. 1, 2

[12] J. E. Dennis Jr e R. B. Schnabel. Numerical methods for unconstrained optimization and nonlinear equazioni. SIAM, 1996. 4 [12] J. E. Dennis Jr and R. B. Schnabel. Numerical methods for unconstrained optimization and nonlinear equations. SIAM, 1996. 4

[13] A. D. Dragan, N. D. Ratliff, e S. S. Srinivasa. Manipulation planning with goal sets using constrained trajectory optimization. In ICRA, 2011. 1, 2 [13] A. D. Dragan, N. D. Ratliff, and S. S. Srinivasa. Manipulation planning with goal sets using constrained trajectory optimization. In ICRA, 2011. 1, 2

[14] S. Fotios, J. Uttley, C. Cheal, e N. Hara. Using eyetracking to identify pedestrians’ critical visual tasks, Part 1. Dual task approach. Lighting Research & Technology, 47(2):133–148, 2015. 1, 2 [14] S. Fotios, J. Uttley, C. Cheal, and N. Hara. Using eyetracking to identify pedestrians' critical visual tasks, Part 1. Dual task approach. Lighting Research & Technology, 47 (2): 133–148, 2015. 1, 2

[15] S. Fotios, J. Uttley, e B. Yang. Using eye-tracking to identify pedestrians’ critical visual tasks. part 2. fixation on pedestrians. Lighting Research & Technology, 47(2):149–160, 2015. 1, 2 [15] S. Fotios, J. Uttley, and B. Yang. Using eye-tracking to identify pedestrians ’critical visual tasks. part 2. fixation on pedestrians. Lighting Research & Technology, 47 (2): 149-160, 2015. 1, 2

[16] T. Foulsham, E. Walker, e A. Kingstone. The where, what and when of gaze allocation in the lab and the natural environment. Vision research, 51(17):1920–1931, 2011. 1, 2 [16] T. Foulsham, E. Walker, and A. Kingstone. The where, what and when of gaze allocation in the lab and the natural environment. Vision research, 51 (17): 1920-1931, 2011. 1, 2

[17] N. Gourier, J. Maisonnasse, D. Hall, e J. L. Crowley. Head pose estimation on low resolution images. In CLEAR, 2006.2 [17] N. Gourier, J. Maisonnasse, D. Hall, and J. L. Crowley. Head pose estimation on low resolution images. In CLEAR, 2006.2

[18] A. Graves. Supervised sequence labelling with recurrent neural networks, volume 385. Springer, 2012. 2 [18] A. Graves. Supervised sequence labeling with recurrent neural networks, volume 385. Springer, 2012. 2

[19] A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013. 2, 4, 6 [19] A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv: 1308.0850, 2013. 2, 4, 6

[20] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, e D. Wierstra. DRAW: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015. 2 [20] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra. DRAW: A recurrent neural network for image generation. arXiv preprint arXiv: 1502.04623, 2015. 2

[21] E. T. Hall. The hidden dimension. Doubleday & Co, 1966. 1 [21] E. T. Hall. The hidden dimension. Doubleday & Co, 1966. 1

[22] Y. D. B. Z. Hang Su, Jun Zhu. Forecast the plausible paths in crowd scenes. In IJCAI, 2017. 1, 2 [22] Y. D. B. Z. Hang Su, Jun Zhu. Forecast the plausible paths in crowd scenes. In IJCAI, 2017. 1, 2

[23] I. Hasan, T. Tsesmelis, F. Galasso, A. Del Bue, e M. Cristani. Tiny head pose classification by bodily cues. In ICIP, 2017. 2 [23] I. Hasan, T. Tsesmelis, F. Galasso, A. Del Bue, and M. Cristani. Tiny head pose classification by bodily cues. In ICIP, 2017. 2

[24] D. Helbing e P. Molnar. Social force model for. Physical review E, 51(5):4282, 1995. 2 [24] D. Helbing and P. Molnar. Social force model for. Physical review E, 51 (5): 4282, 1995. 2

[25] N. J. Higham. Computing a nearest symmetric positive semidefinite matrix. Linear algebra and its applications, 103:103–118, 1988. 4 [25] N. J. Higham. Computing a nearest symmetric positive semidefinite matrix. Linear algebra and its applications, 103: 103–118, 1988. 4

[26] S. Hochreiter e J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. 1, 2 [26] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9 (8): 1735–1780, 1997. 1, 2

[27] J. Intriligator e P. Cavanagh. The spatial resolution of visual attention. Cognitive psychology, 43(3):171–216, 2001. 1, 2 [27] J. Intriligator and P. Cavanagh. The spatial resolution of visual attention. Cognitive psychology, 43 (3): 171–216, 2001. 1, 2

[28] S. R. Jammalamadaka e A. Sengupta. Topics in circular statistics, volume 5. World Scientific, 2001. 5 [28] S. R. Jammalamadaka and A. Sengupta. Topics in circular statistics, volume 5. World Scientific, 2001. 5

[29] R. E. Kalman et al. A new approach to linear filtering and prediction problems. ASME Journal of Basic Engineering, 1960. 2 [29] R. E. Kalman et al. A new approach to linear filtering and prediction problems. ASME Journal of Basic Engineering, 1960. 2

[30] K. Kitani, B. Ziebart, J. Bagnell, e M. Hebert. [30] K. Kitani, B. Ziebart, J. Bagnell, and M. Hebert.

Activity forecasting. In ECCV, 2012. 1, 2 Activity forecasting. In ECCV, 2012. 1, 2

[31] M. Kuderer, H. Kretzschmar, C. Sprunk, e W. Burgard. Feature-based prediction of trajectories for socially compliant navigation. In Robotics: science and systems, 2012. [31] M. Kuderer, H. Kretzschmar, C. Sprunk, and W. Burgard. Feature-based prediction of trajectories for socially compliant navigation. In Robotics: science and systems, 2012.

1 1

[32] D. Lee, M.-H. Yang, e S. Oh. Fast and accurate testa pose estimation via random projection forests. In ICCV, 2015. 1, 2, 6, 7 [32] D. Lee, M.-H. Yang, and S. Oh. Fast and accurate test pose estimation via random projection forests. In ICCV, 2015. 1, 2, 6, 7

[33] N. Lee e K. M. Kitani. Predicting wide receiver trajectories in american football. In WACV, 2016. 1, 2 [33] N. Lee and K. M. Kitani. Predicting wide receiver trajectories in american football. In WACV, 2016. 1, 2

[34] A. Lerner, Y. Chrysanthou, e D. Lischinski. Crowds by example. In Computer Graphics Forum, 2007. 1, 2, 4, 5 [34] A. Lerner, Y. Chrysanthou, and D. Lischinski. Crowds by example. In Computer Graphics Forum, 2007. 1, 2, 4, 5

[35] W.-C. Ma, D.-A. Huang, N. Lee, e K. M. Kitani. Forecasting interactive dynamics of pedestrians with fictitious play. In CVPR, 2017. 1, 2 [35] W.-C. But, D.-A. Huang, N. Lee, and K. M. Kitani. Forecasting interactive dynamics of pedestrians with fictitious play. In CVPR, 2017. 1, 2

[36] J. Mainprice, R. Hayne, e D. Berenson. Goal set inverse optimal control and iterative replanning for predicting human reaching motions in shared workspaces. IEEE Trans. on Robotics, 32(4):897–908, 2016. 1, 2 [36] J. Mainprice, R. Hayne, and D. Berenson. Goal set inverse optimal control and iterative replanning for predicting human reaching motions in shared workspaces. IEEE Trans. on Robotics, 32 (4): 897–908, 2016. 1, 2

[37] P. McCullagh e J. A. Nelder. Generalized linear models, no. 37 in monograph on statistics and applied probability, 1989. 2 [37] P. McCullagh and J. A. Nelder. Generalized linear models, no. 37 in monograph on statistics and applied probability, 1989. 2

[38] B. T. Morris e M. M. Trivedi. A survey of visionbased trajectory learning and analysis for surveillance. IEEE Trans. on Circuits and Systems for Video Technology, 18(8):1114–1127, 2008. 2 [38] B. T. Morris and M. M. Trivedi. A survey of visionbased trajectory learning and analysis for surveillance. IEEE Trans. on Circuits and Systems for Video Technology, 18 (8): 1114-1127, 2008. 2

[39] A. E. Patla e J. N. Vickers. How far ahead do we look when required to step on specific locations in the travel path during locomotion? Experimental brain research, 148(1):133–138, 2003. 1, 2 [39] A. E. Patla and J. N. Vickers. How far ahead do we look when required to step on specific locations in the travel path during locomotion? Experimental brain research, 148 (1): 133–138, 2003. 1, 2

[40] S. Pellegrini, A. Ess, K. Schindler, e L. Van Gool. You’ll never walk alone: Modeling social behavior for multitarget tracking. In ICCV, 2009. 1, 2, 5, 6, 7 [40] S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool. You’ll never walk alone: Modeling social behavior for multitarget tracking. In ICCV, 2009. 1, 2, 5, 6, 7

[41] J. C. Pinheiro e D. M. Bates. Unconstrained parametrizations for variance-covariance matrices. Statistics and Computing, 6(3):289–296, 1996. 2, 4 [41] J. C. Pinheiro and D. M. Bates. Unconstrained parametrizations for variance-covariance matrices. Statistics and Computing, 6 (3): 289–296, 1996. 2, 4

[42] M. Pourahmadi. Covariance estimation: The GLM and regularization perspectives. Statistical Science, pages 369– 387, 2011. 4 [42] M. Pourahmadi. Covariance estimation: The GLM and regularization perspectives. Statistical Science, pages 369– 387, 2011. 4

[43] M. B. Priestley. Spectral analysis and time series. Academic press, 1981. 2 [43] M. B. Priestley. Spectral analysis and time series. Academic press, 1981. 2

[44] J. Qui ̃nonero-Candela e C. E. Rasmussen. A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research, 6(12):1939–1959, 2005. [44] J. Qui ̃nonero-Candela and C. E. Rasmussen. A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research, 6 (12): 1939–1959, 2005.

2 2

[45] C. E. Rasmussen. Gaussian processes for machine learning. In Adaptive Computation and Machine Learning, 2006. [45] C. E. Rasmussen. Gaussian processes for machine learning. In Adaptive Computation and Machine Learning, 2006.

2 2

[46] N. M. Robertson e I. D. Reid. Estimating gaze direction from low-resolution faces in video. In ECCV, 2006. [46] N. M. Robertson and I. D. Reid. Estimating gaze direction from low-resolution faces in video. In ECCV, 2006.

1, 2 1, 2

[47] A. Robicquet, A. Sadeghian, A. Alahi, e S. Savarese. Learning social etiquette: Human trajectory understanding in crowded scenes. In ECCV, 2016. 1 [47] A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese. Learning social etiquette: Human trajectory understanding in crowded scenes. In ECCV, 2016. 1

[48] A. Sadeghian, A. Alahi, e S. Savarese. Tracking the untrackable: Learning to track multiple cues with longterm dependencies. arXiv preprint arXiv:1701.01909, 2017. 1, 2 [48] A. Sadeghian, A. Alahi, and S. Savarese. Tracking the untrackable: Learning to track multiple cues with longterm dependencies. arXiv preprint arXiv: 1701.01909, 2017. 1, 2

[49] R. Stiefelhagen, M. Finke, J. Yang, e A. Waibel. From gaze to focus of attention. In VISUAL, 1999. 1, 2 [49] R. Stiefelhagen, M. Finke, J. Yang, and A. Waibel. From gaze to focus of attention. In VISUAL, 1999. 1, 2

[50] H. Su, Y. Dong, J. Zhu, H. Ling, e B. Zhang. Crowd scene understanding with coherent recurrent neural networks. In IJCAI, 2016. 1, 2 [50] H. Su, Y. Dong, J. Zhu, H. Ling, and B. Zhang. Crowd scene understanding with coherent recurrent neural networks. In IJCAI, 2016. 1, 2

[51] L. Sun, Z. Yan, S. M. Mellado, M. Hanheide, e T. [51] L. Sun, Z. Yan, S. M. Mellado, M. Hanheide, and T.

Duckett. 3DOF pedestrian trajectory prediction learned from long-term autonomous mobile robot deployment data. arXiv preprint arXiv:1710.00126, 2017. 1 Duckett. 3DOF pedestrian trajectory prediction learned from long-term autonomous mobile robot deployment data. arXiv preprint arXiv: 1710.00126, 2017. 1

[52] D. Tosato, M. Spera, M. Cristani, e V. Murino. Characterizing humans on riemannian manifolds. IEEE TPAMI, 35(8):1972–1984, 2013. 2 [52] Fr Tosato, M. Spera, M. Cristani, and V. Murino. Characterizing humans on riemannian manifolds. IEEE TPAMI, 35 (8): 1972–1984, 2013. 2

[53] P. Trautman e A. Krause. Unfreezing the robot: Navigation in dense, interacting crowds. In IROS, 2010. 1, 3, 5 [53] P. Trautman and A. Krause. Unfreezing the robot: Navigation in dense, interacting crowds. In IROS, 2010. 1, 3, 5

[54] P. Vansteenkiste, G. Cardon, E. D’Hondt, R. Philippaerts, e M. Lenoir. The visual control of bicycle steering: The effects of speed and path width. Accident Analysis & Prevention, 51:222–227, 2013. 1, 2 [54] P. Vansteenkiste, G. Cardon, E. D’Hondt, R. Philippaerts, and M. Lenoir. The visual control of bicycle steering: The effects of speed and path width. Accident Analysis & Prevention, 51: 222–227, 2013. 1, 2

[55] D. Varshneya e G. Srinivasaraghavan. Human trajectory prediction using spatially aware deep attention models. In NIPS, 2017. 1, 2 [55] D. Varshneya and G. Srinivasaraghavan. Human trajectory prediction using spatially aware deep attention models. In NIPS, 2017. 1, 2

[56] O. Vinyals, A. Toshev, S. Bengio, e D. Erhan. Show and tell: A neural image caption generator. In CVPR, 2015. [56] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, 2015.

2 2

[57] J. M. Wang, D. J. Fleet, e A. Hertzmann. Gaussian process dynamical models for human motion. IEEE TPAMI, 30(2):283–298, 2008. 2 [57] J. M. Wang, D. J. Fleet, and A. Hertzmann. Gaussian process dynamical models for human motion. IEEE TPAMI, 30 (2): 283–298, 2008. 2

[58] C. K. I. Williams. Prediction with gaussian processes: From linear regression to linear prediction and beyond. In Learning in graphical models, pages 599–621. Springer, 1998. 2 [58] C. K. I. Williams. Prediction with gaussian processes: From linear regression to linear prediction and beyond. In Learning in graphical models, pages 599–621. Springer, 1998. 2

[59] K. Yamaguchi, A. C. Berg, L. E. Ortiz, e T. L. Berg. Who are you with and where are you going? In CVPR, 2011. 1, 3, 5, 6, 7 [59] K. Yamaguchi, A. C. Berg, L. E. Ortiz, and T. L. Berg. Who are you with and where are you going? In CVPR, 2011. 1, 3, 5, 6, 7

[60] S. Yi, H. Li, e X. Wang. Understanding pedestrian behaviors from stationary crowd groups. In CVPR, 2015. 2 [60] S. Yi, H. Li, and X. Wang. Understanding pedestrian behaviors from stationary crowd groups. In CVPR, 2015. 2

[61] B. D. Ziebart, A. Maas, J. A. Bagnell, e A. K. Dey. [61] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey.

Maximum entropy inverse reinforcement learning. In AAAI, 2008. 2 Maximum entropy inverse reinforcement learning. In AAAI, 2008. 2

[62] B. D. Ziebart, N. Ratliff, G. Gallagher, C. Mertz, K. Peterson, J. A. Bagnell, M. Hebert, A. K. Dey, e S. Srinivasa. Planning-based prediction for pedestrians. In IROS, 2009. 1 [62] B. D. Ziebart, N. Ratliff, G. Gallagher, C. Mertz, K. Peterson, J. A. Bagnell, M. Hebert, A. K. Dey, and S. Srinivasa. Planning-based prediction for pedestrians. In IROS, 2009. 1

Claims

CLAIMS 1. Process comprising - receiving from at least one image sensor image signals of at least one person, - detect, from the image signals, a two-dimensional position of at least one person, - estimate, as a function of the image signals received by at least one image sensor, a pose of the head of at least one person, - generate a frustum of sight of at least one person from the estimated head pose, - enter the two-dimensional position and the frustum of view in a recurrent neural network, - generate a predicted trajectory of movement of at least one person.

2. A method according to claim 1, further comprising, - generate a predicted frustum of view of at least one person.

A method according to claim 2, further comprising passing the predicted trajectory of motion and the predicted view frustum to a panel control server.

The method of claim 3, further comprising, the panel control server responding to the predicted motion trajectory and the predicted view frustum, and controlling the panel to display and / or close a video content.

A method according to claim 2, further comprising passing the predicted trajectory of movement and the predicted view frustum to a traffic control system.

The method of claim 5, further comprising the traffic control system which responds to the predicted motion trajectory and the predicted view frustum, and which assigns a vehicle trajectory.

Method according to any of claims 1 to 6, wherein the recurrent neural network is a Long and Short Term Memory (LSTM) network.

8. System comprising: - at least one image sensor for generating image signals of at least one person, - a person detector and the head pose estimator joined to detect a two-dimensional position of at least one person from the image signals; estimate, as a function of the image signals received by at least one image sensor, a pose of the head of at least one person; and generate a frustum of sight of at least one person from the pose of the estimated head, - a recurrent neural network coupled to the person detector and the joint head pose estimator, to process the two-dimensional position and the frustum of view of said at least one person and generate a predicted trajectory of movement of the at least one person.

9. System according to claim 8, in which the recurrent neural network also generates a predicted view frustum of at least one person.

A system according to claim 9, further comprising a panel control server which responds to the predicted motion trajectory and the predicted view frustum, and which controls the panel for displaying video content.

The system according to claim 9, further comprising a traffic control system for responding to the predicted motion trajectory and the predicted view frustum, and assigning a vehicle trajectory.

System according to any of claims 8-11, wherein the recurrent neural network is a Long and Short Term Memory (LSTM) network.

13. Non-transient computer readable recording medium that stores a computer product, which, when executed by a processor, causes a computer to perform the method according to any of claims 1 to 7.