WO2018134854A1

WO2018134854A1 - Movement analysis from visual and audio data

Info

Publication number: WO2018134854A1
Application number: PCT/IT2017/000007
Authority: WO
Inventors: Vito SANTARCANGELO; Giovanni Maria Farinella; Sebastiano Battiato; Alberto CAMPORESI
Original assignee: Centro Studi S.R.L.
Priority date: 2017-01-18
Filing date: 2017-01-18
Publication date: 2018-07-26

Abstract

A kinesthesis analysis system (1) based on artificial vision and audio analysis for process control and management in a retail store, the system (1) comprises: - a lst-person vision equipment (2) intended to be embedded in customer-carried shopping containers to be tracked in the retail store, and designed to output image/video/audio data, - a 2nd-person vision equipment (3) intended to be embedded in stationary interactive multimedia kiosks distributed in the retail store, and designed to output image/video data, - a 3rd-person vision equipment (4) designed to output image/video data; - an environment sensory network (5) distributed over the retail store and designed to gather retail store-related information, - a Human-Machine Interface (HMI) (7) designed to expose a Graphical User Interface (GUI) for process control and monitoring and data display, - a communication infrastructure (6) designed to communicate with the 1st-, 2nd- and 3rd-person vision equipment (2, 3, 4) to receive the image/video/audio data therefrom, and with the environment sensory network (5) to receive the retail store- related information therefrom, and - a kinesthesis analysis equipment (8) designed to communicate with the communication infrastructure (6) to receive and process the image/video/audio data outputted by the 1st- 2nd- and 3rd-person vision equipment (2, 3, 4) and the retail store- related information outputted by the environment sensory network (5) to compute kinesthesis information relating to the retail store.

Description

MOVEMENT ANALYSIS FROM VISUAL AND AUDIO DATA

Technical Field of the Invention

The present invention relates in general to kinesthesis analysis, namely identification of changes in location, pose and motion of people without relying on information from the five senses, and in particular to advanced kinesthesis analysis based on artificial vision and audio analysis for monitoring a process occurring in a delimited area.

The present invention finds advantageous but not limitative application for monitoring people flows, behaviours and interactions with processes occurring in delimited areas, in particular retail stores, such as supermarkets, commercial centres, shops, self-shopping areas, where assessing promotional (marketing and communication) effectiveness and identifying problems in the sale process is of great concern, and to which the following description will refer without losing generality. Nevertheless, the present invention may also find application in different other scenarios, for example in logistic places such as airport and harbour areas or storage warehouses.

State of the Art

As is known, artificial vision in retail stores is always of interest for many applications ranging from anti-shoplifting, to trolley/cart and people counting, and up to customer re-identification, namely linking records for individuals with no identifying information to records with identifying information (i.e., name or social security number) in order to profile individuals within the anonymous data.

The mainly adopted solutions are those based on RGB and/or thermal cameras. However, potentiality of these solutions are currently under exploited in view of the huge amount of visual and audio information that is provided by distributed heterogeneous capture systems and that might result in a faithful picture of the sales trend over time being returned.

Technologies such as RFID, Low Energy Bluetooth, and surveillance cameras arranged to monitor objects (trolleys/carts) and to interact with the customers in the retail stores are used, which, however, require multi-level design and installation, and, due to the dynamism of the retail stores, are indeed hard to apply.

US 8,325,982 Bl and CN 104637198 A discloses cart tracking technologies based on RFI D technology and surveillance cameras.

Subject and Summary of the Invention

The Applicant has appreciated that knowing how to monitor customers' behaviours in a delimited area, such as a retail store, and to assess an associated process from a global analysis perspective represents a considerable advantage factor.

The aim of the present invention is to provide an advanced kinesthesis analysis system based on artificial vision and audio information for process control in a retail store.

This aim is achieved by the present invention, which relates to an advanced kinesthesis analysis system based on artificial vision and audio analysis for process control and monitoring in a retail store, as claimed in the appended claims. Brief Description of Drawings

Figure 1 is a system-level, block-diagram representation of the present invention. Figure 2 is a block-diagram representation of a l^st-person vision equipment in the present invention.

Figure 3 is a block-diagram representation of the architecture of the processing software.

Figure 4 is a block-diagram representation of types of analysis carried out by the l^st-person vision software module and the associated representation and classification methods exploited.

Figure 5 depicts a tree-representation of a tree-classification carried out by a 1^st, 2^nd and 3^rd-person vision software module.

Figure 6 depicts an exemplary structure of the 1^st, 2^nd and 3^rd-person vision software module.

Figure 7 is a graphical representation of a real-time computable snapshot of a retails store.

Detailed Description of Embodiments of the Invention

The following description is provided to enable a person skilled in the art to make and use the invention. Various modifications to the embodiments will be readily apparent to those skilled in the art, without departing from the scope of the claimed invention. Thus, the present invention is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein and defined in the appended claims.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the embodiments disclosed belongs. In the case of conflict, the present specification, including definitions, will control. In addition, the examples are illustrative only not intended to be limiting. In particular, the block diagrams depicted in the Figures and described are not to be construed to represent structural features, namely constructive limitations, but are to be construed to represent functional features, namely intrinsic properties of devices defined by the achieved effects or functional limitations and that may be implemented with different structures, so protecting the functionalities thereof (possibility to function).

For the purposes of promoting understanding of the embodiments described herein, reference will be made to certain embodiments and specific language will be used to describe the same. The terminology used herein is for the purpose of describing particular embodiments only, and is not intended to limit the scope of the present disclosure.

Fig. 1 depicts a system-level, block-diagram representation of an advanced kinesthesis analysis system based on artificial vision and audio analysis for process control and monitoring in a retails store according to the present invention.

The advanced kinesthesis analysis system, designated as a whole with reference numeral 1, essentially comprises:

- a l^s,-person vision equipment 2 intended to be embedded in customer-carried shopping containers, such as shopping carts or baskets (not shown), to be tracked in the retail store, and designed to output image/video/audio, - a 2^nd-person vision equipment 3 intended to be embedded in stationary interactive multimedia terminals, such as kiosks or totems (not shown), conveniently in the form of retail robots able to interact with the customers, distributed in the retail store, and designed to output image/video/audio;

- a 3^rd-person vision equipment 4 designed to output image/video/audio;

- an environment sensory network 5 distributed over the retail store to gather heterogeneous retail store-related information about the environment in the retail store,

- a communication infrastructure 6 distributed over the retail store to wiredly/wirelessly communicate with, and receive the audio/image/video and the retail store-related information outputted by, the 1st-, 2nd- and 3rd-person vision equipment 2, 3, 4 and the environment sensory network 5,

- a Human-Machine Interface (HMI) 7 designed to expose a Graphical User Interface (GUI) for process control and monitoring and data display, and

- a kinesthesis analysis equipment 8, conveniently in the form of a server, designed to communicate with the communication infrastructure 6 to receive and process the image/video/audio from the 1^st-, 2^nd- and 3^rd-person vision equipment 2, 3, 4 and the retail store-related information from the environment sensory network 5 to output kinesthesis data relating to the behaviour in the retail store, as detailed here below, and to the HMI 6 for process monitoring and data display.

As shown in Figure 2, the l^st-person vision equipment 2 comprises an on-board sensory equipment intended to be embedded in each shopping cart to be tracked in the retail store and featuring:

- an image/video capture apparatus 9 comprising one or more digital image/video sensors comprising one or more outwards facing digital cameras operable to sense and capture digital images/videos of the surroundings of the shopping cart, and, optionally, one or more inwards facing digital cameras operable to sense and capture digital images/videos of the inside and outside of the shopping cart, wherein the outwards facing digital cameras conveniently comprise at least a forward facing digital camera, and, optionally, a rear facing digital camera, and one or two side facing digital cameras,

- an audio capture apparatus 10 comprising one or more microphones either distinct from, or integrated in, the cameras, and operable to sense and capture sounds in the proximity of the shopping cart,

- a geolocation apparatus 11 operable to output geolocation data, and conveniently comprising a Global Navigation Satellite System (GNSS) Receiver and a gyroscope,

- a communication interface (Bluetooth/Wi-Fi/RFID) 12 connected to the image/video capture apparatus 9, the audio capture apparatus 10 and the geolocation apparatus 11, and operable to wirelessly transmit the captured digital images/videos/sounds and the geolocation data to the communication infrastructure 6 in the retail store,

- an electric power source 13, conveniently in the form of a rechargeable battery, to power supply the on-board sensory equipment, and

- an electronic control unit 15, conveniently in the form of a microprocessor, connected to the image/video capture apparatus 9, the audio capture apparatus 10, the geolocation apparatus 11, and the communication interface 11, and configured to control operation of on-board sensory equipment.

The 2^nd-person vision equipment 3 is structurally similar to the l^s,-person vision equipment 2 and for this reasons same reference numerals as those used for the 1^st- person vision equipment 2 and shown in Figure 2 are used to designated similar components in the 2^nd-person vision equipment 3.

The 2^nd-person vision equipment 3 comprises a sensory equipment for each interactive robotic kiosk featuring:

- a image/video capture apparatus 9 comprising one or more digital image/video sensors operable to sense and capture digital images/videos of the surroundings of the kiosk,

- an optional audio capture apparatus 10 comprising one or more microphones and operable to sense and capture sounds in the proximity of the kiosk,

- a communication interface 12 connected to the image/video capture apparatus 9 and the audio capture apparatus 10 and operable wiredly or wirelessly transmit the captured digital images/videos/sounds to the communication infrastructure 6 in the retail store, and

- an electronic control unit 14 connected to the image/video capture apparatus 9, to the audio capture apparatus 10 and to the communication interface 12, and configured to control operation of the sensory equipment.

The 3^rd-person vision equipment 4 is structurally similar to the 1^st- and 2^nd-person vision equipment 2, 3 and for this reasons same reference numerals as those used for the l^st-and 2^nd-person vision equipment 2, 3 and shown in Figure 2 are used to designated similar components in the 3^3d-person vision equipment 4.

The 3^rd-person vision equipment 4 comprises a sensory equipment featuring:

- a stationary image/video capture apparatus 9 comprising one or more stationary digital image/video sensors distributed over the retail store, and a mobile image/video capture apparatus 9 comprising one or more mobile digital image/video sensors carried by mobile robots movable in the retail store, and operable to sense and capture digital images/videos of the retail store,

- an optional stationary audio capture apparatus 10 comprising one or more stationary microphones distributed over the retail store, and an optional mobile audio capture apparatus 10 comprising one or more mobile microphones carried by the mobile robots, and operable to sense and capture sounds in the retail store in general,

- a communication interface 12 wiredly/wirelessly connected to the stationary and mobile image/video capture apparatuses 9 and to the stationary and mobile audio capture apparatuses 10 and operable wiredly/wirelessly transmit the captured digital images/videos/sounds to the communication infrastructure 6 in the retail store, and

- an electronic control unit 14 connected to the stationary and mobile image/video capture apparatuses 9, to the stationary and mobile audio capture apparatuses 10 and to the communication interface 12, and configured to control operation of the sensory equipment.

The digital image/video sensors may conveniently be in the form of commercially available digital cameras such as Charge-Coupled Device (CCD) cameras, Complementary Metal-Oxide-Semiconductor (CMOS) cameras, also known as Active Pixel Sensor (APS) cameras, or even thermographic cameras, whose output thermographic images allow tactile/kinesthesis feedbacks useful for behavioural analysis to be computed, for example as disclosed in the Applicant's Italian patent applications 102014902291114 and 102014902314973 filed on 05.09.2014 and on 05.12.2014, respectively. The environment sensory network 5 comprises one or more of the following sensors (not shown): passive infrared (PIR) or ultrasound presence/motion sensors, gas sensors, smart scales, checkout counters or cash desks, fidelity card readers, etc..

The electronic control units 14 of the 1^st-, 2^nd-, and 3^rd-person vision equipment 2, 3, 4 are configured to operate the associated image/video capture apparatus 9, audio capture apparatus 10, and communication interfaces 12 to sense, capture and transmit to the communication infrastructure 6 in the retail store digital images/videos of the surroundings and of the inside of the shopping cart during mission thereof, and of the retail store in general, as well as sounds in the retail store in general and in the proximity of the shopping cart during a mission thereof, wherein a mission of a shopping cart is meant to indicate the period of time elapsing from when the shopping cart is picked up from a collection area to when the shopping cart is returned to a collection area.

Digital images/videos/sounds are captured according to proprietary criteria that fall outside the scope of the present invention and hence will not be described in detail. Exemplarily, the digital camera(s) on the shopping carts, as well as those in the retail store, might be operated to periodically capture either individual digital images spaced apart in time or short/long sequences of digital images close in time to form short/long digital videos, at a settable capture rate, e.g., one digital image or video clips every second or few seconds. Similarly, the microphones on the shopping carts, as well as those in the retail store, might be operated to periodically capture short/long audio clips in synch with the captured digital images or video clips.

Geolocation data from the GNSS receiver in the geolocation apparatus 11 may be conveniently used to locate the shopping cart with respect to the retails tore when the shopping cart is outdoor, where geolocation data are reliable, so as to provide so-called priors, namely initial geolocation data, while geolocation data from the gyroscope in the geolocation apparatus 11 may be conveniently used to enhance georeferencing of the audio/images/videos from the l^st-person vision equipment 2 and, resultingly, the customer behavioural analysis.

The kinesthesis analysis equipment 8 is designed to store and execute a kinesthesis processing software designed to process the audio/images/videos from the 1^st-, 2^nd- and 3^rd-person vision equipment 2, 3, 4, the information from the environment sensory network 5, and the geolocation data from the geolocation apparatus 11 to output data indicative of the customer behaviours in the retail store, as detailed below.

In particular, the kinesthesis processing software is design to perform data fusion of the digital images/videos captured by the l^st-person vision equipment 2 with those captured by the 2^nd- and 3^rd-person vision equipment 3, 4 through known face/people detection and product recognition algorithms, to compute several quantities, such as a 2D/3D map of the retails store, kinesthesis information relating to the retail store, numbers, positions and paths of the customers and charts in the retail store, an analytics map of the retail store based on the number, positions and paths of the customers, so providing information, among other things, on the stops and interactions with objects/displays/personnel in the delimited area, a so-called visual genome (synthesis/account), a promotional (communication/marketing) effectiveness index, alerts, etc.

Figure 3 is a high-level, block-diagram representation of an architecture of the kinesthesis processing software.

The processing software comprises three different processing modules logically associated with the 1^st-, 2^nd- and 3^rd-person vision equipment 2, 3, 4, and for this reason hereinafter referred to as 1^st-, 2^nd-, and 3^rd-person vision software modules, and respectively designated with references numerals 15, 16 and 17, and an occlusion prediction software module 18.

The l^st-person vision software module 15 is designed to receive and process the digital images/videos/sounds captured by the l^st-person vision equipment 2, and refinement data computed by the 3^rd-person vision software module 17, whereby computing and outputting high-level behavioural data, as detailed below.

The 2^nd-person vision software module 16 is designed to receive and process the digital images/videos/sounds captured by the 2^nd-person vision equipment 3 and the high-level behavioural data computed by the l^st-person vision software module 15, and to enrich the high-level behavioural data based on the digital images/videos/sounds captured by the 2^nd-person vision equipment 3, whereby outputting high-level enriched behavioural data, as detailed below.

The 3^rd-person vision software module 17 is designed to receive and process the digital images/videos/sounds captured by the 3^rd-person vision equipment 4 and the high-level behavioural data computed by the l^st-person vision software module 15, and to refine the high-level behavioural data based on the digital images/videos/sounds captured by the 3^rd-person vision equipment 4, whereby outputting the refinement data for the l^st-person vision software module 15.

The occlusion prediction software module 18 is designed to cooperate with the l^st-person vision software module 15 to predict and appropriately correct occlusions in the image/video capturing of the l^st-person vision equipment 2 in a way that is known per se and will not be described in more details.

In particular, the l^st-person vision software module 15 is designed to process the captured digital images/videos/sounds from the l^st-person vision equipment 2 considering also the method disclosed in V. Santarcangelo, G. M. Farinella and S. Battiato, Egocentric Vision for Visual Market Basket Analysis, Conference Paper, The First International Workshop on Egocentric Perception, Interaction and Computing (EPIC@ECCV16) held in Amsterdam on October 9^th, 2016, the content of which is incorporated herein in its entirety for reference.

Resumptively, the l^st-person vision software module 15 is designed to process all, or appropriately selected ones, of the captured digital images/video frames and the captured sound(s) to classify each processed digital image/video frame in an associated behavioural class using a hierarchy of classifiers, so resulting in high-level behavioural data being outputted for each processed digital image/video frame and sound, which high-level behavioural data is indicative of a motion status of the associated shopping cart, namely whether the shopping cart is stopped or is moving, of a coarse geolocation of the associated shopping cart in the retail store, namely whether the shopping cart is outdoor or indoor, and of a fine geolocation of the shopping cart in the retail store, namely in which product/service area of the retail store the shopping cart is.

In particular, motion status (stop/moving) of the shopping cart or other actions are determined based on an artificial vision technique based on Visual Flow, the shopping cart is coarsely geolocated based on Deep Learning (DL), conveniently Convolutional Neural Network (CNN), while the shopping cart is finely geolocated based on a range imaging technique for estimating three-dimensional structures from two- dimensional image sequences known as Structure From Motion (SFM) and on Convolutional Neural Network (CNN).

The l^st-person vision software module 15 is further designed to recognize picking and releasing of goods from/into the shopping cart based on the Visual Flow, through which movement direction may be recognized, and which may be performed either individually or in combination with visual descriptors, conveniently Scale-Invariant Feature Transform (SIFT), KAZE and CNN descriptors.

The l^st-person vision software module 15 is further designed to carry out a Visual Market Basket Analysis of the content of the shopping cart to determine an effectiveness index of the marketing and communication in the retail store.

Figure 4 is a graphical representation of the types of analysis carried out by the 1^st- person vision software module 15 and the associated representation and classification methods exploited, Figure 5 is a tree-representation hierarchy of the above-summarized classification process carried out by the l^st-person vision software module 15, where the classes indicated are provided as an example only and may be different depending on the scenario and the detail level to be achieved, and Figure 6 depicts an exemplary output of the l^st-person vision software module 15.

The 2^nd-person vision software module 16 is designed to process all, or appropriately selected ones, of the captured digital images/video frames and the captured sound(s) to determine customer-related features, and in particular customer age, gender, and emotion by means of both generative and discriminative learning machines used for classification in computer vision, conveniently Local Binary Pattern (LBP), a customer Neuro-Linguistic Programming (NLP) by means of Skeletal Tracking, and customer interactions, such as promotions, products, ticket bookings, product department, day/evening/night shift, etc., and fidelity card-related information.

The distinguishing features of the present invention allow several advantages to be achieved.

In particular, the present invention allows to assess customers' behaviours in the retail store, and to compute several quantities, such as a 2D/3D map of the retails store, kinesthesis information relating to the retail store, numbers, positions and paths of the customers in the retail store, an analytics map of the retail store based on the number, positions and paths of the customers, so providing information, among other things, on the stops and interactions with objects/displays/personnel in the delimited area, a so- called visual genome (synthesis/account), a promotional (communication/marketing) effectiveness index, alerts, etc.

Moreover, the kinesthesis information allows process inefficiencies to be identified, such as noise nuisance, bars to motion, dirtiness, queues in the retail store departments and in the checkout counters or cash desks, etc.

The kinesthesis information may be real-time displayed to the retails tore staff on screens/tablets and used to generate alerts and graphical snapshots, such as the one depicted in Figure 7, of the overall retail store status, to carry out visual market basket analysis, to navigate through the retail store to find specific products using appropriate smartphone/tables apps or the mobile robots or the interactive kiosks.

Claims

1. A kinesthesis analysis system (1) based on artificial vision and audio analysis for process control and management in a retail store, the system (1) comprises:

- a l^st-person vision equipment (2) intended to be embedded in customer-carried shopping containers to be tracked in the retail store, and designed to output image/video/audio data,

- a 2^nd-person vision equipment (3) intended to be embedded in stationary interactive multimedia kiosks distributed in the retail store, and designed to output image/video data,

- a 3^rd-person vision equipment (4) designed to output image/video data;

- an environment sensory network (5) distributed over the retail store and designed to gather retail store-related information,

- a communication infrastructure (6) designed to communicate with the 1^st-, 2^nd- and 3^rd-person vision equipment (2, 3, 4) to receive the image/video/audio data therefrom, and with the environment sensory network (5) to receive the retail store- related information therefrom, and

- a kinesthesis analysis equipment (8) designed to communicate with the communication infrastructure (6) to receive and process the image/video/audio data outputted by the 1^st- 2^nd- and 3^rd-person vision equipment (2, 3, 4) and the retail store- related information outputted by the environment sensory network (5) to compute kinesthesis information relating to the retail store.

2. The kinesthesis analysis system of claim 1, wherein the l^st-person vision equipment (2) comprises an on-board sensory equipment intended to be embedded in each customer-carried shopping container to be tracked in the retail store, and featuring:

- an image/video capture apparatus (9) designed to sense and capture digital images/videos of the surroundings of the customer-carried shopping container,

- an audio capture apparatus (10) designed to sense and capture sounds in the proximity of the customer-carried shopping container,

- a communication interface (12) connected to the image/video capture apparatus (9) and to the audio capture apparatus (10) to receive the captured digital images/videos/sounds and operable to transmit the received digital images/videos/sounds to the communication infrastructure (6), and

- an electronic control unit (14) connected to the image/video capture apparatus (9), the audio capture apparatus (10), and the communication interface (12), and designed to operate the image/video capture apparatus (9), the audio capture apparatus (10), and the communication interface (12) to capture and transmit to the communication infrastructure (6) digital images/videos of the surroundings of the customer-carried shopping container and sounds in the proximity of the customer- carried shopping container during a mission thereof, wherein a mission of a customer- carried shopping container is a period of time elapsing from when the it is picked up from a collection area to when it is returned to a collection area.

3. The kinesthesis analysis system of claim 2, wherein the image/video capture apparatus (9) is further designed to capture digital images/videos of the inside of the customer-carried shopping container.

4. The kinesthesis analysis system of claim 2 or 3, wherein the 2^nd-person vision equipment (3) comprises a sensory equipment in each interactive kiosk and featuring: - an image/video capture apparatus (9) designed to sense and capture digital images/videos of the surroundings of the interactive kiosk,

- a communication interface (12) connected to the image/video capture apparatus (9) to receive the captured digital images/videos and operable to transmit the received digital images/videos to the communication infrastructure (6), and

- an electronic control unit (14) connected to the image/video capture apparatus (9) and the communication interface (12), and designed to operate the image/video capture apparatus (9) and the communication interface (12) to capture and transmit to the communication infrastructure (6) digital images/videos of the surroundings of the interactive kiosk.

5. The kinesthesis analysis system of claim 4, wherein the 3^rd-person vision equipment (4) comprises:

- a stationary image/video capture apparatus (9) distributed over the retail store, and a mobile image/video capture apparatus (9) carried by mobile robots movable in the retail store, and designed to sense and capture digital images/videos of the retail store,

- a communication interface (12) connected to the stationary and mobile image/video capture apparatuses (9) to receive the captured digital images/videos and operable to transmit the received digital images/videos to the communication infrastructure (6), and

- an electronic control unit (14) connected to the stationary and mobile image/video capture apparatuses (9) and the communication interface (12), and designed to operate the stationary and mobile image/video capture apparatuses (9) and the communication interface (12) to capture and transmit to the communication infrastructure (6) digital images/videos of the retail store.

6. The kinesthesis analysis system of claim 5, wherein the 2^nd-person vision equipment (3) further comprises an audio capture apparatus (10) designed to sense and capture sounds in the proximity of the kiosk; and the 3^rd-person vision equipment (3) further comprises a stationary audio capture apparatus (10) and a mobile audio capture apparatus (10) carried by the mobile robots, and operable to sense and capture sounds in the retail store.

7. The kinesthesis analysis system of any one of the preceding claims, wherein the environment sensory network (5) comprises one or more of the following sensors: passive infrared (PIR) or ultrasound presence/motion sensors, gas sensors, smart scales, checkout counters or cash desks, and fidelity cards readers.

8. The kinesthesis analysis system of any one of the preceding claims, further comprising a Human-Machine Interface (HMI) (7) designed to expose a Graphical User Interface (GUI) for process control and monitoring and data display.

9. The kinesthesis analysis system of any one of the preceding claims, wherein the kinesthesis analysis equipment (8) is designed to store and execute a kinesthesis processing software designed to perform, when executed, data fusion of the images/videos/sounds captured by the 1^st-, 2^nd- and 3^rd-person vision equipment (2, 3, 4) and of the information from the environment sensory network (5) to output data indicative of the customers' behaviours in the retail store;

the kinesthesis processing software comprises a 1^st-, 2^nd- and 3^rd-person vision software modules (15, 16, 17) logically associated with the 1^st-, 2^nd- and 3^rd-person vision equipment (2, 3, 4), respectively; the l^st-person vision software module (15) is designed to receive and process the digital images/videos/sounds captured by the l^st-person vision equipment (2), and refinement data computed by the 3^rd-person vision software module (17), whereby computing and outputting high-level behavioural data;

the 2^nd-person vision software module (16) designed to receive and process the digital images/videos captured by the 2^nd-person vision equipment (3) and the high-level behavioural data computed by the l^st-person vision software module (15), and to enrich the high-level behavioural data based on the digital images/videos captured by the 2^nd-person vision equipment (3), whereby outputting high-level enriched behavioural data; and

the 3^rd-person vision software module (17) is designed to receive and process the digital images/videos captured by the 3^rd-person vision equipment (4) and the high-level behavioural data computed by the l^st-person vision software module (15), and to refine the high-level behavioural data based on the digital images/videos captured by the 3^rd- person vision equipment (4), whereby outputting the refinement data for the l^st-person vision software module (15).

10. The kinesthesis analysis system of claim 8, wherein the l^st-person vision software module (14) is designed to process all, or selected ones, of the captured digital images/video frames and the captured sound(s) to classify each processed digital image/video frame in an associated behavioural class using a hierarchy of classifiers, so resulting in high-level behavioural data being outputted for each processed digital image/video frame and sound, which high-level behavioural data is indicative of a motion status, a coarse geolocation, and a fine geolocation of the associated customer- carried shopping container in the retail store;

wherein the motion status of the customer-carried shopping container is determined based on Visual Flow, the customer-carried shopping container is coarsely geolocated based on Deep Learning (DL), conveniently Convolutional Neural Network (CNN), and the customer-carried shopping container is finely geolocated based on Structure From Motion (SFM) and on Convolutional Neural Network (CNN).

11. The kinesthesis analysis system of claim 8 or 9, wherein the l^st-person vision software module (15) is further designed to recognize picking and releasing of goods from/into the customer-carried shopping container based on the Visual Flow, performed either individually or in combination with Scale-Invariant Feature Transform (SIFT), KAZE and Convolutional Neural Network (CNN) descriptors.

12. The kinesthesis analysis system of claim 10, wherein the l^st-person vision software module (15) is further designed to carry out a Visual Market Basket Analysis of the content of the customer-carried shopping container to determine an effectiveness index of the marketing and communication in the retail store.

13. The kinesthesis analysis system of any one of claims 8 to 11, wherein the 2^nd- person vision software module (15) is designed to process all, or selected ones, of the captured digital images/video frames to determine customer-related features, such as customer age, gender, and emotion, by means of visual features, conveniently Local Binary Patterns (LBP), customer Neuro-Linguistic Programming (N LP), skeletal tracking and pose estimation, and customer interactions, such as promotions, products, ticket bookings, product department, day/evening/night shift, etc., and fidelity card-related information.

14. A kinesthesis processing software as claimed in any one of the preceding claims