WO2022047342A9 - Système et procédé d'utilisation de réseaux neuronaux profonds pour ajouter une valeur à des flux vidéo - Google Patents

Système et procédé d'utilisation de réseaux neuronaux profonds pour ajouter une valeur à des flux vidéo Download PDF

Info

Publication number
WO2022047342A9
WO2022047342A9 PCT/US2021/048300 US2021048300W WO2022047342A9 WO 2022047342 A9 WO2022047342 A9 WO 2022047342A9 US 2021048300 W US2021048300 W US 2021048300W WO 2022047342 A9 WO2022047342 A9 WO 2022047342A9
Authority
WO
WIPO (PCT)
Prior art keywords
video stream
information
video
neural network
server
Prior art date
Application number
PCT/US2021/048300
Other languages
English (en)
Other versions
WO2022047342A1 (fr
Inventor
Rogelio AGUILERA JR.
Brandon Alan KAPPELER
Original Assignee
Aguilera Jr Rogelio
Kappeler Brandon Alan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aguilera Jr Rogelio, Kappeler Brandon Alan filed Critical Aguilera Jr Rogelio
Publication of WO2022047342A1 publication Critical patent/WO2022047342A1/fr
Publication of WO2022047342A9 publication Critical patent/WO2022047342A9/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/59Context or environment of the image inside of a vehicle, e.g. relating to seat occupancy, driver state or inner lighting conditions
    • G06V20/597Recognising the driver's state or behaviour, e.g. attention or drowsiness
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the field of the invention relates generally to video image analysis and using deep neural networks for automatically adding value added annotations to the video streams.
  • Video feeds for monitoring environments are commonly utilized in personal and commercial settings in a large variety of applications. For example, video feeds are used in monitoring warehouses to make sure that the premises are secured.
  • Existing security systems for homes and commercial properties feature have been disclosed which provide a video feed from multiple video cameras typically connected to a manual monitoring station to facilitate observation by security personnel.
  • Some home or commercial settings utilize cameras for monitoring the surroundings for the purposes of recording suspicious activities.
  • What is needed is a system that can continually monitor the conditions in a video stream, for example being generated from the monitoring of commercial restaurant kitchen or the monitoring of a delivery vehicle’s occupants, add value to the video stream by adding annotations to the video stream that are meant to serve as alert if certain protocols for observing basic hygiene are violated in the preparation or its delivery.
  • What is also needed is a system for rating establishments that conform to the health and safety guidelines in food preparation and delivery and use the rating as a mechanism for promoting these establishments and use market competitive forces to bring others in line, so they are also motivated to their compliance score. Additionally, what is needed is an ability for customers to have a system detect and report any noncompliance in real time and thereby empower consumers to force compliance by using the captured value-added video streams from the kitchen and delivery vehicles and reporting non-compliance to the establishment personnel.
  • client applications that are further configured to make clips of the value added video streams and add comments about non-compliance and forward these clips as emails or upload to social media sites as a method enhancing transparency and encouraging compliance. And what is needed is the ability to assign a score indicative of the extent to which a video stream complies with a set of predefined conditions, so customers monitor trends.
  • What is needed is a system that uses machine learning and artificial intelligence models for detecting at-risk behavior and annotates the video feed to alert the consumers and stakeholders.
  • the disclosed application for utilizing a deep learning system for generating value added video streams has several practical applications.
  • the system and method for performing the steps of capturing video feeds and analyzing each frame with a deep neural network and annotating the result of the analysis back onto the frame and the stream is disclosed.
  • One specific application of the invention uses video streams from a commercial kitchen establishment where the video stream is generated from collecting optical information from cameras monitoring the cooking area.
  • field cameras with the delivery are used by the establishment to monitor transporting of prepared orders of food to the consumers.
  • any establishment may also include video feeds from the food preparation facilities located remotely such as those used for serving specific needs like a bakery, or confectionary wing of the establishment.
  • the client applications are further configured to make clips of the value added video streams and forward these clips as emails or upload to social media sites.
  • Another aspect of the system and method disclosed is its ability to assign a score indicative of the extent to which a video stream complies with a set of predefined conditions.
  • Embodiments utilize a multitude of deep neural network models with each neural network dedicated to directing a specific or closely related conditions. Furthermore, the deep neural network models begin with a specific set of parameters which generally perform well by detecting a broad set of conditions, such as objects in a video frame.
  • SUBSTITUTE SHEET (RULE 26) network models can be fine-tuned by training on specialized set of examples encountered in a specific setting. Embodiments of invention begin with general model and tune these using additional examples from the situations or scenarios being monitored.
  • the system and method are used for performing the steps of capturing video feeds and adding annotations indicating noncompliance or other aspects of the video stream and making the annotated video streams available as “value added video streams” to subscribers over the Internet.
  • the motivation of the system and processes disclosed is to create transparency from the monitored activities, like having transparency from kitchen to the consumer.
  • video feeds from an end point such as a restaurant are collected and processed by an end point monitor or aggregator and transmitted to a server over wired or wireless network.
  • the end point monitor or aggregator performs the function of combing several feeds into a single stream and using compression, error correction, and encryption to communicate the video stream securely and efficiently to the server for processing.
  • the server is a computer system using deep neural network learning model for detecting if the personnel in the kitchen and the field, including the delivery personnel, are observing health protocols, such as wearing of face masks and hand gloves.
  • the video stream, plus the information detected, is referred to as a “value added video stream,” or simply a “value added stream.
  • the server delivers the value added stream to a recipient subscriber who can take any necessary actions such as informing the establishment about the breach or non-compliance.
  • application used by the client is used for searching, viewing, saving, and uploading the value added video streams.
  • An embodiment further uses triangulation of information obtained from a plurality of cameras to detect if the workers are observing social distancing protocols while working in the commercial kitchen.
  • the disclosed application helps enforce health protocols and designed to be yet another precautionary measure for humanity’s fight against communicable infections like the COVID-19 and help protect against epidemiological outbreaks.
  • An embodiment of the system further empowers consumers by allowing them to use client viewing application to report concerning behavior to the restaurant owners by capturing clips or plurality of snapshots of any of the value-added video streams attached to an electronic message sent to the establishment. Further, the client application enables the completion of a survey which is used to update a compliance score indicative of the user perspective of the extent to which the restaurant is complying with health protocols.
  • SUBSTITUTE SHEET (RULE 26) plurality of subscribing client applications.
  • the client applications are further configured to make clips of the value added video streams, save and forward these clips as emails, or upload to social media sites.
  • Another aspect of the system and method disclosed is its ability to assign a score indicative of the extent to which a video stream complies with a set of predefined conditions.
  • the deep learning neural network creates a value added stream by adding annotations on sections of video streams that pertain to personnel violating or adhering protocols for health, safety, and hygiene.
  • An embodiment measures the distance between the personnel on the floor to determine the distance between the personnel.
  • An embodiment further alerts a human to confirm model’s predicted violations for further validation.
  • An embodiment of the system comprises a process where a plurality of input data is compiled into a calculation of a safety score where the safety score is reflective of the viewers’ perception of the level to which an establishment is adhering to a relevant set of health, safety and hygiene protocols.
  • a process is disclosed where the list of establishments is searched from a database in a predefined order dependent upon the safety scores recorded for the establishments.
  • the invention accordingly comprises several steps and the relation of one or more of such steps with respect to each other, and the apparatus embodying features of construction, combinations of elements and arrangement of parts that are adapted to affect such steps, all is exemplified in the following detailed disclosure, and the scope of the invention will be indicated in the claims.
  • FIG. 1 depicts a system architecture of an embodiment showing the connectivity between the various components with the transmittal of video streams by the endpoint monitor to a cloud server comprising of processing components for annotating and adding value to the video stream and streaming of the annotated video stream to client applications;
  • FIG. 2 shows the internal processing steps leading to the annotation of a video stream, the steps comprising of splitting the video streams into its frames, applying a plurality of pretrained
  • SUBSTITUTE SHEET (RULE 26) models to annotate individual frames, and streaming the frames and the associated application to the client application;
  • FIG. 3 depicts the flow of processing steps performed on the cloud server for detecting faces and masks on a video frame
  • FIG. 4 shows an environmental diagram for an embodiment being used for monitoring a kitchen of a commercial restaurant and depicts the use of station and environment camera, such as the fish-eye cameras shown, being in live communication with an end-point monitor that collects, processes, and transmits the video feeds to a server;
  • station and environment camera such as the fish-eye cameras shown
  • FIG. 5 shows an environmental view with a field camera being used for monitoring a delivery vehicle with delivery personnel and packages where the field camera is connected to a portable end-point monitor that collects, processes, and transmits the video feed from the delivery vehicle over the cloud;
  • FIG. 6 shows an environmental view where a portable phone including a processor, a camera and a network interface serving as a device for capturing information with phone camera serving as an optical sensor and the processor executing software instructions to process and transmit the video to the server ;
  • FIG. 7 shows the processing steps performed by the client application allowing for a selection of the value-added video stream, viewing of the selected stream, and reporting any concerning behavior;
  • FIG. 8(A) depicts the block diagrams for the computer vision model used for detecting the bounding box over a human face
  • FIG. 8(B) the flow chart and the software procedure corresponding to the implementation of the face detection and producing a bounding box around the face for setting up the next stage of detecting a mask on the face;
  • FIG. 9 shows the architecture of a Convolutional Deep Neural Network utilizing a plurality of filters and sampling layers that is followed by a plurality of dense layer of neurons with is ultimately followed by a single output layer generating the determination of the monitored condition;
  • FIG. 10 (A) shows the architecture of a series of LSTM cells that are configured to examine each frame and provide an output to a fully connected hidden layer of neurons that is then fed to an output softmax function in the shown embodiment
  • FIG. 10 (B) shows the inner architecture of each of the LSTM cells used in the autoregressive chain shown above;
  • FIG. 11 depicts the architecture of a deep neural network for detecting the presence of a mask within the bounding box of the image containing a human face;
  • FIG. 12 shows a swim lane diagram for the three concurrent activities in progress, i.e. collection of video feeds from the end-points, processing of video feeds to overlay the value- added annotations, and the receiving and commenting on the value-added feeds by the client application user;
  • FIG. 13 shows an activity diagram depicting the ability of the client application to sort the feeds by The Real Meal Score, or the TRM Score, of assigned by the provider of each of the value-added video feeds streamed from the cloud serve;
  • FIG. 14 depicts the inclusion of a database on the server to manage a plurality of restaurant information
  • FIG. 15(A) shows a GUI providing the search and display capabilities of the client application providing capabilities of searching for a specific restaurant using a variety of criteria including location, name, TRM score and the like;
  • FIG. 15(B) depicts the capability of the client application to drill down and view all the camera feeds provided by a specific restaurant including the plurality of value-added video streams from the kitchen monitoring stations and fish eye camera, and value added streams from the plurality of delivery vehicles;
  • FIG. 16 depicts a Graphical User Interface for a viewing of a specific value-added video stream on the client application and the ability to send an electronic message with attached frames depicting the concerning behavior;
  • FIG. 17 shows a component and packaging diagram of an embodiment of the system.
  • FIG. 18(A) shows an example of a frame where five faces are recognized and none of which are seen wearing masks receiving a TRM score of 0, and FIG. 18(B) shows an example where two faces are recognized and both faces are annotated with a checkmark indicating that the masks were detected and receives a score of 10.
  • FIG. 1 depicts a system architecture of an embodiment showing the connectivity between the various components with the transmittal of video streams by the endpoint monitor to a cloud server comprising of processing components for annotating and adding value to the video stream and streaming of the annotated video stream to client applications.
  • the Video Acquisition Process 22 is configured to receiving the plurality of video streams from End Point Monitor 16.
  • the Video Acquisition Process 22 is a software process configured to capture the incoming video streams. Upon receiving the streams, Video Acquisition Process 22 forwards them over to the process for Video Analysis and Annotation 26.
  • the Video Analysis and Annotation 26 is a process configured to perform an analysis of the frames of a video stream and generate a predefined set of annotations.
  • the Video Analysis and Annotation 26 results in creating the annotations where the annotations are specific to the problem being solved such as whether the images depict people wearing face masks or not.
  • the annotated video streams are subsequently conveyed to Annotated Video Steaming Process 28.
  • the Annotated Video Steaming Process 28 is a process configured to make the value added streams available to client applications.
  • An embodiment uses a deep neural network based system comprising an optical sensor configured to capture information where the optical sensor is in communication with a network interface; the network interface configured to receive the information that is captured and transmit said information to a server; the server configured to execute a deep learning neural network based computer implemented method to perform an analysis of said information to detect a presence of a plurality of monitored conditions, and label said information with the presence that is detected of the plurality of monitored conditions.
  • the labeling of said information comprises adding a visual artifact, or adding an audio artifact, or adding both the visual artifact and the audio artifact to the captured information.
  • the end clients use Web Client 30 and Mobile Client 32 and obtain a value-added video stream from the Annotated Video Steaming Process 28.
  • the Video Analysis and Annotation 26 process receives frames and processes them using a high performance computing server and upon completing the processing merges the annotation with the existing frame.
  • the annotation on a stream may be somewhat lagging
  • SUBSTITUTE SHEET (RULE 26) in phase from the video stream being disseminated to the client application. It will be appreciated by one skilled in the art that with high performance capability of the computing server being utilized this delay will be minimized and any lag in phase will be imperceptible to the consumer of the value-added annotated video stream.
  • FIG. 2 shows the internal processing steps leading to the annotation of a video stream, the steps comprising of splitting the video streams into its frames, applying a plurality of pretrained models to annotate individual frames, and streaming the frames and the associated application to the client application.
  • the Video Splitting Process 34 is configured to receive a video feed from Video Acquisition Process 22.
  • the Video Splitting Process 34 is a process to split an incoming video stream into individual frames or a small group of frames for the purpose of analysis.
  • the Predefined Trained Model 36 is a predefined computational model that is given an input data, such as comprising an image frame in an embodiment, can cause the production of a either a discrete value or a classification value where the discrete or classification value is a function of the input data.
  • Predefined Trained Model 36 would be configured to perform a function that detects for the presence of a face mask in the output of Video Splitting Process 34.
  • Predefined Trained Model 36 would be configured to detect the distance between objects and individual in the output of Video Splitting Process 34.
  • the Video Splitting Process 34 generates input data as frames for applying a Predefined Trained Model 36 through the process Apply Model 42 which is a process that takes the frame from Video Splitting Process 34 and performs the function of the computational model provided by the Predefined Trained Model 36 and causes the production of discrete or classification value.
  • the output of Apply Model 42 is then merged with the output of Video Splitting Process 34 in by the process Overlay Classification on Video 38.
  • Overlay Classification on Video 38 is a process that takes the classification or discrete value provided by the Apply Model 42 process and overlays a visual representation of the classification or discrete over the video where the visual representation stays overlaid for predetermined number of frames or time duration.
  • Overlay Classification on Video 38 can provide annotation dependent on the output of Apply Model 42.
  • Annotated Video Steaming Process 28 receives the output of Overlay Classification on Video 38.
  • Annotated Video Steaming Process 28 is a process configured to make the value added streams available to client applications by communicating this stream to client applications, such as a plurality of Web Clients 30 and Mobile Clients 32.
  • client application will be used to refer to any of the plurality of applications designed to access the restaurants and observe their respective value- added video streams.
  • the client applications include but are not limited to Web Client 30 and Mobile Client 32.
  • a skilled artisan will be able to envision other dedicated applications and appliances for accessing the Cloud Server 20, search for restaurants or other establishments and observe their corresponding value added video feeds delivered through the Annotated Video Steaming Process 28.
  • a client application will utilize at least a display surface - such as screen - as an output device for rendering the value-added video stream. It will further utilize a processor, a memory, a network interface, and input devices to enable in selection of establishment to view the stream, compose a survey, and report concerning behavior.
  • the client application is a software process executing on any general-purpose computing device.
  • FIG. 3 depicts the flow of processing steps performed on the cloud server for detecting faces and masks on a video frame. This process is being continually performed on the Cloud Server 20, specifically the Apply Model 42 process on the Cloud Server 20, on all the video streams received by the Video Acquisition Process 22.
  • Apply Model 42 is applying a Predefined Trained Model 36 to detect the presence of a face mask on individuals located in a frame.
  • Preprocess Model Input Frame 76 receives input from Video Splitting Process 34.
  • Preprocess Model Input Frame 76 is a component to pre-process the output of Video Splitting Process 34, that will render the frame as an acceptable input to the model application processes within Apply Model 42.
  • Preprocess Model Input Frame 76 would covert the image to a black-and-white image from a color image.
  • Preprocess Model Input Frame 76 would reduce the size of the image which enhances the ability for specific Al algorithms to process the image.
  • the output of Preprocess Model Input Frame 76 is then applied to Apply Face Detector 78.
  • Apply Face Detector 78 is a component that will apply a face detector from Predefined Trained Model 36, to the output of Preprocess Model Input Frame 76.
  • the Predefined Trained Model 36 that is used to Apply Face Detector 78 would be Convolutional Neural Network that is trained on a dataset of open-source images.
  • the Predefined Trained Model 36 that is used to Apply Face Detector 78 would be a Convolutional Neural Network that is trained on images from Video Splitting Process 34 that have been annotated by a practitioner in the art.
  • the Predefined Trained Model 36 that is used to Apply Face Detector 78 would be a model that has been pretrained to detect faces.
  • the output of Apply Face Detector 78 is the location of the face or faces detect in the output of Preprocess Model Input Frame 76.
  • Apply Mask Detector 80 detects if a mask is present given the location of face.
  • Apply Mask Detector 80 is a component that will apply a mask detector from Predefined Trained Model 36, to the output of Apply Face Detector 78.
  • the Predefined Trained Model 36 that is used to Apply Mask Detector 80 would be Convolutional Neural Network that is trained on a dataset of open-source images.
  • the Predefined Trained Model 36 that is used to Apply Mask Detector 80 would be a Convolutional Neural Network that is trained on images from Video Splitting Process 34 that have been annotated by a practitioner in the art.
  • the Predefined Trained Model 36 that is used to Apply Mask Detector 80 would be a Convolutional Neural Network that has been pretrained to detect masks.
  • Model Output 82 receives the output from Apply Mask Detector 80.
  • Model Output 82 is a component that will output a binary classification or a discrete value with the results of the application of Predefined Trained Model 36.
  • Predefined Trained Model 36 is detecting the presence of a face mask
  • Model Output 82 will output a binary value as to if a person in the frame is wearing a face mask.
  • FIG. 4 shows an environmental diagram for an embodiment being used for monitoring a kitchen of a commercial restaurant and depicts the use of station and environment camera, such as the fish-eye cameras shown, being in live communication with an end-point monitor that collects, processes, and transmits the video feeds to a server.
  • a plurality of Station Cameras 12 are installed at predetermined locations.
  • a Station Camera 12 is a camera installed for observing a full view at a station configured to monitor the activities in the close proximity of the station.
  • the location of each Station Camera 12 is configured get a clear view of the personnel working at that specific station to ensure compliance with health or other hygiene related protocols being annotated by the system.
  • the Station Cameras 12 are configured to detect whether the personnel preparing food in a commercial kitchen are complying with the requirements of wearing marks while working at their station.
  • a Fish Eye Camera 14 is a plurality of camera installed on the ceiling or a similar location to monitor the area at an environmental level, such as the entire commercial kitchen facility.
  • a plurality of Fish Eye Cameras 14 can help establish the distance between each of the personnel working in the kitchen and use this information to annotate the video stream with a level of social distancing being observed by the personnel.
  • computer vision algorithms can be used to detect a face in a given image from Fish Eye Cameras 14.
  • the distance between multiple faces located in an image can be computed by taking the pixel differential between multiple faces and applying a general heuristic or scaling it using an object with known dimensions.
  • fixed bounding boxes are placed on the regions with the stream processing ensuring that the personnel working in the kitchen remain within the confines of the bounding boxes and triggering a non-compliance when personnel step outside of the bounding box for a period greater than a predefined threshold.
  • the focal length of a plurality of cameras is used to compute the global coordinates of each of the personnel and using the Euclidean distance between each location to ensure observance of social distance.
  • the information from the plurality of Station Cameras 12 and the plurality of Fish Eye Cameras 14 is processed by the End Point Monitor 16.
  • the End Point Monitor 16 is a system designed to collect and preprocess the feeds from each of the plurality of Station Cameras 12 and each of the plurality of Fish Eye Cameras 14 and transmit the consolidated feed over the network to a server for further processing.
  • the End Point Monitor 16 communicates the video streams from the plurality of cameras (from both the plurality of Station Cameras 12
  • the Cloud Server 20 is a high throughput performance computing server with integrated database and processing for streaming in live videos, running processes for annotating video streams, and making the value added streams available for client devices.
  • End Point Monitor 16 communicates the physical locations and information including but not limited to focal length, aperture setting, geolocation, contrast, and enhancement settings, about the cameras to the Cloud Server 20.
  • the cameras also communicate shutter speeds of any still photographs taken to the End Point Monitor 16 which in turn communicates this information to the Cloud Server 20.
  • FIG. 5 shows an environmental view with a field camera being used for monitoring a delivery vehicle with delivery personnel and packages where the field camera is connected to a portable end-point monitor that collects, processes, and transmits the video feed from the delivery vehicle over the cloud.
  • a Field Camera 11 is located inside of a delivery vehicle and is recording the delivery personal.
  • Field Camera 11 is a camera installed for observing the activities inside of the delivery vehicle.
  • the location of a Field Camera 11 is configured get a clear view of the delivery personnel to ensure compliance with health or other hygiene related protocols being annotated by the system.
  • An embodiment the Field Camera 11 is configured to detect whether the personnel delivering food in a delivery vehicle are complying with the requirements of wearing marks.
  • the information from the Field Camera 11 is processed by Field Endpoint 13.
  • the Field Endpoint 13 is a system designed to collect and preprocess the feeds from each of the plurality of Field Cameras 11 and transmit the consolidated feed over the network to a server for further processing.
  • the Field Endpoint 13 communicates the video streams from the Field Camera 11 to the Cloud Server 20 over a wireless data network provisioned for the use by the Field Endpoint 13.
  • An embodiment has the Field Endpoint 13 communicating with End Point Monitor 16 over the wireless data network provisioned for the use by the Field Endpoint 13 where the End Point Monitor 16 consolidates video feeds from the Station Cameras 12, Fish Eye Cameras 14 and Field Cameras 11, and uploads the consolidated stream to Cloud Server 20.
  • An embodiment uses a deep neural network based system comprising a plurality of cameras or optical sensors connected to an end-point monitor where the plurality of cameras or optical sensors capture information and communicate the information to the end-point monitor; the endpoint monitor collects the information that is captured to create a video stream and further communicates the video stream to a server, wherein the server is configured to execute a deep neural network based computer implemented method to perform an analysis of the video stream
  • SUBSTITUTE SHEET (RULE 26) to detect a presence of a plurality of monitored conditions, and annotate the video stream with the presence that is detected of the plurality of monitored conditions to produce an annotated video stream.
  • An embodiment of the system uses plurality of cameras or optical sensors are configured to monitor an interior of a food preparation facility where wherein the computer implemented method is configured to annotate the video stream with the analysis of the video stream for monitoring conditions including a detecting of a presence of a human face in the video stream, and upon detecting the presence of the human face further detecting whether the human face includes a mark or a face covering.
  • An embodiment of the system uses the computer implementation is configured to annotate the video stream with the analysis of the video stream for monitoring conditions including an indication of presence of detecting a plurality of humans in the video stream, and upon the indication of presence of the plurality of humans further indicates whether said plurality of humans qualify a predefined separation condition from each other.
  • FIG. 6 shows an environmental view where a portable phone including a processor, a camera and a network interface serving as a device for capturing information with phone camera serving as an optical sensor and the processor executing software instructions to process and transmit the video to the server .
  • a Mobile Phone 15 is located inside of a delivery vehicle and is recording the delivery personal.
  • Mobile Phone 15 is a portable cellular device which includes a camera and is executing an application that combines the functionality of Field Camera 11 and Field Endpoint 13.
  • the location of Mobile Phone 15 is configured to get a clear view of the delivery personnel to ensure compliance with health or other hygiene related protocols being annotated by the system.
  • An embodiment uses the optical sensor, and the network interface are configured to monitor conditions inside of a vehicle.
  • the deep learning neural network based computer implemented method is configured for detecting a presence of a human face from the information that is captured, and upon detecting the presence of the human, further detecting whether the human face included a mask or a face covering.
  • the Mobile Phone 15 is configured to detect whether the personnel delivering food in a delivery vehicle are complying with the requirements of wearing marks.
  • the Mobile Phone 15 records and communicates video streams to Cloud Server 20 over a wireless data network provisioned for the use by the Mobile Phone 15.
  • An embodiment has the Mobile Phone 15 communicating with End Point Monitor 16 over the wireless data network provisioned for the use by the Mobile Phone 15 where the End Point Monitor 16
  • SUBSTITUTE SHEET (RULE 26) consolidates video feeds from the Station Cameras 12, Fish Eye Cameras 14 and the Mobile Phone 15, and uploads the consolidated stream to Cloud Server 20.
  • the system is also configured to view the packages being delivered in plain view.
  • the video feeds for the Field Cameras 11 and Mobile Phone 15 are configured to keep the food packages within the field of view.
  • the food packages being in plain view serves as a deterrent since the delivery personnel actions are being recorded in plain view.
  • the consumers watching any of the video feeds, including the feeds from within the delivery vehicles can observe any non-compliant behavior and report it to the establishment. This further helps in achieving the application goal of maintaining transparency from kitchen to the consumer.
  • An embodiment of the video stream annotation system further including a mobile cellular device including a camera and a wireless networking interface adapted to communicate over a wireless network, with the camera on the mobile cellular device serving as an optical sensor, the wireless networking interface serving as the network interface, where information of the camera is configured to be received by the wireless networking interface and further transmitted to the server over the wireless network.
  • the server is in a communication with a client system, and the server is further configured to communicate to the client system the said information and the label said information with the presence of the plurality of monitored conditions.
  • the client system further communicates the presence of the monitored conditions to a plurality of subscribers and further alerts the subscribers about the presence of a predefined set of monitored conditions.
  • FIG. 7 shows the processing steps performed by the client application allowing for a selection of the value-added video stream, viewing of the selected stream, and reporting any concerning behavior.
  • the genesis of the process begins with a Login 44 step wherein an authentication of the client application is performed with the Cloud Server 20 to establish privileges level of the client application in continuing with further processing.
  • Login 44 step is further performed by Authenticate Login 46 which is a subordinate process to supporting the authenticate Login 44 with the help of local authentication or biometric keys alleviating the need to authenticate by communicating with Cloud Server 20.
  • Select Restaurant 48 represents a selection step configured to enable a selection of one of the many value-added video streams provided by the Cloud Server 20.
  • Select Restaurant 48 represents a selection step configured to enable a selection of one of the many value-added video streams provided by the Cloud Server 20.
  • Restaurant 48 is configured to enable selections of restaurant by options to filter by geolocation or other location specifiers of the restaurant, options to filter by the name, or a part of the name, of the restaurant, filtering by the type of cuisine served by the restaurant, filtering by the Cloud Server 20 assigned score of the restaurant, or filter by the hours of operation of the restaurant.
  • the client application proceeds to the next step of Display Value Added Stream 50.
  • Display Value Added Stream 50 is a step of the client application that connects to the Annotated Video Steaming Process 28 on the Cloud Server 20 to obtain and display the value- added video stream for the restaurant selected in Select Restaurant 48 step. While the Display Value Added Stream 50 is ongoing, the client application further offers the capability report concerning behavior with the step of Report Behavior 58 which is a step in the client application configured to capture input and attachment and send a report of to the restaurant. The idea here is that upon observing of concerning behavior on the value-added video feed, a report with attachments of the frames depicting the concerning behavior and further input text-based input gathered should be reported to the restaurant. In an embodiment concerning behavior messages and attachments will be delivered to the restaurant by the client application using an electronic messaging system.
  • Report Behavior 58 can be generated by rendering a one-time prompt on the client-device. In another embodiment of the invention, Report Behavior 58 can be generated by rendering a button which would continuously prompt feedback from the Mobile Client 32 or the Web Client 30.
  • Report to Restaurant 60 is a step on the client application that enables sending a report of concerning behavior to the restaurant.
  • Report to Restaurant 60 will render a prompt on the client application confirming that a report would be sent to the establishment regarding the incident.
  • the client application will provide data input fields where detailed information regarding the incident could be entered.
  • the client application is further adapted to enable the attaching of one or more frames thereto by its Attach Concerning Behavior Frame 62 step.
  • Attach Concerning Behavior Frame 62 is a step that provides a capability of the client application to attach a single or plurality of frames of the video to the report of concerning behavior to further substantiate and provide additional information with the report being sent to the restaurant.
  • the business would be any one of the following things: attach a single or plurality of frames of the video to the report of concerning behavior to further substantiate and provide additional information with the report being sent to the restaurant.
  • the business would
  • SUBSTITUTE SHEET (RULE 26) receive the message from Attach Concerning Behavior Frame 62 in the form of an email.
  • the business would receive the message from Attach Concerning Behavior Frame 62 in the form of a message on the platform directly.
  • Exit Stream 64 is a step in which the client application stops the feed being received from the Annotated Video Steaming Process 28.
  • the client application proceeds to the Survey 66 step.
  • Survey 66 is a step of conducting a survey through a series of question related to the streama step of conducting a survey through a series of question related to the stream.
  • the Survey 66 would elicit feedback from the operator of the client application regarding the contents of Display Value Added Stream 50 regarding adherence to predefined protocols understood to enhance safety.
  • Send to Business 68 is a step that sends the data collected in Survey 66 to the restaurant. Additionally, upon submission of Survey 66, the client application also performs Send to Provider 70, a step that sends the data collected in Survey 66 to the Cloud Server 20 and to the value-added stream service provider.
  • the cumulative data collected from all instances of Survey 66 triggering Send to Provider 70 would allow the value added stream provider Update Restaurant Score 72 for each establishment on the platform which is a computation to update a score for each establishment on the platform, given data from Survey 66. This result of Update Restaurant Score 72 then become available as a search criterion for Select Restaurant 48 step in the client application.
  • FIG. 8(A) depicts the block diagrams for the computer vision model used for detecting the bounding box over a human face.
  • the resultant output of Video Splitting Process 34 is used to Apply Model 42 by Cloud Server 20.
  • FIG. 8(A) depicts the embodiment of the invention where Apply Model 42 is used to detect a face in the output of Video Splitting Process 34.
  • Preprocess Model Input Frame 76 is a process to convert the output of Video Splitting Process 34 to grayscale.
  • Apply Model 42 next applies the output of Preprocess Model Input Frame 76, which in the embodiment shown converts a colored image to a grayscale image, is next processed through the Apply Face Detector 78 which is a component that will apply a face detector from Predefined Trained Model 36, to the output of Preprocess Model Input Frame 76.
  • the result of Apply Face Detector 78 produces the Face Detector Output 84 as its output which delineates the location of faces detected in a video frame.
  • FIG. 8(B) the flow chart and the software procedure corresponding to the implementation of the face detection and producing a bounding box around the face for setting up the next stage of detecting a mask on the face.
  • An embodiment is implemented utilizing the Open Source Computer Vision Library ( . op en cv . or g . ) invoked from a Python program. Python is an interpreted programming language commonly used for Machine Learning Applications. Additional information can be found on Python’s official website, www. python . org. The specific libraries used in building the models are included in attached sequence listing and incorporated herein by reference.
  • An embodiment of the flow chart shown uses functions from the OpenCV library for receiving the output of Video Splitting Process 34 in FIG. 8(B) and using the Predefined Trained Model 36 as the Open CV’s prebuilt face detection model using Haar-cascade preprocessing for detecting faces.
  • This model is incorporated herein by reference.
  • An embodiment uses Open CV method to convert the image to grayscale within the flowchart block Preprocess Model Input Frame 76.
  • an embodiment applies Predefined Trained Model 36 - utilizing Haar-cascade model of OpenCV applied by the Apply Face Detector 78 block - to search for faces in the processed frame produced by Preprocess Model Input Frame 76.
  • the output of the Apply Face Detector 78 is the Face Detector Output 84 which is a set of coordinates corresponding to the location of the face or faces in the image analyzed by Apply Face Detector 78.
  • the Haar-cascade model being used in an embodiment can be found on OpenCV’ s official repository on https://github.com/opencv/opencv/blob/master/data/haarcascades/haarcascade fronlalface def ault.xml. This model is incorporated herein by reference.
  • FIG. 9 shows the architecture of a Convolutional Deep Neural Network utilizing a plurality of filters and sampling layers that is followed by a plurality of dense layer of neurons with is ultimately followed by a single output layer generating the determination of the monitored condition.
  • Each of the Filter Layer 91 performs a plurality of preconfigured filtering operations where each is adapted to capture a specific image property.
  • the Filter Layer 91 is followed by Pooling Layer 92 which performs an aggregation of the filtered image layers to create coarser image layers which in turn capture occurrence of higher level features.
  • An embodiment uses two instances of Filter Layer 91 the first instance using 64 and 128 filters respectively each filter using a 5 by 5 mask for convolving with the input.
  • the set of filters is chosen a-priori including blurring filters, edge detection filters, or sharpening filters
  • the actual weights of the filters used in a convolutional neural network is not defined a-priori but learned during the training process.
  • SUBSTITUTE SHEET (RULE 26) deep convolution neural network can find features in the image that pre-configured filters may not find.
  • Each Filter Layer 91 is followed by Pooling Layer 92 layer.
  • the Pooling Layer 92 simply replaces a 4x4 neighborhood in an image with the maximum value. Since by Pooling Layer 92 is done with tiling, the output of the Pooling Layer 92 results in cutting the input image size by half. Thus, the image after two instances of by Pooling Layer 92 layers is reduced to one quarter of the original size.
  • the size of input image of a human face which was standardized to 255 by 255 will be reduced to 253 by 253 after the first instance of Filter Layer 91, further gets reduced to 126 by 126 after the first instance of Pooling Layer 92, which gets reduced to 124 by 124 after the second instance of Filter Layer 91, which ultimately is reduced to 62 by 62 by the second instance of Pooling Layer 92.
  • This 62 by 62 image is next flattened by Flatten Layer 96 which essentially reorganizes the two or higher dimensional data set, such as an image, to a single dimension vector making it possible to be fed to a dense neural network layer.
  • Flatten Layer 96 which essentially reorganizes the two or higher dimensional data set, such as an image, to a single dimension vector making it possible to be fed to a dense neural network layer.
  • the feed forward process works just like it would in a multi-layer perceptron with the additional characteristic that a deep neural network will involve a plurality of hidden layers.
  • Each of these layers are instances of Dense Layer 93 which comprises neurons that are fully connected to any neurons in the previous layer with a weight associated with each of these connections.
  • the last dense layer is often labelled as the Output Layer 94 it being the final layer of neurons that are followed by a function such as the sigmoid or a softmax function for classification.
  • An embodiment shown has three neurons in the Output Layer 94 would therefore utilize a softmax function to convert output of the three neurons into a probability density function using the formula below where probability of the i th . output is calculated given the neurons in the Output Layer 94 produced outputs of Oi, 02, ... On: e°i
  • An embodiment of the system used for detecting a plurality of conditions where said computer implemented method further comprises a plurality of deep neural networks for each for detecting one or more of the plurality of conditions, and each of deep neural networks comprising of a convolutional neural network including a plurality of filtering layers where each of the filtering layers has an associated set of filtering layer parameters, a plurality of pooling layers
  • each of the pooling layers has an associated set of pooling layer parameters, a plurality of dense layers where each of the dense layers has an associated set of dense layer parameters, an output layer where the output layer has an associated set of output layer parameters, and where the convolutional neural network is configured to detect and the output layer is configured to report a status of one or more of the predefined monitored condition.
  • a base set of deep neural network models obtained from a source like ImageNet are utilized that is further fine-tuned, where an inductive tuning of the deep neural network is further performed by training the deep neural network on a plurality of examples of the predefined monitored condition, where each example is used for updating the filtering parameters, the pooling parameters, the dense layer parameters, and the output layer parameters; and the inductive tuning of the deep neural network is configured to cause the deep neural network to recognize the presence of the predefined monitored condition in a manner of improved accuracy.
  • FIGS. 10 (A) and (B) shows the architecture of a Long Short Term Memory or LSTM Deep Neural Networks which offers the advantage of using autoregressive memory and captures dependencies between the consecutive video frame sequences which are modeled as a time series where the monitored conditions will also be determined by the condition’s value in prior frames.
  • FIG. 10 (A) shows the architecture of a series of LSTM cells that are configured to examine each frame and provide an output to a fully connected hidden layer of neurons that is then fed to an output softmax function in the shown embodiment
  • FIG. 10 (B) shows the inner architecture of each of the LSTM cells used in the autoregressive chain shown above.
  • This architecture of deep neural network is configured to capture changes in subsequent images which may not be readily perceptible.
  • a deep neural network for learning time dependent variability in the image sequences is utilized in an embodiment.
  • the embodiment shows the architecture of a Long Short Term Memory or LSTM Deep Neural Networks which offers the advantage of using autoregressive memory and captures dependencies between the consecutive video frame sequences which are modeled as a time series where the monitored conditions will also be determined by the
  • SUBSTITUTE SHEET (RULE 26) condition s value in prior frames.
  • the LSTM belongs to the class of Recurrent Neural Networks which retain the state of learning from one time step to the next and the results from the previous frame influence the interpretation of the subsequent frame. In this manner the entire “video sequence” is used for detection of a monitored condition, where the deep learning neural network utilizes an autoregression of a time series of said captured information where said autoregression is perform using a computer implemented method utilizing a deep recurrent neural network based on a Long Short Term Memory (LSTM) architecture.
  • LSTM Long Short Term Memory
  • a sequence of frames is fed into a LSTM deep neural network where each LSTM Cell 95 which is a component of the LSTM chain where the LSTM cell also includes a memory and uses a neural network to combine the memory with image frame and the previous state to produce an output which is then fed to a next LSTM cell and to a dense layer.
  • a five second sequence will typically comprise of anywhere between 50 to 150 frames corresponding to a frame capture rate of 10 fps to 30 fps respectively.
  • Embodiments of 100 LSTM Cells 95 offer sufficient or long enough span of time for detection of monitored conditions in an embodiment.
  • Each of the LSTM Cells 95 is fed a portion of the video frame comprising of a detected region of interest.
  • the region of interest is the face of any humans in the video frame which is detected by Face Detector Output 84.
  • the subsection detected by Face Detector Output 84 from the video frame is standardized to a size of 255 by 255 and then passed through an embedding process which converts the 2D image data into a single dimensional vector which is then supplied as an input to LSTM Cells 95.
  • the output of the chain of LSTM Cells 95 is fed to a plurality of neurons of a Dense Layer 93.
  • the embodiment shown uses a single Dense Layer 93 which is also the Output Layer 94.
  • Other embodiments use multiple instances of Dense Layer 93.
  • the output of the series of LSTM Cells 95 comprises a flattened set of inputs that are fed into a multi-layer perceptron having a plurality of hidden layers.
  • FIG. 11 depicts the architecture of a deep neural network for detecting the presence of a mask within the bounding box of the image containing a human face.
  • a Deep Neural Network (DNN) is used to classify if a mask is present or absent within a given bounding box containing a human face.
  • FIG. 11 is depicting an embodiment of the invention such that Predefined Trained Model 36 is classifying the presence or absence of a mask within a given bounding box containing a human face.
  • the input into this DNN is Face Detector Output 84.
  • the DNN consists of a base model, Mobile Net V2 86.
  • the Mobile Net V2 model used in an embodiment is included in the attached sequence listing and is incorporated herein by reference.
  • Mobile Net V286 is a general and stable open-source model architecture used for multiple applications in computer vision use cases. Additional information is available at htps : //www. tensorfl ow. org/api__docs/python/tf/kera ⁇ appli cations/Mobi I eNetV2. The model information and parameters are incorporated herein by reference. Face Detector Output 84 is standardized to an image size of 255 by 255 pixels before it is fed to the Mobile Net V286.
  • Head Model 88 is the portion of the DNN architecture that is added to the underlying model to accomplish the specialized training of the base model as will be appreciated by a practitioner in the art.
  • Head Model 88 consists of three layers. The first layer flattens the output of Mobile Net V2 86. The second layer is a layer of 128 dense neurons. The third layer is a layer of 2 dense neurons. The two neurons in this layer correspond to the binary output.
  • the output of the DNN is Mask Detector Output 90.
  • Mask Detector Output 90 is the output of Apply Mask Detector 80 which is a binary classification of the presence of a mask within a given bounding box containing a human face.
  • the DNN that is used to generate the model for Apply Mask Detector 80 would be trained on a dataset of open-source images. In another embodiment of the invention, the DNN that is used to generate the model for Apply Mask Detector 80 would be trained on images from Video Splitting Process 34 that have been annotated by a practitioner in the art.
  • FIG. 12 shows a swim lane diagram for the three concurrent activities in progress, i.e. collection of video feeds from the end-points, processing of video feeds to overlay the value- added annotations, and the receiving and commenting on the value-added feeds by the client application user.
  • a client application facilitates Login 44, Authenticate Login 46, and Select Restaurant 48 processes.
  • the client device selects to Display Value Added Stream 50 by establishing a connection between the client application and the Cloud Server 20.
  • Shown in FIG. 12 is an embodiment where the streaming process commences with End Point Monitor 16 capturing the video feed Acquire Video 52.
  • Acquire Video 52 is a subsystem in which the End Point Monitor 16 acquires video feed from a Station Camera 12 or a Fish Eye Camera 14.
  • the End Point Monitor 16 connects to the cloud Connect to Cloud 54.
  • Connect to Cloud 54 is a process to connect the End Point Monitor 16 to the Cloud Server 20.
  • Connect to Cloud 54 feeds into Stream to Cloud 56.
  • Stream to Cloud 56 is a process to transmit video streams from the End Point Monitor 16 to the Cloud Server 20.
  • Stream to Cloud 56 connects to the Cloud Server 20, specifically to Video Acquisition Process 22.
  • Video Acquisition Process 22 inputs into Video Splitting Process 34.
  • the output of Video Splitting Process 34 is the input in Apply Model 42.
  • SUBSTITUTE SHEET (RULE 26)
  • the output of Apply Model 42 is the input to Annotated Video Steaming Process 28. This process is further detailed in FIG. 3.
  • the output of Annotated Video Steaming Process 28 is made available to be displayed on a Web Client 30 or Mobile Client 32.
  • FIG. 13 shows an activity diagram depicting the ability of the client application to sort the feeds by The Real Meal Score, or the TRM Score, of assigned by the provider of each of the value-added video feeds streamed from the cloud serve.
  • the client application enables the specification of searching using a plurality of Search Criteria 49.
  • Each Search Criteria 49 is search criteria to be used for finding establishments of interest including geographic location, or establishment name, and the like.
  • the client application communicates the search criteria to the Cloud Server 20 and receives a list of restaurants that meet the Search Criteria 49.
  • TRM Score 71 is a score assigned by considering the answers to Survey 66 received by the client application indicative of a perception of health or other concerns pertaining to an establishment.
  • the client application further offers the ability to present the list of restaurants ordered by TRM Score 71.
  • the display of establishments is further facilitated by a TRM Sort 74 which is a list of establishments presented that are sorted by the TRM Score 71 scores with the listing of any establishments falling below a predefined TRM Score 71 threshold getting suppressed.
  • the server further assigns a score to each of the plurality of video streams where the score is a representation of the extent to which the video stream achieves a compliance status with all the plurality of monitored conditions, wherein the score is further annotated to the video stream.
  • FIG. 14 depicts the inclusion of a database on the server to manage a plurality of restaurant information.
  • This allows the Cloud Server 20 to use the Database 24 in managing information about the locations of cameras, the survey results, other pertinent information about the establishments.
  • This database is accessed by the client application while running a Select Restaurant 48 step in response to the Search Criteria 49 specified in Search Box 47. Additionally, Cloud Server 20 uses information stored in the Database 24 to provide additional details to the client application related to health measures or other important attributes.
  • system including a database wherein a unique identifier is further associated with the said captured information, the server further stores the said captured information and the label for the said information in the database, and where the database is configured to make said captured information and the label for the said information retrievable by the unique identifier.
  • FIG. 15(A) shows a GUI providing the search and display capabilities of the client application providing capabilities of searching for a specific restaurant using a variety of criteria including location, name, TRM score and the like. Shown here a plurality of Establishment Pages 19 where each Establishment Page 19 corresponds to one of the many establishments that meet the selection criteria specified in the Search Box 47. As illustrated, the Search Box 47 is a search criteria input box allowing users to specify a plurality of Search Criteria 49.
  • the Establishment Page 19 is a component of the GUI that refers to the dedicated page of an establishment on the Real Meal Platform.
  • the client application also displays a TRM or the Real Meal Score, computed by the Cloud Server 20 and associated with each Value Added Stream 18.
  • TRM Score 71 is computed based on the input received by the client application in response to a survey questionnaire, Survey 66.
  • the client application further provides the capability presenting the search results in a sorted manner.
  • the server further assigns an identifier to the annotated video stream and saves the identifier to a database wherein the database is configured to search and retrieve the annotated video stream by the identifier; the server further accepts a request from a client software wherein the request includes the identifier of the annotated video stream; and the server is configured to retrieve and deliver the annotated video stream to the client software.
  • FIG. 15(B) depicts the capability of the client application to drill down and view all the camera feeds provided by a specific restaurant including the plurality of value-added video streams from the kitchen monitoring stations and fish eye camera, and value added streams from the plurality of delivery vehicles. Shown here are a plurality of Value Added Streams 18 where each Value Added Stream 18 is one of many Value Added Streams 18 for a selected Establishment Page 19. Value Added Stream 18 is a raw video stream(s) obtained from the End Point Monitor 16 or the Field Endpoint 13 that has been annotated with additional information by overlaying informational items, including text and color codes to convey a specific message to the recipient. In the embodiment of the invention shown, there are four Value Added Streams 18 originating from End Point Monitor 16 and two Value Added Streams 18 originating from a Field Endpoint 13.
  • an embodiment has the Field Endpoint 13 send the video feed to the End Point Monitor 16 which sends a consolidated feed of comprising all feeds to the Cloud Server 20.
  • the feeds from the Field Endpoint 13 are generated by a Mobile Phone 15 mounted inside a delivery vehicle configured
  • SUBSTITUTE SHEET (RULE 26) to monitor delivery personnel complying with requirement of wearing a face mask, for example, or monitoring that food packets are visible in plain view.
  • FIG. 16 depicts a Graphical User Interface for a viewing of a specific value-added video stream on the client application and the ability to send an electronic message with attached frames depicting the concerning behavior.
  • Illustrated herein is a Snap Frame 63 input on the client application.
  • the Snap Frame 63 is an input on the client application adapted to allow the instantaneous snapping of the frame being displayed in the Value Added Stream 18 section of the application.
  • the client application enables the sending of Report Behavior 58 message by attaching a Snap Frame 63 as evidence to the message and sending an electronic message with Report to Restaurant 60.
  • the Snap Frame 63 also allows the user to post the clip with their own message to one of the social media platforms.
  • FIG. 17 shows a component and packaging diagram of an embodiment of the system. This figure depicts the various components used for the implementation of the disclosed system. As illustrated the Station Camera 12 and Fish Eye Camera 14 are in communication with End Point Monitor 16 which in turn streams the video to the Cloud Server 20.
  • the Video Acquisition Process 22 process manages all the streams correlates the streams with the establishments and conveys the streams for analysis to Video Analysis and Annotation 26 which has a plurality of processes for splitting the video and analyzing it using deep convolution neural network and adding the result of the analysis as a value added annotation on the respective video streams.
  • the annotated video streams, or Value Added Stream 18 are then streamed to the Web Client 30 or Mobile Client 32 when they request a specific stream satisfying their search of the Database 24 for further information on specific establishments.
  • FIG. 18 shows examples of video streams that have been annotated.
  • FIG. 18(A) shows an example of a frame where five faces are recognized and none of which are seen wearing masks receiving a TRM score of 0, and
  • FIG. 18(B) shows an example where two faces are recognized and both faces are annotated with a checkmark indicating that the masks were detected and receives a score of 10.
  • the TRM or The Reel Meal Score is assigned based on the number of monitored conditions that are met. In an embodiment, when none of the monitored conditions are met the score assigned is zero, and a maximum score of 10 is assigned when all the monitored conditions are met.
  • the server further assigns a score to each of the plurality of video streams where the score is a representation of the extent to which the video stream achieves a compliance status with all the plurality of monitored conditions, wherein the score is further annotated to the
  • SUBSTITUTE SHEET (RULE 26) video stream.
  • the client applications are further configured to make clips of the value added video streams and forward these clips as emails or upload to social media sites.
  • Another aspect of the system and method disclosed is its ability to assign a score indicative of the extent to which a video stream complies with a set of predefined conditions.
  • An embodiment is a process of using a deep neural network comprising having a plurality of cameras or optical sensors connected to an end-point aggregator where the plurality of cameras or optical sensors capture information and communicate the information to the end-point aggregator; having the end-point aggregator collect the information that is captured to create a video stream and having the end-point aggregator further communicate the video stream to a server, having the server further configured to executing a deep neural network based computer implemented method to perform an analysis of the video stream wherein the analysis is configured to detect a presence of a plurality of monitored conditions, and having the server annotate the video stream, with the presence that is detected of the plurality of monitored conditions to produce an annotated video stream.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Closed-Circuit Television Systems (AREA)
  • Image Analysis (AREA)

Abstract

L'invention divulgue un système et un procédé d'analyse de trames d'un flux vidéo. L'analyse est effectuée à l'aide d'un procédé de réseau neuronal profond mis en œuvre par ordinateur qui produit une ou plusieurs étiquettes prédéfinies sur la base de la présence ou de l'absence d'éléments spécifiques dans le flux vidéo. Les étiquettes générées par ordinateur sont ajoutées au flux vidéo pour créer un flux vidéo annoté ou "à valeur ajoutée". Selon certains modes de réalisation, les flux vidéo à valeur ajoutée provenant de caméras ou de capteurs optiques de cuisines commerciales ou de véhicules de livraison indiquent si les travailleurs ou les conducteurs portent des masques en tissu et adhèrent à des protocoles sanitaires et de sécurité. D'autres modes de réalisation comprennent la génération de flux à valeur ajoutée à partir de maisons de retraite, d'installations de soins de jour d'enfants, de camions alimentaires, ou d'autres établissements. Les flux vidéo à valeur ajoutée sont rendus accessibles sur le réseau et sont conçus pour être diffusés en continu sur des applications de clients abonnés qui peuvent filmer, envoyer un courriel, sauvegarder ou télécharger vers l'amont à destination de médias sociaux.
PCT/US2021/048300 2020-08-28 2021-08-30 Système et procédé d'utilisation de réseaux neuronaux profonds pour ajouter une valeur à des flux vidéo WO2022047342A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202063071365P 2020-08-28 2020-08-28
US63/071,365 2020-08-28

Publications (2)

Publication Number Publication Date
WO2022047342A1 WO2022047342A1 (fr) 2022-03-03
WO2022047342A9 true WO2022047342A9 (fr) 2022-06-30

Family

ID=80355785

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/048300 WO2022047342A1 (fr) 2020-08-28 2021-08-30 Système et procédé d'utilisation de réseaux neuronaux profonds pour ajouter une valeur à des flux vidéo

Country Status (1)

Country Link
WO (1) WO2022047342A1 (fr)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6173317B1 (en) * 1997-03-14 2001-01-09 Microsoft Corporation Streaming and displaying a video stream with synchronized annotations over a computer network
US7697026B2 (en) * 2004-03-16 2010-04-13 3Vr Security, Inc. Pipeline architecture for analyzing multiple video streams
US9390169B2 (en) * 2008-06-28 2016-07-12 Apple Inc. Annotation of movies
US8503539B2 (en) * 2010-02-26 2013-08-06 Bao Tran High definition personal computer (PC) cam
CN103347446B (zh) * 2010-12-10 2016-10-26 Tk控股公司 用于监控车辆驾驶员的系统
US20120192220A1 (en) * 2011-01-25 2012-07-26 Youtoo Technologies, LLC User-generated social television content

Also Published As

Publication number Publication date
WO2022047342A1 (fr) 2022-03-03

Similar Documents

Publication Publication Date Title
US12008880B2 (en) Utilizing artificial intelligence to detect objects or patient safety events in a patient room
US20210397843A1 (en) Selective usage of inference models based on visual content
JP6905850B2 (ja) 画像処理システム、撮像装置、学習モデル作成方法、情報処理装置
US10812761B2 (en) Complex hardware-based system for video surveillance tracking
US8737688B2 (en) Targeted content acquisition using image analysis
US9747502B2 (en) Systems and methods for automated cloud-based analytics for surveillance systems with unmanned aerial devices
US20150145991A1 (en) System and method for shared surveillance
JP2018139403A (ja) ビデオ監視システムで警告を発する方法
US20180150695A1 (en) System and method for selective usage of inference models based on visual content
US20140149264A1 (en) Method and system for virtual collaborative shopping
KR102474047B1 (ko) 사진 또는 비디오에서 잠재적 목록에 대한 관심 수집
EP2737698A2 (fr) Système et procédé d'enregistrement et de notification d'anomalies d'un site
WO2013132463A2 (fr) Système et procédé d'analyse d'indications non verbales et de notation d'un contenu numérique
US20180150683A1 (en) Systems, methods, and devices for information sharing and matching
WO2022113439A1 (fr) Dispositif et procédé d'analyse de données
US20180157917A1 (en) Image auditing method and system
US10191969B2 (en) Filtering online content using a taxonomy of objects
CN108701393A (zh) 用于事件处理的系统及方法
WO2022047342A9 (fr) Système et procédé d'utilisation de réseaux neuronaux profonds pour ajouter une valeur à des flux vidéo
JP4061821B2 (ja) ビデオサーバシステム
US10902274B2 (en) Opting-in or opting-out of visual tracking
Gascueña et al. Engineering the development of systems for multisensory monitoring and activity interpretation
TWI706381B (zh) 影像物件偵測方法及系統
US20140375827A1 (en) Systems and Methods for Video System Management
Rupa Multi-Modal Deep Fusion for Activity Detection from Both the Virtual and Real-World

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21862943

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21862943

Country of ref document: EP

Kind code of ref document: A1