CA3057931A1

CA3057931A1 - Method to identify and see a person in a public security camera feed without been able to see the rest of the people or tags

Info

Publication number: CA3057931A1
Application number: CA3057931A
Authority: CA
Inventors: Alfonso F. De La Fuente Sanchez; Dany A. Cabrera Vargas
Original assignee: Individual
Current assignee: Individual
Priority date: 2019-10-08
Filing date: 2019-10-08
Publication date: 2021-04-08

Abstract

A method and system to edit, record, and transmit a video where a located subject of interest is present while keeping other subjects shown in it private or vice versa. More exactly, our invention identifies a person by different methods, including one using that person's geolocation and being able to locate that person's image when the person is in an area covered by a publicly connected security camera while complying with privacy requirements.

Description

METHOD TO IDENTIFY AND SEE A PERSON IN A PUBLIC SECURITY CAMERA
FEED WITHOUT BEEN ABLE TO SEE THE REST OF THE PEOPLE OR TAGS.
INVENTORS: DE LA FUENTE SANCHEZ, Alfonso Fabian CABRERA, Dany Alejandro BACKGROUND
Security cameras are becoming a commodity - most public spaces have one or more.
Security cameras are traditionally connected to a Video Recorder, which in some cases is connected to the Internet. Here, a remote user can see a recorded image from a specific camera or see a live feed from a specific camera from the ones connected to that particular video recorder. Different security cameras record or transmit a video feed in different quality and from different angles.
Most people own and carry with them a smart gadget such as a smartphone, tablet.
SUMMARY
In general, in one aspect, the invention relates to a method and system to edit, record, and transmit a video where a located subject of interest is present while keeping other subjects shown in it private. More exactly, our invention identifies a person by different methods, including one using that person's geolocation and being able to see that person's image when the person is in an area covered by a publicly connected security camera while complying with privacy requirements.
DETAILED DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a typical structure of a CNN model for human face detection.
FIG. 2A is a side view of a person and a frontal camera.
FIG. 2A is a side view of a person and a frontal camera.
FIG. 2B is a side view of pedestrians with an overhead security camera.
FIG. 3A to 3C show illustrated steps for one embodiment of the invention.
FIG. 4A is a flowchart that shows how one embodiment of the invention works.
FIG. 4B shows a frontal picture used in one embodiment of the invention FIG. 5A shows a second embodiment of the invention using overhead camera footage.
FIG. 5B shows a different embodiment of the invention.
FIG. 6A shows a flowchart of the method for blurring.
FIG. 6B shows the blurring or masking module.
FIG. 7 is a flow chart of the system and method that describes how the localization of the subject takes place.
FIG. 8 is a flowchart that shows how we create a ground truth frame.
FIG. 9 is a flowchart, continuation of FIG. 8.
FIG. 10 describes a different embodiment of the invention.
FIG. 11 shows a diagram of a computing system.
DETAILED DESCRIPTION
Specific embodiments of the technology will now be described in detail with reference to the accompanying FIGS. . In the following detailed description of embodiments of the technology, numerous specific details are set forth in order to provide a more thorough understanding of the technology. However, it will be apparent to one of ordinary skill in the art that the technology may be practiced without these specific details.
In other instances, well-known features have not been described in detail to avoid unnecessarily complicating the description.

In the following description of FIGS. , any component described with regard to a figure, in various embodiments of the technology, may be equivalent to one or more like-named components described with regard to any other figure. For brevity, descriptions of these components will not be repeated with regard to each figure. Thus, each and every embodiment of the components of each FIG. is incorporated by reference and assumed to be optionally present within every other FIG. having one or more like-named components. Additionally, in accordance with various embodiments of the technology, any description of the components of a FIG. is to be interpreted as an optional embodiment, which may be implemented in addition to, in conjunction with, or in place of the embodiments described with regard to a corresponding like-named component in any other figure.
FIG. 1 shows a typical structure of a CNN model for human face detection. Our system relies on Computer Vision (a discipline in Artificial Intelligence) to detect and localize the subject of interest in a video and differentiate this person from other people in the same scene. We take advantage of Machine Learning techniques to increase the accuracy of our system, specifically a subset of techniques commonly known as deep learning.
Deep learning refers to the use of a Convolutional Neural Network (CNN) (100), which is a multi-layered system able to learn to recognize visual patterns after being trained with millions of examples (i.e. what a person should look like in a picture).
After trained, a CNN model can understand images it has never seen before by relying on what it learned in the past. This image-understanding method can also be applied to video footage, as video can be understood as a sequence of images (known as "frames") and individual video frames can be analyzed separately.
A CNN(100) is trained using an "image dataset" (110) which comprises a vast collection of image samples, manually annotated to indicate what the CNN (100) should learn. As an example, a CNN (100) can be trained to detect faces using a dataset of thousands of images (110) where someone has manually indicated the position and size of every face contained on every image. This process of annotating the training information in a dataset is known as "labeling a dataset". The bigger and more diverse a dataset (110) is, the more the CNN (100) can learn from it whilst increasing its accuracy.
For illustrative purposes, FIG. 1 shows a common CNN (100) architecture for face recognition. The CNN (100) can be seen as a set of interconnected "layers"
(101, 102, 103, 104) each containing "neurons" (120) which are implemented as units that produce a numerical output from a set of numerical inputs through some mathematical formula.
Neurons (120) are considerably interconnected (mimicking the human brain) so their outputs can serve as input for others. The types of layers, neurons, and mathematical formulas that govern their connections is diverse, and it requires domain knowledge to determine the optimal structure and mathematical formulas a CNN model (100) should have to serve a specific purpose.
Further describing the inner workings of a CNN, the FIG. 1 illustrative example shows how an input image from a dataset (110) can be provided to the CNN's input layer (101), which produces patterns of local contrast (111), then, that information is fed to the next first hidden layer (102) which produces an output with the face features (112), that information is then fed to the second hidden layer (103) which after processing produces a recognition of faces (113), which then are fed to the output layer (104) producing a result of possible matches. This structure is one of many that can be used to model a CNN to produce a prediction given an input image.
As one would expect, a CNN (100) will only learn what it was taught; for example, a CNN trained to detect human faces cannot detect full bodies. It will also encounter difficulties when the environmental conditions change (i.e. natural &
artificial lightning conditions, weather changes, image noise and quality, video camera recording format) unless it has been trained to handle such conditions with a sufficiently vast and diverse dataset.
In practice, a single task might require not a single CNN but a group of them;
for example, a smaller CNN might be used to improve the accuracy of the output from a bigger CNN, or the final result might be obtained from averaging individual results of a set of different CNNs (often called an "ensemble"). The best architecture for a specific purpose is usually obtained after experimentation.
FIG. 2A and 2B show how the different camera footage is considered. As described in FIG. 1, a CNN will only learn what it was taught; a CNN trained to recognize footage of people filmed from their front will struggle to recognize footage of people recorded from a ceiling. The system of our invention works when the camera is installed facing people in front as shown in FIG. 2A, and when the camera is installed in a high place to film people from above as shown on FIG. 2B. We describe both cases in the following sections.
FIG. 2A is a side view of a person and a frontal camera. It shows a person (200) facing the front of a camera (201). Following the concepts discussed in FIG. 1, in the interest of optimizing performance and accuracy, our system includes two modules that act depending on the camera's position and distance from the people being filmed.
Note that the methods described for both cases can be used simultaneously in the same video, since certain video footage might contain people from both cases. An example of a frontal camera is a body camera.

FIG. 2B is a side view of pedestrians with an overhead security camera. It shows a side view when the camera (202) is installed in a high place to film people (203, 204) from above. One familiar with the art will appreciate that the angle from where the camera is filming or capturing the image of the person or pedestrian (203, 204) is different from what is shown in FIG. 2A. As such, it needs a different method to process the subject recognition described in this document.
FIG. 3A to 30 shows Illustrative steps for one embodiment of the invention. It exemplifies frontal camera footage. FIG. 3A shows the original image with 4 persons (301, 302, 303, 304). FIG. 3B shows how faces (311, 312, 313, 314) are detected and extracted from the original image for further analysis. FIG. 3C shows how all faces are transformed to a mathematical representation for comparison purposes (feature vectors).
Continuing with the description of FIGS. 3A to 3C, in the same example using the frontal camera footage, when the camera is installed to record people from the front so that their face is exposed and recognizable as in FIG. 2A, the system of our invention relies on two modules; the first, a face-detection module including but not limited to a CNN architecture that detects and localizes faces in every video frame, obtaining a bounding box for every face so that we can extract them as illustrated in FIG.
3B,and second, a face recognition module including but not limited to a CNN
architecture for face recognition that is able to determine that a certain face belongs to a certain person.
This process needs to transform every face found into a mathematical representation that we can compare against the representations of the subject of interest, as illustrated in FIG. 30.

The face-detection module is implemented by reusing an existing network architecture meant for object detection. Some examples of existing compatible CNN
architectures are: faced, YOL0v3, and MTCNN, among others. These architectures can be re-trained to detect faces in the scene environment conditions and video format required by the system, or alternatively, their structure can be used used as inspiration to formulate a customized face-detection CNN model for this use-case.
This "make or reuse" decision is often made based on a small study where we test multiple architectures against actual video footage provided by the user to obtain a quantitative benchmark that reveals the option with the highest accuracy and performance for the technical particularities of the user's camera systems.
On the other hand, the process of transforming an image into its representative mathematical form as illustrated in FIGS. 3B and 3C is a process known as "feature extraction". Given an image, this process obtains a set of numbers (known as "feature vector") (321, 322, 323, 324) that represent the image's (311, 312, 313, 314) visual features. Feature vectors can be compared against each other using mathematical formulas (i.e. by their n-dimensional euclidean distance), allowing the system to determine the probability of two faces belonging to the same person. This step in our system relies on CNN models for facial feature extraction like FaceNet and OpenFace, to name a few published by the research community.
FIG. 4A shows a flowchart in accordance with one or more embodiments of the invention. While the various steps in these flowcharts are presented and described sequentially, one of ordinary skill will appreciate that some or all of the steps may be executed in different orders, may be combined or omitted, and some or all of the steps may be executed in parallel. Furthermore, the steps may be performed actively or passively. The flowchart shows how one embodiment of our invention works. It shows the steps for our face recognition method, as described in FIGS. 1 through 3, including our method for blurring all faces that don't belong to a person of interest (from now on called "the subject" of "subject of interest") for the same example described in FIGS. 3A
to 30.
Steps 401 and 402 show the inputs to our method.
Step 401 - Video file: The user provides the system a video file where the person of interest appears.
Step 402 - Pictures of the subject: The user loads one or multiple front-face images of the subject of interest into our system. If this is not possible, our system will alternatively present an interface that facilitates the gathering of these images from the video itself.
After preparing the input, the following is a description of the method's steps:
Step 411 - Extract subject's features: Through our face recognition module, the system will obtain the feature vectors of the pictures of the subject and preserve them for later.
Step 412 - Detect faces in all frames: For every frame in the video, the system will use our face detection module to detect and localize all faces, obtaining a "bounding box" for every face found.
Step 413 - Extract features from all faces: For every face found, the system will use our face recognition module to obtain their corresponding feature vectors.

Step 414 - Compare all faces: The system will use our face recognition module to compare all the feature vectors obtained in (5) with the feature vectors obtained for our subject of interest in (2). For all faces in the video, this process will output a probability of the face belonging to the subject of interest (measured as a number from 0.0 to 1.0) Step 415 - Label the subject: All faces with a high probability of belonging to the subject of interest are labeled as positive cases, while every other face is labeled as a negative case. One familiar with the art will appreciate that a face could also be reflected in mirrors or shiny objects such as chromed items for example. In a different embodiment of the invention, the system allows the user or operator to select objects of interest, for example faces and then for the system to redact or mask these objects from the scene. One familiar with the art will appreciate that mask can also be described as black-out.ln a different embodiment of the invention, the operator of the system for redacting complicated scenes that may have a large number of objects that contain personal identifiable information, for example multiple people, and objects can select an area to be blurred instead of just blurring individual objects of interest.
Step 416 - Blur faces: Using our blur module, all faces labeled as negative cases are blurred. The blurring process is also known as redact the objects from the scene, wherein the objects include faces, text, and other identifiable objects.
Step 417 - The output produced by our method is a video where all faces different from the subject of interest will appear blurred.

In a different embodiment of the invention, once the system and method successfully recognizes the subject, it uses the same feature vectors in other videos where the subject of interest is present, which makes possible the recognition of the subject in multiple videos using the same process. In a different embodiment of our invention, after running all the steps described above, if the pictures provided for the subject don't suffice for its successful recognition, the user can try again or start a new process from Step 402 by loading a different set of pictures or by gathering the subject's pictures from a video where the face appears clearly.
FIG. 4B shows a frontal picture and how each face is marked with a box.
FIG. 5A shows a second embodiment of the invention using overhead camera footage.
It shows when the camera is installed in a ceiling or some kind of high place to film, capture, or record the images of people from above. In the example shown in FIG. 5A, the system is not expected to detect the faces of people on the ground, as most often will not be recognizable by traditional automated computer vision methods due to lower video quality (required for reasonable costs on storage) and the camera angle not allowing full face exposure. Instead, we track the entire person's body (501, 502, 503, 504, 505, 506, 507) using 2 modules, the first being a pedestrian detection module comprising a CNN architecture that detects and localizes pedestrians (501, 502, 503, 504, 505, 506, 507) in every video frame, obtaining a bounding box for every pedestrian so that we can extract them as illustrated in FIG. 5B where the second module is a pedestrian recognition module comprising a CNN architecture for pedestrian recognition that is able to determine the probability that one of the detected pedestrians (511, 512, 513, 514, 515, 516, 517) looks the same as the subject of interest. This process needs to transform every pedestrian found into a mathematical representation that we can compare against the representations of the subject of interest, as illustrated in FIG. 5C.

FIG. 5B shows a different embodiment of our invention, the pedestrian detection module reuses an existing network architecture meant for object detection. Some generic examples of compatible network architectures are MobileNets, Inception-v4, Faster R-CNN, to name a few. These architectures can be re-trained to detect pedestrians (501, 502, 503, 504, 505, 506, 507) in the scene environment conditions and video format required by the user, or alternatively, their structure can be used as inspiration to formulate a customized pedestrian detection CNN model for this use-case. This "make or reuse" decision is made based on a small study where we test multiple alternatives against actual video footage provided by the user to obtain a quantitative benchmark that reveals the option with the highest accuracy and performance for the technical particularities of the user's camera systems. The output is known as the pedestrian detection's probability.
Note that in cases where the pedestrian detection's probability isn't confident enough to make a decision on marking a pedestrian (i.e. the pedestrian doesn't look as expected or is covered by other people, etc.), further information like the position and speed of pedestrians in previous and future video frames might be used to refine and support the decision.
In a different embodiment of our invention, the process of transforming an image (511, 512, 513, 514, 515, 516, 517) into its representative mathematical form (521, 522, 523, 524, 525, 526, 527) as illustrated in FIG. 5C is the same "feature extraction"
process explained for Case A. We again rely on the feature descriptor generation capabilities of existing CNN generic object detection models to obtain the feature vector of every pedestrian.
FIG. 6A shows a flowchart describing the method for blurring or masking all pedestrians or people that don't belong to a person of interest (from now on called "the subject") in accordance with one or more embodiments of the invention.

The inputs to our method are:
Step 620 - Video file: The user provides the system a video file where the person of interest appears.
Step 621 - Subject selection- After all pedestrians in the video have been detected, the system will ask the user to select (i.e. click on) the person of interest. This operation might be required more than once in order to improve detection results.
The method steps can be described as follows:
Step 630 - Detect pedestrians in all frames: Using our pedestrian detection module we obtain the bounding box for all pedestrians present for every frame in the video.
Step 631 - Extract features from all pedestrians: For every pedestrian found, the system will use our pedestrian recognition module to obtain their corresponding feature vectors.
Step 633 - Extract subject's features: The user will manually select the detected pedestrian that matches the subject of interest. Further frames of the same pedestrian detected with a high degree of confidence might be used to increase the amount of information on this pedestrian.
Step 634 - Compare all pedestrians: The system will use our pedestrian recognition module to compare all the feature vectors obtained in (2) with the feature vectors obtained for our subject of interest in (3). For all pedestrians in the video, this process will output a probability of the pedestrian being the subject of interest (measured as a number from 0.0 to 1.0) Step 635 - Label the subject: All pedestrians with a high probability of being the subject of interest are labeled as positive cases, while every other pedestrian is labeled as a negative case.
Step 636 - Blur faces: Using our blur module, blur all faces belonging to pedestrians labeled as negative cases.
FIG. 6B shows the blurring or masking module. In all cases of detected faces or pedestrians or subjects of interest, we use a separate face blur module that is in charge of masking or blurring the person's head (601).
One embodiment of our invention uses the following blurring methods from an original image (601) where the subject's face (611) is not covered or blurred: In a first embodiment of the invention it uses the covering method 602): In this case, we cover the head (611) with a solid color polygon (602) (i.e. a black ellipse) to completely hide it.
In a different embodiment of the invention the Blurring method (603) is used.
In this case the system applies a "pixelation"(613) filter to a polygon (i.e. an ellipse) that surrounds the head. The pixelation filter will divide the head area into a "grid", and for every cell in the grid, replace it with a square with the average color of all pixels contained in said cell.
In a different embodiment of the invention - to protect against algorithms that try to recover the original image using different methods including but not limited to reverse engineering a blurred image - the system randomly switches pixel places to make recovery impossible. In a different embodiment of the invention, the polygon, size of pixels, and amount of pixels switched are user-configurable. One familiar with the art will appreciate that this method can also be used to blur ID tags as well (when text is detected), by switching the blur area polygon to a rectangle. These features are user configurable too.
When the face of the person has already been detected and localized (i.e.
after face detection), the system applies the blur method to the area of the face.
However, when dealing with the whole human body (as in the pedestrian detection module described above), the head must first be accurately localized, and its size and position will vary depending on the camera angle and the subject's pose (i.e. the geometrical coordinates of the head of a person standing and a person sitting will appear different).
To localize the head on detected pedestrians or subjects of interest, this module includes a head detection method based on a CNN architecture for human pose estimation i.e. OpenPose among others. The pose estimation CNN outputs an approximation of the head's position and size, which the system then proceeds to obfuscate.
In a different embodiment of the invention, in regard to text detection, the system of our invention uses reliable scene text detection methods available and are implemented for direct application. In a different embodiment of the invention, the system uses methods that achieve higher degrees of accuracy and utilize deep learning models; for example, the EAST, Pyrboxes, and Textboxes++ methods to name a few published by the research community. In a different embodiment of the invention, after the detection and localization of the text in the video footage has occurred, the system then blurs any text that is positioned inside the bounding box of any detected pedestrian or below a nearby detected face. Examples of text detection that one may want to blur for privacy concerns are name tags, licence plates, and any other text that needs to be protected for privacy concerns.

FIG. 7 shows a flowchart describing how the localization of the subject is performed in accordance with one or more embodiments of the invention.
Step 701 localizing a subject of interest in a geographical area, One familiar with the art will appreciate that this could take the form of a coordinate, a region, a building, room, or similar area such as a street or plaza. One familiar with the art will also appreciate that a subject of interest might be a person, or part of a person, for example the face of a person. The subject of interest could also be an animal, a tag, or a licence plate, to name a few.
Step 702 Does the subject of interest carry a geolocation sensor? Wherein the geolocation sensor is connected to a remote server where the device running our software can retrieve this data.
Yes, then 703. No, then 704 Step 703, retrieve the geopositioning data from a smart gadget. Wherein localizing a person of interest comprises using the geopositioning data from a smart gadget's sensor that the person of interest carries. For example, smart gadgets such as smartphones have geopositioning sensors, where the data collected from those sensors can be transmitted to a remote server or directly to the device running the software of our invention, and, by using software applications, one with the appropriate permissions could access the geopositional data generated by the smart gadget of the subject of interest.
With such data, one could identify the location of the subject of interest using coordinates sent by the smart gadget. In a different example, when the subject of interest is an internet of things device, such as an autonomous vehicle, a shared bicycle, or a drone (to name a few), those devices by themselves already have geopositioning sensors and data transmission capabilities that transmit their localization. Then Step 705 Step 704, detecting the presence of the person of interest in an area, manually, by automatic face detection, or by other means of detection such as a request from the person of interest himself by providing the date and time and location of the person of interest. For example, people in Canada that have a legislated right to request access to government information in the Access to Information Act and their own personal information Privacy Act. Then step 705. FIG. 10 describes the process for the face detection.
Step 705 matching the localization of the subject of interest with a first video, wherein a first video comprises a video from a recording or live video feed from a camera, for example, a security camera or a smart gadget. Wherein the frames in the first video comprise all the frames in the first video, selected frames in the first video.
In a different embodiment of the invention the localization of the subject is made manually either from live video feed monitoring or from pre-recorded tapes of different cameras and areas. For example, video footage that is reviewed by employees to identify scenes containing certain individuals. Such video footages may come from a single camera or multiple cameras in a single location or multiple locations.
In this example, the first video mentioned in Step 704 of FIG. 1 will be considered a manually identified video or videos containing the image of the subject of interest.
FIG. 10 shows how a face detection can replace the manual entry.

FIG. 8 shows a flowchart describing how the invention creates a ground truth frame when the images come from a camera installed facing people from above in accordance with one or more embodiments of the invention. Wherein the subjects are pedestrians.
FIG. 8 is a continuation of FIG. 7.
Step 801 Video selection: Loading a video where the subject of interest is present, as per described in FIG. 1 Step 802 selecting a first still frame from the first video, usually is one from the first group of frames where the subject of interest appears in the video. For example, a video of a thief, taken by a security camera, where in the video, other subjects are present, for example a girl and a policeman. All three subjects are identified, but only one individual is the subject of interest that we want to show his face, for the rest identified objects (the girl and the policeman), their faces need to be masked.
Step 803 detecting and localizing individual subjects in the first still frame by using pre-trained convolutional neural networks, Step 804 obtaining a bounding box for all individual subjects present in frames of the first video, wherein the frames include the first still frame, Step 805 creating a first ground truth frame by selecting and marking a detected individual subject as the marked subject at the first still frame. Wherein the selection of the detected individual as the person of interest is one or more from the group of marking the subject by a user, automatically detecting the individual by matching the individual's biometrics. One familiar with the art will appreciate that biometrics is the technical term for body measurements and calculations.

Biometrics refers to metrics related to human characteristics. Biometrics authentication (or realistic authentication) is used in computer science as a form of identification and access control. It is also used to identify individuals in groups that are subjects of interest or under surveillance. Biometric identifiers are the distinctive, measurable characteristics used to label and describe individuals.
Biometric identifiers are often categorized as physiological versus behavioral characteristics. Physiological characteristics are related to the shape of the body.
FIG. 9 shows a flowchart describing the method to identify a person and masking their faces in accordance with one or more embodiments of the invention. FIG. 9 is a continuation of FIG. 8.
Step 901 Obtaining a set of features for the marked subject by extracting the visual features of the marked subject from any of the subsequent frames and any contiguous frame where the marked subject is determined to be present.
Wherein the visual features of the marked individual subject comprises one or more from the group of feature vectors.
Step 902 Obtaining feature vectors for all individual subjects detected.
Obtaining, for every other frame where the marked subject is determined to be present, feature vectors for all individual subjects detected. Then the next step.
Step 903 compute the vector distance between the feature vectors obtained and the feature vectors stored for the marked subject, Step 904 determining if any of the individual subjects matches the marked subject. Also determining the position and displacement speed of the individual subjects to discard unlikely matches.

Step 905 masking every detected individual subject that does not match the marked subject. Wherein masking comprises one or more from the group of blurring, changing individual pixels to a different color than the original pixel color.
FIG. 10 shows a flowchart describing a different embodiment of the invention when the camera captures the images facing people or subjects from, when their faces are exposed and recognizable in accordance with one or more embodiments of the invention.
Step 1001 Load a video recording where the person of interest appears For example, a person walks into an office that has a security camera system that includes personal (face) cameras or overhead (pedestrian) cameras.
Step 1002 Face detection. Detect and locate faces in the video, using a pre-trained deep learning model for face detection and localization. This model can be obtained by training (or re-training) a Convolutional Neural Network (CNN) with an architecture that supports face detection configured to detect and localize a single class (human face). Examples of network architectures for this step: faced, YOL0v3, MTCNN, amongst other public alternatives. In a different embodiment of the invention, as each technique performs differently depending on the video quality, scene complexity, distance to the subject, among others, the system uses one or more face detection techniques.
As output, this step infers the position and size of the "bounding box" for all faces in the frame.

Step 1003 Assemble the set of faces detected into this frame's "face group".
Haying obtained a bounding box for all faces in the frame, each bounding box's image contents is copied and stored to process separately from the rest of the frame. This method addresses the set of cropped faces obtained this way as the "face group" for the frame.
Step 1004 Group feature extraction: For each face in the "face group", encode the face into a "feature vector" as explained in Step 1002.
Step 1005 - Feature matching. Compare the feature vectors of all faces in this frame's "face group" with the feature vectors available for the person of interest (a process often called "distance measurement").
Step 1006 - Does the "face group" contain a face where the distance measured is low enough to be a match? If yes, then 1007 if not then 1002. If the "face group" contains a face where the distance measured is low enough (within thresholds configurable by the user), it is "marked" as the person of interest.
Step 1007 - Masking every "face group" not marked as the person of interest.
Wherein masking comprises one or more from the group of blurring, changing individual pixels to a different color than the original pixel color. In a different embodiment of the invention, in every video frame where the subject is successfully recognized, a record is produced on the coordinates of the subject's face. If due to motion blur or noise the subject's face cannot be recognized, the position of the face between frames is predicted based on the face's coordinates in previous and future frames.

Embodiments of the invention may be implemented on a computing system. Any combination of mobile, desktop, server, router, switch, embedded device, or other types of hardware may be used. For example, as shown in FIG. 11, the computing system (1100) may include one or more computer processors (1101), non-persistent storage (1102) (for example, volatile memory, such as random access memory (RAM), cache memory), persistent storage (1103) (for example, a hard disk, an optical drive such as a compact disk (CD) drive or digital versatile disk (DVD) drive, a flash memory, etc.), a communication interface (1104) (for example, Bluetooth interface, infrared interface, network interface, optical interface, etc.), and numerous other elements and functionalities.
The computer processor(s) (1101) may be an integrated circuit for processing instructions. For example, the computer processor(s) may be one or more cores or micro-cores of a processor. The computing system (1100) may also include one or more input devices (1110), such as a touchscreen, keyboard, mouse, microphone, touchpad, electronic pen, or any other type of input device.
The communication interface (1104) may include an integrated circuit for connecting the computing system (1100) to a network (not shown) (for example, a local area network (LAN), a wide area network (WAN) such as the Internet, mobile network, or any other type of network) and/or to another device, such as another computing device.
Further, the computing system (1100) may include one or more output devices (1106), such as a screen (for example, an LCD display, a plasma display, touch screen, cathode ray tube (CRT) monitor, projector, or other display device), a printer, external storage, or any other output device. One or more of the output devices may be the same or different from the input device(s). The input and output device(s) may be locally or remotely connected to the computer processor(s) (1101), non-persistent storage (1102) , and persistent storage (1103). Many different types of computing systems exist, and the aforementioned input and output device(s) may take other forms.
Software instructions in the form of computer readable program code to perform embodiments of the invention may be stored, in whole or in part, temporarily or permanently, on a non-transitory computer readable medium such as a CD, DVD, storage device, a diskette, a tape, flash memory, physical memory, or any other computer readable storage medium.
Specifically, the software instructions may correspond to computer readable program code that, when executed by a processor(s), is configured to perform one or more embodiments of the invention.
The computing system (1100) in FIG. 11 may be connected to or be a part of a network.
For example, a network may include multiple nodes (for example, node X, node Y
).
Each node may correspond to a computing system, or a group of nodes combined may correspond to the computing system shown in FIG. 11. By way of an example, embodiments of the invention may be implemented on a node of a distributed system that is connected to other nodes. By way of another example, embodiments of the invention may be implemented on a distributed computing system having multiple nodes, where each portion of the invention may be located on a different node within the distributed computing system. Further, one or more elements of the aforementioned computing system (1100) may be located at a remote location and connected to the other elements over a network.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having the benefit of this disclosure, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as disclosed herein. Accordingly, the scope of the invention should be limited only by the attached claims.

Claims

What is claimed is:

1. A method and system to edit, record, transmit a video where a located subject of interest is present while keeping other subjects shown on it private, comprising:
localizing a subject of interest, matching the localization of the subject of interest with a first video, selecting a first still frame from the first video, detecting and localizing individual subjects in the first still frame by using pre-trained convolutional neural networks, obtaining a bounding box for all individual subjects present in frames of the first video, wherein the frames include the first still frame, creating a first ground truth frame by selecting and marking a detected individual subject as the marked subject at the first still frame.

2. The method and system of claim 1, further comprising:
obtaining a set of features for the marked subject by extracting the visual features of the marked subject from any of the subsequent frames, and any contiguous frame where the marked subject is determined to be present, obtaining, for every other frame where the marked subject is determined to be present, feature vectors for all individual subjects detected, compute the vector distance between the feature vectors obtained and the feature vectors stored for the marked subject, determining if any of the individual subjects matches the marked subject, masking every detected individual subject that does not match the marked subject.

3. The method and system of claim 1, further comprising:
determining the position and displacement speed of the individual subjects to discard unlikely matches.

4. The method and system of claim 2, wherein masking comprises one or more from the group of blurring, changing individual pixels to a different color than the original pixel color.

5. The method and system of claim 2, wherein the visual features of the marked individual subject comprises one or more from the group of feature vectors.

6. The method and system of claim 1, wherein the selection of the detected individual as the person of interest is one or more from the group of marking the subject by a user, automatically detecting the individual by matching the individual's biometrics.

7. The method and system of claim 1, wherein the frames in the first video comprise all the frames in the first video, selected frames in the first video.

8. The method and system of claim 1, wherein localizing a person of interest comprises one or more from the group of using the geopositioning data from a smart gadget's sensor that he/she carries, matching the geopositioning data with areas covered by cameras, manually detecting the presence of the person of interest in a video feed, using machine learning to identify a person.

9. The method and system of claim 1, wherein a first video feed comprises a video from a recording or live video from a camera feed.

10. A non-transitory computer readable medium comprising instructions, which when executed by a processor, performs a method, the method to edit, record, transmit a video where a located subject of interest is present while keeping other subjects shown on it private, comprising:
localizing a subject of interest, matching the localization of the subject of interest with a first video, selecting a first still frame from the first video, detecting and localizing individual subjects in the first still frame by using pre-trained convolutional neural networks, obtaining a bounding box for all individual subjects present in frames of the first video, wherein the frames include the first still frame, creating a first ground truth frame by selecting and marking a detected individual subject as the marked subject at the first still frame.

11. The non-transitory computer readable medium of claim 10, further comprising:
obtaining a set of features for the marked subject by extracting the visual features of the marked subject from any of the subsequent frames, and any contiguous frame where the marked subject is determined to be present, obtaining, for every other frame where the marked subject is determined to be present, feature vectors for all individual subjects detected, compute the vector distance between the feature vectors obtained and the feature vectors stored for the marked subject, determining if any of the individual subjects matches the marked subject, masking every detected individual subject that does not match the marked subject, wherein masking comprises one or more from the group of blurring, changing individual pixels to a different color than the original pixel color.

12. The non-transitory computer readable medium of claim 10, further comprising:
determining the position and displacement speed of the individual subjects to discard unlikely matches.

13. The method and system of claim 11, wherein the visual features of the marked individual subject comprises one or more from the group of feature vectors.

14. The method and system of claim 10, wherein the selection of the detected individual as the person of interest is one or more from the group of marking the subject by a user, automatically detecting the individual by matching the individual's biometrics.

15. The method and system of claim 10, wherein the frames in the first video comprise all the frames in the first video, selected frames in the first video.

16. The method and system of claim 10, wherein localizing a person of interest comprises one or more from the group of using the geopositioning data from a smart gadget's sensor that he/she carries, matching the geopositioning data with areas covered by cameras, manually detecting the presence of the person of interest in a video feed, using machine learning to identify a person.

17. The method and system of claim 10, wherein a first video feed comprises a video from a recording or live video from a camera feed.