CA3057939A1

CA3057939A1 - Method that redacts zones of interest in an audio file using computer vision and machine learning

Info

Publication number: CA3057939A1
Application number: CA3057939A
Authority: CA
Inventors: Alfonso F. De La Fuente Sanchez; Dany A. Cabrera Vargas
Original assignee: Individual
Current assignee: Individual
Priority date: 2019-10-08
Filing date: 2019-10-08
Publication date: 2021-04-08

Abstract

Method to replace zones of interest with silenced audio,specifically by detecting the lip sync of a person with the available audio by using a trained machine learning system and video redaction tools. In a different embodiment of the invention, the audio is processed and analysed using speech recognition to identify words or phrases or interest or identifying unrecognized audio, then proceeding to redact the audio accordingly.

Description

METHOD THAT REDACTS ZONES OF INTEREST IN AN AUDIO FILE USING
COMPUTER VISION AND MACHINE LEARNING
INVENTORS: DE LA FUENTE SANCHEZ, Alfonso Fabian CABRERA VARGAS, Dany Alejandro BACKGROUND
Law enforcement agencies around the globe are using body cameras for their field agents and security cameras inside buildings. From the images captured using those cameras, before releasing a video to the public or to other entities, they need to protect victims, suspects, witnesses, or informants' identities. There is a need for a software suite that can redact faces, license plates, addresses, voices, metadata, and more to respond to FOIA (USA), GDPR (Europe) or the Access to Information Act (Canada) requests in a timely manner without having to compromise anyone's identity.
Law enforcement agencies around the globe need to redact and enhance videos.
Manual video redaction and enhancement, even for a short video, can be difficult and take hours work.
SUMMARY
In general, in one aspect, the invention relates to a method to replace zones of interest with silenced audio. Specifically by detecting the lip sync of a person with the available audio by using a trained machine learning system and video redaction tools. In a different embodiment of the invention, the audio is processed and analysed using speech recognition to identify words or phrases of interest or identifying unrecognized audio, then proceeding to redact the audio accordingly.

BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 shows a typical structure of a CNN model for human face detection.
FIG. 2A is a side view of a person and a frontal camera.
FIG. 2A is a side view of a person and a frontal camera.
FIG. 2B is a side view of pedestrians with an overhead security camera.
FIG. 3A to 3C show illustrated steps for one embodiment of the invention.
FIG. 4A is a flowchart that shows how one embodiment of the invention works.
FIG. 4B shows a frontal picture used in one embodiment of the invention FIG. 5A shows a second embodiment of the invention using overhead camera footage.
FIG. 5B shows a different embodiment of the invention.
FIG. 6A shows a flowchart of the method for blurring.
FIG. 6B shows the blurring or masking module.
FIG. 7 is a flow chart of the system and method that describes how the localization of the subject takes place.
FIG. 8 is a flowchart that shows how we create a ground truth frame.
FIG. 9 is a flowchart, continuation of FIG. 8.
FIG. 10 describes a different embodiment of the invention.
FIG. 11 shows a diagram of a computing system.
FIG. 12 is a diagram describing the components of an input audio-video master file FIG. 13 shows a flowchart in accordance with one or more embodiments of our invention.
FIG. 14 shows a flowchart describing how the audio redacting is performed once the identified audio track is selected FIG. 15 shows a flowchart describing the system and method in accordance with one or more embodiments of the invention FIG. 16 shows a flowchart describing a different embodiment of the invention FIG. 17 shows a flowchart describing a different embodiment of the invention

2 DETAILED DESCRIPTION
Specific embodiments of the technology will now be described in detail with reference to the accompanying FIGS. In the following detailed description of embodiments of the technology, numerous specific details are set forth in order to provide a more thorough understanding of the technology. However, it will be apparent to one of ordinary skill in the art that the technology may be practiced without these specific details.
In other instances, well-known features have not been described in detail to avoid unnecessarily complicating the description.
In the following description of FIGS., any component described with regard to a figure, in various embodiments of the technology, may be equivalent to one or more like-named components described with regard to any other figure. For brevity, descriptions of these components will not be repeated with regard to each figure. Thus, each and every embodiment of the components of each FIG. is incorporated by reference and assumed to be optionally present within every other FIG. having one or more like-named components. Additionally, in accordance with various embodiments of the technology, any description of the components of a FIG. is to be interpreted as an optional embodiment, which may be implemented in addition to, in conjunction with, or in place of the embodiments described with regard to a corresponding like-named component in any other figure.
FIG. 1 shows a typical structure of a CNN model for human face detection. Our system relies on Computer Vision (a discipline in Artificial Intelligence) to detect and localize the subject of interest in a video and differentiate this person from other people in the same scene. We take advantage of Machine Learning techniques to increase the accuracy of our system, specifically a subset of techniques commonly known as deep learning.

3 Deep learning refers to the use of a Convolutional Neural Network (CNN) (100), which is a multi-layered system able to learn to recognize visual patterns after being trained with millions of examples (i.e. what a person should look like in a picture).
After trained, a CNN model can understand images it has never seen before by relying on what it learned in the past. This image-understanding method can also be applied to video footage, as video can be understood as a sequence of images (known as "frames") and individual video frames can be analyzed separately.
A CNN(100) is trained using an "image dataset" (110) which comprises a vast collection of image samples, manually annotated to indicate what the CNN (100) should learn. As an example, a CNN (100) can be trained to detect faces using a dataset of thousands of images (110) where someone has manually indicated the position and size of every face contained on every image. This process of annotating the training information in a dataset is known as "labeling a dataset". The bigger and more diverse a dataset (110) is, the more the CNN (100) can learn from it whilst increasing its accuracy.
For illustrative purposes, FIG. 1 shows a common CNN (100) architecture for face recognition. The CNN (100) can be seen as a set of interconnected "layers"
(101, 102, 103, 104) each containing "neurons" (120) which are implemented as units that produce a numerical output from a set of numerical inputs through some mathematical formula.
Neurons (120) are considerably interconnected (mimicking the human brain) so their outputs can serve as input for others. The types of layers, neurons, and mathematical formulas that govern their connections is diverse, and it requires domain knowledge to determine the optimal structure and mathematical formulas a CNN model (100) should have to serve a specific purpose.

4 Further describing the inner workings of a CNN, the FIG. 1 illustrative example shows how an input image from a dataset (110) can be provided to the CNN's input layer (101), which produces patterns of local contrast (111), then, that information is fed to the next first hidden layer (102) which produces an output with the face features (112), that information is then fed to the second hidden layer (103) which after processing produces a recognition of faces (113), which then are fed to the output layer (104) producing a result of possible matches. This structure is one of many that can be used to model a CNN to produce a prediction given an input image.
As one would expect, a CNN (100) will only learn what it was taught; for example, a CNN trained to detect human faces cannot detect full bodies. It will also encounter difficulties when the environmental conditions change (i.e. natural &
artificial lightning conditions, weather changes, image noise and quality, video camera recording format) unless it has been trained to handle such conditions with a sufficiently vast and diverse dataset.
In practice, a single task might require not a single CNN but a group of them;
for example, a smaller CNN might be used to improve the accuracy of the output from a bigger CNN, or the final result might be obtained from averaging individual results of a set of different CNNs (often called an "ensemble"). The best architecture for a specific purpose is usually obtained after experimentation.
FIG. 2A and 2B show how the different camera footage is considered. As described in FIG. 1, a CNN will only learn what it was taught; a CNN trained to recognize footage of people filmed from their front will struggle to recognize footage of people recorded from a ceiling. The system of our invention works when the camera is installed facing people in front as shown in FIG. 2A, and when the camera is installed in a high place to film people from above as shown on FIG. 2B. We describe both cases in the following sections.
FIG. 2A is a side view of a person and a frontal camera. It shows a person (200) facing the front of a camera (201). Following the concepts discussed in FIG. 1, in the interest of optimizing performance and accuracy, our system includes two modules that act depending on the camera's position and distance from the people being filmed.
Note that the methods described for both cases can be used simultaneously in the same video, since certain video footage might contain people from both cases. An example of a frontal camera is a body camera.
FIG. 2B is a side view of pedestrians with an overhead security camera. It shows a side view when the camera (202) is installed in a high place to film people (203, 204) from above. One familiar with the art will appreciate that the angle from where the camera is filming or capturing the image of the person or pedestrian (203, 204) is different from what is shown in FIG. 2A. As such, it needs a different method to process the subject recognition described in this document.
FIG. 3A to 30 shows Illustrative steps for one embodiment of the invention. It exemplifies frontal camera footage. FIG. 3A shows the original image with 4 persons (301, 302, 303, 304). FIG. 3B shows how faces (311, 312, 313, 314) are detected and extracted from the original image for further analysis. FIG. 3C shows how all faces are transformed to a mathematical representation for comparison purposes (feature vectors).

Continuing with the description of FIGS. 3A to 3C, in the same example using the frontal camera footage, when the camera is installed to record people from the front so that their face is exposed and recognizable as in FIG. 2A, the system of our invention relies on two modules; the first, a face-detection module including but not limited to a CNN architecture that detects and localizes faces in every video frame, obtaining a bounding box for every face so that we can extract them as illustrated in FIG.
3B,and second, a face recognition module including but not limited to a CNN
architecture for face recognition that is able to determine that a certain face belongs to a certain person.
This process needs to transform every face found into a mathematical representation that we can compare against the representations of the subject of interest, as illustrated in FIG. 3C.
The face-detection module is implemented by reusing an existing network architecture meant for object detection. Some examples of existing compatible CNN
architectures are: faced, YOL0v3, and MTCNN, among others. These architectures can be re-trained to detect faces in the scene environment conditions and video format required by the system, or alternatively, their structure can be used used as inspiration to formulate a customized face-detection CNN model for this use-case.
This "make or reuse" decision is often made based on a small study where we test multiple architectures against actual video footage provided by the user to obtain a quantitative benchmark that reveals the option with the highest accuracy and performance for the technical particularities of the user's camera systems.
On the other hand, the process of transforming an image into its representative mathematical form as illustrated in FIGS. 3B and 3C is a process known as "feature extraction". Given an image, this process obtains a set of numbers (known as "feature vector") (321, 322, 323, 324) that represent the image's (311, 312, 313, 314) visual features. Feature vectors can be compared against each other using mathematical formulas (i.e. by their n-dimensional euclidean distance), allowing the system to determine the probability of two faces belonging to the same person. This step in our system relies on CNN models for facial feature extraction like FaceNet and OpenFace, to name a few published by the research community.
FIG. 4A shows a flowchart in accordance with one or more embodiments of the invention. While the various steps in these flowcharts are presented and described sequentially, one of ordinary skill will appreciate that some or all of the steps may be executed in different orders, may be combined or omitted, and some or all of the steps may be executed in parallel. Furthermore, the steps may be performed actively or passively. The flowchart shows how one embodiment of our invention works. It shows the steps for our face recognition method, as described in FIGS. 1 through 3, including our method for blurring all faces that don't belong to a person of interest (from now on called "the subject" of "subject of interest") for the same example described in FIGS. 3A
to 3C.
Steps 401 and 402 show the inputs to our method.
Step 401 - Video file: The user provides the system a video file where the person of interest appears.
Step 402 - Pictures of the subject: The user loads one or multiple front-face images of the subject of interest into our system. If this is not possible, our system will alternatively present an interface that facilitates the gathering of these images from the video itself.
After preparing the input, the following is a description of the method's steps:

Step 411 - Extract subject's features: Through our face recognition module, the system will obtain the feature vectors of the pictures of the subject and preserve them for later.
Step 412 - Detect faces in all frames: For every frame in the video, the system will use our face detection module to detect and localize all faces, obtaining a "bounding box" for every face found.
Step 413 - Extract features from all faces: For every face found, the system will use our face recognition module to obtain their corresponding feature vectors.
Step 414 - Compare all faces: The system will use our face recognition module to compare all the feature vectors obtained in (5) with the feature vectors obtained for our subject of interest in (2). For all faces in the video, this process will output a probability of the face belonging to the subject of interest (measured as a number from 0.0 to 1.0) Step 415 - Label the subject: All faces with a high probability of belonging to the subject of interest are labeled as positive cases, while every other face is labeled as a negative case. One familiar with the art will appreciate that a face could also be reflected in mirrors or shiny objects such as chromed items for example. In a different embodiment of the invention, the system allows the user or operator to select objects of interest, for example faces and then for the system to redact or mask these objects from the scene. One familiar with the art will appreciate that mask can also be described as black-out.ln a different embodiment of the invention, the operator of the system for redacting complicated scenes that may have a large number of objects that contain personal identifiable information, for example multiple people, and objects can select an area to be blurred instead of just blurring individual objects of interest.
Step 416 - Blur faces: Using our blur module, all faces labeled as negative cases are blurred. The blurring process is also known as redact the objects from the scene, wherein the objects include faces, text, and other identifiable objects.
Step 417 - The output produced by our method is a video where all faces different from the subject of interest will appear blurred.
In a different embodiment of the invention, once the system and method successfully recognizes the subject, it uses the same feature vectors in other videos where the subject of interest is present, which makes possible the recognition of the subject in multiple videos using the same process. In a different embodiment of our invention, after running all the steps described above, if the pictures provided for the subject don't suffice for its successful recognition, the user can try again or start a new process from Step 402 by loading a different set of pictures or by gathering the subject's pictures from a video where the face appears clearly.
FIG. 4B shows a frontal picture and how each face is marked with a box.
FIG. 5A shows a second embodiment of the invention using overhead camera footage.
It shows when the camera is installed in a ceiling or some kind of high place to film, capture, or record the images of people from above. In the example shown in FIG. 5A, the system is not expected to detect the faces of people on the ground, as most often will not be recognizable by traditional automated computer vision methods due to lower video quality (required for reasonable costs on storage) and the camera angle not allowing full face exposure. Instead, we track the entire person's body (501, 502, 503, 504, 505, 506, 507) using 2 modules, the first being a pedestrian detection module comprising a CNN architecture that detects and localizes pedestrians (501, 502, 503, 504, 505, 506, 507) in every video frame, obtaining a bounding box for every pedestrian so that we can extract them as illustrated in FIG. 5B where the second module is a pedestrian recognition module comprising a CNN architecture for pedestrian recognition that is able to determine the probability that one of the detected pedestrians (511, 512, 513, 514, 515, 516, 517) looks the same as the subject of interest. This process needs to transform every pedestrian found into a mathematical representation that we can compare against the representations of the subject of interest, as illustrated in FIG. 5C.
FIG. 5B shows a different embodiment of our invention, the pedestrian detection module reuses an existing network architecture meant for object detection. Some generic examples of compatible network architectures are MobileNets, Inception-v4, Faster R-CNN, to name a few. These architectures can be re-trained to detect pedestrians (501, 502, 503, 504, 505, 506, 507) in the scene environment conditions and video format required by the user, or alternatively, their structure can be used as inspiration to formulate a customized pedestrian detection CNN model for this use-case. This "make or reuse" decision is made based on a small study where we test multiple alternatives against actual video footage provided by the user to obtain a quantitative benchmark that reveals the option with the highest accuracy and performance for the technical particularities of the user's camera systems. The output is known as the pedestrian detection's probability.
Note that in cases where the pedestrian detection's probability isn't confident enough to make a decision on marking a pedestrian (i.e. the pedestrian doesn't look as expected or is covered by other people, etc.), further information like the position and speed of pedestrians in previous and future video frames might be used to refine and support the decision.

In a different embodiment of our invention, the process of transforming an image (511, 512, 513, 514, 515, 516, 517) into its representative mathematical form (521, 522, 523, 524, 525, 526, 527) as illustrated in FIG. 5C is the same "feature extraction"
process explained for Case A. We again rely on the feature descriptor generation capabilities of existing CNN generic object detection models to obtain the feature vector of every pedestrian.
FIG. 6A shows a flowchart describing the method for blurring or masking all pedestrians or people that don't belong to a person of interest (from now on called "the subject") in accordance with one or more embodiments of the invention.
The inputs to our method are:
Step 620 - Video file: The user provides the system a video file where the person of interest appears.
Step 621 - Subject selection- After all pedestrians in the video have been detected, the system will ask the user to select (i.e. click on) the person of interest. This operation might be required more than once in order to improve detection results.
The method steps can be described as follows:
Step 630 - Detect pedestrians in all frames: Using our pedestrian detection module we obtain the bounding box for all pedestrians present for every frame in the video.

Step 631 - Extract features from all pedestrians: For every pedestrian found, the system will use our pedestrian recognition module to obtain their corresponding feature vectors.
Step 633 - Extract subject's features: The user will manually select the detected pedestrian that matches the subject of interest. Further frames of the same pedestrian detected with a high degree of confidence might be used to increase the amount of information on this pedestrian.
Step 634 - Compare all pedestrians: The system will use our pedestrian recognition module to compare all the feature vectors obtained in (2) with the feature vectors obtained for our subject of interest in (3). For all pedestrians in the video, this process will output a probability of the pedestrian being the subject of interest (measured as a number from 0.0 to 1.0) Step 635 - Label the subject: All pedestrians with a high probability of being the subject of interest are labeled as positive cases, while every other pedestrian is labeled as a negative case.
Step 636 - Blur faces: Using our blur module, blur all faces belonging to pedestrians labeled as negative cases.
FIG. 6B shows the blurring or masking module. In all cases of detected faces or pedestrians or subjects of interest, we use a separate face blur module that is in charge of masking or blurring the person's head (601).
One embodiment of our invention uses the following blurring methods from an original image (601) where the subject's face (611) is not covered or blurred: In a first embodiment of the invention it uses the covering method 602): In this case, we cover the head (611) with a solid color polygon (602) (i.e. a black ellipse) to completely hide it.
In a different embodiment of the invention the Blurring method (603) is used.
In this case the system applies a "pixelation"(613) filter to a polygon (i.e. an ellipse) that surrounds the head. The pixelation filter will divide the head area into a "grid", and for every cell in the grid, replace it with a square with the average color of all pixels contained in said cell.
In a different embodiment of the invention - to protect against algorithms that try to recover the original image using different methods including but not limited to reverse engineering a blurred image - the system randomly switches pixel places to make recovery impossible. In a different embodiment of the invention, the polygon, size of pixels, and amount of pixels switched are user-configurable. One familiar with the art will appreciate that this method can also be used to blur ID tags as well (when text is detected), by switching the blur area polygon to a rectangle. These features are user configurable too.
When the face of the person has already been detected and localized (i.e.
after face detection), the system applies the blur method to the area of the face.
However, when dealing with the whole human body (as in the pedestrian detection module described above), the head must first be accurately localized, and its size and position will vary depending on the camera angle and the subject's pose (i.e. the geometrical coordinates of the head of a person standing and a person sitting will appear different).
To localize the head on detected pedestrians or subjects of interest, this module includes a head detection method based on a CNN architecture for human pose estimation i.e. OpenPose among others. The pose estimation CNN outputs an approximation of the head's position and size, which the system then proceeds to obfuscate.

In a different embodiment of the invention, in regard to text detection, the system of our invention uses reliable scene text detection methods available and are implemented for direct application. In a different embodiment of the invention, the system uses methods that achieve higher degrees of accuracy and utilize deep learning models; for example, the EAST, Pyrboxes, and Textboxes++ methods to name a few published by the research community. In a different embodiment of the invention, after the detection and localization of the text in the video footage has occurred, the system then blurs any text that is positioned inside the bounding box of any detected pedestrian or below a nearby detected face. Examples of text detection that one may want to blur for privacy concerns are name tags, licence plates, and any other text that needs to be protected for privacy concerns.
FIG. 7 shows a flowchart describing how the localization of the subject is performed in accordance with one or more embodiments of the invention.
Step 701 localizing a subject of interest in a geographical area, One familiar with the art will appreciate that this could take the form of a coordinate, a region, a building, room, or similar area such as a street or plaza. One familiar with the art will also appreciate that a subject of interest might be a person, or part of a person, for example the face of a person. The subject of interest could also be an animal, a tag, or a licence plate, to name a few.
Step 702 Does the subject of interest carry a geolocation sensor? Wherein the geolocation sensor is connected to a remote server where the device running our software can retrieve this data.
Yes, then 703. No, then 704 Step 703, retrieve the geopositioning data from a smart gadget. Wherein localizing a person of interest comprises using the geopositioning data from a smart gadget's sensor that the person of interest carries. For example, smart gadgets such as smartphones have geopositioning sensors, where the data collected from those sensors can be transmitted to a remote server or directly to the device running the software of our invention, and, by using software applications, one with the appropriate permissions could access the geopositional data generated by the smart gadget of the subject of interest.
With such data, one could identify the location of the subject of interest using coordinates sent by the smart gadget. In a different example, when the subject of interest is an internet of things device, such as an autonomous vehicle, a shared bicycle, or a drone (to name a few), those devices by themselves already have geopositioning sensors and data transmission capabilities that transmit their localization. Then Step 705 Step 704, detecting the presence of the person of interest in an area, manually, by automatic face detection, or by other means of detection such as a request from the person of interest himself by providing the date and time and location of the person of interest. For example, people in Canada that have a legislated right to request access to government information in the Access to Information Act and their own personal information Privacy Act. Then step 705.

FIG. 10 describes the process for the face detection.
Step 705 matching the localization of the subject of interest with a first video, wherein a first video comprises a video from a recording or live video feed from a camera, for example, a security camera or a smart gadget. Wherein the frames in the first video comprise all the frames in the first video, selected frames in the first video.

In a different embodiment of the invention the localization of the subject is made manually either from live video feed monitoring or from pre-recorded tapes of different cameras and areas. For example, video footage that is reviewed by employees to identify scenes containing certain individuals. Such video footages may come from a single camera or multiple cameras in a single location or multiple locations.
In this example, the first video mentioned in Step 704 of FIG. 1 will be considered a manually identified video or videos containing the image of the subject of interest.
FIG. 10 shows how a face detection can replace the manual entry.
FIG. 8 shows a flowchart describing how the invention creates a ground truth frame when the images come from a camera installed facing people from above in accordance with one or more embodiments of the invention. Wherein the subjects are pedestrians.
FIG. 8 is a continuation of FIG. 7.
Step 801 Video selection: Loading a video where the subject of interest is present, as per described in FIG. 1 Step 802 selecting a first still frame from the first video, usually is one from the first group of frames where the subject of interest appears in the video.
For example, a video of a thief, taken by a security camera, where in the video, other subjects are present, for example a girl and a policeman. All three subjects are identified, but only one individual is the subject of interest that we want to show his face, for the rest identified objects (the girl and the policeman), their faces need to be masked.

Step 803 detecting and localizing individual subjects in the first still frame by using pre-trained convolutional neural networks, Step 804 obtaining a bounding box for all individual subjects present in frames of the first video, wherein the frames include the first still frame, Step 805 creating a first ground truth frame by selecting and marking a detected individual subject as the marked subject at the first still frame.
Wherein the selection of the detected individual as the person of interest is one or more from the group of marking the subject by a user, automatically detecting the individual by matching the individual's biometrics. One familiar with the art will appreciate that biometrics is the technical term for body measurements and calculations. Biometrics refers to metrics related to human characteristics.
Biometrics authentication (or realistic authentication) is used in computer science as a form of identification and access control. It is also used to identify individuals in groups that are subjects of interest or under surveillance. Biometric identifiers are the distinctive, measurable characteristics used to label and describe individuals. Biometric identifiers are often categorized as physiological versus behavioral characteristics. Physiological characteristics are related to the shape of the body.
FIG. 9 shows a flowchart describing the method to identify a person and masking their faces in accordance with one or more embodiments of the invention. FIG. 9 is a continuation of FIG. 8.
Step 901 Obtaining a set of features for the marked subject by extracting the visual features of the marked subject from any of the subsequent frames and any contiguous frame where the marked subject is determined to be present.

Wherein the visual features of the marked individual subject comprises one or more from the group of feature vectors.
Step 902 Obtaining feature vectors for all individual subjects detected.
Obtaining, for every other frame where the marked subject is determined to be present, feature vectors for all individual subjects detected. Then the next step.
Step 903 compute the vector distance between the feature vectors obtained and the feature vectors stored for the marked subject, Step 904 determining if any of the individual subjects matches the marked subject. Also determining the position and displacement speed of the individual subjects to discard unlikely matches.
Step 905 masking every detected individual subject that does not match the marked subject. Wherein masking comprises one or more from the group of blurring, changing individual pixels to a different color than the original pixel color.
FIG. 10 shows a flowchart describing a different embodiment of the invention when the camera captures the images facing people or subjects from, when their faces are exposed and recognizable in accordance with one or more embodiments of the invention.
Step 1001 Load a video recording where the person of interest appears For example, a person walks into an office that has a security camera system that includes personal (face) cameras or overhead (pedestrian) cameras.

Step 1002 Face detection. Detect and locate faces in the video, using a pre-trained deep learning model for face detection and localization. This model can be obtained by training (or re-training) a Convolutional Neural Network (CNN) with an architecture that supports face detection configured to detect and localize a single class (human face). Examples of network architectures for this step: faced, YOL0v3, MTCNN, amongst other public alternatives. In a different embodiment of the invention, as each technique performs differently depending on the video quality, scene complexity, distance to the subject, among others, the system uses one or more face detection techniques.
As output, this step infers the position and size of the "bounding box" for all faces in the frame.
Step 1003 Assemble the set of faces detected into this frame's "face group". Having obtained a bounding box for all faces in the frame, each bounding box's image contents is copied and stored to process separately from the rest of the frame. This method addresses the set of cropped faces obtained this way as the "face group" for the frame.
Step 1004 Group feature extraction: For each face in the "face group", encode the face into a "feature vector" as explained in Step 1002.
Step 1005 - Feature matching. Compare the feature vectors of all faces in this frame's "face group" with the feature vectors available for the person of interest (a process often called "distance measurement").
Step 1006 - Does the "face group" contain a face where the distance measured is low enough to be a match? If yes, then 1007 if not then 1002. If the "face group" contains a face where the distance measured is low enough (within thresholds configurable by the user), it is "marked" as the person of interest.
Step 1007 - Masking every "face group" not marked as the person of interest. Wherein masking comprises one or more from the group of blurring, changing individual pixels to a different color than the original pixel color.
In a different embodiment of the invention, in every video frame where the subject is successfully recognized, a record is produced on the coordinates of the subject's face. If due to motion blur or noise the subject's face cannot be recognized, the position of the face between frames is predicted based on the face's coordinates in previous and future frames.
Embodiments of the invention may be implemented on a computing system. Any combination of mobile, desktop, server, router, switch, embedded device, or other types of hardware may be used. For example, as shown in FIG. 11, the computing system (1100) may include one or more computer processors (1101), non-persistent storage (1102) (for example, volatile memory, such as random access memory (RAM), cache memory), persistent storage (1103) (for example, a hard disk, an optical drive such as a compact disk (CD) drive or digital versatile disk (DVD) drive, a flash memory, etc.), a communication interface (1104) (for example, Bluetooth interface, infrared interface, network interface, optical interface, etc.), and numerous other elements and functional ities.
The computer processor(s) (1101) may be an integrated circuit for processing instructions. For example, the computer processor(s) may be one or more cores or micro-cores of a processor. The computing system (1100) may also include one or more input devices (1110), such as a touchscreen, keyboard, mouse, microphone, touchpad, electronic pen, or any other type of input device.

The communication interface (1104) may include an integrated circuit for connecting the computing system (1100) to a network (not shown) (for example, a local area network (LAN), a wide area network (WAN) such as the Internet, mobile network, or any other type of network) and/or to another device, such as another computing device.
Further, the computing system (1100) may include one or more output devices (1106), such as a screen (for example, an LCD display, a plasma display, touch screen, cathode ray tube (CRT) monitor, projector, or other display device), a printer, external storage, or any other output device. One or more of the output devices may be the same or different from the input device(s). The input and output device(s) may be locally or remotely connected to the computer processor(s) (1101), non-persistent storage (1102) , and persistent storage (1103). Many different types of computing systems exist, and the aforementioned input and output device(s) may take other forms.
Software instructions in the form of computer readable program code to perform embodiments of the invention may be stored, in whole or in part, temporarily or permanently, on a non-transitory computer readable medium such as a CD, DVD, storage device, a diskette, a tape, flash memory, physical memory, or any other computer readable storage medium.
Specifically, the software instructions may correspond to computer readable program code that, when executed by a processor(s), is configured to perform one or more embodiments of the invention.
The computing system (1100) in FIG. 11 may be connected to or be a part of a network.
For example, a network may include multiple nodes (for example, node X, node Y
).
Each node may correspond to a computing system, or a group of nodes combined may correspond to the computing system shown in FIG. 11. By way of an example, embodiments of the invention may be implemented on a node of a distributed system that is connected to other nodes. By way of another example, embodiments of the invention may be implemented on a distributed computing system having multiple nodes, where each portion of the invention may be located on a different node within the distributed computing system. Further, one or more elements of the aforementioned computing system (1100) may be located at a remote location and connected to the other elements over a network.
FIG. 12 is a diagram describing the components of an input audio-video master file (1200) comprising one or more tracks (1201, 1202, 1203); tracks can be audio (1202), video (1201), or metadata files (1203), to name a few. Each video file is made of frames. Each frame has a timestamp which corresponds to a linear timeline. A
track can be edited or cut. When edited, it can be split into other tracks or combined with other tracks. For example, a single audio track may contain a conversation between two people, if the audio for each person is identified, that single track can be split into two or more tracks, each track with each individual's voice. In a reverse scenario, a couple of tracks can be edited to become one track. When redacting an audio file, the content of a portion of the audio file can be cut or replaced with a different audio source. For example, such audio source may take the form of silence, a beep, a continuous tone, or a different voice. The Metadata track may also contain other information such as source of the video, date and time, and other data.
FIG. 13 shows a flowchart in accordance with one or more embodiments of our invention. While the various steps in these flowcharts are presented and described sequentially, one of ordinary skill will appreciate that some or all of the steps may be executed in different orders, may be combined or omitted, and some or all of the steps may be executed in parallel. Furthermore, the steps may be performed actively or passively.
FIG. 13 shows a flowchart describing the system and method to automatically replace zones of interest with alternate audio from an audio and video recording or feed in accordance with one or more embodiments of the invention. Continuing with the description of FIG. 13, the zone of interest for audio may include, for example, any audio that contains information that is personally identifiable and private to anyone that is not the interested party - an interested party may be, for example, a requestor of information, a judge, lawyer, police, or any other person who has an interest in a person of interest in a video but should keep the privacy of others intact. For example, the video is a recording of an interview, and during the interview a person of interest, in this case, an alleged criminal, answers a question by providing the name and address of an acquaintance. The name and address of the acquaintance would need to be redacted by bleeping the later portion of the audio.
Step 1301 - Master audio track input. The master audio track can be a single track (mono) or multiple tracks, i.e. left and right audio tracks (stereo).
The master audio track as well as the video track are part of a linear timeline that runs simultaneously. When used with a video file that contains audio, the audio is stored in the audio track of the video file, thus the name audio-video file or recording.
Step 1302 - Matching a first audio tone (for example pitch of their voice, the speed of their talk, their language), from a master audio track, to a first person while the first person moves their lips. This process can be automatic or manual.
Automatic is using computer vision to identify when the person moves the lips.

Manual is by manually identifying the person where the operator thinks or knows is the person who is talking. The automatic process includes using a pre-trained machine learning convolutional neural network architectureto identify lips on a face that move at the same time the audio track reproduces the sound of the voice of a person. If a person is detected to move their lips while a voice recording is detected in the audio track, the probability that the person moving the lips is high enough to mark that person to be the person talking. When two or more persons become subjects of interest due to the fact that they move their lips when the sound of a voice is detected in the audio track, an exhaustive process of elimination by evaluation of the audio and video tracks is done when a person who is suspected to be the person of interest does not move their lips when the audio track detects the voice to which that person was matched to, then, by a process of elimination, that person is no longer the subject of interest producing that voice. This process repeats until all of the individuals identified in the video are matched to a voice recording. Each voice can be identified by their individual owner using speaker identification which refers to identifying the speaker, rather than what they are saying. Recognizing the speaker can simplify the task of translating speech in systems that have been trained on a specific person's voice or it can be used to authenticate or verify the identity of a speaker as part of a security process.
Step 1303 - Isolating a first identified audio track from the rest of the audio based on the first audio tone. This process is then made for other individuals in the same audio-video recording or track. The number of confluent individuals at the same time in a segment of the recording could represent the number of individuals speaking. At any given time, the system may have visual to the individual's lips or not. It is when the computer vision system has access to the lips of the face of the individual that the match can be made. Once the individual voice has been identified by an individual owner, it is not necessary for the camera to have a view of the lips of the owner of the voice to identify that is the individual talking. This process is important as the final redacted video may require to remove the audio from a specific individual, or to leave only the audio of the specific individual and redact the rest of the audio.

Step 1304 - Matching a second audio tone, from the master audio track, to a second person while the second person moves their lips, Step 1305 - Isolating a second identified audio track from the rest of the audio based on the second audio tone, Step 1306 - Making a determination of the identified audio track to silence.
If the audio track is divided into individual tracks by the voice of each individual, one familiar with the art will appreciate that an audio in which 2 or more individuals talk at the same time can be redacted so only one individual's voice is heard on the final redacted audio track, or, that one single individual audio or voice is redacted so that all other voices are heard and only the one from that individual is redacted or silenced.
FIG. 14 shows a flowchart describing how the audio redacting is performed once the identified audio track is selected in accordance with one or more embodiments of the invention. FIG. 14 is a continuation of the process from FIG. 13.
Step 1401 - Creating a linear timestamp of periods of time to silence by matching the audio tracks to silence with the audio and video timeline. One familiar with the art will appreciate that when we mention silence it also represents the insertion of alternate audio which comprises one or more from the group of blurring, a bleep, silence, whitenoise, anonymizing of voices Step 1402 - Editing the master audio track silencing the timestamped periods of time to silence, Step 1403- Output an edited audio and video recording of feed.

FIG. 15 shows a flowchart describing the system and method in accordance with one or more embodiments of the invention.
Step 1501 - speech recognition cannot translate the speech to text. This can happen when the speech recognition does not recognize the word the person is saying either because the word is not in the selected language's dictionary or because the audio is incomprehensible.
Step 1502 - marking the area of interest as undetermined for the operator to manually confirm the automatic blurring of the audio.
Step 1503 - Playback the undetermined audio to the operator.
Step 1504 - The operator confirms or changes the marking of the unspecified audio either by redacting the audio or by leaving the audio track intact.
FIGS. 16 and 17 show a different embodiment of the invention FIG. 16 shows a flowchart describing the system and method to automatically replace zones of interest with alternate audio from an audio and video recording or feed.
Wherein alternate audio comprises one or more from the group of blurring, a bleep, silence, whitenoise, anonymizing of voices in accordance with one or more embodiments of the invention.
Step 1601 - matching the audio track of an audio-video recording to a language.
wherein the matching can be automatic or manual Step 1602 - using speech recognition identify a phrase of interest. The speech recognition engine used is the one matching the language. In a different embodiment of the invention, if more than one language is detected, the speech recognition is then run more than one time, each with a different language matching the detected languages within the audio-video recording. Speech recognition is an interdisciplinary subfield of computational linguistics that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition, or speech to text (STT). It incorporates knowledge and research in linguistics, computer science, and electrical engineering fields. Speech recognition has a long history with several waves of major innovations. Most recently, the field has benefited from advances in deep learning and big data. There has been a worldwide industry adoption of a variety of deep learning methods in designing and deploying speech recognition systems.
Continuing with the description of step 1602, one familiar with the art will appreciate that a phrase of interest may comprise one or more from the group of a predetermined phrases from a first list, unidentified words, combination of words from a second list. For example, a number before a word is interpreted as an address.
In a different embodiment of the invention, a database is created with words or phrases of interest, i.e. street names, city names, web addresses, individual names in one or different languages or slang terms, to name a few. Based on that database, the voice recognition module compares the identified audio from the speech recognition and compares it with the words or phrases in the database. If a percentage of confidence is higher or lower than the parameters stipulated by default or by the user, then, those words or phrases become candidates for phrases of interest and are processed as such described below.
Step 1603 - editing the master audio track by replacing the phrase of interest with alternate audio. The editing of the master audio track includes replacing the words or phrases of interest with alternate audio which includes one or more from the group of blurring, a bleep, silence, whitenoise, anonymizing of voices.
For example, the subject of interest says a phrase that is recorded in the audio track, which means that the zone of interest for audio may include, for example, any audio that contains information that is personally identifiable and private. For example, when the video is a recording of an interview and during the interview a person or subject of interest says the name and address of a third party. The name and address of that third party needs to be redacted by redacting that portion of the audio.
Step 1604 - output an edited audio and video recording or feed. The final product after the redacting of the audio-video or just the audio track itself is a file that cannot be reverse-engineered. One familiar with the art will appreciate that the final product is a file that only contains the contents of a new audio track which cannot be reversed-engineered to FIG. out the phrases that are redacted. In a different embodiment of the invention, an editable file can be produced which also contains the original audio track in the same timeline of the audio-video file as the final output file in order for an authorized person to be able to compare what has been redacted to what was the original content of the audio-video file.
One familiar with the art will appreciate that the input file can be an audio file or an audio-video file combination and that the output file or final product can be an audio file or an audio-video file, FIG. 17 shows a flowchart describing a different embodiment of the invention Step 1701 - matching the audio track of an audio-video recording to a language.
wherein the matching can be automatic or manual Step 1702 - using speech recognition identify a phrase of interest within the audio-video track Step 1703 - is the result undetermined (area of interest not identified)?
Undetermined audio is an audio segment that the speech recognition module was not able to identify or translate either because the word is a slang, an adjective, a name of a person, street, city, email address, web address, an identifiable number such as a street name, phone number, or any other name or phrase without a meaning in the translation.
Yes Step 1704, no step 1709 Step 1704 - marking the area of interest as undetermined for the operator to manually confirm the automatic blurring of the audio. Alternatively creating an audio track with all the undetermined audio unredacted so the operator can disseminate if those audio segments should be redacted or not.
Step 1705 - Playing back all the undetermined audio to an operator. This option is made so the operator can identify if the redaction of the audio is correct or not.
This track plays the Step 1706 - Is the automatic blurring of the audio correct? Yes Step 1707, no Step 1708 Step 1707 - The operator confirms that the automatic blurring of the audio is correct then Step 1709 Step 1708 the operation reverses it. Then step 1709 Step 1709 - editing the master audio track by Replacing the phrase of interest with alternate audio, Step 1710- output an edited audio and video recording of feed.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this disclosure, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as disclosed herein. Accordingly, the scope of the invention should be limited only by the attached claims.

Claims

What is claimed is:

1. System and method to automatically replace zones of interest with alternate audio from an audio and video recording or feed comprising:
matching a first audio tone, from a master audio track, to a first person while the first person moves their lips, Isolating a first identified audio track from the rest of the audio based on the first audio tone, matching a second audio tone, from the master audio track, to a second person while the second person moves their lips, isolating a second identified audio track from the rest of the audio based on the second audio tone, making a determination of the identified audio track to silence.

2. The system and method of claim 1, further comprising:
creating a linear timestamp of periods of time to silence by matching the audio tracks to silence with the audio and video timeline, editing the master audio track silencing the timestamped periods of time to silence, output an edited audio and video recording of feed.

3. The system and method of claim 1, wherein the Matching an audio tone, from a master audio track, to a person comprises one or more from the group of using computer vision to identify when the person moves the lips, manually identifying the person.

4. The system and method of claim 1, wherein an audio tone further comprises one or more from the group of the pitch of their voice, the speed of their talk, their language.

5. The system and method of claim 1 , wherein the alternate audio form comprises one or more from the group of blurring, a bleep, silence, whitenoise, anonymizing of voices.

6. The system and method of claim 1, further comprising marking the area of interest as undetermined for the operator to manually confirm the automatic blurring of the audio.

7. System and method to automatically replace zones of interest with alternate audio from an audio and video recording or feed comprising:
matching the audio track of an audio-video recording to at least one language, wherein the matching comprises one from the group of automatic or manual, using speech recognition identify a phrase of interest, wherein a phrase of interest comprises one or more from the group of a predetermined phrases from a first list, unidentified words, combination of words from a second list, editing the master audio track by Replacing the phrase of interest with alternate audio, output an edited audio and video recording of feed.

8. The system and method of claim 7, wherein the alternate audio comprises one or more from the group of blurring, a bleep, silence, whitenoise, anonymizing of voices.

9. The system and method of claim 7, further comprising marking the area of interest as undetermined for the operator to manually confirm the automatic blurring of the audio.

10. The system and method of claim 7, further comprising playing back all the undetermined audio to an operator.

11. The system and method of claim 7, when more than one language is detected, further comprising:
matching the audio track of an audio-video recording to a second language, using speech recognition identify a phrase of interest in the second language.

12. A non-transitory computer readable medium comprising instructions, which when executed by a processor, performs a method, the method to automatically replace zones of interest with alternate audio from an audio and video recording or feed comprising:
matching a first audio tone, from a master audio track, to a first person while the first person moves their lips, wherein an audio tone further comprises one or more from the group of the pitch of their voice, the speed of their talk, their language, Isolating a first identified audio track from the rest of the audio based on the first audio tone, matching a second audio tone, from the master audio track, to a second person while the second person moves their lips, isolating a second identified audio track from the rest of the audio based on the second audio tone, making a determination of the identified audio track to silence.

13. The non-transitory computer readable medium of claim 12, further comprising:

creating a linear timestamp of periods of time to silence by matching the audio tracks to silence with the audio and video timeline, editing the master audio track silencing the timestamped periods of time to silence, output an edited audio and video recording of feed.

14. The non-transitory computer readable medium of claim 12, wherein the Matching an audio tone, from a master audio track, to a person comprises one or more from the group of using computer vision to identify when the person moves the lips, manually identifying the person.

15. The non-transitory computer readable medium of claim 12, further comprising marking the area of interest as undetermined for the operator to manually confirm the automatic blurring of the audio.

16. A non-transitory computer readable medium comprising instructions, which when executed by a processor, performs a method, the method to automatically replace zones of interest with alternate audio from an audio and video recording or feed comprising:
matching the audio track of an audio-video recording to at least one language, wherein the matching comprises one from the group of automatic or manual, using speech recognition identify a phrase of interest, wherein a phrase of interest comprises one or more from the group of a predetermined phrases from a first list, unidentified words, combination of words from a second list, editing the master audio track by Replacing the phrase of interest with alternate audio, output an edited audio and video recording of feed.

17. The non-transitory computer readable medium of claim 16, wherein the alternate audio comprises one or more from the group of blurring, a bleep, silence, whitenoise, anonymizing of voices.

18. The non-transitory computer readable medium of claim 16, further comprising marking the area of interest as undetermined for the operator to manually confirm the automatic blurring of the audio.

19. The non-transitory computer readable medium of claim 16, further comprising playing back all the undetermined audio to an operator.

20. The non-transitory computer readable medium of claim 16, when more than one language is detected, further comprising:
matching the audio track of an audio-video recording to a second language, using speech recognition identify a phrase of interest in the second language.