WO2017012123A1 - Video processing - Google Patents

Video processing Download PDF

Info

Publication number
WO2017012123A1
WO2017012123A1 PCT/CN2015/084956 CN2015084956W WO2017012123A1 WO 2017012123 A1 WO2017012123 A1 WO 2017012123A1 CN 2015084956 W CN2015084956 W CN 2015084956W WO 2017012123 A1 WO2017012123 A1 WO 2017012123A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
data
machine learning
attribute data
images
Prior art date
Application number
PCT/CN2015/084956
Other languages
French (fr)
Inventor
Song CAO
Genquan DUAN
Original Assignee
Wizr
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wizr filed Critical Wizr
Priority to EP15898679.4A priority Critical patent/EP3326083A4/en
Priority to PCT/CN2015/084956 priority patent/WO2017012123A1/en
Publication of WO2017012123A1 publication Critical patent/WO2017012123A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques

Definitions

  • Figure 1 illustrates a block diagram of a system 100 for amulti-camera video tracking system.
  • Figure 2 is a flowchart of a method for training a video annotation machine learning processaccording to some embodiments.
  • Figure 3 is a flowchart of a method for processing images from a video according to some embodiments.
  • Figure 4 shows an illustrative computational system for performing functionality to facilitate implementation of the various embodiments described in this document.
  • Systems and methods are disclosed for training a video annotation machine learning process using video annotation data from a third party.
  • Systems and methods are also disclosed for processing images that represent a video.
  • Various other video and/or image processing systems and/or methods are also described.
  • FIG. 1 illustrates a block diagram of a system 100 that may be used in various embodiments.
  • the system 100 may include a plurality of cameras: camera 120, camera 121, and camera 122. While three cameras are shown, any number of cameras may be included.
  • These cameras may include any type of video camera such as, for example, a wireless video camera, a black and white video camera, surveillance video camera, portable cameras, battery powered cameras, CCTV cameras, Wi-Fi enabled cameras, smartphones, smart devices, tablets, computers, GoPro cameras, wearable cameras, etc.
  • the cameras may be positioned anywhere such as, for example, within the same geographic location, in separate geographic location, positioned to record portions of the same scene, positioned to record different portions of the same scene, etc.
  • the cameras may be owned and/or operated by different users, organizations, companies, entities, etc.
  • the cameras may be coupled with the network 115.
  • the network 115 may, for example, include the Internet, a telephonic network, a wireless telephone network, a4G network, etc.
  • the network may include multiple networks, connections, servers, switches, routers, connections, etc. that may enable the transfer of data.
  • the network 115 may be or may include the Internet.
  • the network may include one or more LAN, WAN, WLAN, AN, SAN, PAN, EPN, and/or VPN.
  • one more of the cameras may be coupled with a base station, digital video recorder, or a controller that is then coupled with the network 115.
  • the system 100 may also include video data storage 105 and/or a video processor 110.
  • the video data storage 105 and the video processor 110 may be coupled together via a dedicated communication channel that is separate than or part of the network 115.
  • the video data storage 105 and the video processor 110 may share data via the network 115.
  • the video data storage 105 and the video processor 110 may be part of the same system or systems.
  • the video processor 110 may include various machine learning algorithms, functions, etc.
  • the video data storage 105 may include one or more remote or local data storage locations such as, for example, a cloud storage location, a remote storage location, etc.
  • the video data storage 105 may store video files recorded by one or more of camera 120, camera 121, and camera 122.
  • the video files may be stored in any video format such as, for example, mpeg, avi, etc.
  • video files from the cameras may be transferred to the video data storage 105 using any data transfer protocol such as, for example, HTTP live streaming (HLS) , real time streaming protocol (RTSP) , Real Time Messaging Protocol (RTMP) , HTTP Dynamic Streaming (HDS) , Smooth Streaming, Dynamic Streaming over HTTP, HTML5, Shoutcast, etc.
  • HTTP live streaming HLS
  • RTSP real time streaming protocol
  • RTMP Real Time Messaging Protocol
  • HDS HTTP Dynamic Streaming
  • Smooth Streaming Dynamic Streaming over HTTP
  • HTML5 Shoutcast
  • the video data storage 105 may store user identified event data reported by one or more individuals.
  • the user identified event data may be used, for example, to train the video processor 110 to capture feature events.
  • a video file may be recorded and stored in memory located at a user location prior to being transmitted to the video data storage 105. In some embodiments, a video file may be recorded by the camera and streamed directly to the video data storage 105.
  • the video processor 110 may include one or more local and/or remote servers that may be used to perform data processing on videos stored in the video data storage 105. In some embodiments, the video processor 110 may execute one more algorithms on one or more video files stored with the video storage location. In some embodiments, the video processor 110 may execute a plurality of algorithms in parallel on a plurality of video files stored within the video data storage 105. In some embodiments, the video processor 110 may include a plurality of processors (or servers) that each execute one or more algorithms on one or more video files stored in video data storage 105. In some embodiments, the video processor 110 may include one or more of the components of computational system 400 shown in Fig. 4.
  • the system 100 may include a video annotation service 125 or have access to data provided by the video annotation service 125.
  • Video annotation service 125 may receive a video from the video processor 110 and/or the video data storage 105 and annotate the video with video annotations.
  • the video annotation service 125 may output video annotations for an input video.
  • the video may only be a portion of a video or clip of interest.
  • the video annotation service 125 may use any number of techniques to annotate the video based on any number of video attributes such as, for example, human annotation of the video.
  • the video attributes may include attributes such as, for example, object boundary data, object characteristic data, and/or object event data.
  • the video attributes may vary depending on the object type.
  • the objects for example, may include humans, animals, cars, bikes, etc.
  • the object boundary data may include coordinate data describing the boundary of the object within the frame of the video.
  • the object boundary data may specify a plurality of points that when connected form a polygonal shape bounding or roughly bounding the object.
  • the object boundary data may include pixel data that identifies the pixels representing the object data.
  • Various other data or data types may be used to identify the boundary of an object within the video or a video frame.
  • a polygon or rectangle
  • the boundary data may include a vector or matrix comprising three or more coordinates corresponding with the indices of the boundary of the object within the video or the video frame.
  • the object characteristic data may include characteristics that are general to most objects or specific to certain objects.
  • the object characteristic data may include data specifying the objects size, color, shape, etc.
  • the specific object characteristic data may include an estimated sex, approximate height, approximate weight, hair color, clothing type, accessories, the facial expression of the human, etc.
  • the accessories may include whether the human is carrying or wearing a box, package, backpack, luggage, briefcase, overnight bag, umbrella, gun, purse, hat, hoody, sunglasses, gender, age, race, etc.
  • specific object characteristics may include the number of wheels, vehicle type, the number of doors, etc.
  • the object event data may include, for example, general event data and/or specific event data.
  • general event data may include the speed of the object, the velocity of the object, the orientation of the object, the direction of motion of the object, whether the object stops, etc.
  • specific event data may include the placement and/or motion of the human’s hands, whether the human is walking, whether the human is running, the direction where the human is looking, ether the human is riding a bicycle, ether the human is standing still, ether the human is jumping, ether the human is sitting, etc.
  • video annotation data may be included in a file separate and/or distinct from the video.
  • the video annotation data may be included in a text file or a metadata file that includes video attribute data and/or data identifying the video.
  • the video attribute data may include frame data such as, for example, frame timing data.
  • the video attribute data may vary for each frame of the video. For example, as the video attributes change between frames, the video annotation data may vary accordingly.
  • the video annotation service 125 may receive a video from the video processor 110 and/or the video data storage 105, for example, through the network 115 as an input and in exchange may provide video annotation data to the system 100.
  • the video annotation data may be stored in video data storage 105 or another data storage location such as, for example, another cloud storage location.
  • FIG 2 is a flowchart of an example process 200 for training a video annotation machine learning process.
  • One or more steps of the process 200 may be implemented, in some embodiments, by one or more components of system 100 of Figure 1, such as video processor 110.
  • video processor 110 Although illustrated as discrete blocks, various blocks may be divided into additional blocks, combined into fewer blocks, or eliminated, depending on the desired implementation.
  • a video clip may be identified from one or more videos such as, for example, videos stored in video data storage 105.
  • the video clip may be a portion of a video that includes an event of interest.
  • the video clip for example, may include portions or frames of the video before and/or after the video clip in addition to the video clip.
  • the video clip may include a single frame or multiple frames selected from one or more videos.
  • low level events such as, for example, motion detection events may be used to identify the video clip.
  • a feature detection algorithm may be used to determine that an event has occurred.
  • the feature description can be determined using a low level detection algorithm.
  • An event may include any number or type of occurrences captured by a video camera and stored in a video file.
  • An event may include, for example, a person moving through a scene, a car or an object moving through a scene, a particular face entering the scene, a face, a shadow, animals entering the scene, an automobile entering or leaving the scene, etc.
  • a feature description of the event can be determined for each event.
  • the feature description for example, may be determined using a feature detector algorithm such as, for example, SURG, SIFT, GLOH, HOG, Affine shape adaptation, Harris affine, Hessian affine, etc.
  • the feature description can be determined using a high level detection algorithm. Various other feature detector algorithms may be used.
  • thefeature description may be saved in the video storage location such as, for example, as metadata associated with the video.
  • a first set of object attribute data may be determined from the video clip using machine learning techniques based on machine learning data 230.
  • the dashed lines shown in the flow chart show the flow of data but may not show a step in the process 200.
  • machine learning data may be sent and/or used to determine first set of object attribute data.
  • the first set of object attribute data may include any type of attribute data.
  • the first set of object attribute data may be determined using supervised machine learning techniques.
  • the machine learning data 230 may include a plurality of video clips and a plurality of corresponding object attributes for the plurality of video clips.
  • Support vector machines or decision trees may be used as the type of machine learning functions and may be trained using the machine learning data 230 to produce machine learning functions that can be used to predict the first set of object attribute data.
  • the first set of object attribute data may be determined using any type of machine learning technique such as, for example, associated rule learning techniques, artificial neural network techniques, inductive logic programming, support vector machines, decision tree learning, Bayesian networks, reinforcement learning, representation learning, similarity and metric learning, sparse dictionary learning, convolution networks, genetic algorithms, etc.
  • machine learning technique such as, for example, associated rule learning techniques, artificial neural network techniques, inductive logic programming, support vector machines, decision tree learning, Bayesian networks, reinforcement learning, representation learning, similarity and metric learning, sparse dictionary learning, convolution networks, genetic algorithms, etc.
  • asecond set of object attribute data may be received from a video annotation service such as, for example, video annotation service 125.
  • the second set of object attribute data may be received, for example, through network 115.
  • the second set of object attribute data may be created based on user input.
  • the first set of object attribute data and the second set of object attribute data may be compared.
  • the first set of object attribute data may comprise a matrix or vector of data values and the second set of object attribute data may comprise a matrix or vector of data values. Each or one or more values of the vectors or matrices may be compared.
  • the comparison that occurs in block 220 can provide a measure of the quality of the firstset of object attribute data. If the second set of object attribute data and the first set of object attribute data are the same, similar, and/or congruent then the comparison can be used to validate the quality of the first set of object attribute data and/or the machine learning process used to determine the first set of object attribute data.
  • the comparison can be used to revise the machine learning process used to determine the first set of object attribute data.
  • the results of the comparison that occurred at block 220 may be used to update the machine learning data 230.
  • comparison data may be sent to machine learning data 230 and may be used to revise, update, modify, and/or change the algorithm or inferred function that was used to determine the first set of object attribute data.
  • the updated machine learning data 230 may be used as training data for the machine learning algorithm.
  • the first set of object attribute data may be determined using supervised learning techniques.
  • the video clip may be the input to the supervised learning algorithm and the second set of object attribute data may be the desired output for the input data.
  • the machine learning functions used to determine the first set of object attribute data may or may not be revised.
  • the comparison data and/or the second set of object attribute data may added to the machine learning data 230 and the functions used to determine the first set of object attribute data may be continuously updated and/or revised based on the updated machine learning data 230.
  • attribute data may be exported. If there is a disagreement between the first set of object attribute data and the second set of object attribute data the exported attribute data may include the second set of object data. In some embodiments, if some or all of the second set of object data is missing, then the date from or the first set of object data may be exported.
  • FIG 3 is a flowchart of an example process 300 for processing video.
  • One or more steps of the process 300 may be implemented, in some embodiments, by one or more components of system 100 of Figure 1, such as video processor 110.
  • video processor 110 Although illustrated as discrete blocks, various blocks may be divided into additional blocks, combined into fewer blocks, or eliminated, depending on the desired implementation.
  • Process 300 may begin at block 305.
  • the system 100 may receive video streams from a plurality of cameras such as, for example, camera 120, camera 121, and/or camera 122.
  • the video streams may be received as an mjpeg video stream, h264 video stream, VP8 video stream, MP4, FLV, WebM, ASF, ISMA, flash, HTTP Live Streaming, etc.
  • Various other streaming formats and/or protocols may be used.
  • the video processor 110 may ensure that the video streams are stored in one or more cloud storage locations such as, for example, video data storage 105.
  • the video streams may be converted to lower bandwidth video files such as, for example, the video streams may be converted to a video file with a lower resolution, more compression, or a lower frame rate, etc.
  • images from the videos may be extracted for processing.
  • processing individual video frames rather than a video clip or video file may be less time consuming and/or less resource demanding.
  • Video processor 110 may extract images from the video data storage 105.
  • the images may be filtered. For example, images that include only the foreground or include a majority of the foreground may be filtered out. As another example, images that do not include an event of interest may also be filtered. Event detection and/or determination may be determined in various ways some of which are described herein. As another example, poor quality images may be filtered out. These images may include images that are not in focus, have poor lighting or no lighting, etc. In some embodiments, the video processor 110 may filter the images.
  • the video processor 110 may execute a function on the image prior to determining whether to filter the image. For example, the video processor 110 may determine whether the image includes at least a portion of an event or whether the image has sufficient quality for processing.
  • the images may be processed.
  • the images may be processed as described in conjunction with process 200 shown in Figure 2.
  • Various other processing may occur.
  • the images may be processed to determine set of object attribute data of one or more objects within the image.
  • the images may be processed to determine whether to trigger an alarm.
  • the images may be processed to predict future activity that is likely to occur within a future image.
  • the images may be processed for camera tracking purposes.
  • the images may be processed to determine suspicious events occurring within the image.
  • the images may be processed to synchronize cameras recording videos and/or images of the same scene.
  • the images may be processed to prepare video summaries.
  • the images may be processed to filter out false alarm events.
  • the system 100 may include one or more web servers that may host a website where users can interact with videos stored in the video storage location 105, select videos to view, select videos to monitor using embodiments described in this document, assigning or modifying video attributes, searching for videos and/or video clips of interest, setting alarms based on events occurring within one or more selected video clips, selecting and/or identifying foreground portions of videos, enter feedback, provide object attribute data, select cameras from which to synchronized data, etc.
  • the website may allow a user to select a camera that they wish to monitor. For example, the user may enter the IP address of the camera, a user name, and/or a password.
  • the website may allow the user to view video and/or images from the camera within a frame or page being presented by the website.
  • the website may store the video from the camera in the video storage location 105 and/or the video processor 110 may begin processing the video from the camera to identify events, features, objects, etc.
  • the computational system 400 (or processing unit) illustrated in Figure 4 can be used to perform and/or control operation of any of the embodiments described herein.
  • the computational system 400 can be used alone or in conjunction with other components.
  • the computational system 400 can be used to perform any calculation, solve any equation, perform any identification, and/or make any determination described here.
  • the computational system 400 may include any or all of the hardware elements shown in the figure and described herein.
  • the computational system 400 may includehardware elements that can be electrically coupled via a bus 405 (or may otherwise be in communication, as appropriate) .
  • the hardware elements can include one or more processors 410, including, without limitation, one or more general-purpose processors and/or one or more special-purpose processors (such as digital signal processing chips, graphics acceleration chips, and/or the like) ; one or more input devices 415, which can include, without limitation, a mouse, a keyboard, and/or the like; and one or more output devices 420, which can include, without limitation, a display device, a printer, and/or the like.
  • processors 410 including, without limitation, one or more general-purpose processors and/or one or more special-purpose processors (such as digital signal processing chips, graphics acceleration chips, and/or the like)
  • input devices 415 which can include, without limitation, a mouse, a keyboard, and/or the like
  • the computational system 400 may further include (and/or be in communication with) one or more storage devices 425, which can include, without limitation, local and/or network-accessible storage and/or can include, without limitation, a disk drive, a drive array, an optical storage device, a solid-state storage device, such as random access memory (“RAM” ) and/or read-only memory ( “ROM” ) , which can be programmable, flash-updateable, and/or the like.
  • storage devices 425 can include, without limitation, local and/or network-accessible storage and/or can include, without limitation, a disk drive, a drive array, an optical storage device, a solid-state storage device, such as random access memory (“RAM” ) and/or read-only memory ( “ROM” ) , which can be programmable, flash-updateable, and/or the like.
  • RAM random access memory
  • ROM read-only memory
  • the computational system 400 might also include a communications subsystem 430, which can include, without limitation, a modem, a network card (wireless or wired) , an infrared communication device, a wireless communication device, and/or chipset (such as a a 802.6 device, a Wi-Fi device, a WiMAX device, cellular communication facilities, etc. ) , and/or the like.
  • the communications subsystem 430 may permit data to be exchanged with a network (such as the network described below, to name one example) and/or any other devices described herein.
  • the computational system 400 will further include a working memory 435, which can include a RAM or ROM device, as described above.
  • the computational system 400 also can include software elements, shown as being currently located within the working memory 435, including an operating system 440 and/or other code, such as one or more application programs 445, which may include computer programs of the invention, and/or may be designed to implement methods of the invention and/or configure systems of the invention, as described herein.
  • an operating system 440 and/or other code such as one or more application programs 445, which may include computer programs of the invention, and/or may be designed to implement methods of the invention and/or configure systems of the invention, as described herein.
  • application programs 445 which may include computer programs of the invention, and/or may be designed to implement methods of the invention and/or configure systems of the invention, as described herein.
  • one or more procedures described with respect to the method (s) discussed above might be implemented as code and/or instructions executable by a computer (and/or a processor within a computer) .
  • a set of these instructions and/or codes might be stored on a computer-readable storage medium, such as the storage device (s)
  • the storage medium might be incorporated within the computational system 400 or in communication with the computational system 400.
  • the storage medium might be separate from the computational system 400 (e.g., a removable medium, such as a compact disc, etc. ) , and/or provided in an installation package, such that the storage medium can be used to program a general-purpose computer with the instructions/code stored thereon.
  • These instructions might take the form of executable code, which is executable by the computational system 400 and/or might take the form of source and/or installable code, which, upon compilation and/or installation on the computational system 400 (e.g., using any of a variety of generally available compilers, installation programs, compression/decompression utilities, etc. ) , then takes the form of executable code.
  • a computing device can include any suitable arrangement of components that provides a result conditioned on one or more inputs.
  • Suitable computing devices include multipurpose microprocessor-based computer systems accessing stored software that programs or configures the computing system from a general-purpose computing apparatus to a specialized computing apparatus implementing one or more embodiments of the present subject matter. Any suitable programming, scripting, or other type of language or combinations of languages may be used to implement the teachings contained herein in software to be used in programming or configuring a computing device.
  • Embodiments of the methods disclosed herein may be performed in the operation of such computing devices.
  • the order of the blocks presented in the examples above can be varied—for example, blocks can be re-ordered, combined, and/or broken into sub-blocks. Certain blocks or processes can be performed in parallel.

Abstract

A system described in the disclosure and/or shown in the drawings. A method as described in the disclosure and/or shown in the drawings. A method comprising: determining a first set of object attributes of a video clip using machine learning using machine learning training data; receiving a second set of object attributes of the video clip; comparing the first set of object attributes and the second set of object attributes; and updating the machine learning data with at least a portion of either or both the first set of object attributes and the second set of object attributes.

Description

VIDEO PROCESSING
BRIEF DESCRIPTION OF THE FIGURES
These and other features, aspects, and advantages of the present disclosure are better understood when the following Disclosure is read with reference to the accompanying drawings.
Figure 1 illustrates a block diagram of a system 100 for amulti-camera video tracking system.
Figure 2 is a flowchart of a method for training a video annotation machine learning processaccording to some embodiments.
Figure 3 is a flowchart of a method for processing images from a video according to some embodiments.
Figure 4 shows an illustrative computational system for performing functionality to facilitate implementation of the various embodiments described in this document.
DISCLOSURE
Systems and methods are disclosed for training a video annotation machine learning process using video annotation data from a third party. Systems and methods are also disclosed for processing images that represent a video. Various other video and/or image processing systems and/or methods are also described.
Figure 1 illustrates a block diagram of a system 100 that may be used in various embodiments. The system 100 may include a plurality of cameras: camera 120, camera 121,  and camera 122. While three cameras are shown, any number of cameras may be included. These cameras may include any type of video camera such as, for example, a wireless video camera, a black and white video camera, surveillance video camera, portable cameras, battery powered cameras, CCTV cameras, Wi-Fi enabled cameras, smartphones, smart devices, tablets, computers, GoPro cameras, wearable cameras, etc. The cameras may be positioned anywhere such as, for example, within the same geographic location, in separate geographic location, positioned to record portions of the same scene, positioned to record different portions of the same scene, etc. In some embodiments, the cameras may be owned and/or operated by different users, organizations, companies, entities, etc.
The cameras may be coupled with the network 115. The network 115 may, for example, include the Internet, a telephonic network, a wireless telephone network, a4G network, etc. In some embodiments, the network may include multiple networks, connections, servers, switches, routers, connections, etc. that may enable the transfer of data. In some embodiments, the network 115 may be or may include the Internet. In some embodiments, the network may include one or more LAN, WAN, WLAN, AN, SAN, PAN, EPN, and/or VPN.
In some embodiments, one more of the cameras may be coupled with a base station, digital video recorder, or a controller that is then coupled with the network 115.
The system 100 may also include video data storage 105 and/or a video processor 110. In some embodiments, the video data storage 105 and the video processor 110 may be coupled together via a dedicated communication channel that is separate than or part of the network 115. In some embodiments, the video data storage 105 and the video processor 110 may share data via the network 115. In some embodiments, the video data storage 105 and  the video processor 110 may be part of the same system or systems. In some embodiments, the video processor 110 may include various machine learning algorithms, functions, etc.
In some embodiments, the video data storage 105 may include one or more remote or local data storage locations such as, for example, a cloud storage location, a remote storage location, etc.
In some embodiments, the video data storage 105 may store video files recorded by one or more of camera 120, camera 121, and camera 122. In some embodiments, the video files may be stored in any video format such as, for example, mpeg, avi, etc. In some embodiments, video files from the cameras may be transferred to the video data storage 105 using any data transfer protocol such as, for example, HTTP live streaming (HLS) , real time streaming protocol (RTSP) , Real Time Messaging Protocol (RTMP) , HTTP Dynamic Streaming (HDS) , Smooth Streaming, Dynamic Streaming over HTTP, HTML5, Shoutcast, etc.
In some embodiments, the video data storage 105 may store user identified event data reported by one or more individuals. The user identified event data may be used, for example, to train the video processor 110 to capture feature events.
In some embodiments, a video file may be recorded and stored in memory located at a user location prior to being transmitted to the video data storage 105. In some embodiments, a video file may be recorded by the camera and streamed directly to the video data storage 105.
In some embodiments, the video processor 110 may include one or more local and/or remote servers that may be used to perform data processing on videos stored in the  video data storage 105. In some embodiments, the video processor 110 may execute one more algorithms on one or more video files stored with the video storage location. In some embodiments, the video processor 110 may execute a plurality of algorithms in parallel on a plurality of video files stored within the video data storage 105. In some embodiments, the video processor 110 may include a plurality of processors (or servers) that each execute one or more algorithms on one or more video files stored in video data storage 105. In some embodiments, the video processor 110 may include one or more of the components of computational system 400 shown in Fig. 4.
In some embodiments, the system 100 may include a video annotation service 125 or have access to data provided by the video annotation service 125. Video annotation service 125 may receive a video from the video processor 110 and/or the video data storage 105 and annotate the video with video annotations. The video annotation service 125 may output video annotations for an input video. In some embodiments, the video may only be a portion of a video or clip of interest. The video annotation service 125 may use any number of techniques to annotate the video based on any number of video attributes such as, for example, human annotation of the video.
In some embodiments, the video attributes may include attributes such as, for example, object boundary data, object characteristic data, and/or object event data. The video attributes may vary depending on the object type. The objects, for example, may include humans, animals, cars, bikes, etc.
The object boundary data may include coordinate data describing the boundary of the object within the frame of the video. For example, the object boundary data may specify a plurality of points that when connected form a polygonal shape bounding or  roughly bounding the object. In some embodiments, the object boundary data may include pixel data that identifies the pixels representing the object data. Various other data or data types may be used to identify the boundary of an object within the video or a video frame. For example, a polygon (or rectangle) can be used to define the boundary of an object. In this example, the boundary data may include a vector or matrix comprising three or more coordinates corresponding with the indices of the boundary of the object within the video or the video frame.
The object characteristic data may include characteristics that are general to most objects or specific to certain objects. For example, the object characteristic data may include data specifying the objects size, color, shape, etc. For a human object, the specific object characteristic data may include an estimated sex, approximate height, approximate weight, hair color, clothing type, accessories, the facial expression of the human, etc. In some embodiments, the accessories may include whether the human is carrying or wearing a box, package, backpack, luggage, briefcase, overnight bag, umbrella, gun, purse, hat, hoody, sunglasses, gender, age, race, etc. For a vehicle object, specific object characteristics may include the number of wheels, vehicle type, the number of doors, etc.
The object event data may include, for example, general event data and/or specific event data. For example, general event data may include the speed of the object, the velocity of the object, the orientation of the object, the direction of motion of the object, whether the object stops, etc. For a human object, specific event data may include the placement and/or motion of the human’s hands, whether the human is walking, whether the human is running, the direction where the human is looking, ether the human is riding a bicycle, ether the human is standing still, ether the human is jumping, ether the human is  sitting, etc.
In some embodiments, video annotation data may be included in a file separate and/or distinct from the video. For example, the video annotation data may be included in a text file or a metadata file that includes video attribute data and/or data identifying the video. In some embodiments, the video attribute data may include frame data such as, for example, frame timing data. In some embodiments, the video attribute data may vary for each frame of the video. For example, as the video attributes change between frames, the video annotation data may vary accordingly.
In some embodiments, the video annotation service 125 may receive a video from the video processor 110 and/or the video data storage 105, for example, through the network 115 as an input and in exchange may provide video annotation data to the system 100. The video annotation data, for example, may be stored in video data storage 105 or another data storage location such as, for example, another cloud storage location.
Figure 2 is a flowchart of an example process 200 for training a video annotation machine learning process. One or more steps of the process 200 may be implemented, in some embodiments, by one or more components of system 100 of Figure 1, such as video processor 110. Although illustrated as discrete blocks, various blocks may be divided into additional blocks, combined into fewer blocks, or eliminated, depending on the desired implementation.
Process 200 may begin at block 205. At block 205 a video clip may be identified from one or more videos such as, for example, videos stored in video data storage 105. For example, the video clip may be a portion of a video that includes an event of interest. The video clip, for example, may include portions or frames of the video before and/or after the  video clip in addition to the video clip. In some embodiments, the video clip may include a single frame or multiple frames selected from one or more videos.
In some embodiments, low level events such as, for example, motion detection events may be used to identify the video clip. In some embodiments, a feature detection algorithm may be used to determine that an event has occurred. In some embodiments, the feature description can be determined using a low level detection algorithm.
An event may include any number or type of occurrences captured by a video camera and stored in a video file. An event may include, for example, a person moving through a scene, a car or an object moving through a scene, a particular face entering the scene, a face, a shadow, animals entering the scene, an automobile entering or leaving the scene, etc.
In some embodiments, a feature description of the event can be determined for each event. The feature description, for example, may be determined using a feature detector algorithm such as, for example, SURG, SIFT, GLOH, HOG, Affine shape adaptation, Harris affine, Hessian affine, etc. In some embodiments, the feature description can be determined using a high level detection algorithm. Various other feature detector algorithms may be used. In some embodiments, thefeature description may be saved in the video storage location such as, for example, as metadata associated with the video.
At block 210 a first set of object attribute data may be determined from the video clip using machine learning techniques based on machine learning data 230. The dashed lines shown in the flow chart show the flow of data but may not show a step in the process 200. Between  blocks  230 and 210, machine learning data may be sent and/or used to determine first set of object attribute data. The first set of object attribute data may include  any type of attribute data.
In some embodiments, the first set of object attribute data may be determined using supervised machine learning techniques. For example, the machine learning data 230 may include a plurality of video clips and a plurality of corresponding object attributes for the plurality of video clips. Support vector machines or decision trees may be used as the type of machine learning functions and may be trained using the machine learning data 230 to produce machine learning functions that can be used to predict the first set of object attribute data.
In some embodiments, the first set of object attribute data may be determined using any type of machine learning technique such as, for example, associated rule learning techniques, artificial neural network techniques, inductive logic programming, support vector machines, decision tree learning, Bayesian networks, reinforcement learning, representation learning, similarity and metric learning, sparse dictionary learning, convolution networks, genetic algorithms, etc.
At block 215 asecond set of object attribute datamay be received from a video annotation service such as, for example, video annotation service 125. The second set of object attribute data may be received, for example, through network 115. In some embodiments, the second set of object attribute data may be created based on user input.
At block 220 the first set of object attribute data and the second set of object attribute data may be compared. For example, the first set of object attribute data may comprise a matrix or vector of data values and the second set of object attribute data may comprise a matrix or vector of data values. Each or one or more values of the vectors or matrices may be compared.
In some embodiments, the comparison that occurs in block 220 can provide a measure of the quality of the firstset of object attribute data. If the second set of object attribute data and the first set of object attribute data are the same, similar, and/or congruent then the comparison can be used to validate the quality of the first set of object attribute data and/or the machine learning process used to determine the first set of object attribute data.
If the second set of object attribute data and the first set of object attribute data are not the same, similar, and/or congruent then the comparison can be used to revise the machine learning process used to determine the first set of object attribute data.
At block 225 the results of the comparison that occurred at block 220 may be used to update the machine learning data 230. In some embodiments, comparison data may be sent to machine learning data 230 and may be used to revise, update, modify, and/or change the algorithm or inferred function that was used to determine the first set of object attribute data. In some embodiments, the updated machine learning data 230 may be used as training data for the machine learning algorithm.
In some embodiments, the first set of object attribute data may be determined using supervised learning techniques. For supervised training purposes, for example, the video clip may be the input to the supervised learning algorithm and the second set of object attribute data may be the desired output for the input data. Using this data and/or other machine learning data 230, the machine learning functions used to determine the first set of object attribute data may or may not be revised. In some embodiments, the comparison data and/or the second set of object attribute data may added to the machine learning data 230 and the functions used to determine the first set of object attribute data may be continuously  updated and/or revised based on the updated machine learning data 230.
In some embodiments, attribute data may be exported. If there is a disagreement between the first set of object attribute data and the second set of object attribute data the exported attribute data may include the second set of object data. In some embodiments, if some or all of the second set of object data is missing, then the date from or the first set of object data may be exported.
Figure 3 is a flowchart of an example process 300 for processing video. One or more steps of the process 300 may be implemented, in some embodiments, by one or more components of system 100 of Figure 1, such as video processor 110. Although illustrated as discrete blocks, various blocks may be divided into additional blocks, combined into fewer blocks, or eliminated, depending on the desired implementation.
Process 300 may begin at block 305. At block 305 the system 100 may receive video streams from a plurality of cameras such as, for example, camera 120, camera 121, and/or camera 122. The video streams, for example, may be received as an mjpeg video stream, h264 video stream, VP8 video stream, MP4, FLV, WebM, ASF, ISMA, flash, HTTP Live Streaming, etc. Various other streaming formats and/or protocols may be used.
In some embodiments, at block 310 the video processor 110 may ensure that the video streams are stored in one or more cloud storage locations such as, for example, video data storage 105. In some embodiments, the video streams may be converted to lower bandwidth video files such as, for example, the video streams may be converted to a video file with a lower resolution, more compression, or a lower frame rate, etc.
At block 315 images from the videos may be extracted for processing. In some  embodiments, processing individual video frames rather than a video clip or video file may be less time consuming and/or less resource demanding. Video processor 110 may extract images from the video data storage 105.
In some embodiments, at block 320 the images may be filtered. For example, images that include only the foreground or include a majority of the foreground may be filtered out. As another example, images that do not include an event of interest may also be filtered. Event detection and/or determination may be determined in various ways some of which are described herein. As another example, poor quality images may be filtered out. These images may include images that are not in focus, have poor lighting or no lighting, etc. In some embodiments, the video processor 110 may filter the images.
In some embodiments, the video processor 110 may execute a function on the image prior to determining whether to filter the image. For example, the video processor 110 may determine whether the image includes at least a portion of an event or whether the image has sufficient quality for processing.
At block 325 the images may be processed. In some embodiments, the images may be processed as described in conjunction with process 200 shown in Figure 2. Various other processing may occur. For example, the images may be processed to determine set of object attribute data of one or more objects within the image. As another example, the images may be processed to determine whether to trigger an alarm. As another example, the images may be processed to predict future activity that is likely to occur within a future image. As another example, the images may be processed for camera tracking purposes. As another example, the images may be processed to determine suspicious events occurring within the image. As another example, the images may be processed to synchronize  cameras recording videos and/or images of the same scene. As another example, the images may be processed to prepare video summaries. As another example, the images may be processed to filter out false alarm events. In some embodiments, the system 100 may include one or more web servers that may host a website where users can interact with videos stored in the video storage location 105, select videos to view, select videos to monitor using embodiments described in this document, assigning or modifying video attributes, searching for videos and/or video clips of interest, setting alarms based on events occurring within one or more selected video clips, selecting and/or identifying foreground portions of videos, enter feedback, provide object attribute data, select cameras from which to synchronized data, etc. In some embodiments, the website may allow a user to select a camera that they wish to monitor. For example, the user may enter the IP address of the camera, a user name, and/or a password. Once a camera has been identified, for example, the website may allow the user to view video and/or images from the camera within a frame or page being presented by the website. As another example, the website may store the video from the camera in the video storage location 105 and/or the video processor 110 may begin processing the video from the camera to identify events, features, objects, etc. The computational system 400 (or processing unit) illustrated in Figure 4 can be used to perform and/or control operation of any of the embodiments described herein. For example, the computational system 400 can be used alone or in conjunction with other components. As another example, the computational system 400 can be used to perform any calculation, solve any equation, perform any identification, and/or make any determination described here.
The computational system 400 may include any or all of the hardware elements  shown in the figure and described herein. The computational system 400 may includehardware elements that can be electrically coupled via a bus 405 (or may otherwise be in communication, as appropriate) . The hardware elements can include one or more processors 410, including, without limitation, one or more general-purpose processors and/or one or more special-purpose processors (such as digital signal processing chips, graphics acceleration chips, and/or the like) ; one or more input devices 415, which can include, without limitation, a mouse, a keyboard, and/or the like; and one or more output devices 420, which can include, without limitation, a display device, a printer, and/or the like.
The computational system 400 may further include (and/or be in communication with) one or more storage devices 425, which can include, without limitation, local and/or network-accessible storage and/or can include, without limitation, a disk drive, a drive array, an optical storage device, a solid-state storage device, such as random access memory (“RAM” ) and/or read-only memory ( “ROM” ) , which can be programmable, flash-updateable, and/or the like. The computational system 400 might also include a communications subsystem 430, which can include, without limitation, a modem, a network card (wireless or wired) , an infrared communication device, a wireless communication device, and/or chipset (such as a
Figure PCTCN2015084956-appb-000001
a 802.6 device, a Wi-Fi device, a WiMAX device, cellular communication facilities, etc. ) , and/or the like. The communications subsystem 430 may permit data to be exchanged with a network (such as the network described below, to name one example) and/or any other devices described herein. In many embodiments, the computational system 400 will further include a working memory 435, which can include a RAM or ROM device, as described above.
The computational system 400 also can include software elements, shown as being currently located within the working memory 435, including an operating system 440 and/or other code, such as one or more application programs 445, which may include computer programs of the invention, and/or may be designed to implement methods of the invention and/or configure systems of the invention, as described herein. For example, one or more procedures described with respect to the method (s) discussed above might be implemented as code and/or instructions executable by a computer (and/or a processor within a computer) . A set of these instructions and/or codes might be stored on a computer-readable storage medium, such as the storage device (s) 425 described above.
In some cases, the storage medium might be incorporated within the computational system 400 or in communication with the computational system 400. In other embodiments, the storage medium might be separate from the computational system 400 (e.g., a removable medium, such as a compact disc, etc. ) , and/or provided in an installation package, such that the storage medium can be used to program a general-purpose computer with the instructions/code stored thereon. These instructions might take the form of executable code, which is executable by the computational system 400 and/or might take the form of source and/or installable code, which, upon compilation and/or installation on the computational system 400 (e.g., using any of a variety of generally available compilers, installation programs, compression/decompression utilities, etc. ) , then takes the form of executable code.
The term “substantially” means within 4%or 10%of the value referred to or within manufacturing tolerances.
Numerous specific details are set forth herein to provide a thorough  understanding of the claimed subject matter. However, those skilled in the art will understand that the claimed subject matter may be practiced without these specific details. In other instances, methods, apparatuses, or systems that would be known by one of ordinary skill have not been described in detail so as not to obscure claimed subject matter.
Some portions are presented in terms of algorithms or symbolic representations of operations on data bits or binary digital signals stored within a computing system memory, such as a computer memory. These algorithmic descriptions or representations are examples of techniques used by those of ordinary skill in the data processing art to convey the substance of their work to others skilled in the art. An algorithm is a self-consistent sequence of operations or similar processing leading to a desired result. In this context, operations or processing involves physical manipulation of physical quantities. Typically, although not necessarily, such quantities may take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, or otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to such signals as bits, data, values, elements, symbols, characters, terms, numbers, numerals, or the like. It should be understood, however, that all of these and similar terms are to be associated with appropriate physical quantities and are merely convenient labels. Unless specifically stated otherwise, it is appreciated that throughout this specification discussions utilizing terms such as “processing, ” “computing, ” “calculating, ” “determining, ” and “identifying” or the like refer to actions or processes of a computing device, such as one or more computers or a similar electronic computing device or devices, that manipulate or transform data represented as physical, electronic, or magnetic quantities within memories, registers, or other information storage devices, transmission devices, or display devices of  the computing platform.
The system or systems discussed herein are not limited to any particular hardware architecture or configuration. A computing device can include any suitable arrangement of components that provides a result conditioned on one or more inputs. Suitable computing devices include multipurpose microprocessor-based computer systems accessing stored software that programs or configures the computing system from a general-purpose computing apparatus to a specialized computing apparatus implementing one or more embodiments of the present subject matter. Any suitable programming, scripting, or other type of language or combinations of languages may be used to implement the teachings contained herein in software to be used in programming or configuring a computing device.
Embodiments of the methods disclosed herein may be performed in the operation of such computing devices. The order of the blocks presented in the examples above can be varied—for example, blocks can be re-ordered, combined, and/or broken into sub-blocks. Certain blocks or processes can be performed in parallel.
The use of “adapted to” or “configured to” herein is meant as open and inclusive language that does not foreclose devices adapted to or configured to perform additional tasks or steps. Additionally, the use of “based on” is meant to be open and inclusive, in that a process, step, calculation, or other action “based on” one or more recited conditions or values may, in practice, be based on additional conditions or values beyond those recited. Headings, lists, and numbering included herein are for ease of explanation only and are not meant to be limiting.
While the present subject matter has been described in detail with respect to  specific embodiments thereof, it will be appreciated that those skilled in the art, upon attaining an understanding of the foregoing, may readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, it should be understood that the present disclosure has been presented for-purposes of example rather than limitation, and does not preclude inclusion of such modifications, variations, and/or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art.

Claims (3)

  1. A system described in the disclosure and/or shown in the drawings.
  2. A method as described in the disclosure and/or shown in the drawings.
  3. A method comprising:
    determining a first set of object attributes of a video clip using machine learning using machine learning training data;
    receiving a second set of object attributes of the video clip;
    comparing the first set of object attributes and the second set of object attributes; and
    updating the machine learning data with at least a portion of either or both the first set of object attributes and the second set of object attributes.
PCT/CN2015/084956 2015-07-23 2015-07-23 Video processing WO2017012123A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP15898679.4A EP3326083A4 (en) 2015-07-23 2015-07-23 Video processing
PCT/CN2015/084956 WO2017012123A1 (en) 2015-07-23 2015-07-23 Video processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/084956 WO2017012123A1 (en) 2015-07-23 2015-07-23 Video processing

Publications (1)

Publication Number Publication Date
WO2017012123A1 true WO2017012123A1 (en) 2017-01-26

Family

ID=57833867

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/084956 WO2017012123A1 (en) 2015-07-23 2015-07-23 Video processing

Country Status (2)

Country Link
EP (1) EP3326083A4 (en)
WO (1) WO2017012123A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008079850A2 (en) * 2006-12-22 2008-07-03 Google Inc. Annotation framework for video
US20090274434A1 (en) * 2008-04-29 2009-11-05 Microsoft Corporation Video concept detection using multi-layer multi-instance learning
WO2011025701A1 (en) * 2009-08-24 2011-03-03 Google Inc. Relevance-based image selection
US20130156095A1 (en) * 2011-12-15 2013-06-20 Imerj LLC Networked image/video processing system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008079850A2 (en) * 2006-12-22 2008-07-03 Google Inc. Annotation framework for video
US20090274434A1 (en) * 2008-04-29 2009-11-05 Microsoft Corporation Video concept detection using multi-layer multi-instance learning
WO2011025701A1 (en) * 2009-08-24 2011-03-03 Google Inc. Relevance-based image selection
US20130156095A1 (en) * 2011-12-15 2013-06-20 Imerj LLC Networked image/video processing system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3326083A4 *

Also Published As

Publication number Publication date
EP3326083A4 (en) 2019-02-06
EP3326083A1 (en) 2018-05-30

Similar Documents

Publication Publication Date Title
US10140554B2 (en) Video processing
US10489660B2 (en) Video processing with object identification
Tsakanikas et al. Video surveillance systems-current status and future trends
US11809998B2 (en) Maintaining fixed sizes for target objects in frames
CN110235138B (en) System and method for appearance search
US9600744B2 (en) Adaptive interest rate control for visual search
CN109614517B (en) Video classification method, device, equipment and storage medium
US10582211B2 (en) Neural network to optimize video stabilization parameters
US11373685B2 (en) Event/object-of-interest centric timelapse video generation on camera device with the assistance of neural network input
US20170193810A1 (en) Video event detection and notification
WO2016201683A1 (en) Cloud platform with multi camera synchronization
US11869241B2 (en) Person-of-interest centric timelapse video with AI input on home security camera to protect privacy
US10410059B2 (en) Cloud platform with multi camera synchronization
CN113228626B (en) Video monitoring system and method
US11954880B2 (en) Video processing
Kong et al. Digital and physical face attacks: Reviewing and one step further
US20190370553A1 (en) Filtering of false positives using an object size model
US20200327332A1 (en) Moving image analysis apparatus, system, and method
WO2017012123A1 (en) Video processing
US20190373165A1 (en) Learning to switch, tune, and retrain ai models
KR102375541B1 (en) Apparatus for Providing Artificial Intelligence Service with structured consistency loss and Driving Method Thereof
US20190370560A1 (en) Detecting changes in object size over long time scales
KR20230069394A (en) Apparatus and method for automatic selection of frame in livestock shooting video
WO2019003040A1 (en) Pulsating image

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15898679

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE