WO2021051230A1 - Systèmes et procédés de détection d'objet - Google Patents

Systèmes et procédés de détection d'objet Download PDF

Info

Publication number
WO2021051230A1
WO2021051230A1 PCT/CN2019/105931 CN2019105931W WO2021051230A1 WO 2021051230 A1 WO2021051230 A1 WO 2021051230A1 CN 2019105931 W CN2019105931 W CN 2019105931W WO 2021051230 A1 WO2021051230 A1 WO 2021051230A1
Authority
WO
WIPO (PCT)
Prior art keywords
machine learning
learning model
samples
object detection
trained machine
Prior art date
Application number
PCT/CN2019/105931
Other languages
English (en)
Inventor
Peilun Li
Guozhen Li
Youzeng LI
Zhangxi Yan
Meiqi LU
Hongda Yang
Yifei Zhang
Original Assignee
Beijing Didi Infinity Technology And Development Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Didi Infinity Technology And Development Co., Ltd. filed Critical Beijing Didi Infinity Technology And Development Co., Ltd.
Priority to PCT/CN2019/105931 priority Critical patent/WO2021051230A1/fr
Publication of WO2021051230A1 publication Critical patent/WO2021051230A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/59Context or environment of the image inside of a vehicle, e.g. relating to seat occupancy, driver state or inner lighting conditions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present disclosure generally relates to object detection technology, and in particular, to systems and methods for hard example mining for object detection.
  • one or more video surveillance devices e.g., a camera
  • Object detection can be performed on one or more images acquired by the one or more video surveillance devices, thereby identifying and/or recognizing an object such as a specific person, and further implementing functions such as object tracking, based on recognized information.
  • image-based object detection technology requires a large number of training samples including various types of images collected in different scenes to train an object detection model.
  • the training samples may lack some specific samples (e.g., samples associated with children) , which may cause the trained machine learning model to be not precise enough to identify objects. Therefore, it is desirable to provide systems and methods for object detection with improved accuracy.
  • a system may include at least one storage medium and at least one processor in communication with the at least one storage medium.
  • the at least one storage medium may include a set of instructions.
  • the system When executing the set of instructions, the system may be configured to perform one or more of the following operations.
  • the system may obtain multiple samples. Each of the multiple samples may include image frames acquired by one or more cameras.
  • the system may determine, based on a first trained machine learning model for object detection, at least a portion of the multiple samples. A count of one or more objects presented in each of the image frames corresponding to each of the at least a portion of the multiple samples may change with time.
  • the system may determine, based on a second trained machine learning model for image classification, a group of samples from the at least a portion of the multiple samples. An identification result of a sample in the group may be false using the first trained machine learning model
  • the system may obtain, based on the group of samples, a sample set for object detection.
  • the system may determine, based on the sample set, a target machine learning model for object detection.
  • the system may be configured to perform one or more of the following operations.
  • the system may determine the count of the one or more objects presented in each of the image frames using the first trained machine learning model.
  • the system may determine a count change curve of the one or more objects presented in the image frames.
  • the system may determine, based on the count change curve corresponding to the each of the multiple samples, the at least a portion of the multiple samples.
  • the system may designate samples of the multiple samples whose count change curves have one or more burrs among the multiple samples as the at least a portion of the multiple samples.
  • the system may be configured to perform one or more of the following operations.
  • the system may classify each of the at least a portion of samples into a first group and a second group using the second trained machine learning model.
  • the system may designate samples in the first group or the second group as the group of samples.
  • the system may be configured to perform one or more of the following operations.
  • the system may obtain a plurality of groups of training data.
  • Each of the plurality of groups of training data may include one or more image frames and a count of one or more objects presented in each of the one or more image frames.
  • the system may train a first machine learning model using the plurality groups of training data to obtain the first trained machine learning model.
  • the system may be configured to perform one or more of the following operations.
  • the system may identify a portion of the plurality of groups of training data based on the first trained machine learning model.
  • the system may label manually each of the portion of the plurality of groups of training data with a label.
  • the system may train a second machine learning model using each of the portion of the plurality of groups of training data and the corresponding label.
  • the system may be configured to perform one or more of the following operations.
  • the system may determine the count of the one or more objects presented in each of the one or more image frames of each group of the plurality of groups of training data using the first trained machine learning model.
  • the system may determine a count change curve for each group of the plurality of groups of training data based on the count of the one or more objects presented in each of the one or more image frames.
  • the system may determine the portion of the plurality of groups of training data corresponding to the count change curve with one or more burrs.
  • the system may be configured to perform one or more of the following operations.
  • the system may train a machine learning model for object detection using the sample set to obtain a trained machine learning model for object detection.
  • the system may designate the trained machine learning model for object detection as the target machine learning model.
  • the system may further be configured to perform one or more of the following operations.
  • the system may obtain specific image data acquired by a specific camera.
  • the system may perform an object detection on the specific image data using the target machine learning model.
  • a method may be implemented on at least one computing device, each of which may include at least one processor and a storage device.
  • the method may include one or more of the following operations.
  • the method may include obtaining multiple samples, each of the multiple samples may include image frames acquired by one or more cameras; determining, based on a first trained machine learning model for object detection, at least a portion of the multiple samples, a count of one or more objects presented in each of the image frames corresponding to each of the at least a portion of the multiple samples may change with time; determining, based on a second trained machine learning model for image classification, a group of samples from the at least a portion of the multiple samples, an identification result of a sample in the group being false using the first trained machine learning model; obtaining, based on the group of samples, a sample set for object detection; determining, based on the sample set, a target machine learning model for object detection.
  • a non-transitory computer readable medium may store instructions, the instructions, when executed by at least one processor, the at least one processor may be configured to perform one or more of the following operations.
  • the at least one processor may obtain multiple samples. Each of the multiple samples may include image frames acquired by one or more cameras.
  • the at least one processor may determine, based on a first trained machine learning model for object detection, at least a portion of the multiple samples. A count of one or more objects presented in each of the image frames corresponding to each of the at least a portion of the multiple samples may change with time.
  • the at least one processor may determine, based on a second trained machine learning model for image classification, a group of samples from the at least a portion of the multiple samples.
  • An identification result of a sample in the group may be false using the first trained machine learning model.
  • the at least one processor may obtain, based on the group of samples, a sample set for object detection.
  • the at least one processor may determine, based on the sample set, a target machine learning model for object detection.
  • FIG. 1 is a schematic diagram illustrating an exemplary online to offline service system according to some embodiments of the present disclosure
  • FIG. 2 is a schematic diagram illustrating hardware and/or software components of a computing device according to some embodiments of the present disclosure
  • FIG. 3 is a schematic diagram illustrating hardware and/or software components of an exemplary mobile device according to some embodiments of the present disclosure
  • FIG. 4 is a block diagram illustrating an exemplary processing device according to some embodiments of the present disclosure.
  • FIG. 5 is a schematic flowchart illustrating an exemplary process for object detection according to some embodiments of the present disclosure
  • FIG. 6 is a schematic flowchart illustrating an exemplary process for determining a target machine learning model for object detection according to some embodiments of the present disclosure
  • FIG. 7 is a schematic flowchart illustrating an exemplary process for determining hard samples for object detection according to some embodiments of the present disclosure
  • FIG. 8 is a schematic flowchart illustrating an exemplary process for determining an object detection model according to some embodiments of the present disclosure.
  • FIG. 9 is a schematic flowchart illustrating an exemplary process for determining an image classification model according to some embodiments of the present disclosure.
  • module, ” “unit, ” or “block, ” as used herein refers to logic embodied in hardware or firmware, or to a collection of software instructions.
  • a module, a unit, or a block described herein may be implemented as software and/or hardware and may be stored in any type of non-transitory computer-readable medium or another storage device.
  • a software module/unit/block may be compiled and linked into an executable program. It will be appreciated that software modules can be callable from other modules/units/blocks or from themselves, and/or may be invoked in response to detected events or interrupts.
  • Software modules/units/blocks configured for execution on computing devices may be provided on a computer-readable medium, such as a compact disc, a digital video disc, a flash drive, a magnetic disc, or any other tangible medium, or as a digital download (and can be originally stored in a compressed or installable format that needs installation, decompression, or decryption prior to execution) .
  • a computer-readable medium such as a compact disc, a digital video disc, a flash drive, a magnetic disc, or any other tangible medium, or as a digital download (and can be originally stored in a compressed or installable format that needs installation, decompression, or decryption prior to execution) .
  • Such software code may be stored, partially or fully, on a storage device of the executing computing device, for execution by the computing device.
  • Software instructions may be embedded in firmware, such as an erasable programmable read-only memory (EPROM) .
  • EPROM erasable programmable read-only memory
  • modules/units/blocks may be included in connected logic components, such as gates and flip-flops, and/or can be included of programmable units, such as programmable gate arrays or processors.
  • the modules/units/blocks or computing device functionality described herein may be implemented as software modules/units/blocks, but may be represented in hardware or firmware.
  • the modules/units/blocks described herein refer to logical modules/units/blocks that may be combined with other modules/units/blocks or divided into sub-modules/sub-units/sub-blocks despite their physical organization or storage. The description may be applicable to a system, an engine, or a portion thereof.
  • the flowcharts used in the present disclosure illustrate operations that systems implement according to some embodiments of the present disclosure. It is to be expressly understood, the operations of the flowcharts may be implemented not in order. Conversely, the operations may be implemented in inverted order, or simultaneously. Moreover, one or more other operations may be added to the flowcharts. One or more operations may be removed from the flowcharts.
  • Embodiments of the present disclosure may be applied to different transportation systems including but not limited to land transportation, sea transportation, air transportation, space transportation, or the like, or any combination thereof.
  • a vehicle of the transportation systems may include a rickshaw, travel tool, taxi, chauffeured car, hitch, bus, rail transportation (e.g., a train, a bullet train, high-speed rail, and subway) , ship, airplane, spaceship, hot-air balloon, driverless vehicle, or the like, or any combination thereof.
  • the transportation system may also include any transportation system that applies management and/or distribution, for example, a system for sending and/or receiving an express.
  • the application scenarios of different embodiments of the present disclosure may include but not limited to one or more webpages, browser plugins and/or extensions, client terminals, custom systems, intracompany analysis systems, artificial intelligence robots, or the like, or any combination thereof. It should be understood that application scenarios of the system and method disclosed herein are only some examples or embodiments. Those having ordinary skills in the art, without further creative efforts, may apply these drawings to other application scenarios, for example, another similar server.
  • a method may include obtaining multiple samples. Each of the multiple samples may include image frames acquired by one or more cameras. The method may include determining at least a portion of the multiple samples based on a first trained machine learning model for object detection. A count of one or more objects presented in each of the image frames corresponding to each of the at least a portion of the multiple samples may change with time. The method may also include determining a group of samples from the at least a portion of the multiple samples based on a second trained machine learning model for image classification. A count of one or more objects presented in a sample in the group may be unavailable using the first trained machine learning model. The method may also include obtaining a sample set for object detection based on the group of samples. The method may also include determining a target machine learning model for object detection based on the sample set. The method may also include obtaining a specific image data. The method may include detecting one or more objects presented in the specific image data using the target machine learning model.
  • the systems and methods of the present disclosure may be applied in an online to offline system, a driving system (e.g., an automatic pilot system) , a security system, a smart home system, or the like, or a combination thereof.
  • a driving system e.g., an automatic pilot system
  • the systems and methods for object detection may monitor an area of a vehicle of a service provider when the service provider provides a transport service for a service requester.
  • the systems and methods for object detection may identify a count of service requesters from a video or an image collected by a camera installed in the vehicle.
  • the systems and methods for object detection may send identified information (e.g., the count of service requesters) to an online to offline platform for reference, for example, to provide help for the service requester and/or the service provider.
  • systems and methods disclosed in the present disclosure are described primarily regarding an online to offline system for detecting one or more persons inside a vehicle, it should be understood that this is only one exemplary embodiment.
  • the systems and methods of the present disclosure may be applied other systems to detect any other kind of object.
  • the systems and methods of the present disclosure may be applied to object detection of different objects including an animal, a tree, a roadblock, a building, a vehicle, etc.
  • FIG. 1 is a block diagram illustrating an exemplary online to offline (O2O) service system 100 according to some embodiments of the present disclosure.
  • the O2O service system 100 may be an online transportation service platform for transportation services.
  • the O2O service system 100 may include a server 110, a network 120, a requester terminal 130, a provider terminal 140, a vehicle 150, a storage device 160, and a navigation system 170.
  • the O2O service system 100 may provide a plurality of services. Exemplary services may include a taxi-hailing service, a chauffeur service, an express car service, a carpool service, a bus service, a driver hire service, and a shuttle service.
  • the O2O service may be any online service, such as booking a meal, shopping, or the like, or any combination thereof.
  • the server 110 may be a single server or a server group.
  • the server group may be centralized, or distributed (e.g., the server 110 may be a distributed system) .
  • the server 110 may be local or remote.
  • the server 110 may access information and/or data stored in the requester terminal 130, the provider terminal 140, and/or the storage device 160 via the network 120.
  • the server 110 may be directly connected to the requester terminal 130, the provider terminal 140, and/or the storage device 160 to access stored information and/or data.
  • the server 110 may be implemented on a cloud platform.
  • the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-cloud, a multi-cloud, or the like, or any combination thereof.
  • the server 110 may be implemented on a computing device 200 having one or more components illustrated in FIG. 2 in the present disclosure.
  • the server 110 may include a processing device 112.
  • the processing device 112 may process information and/or data related to object detection to perform one or more functions described in the present disclosure. For example, the processing device 112 may obtain image data from a camera installed in the vehicle 150 and a target machine learning model for object detection from the storage device 160 or any other storage device. The processing device 112 may determine a count of objects the image data presenting using the target machine learning model. As another example, the processing device 112 may determine the target machine learning model for object detection by training a preliminary machine learning model using a sample set. As a further example, the processing device 112 may determine the sample set.
  • the processing device 112 may receive multiple samples including image frames acquired by one or more cameras from the cameras, the requester terminal 130, the provider terminal 140, and/or the storage device 160 via the network 120.
  • the processing device 112 may obtain a first trained machine learning model for object detection from the storage device 160.
  • the processing device 112 may also determine at least a portion of the multiple samples based on the first trained machine learning model for object detection. A count of one or more objects presented in each of the image frames corresponding to each of the at least a portion of the multiple samples may change with time.
  • the processing device 112 may determine a group of samples from the at least a portion of the multiple samples based on a second trained machine learning model for image classification.
  • a count of one or more objects presented in a sample in the group may be unavailable using the first trained machine learning model.
  • the group of samples from the at least a portion of the multiple samples may include hard samples as described elsewhere in the present disclosure (e.g., FIG. 5 and the descriptions thereof) .
  • the determination and/or updating of models may be performed on a processing device, while the application of the models may be performed on a different processing device. In some embodiments, the determination and/or updating of the models may be performed on a processing device of a system different than the online to offline system 100 or a server different than the server 110 on which the application of the models is performed.
  • the determination and/or updating of the models may be performed on a first system of a vendor who provides and/or maintains such a machine learning model, and/or has access to training samples used to determine and/or update the machine learning model, while object detection based on the provided machine learning model, may be performed on a second system of a client of the vendor.
  • the determination and/or updating of the models may be performed online in response to a request for object detection. In some embodiments, the determination and/or updating of the models may be performed offline.
  • the processing device 112 may include one or more processing engines (e.g., single-core processing engine (s) or multi-core processor (s) ) .
  • the processing device 112 may include a central processing unit (CPU) , an application-specific integrated circuit (ASIC) , an application-specific instruction-set processor (ASIP) , a graphics processing unit (GPU) , a physics processing unit (PPU) , a digital signal processor (DSP) , a field-programmable gate array (FPGA) , a programmable logic device (PLD) , a controller, a microcontroller unit, a reduced instruction-set computer (RISC) , a microprocessor, or the like, or any combination thereof.
  • CPU central processing unit
  • ASIC application-specific integrated circuit
  • ASIP application-specific instruction-set processor
  • GPU graphics processing unit
  • PPU physics processing unit
  • DSP digital signal processor
  • FPGA field-programmable gate array
  • PLD programmable logic device
  • controller
  • the network 120 may facilitate exchange of information and/or data.
  • one or more components of the O2O service system 100 e.g., the server 110, the requester terminal 130, the provider terminal 140, the vehicle 150, the storage device 160, and the navigation system 170
  • the server 110 may receive a service request from the requester terminal 130 via the network 120.
  • the network 120 may be any type of wired or wireless network, or combination thereof.
  • the network 120 may include a cable network, a wireline network, an optical fiber network, a telecommunications network, an intranet, an Internet, a local area network (LAN) , a wide area network (WAN) , a wireless local area network (WLAN) , a metropolitan area network (MAN) , a wide area network (WAN) , a public telephone switched network (PSTN) , a Bluetooth network, a ZigBee network, a near field communication (NFC) network, or the like, or any combination thereof.
  • the network 120 may include one or more network access points.
  • the network 120 may include wired or wireless network access points such as base stations and/or internet exchange points 120-1, 120-2, through which one or more components of the O2O service system 100 may be connected to the network 120 to exchange data and/or information.
  • a passenger may be an owner of the requester terminal 130. In some embodiments, the owner of the requester terminal 130 may be someone other than the passenger. For example, an owner A of the requester terminal 130 may use the requester terminal 130 to transmit a service request for a passenger B or receive a service confirmation and/or information or instructions from the server 110.
  • a service provider may be a user of the provider terminal 140. In some embodiments, the user of the provider terminal 140 may be someone other than the service provider. For example, a user C of the provider terminal 140 may use the provider terminal 140 to receive a service request for a service provider D, and/or information or instructions from the server 110.
  • passenger and “passenger terminal” may be used interchangeably, and “service provider” and “provider terminal” may be used interchangeably.
  • the provider terminal may be associated with one or more service providers (e.g., a night-shift service provider, or a day-shift service provider) .
  • the requester terminal 130 may include a mobile device 130-1, a tablet computer 130-2, a laptop computer 130-3, a built-in device in a vehicle 130-4, a wearable device 130-5, or the like, or any combination thereof.
  • the mobile device 130-1 may include a smart home device, a smart mobile device, a virtual reality device, an augmented reality device, or the like, or any combination thereof.
  • the smart home device may include a smart lighting device, a control device of an intelligent electrical apparatus, a smart monitoring device, a smart television, a smart video camera, an interphone, or the like, or any combination thereof.
  • the smart mobile device may include a smartphone, a personal digital assistant (PDA) , a gaming device, a navigation device, a point of sale (POS) device, or the like, or any combination thereof.
  • the virtual reality device and/or the augmented reality device may include a virtual reality helmet, virtual reality glasses, a virtual reality patch, an augmented reality helmet, augmented reality glasses, an augmented reality patch, or the like, or any combination thereof.
  • the virtual reality device and/or the augmented reality device may include GoogleTM Glasses, an Oculus Rift, a HoloLens, a Gear VR, etc.
  • the built-in device in the vehicle 130-4 may include an onboard computer, an onboard television, etc.
  • the requester terminal 130 may be a device with positioning technology for locating the position of the passenger and/or the requester terminal 130.
  • the wearable device 130-5 may include a smart bracelet, a smart footgear, smart glasses, a smart helmet, a smart watch, smart clothing, a smart backpack, a smart accessory, or the like, or any combination thereof.
  • the provider terminal 140 may include a plurality of provider terminals 140-1, 140-2, ..., 140-n. In some embodiments, the provider terminal 140 may be similar to, or the same device as the requester terminal 130. In some embodiments, the provider terminal 140 may be customized to be able to implement the O2O service system 100. In some embodiments, the provider terminal 140 may be a device with positioning technology for locating the service provider, the provider terminal 140, and/or a vehicle 150 associated with the provider terminal 140. In some embodiments, the requester terminal 130 and/or the provider terminal 140 may communicate with another positioning device to determine the position of the passenger, the requester terminal 130, the service provider, and/or the provider terminal 140.
  • the requester terminal 130 and/or the provider terminal 140 may periodically transmit the positioning information to the server 110.
  • the provider terminal 140 may also periodically transmit the availability status to the server 110.
  • the availability status may indicate whether a vehicle 150 associated with the provider terminal 140 is available to carry a passenger.
  • the requester terminal 130 and/or the provider terminal 140 may transmit the positioning information and the availability status to the server 110 every thirty minutes.
  • the requester terminal 130 and/or the provider terminal 140 may transmit the positioning information and the availability status to the server 110 each time the user logs into the mobile application associated with the O2O service system 100.
  • the provider terminal 140 may correspond to one or more vehicles 150.
  • the vehicles 150 may carry the passenger and travel to the destination.
  • the vehicles 150 may include a plurality of vehicles 150-1, 150-2, ..., 150-n.
  • One vehicle may correspond to one type of services (e.g., a taxi-hailing service, a chauffeur service, an express car service, a carpool service, a bus service, a driver hire service, or a shuttle service) .
  • the vehicles 150 may be installed with a camera.
  • the camera may be configured to obtain image data via perform surveillance of an area within the scope of the camera.
  • a camera may refer to an apparatus for visual recording.
  • the camera may include a color camera, a digital video camera, a camera, a camcorder, a PC camera, a webcam, an infrared (IR) video camera, a low-light video camera, a thermal video camera, a CCTV camera, a pan, a tilt, a zoom (PTZ) camera, a video sensing device, or the like, or a combination thereof.
  • the image data may include a video.
  • the video may include a television, a movie, an image sequence, a computer- generated image sequence, or the like, or a combination thereof.
  • the area may be reflected in the video as a scene.
  • the scene may include one or more objects of interest.
  • the one or more objects may include a person, a vehicle, an animal, a physical subject, or the like, or a combination thereof.
  • the storage device 160 may store data and/or instructions.
  • the storage device 160 may store data of a plurality of travel trajectories, a plurality of orders, image data obtained by the camera in the vehicle 150, one or more machine learning models, a training set of a machine learning model, etc.
  • the storage device 160 may store data obtained from the requester terminal 130 and/or the provider terminal 140.
  • the storage device 160 may store data and/or instructions that the server 110 may execute or use to perform exemplary methods described in the present disclosure.
  • the storage device 160 may include a mass storage device, a removable storage device, a volatile read-and-write memory, a read-only memory (ROM) , or the like, or any combination thereof.
  • Exemplary mass storage devices may include a magnetic disk, an optical disk, a solid-state drive, etc.
  • Exemplary removable storage devices may include a flash drive, a floppy disk, an optical disk, a memory card, a zip disk, a magnetic tape, etc.
  • Exemplary volatile read-and-write memory may include a random-access memory (RAM) .
  • Exemplary RAM may include a dynamic RAM (DRAM) , a double date rate synchronous dynamic RAM (DDR SDRAM) , a static RAM (SRAM) , a thyristor RAM (T-RAM) , and a zero-capacitor RAM (Z-RAM) , etc.
  • DRAM dynamic RAM
  • DDR SDRAM double date rate synchronous dynamic RAM
  • SRAM static RAM
  • T-RAM thyristor RAM
  • Z-RAM zero-capacitor RAM
  • Exemplary ROM may include a mask ROM (MROM) , a programmable ROM (PROM) , an erasable programmable ROM (EPROM) , an electrically-erasable programmable ROM (EEPROM) , a compact disk ROM (CD-ROM) , and a digital versatile disk ROM, etc.
  • the storage device 160 may be implemented on a cloud platform.
  • the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-cloud, a multi-cloud, or the like, or any combination thereof.
  • the storage device 160 may be connected to the network 120 to communicate with one or more components of the O2O service system 100 (e.g., the server 110, the requester terminal 130, or the provider terminal 140) .
  • One or more components of the O2O service system 100 may access the data or instructions stored in the storage device 160 via the network 120.
  • the storage device 160 may be directly connected to or communicate with one or more components of the O2O service system 100 (e.g., the server 110, the requester terminal 130, the provider terminal 140) .
  • the storage device 160 may be part of the server 110.
  • the navigation system 170 may determine information associated with an object, for example, one or more of the requester terminal 130, the provider terminal 140, the vehicle 150, etc.
  • the navigation system 170 may be a global positioning system (GPS) , a global navigation satellite system (GLONASS) , a compass navigation system (COMPASS) , a BeiDou navigation satellite system, a Galileo positioning system, a quasi-zenith satellite system (QZSS) , etc.
  • the information may include a location, an elevation, a speed, or an acceleration of the object, or a current time.
  • the navigation system 170 may include one or more satellites, for example, a satellite 170-1, a satellite 170-2, and a satellite 170-3.
  • the satellites 170-1 through 170-3 may determine the information mentioned above independently or jointly.
  • the satellite navigation system 170 may transmit the information mentioned above to the network 120, the requester terminal 130, the provider terminal 140, or the vehicle 150 via wireless connections.
  • one or more components of the O2O service system 100 may have permissions to access the storage device 160.
  • one or more components of the O2O service system 100 may read and/or modify information related to the passenger, service provider, and/or the public when one or more conditions are met.
  • the server 110 may read and/or modify one or more passengers’ information after a service is completed.
  • the server 110 may read and/or modify one or more service providers’ information after a service is completed.
  • an element or component of the O2O service system 100 performs, the element may perform through electrical signals and/or electromagnetic signals.
  • a processor of the requester terminal 130 may generate an electrical signal encoding the request.
  • the processor of the requester terminal 130 may then transmit the electrical signal to an output port. If the requester terminal 130 communicates with the server 110 via a wired network, the output port may be physically connected to a cable, which further may transmit the electrical signal to an input port of the server 110.
  • the output port of the requester terminal 130 may be one or more antennas, which convert the electrical signal to electromagnetic signal.
  • a provider requester terminal 130 may receive an instruction and/or service request from the server 110 via electrical signal or electromagnet signals.
  • an electronic device such as the requester terminal 130, the provider terminal 140, and/or the server 110, when a processor thereof processes an instruction, transmits out an instruction, and/or performs an action, the instruction and/or action is conducted via electrical signals.
  • the processor retrieves or saves data from a storage medium, it may transmit out electrical signals to a read/write device of the storage medium, which may read or write structured data in the storage medium.
  • the structured data may be transmitted to the processor in the form of electrical signals via a bus of the electronic device.
  • an electrical signal may refer to one electrical signal, a series of electrical signals, and/or a plurality of discrete electrical signals.
  • FIG. 2 is a schematic diagram illustrating exemplary hardware and/or software components of a computing device 200 on which the server 110, the requester terminal 130 and/or the provider terminal 140 may be implemented according to some embodiments of the present disclosure.
  • the processing device 112 may be implemented on the computing device 200 and configured to perform functions of the processing device 112 disclosed in this disclosure.
  • the computing device 200 may be a special purpose computer in some embodiments.
  • the computing device 200 may be used to implement the online to offline system 100 for the present disclosure.
  • the computing device 200 may implement any component that performs one or more functions disclosed in the present disclosure. In FIGs. 1-2, only one such computing device is shown purely for convenience purposes.
  • One of ordinary skill in the art would understand that the computer functions relating to the object detection as described herein may be implemented in a distributed fashion on a number of similar platforms, to distribute the processing load.
  • the computing device 200 may include communication ports (COM PORTS) 250 connected to and from a network (e.g., the network 120) connected thereto to facilitate data communications.
  • the computing device 200 may also include a processor 220, in the form of one or more processors, for executing program instructions.
  • the exemplary computer platform may include an internal communication bus 210, a program storage and a data storage of different forms, for example, a disk 270, and a read only memory (ROM) 230, or a random access memory (RAM) 240, for various data files to be processed and/or transmitted by the computer.
  • the exemplary computer platform may also include program instructions stored in the ROM 230, the RAM 240, and/or other type of non-transitory storage medium to be executed by the processor 220.
  • the methods and/or processes of the present disclosure may be implemented as the program instructions.
  • the computing device 200 also includes an I/O component 260, supporting input/output between the computing device 200 and other components. Moreover, the computing device 200 may receive
  • processor 220 is described in the computing device 200.
  • the computing device 200 in the present disclosure may also include multiple processors, thus operations and/or method steps that are performed by one processor 220 as described in the present disclosure may also be jointly or separately performed by the multiple processors.
  • the processor 220 of the computing device 200 executes both operation A and operation B
  • operation A and operation B may also be performed by two different processors jointly or separately in the computing device 200 (e.g., the first processor executes operation A and the second processor executes operation B, or the first and second processors jointly execute operations A and B) .
  • FIG. 3 is a schematic diagram illustrating exemplary hardware and/or software components of an exemplary mobile device according to some embodiments of the present disclosure.
  • the mobile device 300 may include a camera 305, a communication platform 310, a display 320, a graphic processing unit (GPU) 330, a central processing unit (CPU) 340, an I/O 350, a memory 360, a mobile operating system (OS) 370, application (s) 380, and a storage 390.
  • any other suitable component including but not limited to a system bus or a controller (not shown) , may also be included in the mobile device 300.
  • the mobile operating system 370 e.g., iOS TM , Android TM , Windows Phone TM , etc.
  • the applications 380 may include a browser or any other suitable mobile apps for receiving and rendering information relating to image processing or other information from the online to offline system 100.
  • User interactions with the information stream may be achieved via the I/O 350 and provided to the storage device 160, the server 110 and/or other components of the online to offline system 100.
  • the mobile device 300 may be an exemplary embodiment corresponding to a terminal associated with the online to offline system 100, the requester terminal 130 and/or the provider terminal 140.
  • computer hardware platforms may be used as the hardware platform (s) for one or more of the elements described herein.
  • the hardware elements, operating systems and programming languages of such computers are conventional in nature, and it is presumed that those skilled in the art are adequately familiar therewith to adapt those technologies to detect object and/or obtain samples as described herein.
  • a computer with user interface elements may be used to implement a personal computer (PC) or other type of work station or terminal device, although a computer may also act as a server if appropriately programmed. It is believed that those skilled in the art are familiar with the structure, programming and general operation of such computer equipment and as a result the drawings should be self-explanatory.
  • the element may perform through electrical signals and/or electromagnetic signals.
  • a server 110 may operate logic circuits in its processor to process such task.
  • the processor of the server 110 may generate electrical signals encoding the target machine learning model.
  • the processor of the server 110 may then send the electrical signals to at least one data exchange port of a target system associated with the server 110.
  • the server 110 communicates with the target system via a wired network, the at least one data exchange port may be physically connected to a cable, which may further transmit the electrical signals to an input port (e.g., an inforamtion exchange port) of the requester terminal 130.
  • the at least one data exchange port of the target system may be one or more antennas, which may convert the electrical signals to electromagnetic signals.
  • an electronic device such as the requester terminal 130, the provider terminal 140, and/or the server 110, when a processor thereof processes an instruction, sends out an instruction, and/or performs an action, the instruction and/or action is conducted via electrical signals.
  • the processor when the processor retrieves or saves data from a storage medium (e.g., the storage device 160) , it may send out electrical signals to a read/write device of the storage medium, which may read or write structured data in the storage medium.
  • the structured data may be transmitted to the processor in the form of electrical signals via a bus of the electronic device.
  • an electrical signal may be one electrical signal, a series of electrical signals, and/or a plurality of discrete electrical signals.
  • FIG. 4 is a block diagram illustrating an exemplary processing device 112 according to some embodiments of the present disclosure.
  • the processing device 112 may include an obtaining module 410, a determination module 420, an object detection module 430, and a storage module 440.
  • Each of the modules described above may be a hardware circuit that is designed to perform certain actions, e.g., according to a set of instructions stored in one or more storage media, and/or any combination of the hardware circuit and the one or more storage media.
  • the obtaining module 410 may be configured to obtain specific image data acquired by a specific camera.
  • the specific image data may be generated by the specific camera via monitoring an area within the scope of the specific camera.
  • the obtaining module 410 may obtain the specific image data in real time or periodically.
  • the obtaining module 410 may obtain the specific image data from the specific camera, the requester terminal 130, the provider terminal 140, the storage device 160, or any other storage device.
  • the obtaining module 410 may be configured to obtain multiple samples. Each of the multiple samples may include image frames. In some embodiments, the obtaining module 410 may obtain the multiple samples from same or different cameras in a scene, such as, a meeting scene, a working scene, a game scene, a party scene, a travel scene, a play scene, or the like, or any combination thereof. In some embodiments, the obtaining module 410 may obtain the image frames of a sample by a camera automatically periodically, such as, every 0.001 seconds, every 0.002 second, every 0.003 seconds, every 0.01 seconds, etc. In some embodiments, the obtaining module 410 may obtain the multiple samples in real time or periodically. In some embodiments, the obtaining module 410 may obtain the multiple samples from the cameras, the requester terminal 130, the provider terminal 140, the storage device160, and/or any other storage device (not shown in the online to offline system 100) via the network 120.
  • the determination module 420 may be configured to determine at least a portion of the multiple samples based on a first trained machine learning model for object detection. A count of one or more objects presented in each of the image frames corresponding to each of the at least a portion of the multiple samples may change with time. The at least a portion of the multiple samples may be also referred to as a potential hard sample set. In some embodiments, the determination module 420 may determine the first trained machine learning model by training a machine learning model using a first training set. In some embodiments, the determination module 420 may determine the first trained machine learning model by training a prior updated object detection model using the training set.
  • the determination module 420 may determine a potential hard sample based on a count of the one or more objects presented in each of the multiple image frames of a sample.
  • a potential hard sample may be a true hard sample or a pseudo-hard sample.
  • the determination module 420 may determine a potential hard sample based on a count change curve of the one or more objects presented in the image frames of a sample.
  • the determination module 420 may determine the count change curve of one or more objects presented in the multiple image frames of the specific sample based on the count of one or more objects presented in each of the multiple image frames of the specific sample.
  • the determination module 420 may determine a group of samples from the at least a portion of the multiple samples based on a second trained machine learning model for image classification. Each of the group of samples may also be referred to a true hard sample. In some embodiments, the determination module 420 may determine the second trained machine learning model training a machine learning model using a second training set. The second training set may include a plurality of samples. Each of the plurality of samples in the second training set may include a video. In some embodiments, the determination module 420 may classify each sample in the potential hard sample set into a first group and a second group using the second trained machine learning model. The determination module 420 may designate samples in the first group or the second group as the group of samples. In some embodiments, the determination module 420 may classify each sample in the potential hard sample set into several groups using the second trained machine learning model. Each group of the server groups may include one or more samples acquired under a same scene.
  • the obtaining module 410 may obtain a sample set for object detection based on the group of samples.
  • the determination module 420 may determine a target machine learning model for object detection based on the sample set. In some embodiments, the determination module 420 may determine the target machine learning model by training an object detection model using the sample set.
  • the object detection module 430 may be configured to perform an object detection on the specific image data using the target machine learning model.
  • the object detection module 430 may input the specific image data into the target machine learning model to determine whether the specific image data presents one or more objects (e.g., passengers) .
  • the object detection module 430 may input the specific image data into the target machine learning model to determine a count of the one or more objects presented in the specific image data.
  • the specific image data may be a video including multiple image frames.
  • the object detection module 430 may input the specific image data into the target machine learning model to determine a count of the one or more objects presented in each of the multiple image frames and/or a count change curve of the one or more objects presented in the specific image data.
  • the object detection module 430 may input the specific image data into the target machine learning model to determine whether an anomaly behavior (e.g., a fighting, a quarrel) is involved between detected objects presented in the specific image data.
  • an anomaly behavior e.g., a fighting,
  • the storage module 440 may be configured to store information.
  • the information may include programs, software, algorithms, data, text, number, images and some other information.
  • the information may include a specific image, a video, etc.
  • the processing device 112 may further include an image preprocessing module (not shown in FIG. 4) .
  • the image preprocessing module may be configured to preprocess images of the multiple samples.
  • the determination module 420 and the object detection module 430 may be integrated into one module.
  • FIG. 5 is a schematic flowchart illustrating an exemplary process for object detection according to some embodiments of the present disclosure.
  • the process 500 may be implemented as a set of instructions (e.g., an application) stored in the storage device 160, ROM 230 or RAM 240, or storage 390.
  • the processing device 112, the processor 220, and/or the CPU 340 may execute the set of instructions, and when executing the instructions, the processing device 112, the processor 220, and/or the CPU 340 may be configured to perform the process 500.
  • the operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 500 may be accomplished with one or more additional operations not described and/or without one or more of the operations discussed. Additionally, the order in which the operations of the process 500 illustrated in FIG. 5 and described below is not intended to be limiting.
  • the processing device 112 may obtain specific image data acquired by a specific camera.
  • the specific image data may be generated by the specific camera via monitoring an area within the scope of the specific camera.
  • the specific camera may denote any device for visual recording as described elsewhere in the present disclosure (e.g., FIG. 1 and the descriptions thereof) .
  • the specific image data may include a video as described elsewhere in the present disclosure (e.g., FIG. 1 and the descriptions thereof) .
  • the video may include a plurality of image frames.
  • the video may record and/or present a specific scene in the area, such as, a meeting scene, a working scene, a game scene, a party scene, a travel scene, a play scene, or the like, or any combination thereof.
  • the specific image data may be a video recording a travel scene inside and/or outside the vehicle collected by the specific camera (e.g., a driving recorder) installed inside and/or outside of the vehicle.
  • the processing device 112 may obtain the specific image data in real time or periodically.
  • the processing device 112 (e.g., the obtaining module 410) may obtain the specific image data from the specific camera, the requester terminal 130, the provider terminal 140, the storage device 160, or any other storage device.
  • the processing device 112 may obtain a target machine learning model for object detection.
  • the target machine learning model may be constructed based on a linear regression model, a support vector machine (SVM) model, a decision tree model, a boosting model, a neural network model, or the like, or any combination thereof.
  • the target machine learning model may be configured to determine whether the specific image data includes one or more objects. For example, the target machine learning model may output “1” denoting that the specific image data includes one or more objects or “0” denoting that the specific image data does not present one or more objects.
  • the target machine learning model for object detection may be configured to determine a count of one or more objects presented in the specific image data. In some embodiments, the target machine learning model for object detection may be configured to determine behaviors of the one or more objects presented in the specific image data. For example, the target machine learning model may determine whether an anomaly behavior (e.g., a fighting, a quarrel) is involved between detected objects presented in the specific image data.
  • an anomaly behavior e.g., a fighting, a quarrel
  • the target machine learning model may be determined by training a preliminary machine learning model (e.g., the first trained machine learning model for object detection as described in FIG. 6) using a training set.
  • the preliminary machine learning model may be an initial machine learning model that has not been trained using any training data, such as a neural network model.
  • the preliminary machine learning model may be an original object detection model provided by a vendor who provides and/or maintains such a machine learning model for object detection, and/or has access to the training set used to determine and/or update the machine learning model for object detection.
  • the object detection model provided by a vendor may be updated from time to time, e.g., periodically or not, based on a training set that is at least partially different from a prior training set from which a prior updated object detection model is determined.
  • the preliminary machine learning model may be an updated object detection model.
  • the preliminary machine learning model may be a last updated object detection model based on a training set including new training samples that are not in the prior training set, training samples being assessed using the prior updated object detection model updated before the last updated object detection model, or the like, or a combination thereof.
  • the training set used for object detection may include a plurality of videos (or images) . Each of at least a portion of the plurality of videos (or images) may present one or more objects, such as persons (e.g., adults, children, old people) , animals, physical subjects, etc.
  • a video (or one or more images) in the training set may also be referred to as a sample.
  • a video (or one or more images) in the training set that presents one or more objects may be labeled as a positive sample.
  • a video (or one or more images) in the training set that does not present one or more objects may be labeled as a negative sample.
  • a video (or one or more images) in the training set may be labeled with a label representing a count (or number) of one or more objects presented in the video (or one or more images) .
  • a video (or one or more images) may be labeled with a label “1. ”
  • the training set may include one or more hard samples (i.e., true hard samples) , also referred to as hard examples.
  • a hard sample may refer to a sample (e.g., a video) of which its identification or detection result is proved false, when the sample is analyzed using an object detection model, which is different from the target machine learning model.
  • a training set used for determining the object detection model may include rare hard samples.
  • An identification or detection result of a sample being false based on the object detection model means the identification or detection result is different from a true result of the sample.
  • a hard sample may be a positive sample, but it is identified as a negative sample using the object detection model.
  • a hard sample may be a negative sample, but it is identified as a positive sample using the object detection model.
  • a hard sample may be a video (or one or more images) showing a true count of objects. However, the estimated count of objects presented in the video (or one or more images) identified by the object detection model may be different from the true count of objects.
  • a hard sample may be referred to a rare sample of the training set that is used to determine the object detection model.
  • a rare sample may refer to a video that has captured objects which is rarely presented in other samples of the training set, and the training set is used to determine the object detection model.
  • a ratio of the number of rare samples to the total number of the sample in the training set may be less than a threshold, such as less than 1%, 0.5%, etc.
  • children and/or disabled individuals may rarely request an online transport service alone. The camera installed in a vehicle for providing the online transport service thus may rarely collect videos that have captured children and/or disabled individuals. So a video presenting a child or a disable individual may be designated as a hard sample (a rare sample) .
  • At least a portion of the one or more hard samples in the training set may be acquired from multiple videos (or images) collected by one or more cameras manually.
  • an operator e.g., an engineer
  • the operator may choose videos (or one or more images) presenting one or more children, one or more disabled individuals, etc., from the multiple videos (or images) as the at least a portion of the one or more hard samples.
  • the at least a portion of the one or more hard samples in the training set may be acquired from multiple videos (or images) collected by one or more cameras automatically.
  • a processor may determine, based on the object detection model (e.g., the preliminary machine learning model) , at least a portion of the multiple videos (i.e., potential hard samples) .
  • the count of one or more objects presented in each of image frames of each of the multiple videos may change with time. In other words, the counts of one or more objects presented in at least two of image frames of each of the multiple videos may be different.
  • the processor e.g., the processing device 112 or a processing device different from the processing device 112 may determine a group of videos (i.e., true hard samples) from the at least a portion of the multiple videos using an image classification model.
  • the count of one or more objects presented in a video in the group determined using the object detection model may be false. More descriptions for determining hard samples may be found elsewhere in the present disclosure (e.g., FIG. 6 and the descriptions thereof) .
  • At least a portion of the one or more hard samples in the training set may be acquired from multiple videos (or images) collected by one or more cameras manually.
  • an operator e.g., an engineer
  • the operator may choose videos (or one or more images) presenting one or more children, one or more disabled individuals, etc., from the multiple videos (or images) as the at least a portion of the one or more hard samples.
  • the at least a portion of the one or more hard samples in the training set may be acquired from multiple videos (or images) collected by one or more cameras automatically.
  • a processor may determine, based on the object detection model (e.g., the preliminary machine learning model) , at least a portion of the multiple videos (i.e., potential hard samples) .
  • the count of one or more objects presented in each of image frames of each of the multiple videos may change with time. In other words, the counts of one or more objects presented in at least two of image frames of each of the multiple videos may be different.
  • the processor e.g., the processing device 112 or a processing device different from the processing device 112 may determine a group of videos (i.e., true hard samples) from the at least a portion of the multiple videos using an image classification model.
  • An identification or detection result of a video in the group is proved false, when the video is analyzed using the object detection model (e.g., the preliminary machine learning model) . More descriptions for determining hard samples may be found elsewhere in the present disclosure (e.g., FIG. 6 and the descriptions thereof) .
  • the processing device 112 may perform an object detection on the specific image data using the target machine learning model.
  • the processing device 112 e.g., the object detection module 430
  • the target machine learning model may be configured to output an identification or detection result including that there are the one or more objects presented in the specific image data or no object presented in the specific image data.
  • the processing device 112 e.g., the object detection module 430
  • the target machine learning model may be configured to output an identification or detection result including that the count of the one or more objects presented in the specific image data.
  • the count of the one or more objects may be “0” , “1” , “2” , “3” , etc.
  • the specific image data may be a video including multiple image frames.
  • the processing device 112 e.g., the object detection module 430
  • the target machine learning model may be configured to output an identification or detection result including the count change curve of the one or more objects presented in the specific image data and/or the count of the one or more objects presented in each of the multiple image frames.
  • the count change curve of the one or more objects may be a curve showing the change of the count of the one or more objects along time or with respect to each of image frames of the specific image data.
  • the processing device 112 e.g., the object detection module 430
  • the processing device 112 may transmit a signal to a terminal (e.g., a mobile terminal associated with the requester terminal 130) , a server associated with an online to offline platform, etc.
  • the signal may include the identification or detection result for object detection.
  • the signal may be configured to direct the terminal to display the identification or detection result to a user associated with the terminal.
  • the processing device 112 and/or the server associated with the online to offline platform may determine an event associated with the one or more detected objects presented in the specific image data based on the identification or detection result.
  • the processing device 112 and/or the server associated with the online to offline platform may determine whether a service provider (e.g., a driver) picks up a service requester (e.g., a passenger) based on the count of the one or more objects presented in the specific image data. If the processing device 112 determines that the count of the one or more objects presented in the specific image data does not change, the processing device 112 may determine that the service provides does not pick up the service requester. As another example, the processing device 112 may determine whether an anomaly behavior of a driver and/or a passenger needs to be intervened by a third party. Further, if the processing device 112 determines that an anomaly behavior of a driver and/or a passenger needs to be intervened by a third party, the processing device 112 may generate an alert, call the police, etc.
  • a service provider e.g., a driver
  • a service requester e.g., a passenger
  • the processing device 112 may update the target machine learning model by updating the training set using the specific image data and the detected result of the specific image data.
  • the processing device 112 may also preprocess the specific image data.
  • one or more other optional operations e.g., a storing operation
  • the processing device 112 may store information and/or data (e.g., the specific image data, the target machine learning model, the count of the one or more objects presented in the specific image data, etc. ) associated with the online to offline system 100 in a storage device (e.g., the storage device 160) disclosed elsewhere in the present disclosure.
  • FIG. 6 is a schematic flowchart illustrating an exemplary process for determining a target machine learning model for object detection according to some embodiments of the present disclosure.
  • the process 600 may be implemented as a set of instructions (e.g., an application) stored in the storage device 160, ROM 230 or RAM 240, or storage 390.
  • the processing device 112, the processor 220, and/or the CPU 340 may execute the set of instructions, and when executing the instructions, the processing device 112, the processor 220, and/or the CPU 340 may be configured to perform the process 600.
  • the operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 600 may be accomplished with one or more additional operations not described and/or without one or more of the operations discussed. Additionally, the order in which the operations of the process 600 illustrated in FIG. 6 and described below is not intended to be limiting. Operation 520 may be performed according to process 600 as illustrated in FIG. 6.
  • the processing device 112 may obtain multiple samples.
  • Each of the multiple samples may include image frames.
  • each of the multiple samples may include a video, an image sequence, or the like, or any combination thereof.
  • Each of the multiple samples may be collected by a camera as described elsewhere in the present disclosure (e.g., FIG. 1 and the descriptions thereof) .
  • the multiple samples may be obtained from same or different cameras in a scene, such as, a meeting scene, a working scene, a game scene, a party scene, a travel scene, a play scene, or the like, or any combination thereof.
  • the multiple samples may be obtained from a camera installed in a vehicle for recording a scene in the vehicle.
  • the image frames of a sample may be obtained by a camera automatically periodically, such as, every 0.001 seconds, every 0.002 second, every 0.003 seconds, every 0.01 seconds, etc.
  • the processing device 112 may obtain the multiple samples in real time or periodically.
  • the processing device 112 may obtain the multiple samples from the cameras, the requester terminal 130, the provider terminal 140, the storage device160, and/or any other storage device (not shown in the online to offline system 100) via the network 120.
  • the camera may acquire one of the multiple samples via monitoring an area surrounding the camera.
  • the area may be part of a space in a vehicle.
  • a sample may record a scene inside and/or outside of the vehicle.
  • a sample may include a video recording a passenger getting on or off the vehicle.
  • a sample may include a video recording a moment of turn-on or turn-off of a lighting device (e.g., a fill light of the camera) installed in the vehicle.
  • a sample may include a video recording an abnormal pose (e.g., lying down, bending down, etc. ) of a passenger.
  • a sample may include a video recording the passenger’s partially or entirely blocked body, for example, by the driver.
  • a sample may include a video recording a rare passenger (e.g., a child and/or a disabled person) , or a passenger holding a child, etc.
  • the processing device 112 may preprocess the image frames of each of one or more samples.
  • the preprocessing of an image frame may include, but is not limited to, an image thresholding operation, an image enhancement operation, an image segmentation operation, a symmetric neighborhood filter operation, or any preprocessing technique for binarization of the image frame.
  • the processing device 112 may determine at least a portion of the multiple samples based on a first trained machine learning model for object detection.
  • a count of one or more objects presented in each of the image frames corresponding to each of the at least a portion of the multiple samples may change with time. In other words, the count of the one or more objects presented in each of the image frames of each of the at least a portion of the multiple samples may be inconsistent or different.
  • the image frames of a sample in which the count of one or more objects presented shows this inconsistency or difference may record a special scene.
  • the special scene recoded by a video may be a cause for the inconsistency in the video identified by the first trained machine learning model.
  • the special scene may be: for example, the image frames of the sample may record a passenger getting on or off a vehicle, or record the moment of turn-on of a lighting device (e.g., a fill light of the camera) installed in the vehicle, or record an abnormal pose of the passenger (e.g., lying down) , or record the passenger’s partially or entirely blocked body, for example, by the driver, or record a rare passenger (e.g., a child and/or a disabled person) , or record a passenger holding a child, or the like, or any combination thereof.
  • a lighting device e.g., a fill light of the camera
  • an abnormal pose of the passenger e.g., lying down
  • a rare passenger e.g., a child and/or a disabled person
  • Each of the at least a portion of the multiple samples in which the count of one or more objects presented changes may also be referred to as a potential hard sample.
  • a potential hard sample may be a true hard sample or a pseudo-hard sample.
  • the true hard sample may be a sample (e.g., a video) of which an identification result produced by the first trained machine learning model is proved false.
  • the pseudo-hard sample may be a sample (e.g., a video) of which an identification result produced by the first trained machine learning model is proved correct.
  • the first trained machine learning model may determine the count of one or more objects (e.g., persons) presented in a pseudo-hard sample based on the count of one or more objects (e.g., persons) presented in each of multiple image frames of the pseudo-hard sample.
  • the first trained machine learning model cannot determine the count of one or more objects (e.g., persons) presented in a true-hard sample based on the count of one or more objects (e.g., persons) presented in each of multiple image frames of the true hard sample.
  • a true hard sample may be a video recording a rare type passenger (e.g., a child and/or a disabled person) , or recording a passenger holding a child, etc.
  • a pseudo-hard sample may be a video recording a passenger getting on or off a vehicle, or recording the moment of turn-on or turn-off of a lighting device (e.g., a fill light of the camera) installed in the vehicle, or recording an abnormal pose of a passenger (e.g., lying down) , or recording the passenger’s partially or entirely blocked body, for example, by the driver, or the like, or any combination thereof.
  • a lighting device e.g., a fill light of the camera
  • the first trained machine learning model may be constructed based on a linear regression model, a support vector machine (SVM) model, a decision tree model, a boosting model, a neural network model, or the like, or any combination thereof.
  • the processing device 112 may obtain the first trained machine learning model from the storage device 160, the requester terminal 130, or any other storage device.
  • the first trained machine learning model may be an object detection model that is determined by training a machine learning model using a first training set.
  • the first trained machine learning model may be an original object detection model provided by a vendor who provides and/or maintains such a machine learning model for object detection, and/or has access to the training set used to determine and/or update the machine learning model.
  • the first trained machine learning model may be an object detection model that is determined by training a prior updated object detection model using the training set.
  • the first trained machine learning model may be determined by training a prior updated object detection model using a new training set including new samples that are not in a prior training set used for determining the prior updated object detection model.
  • a ratio of the count (or number) of true hard samples and the total count (or number) of samples in the training set used for determining the first trained machine learning model may be less than a threshold, such as less than 0.5%, less than 1%, or less than 2%, or less than 3%, or less than 5%etc., which may cause the first trained machine learning model cannot detect and/or determine the count of one or more objects presented in a true hard sample. More descriptions for determining the first trained machine learning model may be found elsewhere in the present disclosure (e.g., FIG. 8, and the descriptions thereof) .
  • the first trained machine learning model may be configured to identify and/or detect one or more objects (e.g., persons) presented in each image frame of a sample. For example, the first trained machine learning model may be configured to mark each of one or more objects (e.g., persons) presented in a specific image frame of a specific sample and output the specific image frame with marked objects. In some embodiments, the first trained machine learning model may be configured to identify and/or determine a count of one or more objects (e.g., persons) presented in each image frame of a sample. The first trained machine learning model may output a count of one or more objects presented in each image frame of the sample.
  • objects e.g., persons
  • the first trained machine learning model may be further configured to determine a count change curve of the one or more objects presented in the multiple image frames of the sample.
  • the count of one or more objects presented in each of the image frames of a sample may be the same or different.
  • the count of one or more objects in each of the images frames of a sample may be always as 2.
  • a count of one or more objects in one image frame of a specific sample may be as 1, and a count of one or more objects in another image frame of the specific sample may be as 2.
  • the processing device 112 may determine a potential hard sample based on a count of the one or more objects presented in each of the multiple image frames of a sample. For example, the processing device 112 may determine the count of one or more objects presented in each of multiple image frames of a specific sample. The processing device 112 may determine whether the count of one or more objects presented in each of the multiple image frames of the specific sample is consistent or same. If the processing device 112 determines that the count of one or more objects presented in each of the multiple images of the specific sample is inconsistent or different, the processing device 112 may determine the specific sample as a potential hard sample.
  • the processing device 112 may determine a potential hard sample based on a count change curve of the one or more objects presented in the image frames of a sample. For example, the processing device 112 may determine a count change curve of the one or more objects presented in the multiple image frames of a specific sample using the first trained machine learning model. The processing device 112 may determine whether the count change curve of one or more objects presented in the multiple image frames of the specific sample has one or more burrs. If the processing device 112 determines that the count change curve of one or more objects presented in the multiple image frames of the specific sample has one or more burrs, the processing device 112 may determine that the specific sample as a potential hard sample.
  • the count change curve of a specific sample may be a straight line, which indicates that the count of the one or more objects presented in each of the image frames of a sample remains unchanged.
  • the count change curve of a specific sample may include one or more burrs, which indicates that the count of one or more objects presented in each of the image frames of the specific sample changes with time.
  • the processing device 112 may determine that the specific sample with a count change curve including one or more burrs as a potential hard sample.
  • the processing device 112 may input the multiple image frames of a specific sample into the first trained machine learning model.
  • the first trained machine learning model may determine and/or output the count change curve of one or more objects presented in the multiple image frames of the specific sample.
  • the processing device 112 may input each of the multiple image frames of a specific sample into the first trained machine learning model.
  • the first trained machine learning model may determine and/or output the count of one or more objects presented in each of the multiple image frames of the specific sample.
  • the processing device 112 may determine the count change curve of one or more objects presented in the multiple image frames of the specific sample based on the count of one or more objects presented in each of the multiple image frames of the specific sample.
  • the processing device 112 may determine a group of samples from the at least a portion of the multiple samples based on a second trained machine learning model for image classification. Each of the group of samples may also be referred to a true hard sample.
  • the group of samples determined from the at least a portion of the multiple samples (i.e., the potential hard sample set) determined in 620 may also be referred to as a true hard sample set.
  • an identification result of a true hard sample produced by the first trained machine learning model may be proved false.
  • a count of one or more objects presented in a sample in the group may be unavailable using the first trained machine learning model.
  • a true hard sample may include a video recording a child and/or a disabled person, or recording a passenger holding a child, etc.
  • the processing device 112 may store the group of samples into the storage device 160.
  • the second trained machine learning model for image classification may include a classification model.
  • the second trained machine learning model may be constructed based on a decision tree model, a multiclass support vector machine (SVM) model, a K-nearest neighbors classifier, a Gradient Boosting Machine (GBM) model, a neural network model, or the like, or any combination thereof.
  • the processing device 112 may obtain the second trained machine learning model from the storage device 160, the requester terminal 130, or any other storage device.
  • the second trained machine learning model may be determined by training a machine learning model using a second training set.
  • the second training set may include a plurality of samples. Each of the plurality of samples in the second training set may include a video.
  • each of the plurality of samples in the second training set may correspond to a label.
  • the label corresponding to a sample in the second training set may denote a category to which the sample belongs.
  • the category to which a sample in the second training set belongs may include a true hard sample and a pseudo-hard sample.
  • the label corresponding to each of the plurality of samples in the second training set may denote whether the each of the plurality of samples is a true hard sample or a pseudo-hard sample.
  • the label may be one of a positive sample and a negative sample.
  • the positive sample may denote a true hard sample.
  • the negative sample may denote a pseudo-hard sample.
  • the label may be one of a true hard sample and a pseudo-hard sample.
  • the category to which a sample in the second training set belongs may denote a scene the sample (e.g., a video) recording.
  • the scene may include a passenger getting on or off a vehicle, the moment of turn-on of a lighting device (e.g., a fill light of the camera) installed in the vehicle, a pose of a passenger being abnormal (e.g., lying down) , partially or entirely blocked body, for example, by the driver, to a rare passenger (e.g., a child and/or a disabled person) , a passenger holding a child, or the like, or any combination thereof.
  • the label of a sample in the second training set may represent the category that the sample belongs to.
  • the label of a sample may be denoted as a character (e.g., A, B, etc. ) , a digit (e.g., 1, 2, etc. ) , etc., to represent a category that the training sample belongs to.
  • the label of a sample in the second training set may be “1” representing that the sample belongs to a category denoting the sample recording a rare type passenger.
  • the label of another sample in the second training set may be “2” representing that the sample belongs to a category denoting the another sample recording a passenger getting on or off a vehicle.
  • the label corresponding to each of the plurality of samples in the second training set may be determined by an operator (e.g., an engineer) manually according to the each of the plurality of samples.
  • the operator may label a sample with a label “true hard sample” or “positive sample” if the sample presents a child or a disabled person.
  • the operator may label a sample in the second training set with a label “pseudo-hard sample” or “negative sample” if the training sample presents a passenger getting on or off a vehicle, a passenger with an abnormal (e.g., lying down) pose, a part or all of the passenger’s body being blocked, for example, by the driver, etc.
  • the operator may label a sample in the second training set with a label “0” if the sample presents a passenger getting on or off a vehicle, with a label “1” if the sample presents a passenger with an abnormal (e.g., lying down) pose, with a label “2” if the sample presents a part or all of the passenger’s body being blocked, for example, by the driver, with a label “3” if the color or brightness of the sample (e.g., a video) changes, with a label “4” if the sample presents one or more children or disabled persons, etc.
  • a label “0” if the sample presents a passenger getting on or off a vehicle
  • a label “1” if the sample presents a passenger with an abnormal (e.g., lying down) pose
  • a label “2” if the sample presents a part or all of the passenger’s body being blocked, for example, by the driver
  • a label “3” if the color or brightness of the sample (e.g.
  • the second trained machine learning model may be configured to determine a category that a specific video belongs to.
  • the category that the specific video belongs to may be associated with labels of the samples in the second training set. For example, if the labels of the samples in the second training set include a true hard sample and a pseudo-hard sample, the category that the specific video belongs to may be a true hard sample or a pseudo-hard sample.
  • the processing device 112 may classify each sample in the potential hard sample set determined in 620 into a first group and a second group using the second trained machine learning model. The processing device 112 may designate samples in the first group or the second group as the group of samples.
  • the processing device 112 may designate the samples in the first group as the group of samples. If the samples in the second group are true hard samples, the processing device 112 may designate the samples in the second group as the group of samples. In some embodiments, the processing device 112 may classify each sample in the potential hard sample set determined in 620 into several groups using the second trained machine learning model. Each group of the server groups may include one or more samples acquired under a same scene.
  • one group of the several groups of samples may record passengers getting on or off vehicles image frames, recording light devices installed in the vehicles being turned on or off, or recording the passengers have abnormal poses (e.g., lying down) , or recording a part or all of passenger’s body are blocked by, for example, the driver, or recording children and/or disabled persons, etc.
  • the processing device 112 may obtain a sample set for object detection based on the group of samples.
  • the sample set for object detection may include a video, an image sequence, or the like, or any combination thereof.
  • the processing device 112 may obtain the sample set from the requester terminal 130, the provider terminal 140, the storage device 160, and/or any other storage device (not shown in the online to offline system 100) via the network 120.
  • the processing device 112 may update an original sample set for object detection based on the group of samples to obtain the sample set. For example, the processing device 112 may add the group of samples into the first training set of the first trained machine learning model to obtain the sample set.
  • the processing device 112 may update the sample set from time to time, e.g., periodically (e.g., per day, per week, per month, etc. ) based on true hard samples (e.g., the group of samples) determined according to, for example, operation 610 to 630.
  • the processing device 112 may update the sample set based on a proportion of true hard samples in the sample set.
  • the proportion of true hard samples in the sample set may be equal to a ratio of a count of true hard samples in the sample set and a total count of samples in the sample set. If the proportion of true hard samples in the sample set is less than a threshold, the processing device 112 may update the sample set.
  • the updating of the sample set may be performed online in response to a request of a user. In some embodiments, the updating of the sample set may be performed offline.
  • the processing device 112 may determine a target machine learning model for object detection based on the sample set.
  • the target machine learning model for object detection may be determined by training a machine learning model based on the sample set determined in 640.
  • the machine learning model may include a linear regression model, a support vector machine (SVM) model, a decision tree model, a boosting model, a neural network model, or the like, or any combination thereof.
  • the target machine learning model may be determined by training an object detection model (e.g., the first trained machine learning model) using the sample set.
  • the processing device 112 may update parameters of object detection model (e.g., the first trained machine learning model) based on the sample set to obtain the target machine learning model.
  • the target machine learning model may be obtained by performing a plurality of iterations. For each of the plurality of iterations, a specific sample in the sample set may first be input into a machine learning model or object detection model (e.g., the first trained machine learning model) .
  • the machine learning model or the object detection model e.g., the first trained machine learning model
  • may extract one or more object features e.g., a count of objects included in the specific sample.
  • the machine learning model or the object detection model e.g., the first trained machine learning model
  • the output of the specific sample may then be compared with an actual or true label (i.e., a desired output) corresponding to the specific sample based on a cost function.
  • the cost function of the machine learning model or the object detection model e.g., the first trained machine learning model
  • the cost function of the machine learning model or the object detection model may be configured to assess a difference between an estimated value (e.g., the predict output) of the machine learning model and a desired value (e.g., the actual label or desired output) .
  • parameters of the machine learning model or the object detection model may be adjusted and updated to cause the value of the cost function (i.e., the difference between the predict output and the actual label) smaller than the threshold.
  • the value of the cost function i.e., the difference between the predict output and the actual label
  • another sample may be input into the machine learning model or the object detection model (e.g., the first trained machine learning model) to train the machine learning model or the first trained machine learning model as described above.
  • the plurality of iterations may be performed to update the parameters of the machine learning model or the object detection model (e.g., the first trained machine learning model) until a terminated condition is satisfied.
  • the terminated condition may provide an indication of whether the machine learning model or the object detection model (e.g., the first trained machine learning model) is sufficiently trained. For example, the terminated condition may be satisfied if the value of the cost function associated with the machine learning model or the object detection model (e.g., the first trained machine learning model) is minimal or smaller than a threshold (e.g., a constant) . As another example, the terminated condition may be satisfied if the value of the cost function converges. The convergence may be deemed to have occurred if the variation of the values of the cost function in two or more consecutive iterations is smaller than a threshold (e.g., a constant) .
  • a threshold e.g., a constant
  • the terminated condition may be satisfied when a specified number of iterations are performed in the training process.
  • the trained machine learning model i.e., the target machine learning model
  • the trained machine learning model may be determined based on the updated parameters.
  • the trained machine learning model i.e., the target machine learning model
  • the present disclosure takes a person as the object in the images as an example, it should be noted that the processing device 112 may process other object associated with other object detection system (e.g., an animal, a vehicle, a roadblock, etc. ) according to the process and/or technique disclosed elsewhere in the present disclosure. Taking the vehicle as an example, the processing device 112 may obtain multiple samples including images associated with vehicles. The processing device 112 may determine at least a portion of the multiple samples based on the first trained machine learning model for object detection and select a group of samples from the at least a portion of the multiple samples based a second trained machine learning model.
  • object detection system e.g., an animal, a vehicle, a roadblock, etc.
  • one or more operations may be omitted and/or one or more additional operations may be added.
  • the operation 620 and the operation 630 may be combined into a signal operation to obtain the group of samples.
  • the image frames of the multiple samples may be preprocessed by one more preprocessing operation (e.g. an image enhancement operation) .
  • FIG. 7 is a schematic flowchart illustrating an exemplary process for determining hard samples for object detection according to some embodiments of the present disclosure.
  • the process 700 may be implemented as a set of instructions (e.g., an application) stored in the storage device 160, ROM 230 or RAM 240, or storage 390.
  • the processing device 112, the processor 220 and/or the CPU 340 may execute the set of instructions, and when executing the instructions, the processing device 112, the processor 220 and/or the CPU 340 may be configured to perform the process 700.
  • the operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 700 may be accomplished with one or more additional operations not described and/or without one or more of the operations discussed. Additionally, the order in which the operations of the process 700 illustrated in FIG. 7 and described below is not intended to be limiting.
  • the processing device 112 may obtain multiple samples.
  • Each of the multiple samples may include image frames.
  • the multiple samples may include a video, an image sequence, or the like, or any combination thereof. More descriptions for the multiple samples may be found elsewhere in the present disclosure (e.g., FIG. 6 and the descriptions thereof) .
  • the processing device 112 may determine a count of one or more objects presented in each of the image frames of each of the multiple samples using a first trained machine learning model.
  • the first trained machine learning model may be an object detection model as described in connection with operation 620 illustrated in FIG. 6.
  • the count of one or more objects presented in each of the image frames corresponding to each of the multiple samples may be the same or different.
  • the count of one or more objects in each of the one or more images frames of a sample may be always as 2.
  • a count of one or more objects in one image of a sample may be as 1, and a count of one or more objects in another image of the sample may be as 2.
  • the processing device 112 may determine a count change curve of the one or more objects presented in the image frames. In some embodiments, the processing device 112 (e.g., the determination module 420) may determine the count change curve based on the count of the one or more objects presented in each of the image frames of a sample. In some embodiments, the count change curve may be determined using the first trained machine learning model based on the count of one or more objects presented in the image frames of a sample.
  • the processing device 112 may designate samples whose count change curves have one or more burrs among the multiple samples as potential hard samples.
  • the processing device 112 may store the potential hard samples into a storage device (e.g., the storage device 160, the storage module 440, or any other storage device) for storage.
  • a potential hard sample may be a true hard sample or a pseudo-hard sample.
  • the true hard sample may be a video of which its identification or detection result is proved false, when the true hard sample is analyzed using the first trained machine learning model.
  • the pseudo-hard sample may be a video of which its identification or detection result is proved correct, when the pseudo-hard sample is analyzed using the first trained machine learning model.
  • a true hard sample may be a video recording a rare type passenger (e.g., a child and/or a disabled person) , or recording a passenger holding a child, etc.
  • a pseudo-hard sample may be a video recording a passenger getting on or off a vehicle, or recording the moment of turn-on or turn-off of a lighting device (e.g., a fill light of the camera) installed in the vehicle, or recording the abnormal pose of a passenger (e.g., lying down, bending down) , or recording the passenger’s partially or entirely blocked body, for example, by the driver, the like, or any combination thereof.
  • the processing device 112 may classify each of the potential hard samples into a first group and a second group using a second trained machine learning model.
  • the second trained machine learning model may be an image classification model as described in connection with operation 630 illustrated in FIG. 6.
  • one of the first group and the second group may include one or more true hard samples.
  • Another one of the first group and the second group may include one or more pseudo-hard samples.
  • the processing device 112 may determine a sample set (or training set) used for object detection based on the first group or the second group. If the first group includes true hard samples, the processing device 112 may determine the sample set based on samples in the first group. If the second group includes true hard samples, the processing device 112 may determine the sample set based on samples in the second group. In some embodiments, the processing device 112 may update an original sample set for object detection based on the true hard samples in the first group or the second group to obtain the sample set. For example, the processing device 112 may add the true hard samples in the first group or the second group into the training set of the first trained machine learning model to obtain the sample set. In some embodiments, the processing device 112 may store the sample set into a storage device 160, the storage module 440, or any other storage device.
  • the processing device 112 may also preprocess the image frames of the multiple samples.
  • one or more other optional operations e.g., a storing operation
  • the processing device 112 may store information and/or data (e.g., the multiple samples, the target machine learning model, the count of the one or more objects presented in the each of the image frames, etc. ) associated with the online to offline system 100 in a storage device (e.g., the storage device 160) disclosed elsewhere in the present disclosure.
  • FIG. 8 is a schematic flowchart illustrating an exemplary process for determining an object detection model according to some embodiments of the present disclosure.
  • the process 800 may be implemented as a set of instructions (e.g., an application) stored in the storage device 160, ROM 230 or RAM 240, or storage 390.
  • the processing device 112, the processor 220 and/or the CPU 340 may execute the set of instructions, and when executing the instructions, the processing device 112, the processor 220 and/or the CPU 340 may be configured to perform the process 800.
  • the operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 800 may be accomplished with one or more additional operations not described and/or without one or more of the operations discussed. Additionally, the order in which the operations of the process 800 illustrated in FIG. 8 and described below is not intended to be limiting.
  • each group of the plurality of groups of training data may include a video including one or more image frames and a count of one or more objects presented in each of the one or more image frames of the video and/or in the video.
  • the count of one or more objects presented in each of the one or more image frames and/or in the video may be also referred to as a label corresponding to the video.
  • the one or more objects may include a person, a vehicle, an animal, a physical subject, or the like, other a combination thereof.
  • the plurality of groups of training data may be obtained from the same or different cameras in same or different scenes, such as, a meeting scene, a working scene, a game scene, a part scene, a travel scene, or the like, or any combination thereof.
  • a group of training data may be obtained from a camera installed in a vehicle for recording a scene in the vehicle
  • the count of the one or more objects presented in each of the one or more image frames of a video may be the same or different.
  • the count of the one or more objects in each of the one or more images frames of a video may be always as 2.
  • a count of one or more objects in one image of data video may be 1, and a count of one or more objects in another image of the video may be 2.
  • the label of the video of each group of the plurality of groups of training data may be determined by an operator (e.g., an engineer) manually.
  • the processing device 112 may train a machine learning model using the plurality groups of training data to obtain a first trained machine learning model.
  • the first machine learning model may include a linear regression model, a support vector machine (SVM) model, a decision tree model, a boosting model, a neural network model, or the like, or any combination thereof.
  • SVM support vector machine
  • a training process of the first trained machine learning model may be similar to or same as the training process of the target machine learning model as described in FIG. 6 of the present disclosure.
  • the first trained machine learning model may be obtained by performing a plurality of iterations. For each of the plurality of iterations, a specific group of training data may first be input into the first machine learning model.
  • the first machine learning model may determine a predict output (e.g., a predict count of one or more objects) corresponding to the specific group of training data.
  • the predicted output may then be compared with a desired output (i.e. a label) corresponding to the specific group of training data based on a cost function.
  • parameters of the first machine learning model may be adjusted and updated to cause the value of the cost function (i.e., the difference between the predict output and the desired output) smaller than the threshold. Accordingly, in a next iteration, another group of training data may be input into the first machine learning model to train the first machine learning model as described above. Then the plurality of iterations may be performed to update the parameters of the first machine learning model until a terminated condition is satisfied. The first trained machine learning model may be obtained based on the updated parameters.
  • the first trained machine learning model may be configured to determine a count of one or more objects presented in a specific video. In some embodiments, the first trained machine learning model may be configured to determine a count change curve of one or more objects presented in each of image frames of the specific video. More descriptions for the first trained machine learning model may be found elsewhere in the present disclosure (e.g., FIGs. 5-6 and the descriptions thereof) .
  • the first trained machine learning model may be updated from time to time, e.g., periodically or not, based on a sample set that is at least partially different from the original sample set from which the original first trained machine learning model is determined. For instance, the first trained machine learning model may be updated based on a sample set including new samples that are not in the original sample set, samples whose count of one or more objects is assessed using the machine learning model, or the like, or a combination thereof.
  • one or more other optional operations may be added in the process 800.
  • the image frames of the multiple samples may be preprocessed by one more preprocessing operation (e.g. an image enhancement operation) .
  • FIG. 9 is a schematic flowchart illustrating an exemplary process for determining an image classification model according to some embodiments of the present disclosure.
  • the process 900 may be implemented as a set of instructions (e.g., an application) stored in the storage device 160, ROM 230 or RAM 240, or storage 390.
  • the processing device 112, the processor 220 and/or the CPU 340 may execute the set of instructions, and when executing the instructions, the processing device 112, the processor 220 and/or the CPU 340 may be configured to perform the process 900.
  • the operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 900 may be accomplished with one or more additional operations not described and/or without one or more of the operations discussed. Additionally, the order in which the operations of the process 900 illustrated in FIG. 9 and described below is not intended to be limiting.
  • the processing device 112 may obtain a plurality of samples.
  • each of the plurality of samples may include a video including one or more image frames.
  • a count of one or more objects presented in each of the image frames of a sample may change with time.
  • the plurality of samples may be obtained from the same or different cameras
  • the plurality of samples may be obtained from a camera installed in a vehicle for recording a scene in the vehicle.
  • the processing device 112 e.g., the obtaining module 410) may obtain the plurality of samples from the storage device 160, a camera, or any other storage device.
  • the processing device 112 may determine potential hard samples from the plurality of samples based on a first trained machine learning model for object detection.
  • the first trained machine learning model may be configured to identify or detect one or more objects (e.g., persons) from a sample (e.g., video) .
  • the first trained machine learning model may be configured to determine a count of one or more objects presented in the sample.
  • the first trained machine learning model may be configured to determine a count change curve of one or more objects presented in each of the image frames of the sample. More descriptions for the first trained machine learning model may be found elsewhere in the present disclosure (e.g., FIGs. 5-6 and the descriptions thereof) .
  • a potential hard sample may refer to a sample in which the count of one or more objects presented changes.
  • the potential hard sample may be a true hard sample or a pseudo-hard sample.
  • the true hard sample may be a sample (e.g., video) of which its identification result is proved false, when the sample is analyzed using the first trained machine learning model.
  • the pseudo-hard sample may be a sample (e.g., video) of which its identification result is proved correct, when the sample is analyzed using the first trained machine learning model. More description for the potential hard samples may be found in elsewhere in the present disclosure (e.g., FIG. 6 and the descriptions thereof) .
  • the processing device 112 may determine a potential hard sample based on a count of the one or more objects presented in each of the image frames of the sample. For example, the processing device 112 may determine the count of one or more objects presented in each of image frames of the sample using the first trained machine learning model. The processing device 112 may determine whether the count of one or more objects presented in each of the multiple image frames of the specific sample is consistent or same. If the processing device 112 determines that the count of one or more objects presented in each of the multiple images of the specific sample is inconsistent or different, the processing device 112 may determine that the sample as a potential hard sample.
  • the processing device 112 may determine a potential hard sample based on a count change curve of the one or more objects presented in the image frames of the sample. For example, the processing device 112 may determine a count change curve of a sample based on the count of the one or more objects presented in the image frames of a sample using the first trained machine learning model. The processing device 112 may determine whether the count change curve of the sample has one or more burrs. If the processing device 112 determines that the count change curve of the sample has one or more burrs, the processing device 112 may determine that the sample as a potential hard sample. In some embodiments, the processing device 112 may input the multiple image frames of a sample into the first trained machine learning model.
  • the first trained machine learning model may determine and/or output the count change curve of one or more objects presented in the multiple image frames of the sample.
  • the processing device 112 may input each of the multiple image frames of a sample into the first trained machine learning model.
  • the first trained machine learning model may determine and/or output the count of one or more objects presented in each of the multiple image frames of the sample.
  • the processing device 112 may determine the count change curve of one or more objects presented in the multiple image frames of the sample based on the count of one or more objects presented in each of the multiple image frames of the sample.
  • the processing device 112 may determine a plurality of potential hard samples from the plurality of samples. Then the processing device 112 may obtain a label for each of the potential hard samples according the scene that the each of the potential hard samples records.
  • the processing device 112 may determine a plurality of groups of training data based on the potential hard samples.
  • Each group of the plurality of groups of training data may include one or more potential hard samples.
  • the one or more potential hard samples in one of the plurality of groups of training data may record a same scene.
  • the potential hard samples in one of the plurality of groups of training data may record a rare passenger (e.g., a child and/or a disabled person) , or record a passenger holding a child, or record a passenger getting on or off a vehicle, or record the moment of turn-on or turn-off of a lighting device (e.g., a fill light of the camera) installed in the vehicle, or record an abnormal pose of the passenger (e.g., lying down, bending down) , or record the passenger’s partially or entirely blocked body , for example, by the driver, or the like, or any combination thereof.
  • the potential hard samples in each group of the plurality of groups of training data may include a same label.
  • the label of each of group of the plurality of groups of training data may be denoted as a character (e.g., A, B, etc. ) , a digit (e.g., 1, 2, etc. ) , etc., to represent a category to which the each of group of the plurality of groups of training data belongs
  • the label corresponding to each group of the plurality of groups of training data may be marked by an operator (e.g., an engineer) manually. For example, an operator may label a potential hard sample with a label “true hard sample” or “positive sample” if the potential hard sample presents a child or a disabled person.
  • the operator may label a potential hard sample with a label “pseudo-hard sample” or “negative sample” if the potential hard sample presents a passenger getting on or off a vehicle, the passenger with an abnormal (e.g., lying down) pose, the passenger’s partially or entirely blocked body, for example, by the driver, etc.
  • the processing device 112 may train a second machine learning model using the plurality of groups of training data.
  • the second trained machine learning model may be a classification model.
  • the second machine learning model may include a decision tree model, a multiclass support vector machine (SVM) model, a K-nearest neighbors classifier, a Gradient Boosting Machine (GBM) model, a neural network model, or the like, or any combination thereof.
  • SVM multiclass support vector machine
  • GBM Gradient Boosting Machine
  • a training process of the second trained machine learning model may also be similar to or same as the training process of the first trained machine learning model or the target machine learning model as described in FIG. 6 of the present disclosure.
  • the second trained machine learning model may be obtained by performing a plurality of iterations. For each of the plurality of iterations, a group of training data (i.e., a potential hard sample) may first be input into the second machine learning model. The second machine learning model may determine a predicted category of the group of training data. The predicted category may then be compared with an actual category (i.e. a label) corresponding to the group of training data based on a cost function.
  • parameters of the second machine learning model may be adjusted and updated to cause the value of the cost function (i.e., the difference between the predicted category and the actual category) smaller than the threshold. Accordingly, in a next iteration, another group of training data may be input into the second machine learning model to train the second machine learning model as described above. Then the plurality of iterations may be performed to update the parameters of the second machine learning model until a terminated condition is satisfied. The second trained machine learning model may be obtained based on the updated parameters.
  • one or more operations may be omitted and/or one or more additional operations may be added.
  • the operation 910 and the operation 920 may be combined into a signal operation to obtain potential hard samples.
  • the plurality of groups of training data may be preprocessed by a preprocessing operation (e.g. an image enhancement operation) .
  • aspects of the present disclosure may be illustrated and described herein in any of a number of patentable classes or context including any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof. Accordingly, aspects of the present disclosure may be implemented entirely hardware, entirely software (including firmware, resident software, micro-code, etc. ) or combining software and hardware implementation that may all generally be referred to herein as a “unit, ” “module, ” or “system. ” Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer-readable media having computer readable program code embodied thereon.
  • a non-transitory computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including electromagnetic, optical, or the like, or any suitable combination thereof.
  • a computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that may communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
  • Program code embodied on a computer readable signal medium may be transmitted using any appropriate medium, including wireless, wireline, optical fiber cable, RF, or the like, or any suitable combination of the foregoing.
  • Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object-oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB. NET, Python or the like, conventional procedural programming languages, such as the "C" programming language, Visual Basic, Fortran, Perl, COBOL, PHP, ABAP, dynamic programming languages such as Python, Ruby, and Groovy, or other programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN) , or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider) or in a cloud computing environment or offered as a service such as a Software as a Service (SaaS) .
  • LAN local area network
  • WAN wide area network
  • SaaS Software as a Service
  • the numbers expressing quantities, properties, and so forth, used to describe and claim certain embodiments of the application are to be understood as being modified in some instances by the term “about, ” “approximate, ” or “substantially. ”
  • “about, ” “approximate” or “substantially” may indicate ⁇ 20%variation of the value it describes, unless otherwise stated.
  • the numerical parameters set forth in the written description and attached claims are approximations that may vary depending upon the desired properties sought to be obtained by a particular embodiment.
  • the numerical parameters should be construed in light of the number of reported significant digits and by applying ordinary rounding techniques. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of some embodiments of the application are approximations, the numerical values set forth in the specific examples are reported as precisely as practicable.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

La présente invention concerne un système et un procédé de détection d'objet. Le procédé peut consister à obtenir de multiples échantillons; à déterminer, sur la base d'un premier modèle d'apprentissage machine entraîné pour une détection d'objet, au moins une partie des multiples échantillons, un comptage d'un ou plusieurs objets présentés dans chacune des trames d'image correspondant à chacune de l'au moins une partie des multiples échantillons changeant avec le temps; à déterminer, sur la base d'un second modèle d'apprentissage machine entraîné pour une classification d'image, un groupe d'échantillons parmi l'au moins une partie des multiples échantillons, un résultat d'identification d'un échantillon dans le groupe étant faux en utilisant le premier modèle d'apprentissage machine entraîné; à obtenir, sur la base du groupe d'échantillons, un ensemble d'échantillons pour une détection d'objet; et à déterminer, sur la base de l'ensemble d'échantillons, un modèle d'apprentissage machine cible pour une détection d'objet.
PCT/CN2019/105931 2019-09-16 2019-09-16 Systèmes et procédés de détection d'objet WO2021051230A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/105931 WO2021051230A1 (fr) 2019-09-16 2019-09-16 Systèmes et procédés de détection d'objet

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/105931 WO2021051230A1 (fr) 2019-09-16 2019-09-16 Systèmes et procédés de détection d'objet

Publications (1)

Publication Number Publication Date
WO2021051230A1 true WO2021051230A1 (fr) 2021-03-25

Family

ID=74882939

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/105931 WO2021051230A1 (fr) 2019-09-16 2019-09-16 Systèmes et procédés de détection d'objet

Country Status (1)

Country Link
WO (1) WO2021051230A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11948346B1 (en) 2023-06-22 2024-04-02 The Adt Security Corporation Machine learning model inference using user-created machine learning models while maintaining user privacy

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140133742A1 (en) * 2012-11-09 2014-05-15 Seiko Epson Corporation Detector Evolution With Multi-Order Contextual Co-Occurrence
CN107092878A (zh) * 2017-04-13 2017-08-25 中国地质大学(武汉) 一种基于混合分类器的可自主学习多目标检测方法
CN108038515A (zh) * 2017-12-27 2018-05-15 中国地质大学(武汉) 无监督多目标检测跟踪方法及其存储装置与摄像装置
US20190130221A1 (en) * 2017-11-02 2019-05-02 Royal Bank Of Canada Method and device for generative adversarial network training
US20190279028A1 (en) * 2017-12-12 2019-09-12 TuSimple Method and Apparatus for Object Re-identification

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140133742A1 (en) * 2012-11-09 2014-05-15 Seiko Epson Corporation Detector Evolution With Multi-Order Contextual Co-Occurrence
CN107092878A (zh) * 2017-04-13 2017-08-25 中国地质大学(武汉) 一种基于混合分类器的可自主学习多目标检测方法
US20190130221A1 (en) * 2017-11-02 2019-05-02 Royal Bank Of Canada Method and device for generative adversarial network training
US20190279028A1 (en) * 2017-12-12 2019-09-12 TuSimple Method and Apparatus for Object Re-identification
CN108038515A (zh) * 2017-12-27 2018-05-15 中国地质大学(武汉) 无监督多目标检测跟踪方法及其存储装置与摄像装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11948346B1 (en) 2023-06-22 2024-04-02 The Adt Security Corporation Machine learning model inference using user-created machine learning models while maintaining user privacy
US12073606B1 (en) 2023-06-22 2024-08-27 The Adt Security Corporation Machine learning model inference using user-created machine learning models while maintaining user privacy

Similar Documents

Publication Publication Date Title
WO2020001261A1 (fr) Systèmes et procédés destinés à estimer un temps d'arrivée d'un véhicule
US20220245792A1 (en) Systems and methods for image quality detection
US11790551B2 (en) Method and system for object centric stereo in autonomous driving vehicles
US20230140540A1 (en) Method and system for distributed learning and adaptation in autonomous driving vehicles
US20210294326A1 (en) Method and system for closed loop perception in autonomous driving vehicles
US12032670B2 (en) Visual login
US20200265239A1 (en) Method and apparatus for processing video stream
US12008794B2 (en) Systems and methods for intelligent video surveillance
US11709282B2 (en) Asset tracking systems
US20210042531A1 (en) Systems and methods for monitoring traffic sign violation
US20200182618A1 (en) Method and system for heading determination
WO2020029231A1 (fr) Systèmes et procédés d'identification de demandeurs ivres dans une plateforme de service hors-ligne à en ligne
CN110945484B (zh) 数据存储中异常检测的系统和方法
WO2021056250A1 (fr) Systèmes et procédés de recommandation et d'affichage de point d'intérêt
US11657515B2 (en) Device, method and storage medium
WO2021051230A1 (fr) Systèmes et procédés de détection d'objet
WO2020107509A1 (fr) Systèmes et procédés pour traiter des données à partir d'une plateforme de service à la demande en ligne
CN113673527A (zh) 一种车牌识别方法及系统
WO2022264055A1 (fr) Système, appareil et procédé de surveillance
CN113099385A (zh) 停车监测方法、系统和设备
AU2018102206A4 (en) Systems and methods for identifying drunk requesters in an Online to Offline service platform
CN112106067A (zh) 一种用于用户分析的系统和方法
US20220368862A1 (en) Apparatus, monitoring system, method, and computer-readable medium
CN113378972A (zh) 一种复杂场景下的车牌识别方法及系统
Onim et al. Vehicle Detection and Localization in Extremely Low-Light Condition with Generative Adversarial Network and Yolov4

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19946040

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19946040

Country of ref document: EP

Kind code of ref document: A1