WO2021038752A1 - Dispositif de traitement d'image, système, procédé de traitement d'image et programme de traitement d'image - Google Patents

Dispositif de traitement d'image, système, procédé de traitement d'image et programme de traitement d'image Download PDF

Info

Publication number
WO2021038752A1
WO2021038752A1 PCT/JP2019/033709 JP2019033709W WO2021038752A1 WO 2021038752 A1 WO2021038752 A1 WO 2021038752A1 JP 2019033709 W JP2019033709 W JP 2019033709W WO 2021038752 A1 WO2021038752 A1 WO 2021038752A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
processing
information
unit
image processing
Prior art date
Application number
PCT/JP2019/033709
Other languages
English (en)
Japanese (ja)
Inventor
小泉 誠
Original Assignee
株式会社ソニー・インタラクティブエンタテインメント
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社ソニー・インタラクティブエンタテインメント filed Critical 株式会社ソニー・インタラクティブエンタテインメント
Priority to JP2021541869A priority Critical patent/JP7304955B2/ja
Priority to PCT/JP2019/033709 priority patent/WO2021038752A1/fr
Priority to US17/635,304 priority patent/US20220308157A1/en
Publication of WO2021038752A1 publication Critical patent/WO2021038752A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • G01S5/20Position of source determined by a plurality of spaced direction-finders
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/02Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using radio waves
    • G01S5/0294Trajectory determination or predictive filtering, e.g. target tracking or Kalman filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • G06V10/811Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/20Arrangements for obtaining desired frequency or directional characteristics
    • H04R1/32Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
    • H04R1/40Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
    • H04R1/406Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers microphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/005Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • G01S5/22Position of source determined by co-ordinating a plurality of position lines defined by path-difference measurements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2499/00Aspects covered by H04R or H04S not otherwise provided for in their subgroups
    • H04R2499/10General applications
    • H04R2499/15Transducers incorporated in visual displaying devices, e.g. televisions, computer displays, laptops
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/11Positioning of individual sound objects, e.g. moving airplane, within a sound field

Definitions

  • the present invention relates to an image processing device, a system, an image processing method, and an image processing program.
  • Patent Document 1 There is known a moving object detection technology that performs image analysis using an image generated by an imaging device to detect and track an object. Motion detection has advantages in focus adjustment during imaging and application to surveillance cameras. A technique for detecting such a moving object is described in, for example, Patent Document 1.
  • the invention of Patent Document 1 has a mode for acquiring RGB images and a mode for acquiring infrared images, and determines whether or not it is necessary to regenerate the background model when performing motion detection using the background subtraction method. , Realizes efficient motion detection.
  • the present invention provides an image processing device, a system, an image processing method, and an image processing program that can perform processing on an appropriate object according to the purpose of processing by applying voice information.
  • the purpose is acoustic processing, acoustic processing, and acoustic processing, and acoustic processing, and acoustic processing, and acoustic processing, and acoustic processing, and acoustic processing, and a processing program that can perform processing on an appropriate object according to the purpose of processing by applying voice information. The purpose.
  • a first receiver that receives image information acquired by an image sensor and at least a portion of the image sensor's field of view acquired by one or more directional microphones.
  • a second receiving unit that receives audio information in the region, an association processing unit that associates the audio information with the pixel address of the image information indicating the position in the field, and an object existing in the field from the image information.
  • An image processing apparatus including an object detection unit that detects at least a part thereof and a processing execution unit that performs a predetermined processing on an object based on the result of association by the association processing unit is provided.
  • an image sensor that acquires image information, one or more directional microphones that acquire audio information in at least a part of the area of the image sensor's field, and image information.
  • a first receiving unit that receives the image
  • a second receiving unit that receives the audio information
  • an association processing unit that associates the audio information with the pixel address of the image information indicating the position in the field
  • the image information is included in the image information.
  • a system including a terminal device having an object detection unit that detects at least a part of an object existing in a field, and a processing execution unit that performs a predetermined processing on an object based on the result of association by the association processing unit.
  • a step of receiving voice information in the step of receiving the image information acquired by the image sensor and in at least a portion of the area of the image sensor's field acquired by one or more directional microphones.
  • a step of receiving voice information a step of associating the voice information with a pixel address of image information indicating a position in the field, and a step of detecting at least a part of an object existing in the field from the image information.
  • An image processing method including a step of performing a predetermined process on an object based on the result of the association is provided.
  • the ability to receive image information acquired by an image sensor and at least a portion of the image sensor's field of view acquired by one or more directional microphones A function to receive voice information, a function to associate the voice information with a pixel address of image information indicating a position in the field, and a function to detect at least a part of an object existing in the field from the image information.
  • An image processing program is provided that allows a computer to realize a function of performing a predetermined process on an object based on the result of the association.
  • FIG. 1 is a block diagram showing a schematic configuration of an image processing system 10 according to a first embodiment of the present invention.
  • the image processing system 10 includes a vision sensor 101, a microphone 102, and an information processing device 200.
  • the vision sensor 101 includes a sensor array including an event-driven sensor (EDS: Event Driven Sensor) that generates an event signal when a change in light intensity is detected, and a processing circuit connected to the sensor.
  • EDS Event Driven Sensor
  • the EDS includes a light receiving element and generates an event signal when it detects a change in the intensity of incident light, more specifically, a change in brightness. Since the EDS that does not detect the change in brightness does not generate an event signal, the event signal is generated in the vision sensor 101 time-asynchronically with respect to the pixel address where the event occurred.
  • the event signal includes sensor identification information (eg, pixel address), luminance change polarity (increase or decrease), and time stamp.
  • the event signal generated by the vision sensor 101 is output to the information processing device 200.
  • the microphone 102 converts the sound generated in at least a part of the field of the vision sensor 101 into an audio signal.
  • the microphone 102 includes, for example, a plurality of directional microphones constituting a microphone array, and when a sound having a predetermined signal level or higher is detected, the position where the sound is generated in at least a part of the field of the vision sensor 101 is determined. Generates a voice signal associated with the indicated location information.
  • the audio signal generated by the microphone 102 includes position information (for example, XY coordinates), signal level (volume), and time stamp in the field of the vision sensor 101.
  • the audio signal generated by the microphone 102 is output to the information processing device 200.
  • the time stamp of the audio signal is common to or can be associated with the time stamp of the event signal.
  • the information processing device 200 is implemented by, for example, a computer having a communication interface, a processor, and a memory, and the event signal reception realized by the processor operating according to a program stored in the memory or received through the communication interface.
  • the functions of the unit 201, the object detection unit 202, the voice signal reception unit 203, the alignment processing unit 204, the association processing unit 205, the object classification unit 206, the first image processing unit 207, and the second image processing unit 208 are included. The functions of each part will be further described below.
  • the event signal receiving unit 201 receives the event signal generated by the vision sensor 101.
  • the position of the object changes in the field of the vision sensor 101, the brightness change occurs, and the event signal generated by the EDS at the pixel address where the brightness change occurs is received by the event signal receiving unit 201.
  • the position change of the object in the field is not only caused by the movement of the moving object in the field of the vision sensor 101, but also by the movement of the device on which the vision sensor 101 is mounted, the object is actually stationary. It also happens when the objects move apparently, but they are indistinguishable in the event signal generated by the EDS.
  • the object detection unit 202 detects an object based on the event signal received by the event signal reception unit 201. For example, the object detection unit 202 detects an object existing in a continuous pixel area indicating that an event of the same polarity is generated by the received event signal, and supplies information indicating the detection result to the association processing unit 205. To do. As described above, since the event signal does not distinguish between the object that is actually moving and the object that is apparently moving due to the movement of the device on which the vision sensor 101 is mounted, the object detected by the object detection unit 202 Includes an object that is actually moving in the field of the vision sensor 101 and an object that is actually stationary but is apparently moving due to the movement of the device on which the vision sensor 101 is mounted.
  • the audio signal receiving unit 203 receives the audio signal generated by the microphone 102.
  • the audio signal received by the audio signal receiving unit 203 is associated with position information indicating a position where sound is generated in at least a part of the field of the vision sensor 101.
  • an object that is actually moving in the field of vision sensor 101 is a sound produced by the object itself (for example, a sound produced by a motor or engine, a sound of parts colliding with each other, etc.), or an object moving.
  • Accompanying sounds for example, friction noise and wind noise
  • the voice signals indicating these sounds are received by the voice signal receiving unit 203 together with the position information.
  • object detection based on the event signal from the vision sensor 101 does not distinguish between an object that is actually moving and an object that is actually stationary but apparently moving, but the sound from the microphone 102. The signal is likely to be obtained only for the object that is actually moving.
  • the alignment processing unit 204 performs processing for aligning the coordinate system of the audio signal received by the audio signal receiving unit 203 with the coordinate system of the event signal received by the event signal receiving unit 201.
  • the position information (pixel address) of the event signal generated by the vision sensor 101 and the position information of the audio signal generated by the microphone 102 are calibrated in advance, and the alignment processing unit 204 has two positions. By performing a geometric calculation based on the correlation of information, a process of converting the coordinate system of the audio signal received by the audio signal receiving unit 203 into the coordinate system of the event signal received by the event signal receiving unit 201 is performed.
  • the vision sensor 101 and the microphone 102 may be arranged coaxially or close to each other. In this case, the above-mentioned calibration can be performed easily and accurately.
  • the association processing unit 205 uses the processing result of the alignment processing unit 204 to perform a process of associating the audio signal with the pixel address corresponding to the area in the image of the object detected by the object detection unit 202.
  • the association processing unit 205 since the alignment processing unit 204 converts the coordinate system based on the calibration result of the position information of the voice signal and the pixel address, the association processing unit 205 also performs the calibration result of the position information and the pixel address. Use to associate audio information with pixel addresses. Specifically, for example, the association processing unit 205 displays an image of the object at the time when the event signal on which the object is detected is generated (for example, between the minimum and maximum of the time stamp of the event signal).
  • the information associated with the pixel address of the object may include, for example, only the presence / absence of voice detection, or may further include the signal level of the voice signal.
  • the object classification unit 206 classifies the objects detected by the object detection unit 202 based on the result of the association by the association processing unit 205.
  • the object classification unit 206 sets an object to which information indicating that voice detection has occurred, or an object in which the signal level of the voice signal indicated by the associated information is equal to or higher than a threshold value to a sound object. Classify and classify other objects as soundless objects.
  • the object classification unit 206 classifies the object that is not associated with the information indicating that the voice detection has occurred, or the object that the signal level of the voice signal indicated by the associated information is less than the threshold, as a soundless object.
  • Other objects may be classified as sound objects.
  • the object with sound classified by the processing of the object classification unit 206 as described above is an object that is actually moving (moving object).
  • the soundless object is actually a stationary but apparently moving object (background).
  • the first image processing unit 207 performs the first image processing based on the information of the objects classified into the objects with sound by the object classification unit 206.
  • the first image processing is, for example, a process for processing an object (moving object) that is actually moving, and includes, for example, a tracking process and a process of cutting out and drawing a moving object.
  • the object classification unit 206 adds only the above-mentioned sound-bearing object to the tracking target object. Then, the first image processing unit 207 performs tracking processing on the tracking target object based on the detection result of the event signal in the time series.
  • the second image processing unit 208 performs the second image processing based on the information of the objects classified as soundless objects by the object classification unit 206.
  • the second image processing is, for example, a process for processing an object (background) that is actually stationary but apparently moving. For example, self-position estimation processing, motion cancellation processing, and moving objects are erased from the image to only the background. Includes processing such as drawing.
  • the object classification unit 206 adds only the above soundless object to the target object of the self-position estimation process. Then, the second image processing unit 208 performs self-position estimation processing on the target object based on the detection result of the event signal in the time series, for example, using a method such as SLAM (Simultaneously Localization and Mapping). Similarly, when the second image processing unit 208 executes the motion canceling process, the object classification unit 206 adds only the above-mentioned soundless object to the target object of the motion canceling process.
  • SLAM Simultaneously Localization and Mapping
  • the second image processing unit 208 performs a motion cancel process for compensatingly rotating or moving the vision sensor 101 so that the position of the target object is maintained in the field of the vision sensor 101.
  • the motion cancel process may be executed, for example, by transmitting a control signal to the drive unit of the device equipped with the vision sensor 101.
  • FIG. 2 is a diagram for conceptually explaining the processing in the image processing system shown in FIG.
  • the event signal generated by the vision sensor 101 apparently moves due to the movement of the vehicle (obj1), which is an object (moving object) that is actually moving, and the device on which the vision sensor 101 is mounted.
  • a building (obj2), which is an object (background) is included. Since the microphone 102 collects only the sound generated by the traveling of the vehicle, the voice signal is generated only in the region (indicated by diagonal lines) that coincides with or overlaps with the moving vehicle.
  • association processing unit 205 of the information processing device 200 information (or the signal level of the sound signal equal to or higher than the threshold value) indicating that the sound was detected only in the vehicle object (obj1) is associated with the object classification unit 206.
  • the first image processing unit 207 executes processing such as tracking on the vehicle object (obj1).
  • the building object (obj2) is not associated with the information indicating that the sound was detected (or the signal level of the sound signal below the threshold is associated), and the object classification unit 206 is the building.
  • Object (obj2) is classified as a soundless object.
  • the second image processing unit 208 executes processing such as self-position estimation and motion cancellation using the building object (obj2).
  • the vehicle object (obj) and the building object (obj2) are cut out and drawn separately, but it is necessary to cut out and draw each object as an image. Instead, the image processing as described above may be performed without drawing the object.
  • FIG. 3 is a flowchart showing an example of the process according to the first embodiment of the present invention.
  • the event signal receiving unit 201 of the information processing apparatus 200 receives the event signal generated by the vision sensor 101 (step S101), and detects an object based on the event signal received by the event signal receiving unit 201.
  • Unit 202 detects the object (step S102).
  • the audio signal receiving unit 203 receives the audio signal acquired by the microphone 102 (step S103), and the alignment processing unit 204 performs the alignment process (step S104).
  • the association processing unit 205 performs the association process for each object detected by the object detection unit 202 (step S105).
  • step S202 determines whether or not voice detection is performed at the position of the object
  • step S203 Classify the objects into objects to be processed
  • step S203 The object classification unit 206 repeats the classification process for the objects detected by the object detection unit 202 in the above step S102 (steps S201 to S204).
  • step S205 executes the tracking process on the objects classified as the objects to be processed
  • the object classification unit 206 determines whether or not there is voice detection at the position of the object (step S302), and there is no voice detection. Classify the objects into objects to be processed (step S303). The object classification unit 206 repeats the classification process for the objects detected by the object detection unit 202 in the above step S102 (steps S301 to S304). Then, the second image processing unit 208 executes the self-position estimation process or the motion cancel process as an object that uses the object classified as the processing target object for the self-position estimation process or the motion cancel process (step S305).
  • the voice information in at least a part of the area of the vision sensor 101 acquired by the directional microphone 102 is transferred to the position in the image.
  • At least a part of the object existing in the field is detected from the image information in association with the pixel address of the event signal indicating, and a predetermined process is performed on the object based on the result of the association process. Therefore, by applying the voice information, it is possible to perform processing on an appropriate object according to the purpose of processing.
  • the objects are classified into sound-bearing objects and soundless objects based on the result of the association. By selectively using at least one of an object with sound and an object without sound to perform predetermined processing, appropriate processing is performed according to the characteristics of the object such as whether the object is a moving object or a background. It can be carried out.
  • tracking processing can be executed for an object (moving object) that is actually moving.
  • object moving object
  • it can be expected to increase the possibility of capturing a moving object even when the device on which the vision sensor 101 is mounted is moving. Therefore, for example, when tracking a nearby object for the purpose of detecting danger, it is possible to avoid the problem of erroneously tracking an apparently moving object.
  • it since it is possible to increase the possibility of tracking only a truly moving object, even if an event signal is generated on the entire screen when a device equipped with the vision sensor 101 is moving. , You can track objects more accurately without delay.
  • the device self equipped with the vision sensor 101 uses the time-series detection results of an object (background) that is actually stationary but apparently moving.
  • the position estimation process can be executed. For example, when it is necessary to map only stationary objects in the self-position estimation process, in the first embodiment of the present invention, the self-position estimation process is performed by correctly distinguishing the stationary objects. , The accuracy of the map for self-position estimation can be improved.
  • a motion in a device equipped with a vision sensor 101 is used by using a time-series detection result of an object (background) that is actually stationary but apparently moving. Cancellation processing can be executed.
  • the vision sensor 101 is performed by correctly distinguishing the stationary object and performing the motion canceling process. Motion cancellation processing that correctly compensates for the rotation or movement of the object becomes possible.
  • the image processing by the image processing system 10 described in the above example is not limited to these examples.
  • it may be configured to perform only one of the image processings described in FIGS. 3 and 4, or may be configured to perform a plurality of image processings.
  • the configuration may be such that only one of the image processing by the first image processing unit 207 and the image processing by the second image processing unit 208 is performed.
  • only either the first image processing unit 207 or the second image processing unit 208 may be provided.
  • FIG. 6 is a block diagram showing a schematic configuration of the image processing system 20 according to the second embodiment of the present invention.
  • the components having substantially the same functional configuration as each configuration of the first embodiment are designated by the same reference numerals, and duplicate description will be omitted.
  • the object detection is performed based on the result of the association processing.
  • the image processing system 20 includes a vision sensor 101, a microphone 102, and an information processing device 300.
  • the information processing device 300 is implemented by, for example, a computer having a communication interface, a processor, and a memory, and the event signal reception realized by the processor operating according to a program stored in the memory or received via the communication interface. It includes functions of unit 201, audio signal receiving unit 203, alignment processing unit 204, association processing unit 301, object detection unit 302, and image processing unit 303.
  • the functions having a configuration different from that of FIG. 1 will be further described.
  • the association processing unit 301 uses the processing result of the alignment processing unit 204 described in the first embodiment to move the audio signal received by the audio signal receiving unit 203 to the position in the field of the vision sensor 101. Performs the process of associating with the pixel address of the event signal indicating. Specifically, for example, the association processing unit 301 determines the vision sensor 101 at the time when the event signal on which the object is detected is generated (for example, between the minimum and maximum time stamps of the event signal). Information based on an audio signal indicating a sound generated in at least a part of the area of the field is associated with the pixel address of the event signal.
  • the information associated with the pixel address of the event signal may include, for example, only the presence / absence of voice detection, or may further include the signal level of the voice signal.
  • the object detection unit 302 detects an object based on the event signal in an area in the image determined according to the audio signal associated with the pixel address of the event signal.
  • the object detection unit 302 is a region in the image determined according to the audio information according to the characteristics of the object to be image processed by the image processing unit 303, and an event of the same polarity is generated by the event signal.
  • An object existing in a continuous pixel region indicated to be present is detected, and information indicating the detection result is supplied to the image processing unit 303.
  • the image processing unit 303 targets an object with sound that is actually moving in the field of the vision sensor 101 as described in the first image processing unit 207 of the first embodiment, it is processed.
  • the object detection unit 302 detects an object based on an event signal in an area in an image in which information indicating that voice detection has occurred or information indicating that the signal level of the voice signal is equal to or higher than a threshold value is associated as voice information. I do.
  • the image processing unit 303 is actually stationary, but apparently moves due to the movement of the device on which the vision sensor 101 is mounted.
  • the object detection unit 302 indicates that the information indicating that the sound has been detected is not associated with the sound information in the image, or the signal level of the sound signal is less than the threshold value. Object detection is performed based on the event signal in the area in the image in which the information indicating that the information is associated with the sound information. As described above, in the present embodiment, the object detection unit 302 does not detect all the objects, but applies the voice information to detect only the objects to be image-processed by the image processing unit 303.
  • the image processing unit 303 performs image processing in the same manner as the first image processing unit 207 or the second image processing unit 208 of the first embodiment based on the information of the object detected by the object detection unit 302.
  • FIG. 7 is a diagram for conceptually explaining the processing in the image processing system shown in FIG.
  • the event signal generated by the vision sensor 101 is apparently moved by the movement of the vehicle, which is an actually moving object (moving object), and the device on which the vision sensor 101 is mounted. Includes buildings that are objects (background). Since the microphone 102 collects only the sound generated by the traveling of the vehicle, the voice signal is generated only in the region (indicated by diagonal lines) that coincides with or overlaps with the moving vehicle.
  • the association processing unit 301 of the information processing device 300 associates the information (or the signal level of the audio signal equal to or higher than the threshold value) indicating that the audio detection has occurred only in the area R1 including the vehicle object, and the object detection unit 302 detects a vehicle object (obj1) in the area R1, and the image processing unit 303 executes processing such as tracking on this object.
  • the object detection unit 302 used the building object (obj2). Upon detection, the image processing unit 303 may execute processing such as self-position estimation and motion cancellation on this object.
  • the vehicle object (obj) and the building object (obj2) are cut out and drawn separately for the sake of explanation, but it is necessary to cut out and draw each object as an image. Instead, the image processing as described above may be performed without drawing the object.
  • FIG. 8 is a flowchart showing an example of the process according to the second embodiment of the present invention.
  • the event signal receiving unit 201 of the information processing apparatus 300 receives the event signal generated by the vision sensor 101 (step S401).
  • the audio signal receiving unit 203 receives the audio signal acquired by the microphone 102 (step S402), and the alignment processing unit 204 performs the alignment process (step S403).
  • the association processing unit 301 performs the association process (step S404).
  • the object detecting unit 302 detects the object (step S405), and the image processing unit 303 executes image processing (step S406).
  • the image processing by the image processing system 10 and the system 20 described in each of the above embodiments may be executed in combination with general image-based object recognition (General Object Recognition).
  • General Object Recognition For example, an object identified as a normally stationary object such as a structure (building or the like) or a stationary object (chair or the like) by image-based object recognition can be classified into the object classification unit 206 of the information processing apparatus 200 described above.
  • a soundless object actually stationary but apparently moving background
  • it may be determined that the object classification was performed correctly.
  • the recognition result by the image-based object recognition and the classification result are inconsistent, it may be determined that the object classification has not been performed correctly, and for example, the object recognition or the association with the voice signal may be re-executed. With such a configuration, the classification accuracy of objects can be improved.
  • the object detection unit 302 correctly detects the object. It can be judged that it was done. On the other hand, if the recognition result by the image-based object recognition and the detection result are inconsistent, it is determined that the object detection by the object detection unit 302 has not been performed correctly, and for example, the object recognition or the association with the voice signal is re-executed. You may. With such a configuration, the detection accuracy of the object can be improved.
  • frequency analysis is performed on the voice signal generated by the microphone 102 to recognize the type and characteristics of the sound source, and the recognition result based on the voice signal and the recognition result by the above-mentioned general object recognition. You may judge whether or not it is consistent with. In this case, for example, if the recognition result based on the voice signal of the object is the bark of an animal and the recognition result by the general object recognition is an animal, it is consistent. Therefore, the object is associated and the object classification process is performed. The target of. On the other hand, if they are not matched, it is determined that the noise is in at least one of the image signal and the audio signal, and the object is not subject to the association processing or the object classification processing. With such a configuration, the accuracy of object detection can be improved.
  • the image processing by the image processing system 10 and the system 20 described in each of the above embodiments may be applied to the tracking process targeting a specific object.
  • the input device such as a controller of a game device
  • the input device is provided with a transmitting member that constantly emits a predetermined sound. Then, first, a rough tracking process is performed based on the audio information, then a tracking range is limited based on the rough tracking process, and a more detailed tracking process based on the image information is performed, thereby suppressing the processing load. , The accuracy of tracking processing can be improved.
  • an example of generating an event signal by the vision sensor 101 is shown, but the present invention is not limited to this example.
  • an image pickup device that acquires an RGB image may be provided instead of the vision sensor 101.
  • the same effect can be obtained by performing object detection based on the difference between images of a plurality of frames. It is also possible to reduce the processing load of object detection by performing object detection after limiting the detection range based on voice information.
  • the image processing system 10 and the system 20 described in each of the above embodiments may be mounted in a single device or may be distributed in a plurality of devices.
  • the image processing system 10 and the entire system 20 may be mounted on the terminal device including the vision sensor 101, or the information processing device 200 and the information processing device 300 may be mounted separately on the server device.
  • the image processing may be performed after the data after the association processing or the object classification is saved.
  • the image processing includes an event signal receiving unit, an audio signal receiving unit, an object detecting unit, an alignment processing unit, an association processing unit, an object classification unit, a first image processing unit, a second image processing unit, and an image processing unit. It may be configured to be performed by different devices.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Otolaryngology (AREA)
  • Signal Processing (AREA)
  • Remote Sensing (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Acoustics & Sound (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)
  • Image Analysis (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

L'invention concerne un dispositif de traitement d'image comportant : une première unité de réception qui reçoit des informations d'image acquises par un capteur d'image ; une seconde unité de réception qui reçoit des informations audio, acquises par un ou plusieurs microphones directionnels, dans au moins une partie de la zone de la couverture de capteur d'image ; une unité de traitement d'association qui associe les informations audio à l'adresse de pixel des informations d'image indiquant une position dans le champ de couverture ; une unité de détection d'objet qui, à partir des informations d'image, détecte au moins une partie d'un objet présent dans le champ de couverture ; et une unité de traitement qui effectue un traitement prescrit sur l'objet sur la base des résultats d'association par l'unité de traitement d'association.
PCT/JP2019/033709 2019-08-28 2019-08-28 Dispositif de traitement d'image, système, procédé de traitement d'image et programme de traitement d'image WO2021038752A1 (fr)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2021541869A JP7304955B2 (ja) 2019-08-28 2019-08-28 画像処理装置、システム、画像処理方法および画像処理プログラム
PCT/JP2019/033709 WO2021038752A1 (fr) 2019-08-28 2019-08-28 Dispositif de traitement d'image, système, procédé de traitement d'image et programme de traitement d'image
US17/635,304 US20220308157A1 (en) 2019-08-28 2019-08-28 Image processing apparatus, system, image processing method, and image processing program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2019/033709 WO2021038752A1 (fr) 2019-08-28 2019-08-28 Dispositif de traitement d'image, système, procédé de traitement d'image et programme de traitement d'image

Publications (1)

Publication Number Publication Date
WO2021038752A1 true WO2021038752A1 (fr) 2021-03-04

Family

ID=74683969

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2019/033709 WO2021038752A1 (fr) 2019-08-28 2019-08-28 Dispositif de traitement d'image, système, procédé de traitement d'image et programme de traitement d'image

Country Status (3)

Country Link
US (1) US20220308157A1 (fr)
JP (1) JP7304955B2 (fr)
WO (1) WO2021038752A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116540178A (zh) * 2023-04-28 2023-08-04 广东顺德西安交通大学研究院 一种音视频融合的噪声源定位方法及系统

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008301161A (ja) * 2007-05-31 2008-12-11 Fujifilm Corp 画像処理装置、デジタルカメラ、及び画像処理方法
JP2010103972A (ja) * 2008-09-25 2010-05-06 Sanyo Electric Co Ltd 画像処理装置及び電子機器
JP2011217334A (ja) * 2010-04-02 2011-10-27 Canon Inc 撮像装置および撮像装置の制御方法
JP2013141090A (ja) * 2011-12-28 2013-07-18 Canon Inc 撮影装置及びその処理方法
JP2015100066A (ja) * 2013-11-20 2015-05-28 キヤノン株式会社 撮像装置及びその制御方法、並びにプログラム
JP2015177490A (ja) * 2014-03-18 2015-10-05 株式会社リコー 映像音声処理システム、情報処理装置、映像音声処理方法、及び映像音声処理プログラム
JP2016039407A (ja) * 2014-08-05 2016-03-22 パナソニックIpマネジメント株式会社 音声処理システム及び音声処理方法
JP2017028529A (ja) * 2015-07-23 2017-02-02 パナソニックIpマネジメント株式会社 モニタリングシステム及びモニタリング方法
WO2017159003A1 (fr) * 2016-03-17 2017-09-21 ソニー株式会社 Appareil et procédé de traitement d'image, et programme
JP2017175474A (ja) * 2016-03-24 2017-09-28 パナソニックIpマネジメント株式会社 モニタリングシステム及びモニタリング方法
US20180098082A1 (en) * 2016-09-30 2018-04-05 Intel Corporation Motion estimation using hybrid video imaging system
JP2019029962A (ja) * 2017-08-03 2019-02-21 キヤノン株式会社 撮像装置およびその制御方法

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090292504A1 (en) * 2008-04-11 2009-11-26 Haas Alfred M Adaptive Image Sensor
US9736580B2 (en) * 2015-03-19 2017-08-15 Intel Corporation Acoustic camera based audio visual scene analysis
US10909384B2 (en) * 2015-07-14 2021-02-02 Panasonic Intellectual Property Management Co., Ltd. Monitoring system and monitoring method
US10134422B2 (en) * 2015-12-01 2018-11-20 Qualcomm Incorporated Determining audio event based on location information
US10045120B2 (en) * 2016-06-20 2018-08-07 Gopro, Inc. Associating audio with three-dimensional objects in videos
US11543521B2 (en) * 2017-03-09 2023-01-03 Sony Corporation Information processing apparatus, information processing method, and recording medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008301161A (ja) * 2007-05-31 2008-12-11 Fujifilm Corp 画像処理装置、デジタルカメラ、及び画像処理方法
JP2010103972A (ja) * 2008-09-25 2010-05-06 Sanyo Electric Co Ltd 画像処理装置及び電子機器
JP2011217334A (ja) * 2010-04-02 2011-10-27 Canon Inc 撮像装置および撮像装置の制御方法
JP2013141090A (ja) * 2011-12-28 2013-07-18 Canon Inc 撮影装置及びその処理方法
JP2015100066A (ja) * 2013-11-20 2015-05-28 キヤノン株式会社 撮像装置及びその制御方法、並びにプログラム
JP2015177490A (ja) * 2014-03-18 2015-10-05 株式会社リコー 映像音声処理システム、情報処理装置、映像音声処理方法、及び映像音声処理プログラム
JP2016039407A (ja) * 2014-08-05 2016-03-22 パナソニックIpマネジメント株式会社 音声処理システム及び音声処理方法
JP2017028529A (ja) * 2015-07-23 2017-02-02 パナソニックIpマネジメント株式会社 モニタリングシステム及びモニタリング方法
WO2017159003A1 (fr) * 2016-03-17 2017-09-21 ソニー株式会社 Appareil et procédé de traitement d'image, et programme
JP2017175474A (ja) * 2016-03-24 2017-09-28 パナソニックIpマネジメント株式会社 モニタリングシステム及びモニタリング方法
US20180098082A1 (en) * 2016-09-30 2018-04-05 Intel Corporation Motion estimation using hybrid video imaging system
JP2019029962A (ja) * 2017-08-03 2019-02-21 キヤノン株式会社 撮像装置およびその制御方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116540178A (zh) * 2023-04-28 2023-08-04 广东顺德西安交通大学研究院 一种音视频融合的噪声源定位方法及系统
CN116540178B (zh) * 2023-04-28 2024-02-20 广东顺德西安交通大学研究院 一种音视频融合的噪声源定位方法及系统

Also Published As

Publication number Publication date
US20220308157A1 (en) 2022-09-29
JP7304955B2 (ja) 2023-07-07
JPWO2021038752A1 (fr) 2021-03-04

Similar Documents

Publication Publication Date Title
CN108831474B (zh) 语音识别设备及其语音信号捕获方法、装置和存储介质
CN111034222B (zh) 拾音装置、拾音方法以及计算机程序产品
US20200075012A1 (en) Methods, apparatuses, systems, devices, and computer-readable storage media for processing speech signals
US10694312B2 (en) Dynamic augmentation of real-world sounds into a virtual reality sound mix
US10339913B2 (en) Context-based cancellation and amplification of acoustical signals in acoustical environments
US11152001B2 (en) Vision-based presence-aware voice-enabled device
JP2012103865A5 (fr)
JP2017067666A5 (fr)
US10964326B2 (en) System and method for audio-visual speech recognition
JP2005250397A (ja) ロボット
KR102333107B1 (ko) 차량용 다중 센서를 위한 딥러닝 처리 장치 및 방법
US7377650B2 (en) Projection of synthetic information
US11740315B2 (en) Mobile body detection device, mobile body detection method, and mobile body detection program
US9992593B2 (en) Acoustic characterization based on sensor profiling
WO2021038752A1 (fr) Dispositif de traitement d'image, système, procédé de traitement d'image et programme de traitement d'image
WO2021108991A1 (fr) Procédé et appareil de commande, et plateforme mobile
JP2019159380A (ja) 物体検知装置、物体検知方法およびプログラム
RU174044U1 (ru) Аудиовизуальный многоканальный детектор наличия голоса
JP2005202578A (ja) コミュニケーション装置およびコミュニケーション方法
WO2020003764A1 (fr) Dispositif de traitement d'images, appareil mobile, procédé et programme
JP2000092368A (ja) カメラ制御装置及びコンピュータ読み取り可能な記憶媒体
CN117859339A (zh) 媒体设备及其控制方法和装置、目标跟踪方法和装置
WO2017029886A1 (fr) Dispositif de traitement d'informations, procédé de traitement d'informations et programme
US10812898B2 (en) Sound collection apparatus, method of controlling sound collection apparatus, and non-transitory computer-readable storage medium
Taj et al. Audio-assisted trajectory estimation in non-overlapping multi-camera networks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19942822

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021541869

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19942822

Country of ref document: EP

Kind code of ref document: A1