CN116797926A - Robot multi-mode near-field environment sensing method and system - Google Patents

Robot multi-mode near-field environment sensing method and system Download PDF

Info

Publication number
CN116797926A
CN116797926A CN202310679503.5A CN202310679503A CN116797926A CN 116797926 A CN116797926 A CN 116797926A CN 202310679503 A CN202310679503 A CN 202310679503A CN 116797926 A CN116797926 A CN 116797926A
Authority
CN
China
Prior art keywords
image
panoramic
sound
robot
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310679503.5A
Other languages
Chinese (zh)
Inventor
朱晓秀
邸荻
吴耀忠
叶亚峰
张龙飞
马宁
马添龙
王俊豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
32398 Troops Of Chinese Pla
Original Assignee
32398 Troops Of Chinese Pla
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 32398 Troops Of Chinese Pla filed Critical 32398 Troops Of Chinese Pla
Priority to CN202310679503.5A priority Critical patent/CN116797926A/en
Publication of CN116797926A publication Critical patent/CN116797926A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/10Image acquisition
    • G06V10/16Image acquisition using multiple overlapping images; Image stitching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/24Aligning, centring, orientation detection or correction of the image
    • G06V10/245Aligning, centring, orientation detection or correction of the image by locating a pattern; Special marks for positioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/803Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Abstract

The invention aims to disclose a multi-mode near-field environment sensing method and system for a robot, and belongs to the technical field of near-field environment sensing. According to the invention, based on the robot load panoramic camera and the microphone array equipment, through splicing of panoramic camera pictures and identification and positioning of the microphone on the near-field sound source, efficient identification and classification of targets in the categories of buildings, trees, pedestrians, vehicles and the like around the robot are realized. By fusing the seamless panoramic video stitching method in real time and the high-precision multi-object detection classification method based on multi-scale information equalization and regression learning, the problems that ghosts, misplacement, blurring and stitching seams are caused by interference of moving objects in the panoramic video and images, diversity of targets in the aspects of scale, visual angle, appearance and the like and complex data characteristics in actual application scenes are solved, and high-precision target detection is realized.

Description

Robot multi-mode near-field environment sensing method and system
Technical Field
The invention belongs to the technical field of near-field environment sensing, and particularly relates to a multi-mode near-field environment sensing method and system for a robot.
Background
The environment perception is one of important modules in the autonomous navigation system of the mobile robot, the movement behavior of the mobile robot is determined by the autonomous navigation system, the autonomous navigation system mainly comprises four modules, namely perception, planning, control and positioning, the perception module is a bridge for connecting the robot and the environment, the perception module is used for reading and extracting environment contents, the thought is to acquire original data of the surrounding environment of the robot by using various environment perception sensors, the target characteristics are extracted through a perception algorithm, and the final purpose is to enable the robot to know the position of the robot in the environment, know what the condition of the surrounding environment of the robot is, what the meaning of the content in the environment is and what the relation among the contents is.
The mobile robot was born in the last 60 th century, and artificial intelligence technology began to be applied to mobile robots equipped with electronic cameras, triangulation rangefinders, collision sensors and driving motors, which can simply solve the problems of perception, motion planning and control (Xiong Yunlong. Virtual intelligent manager systems based on six-wheeled carts and related SLAM technical research [ D ]. University of warrior, 2020.) while reliable localization and environmental perception technologies are required to realize intelligent walking of robots. In terms of environmental perception technology, the research results of related algorithms are different, and the environmental perception algorithm has different research emphasis points for different application scenes, such as the fields of mapping, AR/VR and the like, so that the geometric, color and other characteristic details of the actual environment are required to be displayed in detail as far as possible, and the real-time requirement can be not too high. At present, the robot can select real-scene point cloud data or execute the point cloud data according to the self gesture to achieve the perception of the environment (CN 111645067B); due to the popularity and increased capabilities of deep learning methods, robot navigation using neural networks has been developed, the robots being able to perform intentional actions directly from the neural network output. The robot is able to perform a decision action (CN 114879660 a) obtained directly from the neural network output.
At present, some defects still commonly exist in near-field environment sensing of a robot, for example, a part of environment sensing methods only depend on vision to sense, visual sensing results can be obtained, but accuracy is limited, matched position information is lacking, and other methods can not meet strain sensing capability in robot application although some methods obtain better results.
In summary, a method for realizing multi-mode near-field environment sensing based on a robot is lacking at present, and the present invention aims to solve the technical problems set forth above.
Object of the Invention
The invention aims to disclose a multi-mode near-field environment sensing method and system for a robot, which are based on a robot load panoramic camera and microphone array equipment, and the efficient identification and classification of objects in the categories of buildings, trees, pedestrians, vehicles and the like around the robot are realized through the splicing of panoramic camera pictures and the identification and positioning of near-field sound sources by microphones. By fusing the seamless panoramic video stitching method in real time and the high-precision multi-object detection classification method based on multi-scale information equalization and regression learning, the problems that ghosts, misplacement, blurring and stitching seams are caused by interference of moving objects in the panoramic video and images, diversity of targets in the aspects of scale, visual angle, appearance and the like and complex data characteristics in actual application scenes are solved, and high-precision target detection is realized.
In order to achieve the above purpose and solve the above technical problems, the technical scheme of the invention is as follows:
a robot multi-mode near-field environment sensing system comprises a panoramic video stitching module, a near-field identification labeling module and an acoustic monitoring module;
the panoramic video stitching module is used for acquiring continuous image data from a plurality of panoramic cameras loaded by the robot, and constructing and stitching 360-degree panoramic video pictures, and the specific implementation process is as follows:
1) Image preprocessing, including image denoising and balanced illumination, and meanwhile, in order to keep the consistency of space constraint and vision in the picture, coordinate transformation of cylindrical projection is needed to be carried out on the image, so that the spliced panoramic image can meet 360-degree circular view in the horizontal direction;
2) The image registration, the image to be spliced is input into a feature extraction model by adopting a feature-based image registration method, the matched feature point pair coordinates are calculated and stored, a transformation model is estimated, and the image to be spliced is transformed into the same coordinate system;
3) The image fusion method comprises the steps of adopting an image fusion algorithm of an optimal suture line dynamic updating and an improved gradual-in gradual-out method to judge whether a current video image needs to update the optimal suture line or not through a foreground area of a moving object in an obtained video, then using the improved gradual-in gradual-out method to smooth a transition area, eliminating blurring and ghosting of an overlapped area under the condition of keeping original information of the image, and weakening a suture line;
the near field identification labeling module is used for receiving and storing the image data of the panoramic video stitching module, and detecting, classifying and labeling all objects in the stitched panoramic video picture; outputting the input complete image into a rectangular detection frame with category labeling information, compressing and encoding the panoramic picture with the labels to form a panoramic labeling video stream, and storing and transmitting the video stream to a main control;
the near field identification labeling module realizes multi-target detection and classification of all targets appearing in a delimited area through a one-stage end-to-end model, and regards a target detection task as a joint regression problem of target area prediction and category prediction, adopts a single neural network to directly predict object boundaries and category probabilities, and realizes end-to-end real-time target detection, and the method comprises the following specific realization steps:
1) Adopting a network connection structure of a self-adaptive multi-scale information flow, integrating adjacent scale features by utilizing information fusion, and then further enhancing feature representation of all levels in a feature pyramid by a strategy of transition from adjacent scale feature interaction to global scale feature interaction;
2) Target classification, extracting target information from a target candidate frame and a detection window with more accurate positioning and generating classification confidence based on a target classification enhancement algorithm of a multi-path detection head;
3) Target positioning, namely based on statistical analysis of training samples, improving performance of positioning tasks by adopting a balanced optimization regression learning network, and respectively modeling window regression processes of target candidate frames with different positioning accuracies by utilizing diversity of self-iterative window sampling self-adaptive learning training samples;
the sound monitoring module is used for analyzing the audio data acquired by the 360-degree microphone array carried by the robot, extracting audio characteristics, calculating the deflection angle of sound information, positioning sudden sound, storing and streaming the audio data with the voice information, and specifically comprises the following steps:
1) Audio frequency orientation, which is to process the input audio information and adjust the low frequency audible sound signal with high directivity;
2) Acoustic intensity detection
The sound signal is very weak after being converted into an electric signal, and the A/D conversion cannot be directly carried out, so that the signal is amplified after passing through the microphone circuit, and finally the sound intensity detection is finished after passing through the A/D conversion circuit;
3) Positioning sudden sound by microphone array
The digital MEMS microphone sensor converts the analog signals into digital signals, the digital signals are encoded and modulated and finally uploaded to the PC section, the received data are further processed, and the position estimation of the sound source signals is realized through algorithm calculation.
Further, the specific process of audio frequency orientation is as follows: firstly, when a sound source emits audible sound which needs to be directionally emitted and spread, an audio signal is sent into an AD converter after being subjected to low-pass filtering and boosting, then a singlechip carries out signal and processing on the audio signal after AD conversion, a transducer array is driven by a preprocessed signal after power amplification, an ultrasonic signal with the audible sound is radiated into the air, and the low-frequency audible sound signal with high directivity is self-demodulated.
Further, the specific process of the microphone array for positioning the sudden sound is as follows: when the sound source signal of the burst sound reaches the front end microphone array, the digital MEMS microphone sensor converts the acquired analog quantity signal into digital quantity and finally outputs a 1-bit PDM signal through coding and modulation, the FPGA codes four paths of PDM signals acquired synchronously into 128-bit signals to be cached in the DDRSDRAM, when the data length reaches the set burst length, the FPGA controls the Ethernet port to package the data read out from the DDRSDRAM into an Ethernet frame format and then upload the Ethernet frame format to the PC, and the PC further processes the received data and then calculates the position estimation of the sound source signal through a delay estimation algorithm.
The invention also provides a multi-mode near-field environment sensing method of the robot, which is realized by adopting a multi-mode near-field environment sensing system of the robot and comprises the following steps,
step 1, acquiring image data through a panoramic camera array carried by a robot;
step 2, a panoramic video stitching module constructs and splices 360-degree panoramic video pictures through the acquired continuous image data of a plurality of cameras;
step 3, detecting the object in the panoramic video picture by the near field identification labeling module, acquiring the position and type information of the object, and outputting the input complete image into a rectangular detection frame with the type labeling information;
step 4, compressing and encoding the panoramic picture with the label in the step 3 to form a panoramic label video, wherein the picture structure of the video stream is an upper layer structure and a lower layer structure, and the picture structure is respectively two 180-degree pictures in opposite directions, and a panoramic video stream is formed in a conformal manner;
step 5, simultaneously acquiring audio data through a microphone array carried by the robot;
step 6, the sound monitoring module extracts the audio characteristics through the acquired audio data, calculates the deflection angle of sound, and detects the sound intensity after audio frequency orientation;
step 7, positioning the sudden sound by adopting a microphone array;
and 8, recording the time stamp of the data of the video and audio modes obtained in the step 4 and the step 7 through the audio and video equipment, so that the storage and forwarding master control is used for sensing and judging the near-field environment after the data of the video and audio modes have time consistency.
Further, the robot-mounted devices are distributed as follows: the panoramic camera array is distributed at four points, each point comprises two cameras, and local area data acquisition is carried out; the left and right sound source modules are respectively provided with a sound source module structure, the right sound source module structure is that one microphone is arranged in the middle, and six microphones are distributed on the surrounding annular circuit board.
The effective benefits of the invention are as follows:
1. the invention constructs an environment sensing method in a close-range scene with audio and video multi-modes by taking moving objects such as robots and the like as carriers. Image data acquired by the panoramic camera array is fused in real time, a 360-degree panoramic video picture is constructed by splicing the seamless panoramic video, objects in the panoramic video picture are detected, and information such as the position and the type of the objects is acquired; the sudden sound is positioned through the acquired audio data in the 360-degree microphone array, and the performance of identifying and positioning the near-field sound source is improved; and time synchronization, storage and forwarding are carried out on the data of the two modes.
2. The splicing method based on the real-time high-fusion seamless panoramic video and the high-precision multi-object detection classification method based on multi-scale information equalization and regression learning are optimized and improved. Aiming at the problems of target detection with expandable category numbers and cross-domain target detection when target domain samples are scarce, a target feature enhancement algorithm based on a feature pyramid is adopted to improve the recognition accuracy of targets under the conditions of scale, visual angle and appearance diversity;
3. the invention provides real-time fusion seamless panoramic video stitching, which improves the feature extraction speed and solves the problems of ghost images, gaps and the like caused by image accumulation registration errors.
4. Aiming at the problems that in an actual scene, the environment is various and complex, the non-line-of-sight phenomenon is common, high-precision positioning is difficult to realize, and the acoustic event is interfered by echo, noise and a superimposed sound source due to a voice signal, the scheme provides an acoustic frequency orientation method based on an acoustic parametric array and a sound monitoring method for predicting and enhancing the event, so that a high-performance acoustic enhancement method is provided, and the performance of identifying and positioning a near-field sound source is improved.
Drawings
FIG. 1 is a flow chart of a multi-modal near-field environment awareness method of the present invention;
FIG. 2 is a schematic diagram of a panoramic video single frame processing flow according to the present invention;
FIG. 3 is a schematic view of an image fusion stitching framework of the present invention;
fig. 4 is a schematic diagram of panoramic camera and microphone positioning according to an embodiment of the invention.
Detailed Description
The invention is explained and illustrated in detail below with reference to the attached drawings.
The invention provides a robot multi-mode near-field environment sensing system which comprises a panoramic video stitching module, a near-field identification labeling module and an acoustic monitoring module. The system design idea of the invention is as follows: the panoramic video stitching module is used for acquiring continuous image data through a plurality of panoramic cameras loaded by the robot, constructing and stitching 360-degree panoramic video pictures, and then detecting objects in the panoramic video pictures by the near-field identification labeling module to acquire information such as positions, types and the like of the objects; the sound monitoring module is used for positioning the sudden sound by acquiring the audio data in the 360-degree microphone array; and then, carrying out time synchronization on the data of the two modes to finish storage and forwarding.
1. Panoramic video stitching module
The panoramic video stitching module is used for acquiring continuous image data from a plurality of panoramic cameras loaded by the robot, and constructing and stitching 360-degree panoramic video pictures. Firstly, preprocessing images to be spliced, then, positioning similar parts in the images, obtaining a transformation model between the images to be spliced according to the positions of the similar parts, transforming a plurality of images to be spliced under the same coordinate system, realizing image registration, finally, adjusting a splicing area, and fusing the plurality of images more smoothly to obtain an image splicing result.
The panoramic image stitching module realizes the whole process of panoramic image stitching, wherein the whole process comprises the functions of image preprocessing, image feature extraction, image feature matching and the like. The panoramic image stitching module is used for preprocessing a plurality of images to be stitched input by a user, the processing method comprises denoising, balanced illumination, cylindrical projection and the like, then the processed images are input into the feature extraction network, and the feature extraction network outputs matched feature point pairs; inputting the preprocessed images to be spliced into an image segmentation network, segmenting moving objects contained in the images, outputting masks (masks) by a model, and segmenting the moving objects on the images to be spliced according to the masks; dynamically generating joint lines avoiding the moving object according to the feature matching point pairs and the moving object area, and splicing a plurality of images to be spliced; and finally, fusing panoramic stitching result images.
The specific implementation process is as follows, as shown in fig. 2:
1) Image preprocessing
The purpose of the image preprocessing stage is to maximally eliminate the subsequent influence of poor image quality on the stitching quality. The image preprocessing comprises image denoising and balanced illumination, and meanwhile, in order to keep the consistency of space constraint and vision in the picture, coordinate transformation of cylindrical projection is needed to be carried out on the image, so that the spliced panoramic image can meet 360-degree circular view in the horizontal direction.
When the brightness of the images to be spliced is poor, the images have strong noise or distortion and the like, the images need to be preprocessed, and the aim of the preprocessing stage is to eliminate the subsequent influence caused by poor image quality to the greatest extent, so that the quality of the spliced results is higher and the appearance is better. Image preprocessing typically includes image denoising, image transformation, and the like. Denoising an image: in the image generation process, various noises such as thermal noise, photon noise, light response non-uniformity noise and the like caused by resistance can be generated under the influence of the properties of the sensor and the surrounding environment and the like, and in addition, the digital image can be polluted by the noise in the transmission recording process due to the imperfect transmission medium. In image preprocessing, the influence of noise is often reduced by means of filtering. The mean filtering uses a linear approach to average the pixel values contained in the image, limited by its inherent nature, which may destroy the local details of the image, creating blurring phenomena, and thus resulting in a reduction in overall image quality. Mean filtering is applicable to gaussian noise but not to pretzel noise. The median filtering adopts a nonlinear method to denoise the image, and the method can maintain the edge characteristics of the image. Median filtering works well in smoothing pretzel noise, but it is difficult to cope with gaussian noise. In addition, the common denoising method for the image also comprises Gaussian filtering, image pyramid, histogram equalization and the like. Image transformation: in the panoramic image stitching process, in order to keep the consistency of spatial constraint and vision in the picture, certain projection transformation is required to be carried out on the image. The coordinate transformation of the cylindrical projection is easy to calculate, and the panoramic image spliced by the cylindrical projection can meet 360-degree circular viewing in the horizontal direction, and has natural appearance.
2) Image registration
Because of the overlapping phenomenon of the original pictures shot by the adjacent panoramic cameras, the images to be spliced are input into a feature extraction model, matched feature point pair coordinates are calculated and stored, a transformation model is estimated, and the images to be spliced are transformed into the same coordinate system.
Image registration is the most important part in the panoramic image stitching process, and determines the quality of an image stitching result. Common methods of image registration include pixel-based methods and feature-based methods. The method based on the pixels searches similar overlapping areas among the images to be spliced, evaluates the similarity of the overlapping areas, and registers according to the attribute similarity selected by the overlapping areas. The feature-based image registration method detects special points in the image, such as spots, corner points and the like, describes and registers the extracted feature points, and compared with the pixel-based image registration method, the method remarkably reduces the calculated amount and improves the matching speed because the feature points in the image are far less than the pixel points. In addition, since the feature points are less sensitive to illumination, noise, etc. than the pixel-based method, there is translational rotation invariance.
3) Image fusion
The image fusion algorithm of the optimal seam dynamic updating and improved gradual-in gradual-out method judges whether the current video image needs to update the optimal seam or not through the foreground area of the moving object in the obtained video, then the improved gradual-in gradual-out method is used for smoothing the transition area, and under the condition of keeping the original information of the images, the blurring and ghosting of the overlapped area are eliminated, and the seam line is weakened.
Image fusion is another key part of image stitching, and due to interference of a series of factors such as illumination, parallax, color and the like in image shooting, uneven phenomena such as double images, blurring or dislocation and the like with poor appearance of a resultant image in a stitching area are obtained after image registration. To reduce or eliminate these phenomena, image fusion algorithms have been developed that incorporate optimal suture dynamics updating and improved the progressive-in and progressive-out method. According to the algorithm, a foreground region of a moving object in a video is obtained through a background subtraction method based on a Gaussian mixture model, whether the current video image needs to update an optimal suture line is judged according to the foreground region, and then an improved progressive-in and progressive-out method is used for smoothing a transition region, so that blurring and ghosting of an overlapped region can be eliminated under the condition that original information of the image is maintained, obvious bad looking and feel caused by the suture line is weakened, and the transition of a spliced region is smoother and more natural. The image fusion framework is referred to in fig. 3.
The method comprises the steps that an image stitching algorithm frame based on semantic segmentation and seam lines is used, two images to be stitched are input at first, the two images respectively pass through a proposed progressive local feature matching network and an image segmentation network, the progressive local feature matching network outputs paired feature points between the two images, the image segmentation network adopts an encoder-decoder architecture, a prediction result of pixel level segmentation of the two images is output, and then the prediction result is converted into a feature mask of a moving object; then combining the feature point matching result and the feature mask, and iteratively estimating the interior points and the exterior points of the feature matching to further calculate the homography transformation relationship between the two images; finally, according to a seam line algorithm for avoiding the moving object area, combining multi-band fusion to obtain the final spelling
2. Near field identification marking module
The near field identification labeling module is used for receiving and storing the image data of the panoramic video stitching module, and detecting, classifying and labeling all objects in the stitched panoramic video picture. And outputting the input complete image into a rectangular detection frame with category labeling information, compressing and encoding the panoramic picture with the labels to form a panoramic labeling video stream, and storing and transmitting the video stream to a main control.
The method is characterized in that multi-target detection and classification of all targets in a delimited area are realized through an end-to-end one-stage model, a target detection task is regarded as a joint regression problem of target area prediction and category prediction, a single neural network is adopted to directly predict object boundaries and category probabilities, and end-to-end real-time target detection is realized, wherein the method comprises the following specific realization steps:
1) Adopting a network connection structure of a self-adaptive multi-scale information flow, integrating adjacent scale features by utilizing information fusion, and then further enhancing feature representation of all levels in a feature pyramid by a strategy of transition from adjacent scale feature interaction to global scale feature interaction;
2) Target classification, extracting target information from a target candidate frame and a detection window with more accurate positioning and generating classification confidence based on a target classification enhancement algorithm of a multi-path detection head;
3) Target positioning, namely based on statistical analysis of training samples, improving performance of positioning tasks by adopting a balanced optimization regression learning network, and respectively modeling window regression processes of target candidate frames with different positioning accuracies by utilizing diversity of self-iterative window sampling self-adaptive learning training samples; 3. sound monitoring module
The sound monitoring module is used for analyzing the audio data acquired by the 360-degree microphone array carried by the robot, extracting audio characteristics, calculating the deflection angle of sound information, positioning sudden sound, storing and streaming the audio data with the voice information, and specifically comprises the following steps:
4) Audio frequency orientation
The input audio information is subjected to signal processing to adjust a low-frequency audible sound signal having high directivity.
Firstly, when a sound source emits audible sound which needs to be directionally emitted and spread, an audio signal is sent into an AD converter after being subjected to low-pass filtering and boosting, then a singlechip carries out signal and processing on the audio signal after AD conversion, a transducer array is driven by a preprocessed signal after power amplification, an ultrasonic signal with the audible sound is radiated into the air, and the low-frequency audible sound signal with high directivity is self-demodulated.
5) Acoustic intensity detection
The sound signal is very weak after being converted into an electric signal, and the A/D conversion cannot be directly carried out, so that the signal is amplified after passing through the microphone circuit, and finally the sound intensity detection is finished after passing through the A/D conversion circuit.
6) The microphone array is used for positioning the burst sound by adopting the annular-positioning of the array on the right and the firm on the left
The digital MEMS microphone sensor converts the analog signals into digital signals, the digital signals are encoded and modulated and finally uploaded to the PC section, the received data are further processed, and the position estimation of the sound source signals is realized through algorithm calculation.
When the sound source signal of the burst sound reaches the front end microphone array, the digital MEMS microphone sensor converts the acquired analog quantity signal into digital quantity and finally outputs a 1-bit PDM signal through coding and modulation, the FPGA codes four paths of PDM signals acquired synchronously into 128-bit signals to be cached in the DDRSDRAM, when the data length reaches the set burst length, the FPGA controls the Ethernet port to package the data read out from the DDRSDRAM into an Ethernet frame format and then upload the Ethernet frame format to the PC, and the PC further processes the received data and then calculates the position estimation of the sound source signal through a delay estimation algorithm.
Meanwhile, the invention provides a multi-mode near-field environment sensing method of a robot, which specifically comprises the following steps of, as shown in fig. 1:
step 1, acquiring image data through a panoramic camera array carried by a robot;
the panoramic camera array can be set according to specific task requirements, and the panoramic camera array is distributed at four points as shown in fig. 4, wherein each point comprises two cameras for local area data acquisition; the left and right sound source modules are respectively provided with a sound source module structure, the right sound source module structure is that one microphone is arranged in the middle, and six microphones are distributed on the surrounding annular circuit board.
The panoramic camera adopted by the embodiment of the invention is a wide-angle lens with 70 degrees, a C6130 detector with a micro-core, a resolution of 1920 multiplied by 1080, a pixel size of 2.7 mu m, and a wide-angle lens with 2.1mm as a matched detector; panoramic camera and microphone, interface includes RJ45 x1, power input x 2, microphone output x 1; the front-end video centralized control box is 300mm multiplied by 150mm multiplied by 300mm, and the interface comprises power input 1, RJ 45X 9, CAN 1 and microphone signal 8.
Step 2, a panoramic video stitching module constructs and splices 360-degree panoramic video pictures through the acquired continuous image data of a plurality of cameras;
specifically, referring to fig. 2, continuous image data of a plurality of cameras are obtained through a panoramic camera array, and original pictures shot by adjacent cameras have an overlapping phenomenon, so that after preprocessing the obtained image data, similar parts in output images of the adjacent cameras are positioned and coordinate pairs of key points of the similar parts are extracted, referring to fig. 3, a transformation model between the images to be spliced is obtained according to the matched key point coordinate pairs, and splicing of the pictures shot by the adjacent cameras is completed.
Step 3, detecting the object in the panoramic video picture by the near field identification labeling module, acquiring the position and type information of the object, and outputting the input complete image into a rectangular detection frame with the type labeling information;
step 4, compressing and encoding the panoramic picture with the label in the step 3 to form a panoramic label video, wherein the picture structure of the video stream is an upper layer structure and a lower layer structure, and is respectively provided with two 180-degree pictures in opposite directions, so that a 1920X1080 30fps panoramic video stream is formed in a conformal manner, and the panoramic video stream is stored;
step 5, simultaneously acquiring audio data through a 360-degree microphone array carried by the robot;
step 6, the sound monitoring module extracts the audio characteristics through the acquired audio data and calculates the deflection angle of the sound;
specifically, audio frequency orientation firstly transmits audible sound needing directional transmission and propagation when a sound source transmits the audible sound, an audio signal is sent to an AD converter after low-pass filtering and boosting, then a singlechip carries out signal and processing on the audio signal after AD conversion, a transducer array is driven by a pretreatment signal after power amplification, an ultrasonic signal with the audible sound is radiated into air, and a low-frequency audible sound signal with high directivity is self-demodulated.
After specific sound intensity detection passes through the microphone circuit, signals are amplified, and finally the sound intensity detection is completed through the A/D conversion circuit.
Step 7, positioning the sudden sound by adopting a microphone array;
specifically, the microphone array positioning converts an analog signal into a digital signal, the digital signal is finally uploaded to the PC section after being coded and modulated, the received data is further processed, and the position estimation of the sound source signal is realized through algorithm calculation.
And 8, recording the time stamp of the data of the video and audio modes obtained in the step 4 and the step 7 through the audio and video equipment, so that the storage and forwarding master control is used for near-field environment sensing and judging after the data of the video and audio modes have time consistency.

Claims (5)

1. The multi-mode near-field environment sensing system of the robot is characterized by comprising a panoramic video splicing module, a near-field identification labeling module and an acoustic monitoring module;
the panoramic video stitching module is used for acquiring continuous image data from a plurality of panoramic cameras loaded by the robot, and constructing and stitching 360-degree panoramic video pictures, and the specific implementation process is as follows:
1) Image preprocessing, including image denoising and balanced illumination, and meanwhile, in order to keep the consistency of space constraint and vision in the picture, coordinate transformation of cylindrical projection is needed to be carried out on the image, so that the spliced panoramic image can meet 360-degree circular view in the horizontal direction;
2) The image registration, the image to be spliced is input into a feature extraction model by adopting a feature-based image registration method, the matched feature point pair coordinates are calculated and stored, a transformation model is estimated, and the image to be spliced is transformed into the same coordinate system;
3) The image fusion method comprises the steps of adopting an image fusion algorithm of an optimal suture line dynamic updating and an improved gradual-in gradual-out method to judge whether a current video image needs to update the optimal suture line or not through a foreground area of a moving object in an obtained video, then using the improved gradual-in gradual-out method to smooth a transition area, eliminating blurring and ghosting of an overlapped area under the condition of keeping original information of the image, and weakening a suture line;
the near field identification labeling module is used for receiving and storing the image data of the panoramic video stitching module, and detecting, classifying and labeling all objects in the stitched panoramic video picture; outputting the input complete image into a rectangular detection frame with category labeling information, compressing and encoding the panoramic picture with the labels to form a panoramic labeling video stream, and storing and transmitting the video stream to a main control;
the near field identification labeling module realizes multi-target detection and classification of all targets appearing in a delimited area through a one-stage end-to-end model, and regards a target detection task as a joint regression problem of target area prediction and category prediction, adopts a single neural network to directly predict object boundaries and category probabilities, and realizes end-to-end real-time target detection, and the method comprises the following specific realization steps:
1) Adopting a network connection structure of a self-adaptive multi-scale information flow, integrating adjacent scale features by utilizing information fusion, and then further enhancing feature representation of all levels in a feature pyramid by a strategy of transition from adjacent scale feature interaction to global scale feature interaction;
2) Target classification, extracting target information from a target candidate frame and a detection window with more accurate positioning and generating classification confidence based on a target classification enhancement algorithm of a multi-path detection head;
3) Target positioning, namely based on statistical analysis of training samples, improving performance of positioning tasks by adopting a balanced optimization regression learning network, and respectively modeling window regression processes of target candidate frames with different positioning accuracies by utilizing diversity of self-iterative window sampling self-adaptive learning training samples;
the sound monitoring module is used for analyzing the audio data acquired by the 360-degree microphone array carried by the robot, extracting audio characteristics, calculating the deflection angle of sound information, positioning sudden sound, storing and streaming the audio data with the voice information, and specifically comprises the following steps:
1) Audio frequency orientation, which is to process the input audio information and adjust the low frequency audible sound signal with high directivity;
2) Acoustic intensity detection
The sound signal is very weak after being converted into an electric signal, and the A/D conversion cannot be directly carried out, so that the signal is amplified after passing through the microphone circuit, and finally the sound intensity detection is finished after passing through the A/D conversion circuit;
3) Positioning sudden sound by microphone array
The digital MEMS microphone sensor converts the analog signals into digital signals, the digital signals are encoded and modulated and finally uploaded to the PC section, the received data are further processed, and the position estimation of the sound source signals is realized through algorithm calculation.
2. A multi-modal near-field environmental awareness system for a robot in accordance with claim 1 wherein said audio-frequency orientation is specifically described as follows: firstly, when a sound source emits audible sound which needs to be directionally emitted and spread, an audio signal is sent into an AD converter after being subjected to low-pass filtering and boosting, then a singlechip carries out signal and processing on the audio signal after AD conversion, a transducer array is driven by a preprocessed signal after power amplification, an ultrasonic signal with the audible sound is radiated into the air, and the low-frequency audible sound signal with high directivity is self-demodulated.
3. The system of claim 1, wherein the microphone array locates sudden sounds as follows: when the sound source signal of the burst sound reaches the front end microphone array, the digital MEMS microphone sensor converts the acquired analog quantity signal into digital quantity and finally outputs a 1-bit PDM signal through coding and modulation, the FPGA codes four paths of PDM signals acquired synchronously into 128-bit signals to be cached in the DDRSDRAM, when the data length reaches the set burst length, the FPGA controls the Ethernet port to package the data read out from the DDRSDRAM into an Ethernet frame format and then upload the Ethernet frame format to the PC, and the PC further processes the received data and then calculates the position estimation of the sound source signal through a delay estimation algorithm.
4. The method for sensing the multi-mode near-field environment of the robot is characterized by being realized by adopting the multi-mode near-field environment sensing system of the robot as claimed in claim 1 and comprises the following steps,
step 1, acquiring image data through a panoramic camera array carried by a robot;
step 2, a panoramic video stitching module constructs and splices 360-degree panoramic video pictures through the acquired continuous image data of a plurality of cameras;
step 3, detecting the object in the panoramic video picture by the near field identification labeling module, acquiring the position and type information of the object, and outputting the input complete image into a rectangular detection frame with the type labeling information;
step 4, compressing and encoding the panoramic picture with the label in the step 3 to form a panoramic label video, wherein the picture structure of the video stream is an upper layer structure and a lower layer structure, and the picture structure is respectively two 180-degree pictures in opposite directions, and a panoramic video stream is formed in a conformal manner;
step 5, simultaneously acquiring audio data through a microphone array carried by the robot;
step 6, the sound monitoring module extracts the audio characteristics through the acquired audio data, calculates the deflection angle of sound, and detects the sound intensity after audio frequency orientation;
step 7, positioning the sudden sound by adopting a microphone array;
and 8, recording the time stamp of the data of the video and audio modes obtained in the step 4 and the step 7 through the audio and video equipment, so that the storage and forwarding master control is used for sensing and judging the near-field environment after the data of the video and audio modes have time consistency.
5. The method for sensing the multi-modal near-field environment of the robot as set forth in claim 4, wherein the robot-carried devices are distributed as follows: the panoramic camera array is distributed at four points, each point comprises two cameras, and local area data acquisition is carried out; the left and right sound source modules are respectively provided with a sound source module structure, the right sound source module structure is that one microphone is arranged in the middle, and six microphones are distributed on the surrounding annular circuit board.
CN202310679503.5A 2023-06-09 2023-06-09 Robot multi-mode near-field environment sensing method and system Pending CN116797926A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310679503.5A CN116797926A (en) 2023-06-09 2023-06-09 Robot multi-mode near-field environment sensing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310679503.5A CN116797926A (en) 2023-06-09 2023-06-09 Robot multi-mode near-field environment sensing method and system

Publications (1)

Publication Number Publication Date
CN116797926A true CN116797926A (en) 2023-09-22

Family

ID=88049023

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310679503.5A Pending CN116797926A (en) 2023-06-09 2023-06-09 Robot multi-mode near-field environment sensing method and system

Country Status (1)

Country Link
CN (1) CN116797926A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117079667A (en) * 2023-10-16 2023-11-17 华南师范大学 Scene classification method, device, equipment and readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117079667A (en) * 2023-10-16 2023-11-17 华南师范大学 Scene classification method, device, equipment and readable storage medium
CN117079667B (en) * 2023-10-16 2023-12-22 华南师范大学 Scene classification method, device, equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN111201451B (en) Method and device for detecting object in scene based on laser data and radar data of scene
JP5070213B2 (en) Vision processing device for network-based intelligent service robot, processing method thereof, and system using the same
US10339386B2 (en) Unusual event detection in wide-angle video (based on moving object trajectories)
Lin et al. Depth estimation from monocular images and sparse radar data
US11064178B2 (en) Deep virtual stereo odometry
JP5782088B2 (en) System and method for correcting distorted camera images
Rout A survey on object detection and tracking algorithms
CN103581614A (en) Method and system for tracking targets in video based on PTZ
CN116797926A (en) Robot multi-mode near-field environment sensing method and system
Shreyas et al. 3D object detection and tracking methods using deep learning for computer vision applications
JP2023539865A (en) Real-time cross-spectral object association and depth estimation
US11380111B2 (en) Image colorization for vehicular camera images
US10735660B2 (en) Method and device for object identification
Geiger Monocular road mosaicing for urban environments
Guerbas et al. Direct 3D model-based tracking in omnidirectional images robust to large inter-frame motion
CN112378409B (en) Robot RGB-D SLAM method based on geometric and motion constraint in dynamic environment
CN114905512A (en) Panoramic tracking and obstacle avoidance method and system for intelligent inspection robot
CN114898144A (en) Automatic alignment method based on camera and millimeter wave radar data
JPH08114416A (en) Three-dimensional object recognizing apparatus based on image data
Rahman et al. Camera-based light emitter localization using correlation of optical pilot sequences
Pan Challenges in visual navigation of agv and comparison study of potential solutions
US20230342968A1 (en) Method for Generating a Vehicle-Mounted Smart Sticker and Vehicle-Mounted Smart Sticker System Capable of Non-Overlapping at Least One Object Image in Real-Time
Yang et al. AV-PedAware: Self-Supervised Audio-Visual Fusion for Dynamic Pedestrian Awareness
CN117593650B (en) Moving point filtering vision SLAM method based on 4D millimeter wave radar and SAM image segmentation
CN116824641B (en) Gesture classification method, device, equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination