WO2024017363A1

WO2024017363A1 - Action recognition method and apparatus, and storage medium, sensor and vehicle

Info

Publication number: WO2024017363A1
Application number: PCT/CN2023/108545
Authority: WO
Inventors: 牛寅; 陈枭雄; 岑冠男
Original assignee: 联合汽车电子有限公司
Priority date: 2022-07-21
Filing date: 2023-07-21
Publication date: 2024-01-25
Also published as: CN115273229A

Abstract

An action recognition method and apparatus, and a storage medium, a sensor and a vehicle. An ultra wide band (UWB) radar is used to recognize a kicking action of a human body, and a deep learning technique is combined with the UWB radar to solve the problems of mis-recognition and mis-operation in a recognition technique, such that the performance of a related system can be further improved.

Description

An action recognition method, device, storage medium, sensor and vehicle

Technical field

The invention belongs to the technical field of smart vehicles, and in particular relates to an action recognition method, device, storage medium, sensor and vehicle.

Background technique

The kick automatic tailgate system is an implementation of smart cars in the field of perception control. The system automatically recognizes human body kicking movements through sensors installed around the body and controls the automatic opening of the tailgate. The core of the system is used to identify Sensor for kicking motion of human body.

Currently, the mainstream method on the market is to use capacitive sensors, usually under the rear bumper of the vehicle. A capacitive electric field emitting electrode is installed at the rear and bottom of the vehicle. When a human leg appears in the detection area, a capacitance will be formed between the sensor and the electrode; The capacitance value changes with the distance between the leg and the electrode, and the kicking movement of the human body can be recognized.

With the intelligent development of automobiles, radar has gradually become an indispensable part of the automobile perception system; among them, ultra-wideband UWB (Ultra Wide Band) radar has great advantages due to its large bandwidth, strong penetration, and low cost. development potential.

Compared with capacitive sensors that only measure changes in capacitance values, UWB radar can simultaneously measure distance, speed, angle and other information of objects to obtain richer features; at the same time, if the UWB anchor points on the left or right rear of the vehicle are reused, further cut costs.

Contents of the invention

Embodiments of the present invention disclose an action recognition method, device, storage medium, sensor and vehicle; the action recognition method identifies valid data in relevant signals through artificial intelligence methods, and can effectively filter interference signals and white noise.

Specifically, the method includes a first data collection step and a fourth action recognition step; wherein the first data collection step scans and acquires a first original signal, converts the first original signal into a two-dimensional array form and forms a first image, This first image is a distance-time image.

Further, the fourth action recognition step passes the first type feature processing step, the second type feature The processing steps, the third category determination output step and other related steps identify the valid data; among them, the first type feature processing step: extract the first type feature data in the time dimension of the first image; the second type feature processing step: extract the first type feature data. The second type of feature data in the image distance and speed dimensions; the third category determination output step: synthesizing the first type feature data and the second type feature data to obtain an action recognition data set; wherein the third category determination output step also includes : Obtain third recognition result data by classifying and identifying the action recognition data set; wherein, the action recognition data set is divided into at least three categories: the first effective data set, the second noise data set, and the third interference data set.

Specifically, its first data acquisition step periodically acquires a first original signal, which may be a fixed-length I/Q complex signal; its first image is arranged by a first original signal sequence, and its two-dimensional The array is in the form of MxN; where M and N are natural numbers; the third identification result data includes switch quantities or signals used to trigger relevant mechanisms.

Further, the method may also include a second preprocessing step and a third intermediate processing step; wherein the second preprocessing step also includes a noise reduction processing step, and the third intermediate processing step further includes a fast Fourier transform step FFT and /or the short-time Fourier transform STFT step; the STFT step adds a window function before the FFT, and the window function can be a hanning window; the first image is processed by denoising to obtain the second denoised image; the second denoising The fourth action recognition step can be further performed by replacing the first image with the latter image, thereby improving the effect of the related recognition process; the third intermediate processing step obtains the signal-to-noise ratio SNR of the second denoised image and forms a second image for Extraction of the second type of feature data.

Specifically, its first type of feature data can be a first feature vector with a length of N, its second type of feature data can be a second feature vector with a length of L, and its action recognition data set can be a length of ( N+L) third feature vector; by performing 1x1 convolution on the third feature vector with the preset convolution kernel, the activation values of various types of data in the action recognition data set are obtained, and then the activation values are normalized and the action recognition is obtained The probability distribution of the values of various types of data in the data set; among them, the category with the highest probability corresponds to the third recognition result data; its convolution kernel can be, and its normalization process can use the softmax method.

Furthermore, the pulse repetition interval PRI (Pulse Repetition Interval) of the first original signal can be a fixed value, the first original signal can be the echo of an ultra-wideband UWB radar, and the operating frequency of the radar can be set at 6.4GHZ. and 8GHZ or set its wavelength to Between 3.75CM and 4.69CM.

Further, the method may also include an anti-shake output step and a model training step; wherein, the anti-shake output step obtains the third recognition result data for R consecutive times to confirm the validity of the data, avoid malfunctions caused by interference signals, and improve The reliability and robustness of the system; where R is a natural number greater than or equal to 2.

If the third recognition result data can be obtained R times in succession and all are first valid data, then the first original signal is determined to be a valid signal.

Further, the model training step may include a noise reduction model training step, a recognition model training step and/or a related model training step; wherein the noise reduction model training step is used in the noise reduction processing step; the noise reduction model training step is performed by normalization Process the training sample X-Normal, X-Noise>; Obtain decoding output Y by inputting Target value; wherein, in the forward inference stage, the first image can also be normalized and the normalized result is input into the autoencoder.

Specifically, the synthetic minority class oversampling method SMOTE (Synthetic Minority Oversampling Technique) can be used to amplify its training samples; the recognition model training step can also be performed on the first image or the second denoised image and the second image respectively. Normalization processing; then input the normalized sample into the action recognition model to obtain the predicted probability distribution P; then use Focal Loss as the loss function to calculate the error Loss between the predicted value and the real label; and use the gradient descent method Iteratively optimize the model parameters until the loss drops to the preset accuracy range.

In addition, an embodiment of the present invention also discloses an action recognition device, which includes a first data acquisition module and a third action recognition module; wherein: the first data acquisition module scans and acquires a first original signal, and converts the first original signal into In the form of a two-dimensional array, a first image is formed, and the first image is a distance-time image.

Further, its third action recognition module may include a first type feature processing module, a second type feature processing module, and a third type judgment output unit; wherein, the first type feature processing module: extracts the third type of feature processing module. The first type feature data in the time dimension of an image; the second type feature processing module: extracts the second type feature data in the distance and speed dimensions of the first image; the third category determination output unit: synthesizes the first type feature data and the second type The action recognition data set is obtained after classifying the feature data; its third category judgment output unit classifies and identifies the action recognition data set, and obtains the third recognition result data; wherein, the action recognition data set is divided into at least a first valid data set, a third valid data set, and a first valid data set. There are three categories: the second noise data set and the third interference data set.

Specifically, the first data acquisition module periodically acquires the first original signal, where the first original signal may be a fixed-length I/Q complex signal; the first image is arranged by the first original signal sequence, and its two-dimensional The array is in the form of MxN; where M and N are natural numbers; the third recognition result data includes the switch quantity or signal used to trigger the relevant mechanism.

Further, the device may also include: a second data processing module to improve the effect of feature recognition through data preprocessing; wherein the second data processing module includes a noise reduction processing module and an intermediate processing module; the intermediate processing module can complete rapid Fourier transform FFT and/or short-time Fourier transform STFT; its STFT adds a window function before the FFT, and the window function can be a hanning window function; its noise reduction processing module processes the first image to obtain the second noise reduction image; the second denoised image can replace the first image to participate in the processing of the third action recognition module; the intermediate processing module obtains the signal-to-noise ratio SNR of the second denoised image and forms a second image for the second type of feature Data extraction.

Specifically, its first type of feature data is a first feature vector of length N, its second type of feature data is a second feature vector of length L, and its action recognition data set is a length of (N+L) The third feature vector of Probability distribution of data values; among them, the category with the highest probability corresponds to the third recognition result data. The optional convolution kernel is 3, and the normalization method can be implemented by softmax.

Furthermore, the pulse repetition interval PRI of the first original signal can be a fixed value, and the first original signal can be the echo of an ultra-wideband UWB radar. The working frequency of the radar is between 6.4GHZ and 8GHZ or the wavelength is between 6.4GHZ and 8GHZ. Between 3.75CM and 4.69CM.

Further, the device may also include a fourth control output module; after acquiring the third recognition result data R times continuously; if the third recognition result data acquired R times are all first valid data, then determine The first original signal is a valid signal.

Further, its second data processing module may also include a noise reduction model training module, a recognition model training module and/or a related model training module; wherein, the noise reduction model training module is used for the optimization of noise reduction processing; the noise reduction model training module Normalize the training sample X to obtain the normalized sample X-Normal, and superimpose random white noise on the For <X-Normal, Until the loss function Loss reaches the target value; in the forward inference stage, the first image can also be normalized and the normalized result is input into the autoencoder; the second data processing module can use the synthetic minority class The oversampling method SMOTE amplifies the training samples to balance the sample data; in addition, the recognition model training module can perform normalization processing on the first image or the second denoised image and the second image respectively; and then normalize the The final sample is input into the action recognition model, and the predicted probability distribution P is obtained and Focal Loss is used as the loss function to calculate the error Loss between the predicted value and the real label; then the gradient descent method is used to iteratively optimize the model parameters until the Loss drops to the preset value. accuracy range.

Further, embodiments of the present invention also disclose related products using the above methods and devices, including computer storage media, sensors and vehicles. Wherein, the storage medium includes a storage medium body for storing a computer program; when the computer program is executed by the microprocessor, the above action recognition method can be implemented.

Similarly, its sensors and vehicles include any of the above devices and storage media, which can also recognize relevant feature data and respond to relevant actions. The specific process will not be described again.

The embodiment of the present invention uses UWB radar to identify human body kicking movements, and combines deep learning technology with UWB radar to solve the problems of misrecognition and misoperation in the recognition technology, which can further improve the performance of related systems; in addition, using The kick sensor implemented by UWB radar is lower in cost. It combines UWB radar signals with deep learning models, and its recognition performance is also improved.

It should be noted that the "first", "second" and other similar terms used in this article are only used to describe the various components of the technical solution, and do not constitute a limitation on the technical solution, nor can they be interpreted. It is interpreted as an indication or hint of the importance of the corresponding element; elements with similar terms such as "first" and "second" mean that the element contains at least one in the corresponding technical solution.

Description of drawings

In order to explain the technical solution of the present invention more clearly and facilitate a further understanding of the technical effects, technical features and purposes of the present invention, the present invention will be described in detail below in conjunction with the accompanying drawings, which constitute an essential part of the specification and are closely related to the implementation of the present invention. The examples are used to illustrate the technical solution of the present invention, but do not constitute a limitation of the present invention.

The same reference numerals in the drawings represent the same parts, specifically:

Figure 1 is a schematic flow diagram 1 of an embodiment of the method of the present invention;

Figure 2 is a schematic diagram of data collection according to the method and product embodiment of the present invention;

Figure 3 is a schematic structural diagram of the noise reduction auto-encoding and decoding method and product embodiment of the present invention;

Figure 4 is a schematic diagram of the action recognition structure of the method and product embodiment of the present invention;

Figure 5 is a schematic diagram of the action recognition process according to the method embodiment of the present invention;

Figure 6 is a schematic diagram of the model training process according to the method embodiment of the present invention;

Figure 7 is a schematic flow diagram 2 of the method embodiment of the present invention;

Figure 8 is a schematic structural diagram of an embodiment of the device of the present invention;

Figure 9 is a schematic structural diagram of an action recognition module according to an embodiment of the device of the present invention;

Figure 10 is a schematic structural diagram of the model training module of the device embodiment of the present invention;

Figure 11 is a schematic structural diagram of an embodiment of the product of the present invention;

Figure 12 is a schematic diagram 2 of the composition and structure of an embodiment of the product of the present invention;

Figure 13 is a schematic diagram three of the composition and structure of an embodiment of the product of the present invention;

Figure 14 is a schematic diagram 4 of the composition and structure of an embodiment of the product of the present invention.

in:

001 - the first original signal, the embodiment I/Q signal,

011-First distance-time image (two-dimensional array) one,

012-Overlap between signals,

022-First distance-time image (two-dimensional array) two,

100-First data collection step,

111-The first denoised image (two-dimensional array),

200-The second preprocessing step, including noise reduction in the embodiment;

300-The third intermediate processing step, in the embodiment, the distance velocity image is constructed;

400-The fourth action recognition step. In the embodiment, white noise, body movements and interference actions are distinguished;

410-The first type of feature processing step involves data extraction in the time dimension;

420-The second type of feature processing step involves data extraction on distance and speed dimensions;

430-The third category determination output step;

500-The fifth anti-shake output step,

600-Sixth model training step,

602-Noise reduction model training steps,

604-Recognition model training step,

60M-related model training steps, M is a natural number, M is greater than or equal to 2;

700-motion recognition device,

710-First data acquisition module,

720-Second data processing module,

721-Noise reduction model training module,

722-Recognition model training module,

72M-related model training module,

730-The third action recognition module,

740-Fourth control output module;

771-The first type of feature processing module,

772-The second type of feature processing module,

773-The third category judgment output module,

810-The eighth denoising auto-encoding decoding structure,

811-encoder,

812-Decoder,

900-vehicle,

901-Controller,

902-motion recognition device,

903-Computer storage media,

905-sensor,

1001-One-dimensional convolutional neural network processing process, namely 1D-CNN;

1003-Feature tensor, namely CNN-F;

1005-Feature map, that is, the N output quantities obtained through LSTM in the embodiment;

1111-First feature data (vector),

2001-Two-dimensional convolutional neural network feature extraction process, that is, the L outputs obtained through 2D-CNN in the embodiment;

2222-Second feature data (vector),

3000-feature data (vector) merge,

3333-Third recognition result data.

Detailed ways

The present invention will be further described in detail below in conjunction with the accompanying drawings and examples. Of course, the specific embodiments described below are only for explaining the technical solution of the present invention, but are not intended to limit the present invention. In addition, the parts described in the embodiments or drawings are only illustrations of the relevant parts of the present invention, and are not the entirety of the present invention.

The action recognition method shown in Figures 1, 2, 4, and 5 includes a first data collection step 100 and a fourth action recognition step 400; wherein the first data collection step 100 scans and acquires the first original signal 001 , convert the first original signal 001 into a two-dimensional array form and form a first image 011, and the first image 011 is a distance-time image.

As shown in Figure 5, its fourth action recognition step 400 includes a first type feature processing step 410, a second type feature processing step 420, and a third type judgment output step 430; as shown in Figure 4, its first type feature processing step Step 410 extracts the first type feature data 1111 in the time dimension of the first image 011, and the second type feature processing step 420 extracts the second type feature data 2222 in the distance and speed dimensions of the first image 011; the third category determination output step 430 synthesizes the One type of feature data 1111 and the second type of feature number After data 2222, the action recognition data set is obtained.

Further, as shown in Figure 4, the third recognition result data 3333 is obtained by classifying and identifying the action recognition data set; wherein the action recognition data set is at least divided into a first effective data set, a second noise data set, a third Interference dataset three categories.

Specifically, as shown in Figure 2, the first data acquisition step 100: periodically acquires the first original signal 001, which is a fixed-length I/Q complex signal; and whose first image 011 is composed of the first original signal 001. The signal 001 is arranged in sequence, and its two-dimensional array is in the form of MxN; where M and N are natural numbers; in addition, the third recognition result data 3333 in Figure 4 includes switch quantities or signals used to trigger related mechanisms.

Further, as shown in Figure 7, the action recognition method also includes a second preprocessing step 200 and a third intermediate processing step 300; the second preprocessing step 200 may include a noise reduction processing step, and the third intermediate processing step 300 may include Fast Fourier transform step FFT and/or short-time Fourier transform STFT; wherein STFT adds a window function before FFT, and the window function can be a hanning window; the second reduced noise can be obtained by processing the first image 011 through noise reduction. The second noise-reduced image 111 can replace the first image 011 to perform the fourth action recognition step 400; the third intermediate processing step 300 obtains the signal-to-noise ratio SNR of the second noise-reduced image 111, and forms a third The second image 222 is used to extract the second type of feature data 2222.

As shown in Figure 4, the first type of feature data 1111 can be a first feature vector with a length of N, the second type of feature data 2222 can be a second feature vector with a length of L, and the action recognition data set is a The third eigenvector of length (N+L).

Further, perform 1x1 convolution on the third feature vector with a preset convolution kernel to obtain the activation values of various types of data in the action recognition data set, normalize the activation values and obtain the values of various types of data in the action recognition data set. Probability distribution; among them, the category with the highest probability corresponds to the third recognition result data 3333.

Specifically, the convolution kernel can be 3; the normalization process can be obtained by using the softmax method.

Further, the pulse repetition interval PRI of the first original signal 001 is fixed. The first original signal 001 is the echo of an ultra-wideband UWB radar. The operating frequency of the radar can be selected between 6.4GHZ and 8GHZ or the wavelength can be between 6.4GHZ and 8GHZ. Between 3.75CM and 4.69CM.

In addition, as shown in Figure 7, the method may also include an anti-shake output step 500 and a model training step 600; wherein the anti-shake output step 500 ensures the reliability of the recognition by obtaining the third recognition result data 3333 for R consecutive times; where , R is a natural number greater than or equal to 2; if the third recognition result data 3333 obtained R times are all first valid data, it is determined that the first original signal 001 is a valid signal.

As shown in Figure 6, the model training step 600 includes a noise reduction model training step 602, a recognition model training step 604 and/or a related model training step 60M, where M is a natural number and M is not less than 2; wherein, the noise reduction model training step 602 Used for the noise reduction processing step; the noise reduction model training step 602 normalizes the training sample X to obtain the normalized sample X-Normal, and superimposes random white noise on the X-Normal to obtain the noise sample X-Noise; and then normalizes The sample X-Normal and the noise sample X-Noise construct a training sample pair <X-Normal, Normal); iteratively optimize the encoding and decoding parameters until the loss function Loss reaches the target value; wherein, in the forward inference stage, the first image 011 can also be normalized and the normalized result can be input into the auto-encoding device.

Furthermore, in order to ensure the balance of training samples, the synthetic minority class oversampling method SMOTE can be used to amplify the training samples.

In addition, as shown in Figure 6, the recognition model training step 604 can perform normalization processing on the first image 011 or the second denoised image 111 and the second image 222 respectively; and then input the normalized sample into the action In the recognition model, the predicted probability distribution P is obtained; Focal Loss is then used as the loss function to calculate the error Loss between the predicted value and the real label; and the gradient descent method is used to iteratively optimize the model parameters until the Loss drops to the preset accuracy range.

As shown in Figure 8, the embodiment of the present invention also discloses an action recognition device 700, which includes a first data acquisition module 710 and a third action recognition module 730; wherein: the first data acquisition module 710 scans and acquires the first original signal. 001, convert the first original signal 001 into a two-dimensional array form and form a first image 011. The first image 011 is a distance-time image.

As shown in Figure 9, its third action recognition module 730 includes a first type feature processing module 771, a second type feature processing module 772, and a third type judgment output unit 773; among which, the first type feature processing module 771 extracts the first image 011 The first type of feature data in the time dimension 1111, the second type of feature processing The module 772 extracts the second type feature data 2222 in the distance and speed dimensions of the first image 011; the third category determination output unit 773 synthesizes the first type feature data 1111 and the second type feature data 2222 to obtain an action recognition data set; the third category The determination output unit 773 classifies and recognizes the action recognition data set to obtain third recognition result data 3333; wherein the action recognition data set is divided into at least three categories: a first valid data set, a second noise data set, and a third interference data set.

Further, as shown in Figure 2, the first data acquisition module 710 periodically acquires the first original signal 001, which is a fixed-length I/Q complex signal; the first image 011 is composed of the first original signal 001 Arranged in sequence, the two-dimensional array is in the form of MxN; where M and N are natural numbers; the third recognition result data 3333 includes switch quantities or signals used to trigger related mechanisms.

Further, as shown in Figure 8, the action recognition device also includes: a second data processing module 720; as shown in Figure 10, the second data processing module 720 includes a noise reduction processing module and an intermediate processing module; the intermediate processing module Complete the fast Fourier transform FFT and/or the short-time Fourier transform STFT; STFT adds a window function before the FFT, and the window function can be in the form of hanning; the noise reduction processing module processes the first image 011 to obtain the second reduction The denoised image 111; the second denoised image 111 replaces the first image 011 to participate in the processing of the third action recognition module 730; the intermediate processing module obtains the signal-to-noise ratio SNR of the second denoised image 111 and forms the second image 222 Used for the extraction of the second type of feature data 2222.

As shown in Figure 4, the first type of feature data 1111 is a first feature vector with a length of N, the second type of feature data 2222 is a second feature vector with a length of L, and the action recognition data set is a length of (N +L) third feature vector; perform 1x1 convolution on the third feature vector with the preset convolution kernel 3 to obtain the activation values of various types of data in the action recognition data set, normalize the activation values and obtain the action recognition data Set the probability distribution of the values of various types of data; among them, the category with the highest probability corresponds to the third recognition result data 3333; among them, the normalization method can be implemented by using softmax.

Further, the pulse repetition interval PRI of the first original signal 001 is fixed. The first original signal 001 may be the echo of an ultra-wideband UWB radar. The operating frequency of the radar is between 6.4 GHZ and 8 GHZ or the wavelength is between 3.75 CM and 4.69 CM. between CM.

As shown in Figure 8, the device 700 may also include a fourth control output module 740; wherein the fourth control output module 740 continuously obtains the third recognition result data 3333 R times, and R is greater than or equal to 2. Natural number; if the third recognition result data 3333 obtained R times are all first valid data, then the first original signal 001 is determined to be a valid signal; in addition, the second data processing module 720 may also include a noise reduction model training module 721, a recognition Model training module 722 and/or related model training module 72M; among them, the noise reduction model training module 721 is used for the optimization of noise reduction processing; the noise reduction model training module 721 obtains the normalized sample X- by normalizing the training sample X Normal, and superimpose random white noise on X-Normal to obtain the noise sample X-Noise; then construct the training sample pair <X-Normal, X-Noise> from the normalized sample X-Normal and the noise sample X-Noise; and input X -Noise to the autoencoder to obtain the decoded output Y; the noise reduction model training module 721 obtains the loss function Loss = MSE (Y, X-Normal) and iteratively optimizes the encoding and decoding parameters until the loss function Loss reaches the target value.

Among them, in the forward inference stage, the first image 011 can also be normalized and the normalized result can be input into the autoencoder; in addition, the second data processing module 720 can use the synthetic minority class oversampling method SMOTE to The training samples are amplified.

Further, the recognition model training module 722 can perform normalization processing on the first image 011 or the second denoised image 111 and the second image 222 respectively; and then input the normalized samples into the action recognition model to obtain predictions. Probability distribution P and use Focal Loss as the loss function to calculate the error Loss between the predicted value and the real label; then use the gradient descent method to iteratively optimize the model parameters until the Loss drops to the preset accuracy range.

The computer storage medium 903 shown in Figures 11 to 14 includes a storage medium body for storing a computer program; when the computer program is executed by a microprocessor, any action recognition method disclosed in the present invention can be implemented.

In addition, the sensor 905 shown in FIG. 14 can use any action recognition device and/or storage medium disclosed in the present invention; similarly, vehicles using the device or storage medium disclosed in the present invention also naturally fall into the scope of this invention. protection scope of the invention.

Specifically, embodiments of the present invention can use NXP's UWB radar chip to operate between 6.4GHz and 8GHz, and its wavelength range is between 3.75CM and 4.69CM.

The UWB radar periodically emits narrow pulse signals with a fixed PRI. If there is a target within the detection range, the received echo will carry target information.

This device embodiment consists of a UWB radar data acquisition module, a data processing module, and an action recognition module. It consists of blocks and output modules; its UWB radar data acquisition module is responsible for collecting the original I/Q signals received by the UWB radar, the data processing module is responsible for preprocessing the original I/Q signals such as noise reduction and FFT, and the action recognition module is responsible for preprocessing the UWB radar signals. The characteristics of human kicking movements are extracted from the signal to determine whether it is a human kicking movement. The output module is responsible for synthesizing multiple recognition results and outputting control signals.

Specifically, it may include several processes as shown in Figure 1:

Among them, the UWB radar data acquisition module receives the original I/Q signal of a fixed length each time, and arranges the original signal sequence into a two-dimensional array of MxN, which can also be expressed as a two-dimensional image, recorded as a range-time image Img_DT; further Ground, input Img_DT to the data processing module, perform denoising processing on Img_DT through the autoencoder, and obtain the denoised image Img_Denoise; further, perform STFT processing on Img_Denoise along the slow time dimension, and calculate the signal for the processed image. Noise ratio, the distance-velocity heat map Img_DVH is obtained; then Img_Denoise and Img_DVH are input into the action recognition module at the same time to obtain the recognition result of human kicking action; in addition, if the action recognition module recognizes human kicking action multiple times, output The module outputs the control signal for opening the tailgate.

As shown in Figure 2, the UWB radar acquisition module is implemented by dynamically maintaining a buffer queue: due to the existence of the buffer queue, the writing speed and reading speed of data may be inconsistent, that is, when the length of the data in the queue accumulates to exceed the length that needs to be read, The reading is only performed when the length is reached; in addition, an overlap area is set between two consecutive reads, which can reduce the probability that a continuous action is split into 2 frames of data. The overlap can be between [0,1) , a typical value is 0.5.

Further, after reading the data from the queue, the data is arranged into a two-dimensional array of MxN, where M corresponds to the fast time dimension and N corresponds to the slow time dimension. That is, each column is a sampling signal of a single pulse and is a one-dimensional distance. Like, there are N groups of pulse sampling signals; each element in the two-dimensional array is a complex signal of I+Q*j, and the signal is modulo, that is, A=SQR(I^2+Q^2), image Img_DT Each pixel value in is the corresponding module value.

Specifically, an autoencoder can be used to denoise the image Img_DT: its autoencoder can adopt an Encoder-Decoder structure, as shown in Figure 3; among them, the Encoder is used to encode the input image Img_DT through stacked convolution (Convolution ), activation (Activation) and maximum value pooling (MaxPooling) to extract effective features of the image and reduce the feature dimension; Decoder is used to use encoded low-dimensional features through stacked convolution (Convolution) and activation (Activation) and upsample (Upsample) to restore the image content; the output of the Decoder is the denoised image Img_Denoise, whose size is consistent with the input size of the Encoder.

Furthermore, the autoencoder needs to be trained before use. The training process includes:

Normalize the training sample X, that is, X_norm=(X-mean(X))/std(X); superimpose random white noise on the training sample ) distribution of random numbers, obtain the noise training sample X_noise, which forms the training sample pair <X_norm, X_noise> with (Y, into the autoencoder.

Specifically, due to the limited width of the image, direct FFT will cause frequency leakage problems; in order to reduce frequency leakage, short-time Fourier transform (STFT) can be used to first multiply the signal by a window function ( Such as hanning window), then do FFT (Fast Fourier Transformation).

Among them, the signal-to-noise ratio SNR=20log(|STFT(Img_Denoise)|); after calculation by this formula, the distance-velocity heat map Img_DVH can be obtained.

Furthermore, the action recognition module is an end-to-end recognition model that combines the convolutional neural network CNN (Convolutional Neural Network) and the recurrent neural network RNN (Recurrent Neural Network). As shown in Figure 4, the model has two branches. Branch 1 mainly extracts features in the time dimension, and branch 2 mainly extracts features in the distance and speed dimensions.

Specifically, the input of branch 1 is Img_Denoise. This branch uses 1D-CNN+LSTM, that is, the one-dimensional convolutional long short-term memory network LSTM (Long Short Term Memory) to extract object features: take a column of Img_Denoise as an object, and perform a Dimensional convolution, where the convolution kernel size is 1xk (k is typically 3), the number of convolution kernels is C, and the feature after 1D-CNN is an MxNxC tensor; the tensor is accumulated along the C direction to obtain MxN Large and small feature maps CNN_F; split CNN_F by columns into N vectors of length M, and input the N M-dimensional vectors into LSTM in turn (LSTM is a type of RNN), and N outputs can be obtained. That is, F_branch1 of branch 1 is a feature vector of length N.

Similarly, the input of branch 2 is Img_DVH. This branch uses a typical 2D-CNN to extract object features. It is internally stacked by a series of convolutional layers, batch normalization layers, activation layers, pooling layers, etc., after 2D -F_branch2 of feature branch 2 after CNN is a feature vector of length L.

Further, as shown in Figure 4, after each of the two branches calculates the feature vector, F_branch1 and F_branch2 are merged to obtain a feature vector with a length of (N+L); a 1x1 convolution is performed on the feature vector, and the number of convolution kernels is is 3, the activation values of each category are obtained; finally, the activation values of each category are normalized through softmax to obtain the probability distribution of the category, and the category with the highest probability is the output of the model.

The method of the present invention sets the number of categories to 3. One category is background noise, one category is human body kicks, and the other category is others (such as a cat crawling under the car, strong wind, etc.), that is, those that are easy to interact with human feet. Kick motion confuses the interference unified as a class.

Specifically, the model needs to be trained before use. The training process includes: normalizing the training samples Img_Denoise and Img_DVH respectively; inputting the normalized sample pairs into the model as shown in Figure 4 to obtain predictions Probability distribution P; use Focal Loss as the loss function to calculate the error Loss between the predicted value and the real label; then use the gradient descent method to iteratively optimize the model parameters until the Loss drops to no longer small.

Optionally, if the samples used for training are unbalanced, the model effect will be affected. To solve this problem, the synthetic minority oversampling technique SMOTE (Synthetic Minority Oversampling Technique) method can be used to amplify the training samples.

Furthermore, in order to improve the robustness of the system, a signal debounce strategy is added: its category signal needs to be a kicking action for R consecutive times (such as R=2) before the tailgate control signal will be output.

In addition, the denoising autoencoder shown in Figure 3, as long as it uses CNN as the Encoder or Decoder, should fall within the scope of this solution; the action recognition model shown in Figure 4, the LSTM in branch 1 can also be replaced Into other RNN models, such as gate recurrent unit GRU (Gate Recurrent Unit), etc.; the 2D-CNN model in branch 2, as long as it consists of a series of convolutional layers, batch normalization layers, activation layers, pooling layers, and full One or more components in the connection layer are stacked, and no matter how they are combined, they should all fall within the scope of the embodiments of the present invention.

It should be noted that the above embodiments are only to illustrate the technical solution of the present invention more clearly. Those skilled in the art can understand that the implementation of the present invention is not limited to the above content, and obvious changes, substitutions or substitutions based on the above content do not exceed the scope of the technical solution of the present invention; without departing from the concept of the present invention , other implementations will also fall within the scope of the present invention.

Claims

An action recognition method, characterized by including: a first data collection step (100) and a fourth action recognition step (400); wherein,

The first data acquisition step (100) scans and acquires a first original signal (001), converts the first original signal (001) into a two-dimensional array form and forms a first image (011), the first image (011) is the distance-time image;

The fourth action recognition step (400) includes: a first type feature processing step (410), a second type feature processing step (420), and a third type judgment output step (430); wherein,

The first type feature processing step (410): extract the first type feature data (1111) in the time dimension of the first image (011); the second type feature processing step (420): extract the first The second type of feature data (2222) in the distance and speed dimensions of the image (011); the third category determination output step (430): synthesize the first type of feature data (1111) and the second type of feature data ( 2222) and then obtain the action recognition data set;

Wherein, the third category determination output step (430) also includes: classifying and identifying the action recognition data set to obtain third recognition result data (3333); wherein the action recognition data set is divided into at least the first valid There are three categories: data set, second noise data set, and third interference data set.
The action recognition method of claim 1, wherein:

The first data acquisition step (100) periodically acquires the first original signal (001), which is a fixed-length I/Q complex signal;

The first image (011) is arranged in sequence from the first original signal (001), and the two-dimensional array is in the form of MxN; where M and N are natural numbers;

The third identification result data (3333) includes a switch value or signal used to trigger the relevant mechanism.
The action recognition method of claim 2, further comprising: a second preprocessing step (200) and a third intermediate processing step (300);

The second preprocessing step (200) includes a noise reduction processing step, and the third intermediate processing step (300) includes a fast Fourier transform step FFT and/or a short time Fourier transform STFT; wherein, the STFT Add a window function before FFT, and the window function includes a hanning window;

The first image (011) is processed by noise reduction to obtain a second noise-reduced image (111); the second noise-reduced image is The post-noise image (111) replaces the first image (011) to perform the fourth action recognition step (400);

The third intermediate processing step (300) obtains the signal-to-noise ratio SNR of the second denoised image (111), and forms a second image (222) for extraction of the second type of feature data (2222) .
The action recognition method of claim 3, wherein:

The first type of feature data (1111) is a first feature vector with a length of N, the second type of feature data (2222) is a second feature vector with a length of L, and the action recognition data set is a The third eigenvector of length (N+L);

Perform 1x1 convolution on the third feature vector with a preset convolution kernel to obtain activation values of various types of data in the action recognition data set, normalize the activation values and obtain various types of data in the action recognition data set. Probability distribution of data values; wherein the category with the highest probability corresponds to the third recognition result data (3333).
The action recognition method of claim 4, wherein:

The convolution kernel is 3; the normalization process is obtained by using the softmax method.
The action recognition method according to any one of claims 3, 4 or 5, wherein:

The pulse repetition interval PRI of the first original signal (001) is fixed, the first original signal (001) is the echo of an ultra-wideband UWB radar, and the operating frequency of the radar is between 6.4GHZ and 8GHZ or the The wavelength of radar waves is between 3.75CM and 4.69CM.
The action recognition method according to any one of claims 1, 2, 3, 4 or 5, further comprising:

Anti-shake output step (500), model training step (600); among them,

The anti-shake output step (500) continuously acquires the third recognition result data (3333) R times, where R is a natural number greater than or equal to 2; if the third recognition result data (3333) is acquired R times, are the first valid data, then it is determined that the first original signal (001) is a valid signal;

The model training step (600) includes a noise reduction model training step (602), a recognition model training step (604) and/or a related model training step (60M); wherein the noise reduction model training step (602) is used to Noise reduction processing step; the noise reduction model training step (602) normalizes the training sample X to obtain the normalized sample X-Normal, and superimposes random white noise on the X-Normal to obtain the noise sample X-Noise; and then The normalized sample X-Normal and the noise sample X-Noise are constructed Create a training sample pair <X-Normal, X-Noise>; input the X-Noise to the autoencoder to obtain the decoded output Y;

Obtain the loss function Loss=MSE(Y, The normalization process is described above and the normalized result is input into the autoencoder.
The action recognition method of claim 7, wherein:

Use the synthetic minority class oversampling method SMOTE to amplify the training samples;

The recognition model training step (604) performs normalization processing on the first image (011) or the second denoised image (111) and the second image (222) respectively; and then normalizes the The transformed samples are input into the action recognition model to obtain the predicted probability distribution P; Focal Loss is used as the loss function to calculate the error Loss between the predicted value and the real label; then the gradient descent method is used to iteratively optimize the model parameters until the Loss drops to the predetermined value. Set the accuracy range.
An action recognition device, including:

The first data collection module (710), the third action recognition module (730); among which:

The first data acquisition module (710) scans and acquires the first original signal (001), converts the first original signal (001) into a two-dimensional array form and forms a first image (011). One image (011) is a distance-time image;

The third action recognition module (730) includes: a first type feature processing module (771), a second type feature processing module (772), and a third type judgment output unit (773); wherein,

The first type of feature processing module (771): extracts the first type of feature data (1111) in the time dimension of the first image (011); the second type of feature processing module (772): extracts the first type of feature data (1111) The second type of feature data (2222) in the distance and speed dimensions of the image (011); the third category determination output unit (773): synthesizes the first type of feature data (1111) and the second type of feature data ( 2222) and then obtain the action recognition data set;

The third category judgment output unit (773) classifies and identifies the action recognition data set to obtain third recognition result data (3333); wherein the action recognition data set is divided into at least a first valid data set, a second valid data set, and a second effective data set. There are three categories: noise data set and third interference data set.
The action recognition device of claim 9, wherein:

The first data acquisition module (710) periodically acquires the first original signal (001), which is a fixed-length I/Q complex signal;

The first image (011) is arranged in sequence from the first original signal (001), and the two-dimensional array is in the form of MxN; where M and N are natural numbers;

The third identification result data (3333) includes a switch value or signal used to trigger the relevant mechanism.
The action recognition device of claim 10, further comprising: a second data processing module (720);

The second data processing module (720) includes a noise reduction processing module and an intermediate processing module; the intermediate processing module completes the fast Fourier transform FFT and/or the short time Fourier transform STFT; wherein the STFT is in the FFT Previously, a window function was added, and the window function includes a hanning window;

The noise reduction processing module processes the first image (011) to obtain a second noise-reduced image (111); the second noise-reduced image (111) replaces the first image (011) to participate in the third image (011). Processing of three action recognition modules (730);

The intermediate processing module obtains the signal-to-noise ratio SNR of the second denoised image (111), and forms a second image (222) for extraction of the second type of feature data (2222).
The action recognition device of claim 11, wherein:

The first type of feature data (1111) is a first feature vector with a length of N, the second type of feature data (2222) is a second feature vector with a length of L, and the action recognition data set is a The third eigenvector of length (N+L);

Perform 1x1 convolution on the third feature vector with a preset convolution kernel to obtain activation values of various types of data in the action recognition data set, normalize the activation values and obtain various types of data in the action recognition data set. Probability distribution of data values; wherein, the category with the highest probability corresponds to the third recognition result data (3333), the convolution kernel includes a natural number 3, and the normalization method includes softmax.
The action recognition device according to any one of claims 9, 10, 11 or 12, wherein:

The pulse repetition interval PRI of the first original signal (001) is fixed, the first original signal (001) is the echo of an ultra-wideband UWB radar, and the operating frequency of the radar is between 6.4GHZ and 8GHZ or the The wavelength of radar waves is between 3.75CM and 4.69CM.
The action recognition device according to any one of claims 9, 10, 11 or 12, further comprising:

The fourth control output module (740); wherein,

The fourth control output module (740) continuously acquires the third recognition result data (3333) R times, where R is a natural number greater than or equal to 2; if the third recognition result data (3333) acquired R times ) are the first valid data, then it is determined that the first original signal (001) is a valid signal;

The second data processing module (720) also includes a noise reduction model training module (721), a recognition model training module (722) and/or a related model training module (72M); wherein, the noise reduction model training module (721 ) is used for the optimization of noise reduction processing; the noise reduction model training module (721) normalizes the training sample X to obtain the normalized sample X-Normal, and superimposes random white noise on the X-Normal to obtain the noise sample X-Noise ; Then construct a training sample pair <X-Normal, X-Noise> from the normalized sample X-Normal and the noise sample X-Noise; input the X-Noise to the autoencoder to obtain the decoding output Y;

The noise reduction model training module (721) obtains the loss function Loss = MSE (Y, X-Normal) and iteratively optimizes the encoding and decoding parameters until the loss function Loss reaches the target value; wherein, in the forward inference stage, the The first image (011) performs the normalization process and inputs the normalized result into the autoencoder; the second data processing module (720) uses the synthetic minority class oversampling method SMOTE to Training samples are amplified;

The recognition model training module (722) performs normalization processing on the first image (011) or the second denoised image (111) and the second image (222) respectively; and then normalizes the The transformed samples are input into the action recognition model to obtain the predicted probability distribution P and use Focal Loss as the loss function to calculate the error Loss between the predicted value and the real label; then use the gradient descent method to iteratively optimize the model parameters until the Loss drops to the predetermined value. Set the accuracy range.
A computer storage medium consisting of:

The storage medium body used to store computer programs;

When executed by a microprocessor, the computer program implements any action recognition method as described in claims 1 to 8.
A sensor including:

The action recognition device (902) according to any one of claims 9 to 14;

and/or the storage medium (903) of claim 15.
A vehicle including:

The action recognition device (902) according to any one of claims 9 to 14;

And/or the storage medium (903) of claim 15;

and/or a sensor (905) as claimed in any one of claims 16 to 16.