CN113587935B

CN113587935B - Indoor scene understanding method based on radio frequency signal multi-task learning network

Info

Publication number: CN113587935B
Application number: CN202110891904.8A
Authority: CN
Inventors: 王林; 王新雨; 高畅; 石中玉; 张德安; 厉斌斌
Original assignee: Yanshan University
Current assignee: Yanshan University
Priority date: 2021-08-04
Filing date: 2021-08-04
Publication date: 2023-12-01
Anticipated expiration: 2041-08-04
Also published as: CN113587935A

Abstract

The invention relates to the technical field of behavior perception, in particular to an indoor scene understanding method based on a radio frequency signal multi-task learning network, which comprises the following steps of: collecting channel state information by using an Atheres-carried wireless network card; data preprocessing: filtering noise contained in an original signal, synthesizing multilink data after denoising, standardizing a data format, and constructing an input data set of a neural network; multitasking identification network: indoor scene understanding is achieved using a multitasking learning network wisnet, where wisnet contains a shared presentation layer, and a domain identification network dom_net, a location identification network loc_net, and a behavior identification network act_net using gradient information between the shared presentation layer multitasking. The method uses a multi-task learning method to identify the scene where the user is located, including the domain where the user is located, the position and the action, and the user is perceived from multiple angles so as to understand the meaning of the behavior of the user.

Description

Indoor scene understanding method based on radio frequency signal multi-task learning network

Technical Field

The invention relates to the technical field of behavior perception, in particular to an indoor scene understanding method based on a radio frequency signal multi-task learning network.

Background

When the commercial WiFi is used for realizing behavior perception, action semantics are often closely related to occurrence scenes, and single action recognition cannot meet the semantic understanding requirement of actions in specific scenes. The patent designs and realizes a scene understanding multi-task learning method based on channel state information. The method gives different weights to signals from different sources by using an attention mechanism, realizes the mining of hidden information by using a multi-task learning network, and has stronger cross-domain property and expandability.

There is a lot of mature work on both WiFi-based behavioral awareness and indoor positioning. However, in an indoor home environment, the actions of the user are indistinguishable from the environment and location in which they occur. The same action or similar actions may represent distinct semantics in different environments. For example, also lying, a high probability of a user lying in a bedroom's bed is in sleep, while lying on the living room's floor may represent that the user has fallen, shocked, or even more severe death; to avoid similar misunderstandings in the home environment, it is important to distinguish between semantics of the same or similar actions. Especially in the guardianship of solitary old people, know their position, go to judge that the action that they take place can understand the old man's action better again to avoid unnecessary misunderstanding. In an AR game, the same actions occurring at different locations may represent different operations of the game character, if the location of the user and the current location area can be known while the actions are recognized, explicit semantics can be given to the user's behavior, and the AR-supported scene will be richer at that time. The environment and the position of the user restrict the actions that the user can execute, in other words, the actions that the user executes are reflected on the environment and the position of the user, and the user should not be split.

Existing contact sensing such as wearable devices and the like are limited by limited battery capacity, the devices cannot continue to work once the battery is exhausted, and frequent charging definitely brings a certain burden to users. The non-contact sensing equipment such as RFID, millimeter wave and infrared sensor is expensive in cost, and is relatively suitable for places with large human flow such as markets, airports and stations. The ubiquitous existence of WiFi in home environments makes it circumvent the limitations of application scenarios, and WiFi is low cost and can be deployed on a large scale. Many key technologies based on WiFi signal indoor behavior semantic understanding are to be broken through, and accurate action semantic understanding is required to not only identify user behaviors, but also support information of the region and the position of the user. There is no relevant work currently fusing information in these three dimensions together.

Disclosure of Invention

In order to solve the problems, the invention provides an indoor scene understanding method based on a radio frequency signal multi-task learning network, which uses the multi-task learning method to identify the scene where a user is located, including the domain, the position and the action where the user is located, and perceives the user from multiple angles so as to understand the meaning of the behavior of the user.

In order to achieve the above purpose, the invention adopts the following technical scheme:

an indoor scene understanding method based on a radio frequency signal multi-task learning network comprises the following steps,

step 1, data acquisition: collecting channel state information by using an Atheres-carried wireless network card;

step 2, data preprocessing: filtering noise contained in an original signal, synthesizing multilink data after denoising, standardizing a data format, and constructing an input data set of a neural network;

step 3, multitasking identification network: indoor scene understanding is achieved using a multitasking learning network wisnet, where wisnet contains a shared presentation layer, and a domain identification network dom_net, a location identification network loc_net, and a behavior identification network act_net using gradient information between the shared presentation layer multitasking.

Preferably, the method is characterized in that:

in the step 1, the data acquisition equipment comprises two computers, two routers carrying Atheres wireless network cards and a network cable, wherein the computers are connected with the routers through the network cable, the router system is accessed through a notebook computer, the mode, the center frequency and the packet sending rate parameter setting are completed, and the sending signal and the receiving signal instruction are transmitted to the router; the two routers control the sending and receiving of the CSI signals according to the command sent by the terminal, the command comprises a destination address and the number of sending packets, each router is provided with two pairs of receiving and sending end antennas, the sending end sending rate is 500 packets/second, the bandwidth is 20MHz, and the center frequency is 2.4GHz.

Preferably, in step 2, the denoising method is wavelet decomposition and reconstruction in wavelet transformation, single-scale wavelet transformation analysis is performed on the amplitude of the CSI by using db3 wavelet, and db3 wavelet coefficient decomposition and reconstruction are performed by randomly selecting one subcarrier data in the original signal, so as to complete noise filtering.

Preferably, in step 2, all link data of the two pairs of receiving and transmitting ends are synthesized into a data format of (2000,56,4), and the synthesized data and three corresponding tags thereof are respectively used for generating a data set for a domain, a position and an action, and the synthesized data.

Preferably, in step 3, the domain identification network dom_net uses a convolution attention mechanism based on minimum pooling to give more weight to information with smaller amplitude values to distinguish between different domains; the behavior recognition network act_net can distinguish between different actions based on the fact that using a convolution attention mechanism based on maximum pooling can give greater weight to information with greater amplitude values.

Preferably, the input data set obtained in step 2 is input into a convolution attention mechanism AM, which comprises a channel attention module and a spatial attention module.

Preferably, the input data set obtained in the step 2 is input into a normal convolution operation and an attention mechanism is added, and a channel attention module is expressed as the following formula:

M _c (F)＝σ(MLP(AvgPool(X))+MLP(MinPool(X)))，

wherein X is input data of a neural network, avgPool and MinPool are an average pooling layer and a limit pooling layer respectively, MLP is a sharing layer, data dimension reduction and feature extraction are realized through convolution operation in the sharing layer, and sigma is a corresponding Sigmoid activation function; the channel attention module compresses the feature map in the space dimension, only considers the features in each channel, and the input feature map respectively passes through a global average pooling layer and a global limit pooling layer of the channel attention module while carrying out convolution operation, wherein the average pooling layer feeds back each feature point and is used for retaining the background information in the feature map, and when the limit pooling layer carries out gradient back propagation calculation, only the feature points with small response on the feature map have gradient feedback; two feature graphs of the average pooling layer and the limit pooling layer are input into a shared layer MLP to realize the dimension reduction and feature extraction, the space dimension of the feature graphs is compressed, the output of the MLP is added and activated through a sigmoid function to obtain a channel attention matrix, and the result and the convolved feature matrix are subjected to intelligent product operation to obtain an adjusted feature F';

the spatial attention module compresses the channel, expressed as:

M _s (F)＝σ(f ^n*n ([AvgPool(F′)；MinPool(F′)]))，

wherein F' is a feature after a channel attention mechanism, F corresponds to a two-dimensional convolution operation, n is a dimension of a convolution kernel, avgPool is used for extracting an average value on a channel, and MinPool is used for extracting a limit value on the channel; connecting the feature matrixes extracted from the average pooling layer and the limit pooling layer, activating by sigmoid after passing through the convolution layer to obtain a spatial attention matrix (Spatial Attention), and performing intelligent product operation on the spatial attention matrix and the adjusted feature F' to obtain the following formula:

C _A ＝M _c (F)·M _s (F)，

C _A i.e. as a result of adding an attention mechanism on the basis of CNN, C in a specific domain identification application _A Containing background information in the acquired data for characterizing the domain in which the current user is located, C when the network comprises multiple layers _A Iterate as input into the computation of the next layer.

Preferably, the shared representation layer comprises two layers of convolution, and after each layer of convolution operation, the structure of the batch normalization layer and the correction linear unit with leakage is provided to avoid gradient disappearance and gradient explosion.

Preferably, in step 3, using the wisnet network structure, the data input-output calculation process is as follows:

original dataset d= { (x) ₁ ,y ₁ ),(x ₂ ,y ₂ )…(x _n ,y _n ) }, whereinx _i Obtaining a shared layer output S through two hard shared layers _i ：

Wherein k is a corresponding convolution kernel parameter, and b is an offset; convolved x _i After activation by LeaklyRelu, k and b are shared among the three tasks; in the gradient updating process, returning the gradient information of the shared parameters while returning the task specific gradient information;

in order to judge the domain where the user is located, a network structure shown by the Dom_Net is used; s after passing through shared layer _i First, the convolution is carried out to obtain

In the training process, the problem that gradient vanishes or explodes can occur when the data distribution of the middle layer is changed; in order to solve the problem and simultaneously increase the training speed,batch normalization of layer BN is required; passes through BN layer to obtain->After being subjected to LeaklyReluObtaining a one-dimensional convolved result F by using maximum pooling ^dom :

Firstly, extracting minimum value on every channel, i.e. adding channel attention mechanism CA to obtain

Then, the channel information is compressed, namely a spatial attention mechanism SA is added, and the obtained channel information is obtained

Data x after convolution _i After the steps are carried out twice, the output is obtained through a linear full-connection layer

Wherein W is _dom And b _dom Respectively iteratively updating a weight matrix and a bias matrix for the full connection layer;

the index value corresponding to the maximum value of each row is the output of network prediction, and the corresponding loss function is L _dom ：

S of the same kind _i The output obtained after Act_Net is subjected to three layers of convolution layers to obtain F ^act The method comprises the steps of carrying out a first treatment on the surface of the Because the signals with larger variation amplitude contain behavior information of more users, an attention mechanism consisting of an average pooling layer and a maximum pooling layer is added; adding attention mechanism to obtain

Is obtained through a linear full-connection layer

The corresponding loss functions are respectively as follows:

relatively, the network structure of the Loc_Net is simpler, because the CNN convolutional neural network is sensitive to space information, and the position can be well identified without adding a attention mechanism; the output after passing through the convolutional layer after Loc_Net is

After the batch normalization layer and the activation function layer, F is obtained ^loc :

S _i Through two-layer convolution and finally through a full connection layer

Likewise, the loss function of the final loc_net is:

since the shared layer is embedded in each sub-network, the loss returned in each sub-network contains both task-specific gradient information and gradient information from the shared layer, i.e., Θ contains θ ^s ,θ ⁱ Two parts, the optimization objective function of wisnet is:

wherein L is _i ＝{L _dom ,L _act ,L _loc -minimizing the objective function by updating the parameters;

the final output of wisnet is the output of three networks,and->The method comprises the steps of respectively corresponding to a domain where a user is located, a position where the user is located in a current domain and an executed action; from the information of the domain and the location, the specific meaning contained by the action can be inferred.

The beneficial effects of using the invention are as follows:

the method of minimum pooling in the field of image recognition is less used mainly because in the representation RGB representation of an image, 000 represents black and the smaller the value, the more towards black. The picture information extracted by adopting the minimum pooling is background information with fewer characteristics, and the characteristics have no meaning. But in the field of signal processing, 0 is of practical significance. The range of signal amplitudes for signals reflected from different domains is different due to the different room locations and indoor furnishings. Taking this as a starting point, the different domains are distinguished based on the amplitude level when the space is relatively stationary. The use of a convolution attention mechanism based on minimization pooling may give more weight to information with smaller amplitude values. Thereby ignoring the effect of the information that the amplitude fluctuations are large. There is no way to multitasking concurrency in CSI-based behavior awareness techniques. Aiming at the problem of understanding the multitasking scenes, the hard sharing mechanism-based multitasking learning network structure wiset provided by the patent utilizes the sharing mechanism of the convolution layer to extract hidden information among subtasks, and provides possibility for recognition of actions across scenes and indoor positioning.

From the above, the advantages of the present invention are:

(1) The system resolves different meanings of the same action under different scenes and positions, and solves the problem that the traditional method can not realize semantic understanding of behaviors.

(2) The system takes all the receiving and transmitting terminal carrier signals as network input, defines an effective data splicing format, and more effectively utilizes indoor multipath information.

(3) The system provides a multi-task scene understanding network wisnet based on CSI, and behavior recognition and indoor positioning can be performed under multiple scenes without retraining a model.

Drawings

Fig. 1 is a flowchart of an indoor scene understanding method based on a radio frequency signal multi-task learning network.

Fig. 2 is an action diagram of the indoor scene understanding method volunteer based on the radio frequency signal multi-task learning network.

Fig. 3 is a schematic view of a hall scene of the indoor scene understanding method based on the radio frequency signal multi-task learning network.

Fig. 4 is an office scene diagram of an indoor scene understanding method based on a radio frequency signal multi-task learning network.

Fig. 5 is a schematic diagram of different-scale wavelet reconstruction signals of the indoor scene understanding method based on the radio frequency signal multi-task learning network.

Fig. 6 is a schematic diagram of indoor scene understanding method data set construction based on a radio frequency signal multi-task learning network.

Fig. 7 is a diagram of a wisnet network structure of an indoor scene understanding method based on a radio frequency signal multi-task learning network.

Fig. 8 is a diagram illustrating an indoor scene understanding method dom_net attention mechanism structure based on a radio frequency signal multi-task learning network according to the present invention.

Fig. 9 is a schematic diagram of accuracy and loss values of a sub-network in the training process of the indoor scene understanding method based on the radio frequency signal multi-task learning network.

Fig. 10 is a schematic diagram of a wisnet confusion matrix of the indoor scene understanding method based on the radio frequency signal multi-task learning network.

Fig. 11 is a schematic diagram of indoor scene understanding method wisnet performance evaluation based on a radio frequency signal multi-task learning network.

Fig. 12 is a comparison chart of training accuracy of different structures of the act_net in the indoor scene understanding method based on the radio frequency signal multi-task learning network.

Fig. 13 is a comparison chart of various indexes of the indoor scene understanding method act_net based on the radio frequency signal multi-task learning network in different networks.

Detailed Description

In order to make the objects, technical solutions and advantages of the present technical solution more apparent, the present technical solution is further described in detail below in conjunction with the specific embodiments. It should be understood that the description is only illustrative and is not intended to limit the scope of the present technical solution.

As shown in fig. 1-8, the present embodiment proposes an indoor scene understanding method based on a radio frequency signal multi-task learning network, focuses on a key technology in the intelligent perception field of semantic understanding of cross-domain actions, and proposes a scene understanding system architecture Wi-selys based on indoor wireless signals as shown in fig. 1. Wi-Sesys contains three parts of data acquisition, data preprocessing, and a multi-tasking identification network. First, the acquisition of channel state information (channel State Information, CSI) is performed using an Atheros-mounted wireless network card. And then filtering noise contained in the original signal, synthesizing multilink data after denoising, standardizing a data format, and constructing an input data set of the neural network. Finally, the indoor scene understanding is realized by using a multitasking learning network Wisenet, wherein the Wisenet comprises a shared representation layer, a domain identification network Dom_Net, a position identification network loc_Net and a behavior identification network act_Net.

Data acquisition

The device for collecting experimental data comprises two notebook computers, two routers carrying Atheres wireless network cards and two 5-meter long network cables. The computer and the router are connected through a network cable, and the router system can be accessed through a notebook computer, so that parameter settings such as a mode, a center frequency, a packet sending rate and the like are completed. And communicates send and receive signal instructions to the router. And the two routers control the sending and receiving of the CSI signals according to the command sent by the terminal. The command contains the destination address and the number of packets. Each router has two pairs of receiving-transmitting end antennas, and the sending end packet sending rate is 500 packets/second. The bandwidth is 20MHz and the center frequency is 2.4GHz.

In experimental setup, the actions performed by the volunteers are shown in fig. 2, including some actions common in life such as squatting, bending down, walking, raising hands, etc. The volunteer performs each action 10 times per position in the domain, with a sampling time of about 4.5 seconds per action. Each sample collected consisted of 2300 CSI packets.

The scene shown in fig. 3 is a hall with a teaching building being open, a few tables are arranged in Zhou Weibu, and windows around the hall are more. The height of the router from the ground is 85 cm, the size of the fingerprint block of each position is 1.2 multiplied by 1.2 meters, and each domain contains 9 positions which are respectively named as 1-9 by numbers. The domain size is about 13 square meters. During the CSI collection process, pedestrians pass through, and certain interference is brought to the effective signals. Fig. 4 shows a conference room, in which tables and chairs are closely arranged, and the wall area in the room is large. The scene space is larger and the surrounding is more complex than the lobby scene shown in fig. 4. After the signal is sent out, the reflection times of static objects such as a table, a chair and a wall in the environment are more, and thus the collected CSI signal contains more uncertain factors. The actions performed by the volunteers are shown in fig. 2, including some actions common in life such as squatting, bending, walking, lifting, etc. The volunteer performs each action 10 times per position in the domain, with a sampling time of about 4.5 seconds per action. Each sample collected consisted of 2300 CSI packets.

Data preprocessing

And when the Atheres network card is used for collecting the CSI, the signal can be reflected, diffracted and scattered by static objects such as furniture and human bodies in the process of reaching the receiving end from the transmitting end. In this process, the device itself may also vibrate, and other devices transmitting wireless signals in the home environment may also interfere with the CSI propagation process. Resulting in packet loss, delay and noise conditions of the signal during double-ended transmission, which can easily lead to flooding of the effective signal. Data needs to be denoised before extracting valid features in the CSI signal. The denoising method used herein is wavelet decomposition and reconstruction in wavelet transform. Wavelet decomposition and reconstruction denoising as used herein primarily utilizes db3 wavelet for single-scale wavelet transform analysis of the amplitude of CSI. And randomly selecting one subcarrier data in the original signal to perform db3 wavelet coefficient decomposition and reconstruction to obtain a result shown in fig. 5. As the reconstruction scale increases, the signal tends to be smooth. When the scale 6 reconstruction is used, the relatively high-frequency signal is lost more, and part of the signal is not matched with the original signal, so the a5 scale reconstruction is selected.

The data set construction part discovers through observation that the signals acquired by different transceiver devices are different even if the same volunteer performs the same action at the same position in the same domain in the experimental process. As shown in fig. 5. Even if the receiving ends are the same, the amplitude value intervals and the data change patterns from different transmitting ends are different. Different transceiver links form different perspectives of human body variation in space. The common sense tells us that the richer the viewing angle, the more comprehensive and realistic we see the change. In order to better utilize the data redundancy caused by multipath and simultaneously meet the input of the neural network, all link data of two pairs of receiving and transmitting ends are synthesized into a data format (2000,56,4). The synthesized data together with its corresponding three tags (fields, locations, actions) generate a dataset. And connecting the data collected by the two pairs of receiving and transmitting end equipment, and longitudinally splicing the data to obtain a data format (2000,56,4). The spliced data together with the three labels corresponding to the spliced data generate a data set, and the data format is shown in fig. 6.

Multitasking identification network

Unlike a single-task learning network, the data set of a multi-task learning network contains information in three dimensions of domain, location and action at the same time. Three kinds of information are read and processed simultaneously by using a multi-task learning method, and hidden information among tasks can be fully mined. The process is mainly completed by a parameter sharing mechanism, and a sharing layer can synthesize gradient information among a plurality of tasks and update the multiple tasks synchronously. The scene understanding neural network multitasking learning architecture employed herein is shown in fig. 7.

In the attention mechanism, dom_net distinguishes between different domains based on amplitude levels when space is relatively stationary. The use of a convolution attention mechanism based on minimization pooling may give more weight to information with smaller amplitude values. Thereby ignoring the effect of the information that the amplitude fluctuations are large. Whereas act_net adds a maximum pooling-based attention mechanism so that information with larger amplitude dominates. Different networks add different attention mechanisms to achieve focusing on different signals. The attention module employed by the domnet is shown in fig. 8

The convolution attention mechanism AM mainly consists of two parts: a channel attention module and a spatial attention module. Each channel of the feature represents a special detector, and the channel attention module compresses the feature matrix in the space dimension and extracts the feature information to be focused on from each channel. The spatial attention mechanism compresses the channels, and integrates the extracted features of each channel from the feature dimension of the whole data.

The input data is normally convolved while the attention mechanism is added. The channel attention module is expressed as:

M _c (F)＝σ(MLP(AvgPool(X))+MLP(MinPool(X)))，

wherein X is input data of the neural network, avgPool and MinPool are an average pooling layer and a minimum pooling layer respectively, MLP is a sharing layer, data dimension reduction and feature extraction are realized through convolution operation in the sharing layer, σ is a corresponding activation function, and Sigmoid activation function is used here.

The channel attention module compresses the feature map in the spatial dimension, considering only features inside each channel. The input feature map respectively passes through a global average pooling layer and a global minimum pooling layer of the channel attention module while being convolved. The average pooling layer has feedback for each feature point and is used for retaining background information in the feature map, and the minimum pooling layer has gradient feedback only when the feature points with small responses on the feature map are calculated in gradient back propagation, so that the minimum pooling can be used for selecting the features with less obvious change on the feature map. The two feature maps passing through the average pooling layer and the minimum pooling layer are input into the shared layer MLP to realize the dimension reduction and feature extraction, and the space dimension of the feature maps is compressed. After the outputs of the MLP are added and activated by a sigmoid function, a channel attention matrix (Channel Attention) can be obtained, and the result and the convolved feature matrix are subjected to intelligent product operation to obtain an adjusted feature F'.

The spatial attention module compresses the channels and comprehensively considers the relation among the channels. The spatial attention module is expressed as:

M _s (F)＝σ(f ^n*n ([AvgPool(F')；MinPool(F')]))，

where F' is the feature after the channel attention mechanism, F corresponds to the two-dimensional convolution operation, and n is the dimension of the convolution kernel.

AvgPool was used to extract the average value over the channel and MinPool was used to extract the minimum value over the channel. Connecting the feature matrixes extracted by the average pooling layer and the minimum pooling layer, activating by sigmoid after passing through the convolution layer to obtain a spatial attention matrix (spatial attention), and performing intelligent product operation on the spatial attention matrix and the adjusted feature F' to obtain the following formula: .

C _A ＝M _c (F)·M _s (F)，

Wherein C is _A I.e. as a result of adding an attention mechanism on the basis of CNN, C in a specific domain identification application _A And background information in the acquired data is included to characterize the domain in which the current user is located. When the network comprises multiple layers, C _A Iterate as input into the computation of the next layer.

The structure of the attention mechanism in act_net is similar to that of fig. 8, with minimum pooling being replaced by maximum pooling during use.

Wisnet contains a shared presentation layer, a domain identification network dom_net, a location identification network loc_net, and an action identification network act_net. The shared representation layer comprises two layers of convolution, and after each layer of convolution operation, the structure of the batch normalization layer and the correction linear unit with leakage is arranged to avoid gradient elimination and gradient explosion. The network structure of the three subtasks is shown in fig. 7, and the data input-output calculation process is as follows:

original dataset d= { (x) ₁ ,y ₁ ),(x ₂ ,y ₂ )…(x _n ,y _n ) }, whereinx _i Obtaining a shared layer output Si through two hard shared layers:

S _i ＝LeaklyRelu(f(∑ _i∈D x _i *k ^s _i +b ^s _i ))，

where k is the corresponding convolution kernel parameter and b is the offset. Convolved x _i After activation by LeaklyRelu, k and b are shared among the three tasks. In the gradient updating process, the gradient information of the shared parameters is returned while the task specific gradient information is returned.

To determine the domain in which the user is located, the network structure shown in dom_net of fig. 8 is used. S after passing through shared layer _i First, the convolution is carried out to obtain

During the training process, the gradient disappearance or explosion may occur when the data distribution of the middle layer is changed. In order to solve the problem and simultaneously increase the training speed,batch normalization of layer BN is required. Passes through BN layer to obtain->Obtaining a one-dimensional convolved result F by using maximum pooling after LeaklyRelu ^dom :

Wherein W is _dom And b _dom And the weight matrix and the bias matrix are iteratively updated for the full connection layer respectively.

S of the same kind _i The output obtained after Act_Net is subjected to three layers of convolution layers to obtain F ^act . Since the signals with larger variation amplitude contain behavior information of more users, an attention mechanism consisting of an average pooling layer and a maximum pooling layer is added. Adding attention mechanism to obtain

Is obtained through a linear full-connection layer

The corresponding loss functions are respectively as follows:

the network structure of loc_net is relatively simple, because CNN convolutional neural network is sensitive to spatial information, so that the position can be well identified without adding attention mechanisms. The output after passing through the convolutional layer after Loc_Net is

S _i Through two-layer convolution and finally through a full connection layer

Likewise, the loss function of the final loc_net is:

wherein L is _i ＝{L _dom ,L _act ,L _loc -minimizing the objective function by updating the parameters.

The final output of wisnet is the output of three networksCorresponding to the domain in which the user is located, the location of the current domain, and the action performed, respectively. From the information of the domain and the location, the specific meaning contained by the action can be inferred.

Example 1

In this embodiment, the accuracy and system robustness of the above method are verified.

Recognition accuracy

Training using data sets under two domains. The actions contained under the dataset of each domain are not exactly the same, and actions not involved in the domain are individually classified as one type. The accuracy and loss variations during training are shown in fig. 9.

After adding the shared layer, the accuracy gradually increases while the loss gradually decreases as the number of training rounds increases. After 200 rounds of training, the accuracy of the three tasks reaches more than 95%, and the loss is reduced to below 0.1 on average.

Wisnet is trained using data sets under two domains. The confusion matrix of wisnet in the test set is shown in fig. 10.

As can be seen from the observation of fig. 10 a) and b), the accuracy of each category of act_net is more than 80%, and the accuracy of loc_net is more than 95%.

Other evaluation index Recall (Recall), precision (permission), and macro F1 on the test set are shown in fig. 11.

As can be seen from fig. 11, the dot_net and loc_net perform best, and each index is 95% or more. Act_Net is difficult to identify in statistics because the features are rich in variability after the domain and location of changes. Even so, the precision, recall and macro-F1 values reached 83%. Overall, performing motion and location recognition under multiple domains, adding a hard sharing mechanism can significantly improve the performance of the model.

Proper implementation classification of the subtasks of wisnet is a necessary condition for scene understanding, and in the scene understanding task, action semantics can be correctly resolved only when the classification is correct (of domain, position and action). To evaluate the classification performance of wisnet, a test is performed herein on a test set. The test index is detailed in table 1 below.

TABLE 1Wisenet test results

Where ∈v is correctly classified and x is incorrectly classified.

The TTT in 1888 data is 1553, and the proportion is 82.3%. The TTT in 1888 data is 1553, and the proportion is 82.3%. Of the 335 pieces of data with the rest classification errors, the TTF is 291 pieces. This illustrates that the Act_Net classification error results in 87% probability of overall classification error on the premise that Loc_Net and Dom_Net are correctly classified. And the sum of TTF, TFF, FFF and FTF is 300, where TTF is 291. That is, when the Act_Net classification is wrong, the correct duty ratio of Loc_Net and Dom_Net classifications is 97%. And the sum of TTF and TTT is 1844, the duty ratio is 97.6%, and Loc_Net and Dom_Net can correctly classify a large part of data, so that the influence on the whole classification is small. From this analysis, wisnet appears as a "short-plate effect", whose overall classification performance is determined by the subtask network act_net. Therefore, when the wisnet is improved by adopting different structures and parameters, the classification performance of the act_net should be focused.

System robustness

To observe the role of the attention mechanism, the following comparative experiments were performed herein for different network structures. Named act_o_dot_o, act_o_dot_w, act_w_dot_o, and act_w_dot_w depending on whether or not attention mechanisms are added. The accuracy of the act_net training for different network structures for the same dataset is shown in fig. 9. Fig. 12 shows the accuracy of act_net in the training process of 100 rounds under the four network structures, and it can be obviously observed that the network without the attention mechanism performs worst, the accuracy only reaches about 80%, and the network with the attention mechanism has relatively good performance, wherein the Act-w-Dom-w of two attention mechanisms is added simultaneously, namely the performance of wisnet is the best.

Fig. 13 shows the accuracy of wisnet semantic recognition under four different network architectures. It can be obtained that the action semantic recognition accuracy is obviously improved after the attention is added to the act_Net and the dot_Net simultaneously.

The foregoing is merely exemplary of the present invention, and those skilled in the art can make many variations in the specific embodiments and application scope according to the spirit of the present invention, as long as the variations do not depart from the spirit of the invention.

Claims

1. An indoor scene understanding method based on a radio frequency signal multi-task learning network is characterized by comprising the following steps of: comprises the following steps of the method,

step 3, multitasking identification network: indoor scene understanding is achieved by using a multitasking learning network Wisenet, wherein the Wisenet comprises a shared representation layer, and a domain identification network Dom_Net, a position identification network loc_Net and a behavior identification network act_Net which use gradient information among multitasking of the shared representation layer;

in step 3, the domain identification network dom_net uses a convolution attention mechanism based on minimum pooling to give more weight to information with smaller amplitude values to distinguish different domains; the behavior recognition network act_net distinguishes different actions based on using a convolution attention mechanism based on maximum pooling to give greater weight to information with greater amplitude values;

inputting the input data set obtained in the step 2 into a convolution attention mechanism AM, wherein the convolution attention mechanism AM comprises a channel attention module and a space attention module;

inputting the input data set obtained in the step 2 into a normal convolution operation and adding an attention mechanism, wherein a channel attention module is expressed as the following formula:

M _C (F)＝σ(MLP(AvgPool(X)))+MLP(MinPool(X)))，

the spatial attention module compresses the channel, expressed as:

M _s (F)＝σ(f ^n*n ([AvgPool(F′)；MinPool(F)]))，

C _A ＝M _c (F)·M _s (F)，

C _A i.e. as a result of adding an attention mechanism on the basis of CNN, C in a specific domain identification application _A Containing background information in the acquired data for characterizing the domain in which the current user is located, C when the network comprises multiple layers _A Iterating as input into the calculation of the next layer;

the shared representation layer comprises two layers of convolutions, and after each layer of convolutions is operated, the structure of a batch of normalization layers and correction linear units with leakage is arranged to avoid gradient disappearance and gradient explosion;

in step 3, with the wisnet network structure, the data input-to-output calculation process is as follows:

original dataset d= { (x) ₁ ,y ₁ ),(x ₂ ,y ₂ )…(x _n ,y _n ) }, whereinx _i Through two-layer hard sharingLayer gets shared layer output S _i ：

S _i ＝LeaklyRelu(f(Σ _i∈D x _i *k ^s _i +b ^s _i ))，

In the training process, the problem that gradient vanishes or explodes can occur when the data distribution of the middle layer is changed; in order to solve the problem and simultaneously increase the training speed,batch normalization of layer BN is required; passes through BN layer to obtain->Obtaining a one-dimensional convolved result F by using maximum pooling after LeaklyRelu ^dom :

S of the same kind _i The output obtained after Act_Net is subjected to three layers of convolution layers to obtain F ^act The method comprises the steps of carrying out a first treatment on the surface of the Since the signal with larger variation amplitude contains behavior information of more users, the method is characterized by thatAdding an attention mechanism consisting of an average pooling and a maximum pooling layer; adding attention mechanism to obtain

Is obtained through a linear full-connection layer

The corresponding loss functions are respectively as follows:

Through batch normalization layer and activationAfter the function layer, F is obtained ^loc :

S _i Through two-layer convolution and finally through a full connection layer

Likewise, the loss function of the final loc_net is:

the final output of wisnet is the output of three networks,and->Bits corresponding to the domain in which the user is located, the current domain in which the user is located, respectivelySetting and executing actions; the specific meaning contained by the action is inferred from the information of the domain and the location.

2. The indoor scene understanding method based on radio frequency signal multi-task learning network of claim 1, wherein the indoor scene understanding method comprises the following steps:

3. The indoor scene understanding method based on radio frequency signal multi-task learning network of claim 1, wherein the indoor scene understanding method comprises the following steps: in step 2, the denoising method is wavelet decomposition and reconstruction in wavelet transformation, single-scale wavelet transformation analysis is performed on the amplitude of the CSI by using db3 wavelet, and db3 wavelet coefficient decomposition and reconstruction are performed by randomly selecting one subcarrier data in the original signal, so as to complete noise filtering.

4. The indoor scene understanding method based on the radio frequency signal multi-task learning network according to claim 2, wherein the indoor scene understanding method is characterized by: in step 2, all link data of the two pairs of receiving and transmitting ends are synthesized into a data format of (2000,56,4), and the synthesized data and three corresponding labels thereof are respectively domain, position and action, and a data set is generated by the synthesized data.