CN113587935A

CN113587935A - Indoor scene understanding method based on radio frequency signal multitask learning network

Info

Publication number: CN113587935A
Application number: CN202110891904.8A
Authority: CN
Inventors: 王林; 王新雨; 高畅; 石中玉; 张德安; 厉斌斌
Original assignee: Yanshan University
Current assignee: Yanshan University
Priority date: 2021-08-04
Filing date: 2021-08-04
Publication date: 2021-11-02
Anticipated expiration: 2041-08-04
Also published as: CN113587935B

Abstract

The invention relates to the technical field of behavior perception, in particular to an indoor scene understanding method based on a radio frequency signal multitask learning network, which comprises the following steps of: collecting channel state information by using a wireless network card carrying Atheros; data preprocessing: filtering noise contained in an original signal, synthesizing multilink data after denoising is finished, standardizing a data format, and constructing an input data set of a neural network; multitask identification network: indoor scene understanding is achieved by using a multitask learning network wisnet, wherein the wisnet comprises a shared representation layer, and a domain identification network Dom _ Net, a position identification network Loc _ Net and a behavior identification network Act _ Net which use gradient information between multitasks of the shared representation layer. The method uses a multi-task learning method to simultaneously identify the scene where the user is located, including the domain, the position and the action where the user is located, and senses the user from multiple angles, so as to understand the meaning of the behavior of the user.

Description

Indoor scene understanding method based on radio frequency signal multitask learning network

Technical Field

The invention relates to the technical field of behavior perception, in particular to an indoor scene understanding method based on a radio frequency signal multitask learning network.

Background

When behavior perception is achieved by using commercial WiFi, action semantics are often closely related to occurrence scenes, and single action recognition cannot meet semantic understanding requirements of actions in specific scenes. The patent designs and realizes a scene understanding multi-task learning method based on channel state information. The method utilizes an attention mechanism to endow different weights to signals from different sources, utilizes a multi-task learning network to realize the mining of hidden information, and has stronger cross-domain property and expandability.

There have been many mature efforts for WiFi-based behavioral awareness and indoor positioning. However, in an indoor home environment, the user's actions are not separated from the environment and location in which they occur. The same action or similar actions may represent distinct semantics in different environments. For example, also lying down, with a high probability that the user is sleeping in a bed in a bedroom, while lying down on the floor of a living room may represent that the user has fallen, shocked or even more seriously died; to avoid similar misunderstandings in a home environment, it is important to distinguish semantics of the same or similar actions. Especially in the monitoring of solitary old people, the positions of the old people are known, and then the actions of the old people can be judged to better understand the behaviors of the old people, so that unnecessary misunderstandings are avoided. In the AR game, the same action occurring at different positions may represent different operations of a game character, and if the action can be recognized and the position and the current area where the user is located can be known, a definite semantic meaning can be given to the behavior of the user, and at that time, scenes supported by the AR will be richer. The environment and location of the user constrains the actions that the user can perform, in other words, the actions that the user performs are a reflection of the environment and location of the user and should not be split apart.

The existing touch sensing, such as wearable devices, is limited by the limited battery capacity, the devices cannot continue to work once the batteries are exhausted, and frequent charging undoubtedly brings a certain burden to users. Non-contact sensing devices such as RFID, millimeter wave and infrared sensors are expensive in manufacturing cost and are suitable for being used in places with large people flow, such as markets, airports, stations and the like. The ubiquity of WiFi in the home environment makes it circumvent the limitations of the application scenario, and WiFi is low cost and can be deployed on a large scale. Many key technologies based on WiFi signal indoor behavior semantic understanding need to be broken through, and accurate action semantic understanding not only needs to identify user behaviors, but also needs information support of the domain and the position where the user is located. There is currently no correlation work to fuse these three dimensions of information together.

Disclosure of Invention

In order to solve the problems, the invention provides an indoor scene understanding method based on a radio frequency signal multitask learning network.

In order to achieve the purpose, the invention adopts the technical scheme that:

an indoor scene understanding method based on a radio frequency signal multitask learning network comprises the following steps,

step 1, data acquisition: collecting channel state information by using a wireless network card carrying Atheros;

step 2, data preprocessing: filtering noise contained in an original signal, synthesizing multilink data after denoising is finished, standardizing a data format, and constructing an input data set of a neural network;

step 3, multi-task identification network: indoor scene understanding is achieved by using a multitask learning network wisnet, wherein the wisnet comprises a shared representation layer, and a domain identification network Dom _ Net, a position identification network Loc _ Net and a behavior identification network Act _ Net which use gradient information between multitasks of the shared representation layer.

Preferably, the method is characterized in that:

in step 1, the data acquisition equipment comprises two computers, two routers carrying Atheros wireless network cards and a network cable, wherein the computers are connected with the routers through the network cable, the router system can be accessed through a notebook computer, the setting of parameters such as mode, center frequency, packet sending rate and the like is completed, and a signal sending instruction and a signal receiving instruction are transmitted to the routers; the two routers control the sending and receiving of CSI signals according to commands sent by terminals, the commands comprise destination addresses and the number of sending packets, each router is provided with two pairs of receiving and sending end antennas, the sending rate of the sending end is 500 packets/second, the bandwidth is 20MHZ, and the center frequency adopts 2.4 GHZ.

Preferably, in step 2, the denoising method is wavelet decomposition and reconstruction in wavelet transform, single-scale wavelet transform analysis is performed on the amplitude of CSI by using db3 wavelet, and one subcarrier data in an original signal is randomly selected to perform db3 wavelet coefficient decomposition and reconstruction, so as to complete noise filtering.

Preferably, in step 2, all link data of the two pairs of transmitting and receiving terminals are synthesized into a data format of 2000,56,4, the synthesized data and the three corresponding tags are respectively a domain, a position, and an action, and the synthesized data generates a data set.

Preferably, in step 3, the domain identification network Dom _ Net can give more weight to information with smaller amplitude values using the convolutional attention mechanism based on minimum pooling to distinguish different domains; the behavior recognition network Act _ Net distinguishes different actions based on the fact that information with larger amplitude values can be given more weight using a convolutional attention mechanism based on maximum pooling.

Preferably, the input data set obtained in step 2 is input into a convolution attention mechanism AM, which comprises a channel attention module and a spatial attention module.

Preferably, the input data set obtained in step 2 is input into a normal convolution operation, while adding an attention mechanism, and the channel attention module is expressed as follows:

M_c(F)＝σ(MLP(AvgPool(X))+MLP(MinPool(X)))，

wherein X is input data of a neural network, Avgpool and Minpool are respectively an average pooling layer and a limit pooling layer, MLP is a sharing layer, data dimension reduction and feature extraction are mainly realized on the sharing layer through convolution operation, and sigma is a corresponding Sigmoid activation function; the method comprises the steps that a channel attention module compresses a feature map on a space dimension, only the features inside each channel are considered, the input feature map passes through a global average pooling layer and a global limit pooling layer of the channel attention module respectively while convolution operation is carried out, the average pooling layer has feedback on each feature point and is used for keeping background information in the feature map, and when gradient back propagation calculation is carried out on the limit pooling layer, only the feature points with small response on the feature map have gradient feedback; inputting two feature maps of an average pooling layer and a limit pooling layer into a shared layer MLP to realize dimension reduction and feature extraction, compressing the spatial dimension of the feature maps, adding the output of the MLP, activating by a sigmoid function to obtain a channel attention matrix, and performing intelligent product operation on the result and the feature matrix subjected to convolution to obtain an adjusted feature F';

the spatial attention module compresses the channel, which is expressed as:

M_s(F)＝σ(f^n*n([AvgPool(F′)；MinPool(F′)]))，

wherein F' is a feature after a channel attention mechanism, F corresponds to a two-dimensional convolution operation, n is a dimension of a convolution kernel, AvgPool is used for extracting an average value on a channel, and MinPool is used for extracting a limit value on the channel; connecting the feature matrixes extracted by the average pooling layer and the limit pooling layer, activating the feature matrixes by sigmoid after the convolution layer to obtain a Spatial Attention matrix (Spatial Attention), and carrying out intelligent product operation on the Spatial Attention matrix and the adjusted feature F' to obtain the following formula:

C_A＝M_c(F)·M_s(F)，

C_Ai.e. adding notes on the basis of CNNResults of the ideogram mechanism, in specific domain identification applications, C_AIncluding background information in the collected data, for characterizing the domain where the current user is located, when the network includes multiple layers, C_AIterate as input into the calculation of the next layer.

Preferably, the shared representation layer comprises two layers of convolution, and after each layer of convolution operation, the shared representation layer has a structure of a batch normalization layer and a correction linear unit with leakage to avoid the phenomena of gradient disappearance and gradient explosion.

Preferably, in step 3, using the wisnet network structure, the data input to output calculation process is as follows:

original dataset D { (x)₁，y₁)，(x₂，y₂)...(x_n，y_n) Therein of

x_iObtaining shared layer output S through two hard shared layers_i：

S_i＝LeaklyRelu(f(∑_i∈Dx_i*k^s _i+b^s _i))，

Wherein k is a corresponding convolution kernel parameter, and b is an offset; after convolution x_iAfter activation by LeaklyRelu, k and b are shared among the three tasks; in the gradient updating process, returning the task specific gradient information and simultaneously returning the gradient information of the shared parameter;

in order to judge the domain where the user is located, a network structure shown by Dom _ Net is used; s after passing through sharing layer_iFirstly, obtaining the product after convolution

In the training process, the problem of gradient disappearance or explosion can occur when the data distribution of the middle layer is changed;to solve this problem and at the same time to speed up the training,

a batch normalization layer BN is required; after passing through BN layer to obtain

Obtaining a result F after one-dimensional convolution by using maximum pooling after LeaklyRelu^dom：

Firstly, extracting minimum values on each channel, namely adding a channel attention mechanism CA to obtain

Then, compressing the channel information, i.e. adding a spatial attention mechanism SA to obtain

Data x after convolution_iAfter the two steps, the linear full-connection layer is output to obtain the optical fiber

Wherein W_domAnd b_domRespectively updating a weight matrix and a bias matrix for the iteration of the full connection layer;

the index value corresponding to the maximum value of each row is the output of the network prediction, and the corresponding loss function is L_dom：

Same principle S_iThe output obtained after Act _ Net is subjected to three-layer convolution layer to obtain F^act(ii) a As the signal with larger change amplitude contains behavior information of more users, an attention mechanism consisting of an average pooling layer and a maximum pooling layer is added; adding an injection machine to obtain

Obtained by passing through a linear full-connection layer

The corresponding loss functions are respectively:

relatively speaking, the network structure of Loc _ Net is simple, because the CNN convolutional neural network is sensitive to spatial information, the location can be well identified without adding a attention mechanism; the output after the convolutional layer after the Loc _ Net is

F is obtained after batch normalization layer and activation function layer^loc:

S_iObtained by two-layer convolution and finally by full connection layer

Likewise, the penalty function for the final Loc _ Net is:

since the sharing layer is embedded in each sub-network, the loss returned in each sub-network contains gradient information of a specific task and gradient information from the sharing layer, namely theta contains theta^sh，θⁱTwo parts, the optimization objective function of wisnet is:

wherein L is_i＝{L_dom，L_act，L_locUpdating parameters to minimize the objective function;

the final output of wisnet is the output of three networks,

and

respectively corresponding to the domain where the user is located, the position of the current domain where the user is located and the executed action; from the information of the domain and the location, the specific meaning contained by the action can be deduced.

The beneficial effects of the invention are as follows:

the method of minimum pooling in the field of image recognition is less used, mainly because in the representation RGB of the image, 000 represents black, and the smaller the value the closer to black. The picture information extracted by the minimum pooling is background information with few characteristics, and the characteristics have no significance. But in the field of signal processing 0 is of practical significance. The signal reflected from different domains will have different amplitudes due to different room locations and furnishings. Starting from this, different domains are distinguished based on the amplitude level when the space is relatively stationary. Using a convolutional attention mechanism based on minimal pooling may give more weight to information with smaller amplitude values. Thereby neglecting the influence of information with large amplitude fluctuation. In the behavior perception technology based on the CSI, a method of multitask concurrency never exists. Aiming at the problem of multi-task scene understanding, the multi-task learning network structure wisnet based on the hard sharing mechanism provided by the patent utilizes the sharing mechanism of the convolutional layer to extract hidden information among subtasks, and provides possibility for cross-scene action identification and indoor positioning.

From the above, the advantages of the present invention are:

(1) the system distinguishes different meanings of the same action under different scenes and positions, and solves the problem that the traditional method cannot realize behavior semantic understanding.

(2) The system takes all the receiving and transmitting terminal carrier signals as network input, defines an effective data splicing format and more effectively utilizes indoor multipath information.

(3) The system provides a multi-task scene understanding network wisnet based on CSI, and behavior recognition and indoor positioning can be carried out under multiple scenes without retraining a model.

Drawings

Fig. 1 is a flowchart of an indoor scene understanding method based on a radio frequency signal multitask learning network according to the present invention.

Fig. 2 is a diagram of actions performed by volunteers in an indoor scene understanding method based on a radio frequency signal multitask learning network according to the present invention.

Fig. 3 is a schematic view of a hall scene in the indoor scene understanding method based on the radio frequency signal multitask learning network.

Fig. 4 is a schematic view of an office scene of the indoor scene understanding method based on the radio frequency signal multitask learning network.

Fig. 5 is a schematic diagram of wavelet reconstruction signals of different scales of the indoor scene understanding method based on the radio frequency signal multitask learning network.

Fig. 6 is a schematic diagram of data set construction of an indoor scene understanding method based on a radio frequency signal multitask learning network.

Fig. 7 is a diagram of a wisnet network structure for understanding an indoor scene based on a radio frequency signal multitask learning network according to the present invention.

Fig. 8 is a diagram of an indoor scene understanding method Dom _ Net attention mechanism structure based on a radio frequency signal multitask learning network.

FIG. 9 is a schematic diagram of sub-network accuracy and loss value in the training process of the indoor scene understanding method based on the radio frequency signal multitask learning network.

Fig. 10 is a schematic diagram of a wisnet confusion matrix of an indoor scene understanding method based on a radio frequency signal multitask learning network.

Fig. 11 is a schematic diagram of wisnet performance evaluation of an indoor scene understanding method based on a radio frequency signal multitask learning network.

Fig. 12 is a comparison graph of training accuracy rates of different structures of Act _ Net in the indoor scene understanding method based on the radio frequency signal multitask learning network.

Fig. 13 is a comparison graph of indexes of the indoor scene understanding method Act _ Net based on the radio frequency signal multitask learning network under different networks.

Detailed Description

In order to make the purpose, technical solution and advantages of the present technical solution more clear, the present technical solution is further described in detail below with reference to specific embodiments. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of the present teachings.

As shown in fig. 1 to 8, the present embodiment provides an indoor scene understanding method based on a radio frequency signal multitask learning network, focuses on semantic understanding of cross-domain actions, which is a key technology in the intelligent perception field, and provides a scene understanding system architecture Wi-sys based on an indoor wireless signal as shown in fig. 1. Wi-Senys comprises three parts of data acquisition, data preprocessing and multitask identification network. Firstly, the loaded Atheros wireless network card is used for collecting Channel State Information (CSI). Then, noise contained in the original signal is filtered, multi-link data is synthesized after denoising is finished, a data format is standardized, and an input data set of the neural network is constructed. Finally, indoor scene understanding is achieved through the multitask learning network wisnet, and the wisnet comprises a sharing representation layer, a domain identification network Dom _ Net, a position identification network Loc _ Net and a behavior identification network Act _ Net.

Data acquisition

The equipment for acquiring the experimental data comprises two notebook computers, two routers carrying Atheros wireless network cards and two 5-meter long network cables. The computer is connected with the router through a network cable, and the router system can be accessed through a notebook computer to complete the setting of parameters such as modes, center frequencies, packet sending rates and the like. And communicates send and receive signal instructions to the router. The two routers control the transmission and reception of the CSI signals according to commands sent by the terminals. The command includes a destination address and the number of packets to be transmitted. Each router has two pairs of transmit-receive end antennas, and the packet transmission rate of the transmit end is 500 packets/second. The bandwidth is 20MHz, and the center frequency adopts 2.4 GHz.

In the experimental setup, the volunteers performed the actions as shown in fig. 2, including squat, stooping, walking, raising hands, and other common actions in life. The volunteer performed 10 passes of each action at each location in the field, with a sample time of approximately 4.5 seconds per action. Each sample collected consists of 2300 CSI packets.

The scene shown in fig. 3 is a hall with a certain teaching building being relatively open, a few tables are distributed around the hall, and the number of windows around the hall is large. The height of the router from the ground is 85 cm, the size of each position fingerprint block is 1.2 multiplied by 1.2 m, and each domain comprises 9 positions which are respectively named as 1-9 by numbers. The domain size is about 13 square meters. In the process of collecting CSI, pedestrians pass through the CSI, and certain interference is brought to effective signals. Fig. 4 shows a conference room, in which the desks and chairs are arranged closely and the wall area is large. The scene space is larger and the surrounding environment is more complex than the lobby scene shown in fig. 4-9. After the signal is sent out, the number of reflections passing through static objects such as tables, chairs and walls in the environment is more, and therefore more uncertain factors are contained in the collected CSI signal. The volunteers performed the actions shown in fig. 4, including some actions common in living such as squatting, bending, walking, raising hands, etc. The volunteer performed 10 passes of each action at each location in the field, with a sample time of approximately 4.5 seconds per action. Each sample collected consists of 2300 CSI packets.

Data pre-processing

And the acquisition and denoising part is used for acquiring the CSI by using an Atheros network card, and the signal can be reflected, diffracted and scattered by furniture and other static objects and human bodies in the process of reaching the receiving end from the transmitting end. In the process, the device itself may also generate vibration, and meanwhile, other devices transmitting wireless signals in the home environment may also cause interference to the CSI propagation process. Leading to the situation of packet loss, delay and noise in the signal during the double-end transmission, which easily leads to the effective signal being submerged. The data needs to be denoised before extracting effective features in the CSI signal. The denoising method used herein is wavelet decomposition and reconstruction in wavelet transform. Wavelet decomposition and reconstruction denoising, as used herein, primarily utilizes the db3 wavelet to perform a single-scale wavelet transform analysis of the amplitudes of the CSI. Randomly selecting one subcarrier data in the original signal to perform db3 wavelet coefficient decomposition and reconstruction, and obtaining the result shown in fig. 5. As the reconstruction scale increases, the signal tends to be smooth. When the scale 6 reconstruction is used, the relatively high-frequency signal is lost more, and part of the signal is not matched with the original signal, so the a5 scale reconstruction is selected.

In the experimental process, the data set construction part observes that the signals collected by different transceiving end devices are different even if the same volunteer executes the same action at the same position in the same domain. As shown in fig. 5. Even if the receiving end is the same, the amplitude value interval and the data change pattern from different transmitting ends are different. Different transceiving end links form different perspectives of human body variation in space. The common sense tells us that the richer the viewing angle the more comprehensive and true the changes we see. In order to better utilize the data redundancy brought by multipath and simultaneously meet the input of the neural network, all link data of two pairs of transceiving ends are synthesized into a data format of (2000,56, 4). The synthesized data along with its corresponding three labels (domain, location, action) generate a data set. And connecting the data collected by the devices at the two pairs of transmitting and receiving ends, and performing longitudinal splicing to obtain a data format (2000,56, 4). The spliced data together with the three corresponding labels generates a data set, and the data format is as shown in fig. 6.

Multitask identification network

Unlike a single-task learning network, a dataset of a multi-task learning network contains information in three dimensions, namely domain, position and action. The method of multi-task learning is used for simultaneously reading and processing three kinds of information, and hidden information among tasks can be fully mined. The process is mainly completed by a parameter sharing mechanism, and a sharing layer can synthesize gradient information among a plurality of tasks and synchronously update the plurality of tasks. The architecture diagram for a scenario-aware neural network multitask learning as used herein is shown in fig. 7.

Note that in the conventional mechanism, Dom _ Net distinguishes between different domains based on the amplitude level when space is relatively stationary. Using a convolutional attention mechanism based on minimal pooling may give more weight to information with smaller amplitude values. Thereby neglecting the influence of information with large amplitude fluctuation. And Act _ Net adds a mechanism of attention based on maximum pooling, so that information with larger amplitude dominates. Different networks add different attention mechanisms to achieve focusing on different signals. The attention module employed by Dom _ Net is shown in FIG. 8

The convolution attention mechanism AM mainly comprises two parts: a channel attention module and a spatial attention module. Each channel of the features represents a special detector, and the channel attention module compresses the feature matrix in a spatial dimension to extract feature information needing attention in each channel. The spatial attention mechanism is to compress the channels, and integrate the features extracted by each channel in consideration of the feature dimension of the whole data.

The input data is normally convolved while an attention mechanism is added. The channel attention module is expressed as follows:

M_c(F)＝σ(MLP(AvgPool(X))+MLP(MinPool(X)))，

wherein, X is input data of a neural network, Avgpool and Minpool are an average pooling layer and a minimum pooling layer respectively, MLP is a sharing layer, data dimension reduction and feature extraction are mainly realized through convolution operation in the sharing layer, sigma is a corresponding activation function, and a Sigmoid activation function is used here.

The channel attention module compresses the feature map in the spatial dimension, taking into account only the features inside each channel. The input feature map passes through the global average pooling layer and the global minimum pooling layer of the channel attention module respectively while performing convolution operation. The average pooling layer has feedback on each feature point and is used for keeping background information in the feature map, and when the minimum pooling layer is used for gradient back propagation calculation, only the feature points which respond to small features on the feature map have gradient feedback, so that the minimum pooling can be used for selecting features which are not obvious in change on the feature map. And inputting the two feature maps passing through the average pooling layer and the minimum pooling layer into a shared layer MLP to realize dimension reduction and feature extraction, and compressing the spatial dimension of the feature maps. And activating the output sum of the MLP through a sigmoid function to obtain a Channel Attention matrix (Channel Attention), and performing intelligent product operation on the result and the feature matrix subjected to convolution to obtain an adjusted feature F'.

The spatial attention module compresses the channels and comprehensively considers the relationship among the channels. The spatial attention module is expressed as:

M_s(F)＝σ(f^n*n([AvgPool(F′)；MinPool(F′)]))，

where F' is the feature after the channel attention mechanism, F corresponds to a two-dimensional convolution operation, and n is the dimensionality of the convolution kernel.

AvgPool was used to extract the mean on the channel and MinPool was used to extract the minimum on the channel. Connecting the average pooling layer with the feature matrix extracted by the minimum pooling layer, activating by sigmoid after the convolution layer to obtain a Spatial Attention matrix (Spatial Attention), and performing intelligent product operation on the Spatial Attention matrix and the adjusted feature F' to obtain the following formula: .

C_A＝M_c(F)·M_s(F)，

Wherein, C_AI.e. the result of adding an attention mechanism on the basis of CNN, in a specific domain identification application, C_AIncluding background information in the collected data, to characterize the domain where the current user is located. When the network comprises multiple layers, C_AIterate as input into the calculation of the next layer.

The mechanism of attention in Act _ Net is similar to that of FIG. 8, and during use, the minimum pooling needs to be replaced by the maximum pooling.

The wisnet includes a shared representation layer, a domain identification network Dom _ Net, a location identification network Loc _ Net, and an action identification network Act _ Net. The shared representation layer comprises two layers of convolution, and after each layer of convolution operation, the shared representation layer has a batch normalization layer and a structure with leaked correction linear units to avoid gradient extinction and gradient explosion phenomena. The network structure of the three subtasks is shown in fig. 7, and the data input-output calculation process is as follows:

original dataset D { (x)₁，y₁)，(x₂，y₂)...(x_n，y_n) Therein of

x_iObtaining shared layer output S through two hard shared layers_i：

S_i＝LeaklyRelu(f(∑_i∈Dx_i*k^s _i+b^s _i))，

Where k is the corresponding convolution kernel parameter and b is the offset. After convolution x_iAfter LeaklyRelu activation, k and b are shared among the three tasks. In the gradient updating process, the task-specific gradient information is returned and the gradient information of the shared parameter is returned at the same time.

To determine the domain in which the user is located, a network structure shown by Dom _ Net in fig. 5 is used. S after passing through sharing layer_iFirstly, obtaining the product after convolution

During training, the problem of gradient disappearance or explosion can occur when the data distribution of the middle layer is changed. To solve this problem and at the same time to speed up the training,

a batch normalization layer BN is required. After passing through BN layer to obtain

Wherein, W_domAnd b_domRespectively updating a weight matrix and a bias matrix for the iteration of the full connection layer.

Same principle S_iThe output obtained after Act _ Net is subjected to three-layer convolution layer to obtain F^act. Since the signal with larger variation amplitude contains more user behavior information, an attention mechanism consisting of average pooling and maximum pooling layers is added. Adding an injection machine to obtain

Obtained by passing through a linear full-connection layer

The corresponding loss functions are respectively:

the network structure of Loc _ Net is relatively simple, because the CNN convolutional neural network itself is sensitive to spatial information, and thus can identify the position well without adding a attention mechanism. The output after the convolutional layer after the Loc _ Net is

S_iObtained by two-layer convolution and finally by full connection layer

Likewise, the penalty function for the final Loc _ Net is:

wherein L is_i＝{L_dom，L_act，L_locAnd (4) updating the parameters to minimize the objective function.

The final output of the wisnet is the output of three networks

Corresponding to the domain in which the user is located, the location of the current domain in which the user is located, and the action performed, respectively. From the information of the domain and the location, the specific meaning contained by the action can be deduced.

Example 1

The present embodiment verifies the accuracy and system robustness of the above method.

Accuracy of identification

Training is performed using the data sets under both domains. The actions contained under the data set of each domain are not exactly the same, and actions not involved in a domain are individually categorized into one class. The accuracy and loss variation during training is shown in fig. 9.

After the addition of the shared layer, as the number of training rounds increases, the accuracy gradually increases while the loss gradually decreases. After 200 rounds of training, the accuracy of all three tasks reaches more than 95%, and the loss is averagely reduced to be less than 0.1.

Wisnet was trained using a data set under both domains. The confusion matrix for wisnet in the test set is shown in fig. 10.

As can be seen from the graphs of FIG. 10a) and B), the accuracy of each category of Act _ Net is over 80%, and the accuracy of Loc _ Net is over 95%.

Other evaluation indicators on the test set are Recall (Recall), precision (permission), and macro F1 as shown in fig. 11.

As can be seen from fig. 11, Dom _ Net and Loc _ Net performed best, with each index being 95% or more. Act _ Net is difficult to identify when statistical features because features are rich in variability after varying domains and positions. Even so, the precision rate, the recall rate and the macro-F1 value reach 83%. In overview, performing action and location recognition under multiple domains, adding a hard sharing mechanism can significantly improve the performance of the model.

The correct implementation of classification of each subtask of wisnet is a necessary condition for scene understanding, and in the scene understanding task, action semantics can be correctly analyzed only when (domains, positions and actions) are correctly classified. To evaluate the classification performance of wisnet, tests were performed on the test set herein. The test indexes are detailed in table 1 below.

TABLE 1 Wisenet test results

Wherein √ is a correct classification, and x is a wrong classification.

The TTT in 1888 data is 1553, and accounts for 82.3%. The TTT in 1888 data is 1553, and accounts for 82.3%. Of the 335 data items with the remaining misclassifications, the TTFs are 291 data items. This indicates that the probability of an Act _ Net misclassification resulting in an overall misclassification is 87% with the precondition that Loc _ Net and Dom _ Net are correctly classified. And the sum of TTF, TFF, FFF and FTF is 300, wherein TTF is 291. That is, when Act _ Net classification is wrong, the ratio of the Loc _ Net classification and the Dom _ Net classification is 97%. And the sum of TTF and TTT is 1844, which accounts for 97.6%, and Loc _ Net and Dom _ Net can correctly classify a large part of data and have little influence on the whole classification. From this analysis, wisnet presents a "short plate effect" whose overall classification performance is determined by the subtask network Act _ Net. Therefore, when the wisnet is improved by adopting different structures and parameters, the classification performance of the Act _ Net is focused.

System robustness

To observe the effect of the attention mechanism, the following comparative experiments were performed for different network structures herein. The attention mechanism is named Act _ o _ Dom _ o, Act _ o _ Dom _ w, Act _ w _ Dom _ o and Act _ w _ Dom _ w according to whether the attention mechanism is added or not. The Act _ Net training accuracy for different network structures under the same data set is shown in FIG. 9. Fig. 12 shows the accuracy of Act _ Net in the course of 100 rounds of training under the four network structures, and it can be obviously observed that the network without adding the attention mechanism has the worst performance, and the accuracy only reaches about 80%, while the network with adding the attention mechanism has relatively good performance, wherein the Act-w-Dom-w, i.e. wisnet, with two attention mechanisms added simultaneously has the best performance.

Fig. 13 shows the accuracy of wisnet semantic recognition under four different network architectures. It can be shown that after the attention is added to Act _ Net and Dom _ Net simultaneously, the recognition accuracy of the action semantics is obviously improved.

The foregoing is only a preferred embodiment of the present invention, and many variations in the specific embodiments and applications of the invention may be made by those skilled in the art without departing from the spirit of the invention, which falls within the scope of the claims of this patent.

Claims

1. An indoor scene understanding method based on a radio frequency signal multitask learning network is characterized by comprising the following steps: comprises the following steps of (a) carrying out,

2. The indoor scene understanding method based on the radio frequency signal multitask learning network according to claim 1, characterized in that:

3. The indoor scene understanding method based on the radio frequency signal multitask learning network according to claim 1, characterized in that:

in step 2, the denoising method is wavelet decomposition and reconstruction in wavelet transformation, single-scale wavelet transformation analysis is performed on the amplitude of the CSI by using a db3 wavelet, and one subcarrier data in an original signal is randomly selected to perform db3 wavelet coefficient decomposition and reconstruction, so that noise filtering is completed.

4. The indoor scene understanding method based on the radio frequency signal multitask learning network according to claim 2, characterized in that: in step 2, all link data of two pairs of transceiving ends are synthesized into a data format of 2000,56,4, the synthesized data and three corresponding labels thereof are respectively a domain, a position and an action, and the synthesized data generates a data set.

5. The indoor scene understanding method based on the radio frequency signal multitask learning network according to claim 1, characterized in that: in step 3, the domain identification network Dom _ Net can give more weight to information with smaller amplitude values using a convolutional attention mechanism based on minimum pooling to distinguish different domains; the behavior recognition network Act _ Net distinguishes different actions based on the fact that information with larger amplitude values can be given more weight using a convolutional attention mechanism based on maximum pooling.

6. The indoor scene understanding method based on the radio frequency signal multitask learning network as claimed in claim 5, wherein: and (3) inputting the input data set obtained in the step (2) into a convolution attention mechanism AM, wherein the convolution attention mechanism AM comprises a channel attention module and a space attention module.

7. The indoor scene understanding method based on the radio frequency signal multitask learning network as claimed in claim 6, wherein: inputting the input data set obtained in the step 2 into a normal convolution operation, and adding an attention mechanism at the same time, wherein a channel attention module is expressed as the following formula:

M_c(F)＝σ(MLP(AvgPool(X))+MLP(MinPool(X)))，

the spatial attention module compresses the channel, which is expressed as:

M_S(F)＝σ(f^n*n([AvgPool(F′)；MinPool(F′)]))，

C_A＝M_c(F)·M_s(F)，

C_Ai.e. the result of adding an attention mechanism on the basis of CNN, in a specific domain identification application, C_AIncluding background information in the collected data, for characterizing the domain where the current user is located, when the network includes multiple layers, C_AIterate as input into the calculation of the next layer.

8. The indoor scene understanding method based on the radio frequency signal multitask learning network as claimed in claim 7, wherein: the shared representation layer comprises two layers of convolution, and after each layer of convolution operation, the shared representation layer has a batch normalization layer and a structure with a leaked correction linear unit, so that the phenomena of gradient disappearance and gradient explosion are avoided.

9. The indoor scene understanding method based on the radio frequency signal multitask learning network according to claim 8, characterized in that: in step 3, using the wisnet network structure, the data input-output calculation process is as follows:

original dataset D { (x)₁，y₁)，(x₂，y₂)...(x_n，y_n) Therein of

x_iObtaining shared layer output S through two hard shared layers_i：

S_i＝LeaklyRelu(f(∑_i∈Dx_i*k^s _i+b^s _i))，

During training, gradient disappearance may occur when the distribution of the intermediate layer data changesOr the problem of explosion; to solve this problem and at the same time to speed up the training,

Obtained by passing through a linear full-connection layer

The corresponding loss functions are respectively:

F is obtained after batch normalization layer and activation function layer^loc：

S_iObtained by two-layer convolution and finally by full connection layer

Likewise, the penalty function for the final Loc _ Net is:

since the sharing layer is embedded in each sub-network, the loss returned from each sub-network contains the specific taskThe gradient information of the service also comprises the gradient information from the sharing layer, namely theta comprises theta^sh，θⁱTwo parts, the optimization objective function of wisnet is:

the final output of the wisnet is the output of three networks

And