CN111897416A

CN111897416A - Self-adaptive blowing interaction method and system based on twin network

Info

Publication number: CN111897416A
Application number: CN202010603459.6A
Authority: CN
Inventors: 杨承磊; 陈叶青; 盖伟; 卞玉龙
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2020-06-29
Filing date: 2020-06-29
Publication date: 2020-11-06

Abstract

The disclosure provides a twin network-based self-adaptive blowing interaction method and system, which preprocesses acquired original blowing data; extracting characteristics of the preprocessed data, wherein the characteristics comprise a time domain characteristic and a frequency domain characteristic, and then packaging the extracted characteristics into a grid image frame format; the trained twin network model is used as a classifier model for solving the domain self-adaptive learning, and the blowing type is distinguished from the strength, frequency and duration of blowing according to the extracted time domain characteristics and frequency domain characteristics; and identifying the blowing behavior to obtain interactive information. The method is convenient to use, easy to operate and low in cost.

Description

Self-adaptive blowing interaction method and system based on twin network

Technical Field

The disclosure belongs to the technical field of virtual reality interaction, and relates to a twin network-based self-adaptive blowing interaction method and system.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Breathing is a human instinct, and since breathing can be controlled consciously, interaction with breathing is considered as another control mechanism affecting both the real physical world and the virtual world. The convenience and controllability of breathing make breathing a controllable simple operation that can be used in some specific interaction situations. Thus, breathing can be a directly controllable natural interaction that aids in interactions such as touch and voice.

To date, some studies have used breathing or insufflation as a direct input interactive control. According to whether special detection equipment is needed for acquiring a respiratory signal, respiratory detection methods can be divided into two categories: a tailored detection device based approach and a common microphone based approach. Breath detection methods based on specially made detection devices, one that obtains a breath signal by detecting the movement of the chest or abdomen, and the other that obtains a breath signal by directly detecting the inhaled and exhaled air flows through a special device placed in the mouth. The common microphone-based respiration detection method mainly detects relevant information of respiration by using a microphone of equipment and performs certain processing operation. However, these studies either need to rely on special customized devices or can only be used in a few specific scenes, and the problems of noise interference and individual and device variability are rarely considered, so that the interaction method based on breathing or blowing currently lacks versatility.

Disclosure of Invention

The self-adaptive blowing interaction method and system based on the twin network can interact through breathing signals and can provide help for human information interaction.

According to some embodiments, the following technical scheme is adopted in the disclosure:

a twin network-based adaptive blowing interaction method comprises the following steps:

preprocessing the acquired original blowing data;

extracting characteristics of the preprocessed data, wherein the characteristics comprise a time domain characteristic and a frequency domain characteristic, and then packaging the extracted characteristics into a grid image frame format;

the trained twin network model is used as a classifier model for solving the domain self-adaptive learning, and the blowing type is distinguished from the strength, frequency and duration of blowing according to the extracted time domain characteristics and frequency domain characteristics;

and identifying the blowing behavior to obtain interactive information.

As an alternative embodiment, the preprocessing includes normalizing the insufflation data and discretizing the continuous voice sound wave data in the acquired insufflation data by using a sliding window to form discrete samples.

As an alternative embodiment, the time domain feature extraction process includes calculating the features of the average value of the amplitudes of the discrete samples, the variance of the amplitudes, the first order difference, the zero-crossing rate in the set time and the fluctuation amplitude.

As an alternative embodiment, the frequency domain feature extraction process includes performing fourier transform on original blowing data, reserving the positive frequency of a signal, obtaining frequency spectrum data, and then solving the mean square frequency, frequency variance and over-rate features of the frequency; and performing correlation calculation on the peak value according to the analyzed spectrum waveform, and extracting correlation characteristics about the peak value.

As an alternative embodiment, the twin network model is constructed by the following steps: on the basis of comparing loss functions with feature metrics of a twin network model, a separation loss function is added when a semantic alignment loss function is defined, and a classification loss function is set in a full connection layer of a CNN source domain processing flow.

As a further, semantic alignment loss function, minimizing the distance between samples with the same label belonging to the source domain or the target domain; the separation loss function maximizes the distance between samples belonging to either the source domain or the target domain and having different labels.

An adaptive blowing interaction system based on a twin network, comprising:

an acquisition module configured to acquire insufflation data of a user;

a preprocessing module configured to preprocess the acquired raw insufflation data;

the characteristic extraction module is configured to extract characteristics of the preprocessed data, wherein the characteristics comprise a time domain characteristic and a frequency domain characteristic, and then the extracted characteristics are packaged into a format of a grid-shaped image frame;

the classification module is configured to utilize the trained twin network model as a classifier model for solving the domain self-adaptive learning and distinguish the blowing type from the strength, frequency and duration of blowing according to the extracted time domain characteristics and frequency domain characteristics;

and the identification module is configured to identify the blowing behavior based on the blowing type to obtain the interactive information.

As an alternative embodiment, the classification module includes a first half and a second half, the first half includes two CNN networks for feature extraction, one for source domain samples and the other for target domain samples;

the second half is to construct distance measurement of two feature vectors as a similarity calculation function of the source domain data and the target domain data; the source domain process flow will continue to be modeled using additional fully connected layers, i.e., modeling of the class identification portion.

As an alternative embodiment, the classification module further includes a training data collection sub-module, a feature extraction sub-module, and a model training sub-module:

the training data collection submodule is configured to collect user blowing data as input of the feature extraction submodule and is used for training the twin network model;

the feature extraction submodule is configured to perform normalization and discretization preprocessing operations on the original data acquired by the training data collection submodule; extracting features from the preprocessed data, and packaging the data after feature extraction into an image frame form as the input of a model training submodule;

and the model training submodule is configured to train the twin network model by using the feature data obtained by the feature extraction submodule.

A computer readable storage medium having stored therein a plurality of instructions adapted to be loaded by a processor of a terminal device and to execute said one twin network based adaptive insufflation interaction method.

A terminal device comprising a processor and a computer readable storage medium, the processor being configured to implement instructions; the computer readable storage medium is used for storing a plurality of instructions, and the instructions are suitable for being loaded by a processor and executing the twin network-based adaptive blowing interaction method.

Compared with the prior art, the beneficial effect of this disclosure is:

the method disclosed by the invention is convenient to use, easy to operate and low in cost;

the present disclosure has the ability to adapt to noisy environments, individuals, and device differences, can be used in noisy scenes, and by different people, and when used on different devices, the effect is considerable. For example, the ambient music is 64db, the distance between the sound source and the user is not more than 0.5 m, 3-4 people in the same space normally communicate or are beside a noisy road, and the interaction accuracy is considerable;

the application crowd of the present disclosure is wide, except common crowd can use conveniently, some special crowd can use very conveniently too, like deaf-mute crowd that can't use the pronunciation, can't use the interactive limb obstacle crowd of touch-control, etc.;

the application range of the system is wide, the system can be used by different terminals, like an HTC VIVE virtual reality terminal, a PC terminal and a mobile phone terminal application system, and can be used in various interaction situations, such as answering a call during driving, learning and cooking in a noisy kitchen, combining the interaction mode with a digital dictionary and popularizing the interaction mode for deaf-mute people, and the like.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

FIG. 1 is a system architecture diagram of the present embodiment;

FIG. 2(a) and FIG. 2(b) are real views used in the air-blowing interaction method of the present embodiment;

FIGS. 3(a) - (e) are schematic diagrams of the air-blowing interaction type in the present embodiment;

FIGS. 4(a) - (c) are sound waveforms of large blows obtained by three different devices in this embodiment;

FIG. 5 is a flowchart of the algorithm of the present embodiment;

FIG. 6 is a specific process of model training according to the present embodiment;

fig. 7(a) and 7(b) show examples of feature extraction image frames according to the present embodiment;

FIG. 8 is a twin network model representation of the present embodiment.

The specific implementation mode is as follows:

the present disclosure is further described with reference to the following drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

An adaptive blowing interaction system based on a twin network, comprising:

an acquisition module configured to acquire insufflation data of a user;

The system comprises a client side and a server side in execution, wherein the acquisition module is arranged at the client side. And the server side is used for further processing and analyzing the collected data.

The self-adaptive blowing interaction method based on the twin network comprises the following steps:

(1) selecting N operators to carry out air blowing data acquisition on the M types of air blowing types, and taking the air blowing data as original training data of the twin network model to form a training data set for training a classifier model;

(2) the method comprises the steps of preprocessing acquired original blowing data, namely normalizing the data to a range of < -1, 1 >, discretizing the original continuous voice sound wave data by using a sliding window with the length of d data points and the overlap rate of r to form discrete samples. In this embodiment, d is 2048 and r is 0.75. Then, extracting characteristics of the preprocessed data, wherein the characteristics comprise two parts of time domain characteristics and frequency domain characteristics, and then packaging the extracted characteristics into a format of a grid-shaped image frame;

(3) selecting a twin network (Siamese network, also called a Siamese network) to construct, using the twin network as a classifier model for solving domain self-adaptive learning, training the classifier model based on the extracted and packaged image frame, and verifying the accuracy of the classification model by carrying out an accuracy comparison experiment;

(4) when a user uses air blowing interaction in an application scene, firstly judging whether the user selects to practice, if so, adding exercise data into a training data set, and executing the steps (2) and (3), otherwise, executing the step (5) to perform air blowing interaction on the user and the application scene content;

(5) when the system is used, the blowing behavior of a user is detected through the microphone, the blowing data is sent to the server side, and then the server side identifies the blowing behavior. Firstly, preprocessing and characteristic extraction are carried out on the blowing data of the user, namely, the steps (2) and (3) are executed, then, the trained twin model is utilized to carry out the recognition of the blowing type, the blowing recognition result is sent to the user, and the corresponding interactive behavior is triggered and presented to the user;

(6) judging whether the user blowing interaction is finished or not, and if so, finishing the user blowing interaction; otherwise, step (5) is executed.

In the above step (1), in this embodiment, N is 2, M is 5, and the sampling frequency is 8192 Hz. Forming a training data set which is large enough and used for training a classifier model by using the blowing data of two participants, wherein the sampling frequency is 8192 Hz; experiments show that when the sampling rate of the blowing signal is 2048Hz, the data redundancy can be reduced while the accuracy is ensured.

In the step (1), firstly, a data sample is obtained by using one android mobile phone (referred to as a source data obtaining device A), an abnormal sample is removed, an effective sample is left to form a source domain training data set (1043 pieces of data), and the training data set is formed by characteristic data X_sAnd label Y_sComposition is carried out; then, a android mobile phone (target data acquisition device B) with different models and large data difference is used for acquiring data to form a target domain test data set (10 pieces of data), and known characteristic data X is used_tAnd label Y_tAnd (4) forming. 1043 samples are used as source domain data samples for training of the twin network model, and 10 samples are used as target domain samples for training. The 10 target domain samples contained 8 puff interactive samples (2 samples were taken for each of the 4 puff types), 2 speech noise samples. In addition, 200 pieces of data target domain data are collected for test verification.

In the data acquisition process, the embodiment acquires data of five types of blowing modes. Wherein four blowing patterns are defined for interaction according to duration, intensity and frequency, and there is also a recognition type for eliminating interference of speaking voice:

big blow (large): very strong blowing and lasting more than 2 seconds;

forced call (gust): strong but short puffs, lasting about 1 second;

light blow (breeze): slow, mild and short lasting about 2 seconds;

two strong calls (two gust): 2 times of forced exhalations, each time does not exceed 1 second, and the interval is about 1 second;

noise (noise): the speech sound in the normal case.

In the step (3), five types of air blows are distinguished from the strength and duration of the air blows by using time domain characteristics and frequency domain characteristics. The specific formula for extracting the time domain features is as follows:

wherein, X_iIs the ith element in a discrete sample, n is the length of each sample;

equation (1) is an average of the amplitudes, representing a short-time average amplitude.

Equation (2) is the variance of the amplitude, representing the severity of the change.

Equation (3) is a first order difference.

Formula (4) refers to the short-time zero-crossing rate, and takes the proportion that the normalized value is greater than a certain threshold value a as a characteristic; p is the calculated ratio, a is a specific threshold, and is the optimum parameter value found by a Grid search (Grid search) parameter optimization method. Finally, two values of 0.1 and 0.2 are selected as parameters of the characteristic.

The formula (5) is a characteristic for representing the fluctuation amplitude, and is realized by equally dividing the data into 4 sections, marking the maximum value of each section as true (marked as 1) if the maximum value exceeds a certain threshold, and marking the maximum value as false (marked as 0) if the maximum value does not exceed the certain threshold, so as to obtain a true and false table. And then calculating the sum of the numbers marked as true according to the existing true and false tables. In addition, if the true and false values of the current segment are different from those of the previous segment, 1 is added on the basis of the previous summation. And in the same way, converting the obtained sum into a proportional value p' not greater than 1 so as to judge the overall fluctuation amplitude. Wherein

Refers to the i-th data which is divided,

and TF (j) are calculated as follows:

where θ is 0.6, i is {1, 2, 3, 4}, j is {1, 2, 3}, and TF (j +1) -TF (j) represents the difference between the j +1 th segment and the j th segment.

The steps of extracting the frequency domain features are as follows:

firstly, performing frequency spectrum analysis on the blowing data to obtain the blowing frequency spectrum data. The specific process is that Fourier transform is firstly carried out on original blowing data, and the positive frequency of the signal is reserved, namely the frequency [0, fs/2] in the first half interval of the signal is reserved. After the frequency spectrum data is obtained, the features of the frequency, such as the mean square frequency, the frequency variance, the over-rate and the like, are solved (for a specific solving method, see a solving method of features, such as the mean value of the time domain features and the like).

Furthermore, correlation calculation is performed on the peak value based on the analyzed spectrum waveform, and a correlation feature with respect to the peak value is extracted. The peak is the maximum value of the signal over a time interval, often denoted as X_peak. From a sequence of length N { X by peak counting₁,X₂,...,X_NFind n peaks { X } of_p1,X_p2,...,X_pn}，n<N, sequence { X₁,X₂,...,X_NThe peak index of the } is:

X_peak＝1/n*∑ⁿ _i＝1X_pi。

the spectral data is transformed by combining the way the data is processed with a peak and a sliding window, where the sliding window has a length of 10, and the difference between the maximum and minimum values in each window is recordedThe difference values of all windows are summed and normalized, and are denoted as peak difference X'_peak-T. Aiming at the characteristics of blowing data, slightly deforming the peak value to obtain a peak value difference X'_peak-TSpecifically, the expression is shown in formula (8):

wherein, X'_peakIs calculated as follows:

n is the sequence of n segments, m is the size of the sliding window, max { X_iDenotes a specified length m sequence { X }_i,X_i+1,...,X_i+mMaximum of min { X }, min_iDenotes a specified length m sequence { X }_i,X_i+1,...,X_i+mMinimum of, max X'_peakIs that X 'is selected from the existing test data'_peakIs actually operative to the data characteristic X'_peak-TAnd carrying out simple class normalization processing.

In the step (4), the basic idea of realizing the self-adaptive blowing interaction method based on the twin network model is as follows:

first, the present embodiment processes the blowing data into the form of image frames using the mentioned method of feature extraction, and then inputs "image frame pairs" constructed from the source domain data and the target domain data to the twin network model. The twin network may be divided into a first half and a second half. The former part is modeled by two CNNs for feature extraction (i.e. the embedding function g), i.e. there are two CNN processing streams: one for source domain samples and one for target domain samples. The second half is a distance metric that constructs two feature vectors as a function of similarity calculations for the source domain data and the target domain data. The source domain process flow will continue to be modeled using an additional fully connected layer (i.e., the prediction function h), i.e., the modeling of the class identification portion.

The construction process of the twin network model in the step (4) is as follows:

the method solves the problem of equipment difference and the robust signal classification problem encountered by air blowing interaction by constructing a contrast Loss Function (contrast Loss Function) through appropriate modification on the infrastructure of a twin network. Namely, on the basis of a common feature metric comparison loss function, a separation loss function is added when a semantic alignment loss function is defined so as to better process the similar problems of different fields but belonging to the same category and the separation and problems of different fields and different categories, and a classification loss function is arranged at a full connection layer of a CNN source domain processing flow so as to improve the classification precision. The twin network model used in this embodiment is based on the following specific formula:

wherein, the function model mainly comprises two parts in the training: the first part is a classification loss function, which minimizes the classification loss to ensure higher classification accuracy. The second part constitutes a characteristic metric versus loss function of the twin network herein: including semantic alignment and separation penalty functions. Wherein the semantic alignment loss function minimizes the distance between samples belonging to different domains (source domain or target domain) but having the same label; the loss function is separated to maximize the distance between samples belonging to different domains (source or target domains) and having different labels. In addition, g corresponds to a part of feature extraction in the depth model, and h corresponds to a modeling of a full-connection layer used by the source domain processing flow, namely, a classification identification part.

Regarding the first part of equation (10), E refers to statistical expectation, and loss is a suitable loss function, which is used in this embodiment. The function g and the function h represent two functions for constructing the twin network model in the present embodiment, and h × g represents a combination of the two. g denotes an embedding function, which is an embedding from the initial input space X into a feature, which can also be said to be a feature embedding space Z. h represents a prediction function, namely a function for predicting a classification result Y from the feature embedding space Z, and h X g represents the processing process from X to Y, namely X → Z → Y.

With respect to the second part of equation (10), T is expressed as the number of class labels, i.e., there are 5 classes in the classification task herein. And l represents the current label. N is a radical of_l ^SAnd N_l ^TRespectively representing the number of samples in the source domain data and the number of samples in the target domain in the category with the label of l. X_i ^S＝X^S|{Y＝l}，X_j ^T＝X^TI { Y ═ l } refers to the random data sample in the specified field when the label is l.

Means X_i ^SAnd X_j ^TAn appropriate distance network distance distributed in the embedding space, | | | · | | |, which represents the Frobenius norm, is a matrix norm defined as the sum of the squares of the absolute values of the elements in the matrix.

Regarding the third part of equation (10), l represents the label of the current source domain sample, and l 'represents the label of the current target domain sample, please note that l ≠ l'. N is a radical of_l ^SRepresents the number of samples in the source domain data in the class labeled l, and N_l' ^TThe number of samples in the target domain in the class labeled l'. X_i ^SIs from { X^SSample of | -Y ═ l } representing a random data sample in the source domain when labeled l, X_j ^TIs from { X^TSample of l 'represents a random data sample in the target domain when labeled l', X_i ^SAnd X_j ^TThe corresponding relationship of (a) is represented in the previous fig. 4-4 as a solid green line connecting different tag data of different fields. Distance metric sim refers to X_i ^SAnd X_j ^TA suitable similarity grid representation distributed in the embedding space. Calculation using Euclidean distance

m represents a hyperparameter-degree of separation, representing the interval of separability of the embedding space. When g (X)_i ^S) And g (X)_j ^T) When the distribution of the embedding space is very similar, sim is close to m; when the similarity is large, sim is close to 0. X_i ^SAnd X_j ^TWhen the labels are too close to each other in the embedding space, the classification precision is reduced, a penalty factor is added at the moment, the distance between samples of different labels belonging to different domains is furthest, and semantic alignment which is less prone to errors is realized.

In the step (4), the specific steps of training the twin model algorithm include:

inputting: source domain data and target domain data which are subjected to preprocessing and feature extraction;

and (3) outputting: a classification model twin network model;

the specific details include:

(4-1) setting parameters including Sample _ per _ class, repetition, input _ size, classes, alpha;

(4-2) creating a CNN model;

(4-3) declaring 2 CNN data streams;

(4-4) creating a prediction function h;

(4-5) calculating euclidean distances of the data processed by the two CNN streams;

(4-6) constructing and compiling a twin network model, namely L (h × g);

and (4-7) training the twin network model.

In step (4-1), the parameter Sample _ per _ class represents that there are several samples in each class of the target domain for training, and there are 2 samples in each class in this embodiment.

repetition represents the number of times the model can be repeatedly trained. In general, the more times, the longer the time required to train the model, and the higher the accuracy. However, there may be an over-fitting situation, and the accuracy is rather lowered when the number of times is too high, and in consideration of the tradeoff between accuracy and time efficiency, 2 repetitions are used in the present embodiment.

input _ size represents the size of the input data. In this embodiment, 16 × 16, the original data is processed by a feature extraction method, and the processed data is processed into an image frame format of 16 × 16 size as an input of the model.

classes represent the category of the declaration classification, and there are 5 types of air blowing to be identified in this embodiment.

Alpha is a weight defining different loss functions. In this embodiment, there is a classification loss function, and the weight value of the classification loss function is 0.75, and the weight values of the semantic alignment loss function and the separation loss function are 1- α. Aiming at the problem of self-adaption of blowing interaction to be solved in the embodiment, a verification experiment is carried out by training a model for multiple times, and the fact that when the value range of alpha is (0, 1), the recognition accuracy of the model floats by less than 10% is found, the model cannot be classified basically when the value range of alpha is 0, and when the value range of alpha is 0.7, 0.8, the model has good and stable recognition accuracy.

In the step (4-2), a Keras library in Python is used to create a CNN model, and the CNN _ model has a specific structure of input layer → convolutional layer (ReLU) → pooling layer → Flatten → fully-connected layer → softmax → output layer. The specific hyper-parameters of the model are set to kernel 3 x 3, the number of filters is 32, and epoch is 50.

In step (4-3), the twin network used in this embodiment needs two CNNs to process the source domain data and the target domain data, respectively, and these two data processing streams use the same CNN model, and the parameters are shared. Processing the structure Process _ source _ CNN of the source domain through a declaration function CNN _ model (a variable representing input data of the source domain); the target domain structure Process _ target _ CNN is defined by declaring CNN _ model (a variable representing input data of the target domain).

In step (4-4), the creation of the prediction function is referred to the general process of creating the fully-connected layer. The specific process is as follows: the Dropout () function is used to prevent the occurrence of overfitting, and the specific role is to randomly disable the output node of the previous network in a certain proportion at each update in the training process, that is, the output is set to zero, and the weight is not updated. This example uses a random ratio of 0.5. A fully-connected neural network layer is created using the density () function, the input size of the first layer to match the size of the input data. The Softmax predefined Activation function is applied to the model output using the Activation () function.

In the step (4-5), the specific function implementation is to calculate the euclidean distance between two data workflow data according to the euclidean function formula.

And (4-6) creating a domain self-adaptive model twin network, and declaring input and output. Compiling the model by calling the build method, configuring the learning process of the model, declaring an optimizer, defining a loss-of-implementation function and the like. Here, the weight of the penalty function is declared, the weight of the categorical penalty function is α, and the weight of the semantic pair penalty and separation penalty function is 1- α.

In the step (4-7), firstly, the data pairs of the source domain and the target domain are realized, and then the twin network model is trained.

In the step (5), before the blowing interaction interface is formally used, the user can choose to carry out certain blowing practice to become familiar with available interaction forms, and meanwhile, blowing data with small individual differences of the user are obtained and added into a training data set at a server end, so that the model better remembers the blowing mode of the user, and the identification accuracy is improved. When the user exercises, the user performs correct air blowing according to the prompted air blowing types, each air blowing type is repeatedly exercised for at least 2 times, and personal air blowing data of the user are uploaded to a training set of a server.

In the step (6), the specific steps of the blowing type identification algorithm include:

inputting: a preset threshold value theta is used for judging whether the user performs blowing action;

(6-1) when the interactive interface in the present embodiment is used, cyclically executing the following steps;

(6-2) the user side acquires the volume value v of the blowing sound in real time;

(6-3) judging the magnitude of the v value and the magnitude of the theta value, if v is larger than theta, the user side sends a request to the server side, the following steps are executed, and if not, the step (6-2) is executed;

(6-4) the server side acquires the sound data d, and then preprocessing and feature extraction operations are carried out on the data d to obtain d'; identifying the current blowing action according to the d', obtaining a blowing type a, and sending the type a to the client;

and (6-5) the user side executes corresponding interactive operation in the relevant application scene according to the received a.

The user side is used for acquiring the blowing interaction data, sending the data to the server side for identification through Socket communication, and then executing corresponding interaction operation on the running application program according to the blowing identification result received from the server side. The server is the back end used to process the insufflation data, train and run the recognition model. When a user uses the blowing interaction interface, the recognition model running on the server can recognize blowing actions according to the blowing data of the user and send recognition results back to the user side, wherein the user side comprises PC side application, mobile phone side application, virtual reality system application (such as HTC VIVE) and the like. Once the user receives the identified type of the blowing action, the application triggers the corresponding interaction.

As a typical example, as shown in FIG. 1, the Web server acts as the server side for acquiring and processing insufflation data, training and running classification models. When the user uses the blowing interaction interface, the classification prediction model running on the server can identify the blowing action of the user according to the blowing data of the user and send the identification result back to the user side, wherein the identification result comprises PC side application, mobile phone side application, virtual reality system application (such as HTC VIVE) and the like. If the user selects the exercise mode at the user terminal, the exercise data is updated to the training data set of the service period terminal, and then the training model is updated.

Part (a) of fig. 2 shows that the user exercises: the user selects to practice air blowing firstly, so that better interaction effect can be obtained during formal use; for example, as shown in part (b) of fig. 2, the identified blowing type received by the client is used, and the running application executes corresponding operations, such as playing a video at the PC end, zooming a map view controlled by the mobile phone end, and running a virtual reality game.

The five pictures are sequentially represented by four types of blowing interaction (haar, strong call, long light blow, 2 times of strong call) and a sound wave of a speaking sound acquired by a microphone:

big blow (large): very strong blow and last more than 2 seconds, see fig. 3 (a);

forced call (gust): strong but short puffs, lasting about 1 second, see fig. 3 (b);

light blow (breeze): slow, gentle and short, lasting about 2 seconds, see fig. 3 (c);

two strong calls (two gust): 2 strong breaths, each time not exceeding 1 second, with an interval of about 1 second, see fig. 3 (d);

noise (noise): the normal speaking voice is shown in fig. 3 (e).

The difference of the air blowing data obtained by different application devices is illustrated by taking a PC, an OPPO R7 mobile phone and an OPPO R11S mobile phone as examples. Fig. 4(a) (b) (c) shows the blowing data obtained by using PC, OPPO R7 cell phone and OPPO R11S cell phone, and the blowing data obtained by different application devices are different, which results in different testing accuracy; fig. 4(a) and (b) (c) show that the obtained blowing data are different when the same tester respectively performs the same interaction task at the PC end and the android mobile phone end, and fig. 4(a) and (b) show that the obtained blowing data are also different when the same tester uses mobile phones of different models.

As shown in fig. 5, a specific flowchart of this embodiment:

(1) randomly selecting two persons for collecting sound wave data of 5 blowing types;

(2) forming a training set for training a classifier model; if the training links exist, adding the training data and updating the training set;

(3) carrying out preprocessing operations such as normalization, discretization and the like on the original data; extracting time domain characteristics and frequency domain characteristics of the preprocessed data, and then packaging the extracted various characteristics into a format of a latticed image frame;

(4) selecting a twin network as a classification model, and training the constructed twin network by using the extracted features;

(5) entering a client application stage, setting a required threshold value theta in advance, and judging whether the user performs blowing action;

(6) judging whether the user chooses to practice or not, if so, skipping to the step (2), and updating the training data set; otherwise, jumping to the step (7);

(7) the user formally uses the blowing interaction interface, and the client monitors the blowing volume v in real time;

(8) judging whether v is larger than theta in real time, and if so, skipping to the step (9); otherwise, jumping to the step (7);

(9) using the trained online model to perform recognition and classification at the server side;

(10) the server sends the identification result to the corresponding application client;

(11) according to the identification result, the client executes corresponding interactive operation, triggers corresponding interactive behavior and presents the behavior to the user;

(12) whether the use is finished or not, if so, finishing; otherwise, jumping to step (7).

As shown in fig. 6, the training process of the model training of the twin network-based adaptive blowing interaction method mainly includes data acquisition, data preprocessing, feature extraction, model training, classifier identification, class result output, and the like. Wherein the data acquisition is to acquire source domain and target domain samples; the data preprocessing is to standardize and discretize the acquired original data; the characteristic extraction is to extract time domain and frequency domain characteristics firstly, and then package the time domain and frequency domain characteristics to form a grid-shaped image frame Imt as the input of a model; the model training is to construct and train a self-adaptive model twin network; and the classifier identification is to input data into a trained model for identification and finally output an identification result.

Aiming at a blowing sample which is subjected to data standardization and discretization, relevant features are extracted, time domain features are extracted to obtain basic statistical features such as short-time average amplitude, amplitude variance, first-order difference and the like, and features such as short-time zero-crossing rate, fluctuation amplitude and the like, frequency domain features are extracted to obtain features such as mean square frequency, frequency variance, over-value rate and the like of frequency, and deformed peak features are simply fused and packaged into a grid image frame format. Fig. 7(a) shows an exemplary representation of an image frame of a hall signal subjected to feature extraction, and fig. 7(b) shows an exemplary representation of an image frame of a turbo signal subjected to feature extraction.

As shown in fig. 8, the twin network model in this embodiment processes the blowing signal: a pair of "image frames" formed by the source domain data and the target domain data is provided to the twin network model. The twin network may be divided into a first half and a second half. The former part is modeled by two CNNs for feature extraction (i.e. the embedding function g), i.e. there are two CNN processing streams: one for source domain samples and the other for target domain samples; this part CNN will output two feature vectors. The second half is a distance metric that constructs the two feature vectors as a function of similarity calculations for the source domain data and the target domain data. Finally, the CNN source domain processing flow will continue to be modeled using an additional fully connected layer (i.e., the prediction function h), setting the classification loss function, i.e., the modeling of the classification identification portion. The convolutional network training structure used in this embodiment is as follows: input layer → convolutional layer (ReLU) → pooling layer → Flatten → fully-connected layer → softmax → output layer.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims

1. A self-adaptive blowing interaction method based on a twin network is characterized by comprising the following steps: the method comprises the following steps:

preprocessing the acquired original blowing data;

and identifying the blowing behavior to obtain interactive information.

2. The twin network-based adaptive blowing interaction method as claimed in claim 1, wherein: the preprocessing comprises the steps of normalizing the blowing data and discretizing continuous voice sound wave data in the collected blowing data by utilizing a sliding window to form discrete samples.

3. The twin network-based adaptive blowing interaction method as claimed in claim 1, wherein: the time domain feature extraction process comprises the steps of calculating the average value of the amplitudes of the discrete samples, the variance of the amplitudes, the first-order difference, the zero-crossing rate in the set time and the features of the fluctuation amplitude.

4. The twin network-based adaptive blowing interaction method as claimed in claim 1, wherein: the process of frequency domain feature extraction comprises the steps of carrying out Fourier transform on original blowing data, reserving the positive frequency of a signal, obtaining frequency spectrum data, and then solving the mean square frequency, the frequency variance and the over-rate feature of the frequency; and performing correlation calculation on the peak value according to the analyzed spectrum waveform, and extracting correlation characteristics about the peak value.

5. The twin network-based adaptive blowing interaction method as claimed in claim 1, wherein: the construction process of the twin network model comprises the following steps: on the basis of comparing loss functions with feature metrics of a twin network model, adding a separation loss function when defining a semantic alignment loss function, and setting a classification loss function in a full connection layer of a CNN source domain processing flow;

a semantic alignment loss function that minimizes the distance between samples belonging to the source domain or the target domain that have the same label; the separation loss function maximizes the distance between samples belonging to either the source domain or the target domain and having different labels.

6. A self-adaptive blowing interaction system based on a twin network is characterized in that: the method comprises the following steps:

an acquisition module configured to acquire insufflation data of a user;

7. The twin network based adaptive insufflation interaction system of claim 6 further characterized by: the classification module comprises a first half part and a second half part, wherein the first part comprises two CNN networks for feature extraction, one CNN network is used for a source domain sample, and the other CNN network is used for a target domain sample;

8. An adaptive blowing interaction system based on twin networks as claimed in claim 6 or 7, characterized in that: the classification module further comprises a training data collection submodule, a feature extraction submodule and a model training submodule:

9. A computer-readable storage medium characterized by: a plurality of instructions are stored therein, the instructions being adapted to be loaded by a processor of a terminal device and to perform a twin network based adaptive blowing interaction method according to any of claims 1-5.

10. A terminal device is characterized in that: the system comprises a processor and a computer readable storage medium, wherein the processor is used for realizing instructions; the computer readable storage medium is used for storing a plurality of instructions adapted to be loaded by a processor and to execute a twin network based adaptive insufflation interaction method of any one of claims 1-5.