WO2021169209A1

WO2021169209A1 - Method, apparatus and device for recognizing abnormal behavior on the basis of voice and image features

Info

Publication number: WO2021169209A1
Application number: PCT/CN2020/111664
Authority: WO
Inventors: 雷宇泽; 陈远旭; 周宝; 骆加维; 廖智
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-02-27
Filing date: 2020-08-27
Publication date: 2021-09-02
Also published as: CN111460889B; CN111460889A

Abstract

The present application relates to the field of artificial intelligence, and disclosed thereby are a method, apparatus and device for recognizing abnormal behavior on the basis of voice and image features. The method comprises: performing feature extraction on an image of a user by utilizing a human body feature extraction model obtained after learning and training to obtain an image feature vector to be recognized; performing feature extraction on the voice of the user to obtain a voice feature vector to be recognized; performing cross fusion on the image feature vector to be recognized and the voice feature vector to be recognized, so as to then obtain a fusion feature vector to be recognized; and processing the fusion feature vector to be recognized by using an abnormal behavior recognition model obtained by learning and training a convolutional neural network to determine whether an action of the user is an abnormal behavior, and if so, proving that the user is a dangerous person, and starting a corresponding interception function to intercept the user. As such, users with abnormal behaviors may be identified more quickly and accurately, thereby effectively improving the safety coefficient of an enterprise, and providing effective guarantee for the safety of the users of the enterprise.

Description

Method, device and equipment for identifying abnormal behavior based on voice and image features

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on February 27, 2020, the application number is 202010123166.8, and the invention title is "A method, device and equipment for identifying abnormal behavior based on voice and image features". The entire content is incorporated into this application by reference.

Technical field

This application relates to the field of artificial intelligence technology, and in particular to a method, device and equipment for identifying abnormal behaviors based on voice and image features.

Background technique

The security system of the service industry is related to social stability and the safety of the people's property, and has always been the focus of security development. For example, the existing security systems of bank branches can no longer reliably guarantee the business of bank branches and the safety of personnel in the branches.

Some security systems in the service industry use the method of triggering alarms or video surveillance. This method can only notify relevant personnel to deal with them in a timely manner after dangerous personnel enter.

The inventor realized that the existing security system only recognizes the behavior of people based on the human posture in the video or picture, and the accuracy of the recognition of dangerous abnormal behaviors based on voice and image features is low. There are situations in which security personnel are mistaken for dangerous personnel, or dangerous personnel are safely released to endanger public safety.

Summary of the invention

In view of this, the present application provides a method, device, and equipment for identifying abnormal behavior based on voice and image features. The main purpose is to solve the current technical problem of low accuracy of abnormal behavior recognition based on voice and image features.

According to the first aspect of the present application, a method for identifying abnormal behavior based on voice and image features is provided, the steps of the method include:

After detecting that the user enters the recognition area, control the camera to obtain the user's motion image to be recognized, and at the same time start the recording structure to record the voice to be recognized for a predetermined time;

Performing feature extraction on the motion image to be recognized to obtain a feature matrix to be recognized;

Using a human body feature extraction model to process the feature matrix to be recognized to obtain a corresponding feature vector of the image to be recognized;

Performing text feature extraction on the voice to be recognized to obtain a voice feature vector to be recognized;

Cross fusion of the image feature vector to be recognized and the voice feature vector to be recognized to obtain a fusion feature vector to be recognized;

The fusion feature vector to be recognized is input into the abnormal behavior recognition model for processing, and the corresponding human body action category is output, and whether the human body action category belongs to the abnormal behavior.

According to the second aspect of the present application, there is provided an abnormal behavior recognition device based on voice and image features, the device including:

The acquisition module, when detecting that the user enters the recognition area, controls the camera to acquire the user's motion image to be recognized, and at the same time activates the recording structure to record the voice to be recognized for a predetermined time;

The image feature extraction module is used to perform feature extraction on the action image to be recognized to obtain the feature matrix to be recognized;

The feature processing module is configured to process the feature matrix to be recognized by using the human body feature extraction model to obtain the corresponding feature vector of the image to be recognized;

A voice feature extraction module, configured to extract text features of the voice to be recognized to obtain a voice feature vector to be recognized;

The fusion feature module is used to cross-fuse the image feature vector to be recognized and the voice feature vector to be recognized to obtain the fusion feature vector to be recognized;

The abnormal behavior recognition module based on voice and image features is used to input the fusion feature vector to be recognized into the abnormal behavior recognition model for processing, and output the corresponding human action category and whether the human action category belongs to the abnormal behavior.

According to a third aspect of the present application, there is provided a computer device, including a memory and a processor, the memory stores a computer program, and the processor implements the following steps when executing the computer program:

According to the fourth aspect of the present application, there is provided a computer storage medium with a computer program stored thereon, and the computer program implements the following steps when executed by a processor:

With the above technical solutions, this application provides a method, device, and equipment for identifying abnormal behaviors based on voice and image features, using the human body feature extraction model obtained after learning and training to perform feature extraction on the user's image to obtain the image to be recognized Feature vector, and then feature extraction of the user’s voice to obtain the voice feature vector to be recognized, and then cross-fuse the image feature vector to be recognized and the voice feature vector to be recognized to obtain the fused feature vector to be recognized, using convolutional neural The abnormal behavior recognition model obtained by the network after learning and training processes the fusion feature vector to be recognized to determine whether the user's action is an abnormal behavior. If it is proved that the user is a dangerous person, the corresponding interception function is activated to intercept the user to prevent the user Cause injury to the personal property of other people. This method determines the action category corresponding to the user’s action based on the user’s image and sound, and judges whether the action category is an abnormal action, so that corresponding measures can be taken according to the judgment result, and the user with abnormal behavior can be identified more quickly and accurately. It improves the safety factor of the enterprise and provides an effective guarantee for the safety of the users of the enterprise.

Description of the drawings

FIG. 1 is a flowchart of an embodiment of an abnormal behavior recognition method based on voice and image features of this application;

Figure 2 is a schematic diagram of the indoor layout of the application;

Fig. 3 is a training flowchart of the spatio-temporal convolutional network of this application;

Fig. 4 is a flowchart of speech feature extraction in this application;

Figure 5 is a training flowchart of the abnormal behavior recognition model of the application;

Figure 6 is a structural block diagram of an embodiment of an abnormal behavior recognition device based on voice and image features;

FIG. 7 is a schematic diagram of the structure of the computer equipment of this application.

Detailed ways

Hereinafter, exemplary embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although the drawings show exemplary embodiments of the present disclosure, it should be understood that the present disclosure can be implemented in various forms and should not be limited by the embodiments set forth herein. On the contrary, these embodiments are provided to enable a more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.

The embodiment of the application provides a method for identifying abnormal behavior based on voice and image features, which can determine the action category corresponding to the user's action based on the user’s image and sound, and determine whether the action category belongs to the abnormal action, so as to determine whether the action category belongs to the abnormal action. As a result, corresponding measures are taken, and this method of identifying abnormal behaviors based on voice and image features is faster and more accurate.

As shown in Fig. 1, an embodiment of the present application provides a method for identifying abnormal behaviors based on voice and image features, including the following steps:

Step 101: When it is detected that the user enters the recognition area, the camera is controlled to obtain the user's motion image to be recognized, and at the same time, the recording structure is activated to record the voice to be recognized for a predetermined time.

In this step, the executor of the method for identifying abnormal behavior based on voice and image features can be a robot, or a security system of an enterprise, and the method for identifying abnormal behavior based on voice and image features is stored in the robot or security system.的execution procedures. And set a recognition area for the robot or security system, the size and scope of the area can be set according to needs. When the camera scans that the user enters the recognition area, the camera is pointed at the user to take the user's action image, and the user's voice is recorded at the same time.

Step 102: Perform feature extraction on the action image to be recognized to obtain a feature matrix to be recognized.

In this step, the obtained motion image to be recognized is digitally converted, the surrounding environment image of the user is deleted, the user's image is captured, and then the user's facial expressions, body movements, and hand-held objects in the user image The other information features are extracted and converted into a feature matrix of dimension D to be identified.

Step 103: Use the human body feature extraction model to process the feature matrix to be recognized to obtain the corresponding feature vector of the image to be recognized.

In this step, the human body feature extraction model is obtained by the spatiotemporal convolutional network using a large number of images representing various human behaviors for learning and training. After the human body feature extraction model is trained, the corresponding code program is written into the robot or security system. And the dimension of the input port of the human body feature extraction model is D to ensure that the feature matrix to be recognized can smoothly enter the human body feature extraction model for processing without further conversion, so that the dimension of the image feature vector to be recognized after processing is also D .

Step 104: Perform text feature extraction on the voice to be recognized to obtain a voice feature vector to be recognized.

In this step, the text information in the voice to be recognized is extracted and converted into corresponding numbers, and the numbers are arranged in a matrix to form a feature vector of the voice to be recognized with dimension D.

Step 105: Cross-fuse the image feature vector to be recognized and the voice feature vector to be recognized to obtain the fused feature vector to be recognized.

In this step, since the dimensions of the image feature vector to be recognized and the voice feature vector to be recognized are the same, the dimension of the fusion feature vector to be recognized obtained after direct cross fusion of the two is also D.

Step 106: Input the fusion feature vector to be recognized into the abnormal behavior recognition model for processing, and output the corresponding human body action category and whether the human body action category belongs to the abnormal behavior.

In this step, a large number of human behavior images and a large number of recorded voices are combined, and the human behavior images and recorded voices are similar to the above steps 102-105 to obtain the fusion feature vector that can be used to train the convolutional neural network, and the After the fusion feature vector is input into the convolutional neural network for training, an abnormal behavior recognition model that can recognize human behavior is obtained, and the code program corresponding to the abnormal behavior recognition model is written into the robot or security system. In this way, the robot or the security system can use the human body feature extraction model and the abnormal behavior recognition model to detect the personnel entering the enterprise according to the above-mentioned interaction. If the behavior of the personnel is detected as abnormal behavior, the robot is controlled to intercept the personnel , Or activate the interception function of the security system to intercept the person, and at the same time activate the alarm device to notify the staff to come for processing. This method of identifying abnormal behaviors based on voice and image features effectively protects the personal and property safety of enterprises, employees, and users.

Through the above technical solution, the human body feature extraction model obtained after learning and training is used to perform feature extraction on the user’s image to obtain the image feature vector to be recognized, and then perform feature extraction on the user’s voice to obtain the voice feature vector to be recognized. The recognized image feature vector and the voice feature vector to be recognized are cross-fused to obtain the fused feature vector to be recognized, and the abnormal behavior recognition model obtained through learning and training of the convolutional neural network is used to process the fused feature vector to be recognized to determine the user's actions Whether it is an abnormal behavior, if it proves that the user is a dangerous person, activate the corresponding interception function to intercept the user to prevent the user from causing harm to other people’s personal and property. This method determines the action category corresponding to the user’s action based on the user’s image and sound, and judges whether the action category is an abnormal action, so that corresponding measures can be taken according to the judgment result, and the user with abnormal behavior can be identified more quickly and accurately. It improves the safety factor of the enterprise and provides an effective guarantee for the safety of the users of the enterprise.

In a specific embodiment, before step 103, the method further includes:

Step 1031: Obtain a plurality of sample images representing various human actions, and label each sample image with a corresponding human body action label.

In this step, various human actions include: running, walking, clapping, holding guns, holding knives, punching, kicking, etc. The human actions corresponding to each sample image are labeled for subsequent training When, judge whether the recognition result is correct. Wherein, each sample image includes multiple human action pictures, preferably 4 pictures.

Step 1032: Perform feature extraction on each of the multiple sample images to obtain multiple sample feature matrices.

In this step, multiple sample images are digitally converted, and the environmental images around the person are deleted, the image of the person is captured, and then the facial expressions, body movements, hand-held objects and other information features in the person's image are extracted , And converted into a sample feature matrix with dimension D.

Step 1033: Construct a five-layer spatio-temporal convolutional network, input multiple sample feature matrices into the first three layers of the spatio-temporal convolutional network for processing, and transfer the obtained multiple one-dimensional feature vectors to the last two layers of the spatio-temporal convolutional network for recognition Process and output the sample human body motion category corresponding to each sample image.

Step 1034: Compare the sample human motion category with the corresponding human motion label to determine the sample loss function, and adjust the parameters of the spatiotemporal convolutional network according to the sample loss function to obtain the spatiotemporal convolutional network model.

In the above steps, the spatiotemporal convolutional network is trained using the sample feature matrix of dimension D. The spatiotemporal convolutional network processes the sample feature matrix and outputs the corresponding sample human action category, so that the sample human action category can be Compare with the correct human action label, calculate the sample loss function once for each comparison, and adjust the spatiotemporal convolutional network according to the sample loss function, and then train the adjusted spatiotemporal convolutional network to the next sample feature matrix. And continue to repeat the process until all the sample feature matrices are fully trained, and a spatio-temporal convolutional network model that can identify the type of human action based on the image is obtained.

In addition, multiple sample images can be obtained for each type of human action, so that the spatiotemporal convolutional network model obtained through multiple training of the same sample image can better recognize the human body category.

Step 1035: Delete the last two layers of the spatiotemporal convolutional network model to obtain a human body feature extraction model.

In this step, this application does not simply perform recognition based on the image, but combines the image and sound to perform recognition. Therefore, the last two layers of the spatio-temporal convolutional network model need to be deleted, so that the model can be directly derived Human body feature matrix.

In a specific embodiment, step 1033 specifically includes:

Step 10331, the five-layer structure of the constructed spatio-temporal convolutional network is the first layer receiving layer, the second layer spatial characteristic analysis layer, the third layer temporal characteristic analysis layer, the fourth layer fully connected layer, and the fifth layer classification layer. .

In step 10332, the first layer transmits the received sample feature matrix to the second layer.

Step 10333: The second layer extracts the spatial features of the sample feature matrix, and sends the extracted spatial features and the sample feature matrix to the third layer.

In step 10334, the third layer will extract the temporal features in the sample feature matrix, combine the temporal features and the spatial features to form a one-dimensional feature vector, and send it to the fourth layer.

Step 10335: The fourth layer performs full connection processing on the one-dimensional feature vector, and sends the processed one-dimensional feature vector to the fifth layer.

Step 10336: The fifth layer analyzes the processed one-dimensional feature vector, determines the corresponding sample human body movement category and outputs it.

In the above scheme, the first layer has D input ports, so that the sample feature matrix with dimension D can be directly input in the form of a matrix. After the input is completed, enter the second layer and use the image of the human action in the sample feature matrix as Spatial features are extracted, and then the third layer analyzes each human action obtained from multiple images in chronological order, and then combines the temporal and spatial features to form a one-dimensional feature vector. The fourth and fifth layers directly According to the one-dimensional feature vector, the corresponding sample action category is obtained. In this way, the one-dimensional feature vector of the sample feature matrix can be determined in the two dimensions of space and time, so that the pseudo-action category determined according to the one-dimensional feature vector is more accurate and faster.

In a specific embodiment, before step 106, it specifically includes:

Step 1061: Obtain each person's action image as a training image for M individuals, and simultaneously record a predetermined time of training voice for each person to obtain M training images and M training voices.

In this step, the training image and the training speech obtained by each person are combined, so that the obtained training fusion feature vectors are all from the same person, and the training can be better based on the training fusion feature vectors.

In step 1062, each training image and each training voice are correspondingly labeled with a training human body action label.

In this step, the training image corresponding to the same person is the same as the training human body action label marked by the training voice.

Step 1063: Perform feature extraction on each training image to obtain M training feature matrices.

In this step, each training image is digitally converted, and the environmental image around the human body is deleted, the image of the human body is captured, and then the facial expressions, body movements, hand-held objects and other information features of the human body are extracted. And converted into a training feature matrix of dimension D.

Step 1064: Input the M training feature matrices into the human body feature extraction model in sequence for processing, and output M training image feature vectors.

In this step, the training feature matrix is processed by the three layers of the human body feature extraction model to obtain the corresponding training image feature vector.

Step 1065: Perform text feature extraction on M training voices to obtain M training voice feature vectors.

In this step, an automatic speech recognition system (Automatic Speech Recognition) is used to perform speech recognition on each training speech, convert it into a corresponding text, and perform feature extraction on the text to obtain a training speech feature matrix with dimension D.

Step 1066: Cross-fuse the training image feature vectors and training voice feature vectors belonging to the same human action label to obtain a training fusion feature vector. M training image feature vectors and M training voice feature vectors are correspondingly fused into M training fusion features vector.

In this step, the training image feature vector and the training speech feature vector belonging to the same human action label are fused, so that the uniqueness of the human action label of the training fusion feature vector is obtained.

Step 1067, sequentially input the M training fusion feature vectors into the convolutional neural network for training processing, and compare the output training human action category with the corresponding training human action label to determine the corresponding training loss function.

Step 1068: Adjust the convolutional neural network according to the training loss function to obtain a convolutional neural network model.

In this step, a training loss function is obtained for each input of the training fusion feature vector. After the parameters of the convolutional neural network are adjusted according to the training loss function, the next training fusion feature vector is input, and the process is repeated until All training fusion feature vectors are completed until the training is completed, so that a convolutional neural network model that can recognize human actions based on the combined fusion feature vectors of the character's image and voice can be obtained.

Step 1069, before the output layer of the convolutional neural network, add a judgment layer capable of judging whether it belongs to an abnormal behavior according to the obtained training human action category, to obtain an abnormal behavior recognition model.

In this step, the name of the human body action category belonging to the abnormal behavior is added to the judgment layer. In this way, after the convolutional neural network model obtains the human action category, the human action category is searched from the human action category of the added abnormal sexual behavior. If it exists, it proves that the human body action category belongs to an abnormal behavior, and the human body action category and the judgment result are directly output from the output layer. If it does not exist, it proves that the human body action category does not belong to an abnormal behavior, and then the human body action category and the judgment result are output from the output layer.

In a specific embodiment, the method further includes:

In step 101', when it is detected that the user enters the recognition area, the camera is controlled to obtain multiple action images of the user to be recognized, and the recording structure is activated at the same time to record the voice to be recognized for a predetermined time.

Then step 102 specifically includes:

Step 1021: Input multiple motion images to be recognized into the encoding processor, and use the self-attention mechanism layer in the encoding processor to visually analyze each motion image to be recognized, and extract the image of each motion image to be recognized. Visualized features, multiple visualized features are obtained corresponding to multiple action images to be recognized.

In this step, a self-attention mechanism layer is added to the encoding processor in advance, and the self-attention mechanism layer can be used to visually analyze each action image to be recognized, delete other disturbing environmental factors, and remove the contour or posture of the human body. After the features (that is, visual features) are extracted, multiple visual features are obtained corresponding to multiple action images to be recognized.

Step 1022: Input multiple visualization features into the overlay layer of the encoding processor for overlay processing to obtain multiple overlay processing results.

Step 1023: Input the results of the multiple superposition processing into the residual layer to perform residual processing to strengthen the results of the multiple superposition processing.

In this step, each result of the superposition processing is input into the residual layer, and compared with the estimated value in the residual layer, to determine the reliability of the result of the superposition processing, and to perform enhanced processing on the result of the superposition processing. Avoid the gradient disappearance of the features in the image during the feature extraction process.

Step 1024: After concatenating the multiple enhanced superimposition processing results, linear processing is performed using linear processing to obtain a feature matrix to be identified.

In this step, the enhanced multiple superimposition processing results are linearly spliced according to the dimension D, so that the feature matrix of the dimension D to be identified is obtained.

Through the above technical solution, the multiple motion images to be recognized can be processed to form a feature matrix of dimension D to be recognized. After the above processing, the features in the feature matrix to be recognized can be more prominent, which can be quickly and accurately For identification.

In addition, in step 1032, feature extraction is performed on the sample image in the above-mentioned manner. In the same way, in step 1063, the feature extraction of the training image is also performed in the above-mentioned manner.

In a specific embodiment, step 104 specifically includes:

Step 1041, using an automatic speech recognition algorithm to perform text feature extraction on the speech to be recognized.

Step 1042: Use the self-attention mechanism to perform text feature analysis on the extracted text features, and extract word feature vectors.

In this step, when extracting voice features, the self-attention mechanism is also used, so that the obtained image feature vector and word feature vector are relatively similar, which facilitates the later image and voice feature fusion.

Step 1043: Perform linear transformation on the word feature vector to obtain the voice feature vector to be recognized.

In this step, the word feature vector is linearly transformed according to the dimension D, so that the voice feature vector with the dimension D to be recognized is obtained.

In addition, in step 1065, feature extraction of the training speech is also performed in the above-mentioned manner.

In a specific embodiment, step 105 specifically includes:

Step 1051: Use the additive attention mechanism to cross-add the image feature vector to be recognized and the voice feature vector to be recognized to obtain the added feature vector.

In this step, an additive attention mechanism (additive attention) is used to cross-add the image feature vector to be recognized and the voice feature vector to be recognized.

The specific formula is as follows: Among them, Q=word vector, K=V=image matrix.

Head _i = Attention (Q _i , K, V).

Step 1052: Perform a dot product operation on the added feature vector by using the quantified product method to obtain the fusion feature vector to be identified.

The scaled-dot product method is used (dot product (scalar product, also called dot product) is a binary operation method that accepts two vectors on a real number R and returns a real-valued scalar). The specific formula is as follows:

MultiHead (Q, K, V) = Concat (head ₁ ,..., head _h )W ^O.

The fusion feature MultiHead (Q, K, V) obtained above is normalized. The effect is to reduce the complexity of subsequent abnormal behavior classification.

Through the above-mentioned technical solution of the present application, the human body feature extraction model obtained after learning and training is used to perform feature extraction on the user's image to obtain the image feature vector to be recognized, and then perform feature extraction on the user’s voice to obtain the voice feature vector to be recognized. Then cross-fuse the image feature vector to be recognized and the voice feature vector to be recognized to obtain the fusion feature vector to be recognized, and use the convolutional neural network to obtain the abnormal behavior recognition model after learning and training to process and judge the fusion feature vector to be recognized Whether the user's actions are abnormal behaviors, if it proves that the user is a dangerous person, activate the corresponding interception function to intercept the user to prevent the user from causing harm to other people's personal and property. This method determines the action category corresponding to the user’s action based on the user’s image and sound, and judges whether the action category is an abnormal action, so that corresponding measures can be taken according to the judgment result, and the user with abnormal behavior can be identified more quickly and accurately. It improves the safety factor of the enterprise and provides an effective guarantee for the safety of the users of the enterprise.

In another embodiment of the present application, the method for identifying abnormal behaviors based on voice and image features includes the following steps:

Place multiple robots in the Alde placement area in Figure 2 below to identify and judge abnormal behaviors.

The specific process of identifying and judging abnormal behavior is as follows:

1. Obtain image samples and train the spatio-temporal graph convolutional network (STGCN) to obtain the corresponding human body feature extraction model

As shown in Figure 3:

1. Obtain multiple consecutive images (for example, four) when each of the N people perform various actions to form an image group, obtain N image groups, and determine the type of action of each image group (for example, beating, holding a gun, Walking, taking, etc.) for identification.

2. Each image in each image group will first go through a 6-layer encoding structure (encoder layer) to complete the strong feature extraction of the contour and posture of the human body in the image, and obtain multiple sets of feature matrices (each feature matrix corresponds to one Image, the matrix dimension is D).

Specifically:

In each layer, the data passes through the multihead-self-attention layer and then superimposes the residual residual layer and the normalization layer. The self-attention mechanism can extract the key features (including the outline and posture of the human body) in the picture better. Then, the residual layer is used to avoid the gradient disappearance of the features in the image during the feature extraction process.

3. Combine multiple sets of feature matrices, and then input linear-layer to complete the linear transformation of multiple graphs, and obtain a fusion matrix D×4 (dimension D) from multiple graphs to one graph.

4. Construct an initial spatiotemporal convolutional network, and set dropout_rate in the spatiotemporal convolutional network to a default setting of <0.3 to improve the difference of training samples.

Among them, random inactivation (dropout) is a method of optimizing artificial neural networks with deep structure. In the learning process, part of the weight or output of the hidden layer is randomly reset to zero to reduce the interdependence between nodes to realize the neural network. Regularization of the network reduces its structural risks.

5. Input the fusion matrix obtained in step 3 into a 5-layer spatiotemporal graph convolutional network (STGCN) for learning and training.

Among them, the first layer of STGCN is used to receive the fusion matrix, the second layer analyzes the spatial characteristics of multiple orientation images in the fusion matrix, the third layer analyzes the temporal characteristics of the frames before and after the fusion matrix to obtain a one-dimensional feature vector, and the fourth layer compares the one-dimensional The feature vector is fully connected, and the fifth layer of softmax layer is used to classify human behavior with the output results of the fourth layer and output the results of human behavior classification.

According to the output classification result of human body behavior and the above-mentioned mark are compared, the corresponding loss function is obtained. The spatio-temporal graph convolutional network is adjusted according to the loss function, until all the learning and training of the fusion matrix obtained in step 3 are completed, and the corresponding spatio-temporal graph convolutional network model is obtained.

6. Since what this solution needs here is the one-dimensional feature vector that represents human behavior obtained through the intermediate level of the spatio-temporal graph convolutional network model, it is necessary to combine the fourth and fifth layers of the spatio-temporal graph convolutional network model obtained above. Layer deletion, a human body feature extraction model that can obtain a one-dimensional feature vector (dimension D) of human behavior is obtained.

2. Obtain voice training samples and perform strong feature extraction

As shown in Figure 4:

1. Convert the collected speech into text through the existing ASR (Automatic Speech Recognition) basis.

2. The traditional self-attention (transformer) mechanism is used to extract features of the text, and the corresponding word vector features are obtained (and the strong feature extraction of the image uses the transformer mechanism, so that the extracted information features will be relatively similar, which is convenient for later fusion) .

3. Perform a linear transformation on the word vector feature, so that the word vector feature of the output matrix dimension D (equal to the dimension of the one-dimensional feature vector of the human body behavior).

3. After fusing the one-dimensional feature vector of human behavior with the word vector feature, DNN is used for learning and training.

As shown in Figure 5:

1. Use a two-layer cross-attention mechanism to fuse one-dimensional feature vectors with word vector features, specifically:

The first layer uses additive attention, an additive attention mechanism (Q=word vector, K=V=image matrix)

Head _i = Attention (Q _i , K, V).

MultiHead (Q, K, V) = Concat (head ₁ ,..., head _h )W ^O.

The second layer uses the scaled-dot product method (dot product (scalar product, also known as dot product) is a binary operation method that accepts two vectors on a real number R and returns a real-valued scalar). The fusion feature MultiHead (Q, K, V) obtained above is normalized. The effect is to reduce the complexity of subsequent abnormal behavior classification.

Add a corresponding label to the behavior category of the fusion feature, and mark whether the behavior category is an abnormal behavior.

2. Construct a DNN network structure, and input the above-mentioned fusion features into the DNN network for training processing, and the DNN network outputs the behavior category of the fusion feature, and whether the behavior category belongs to abnormal behavior. Compare the output result with the corresponding label, calculate the loss function, adjust the DNN network structure according to the loss function, and repeat this process until all the fusion features are fully trained, and the user's behavior can be classified and judged Whether it belongs to the abnormal behavior recognition model of abnormal behavior.

Four, application

The human body feature extraction model and the abnormal behavior recognition model obtained above are input into the robot system, and the robot is used to complete the recognition and detection of the user's human body behavior.

The specific process is as follows:

1. When the user enters the Alde placement area, the Alde robot will use the camera to obtain a set of user's action images (for example, four), and record a segment of the user's voice.

2. Use 2-3 in step one to obtain the corresponding fusion matrix.

3. The strongly obtained fusion matrix is input to the human body feature extraction model for processing. The first layer is used to receive the fusion matrix, the second layer analyzes the spatial characteristics of multiple orientation images in the fusion matrix, and the third layer analyzes the time before and after the fusion matrix. The feature obtains the corresponding feature vector of the image to be recognized.

4. Process the acquired user's voice according to steps 1-3 in step 2 to obtain the corresponding voice feature vector to be recognized.

5. The image feature vector to be recognized and the voice feature vector to be recognized are fused according to the process of step 1 in step 3 to obtain the fused feature vector to be recognized.

6. Input the fusion feature to be recognized into the abnormal behavior recognition model, perform processing, and output the human behavior category corresponding to the fusion feature to be recognized, and whether the human behavior category belongs to the abnormal behavior.

7. If it is an abnormal behavior, control one or more Alde robots to intercept users and organize users to enter the bank's business processing area to prevent users from causing harm or loss to other users, public facilities, bank staff, or property. At the same time, the alarm device is activated to remind the staff to intercept the abnormal user.

If it is not an abnormal behavior, the user is allowed to enter the business processing area for business processing.

Further, as a specific implementation of the method in FIG. 1, an embodiment of the present application provides an abnormal behavior recognition device based on voice and image features. As shown in FIG. 6, the device includes: an acquisition module 61 and an image feature extraction module connected in sequence 62. Feature processing module 63, voice feature extraction module 64, fusion feature module 65, and abnormal behavior recognition module 66 based on voice and image features.

The acquiring module 61, after detecting that the user enters the recognition area, controls the camera to acquire the user's motion image to be recognized, and at the same time activates the recording structure to record the voice to be recognized for a predetermined time;

The image feature extraction module 62 is configured to perform feature extraction on the action image to be recognized to obtain a feature matrix to be recognized;

The feature processing module 63 is configured to process the feature matrix to be recognized by using the human feature extraction model to obtain the corresponding feature vector of the image to be recognized;

The voice feature extraction module 64 is configured to perform text feature extraction on the voice to be recognized to obtain the voice feature vector to be recognized;

The fusion feature module 65 is configured to cross-fuse the image feature vector to be recognized and the voice feature vector to be recognized to obtain the fusion feature vector to be recognized;

The abnormal behavior recognition module 66 based on voice and image features is used to input the to-be-recognized fusion feature vector into the abnormal behavior recognition model for processing, and output the corresponding human action category and whether the human action category belongs to the abnormal behavior.

In a specific embodiment, the acquiring module 61 is further configured to acquire a plurality of sample images representing various human actions, and label each sample image with a corresponding human body action label;

The image feature extraction module 62 is further configured to perform feature extraction on each of the multiple sample images to obtain multiple sample feature matrices;

The device also includes:

The construction module is used to construct a five-layer spatio-temporal convolutional network, input multiple sample feature matrices into the first three layers of the spatio-temporal convolutional network for processing, and transfer multiple one-dimensional feature vectors to the last two layers of the spatio-temporal convolutional network Perform recognition processing, and output the sample human body action category corresponding to each sample image;

The feature extraction training module is used to compare the sample human action category with the corresponding human action label to determine the sample loss function, and adjust the parameters of the spatiotemporal convolutional network according to the sample loss function to obtain the spatiotemporal convolutional network model;

The deletion module is used to delete the last two layers of the spatio-temporal convolutional network model to obtain a human body feature extraction model.

In a specific embodiment, the building module specifically includes:

Construction unit, the five-layer structure of the spatio-temporal convolutional network used to construct is: the first receiving layer, the second spatial feature analysis layer, the third temporal feature analysis layer, the fourth fully connected layer, and the fifth layer Classification layer

The transmitting unit is used for the first layer to transmit the received sample feature matrix to the second layer;

The spatial feature processing unit is used for the second layer to extract the spatial features of the sample feature matrix, and send the extracted spatial features and the sample feature matrix to the third layer;

The temporal feature processing unit is used in the third layer to extract the temporal features in the sample feature matrix, combine the temporal and spatial features to form a one-dimensional feature vector, and send it to the fourth layer;

The fully connected processing unit is used for the fourth layer to perform fully connected processing on the one-dimensional feature vector, and send the processed one-dimensional feature vector to the fifth layer;

The analysis unit is used for the fifth layer to analyze the processed one-dimensional feature vector, and output after determining the corresponding sample human action category.

In a specific embodiment, the acquiring module 61 is further configured to separately acquire the action images of each person as training images for M individuals, and simultaneously record training voices for each person for a predetermined time to obtain M training images and M training voices;

The device also includes:

The labeling module is used to label each training image and each training voice corresponding to the training human action label;

The image feature extraction module 62 is also used to perform feature extraction on each training image to obtain M training feature matrices;

The feature processing module 63 is further configured to sequentially input M training feature matrices into the human body feature extraction model for processing, and output M training image feature vectors;

The voice feature extraction module 64 is also used to perform text feature extraction on M training voices to obtain M training voice feature vectors;

The device also includes:

Abnormal behavior training module, used to cross-fuse training image feature vectors and training voice feature vectors belonging to the same human action label to obtain training fusion feature vectors. M training image feature vectors and M training voice feature vectors are correspondingly fused into M A training fusion feature vector; M training fusion feature vectors are sequentially input into the convolutional neural network for training processing, and the output training human action category is compared with the corresponding training human action label to determine the corresponding training loss function ; Adjust the convolutional neural network according to the training loss function to obtain the convolutional neural network model; before the output layer of the convolutional neural network, add a judgment layer that can determine whether it belongs to abnormal behavior according to the obtained training human action category to obtain abnormal behavior Identify the model.

In a specific embodiment, when the obtaining module 61 detects that the user enters the recognition area, it controls the camera to obtain multiple action images of the user to be recognized;

The image feature extraction module 62 is specifically used for:

Input multiple motion images to be recognized into the encoding processor, use the self-attention mechanism layer in the encoding processor to visually analyze each motion image to be recognized, and extract the visual features of each motion image to be recognized. The multiple action images to be recognized correspond to multiple visualization features; the multiple visualization features are input to the overlay layer of the encoding processor for overlay processing, and multiple overlay processing results are obtained; multiple overlay processing results are input to the residual layer for processing The residual processing strengthens multiple superimposed processing results; after the enhanced multiple superimposed processing results are spliced, linear processing is used to perform linear processing to obtain the feature matrix to be identified.

In a specific embodiment, the voice feature extraction module 64 is specifically used to: use an automatic voice recognition algorithm to extract text features of the voice to be recognized; use a self-attention mechanism to perform text feature analysis on the extracted text features to extract word feature vectors ; Perform a linear transformation on the word feature vector to obtain the voice feature vector to be recognized.

In a specific embodiment, the fusion feature module 65 is specifically configured to: use an additive attention mechanism to cross-add the image feature vector to be recognized and the voice feature vector to be recognized to obtain the added feature vector; use the quantitative product The method performs dot product operation on the added feature vector to obtain the fusion feature vector to be recognized.

Based on the above-mentioned method shown in FIG. 1 and the embodiment of the apparatus shown in FIGS. 2-6, in order to achieve the above-mentioned purpose, an embodiment of the present application also provides a computer device, as shown in FIG. 7, including a memory 72 and a processor 71, The memory 72 and the processor 71 are both arranged on the bus 73 and the memory 72 stores a computer program. When the processor 71 executes the computer program, the abnormal behavior recognition method based on voice and image features shown in FIG. 1 is implemented.

Based on this understanding, the technical solution of this application can be embodied in the form of a software product, which can be stored in a non-volatile memory (which can be a CD-ROM, U disk, mobile hard disk, etc.), including several instructions It is used to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in each implementation scenario of this application.

Optionally, the device can also be connected to a user interface, a network interface, a camera, a radio frequency (RF) circuit, a sensor, an audio circuit, a WI-FI module, and so on. The user interface may include a display screen (Display), an input unit such as a keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, and the like. The network interface can optionally include a standard wired interface, a wireless interface (such as a Bluetooth interface, a WI-FI interface), and so on.

Those skilled in the art can understand that the structure of a computer device provided in this embodiment does not constitute a limitation on the physical device, and may include more or fewer components, or combine certain components, or arrange different components.

Based on the above-mentioned method shown in FIG. 1 and the embodiment of the device shown in FIG. 6, correspondingly, an embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor The above-mentioned abnormal behavior recognition method based on voice and image features as shown in Figure 1 is realized. Wherein, the computer-readable storage medium may be non-volatile or volatile.

The storage medium may also include an operating system and a network communication module. The operating system is a program that manages the hardware and software resources of computer equipment, and supports the operation of information processing programs and other software and/or programs. The network communication module is used to realize the communication between the various components in the storage medium and the communication with other hardware and software in the computer equipment.

Through the description of the above embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus a necessary general hardware platform, or can be implemented by hardware.

By applying the technical solution of the present application, the human body feature extraction model obtained after learning and training is used to perform feature extraction on the user's image to obtain the image feature vector to be recognized, and then perform feature extraction on the user’s voice to obtain the voice feature vector to be recognized. Then cross-fuse the image feature vector to be recognized and the voice feature vector to be recognized to obtain the fusion feature vector to be recognized, and use the convolutional neural network to obtain the abnormal behavior recognition model after learning and training to process and judge the fusion feature vector to be recognized Whether the user's actions are abnormal behaviors, if it proves that the user is a dangerous person, activate the corresponding interception function to intercept the user to prevent the user from causing harm to other people's personal and property. This method determines the action category corresponding to the user’s action based on the user’s image and sound, and judges whether the action category is an abnormal action, so that corresponding measures can be taken according to the judgment result, and the user with abnormal behavior can be identified more quickly and accurately. It improves the safety factor of the enterprise and provides an effective guarantee for the safety of the users of the enterprise.

The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims

An abnormal behavior recognition method based on voice and image features, wherein the steps of the method include:

After detecting that the user enters the recognition area, control the camera to obtain the user's motion image to be recognized, and at the same time start the recording structure to record the voice to be recognized for a predetermined time;

Performing feature extraction on the motion image to be recognized to obtain a feature matrix to be recognized;

Using a human body feature extraction model to process the feature matrix to be recognized to obtain a corresponding feature vector of the image to be recognized;

Performing text feature extraction on the voice to be recognized to obtain a voice feature vector to be recognized;

Cross fusion of the image feature vector to be recognized and the voice feature vector to be recognized to obtain a fusion feature vector to be recognized;

The fusion feature vector to be recognized is input into the abnormal behavior recognition model for processing, and the corresponding human body action category is output, and whether the human body action category belongs to the abnormal behavior.
The method according to claim 1, wherein before said using the human body feature extraction model to process the feature matrix to be recognized to obtain the corresponding feature vector of the image to be recognized, the method further comprises:

Acquire multiple sample images representing various human actions, and label each sample image with a corresponding human action label;

Perform feature extraction on each sample image of the multiple sample images to obtain multiple sample feature matrices;

Construct a five-layer spatio-temporal convolutional network, input the multiple sample feature matrices into the first three layers of the spatio-temporal convolutional network for processing, and transmit multiple one-dimensional feature vectors to the last two of the spatio-temporal convolutional network The layer performs recognition processing, and outputs the sample human body action category corresponding to each sample image;

Comparing the sample human motion category with the corresponding human motion label to determine a sample loss function, and adjusting the parameters of the spatiotemporal convolutional network according to the sample loss function to obtain a spatiotemporal convolutional network model;

The last two layers of the spatiotemporal convolutional network model are deleted to obtain a human body feature extraction model.
The method according to claim 2, wherein the constructing a five-layer spatio-temporal convolutional network, and sequentially inputting the multiple sample feature matrices into the first three layers of the spatio-temporal convolutional network for processing, will obtain multiple one-dimensional The feature vector is transmitted to the last two layers of the spatio-temporal convolutional network for recognition processing, and the sample human body action category corresponding to each sample image is output, which specifically includes:

The five-layer structure of the constructed spatio-temporal convolutional network is the first receiving layer, the second spatial feature analysis layer, the third temporal feature analysis layer, the fourth fully connected layer, and the fifth classification layer;

The first layer transmits the received sample feature matrix to the second layer;

The second layer extracts the spatial features of the sample feature matrix, and sends the extracted spatial features and the sample feature matrix to the third layer;

The third layer will extract the temporal features in the sample feature matrix, combine the temporal features and the spatial features to form a one-dimensional feature vector, and send it to the fourth layer;

The fourth layer performs full connection processing on the one-dimensional feature vector, and sends the processed one-dimensional feature vector to the fifth layer;

The fifth layer analyzes the processed one-dimensional feature vector, determines the corresponding sample human body movement category, and outputs it.
The method according to any one of claims 1 to 3, wherein the input of the fusion feature vector to be recognized into the abnormal behavior recognition model is processed, and the corresponding human action category is output, and whether the human action category is Before it belongs to abnormal behavior, it specifically includes:

Obtain each person's action image as a training image for M individuals, and simultaneously record a predetermined time of training voice for each person to obtain M training images and M training voices;

Each training image and each training speech are correspondingly labeled with training human action labels;

Perform feature extraction on each training image to obtain M training feature matrices;

Sequentially inputting the M training feature matrices into the human body feature extraction model for processing, and outputting M training image feature vectors;

Performing text feature extraction on the M training voices to obtain M training voice feature vectors;

Cross fusion of training image feature vectors and training voice feature vectors belonging to the same human action label to obtain training fusion feature vectors, and M training image feature vectors and M training voice feature vectors are correspondingly fused into M training fusion feature vectors;

The M training fusion feature vectors are sequentially input into the convolutional neural network for training processing, and the output training human action category is compared with the corresponding training human action label to determine the corresponding training loss function;

Adjusting the convolutional neural network according to the training loss function to obtain a convolutional neural network model;

Before the output layer of the convolutional neural network, a judgment layer capable of judging whether it belongs to an abnormal behavior according to the obtained training human action category is added to obtain an abnormal behavior recognition model.
The method according to claim 1, wherein when it is detected that the user enters the recognition area, the camera is controlled to obtain multiple motion images of the user to be recognized;

The feature extraction of the action image to be recognized to obtain the feature matrix to be recognized specifically includes:

Input the multiple motion images to be recognized into the encoding processor, use the self-attention mechanism layer in the encoding processor to visually analyze each motion image to be recognized, and extract each motion image to be recognized , Then the multiple to-be-recognized action images correspondingly obtain multiple visual features;

Input the multiple visualization features into the overlay layer of the encoding processor for overlay processing to obtain multiple overlay processing results;

Inputting the results of the multiple superposition processing to a residual layer to perform residual processing to strengthen the results of the multiple superposition processing;

After concatenating the enhanced multiple superposition processing results, linear processing is used to perform linear processing to obtain the feature matrix to be identified.
The method according to claim 1, wherein, performing text feature extraction on the voice to be recognized to obtain the voice feature vector to be recognized specifically comprises:

Using an automatic speech recognition algorithm to extract text features of the speech to be recognized;

Use the self-attention mechanism to perform text feature analysis on the extracted text features and extract word feature vectors;

The word feature vector is linearly transformed to obtain the voice feature vector to be recognized.
The method according to claim 1, wherein the cross fusion of the image feature vector to be recognized and the voice feature vector to be recognized to obtain the fused feature vector to be recognized specifically includes:

Using an additive attention mechanism to cross-add the image feature vector to be recognized and the voice feature vector to be recognized to obtain the added feature vector;

The dot product operation is performed on the added feature vectors by using the quantified product method to obtain the fusion feature vector to be identified.
An abnormal behavior recognition device based on voice and image features, wherein the device includes:

The acquisition module is used to control the camera to acquire the user's motion image to be recognized after detecting that the user enters the recognition area, and at the same time start the recording structure to record the voice to be recognized for a predetermined time;

The image feature extraction module is used to perform feature extraction on the action image to be recognized to obtain the feature matrix to be recognized;

The feature processing module is configured to process the feature matrix to be recognized by using the human body feature extraction model to obtain the corresponding feature vector of the image to be recognized;

A voice feature extraction module, configured to extract text features of the voice to be recognized to obtain a voice feature vector to be recognized;

The fusion feature module is used to cross-fuse the image feature vector to be recognized and the voice feature vector to be recognized to obtain the fusion feature vector to be recognized;

The abnormal behavior recognition module based on voice and image features is used to input the fusion feature vector to be recognized into the abnormal behavior recognition model for processing, and output the corresponding human action category and whether the human action category belongs to the abnormal behavior.
A computer device, wherein the computer device includes a memory and a processor, the memory and the processor are connected to each other, and the memory is used to store a computer program configured to be executed by the processor , The computer program is configured to execute an abnormal behavior recognition method based on voice and image features:

Wherein, the method includes:

After detecting that the user enters the recognition area, control the camera to obtain the user's motion image to be recognized, and at the same time start the recording structure to record the voice to be recognized for a predetermined time;

Performing feature extraction on the motion image to be recognized to obtain a feature matrix to be recognized;

Using a human body feature extraction model to process the feature matrix to be recognized to obtain a corresponding feature vector of the image to be recognized;

Performing text feature extraction on the voice to be recognized to obtain a voice feature vector to be recognized;

Cross fusion of the image feature vector to be recognized and the voice feature vector to be recognized to obtain a fusion feature vector to be recognized;

The fusion feature vector to be recognized is input into the abnormal behavior recognition model for processing, and the corresponding human body action category is output, and whether the human body action category belongs to the abnormal behavior.
The computer device according to claim 9, wherein, before said using the human body feature extraction model to process the feature matrix to be recognized to obtain the corresponding feature vector of the image to be recognized, the method further comprises:

Acquire multiple sample images representing various human actions, and label each sample image with a corresponding human action label;

Perform feature extraction on each sample image of the multiple sample images to obtain multiple sample feature matrices;

Construct a five-layer spatio-temporal convolutional network, input the multiple sample feature matrices into the first three layers of the spatio-temporal convolutional network for processing, and transmit multiple one-dimensional feature vectors to the last two of the spatio-temporal convolutional network The layer performs recognition processing, and outputs the sample human body action category corresponding to each sample image;

Comparing the sample human motion category with the corresponding human motion label to determine a sample loss function, and adjusting the parameters of the spatiotemporal convolutional network according to the sample loss function to obtain a spatiotemporal convolutional network model;

The last two layers of the spatiotemporal convolutional network model are deleted to obtain a human body feature extraction model.
The computer device according to claim 10, wherein the constructing a five-layer spatio-temporal convolutional network, and sequentially inputting the plurality of sample feature matrices into the first three layers of the spatio-temporal convolutional network for processing, will obtain multiple one The dimensional feature vector is transmitted to the last two layers of the spatio-temporal convolutional network for recognition processing, and the sample human body action category corresponding to each sample image is output, which specifically includes:

The five-layer structure of the constructed spatio-temporal convolutional network is the first receiving layer, the second spatial feature analysis layer, the third temporal feature analysis layer, the fourth fully connected layer, and the fifth classification layer;

The first layer transmits the received sample feature matrix to the second layer;

The second layer extracts the spatial features of the sample feature matrix, and sends the extracted spatial features and the sample feature matrix to the third layer;

The third layer will extract the temporal features in the sample feature matrix, combine the temporal features and the spatial features to form a one-dimensional feature vector, and send it to the fourth layer;

The fourth layer performs full connection processing on the one-dimensional feature vector, and sends the processed one-dimensional feature vector to the fifth layer;

The fifth layer analyzes the processed one-dimensional feature vector, determines the corresponding sample human body movement category, and outputs it.
11. The computer device according to any one of claims 9-11, wherein the fusion feature vector to be recognized is input into the abnormal behavior recognition model for processing, and the corresponding human body action category and the human body action category are output Before whether it belongs to abnormal behavior, specifically include:

Obtain each person's action image as a training image for M individuals, and simultaneously record a predetermined time of training voice for each person to obtain M training images and M training voices;

Each training image and each training speech are correspondingly labeled with training human action labels;

Perform feature extraction on each training image to obtain M training feature matrices;

Sequentially inputting the M training feature matrices into the human body feature extraction model for processing, and outputting M training image feature vectors;

Performing text feature extraction on the M training voices to obtain M training voice feature vectors;

Cross fusion of training image feature vectors and training voice feature vectors belonging to the same human action label to obtain training fusion feature vectors, and M training image feature vectors and M training voice feature vectors are correspondingly fused into M training fusion feature vectors;

The M training fusion feature vectors are sequentially input into the convolutional neural network for training processing, and the output training human action category is compared with the corresponding training human action label to determine the corresponding training loss function;

Adjusting the convolutional neural network according to the training loss function to obtain a convolutional neural network model;

Before the output layer of the convolutional neural network, a judgment layer capable of judging whether it belongs to an abnormal behavior according to the obtained training human action category is added to obtain an abnormal behavior recognition model.
9. The computer device according to claim 9, wherein when it is detected that the user enters the recognition area, the camera is controlled to obtain multiple motion images of the user to be recognized;

The feature extraction of the action image to be recognized to obtain the feature matrix to be recognized specifically includes:

Input the multiple motion images to be recognized into the encoding processor, use the self-attention mechanism layer in the encoding processor to visually analyze each motion image to be recognized, and extract each motion image to be recognized , Then the multiple to-be-recognized action images correspondingly obtain multiple visual features;

Input the multiple visualization features into the overlay layer of the encoding processor for overlay processing to obtain multiple overlay processing results;

Inputting the results of the multiple superposition processing to a residual layer to perform residual processing to strengthen the results of the multiple superposition processing;

After concatenating the enhanced multiple superposition processing results, linear processing is used to perform linear processing to obtain the feature matrix to be identified.
9. The computer device according to claim 9, wherein, performing text feature extraction on the voice to be recognized to obtain the voice feature vector to be recognized specifically comprises:

Using an automatic speech recognition algorithm to extract text features of the speech to be recognized;

Use the self-attention mechanism to perform text feature analysis on the extracted text features and extract word feature vectors;

The word feature vector is linearly transformed to obtain the voice feature vector to be recognized.
The computer device according to claim 9, wherein the cross-fusion of the image feature vector to be recognized and the voice feature vector to be recognized to obtain the fused feature vector to be recognized specifically includes:

Using an additive attention mechanism to cross-add the image feature vector to be recognized and the voice feature vector to be recognized to obtain the added feature vector;

The dot product operation is performed on the added feature vectors by using the quantified product method to obtain the fusion feature vector to be identified.
A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, it is used to realize an abnormal behavior recognition method based on voice and image characteristics. It includes the following steps:

After detecting that the user enters the recognition area, control the camera to obtain the user's motion image to be recognized, and at the same time start the recording structure to record the voice to be recognized for a predetermined time;

Performing feature extraction on the motion image to be recognized to obtain a feature matrix to be recognized;

Using a human body feature extraction model to process the feature matrix to be recognized to obtain a corresponding feature vector of the image to be recognized;

Performing text feature extraction on the voice to be recognized to obtain a voice feature vector to be recognized;

Cross fusion of the image feature vector to be recognized and the voice feature vector to be recognized to obtain a fusion feature vector to be recognized;

The fusion feature vector to be recognized is input into the abnormal behavior recognition model for processing, and the corresponding human body action category is output, and whether the human body action category belongs to the abnormal behavior.
The computer-readable storage medium according to claim 16, wherein, before the use of the human body feature extraction model to process the feature matrix to be recognized to obtain the corresponding feature vector of the image to be recognized, the method further comprises :

Acquire multiple sample images representing various human actions, and label each sample image with a corresponding human action label;

Perform feature extraction on each sample image of the multiple sample images to obtain multiple sample feature matrices;

Construct a five-layer spatio-temporal convolutional network, input the multiple sample feature matrices into the first three layers of the spatio-temporal convolutional network for processing, and transmit multiple one-dimensional feature vectors to the last two of the spatio-temporal convolutional network The layer performs recognition processing, and outputs the sample human body action category corresponding to each sample image;

Comparing the sample human motion category with the corresponding human motion label to determine a sample loss function, and adjusting the parameters of the spatiotemporal convolutional network according to the sample loss function to obtain a spatiotemporal convolutional network model;

The last two layers of the spatiotemporal convolutional network model are deleted to obtain a human body feature extraction model.
The computer-readable storage medium according to claim 17, wherein the constructing a five-layer spatio-temporal convolutional network, and sequentially inputting the plurality of sample feature matrices into the first three layers of the spatio-temporal convolutional network for processing, will obtain The multiple one-dimensional feature vectors are transmitted to the last two layers of the spatiotemporal convolutional network for recognition processing, and the sample human body action category corresponding to each sample image is output, which specifically includes:

The five-layer structure of the constructed spatio-temporal convolutional network is the first receiving layer, the second spatial feature analysis layer, the third temporal feature analysis layer, the fourth fully connected layer, and the fifth classification layer;

The first layer transmits the received sample feature matrix to the second layer;

The second layer extracts the spatial features of the sample feature matrix, and sends the extracted spatial features and the sample feature matrix to the third layer;

The third layer will extract the temporal features in the sample feature matrix, combine the temporal features and the spatial features to form a one-dimensional feature vector, and send it to the fourth layer;

The fourth layer performs full connection processing on the one-dimensional feature vector, and sends the processed one-dimensional feature vector to the fifth layer;

The fifth layer analyzes the processed one-dimensional feature vector, determines the corresponding sample human body movement category, and outputs it.
The computer-readable storage medium according to any one of claims 16-18, wherein the fusion feature vector to be recognized is input into the abnormal behavior recognition model for processing, and the corresponding human action category is output, and the Before whether the human body movement category belongs to abnormal behavior, it specifically includes:

Obtain each person's action image as a training image for M individuals, and simultaneously record a predetermined time of training voice for each person to obtain M training images and M training voices;

Each training image and each training speech are correspondingly labeled with training human action labels;

Perform feature extraction on each training image to obtain M training feature matrices;

Sequentially inputting the M training feature matrices into the human body feature extraction model for processing, and outputting M training image feature vectors;

Performing text feature extraction on the M training voices to obtain M training voice feature vectors;

Cross fusion of training image feature vectors and training voice feature vectors belonging to the same human action label to obtain training fusion feature vectors, and M training image feature vectors and M training voice feature vectors are correspondingly fused into M training fusion feature vectors;

The M training fusion feature vectors are sequentially input into the convolutional neural network for training processing, and the output training human action category is compared with the corresponding training human action label to determine the corresponding training loss function;

Adjusting the convolutional neural network according to the training loss function to obtain a convolutional neural network model;

Before the output layer of the convolutional neural network, a judgment layer capable of judging whether it belongs to an abnormal behavior according to the obtained training human action category is added to obtain an abnormal behavior recognition model.
16. The computer-readable storage medium according to claim 16, wherein when it is detected that the user enters the recognition area, the camera is controlled to obtain multiple motion images of the user to be recognized;

The feature extraction of the action image to be recognized to obtain the feature matrix to be recognized specifically includes:

Input the multiple motion images to be recognized into the encoding processor, use the self-attention mechanism layer in the encoding processor to visually analyze each motion image to be recognized, and extract each motion image to be recognized , Then the multiple to-be-recognized action images correspondingly obtain multiple visual features;

Input the multiple visualization features into the overlay layer of the encoding processor for overlay processing to obtain multiple overlay processing results;

Inputting the results of the multiple superposition processing to a residual layer to perform residual processing to strengthen the results of the multiple superposition processing;

After concatenating the enhanced multiple superposition processing results, linear processing is used to perform linear processing to obtain the feature matrix to be identified.