WO2021169209A1 - Method, apparatus and device for recognizing abnormal behavior on the basis of voice and image features - Google Patents

Method, apparatus and device for recognizing abnormal behavior on the basis of voice and image features Download PDF

Info

Publication number
WO2021169209A1
WO2021169209A1 PCT/CN2020/111664 CN2020111664W WO2021169209A1 WO 2021169209 A1 WO2021169209 A1 WO 2021169209A1 CN 2020111664 W CN2020111664 W CN 2020111664W WO 2021169209 A1 WO2021169209 A1 WO 2021169209A1
Authority
WO
WIPO (PCT)
Prior art keywords
recognized
feature
training
image
layer
Prior art date
Application number
PCT/CN2020/111664
Other languages
French (fr)
Chinese (zh)
Inventor
雷宇泽
陈远旭
周宝
骆加维
廖智
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021169209A1 publication Critical patent/WO2021169209A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/251Fusion techniques of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/02Banking, e.g. interest calculation or account maintenance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a method, device and equipment for identifying abnormal behaviors based on voice and image features.
  • the security system of the service industry is related to social stability and the safety of the people's property, and has always been the focus of security development.
  • the existing security systems of bank branches can no longer reliably guarantee the business of bank branches and the safety of personnel in the branches.
  • Some security systems in the service industry use the method of triggering alarms or video surveillance. This method can only notify relevant personnel to deal with them in a timely manner after dangerous personnel enter.
  • the present application provides a method, device, and equipment for identifying abnormal behavior based on voice and image features.
  • the main purpose is to solve the current technical problem of low accuracy of abnormal behavior recognition based on voice and image features.
  • a method for identifying abnormal behavior based on voice and image features includes:
  • control the camera After detecting that the user enters the recognition area, control the camera to obtain the user's motion image to be recognized, and at the same time start the recording structure to record the voice to be recognized for a predetermined time;
  • the fusion feature vector to be recognized is input into the abnormal behavior recognition model for processing, and the corresponding human body action category is output, and whether the human body action category belongs to the abnormal behavior.
  • an abnormal behavior recognition device based on voice and image features including:
  • the acquisition module when detecting that the user enters the recognition area, controls the camera to acquire the user's motion image to be recognized, and at the same time activates the recording structure to record the voice to be recognized for a predetermined time;
  • the image feature extraction module is used to perform feature extraction on the action image to be recognized to obtain the feature matrix to be recognized;
  • the feature processing module is configured to process the feature matrix to be recognized by using the human body feature extraction model to obtain the corresponding feature vector of the image to be recognized;
  • a voice feature extraction module configured to extract text features of the voice to be recognized to obtain a voice feature vector to be recognized
  • the fusion feature module is used to cross-fuse the image feature vector to be recognized and the voice feature vector to be recognized to obtain the fusion feature vector to be recognized;
  • the abnormal behavior recognition module based on voice and image features is used to input the fusion feature vector to be recognized into the abnormal behavior recognition model for processing, and output the corresponding human action category and whether the human action category belongs to the abnormal behavior.
  • a computer device including a memory and a processor, the memory stores a computer program, and the processor implements the following steps when executing the computer program:
  • control the camera After detecting that the user enters the recognition area, control the camera to obtain the user's motion image to be recognized, and at the same time start the recording structure to record the voice to be recognized for a predetermined time;
  • the fusion feature vector to be recognized is input into the abnormal behavior recognition model for processing, and the corresponding human body action category is output, and whether the human body action category belongs to the abnormal behavior.
  • a computer storage medium with a computer program stored thereon, and the computer program implements the following steps when executed by a processor:
  • control the camera After detecting that the user enters the recognition area, control the camera to obtain the user's motion image to be recognized, and at the same time start the recording structure to record the voice to be recognized for a predetermined time;
  • the fusion feature vector to be recognized is input into the abnormal behavior recognition model for processing, and the corresponding human body action category is output, and whether the human body action category belongs to the abnormal behavior.
  • this application provides a method, device, and equipment for identifying abnormal behaviors based on voice and image features, using the human body feature extraction model obtained after learning and training to perform feature extraction on the user's image to obtain the image to be recognized Feature vector, and then feature extraction of the user’s voice to obtain the voice feature vector to be recognized, and then cross-fuse the image feature vector to be recognized and the voice feature vector to be recognized to obtain the fused feature vector to be recognized, using convolutional neural
  • the abnormal behavior recognition model obtained by the network after learning and training processes the fusion feature vector to be recognized to determine whether the user's action is an abnormal behavior.
  • the corresponding interception function is activated to intercept the user to prevent the user Cause injury to the personal property of other people.
  • This method determines the action category corresponding to the user’s action based on the user’s image and sound, and judges whether the action category is an abnormal action, so that corresponding measures can be taken according to the judgment result, and the user with abnormal behavior can be identified more quickly and accurately. It improves the safety factor of the enterprise and provides an effective guarantee for the safety of the users of the enterprise.
  • FIG. 1 is a flowchart of an embodiment of an abnormal behavior recognition method based on voice and image features of this application;
  • Figure 2 is a schematic diagram of the indoor layout of the application
  • Fig. 3 is a training flowchart of the spatio-temporal convolutional network of this application.
  • Fig. 4 is a flowchart of speech feature extraction in this application.
  • Figure 5 is a training flowchart of the abnormal behavior recognition model of the application
  • Figure 6 is a structural block diagram of an embodiment of an abnormal behavior recognition device based on voice and image features
  • FIG. 7 is a schematic diagram of the structure of the computer equipment of this application.
  • the embodiment of the application provides a method for identifying abnormal behavior based on voice and image features, which can determine the action category corresponding to the user's action based on the user’s image and sound, and determine whether the action category belongs to the abnormal action, so as to determine whether the action category belongs to the abnormal action.
  • a method for identifying abnormal behavior based on voice and image features which can determine the action category corresponding to the user's action based on the user’s image and sound, and determine whether the action category belongs to the abnormal action, so as to determine whether the action category belongs to the abnormal action.
  • an embodiment of the present application provides a method for identifying abnormal behaviors based on voice and image features, including the following steps:
  • Step 101 When it is detected that the user enters the recognition area, the camera is controlled to obtain the user's motion image to be recognized, and at the same time, the recording structure is activated to record the voice to be recognized for a predetermined time.
  • the executor of the method for identifying abnormal behavior based on voice and image features can be a robot, or a security system of an enterprise, and the method for identifying abnormal behavior based on voice and image features is stored in the robot or security system. ⁇ execution procedures. And set a recognition area for the robot or security system, the size and scope of the area can be set according to needs. When the camera scans that the user enters the recognition area, the camera is pointed at the user to take the user's action image, and the user's voice is recorded at the same time.
  • Step 102 Perform feature extraction on the action image to be recognized to obtain a feature matrix to be recognized.
  • the obtained motion image to be recognized is digitally converted, the surrounding environment image of the user is deleted, the user's image is captured, and then the user's facial expressions, body movements, and hand-held objects in the user image
  • the other information features are extracted and converted into a feature matrix of dimension D to be identified.
  • Step 103 Use the human body feature extraction model to process the feature matrix to be recognized to obtain the corresponding feature vector of the image to be recognized.
  • the human body feature extraction model is obtained by the spatiotemporal convolutional network using a large number of images representing various human behaviors for learning and training.
  • the corresponding code program is written into the robot or security system.
  • the dimension of the input port of the human body feature extraction model is D to ensure that the feature matrix to be recognized can smoothly enter the human body feature extraction model for processing without further conversion, so that the dimension of the image feature vector to be recognized after processing is also D .
  • Step 104 Perform text feature extraction on the voice to be recognized to obtain a voice feature vector to be recognized.
  • the text information in the voice to be recognized is extracted and converted into corresponding numbers, and the numbers are arranged in a matrix to form a feature vector of the voice to be recognized with dimension D.
  • Step 105 Cross-fuse the image feature vector to be recognized and the voice feature vector to be recognized to obtain the fused feature vector to be recognized.
  • the dimension of the fusion feature vector to be recognized obtained after direct cross fusion of the two is also D.
  • Step 106 Input the fusion feature vector to be recognized into the abnormal behavior recognition model for processing, and output the corresponding human body action category and whether the human body action category belongs to the abnormal behavior.
  • this step a large number of human behavior images and a large number of recorded voices are combined, and the human behavior images and recorded voices are similar to the above steps 102-105 to obtain the fusion feature vector that can be used to train the convolutional neural network, and the After the fusion feature vector is input into the convolutional neural network for training, an abnormal behavior recognition model that can recognize human behavior is obtained, and the code program corresponding to the abnormal behavior recognition model is written into the robot or security system. In this way, the robot or the security system can use the human body feature extraction model and the abnormal behavior recognition model to detect the personnel entering the enterprise according to the above-mentioned interaction.
  • the robot is controlled to intercept the personnel , Or activate the interception function of the security system to intercept the person, and at the same time activate the alarm device to notify the staff to come for processing.
  • This method of identifying abnormal behaviors based on voice and image features effectively protects the personal and property safety of enterprises, employees, and users.
  • the human body feature extraction model obtained after learning and training is used to perform feature extraction on the user’s image to obtain the image feature vector to be recognized, and then perform feature extraction on the user’s voice to obtain the voice feature vector to be recognized.
  • the recognized image feature vector and the voice feature vector to be recognized are cross-fused to obtain the fused feature vector to be recognized, and the abnormal behavior recognition model obtained through learning and training of the convolutional neural network is used to process the fused feature vector to be recognized to determine the user's actions Whether it is an abnormal behavior, if it proves that the user is a dangerous person, activate the corresponding interception function to intercept the user to prevent the user from causing harm to other people’s personal and property.
  • This method determines the action category corresponding to the user’s action based on the user’s image and sound, and judges whether the action category is an abnormal action, so that corresponding measures can be taken according to the judgment result, and the user with abnormal behavior can be identified more quickly and accurately. It improves the safety factor of the enterprise and provides an effective guarantee for the safety of the users of the enterprise.
  • the method before step 103, the method further includes:
  • Step 1031 Obtain a plurality of sample images representing various human actions, and label each sample image with a corresponding human body action label.
  • each sample image includes multiple human action pictures, preferably 4 pictures.
  • Step 1032 Perform feature extraction on each of the multiple sample images to obtain multiple sample feature matrices.
  • sample images are digitally converted, and the environmental images around the person are deleted, the image of the person is captured, and then the facial expressions, body movements, hand-held objects and other information features in the person's image are extracted , And converted into a sample feature matrix with dimension D.
  • Step 1033 Construct a five-layer spatio-temporal convolutional network, input multiple sample feature matrices into the first three layers of the spatio-temporal convolutional network for processing, and transfer the obtained multiple one-dimensional feature vectors to the last two layers of the spatio-temporal convolutional network for recognition Process and output the sample human body motion category corresponding to each sample image.
  • Step 1034 Compare the sample human motion category with the corresponding human motion label to determine the sample loss function, and adjust the parameters of the spatiotemporal convolutional network according to the sample loss function to obtain the spatiotemporal convolutional network model.
  • the spatiotemporal convolutional network is trained using the sample feature matrix of dimension D.
  • the spatiotemporal convolutional network processes the sample feature matrix and outputs the corresponding sample human action category, so that the sample human action category can be Compare with the correct human action label, calculate the sample loss function once for each comparison, and adjust the spatiotemporal convolutional network according to the sample loss function, and then train the adjusted spatiotemporal convolutional network to the next sample feature matrix. And continue to repeat the process until all the sample feature matrices are fully trained, and a spatio-temporal convolutional network model that can identify the type of human action based on the image is obtained.
  • multiple sample images can be obtained for each type of human action, so that the spatiotemporal convolutional network model obtained through multiple training of the same sample image can better recognize the human body category.
  • Step 1035 Delete the last two layers of the spatiotemporal convolutional network model to obtain a human body feature extraction model.
  • this application does not simply perform recognition based on the image, but combines the image and sound to perform recognition. Therefore, the last two layers of the spatio-temporal convolutional network model need to be deleted, so that the model can be directly derived Human body feature matrix.
  • step 1033 specifically includes:
  • Step 10331 the five-layer structure of the constructed spatio-temporal convolutional network is the first layer receiving layer, the second layer spatial characteristic analysis layer, the third layer temporal characteristic analysis layer, the fourth layer fully connected layer, and the fifth layer classification layer. .
  • step 10332 the first layer transmits the received sample feature matrix to the second layer.
  • Step 10333 The second layer extracts the spatial features of the sample feature matrix, and sends the extracted spatial features and the sample feature matrix to the third layer.
  • the third layer will extract the temporal features in the sample feature matrix, combine the temporal features and the spatial features to form a one-dimensional feature vector, and send it to the fourth layer.
  • Step 10335 The fourth layer performs full connection processing on the one-dimensional feature vector, and sends the processed one-dimensional feature vector to the fifth layer.
  • Step 10336 The fifth layer analyzes the processed one-dimensional feature vector, determines the corresponding sample human body movement category and outputs it.
  • the first layer has D input ports, so that the sample feature matrix with dimension D can be directly input in the form of a matrix.
  • the second layer After the input is completed, enter the second layer and use the image of the human action in the sample feature matrix as Spatial features are extracted, and then the third layer analyzes each human action obtained from multiple images in chronological order, and then combines the temporal and spatial features to form a one-dimensional feature vector.
  • the fourth and fifth layers directly According to the one-dimensional feature vector, the corresponding sample action category is obtained. In this way, the one-dimensional feature vector of the sample feature matrix can be determined in the two dimensions of space and time, so that the pseudo-action category determined according to the one-dimensional feature vector is more accurate and faster.
  • step 106 specifically includes:
  • Step 1061 Obtain each person's action image as a training image for M individuals, and simultaneously record a predetermined time of training voice for each person to obtain M training images and M training voices.
  • the training image and the training speech obtained by each person are combined, so that the obtained training fusion feature vectors are all from the same person, and the training can be better based on the training fusion feature vectors.
  • each training image and each training voice are correspondingly labeled with a training human body action label.
  • the training image corresponding to the same person is the same as the training human body action label marked by the training voice.
  • Step 1063 Perform feature extraction on each training image to obtain M training feature matrices.
  • each training image is digitally converted, and the environmental image around the human body is deleted, the image of the human body is captured, and then the facial expressions, body movements, hand-held objects and other information features of the human body are extracted. And converted into a training feature matrix of dimension D.
  • Step 1064 Input the M training feature matrices into the human body feature extraction model in sequence for processing, and output M training image feature vectors.
  • the training feature matrix is processed by the three layers of the human body feature extraction model to obtain the corresponding training image feature vector.
  • Step 1065 Perform text feature extraction on M training voices to obtain M training voice feature vectors.
  • an automatic speech recognition system (Automatic Speech Recognition) is used to perform speech recognition on each training speech, convert it into a corresponding text, and perform feature extraction on the text to obtain a training speech feature matrix with dimension D.
  • Step 1066 Cross-fuse the training image feature vectors and training voice feature vectors belonging to the same human action label to obtain a training fusion feature vector.
  • M training image feature vectors and M training voice feature vectors are correspondingly fused into M training fusion features vector.
  • the training image feature vector and the training speech feature vector belonging to the same human action label are fused, so that the uniqueness of the human action label of the training fusion feature vector is obtained.
  • Step 1067 sequentially input the M training fusion feature vectors into the convolutional neural network for training processing, and compare the output training human action category with the corresponding training human action label to determine the corresponding training loss function.
  • Step 1068 Adjust the convolutional neural network according to the training loss function to obtain a convolutional neural network model.
  • a training loss function is obtained for each input of the training fusion feature vector.
  • the parameters of the convolutional neural network are adjusted according to the training loss function, the next training fusion feature vector is input, and the process is repeated until All training fusion feature vectors are completed until the training is completed, so that a convolutional neural network model that can recognize human actions based on the combined fusion feature vectors of the character's image and voice can be obtained.
  • Step 1069 before the output layer of the convolutional neural network, add a judgment layer capable of judging whether it belongs to an abnormal behavior according to the obtained training human action category, to obtain an abnormal behavior recognition model.
  • the name of the human body action category belonging to the abnormal behavior is added to the judgment layer.
  • the human action category is searched from the human action category of the added abnormal sexual behavior. If it exists, it proves that the human body action category belongs to an abnormal behavior, and the human body action category and the judgment result are directly output from the output layer. If it does not exist, it proves that the human body action category does not belong to an abnormal behavior, and then the human body action category and the judgment result are output from the output layer.
  • the method further includes:
  • step 101' when it is detected that the user enters the recognition area, the camera is controlled to obtain multiple action images of the user to be recognized, and the recording structure is activated at the same time to record the voice to be recognized for a predetermined time.
  • step 102 specifically includes:
  • Step 1021 Input multiple motion images to be recognized into the encoding processor, and use the self-attention mechanism layer in the encoding processor to visually analyze each motion image to be recognized, and extract the image of each motion image to be recognized. Visualized features, multiple visualized features are obtained corresponding to multiple action images to be recognized.
  • a self-attention mechanism layer is added to the encoding processor in advance, and the self-attention mechanism layer can be used to visually analyze each action image to be recognized, delete other disturbing environmental factors, and remove the contour or posture of the human body.
  • the features that is, visual features
  • multiple visual features are obtained corresponding to multiple action images to be recognized.
  • Step 1022 Input multiple visualization features into the overlay layer of the encoding processor for overlay processing to obtain multiple overlay processing results.
  • Step 1023 Input the results of the multiple superposition processing into the residual layer to perform residual processing to strengthen the results of the multiple superposition processing.
  • each result of the superposition processing is input into the residual layer, and compared with the estimated value in the residual layer, to determine the reliability of the result of the superposition processing, and to perform enhanced processing on the result of the superposition processing. Avoid the gradient disappearance of the features in the image during the feature extraction process.
  • Step 1024 After concatenating the multiple enhanced superimposition processing results, linear processing is performed using linear processing to obtain a feature matrix to be identified.
  • the enhanced multiple superimposition processing results are linearly spliced according to the dimension D, so that the feature matrix of the dimension D to be identified is obtained.
  • the multiple motion images to be recognized can be processed to form a feature matrix of dimension D to be recognized.
  • the features in the feature matrix to be recognized can be more prominent, which can be quickly and accurately For identification.
  • step 1032 feature extraction is performed on the sample image in the above-mentioned manner.
  • step 1063 the feature extraction of the training image is also performed in the above-mentioned manner.
  • step 104 specifically includes:
  • Step 1041 using an automatic speech recognition algorithm to perform text feature extraction on the speech to be recognized.
  • Step 1042 Use the self-attention mechanism to perform text feature analysis on the extracted text features, and extract word feature vectors.
  • the self-attention mechanism is also used, so that the obtained image feature vector and word feature vector are relatively similar, which facilitates the later image and voice feature fusion.
  • Step 1043 Perform linear transformation on the word feature vector to obtain the voice feature vector to be recognized.
  • the word feature vector is linearly transformed according to the dimension D, so that the voice feature vector with the dimension D to be recognized is obtained.
  • step 1065 feature extraction of the training speech is also performed in the above-mentioned manner.
  • step 105 specifically includes:
  • Step 1051 Use the additive attention mechanism to cross-add the image feature vector to be recognized and the voice feature vector to be recognized to obtain the added feature vector.
  • an additive attention mechanism (additive attention) is used to cross-add the image feature vector to be recognized and the voice feature vector to be recognized.
  • Step 1052 Perform a dot product operation on the added feature vector by using the quantified product method to obtain the fusion feature vector to be identified.
  • scaled-dot product is used (dot product (scalar product, also called dot product) is a binary operation method that accepts two vectors on a real number R and returns a real-valued scalar).
  • dot product is a binary operation method that accepts two vectors on a real number R and returns a real-valued scalar.
  • the specific formula is as follows:
  • MultiHead (Q, K, V) Concat (head 1 ,..., head h )W O.
  • the fusion feature MultiHead (Q, K, V) obtained above is normalized.
  • the effect is to reduce the complexity of subsequent abnormal behavior classification.
  • the human body feature extraction model obtained after learning and training is used to perform feature extraction on the user's image to obtain the image feature vector to be recognized, and then perform feature extraction on the user’s voice to obtain the voice feature vector to be recognized. Then cross-fuse the image feature vector to be recognized and the voice feature vector to be recognized to obtain the fusion feature vector to be recognized, and use the convolutional neural network to obtain the abnormal behavior recognition model after learning and training to process and judge the fusion feature vector to be recognized Whether the user's actions are abnormal behaviors, if it proves that the user is a dangerous person, activate the corresponding interception function to intercept the user to prevent the user from causing harm to other people's personal and property.
  • This method determines the action category corresponding to the user’s action based on the user’s image and sound, and judges whether the action category is an abnormal action, so that corresponding measures can be taken according to the judgment result, and the user with abnormal behavior can be identified more quickly and accurately. It improves the safety factor of the enterprise and provides an effective guarantee for the safety of the users of the enterprise.
  • the method for identifying abnormal behaviors based on voice and image features includes the following steps:
  • Each image in each image group will first go through a 6-layer encoding structure (encoder layer) to complete the strong feature extraction of the contour and posture of the human body in the image, and obtain multiple sets of feature matrices (each feature matrix corresponds to one Image, the matrix dimension is D).
  • the data passes through the multihead-self-attention layer and then superimposes the residual residual layer and the normalization layer.
  • the self-attention mechanism can extract the key features (including the outline and posture of the human body) in the picture better. Then, the residual layer is used to avoid the gradient disappearance of the features in the image during the feature extraction process.
  • random inactivation is a method of optimizing artificial neural networks with deep structure.
  • part of the weight or output of the hidden layer is randomly reset to zero to reduce the interdependence between nodes to realize the neural network. Regularization of the network reduces its structural risks.
  • step 3 Input the fusion matrix obtained in step 3 into a 5-layer spatiotemporal graph convolutional network (STGCN) for learning and training.
  • STGCN 5-layer spatiotemporal graph convolutional network
  • the first layer of STGCN is used to receive the fusion matrix
  • the second layer analyzes the spatial characteristics of multiple orientation images in the fusion matrix
  • the third layer analyzes the temporal characteristics of the frames before and after the fusion matrix to obtain a one-dimensional feature vector
  • the fourth layer compares the one-dimensional The feature vector is fully connected
  • the fifth layer of softmax layer is used to classify human behavior with the output results of the fourth layer and output the results of human behavior classification.
  • the corresponding loss function is obtained.
  • the spatio-temporal graph convolutional network is adjusted according to the loss function, until all the learning and training of the fusion matrix obtained in step 3 are completed, and the corresponding spatio-temporal graph convolutional network model is obtained.
  • the traditional self-attention (transformer) mechanism is used to extract features of the text, and the corresponding word vector features are obtained (and the strong feature extraction of the image uses the transformer mechanism, so that the extracted information features will be relatively similar, which is convenient for later fusion) .
  • DNN After fusing the one-dimensional feature vector of human behavior with the word vector feature, DNN is used for learning and training.
  • MultiHead (Q, K, V) Concat (head 1 ,..., head h )W O.
  • the second layer uses the scaled-dot product method (dot product (scalar product, also known as dot product) is a binary operation method that accepts two vectors on a real number R and returns a real-valued scalar).
  • dot product Scalar product, also known as dot product
  • the fusion feature MultiHead (Q, K, V) obtained above is normalized. The effect is to reduce the complexity of subsequent abnormal behavior classification.
  • the human body feature extraction model and the abnormal behavior recognition model obtained above are input into the robot system, and the robot is used to complete the recognition and detection of the user's human body behavior.
  • the Alde robot will use the camera to obtain a set of user's action images (for example, four), and record a segment of the user's voice.
  • the strongly obtained fusion matrix is input to the human body feature extraction model for processing.
  • the first layer is used to receive the fusion matrix
  • the second layer analyzes the spatial characteristics of multiple orientation images in the fusion matrix
  • the third layer analyzes the time before and after the fusion matrix.
  • the feature obtains the corresponding feature vector of the image to be recognized.
  • step 4 Process the acquired user's voice according to steps 1-3 in step 2 to obtain the corresponding voice feature vector to be recognized.
  • the image feature vector to be recognized and the voice feature vector to be recognized are fused according to the process of step 1 in step 3 to obtain the fused feature vector to be recognized.
  • control one or more Alde robots to intercept users and organize users to enter the bank's business processing area to prevent users from causing harm or loss to other users, public facilities, bank staff, or property.
  • the alarm device is activated to remind the staff to intercept the abnormal user.
  • the user is allowed to enter the business processing area for business processing.
  • an embodiment of the present application provides an abnormal behavior recognition device based on voice and image features.
  • the device includes: an acquisition module 61 and an image feature extraction module connected in sequence 62.
  • the acquiring module 61 after detecting that the user enters the recognition area, controls the camera to acquire the user's motion image to be recognized, and at the same time activates the recording structure to record the voice to be recognized for a predetermined time;
  • the image feature extraction module 62 is configured to perform feature extraction on the action image to be recognized to obtain a feature matrix to be recognized;
  • the feature processing module 63 is configured to process the feature matrix to be recognized by using the human feature extraction model to obtain the corresponding feature vector of the image to be recognized;
  • the voice feature extraction module 64 is configured to perform text feature extraction on the voice to be recognized to obtain the voice feature vector to be recognized;
  • the fusion feature module 65 is configured to cross-fuse the image feature vector to be recognized and the voice feature vector to be recognized to obtain the fusion feature vector to be recognized;
  • the abnormal behavior recognition module 66 based on voice and image features is used to input the to-be-recognized fusion feature vector into the abnormal behavior recognition model for processing, and output the corresponding human action category and whether the human action category belongs to the abnormal behavior.
  • the acquiring module 61 is further configured to acquire a plurality of sample images representing various human actions, and label each sample image with a corresponding human body action label;
  • the image feature extraction module 62 is further configured to perform feature extraction on each of the multiple sample images to obtain multiple sample feature matrices;
  • the device also includes:
  • the construction module is used to construct a five-layer spatio-temporal convolutional network, input multiple sample feature matrices into the first three layers of the spatio-temporal convolutional network for processing, and transfer multiple one-dimensional feature vectors to the last two layers of the spatio-temporal convolutional network Perform recognition processing, and output the sample human body action category corresponding to each sample image;
  • the feature extraction training module is used to compare the sample human action category with the corresponding human action label to determine the sample loss function, and adjust the parameters of the spatiotemporal convolutional network according to the sample loss function to obtain the spatiotemporal convolutional network model;
  • the deletion module is used to delete the last two layers of the spatio-temporal convolutional network model to obtain a human body feature extraction model.
  • the building module specifically includes:
  • Construction unit the five-layer structure of the spatio-temporal convolutional network used to construct is: the first receiving layer, the second spatial feature analysis layer, the third temporal feature analysis layer, the fourth fully connected layer, and the fifth layer Classification layer
  • the transmitting unit is used for the first layer to transmit the received sample feature matrix to the second layer;
  • the spatial feature processing unit is used for the second layer to extract the spatial features of the sample feature matrix, and send the extracted spatial features and the sample feature matrix to the third layer;
  • the temporal feature processing unit is used in the third layer to extract the temporal features in the sample feature matrix, combine the temporal and spatial features to form a one-dimensional feature vector, and send it to the fourth layer;
  • the fully connected processing unit is used for the fourth layer to perform fully connected processing on the one-dimensional feature vector, and send the processed one-dimensional feature vector to the fifth layer;
  • the analysis unit is used for the fifth layer to analyze the processed one-dimensional feature vector, and output after determining the corresponding sample human action category.
  • the acquiring module 61 is further configured to separately acquire the action images of each person as training images for M individuals, and simultaneously record training voices for each person for a predetermined time to obtain M training images and M training voices;
  • the device also includes:
  • the labeling module is used to label each training image and each training voice corresponding to the training human action label
  • the image feature extraction module 62 is also used to perform feature extraction on each training image to obtain M training feature matrices;
  • the feature processing module 63 is further configured to sequentially input M training feature matrices into the human body feature extraction model for processing, and output M training image feature vectors;
  • the voice feature extraction module 64 is also used to perform text feature extraction on M training voices to obtain M training voice feature vectors;
  • the device also includes:
  • Abnormal behavior training module used to cross-fuse training image feature vectors and training voice feature vectors belonging to the same human action label to obtain training fusion feature vectors.
  • M training image feature vectors and M training voice feature vectors are correspondingly fused into M A training fusion feature vector;
  • M training fusion feature vectors are sequentially input into the convolutional neural network for training processing, and the output training human action category is compared with the corresponding training human action label to determine the corresponding training loss function ;
  • the obtaining module 61 when the obtaining module 61 detects that the user enters the recognition area, it controls the camera to obtain multiple action images of the user to be recognized;
  • the image feature extraction module 62 is specifically used for:
  • Input multiple motion images to be recognized into the encoding processor use the self-attention mechanism layer in the encoding processor to visually analyze each motion image to be recognized, and extract the visual features of each motion image to be recognized.
  • the multiple action images to be recognized correspond to multiple visualization features; the multiple visualization features are input to the overlay layer of the encoding processor for overlay processing, and multiple overlay processing results are obtained; multiple overlay processing results are input to the residual layer for processing
  • the residual processing strengthens multiple superimposed processing results; after the enhanced multiple superimposed processing results are spliced, linear processing is used to perform linear processing to obtain the feature matrix to be identified.
  • the voice feature extraction module 64 is specifically used to: use an automatic voice recognition algorithm to extract text features of the voice to be recognized; use a self-attention mechanism to perform text feature analysis on the extracted text features to extract word feature vectors ; Perform a linear transformation on the word feature vector to obtain the voice feature vector to be recognized.
  • the fusion feature module 65 is specifically configured to: use an additive attention mechanism to cross-add the image feature vector to be recognized and the voice feature vector to be recognized to obtain the added feature vector; use the quantitative product The method performs dot product operation on the added feature vector to obtain the fusion feature vector to be recognized.
  • an embodiment of the present application also provides a computer device, as shown in FIG. 7, including a memory 72 and a processor 71,
  • the memory 72 and the processor 71 are both arranged on the bus 73 and the memory 72 stores a computer program.
  • the processor 71 executes the computer program, the abnormal behavior recognition method based on voice and image features shown in FIG. 1 is implemented.
  • the technical solution of this application can be embodied in the form of a software product, which can be stored in a non-volatile memory (which can be a CD-ROM, U disk, mobile hard disk, etc.), including several instructions It is used to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in each implementation scenario of this application.
  • a non-volatile memory which can be a CD-ROM, U disk, mobile hard disk, etc.
  • a computer device which may be a personal computer, a server, or a network device, etc.
  • the device can also be connected to a user interface, a network interface, a camera, a radio frequency (RF) circuit, a sensor, an audio circuit, a WI-FI module, and so on.
  • the user interface may include a display screen (Display), an input unit such as a keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, and the like.
  • the network interface can optionally include a standard wired interface, a wireless interface (such as a Bluetooth interface, a WI-FI interface), and so on.
  • a computer device does not constitute a limitation on the physical device, and may include more or fewer components, or combine certain components, or arrange different components.
  • an embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor
  • the above-mentioned abnormal behavior recognition method based on voice and image features as shown in Figure 1 is realized.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the storage medium may also include an operating system and a network communication module.
  • the operating system is a program that manages the hardware and software resources of computer equipment, and supports the operation of information processing programs and other software and/or programs.
  • the network communication module is used to realize the communication between the various components in the storage medium and the communication with other hardware and software in the computer equipment.
  • the human body feature extraction model obtained after learning and training is used to perform feature extraction on the user's image to obtain the image feature vector to be recognized, and then perform feature extraction on the user’s voice to obtain the voice feature vector to be recognized. Then cross-fuse the image feature vector to be recognized and the voice feature vector to be recognized to obtain the fusion feature vector to be recognized, and use the convolutional neural network to obtain the abnormal behavior recognition model after learning and training to process and judge the fusion feature vector to be recognized Whether the user's actions are abnormal behaviors, if it proves that the user is a dangerous person, activate the corresponding interception function to intercept the user to prevent the user from causing harm to other people's personal and property.
  • This method determines the action category corresponding to the user’s action based on the user’s image and sound, and judges whether the action category is an abnormal action, so that corresponding measures can be taken according to the judgment result, and the user with abnormal behavior can be identified more quickly and accurately. It improves the safety factor of the enterprise and provides an effective guarantee for the safety of the users of the enterprise.

Abstract

The present application relates to the field of artificial intelligence, and disclosed thereby are a method, apparatus and device for recognizing abnormal behavior on the basis of voice and image features. The method comprises: performing feature extraction on an image of a user by utilizing a human body feature extraction model obtained after learning and training to obtain an image feature vector to be recognized; performing feature extraction on the voice of the user to obtain a voice feature vector to be recognized; performing cross fusion on the image feature vector to be recognized and the voice feature vector to be recognized, so as to then obtain a fusion feature vector to be recognized; and processing the fusion feature vector to be recognized by using an abnormal behavior recognition model obtained by learning and training a convolutional neural network to determine whether an action of the user is an abnormal behavior, and if so, proving that the user is a dangerous person, and starting a corresponding interception function to intercept the user. As such, users with abnormal behaviors may be identified more quickly and accurately, thereby effectively improving the safety coefficient of an enterprise, and providing effective guarantee for the safety of the users of the enterprise.

Description

一种基于语音及图像特征的异常行为识别方法、装置及设备Method, device and equipment for identifying abnormal behavior based on voice and image features
本申请要求于2020年02月27日提交中国专利局、申请号为202010123166.8,发明名称为“一种基于语音及图像特征的异常行为识别方法、装置及设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on February 27, 2020, the application number is 202010123166.8, and the invention title is "A method, device and equipment for identifying abnormal behavior based on voice and image features". The entire content is incorporated into this application by reference.
技术领域Technical field
本申请涉及人工智能技术领域,特别是涉及一种基于语音及图像特征的异常行为识别方法、装置及设备。This application relates to the field of artificial intelligence technology, and in particular to a method, device and equipment for identifying abnormal behaviors based on voice and image features.
背景技术Background technique
服务业的安防系统关系到社会稳定和人民的财产安全,一直以来都是安全防范发展的重点。例如,银行网点现有的安防系统已无法可靠保障银行网点营业和网点内人员的安全。The security system of the service industry is related to social stability and the safety of the people's property, and has always been the focus of security development. For example, the existing security systems of bank branches can no longer reliably guarantee the business of bank branches and the safety of personnel in the branches.
一些服务业的安防系统,都是采用触发报警或者视频监控的方式,这种方式只能在危险人员进入后,通知相关人员及时进行处理。Some security systems in the service industry use the method of triggering alarms or video surveillance. This method can only notify relevant personnel to deal with them in a timely manner after dangerous personnel enter.
发明人意识到,现有的安防系统对于人员的行为识别都只是根据视频或者图片中的人体姿态进行识别的,对于危险的基于语音及图像特征的异常行为识别的准确率较低,这样就会出现将安全人员误认为危险人员,或者将危险人员安全放行危害公共安全的情况。The inventor realized that the existing security system only recognizes the behavior of people based on the human posture in the video or picture, and the accuracy of the recognition of dangerous abnormal behaviors based on voice and image features is low. There are situations in which security personnel are mistaken for dangerous personnel, or dangerous personnel are safely released to endanger public safety.
发明内容Summary of the invention
有鉴于此,本申请提供了一种基于语音及图像特征的异常行为识别方法、装置及设备。主要目的在于解决目前的基于语音及图像特征的异常行为识别的准确率较低的技术问题。In view of this, the present application provides a method, device, and equipment for identifying abnormal behavior based on voice and image features. The main purpose is to solve the current technical problem of low accuracy of abnormal behavior recognition based on voice and image features.
依据本申请的第一方面,提供了一种基于语音及图像特征的异常行为识别方法,所述方法的步骤包括:According to the first aspect of the present application, a method for identifying abnormal behavior based on voice and image features is provided, the steps of the method include:
当检测到用户进入识别区域后,控制摄像头获取用户的待识别的动作图像,同时启动录音结构,录制预定时间的待识别的语音;After detecting that the user enters the recognition area, control the camera to obtain the user's motion image to be recognized, and at the same time start the recording structure to record the voice to be recognized for a predetermined time;
对所述待识别的动作图像进行特征提取,得到待识别的特征矩阵;Performing feature extraction on the motion image to be recognized to obtain a feature matrix to be recognized;
利用人体特征提取模型对所述待识别的特征矩阵进行处理,得到对应的待识别的图像特征向量;Using a human body feature extraction model to process the feature matrix to be recognized to obtain a corresponding feature vector of the image to be recognized;
对所述待识别的语音进行文本特征提取,得到待识别的语音特征向量;Performing text feature extraction on the voice to be recognized to obtain a voice feature vector to be recognized;
将所述待识别的图像特征向量和所述待识别的语音特征向量进行交叉融合得到待识别的融合特征向量;Cross fusion of the image feature vector to be recognized and the voice feature vector to be recognized to obtain a fusion feature vector to be recognized;
将待识别的融合特征向量输入至异常行为识别模型中进行处理,输出对应的人体动作类别,以及所述人体动作类别是否属于异常行为。The fusion feature vector to be recognized is input into the abnormal behavior recognition model for processing, and the corresponding human body action category is output, and whether the human body action category belongs to the abnormal behavior.
依据本申请的第二方面,提供了一种基于语音及图像特征的异常行为识别装置,所述装置包括:According to the second aspect of the present application, there is provided an abnormal behavior recognition device based on voice and image features, the device including:
获取模块,当检测到用户进入识别区域后,控制摄像头获取用户的待识别的动作图像,同时启动录音结构,录制预定时间的待识别的语音;The acquisition module, when detecting that the user enters the recognition area, controls the camera to acquire the user's motion image to be recognized, and at the same time activates the recording structure to record the voice to be recognized for a predetermined time;
图像特征提取模块,用于对所述待识别的动作图像进行特征提取,得到待识别的特征矩阵;The image feature extraction module is used to perform feature extraction on the action image to be recognized to obtain the feature matrix to be recognized;
特征处理模块,用于利用人体特征提取模型对所述待识别的特征矩阵进行处理,得到对应的待识别的图像特征向量;The feature processing module is configured to process the feature matrix to be recognized by using the human body feature extraction model to obtain the corresponding feature vector of the image to be recognized;
语音特征提取模块,用于对所述待识别的语音进行文本特征提取,得到待识别的语音特征向量;A voice feature extraction module, configured to extract text features of the voice to be recognized to obtain a voice feature vector to be recognized;
融合特征模块,用于将所述待识别的图像特征向量和所述待识别的语音特征向量进行交叉融合得到待识别的融合特征向量;The fusion feature module is used to cross-fuse the image feature vector to be recognized and the voice feature vector to be recognized to obtain the fusion feature vector to be recognized;
基于语音及图像特征的异常行为识别模块,用于将待识别的融合特征向量输入至异常 行为识别模型中进行处理,输出对应的人体动作类别,以及所述人体动作类别是否属于异常行为。The abnormal behavior recognition module based on voice and image features is used to input the fusion feature vector to be recognized into the abnormal behavior recognition model for processing, and output the corresponding human action category and whether the human action category belongs to the abnormal behavior.
依据本申请的第三方面,提供了一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现以下步骤:According to a third aspect of the present application, there is provided a computer device, including a memory and a processor, the memory stores a computer program, and the processor implements the following steps when executing the computer program:
当检测到用户进入识别区域后,控制摄像头获取用户的待识别的动作图像,同时启动录音结构,录制预定时间的待识别的语音;After detecting that the user enters the recognition area, control the camera to obtain the user's motion image to be recognized, and at the same time start the recording structure to record the voice to be recognized for a predetermined time;
对所述待识别的动作图像进行特征提取,得到待识别的特征矩阵;Performing feature extraction on the motion image to be recognized to obtain a feature matrix to be recognized;
利用人体特征提取模型对所述待识别的特征矩阵进行处理,得到对应的待识别的图像特征向量;Using a human body feature extraction model to process the feature matrix to be recognized to obtain a corresponding feature vector of the image to be recognized;
对所述待识别的语音进行文本特征提取,得到待识别的语音特征向量;Performing text feature extraction on the voice to be recognized to obtain a voice feature vector to be recognized;
将所述待识别的图像特征向量和所述待识别的语音特征向量进行交叉融合得到待识别的融合特征向量;Cross fusion of the image feature vector to be recognized and the voice feature vector to be recognized to obtain a fusion feature vector to be recognized;
将待识别的融合特征向量输入至异常行为识别模型中进行处理,输出对应的人体动作类别,以及所述人体动作类别是否属于异常行为。The fusion feature vector to be recognized is input into the abnormal behavior recognition model for processing, and the corresponding human body action category is output, and whether the human body action category belongs to the abnormal behavior.
依据本申请的第四方面,提供了一种计算机存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现以下步骤:According to the fourth aspect of the present application, there is provided a computer storage medium with a computer program stored thereon, and the computer program implements the following steps when executed by a processor:
当检测到用户进入识别区域后,控制摄像头获取用户的待识别的动作图像,同时启动录音结构,录制预定时间的待识别的语音;After detecting that the user enters the recognition area, control the camera to obtain the user's motion image to be recognized, and at the same time start the recording structure to record the voice to be recognized for a predetermined time;
对所述待识别的动作图像进行特征提取,得到待识别的特征矩阵;Performing feature extraction on the motion image to be recognized to obtain a feature matrix to be recognized;
利用人体特征提取模型对所述待识别的特征矩阵进行处理,得到对应的待识别的图像特征向量;Using a human body feature extraction model to process the feature matrix to be recognized to obtain a corresponding feature vector of the image to be recognized;
对所述待识别的语音进行文本特征提取,得到待识别的语音特征向量;Performing text feature extraction on the voice to be recognized to obtain a voice feature vector to be recognized;
将所述待识别的图像特征向量和所述待识别的语音特征向量进行交叉融合得到待识别的融合特征向量;Cross fusion of the image feature vector to be recognized and the voice feature vector to be recognized to obtain a fusion feature vector to be recognized;
将待识别的融合特征向量输入至异常行为识别模型中进行处理,输出对应的人体动作类别,以及所述人体动作类别是否属于异常行为。The fusion feature vector to be recognized is input into the abnormal behavior recognition model for processing, and the corresponding human body action category is output, and whether the human body action category belongs to the abnormal behavior.
借由上述技术方案,本申请提供的一种基于语音及图像特征的异常行为识别方法、装置及设备,利用经过学习训练后得到的人体特征提取模型对用户的图像进行特征提取得到待识别的图像特征向量,然后对用户的语音进行特征提取得到待识别的语音特征向量,再将待识别的图像特征向量和待识别的语音特征向量进行交叉融合后得到待识别的融合特征向量,利用卷积神经网络经过学习训练得到的异常行为识别模型对待识别的融合特征向量进行处理判断用户的动作是否属于异常行为,如果是证明该用户属于危险人员,启动对应的拦截功能对该用户进行拦截,防止该用户对其他人的人身财产造成伤害。这种根据用户的图像和声音共同来确定用户的动作对应的动作类别,并判断该动作类别是否属于异常动作,以便根据判断结果采取相应的措施,能够更加快捷准确的识别行为异常的用户,有效提高了企业的安全系数,对企业的用户的安全提供了有效的保障。With the above technical solutions, this application provides a method, device, and equipment for identifying abnormal behaviors based on voice and image features, using the human body feature extraction model obtained after learning and training to perform feature extraction on the user's image to obtain the image to be recognized Feature vector, and then feature extraction of the user’s voice to obtain the voice feature vector to be recognized, and then cross-fuse the image feature vector to be recognized and the voice feature vector to be recognized to obtain the fused feature vector to be recognized, using convolutional neural The abnormal behavior recognition model obtained by the network after learning and training processes the fusion feature vector to be recognized to determine whether the user's action is an abnormal behavior. If it is proved that the user is a dangerous person, the corresponding interception function is activated to intercept the user to prevent the user Cause injury to the personal property of other people. This method determines the action category corresponding to the user’s action based on the user’s image and sound, and judges whether the action category is an abnormal action, so that corresponding measures can be taken according to the judgment result, and the user with abnormal behavior can be identified more quickly and accurately. It improves the safety factor of the enterprise and provides an effective guarantee for the safety of the users of the enterprise.
附图说明Description of the drawings
图1为本申请的基于语音及图像特征的异常行为识别方法的一个实施例的流程图;FIG. 1 is a flowchart of an embodiment of an abnormal behavior recognition method based on voice and image features of this application;
图2为本申请的室内布置示意图;Figure 2 is a schematic diagram of the indoor layout of the application;
图3为本申请的时空卷积网络的训练流程图;Fig. 3 is a training flowchart of the spatio-temporal convolutional network of this application;
图4为本申请的语音特征提取的流程图;Fig. 4 is a flowchart of speech feature extraction in this application;
图5为本申请的异常行为识别模型的训练流程图;Figure 5 is a training flowchart of the abnormal behavior recognition model of the application;
图6为基于语音及图像特征的异常行为识别装置的一个实施例的结构框图;Figure 6 is a structural block diagram of an embodiment of an abnormal behavior recognition device based on voice and image features;
图7为本申请的计算机设备的结构示意图。FIG. 7 is a schematic diagram of the structure of the computer equipment of this application.
具体实施方式Detailed ways
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。Hereinafter, exemplary embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although the drawings show exemplary embodiments of the present disclosure, it should be understood that the present disclosure can be implemented in various forms and should not be limited by the embodiments set forth herein. On the contrary, these embodiments are provided to enable a more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.
本申请实施例提供了一种基于语音及图像特征的异常行为识别方法,能够根据用户的图像和声音共同来确定用户的动作对应的动作类别,并判断该动作类别是否属于异常动作,以便根据判断结果采取相应的措施,这种基于语音及图像特征的异常行为识别方法更加快捷准确。The embodiment of the application provides a method for identifying abnormal behavior based on voice and image features, which can determine the action category corresponding to the user's action based on the user’s image and sound, and determine whether the action category belongs to the abnormal action, so as to determine whether the action category belongs to the abnormal action. As a result, corresponding measures are taken, and this method of identifying abnormal behaviors based on voice and image features is faster and more accurate.
如图1所示,本申请实施例提供了一种基于语音及图像特征的异常行为识别方法,包括如下步骤:As shown in Fig. 1, an embodiment of the present application provides a method for identifying abnormal behaviors based on voice and image features, including the following steps:
步骤101,当检测到用户进入识别区域后,控制摄像头获取用户的待识别的动作图像,同时启动录音结构,录制预定时间的待识别的语音。Step 101: When it is detected that the user enters the recognition area, the camera is controlled to obtain the user's motion image to be recognized, and at the same time, the recording structure is activated to record the voice to be recognized for a predetermined time.
在该步骤中,该基于语音及图像特征的异常行为识别方法的执行者可以是机器人,可以是企业的安防系统,在该机器人或者安防系统中保存有该基于语音及图像特征的异常行为识别方法的执行程序。并且为机器人或者安防系统的设定一个识别区域,区域的大小和范围可以根据需要进行设定。当摄像头扫描到用户进入识别区域后,则将摄像头对准用户拍摄用户的动作图像,同时录制用户的语音。In this step, the executor of the method for identifying abnormal behavior based on voice and image features can be a robot, or a security system of an enterprise, and the method for identifying abnormal behavior based on voice and image features is stored in the robot or security system.的execution procedures. And set a recognition area for the robot or security system, the size and scope of the area can be set according to needs. When the camera scans that the user enters the recognition area, the camera is pointed at the user to take the user's action image, and the user's voice is recorded at the same time.
步骤102,对待识别的动作图像进行特征提取,得到待识别的特征矩阵。Step 102: Perform feature extraction on the action image to be recognized to obtain a feature matrix to be recognized.
在该步骤中,将得到的待识别的动作图像进行数字转换,并将用户周围的环境图像删除,对用户的图像进行抓取,然后再将用户图像中用户的面部表情、肢体动作、手持物体等信息特征进行提取,并转换成维度为D的待识别的特征矩阵。In this step, the obtained motion image to be recognized is digitally converted, the surrounding environment image of the user is deleted, the user's image is captured, and then the user's facial expressions, body movements, and hand-held objects in the user image The other information features are extracted and converted into a feature matrix of dimension D to be identified.
步骤103,利用人体特征提取模型对待识别的特征矩阵进行处理,得到对应的待识别的图像特征向量。Step 103: Use the human body feature extraction model to process the feature matrix to be recognized to obtain the corresponding feature vector of the image to be recognized.
在该步骤中,人体特征提取模型是时空卷积网络利用大量的代表各种人体行为的图像进行学习训练得到的。该人体特征提取模型训练完成后,就将对应的代码程序写入机器人或者安防系统中。且人体特征提取模型的输入口的维度为D保证待识别的特征矩阵能够顺利进入人体特征提取模型进行处理,无需再进行转换,这样经过处理后得到的待识别的图像特征向量的维度也为D。In this step, the human body feature extraction model is obtained by the spatiotemporal convolutional network using a large number of images representing various human behaviors for learning and training. After the human body feature extraction model is trained, the corresponding code program is written into the robot or security system. And the dimension of the input port of the human body feature extraction model is D to ensure that the feature matrix to be recognized can smoothly enter the human body feature extraction model for processing without further conversion, so that the dimension of the image feature vector to be recognized after processing is also D .
步骤104,对待识别的语音进行文本特征提取,得到待识别的语音特征向量。Step 104: Perform text feature extraction on the voice to be recognized to obtain a voice feature vector to be recognized.
在该步骤中,将待识别的语音中的文本信息提取出来,并转换成相应的数字,将数字进行矩阵排列,排列成维度为D的待识别的语音特征向量。In this step, the text information in the voice to be recognized is extracted and converted into corresponding numbers, and the numbers are arranged in a matrix to form a feature vector of the voice to be recognized with dimension D.
步骤105,将待识别的图像特征向量和待识别的语音特征向量进行交叉融合得到待识别的融合特征向量。Step 105: Cross-fuse the image feature vector to be recognized and the voice feature vector to be recognized to obtain the fused feature vector to be recognized.
在该步骤中,由于待识别的图像特征向量和待识别的语音特征向量的维度相同,因此,直接将二者的直接交叉融合后得到的待识别的融合特征向量的维度也为D。In this step, since the dimensions of the image feature vector to be recognized and the voice feature vector to be recognized are the same, the dimension of the fusion feature vector to be recognized obtained after direct cross fusion of the two is also D.
步骤106,将待识别的融合特征向量输入至异常行为识别模型中进行处理,输出对应的人体动作类别,以及人体动作类别是否属于异常行为。Step 106: Input the fusion feature vector to be recognized into the abnormal behavior recognition model for processing, and output the corresponding human body action category and whether the human body action category belongs to the abnormal behavior.
在该步骤中,将大量的人体行为图像和录制的大量语音,并将人体行为图像和录制的语音同理上述步骤102-105得到能够用来训练卷积神经网络的融合特征向量,并将该融合特征向量输入卷积神经网络进行训练完成后得到能够识别人体行为的异常行为识别模型,将异常行为识别模型对应的代码程序写入机器人或者安防系统中。这样机器人或者安防系统就可以利用人体特征提取模型和异常行为识别模型按照上述相互配合使用来对进入企业的人员进行检测,若检测出该人员的行为属于异常行为,则控制机器人对该人员进行拦截, 或者启动安防系统的拦截功能,对该人员进行拦截,同时启动报警装置,通知工作人员前来进行处理。这种基于语音及图像特征的异常行为识别方法有效保护企业、员工以及用户的人身财产的安全。In this step, a large number of human behavior images and a large number of recorded voices are combined, and the human behavior images and recorded voices are similar to the above steps 102-105 to obtain the fusion feature vector that can be used to train the convolutional neural network, and the After the fusion feature vector is input into the convolutional neural network for training, an abnormal behavior recognition model that can recognize human behavior is obtained, and the code program corresponding to the abnormal behavior recognition model is written into the robot or security system. In this way, the robot or the security system can use the human body feature extraction model and the abnormal behavior recognition model to detect the personnel entering the enterprise according to the above-mentioned interaction. If the behavior of the personnel is detected as abnormal behavior, the robot is controlled to intercept the personnel , Or activate the interception function of the security system to intercept the person, and at the same time activate the alarm device to notify the staff to come for processing. This method of identifying abnormal behaviors based on voice and image features effectively protects the personal and property safety of enterprises, employees, and users.
通过上述技术方案,利用经过学习训练后得到的人体特征提取模型对用户的图像进行特征提取得到待识别的图像特征向量,然后对用户的语音进行特征提取得到待识别的语音特征向量,再将待识别的图像特征向量和待识别的语音特征向量进行交叉融合后得到待识别的融合特征向量,利用卷积神经网络经过学习训练得到的异常行为识别模型对待识别的融合特征向量进行处理判断用户的动作是否属于异常行为,如果是证明该用户属于危险人员,启动对应的拦截功能对该用户进行拦截,防止该用户对其他人的人身财产造成伤害。这种根据用户的图像和声音共同来确定用户的动作对应的动作类别,并判断该动作类别是否属于异常动作,以便根据判断结果采取相应的措施,能够更加快捷准确的识别行为异常的用户,有效提高了企业的安全系数,对企业的用户的安全提供了有效的保障。Through the above technical solution, the human body feature extraction model obtained after learning and training is used to perform feature extraction on the user’s image to obtain the image feature vector to be recognized, and then perform feature extraction on the user’s voice to obtain the voice feature vector to be recognized. The recognized image feature vector and the voice feature vector to be recognized are cross-fused to obtain the fused feature vector to be recognized, and the abnormal behavior recognition model obtained through learning and training of the convolutional neural network is used to process the fused feature vector to be recognized to determine the user's actions Whether it is an abnormal behavior, if it proves that the user is a dangerous person, activate the corresponding interception function to intercept the user to prevent the user from causing harm to other people’s personal and property. This method determines the action category corresponding to the user’s action based on the user’s image and sound, and judges whether the action category is an abnormal action, so that corresponding measures can be taken according to the judgment result, and the user with abnormal behavior can be identified more quickly and accurately. It improves the safety factor of the enterprise and provides an effective guarantee for the safety of the users of the enterprise.
在具体实施例中,在步骤103之前,方法还包括:In a specific embodiment, before step 103, the method further includes:
步骤1031,获取多个表示各种人体动作的样本图像,对每个样本图像标注对应的人体动作标签。Step 1031: Obtain a plurality of sample images representing various human actions, and label each sample image with a corresponding human body action label.
在该步骤中,人体的各种动作包括:奔跑、行走、鼓掌、手持枪具、手持刀具、挥拳、脚踢等,将每个样本图像对应的人体动作进行标注,以供后续进行训练时,判断识别结果是否正确。其中,每个样本图像中包括多张人体动作图片,优选为4张。In this step, various human actions include: running, walking, clapping, holding guns, holding knives, punching, kicking, etc. The human actions corresponding to each sample image are labeled for subsequent training When, judge whether the recognition result is correct. Wherein, each sample image includes multiple human action pictures, preferably 4 pictures.
步骤1032,对多个样本图像中的每个样本图像分别进行特征提取,得到多个样本特征矩阵。Step 1032: Perform feature extraction on each of the multiple sample images to obtain multiple sample feature matrices.
在该步骤中,将多个样本图像进行数字转换,并将人物周围的环境图像删除,对人物的图像进行抓取,然后再将人物图像中面部表情、肢体动作、手持物体等信息特征进行提取,并转换成维度为D的样本特征矩阵。In this step, multiple sample images are digitally converted, and the environmental images around the person are deleted, the image of the person is captured, and then the facial expressions, body movements, hand-held objects and other information features in the person's image are extracted , And converted into a sample feature matrix with dimension D.
步骤1033,构建五层时空卷积网络,将多个样本特征矩阵依次输入时空卷积网络的前三层进行处理,将得到多个一维特征向量传送给时空卷积网络的后两层进行识别处理,输出与每个样本图像对应的样本人体动作类别。Step 1033: Construct a five-layer spatio-temporal convolutional network, input multiple sample feature matrices into the first three layers of the spatio-temporal convolutional network for processing, and transfer the obtained multiple one-dimensional feature vectors to the last two layers of the spatio-temporal convolutional network for recognition Process and output the sample human body motion category corresponding to each sample image.
步骤1034,将样本人体动作类别与对应的人体动作标签进行比对确定样本损失函数,根据样本损失函数对时空卷积网络的参数进行调整得到时空卷积网络模型。Step 1034: Compare the sample human motion category with the corresponding human motion label to determine the sample loss function, and adjust the parameters of the spatiotemporal convolutional network according to the sample loss function to obtain the spatiotemporal convolutional network model.
在上述步骤中,利用维度为D的样本特征矩阵对时空卷积网络进行训练,时空卷积网络会对样本特征矩阵进行处理并输出对应的样本人体动作类别,这样就可以将该样本人体动作类别与正确的人体动作标签进行比对,每比对一次计算一次样本损失函数,并根据样本损失函数调整一次时空卷积网络,然后将调整好的时空卷积网络对下一个样本特征矩阵进行训练,并不断重复该过程直至所有的样本特征矩阵全部训练完成为止,得到能够根据图像识别人体动作类型的时空卷积网络模型。In the above steps, the spatiotemporal convolutional network is trained using the sample feature matrix of dimension D. The spatiotemporal convolutional network processes the sample feature matrix and outputs the corresponding sample human action category, so that the sample human action category can be Compare with the correct human action label, calculate the sample loss function once for each comparison, and adjust the spatiotemporal convolutional network according to the sample loss function, and then train the adjusted spatiotemporal convolutional network to the next sample feature matrix. And continue to repeat the process until all the sample feature matrices are fully trained, and a spatio-temporal convolutional network model that can identify the type of human action based on the image is obtained.
另外,每种人体动作可以获取多个样本图像,这样经过多次同类的样本图像训练得到的时空卷积网络模型能够更好的对对该人体类别进行识别。In addition, multiple sample images can be obtained for each type of human action, so that the spatiotemporal convolutional network model obtained through multiple training of the same sample image can better recognize the human body category.
步骤1035,将时空卷积网络模型的最后两层删除,得到人体特征提取模型。Step 1035: Delete the last two layers of the spatiotemporal convolutional network model to obtain a human body feature extraction model.
在该步骤中,本申请并不是单纯的根据图像进行识别,而是将图像与声音进行结合后进行识别,因此需要将时空卷积网络模型的最后两层删除,这样该模型就能够直接得出人体特征矩阵了。In this step, this application does not simply perform recognition based on the image, but combines the image and sound to perform recognition. Therefore, the last two layers of the spatio-temporal convolutional network model need to be deleted, so that the model can be directly derived Human body feature matrix.
在具体实施例中,步骤1033具体包括:In a specific embodiment, step 1033 specifically includes:
步骤10331,构建的时空卷积网络的五层结构分别是,第一层接收层、第二层空间特征分析层、第三层时间特征分析层、第四层全连接层、第五层分类层。Step 10331, the five-layer structure of the constructed spatio-temporal convolutional network is the first layer receiving layer, the second layer spatial characteristic analysis layer, the third layer temporal characteristic analysis layer, the fourth layer fully connected layer, and the fifth layer classification layer. .
步骤10332,第一层将接收到的样本特征矩阵传送至第二层。In step 10332, the first layer transmits the received sample feature matrix to the second layer.
步骤10333,第二层将样本特征矩阵的空间特征进行提取,将提取后的空间特征和样本特征矩阵一起发送至第三层。Step 10333: The second layer extracts the spatial features of the sample feature matrix, and sends the extracted spatial features and the sample feature matrix to the third layer.
步骤10334,第三层将对样本特征矩阵中的时间特征进行提取,并将时间特征和空间特征进行组合形成一维特征向量,发送至第四层。In step 10334, the third layer will extract the temporal features in the sample feature matrix, combine the temporal features and the spatial features to form a one-dimensional feature vector, and send it to the fourth layer.
步骤10335,第四层对一维特征向量进行全连接处理,将处理后的一维特征向量发送至第五层。Step 10335: The fourth layer performs full connection processing on the one-dimensional feature vector, and sends the processed one-dimensional feature vector to the fifth layer.
步骤10336,第五层对处理后的一维特征向量进行分析,确定出对应的样本人体动作类别后输出。Step 10336: The fifth layer analyzes the processed one-dimensional feature vector, determines the corresponding sample human body movement category and outputs it.
在上述方案中,第一层有D个输入口,这样可以将维度为D的样本特征矩阵按照矩阵的方式直接输入,输入完成后,进入第二层将样本特征矩阵中的人体动作的图像作为空间特征进行提取,然后第三层将多张图像得到的每个人体动作按照时间顺序进行分析,进而将时间特征和空间特征组合在一起,成为一维特征向量,第四层和第五层直接根据该一维特征向量得出对应的样本动作类别。这样就可以在空间和时间两个维度确定样本特征矩阵的一维特征向量,使得根据该一维特征向量确定的赝本动作类别更加准确快捷。In the above scheme, the first layer has D input ports, so that the sample feature matrix with dimension D can be directly input in the form of a matrix. After the input is completed, enter the second layer and use the image of the human action in the sample feature matrix as Spatial features are extracted, and then the third layer analyzes each human action obtained from multiple images in chronological order, and then combines the temporal and spatial features to form a one-dimensional feature vector. The fourth and fifth layers directly According to the one-dimensional feature vector, the corresponding sample action category is obtained. In this way, the one-dimensional feature vector of the sample feature matrix can be determined in the two dimensions of space and time, so that the pseudo-action category determined according to the one-dimensional feature vector is more accurate and faster.
在具体实施例中,步骤106之前具体包括:In a specific embodiment, before step 106, it specifically includes:
步骤1061,针对M个人分别获取每个人的动作图像作为训练图像,同时对每个人录制预定时间的训练语音,得到M个训练图像和M个训练语音。Step 1061: Obtain each person's action image as a training image for M individuals, and simultaneously record a predetermined time of training voice for each person to obtain M training images and M training voices.
在该步骤中,将每个人得到的训练图像和训练语音进行组合,这样使得得到的训练融合特征向量均出自同一个人,能够更好的根据训练融合特征向量进行训练。In this step, the training image and the training speech obtained by each person are combined, so that the obtained training fusion feature vectors are all from the same person, and the training can be better based on the training fusion feature vectors.
步骤1062,对每个训练图像和每个训练语音均对应标注训练人体动作标签。In step 1062, each training image and each training voice are correspondingly labeled with a training human body action label.
在该步骤中,将同一个人对应的训练图像和训练语音标注的训练人体动作标签相同。In this step, the training image corresponding to the same person is the same as the training human body action label marked by the training voice.
步骤1063,对每个训练图像分别进行特征提取,得到M个训练特征矩阵。Step 1063: Perform feature extraction on each training image to obtain M training feature matrices.
在该步骤中,将每个训练图像进行数字转换,并将人体周围的环境图像删除,对人体的图像进行抓取,然后再将人体的面部表情、肢体动作、手持物体等信息特征进行提取,并转换成维度为D的训练特征矩阵。In this step, each training image is digitally converted, and the environmental image around the human body is deleted, the image of the human body is captured, and then the facial expressions, body movements, hand-held objects and other information features of the human body are extracted. And converted into a training feature matrix of dimension D.
步骤1064,将M个训练特征矩阵依次输入至人体特征提取模型中进行处理,输出M个训练图像特征向量。Step 1064: Input the M training feature matrices into the human body feature extraction model in sequence for processing, and output M training image feature vectors.
在该步骤中,该训练特征矩阵经过人体特征提取模型的三层进行处理后,得到对应的训练图像特征向量。In this step, the training feature matrix is processed by the three layers of the human body feature extraction model to obtain the corresponding training image feature vector.
步骤1065,对M个训练语音进行文本特征提取,得到M个训练语音特征向量。Step 1065: Perform text feature extraction on M training voices to obtain M training voice feature vectors.
在该步骤中,利用自动语音识别系统(Automatic Speech Recognition)对每个训练语音进行语音识别,转换成相应的文字,将文字进行特征提取得到维度为D的训练语音特征矩阵。In this step, an automatic speech recognition system (Automatic Speech Recognition) is used to perform speech recognition on each training speech, convert it into a corresponding text, and perform feature extraction on the text to obtain a training speech feature matrix with dimension D.
步骤1066,将属于同一个人体动作标签的训练图像特征向量和训练语音特征向量进行交叉融合得到训练融合特征向量,M个训练图像特征向量和M个训练语音特征向量对应融合成M个训练融合特征向量。Step 1066: Cross-fuse the training image feature vectors and training voice feature vectors belonging to the same human action label to obtain a training fusion feature vector. M training image feature vectors and M training voice feature vectors are correspondingly fused into M training fusion features vector.
在该步骤中,将属于同一个人体动作标签的的训练图像特征向量和训练语音特征向量进行融合,这样得到的训练融合特征向量的人体动作标签的唯一性。In this step, the training image feature vector and the training speech feature vector belonging to the same human action label are fused, so that the uniqueness of the human action label of the training fusion feature vector is obtained.
步骤1067,将M个训练融合特征向量依次输入卷积神经网络中进行训练处理,并将输出的训练人体动作类别与对应的训练人体动作标签进行比对,确定出相应的训练损失函数。Step 1067, sequentially input the M training fusion feature vectors into the convolutional neural network for training processing, and compare the output training human action category with the corresponding training human action label to determine the corresponding training loss function.
步骤1068,根据训练损失函数对卷积神经网络进行调整得到卷积神经网络模型。Step 1068: Adjust the convolutional neural network according to the training loss function to obtain a convolutional neural network model.
在该步骤中,训练融合特征向量每输入一次获得一个训练损失函数,根据该训练损失函数对卷积神经网络的参数进行调整后,再输入下一个训练融合特征向量,并不断重复该 过程,直至所有的训练融合特征向量全部训练完成为止,这样就可以得到能够根据人物的图像和语音进行结合后的融合特征向量进行人体动作识别的卷积神经网络模型。In this step, a training loss function is obtained for each input of the training fusion feature vector. After the parameters of the convolutional neural network are adjusted according to the training loss function, the next training fusion feature vector is input, and the process is repeated until All training fusion feature vectors are completed until the training is completed, so that a convolutional neural network model that can recognize human actions based on the combined fusion feature vectors of the character's image and voice can be obtained.
步骤1069,在卷积神经网络的输出层之前,添加能够根据得到的训练人体动作类别判断是否属于异常行为的判断层,得到异常行为识别模型。Step 1069, before the output layer of the convolutional neural network, add a judgment layer capable of judging whether it belongs to an abnormal behavior according to the obtained training human action category, to obtain an abnormal behavior recognition model.
在该步骤中,在判断层中加入属于异常行为的人体动作类别的名称。这样,当卷积神经网络模型得出人体动作类别之后,从加入的异常性行为的人体动作类别中搜寻该人体动作类别。若存在,证明该人体动作类别属于异常行为,则直接将该人体动作类别以及判断结果从输出层输出。若不存在,证明该人体动作类别不属于异常行为,则将该人体动作类别以及判断结果从输出层输出。In this step, the name of the human body action category belonging to the abnormal behavior is added to the judgment layer. In this way, after the convolutional neural network model obtains the human action category, the human action category is searched from the human action category of the added abnormal sexual behavior. If it exists, it proves that the human body action category belongs to an abnormal behavior, and the human body action category and the judgment result are directly output from the output layer. If it does not exist, it proves that the human body action category does not belong to an abnormal behavior, and then the human body action category and the judgment result are output from the output layer.
在具体实施例中,方法还包括:In a specific embodiment, the method further includes:
步骤101’,当检测到用户进入识别区域后,控制摄像头获取用户的多张待识别的动作图像,同时启动录音结构,录制预定时间的待识别的语音。In step 101', when it is detected that the user enters the recognition area, the camera is controlled to obtain multiple action images of the user to be recognized, and the recording structure is activated at the same time to record the voice to be recognized for a predetermined time.
则步骤102具体包括:Then step 102 specifically includes:
步骤1021,将多张待识别的动作图像输入编码处理器,利用编码处理器中的自注意力机制层对每一张待识别的动作图像进行可视化分析,提取每一张待识别的动作图像的可视化特征,则多张待识别的动作图像对应得到多个可视化特征。Step 1021: Input multiple motion images to be recognized into the encoding processor, and use the self-attention mechanism layer in the encoding processor to visually analyze each motion image to be recognized, and extract the image of each motion image to be recognized. Visualized features, multiple visualized features are obtained corresponding to multiple action images to be recognized.
在该步骤中,预先在编码处理器中添加自注意力机制层,利用自注意力机制层能够将每张待识别的动作图像进行可视化分析,将其他干扰环境因素删除,同时将人体轮廓或姿态特征(即可视化特征)提取出来,则多张待识别的动作图像对应得到多个可视化特征。In this step, a self-attention mechanism layer is added to the encoding processor in advance, and the self-attention mechanism layer can be used to visually analyze each action image to be recognized, delete other disturbing environmental factors, and remove the contour or posture of the human body. After the features (that is, visual features) are extracted, multiple visual features are obtained corresponding to multiple action images to be recognized.
步骤1022,将多个可视化特征输入编码处理器的叠加层进行叠加处理,得到多个叠加处理结果。Step 1022: Input multiple visualization features into the overlay layer of the encoding processor for overlay processing to obtain multiple overlay processing results.
步骤1023,将多个叠加处理的结果输入残差层进行残差处理强化多个叠加处理结果。Step 1023: Input the results of the multiple superposition processing into the residual layer to perform residual processing to strengthen the results of the multiple superposition processing.
在该步骤中,将每个叠加处理结果输入残差层,与残差层中的预估数值进行比对,判断叠加处理结果的可靠性,并对叠加处理结果进行强化处理。避免在特征提取过程中图像中的特征产生梯度消失。In this step, each result of the superposition processing is input into the residual layer, and compared with the estimated value in the residual layer, to determine the reliability of the result of the superposition processing, and to perform enhanced processing on the result of the superposition processing. Avoid the gradient disappearance of the features in the image during the feature extraction process.
步骤1024,将强化后的多个叠加处理结果进行拼接后,利用线性处理进行线性处理,得到待识别的特征矩阵。Step 1024: After concatenating the multiple enhanced superimposition processing results, linear processing is performed using linear processing to obtain a feature matrix to be identified.
在该步骤中,将强化后的多个叠加处理结果按照维度D进行线性拼接,这样得到维度为D的待识别的特征矩阵。In this step, the enhanced multiple superimposition processing results are linearly spliced according to the dimension D, so that the feature matrix of the dimension D to be identified is obtained.
通过上述技术方案,能够将得到的多张待识别的动作图像进行处理形成维度为D的待识别的特征矩阵,经过上述处理,使得得到的待识别的特征矩阵中的特征更加突出,能够快速准确的进行识别。Through the above technical solution, the multiple motion images to be recognized can be processed to form a feature matrix of dimension D to be recognized. After the above processing, the features in the feature matrix to be recognized can be more prominent, which can be quickly and accurately For identification.
另外,在步骤1032中,也是按照上述方式对样本图像进行特征提取的。同理,在步骤1063中,也是按照上述方式对训练图像进行特征提取的。In addition, in step 1032, feature extraction is performed on the sample image in the above-mentioned manner. In the same way, in step 1063, the feature extraction of the training image is also performed in the above-mentioned manner.
在具体实施例中,步骤104具体包括:In a specific embodiment, step 104 specifically includes:
步骤1041,利用自动语音识别算法对待识别的语音进行文本特征提取。Step 1041, using an automatic speech recognition algorithm to perform text feature extraction on the speech to be recognized.
步骤1042,利用自注意力机制对提取到的文本特征进行文本特征分析,提取词特征向量。Step 1042: Use the self-attention mechanism to perform text feature analysis on the extracted text features, and extract word feature vectors.
在该步骤中,在对语音进行特征提取时,同样利用自注意力机制,这样使得得到的图像特征向量和词特征向量比较相似,方便后期进行图像和语音特征融合。In this step, when extracting voice features, the self-attention mechanism is also used, so that the obtained image feature vector and word feature vector are relatively similar, which facilitates the later image and voice feature fusion.
步骤1043,将词特征向量进行线性变换,得到待识别的语音特征向量。Step 1043: Perform linear transformation on the word feature vector to obtain the voice feature vector to be recognized.
在该步骤中,将词特征向量按照维度D进行线性变换,这样得到维度为D的待识别的语音特征向量。In this step, the word feature vector is linearly transformed according to the dimension D, so that the voice feature vector with the dimension D to be recognized is obtained.
另外,步骤1065中,也是按照上述方式对训练语音进行特征提取的。In addition, in step 1065, feature extraction of the training speech is also performed in the above-mentioned manner.
在具体实施例中,步骤105具体包括:In a specific embodiment, step 105 specifically includes:
步骤1051,利用加性注意力机制将待识别的图像特征向量和待识别的语音特征向量交叉相加,得到相加后的特征向量。Step 1051: Use the additive attention mechanism to cross-add the image feature vector to be recognized and the voice feature vector to be recognized to obtain the added feature vector.
在该步骤中,利用加性注意力机制(additive attention)待识别的图像特征向量和待识别的语音特征向量交叉相加。In this step, an additive attention mechanism (additive attention) is used to cross-add the image feature vector to be recognized and the voice feature vector to be recognized.
具体公式如下:其中,Q=词向量,K=V=图像矩阵。The specific formula is as follows: Among them, Q=word vector, K=V=image matrix.
Head i=Attention(Q i,K,V)。 Head i = Attention (Q i , K, V).
步骤1052,利用数量积方法将相加后的特征向量进行点积运算,得到待识别的融合特征向量。Step 1052: Perform a dot product operation on the added feature vector by using the quantified product method to obtain the fusion feature vector to be identified.
使用了scaled-dot product方法(数量积(dot product;scalar product,也称为点积)是接受在实数R上的两个向量并返回一个实数值标量的二元运算方法),具体公式如下:The scaled-dot product method is used (dot product (scalar product, also called dot product) is a binary operation method that accepts two vectors on a real number R and returns a real-valued scalar). The specific formula is as follows:
MultiHead(Q,K,V)=Concat(head 1,…,head h)W OMultiHead (Q, K, V) = Concat (head 1 ,..., head h )W O.
将上述得到的融合特征MultiHead(Q,K,V)进行归一化处理。作用是降低后续进行异常行为分类的复杂度。The fusion feature MultiHead (Q, K, V) obtained above is normalized. The effect is to reduce the complexity of subsequent abnormal behavior classification.
通过本申请的上述技术方案,利用经过学习训练后得到的人体特征提取模型对用户的图像进行特征提取得到待识别的图像特征向量,然后对用户的语音进行特征提取得到待识别的语音特征向量,再将待识别的图像特征向量和待识别的语音特征向量进行交叉融合后得到待识别的融合特征向量,利用卷积神经网络经过学习训练得到的异常行为识别模型对待识别的融合特征向量进行处理判断用户的动作是否属于异常行为,如果是证明该用户属于危险人员,启动对应的拦截功能对该用户进行拦截,防止该用户对其他人的人身财产造成伤害。这种根据用户的图像和声音共同来确定用户的动作对应的动作类别,并判断该动作类别是否属于异常动作,以便根据判断结果采取相应的措施,能够更加快捷准确的识别行为异常的用户,有效提高了企业的安全系数,对企业的用户的安全提供了有效的保障。Through the above-mentioned technical solution of the present application, the human body feature extraction model obtained after learning and training is used to perform feature extraction on the user's image to obtain the image feature vector to be recognized, and then perform feature extraction on the user’s voice to obtain the voice feature vector to be recognized. Then cross-fuse the image feature vector to be recognized and the voice feature vector to be recognized to obtain the fusion feature vector to be recognized, and use the convolutional neural network to obtain the abnormal behavior recognition model after learning and training to process and judge the fusion feature vector to be recognized Whether the user's actions are abnormal behaviors, if it proves that the user is a dangerous person, activate the corresponding interception function to intercept the user to prevent the user from causing harm to other people's personal and property. This method determines the action category corresponding to the user’s action based on the user’s image and sound, and judges whether the action category is an abnormal action, so that corresponding measures can be taken according to the judgment result, and the user with abnormal behavior can be identified more quickly and accurately. It improves the safety factor of the enterprise and provides an effective guarantee for the safety of the users of the enterprise.
在本申请的另一个实施例的基于语音及图像特征的异常行为识别方法中,包括如下步骤:In another embodiment of the present application, the method for identifying abnormal behaviors based on voice and image features includes the following steps:
将多个机器人投放在下图2中的Alde投放区域,进行异常行为的识别和判断。Place multiple robots in the Alde placement area in Figure 2 below to identify and judge abnormal behaviors.
异常行为的识别和判断的具体过程如下:The specific process of identifying and judging abnormal behavior is as follows:
一、获取图像样本对时空图卷积网络(STGCN)进行训练,得到相应的人体特征提取模型1. Obtain image samples and train the spatio-temporal graph convolutional network (STGCN) to obtain the corresponding human body feature extraction model
如图3所示:As shown in Figure 3:
1、获取N个人中的每个人进行各种动作时的连续多张图像(例如四张)组成图像组,得到N个图像组,并对每个图像组的动作种类(例如殴打、持枪、行走、拿取等动作)进行标识。1. Obtain multiple consecutive images (for example, four) when each of the N people perform various actions to form an image group, obtain N image groups, and determine the type of action of each image group (for example, beating, holding a gun, Walking, taking, etc.) for identification.
2.将每个图像组中的每一张图像首先会经过6层的编码结构(encoder layer)完成对图像中人体轮廓以及姿态的强特征提取,得到多组特征矩阵(每组特征矩阵对应一张图像,矩阵维度为D)。2. Each image in each image group will first go through a 6-layer encoding structure (encoder layer) to complete the strong feature extraction of the contour and posture of the human body in the image, and obtain multiple sets of feature matrices (each feature matrix corresponds to one Image, the matrix dimension is D).
具体为:Specifically:
数据在每一层的先经过自注意力机制层(multihead-self-attention)后叠加residual残差层与归一化层层。自注意力机制能够提取出更好的提取图片中的重点特征(包括人体轮廓与姿态)。再通过残差层避免在特征提取过程中图像中的特征产生梯度消失。In each layer, the data passes through the multihead-self-attention layer and then superimposes the residual residual layer and the normalization layer. The self-attention mechanism can extract the key features (including the outline and posture of the human body) in the picture better. Then, the residual layer is used to avoid the gradient disappearance of the features in the image during the feature extraction process.
3、将多组特征矩阵进行拼接,然后输入linear-layer完成多图的线性变换,得到多图转一图的融合矩阵D×4(维度为D)。3. Combine multiple sets of feature matrices, and then input linear-layer to complete the linear transformation of multiple graphs, and obtain a fusion matrix D×4 (dimension D) from multiple graphs to one graph.
4、构建初始时空卷积网络,将时空卷积网络中的dropout_rate默认设置<0.3提升训练样本的差异性。4. Construct an initial spatiotemporal convolutional network, and set dropout_rate in the spatiotemporal convolutional network to a default setting of <0.3 to improve the difference of training samples.
其中,随机失活(dropout)是对具有深度结构的人工神经网络进行优化的方法,在学习过程中通过将隐含层的部分权重或输出随机归零,降低节点间的相互依赖性从而实现神经网络的正则化,降低其结构风险。Among them, random inactivation (dropout) is a method of optimizing artificial neural networks with deep structure. In the learning process, part of the weight or output of the hidden layer is randomly reset to zero to reduce the interdependence between nodes to realize the neural network. Regularization of the network reduces its structural risks.
5、将步骤3得到的融合矩阵输入5层时空图卷积网络(STGCN),进行学习训练。5. Input the fusion matrix obtained in step 3 into a 5-layer spatiotemporal graph convolutional network (STGCN) for learning and training.
其中,第一层STGCN用于接收融合矩阵,第二层分析融合矩阵中多个方位图像的空间特征,第三层分析融合矩阵前后帧的时间特征得到一维特征向量,第四层对一维特征向量进行全连接处理,第五层softmax层,用于多第四层的输出结果进行人体行为分类,并输出人体行为分类结果。Among them, the first layer of STGCN is used to receive the fusion matrix, the second layer analyzes the spatial characteristics of multiple orientation images in the fusion matrix, the third layer analyzes the temporal characteristics of the frames before and after the fusion matrix to obtain a one-dimensional feature vector, and the fourth layer compares the one-dimensional The feature vector is fully connected, and the fifth layer of softmax layer is used to classify human behavior with the output results of the fourth layer and output the results of human behavior classification.
根据输出的人体行为分类结果与上述的标识进行对比,得出相应的损失函数。根据损失函数对时空图卷积网络进行调整,直至步骤3得到的融合矩阵全部学习训练完成,得到相应的时空图卷积网络模型。According to the output classification result of human body behavior and the above-mentioned mark are compared, the corresponding loss function is obtained. The spatio-temporal graph convolutional network is adjusted according to the loss function, until all the learning and training of the fusion matrix obtained in step 3 are completed, and the corresponding spatio-temporal graph convolutional network model is obtained.
6、由于本方案此处需要的是经过时空图卷积网络模型的中间层次得到的表征人体行为的一维特征向量,因此需要将上述得到的时空图卷积网络模型的第四层和第五层删除,得到能够获取人体行为的一维特征向量(维度为D)的人体特征提取模型。6. Since what this solution needs here is the one-dimensional feature vector that represents human behavior obtained through the intermediate level of the spatio-temporal graph convolutional network model, it is necessary to combine the fourth and fifth layers of the spatio-temporal graph convolutional network model obtained above. Layer deletion, a human body feature extraction model that can obtain a one-dimensional feature vector (dimension D) of human behavior is obtained.
二、获取语音训练样本,并进行强特征提取2. Obtain voice training samples and perform strong feature extraction
如图4所示:As shown in Figure 4:
1、将收集到的语音通过现有ASR(自动语音识别Automatic Speech Recognition)基础转换成为文本。1. Convert the collected speech into text through the existing ASR (Automatic Speech Recognition) basis.
2、采用传统自注意力(transformer)机制对文本进行特征提取,得到相应的词向量特征(与图像的强特征提取都使用transformer机制,这样使得提取的信息特征会比较相近,便于后期进行融合)。2. The traditional self-attention (transformer) mechanism is used to extract features of the text, and the corresponding word vector features are obtained (and the strong feature extraction of the image uses the transformer mechanism, so that the extracted information features will be relatively similar, which is convenient for later fusion) .
3、对词向量特征进行线性变换,使得输出矩阵维度为D的词向量特征(与人体行为的一维特征向量的维度相等)。3. Perform a linear transformation on the word vector feature, so that the word vector feature of the output matrix dimension D (equal to the dimension of the one-dimensional feature vector of the human body behavior).
三、将人体行为的一维特征向量与词向量特征进行融合后,利用DNN进行学习训练。3. After fusing the one-dimensional feature vector of human behavior with the word vector feature, DNN is used for learning and training.
如图5所示:As shown in Figure 5:
1、使用两层交叉注意力机制将一维特征向量与词向量特征进行融合,具体为:1. Use a two-layer cross-attention mechanism to fuse one-dimensional feature vectors with word vector features, specifically:
首层使用的是additive attention,加性注意力机制(Q=词向量,K=V=图像矩阵)The first layer uses additive attention, an additive attention mechanism (Q=word vector, K=V=image matrix)
Head i=Attention(Q i,K,V)。 Head i = Attention (Q i , K, V).
MultiHead(Q,K,V)=Concat(head 1,…,head h)W OMultiHead (Q, K, V) = Concat (head 1 ,..., head h )W O.
第二层使用了scaled-dot product方法(数量积(dot product;scalar product,也称为点积)是接受在实数R上的两个向量并返回一个实数值标量的二元运算方法),将上述得到的融合特征MultiHead(Q,K,V)进行归一化处理。作用是降低后续进行异常行为分类的复杂度。The second layer uses the scaled-dot product method (dot product (scalar product, also known as dot product) is a binary operation method that accepts two vectors on a real number R and returns a real-valued scalar). The fusion feature MultiHead (Q, K, V) obtained above is normalized. The effect is to reduce the complexity of subsequent abnormal behavior classification.
对该融合特征的行为类别添加对应的标签,并标注该行为类别是否属于异常行为。Add a corresponding label to the behavior category of the fusion feature, and mark whether the behavior category is an abnormal behavior.
2、构建DNN网络结构,并将上述融合特征输入至DNN网络中进行训练处理,DNN网络输出该融合特征的行为类别,以及该行为类别是否属于异常行为。将输出结果与对应的标签进行比对,计算损失函数,根据损失函数对DNN网络结构进行调整,并重复此过程,直至所有的融合特征全部训练完成,得到能够对用户的行为进行分类,并判断是否属于异常行为的异常行为识别模型。2. Construct a DNN network structure, and input the above-mentioned fusion features into the DNN network for training processing, and the DNN network outputs the behavior category of the fusion feature, and whether the behavior category belongs to abnormal behavior. Compare the output result with the corresponding label, calculate the loss function, adjust the DNN network structure according to the loss function, and repeat this process until all the fusion features are fully trained, and the user's behavior can be classified and judged Whether it belongs to the abnormal behavior recognition model of abnormal behavior.
四、应用Four, application
将上述得到的人体特征提取模型和异常行为识别模型输入至机器人的系统中,利用机器人完成对用户的人体行为识别检测。The human body feature extraction model and the abnormal behavior recognition model obtained above are input into the robot system, and the robot is used to complete the recognition and detection of the user's human body behavior.
具体过程如下:The specific process is as follows:
1、当用户进入Alde投放区域后,Alde机器人就会利用摄像头获取用户的一组动作图像(例如四张),并录制用户的一段语音。1. When the user enters the Alde placement area, the Alde robot will use the camera to obtain a set of user's action images (for example, four), and record a segment of the user's voice.
2、利用步骤一中的2-3,得到对应的融合矩阵。2. Use 2-3 in step one to obtain the corresponding fusion matrix.
3、强得到的融合矩阵输入至人体特征提取模型进行处理,第一层用于接收融合矩阵,第二层分析融合矩阵中多个方位图像的空间特征,第三层分析融合矩阵前后帧的时间特征得到对应的待识别的图像特征向量。3. The strongly obtained fusion matrix is input to the human body feature extraction model for processing. The first layer is used to receive the fusion matrix, the second layer analyzes the spatial characteristics of multiple orientation images in the fusion matrix, and the third layer analyzes the time before and after the fusion matrix. The feature obtains the corresponding feature vector of the image to be recognized.
4、按照步骤二中的1-3对获取的用户的语音进行处理,得到相应的待识别的语音特征向量。4. Process the acquired user's voice according to steps 1-3 in step 2 to obtain the corresponding voice feature vector to be recognized.
5、按照步骤三中的1的处理过程将待识别的图像特征向量和待识别的语音特征向量进行融合,得到待识别的融合特征向量。5. The image feature vector to be recognized and the voice feature vector to be recognized are fused according to the process of step 1 in step 3 to obtain the fused feature vector to be recognized.
6、将待识别的融合特征输入至异常行为识别模型中,进行处理,输出该待识别的融合特征对应的人体行为类别,以及该人体行为类别是否属于异常行为。6. Input the fusion feature to be recognized into the abnormal behavior recognition model, perform processing, and output the human behavior category corresponding to the fusion feature to be recognized, and whether the human behavior category belongs to the abnormal behavior.
7、若属于异常行为则控制一个或多个Alde机器人对用户进行拦截组织用户进入银行的业务处理区域,避免用户对其他用户、公共设施、银行工作人员或者财物,造成伤害或者损失。同时启动报警装置,提醒工作人员对该异常用户进行拦截处理。7. If it is an abnormal behavior, control one or more Alde robots to intercept users and organize users to enter the bank's business processing area to prevent users from causing harm or loss to other users, public facilities, bank staff, or property. At the same time, the alarm device is activated to remind the staff to intercept the abnormal user.
若不属于异常行为,则允许用户进入业务办理区域进行业务办理。If it is not an abnormal behavior, the user is allowed to enter the business processing area for business processing.
进一步的,作为图1方法的具体实现,本申请实施例提供了一种基于语音及图像特征的异常行为识别装置,如图6所示,装置包括:依次连接的获取模块61、图像特征提取模块62、特征处理模块63、语音特征提取模块64、融合特征模块65和基于语音及图像特征的异常行为识别模块66。Further, as a specific implementation of the method in FIG. 1, an embodiment of the present application provides an abnormal behavior recognition device based on voice and image features. As shown in FIG. 6, the device includes: an acquisition module 61 and an image feature extraction module connected in sequence 62. Feature processing module 63, voice feature extraction module 64, fusion feature module 65, and abnormal behavior recognition module 66 based on voice and image features.
获取模块61,当检测到用户进入识别区域后,控制摄像头获取用户的待识别的动作图像,同时启动录音结构,录制预定时间的待识别的语音;The acquiring module 61, after detecting that the user enters the recognition area, controls the camera to acquire the user's motion image to be recognized, and at the same time activates the recording structure to record the voice to be recognized for a predetermined time;
图像特征提取模块62,用于对待识别的动作图像进行特征提取,得到待识别的特征矩阵;The image feature extraction module 62 is configured to perform feature extraction on the action image to be recognized to obtain a feature matrix to be recognized;
特征处理模块63,用于利用人体特征提取模型对待识别的特征矩阵进行处理,得到对应的待识别的图像特征向量;The feature processing module 63 is configured to process the feature matrix to be recognized by using the human feature extraction model to obtain the corresponding feature vector of the image to be recognized;
语音特征提取模块64,用于对待识别的语音进行文本特征提取,得到待识别的语音特征向量;The voice feature extraction module 64 is configured to perform text feature extraction on the voice to be recognized to obtain the voice feature vector to be recognized;
融合特征模块65,用于将待识别的图像特征向量和待识别的语音特征向量进行交叉融合得到待识别的融合特征向量;The fusion feature module 65 is configured to cross-fuse the image feature vector to be recognized and the voice feature vector to be recognized to obtain the fusion feature vector to be recognized;
基于语音及图像特征的异常行为识别模块66,用于将待识别的融合特征向量输入至异常行为识别模型中进行处理,输出对应的人体动作类别,以及人体动作类别是否属于异常行为。The abnormal behavior recognition module 66 based on voice and image features is used to input the to-be-recognized fusion feature vector into the abnormal behavior recognition model for processing, and output the corresponding human action category and whether the human action category belongs to the abnormal behavior.
在具体实施例中,获取模块61,还用于获取多个表示各种人体动作的样本图像,对每个样本图像标注对应的人体动作标签;In a specific embodiment, the acquiring module 61 is further configured to acquire a plurality of sample images representing various human actions, and label each sample image with a corresponding human body action label;
图像特征提取模块62,还用于对多个样本图像中的每个样本图像分别进行特征提取,得到多个样本特征矩阵;The image feature extraction module 62 is further configured to perform feature extraction on each of the multiple sample images to obtain multiple sample feature matrices;
装置还包括:The device also includes:
构建模块,用于构建五层时空卷积网络,将多个样本特征矩阵依次输入时空卷积网络的前三层进行处理,将得到多个一维特征向量传送给时空卷积网络的后两层进行识别处理,输出与每个样本图像对应的样本人体动作类别;The construction module is used to construct a five-layer spatio-temporal convolutional network, input multiple sample feature matrices into the first three layers of the spatio-temporal convolutional network for processing, and transfer multiple one-dimensional feature vectors to the last two layers of the spatio-temporal convolutional network Perform recognition processing, and output the sample human body action category corresponding to each sample image;
特征提取训练模块,用于将样本人体动作类别与对应的人体动作标签进行比对确定样本损失函数,根据样本损失函数对时空卷积网络的参数进行调整得到时空卷积网络模型;The feature extraction training module is used to compare the sample human action category with the corresponding human action label to determine the sample loss function, and adjust the parameters of the spatiotemporal convolutional network according to the sample loss function to obtain the spatiotemporal convolutional network model;
删除模块,用于将时空卷积网络模型的最后两层删除,得到人体特征提取模型。The deletion module is used to delete the last two layers of the spatio-temporal convolutional network model to obtain a human body feature extraction model.
在具体实施例中,构建模块具体包括:In a specific embodiment, the building module specifically includes:
构建单元,用于构建的时空卷积网络的五层结构分别是,第一层接收层、第二层空间特征分析层、第三层时间特征分析层、第四层全连接层、第五层分类层;Construction unit, the five-layer structure of the spatio-temporal convolutional network used to construct is: the first receiving layer, the second spatial feature analysis layer, the third temporal feature analysis layer, the fourth fully connected layer, and the fifth layer Classification layer
传送单元,用于第一层将接收到的样本特征矩阵传送至第二层;The transmitting unit is used for the first layer to transmit the received sample feature matrix to the second layer;
空间特征处理单元,用于第二层将样本特征矩阵的空间特征进行提取,将提取后的空间特征和样本特征矩阵一起发送至第三层;The spatial feature processing unit is used for the second layer to extract the spatial features of the sample feature matrix, and send the extracted spatial features and the sample feature matrix to the third layer;
时间特征处理单元,用于第三层将对样本特征矩阵中的时间特征进行提取,并将时间特征和空间特征进行组合形成一维特征向量,发送至第四层;The temporal feature processing unit is used in the third layer to extract the temporal features in the sample feature matrix, combine the temporal and spatial features to form a one-dimensional feature vector, and send it to the fourth layer;
全连接处理单元,用于第四层对一维特征向量进行全连接处理,将处理后的一维特征向量发送至第五层;The fully connected processing unit is used for the fourth layer to perform fully connected processing on the one-dimensional feature vector, and send the processed one-dimensional feature vector to the fifth layer;
分析单元,用于第五层对处理后的一维特征向量进行分析,确定出对应的样本人体动作类别后输出。The analysis unit is used for the fifth layer to analyze the processed one-dimensional feature vector, and output after determining the corresponding sample human action category.
在具体实施例中,获取模块61,还用于针对M个人分别获取每个人的动作图像作为训练图像,同时对每个人录制预定时间的训练语音,得到M个训练图像和M个训练语音;In a specific embodiment, the acquiring module 61 is further configured to separately acquire the action images of each person as training images for M individuals, and simultaneously record training voices for each person for a predetermined time to obtain M training images and M training voices;
装置还包括:The device also includes:
标注模块,用于对每个训练图像和每个训练语音均对应标注训练人体动作标签;The labeling module is used to label each training image and each training voice corresponding to the training human action label;
图像特征提取模块62,还用于对每个训练图像分别进行特征提取,得到M个训练特征矩阵;The image feature extraction module 62 is also used to perform feature extraction on each training image to obtain M training feature matrices;
特征处理模块63,还用于将M个训练特征矩阵依次输入至人体特征提取模型中进行处理,输出M个训练图像特征向量;The feature processing module 63 is further configured to sequentially input M training feature matrices into the human body feature extraction model for processing, and output M training image feature vectors;
语音特征提取模块64,还用于对M个训练语音进行文本特征提取,得到M个训练语音特征向量;The voice feature extraction module 64 is also used to perform text feature extraction on M training voices to obtain M training voice feature vectors;
装置还包括:The device also includes:
异常行为训练模块,用于将属于同一个人体动作标签的训练图像特征向量和训练语音特征向量进行交叉融合得到训练融合特征向量,M个训练图像特征向量和M个训练语音特征向量对应融合成M个训练融合特征向量;将M个训练融合特征向量依次输入卷积神经网络中进行训练处理,并将输出的训练人体动作类别与对应的训练人体动作标签进行比对,确定出相应的训练损失函数;根据训练损失函数对卷积神经网络进行调整得到卷积神经网络模型;在卷积神经网络的输出层之前,添加能够根据得到的训练人体动作类别判断是否属于异常行为的判断层,得到异常行为识别模型。Abnormal behavior training module, used to cross-fuse training image feature vectors and training voice feature vectors belonging to the same human action label to obtain training fusion feature vectors. M training image feature vectors and M training voice feature vectors are correspondingly fused into M A training fusion feature vector; M training fusion feature vectors are sequentially input into the convolutional neural network for training processing, and the output training human action category is compared with the corresponding training human action label to determine the corresponding training loss function ; Adjust the convolutional neural network according to the training loss function to obtain the convolutional neural network model; before the output layer of the convolutional neural network, add a judgment layer that can determine whether it belongs to abnormal behavior according to the obtained training human action category to obtain abnormal behavior Identify the model.
在具体实施例中,当获取模块61检测到用户进入识别区域后,控制摄像头获取用户的多张待识别的动作图像;In a specific embodiment, when the obtaining module 61 detects that the user enters the recognition area, it controls the camera to obtain multiple action images of the user to be recognized;
图像特征提取模块62,具体用于:The image feature extraction module 62 is specifically used for:
将多张待识别的动作图像输入编码处理器,利用编码处理器中的自注意力机制层对每一张待识别的动作图像进行可视化分析,提取每一张待识别的动作图像的可视化特征,则多张待识别的动作图像对应得到多个可视化特征;将多个可视化特征输入编码处理器的叠加层进行叠加处理,得到多个叠加处理结果;将多个叠加处理的结果输入残差层进行残差处理强化多个叠加处理结果;将强化后的多个叠加处理结果进行拼接后,利用线性处理进行线性处理,得到待识别的特征矩阵。Input multiple motion images to be recognized into the encoding processor, use the self-attention mechanism layer in the encoding processor to visually analyze each motion image to be recognized, and extract the visual features of each motion image to be recognized. The multiple action images to be recognized correspond to multiple visualization features; the multiple visualization features are input to the overlay layer of the encoding processor for overlay processing, and multiple overlay processing results are obtained; multiple overlay processing results are input to the residual layer for processing The residual processing strengthens multiple superimposed processing results; after the enhanced multiple superimposed processing results are spliced, linear processing is used to perform linear processing to obtain the feature matrix to be identified.
在具体实施例中,语音特征提取模块64,具体用于:利用自动语音识别算法对待识别的语音进行文本特征提取;利用自注意力机制对提取到的文本特征进行文本特征分析,提 取词特征向量;将词特征向量进行线性变换,得到待识别的语音特征向量。In a specific embodiment, the voice feature extraction module 64 is specifically used to: use an automatic voice recognition algorithm to extract text features of the voice to be recognized; use a self-attention mechanism to perform text feature analysis on the extracted text features to extract word feature vectors ; Perform a linear transformation on the word feature vector to obtain the voice feature vector to be recognized.
在具体实施例中,融合特征模块65,具体用于:利用加性注意力机制将待识别的图像特征向量和待识别的语音特征向量交叉相加,得到相加后的特征向量;利用数量积方法将相加后的特征向量进行点积运算,得到待识别的融合特征向量。In a specific embodiment, the fusion feature module 65 is specifically configured to: use an additive attention mechanism to cross-add the image feature vector to be recognized and the voice feature vector to be recognized to obtain the added feature vector; use the quantitative product The method performs dot product operation on the added feature vector to obtain the fusion feature vector to be recognized.
基于上述图1所示方法和图2-6所示装置的实施例,为了实现上述目的,本申请实施例还提供了一种计算机设备,如图7所示,包括存储器72和处理器71,其中存储器72和处理器71均设置在总线73上存储器72存储有计算机程序,处理器71执行计算机程序时实现图1所示的基于语音及图像特征的异常行为识别方法。Based on the above-mentioned method shown in FIG. 1 and the embodiment of the apparatus shown in FIGS. 2-6, in order to achieve the above-mentioned purpose, an embodiment of the present application also provides a computer device, as shown in FIG. 7, including a memory 72 and a processor 71, The memory 72 and the processor 71 are both arranged on the bus 73 and the memory 72 stores a computer program. When the processor 71 executes the computer program, the abnormal behavior recognition method based on voice and image features shown in FIG. 1 is implemented.
基于这样的理解,本申请的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储器(可以是CD-ROM,U盘,移动硬盘等)中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施场景所述的方法。Based on this understanding, the technical solution of this application can be embodied in the form of a software product, which can be stored in a non-volatile memory (which can be a CD-ROM, U disk, mobile hard disk, etc.), including several instructions It is used to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in each implementation scenario of this application.
可选地,该设备还可以连接用户接口、网络接口、摄像头、射频(Radio Frequency,RF)电路,传感器、音频电路、WI-FI模块等等。用户接口可以包括显示屏(Display)、输入单元比如键盘(Keyboard)等,可选用户接口还可以包括USB接口、读卡器接口等。网络接口可选的可以包括标准的有线接口、无线接口(如蓝牙接口、WI-FI接口)等。Optionally, the device can also be connected to a user interface, a network interface, a camera, a radio frequency (RF) circuit, a sensor, an audio circuit, a WI-FI module, and so on. The user interface may include a display screen (Display), an input unit such as a keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, and the like. The network interface can optionally include a standard wired interface, a wireless interface (such as a Bluetooth interface, a WI-FI interface), and so on.
本领域技术人员可以理解,本实施例提供的一种计算机设备的结构并不构成对该实体设备的限定,可以包括更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art can understand that the structure of a computer device provided in this embodiment does not constitute a limitation on the physical device, and may include more or fewer components, or combine certain components, or arrange different components.
基于上述如图1所示方法和图6所示装置的实施例,相应的,本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述如图1所示的基于语音及图像特征的异常行为识别方法。其中,所述计算机可读存储介质可以是非易失性,也可以是易失性的。Based on the above-mentioned method shown in FIG. 1 and the embodiment of the device shown in FIG. 6, correspondingly, an embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor The above-mentioned abnormal behavior recognition method based on voice and image features as shown in Figure 1 is realized. Wherein, the computer-readable storage medium may be non-volatile or volatile.
存储介质中还可以包括操作系统、网络通信模块。操作系统是管理计算机设备硬件和软件资源的程序,支持信息处理程序以及其它软件和/或程序的运行。网络通信模块用于实现存储介质内部各组件之间的通信,以及与计算机设备中其它硬件和软件之间通信。The storage medium may also include an operating system and a network communication module. The operating system is a program that manages the hardware and software resources of computer equipment, and supports the operation of information processing programs and other software and/or programs. The network communication module is used to realize the communication between the various components in the storage medium and the communication with other hardware and software in the computer equipment.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到本申请可以借助软件加必要的通用硬件平台的方式来实现,也可以通过硬件实现。Through the description of the above embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus a necessary general hardware platform, or can be implemented by hardware.
通过应用本申请的技术方案,利用经过学习训练后得到的人体特征提取模型对用户的图像进行特征提取得到待识别的图像特征向量,然后对用户的语音进行特征提取得到待识别的语音特征向量,再将待识别的图像特征向量和待识别的语音特征向量进行交叉融合后得到待识别的融合特征向量,利用卷积神经网络经过学习训练得到的异常行为识别模型对待识别的融合特征向量进行处理判断用户的动作是否属于异常行为,如果是证明该用户属于危险人员,启动对应的拦截功能对该用户进行拦截,防止该用户对其他人的人身财产造成伤害。这种根据用户的图像和声音共同来确定用户的动作对应的动作类别,并判断该动作类别是否属于异常动作,以便根据判断结果采取相应的措施,能够更加快捷准确的识别行为异常的用户,有效提高了企业的安全系数,对企业的用户的安全提供了有效的保障。By applying the technical solution of the present application, the human body feature extraction model obtained after learning and training is used to perform feature extraction on the user's image to obtain the image feature vector to be recognized, and then perform feature extraction on the user’s voice to obtain the voice feature vector to be recognized. Then cross-fuse the image feature vector to be recognized and the voice feature vector to be recognized to obtain the fusion feature vector to be recognized, and use the convolutional neural network to obtain the abnormal behavior recognition model after learning and training to process and judge the fusion feature vector to be recognized Whether the user's actions are abnormal behaviors, if it proves that the user is a dangerous person, activate the corresponding interception function to intercept the user to prevent the user from causing harm to other people's personal and property. This method determines the action category corresponding to the user’s action based on the user’s image and sound, and judges whether the action category is an abnormal action, so that corresponding measures can be taken according to the judgment result, and the user with abnormal behavior can be identified more quickly and accurately. It improves the safety factor of the enterprise and provides an effective guarantee for the safety of the users of the enterprise.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims (20)

  1. 一种基于语音及图像特征的异常行为识别方法,其中,所述方法的步骤包括:An abnormal behavior recognition method based on voice and image features, wherein the steps of the method include:
    当检测到用户进入识别区域后,控制摄像头获取用户的待识别的动作图像,同时启动录音结构,录制预定时间的待识别的语音;After detecting that the user enters the recognition area, control the camera to obtain the user's motion image to be recognized, and at the same time start the recording structure to record the voice to be recognized for a predetermined time;
    对所述待识别的动作图像进行特征提取,得到待识别的特征矩阵;Performing feature extraction on the motion image to be recognized to obtain a feature matrix to be recognized;
    利用人体特征提取模型对所述待识别的特征矩阵进行处理,得到对应的待识别的图像特征向量;Using a human body feature extraction model to process the feature matrix to be recognized to obtain a corresponding feature vector of the image to be recognized;
    对所述待识别的语音进行文本特征提取,得到待识别的语音特征向量;Performing text feature extraction on the voice to be recognized to obtain a voice feature vector to be recognized;
    将所述待识别的图像特征向量和所述待识别的语音特征向量进行交叉融合得到待识别的融合特征向量;Cross fusion of the image feature vector to be recognized and the voice feature vector to be recognized to obtain a fusion feature vector to be recognized;
    将待识别的融合特征向量输入至异常行为识别模型中进行处理,输出对应的人体动作类别,以及所述人体动作类别是否属于异常行为。The fusion feature vector to be recognized is input into the abnormal behavior recognition model for processing, and the corresponding human body action category is output, and whether the human body action category belongs to the abnormal behavior.
  2. 根据权利要求1所述的方法,其中,在所述利用人体特征提取模型对所述待识别的特征矩阵进行处理,得到对应的待识别的图像特征向量之前,所述方法还包括:The method according to claim 1, wherein before said using the human body feature extraction model to process the feature matrix to be recognized to obtain the corresponding feature vector of the image to be recognized, the method further comprises:
    获取多个表示各种人体动作的样本图像,对每个样本图像标注对应的人体动作标签;Acquire multiple sample images representing various human actions, and label each sample image with a corresponding human action label;
    对多个样本图像中的每个样本图像分别进行特征提取,得到多个样本特征矩阵;Perform feature extraction on each sample image of the multiple sample images to obtain multiple sample feature matrices;
    构建五层时空卷积网络,将所述多个样本特征矩阵依次输入所述时空卷积网络的前三层进行处理,将得到多个一维特征向量传送给所述时空卷积网络的后两层进行识别处理,输出与每个样本图像对应的样本人体动作类别;Construct a five-layer spatio-temporal convolutional network, input the multiple sample feature matrices into the first three layers of the spatio-temporal convolutional network for processing, and transmit multiple one-dimensional feature vectors to the last two of the spatio-temporal convolutional network The layer performs recognition processing, and outputs the sample human body action category corresponding to each sample image;
    将所述样本人体动作类别与对应的人体动作标签进行比对确定样本损失函数,根据所述样本损失函数对所述时空卷积网络的参数进行调整得到时空卷积网络模型;Comparing the sample human motion category with the corresponding human motion label to determine a sample loss function, and adjusting the parameters of the spatiotemporal convolutional network according to the sample loss function to obtain a spatiotemporal convolutional network model;
    将所述时空卷积网络模型的最后两层删除,得到人体特征提取模型。The last two layers of the spatiotemporal convolutional network model are deleted to obtain a human body feature extraction model.
  3. 根据权利要求2所述的方法,其中,所述构建五层时空卷积网络,将所述多个样本特征矩阵依次输入所述时空卷积网络的前三层进行处理,将得到多个一维特征向量传送给所述时空卷积网络的后两层进行识别处理,输出与每个样本图像对应的样本人体动作类别,具体包括:The method according to claim 2, wherein the constructing a five-layer spatio-temporal convolutional network, and sequentially inputting the multiple sample feature matrices into the first three layers of the spatio-temporal convolutional network for processing, will obtain multiple one-dimensional The feature vector is transmitted to the last two layers of the spatio-temporal convolutional network for recognition processing, and the sample human body action category corresponding to each sample image is output, which specifically includes:
    构建的时空卷积网络的五层结构分别是,第一层接收层、第二层空间特征分析层、第三层时间特征分析层、第四层全连接层、第五层分类层;The five-layer structure of the constructed spatio-temporal convolutional network is the first receiving layer, the second spatial feature analysis layer, the third temporal feature analysis layer, the fourth fully connected layer, and the fifth classification layer;
    所述第一层将接收到的所述样本特征矩阵传送至所述第二层;The first layer transmits the received sample feature matrix to the second layer;
    所述第二层将所述样本特征矩阵的空间特征进行提取,将提取后的空间特征和所述样本特征矩阵一起发送至所述第三层;The second layer extracts the spatial features of the sample feature matrix, and sends the extracted spatial features and the sample feature matrix to the third layer;
    所述第三层将对所述样本特征矩阵中的时间特征进行提取,并将所述时间特征和所述空间特征进行组合形成一维特征向量,发送至所述第四层;The third layer will extract the temporal features in the sample feature matrix, combine the temporal features and the spatial features to form a one-dimensional feature vector, and send it to the fourth layer;
    所述第四层对所述一维特征向量进行全连接处理,将处理后的一维特征向量发送至所述第五层;The fourth layer performs full connection processing on the one-dimensional feature vector, and sends the processed one-dimensional feature vector to the fifth layer;
    所述第五层对所述处理后的一维特征向量进行分析,确定出对应的样本人体动作类别后输出。The fifth layer analyzes the processed one-dimensional feature vector, determines the corresponding sample human body movement category, and outputs it.
  4. 根据权利要求1-3任一项所述的方法,其中,在所述将待识别的融合特征向量输入至异常行为识别模型中进行处理,输出对应的人体动作类别,以及所述人体动作类别是否属于异常行为之前,具体包括:The method according to any one of claims 1 to 3, wherein the input of the fusion feature vector to be recognized into the abnormal behavior recognition model is processed, and the corresponding human action category is output, and whether the human action category is Before it belongs to abnormal behavior, it specifically includes:
    针对M个人分别获取每个人的动作图像作为训练图像,同时对每个人录制预定时间的训练语音,得到M个训练图像和M个训练语音;Obtain each person's action image as a training image for M individuals, and simultaneously record a predetermined time of training voice for each person to obtain M training images and M training voices;
    对每个训练图像和每个训练语音均对应标注训练人体动作标签;Each training image and each training speech are correspondingly labeled with training human action labels;
    对每个训练图像分别进行特征提取,得到M个训练特征矩阵;Perform feature extraction on each training image to obtain M training feature matrices;
    将所述M个训练特征矩阵依次输入至所述人体特征提取模型中进行处理,输出M个训练图像特征向量;Sequentially inputting the M training feature matrices into the human body feature extraction model for processing, and outputting M training image feature vectors;
    对所述M个训练语音进行文本特征提取,得到M个训练语音特征向量;Performing text feature extraction on the M training voices to obtain M training voice feature vectors;
    将属于同一个人体动作标签的训练图像特征向量和训练语音特征向量进行交叉融合得到训练融合特征向量,M个训练图像特征向量和M个训练语音特征向量对应融合成M个训练融合特征向量;Cross fusion of training image feature vectors and training voice feature vectors belonging to the same human action label to obtain training fusion feature vectors, and M training image feature vectors and M training voice feature vectors are correspondingly fused into M training fusion feature vectors;
    将M个训练融合特征向量依次输入卷积神经网络中进行训练处理,并将输出的训练人体动作类别与对应的训练人体动作标签进行比对,确定出相应的训练损失函数;The M training fusion feature vectors are sequentially input into the convolutional neural network for training processing, and the output training human action category is compared with the corresponding training human action label to determine the corresponding training loss function;
    根据所述训练损失函数对所述卷积神经网络进行调整得到卷积神经网络模型;Adjusting the convolutional neural network according to the training loss function to obtain a convolutional neural network model;
    在所述卷积神经网络的输出层之前,添加能够根据得到的所述训练人体动作类别判断是否属于异常行为的判断层,得到异常行为识别模型。Before the output layer of the convolutional neural network, a judgment layer capable of judging whether it belongs to an abnormal behavior according to the obtained training human action category is added to obtain an abnormal behavior recognition model.
  5. 根据权利要求1所述的方法,其中,当检测到用户进入识别区域后,控制摄像头获取用户的多张待识别的动作图像;The method according to claim 1, wherein when it is detected that the user enters the recognition area, the camera is controlled to obtain multiple motion images of the user to be recognized;
    所述对所述待识别的动作图像进行特征提取,得到待识别的特征矩阵,具体包括:The feature extraction of the action image to be recognized to obtain the feature matrix to be recognized specifically includes:
    将所述多张待识别的动作图像输入编码处理器,利用所述编码处理器中的自注意力机制层对每一张待识别的动作图像进行可视化分析,提取每一张待识别的动作图像的可视化特征,则所述多张待识别的动作图像对应得到多个可视化特征;Input the multiple motion images to be recognized into the encoding processor, use the self-attention mechanism layer in the encoding processor to visually analyze each motion image to be recognized, and extract each motion image to be recognized , Then the multiple to-be-recognized action images correspondingly obtain multiple visual features;
    将所述多个可视化特征输入编码处理器的叠加层进行叠加处理,得到多个叠加处理结果;Input the multiple visualization features into the overlay layer of the encoding processor for overlay processing to obtain multiple overlay processing results;
    将所述多个叠加处理的结果输入残差层进行残差处理强化所述多个叠加处理结果;Inputting the results of the multiple superposition processing to a residual layer to perform residual processing to strengthen the results of the multiple superposition processing;
    将强化后的多个叠加处理结果进行拼接后,利用线性处理进行线性处理,得到待识别的特征矩阵。After concatenating the enhanced multiple superposition processing results, linear processing is used to perform linear processing to obtain the feature matrix to be identified.
  6. 根据权利要求1所述的方法,其中,对所述待识别的语音进行文本特征提取,得到待识别的语音特征向量,具体包括:The method according to claim 1, wherein, performing text feature extraction on the voice to be recognized to obtain the voice feature vector to be recognized specifically comprises:
    利用自动语音识别算法对所述待识别的语音进行文本特征提取;Using an automatic speech recognition algorithm to extract text features of the speech to be recognized;
    利用自注意力机制对提取到的文本特征进行文本特征分析,提取词特征向量;Use the self-attention mechanism to perform text feature analysis on the extracted text features and extract word feature vectors;
    将所述词特征向量进行线性变换,得到待识别的语音特征向量。The word feature vector is linearly transformed to obtain the voice feature vector to be recognized.
  7. 根据权利要求1所述的方法,其中,将所述待识别的图像特征向量和所述待识别的语音特征向量进行交叉融合得到待识别的融合特征向量,具体包括:The method according to claim 1, wherein the cross fusion of the image feature vector to be recognized and the voice feature vector to be recognized to obtain the fused feature vector to be recognized specifically includes:
    利用加性注意力机制将所述待识别的图像特征向量和所述待识别的语音特征向量交叉相加,得到相加后的特征向量;Using an additive attention mechanism to cross-add the image feature vector to be recognized and the voice feature vector to be recognized to obtain the added feature vector;
    利用数量积方法将所述相加后的特征向量进行点积运算,得到待识别的融合特征向量。The dot product operation is performed on the added feature vectors by using the quantified product method to obtain the fusion feature vector to be identified.
  8. 一种基于语音及图像特征的异常行为识别装置,其中,所述装置包括:An abnormal behavior recognition device based on voice and image features, wherein the device includes:
    获取模块,用于当检测到用户进入识别区域后,控制摄像头获取用户的待识别的动作图像,同时启动录音结构,录制预定时间的待识别的语音;The acquisition module is used to control the camera to acquire the user's motion image to be recognized after detecting that the user enters the recognition area, and at the same time start the recording structure to record the voice to be recognized for a predetermined time;
    图像特征提取模块,用于对所述待识别的动作图像进行特征提取,得到待识别的特征矩阵;The image feature extraction module is used to perform feature extraction on the action image to be recognized to obtain the feature matrix to be recognized;
    特征处理模块,用于利用人体特征提取模型对所述待识别的特征矩阵进行处理,得到对应的待识别的图像特征向量;The feature processing module is configured to process the feature matrix to be recognized by using the human body feature extraction model to obtain the corresponding feature vector of the image to be recognized;
    语音特征提取模块,用于对所述待识别的语音进行文本特征提取,得到待识别的语音特征向量;A voice feature extraction module, configured to extract text features of the voice to be recognized to obtain a voice feature vector to be recognized;
    融合特征模块,用于将所述待识别的图像特征向量和所述待识别的语音特征向量进行交叉融合得到待识别的融合特征向量;The fusion feature module is used to cross-fuse the image feature vector to be recognized and the voice feature vector to be recognized to obtain the fusion feature vector to be recognized;
    基于语音及图像特征的异常行为识别模块,用于将待识别的融合特征向量输入至异常行为识别模型中进行处理,输出对应的人体动作类别,以及所述人体动作类别是否属于异常行为。The abnormal behavior recognition module based on voice and image features is used to input the fusion feature vector to be recognized into the abnormal behavior recognition model for processing, and output the corresponding human action category and whether the human action category belongs to the abnormal behavior.
  9. 一种计算机设备,其中,所述计算机设备包括存储器和处理器,所述存储器和所述处理器相互连接,所述存储器用于存储计算机程序,所述计算机程序被配置为由所述处理器执行,所述计算机程序配置用于执行一种基于语音及图像特征的异常行为识别方法:A computer device, wherein the computer device includes a memory and a processor, the memory and the processor are connected to each other, and the memory is used to store a computer program configured to be executed by the processor , The computer program is configured to execute an abnormal behavior recognition method based on voice and image features:
    其中,所述方法包括:Wherein, the method includes:
    当检测到用户进入识别区域后,控制摄像头获取用户的待识别的动作图像,同时启动录音结构,录制预定时间的待识别的语音;After detecting that the user enters the recognition area, control the camera to obtain the user's motion image to be recognized, and at the same time start the recording structure to record the voice to be recognized for a predetermined time;
    对所述待识别的动作图像进行特征提取,得到待识别的特征矩阵;Performing feature extraction on the motion image to be recognized to obtain a feature matrix to be recognized;
    利用人体特征提取模型对所述待识别的特征矩阵进行处理,得到对应的待识别的图像特征向量;Using a human body feature extraction model to process the feature matrix to be recognized to obtain a corresponding feature vector of the image to be recognized;
    对所述待识别的语音进行文本特征提取,得到待识别的语音特征向量;Performing text feature extraction on the voice to be recognized to obtain a voice feature vector to be recognized;
    将所述待识别的图像特征向量和所述待识别的语音特征向量进行交叉融合得到待识别的融合特征向量;Cross fusion of the image feature vector to be recognized and the voice feature vector to be recognized to obtain a fusion feature vector to be recognized;
    将待识别的融合特征向量输入至异常行为识别模型中进行处理,输出对应的人体动作类别,以及所述人体动作类别是否属于异常行为。The fusion feature vector to be recognized is input into the abnormal behavior recognition model for processing, and the corresponding human body action category is output, and whether the human body action category belongs to the abnormal behavior.
  10. 根据权利要求9所述的计算机设备,其中,在所述利用人体特征提取模型对所述待识别的特征矩阵进行处理,得到对应的待识别的图像特征向量之前,所述方法还包括:The computer device according to claim 9, wherein, before said using the human body feature extraction model to process the feature matrix to be recognized to obtain the corresponding feature vector of the image to be recognized, the method further comprises:
    获取多个表示各种人体动作的样本图像,对每个样本图像标注对应的人体动作标签;Acquire multiple sample images representing various human actions, and label each sample image with a corresponding human action label;
    对多个样本图像中的每个样本图像分别进行特征提取,得到多个样本特征矩阵;Perform feature extraction on each sample image of the multiple sample images to obtain multiple sample feature matrices;
    构建五层时空卷积网络,将所述多个样本特征矩阵依次输入所述时空卷积网络的前三层进行处理,将得到多个一维特征向量传送给所述时空卷积网络的后两层进行识别处理,输出与每个样本图像对应的样本人体动作类别;Construct a five-layer spatio-temporal convolutional network, input the multiple sample feature matrices into the first three layers of the spatio-temporal convolutional network for processing, and transmit multiple one-dimensional feature vectors to the last two of the spatio-temporal convolutional network The layer performs recognition processing, and outputs the sample human body action category corresponding to each sample image;
    将所述样本人体动作类别与对应的人体动作标签进行比对确定样本损失函数,根据所述样本损失函数对所述时空卷积网络的参数进行调整得到时空卷积网络模型;Comparing the sample human motion category with the corresponding human motion label to determine a sample loss function, and adjusting the parameters of the spatiotemporal convolutional network according to the sample loss function to obtain a spatiotemporal convolutional network model;
    将所述时空卷积网络模型的最后两层删除,得到人体特征提取模型。The last two layers of the spatiotemporal convolutional network model are deleted to obtain a human body feature extraction model.
  11. 根据权利要求10所述的计算机设备,其中,所述构建五层时空卷积网络,将所述多个样本特征矩阵依次输入所述时空卷积网络的前三层进行处理,将得到多个一维特征向量传送给所述时空卷积网络的后两层进行识别处理,输出与每个样本图像对应的样本人体动作类别,具体包括:The computer device according to claim 10, wherein the constructing a five-layer spatio-temporal convolutional network, and sequentially inputting the plurality of sample feature matrices into the first three layers of the spatio-temporal convolutional network for processing, will obtain multiple one The dimensional feature vector is transmitted to the last two layers of the spatio-temporal convolutional network for recognition processing, and the sample human body action category corresponding to each sample image is output, which specifically includes:
    构建的时空卷积网络的五层结构分别是,第一层接收层、第二层空间特征分析层、第三层时间特征分析层、第四层全连接层、第五层分类层;The five-layer structure of the constructed spatio-temporal convolutional network is the first receiving layer, the second spatial feature analysis layer, the third temporal feature analysis layer, the fourth fully connected layer, and the fifth classification layer;
    所述第一层将接收到的所述样本特征矩阵传送至所述第二层;The first layer transmits the received sample feature matrix to the second layer;
    所述第二层将所述样本特征矩阵的空间特征进行提取,将提取后的空间特征和所述样本特征矩阵一起发送至所述第三层;The second layer extracts the spatial features of the sample feature matrix, and sends the extracted spatial features and the sample feature matrix to the third layer;
    所述第三层将对所述样本特征矩阵中的时间特征进行提取,并将所述时间特征和所述空间特征进行组合形成一维特征向量,发送至所述第四层;The third layer will extract the temporal features in the sample feature matrix, combine the temporal features and the spatial features to form a one-dimensional feature vector, and send it to the fourth layer;
    所述第四层对所述一维特征向量进行全连接处理,将处理后的一维特征向量发送至所述第五层;The fourth layer performs full connection processing on the one-dimensional feature vector, and sends the processed one-dimensional feature vector to the fifth layer;
    所述第五层对所述处理后的一维特征向量进行分析,确定出对应的样本人体动作类别后输出。The fifth layer analyzes the processed one-dimensional feature vector, determines the corresponding sample human body movement category, and outputs it.
  12. 根据权利要求9-11任一项所述的计算机设备,其中,在所述将待识别的融合特征向量输入至异常行为识别模型中进行处理,输出对应的人体动作类别,以及所述人体动作 类别是否属于异常行为之前,具体包括:11. The computer device according to any one of claims 9-11, wherein the fusion feature vector to be recognized is input into the abnormal behavior recognition model for processing, and the corresponding human body action category and the human body action category are output Before whether it belongs to abnormal behavior, specifically include:
    针对M个人分别获取每个人的动作图像作为训练图像,同时对每个人录制预定时间的训练语音,得到M个训练图像和M个训练语音;Obtain each person's action image as a training image for M individuals, and simultaneously record a predetermined time of training voice for each person to obtain M training images and M training voices;
    对每个训练图像和每个训练语音均对应标注训练人体动作标签;Each training image and each training speech are correspondingly labeled with training human action labels;
    对每个训练图像分别进行特征提取,得到M个训练特征矩阵;Perform feature extraction on each training image to obtain M training feature matrices;
    将所述M个训练特征矩阵依次输入至所述人体特征提取模型中进行处理,输出M个训练图像特征向量;Sequentially inputting the M training feature matrices into the human body feature extraction model for processing, and outputting M training image feature vectors;
    对所述M个训练语音进行文本特征提取,得到M个训练语音特征向量;Performing text feature extraction on the M training voices to obtain M training voice feature vectors;
    将属于同一个人体动作标签的训练图像特征向量和训练语音特征向量进行交叉融合得到训练融合特征向量,M个训练图像特征向量和M个训练语音特征向量对应融合成M个训练融合特征向量;Cross fusion of training image feature vectors and training voice feature vectors belonging to the same human action label to obtain training fusion feature vectors, and M training image feature vectors and M training voice feature vectors are correspondingly fused into M training fusion feature vectors;
    将M个训练融合特征向量依次输入卷积神经网络中进行训练处理,并将输出的训练人体动作类别与对应的训练人体动作标签进行比对,确定出相应的训练损失函数;The M training fusion feature vectors are sequentially input into the convolutional neural network for training processing, and the output training human action category is compared with the corresponding training human action label to determine the corresponding training loss function;
    根据所述训练损失函数对所述卷积神经网络进行调整得到卷积神经网络模型;Adjusting the convolutional neural network according to the training loss function to obtain a convolutional neural network model;
    在所述卷积神经网络的输出层之前,添加能够根据得到的所述训练人体动作类别判断是否属于异常行为的判断层,得到异常行为识别模型。Before the output layer of the convolutional neural network, a judgment layer capable of judging whether it belongs to an abnormal behavior according to the obtained training human action category is added to obtain an abnormal behavior recognition model.
  13. 根据权利要求9所述的计算机设备,其中,当检测到用户进入识别区域后,控制摄像头获取用户的多张待识别的动作图像;9. The computer device according to claim 9, wherein when it is detected that the user enters the recognition area, the camera is controlled to obtain multiple motion images of the user to be recognized;
    所述对所述待识别的动作图像进行特征提取,得到待识别的特征矩阵,具体包括:The feature extraction of the action image to be recognized to obtain the feature matrix to be recognized specifically includes:
    将所述多张待识别的动作图像输入编码处理器,利用所述编码处理器中的自注意力机制层对每一张待识别的动作图像进行可视化分析,提取每一张待识别的动作图像的可视化特征,则所述多张待识别的动作图像对应得到多个可视化特征;Input the multiple motion images to be recognized into the encoding processor, use the self-attention mechanism layer in the encoding processor to visually analyze each motion image to be recognized, and extract each motion image to be recognized , Then the multiple to-be-recognized action images correspondingly obtain multiple visual features;
    将所述多个可视化特征输入编码处理器的叠加层进行叠加处理,得到多个叠加处理结果;Input the multiple visualization features into the overlay layer of the encoding processor for overlay processing to obtain multiple overlay processing results;
    将所述多个叠加处理的结果输入残差层进行残差处理强化所述多个叠加处理结果;Inputting the results of the multiple superposition processing to a residual layer to perform residual processing to strengthen the results of the multiple superposition processing;
    将强化后的多个叠加处理结果进行拼接后,利用线性处理进行线性处理,得到待识别的特征矩阵。After concatenating the enhanced multiple superposition processing results, linear processing is used to perform linear processing to obtain the feature matrix to be identified.
  14. 根据权利要求9所述的计算机设备,其中,对所述待识别的语音进行文本特征提取,得到待识别的语音特征向量,具体包括:9. The computer device according to claim 9, wherein, performing text feature extraction on the voice to be recognized to obtain the voice feature vector to be recognized specifically comprises:
    利用自动语音识别算法对所述待识别的语音进行文本特征提取;Using an automatic speech recognition algorithm to extract text features of the speech to be recognized;
    利用自注意力机制对提取到的文本特征进行文本特征分析,提取词特征向量;Use the self-attention mechanism to perform text feature analysis on the extracted text features and extract word feature vectors;
    将所述词特征向量进行线性变换,得到待识别的语音特征向量。The word feature vector is linearly transformed to obtain the voice feature vector to be recognized.
  15. 根据权利要求9所述的计算机设备,其中,将所述待识别的图像特征向量和所述待识别的语音特征向量进行交叉融合得到待识别的融合特征向量,具体包括:The computer device according to claim 9, wherein the cross-fusion of the image feature vector to be recognized and the voice feature vector to be recognized to obtain the fused feature vector to be recognized specifically includes:
    利用加性注意力机制将所述待识别的图像特征向量和所述待识别的语音特征向量交叉相加,得到相加后的特征向量;Using an additive attention mechanism to cross-add the image feature vector to be recognized and the voice feature vector to be recognized to obtain the added feature vector;
    利用数量积方法将所述相加后的特征向量进行点积运算,得到待识别的融合特征向量。The dot product operation is performed on the added feature vectors by using the quantified product method to obtain the fusion feature vector to be identified.
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时用于实现一种基于语音及图像特征的异常行为识别方法,所述方法包括以下步骤:A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, it is used to realize an abnormal behavior recognition method based on voice and image characteristics. It includes the following steps:
    当检测到用户进入识别区域后,控制摄像头获取用户的待识别的动作图像,同时启动录音结构,录制预定时间的待识别的语音;After detecting that the user enters the recognition area, control the camera to obtain the user's motion image to be recognized, and at the same time start the recording structure to record the voice to be recognized for a predetermined time;
    对所述待识别的动作图像进行特征提取,得到待识别的特征矩阵;Performing feature extraction on the motion image to be recognized to obtain a feature matrix to be recognized;
    利用人体特征提取模型对所述待识别的特征矩阵进行处理,得到对应的待识别的图像 特征向量;Using a human body feature extraction model to process the feature matrix to be recognized to obtain a corresponding feature vector of the image to be recognized;
    对所述待识别的语音进行文本特征提取,得到待识别的语音特征向量;Performing text feature extraction on the voice to be recognized to obtain a voice feature vector to be recognized;
    将所述待识别的图像特征向量和所述待识别的语音特征向量进行交叉融合得到待识别的融合特征向量;Cross fusion of the image feature vector to be recognized and the voice feature vector to be recognized to obtain a fusion feature vector to be recognized;
    将待识别的融合特征向量输入至异常行为识别模型中进行处理,输出对应的人体动作类别,以及所述人体动作类别是否属于异常行为。The fusion feature vector to be recognized is input into the abnormal behavior recognition model for processing, and the corresponding human body action category is output, and whether the human body action category belongs to the abnormal behavior.
  17. 根据权利要求16所述的计算机可读存储介质,其中,在所述利用人体特征提取模型对所述待识别的特征矩阵进行处理,得到对应的待识别的图像特征向量之前,所述方法还包括:The computer-readable storage medium according to claim 16, wherein, before the use of the human body feature extraction model to process the feature matrix to be recognized to obtain the corresponding feature vector of the image to be recognized, the method further comprises :
    获取多个表示各种人体动作的样本图像,对每个样本图像标注对应的人体动作标签;Acquire multiple sample images representing various human actions, and label each sample image with a corresponding human action label;
    对多个样本图像中的每个样本图像分别进行特征提取,得到多个样本特征矩阵;Perform feature extraction on each sample image of the multiple sample images to obtain multiple sample feature matrices;
    构建五层时空卷积网络,将所述多个样本特征矩阵依次输入所述时空卷积网络的前三层进行处理,将得到多个一维特征向量传送给所述时空卷积网络的后两层进行识别处理,输出与每个样本图像对应的样本人体动作类别;Construct a five-layer spatio-temporal convolutional network, input the multiple sample feature matrices into the first three layers of the spatio-temporal convolutional network for processing, and transmit multiple one-dimensional feature vectors to the last two of the spatio-temporal convolutional network The layer performs recognition processing, and outputs the sample human body action category corresponding to each sample image;
    将所述样本人体动作类别与对应的人体动作标签进行比对确定样本损失函数,根据所述样本损失函数对所述时空卷积网络的参数进行调整得到时空卷积网络模型;Comparing the sample human motion category with the corresponding human motion label to determine a sample loss function, and adjusting the parameters of the spatiotemporal convolutional network according to the sample loss function to obtain a spatiotemporal convolutional network model;
    将所述时空卷积网络模型的最后两层删除,得到人体特征提取模型。The last two layers of the spatiotemporal convolutional network model are deleted to obtain a human body feature extraction model.
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述构建五层时空卷积网络,将所述多个样本特征矩阵依次输入所述时空卷积网络的前三层进行处理,将得到多个一维特征向量传送给所述时空卷积网络的后两层进行识别处理,输出与每个样本图像对应的样本人体动作类别,具体包括:The computer-readable storage medium according to claim 17, wherein the constructing a five-layer spatio-temporal convolutional network, and sequentially inputting the plurality of sample feature matrices into the first three layers of the spatio-temporal convolutional network for processing, will obtain The multiple one-dimensional feature vectors are transmitted to the last two layers of the spatiotemporal convolutional network for recognition processing, and the sample human body action category corresponding to each sample image is output, which specifically includes:
    构建的时空卷积网络的五层结构分别是,第一层接收层、第二层空间特征分析层、第三层时间特征分析层、第四层全连接层、第五层分类层;The five-layer structure of the constructed spatio-temporal convolutional network is the first receiving layer, the second spatial feature analysis layer, the third temporal feature analysis layer, the fourth fully connected layer, and the fifth classification layer;
    所述第一层将接收到的所述样本特征矩阵传送至所述第二层;The first layer transmits the received sample feature matrix to the second layer;
    所述第二层将所述样本特征矩阵的空间特征进行提取,将提取后的空间特征和所述样本特征矩阵一起发送至所述第三层;The second layer extracts the spatial features of the sample feature matrix, and sends the extracted spatial features and the sample feature matrix to the third layer;
    所述第三层将对所述样本特征矩阵中的时间特征进行提取,并将所述时间特征和所述空间特征进行组合形成一维特征向量,发送至所述第四层;The third layer will extract the temporal features in the sample feature matrix, combine the temporal features and the spatial features to form a one-dimensional feature vector, and send it to the fourth layer;
    所述第四层对所述一维特征向量进行全连接处理,将处理后的一维特征向量发送至所述第五层;The fourth layer performs full connection processing on the one-dimensional feature vector, and sends the processed one-dimensional feature vector to the fifth layer;
    所述第五层对所述处理后的一维特征向量进行分析,确定出对应的样本人体动作类别后输出。The fifth layer analyzes the processed one-dimensional feature vector, determines the corresponding sample human body movement category, and outputs it.
  19. 根据权利要求16-18任一项所述的计算机可读存储介质,其中,在所述将待识别的融合特征向量输入至异常行为识别模型中进行处理,输出对应的人体动作类别,以及所述人体动作类别是否属于异常行为之前,具体包括:The computer-readable storage medium according to any one of claims 16-18, wherein the fusion feature vector to be recognized is input into the abnormal behavior recognition model for processing, and the corresponding human action category is output, and the Before whether the human body movement category belongs to abnormal behavior, it specifically includes:
    针对M个人分别获取每个人的动作图像作为训练图像,同时对每个人录制预定时间的训练语音,得到M个训练图像和M个训练语音;Obtain each person's action image as a training image for M individuals, and simultaneously record a predetermined time of training voice for each person to obtain M training images and M training voices;
    对每个训练图像和每个训练语音均对应标注训练人体动作标签;Each training image and each training speech are correspondingly labeled with training human action labels;
    对每个训练图像分别进行特征提取,得到M个训练特征矩阵;Perform feature extraction on each training image to obtain M training feature matrices;
    将所述M个训练特征矩阵依次输入至所述人体特征提取模型中进行处理,输出M个训练图像特征向量;Sequentially inputting the M training feature matrices into the human body feature extraction model for processing, and outputting M training image feature vectors;
    对所述M个训练语音进行文本特征提取,得到M个训练语音特征向量;Performing text feature extraction on the M training voices to obtain M training voice feature vectors;
    将属于同一个人体动作标签的训练图像特征向量和训练语音特征向量进行交叉融合得到训练融合特征向量,M个训练图像特征向量和M个训练语音特征向量对应融合成M个训 练融合特征向量;Cross fusion of training image feature vectors and training voice feature vectors belonging to the same human action label to obtain training fusion feature vectors, and M training image feature vectors and M training voice feature vectors are correspondingly fused into M training fusion feature vectors;
    将M个训练融合特征向量依次输入卷积神经网络中进行训练处理,并将输出的训练人体动作类别与对应的训练人体动作标签进行比对,确定出相应的训练损失函数;The M training fusion feature vectors are sequentially input into the convolutional neural network for training processing, and the output training human action category is compared with the corresponding training human action label to determine the corresponding training loss function;
    根据所述训练损失函数对所述卷积神经网络进行调整得到卷积神经网络模型;Adjusting the convolutional neural network according to the training loss function to obtain a convolutional neural network model;
    在所述卷积神经网络的输出层之前,添加能够根据得到的所述训练人体动作类别判断是否属于异常行为的判断层,得到异常行为识别模型。Before the output layer of the convolutional neural network, a judgment layer capable of judging whether it belongs to an abnormal behavior according to the obtained training human action category is added to obtain an abnormal behavior recognition model.
  20. 根据权利要求16所述的计算机可读存储介质,其中,当检测到用户进入识别区域后,控制摄像头获取用户的多张待识别的动作图像;16. The computer-readable storage medium according to claim 16, wherein when it is detected that the user enters the recognition area, the camera is controlled to obtain multiple motion images of the user to be recognized;
    所述对所述待识别的动作图像进行特征提取,得到待识别的特征矩阵,具体包括:The feature extraction of the action image to be recognized to obtain the feature matrix to be recognized specifically includes:
    将所述多张待识别的动作图像输入编码处理器,利用所述编码处理器中的自注意力机制层对每一张待识别的动作图像进行可视化分析,提取每一张待识别的动作图像的可视化特征,则所述多张待识别的动作图像对应得到多个可视化特征;Input the multiple motion images to be recognized into the encoding processor, use the self-attention mechanism layer in the encoding processor to visually analyze each motion image to be recognized, and extract each motion image to be recognized , Then the multiple to-be-recognized action images correspondingly obtain multiple visual features;
    将所述多个可视化特征输入编码处理器的叠加层进行叠加处理,得到多个叠加处理结果;Input the multiple visualization features into the overlay layer of the encoding processor for overlay processing to obtain multiple overlay processing results;
    将所述多个叠加处理的结果输入残差层进行残差处理强化所述多个叠加处理结果;Inputting the results of the multiple superposition processing to a residual layer to perform residual processing to strengthen the results of the multiple superposition processing;
    将强化后的多个叠加处理结果进行拼接后,利用线性处理进行线性处理,得到待识别的特征矩阵。After concatenating the enhanced multiple superposition processing results, linear processing is used to perform linear processing to obtain the feature matrix to be identified.
PCT/CN2020/111664 2020-02-27 2020-08-27 Method, apparatus and device for recognizing abnormal behavior on the basis of voice and image features WO2021169209A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010123166.8 2020-02-27
CN202010123166.8A CN111460889B (en) 2020-02-27 2020-02-27 Abnormal behavior recognition method, device and equipment based on voice and image characteristics

Publications (1)

Publication Number Publication Date
WO2021169209A1 true WO2021169209A1 (en) 2021-09-02

Family

ID=71685056

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/111664 WO2021169209A1 (en) 2020-02-27 2020-08-27 Method, apparatus and device for recognizing abnormal behavior on the basis of voice and image features

Country Status (2)

Country Link
CN (1) CN111460889B (en)
WO (1) WO2021169209A1 (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113743525A (en) * 2021-09-14 2021-12-03 杭州电子科技大学 Fabric material identification system and method based on luminosity stereo
CN113987274A (en) * 2021-12-30 2022-01-28 智者四海(北京)技术有限公司 Video semantic representation method and device, electronic equipment and storage medium
CN114218984A (en) * 2021-12-07 2022-03-22 桂林电子科技大学 Radio frequency fingerprint identification method based on sample multi-view learning
CN114464182A (en) * 2022-03-03 2022-05-10 慧言科技(天津)有限公司 Voice recognition fast self-adaption method assisted by audio scene classification
CN114510968A (en) * 2022-01-21 2022-05-17 石家庄铁道大学 Fault diagnosis method based on Transformer
CN114612681A (en) * 2022-01-30 2022-06-10 西北大学 GCN-based multi-label image classification method, model construction method and device
CN114639169A (en) * 2022-03-28 2022-06-17 合肥工业大学 Human body action recognition system based on attention mechanism feature fusion and position independence
CN114882421A (en) * 2022-06-01 2022-08-09 江南大学 Method for recognizing skeleton behavior based on space-time feature enhancement graph convolutional network
CN114973120A (en) * 2022-04-14 2022-08-30 山东大学 Behavior identification method and system based on multi-dimensional sensing data and monitoring video multi-mode heterogeneous fusion
CN114998834A (en) * 2022-06-06 2022-09-02 杭州中威电子股份有限公司 Medical warning system based on face image and emotion recognition
CN116246214A (en) * 2023-05-08 2023-06-09 浪潮电子信息产业股份有限公司 Audio-visual event positioning method, model training method, device, equipment and medium
CN116451139A (en) * 2023-06-16 2023-07-18 杭州新航互动科技有限公司 Live broadcast data rapid analysis method based on artificial intelligence
CN116664292A (en) * 2023-04-13 2023-08-29 连连银通电子支付有限公司 Training method of transaction anomaly prediction model and transaction anomaly prediction method
CN116703161A (en) * 2023-06-13 2023-09-05 湖南工商大学 Prediction method and device for man-machine co-fusion risk, terminal equipment and medium

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460889B (en) * 2020-02-27 2023-10-31 平安科技(深圳)有限公司 Abnormal behavior recognition method, device and equipment based on voice and image characteristics
CN112132430B (en) * 2020-09-14 2022-09-27 国网山东省电力公司电力科学研究院 Reliability evaluation method and system for distributed state sensor of power distribution main equipment
CN111832581B (en) * 2020-09-21 2021-01-29 平安科技(深圳)有限公司 Lung feature recognition method and device, computer equipment and storage medium
CN112259078A (en) * 2020-10-15 2021-01-22 上海依图网络科技有限公司 Method and device for training audio recognition model and recognizing abnormal audio
CN112289306B (en) * 2020-11-18 2024-03-26 上海依图网络科技有限公司 Juvenile identification method and device based on human body characteristics
CN113409769B (en) * 2020-11-24 2024-02-09 腾讯科技(深圳)有限公司 Data identification method, device, equipment and medium based on neural network model
CN112992340A (en) * 2021-02-24 2021-06-18 北京大学 Disease early warning method, device, equipment and storage medium based on behavior recognition
CN112820071B (en) * 2021-02-25 2023-05-05 泰康保险集团股份有限公司 Behavior recognition method and device
CN113159013A (en) * 2021-04-28 2021-07-23 平安科技(深圳)有限公司 Paragraph identification method and device based on machine learning, computer equipment and medium
CN113255597B (en) * 2021-06-29 2021-09-28 南京视察者智能科技有限公司 Transformer-based behavior analysis method and device and terminal equipment thereof
CN113610082A (en) * 2021-08-12 2021-11-05 北京有竹居网络技术有限公司 Character recognition method and related equipment thereof
CN113673489B (en) * 2021-10-21 2022-04-08 之江实验室 Video group behavior identification method based on cascade Transformer
CN114140673B (en) * 2022-02-07 2022-05-20 人民中科(北京)智能技术有限公司 Method, system and equipment for identifying violation image
CN114581749B (en) * 2022-05-09 2022-07-26 城云科技(中国)有限公司 Audio-visual feature fusion target behavior identification method and device and application
CN114821805B (en) * 2022-05-18 2023-07-18 湖北大学 Dangerous behavior early warning method, dangerous behavior early warning device and dangerous behavior early warning equipment
CN116704405A (en) * 2023-05-22 2023-09-05 阿里巴巴(中国)有限公司 Behavior recognition method, electronic device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6496184B1 (en) * 1998-11-30 2002-12-17 William T. Freeman Method for inferring scenes from test images and training data using probability propagation in a markov network
CN105913559A (en) * 2016-04-06 2016-08-31 南京华捷艾米软件科技有限公司 Motion sensing technique based bank ATM intelligent monitoring method
CN110276265A (en) * 2019-05-27 2019-09-24 魏运 Pedestrian monitoring method and device based on intelligent three-dimensional solid monitoring device
CN110837579A (en) * 2019-11-05 2020-02-25 腾讯科技(深圳)有限公司 Video classification method, device, computer and readable storage medium
CN111460889A (en) * 2020-02-27 2020-07-28 平安科技(深圳)有限公司 Abnormal behavior identification method, device and equipment based on voice and image characteristics

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108647599B (en) * 2018-04-27 2022-04-15 南京航空航天大学 Human behavior recognition method combining 3D (three-dimensional) jump layer connection and recurrent neural network
CN109784150B (en) * 2018-12-06 2023-08-01 东南大学 Video driver behavior identification method based on multitasking space-time convolutional neural network
CN110222653B (en) * 2019-06-11 2020-06-16 中国矿业大学(北京) Skeleton data behavior identification method based on graph convolution neural network
KR20190101329A (en) * 2019-08-12 2019-08-30 엘지전자 주식회사 Intelligent voice outputting method, apparatus, and intelligent computing device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6496184B1 (en) * 1998-11-30 2002-12-17 William T. Freeman Method for inferring scenes from test images and training data using probability propagation in a markov network
CN105913559A (en) * 2016-04-06 2016-08-31 南京华捷艾米软件科技有限公司 Motion sensing technique based bank ATM intelligent monitoring method
CN110276265A (en) * 2019-05-27 2019-09-24 魏运 Pedestrian monitoring method and device based on intelligent three-dimensional solid monitoring device
CN110837579A (en) * 2019-11-05 2020-02-25 腾讯科技(深圳)有限公司 Video classification method, device, computer and readable storage medium
CN111460889A (en) * 2020-02-27 2020-07-28 平安科技(深圳)有限公司 Abnormal behavior identification method, device and equipment based on voice and image characteristics

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113743525A (en) * 2021-09-14 2021-12-03 杭州电子科技大学 Fabric material identification system and method based on luminosity stereo
CN113743525B (en) * 2021-09-14 2024-02-13 杭州电子科技大学 Fabric material identification system and method based on luminosity three-dimensional
CN114218984A (en) * 2021-12-07 2022-03-22 桂林电子科技大学 Radio frequency fingerprint identification method based on sample multi-view learning
CN114218984B (en) * 2021-12-07 2024-03-22 桂林电子科技大学 Radio frequency fingerprint identification method based on sample multi-view learning
CN113987274A (en) * 2021-12-30 2022-01-28 智者四海(北京)技术有限公司 Video semantic representation method and device, electronic equipment and storage medium
CN114510968A (en) * 2022-01-21 2022-05-17 石家庄铁道大学 Fault diagnosis method based on Transformer
CN114612681A (en) * 2022-01-30 2022-06-10 西北大学 GCN-based multi-label image classification method, model construction method and device
CN114464182A (en) * 2022-03-03 2022-05-10 慧言科技(天津)有限公司 Voice recognition fast self-adaption method assisted by audio scene classification
CN114639169A (en) * 2022-03-28 2022-06-17 合肥工业大学 Human body action recognition system based on attention mechanism feature fusion and position independence
CN114639169B (en) * 2022-03-28 2024-02-20 合肥工业大学 Human motion recognition system based on attention mechanism feature fusion and irrelevant to position
CN114973120A (en) * 2022-04-14 2022-08-30 山东大学 Behavior identification method and system based on multi-dimensional sensing data and monitoring video multi-mode heterogeneous fusion
CN114973120B (en) * 2022-04-14 2024-03-12 山东大学 Behavior recognition method and system based on multi-dimensional sensing data and monitoring video multimode heterogeneous fusion
CN114882421A (en) * 2022-06-01 2022-08-09 江南大学 Method for recognizing skeleton behavior based on space-time feature enhancement graph convolutional network
CN114882421B (en) * 2022-06-01 2024-03-26 江南大学 Skeleton behavior recognition method based on space-time characteristic enhancement graph convolution network
CN114998834A (en) * 2022-06-06 2022-09-02 杭州中威电子股份有限公司 Medical warning system based on face image and emotion recognition
CN116664292A (en) * 2023-04-13 2023-08-29 连连银通电子支付有限公司 Training method of transaction anomaly prediction model and transaction anomaly prediction method
CN116246214A (en) * 2023-05-08 2023-06-09 浪潮电子信息产业股份有限公司 Audio-visual event positioning method, model training method, device, equipment and medium
CN116246214B (en) * 2023-05-08 2023-08-11 浪潮电子信息产业股份有限公司 Audio-visual event positioning method, model training method, device, equipment and medium
CN116703161A (en) * 2023-06-13 2023-09-05 湖南工商大学 Prediction method and device for man-machine co-fusion risk, terminal equipment and medium
CN116451139B (en) * 2023-06-16 2023-09-01 杭州新航互动科技有限公司 Live broadcast data rapid analysis method based on artificial intelligence
CN116451139A (en) * 2023-06-16 2023-07-18 杭州新航互动科技有限公司 Live broadcast data rapid analysis method based on artificial intelligence

Also Published As

Publication number Publication date
CN111460889B (en) 2023-10-31
CN111460889A (en) 2020-07-28

Similar Documents

Publication Publication Date Title
WO2021169209A1 (en) Method, apparatus and device for recognizing abnormal behavior on the basis of voice and image features
CN108875833B (en) Neural network training method, face recognition method and device
CN106251874B (en) A kind of voice gate inhibition and quiet environment monitoring method and system
CN112328999B (en) Double-recording quality inspection method and device, server and storage medium
CN112364696B (en) Method and system for improving family safety by utilizing family monitoring video
CN105590097B (en) Dual camera collaboration real-time face identification security system and method under the conditions of noctovision
CN111127830A (en) Alarm method, alarm system and readable storage medium based on monitoring equipment
CN116563797B (en) Monitoring management system for intelligent campus
WO2021007857A1 (en) Identity authentication method, terminal device, and storage medium
CN110175578A (en) Micro- expression recognition method based on depth forest applied to criminal investigation
CN110908718A (en) Face recognition activated voice navigation method, system, storage medium and equipment
CN112836689A (en) Dangerous area personnel management and control system and method based on image recognition
Kadhim et al. A multimodal biometric database and case study for face recognition based deep learning
CN108197593B (en) Multi-size facial expression recognition method and device based on three-point positioning method
Fernandes et al. IoT based smart security for the blind
CN115171335A (en) Image and voice fused indoor safety protection method and device for elderly people living alone
CN112633063B (en) Figure action tracking system and method thereof
KR20230064095A (en) Apparatus and method for detecting abnormal behavior through deep learning-based image analysis
CN113158933A (en) Method, system, device and storage medium for identifying lost personnel
Pradeesh et al. Fast and reliable group attendance marking system using face recognition in classrooms
Lateef et al. Face Recognition-Based Automatic Attendance System in a Smart Classroom
Shrestha Face recognition student attendance system
US20230230415A1 (en) System and method for body language interpretation
Ho et al. Leveraging Supplementary Information for Multi-Modal Fake News Detection
Akram et al. Health Monitoring Approaches towards Face Recognition based System

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20921300

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20921300

Country of ref document: EP

Kind code of ref document: A1