CN110443124B

CN110443124B - Identification method, device and storage medium

Info

Publication number: CN110443124B
Application number: CN201910562932.8A
Authority: CN
Inventors: 贺鑫
Original assignee: Zhuo Erzhi Lian Wuhan Research Institute Co Ltd
Current assignee: Zhuo Erzhi Lian Wuhan Research Institute Co Ltd
Priority date: 2019-06-26
Filing date: 2019-06-26
Publication date: 2021-11-16
Anticipated expiration: 2039-06-26
Also published as: CN110443124A

Abstract

The embodiment of the application discloses an identification method, an identification device and a storage medium, wherein the method comprises the following steps: obtaining at least two adjacent images; calculating a feature map of each image; determining a face region of at least one object in the at least two images based on the feature map; obtaining lip images of the object in each image based on the face regions of the object in the at least two adjacent images; based on the lip image, determining attributes of the object, the attributes characterizing at least whether a security threat exists with the object.

Description

Identification method, device and storage medium

Technical Field

The present application relates to identification technologies, and in particular, to an identification method, an identification apparatus, and a storage medium.

Background

At present, the face image of a person can be acquired through a face identification technology, and the acquired face image is subjected to similarity matching with a face image of a person who is known in advance to have personal and/or property safety threats, so that whether the monitored person is a suspicious person or not is identified. This approach is suitable in situations where it has been determined which person/persons are already persons with a significant security threat. Typically, knowledge of which person/persons are persons who may present a security threat is often determined by a judicial authority such as the police department based on the experience of handling a case. Considering that video monitoring is installed in each public place at present, how to automatically identify suspicious or non-suspicious people by means of monitoring videos becomes a technical problem to be solved urgently.

Disclosure of Invention

In order to solve the existing technical problem, embodiments of the present invention provide an identification method, apparatus, and storage medium, which at least can implement automatic identification of suspicious or non-suspicious people.

The technical scheme of the embodiment of the invention is realized as follows:

the embodiment of the invention provides an identification method, which comprises the following steps:

obtaining at least two adjacent images;

calculating a feature map of each image;

determining a face region of at least one object in the at least two images based on the feature map;

obtaining lip images of the object in each image based on the face regions of the object in the at least two adjacent images;

based on the lip image, determining attributes of the object, the attributes characterizing at least whether a security threat exists with the object.

In the above solution, the determining the attribute of the object based on the lip image includes:

obtaining a lip motion image sequence based on the lip image;

obtaining lip movements based on the lip motion image sequence;

based on the lip action, a property of the object is determined.

In the foregoing solution, the determining a face region of at least one object in the at least two images based on the feature map includes:

obtaining a plurality of candidate target positions of the feature map, wherein the candidate target positions are at least characterized by central coordinates of candidate areas in the feature map;

aiming at any one candidate target position, at least two candidate regions of a feature map are obtained, and the size of each candidate region in the at least two candidate regions is different;

and obtaining the face area of the object from a plurality of candidate areas of the feature map obtained based on the candidate target positions.

In the above solution, the determining the attribute of the object based on the lip action includes:

determining first information generated by the object based on the lip action, the first information being characterized by at least one of a word, a phrase, and a sentence spoken by the object;

analyzing the sensitivity of the first information to obtain an analysis result;

based on the analysis results, attributes of the object are determined.

In the above scheme, the sensitivity of the first information is analyzed to obtain an analysis result; determining attributes of the object based on the analysis results, including:

judging whether the first information appears in the acquired sensitive data set or not;

adding a facial image of an object generating the first information to a facial database in the case that the first information appears in a sensitive data set;

judging the similarity between the added object and each face in a face database;

and determining the attribute of the object based on the judgment result of the similarity.

In the foregoing solution, the determining the attribute of the object based on the determination result of the similarity includes:

under the condition that the similarity between the added object and any one of the faces is judged to be smaller than a first threshold value, the added object is judged to be a first attribute;

under the condition that the similarity between the added object and one of the faces is judged to be more than or equal to a first threshold value, the added object is judged to be a third attribute;

wherein the object with the first attribute has a lower security threat than the object with the third attribute.

and under the condition that the similarity between the added object and one of the faces is judged to be more than or equal to a first threshold value, determining the attribute of the added object based on the times that the similarity between the object and the one of the faces in the face database is more than or equal to the first threshold value.

In the foregoing solution, the determining the attribute of the added object based on the number of times that the similarity between the object and the one face in the face database is greater than or equal to a first threshold includes:

determining the object as a second attribute when the number of times is less than a second threshold, and determining the object as a third attribute when the number of times is greater than or equal to the second threshold;

the object with the first attribute has lower security threat than the object with the second attribute, and the object with the second attribute has lower security threat than the object with the third attribute.

An embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the foregoing identification method.

The embodiment of the invention provides an identification device, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and is characterized in that the steps of the identification method are executed when the processor executes the program.

The identification method, the identification device and the storage medium of the embodiment of the application are provided, wherein the method comprises the following steps: obtaining at least two adjacent images; calculating a feature map of each image; determining a face region of at least one object in the at least two images based on the feature map; obtaining lip images of the object in each image based on the face regions of the object in the at least two adjacent images; based on the lip image, determining attributes of the object, the attributes characterizing at least whether a security threat exists with the object.

In the embodiment of the application, the lip images of each image in at least two adjacent images are identified from the face images, and whether the person is a suspicious person or not is determined based on the lip images, so that the automatic identification of the suspicious or non-suspicious person is realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flow chart illustrating an implementation of a first embodiment of an identification method provided in the present application;

fig. 2 is a schematic flow chart illustrating an implementation of a second embodiment of the identification method provided in the present application;

FIG. 3 is a schematic diagram of (a) and (b) two images collected from a surveillance video provided by the present application;

FIG. 4 is a schematic diagram of anchor points and constructing multi-scale candidate boxes provided by the present application;

fig. 5 is a schematic flow chart of an implementation of a third embodiment of the identification method provided in the present application;

fig. 6 is a schematic structural diagram illustrating a first embodiment of an identification device according to the present application;

fig. 7 is a schematic structural diagram of a second embodiment of an identification device according to the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. In the present application, the embodiments and features of the embodiments may be arbitrarily combined with each other without conflict. The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

The embodiment of the application is automatic identification of suspicious people (people with personal or property safety possibly existing on other people) or non-suspicious people, and the scheme can be applied to any equipment, system or platform needing to identify suspicious or non-suspicious people. For example, it can be applied to a monitoring device, a monitoring system or a monitoring platform. According to the embodiment of the application, the face image is firstly identified from the video image, then the lip image is identified from the face image, whether the person is a suspicious person or not is determined based on the lip image identified from at least two adjacent video images, and then the automatic identification of the suspicious person or the non-suspicious person is realized.

In a first embodiment of the identification method provided in the present application, as shown in fig. 1, the method includes:

step 101: obtaining at least two adjacent images;

step 102: calculating a feature map of each image;

step 103: determining a face region of at least one object in the at least two images based on the feature map;

step 104: obtaining lip images of the object in each image based on the face regions of the object in the at least two adjacent images;

step 105: based on the lip image, determining attributes of the object, the attributes characterizing at least whether a security threat exists with the object.

The entity performing steps 101-105 is a device, system, or platform, such as a monitoring device, that needs to identify suspicious or non-suspicious personnel. In step 101: the images can be collected from video monitoring, the collection can be real-time collection or interval collection, and two adjacent images refer to video images which are adjacent in collection time. In steps 102 and 103: obtaining a feature map of the image, and identifying a face region based on the feature map; in steps 104 and 105: lip images are identified from the face region, and the object is determined to be suspicious personnel based on the lip images of each image in at least two adjacent images, so that automatic identification of suspicious or non-suspicious personnel is realized. The automatic identification of suspicious or non-suspicious personnel is carried out based on the face image and the further lip image, the feasibility is high, and the identification accuracy is good.

It is understood that the object refers to each person (subject) appearing in the image, and specifically refers to the same person in at least two adjacent images. The scheme can identify whether each person is a suspicious person or not through the lip images of the persons appearing in the images, and is an automatic identification scheme. And suspicious personnel are identified based on face identification and lip image identification, so that the identification accuracy is high and the feasibility is high.

In a second embodiment of the identification method provided in the present application, as shown in fig. 2, the method includes:

step 201: obtaining at least two adjacent images;

step 202: calculating a feature map of each image;

step 203: determining a face region of at least one object in the at least two images based on the feature map;

step 204: obtaining lip images of the object in each image based on the face regions of the object in the at least two adjacent images;

step 205: obtaining a lip motion image sequence based on the lip image;

in this step, lip images of each of at least two adjacent images are recorded as lip motion image sequences in time order from a plurality of lip images obtained from the at least two adjacent images;

step 206: obtaining lip movements based on the lip motion image sequence;

step 207: based on the lip action, determining an attribute of the object, the attribute being at least indicative of whether a security threat exists for the object.

The entity performing steps 201-207 is a device, system, or platform, such as a monitoring device, that needs to identify suspicious or non-suspicious personnel. In step 201: the images can be collected from video monitoring, the collection can be real-time collection or interval collection, and two adjacent images refer to video images which are adjacent in collection time. In steps 202, 203: obtaining a feature map of the image, and identifying a face region based on the feature map; in the step 204-207: the method comprises the steps of identifying lip images from a face region, recording a plurality of lip images as lip moving image sequences, obtaining lip actions based on the lip moving image sequences, determining whether people generating the lip actions are suspicious people or not, and further achieving automatic identification of suspicious or non-suspicious people. The automatic identification of suspicious or non-suspicious personnel is carried out based on the face image and the further lip moving image, the feasibility is high, and the identification accuracy is good.

Based on the above-mentioned description of the first or second embodiment of the method,

in an optional embodiment, the determining a face region of at least one object in the at least two images based on the feature map includes: obtaining a plurality of candidate target positions of the feature map, wherein the candidate target positions are at least characterized by central coordinates of candidate areas in the feature map; aiming at any one candidate target position, at least two candidate regions of a feature map are obtained, and the size of each candidate region in the at least two candidate regions is different; and obtaining the face area of the object from a plurality of candidate areas of the feature map obtained based on the candidate target positions. In this alternative embodiment, the face region of the object is obtained by a plurality of candidate regions obtained based on a plurality of candidate target positions. The face region can be identified more accurately. In addition, the face region is recognized based on the candidate target position, so that the problem that the face cannot be recognized accurately due to too many people in the image and too long shooting distance can be avoided. In a specific implementation, the anchor point may be used as a candidate target position, and please refer to the following related description, which is not described herein.

In an alternative embodiment, the determining the property of the object based on the lip action includes:

determining first information generated by the object based on lip action, the first information being characterized by at least one of words, phrases and sentences spoken by the object; analyzing the sensitivity of the first information to obtain an analysis result; based on the analysis results, attributes of the object are determined. In this optional embodiment, based on the lip movement obtained from the lip movement image sequence, the characters, words, phrases and/or sentences spoken by the person to be shot are obtained, and whether the person to be shot is a suspicious person is determined through sensitivity analysis of the content spoken by the person to be shot, so as to realize automatic identification of the suspicious person. Wherein, based on the sensitivity of the words of the person to be shot, the automatic identification of the suspicious person is carried out, and the leakage of the identification of the suspicious person can be avoided.

In an optional embodiment, the analyzing the sensitivity of the first information obtains an analysis result; determining attributes of the object based on the analysis results, including: judging whether the first information appears in the acquired sensitive data set; adding face data of an object generating first information into a face database under the condition that the first information appears in a sensitive data set; judging the similarity between the added object and each face in a face database; and determining the attribute of the object based on the judgment result of the similarity. In this optional embodiment, it is determined whether the added photographed object is a suspicious person based on the determination result of the similarity between the photographed object added to the face database and each face in the face database. In particular, the method can be implemented in two ways:

the implementation mode is as follows: under the condition that the similarity between the added object and any one of the faces is judged to be smaller than a first threshold value, the added object is judged to be a first attribute; under the condition that the similarity between the added object and one of the faces is judged to be more than or equal to a first threshold value, the added object is judged to be a third attribute; wherein the object with the first attribute has a lower security threat than the object with the third attribute.

In the first implementation manner, it is determined whether the added object is a person object with the first attribute or a person object with the third attribute directly according to the comparison result between the similarity and the first threshold. Wherein, the person object with the first attribute can be regarded as a non-suspicious person-a normal person. The person object of the third attribute may be considered suspicious.

The implementation mode two is as follows: under the condition that the similarity between the added object and any one of the faces is judged to be smaller than a first threshold value, the added object is judged to be a first attribute; and under the condition that the similarity between the added object and one of the faces is judged to be more than or equal to a first threshold value, determining the attribute of the added object based on the times that the similarity between the object and the one of the faces in the face database is more than or equal to the first threshold value.

In an implementation manner two, the determining the attribute of the added object based on the number of times that the similarity between the object and the one face in the face database is greater than or equal to a first threshold includes:

determining the object as a second attribute when the number of times is less than a second threshold, and determining the object as a third attribute when the number of times is greater than or equal to the second threshold; the object with the first attribute has lower security threat than the object with the second attribute, and the object with the second attribute has lower security threat than the object with the third attribute.

In the second embodiment, the added object is considered to be a non-suspicious person when the similarity is smaller than the first threshold. And under the condition that the similarity is greater than or equal to a first threshold, determining the suspicious degree of the object according to the times that the similarity between the object and the same face in the face database is greater than or equal to the first threshold. If the number of times is less than a second threshold value, determining that the object is a person of the continuous concern class; and if the times are larger than or equal to a second threshold value, determining that the object is a suspicious person. Wherein, the person object with the second attribute can be regarded as a person needing continuous attention.

In the foregoing scheme, both the first and second embodiments can ensure the accuracy of identifying the object attribute. Compared with the first embodiment, the second embodiment divides the attributes of the persons more finely, so that a scheme with more detailed identified suspicious degree can be obtained, and the suspicious or non-suspicious degree of each person appearing in the video image can be divided more finely, so that the degree of personal and/or property safety of each person appearing in the video image to others can be accurately determined.

In the foregoing solution, when it is determined that the person object is a suspicious person, the current activity track of the suspicious person is obtained, the stored historical activity track of the suspicious person is read (the activity track of the suspicious person and the identifier of the suspicious person, such as a face image, need to be stored correspondingly each time the suspicious person is identified), and the target targeted by the suspicious person is predicted according to the current activity track and the historical activity track of the suspicious person. For example, through the collection and analysis of video images of a bank (see the relevant description in the analysis process), the person a is determined to be a suspicious person, and if the person a is currently present in the bank, and the stored historical activity track of the person a is read to know that the person a is frequently present or stays in the bank, the person a is predicted to make a threat action on the bank, such as robbing the bank.

The following detailed description of the embodiments of the present invention will be made with reference to fig. 3(a), (b) to fig. 5 and the following specific embodiments.

Considering that video monitors are installed in public places such as shopping malls, railway stations and the like at present, at least two images, such as the images shown in fig. 3(a) and (b), are collected from the monitoring videos. In view of the identification of the lip movements of the person in the present application scenario based on at least two images, the at least two images read on this basis are preferably two or more images representing a succession of movements of the person. At the shooting time, the two or more read images may be acquired at adjacent acquisition times, for example, once every 10s, and then the two acquired adjacent images may be an image captured at 10 th second of 2 o ' clock 0 and an image captured at 20 th second of 2 o ' clock 0 at 18 o ' clock 6/2019. From a temporal perspective, two or more images captured from a surveillance video may be considered as a sequence of images, each image in the sequence of images having its own temporal information. The acquisition interval may be any reasonable interval and is not limited to 10s once.

Step 500: collecting two or more adjacent images from a monitoring video;

in this step, the two or more acquired images may be recorded in the form of an image sequence. In the scheme, the person objects, particularly the same person object, appearing in the plurality of acquired images are mainly subjected to image analysis, and the appearing objects and/or the appearing animals are not considered. For convenience of description, the image collected from the surveillance video may be regarded as a collected image or an original image.

Step 501: calculating a characteristic diagram of the acquired image based on the neural network model;

step 502: based on the collected characteristic image, identifying the position of the human face area of the personnel object in the collected image to obtain a human face image;

in the

steps

501 and 502 of the method,

those skilled in the art will appreciate that neural network models, and in particular neural network deep learning models, include an input layer, a convolutional layer, a pooling layer, a fully-connected layer, and an output layer. In the application scenario, for example, one of the acquired images is input to the input layer, and the input layer performs preprocessing on the image, such as mean value removal, normalization, and/or dimension reduction. The preprocessed image data is input to the convolutional layer. The scheme comprises a plurality of convolution layers, wherein the output of each convolution layer is used as the input of the next convolution layer, and the output of the last 1 convolution layer is input to the full-connection layer. Wherein, the input of the 1 st convolution layer is the preprocessed image data. It is understood that the image data input to each convolution layer will be processed by convolution of the corresponding convolution layer to obtain a corresponding feature map. Among all the convolutional layers, the feature map obtained by the low convolutional layer has less semantic information, but the position of each represented person object is more accurate. The feature map obtained from the high convolution layer has rich semantic information, but the positions of the individual human objects represented by the feature map are rough.

The application scene is combined with the characteristics of the convolution layers to identify the face area. Specifically, in consideration of the fact that the feature maps output by different convolutional layers are different in size, the position of a certain person object in the acquired image is predicted based on the feature maps with different sizes in the application scene.

Taking the feature map output by one of the convolutional layers as an example, predicting the position of the human object in the acquired image is taken as an example:

and reading the position information of a plurality of anchor points arranged for the convolutional layer, namely the coordinate information of the anchor points in the characteristic diagram. And constructing m length-width ratio and n scales of candidate frames by taking each anchor point as a center, wherein the candidate frames cover the region of the feature map and can be regarded as the candidate region of the feature map. Where it is understood that n dimensions refer to n areas of the constructed candidate box, and m aspect ratios refer to n length-to-width ratios of the constructed candidate box. For example, 3 areas are {128 } with n ═ 3 and m ═ 3²，256²，512²And the aspect ratios of the three frames are {1:1, 1:2, 2:1}, and as three candidate frames with different aspect ratios can be constructed under the same area of the same anchor point, n × m candidate frames can be constructed under the n scales and m aspect ratios of the same anchor point, that is, n × m candidate regions for the same anchor point in the feature map. Assuming that the number of anchor points is P, the P anchor points have P × n × m candidate regions in total. Wherein n, m and p are positive integers. Here, in consideration of space limitations, n × m candidate frames of the same anchor point cannot be represented one by one, and may be understood by referring to a form in which three anchor points (anchor points A, B and C) of one feature diagram shown in fig. 4 all have three candidate frames, which is not described in detail herein. Taking anchor point a as an example, the size of the frame with anchor point a as the center coordinate point of the candidate frame may take three sizes as shown in the figure, where the three sizes are sizes under different aspect ratios in the same area.

It should be noted that, in the embodiment of the present application, each parameter in the neural network deep learning model, specifically, the model, is obtained by training, and further, is obtained by training a position of a human face of a certain object, which is artificially labeled, in an original image. Wherein, the method is a supervised learning method because of the need of manual labeling. In consideration of the fact that the original image may include not only a person but also objects such as a cat, a dog, and a counter in practical applications, it is necessary to perform category labeling on the categories of the respective objects in the original image. It can be understood that, since the emphasis is placed on recognizing the human face of a person in the embodiment of the present application, only objects characterized as human categories need to be focused on in the present solution.

It can be understood that the scheme in the embodiment of the present application is performed on the basis of a trained neural network deep learning model. The position of the face of a certain object visually labeled in the original image is a label area (Ground Truth). The label area is represented by a box, which can be mathematically represented by an array [ x, y, l, w ]. Where [ x, y ] are coordinates of the lower left corner of the frame on the original, l is the length of the frame, and w is the width of the frame. On the basis of obtaining the label area, the overlapping proportion of each candidate area and the label area is calculated, the overlapping proportion is the area of the intersection of the candidate area and the label area/the area of the parallel candidate area and the label area, the candidate area with the overlapping proportion lower than a preset proportion threshold value is deleted, the candidate area with the overlapping proportion larger than or equal to the proportion threshold value is reserved, and the reserved candidate area is subjected to NMS (non-maximum suppression) to screen out the candidate area with the high overlapping proportion with the Ground Truth, wherein the candidate area can be used as the position of the face area of the human object in the feature map of the convolutional layer.

The feature map output by each convolution layer is processed in the way to obtain the positions of the face regions in the feature maps of different convolution layers, the obtained information is subjected to feature fusion of the face regions to obtain the face positions of the face regions of the personnel object in the collected image, and then the face regions in the collected image are identified to obtain the face image. Because the data volume of the image output by the convolutional layer is large, the dimension of the face region identified by the convolutional layer is reduced through the pooling layer, and the face image with small data volume is obtained. And the full-connection layer determines whether the face appears in the identified face region by applying probability theory, and the output layer outputs the identified final face region. Specifically, please refer to the related description for the implementation process of the pooling layer, the full link layer, and the output layer, which is not described herein again.

In the scheme, m length-width ratios and n scales are adopted to divide the candidate region of the feature map, the feature map is divided from a plurality of angles such as a plurality of length-width ratios and a plurality of scales, the division is more detailed, a good basis is further provided for accurately obtaining the position of the face image in the feature map, and the accuracy of identifying the face image in the acquired image can be effectively improved.

In the scheme, the analysis of the feature maps with different sizes is carried out in a multi-scale mode, which is equivalent to the realization of multi-scale face detection. The multi-scale face detection can greatly improve the accuracy of face recognition.

In the scheme, the human face area is identified based on the anchor point information and the multi-scale mode, so that the problem that the human face cannot be accurately identified due to the fact that people in the image are too many, the shooting distance is too far, and the people are shot too small can be solved. It can be understood that the technical scheme of the application scenario at least solves the problem that the human face cannot be accurately identified due to too many people and too many distances in the collected image.

Those skilled in the art will appreciate that the aforementioned proportional threshold is preset, and can be flexibly set according to the actual use situation, and is not enumerated here. For the process of screening by using NMS, please refer to the related description, which is not repeated. Specifically, the number of the convolutional layers may be 9 or 13, and naturally, other numbers of convolutional layers may be used, specifically, the number of layers may be flexibly set depending on the actual use condition.

Step 503: preprocessing a face image;

in this step, the preprocessing of the face image may include high-pass filtering and/or super-resolution reconstruction of the face image. The high-pass filtering can filter out low-frequency signals in the face image and keep high-frequency signals. Wherein the high frequency signal generally represents detail information of each part of the human face, such as lip contour edges. High-pass filtering can facilitate extraction of lip and its contour edge information.

In the application scenario, the fact that the resolution of a surveillance video is usually low is mainly considered, and in order to obtain a clear lip image, only super-resolution processing needs to be performed on an identified face area, so that a clearer face image is obtained. Specifically, the super-resolution reconstruction may employ image interpolation or GAN (generative countermeasure network) to reconstruct the face image, so that a face region with high resolution may be constructed, the constructed face region is clearer, and accurate interception of the lip image from the face region in a subsequent scheme is facilitated.

Compared with the prior art in which the whole collected image is enhanced, the scheme of preprocessing the face image in the collected image can effectively reduce the calculation amount and reduce the calculation load.

Step 504: intercepting a lip image from the preprocessed face image;

in this step, the lip image can be captured from the face image according to the set face proportion.

Step 505: obtaining a lip motion image sequence based on lip images of each image in the two or more acquired images, and obtaining lip motions of the person object generated in the two or more acquired images from the lip motion image sequence;

step 506: based on the lip action, obtaining the content (first information) spoken by the person object in the two or more images;

in

steps

505, 506, the human subject is pre-trained on the changes in lip shape that occur during the process of speaking individual words, words and/or phrases. And correspondingly recording the changes generated by the lip shapes, such as lip motions and characters, words and/or phrases corresponding to the changes to obtain a lip language data set. When the method is used, based on the obtained lip action, characters, words and/or phrases corresponding to the lip action are searched in the lip language data set, and therefore the content spoken by the human object is obtained. In the application scenario, disclosed lip language data sets such as CAVSR (Chinese auditory visual speech recognition), HIT Bi-CAV (bimodal corpus) and the like can be adopted.

In this step, if the person object says a sentence, the sentence is further split according to semantics to obtain characters, words and/or phrases, and the content is identified based on the above scheme.

Step 507: judging whether the content spoken by the personnel object appears in the sensitive data set or not;

when so, go to step 508;

if not, ending the flow;

in the scheme, the automatic identification of the suspicious personnel is carried out based on the sensitivity of the words of the shot personnel, so that the leakage of the identification of the suspicious personnel can be avoided.

For example, if it is recognized that the content spoken by a photographed person is "robbery", "bomb", etc., and appears in the sensitive data set, step 508 is performed. It can be understood that the sensitive data set of the present application collects sensitive data related to security such as the aforementioned "robbery", "bomb", etc.

Step 508: adding the face image of the object into a face database, and judging the similarity between the object added into the face database and each face in the face database;

executing step 509 when it is determined that the similarity between the object added to the face database and any one of the faces is smaller than the first threshold;

executing step 510 when the similarity between the object added to the face database and one of the faces is judged to be greater than or equal to the first threshold;

step 509: judging whether the object added into the face database is a non-suspicious person-a normal person, and ending the process;

step 510: calculating the times that the similarity of the object in the face database and the one face is greater than or equal to a first threshold value;

when the calculated number of times is equal to or greater than the second threshold, step 511 is executed;

when the calculated number of times is less than the second threshold, executing step 512;

step 511: determining the object added into the face database as a suspicious person, and ending the process;

step 512: and determining the objects added into the face database as the people needing continuous attention, and ending the process.

The first threshold and the second threshold are preset values or value ranges and can be flexibly set according to actual conditions. If the first threshold value is 30%; the second threshold value is taken as 3 times; of course, other reasonable values can be taken, and the method is not particularly limited.

In the scheme, the shot persons are divided into three types, namely a normal type, an attention type, a suspicious type and the like. The normal class can be regarded as non-suspicious personnel and has no threat, the attention class can continuously pay attention to the normal class and continuously judges whether the normal class is suspicious personnel according to the subsequent lip images of the shot personnel; the suspicious persons are persons with threats. It can be understood that the scheme of the embodiment of the application is also a scheme for carrying out security classification on each shot person in the acquired image, and can be accurately classified into corresponding classes, so that the missed identification of the shot person is avoided.

In the scheme, characters, words, phrases and/or sentences spoken by the photographed person are obtained based on the lip movement, whether the photographed person is a suspicious person is determined through sensitivity analysis of the content spoken by the photographed person, and then automatic identification of the suspicious person is achieved. Wherein, based on the sensitivity of the words of the person to be shot, the automatic identification of the suspicious person is carried out, and the leakage of the identification of the suspicious person can be avoided. The objects added into the face database are divided into three types of objects, namely a normal type object, a suspicious type object, an attention type object and the like, the division is more detailed, and the leakage of suspicious personnel can be avoided.

An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is configured to, when executed by a processor, perform at least the steps of the method shown in any one of fig. 1 to 5. The computer readable storage medium may be specifically a memory. The memory may be the memory 62 as shown in fig. 6.

The embodiment of the invention also provides the terminal. Fig. 6 is a schematic diagram of a hardware structure of an identification apparatus according to an embodiment of the present invention, and as shown in fig. 6, the identification apparatus includes: a communication component 63 for data transmission, at least one processor 61 and a memory 62 for storing computer programs capable of running on the processor 61. The various components in the terminal are coupled together by a bus system 64. It will be appreciated that the bus system 64 is used to enable communications among the components. The bus system 64 includes a power bus, a control bus, and a status signal bus in addition to the data bus. For clarity of illustration, however, the various buses are labeled as bus system 64 in fig. 6.

Wherein the processor 61 executes the computer program to perform at least the steps of the method of any of fig. 1 to 5.

It will be appreciated that the memory 62 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The memory 62 described in connection with the embodiments of the invention is intended to comprise, without being limited to, these and any other suitable types of memory.

The method disclosed in the above embodiments of the present invention may be applied to the processor 61, or implemented by the processor 61. The processor 61 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 61. The processor 61 described above may be a general purpose processor, a DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. Processor 61 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed by the embodiment of the invention can be directly implemented by a hardware decoding processor, or can be implemented by combining hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in the memory 62, and the processor 61 reads the information in the memory 62 and performs the steps of the aforementioned method in conjunction with its hardware.

In an exemplary embodiment, the identification Device may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), FPGAs, general purpose processors, controllers, MCUs, microprocessors (microprocessors), or other electronic components for performing the aforementioned identification method.

An embodiment of the present application further provides an identification apparatus, as shown in fig. 7, including: a first obtaining unit 71, a second obtaining unit 72, a first determining unit 73, a second obtaining unit 74, and a second determining unit 75; wherein,

a first obtaining unit 71 for obtaining at least two adjacent images;

a first calculation unit 72 for calculating a feature map of each image;

a first determining unit 73, configured to determine a face region of at least one object in the at least two images based on the feature map;

a second obtaining unit 74, configured to obtain lip images of the object in the respective images based on the face regions of the object in the at least two adjacent images;

a second determination unit 75 for determining a property of the object based on the lip image, the property being at least indicative of whether a security threat is present for the object.

In an optional embodiment, the second determining unit 75 is further configured to:

obtaining a lip motion image sequence based on the lip image;

obtaining lip movements based on the lip motion image sequence;

based on the lip action, a property of the object is determined.

In an optional embodiment, the first determining unit 73 is further configured to:

determining first information generated by the object based on lip action, the first information being characterized by at least one of words, phrases and sentences spoken by the object;

based on the analysis results, attributes of the object are determined.

judging whether the first information appears in the acquired sensitive data set;

in the case that first information appears in a sensitive data set, adding a face image of an object generating the first information into a face database;

It should be noted that, in order to implement the above application to the identification method, the identification apparatus in the embodiment of the present invention has a problem solving principle similar to that of the identification method, so that both the implementation process and the implementation principle of the apparatus can be described with reference to the implementation process and the implementation principle of the identification method, and repeated details are not repeated.

It is to be understood that the first obtaining unit 71, the first calculating unit 72, the first determining unit 73, the second obtaining unit 74 and the second determining unit 75 may be implemented by a CPU (central processing unit), a DSP (digital signal processor), an FPGA (programmable gate array) or an MCU (micro control unit) in any reasonable system, platform or device that needs to identify a suspicious person.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.

Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.

The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. An identification method, characterized in that the method comprises:

obtaining at least two adjacent images;

calculating a feature map of each image;

determining, based on the lip image, attributes of the object, the attributes characterizing at least whether the object presents a security threat; wherein said determining attributes of the object based on the lip image comprises: determining whether first information spoken by the subject appears in a sensitive data set based on the lip action determined from the lip image; when first information spoken by the object is determined to appear in the sensitive data set, adding the face image of the object into a face database, and judging the similarity between the object added into the face database and each face in the face database; determining the attribute of the object according to the similarity between the object added into the face database and each face in the face database, wherein the attribute of the object comprises the following components: normal class, suspect class, and hazard class.

2. The method of claim 1, wherein determining whether first information spoken by the subject appears in a sensitive data set based on the determined lip action from the lip image comprises:

obtaining a lip motion image sequence based on the lip image;

based on the lip motion image sequence, a lip motion is obtained.

3. The method according to claim 1 or 2, wherein the determining a face region of at least one object in the at least two images based on the feature map comprises:

4. The method of claim 2, wherein determining whether first information spoken by the subject appears in a sensitive data set based on the determined lip action from the lip image comprises:

in the case where the first information is present in a sensitive data set, adding a face image of an object that generated the first information to a face database.

5. The method of claim 4, wherein determining the attributes of the objects according to the similarity between the objects added to the face database and the faces in the face database comprises:

6. The method of claim 4, wherein determining the attributes of the objects according to the similarity between the objects added to the face database and the faces in the face database comprises:

7. The method of claim 6, wherein determining attributes of the objects according to the similarity between the objects added to the face database and the faces in the face database comprises:

8. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the identification method of one of claims 1 to 7.

9. An identification device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the identification method of any one of claims 1 to 7 are performed when the program is executed by the processor.