CN107977634A

CN107977634A - A kind of expression recognition method, device and equipment for video

Info

Publication number: CN107977634A
Application number: CN201711274570.XA
Authority: CN
Inventors: 许靳昌; 董远; 白洪亮; 熊风烨
Original assignee: Beijing Faceall Co
Current assignee: Beijing Faceall Co
Priority date: 2017-12-06
Filing date: 2017-12-06
Publication date: 2018-05-01

Abstract

This specification embodiment discloses a kind of expression recognition method, device and equipment for video.For the arbitrary video for including face, extract corresponding sequence of pictures, its face picture included is extracted by Face datection, and corresponding alignment operation is carried out as input, then by using 3D convolutional neural networks trained in advance, carry out both including space characteristics or include the feature extraction of temporal characteristics, generate feature vector, the prediction to the time-space domain of sequence of pictures is merged, so as to realize more accurately Expression Recognition according to described eigenvector.

Description

A kind of expression recognition method, device and equipment for video

Technical field

This specification is related to field of computer technology, more particularly to it is a kind of for the expression recognition method of video, device and Equipment.

Background technology

As technology develops, need to utilize the Expression Recognition for video in more and more living scenes.

In the art, mostly using extraction video in face various features mode (such as geometric properties, statistics Feature etc.), the methods of being positioned, being counted, carries out Expression Recognition, lost many identifications and classification information, and recognition result is accurate Exactness is relatively low.

Based on this, it is necessary to a kind of Expression Recognition scheme for being more accurately directed to video.

The content of the invention

This specification embodiment provides a kind of expression recognition method, device and equipment for video, as follows for solving Problem：To provide a kind of Expression Recognition scheme for being more accurately directed to video.

Based on this, this specification embodiment provides a kind of expression recognition method for video, including：

The sequence of pictures that frame number is specified included in video to be identified is obtained, the sequence of pictures includes video to be identified In face；

According to the sequence of pictures, the 3D convolutional neural networks generation characterization that is obtained using advance training is described to be identified to be regarded The feature vector of frequency, wherein, the convolution nuclear parameter of the 3D convolutional neural networks includes the specified frame number；

The expression classification of the face in the video to be identified is determined according to described eigenvector, the expression classification includes Angry, glad, sad, surprised, detest, frightened or nature expression.

Meanwhile the embodiment of this specification also provides a kind of expression recognition apparatus for video, including：

Acquisition module, obtains the sequence of pictures that frame number is specified included in video to be identified, and the sequence of pictures includes Face in video to be identified；

Generation module, according to the sequence of pictures, the 3D convolutional neural networks generation characterization institute obtained using advance training The feature vector of video to be identified is stated, wherein, the convolution nuclear parameter of the 3D convolutional neural networks includes the specified frame number；

Sort module, the expression classification of the face in the video to be identified, the table are determined according to described eigenvector Feelings classification includes angry, glad, sad, surprised, detest, frightened or nature expression.

Corresponding, this specification embodiment also provides a kind of Expression Recognition equipment for video, including：

Memory, Expression Recognition program of the storage for video；

Processor, calls the Expression Recognition program for video stored in memory, and performs：

The expression classification of the face in the video to be identified is determined according to described eigenvector, the expression classification includes Angry, glad, sad, surprised, detest, frightened or nature expression

Corresponding, the embodiment of this specification also provides a kind of nonvolatile computer storage media, is stored with computer Executable instruction, the computer executable instructions are arranged to：

Above-mentioned at least one technical solution that this specification embodiment uses can reach following beneficial effect：

For the arbitrary video for including face, corresponding sequence of pictures is extracted, extracting it by Face datection is included Face picture, and carry out corresponding alignment operation as input, then by using 3D convolutional neural networks trained in advance, Carry out both comprising space characteristics or include the feature extractions of temporal characteristics, generation feature vector, merged to sequence of pictures when The prediction in spatial domain, so as to realize more accurately Expression Recognition according to described eigenvector.Further, it is also possible to by 3D convolution god The feature vector obtained through network carries out dimensionality reduction, and generation low-dimensional vector, passes through support vector machines (Support Vector Machine, SVM) classify to low-dimensional vector, to realize the Expression Recognition to face in video, operated by dimensionality reduction, a side Face eliminates redundancy feature therein, makes classification more accurate, on the other hand decreases calculation amount, improves recognition speed.

Brief description of the drawings

The schematic diagram for the Expression Recognition process for video that the embodiment of Fig. 1 this specification provides；

Fig. 2 is the schematic diagram for the 3D convolutional neural networks convolution process that this specification embodiment provides；

Fig. 3 is the schematic diagram for the face key point that this specification embodiment provides；

Exemplary logic relation picture in a kind of practical application that Fig. 4 is provided by this specification embodiment；

Fig. 5 is the schematic diagram for the specific embodiment that this specification embodiment provides.

Embodiment

To make the purpose, technical scheme and advantage of the application clearer, below in conjunction with the application specific embodiment and Technical scheme is clearly and completely described in corresponding attached drawing.Obviously, described embodiment is only the application one Section Example, instead of all the embodiments.Based on the embodiment in this specification, those of ordinary skill in the art are not having All other embodiments obtained under the premise of creative work are made, shall fall in the protection scope of this application.

For the expression of the mankind, a kind of generation of expression is often continued for some time to disappearing, during this, The muscle of face is in consecutive variations.It is reflected among video, as face has in the picture of the continuous frame number of one section of video There is different features, i.e., existing temporal characteristics also there are space characteristics.In current Expression Recognition technology, mainly to face table The notable feature change in location of feelings such as is positioned, is measured at the mode, determines the features such as its size, distance, shape and mutual ratio, Expression Recognition is carried out, many identifications and classification information is lost, lacks correlation between data, it is difficult to reflect the reality of expression shape change Matter, and it is bigger by extraneous factor interference.

Based on this, this specification embodiment proposes a kind of expression recognition method for video, is obtained by training in advance 3D convolutional neural networks extract the feature vector of video to be identified, more classification informations are obtained, to realize more accurately table Feelings identify.

The Expression Recognition process for video that the embodiment of this specification provides is described more detail below, the process is specific Comprise the following steps, as shown in Figure 1, the Expression Recognition process schematic for video that the embodiment of Fig. 1 this specification provides, Including：

S101, obtains the sequence of pictures that frame number is specified included in video to be identified, and the sequence of pictures includes and waits to know Face in other video.

The video to be identified can be record in advance human face expression video recording or it is real-time by camera Shoot the obtained video (for example, monitoring video) comprising face.The specified frame number is usually manual definite, such as 128 frames, specific numerical value can according to be actually needed (such as in video frame number height, duration of video etc.) Voluntarily determine.

Video is made of continuous multiple image, and in practical applications, the picture number that one section of video is included may be remote The needs of super specified frame number.For this reason, the sequence of pictures of the specified frame number, is as arranged according to the time order and function order in video The picture of the specified number arranged.So as to characterize video to be identified with the sequence of pictures, as input data.In other words, i.e., It is a series of chronological picture by Video Quality Metric, therefrom obtains in sequence and specify the picture of number to be used as input.

S103, according to the sequence of pictures, is treated described in the 3D convolutional neural networks generation characterization obtained using advance training Identify the feature vector of video, wherein, the convolution nuclear parameter of the 3D convolutional neural networks includes the specified frame number.

It is exactly in video that each frame is identified with neutral net using one simple method of neutral net, still This method does not have the movable information in view of continuous interframe.In order to effectively integrate countenance information, using 3D convolution Method.3D convolution is carried out by the convolutional layer in neutral net, all there is distinction in time and Spatial Dimension to catch Feature.

Specifically, 3D convolution in the sequence of pictures for specifying frame number by using 3D convolution kernels.In this structure, volume Each characteristics map can be connected with multiple neighbouring successive frames in last layer in lamination, therefore catch obtained information at the same time With time-space attribute, need to carry out convolution with how many successive frames to specify by setting relevant parameter in 3D convolution kernels, at this In specification embodiment, which is to specify frame number.As shown in Fig. 2, the value of a certain position of a convolution map is to pass through What the same position of the continuous frame number of specified quantity (for example, several 3 frames of designated frame herein) of convolution last layer obtained, Fig. 2 is The schematic diagram for the 3D convolutional neural networks convolution process that this specification embodiment provides.In the schematic diagram, it will be appreciated that be to phase 3 adjacent frame figures carry out convolution, and convolution results are added.In other words, the convolution between this 3 frame figure, 3D convolutional Neurals are passed through Network is extracted certain correlation in the time-domain of 3 pictures.

Convolution is carried out by the same position of the image to continuous frame number, while is extracted the spy of the space-time in image sequence Sign, thus obtain characterize the sequence of pictures extracted and obtained feature vector (and the feature of characterization video to be identified to Amount).Specifically can voluntarily it be determined according to the actual requirements using how many ponds layer, convolutional layer and the feature vector parameter of output.

For the 3D convolution kernels, it includes have multiple deconvolution parameters, it is clear that due in the training 3D convolutional Neural nets It is to use to specify the training sequence of pictures of frame number as training sample comprising this, in the nerve net that training obtains during network Its convolution kernel in network comprising this it is also apparent that specify frame number as time parameter.It should be noted that in a 3D convolutional Neural net In network, multiple convolution kernels may be included, not all convolution kernel can all include this and specify frame number as time parameter.Change speech It, some convolution kernels have only carried out convolution to space characteristics, not comprising time parameter.But at least one convolution kernel should wrap Containing the specified frame number as time parameter.

S105, the expression classification of the face in the video to be identified, the expression class are determined according to described eigenvector Bao Kuo not angry, glad, sad, surprised, detest, frightened or nature expression.

The feature vector obtained for 3D convolutional neural networks, be usually one have higher dimensional feature to Amount, itself has contained more feature, can carry out classification knowledge to it using foregoing obtained 3D convolutional neural networks Not, alternatively, the support vector machines obtained using precondition is identified, to the recognition result of feature vector dependent on trained To neural network model or the obtained support vector machines of training, specific training method will be described in detail later.

For arbitrary video, corresponding sequence of pictures is extracted, its face figure included is extracted in a manner of Face datection Piece, and corresponding alignment operation is carried out, then by using 3D convolutional neural networks trained in advance, carry out both special comprising space Sign also includes the feature extractions of temporal characteristics, has merged the prediction to the time-space domain of sequence of pictures, thus according to the feature to Amount realizes more accurately Expression Recognition.Further, it is also possible to dimensionality reduction is carried out by the feature vector obtained to 3D convolutional neural networks, Low-dimensional vector is generated, is classified by support vector machines (Support Vector Machine, SVM) to low-dimensional vector, with Realize the Expression Recognition to face in video, operated by dimensionality reduction, on the one hand eliminate redundancy feature therein, make classification more accurate Really, calculation amount is on the other hand decreased, improves recognition speed.

As a kind of specific embodiment, in step S101, obtaining designated frame included in video to be identified Several sequence of pictures, can take numerous embodiments, be set forth below two kinds：

The first, intercepts the video to be identified in picture frame by frame, obtains the sequence of pictures for specifying frame number.That is, obtain In sequence of pictures, two pictures of arbitrary neighborhood are continuous.

Second, to the video intercepting picture to be identified, sequence of pictures is obtained, is sequentially chosen from the sequence of pictures Specify the sequence of pictures of frame number.That is, by frame by frame or it is non-intercept frame by frame in a manner of (such as every two frames intercept a pictures) obtain Sequence of pictures is obtained, then selects the sequence of pictures of specified frame number in sequence from sequence of pictures.

Since the duration of an expression may be up to several seconds to more than ten seconds, as long as the picture sequentially produced Sequence, can reflect that, at different time points, facial muscles caused by face change when producing the expression, in other words, as As long as the picture sample sequence of pictures of input can, it is continuous as video that picture therein is not necessarily intended to Seeking Truth.

For video to be identified, standard of comparison (that is, alignment has been when some videos may be recorded ), the sequence of pictures of generation can be directly as input.More generally, in practical applications, can also be in the following way to cutting The sequence of pictures got is handled：The face included in the sequence of pictures acquired described in detection per frame picture；According to Obtained face is detected, alignment operation is carried out to the face included per frame picture, generation is specified comprising alignment face The sequence of pictures of frame number.That is, by human face detection tech, registration process is carried out, 3D convolution is used as using the sequence of pictures after alignment The input of neutral net.

Under a kind of specific embodiment, when carrying out face alignment, following steps can be used：It is definite to detect what is obtained Key point on face, the key point include at least one of facial contour, eyes, eyebrow, lip or nose profile； Affine transformation, the sequence of pictures of specified frame number of the generation comprising alignment face are carried out according to the key point.

The key point is typically multiple points, specific key point system of selection also have it is a variety of, such as 68 points, 194 points etc., as shown in figure 3, Fig. 3 is the schematic diagram for the face key point that this specification embodiment provides., can be with alignment Using the relation (such as two points on left eye and right eye) between two points, affine transformation is carried out, its meaning of affine transformation is One vector space carries out once linear conversion and connects a translation, is transformed to another vector space, its mathematic(al) representation It is as follows：In this specification embodiment, it is believed that be in image according to the vector between face key point Face is rotated, zooming and panning, realizes that face aligns.

In this specification embodiment, the 3D convolutional neural networks for handling the sequence of pictures can be by instructing in advance Get, specific training method is as follows：Obtain the training sequence of pictures for including specified frame number of different expression classifications；By described in Convolution nuclear parameter of the frame number as 3D convolutional neural networks is specified, the different expression classifications are included into the training for specifying frame number Sequence of pictures obtains the 3D convolutional neural networks as training sample, training；Wherein, the 3D convolutional Neurals that the training obtains Last layer of network is full articulamentum, for generating the feature vector of characterization sequence of pictures.

Specifically, the video for including different expression classifications can be chosen or recorded in advance, is intercepted from each video Specify characterization (training sample i.e. as every kind of expression classification of the face picture sequence of frame number (such as 128) as section video This) so that, in training, the sequence of pictures to specify frame number is inputted as a training sample, by choosing multigroup trained sample Originally it is trained, training obtains the 3D convolutional neural networks.

It should be noted that in model training, 3D convolutional neural networks last layer that training obtains is full articulamentum, For generating the feature vector of characterization training sample sequence of pictures, afterwards again using conventional back-propagation algorithm (Backpropagation, BP algorithm) is trained；In trained 3D convolutional neural networks, full articulamentum Output is the feature vector for characterizing video to be identified.

In practical applications, if last layer is not full articulamentum (for example, using global average pond (global Average pooling, GAP) substitute full articulamentum to merge the feature that convolution obtains), it can also realize neural network model Training.But the feature vector of corresponding characterization sequence of pictures in this case, can not be obtained.

Since the facial expressions and acts of people can often continue several seconds, in different time phases, there is different tables Feelings feature, in 3D convolutional neural networks models, it is necessary to catch this temporal information.In order to reach this purpose, can adopt Expressive features are calculated with larger frame number, corresponding, training sample as input, which can also continue longer time, (that is, to be had Have higher specified frame number, such as 128 frames), in this case, the feature vector of output also would is that the feature of a higher-dimension Vector, as the vector compared with higher-dimension, wherein not only containing more redundancy feature, also reduces calculating speed.

Based on this, model training can also be carried out in the following way, i.e., on the basis of 3D convolutional neural networks, to institute The feature vector for stating the characterization trained sequence of pictures of generation carries out dimensionality reduction, generation low-dimensional vector；Made with low-dimensional vector For training sample, training generation support vector machines.

The mode of vectorial dimensionality reduction have it is a variety of, for example, principal component analysis (principal component can be used Analysis, PCA) algorithm progress dimensionality reduction, high dimension vector progress dimensionality reduction can be accelerated to the speed of model training, can also be accelerated The speed classified in practical applications.Support vector machines is as a kind of common supervised learning model, the instruction for giving label Practice sample (the low-dimensional vector of i.e. known each expression of characterization), classification based training effectively can be carried out to the low-dimensional vector of input, And promoted based on the supporting vector machine model that training obtains, to identify that (i.e. identification characterizes video to be identified to Unmarked word sheet Low-dimensional vector).

So as to, it is in practical applications, vectorial for the high dimensional feature of 3D convolutional neural networks output, can also be by as follows Two ways carries out Classification and Identification：The 3D convolutional neural networks obtained using the training classify described eigenvector, Determine the expression classification of the face in the video to be identified；Alternatively, carrying out dimensionality reduction to described eigenvector, another characterization is generated The low-dimensional vector of the video to be identified, the support vector machines generated using precondition divide the low-dimensional vector Class, determines the expression classification of the face in the video to be identified.Directly classified using 3D convolutional neural networks, or Classified using support vector machines.Classified using support vector machines to low-dimensional vector, on the one hand, can remove higher-dimension to Redundancy feature in amount, improves the accuracy of classification；On the other hand, reduce calculation amount, improve classification speed.

The method that this specification embodiment is provided can be widely applied in various scenes, such as to student in long-distance education The identification of facial expression, be identified in monitoring etc. for the facial expression of baby, to be done by the identification to portion's expression Go out corresponding judgement and further operation.

The example in a kind of practical application is given below, foregoing scheme is illustrated：Baby face is included for one section The video (such as monitor video to baby) of portion's expression, carries out cutting frame processing, detects face for each frame picture, will examine The face picture measured is alignd according to the position where two, is represented with the sequence of pictures comprising face of 128 alignment This section of video, as the inputs of 3D convolutional neural networks, (the 3D convolutional neural networks are in advance to pass through foregoing similar step Rapid training obtains), the feature vector of 1024 dimensions is obtained by the network, 128 are dropped to using the operation of PCA dimensionality reductions to vector Dimension, as the input of linear SVM, the current expression of baby is finally identified by linear SVM, so that according to Recognition result remind accordingly or alarm.Its logical relation is as shown in figure 4, Fig. 4 is provided by this specification embodiment A kind of practical application in exemplary logic relation picture.

Based on same thinking, the present invention also provides a kind of expression recognition apparatus for video, as shown in figure 5, Fig. 5 is A kind of structure diagram for expression recognition apparatus for video that this specification embodiment is provided, including：

Acquisition module 501, obtains the sequence of pictures that frame number is specified included in video to be identified, the sequence of pictures bag Containing the face in video to be identified；

Generation module 503, according to the sequence of pictures, the 3D convolutional neural networks obtained using advance training generate characterization The feature vector of the video to be identified, wherein, the convolution nuclear parameter of the 3D convolutional neural networks includes the designated frame Number；

Sort module 505, the expression classification of the face in the video to be identified is determined according to described eigenvector, described Expression classification includes angry, glad, sad, surprised, detest, frightened or nature expression.

Further, the acquisition module 501, intercepts the video to be identified in picture frame by frame, obtains and specifies frame number Sequence of pictures；Alternatively, to the video intercepting picture to be identified, sequence of pictures is obtained, from the sequence of pictures sequentially middle selection Specify the sequence of pictures of frame number.

Further, the acquisition module 501, detects the face included in the sequence of pictures per frame picture；According to Obtained face is detected, alignment operation is carried out to the face included per frame picture, generation is specified comprising alignment face The sequence of pictures of frame number.

Further, the acquisition module 501, determines the key point on the face that detection obtains, and the key point includes At least one of facial contour, eyes, eyebrow, lip or nose profile；Affine transformation is carried out according to the key point, it is raw Into the sequence of pictures of the specified frame number comprising alignment face.

Further, described device further includes neural metwork training module 507, obtains including for different expression classifications and specifies The training sequence of pictures of frame number；Convolution nuclear parameter using the specified frame number as 3D convolutional neural networks, by the different tables The training sequence of pictures comprising specified frame number of feelings classification obtains the 3D convolutional neural networks as training sample, training；Its In, 3D convolutional neural networks last layer that the training obtains is full articulamentum, for generating the feature of characterization sequence of pictures Vector.

Further, described device further includes support vector machines training module 509, the characterization training to the generation The feature vector of sequence of pictures carries out dimensionality reduction, generation low-dimensional vector；Using the low-dimensional vector be used as training sample, generation support to Amount machine.

Further, the sort module 505, using the 3D convolutional neural networks that the training obtains to the feature to Amount is classified, and determines the expression classification of the face in the video to be identified；Alternatively, dimensionality reduction is carried out to described eigenvector, The low-dimensional vector of another characterization video to be identified is generated, using the support vector machines of precondition generation to described low Dimensional vector is classified, and determines the expression classification of the face in the video to be identified.

Corresponding, the embodiment of the present application also provides a kind of Expression Recognition equipment for video, and the equipment includes：

Memory, Expression Recognition program of the storage for video；

Based on same invention thinking, the embodiment of this specification also provides a kind of nonvolatile computer storage media, Computer executable instructions are stored with, the computer executable instructions are arranged to：

Each embodiment in this specification is described by the way of progressive, identical similar portion between each embodiment Divide mutually referring to what each embodiment stressed is the difference with other embodiment.Especially for device, For equipment and medium class embodiment, since it is substantially similar to embodiment of the method, so description is fairly simple, related part Illustrate referring to the part of embodiment of the method, just no longer repeat one by one here.

It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims It is interior.In some cases, the action recorded in detail in the claims or step or module can be according to different from embodiments Order performs and still can realize desired result.In addition, the process described in the accompanying drawings not necessarily requires what is shown Particular order or consecutive order could realize desired result.In some embodiments, multitasking and parallel processing It is also possible or it may be advantageous.

In the 1990s, the improvement for a technology can clearly distinguish be on hardware improvement (for example, Improvement to circuit structures such as diode, transistor, switches) or software on improvement (improvement for method flow).So And as the development of technology, the improvement of current many method flows can be considered as directly improving for hardware circuit. Designer nearly all obtains corresponding hardware circuit by the way that improved method flow is programmed into hardware circuit.Cause This, it cannot be said that the improvement of a method flow cannot be realized with hardware entities module.For example, programmable logic device (Programmable Logic Device, PLD) (such as field programmable gate array (Field Programmable Gate Array, FPGA)) it is exactly such a integrated circuit, its logic function determines device programming by user.By designer Voluntarily programming comes a digital display circuit " integrated " on a piece of PLD, without asking chip maker to design and make Dedicated IC chip.Moreover, nowadays, substitution manually makes IC chip, this programming is also used instead mostly " patrols Volume compiler (logic compiler) " software realizes that software compiler used is similar when it writes with program development, And the source code before compiling also write by handy specific programming language, this is referred to as hardware description language (Hardware Description Language, HDL), and HDL is also not only a kind of, but have many kinds, such as ABEL (Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL (Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language) etc., VHDL (Very-High-Speed are most generally used at present Integrated Circuit Hardware Description Language) and Verilog.Those skilled in the art also should This understands, it is only necessary to method flow slightly programming in logic and is programmed into integrated circuit with above-mentioned several hardware description languages, The hardware circuit for realizing the logical method flow can be readily available.

Controller can be implemented in any suitable manner, for example, controller can take such as microprocessor or processing The computer for the computer readable program code (such as software or firmware) that device and storage can be performed by (micro-) processor can Read medium, logic gate, switch, application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), the form of programmable logic controller (PLC) and embedded microcontroller, the example of controller include but not limited to following microcontroller Device：ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20 and Silicone Labs C8051F320, are deposited Memory controller is also implemented as a part for the control logic of memory.It is also known in the art that except with Pure computer readable program code mode is realized beyond controller, can be made completely by the way that method and step is carried out programming in logic Controller is obtained in the form of logic gate, switch, application-specific integrated circuit, programmable logic controller (PLC) and embedded microcontroller etc. to come in fact Existing identical function.Therefore this controller is considered a kind of hardware component, and various to being used for realization for including in it The device of function can also be considered as the structure in hardware component.Or even, the device for being used for realization various functions can be regarded For either the software module of implementation method can be the structure in hardware component again.

System, device, module or the unit that above-described embodiment illustrates, can specifically be realized by computer chip or entity, Or realized by having the function of certain product.One kind typically realizes that equipment is computer.Specifically, computer for example may be used Think personal computer, laptop computer, cell phone, camera phone, smart phone, personal digital assistant, media play It is any in device, navigation equipment, electronic mail equipment, game console, tablet PC, wearable device or these equipment The combination of equipment.

For convenience of description, it is divided into various units during description apparatus above with function to describe respectively.Certainly, this is being implemented The function of each unit can be realized in same or multiple softwares and/or hardware during the embodiment of specification.

It should be understood by those skilled in the art that, the embodiment of the present invention can be provided as method, system or computer program Product.Therefore, the present invention can use the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.Moreover, the present invention can use the computer for wherein including computer usable program code in one or more The computer program production that usable storage medium is implemented on (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.

The present invention be with reference to according to the method for the embodiment of the present invention, the flow of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that it can be realized by computer program instructions every first-class in flowchart and/or the block diagram The combination of flow and/or square frame in journey and/or square frame and flowchart and/or the block diagram.These computer programs can be provided The processors of all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that the instruction performed by computer or the processor of other programmable data processing devices, which produces, to be used in fact The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.

These computer program instructions, which may also be stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which produces, to be included referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The function of being specified in multiple square frames.

These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, thus in computer or The instruction performed on other programmable devices is provided and is used for realization in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in a square frame or multiple square frames.

In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net Network interface and memory.

Memory may include computer-readable medium in volatile memory, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM).Memory is computer-readable medium Example.

Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer-readable instruction, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electric erasable Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage (CD-ROM), Digital versatile disc (DVD) or other optical storages, magnetic cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus Or any other non-transmission medium, the information that can be accessed by a computing device available for storage.Define, calculate according to herein Machine computer-readable recording medium does not include temporary computer readable media (transitory media), the data letter numbering and carrier wave of such as modulation.

It should also be noted that, term " comprising ", "comprising" or its any other variant are intended to nonexcludability Comprising so that process, method, commodity or equipment including a series of elements not only include those key elements, but also wrapping Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment it is intrinsic will Element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that wanted including described Also there are other identical element in the process of element, method, commodity or equipment.

It will be understood by those skilled in the art that embodiment one or more in this specification can be provided as method, system or Computer program product.Therefore, the embodiment of this specification can use complete hardware embodiment, complete software embodiment or combination Form in terms of software and hardware.Moreover, the embodiment of this specification can be used wherein includes computer in one or more The computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) of usable program code The form of the computer program product of upper implementation.

The embodiment of this specification can retouch in the general context of computer executable instructions State, such as program module.Usually, program module include perform particular transaction or realize particular abstract data type routine, Program, object, component, data structure etc..The embodiment of this specification can also be put into practice in a distributed computing environment, at this In a little distributed computing environment, by performing affairs by communication network and connected remote processing devices.Counted in distribution Calculate in environment, program module can be located in the local and remote computer-readable storage medium including storage device.

Each embodiment in this specification is described by the way of progressive, identical similar portion between each embodiment Divide mutually referring to what each embodiment stressed is the difference with other embodiment.It is real especially for system For applying example, since it is substantially similar to embodiment of the method, so description is fairly simple, related part is referring to embodiment of the method Part explanation.

The foregoing is merely the embodiment of this specification, the application is not limited to.For people in the art For member, the embodiment of this specification can have various modifications and variations.The spirit and principle of all embodiments in this specification Within any modification, equivalent replacement, improvement and so on, should be included among the interest field of the application.

Claims

1. a kind of expression recognition method for video, including：

The sequence of pictures that frame number is specified included in video to be identified is obtained, the sequence of pictures is included in video to be identified Face；

According to the sequence of pictures, the 3D convolutional neural networks generation obtained using advance training characterizes the video to be identified Feature vector, wherein, the convolution nuclear parameter of the 3D convolutional neural networks includes the specified frame number；

The expression classification of the face in the video to be identified is determined according to described eigenvector, the expression classification includes anger Anger, happiness, sadness, surprised, detest, frightened or nature expression.

2. the method as described in claim 1, obtains the sequence of pictures that frame number is specified included in video to be identified, including：

Intercept picture frame by frame to the video to be identified, obtain the sequence of pictures for specifying frame number；Alternatively,

To the video intercepting picture to be identified, sequence of pictures is obtained, is sequentially chosen from the sequence of pictures and specifies frame number Sequence of pictures.

3. method as claimed in claim 2, obtains the sequence of pictures that frame number is specified included in video to be identified, the figure Piece sequence includes the face in video to be identified, including：

The face included in the sequence of pictures acquired described in detection per frame picture；

The face obtained according to detection, carries out alignment operation, generation includes alignment people to the face included per frame picture The sequence of pictures of the specified frame number of face.

4. method as claimed in claim 3, the face obtained according to detection, the face included to each frame picture carry out Alignment operation, the sequence of pictures of specified certificate of the generation comprising alignment face, including：

The definite key point detected on obtained face, the key point include facial contour, eyes, eyebrow, lip or nose At least one of sub- profile；

Affine transformation, the sequence of pictures of specified frame number of the generation comprising alignment face are carried out according to the key point.

5. the method as described in claim 1, the 3D convolutional neural networks that training obtains in advance, including：

Obtain the training sequence of pictures for including specified frame number of different expression classifications；

Convolution nuclear parameter using the specified frame number as 3D convolutional neural networks, by the different expression classifications comprising specified The training sequence of pictures of frame number obtains the 3D convolutional neural networks as training sample, training；

Wherein, 3D convolutional neural networks last layer that the training obtains is full articulamentum, and sequence of pictures is characterized for generating Feature vector.

6. the method described in claim 5, further includes：

Dimensionality reduction, generation low-dimensional vector are carried out to the feature vector of the characterization trained sequence of pictures of the generation；

Using low-dimensional vector as training sample, training generation support vector machines.

7. method as claimed in claim 6, the expression of the face in the video to be identified is determined according to described eigenvector Classification, including：

The 3D convolutional neural networks obtained using the training classify described eigenvector, determine the video to be identified In face expression classification；Alternatively,

Dimensionality reduction is carried out to described eigenvector, the low-dimensional vector of another characterization video to be identified is generated, using precondition The support vector machines of generation classifies the low-dimensional vector, determines the expression class of the face in the video to be identified Not.

8. a kind of expression recognition apparatus for video, including：

Acquisition module, obtains the sequence of pictures that frame number is specified included in video to be identified, and the sequence of pictures includes and waits to know Face in other video；

Generation module, according to the sequence of pictures, is treated described in the 3D convolutional neural networks generation characterization obtained using advance training Identify the feature vector of video, wherein, the convolution nuclear parameter of the 3D convolutional neural networks includes the specified frame number；

Sort module, the expression classification of the face in the video to be identified, the expression class are determined according to described eigenvector Bao Kuo not angry, glad, sad, surprised, detest, frightened or nature expression.

9. device as claimed in claim 8, the acquisition module, intercepts picture, acquisition refers to frame by frame to the video to be identified The sequence of pictures of framing number；Alternatively, to the video intercepting picture to be identified, obtain sequence of pictures, from the sequence of pictures according to The sequence of pictures for specifying frame number is chosen in sequence.

10. device as claimed in claim 8, the acquisition module, detect what is included in the sequence of pictures per frame picture Face；The face obtained according to detection, carries out alignment operation, generation includes alignment people to the face included per frame picture The sequence of pictures of the specified frame number of face.

11. device as claimed in claim 10, the acquisition module, determine the key point on the face that detection obtains, described Key point includes at least one of facial contour, eyes, eyebrow, lip or nose profile；Carried out according to the key point Affine transformation, the sequence of pictures of specified frame number of the generation comprising alignment face.

12. device as claimed in claim 8, further includes neural metwork training module, obtain including for different expression classifications and refer to The training sequence of pictures of framing number；Convolution nuclear parameter using the specified frame number as 3D convolutional neural networks, by the difference The training sequence of pictures comprising specified frame number of expression classification obtains the 3D convolutional neural networks as training sample, training； Wherein, 3D convolutional neural networks last layer that the training obtains is full articulamentum, for generating the spy of characterization sequence of pictures Sign vector.

13. the device described in claim 12, further includes support vector machines training module, the characterization training to the generation The feature vector of sequence of pictures carries out dimensionality reduction, generation low-dimensional vector；Using the low-dimensional vector be used as training sample, generation support to Amount machine.

14. device as claimed in claim 13, the sort module, the 3D convolutional neural networks pair obtained using the training Described eigenvector is classified, and determines the expression classification of the face in the video to be identified；Alternatively, to described eigenvector Dimensionality reduction is carried out, generates the low-dimensional vector of another characterization video to be identified, the supporting vector generated using precondition Machine classifies the low-dimensional vector, determines the expression classification of the face in the video to be identified.

15. a kind of Expression Recognition equipment for video, including：

Memory, Expression Recognition program of the storage for video；