CN110414344A

CN110414344A - A kind of human classification method, intelligent terminal and storage medium based on video

Info

Publication number: CN110414344A
Application number: CN201910553048.8A
Authority: CN
Inventors: 张邦文; 姚荣国; 周飞; 刘博智; 邱国平
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2019-06-25
Filing date: 2019-06-25
Publication date: 2019-11-05
Anticipated expiration: 2039-06-25
Also published as: CN110414344B

Abstract

The present invention provides a kind of human classification method, intelligent terminal and storage medium based on video, which comprises obtain video frame images to be detected, extract include in the video frame images target person image block；Described image block is inputted in the sorter network model, the preliminary classification result and attention weight of target person in described image block are obtained；The final classification of the target person is obtained according to the preliminary classification result of target person in described image block and attention weight as a result, classifying according to the final classification result to target person included in the video frame images.Method provided by the present invention is extracted the image block of target person to be detected respectively and is classified to target person by region extraction module and sorter network model, the e-learning power weight that gains attention is combined with initial predicted result, contribution of the characteristic part to final classification result is improved, so that video human classification result is more accurate.

Description

A kind of human classification method, intelligent terminal and storage medium based on video

Technical field

The present invention relates to image identification technical field more particularly to a kind of human classification method, intelligence based on video It can terminal and storage medium.

Background technique

In recent years, with the development of internet and entertainment industry, number of videos rapidly increases.Video based on video content Understand and the demand of retrieval is also being continuously improved.Field is understood in video, and person detecting is one of important subject.

Due to the difference of camera angle, illumination condition is complicated, the variation of countenance and blocks, the personage in video It detects extremely challenging.Presently relevant technology includes target detection technique and pedestrian's weight identification technology.The personage of target detection It is a given sub-picture, the coordinate and classification information of the object or person of classification to be detected is belonged in forecast image.And pedestrian's weight The target of identification is that the personage in image is classified and retrieved.Although the above method all achieves not on respective field Wrong effect.But person detecting field in video, since the phase knowledge and magnanimity between personage are high, target detection often will appear Classification error causes human classification accuracy rate low.

Therefore, the existing technology needs further improvement.

Summary of the invention

In view of the above shortcomings in the prior art, personage's inspection based on video that the purpose of the present invention is to provide a kind of Survey method, intelligent terminal and storage medium overcome since the phase knowledge and magnanimity between personage are high in existing video person detecting field, people The low defect of object classification accuracy.

First embodiment disclosed in this invention is a kind of human classification method based on video, wherein including following step It is rapid:

Video frame images to be detected are obtained, the image block in the video frame images comprising target person is extracted；

Described image block is inputted in the sorter network model, the preliminary classification of target person in described image block is obtained As a result with attention weight；The sorter network model is target in image block and described image block based on the target person Made of the corresponding relationship training of the preliminary classification result and attention weight of personage；

The target person is obtained according to the preliminary classification result of target person in described image block and attention weight Final classification is as a result, classify to target person included in the video frame images according to the final classification result.

The human classification method based on video, wherein extract the figure in the video frame images comprising target person As the step of block, specifically include:

The video frame images input area is extracted in network model, is extracted in the video frame images comprising target person The image block of object；The extracted region network model is based on target in input video frame image and the input video frame image Made of the corresponding relationship training of character image block.

The human classification method based on video, wherein the sorter network model includes: the first convolutional layer, Chi Hua Layer and the second convolutional layer containing multiple sub- convolutional layers；

Described to input described image block in the sorter network model, target person is initial in acquisition described image block It the step of classification results and attention weight, specifically includes:

Described image block is inputted in the first convolutional layer, the characteristic pattern of described image block is extracted；

The characteristic pattern is inputted into pond layer, obtains multiple feature vectors of the characteristic pattern；

Each described eigenvector is separately input in each sub- convolutional layer, target person in described image block is obtained Preliminary classification result and attention weight.

The human classification method based on video, wherein second convolutional layer includes: the first sub- convolutional layer, second Convolutional layer, classifier and Recurrent networks；

Each described eigenvector is separately input in each sub- convolutional layer, is obtained corresponding to each described eigenvector Preliminary classification result and attention weight the step of, specifically include:

Each described eigenvector is sequentially input in the first sub- convolutional layer and the second sub- convolutional layer, each spy is exported Levy the first dimensional characteristics and the second dimensional characteristics corresponding to vector；

First dimensional characteristics are inputted into classifier, obtain the preliminary classification result of target person in described image block；

Second dimensional characteristics are inputted into Recurrent networks, obtain the attention weight of target person in described image block.

The human classification method based on video, wherein initial point according to target person in described image block Class result and attention weight obtain the final classification of the target person as a result, according to the final classification result to the view It the step of target person included in frequency frame image is classified, specifically includes:

The preliminary classification result of target person in described image block and attention multiplied by weight are obtained into the target person Final classification result；

The final classification end value for choosing the target person is maximum a kind of as included in the video frame images Target person tag along sort.

The human classification method based on video, wherein the extracted region network model include: the first extract layer and Second extract layer；

It is described to extract the video frame images input area in network model, it extracts in the video frame images comprising mesh The step of marking the image block of personage, specifically includes:

The video frame images are inputted in first extract layer, obtaining includes the corresponding feature of target person detection block Figure；

It described will be inputted in second extract layer comprising the corresponding characteristic pattern of target person detection block, and extract the video It include the image block of target person in frame image.

The human classification method based on video, wherein described that the video frame images input area is extracted into network In model, before extracting the step of including the image block of target person in the video frame images, further includes:

Obtaining includes target person to training image collection, to the true classification for concentrating target person to training image It is labeled with true coordinate；

It described will be extracted in network model to training image collection input area, neural network forecast is obtained by propagated forward algorithm Target person classification and coordinate；

By loss function to the true classification of the target person of mark and the target person of true coordinate and neural network forecast Classification and coordinate be compared, obtain prediction error；

The prediction error is trained the extracted region network model by back-propagation algorithm.

The human classification method based on video, wherein the loss function are as follows:

Wherein, i is the serial number of detection block in training process,For the true classification of target person in i-th of detection block, For the true coordinate of target person in i-th of detection block, p_iFor the neural network forecast classification of target person in i-th of detection block, x_iFor The neural network forecast coordinate of target person, N in i-th of detection block_armAnd N_odmIt is detected respectively in extracted region network model The sum of frame comprising personage to be detected, L_bFor an intersection loss function, L_rIt is a recurrence loss function.

A kind of intelligent terminal, wherein it include: processor, the storage medium that is connect with processor communication, the storage medium Suitable for storing a plurality of instruction；The processor is suitable for calling the instruction in the storage medium, to execute realization any of the above-described The step of described human classification method based on video.

A kind of storage medium, wherein the control of the item recommendation method based on collaborative filtering is stored on the storage medium The control program of processing procedure sequence, the item recommendation method based on collaborative filtering is realized described in any item when being executed by processor The step of human classification method based on video.

Beneficial effect, the present invention provides a kind of human classification method, intelligent terminal and storage medium based on video lead to The image block that region extraction module extracts target person to be detected is crossed, the feature of image block and right is extracted by classification and Detection module Target person is classified, position detection and the assorting process separation of target person, and attention is introduced in assorting process Power mechanism is gained attention power weight by e-learning, attention weight is combined with initial predicted result, improves feature Contribution of the property part to final classification result, so that video human classification result is more accurate.

Detailed description of the invention

Fig. 1 is the flow chart of the preferred embodiment of the human classification method provided by the present invention based on video；

Fig. 2 is the schematic diagram of the function of intelligent terminal of the invention.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer and more explicit, right as follows in conjunction with drawings and embodiments The present invention is further described.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and do not have to It is of the invention in limiting.

A kind of human classification method based on video provided by the invention, can be applied in terminal.Wherein, terminal can be with But it is not limited to various personal computers, laptop, mobile phone, tablet computer, vehicle-mounted computer and portable wearable device. Terminal of the invention uses multi-core processor.Wherein, the processor of terminal can be central processing unit (Central Processing Unit, CPU), graphics processor (Graphics Processing Unit, GPU), video processing unit At least one of (Video Processing Unit, VPU) etc..

When in order to solve in the prior art to classify to target person in video, since the phase knowledge and magnanimity between personage are high, Target person detection often will appear classification error, and the problem for causing target person classification accuracy low, the present invention provides one Human classification method of the kind based on video.

Fig. 1 is please referred to, Fig. 1 is a kind of stream of the preferred embodiment of human classification method based on video provided by the invention Cheng Tu.

In example 1, there are three steps for the human detection method based on video:

S100, video frame images to be detected are obtained, extracts the image block in the video frame images comprising target person.

Video to be detected refers to the video for needing the method for the human classification based on video to be handled.As video to be detected can To be the video of a certain monitor recording, a certain section of television video etc..Video to be detected is continuously shown by multiple image, In the present embodiment when classifying to personage, need to extract the image of video to be detected from video to be detected in advance.Ability The method for extracting image in domain from video have been relatively mature, such as be obtained often to video to be detected by decoder or code Frame image, therefore not to repeat here for the present patent application.

When it is implemented, parts of images includes target since video to be detected is continuously shown by multiple image Character image, parts of images do not include target person image.In order to classify to the target person in video, in the present embodiment It needs to extract the image block comprising target person from the video image to be detected of acquisition.The target person may include police The suspect just required to look up, a certain role personage etc. in TV play.

When it is implemented, needing to pre-establish an extracted region network model for the image block extraction of target person. The extracted region network model can be based on the common target detection network frame such as RefineDet, SSD or Faster RCNN Frame is constructed.After obtaining video frame images to be detected, the video frame images input area is extracted in network model, is mentioned Take the image block in the video frame images comprising target person.

In view of in the prior art, target person detection is mainly based upon deep learning to complete, and learns to be one progressive Process, during target person detection, the background frame that network generates generally has thousands of, and includes target person Detection block it is generally less, in network training process, be easy so that network is more biased to export background frame judgement.Though at present So have by down-sampled to background frame progress, but since network can not be absorbed in the prediction of study class label and coordinate simultaneously, it is existing There is method that the above problem can't be solved perfectly.Therefore, the extracted region network model in the present embodiment includes the first extract layer With the second extract layer, first extract layer is used to carry out tentative prediction, second extract layer to the target person label For being returned to the target person coordinate and carrying out more accurate prediction, the extracted region net to the target person label Network model improves the accuracy of detection to the target person label and the target person coordinate part learning.

When it is implemented, described extract the video frame images input area in network model, the video frame is extracted Before the step of including the image block of personage in image, further includes:

S100a, acquisition, to training image collection, concentrate the true of target person to training image to described comprising target person Real classification and true coordinate are labeled；

S100b, it described will be extracted in network model to training image collection input area, net is obtained by propagated forward algorithm The classification and coordinate of the target person of network prediction；

S100c, by loss function to the true classification of the target person of mark and the mesh of true coordinate and neural network forecast The classification and coordinate for marking personage are compared, and obtain prediction error；

S100d, the prediction error is trained the extracted region network model by back-propagation algorithm.

When it is implemented, needing to prepare in advance in the present embodiment for train the extracted region network model includes mesh Mark personage to training image collection, using annotation tool to the true classification of the target person concentrated to training image and true Real coordinate is labeled.After the completion of mark, described will be extracted in network model to training image collection input area, it is described at this time to Training image collection first passes through first extract layer and carries out coarse extraction.Specifically, by the convolutional layer in first extract layer, Mark includes the detection block of target person position on every frame image to training image collection, roughly by propagated forward algorithm Adjust the coordinate, scale and positive and negative classification of detection block (positive class indicates to include target person, and negative class indicates not including target person). Then by it is all prediction be positive classification detection block position and classification information pass to the second extract layer, the second extract layer is first Further accurate extraction is done on the basis of extract layer.Specifically, the positive class that will be obtained after the first extract layer extracts roughly Detection block and the corresponding characteristic pattern of the positive class detection block input the second extract layer, and the convolutional layer in second extract layer is to defeated The characteristic pattern entered carries out Feature Conversion, and the constraint of positive and negative classification and detection block classification is added to the characteristic pattern after conversion, finally Export the classification and coordinate of the target person of neural network forecast.In the present embodiment extracted region network model include the first extract layer and Second extract layer, the first extract layer carry out tentative prediction to target person label, the second extract layer are focused more on Recurrence to target person coordinate, while more accurately prediction is carried out to target person label, two extract layers cooperate, It is common to improve the accuracy for extracting target person image block.

Specifically, the process of the propagated forward is layer-by-layer from front to back in all convolutional layers of first extract layer It carries out, each layer of calculation formula is as follows:

Wherein x_i-1Indicate the input of current layer, w_i-1Indicate the network parameter of current layer,Indicate convolution algorithm, x_iIt indicates The output of current layer, f indicate that ReLu function, ReLu function are defined as follows:

Further, acquisition is mentioned in abovementioned steps after training image collection, can mark the mesh that training image is concentrated manually The true classification and true coordinate for marking personage, after the classification and the coordinate that obtain the target person of neural network forecast, by what is artificially marked The classification and coordinate of the target person of the true classification and true coordinate and neural network forecast of target person are carried out by loss function Compare.The true coordinate of the target person wherein artificially marked is the learning objective of the coordinate of the target person of neural network forecast, with The progress of training, the coordinate value of the target person of neural network forecast can become closer to the true seat of the target person artificially marked Scale value.Specifically, the formula of the loss function specifically:

Wherein, i is the serial number of detection block in training process,For the true classification of target person in i-th of detection block, For the true coordinate of target person in i-th of detection block, p_iFor the neural network forecast classification of target person in i-th of detection block, x_iFor The neural network forecast coordinate of target person, N in i-th of detection block_armAnd N_odmIt is detected respectively in extracted region network model The sum of frame comprising personage to be detected, L_bFor an intersection loss function, L_rIt is a recurrence loss function.In the present embodiment Extracted region network in loss function be a prospect and background binary classification loss function, Softmax can also be used More Classification Loss function training networks, this is not construed as limiting in the present patent application.

When it is implemented, L_bFor an intersection loss function, function is specifically defined are as follows:

L_rIt is a recurrence loss function, this, which returns loss function, can use L1 loss function, can also be damaged using L2 Lose function, it is preferable that L1 loss function is used in the present embodiment, function is defined as follows: L1 (x₁,x₂)=| x₁-x₂|, work as bracket It is 1 when interior condition is set up, is otherwise 0.

Further, by the true classification and true coordinate of the target person artificially marked and the target person of neural network forecast Classification and coordinate be compared by loss function, obtain neural network forecast error, then will prediction error by backpropagation calculation Method is trained the extracted region network model, the communication process of specific backpropagation be by the last layer convolutional layer by Layer is propagated forward, and each layer of propagation formula is as follows:

WhereinIt is loss function to the partial derivative of current convolution layer parameter, α is learning rate, generally 0.0001, every instruction Practice 50 times, decays to original 0.1 times.

Further, the extracted region network model includes: the first extract layer and the second extract layer.By the video frame Image input area domain is extracted in network model, and the step of include the image block of target person in the video frame images is extracted, and is had Body includes:

S101, the video frame images are inputted in first extract layer, is obtained corresponding comprising target person detection block Characteristic pattern；

S102, it described will be inputted in second extract layer comprising the corresponding characteristic pattern of target person detection block, and extract institute State the image block in video frame images comprising target person.

When it is implemented, after the completion of to extracted region network model training, so that it may by video frame figure to be detected As in input trained extracted region network model.The video frame images to be detected can pass through the first extract layer, will Most background frame filters out, and obtains comprising the corresponding characteristic pattern of target person detection block.Then it will obtain comprising target The corresponding characteristic pattern of person detecting frame inputs in the second extract layer, further to described comprising the corresponding spy of target person detection block Sign figure carries out Feature Conversion, obtains the graph block in the video frame figure comprising target person.Pass through above-mentioned double extract layers Processing, network can more accurately obtain the detection block comprising target person compared to other current main stream approach.

Fig. 1 is continued back at, the human classification method based on video further comprises the steps of:

S200, described image block is inputted in the sorter network model, target person is first in acquisition described image block Beginning classification results and attention weight.

Only having been extracted in video frame images to be detected in step S100 may connect down comprising the image block of target person To need to classify to target person.It presets in the present embodiment and classifies for the image block to the target person Sorter network model.The sorter network model uses ResNet50 framework, and increases three-layer coil lamination, can also use it He replaces conventional sorter network, and such as VGG, ResNet, DenseNet etc. are not construed as limiting this in the present embodiment.

When it is implemented, the sorter network model includes: the first convolutional layer, pond layer and containing multiple sub- convolutional layers Second convolutional layer.It is described to input described image block in the sorter network model, obtain target person in described image block The step of preliminary classification result and attention weight, specific steps include:

S201, described image block is inputted in the first convolutional layer, extracts the characteristic pattern of described image block；

S202, the characteristic pattern is inputted into pond layer, obtains multiple feature vectors of the characteristic pattern；

S203, each described eigenvector is separately input in each sub- convolutional layer, obtains target in described image block The preliminary classification result and attention weight of personage.

When it is implemented, after inputting image block in sorter network model in the present embodiment, it can be first defeated by described image block Enter in the first convolutional layer, extracts the characteristic pattern of described image block.Such as when the sorter network model uses ResNet50 framework When, described image block inputs the 3 dimensional feature figures that image is extracted in the first convolutional layer.Later, average pond is carried out by pond layer Change, the 3 dimensional feature figure is averaged in the horizontal direction is divided into 6 parts, and every part corresponds to a feature vector of picture.

In view of in practical application, user often pays close attention to certain special parts, such as face etc. when observing personage. In order to make final human classification result be more nearly the classification results of actual user, sorter network model described in the present embodiment In be provided with the second convolutional layer containing multiple sub- convolutional layers, by each described eigenvector obtained by pond layer distinguish it is defeated Enter into each sub- convolutional layer after carrying out convolution, obtains preliminary classification result and the attention power of target person in described image block Weight.

When it is implemented, second convolutional layer includes: the first sub- convolutional layer, the second convolutional layer, classifier and returns net Network.It is described that each described eigenvector is separately input in each sub- convolutional layer, it obtains corresponding to each described eigenvector Preliminary classification result and attention weight the step of, specifically include:

S201, each described eigenvector is sequentially input in the first sub- convolutional layer and the second sub- convolutional layer, is exported each First dimensional characteristics and the second dimensional characteristics corresponding to described eigenvector；

S202, first dimensional characteristics are inputted into classifier, obtains the preliminary classification of target person in described image block As a result；

S203, second dimensional characteristics are inputted into Recurrent networks, obtains the attention of target person in described image block Weight.

When it is implemented, the second convolutional layer described in the present embodiment is provided with two different sub- convolutional layers, respectively One sub- convolutional layer and the second sub- convolutional layer.Each described eigenvector is sequentially input in the first sub- convolutional layer first, output is each The dimension of first dimensional characteristics corresponding to a described eigenvector are as follows: 2048- > 256- > 6, by the first dimensional characteristics of output Be connected on classifier, such as existing support vector machines (SVM), output obtains the preliminary classification result of 6x7.By each institute It states feature vector to sequentially input in the second sub- convolutional layer, exports the latitude of the second dimensional characteristics corresponding to each described eigenvector Degree are as follows: the second dimensional characteristics of output are connected to Recurrent networks by 2048- > 256- > 1, such as existing logistics is returned Network, output obtain the attention weight of a 6x1.

Further, it before being classified using sorter network model to the obtained image block comprising target person, needs The sorter network model is trained.Specific training process is to obtain the pictures to be trained comprising target person, right The true classification of target person is labeled in the pictures to be trained.Then the pictures to be trained are inputted into the first volume In lamination, the characteristic pattern of image in the pictures to be trained is extracted；Then the characteristic pattern is inputted into pond layer, obtains every portion Divide the corresponding feature vector of characteristic pattern, described eigenvector is inputted in the first sub- convolutional layer and the second sub- convolutional layer respectively and obtained The corresponding preliminary classification result of target person and attention weight in described image, according to the preliminary classification result and attention The target person classification results of weight output category network model prediction, it is by the target person classification results of prediction and artificial in advance The true classification of the target person of mark compares, and the two is subtracted each other to obtain training error, then by back-propagation algorithm to described Sorter network model is trained.Specific back-propagation algorithm uses reversed when extracting network model training with aforementioned areas Propagation algorithm is identical, and details are not described herein.

S300, the target person is obtained according to the preliminary classification result and attention weight of target person in described image block The final classification of object is as a result, divide target person included in the video frame images according to the final classification result Class.

Specifically, as previously mentioned, the first sub- convolutional layer exports the dimension of the first dimensional characteristics are as follows: 2048- > 256- > 6, it will First dimensional characteristics input in existing support vector machines, and output obtains the preliminary classification of 6x7 as a result, note preliminary classification knot Fruit is c_i, each c_iA classification results are represented, 6 kinds of classification results are finally obtained；Second sub- convolutional layer exports the second dimensional characteristics Dimension are as follows: second dimensional characteristics are inputted existing logistics Recurrent networks by 2048- > 256- > 1, and output obtains The attention weight of one 6x1, note attention weight are w_i, according to formula:Wherein attention weighted value It may range from the arbitrary value between [0,1], it is also adjustable wider to [0,5] etc..According to target in described image block The preliminary classification result and attention weight of personage obtain the final classification result of the target person.Such as the preliminary classification of 6x7 As a result, each c_iA classification results are represented, 6 kinds of classification results are finally obtained, obtain 6 kinds of classification results are passed through into attention Weight w_iWeighting obtains final classification as a result, described in the maximum a kind of conduct of the final classification end value for choosing the target person The tag along sort of target person contained in video frame images, to classify to target person.

Embodiment 2

Based on the above embodiment, the present invention also provides a kind of intelligent terminal, functional block diagram can be as shown in Figure 2.It should Intelligent terminal includes processor, memory, network interface, display screen and the temperature sensor connected by system bus.Wherein, The processor of the intelligent terminal is for providing calculating and control ability.The memory of the intelligent terminal includes that non-volatile memories are situated between Matter, built-in storage.The non-volatile memory medium is stored with operating system and computer program.The built-in storage is non-volatile The operation of operating system and computer program in storage medium provides environment.The network interface of the intelligent terminal is used for and outside Terminal by network connection communication.To realize a kind of human classification based on video when the computer program is executed by processor Method.The display screen of the intelligent terminal can be liquid crystal display or electric ink display screen, and the temperature of the intelligent terminal passes Sensor is to be arranged inside intelligent terminal in advance, for detecting the current running temperature of internal unit.

It will be understood by those skilled in the art that functional block diagram shown in Figure 2, only portion relevant to the present invention program The block diagram of separation structure does not constitute the restriction for the intelligent terminal that the system described in the present invention program is applied thereon, specific intelligence Energy terminal may include perhaps combining certain components or with different components than more or fewer components as shown in the figure Arrangement.

In one embodiment, a kind of intelligent terminal, including memory and processor are provided, is stored with meter in memory Following steps at least may be implemented when executing computer program in calculation machine program, the processor:

In one of them embodiment, which can also realize when executing computer program: by the video frame Image input area domain is extracted in network model, and the image block in the video frame images comprising target person is extracted；The region Extracting network model is the corresponding pass based on input video frame image with target person image block in the input video frame image Made of system's training.

In one of them embodiment, which can also realize when executing computer program: described by the figure As obtaining the preliminary classification result and attention weight of target person in described image block in the block input sorter network model The step of, it specifically includes: described image block is inputted in the first convolutional layer, extract the characteristic pattern of described image block；By the spy Sign figure input pond layer, obtains multiple feature vectors of the characteristic pattern；Each described eigenvector is separately input to each In sub- convolutional layer, the preliminary classification result and attention weight of target person in described image block are obtained.

In one of them embodiment, which can also realize when executing computer program: described by each institute Feature vector is stated to be separately input in each sub- convolutional layer, obtain preliminary classification result corresponding to each described eigenvector and It the step of attention weight, specifically includes: each described eigenvector is sequentially input into the first sub- convolutional layer and the second sub- convolution In layer, the first dimensional characteristics and the second dimensional characteristics corresponding to each described eigenvector are exported；First dimension is special Sign input classifier, obtains the preliminary classification result of target person in described image block；Second dimensional characteristics are inputted back Return network, obtains the attention weight of target person in described image block.

In one of them embodiment, which can also realize when executing computer program: by described image block The preliminary classification result and attention multiplied by weight of middle target person obtain the final classification result of the target person；Choose institute State maximum a kind of point as target person included in the video frame images of final classification end value of target person Class label.

In one of them embodiment, which can also realize when executing computer program: described by the view Frequency frame image input area domain is extracted in network model, and the step of the image block in the video frame images comprising target person is extracted Suddenly, it specifically includes: the video frame images is inputted in first extract layer, obtain corresponding comprising target person detection block Characteristic pattern；It is inputted in the second extract layer to described comprising the corresponding characteristic pattern of target person detection block, obtains the video frame figure Include the image block of target person as in.

In one of them embodiment, which can also realize when executing computer program: obtain comprising target Personage to training image collection, concentrate the true classification of target person and true coordinate to be labeled to training image to described； It described will be extracted in network model to training image collection input area, the target person of neural network forecast is obtained by propagated forward algorithm The classification and coordinate of object；By loss function to the true classification of the target person of mark and the mesh of true coordinate and neural network forecast The classification and coordinate for marking personage are compared, and obtain prediction error；By the prediction error by back-propagation algorithm to described Extracted region network model is trained.

In one of them embodiment, which can also realize when executing computer program: pass through formulaTo mark Personage's coordinate of the true classification and true coordinate of personage and personage's class label of neural network forecast and neural network forecast is compared, Obtaining prediction error, wherein i is the serial number of detection block in training process,For the true class of target person in i-th of detection block Not,For the true coordinate of target person in i-th of detection block, p_iFor the neural network forecast class of target person in i-th of detection block Not, x_iFor the neural network forecast coordinate of target person in i-th of detection block, N_armAnd N_odmRespectively institute in extracted region network model Detect the sum of the frame comprising personage to be detected, L_bFor an intersection loss function, L_rIt is a recurrence loss function.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, To any reference of memory, storage, database or other media used in each embodiment provided by the present invention, Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhancing Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..

In conclusion the present invention provides a kind of human classification method, intelligent terminal and storage medium based on video, institute The method of stating includes: to obtain video frame images to be detected, extracts the image block in the video frame images comprising target person；It will Described image block, which inputs in the sorter network model, extracts feature vector, obtains initial point of target person in described image block Class result and attention weight；According to the preliminary classification result of target person in described image block and the acquisition of attention weight The final classification of target person is as a result, according to the final classification result to target person contained in the video frame images Classify.Method provided by the present invention extracts target person to be detected by region extraction module and sorter network model respectively The image block of object and classify to target person, the e-learning power weight that gains attention combined with initial predicted result, Contribution of the characteristic part to final classification result is improved, so that video human classification result is more accurate.

It should be understood that system application of the invention is not limited to above-mentioned citing, those of ordinary skill in the art are come It says, it can be modified or changed according to the above description, and all these modifications and variations all should belong to right appended by the present invention and want The protection scope asked.

Claims

1. a kind of human classification method based on video, characterized in that it comprises:

Described image block is inputted in the sorter network model, the preliminary classification result of target person in described image block is obtained With attention weight；The sorter network model is target person in image block and described image block based on the target person Preliminary classification result and attention weight corresponding relationship training made of；

The final of the target person is obtained according to the preliminary classification result of target person in described image block and attention weight Classification results classify to target person included in the video frame images according to the final classification result.

2. the human classification method based on video according to claim 1, which is characterized in that extract in the video frame images It the step of including the image block of target person, specifically includes:

The video frame images input area is extracted in network model, extracting includes target person in the video frame images Image block；The extracted region network model is based on target person in input video frame image and the input video frame image Made of the corresponding relationship training of image block.

3. the human classification method based on video according to claim 1, which is characterized in that the sorter network model packet It includes: the first convolutional layer, pond layer and the second convolutional layer containing multiple sub- convolutional layers；

It is described to input described image block in the sorter network model, obtain the preliminary classification of target person in described image block As a result it with attention weight the step of, specifically includes:

Each described eigenvector is separately input in each sub- convolutional layer, target person is initial in acquisition described image block Classification results and attention weight.

4. the human classification method based on video according to claim 3, which is characterized in that second convolutional layer includes: First sub- convolutional layer, the second convolutional layer, classifier and Recurrent networks；

It is described that each described eigenvector is separately input in each sub- convolutional layer, it obtains corresponding to each described eigenvector Preliminary classification result and attention weight the step of, specifically include:

Each described eigenvector is sequentially input in the first sub- convolutional layer and the second sub- convolutional layer, export each feature to Amount corresponding the first dimensional characteristics and the second dimensional characteristics；

5. the human classification method based on video according to claim 1, which is characterized in that described according in described image block The preliminary classification result and attention weight of target person obtain the final classification of the target person as a result, according to described final It the step of classification results classify to target person included in the video frame images, specifically includes:

The preliminary classification result of target person in described image block and attention multiplied by weight are obtained into the target person most Whole classification results；

The final classification end value for choosing the target person is maximum a kind of as mesh included in the video frame images Mark the tag along sort of personage.

6. the human classification method based on video according to claim 2, which is characterized in that the extracted region network model It include: the first extract layer and the second extract layer；

It is described to extract the video frame images input area in network model, it extracts in the video frame images comprising target person It the step of image block of object, specifically includes:

The video frame images are inputted in first extract layer, obtaining includes the corresponding characteristic pattern of target person detection block；

It described will be inputted in second extract layer comprising the corresponding characteristic pattern of target person detection block, and extract the video frame figure Include the image block of target person as in.

7. the human classification method based on video according to claim 6, which is characterized in that described by the video frame images Input area extracts in network model, before extracting the step of include the image block of target person in the video frame images, and also Include:

Obtaining includes target person to training image collection, to the true classification to training image concentration target person and very Real coordinate is labeled；

It described will be extracted in network model to training image collection input area, the mesh of neural network forecast is obtained by propagated forward algorithm Mark the classification and coordinate of personage；

By loss function to the class of the true classification and true coordinate of the target person of mark and the target person of neural network forecast It is not compared with coordinate, obtains prediction error；

8. the human classification method based on video according to claim 7, which is characterized in that the loss function are as follows:

Wherein, i is the serial number of detection block in training process,For the true classification of target person in i-th of detection block,It is The true coordinate of target person, p in i detection block_iFor the neural network forecast classification of target person in i-th of detection block, x_iIt is i-th The neural network forecast coordinate of target person, N in a detection block_armAnd N_odmIt is detected respectively in extracted region network model to include The sum of the frame of personage to be detected, L_bFor an intersection loss function, L_rIt is a recurrence loss function.

9. a kind of intelligent terminal characterized by comprising processor, the storage medium being connect with processor communication, the storage Medium is suitable for storing a plurality of instruction；The processor is suitable for calling the instruction in the storage medium, realizes above-mentioned power to execute The step of human classification method that benefit requires 1-8 described in any item based on video.

10. a kind of storage medium, which is characterized in that be stored with the item recommendation method based on collaborative filtering on the storage medium Control program, when the control program of the item recommendation method based on collaborative filtering is executed by processor realize as right is wanted The step of seeking the human classification method described in any one of 1-8 based on video.