CN109165563A

CN109165563A - Pedestrian recognition methods and device, electronic equipment, storage medium, program product again

Info

Publication number: CN109165563A
Application number: CN201810848366.2A
Authority: CN
Inventors: 陈大鹏; 李鸿升; 刘希慧; 邵静; 王晓刚
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2018-07-27
Filing date: 2018-07-27
Publication date: 2019-01-08
Anticipated expiration: 2038-07-27
Also published as: CN109165563B

Abstract

The embodiment of the present application discloses a kind of pedestrian recognition methods and device, electronic equipment, storage medium, program product again, obtains images to be recognized and candidate image collection；Identify that image and candidate image concentrate each candidate image to carry out feature extraction using feature extraction network handles, the corresponding intermediate features to be identified of images to be recognized and the corresponding candidate intermediate features of candidate image are obtained, feature extraction network is obtained through characteristics of image and the training of language description cross-module state；The corresponding recognition result of images to be recognized is obtained from candidate image concentration based on intermediate features to be identified and candidate intermediate features, corresponding relationship natural between image and the language for describing the image is utilized, correlation between local picture region and noun phrase is further excavated in such a way that phrase is rebuild, enhance the constraint to characteristics of image study, it improves pedestrian and identifies the quality of visual signature again, and then improve the accuracy that pedestrian identifies again.

Description

Pedestrian recognition methods and device, electronic equipment, storage medium, program product again

Technical field

This application involves computer vision technique, especially a kind of pedestrian recognition methods and device, electronic equipment, storage again Medium, program product.

Background technique

It is a key technology in intelligent video monitoring system that pedestrian identifies again, it is intended to by given target sample Similitude between rear sampling sheet is measured, and finds out target sample in a large amount of rear sampling sheets.With deep neural network Application, the visual signature identified again for pedestrian enhanced.In order to further increase the discriminating power of feature, certain methods Complementary data are used；But still have the following problems: relying on additional equipment or model, increases algorithm operation price and time Cost；Or complicated annotation formatting is defined to auxiliary data, increase the human cost of data mark.

Summary of the invention

A kind of pedestrian provided by the embodiments of the present application identification technology again.

According to the one aspect of the embodiment of the present application, a kind of pedestrian recognition methods again for providing, comprising:

Obtain images to be recognized and candidate image collection；

It identifies that image and candidate image concentrate each candidate image to carry out feature extraction using feature extraction network handles, obtains The corresponding intermediate features to be identified of images to be recognized and the corresponding candidate intermediate features of the candidate image, the feature mention Network is taken to obtain through characteristics of image and the training of language description cross-module state；

Based on the intermediate features to be identified and the candidate intermediate features from candidate image concentration obtain it is described to Identify the corresponding recognition result of image, the recognition result includes at least one candidate image.

Optionally, described to be concentrated with the candidate intermediate features from the candidate image based on the intermediate features to be identified Obtain the corresponding recognition result of the images to be recognized, comprising:

The intermediate features to be identified and the candidate intermediate features are obtained by average pond layer and full articulamentum respectively Feature and candidate feature to be identified；

The images to be recognized is obtained from candidate image concentration based on the feature to be identified and the candidate feature Corresponding recognition result.

Optionally, further includes: carried out based on language identification network pair descriptive text relevant to the images to be recognized special Sign is extracted, and language feature is obtained；

The recognition result is screened based on the language feature, the corresponding update of the images to be recognized is obtained and knows Not as a result, the update recognition result includes at least one candidate image.

Optionally, described that the recognition result is screened based on the language feature, obtain the images to be recognized Corresponding update recognition result, comprising:

Based between the language feature at least one described candidate intermediate features corresponding with the recognition result away from From being screened；

At least one described candidate intermediate features that distance is less than or equal to preset value are obtained, it will be in the candidate of acquisition Between the corresponding candidate image of feature as the update recognition result.

Optionally, further includes:

Feature is carried out based on the language identification network at least one words of description relevant to the images to be recognized to mention It takes, obtains word feature, each words of description corresponds at least one of images to be recognized part；

The recognition result or the update recognition result are screened based on the word feature, obtained described wait know The corresponding target identification of other image is as a result, the target identification result includes at least one described candidate image.

It is optionally, described that the recognition result or the update recognition result are screened based on the word feature, Obtain the corresponding target identification result of the images to be recognized, comprising:

Based on the word feature at least one described time corresponding with the recognition result or the update recognition result The distance between intermediate features are selected to be screened；

At least one described candidate feature that distance is less than or equal to preset value is obtained, the candidate of the acquisition is intermediate special Corresponding candidate image is levied as the target identification result.

It is optionally, described to obtain feature extraction network through characteristics of image and the training of language description cross-module state, comprising:

Sample image is inputted into the feature extraction network, obtains sample image feature, the sample image includes text Description mark；

The verbal description is marked based on language identification network and carries out feature extraction, obtains sample language feature；

Based on the sample language feature and the sample image feature, the training feature extraction network.

Optionally, described to be based on the sample language feature and the sample image feature, the training feature extraction net Network, comprising:

Based on the sample language feature and the sample image feature, global dependent probability is obtained；

Based on the correlation that the global dependent probability and the sample image are marked with the verbal description, binary is utilized Mutual entropy loss obtains global loss；

Based on the global loss training feature extraction network.

Optionally, described to be based on the sample language feature and the sample image feature, obtain global dependent probability, packet It includes:

It will subtract each other after the sample image feature pool with the sample language feature, obtain difference feature；

The difference feature is carried out to calculate square value acquisition union feature by element；

Normalized is executed to the union feature, obtains the global dependent probability for indicating holistic correlation.

Optionally, described marked based on the language identification network to the verbal description carries out feature extraction, obtains sample Before this language feature, further includes:

Pre-training is carried out to the language identification network based on sample text, the sample text includes Markup Language spy Sign.

It is optionally, described that pre-training is carried out to the language identification network based on sample text, comprising:

The sample text is inputted into the language identification network and obtains the first forecast sample feature；

Parameter based on language identification network described in the first forecast sample feature and the Markup Language Character adjustment.

Optionally, further includes: based on the language identification network at least one phrase mark in verbal description mark Note carries out feature extraction, obtains at least one local feature, each phrase tagging is for describing in the sample image At least one region；

Local losses is obtained based on the local feature and the sample image feature；

It is described that the feature extraction network is trained based on the global loss, comprising:

Based on the global loss and the local losses training feature extraction network.

Optionally, it is described based on the language identification network to the verbal description mark at least one phrase tagging into Row feature extraction, before obtaining at least one local feature, further includes:

Divide the verbal description mark, obtain at least one phrase tagging, each phrase tagging includes at least one A noun, phrase tagging one Marking Probability of correspondence of the acquisition, each probability value indicate described in the phrase tagging correspondence The probability of sample image.

Optionally, the segmentation verbal description mark, obtains at least one phrase tagging, comprising:

Part of speech identification is carried out to each word in verbal description mark, obtains the corresponding word of each word Property；

Preset phrase piecemeal condition is combined based on the part of speech, it is short that verbal description mark is divided at least one Language mark.

It is optionally, described that local losses is determined based on the local feature and the sample image feature, comprising:

Pondization operation is carried out to the sample image feature, obtains global characteristics figure；

Based on the global characteristics figure and the local feature, conspicuousness weight is obtained；

Determine that the corresponding prediction of each phrase tagging is general based on the conspicuousness weight and the sample image feature Rate；

Based on prediction probability Marking Probability corresponding with the phrase tagging, the local losses is obtained.

Optionally, described to be based on the global characteristics figure and the local feature, obtain conspicuousness weight, comprising:

The characteristic value of each position in the global characteristics figure is subtracted each other respectively with the local feature, obtains local difference Feature；

Square value is calculated to element each in the local difference feature and obtains local union feature；

Based on the local union feature, conspicuousness weight is obtained.

Optionally, described based on the local union feature, obtain conspicuousness weight, comprising:

The local union feature is handled based on fully-connected network, obtains expression phrase tagging and institute State the matching value of sample image matching degree；

The vector constituted to the matching value of each position in the corresponding global characteristics figure of each phrase tagging carries out Normalization, obtains the corresponding conspicuousness weight of each phrase tagging.

Optionally, described that each phrase tagging pair is determined based on the conspicuousness weight and the sample image feature The prediction probability answered, comprising:

It, to the characteristic value and the conspicuousness multiplied by weight of each position, will be corresponded in the sample image feature The weighted feature vector set of each phrase tagging；

By the addition of vectors in the weighted feature vector set, obtains the phrase tagging and correspond in the sample image Local visual feature；

The prediction probability of each word in the phrase tagging is obtained based on the local visual feature；

The corresponding prediction probability of the phrase tagging is determined based on the prediction probability of each word in the phrase tagging.

Optionally, the prediction for obtaining each word in the phrase tagging based on the local visual feature is general Rate, comprising:

The phrase tagging is resolved into word sequence, the local visual feature is inputted into shot and long term memory network, really At least one fixed hidden variable, each corresponding feature vector of the word；

Each moment, the hidden variable at previous moment feature vector corresponding with current word remember net by shot and long term Network phase separation obtains the hidden variable at next moment；

Linear Mapping is carried out based at least one described hidden variable, obtains the predicted vector of each word；

The prediction probability of each word in the phrase tagging is obtained based on the predicted vector.

Optionally, the prediction probability based on each word in the phrase tagging determines that the phrase tagging is corresponding pre- Survey probability, comprising:

Prediction probability by the product of the prediction probability of word each in the phrase tagging, as the phrase tagging.

It is optionally, described based on the global loss and the local losses training feature extraction network, comprising:

To the global loss and local losses summation, obtains and lose；

The parameter of the feature extraction network is adjusted based on described and loss.

Optionally, further includes:

Identity sample image is inputted into the feature extraction network, obtains sample predictions feature, the identity sample image Including marking identification feature；

The sample predictions feature is handled through pond layer and full articulamentum, obtains Forecasting recognition feature；

Based on feature extraction network described in the mark identification feature and the Forecasting recognition Character adjustment, the pond layer With the parameter of the full articulamentum.

According to the other side of the embodiment of the present application, a kind of pedestrian provided identification device again, comprising:

Image acquisition unit, for obtaining images to be recognized and candidate image collection；

Feature extraction unit, it is each for being concentrated using feature extraction network to the images to be recognized and the candidate image Candidate image carries out feature extraction, obtains the corresponding intermediate features to be identified of the images to be recognized and the candidate image is corresponding Candidate intermediate features, the feature extraction network through characteristics of image and language description cross-module state training obtain；

As a result recognition unit, for being schemed based on the intermediate features to be identified and the candidate intermediate features from the candidate The corresponding recognition result of the images to be recognized is obtained in image set, the recognition result includes at least one candidate image.

Optionally, the result recognition unit, for the intermediate features to be identified and the candidate intermediate features difference Feature and candidate feature to be identified are obtained by average pond layer and full articulamentum；Based on the feature to be identified and the candidate Feature obtains the corresponding recognition result of the images to be recognized from candidate image concentration.

Optionally, further includes:

Language screening unit, for being carried out based on language identification network pair descriptive text relevant to the images to be recognized Feature extraction obtains language feature；The recognition result is screened based on the language feature, obtains the figure to be identified As corresponding update recognition result, the update recognition result includes at least one candidate image.

Optionally, the language screening unit screens the recognition result based on the language feature, obtains When the corresponding update recognition result of the images to be recognized, for corresponding with the recognition result extremely based on the language feature The distance between few candidate intermediate features are screened；Obtain at least one time of the distance less than or equal to preset value Intermediate features are selected, using the corresponding candidate image of candidate intermediate features of the acquisition as the update recognition result.

Optionally, further includes:

Word screening unit, for based on the language identification network it is relevant to the images to be recognized at least one retouch Predicate language carries out feature extraction, obtains word feature, and each words of description corresponds at least one in the images to be recognized A part；The recognition result or the update recognition result are screened based on the word feature, obtained described wait know The corresponding target identification of other image is as a result, the target identification result includes at least one candidate image.

Optionally, the word screening unit is based on the word feature to the recognition result or update identification knot Fruit is screened, when obtaining the corresponding target identification result of the images to be recognized, for based on the word feature with it is described The distance between recognition result or at least one corresponding candidate intermediate features of the update recognition result are screened；Obtain institute At least one candidate feature that distance is less than or equal to preset value is stated, by the corresponding candidate figure of the candidate intermediate features of the acquisition As being used as the target identification result.

Optionally, described device further include:

Sample characteristics extraction unit obtains sample image feature for sample image to be inputted the feature extraction network, The sample image includes verbal description mark；

Language feature extraction unit carries out feature extraction for marking based on language identification network to the verbal description, Obtain sample language feature；

Network training unit, for being based on the sample language feature and the sample image feature, the training feature Extract network.

Optionally, the network training unit, comprising:

Global probabilistic module obtains global related for being based on the sample language feature and the sample image feature Probability；

Overall situation loss module, for being marked based on the global dependent probability and the sample image and the verbal description Correlation, utilize the mutual entropy loss of binary to obtain global loss；

Training module is lost, for based on the global loss training feature extraction network.

Optionally, the global probabilistic module, be specifically used for by after the sample image feature pool with the sample language Speech feature is subtracted each other, and difference feature is obtained；The difference feature is carried out to calculate square value acquisition union feature by element；To described Union feature executes normalized, obtains the global dependent probability for indicating holistic correlation.

Optionally, described device further include:

Pre-training unit, for carrying out pre-training, the sample text to the language identification network based on sample text Including Markup Language feature.

Optionally, the pre-training unit is obtained specifically for the sample text is inputted the language identification network First forecast sample feature；Based on language identification net described in the first forecast sample feature and the Markup Language Character adjustment The parameter of network.

Optionally, the network training unit, further includes:

Local shape factor module, for being based on the language identification network at least one in verbal description mark Phrase tagging carries out feature extraction, obtains at least one local feature, each phrase tagging is for describing the sample graph At least one region as in；

Local losses module, for obtaining local losses based on the local feature and the sample image feature；

The loss training module, specifically for being mentioned based on the global loss and the local losses training feature Take network.

Optionally, the network training unit, further includes:

Phrase separation module marks for dividing the verbal description, obtains at least one phrase tagging, each described short Language mark includes at least one noun, the corresponding Marking Probability of the phrase tagging of the acquisition, described in each probability value expression Phrase tagging corresponds to the probability of the sample image.

Optionally, the phrase separation module, specifically for carrying out word to each word in verbal description mark Property identification, obtain the corresponding part of speech of each word；Preset phrase piecemeal condition is combined based on the part of speech, by the text Word description mark is divided at least one phrase tagging.

Optionally, the local losses module, comprising:

Pond module obtains global characteristics figure for carrying out pondization operation to the sample image feature；

Weight module obtains conspicuousness weight for being based on the global characteristics figure and the local feature；

Probabilistic forecasting module, for determining each phrase based on the conspicuousness weight and the sample image feature Mark corresponding prediction probability；

Local losses obtains module, for being based on prediction probability Marking Probability corresponding with the phrase tagging, obtains Obtain the local losses.

Optionally, the weight module, for by the characteristic value of each position in the global characteristics figure and the part Feature is subtracted each other respectively, obtains local difference feature；Square value is calculated to element each in the local difference feature and obtains part Union feature；Based on the local union feature, conspicuousness weight is obtained.

Optionally, the weight module is based on the local union feature, when obtaining conspicuousness weight, connects entirely for being based on It connects network to handle the local union feature, obtains the expression phrase tagging with the sample image and match journey The matching value of degree；The vector that the matching value of each position in the corresponding global characteristics figure of each phrase tagging is constituted into Row normalization, obtains the corresponding conspicuousness weight of each phrase tagging.

Optionally, the probabilistic forecasting module, for by the sample image feature to the characteristic value of each position With the conspicuousness multiplied by weight, the weighted feature vector set for corresponding to each phrase tagging is obtained；The weighting is special The addition of vectors in vector set is levied, the phrase tagging is obtained and corresponds to local visual feature in the sample image；It is based on The local visual feature obtains the prediction probability of each word in the phrase tagging；Based on each in the phrase tagging The prediction probability of a word determines the corresponding prediction probability of the phrase tagging.

Optionally, the probabilistic forecasting module is based on the local visual feature and obtains each list in the phrase tagging When the prediction probability of word, for the phrase tagging to be resolved into word sequence, the local visual feature is inputted into shot and long term Memory network determines at least one hidden variable, each corresponding feature vector of the word；Each moment, when previous The hidden variable at quarter feature vector corresponding with current word obtains the hidden of next moment by shot and long term memory network phase separation Variable；Linear Mapping is carried out based at least one described hidden variable, obtains the predicted vector of each word；Based on the prediction Vector obtains the prediction probability of each word in the phrase tagging.

Optionally, described in prediction probability determination of the probabilistic forecasting module based on each word in the phrase tagging When the corresponding prediction probability of phrase tagging, for by the product of the prediction probability of word each in the phrase tagging, as institute State the prediction probability of phrase tagging.

Optionally, the loss training module is specifically used for summing to the global loss and the local losses, obtain And loss；The parameter of the feature extraction network is adjusted based on described and loss.

Optionally, described device further include:

Identity sample unit, for obtaining sample predictions feature for the identity sample image input feature extraction network, The identity sample image includes mark identification feature；

Default recognition unit obtains pre- for handling through pond layer and full articulamentum the sample predictions feature Survey identification feature；

Parameter adjustment unit, for being mentioned based on feature described in the mark identification feature and the Forecasting recognition Character adjustment Take the parameter of network, the pond layer and the full articulamentum.

According to the another aspect of the embodiment of the present application, a kind of electronic equipment provided, including processor, the processor Including the identification device again of pedestrian described in any one as above.

According to the still another aspect of the embodiment of the present application, a kind of electronic equipment that provides, comprising: memory, for storing Executable instruction；

And processor, it is as above any one to complete that the executable instruction is executed for communicating with the memory The operation of described pedestrian recognition methods again.

According to another aspect of the embodiment of the present application, a kind of computer readable storage medium provided, based on storing The instruction that calculation machine can be read, described instruction are performed the operation for executing the recognition methods again of pedestrian described in any one as above.

According to another aspect of the embodiment of the present application, a kind of computer program product provided, including it is computer-readable Code, when the computer-readable code is run in equipment, the processor in the equipment is executed for realizing such as taking up an official post It anticipates the instruction of pedestrian recognition methods again.

A kind of pedestrian provided based on the above embodiments of the present application again recognition methods and device, electronic equipment, storage medium, Program product, by obtaining images to be recognized and candidate image collection；Utilize feature extraction network handles identification image and candidate figure Each candidate image carries out feature extraction in image set, obtains the corresponding intermediate features to be identified of images to be recognized and candidate image is corresponding Candidate intermediate features, feature extraction network through characteristics of image and language description cross-module state training obtain；Based on centre to be identified Feature and candidate intermediate features obtain the corresponding recognition result of images to be recognized from candidate image concentration, and recognition result includes at least One candidate image carries out pedestrian again by the feature extraction network obtained by characteristics of image and the training of language description cross-module state Identification, is utilized corresponding relationship natural between image and the language for describing the image, further digs in such a way that phrase is rebuild Correlation between local picture region and noun phrase is dug, the constraint to characteristics of image study is enhanced, improves pedestrian and know again The quality of other visual signature, and then improve the accuracy that pedestrian identifies again.

Below by drawings and examples, the technical solution of the application is described in further detail.

Detailed description of the invention

The attached drawing for constituting part of specification describes embodiments herein, and together with description for explaining The principle of the application.

The application can be more clearly understood according to following detailed description referring to attached drawing, in which:

Fig. 1 is the flow chart of the application pedestrian's recognition methods one embodiment again.

Fig. 2 be the application pedestrian again in recognition methods one embodiment step 130 flow chart.

Fig. 3 is the flow chart of the application pedestrian's another embodiment of recognition methods again.

Fig. 4 be the application pedestrian again in another embodiment of recognition methods step 350 flow chart.

Fig. 5 is the flow chart for the training that the application feature pedestrian extracts network in recognition methods one embodiment again.

Fig. 6 is the example flow schematic diagram that noun phrase extracts in the embodiment of the present application.

Fig. 7 is to be associated with an exemplary structural schematic diagram between disclosure reconstruction phrase tagging and image-region.

Fig. 8 is a structural schematic diagram of the embodiment of the present application pedestrian identification device again.

Fig. 9 is the structural representation suitable for the electronic equipment of the terminal device or server that are used to realize the embodiment of the present application Figure.

Specific embodiment

The various exemplary embodiments of the application are described in detail now with reference to attached drawing.It should also be noted that unless in addition having Body explanation, the unlimited system of component and the positioned opposite of step, numerical expression and the numerical value otherwise illustrated in these embodiments is originally The range of application.

Simultaneously, it should be appreciated that for ease of description, the size of various pieces shown in attached drawing is not according to reality Proportionate relationship draw.

Be to the description only actually of at least one exemplary embodiment below it is illustrative, never as to the application And its application or any restrictions used.

Technology, method and apparatus known to person of ordinary skill in the relevant may be not discussed in detail, but suitable In the case of, the technology, method and apparatus should be considered as part of specification.

It should also be noted that similar label and letter indicate similar terms in following attached drawing, therefore, once a certain Xiang Yi It is defined in a attached drawing, then in subsequent attached drawing does not need that it is further discussed.

In order to further increase the discriminating power of feature, certain methods begin to use complementary data, such as: camera numbers, Human body attitude, pedestrian's attribute and infrared or depth image etc..It these methods or needs to rely on during the test additional Equipment or model, such as infrared, depth camera, Attitude estimation model, increase the cost of algorithm operation price and time；Or Complicated annotation formatting is defined to auxiliary data to increase if pedestrian's attribute needs tens of attributes of labeler to be compareed one by one The human cost of data mark.In view of the above problems, the embodiment of the present disclosure is improved using natural language as supplemental training data The identification and interpretation of characteristics of image.

Fig. 1 is the flow chart of the application pedestrian's recognition methods one embodiment again.As shown in Figure 1, the embodiment method packet It includes:

Step 110, images to be recognized and candidate image collection are obtained.

Wherein, images to be recognized can to need the pedestrian image that be identified again, candidate image concentrate may include to A few candidate image is schemed current embodiment require that concentrating and obtaining from candidate image at least one matched candidate of images to be recognized Picture.

Step 120, identify that image and candidate image concentrate each candidate image to carry out feature using feature extraction network handles It extracts, obtains the corresponding intermediate features to be identified of images to be recognized and the corresponding candidate intermediate features of candidate image, wherein feature Network is extracted to obtain through characteristics of image and the training of language description cross-module state.

As complementary data can be used in order to further increase the discriminating power of feature in the application of deep neural network, The present embodiment improves the identification and interpretation of characteristics of image using natural language as supplemental training data, optionally, leads to It crosses the feature extraction network handles identification image obtained through characteristics of image and the training of language cross-module state and each time is concentrated in candidate image Image is selected to carry out feature extraction, the effect for the characteristics of image coding which obtains is mentioned It rises.

Step 130, images to be recognized is obtained from candidate image concentration based on intermediate features to be identified and candidate intermediate features Corresponding recognition result.

Wherein, recognition result includes at least one candidate image.

Based on a kind of pedestrian recognition methods again that the above embodiments of the present application provide, images to be recognized and candidate image are obtained Collection；It identifies that image and candidate image concentrate each candidate image to carry out feature extraction using feature extraction network handles, obtains wait know The corresponding intermediate features to be identified of other image and the corresponding candidate intermediate features of candidate image, feature extraction network is through characteristics of image It is obtained with the training of language description cross-module state；Based on intermediate features to be identified and candidate intermediate features from candidate image concentration obtain to Identify the corresponding recognition result of image, recognition result includes at least one candidate image, by retouching by characteristics of image and language It states the feature extraction network progress pedestrian that the training of cross-module state obtains to identify again, be utilized between image and the language for describing the image Natural corresponding relationship further excavates correlation between local picture region and noun phrase in such a way that phrase is rebuild, increases The strong constraint to characteristics of image study, improves pedestrian and identifies the quality of visual signature again, and then improve pedestrian and identify again Accuracy.

Fig. 2 be the application pedestrian again in recognition methods one embodiment step 130 flow chart.As shown in Fig. 2, at one Or in multiple optional embodiments, step 130, may include:

Step 1302, intermediate features to be identified and candidate intermediate features are obtained by average pond layer and full articulamentum respectively Feature and candidate feature to be identified.

In the present embodiment, what is obtained by feature extraction network is intermediate features, the intermediate features need further across The visual signature that can just obtain after average pondization and full connection processing for describing images to be recognized and candidate image collection is (to be identified Feature and candidate feature).

Step 1304, the corresponding knowledge of images to be recognized is obtained from candidate image concentration based on feature to be identified and candidate feature Other result.

In the present embodiment, by calculating the similarity between feature and candidate feature to be identified, based on the big of similarity The small recognition result that can determine images to be recognized realizes that pedestrian identifies again.For example, special by calculating feature to be identified and candidate The distance between sign (such as: COS distance, Euclidean distance etc.), using the distance as can determine images to be recognized and each candidate's figure Similarity as between can also calculate similarity using other modes, here without limitation in other embodiments.

Fig. 3 is the flow chart of the application pedestrian's another embodiment of recognition methods again.As shown in figure 3, in one or more In optional embodiment, further includes:

Step 340, feature extraction is carried out based on language identification network pair descriptive text relevant to images to be recognized, obtained Language feature.

It in actual application,, can be in addition to the image of offer when searching someone (such as: lost children etc.) Auxiliary can quickly be screened the incongruent recognition result of removal, be improved pedestrian and known again with language description by language description content Other efficiency, and language description can be and carry out whole description to image, can also be at least one part of correspondence image Description.

Step 350, recognition result is screened based on language feature, obtains the corresponding update identification knot of images to be recognized Fruit, updating recognition result includes at least one candidate image.

Fig. 4 be the application pedestrian again in another embodiment of recognition methods step 350 flow chart.As shown in Fig. 2, one In a or multiple optional embodiments, step 350 may include:

Step 3502, based on language feature at least one candidate's the distance between intermediate features corresponding with recognition result into Row screening.

Language description and image belong to two different expression-forms, in order to realize that being based on language description sieves image Choosing is needed by processing, and the present embodiment passes through language identification network respectively and feature extraction network obtains corresponding language feature With candidate intermediate features, with the distance between feature (such as: Euclidean distance, COS distance etc.) determine language description and image it Between similarity, and then realize screening based on language description to image.

Step 3504, at least one candidate intermediate features that distance is less than or equal to preset value are obtained, by the candidate of acquisition The corresponding candidate image of intermediate features is as update recognition result.

In one or more optional embodiments, further includes:

Feature extraction is carried out based at least one relevant words of description of language identification network handles identification image, obtains word Language feature, each words of description correspond at least one of images to be recognized part；

When progress pedestrian identifies again, it is possible to the whole description word of image can not be obtained, office in image can only be carried out The words of description in portion, such as: for a pedestrian, describes it and wear situation clothes, at this time, it may be necessary to which the language in through this embodiment is known Other network obtains the corresponding word feature of at least one words of description, and recognition result is screened and can be mentioned based on the word feature The efficiency that high pedestrian identifies again.

Recognition result or update recognition result are screened based on word feature, obtain the corresponding target of images to be recognized Recognition result, target identification result include at least one candidate image.

Recognition result can be screened by word feature, or screened to recognition result is updated, by word The screening of feature realizes the description based on partial content in image and screens to image, is more convenient for carrying out figure based on language As screening.

Optionally, recognition result or update recognition result are screened based on word feature, obtains images to be recognized pair The target identification result answered, comprising:

Based between word feature at least one candidate intermediate features corresponding with recognition result or update recognition result Distance is screened；

Optionally, the smaller explanation two of the distance between two candidate intermediate features (such as: Euclidean distance, COS distance etc.) The degree of association is bigger between the corresponding word of a feature or image, therefore, is tied by the distance between candidate intermediate features to identification Fruit or update recognition result are screened.

At least one candidate feature that distance is less than or equal to preset value is obtained, the candidate intermediate features of acquisition are corresponding Candidate image is as target identification result.

The words of description of corresponding images to be recognized may include at least one, therefore, the word feature of acquisition also include to It is one few, it is screened by distance with each word feature, it can be achieved that being mentioned to what pedestrian identified again by candidate intermediate features Speed.

Fig. 5 is the flow chart for the training that the application feature pedestrian extracts network in recognition methods one embodiment again.Such as Fig. 5 Shown, which obtains feature extraction network through characteristics of image and the training of language description cross-module state and includes:

Step 510, sample image input feature vector is extracted into network, obtains sample image feature.

Wherein, sample image includes verbal description mark.

Step 520, verbal description is marked based on language identification network and carries out feature extraction, obtain sample language feature.

In one or more optional embodiments, before executing step 520, sample text is also based on to language Identify that network carries out pre-training, sample text includes Markup Language feature, and by pre-training, pair of language identification network can be improved The extractability of character features enables the feature extracted by language identification network more accurately to express the feature of text, for instruction Practice feature extraction network and more accurate supervision message is provided.

Optionally, the process of pre-training may include: that sample text input language identification network is obtained the first pre- test sample Eigen；

Parameter based on the first forecast sample feature and Markup Language Character adjustment language identification network.

Language identification network used by the present embodiment can be it is in the prior art can be achieved to text carry out feature mention Any one neural network taken, specific structure the present embodiment with no restrictions, the training and one for the language identification network As neural metwork training it is similar, may include: to be lost based on forecast sample feature and Markup Language feature, based on loss benefit The parameter of adjustment language identification network is propagated with reversed gradient.

Step 530, it is based on sample image feature and sample language feature, training characteristics extract network.

Based on the above embodiments of the present application, feature extraction network is trained in conjunction with descriptive text, is mentioned for sample image Markup information more abundant has been supplied, the accuracy that feature extraction network extracts feature is improved.

Optionally, step 530 be may include: and be obtained global related general based on sample language feature and sample image feature Rate；

Based on the correlation that global dependent probability and sample image are marked with verbal description, obtained using the mutual entropy loss of binary Overall situation loss；

Network is extracted based on overall situation loss training characteristics.

Holistic correlation obtained is supervised using the mutual entropy loss of binary (Binary Cross-entropy Loss) It superintends and directs, i.e., image union feature relevant to language is allowed to be allowed to connect to image and the incoherent union feature of language close to 1 Nearly 0.

Optionally, it is based on sample language feature and sample image feature, global dependent probability is obtained, may include:

It will subtract each other after sample image feature pool with sample language feature, obtain difference feature；

It carries out calculating square value acquisition union feature by element based on difference feature；

Normalized is executed to union feature, obtains the global dependent probability for indicating holistic correlation.

As sample image feature Ψ (I) and sample language feature θ^g(T) describe the same target when (such as: the same row People), it can be associated by the target, using the method for discriminate to Ψ (I) and θ^g(T) correlation between exercises supervision Study.The process of supervised learning can be following steps:

It to Ψ (I) and carries out joint expression: average pond (average pooling) being carried out to Ψ (I) and obtains vector afterwardsFirst seek two vector (θ^g(T) withDifference obtain difference vector, then in difference vector each dimension carry out by member The square operation of element, obtaining joint indicates vector (union feature)

It obtainsIt can be obtained based on following formula (1):

Wherein,Indicate vector multiplication, two identical vectors carry out vector multiplication, i.e. square of the vector.'s Purpose is to express the correlation of two vectors, for further predicting whether two vectors are related.

Vector (union feature) is indicated to jointLinear Mapping is carried out, and is mapped that in the range of (0,1), Obtain Ψ (I) and θ^g(T) holistic correlation.

In one or more optional embodiments, further includes:

Feature extraction is carried out at least one phrase tagging in verbal description mark based on language identification network, is obtained at least One local feature, each phrase tagging are used to describe at least one region in sample image；

Wherein, the language identification network of use can be that be handled to obtain sample language feature to verbal description mark total The language identification network for enjoying parameter is also possible to different language identification networks, based on language identification network respectively to phrase mark Note carries out feature extraction, corresponding local feature can be obtained, each local feature corresponds to an area in sample image Domain.

Local losses is obtained based on local feature and sample image feature；

In network training process, the local feature of language identification network acquisition is combined, wherein each local feature is corresponding A region in sample image optionally obtains local losses using the mutual entropy loss of binary.

Network is extracted based on overall situation loss training characteristics, comprising:

Network is extracted based on overall situation loss and local losses training characteristics.

The extraction of content of text can comprise the following steps that, one section of original text relevant to image is located in advance Reason.Wherein, it can screen from network for trained original character, and be obtained under study for action using network in practical applications Public data collection.

Optionally, verbal description is marked based on the language identification network and carries out feature extraction, obtain sublanguage spy Before sign, further includes:

Divide verbal description mark, obtains at least one phrase tagging, each phrase tagging includes at least one noun.

Wherein, the corresponding Marking Probability of the phrase tagging of acquisition, each probability value indicate that phrase tagging corresponds to sample graph The probability of picture.

For describe picture one whole section of text, can use natural language kit (NLTK) tool by every a word from It is separated in this section of text, part of speech label is carried out to each of every a word word, and utilize phrase partition, emphatically It is screened to adjectival name phrase with the phrase containing the multiple nouns connected by preposition.

Optionally, segmentation verbal description mark, obtains at least one phrase tagging, comprising:

Part of speech identification is carried out to each word in verbal description mark, obtains the corresponding part of speech of each word；

Preset phrase piecemeal condition is combined based on part of speech, verbal description mark is divided at least one phrase tagging.

Fig. 6 is the example flow schematic diagram that noun phrase extracts in the embodiment of the present application.As shown in fig. 6, to text Description mark carries out part-of-speech tagging (such as: noun, adjective, preposition), is carried out to the word for completing mark based on preset rules Segmentation, obtains at least two phrase taggings, encodes to the language content handled, using LSTM to global language description Text and phrase tagging are separately encoded, and are mapped as the feature vector of specific length, are respectively labeled as θ^g(T) and θ^l(P)。

Optionally, local losses is determined based on sublanguage feature and sample image feature, comprising:

Pondization operation is carried out to sample image feature, obtains global characteristics figure；

Based on global characteristics figure and local language feature, conspicuousness weight is obtained；

The corresponding prediction probability of each phrase tagging is determined based on conspicuousness weight and sample image feature；

Based on prediction probability Marking Probability corresponding with phrase tagging, local losses is obtained.

Specifically, based on global characteristics figure and local language feature, conspicuousness weight is obtained, may include: will be global special The characteristic value of each position is subtracted each other respectively with local feature in sign figure, obtains local difference feature；

Square value is calculated to element each in local difference feature and obtains local union feature；Based on local union feature, Obtain conspicuousness weight.

Noun phrase is usually corresponding with each region in picture.Fig. 7 is that the disclosure rebuilds phrase tagging and image district An exemplary structural schematic diagram is associated between domain.As shown in Figure 7.Phrase tagging and image-region are established in the way of reconstruction Between biaxial stress structure relationship, process is divided into following step again:

Generate conspicuousness weight: middle layer feature Ψ (I) reduces the complexity of object positioning by pooling.For The feature ψ of each of CNN characteristic pattern position after pooling_k(I_n) (region of red-label is used in grey chromatic graph), it utilizes Noun phrase feature θ^l(P) phase separation therewith.

Optionally, based on local union feature, conspicuousness weight is obtained, comprising:

Local union feature is handled based on fully-connected network, obtains an expression phrase tagging and sample image Matching value with degree；

The vector constituted to the matching value of each position in the corresponding global characteristics figure of each phrase tagging carries out normalizing Change, obtains the corresponding conspicuousness weight of each phrase tagging.

Specifically, it may include the specific step that executes includes: that (1) subtracts each other two vectors, obtains difference vector；(2) to difference to The element of each dimension carries out square operation in amount, obtains a new vector；(3) vector is obtained by fully-connected network The scalar of one sample image and phrase tagging matching degree.(4) for all positions generate scalar, using softmax into Row normalization operation make these scalars and for one, to generate numerical value to each position.The numerical value is between zero and one Conspicuousness weight.Note that middle layer feature includes the feature vector of each position, the corresponding conspicuousness power in each position It is again the conspicuousness weight of middle layer feature.

Optionally, the corresponding prediction probability of each phrase tagging is determined based on conspicuousness weight and sample image feature, wrapped It includes:

By the characteristic value of each position in sample image feature and conspicuousness multiplied by weight, obtain corresponding to each phrase mark Infuse corresponding weighted feature vector set；

By the addition of vectors in weighted feature vector set, it is special to obtain the local visual that phrase tagging corresponds in sample image Sign；

The prediction probability of each word in phrase tagging is obtained based on local visual feature；

The prediction probability of each word determines the corresponding prediction probability of phrase tagging in phrase-based mark.

Obtain visual signature relevant to noun phrase: the feature vector root of each position in middle layer feature Ψ (I) Be weighted multiplication according to conspicuousness weight, obtain each position weighting after feature vector, then by the weighting of all positions to Amount is added, obtain for give definite noun phrase relevant some region of visual signature(the weight meeting of relevant range It is high), shown in the calculation formula of the visual signature such as formula (2).

Wherein, r refers to relevance weight, and k refers to that the index of position, P are given noun phrases, and I is picture, n It is the index of picture.

Optionally, the prediction probability of each word in phrase tagging is obtained based on local visual feature, comprising:

Phrase tagging is resolved into word sequence, shot and long term memory network is inputted based on local visual feature, is determined at least One hidden variable, the corresponding feature vector of each word；

Linear Mapping is carried out based at least one hidden variable, obtains the predicted vector of each word；

The prediction probability of each word in phrase tagging is obtained based on predicted vector.

Noun phrase is rebuild using the visual signature of acquisition: being reflected based on shot and long term memory network (LSTM) and linearly Penetrate building phrase reconstruction model.Relevant visual signature is inputted firstThen by input phrase in previous word come Predict that the probability that the latter word occurs in phrase, the probability in word are to carry out Linear Mapping one by the hidden variable in LSTM A given vocabulary simultaneously carries out softmax normalization acquisition.Wherein, first and the last one input word when be labeled with The additional character of phrase beginning and end, the hidden variable spy corresponding with current word at previous moment in shot and long term memory network Sign vector can get the hidden variable at next moment by shot and long term memory network phase separation.

Optionally, the prediction probability of each word determines the corresponding prediction probability of phrase tagging in phrase-based mark, Include:

Prediction probability by the product of the prediction probability of word each in phrase tagging, as phrase tagging.

In one or more optional embodiments, network, packet are extracted based on global loss and local losses training characteristics It includes:

To overall situation loss and local losses summation, obtains and lose.

Since verbal description mark can obtain two losses, overall situation loss and local losses point by language identification network Verbal description mark and mark sentence are not corresponded to, and verbal description mark and mark sentence are respectively to the entirety of sample image and part It is described, is based on and loss is trained feature extraction network can accelerate training speed.

Based on the parameter with loss adjustment feature extraction network.

Based on global loss and local losses and training characteristics extract network, realize in given image and corresponding In the case where description, the problem of noun phrase in description is with the subregional corresponding relationship in image middle part is established.Using being established Local corresponding relationship, come further constrain a certain position characteristics of image coding.

In one or more optional embodiments, further includes:

Identity sample image input feature vector is extracted into network, obtains sample predictions feature, identity sample image includes mark Identification feature；

The feature using the target overall situation is not only needed in the present embodiment, but also needs the spatial information using target signature, with Just the content of topography is excavated.

Sample predictions feature is handled through pond layer and full articulamentum, obtains Forecasting recognition feature；

Optionally, it is extracted using classical CNN network, these networks are not only proved to have very strong target classification Ability, and partial spatial information is saved during feature extraction, such as: the feature of coding clothes and trousers is distinguished The feature vector in different location is encoded, there are corresponding relationships with actual object space for this spatial information, are capable of providing knowledge Other clue.By taking ResNet-50 as an example, for the picture of a 224x224 size, by average pond (average Pooling the characteristic pattern of the 8x4 before) is labeled as Ψ (I), as middle level features to interact to Ψ with language feature (I) pooling and full connection mapping are carried out, the visual signature of target is obtained, is labeled as φ (I).This feature both encodes high level Semantic information, while also containing spatial positional information.

Ginseng based on mark identification feature and Forecasting recognition Character adjustment feature extraction network, pond layer and full articulamentum Number.

Disclosure above-described embodiment is intended to the quality of the language description data enhancing characteristics of image coding using auxiliary.Its core Heart inventive point is the mechanism for establishing language description and characteristics of image is associated, so that language message can guide figure As the study of feature, so that visual signature is laid particular emphasis on coding has the significant image for differentiating meaning apparent.Individual class based on pedestrian Other information proposes the global image of discriminate and the associating policy of language, and the graphic language joint for belonging to same individual is indicated Feature is combined with the graphic language for belonging to Different Individual indicates that feature distinguishes.At the same time, above-described embodiment also utilizes image With the image is described language between natural corresponding relationship, further excavated in such a way that phrase is rebuild local picture region with Correlation between noun phrase enhances the constraint learnt to characteristics of image.The technology proposed, which can not only be realized, improves pedestrian Identify the purpose of visual signature quality again, and can be potentially served as image and language cross-module state retrieval and according to noun phrase The region of image is carried out the task such as detecting.

Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above method embodiment can pass through The relevant hardware of program instruction is completed, and program above-mentioned can be stored in a computer readable storage medium, the program When being executed, step including the steps of the foregoing method embodiments is executed；And storage medium above-mentioned includes: ROM, RAM, magnetic disk or light The various media that can store program code such as disk.

Fig. 8 is a structural schematic diagram of the embodiment of the present application pedestrian identification device again.The device of the embodiment can be used for Realize the above-mentioned each method embodiment of the application.As shown in figure 8, the device of the embodiment includes:

Image acquisition unit 81, for obtaining images to be recognized and candidate image collection.

Feature extraction unit 82, for concentrating each candidate figure using feature extraction network handles identification image and candidate image As carrying out feature extraction, obtains the corresponding intermediate features to be identified of images to be recognized and candidate image is corresponding candidate intermediate special Sign, feature extraction network are obtained through characteristics of image and the training of language description cross-module state.

As a result recognition unit 83, for being obtained based on intermediate features to be identified and candidate intermediate features from candidate image concentration The corresponding recognition result of images to be recognized.

Recognition result includes at least one candidate image.

Based on a kind of pedestrian identification device again that the above embodiments of the present application provide, image is utilized and describes the image Natural corresponding relationship between language further excavates phase between local picture region and noun phrase in such a way that phrase is rebuild Guan Xing enhances the constraint to characteristics of image study, improves pedestrian and identify the quality of visual signature again, and then improve pedestrian The accuracy identified again.

In one or more optional embodiments, as a result recognition unit 83, are used in intermediate features to be identified and candidate Between feature respectively by average pond layer and full articulamentum acquisition feature to be identified and candidate feature；Based on feature to be identified and time Feature is selected to obtain the corresponding recognition result of images to be recognized from candidate image concentration.

In one or more optional embodiments, the present embodiment device can also include:

Language screening unit, for carrying out feature based on language identification network pair descriptive text relevant to images to be recognized It extracts, obtains language feature；Recognition result is screened based on language feature, obtains the corresponding update identification of images to be recognized As a result, updating recognition result includes at least one candidate image.

Optionally, language screening unit screens recognition result based on language feature, obtains images to be recognized pair When the update recognition result answered, for based between language feature at least one candidate intermediate features corresponding with recognition result Distance is screened；At least one candidate intermediate features that distance is less than or equal to preset value are obtained, the candidate of acquisition is intermediate The corresponding candidate image of feature is as update recognition result.

Word screening unit, for based on language identification network handles relevant at least one words of description of identification image into Row feature extraction, obtains word feature, and each words of description corresponds at least one of images to be recognized part；Based on word spy Sign screens recognition result or update recognition result, obtains the corresponding target identification of images to be recognized as a result, target identification It as a result include at least one candidate image.

Optionally, word screening unit screens recognition result or update recognition result based on word feature, obtains When the corresponding target identification result of images to be recognized, for corresponding with recognition result or update recognition result based on word feature The distance between at least one candidate intermediate features are screened；Obtain at least one candidate that distance is less than or equal to preset value Feature, using the corresponding candidate image of candidate intermediate features of acquisition as target identification result.

In one or more optional embodiments, the present embodiment device further include:

Sample characteristics extraction unit obtains sample image feature, sample for sample image input feature vector to be extracted network Image includes verbal description mark；

Language feature extraction unit is carried out feature extraction for being marked based on language identification network to verbal description, obtained Sample language feature；

Network training unit, for being based on sample language feature and sample image feature, training characteristics extract network.

Optionally, network training unit, comprising:

Global probabilistic module obtains global dependent probability for being based on sample language feature and sample image feature；

Overall situation loss module, the correlation for being marked based on global dependent probability and sample image with verbal description, benefit Global loss is obtained with the mutual entropy loss of binary；

Training module is lost, for extracting network based on global loss training characteristics.

Optionally, global probabilistic module is obtained specifically for will subtract each other after sample image feature pool with sample language feature To difference feature；Difference feature is carried out to calculate square value acquisition union feature by element；Union feature is executed at normalization Reason, obtains the global dependent probability for indicating holistic correlation.

Pre-training unit, for carrying out pre-training to language identification network based on sample text, sample text includes mark Language feature.

Optionally, pre-training unit, specifically for sample text input language identification network is obtained the first forecast sample Feature；Parameter based on the first forecast sample feature and Markup Language Character adjustment language identification network.

Optionally, network training unit, further includes:

Local shape factor module, for being based on language identification network at least one phrase tagging in verbal description mark Feature extraction is carried out, at least one local feature is obtained, each phrase tagging is used to describe at least one area in sample image Domain；

Local losses module, for obtaining local losses based on local feature and sample image feature；

Training module is lost, is specifically used for extracting network based on global loss and local losses training characteristics.

Optionally, network training unit, further includes:

Phrase separation module obtains at least one phrase tagging, each phrase tagging packet for dividing verbal description mark At least one noun, the corresponding Marking Probability of the phrase tagging of acquisition are included, each probability value indicates that phrase tagging corresponds to sample The probability of image.

Optionally, phrase separation module carries out part of speech identification specifically for each word in marking to verbal description, obtains To the corresponding part of speech of each word；Preset phrase piecemeal condition is combined based on part of speech, verbal description mark is divided at least One phrase tagging.

Optionally, local losses module, comprising:

Pond module obtains global characteristics figure for carrying out pondization operation to sample image feature；

Weight module obtains conspicuousness weight for being based on global characteristics figure and local feature；

Probabilistic forecasting module, for determining that each phrase tagging is corresponding pre- based on conspicuousness weight and sample image feature Survey probability；

Local losses obtains module, for being based on prediction probability Marking Probability corresponding with phrase tagging, obtains part damage It loses.

Optionally, weight module, for the characteristic value of position each in global characteristics figure to be subtracted each other respectively with local feature, Obtain local difference feature；Square value is calculated to element each in local difference feature and obtains local union feature；Based on part Union feature obtains conspicuousness weight.

Optionally, weight module is based on local union feature, when obtaining conspicuousness weight, for being based on fully-connected network pair Local union feature is handled, and the matching value of expression a phrase tagging and sample image matching degree is obtained；To each short The vector that the matching value that language marks each position in corresponding global characteristics figure is constituted is normalized, and obtains each phrase mark Infuse corresponding conspicuousness weight.

In one or more optional embodiments, probabilistic forecasting module, for by sample image feature to each The characteristic value and conspicuousness multiplied by weight of position, obtain the weighted feature vector set for corresponding to each phrase tagging；It will weighting spy The addition of vectors in vector set is levied, phrase tagging is obtained and corresponds to local visual feature in sample image；Based on local visual Feature obtains the prediction probability of each word in phrase tagging；The prediction probability of each word determines in phrase-based mark The corresponding prediction probability of phrase tagging.

Optionally, probabilistic forecasting module obtains the prediction probability of each word in phrase tagging based on local visual feature When, for phrase tagging to be resolved into word sequence, local visual feature is inputted into shot and long term memory network, determines at least one Hidden variable, the corresponding feature vector of each word；The hidden variable at each moment, previous moment is corresponding with current word Feature vector obtains the hidden variable at next moment by shot and long term memory network phase separation；It is carried out based at least one hidden variable Linear Mapping obtains the predicted vector of each word；The prediction of each word in phrase tagging is obtained based on predicted vector Probability.

Optionally, the prediction probability of each word determines that phrase tagging is corresponding in the phrase-based mark of probabilistic forecasting module Prediction probability when, for the prediction probability by the product of the prediction probability of word each in phrase tagging, as phrase tagging.

In one or more optional embodiments, training module is lost, is specifically used for overall situation loss and local losses Summation, obtains and loses；Based on the parameter with loss adjustment feature extraction network.

Identity sample unit obtains sample predictions feature, identity for identity sample image input feature vector to be extracted network Sample image includes mark identification feature；

Default recognition unit obtains prediction and knows for handling through pond layer and full articulamentum sample predictions feature Other feature；

Parameter adjustment unit, for based on mark identification feature and Forecasting recognition Character adjustment feature extraction network, Chi Hua The parameter of layer and full articulamentum.

According to the other side of the embodiment of the present application, a kind of electronic equipment provided, including processor, the processor packet Include the identification device again of pedestrian described in any one as above.

According to the other side of the embodiment of the present application, a kind of electronic equipment that provides, comprising: memory, for storing Executable instruction；

And processor, for being communicated with the memory to execute the executable instruction to complete any one as above The operation of pedestrian recognition methods again.

The embodiment of the invention also provides a kind of electronic equipment, such as can be mobile terminal, personal computer (PC), put down Plate computer, server etc..Below with reference to Fig. 9, it illustrates the terminal device or the services that are suitable for being used to realize the embodiment of the present application The structural schematic diagram of the electronic equipment 900 of device: as shown in figure 9, electronic equipment 900 includes one or more processors, communication unit For example Deng, one or more of processors: one or more central processing unit (CPU) 901, and/or one or more figures As processor (GPU) 913 etc., processor can according to the executable instruction being stored in read-only memory (ROM) 902 or from Executable instruction that storage section 908 is loaded into random access storage device (RAM) 903 and execute various movements appropriate and place Reason.Communication unit 912 may include but be not limited to network interface card, and the network interface card may include but be not limited to IB (Infiniband) network interface card,

Processor can with communicate in read-only memory 902 and/or random access storage device 903 to execute executable instruction, It is connected by bus 904 with communication unit 912 and is communicated through communication unit 912 with other target devices, to completes the application implementation The corresponding operation of any one method that example provides, for example, obtaining images to be recognized and candidate image collection；Utilize feature extraction network Concentrate each candidate image to carry out feature extraction images to be recognized and candidate image, obtain images to be recognized it is corresponding it is to be identified in Between feature and the corresponding candidate intermediate features of candidate image, wherein feature extraction network is through characteristics of image and language description cross-module State training obtains；It is corresponding with candidate intermediate features images to be recognized to be obtained from candidate image concentration based on intermediate features to be identified Recognition result.

In addition, in RAM 903, various programs and data needed for being also stored with device operation.CPU901,ROM902 And RAM903 is connected with each other by bus 904.In the case where there is RAM903, ROM902 is optional module.RAM903 storage Executable instruction, or executable instruction is written into ROM902 at runtime, executable instruction executes central processing unit 901 The corresponding operation of above-mentioned communication means.Input/output (I/O) interface 905 is also connected to bus 904.Communication unit 912 can integrate Setting, may be set to be with multiple submodule (such as multiple IB network interface cards), and in bus link.

I/O interface 905 is connected to lower component: the importation 906 including keyboard, mouse etc.；It is penetrated including such as cathode The output par, c 907 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.；Storage section 908 including hard disk etc.； And the communications portion 909 of the network interface card including LAN card, modem etc..Communications portion 909 via such as because The network of spy's net executes communication process.Driver 910 is also connected to I/O interface 905 as needed.Detachable media 911, such as Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 910, in order to read from thereon Computer program be mounted into storage section 908 as needed.

It should be noted that framework as shown in Figure 9 is only a kind of optional implementation, it, can root during concrete practice The component count amount and type of above-mentioned Fig. 9 are selected, are deleted, increased or replaced according to actual needs；It is set in different function component It sets, separately positioned or integrally disposed and other implementations, such as the separable setting of GPU913 and CPU901 or can also be used GPU913 is integrated on CPU901, the separable setting of communication unit, can also be integrally disposed on CPU901 or GPU913, etc.. These interchangeable embodiments each fall within protection scope disclosed by the invention.

Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description Software program.For example, embodiment of the disclosure includes a kind of computer program product comprising be tangibly embodied in machine readable Computer program on medium, computer program include the program code for method shown in execution flow chart, program code It may include the corresponding instruction of corresponding execution method and step provided by the embodiments of the present application, for example, obtaining images to be recognized and candidate Image set；It identifies that image and candidate image concentrate each candidate image to carry out feature extraction using feature extraction network handles, obtains The corresponding intermediate features to be identified of images to be recognized and the corresponding candidate intermediate features of candidate image, wherein feature extraction network It is obtained through characteristics of image and the training of language description cross-module state；Based on intermediate features to be identified and candidate intermediate features from candidate image It concentrates and obtains the corresponding recognition result of images to be recognized.In such embodiments, which can pass through communication unit Divide 909 to be downloaded and installed from network, and/or is mounted from detachable media 911.In the computer program by central processing When unit (CPU) 901 executes, the above-mentioned function of limiting in the present processes is executed.

According to the other side of the embodiment of the present application, a kind of computer readable storage medium provided, based on storing The instruction that calculation machine can be read, the instruction are performed the operation for executing the recognition methods again of pedestrian described in any one as above.

According to the other side of the embodiment of the present application, a kind of computer program product provided, including it is computer-readable Code, when the computer-readable code is run in equipment, the processor in equipment is executed for realizing any one as above The instruction of pedestrian recognition methods again.

The present processes and device may be achieved in many ways.For example, can by software, hardware, firmware or Software, hardware, firmware any combination realize the present processes and device.The said sequence of the step of for the method Merely to be illustrated, the step of the present processes, is not limited to sequence described in detail above, special unless otherwise It does not mentionlet alone bright.In addition, in some embodiments, also the application can be embodied as to record program in the recording medium, these programs Including for realizing according to the machine readable instructions of the present processes.Thus, the application also covers storage for executing basis The recording medium of the program of the present processes.

The description of the present application is given for the purpose of illustration and description, and is not exhaustively or by the application It is limited to disclosed form.Many modifications and variations are obvious for the ordinary skill in the art.It selects and retouches Embodiment is stated and be the principle and practical application in order to more preferably illustrate the application, and those skilled in the art is enable to manage Solution the application is to design various embodiments suitable for specific applications with various modifications.

Claims

1. a kind of pedestrian recognition methods again characterized by comprising

Obtain images to be recognized and candidate image collection；

Each candidate image is concentrated to carry out feature extraction the images to be recognized and the candidate image using feature extraction network, Obtain the corresponding intermediate features to be identified of the images to be recognized and the corresponding candidate intermediate features of the candidate image, the spy Sign is extracted network and is obtained through characteristics of image and the training of language description cross-module state；

It is obtained based on the intermediate features to be identified and the candidate intermediate features from candidate image concentration described to be identified The corresponding recognition result of image, the recognition result include at least one described candidate image.

2. the method according to claim 1, wherein described be based on the intermediate features to be identified and the candidate Intermediate features obtain the corresponding recognition result of the images to be recognized from candidate image concentration, comprising:

The intermediate features to be identified and the candidate intermediate features are obtained by average pond layer and full articulamentum wait know respectively Other feature and candidate feature；

The images to be recognized is obtained from candidate image concentration based on the feature to be identified and the candidate feature to correspond to Recognition result.

3. method according to claim 1 or 2, which is characterized in that further include: based on language identification network pair and it is described to It identifies that the relevant descriptive text of image carries out feature extraction, obtains language feature；

The recognition result is screened based on the language feature, obtains the corresponding update identification knot of the images to be recognized Fruit, the update recognition result includes at least one candidate image.

4. according to the method described in claim 3, it is characterized in that, it is described based on the language feature to the recognition result into Row screening, obtains the corresponding update recognition result of the images to be recognized, comprising:

Based on the distance between the language feature at least one described candidate intermediate features corresponding with the recognition result into Row screening；

At least one described candidate intermediate features that distance is less than or equal to preset value are obtained, by the described candidate intermediate special of acquisition The corresponding candidate image is levied as the update recognition result.

5. method according to claim 1 to 4, which is characterized in that further include:

Feature extraction is carried out based on the language identification network at least one words of description relevant to the images to be recognized, is obtained Word feature is obtained, each words of description corresponds at least one of images to be recognized part；

The recognition result or the update recognition result are screened based on the word feature, obtain the figure to be identified As corresponding target identification as a result, the target identification result includes at least one described candidate image.

6. a kind of pedestrian identification device again characterized by comprising

Feature extraction unit, for concentrating each candidate to the images to be recognized and the candidate image using feature extraction network Image carries out feature extraction, obtains the corresponding intermediate features to be identified of the images to be recognized and the corresponding time of the candidate image Intermediate features are selected, the feature extraction network is obtained through characteristics of image and the training of language description cross-module state；

As a result recognition unit, for being based on the intermediate features to be identified and the candidate intermediate features from the candidate image collection Middle to obtain the corresponding recognition result of the images to be recognized, the recognition result includes at least one candidate image.

7. a kind of electronic equipment, which is characterized in that including processor, the processor include pedestrian as claimed in claim 6 again Identification device.

8. a kind of electronic equipment characterized by comprising memory, for storing executable instruction；

And processor, for being communicated with the memory to execute the executable instruction to complete claim 1 to 5 times It anticipates the operation of pedestrian recognition methods again.

9. a kind of computer readable storage medium, for storing computer-readable instruction, which is characterized in that described instruction quilt Perform claim requires the operation of pedestrian's recognition methods again described in 1 to 5 any one when execution.

10. a kind of computer program product, including computer-readable code, which is characterized in that when the computer-readable code When running in equipment, the processor execution in the equipment is known again for realizing pedestrian described in claim 1 to 5 any one The instruction of other method.