CN106529565A

CN106529565A - Target identification model training and target identification method and device, and computing equipment

Info

Publication number: CN106529565A
Application number: CN201610849633.9A
Authority: CN
Inventors: 石建萍
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2016-09-23
Filing date: 2016-09-23
Publication date: 2017-03-22
Anticipated expiration: 2036-09-23
Also published as: CN106529565B

Abstract

The invention discloses a target identification model training method, a target identification model training device, a target identification method, a target identification device and computing equipment, which belong to the technical field of computer vision. The target identification model training method comprises the steps of: inputting a plurality of local candidate regions selected for a training image into a target identification model, so as to obtain a preliminary result of classification of the plurality of local candidate regions output by the target identification model; performing local candidate region integration according to weak supervisory information and the preliminary result of classification of the plurality of local candidate regions; correcting parameters of the target identification model according to the preliminary result of classification of the plurality of local candidate regions and a result of local candidate region integration; and iteratively executing the training steps till a training result of the target identification model satisfies a predetermined convergence condition. The target identification model training method, the target identification model training device, the target identification method, the target identification device and the computing equipment can perform direct supervision of pixel level, can optimize a semantic segmentation model in an end-to-end manner, and can improve the result of target identification according to judgment of the local candidate regions.

Description

Model of Target Recognition is trained and target identification method and device, computing device

Technical field

The present invention relates to technical field of computer vision, more particularly to a kind of training of Model of Target Recognition and target recognition side Method and device, computing device.

Background technology

Target recognition is a classical problem of computer vision field, and purpose is to predict the position of the given object of input picture Put.One good target recognition scheme depends on the object classification of each pixel, realizes accurate, dense image pixel-class Do not understand.Target recognition problem generally takes because all of object space in figure will be marked very much.According to the reality of common method Experience is trampled, if the accurate target recognition mark of a 400*600 pixel size will be obtained, it usually needs time-consuming 5-8 minutes. Therefore data mark speed and quality become the restriction problem obtain big data support, important asking of obtaining more further developing Topic.

The content of the invention

The embodiment of the present invention provides a kind of Model of Target Recognition training and target recognition scheme.

One side according to embodiments of the present invention, there is provided a kind of Model of Target Recognition training method, methods described are adopted Model of Target Recognition is trained with multiple training images for being labeled with Weakly supervised information in advance, is directed to each training figure Picture, training step include：

The multiple local candidate regions selected for the training image are input into the Model of Target Recognition, institute is obtained State the PRELIMINARY RESULTS of multiple local candidate region classification of Model of Target Recognition output；

According to the PRELIMINARY RESULTS that the Weakly supervised information and the plurality of local candidate region are classified, local candidate regions are carried out Merge in domain；

According to the PRELIMINARY RESULTS of the plurality of local candidate region classification and the result of local candidate region fusion to described The parameter of Model of Target Recognition is modified；

Iteration performs the above training step until the training result of the Model of Target Recognition meets predetermined convergence bar Part.

Alternatively, the Weakly supervised packet contains：Object classification information.

Alternatively, know in the described multiple local candidate regions selected for the training image are input into the target Before other model, methods described also includes：Super-pixel segmentation process is carried out to the training image, will be carried out at super-pixel segmentation Some image blocks that reason is obtained are clustered, and obtain multiple local candidate regions.

Alternatively, methods described also includes：The training image is input into semantic segmentation model, the training figure is obtained The PRELIMINARY RESULTS of the semantic segmentation of picture；It is described according to the Weakly supervised information and the plurality of local candidate region classify it is preliminary As a result, carry out the fusion of local candidate region to be specially：Classify according to the Weakly supervised information and the plurality of local candidate region PRELIMINARY RESULTS, carry out the fusion of local candidate region, obtain the correction result of the semantic segmentation of the training image；The basis The result of the PRELIMINARY RESULTS of the plurality of local candidate region classification and the fusion of local candidate region is to the Model of Target Recognition Parameter be modified and further include：According to the correction result of the PRELIMINARY RESULTS and the semantic segmentation of the semantic segmentation, The model parameter of the semantic segmentation model is modified；The output result of the shared semantic segmentation model being corrected, The parameter of the Model of Target Recognition is modified.

Alternatively, it is described that the multiple local candidate regions selected for the training image are input into the target recognition Model, the PRELIMINARY RESULTS for obtaining multiple local candidate region classification of the Model of Target Recognition output also include：Using intersection Entropy loss function, the plurality of local candidate region is classified according to object classification；Each local candidate region is belonged to The other probability of object type is predicted, and obtains the object classification probabilistic forecasting value of each local candidate region；

The parameter to Model of Target Recognition is modified and further includes：Parameter to the cross entropy loss function It is modified.

Alternatively, the training step also includes：According to the Weakly supervised information, to being used for predicting the training image The function of image category is trained.

Alternatively, the PRELIMINARY RESULTS classified according to the Weakly supervised information and the plurality of local candidate region, enters Row local candidate region is merged, and the correction result for obtaining the semantic segmentation of the training image is further included：From the plurality of Select in the candidate region of local and belong to the other local candidate region of same object type；For belonging to the other local of same object type Candidate region, carries out fusion treatment, and the image-region after fusion is divided into nearly object area, nearly background using clustering algorithm Region and ambiguity region；Nearly object area and nearly background area are entered to the ambiguity region using partitioning algorithm as seed Row segmentation, obtains the correction result of the semantic segmentation of the training image.

Alternatively, it is described for belonging to the other local candidate region of same object type, carry out fusion treatment and further include： According to the object classification probabilistic forecasting value of local candidate region from high to low order, from for belonging to the other office of same object type The local candidate region of predetermined number is picked out in portion candidate region, carries out fusion treatment；Or, from for belonging to same object Local candidate region of the object classification probabilistic forecasting value higher than predetermined threshold value is picked out in the local candidate region of classification, is melted Conjunction is processed.

The training step also includes：According to PRELIMINARY RESULTS and each local candidate of the semantic segmentation of the training image The object classification probabilistic forecasting value in region, screens to the plurality of local candidate region；

The PRELIMINARY RESULTS classified according to the Weakly supervised information and the plurality of local candidate region, carries out local time Favored area merges, and the correction result for obtaining the semantic segmentation of the training image is further included：According to the Weakly supervised information With the PRELIMINARY RESULTS of the plurality of local candidate region classification, the local candidate region that Jing screenings are obtained is merged, is obtained The correction result of the semantic segmentation of the training image.

Alternatively, the thing of the PRELIMINARY RESULTS of the semantic segmentation according to the training image and each local candidate region Body class probability predictive value, carries out screening to the plurality of local candidate region and further includes：Calculate the local candidate regions The friendship of the dicing masks in domain and the PRELIMINARY RESULTS of the semantic segmentation of the training image is simultaneously compared；According to the local candidate region Hand over and compare with hand over and comparative result than threshold value and the local candidate region object classification predictive value and predictive value threshold value Comparative result, the plurality of local candidate region is screened.

Alternatively, the friendship according to local candidate region and than with hand over and the comparative result than threshold value and the office The object classification predictive value of portion candidate region and the comparative result of predictive value threshold value, sieve to the plurality of local candidate region Choosing is further included：Friendship in response to the local candidate region is simultaneously handed over and than threshold value than being more than or equal to first, the local The object classification predictive value of candidate region is more than or equal to the first predictive value threshold value, screens the local candidate region as Jing The local candidate region of the positive sample for obtaining；And/or, the friendship in response to the local candidate region is simultaneously compared less than or equal to second Hand over and than threshold value, the object classification predictive value of the local candidate region is less than or equal to the second predictive value threshold value, by the office The local candidate region of the negative sample that portion candidate region is obtained as Jing screenings.

Alternatively, it is described that the local candidate region that Jing screenings are obtained is merged, obtain the semanteme of the training image The correction result of segmentation is further included：Select from the local candidate region that Jing screenings are obtained and belong to same object type Other local candidate region；For belonging to the other local candidate region of same object type, fusion treatment is carried out, and is calculated using cluster Image-region after fusion is divided into nearly object area, nearly background area and ambiguity region by method；Nearly object area and the closely back of the body Scene area is split to the ambiguity region using partitioning algorithm, obtains the semantic segmentation of the training image as seed Correction result.

A kind of one side according to embodiments of the present invention, there is provided target identification method, including：

Using image to be identified as Model of Target Recognition input, the Model of Target Recognition is in advance using described above Method be trained；

The multiple local candidate regions selected for described image are determined according to the output result of the Model of Target Recognition Classification results.

Other side according to embodiments of the present invention, there is provided a kind of Model of Target Recognition training devicess, described device Model of Target Recognition is trained using multiple training images for being labeled with Weakly supervised information in advance, training devicess' bag Include：

Object-recognition unit, for the multiple local candidate regions selected for the training image are input into the mesh Mark identification model, obtains the PRELIMINARY RESULTS of multiple local candidate region classification of the Model of Target Recognition output；

Integrated unit, for the PRELIMINARY RESULTS classified according to the Weakly supervised information and the plurality of local candidate region, Carry out the fusion of local candidate region；

Amending unit, the PRELIMINARY RESULTS and local candidate region for being classified according to the plurality of local candidate region merge Result the parameter of the Model of Target Recognition is modified；

Training devicess' iteration operation is until the training result of the Model of Target Recognition meets predetermined convergence condition.

Alternatively, described device also includes：Data preparation module, the object classification for obtaining the training image are believed Breath.

Alternatively, described device also includes：Data preparation module, for carrying out at super-pixel segmentation to the training image Reason, will carry out super-pixel segmentation and processes some image blocks for obtaining being clustered, obtain multiple local candidate regions.

Alternatively, described device also includes：Semantic segmentation unit, for the training image is input into semantic segmentation mould Type, obtains the PRELIMINARY RESULTS of the semantic segmentation of the training image；

The integrated unit is further used for：Classify according to the Weakly supervised information and the plurality of local candidate region PRELIMINARY RESULTS, carries out the fusion of local candidate region, obtains the correction result of the semantic segmentation of the training image；

The amending unit is further used for：According to the correction of the PRELIMINARY RESULTS and the semantic segmentation of the semantic segmentation As a result, the model parameter of the semantic segmentation model is modified；The output of the shared semantic segmentation model being corrected As a result, the parameter of the Model of Target Recognition is modified.

Alternatively, the object-recognition unit is further used for：Using cross entropy loss function, the plurality of local is waited Favored area is classified according to object classification；Belong to the other probability of object type to each local candidate region to be predicted, obtain The object classification probabilistic forecasting value of each local candidate region；

The amending unit is further used for：The parameter of the cross entropy loss function is modified.

Alternatively, the training module also includes：Image category predicting unit is for according to the Weakly supervised information, right For predicting that the function of the image category of the training image is trained.

Alternatively, the integrated unit is further included：Sort out subelement, for from the plurality of local candidate region Select and belong to the other local candidate region of same object type；Fusion treatment subelement, for for belonging to same object classification Local candidate region, carry out fusion treatment, and using clustering algorithm by the image-region after fusion be divided into nearly object area, Nearly background area and ambiguity region；Segmentation subelement, for nearly object area and nearly background area as seed, using segmentation Algorithm is split to the ambiguity region, obtains the correction result of the semantic segmentation of the training image.

Alternatively, the fusion treatment subelement is further used for：Object class probability according to local candidate region is pre- Measured value belongs to the local for picking out predetermined number in the other local candidate region of same object type from high to low order from being directed to Candidate region, carries out fusion treatment；Or, object type is picked out from for belonging in the other local candidate region of same object type Other probabilistic forecasting value carries out fusion treatment higher than the local candidate region of predetermined threshold value.

The training module also includes：Select unit, for the PRELIMINARY RESULTS of the semantic segmentation according to the training image With the object classification probabilistic forecasting value of each local candidate region, the plurality of local candidate region is screened；

The integrated unit is further used for：According to the Weakly supervised information, the local candidate region obtained to Jing screenings Merged, obtained the correction result of the semantic segmentation of the training image.

Alternatively, the select unit is further used for：Calculate the dicing masks of the local candidate region and the instruction Practice the friendship of the PRELIMINARY RESULTS of the semantic segmentation of image and compare；According to the friendship of the local candidate region and than with hand over and than threshold value The object classification predictive value of comparative result and the local candidate region and the comparative result of predictive value threshold value, to described more Screened individual local candidate region.

Alternatively, the select unit is further used for：Friendship ratio in response to the local candidate region is more than or waits Hand over and than threshold value in first, the object classification predictive value of the local candidate region is more than or equal to the first predictive value threshold value, will The local candidate region of the positive sample that the local candidate region is obtained as Jing screenings；And/or, in response to the local candidate The friendship in region is simultaneously handed over and than threshold value than being less than or equal to second, and the object classification predictive value of the local candidate region is less than or waits In the second predictive value threshold value, the local candidate region of the negative sample that the local candidate region is obtained as Jing screenings.

Alternatively, the integrated unit is further included：Sort out subelement, the local for obtaining from Jing screenings is waited Select in favored area and belong to the other local candidate region of same object type；Fusion treatment subelement, for for belonging to same The other local candidate region of object type, is carried out fusion treatment, and is divided into closely the image-region after fusion using clustering algorithm Object area, nearly background area and ambiguity region；Segmentation subelement, for nearly object area and nearly background area as kind Son, is split to the ambiguity region using partitioning algorithm, obtains the correction result of the semantic segmentation of the training image.

A kind of other side according to embodiments of the present invention, there is provided Target Identification Unit, the Target Identification Unit Pin is determined according to the output result of the Model of Target Recognition for using image to be identified as the input of Model of Target Recognition The classification results of multiple local candidate regions that described image is selected；Wherein, the Model of Target Recognition is in advance using above Described training devicess are trained.

A kind of another aspect according to embodiments of the present invention, there is provided computing device, including：Processor, communication interface, Memorizer and communication bus；The processor, the communication interface and the memorizer complete phase by the communication bus Communication between mutually；

The memorizer is used for storage at least and instructs；The instruction makes to operate below the computing device：

A kind of another aspect according to embodiments of the present invention, there is provided computer-readable storage medium, for storing computer The instruction that can read.The instruction includes：

The multiple local candidate regions selected for the training image are input into the Model of Target Recognition, institute is obtained State the instruction of the PRELIMINARY RESULTS of multiple local candidate region classification of Model of Target Recognition output；

According to the PRELIMINARY RESULTS that the Weakly supervised information and the plurality of local candidate region are classified, local candidate regions are carried out The instruction of domain fusion；

According to the PRELIMINARY RESULTS of the plurality of local candidate region classification and the result of local candidate region fusion to described The instruction that the parameter of Model of Target Recognition is modified；

Iteration performs the above training step until the training result of the Model of Target Recognition meets predetermined convergence bar The instruction of part.

Technical scheme provided in an embodiment of the present invention, first will be the multiple local candidate regions selected for training image defeated Enter to Model of Target Recognition to obtain the PRELIMINARY RESULTS of multiple local candidate region classification, wait according to Weakly supervised information and multiple local The PRELIMINARY RESULTS of favored area classification, carries out the fusion of local candidate region, the parameter of Model of Target Recognition is entered according to fusion results Row amendment, obtains housebroken Model of Target Recognition then.The guidance of the Weakly supervised information by being provided previously by, realizes accurately The classification identification of ground local candidate region, completes object recognition task.

Described above is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention, And can be practiced according to the content of description, and in order to allow the above and other objects of the present invention, feature and advantage can Become apparent, below especially exemplified by the specific embodiment of the present invention.

Description of the drawings

By reading the detailed description of hereafter preferred implementation, various other advantages and benefit are common for this area Technical staff will be clear from understanding.Accompanying drawing is only used for illustrating the purpose of preferred implementation, and is not considered as to the present invention Restriction.And in whole accompanying drawing, it is denoted by the same reference numerals identical part.In the accompanying drawings：

Fig. 1 a to Fig. 1 f adopt the schematic diagram of many individuality learning method prediction pixel classifications in showing prior art；

The flow chart that Fig. 2 shows the embodiment one of the Model of Target Recognition training method of present invention offer；

The flow chart that Fig. 3 shows the embodiment two of the Model of Target Recognition training method of present invention offer；

Fig. 4 shows network model's schematic diagram of the embodiment two of the Model of Target Recognition training method of present invention offer；

Fig. 5 a to Fig. 5 h show the schematic diagram of an example of local candidate region fusion treatment in the embodiment of the present invention；

The flow chart that Fig. 6 shows the embodiment three of the Model of Target Recognition training method of present invention offer；

Fig. 7 shows network model's schematic diagram of the embodiment three of the Model of Target Recognition training method of present invention offer；

Fig. 8 shows the functional block diagram of the embodiment one of the Model of Target Recognition training devicess of present invention offer；

Fig. 9 shows the functional block diagram of the embodiment two of the Model of Target Recognition training devicess of present invention offer；

Figure 10 shows the functional block diagram of the embodiment three of the Model of Target Recognition training devicess of present invention offer；

Figure 11 shows the computing device for performing Model of Target Recognition training method according to embodiments of the present invention Block diagram；

Figure 12 to show and realize Model of Target Recognition training method according to embodiments of the present invention for keeping or carrying Program code memory element.

Specific embodiment

The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in accompanying drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure and should not be by embodiments set forth here Limited.On the contrary, there is provided these embodiments are able to be best understood from the disclosure, and can be by the scope of the present disclosure Complete conveys to those skilled in the art.

In technical field of computer vision, many times in order to realize accurately identifying objects in images, will can scheme As disassembling into multiple local candidate regions to understand and learn.The principle disassembled is as much as possible to cover difference in image The object of size；Each local candidate region can cover a part for object, it is not necessary to completely include object, thus each is locally The information that candidate region is acquired is more rich.In the present invention, target recognition problem actually refers to the classification of local candidate region and asks Topic, i.e., the object classification belonged to by local candidate region.The present invention utilizes substantial amounts of training image, obtains mesh through training process Mark identification model, can greatly promote the accuracy of local candidate region classification.

During the present invention is realized, inventor has found presently, there are some using semi-supervised by studying prior art Or Weakly supervised method training objective identification problem.The Weakly supervised target recognition of traditional given image classification is generally divided into two Class.The first kind be using many individual study (Multiple Instance Learning, hereinafter referred to as：MIL method) is direct Prediction pixel classification.Under this setting, each picture is considered as a series of set of pixels or super-pixel.If wherein Then overall output is just, if instead all elements in set are all that negative sample is then whole for positive sample an element in set Body is output as bearing.Such scheme because do not there is direct supervision message to instruct to bottom-up information, it is easy to accurately cannot determine Position object, as shown in Fig. 1 a to Fig. 1 f.Wherein, Fig. 1 a and Fig. 1 d are artwork, and it is corresponding with Fig. 1 d that Fig. 1 b are respectively Fig. 1 a with Fig. 1 e Real semantic segmentation image, Fig. 1 c and Fig. 1 f is respectively using the MIL methods target recognition image that obtains of prediction.From figure Can be seen that, the accuracy rate of the target recognition image obtained using the prediction of MIL methods is relatively low.

In prior art, the Weakly supervised study in another direction is using the thought (Expectation- for expecting maximum Maximization), the current temporary transient supervision classification of the study of circulation, and study semantic segmentation model.Such method is received Beneficial to the supervision for having pixel scale, but become dependent upon an extraordinary initialization.If initialization is improper, results contrast is difficult Ensure.

Find based on more than, the embodiment of the present invention proposes a kind of Weakly supervised target recognition scheme, existing pixel scale Directly supervise, semantic segmentation model can be optimized again end to end, also introduce an object positioning branch, target can be improved The result of identification.The program is described in detail below by several specific embodiments.

The flow chart that Fig. 2 shows the embodiment one of the Model of Target Recognition training method of present invention offer.The present embodiment The Model of Target Recognition training method of offer need not carry out the mark of pixel scale, based on the Weakly supervised information being provided previously by Realize the training of Model of Target Recognition.The method is trained to Model of Target Recognition using multiple training images, the present embodiment Described method is mainly saying how the Model of Target Recognition that semantic segmentation is aided in be instructed using a training image Practice.It will be appreciated by persons skilled in the art that training process is needed using substantial amounts of training image, the quantity of training image is got over Many, coverage rate is wider, trains the Model of Target Recognition for obtaining more accurate.The embodiment of the present invention is not limited to the quantity of training image System.

As shown in Fig. 2 the training method for being directed to each training image comprises the steps：

Step S101, selects multiple local candidate regions from training image；And, obtain the Weakly supervised letter of training image Breath.Step S101 is data preparation step.

One training image is splitted into multiple local candidate regions to understand and learn by the embodiment of the present invention.The principle disassembled It is as much as possible can to cover different size of object in training image；Each local candidate region can cover the one of object Part, it is not necessary to completely include object, thus the information that each local candidate region is acquired is more rich.Further, to training figure Disassembling for picture specially carries out super-pixel segmentation process to training image, obtains several image blocks；Then, by several images Block carries out clustering combination and obtains multiple local candidate regions.The embodiment of the present invention can be using the local candidate provided in prior art The system of selection in region, is not restricted to this.

In addition, also needing to obtain the Weakly supervised information of training image, the Weakly supervised information is the information being provided previously by.It is optional Ground, Weakly supervised information are specially object classification information.Traditional pixel markup information needs accurately to mark out in training image Object classification belonging to each pixel, and Weakly supervised information is the object classification information included by training image in the present invention.Lift For example, if including people and aircraft in certain training image, traditional pixel markup information needs to mark out the training image In each pixel whether belong to people or aircraft, and the present invention only needs to mark out someone and aircraft in training image.That is, pre- The object classification included in first informing training devicess' training image, but do not inform the position of object.

Step S102, the multiple local candidate regions selected for training image is input into Model of Target Recognition, is obtained The PRELIMINARY RESULTS of multiple local candidate region classification of Model of Target Recognition output.

After data are got out, start training process.First, the multiple local candidate regions that will be selected for training image It is input into Model of Target Recognition, obtains the PRELIMINARY RESULTS of multiple local candidate region classification.

In the embodiment of the present invention, using the cross entropy loss function of the full articulamentum of the full convolutional neural networks of deep learning As Model of Target Recognition, the PRELIMINARY RESULTS of multiple local candidate region classification is obtained using the prediction of cross entropy loss function.

Step S103, according to the PRELIMINARY RESULTS that Weakly supervised information and multiple local candidate regions are classified, carries out local candidate Merge in region.

Depend on the multiple local candidate regions classification that the Weakly supervised information and step S102 of aforementioned preparation obtains just Multiple local candidate regions are carried out fusion treatment by step result.

Step S104, according to the PRELIMINARY RESULTS and the result pair of local candidate region fusion of the classification of multiple local candidate regions The parameter of Model of Target Recognition is modified.

The result of full convolutional neural networks is shared, the cross entropy loss function of full articulamentum is corrected, is realized to target recognition The amendment of model.

Above step S102 to step S104 is training step, and iteration performs above-mentioned training step, housebroken to obtain Model of Target Recognition.Specifically, training step iteration is performed until the training result of Model of Target Recognition meets predetermined convergence bar Part.For example, predetermined convergence condition is to reach predetermined iterationses, when iterationses reach predetermined iterationses, iterative process Terminate.Or, predetermined convergence condition is that the difference between PRELIMINARY RESULTS and correction result restrains to a certain extent, pre- when this is met When determining the condition of convergence, iterative process terminates.

The multiple local selected for training image are waited by Model of Target Recognition training method that the present embodiment is provided first Favored area is input into Model of Target Recognition the PRELIMINARY RESULTS for obtaining the classification of multiple local candidate regions, according to Weakly supervised information and many The PRELIMINARY RESULTS of individual local candidate region classification, carries out the fusion of local candidate region, according to fusion results to Model of Target Recognition Parameter be modified, obtain then housebroken Model of Target Recognition.The guidance of the Weakly supervised information by being provided previously by, it is real Show the classification identification of local candidate region exactly, complete object recognition task.

The flow chart that Fig. 3 shows the embodiment two of the Model of Target Recognition training method of present invention offer.Fig. 4 shows Network model's schematic diagram of the embodiment two of the Model of Target Recognition training method that the present invention is provided.With reference to this two width figure pair The concrete scheme of the present embodiment describes in detail.Method described by the present embodiment is also how to say using a training figure As being trained to Model of Target Recognition.

As shown in figure 3, the training method for being directed to each training image comprises the steps：

Step S201, chooses multiple local candidate regions from training image；And, obtain the Weakly supervised letter of training image Breath.Step S201 is data preparation step.

One training image is splitted into multiple local candidate regions to understand and learn by the embodiment of the present invention.The principle disassembled It is as much as possible can to cover different size of object in training image；Each local candidate region can cover the one of object Part, it is not necessary to completely include object, thus the information that each local candidate region is acquired is more rich.Such as the institute of b branches in Fig. 4 Show, b branches into object positioning branch, specifically disassembles original training image for several local candidate regions, and this branch tears open What is solved is more careful, and the accuracy rate of object positioning is higher.In actual mechanical process, b branches are disassembled to obtain 2000 or so Local candidate region.The system of selection of the local candidate region that the embodiment of the present invention can be provided in adopting prior art, to this not It is restricted.

In addition, also needing to obtain the Weakly supervised information of training image, the Weakly supervised information is the information being provided previously by.It is optional Ground, Weakly supervised information are specially object classification information.Traditional pixel markup information needs accurately to mark out in training image Object classification belonging to each pixel, and Weakly supervised information is the object classification information included by training image in the present invention.Such as Shown in Fig. 4 lower right corner, the Weakly supervised information being provided previously by for the training image is exactly people and aircraft.Traditional pixel mark During information needs to mark out the training image, whether each pixel belongs to people or aircraft, and the present invention only needs to mark out training image Middle someone and aircraft.That is, the object classification included in informing training devicess' training image in advance, but object is not informed Position.

Step S202, training image is input into semantic segmentation model, obtains the preliminary knot of the semantic segmentation of training image Really.

After data are got out, start training process.First, training image is input into initial semantic segmentation model, Obtain the PRELIMINARY RESULTS of the semantic segmentation of training image.The present embodiment by the use of deep learning full convolutional neural networks as semanteme The model of segmentation.Specifically, the PRELIMINARY RESULTS that full convolutional neural networks prediction obtains semantic segmentation is carried out to training image.This step Suddenly learn the parameter of intermediate representation by multiple convolutional layers/nonlinear response layer/pond layer, a specific example is as follows：

1. input layer

// the first stage, share convolutional layer result

2.<=1 convolutional layer 1_1 (3 × 3 × 64)

3.<=2 nonlinear response ReLU layers

4.<=3 convolutional layer 1_2 (3 × 3 × 64)

5.<=4 nonlinear response ReLU layers

6.<=5 pond layers (3 × 3/2)

7.<=6 convolutional layer 2_1 (3 × 3 × 128)

8.<=7 nonlinear response ReLU layers

9.<=8 convolutional layer 2_2 (3 × 3 × 128)

10.<=9 nonlinear response ReLU layers

11.<=10 pond layers (3 × 3/2)

12.<=11 convolutional layer 3_1 (3 × 3 × 256)

13.<=12 nonlinear response ReLU layers

14.<=13 convolutional layer 3_2 (3 × 3 × 256)

15.<=14 nonlinear response ReLU layers

16.<=15 convolutional layer 3_3 (3 × 3 × 256)

17.<=16 nonlinear response ReLU layers

18.<=17 pond layers (3 × 3/2)

19.<=18 convolutional layer 4_1 (3 × 3 × 512)

20.<=19 nonlinear response ReLU layers

21.<=20 convolutional layer 4_2 (3 × 3 × 512)

22.<=21 nonlinear response ReLU layers

23.<=22 convolutional layer 4_3 (3 × 3 × 512)

24.<=23 nonlinear response ReLU layers

25.<=24 pond layers (3 × 3/2)

26.<=25 convolutional layer 5_1 (3 × 3 × 512)

27.<=26 nonlinear response ReLU layers

28.<=27 convolutional layer 5_2 (3 × 3 × 512)

29.<=28 nonlinear response ReLU layers

30.<=29 convolutional layer 5_3 (3 × 3 × 512)

31.<=30 nonlinear response ReLU layers

32.<=31 linear difference layers

33.<=32 loss layers, carry out the calculating of loss function

Wherein symbol ".<=" before numeral be current layer number, numeral below is the input number of plies, for example, 2.<=1 table Bright current layer is the second layer, is input into as ground floor.It is convolution layer parameter in bracket behind convolutional layer, for example, 3 × 3 × 64, show Convolution kernel size 3 × 3, port number are 64.It is pond layer parameter in bracket behind the layer of pond, for example, 3 × 3/2 show Chi Huahe Size 3 × 3, at intervals of 2.

In above-mentioned neutral net, after each convolutional layer, there is a nonlinear response unit.The nonlinear response list Unit's specially correction linear unit (Rectified Linear Units, hereinafter referred to as：ReLU), by increasing after convolutional layer Above-mentioned correction linear unit, the mapping result of convolutional layer is tried one's best more sparse, closer to the vision response of people, so that image Treatment effect is more preferable.In above-mentioned example, the convolution kernel of convolutional layer is set to into 3 × 3, can more preferable comprehensive local message.

In the present embodiment, step-length stride of pond layer is set, in order to allow upper strata feature not increasing amount of calculation On the premise of obtain the bigger visual field, while step-length stride of pond layer also strengthens the feature of space-invariance, that is, allow Same input is occurred on different picture positions, and output result response is identical.

Linear difference layer obtains the predictive value of each pixel in order to feature before is amplified to artwork size.

In sum, the convolutional layer of the full convolutional neural networks is mainly used in information conclusion and merges, and pond layer (is chosen as Maximum pond layer：Max pooling) it is substantially carried out the conclusion of high layer information.The full convolutional neural networks can be finely adjusted and Adapt to the balance of different performance and efficiency.

The PRELIMINARY RESULTS of the semantic segmentation of the training image that this step is obtained is specially the mark of the semantic segmentation of pixel scale Note, the i.e. mark of the semantic segmentation result of each pixel.But, as the semantic segmentation model is the model in training process, It is not final model, therefore PRELIMINARY RESULTS is not accurate enough.

Step S203, using cross entropy loss function, multiple local candidate regions is classified according to object classification；It is right Each local candidate region belongs to the other probability of object type and is predicted, and obtains the object class probability of each local candidate region Predictive value.

In a series of local candidate for obtaining objects by step S201 using the local candidate region generation method of object After region, this step is classified to these local candidate regions.One multitask of embodiment of the present invention additional designs Training subsystem, enters row constraint using the mark of image level.The training subsystem of the multitask includes that the local to object is waited The training (i.e. step S203) and the training (i.e. step 204) to image category of favored area classification, this method avoid because most The semantic deviation that first stage training sample supervisory signals are inaccurate and cause.

Specifically, in step S203, using cross entropy loss function, by multiple local candidate regions according to object classification Classified.One specific example is as follows：

34.<=31 full articulamentum 6_1 (M × N) (M is that last layer exports dimension, and N is the classification dimension for needing prediction)

35.<=34 cross entropy loss function layers

By sharing the result of aforementioned full convolutional neural networks, the class for obtaining local candidate region is can be predicted in full articulamentum Not.

This step also belongs to the other probability of object type to each local candidate region and is predicted, and obtains each local candidate The object classification probabilistic forecasting value in region.Specifically, the object classification probabilistic forecasting value of each local candidate region is by upper State what full convolutional neural networks study was obtained.

Step S204, according to Weakly supervised information, is trained to the function for predicting the image category of training image.

In this step, image category training make use of the scheme of many individual training, using production model Log-Sum- Exponentail classifies, and optimization formula is as follows：

Wherein, I_kFor k-th training image, c is classification；x_kjFor j-th local candidate region of k-th training image Expressive Features, M are the number of the local candidate region of k-th training image；w_cFor the classifier parameters of classification c to be learned.Should Formula predictions are I_kProbability of the classification for c, i.e. P_r(I_k∈c|w_c)。

Weakly supervised information as input, by the study of above-mentioned optimization formula, can be acquired classification of all categories by this step Device parameter.As Weakly supervised information is the standard markup information of image category, according to the standard markup information, by above-mentioned optimization The study of formula obtains classifier parameters of all categories so that network is provided with when running into similar input picture next time also can be pre- The ability of survey.

Step S205, selects from multiple local candidate regions and belongs to the other local candidate region of same object type.

After processing through the classification of step S203, each object classification belonging to the candidate region of local is would know that.This step Suddenly the other local candidate region of same object type will be belonged to choose as one group, follow-up operation will be performed.If training figure As including N number of object classification, then local candidate region is divided into N groups, for the follow-up operation of each group of execution.

Step S206, for belonging to the other local candidate region of same object type, is carried out fusion treatment, and is calculated using cluster Image-region after fusion is divided into nearly object area, nearly background area and ambiguity region by method.

Due to selected local candidate region it is more, if to belonging to the other all local candidate regions of same object type Fusion treatment is carried out, amount of calculation is larger.In order to reduce amount of calculation, alternatively from for belonging to the other local candidate of same object type Picking out a collection of local candidate region in region carries out fusion treatment.The principle selected can be adopted but be not limited to the following two kinds：

One kind is, according to the object classification probabilistic forecasting value of local candidate region from high to low order, from for belonging to The local candidate region of predetermined number is picked out in the other local candidate region of same object type, carries out fusion treatment.

Another kind is to pick out object classification probabilistic forecasting from for belonging in the other local candidate region of same object type Value carries out fusion treatment higher than the local candidate region of predetermined threshold value.

The exemplary principle of both the above is all based on the object classification probabilistic forecasting value of local candidate region and is selected, The height of object classification probabilistic forecasting value reflects that local candidate region belongs to the height of the other probability of certain object type, above-mentioned two The purpose of kind of principle is all to pick out the higher local candidate region of likelihood ratio for belonging to object.

Local candidate region to picking out carries out fusion treatment.Detailed process can be：To the local candidate for picking out Region carries out image segmentation process, obtains the binary segmentation mask of local candidate region；The local candidate region that will be singled out Binary segmentation mask carry out fusion treatment；Using clustering algorithm by the image-region after fusion be divided into nearly object area, Nearly background area and ambiguity region.

Fig. 5 a to Fig. 5 h show the schematic diagram of an example of local candidate region fusion treatment in the embodiment of the present invention. Wherein, Fig. 5 a are original image, and Fig. 5 b are the true picture of semantic segmentation, and Fig. 5 c, Fig. 5 d and Fig. 5 e are that object class probability is pre- Fig. 5 c, Fig. 5 d and Fig. 5 e are carried out fusion treatment by binary segmentation mask of the measured value ranking in the local candidate region of front three Obtain Fig. 5 f；Using a kind of alternatively clustering algorithm, such as kmeans methods carry out clustering processing to image-region and obtain Fig. 5 g, In Fig. 5 g, image-region is divided into nearly object area, nearly background area and ambiguity region.Wherein near object area refer to for The higher region (white portion in Fig. 5 g) of the probability of object, nearly background area is referred to as the higher region of the probability of background (black region in Fig. 5 g), background here typically refer to be not belonging to the higher region of probability of object, and ambiguity region refers to Cannot estimate be whether object region (Fig. 5 g grey areas).

Step S207, nearly object area and nearly background area as seed, using partitioning algorithm to the ambiguity region Split, obtained the correction result of the semantic segmentation of the training image.

In order to further predict the segmentation result in ambiguity region, this step nearly object area and nearly background area conduct Seed, using a kind of alternatively partitioning algorithm, such as grabcut algorithms are split to ambiguity region, obtain an object classification Semantic segmentation correction result.In the examples described above, correction results of Fig. 5 h for the semantic segmentation of Fig. 5 a.

If training image includes N number of object classification, then local candidate region is divided into N groups respectively through above-mentioned steps The process of S206 and step S207, obtains the correction result of the semantic segmentation of all objects classification, finally gives whole training figure The correction result of the semantic segmentation of picture.

Step S208, according to PRELIMINARY RESULTS and correction result, is modified to the model parameter of semantic segmentation model.

Substitute legitimate reading, the correction result that above-mentioned steps are obtained is considered as standard output, determine standard output with it is preliminary As a result difference, is obtained the loss function of semantic segmentation model according to the difference for determining, is carried out back using loss function response value Pass, the model parameter of update semantics parted pattern.

Step S209, the output result of the shared semantic segmentation model being corrected, enters to the parameter of cross entropy loss function Row amendment.

Above step S202 to step S209 is training step, and iteration performs above-mentioned training step, housebroken to obtain Model of Target Recognition.Specifically, training step iteration is performed until the training result of semantic segmentation model meets predetermined convergence bar Part.For example, predetermined convergence condition is to reach predetermined iterationses, when iterationses reach predetermined iterationses, iterative process Terminate.Or, predetermined convergence condition is that the difference between PRELIMINARY RESULTS and correction result restrains to a certain extent, pre- when this is met When determining the condition of convergence, iterative process terminates.

The Model of Target Recognition training method that the present embodiment is provided, according to Weakly supervised information and multiple local candidate regions, The fusion of local candidate region is carried out, the correction result of the semantic segmentation of training image is obtained, so as to the mould to semantic segmentation model Shape parameter is modified.Then, the output result of the semantic segmentation model being corrected, the ginseng to Model of Target Recognition are shared Number is modified, and obtains then housebroken Model of Target Recognition.The guidance of the Weakly supervised information by being provided previously by, realizes Target recognition exactly.Further, training process of this method comprising Liang Ge branches, a branch is semantic segmentation model Training process, another branch (i.e. object positioning branch) are the training that the local candidate region to object is classified and to image class The process of other training, Liang Ge branches can share believing as a result, it is possible to avoid supervising because of initial period training sample for training Number inaccurate and semantic deviation that causes, further improves the accuracy of semantic segmentation result.Due to passing through language in the program Adopted parted pattern is obtained predicting the outcome for pixel scale, can belong to pixel as temporary transient supervision message, such case The direct supervision of rank so that the direct supervision of the existing pixel scale of the program, can optimize semantic segmentation mould again end to end Type, also introduces an object positioning branch, can improve the result of target recognition again according to the judgement to local candidate region, Finally realize and local candidate region classification is accurately identified.

The flow chart that Fig. 6 shows the embodiment three of the Model of Target Recognition training method of present invention offer.Fig. 7 shows Network model's schematic diagram of the embodiment three of the Model of Target Recognition training method that the present invention is provided.The present embodiment and above-mentioned enforcement Example two differs primarily in that, the present embodiment PRELIMINARY RESULTS and each local candidate region according to the semantic segmentation of training image Object classification probabilistic forecasting value, multiple local candidate regions are screened, using Jing screen local candidate region carry out Follow-up fusion treatment.Wherein, one of training process of Model of Target Recognition is fallen within to local candidate region screening process Point.The concrete scheme of the present embodiment is described in detail with reference to this two width figure.Method described by the present embodiment is also Saying how Model of Target Recognition to be trained using a training image.

As shown in fig. 6, the training method for being directed to each training image comprises the steps：

Step S301, chooses multiple local candidate regions from training image；And, obtain the Weakly supervised letter of training image Breath.Step S301 is data preparation step.

Step S302, training image is input into semantic segmentation model, obtains the preliminary knot of the semantic segmentation of training image Really.

Step S303, using cross entropy loss function, multiple local candidate regions is classified according to object classification；It is right Each local candidate region belongs to the other probability of object type and is predicted, and obtains the object class probability of each local candidate region Predictive value.

Step S304, according to Weakly supervised information, is trained to the function for predicting the image category of training image.

The process that implements of above-mentioned steps S301 to step S304 can be found in the embodiment of the present invention two step S201 extremely The description of step S204, will not be described here.

Step S305, according to the PRELIMINARY RESULTS and the object classification of each local candidate region of the semantic segmentation of training image Multiple local candidate regions are screened by probabilistic forecasting value.

There are thousands of as step S301 prepares the local candidate region for obtaining, and these local candidate regions are used as sample For this, there is the area of the quantity and high probability background in the region (positive sample) of imbalanced training sets, i.e. high probability object The quantity in domain (negative sample) is unbalanced so that follow-up training process is affected by obtaining inaccurate result.Therefore, the present embodiment By screening to local candidate region so that sample is more balanced.

Specifically, the dicing masks of local candidate region and the PRELIMINARY RESULTS of the semantic segmentation of training image is calculated first Friendship and compare, hand over and ratio be bigger, show local candidate region be object probability it is higher, hand over and ratio be less, show local candidate Region is higher for the probability of background.Then, according to the friendship of local candidate region and than with hand over and comparative result than threshold value and Multiple local candidate regions are sieved by the object classification predictive value of local candidate region and the comparative result of predictive value threshold value Choosing.

Further, the present embodiment presets two threshold values of friendship ratio, and respectively first hands over and than threshold value and second Hand over and than threshold value, wherein first hands over and hand over more than second than threshold value and compare threshold value；It is pre- that the present embodiment also presets object classification Two threshold values of measured value, respectively the first predictive value threshold value and the second predictive value threshold value, wherein the first predictive value threshold value are more than the Two predictive value threshold values.

Friendship in response to local candidate region is simultaneously handed over and than threshold value, the object of local candidate region than being more than or equal to first Class prediction value is more than or equal to the first predictive value threshold value, the local of the positive sample that local candidate region is obtained as Jing screenings Candidate region；

Friendship in response to local candidate region is simultaneously handed over and than threshold value, the object of local candidate region than being less than or equal to second Class prediction value is less than or equal to the second predictive value threshold value, the local of the negative sample that local candidate region is obtained as Jing screenings Candidate region.

By above-mentioned threshold value comparison procedure, a number of positive sample and negative sample is filtered out, and ensure just to filter out The equal number of sample and negative sample.

Step S306, selects from the local candidate region that Jing screenings are obtained and belongs to the other local candidate of same object type Region.

This step will belong to the other local candidate region of same object type and choose as one group, perform follow-up behaviour Make.If training image includes N number of object classification, then local candidate region is divided into N groups, follow-up for each group of execution Operation.

Step S307, for belonging to the other local candidate region of same object type, is carried out fusion treatment, and is calculated using cluster Image-region after fusion is divided into nearly object area, nearly background area and ambiguity region by method.

Step S308, nearly object area and nearly background area as seed, using partitioning algorithm to the ambiguity region Split, obtained the correction result of the semantic segmentation of the training image.

Step S309, according to PRELIMINARY RESULTS and correction result, is modified to the model parameter of semantic segmentation model.

Step S310, the output result of the shared semantic segmentation model being corrected, enters to the parameter of cross entropy loss function Row amendment.

The process that implements of above-mentioned steps S307 to step S310 can be found in the embodiment of the present invention two step S206 extremely The description of step S209, will not be described here.

Above step S302 to step S310 is training step, and iteration performs above-mentioned training step, housebroken to obtain Model of Target Recognition.Specifically, training step iteration is performed until the training result of semantic segmentation model meets predetermined convergence bar Part.For example, predetermined convergence condition is to reach predetermined iterationses, when iterationses reach predetermined iterationses, iterative process Terminate.Or, predetermined convergence condition is that the difference between PRELIMINARY RESULTS and correction result restrains to a certain extent, pre- when this is met When determining the condition of convergence, iterative process terminates.

The Model of Target Recognition training method that the present embodiment is provided, according to Weakly supervised information and multiple local candidate regions, The fusion of local candidate region is carried out, the correction result of the semantic segmentation of training image is obtained, so as to the mould to semantic segmentation model Shape parameter is modified.Then, the output result of the semantic segmentation model being corrected, the ginseng to Model of Target Recognition are shared Number is modified, and obtains then housebroken Model of Target Recognition.The guidance of the Weakly supervised information by being provided previously by, realizes Target recognition exactly.Further, training process of this method comprising Liang Ge branches, a branch is semantic segmentation model Training process, another branch (i.e. object positioning branch) are the training that the local candidate region to object is classified and to image class The process of other training, Liang Ge branches can share believing as a result, it is possible to avoid supervising because of initial period training sample for training Number inaccurate and semantic deviation that causes, further improves the accuracy of semantic segmentation result.The existing pixel scale of the program Direct supervision, semantic segmentation model can be optimized again end to end, object positioning branch is also introduced, again being capable of basis Judgement to local candidate region improves the result of target recognition.In addition, the present embodiment is also by local candidate region Screened so that sample is more balanced, further optimized training effect, finally realized to local candidate region classification Accurately identify.

Present invention also offers a kind of target identification method, the target identification method is using image to be identified as target knowledge The input of other model, determines dividing for the multiple local candidate regions selected for image according to the output result of Model of Target Recognition Class result.The Model of Target Recognition obtained based on training in the present invention carries out the method and method of the prior art of target recognition Identical, except for the difference that, the Model of Target Recognition for being utilized is obtained using the training method that the above embodiment of the present invention is provided.

Fig. 8 shows the functional block diagram of the embodiment one of the Model of Target Recognition training devicess of present invention offer.Such as Shown in Fig. 8, the Model of Target Recognition training devicess of the present embodiment are using multiple training images for being labeled with Weakly supervised information in advance Model of Target Recognition is trained, the training devicess include：Training module 820.

Training module 820 is further included：Object-recognition unit 824, integrated unit 822, and amending unit 823.

Object-recognition unit 824 is input into target recognition for being directed to multiple local candidate regions that training image is selected Model, obtains the PRELIMINARY RESULTS of multiple local candidate region classification of Model of Target Recognition output.In the embodiment of the present invention, utilize The cross entropy loss function of the full articulamentum of the full convolutional neural networks of deep learning as Model of Target Recognition, using cross entropy Loss function prediction obtains the PRELIMINARY RESULTS of multiple local candidate region classification.

Integrated unit 822 carries out office for the PRELIMINARY RESULTS classified according to Weakly supervised information and multiple local candidate regions Merge portion candidate region.The Weakly supervised information of aforementioned preparation is depended on, fusion treatment is carried out to multiple local candidate regions.

Amending unit 823, for the PRELIMINARY RESULTS classified according to the plurality of local candidate region and local candidate region The result of fusion is modified to the parameter of the Model of Target Recognition.

820 iteration of above-mentioned training module is run, to obtain housebroken Model of Target Recognition.Specifically, training module 820 Iteration is performed until the training result of Model of Target Recognition meets predetermined convergence condition.For example, predetermined convergence condition is pre- for reaching Determine iterationses, when iterationses reach predetermined iterationses, iterative process terminates.Or, predetermined convergence condition is preliminary As a result restrain to a certain extent with the difference between correction result, when the predetermined convergence condition is met, iterative process terminates.

Further, the training devicess also include：Data preparation module 810 is for selecting multiple local from training image Candidate region；And, obtain the Weakly supervised information of training image.

In addition, data preparation module 810 is further used for：Obtain the object classification information of training image.Traditional pixel Markup information needs accurately to mark out the object classification in training image belonging to each pixel, and Weakly supervised information in the present invention It is object classification information that training image is included.For example, it is if including people and aircraft in certain training image, traditional During pixel markup information needs to mark out the training image, whether each pixel belongs to people or aircraft, and the present invention only needs to mark out Someone and aircraft in training image.That is, the object classification included in informing training devicess' training image in advance, but do not accuse Know the position of object.

The Model of Target Recognition training devicess that the present embodiment is provided, the multiple local candidate regions that will be selected for training image Domain is input into Model of Target Recognition the PRELIMINARY RESULTS for obtaining multiple local candidate region classification, according to Weakly supervised information and multiple offices The PRELIMINARY RESULTS of portion candidate region classification, carries out the fusion of local candidate region, according to ginseng of the fusion results to Model of Target Recognition Number is modified, and obtains then housebroken Model of Target Recognition.The guidance of the Weakly supervised information by being provided previously by, realizes Target recognition exactly.

Fig. 9 shows the functional block diagram of the embodiment two of the Model of Target Recognition training devicess of present invention offer.This , on the basis of said apparatus embodiment one, the additional designs training subsystem of one multitask, using image level for embodiment It is other to mark into row constraint, with the semantic deviation for avoiding causing because initial period training sample supervisory signals are inaccurate.

Training module 820 is further included：Semantic segmentation unit 821, for training image is input into semantic segmentation mould Type, obtains the PRELIMINARY RESULTS of the semantic segmentation of training image.Alternatively, full convolution god of the embodiment of the present invention using deep learning Model of the Jing networks as semantic segmentation.It is predicted using full convolutional neural networks, by multiple convolutional layer/nonlinear responses Layer/pond layer learns the parameter of intermediate representation, obtains the PRELIMINARY RESULTS of the semantic segmentation of training image.

Object-recognition unit 824, for using cross entropy loss function, by multiple local candidate regions according to object classification Classified；Belong to the other probability of object type to each local candidate region to be predicted, obtain each local candidate region Object classification probabilistic forecasting value.Specifically, the object classification probabilistic forecasting value of each local candidate region is by full convolution god Jing e-learnings are obtained.

Object-recognition unit 824 is can be predicted in full articulamentum and obtains local by the result of shared full convolutional neural networks The classification of candidate region.

Integrated unit 822 carries out office for the PRELIMINARY RESULTS classified according to Weakly supervised information and multiple local candidate regions Portion candidate region is merged, and obtains the correction result of the semantic segmentation of the training image.

Amending unit 823 is further used for：According to the PRELIMINARY RESULTS and the correction result of semantic segmentation of semantic segmentation, to language The model parameter of adopted parted pattern is modified；The output result of the shared semantic segmentation model being corrected, to target recognition mould The parameter of type is modified.The embodiment of the present invention is that the present embodiment does not adopt semantic segmentation with the difference of prior art Legitimate reading (such as advance markup information of pixel scale etc.) as correction training process in semantic segmentation model model ginseng Number, but the result that the fusion of multiple local candidate regions of training image is obtained is used as semantic segmentation in correction training process The model parameter of model.Legitimate reading is substituted, the correction result that integrated unit 822 is obtained is considered as standard output, determines standard Output and the difference of PRELIMINARY RESULTS, are obtained the loss function of semantic segmentation model according to the difference for determining, are rung using loss function Should be worth and be returned, the model parameter of update semantics parted pattern.

Training module 820 also includes：Image category predicting unit 825, for according to Weakly supervised information, to instructing for predicting The function for practicing the image category of image is trained.

Specifically, image category training make use of the scheme of many individual training, using production model Log-Sum- Exponentail classifies, and optimization formula is as follows：

Weakly supervised information as input, by the study of above-mentioned optimization formula, can be acquired by image category predicting unit 825 Classifier parameters of all categories.As Weakly supervised information is the standard markup information of image category, according to the standard markup information, Classifier parameters of all categories are obtained by the study of above-mentioned optimization formula so that network is provided with and runs into similar input figure next time The ability that can also predict during picture.

Further, integrated unit 822 includes：Sort out subelement 822a, fusion treatment subelement 822b, split subelement 822c。

Sort out subelement 822a and belong to the other local time of same object type for selecting from multiple local candidate regions Favored area.Classification subelement 822a will belong to the other local candidate region of same object type and choose as one group, transfer to melt Close and process subelement 822b and segmentation subelement 822c process.If training image includes N number of object classification, then local Candidate region is divided into N groups, transfers to fusion treatment subelement 822b and segmentation subelement 822c process by each group.

Fusion treatment subelement 822b is used for for belonging to the other local candidate region of same object type, carries out at fusion Reason, and the image-region after fusion is divided into by nearly object area, nearly background area and ambiguity region using clustering algorithm.

Due to selected local candidate region it is more, if to belonging to the other all local candidate regions of same object type Fusion treatment is carried out, amount of calculation is larger.In order to reduce amount of calculation, alternatively fusion treatment subelement 822b is from for belonging to same A collection of local candidate region is picked out in the other local candidate region of object type carries out fusion treatment.The principle selected can adopt but It is not limited to the following two kinds：

Fusion treatment subelement 822b carries out fusion treatment to the local candidate region picked out.Detailed process can be： Local candidate region to picking out carries out image segmentation process, obtains the binary segmentation mask of local candidate region；To choose The binary segmentation mask of the local candidate region selected carries out fusion treatment；Using clustering algorithm by the image-region after fusion It is divided into nearly object area, nearly background area and ambiguity region.

Segmentation subelement 822c is used for nearly object area and nearly background area as seed, using partitioning algorithm to ambiguity Region is split, and obtains the correction result of the semantic segmentation of training image.

In order to further predict the segmentation result in ambiguity region, segmentation subelement 822c nearly object area and nearly background Used as seed, using a kind of alternatively partitioning algorithm, such as grabcut algorithms are split to ambiguity region, obtain one in region The correction result of the other semantic segmentation of object type.If training image includes N number of object classification, then local candidate region is divided It is process of the N groups respectively through fusion treatment subelement 822b and segmentation subelement 822c, obtains the semanteme of all objects classification The correction result of segmentation, finally gives the correction result of the semantic segmentation of whole training image.

The Model of Target Recognition training devicess that the present embodiment is provided, according to Weakly supervised information and multiple local candidate regions, The fusion of local candidate region is carried out, the correction result of the semantic segmentation of training image is obtained, so as to the mould to semantic segmentation model Shape parameter is modified.Then, the output result of the semantic segmentation model being corrected, the ginseng to Model of Target Recognition are shared Number is modified, and obtains then housebroken Model of Target Recognition.The guidance of the Weakly supervised information by being provided previously by, realizes Target recognition exactly.Further, training process of this device comprising Liang Ge branches, a branch is semantic segmentation model Training process, another branch (i.e. object positioning branch) are the training that the local candidate region to object is classified and to image class The process of other training, Liang Ge branches can share believing as a result, it is possible to avoid supervising because of initial period training sample for training Number inaccurate and semantic deviation that causes, further improves the accuracy of semantic segmentation result.The existing pixel scale of the program Direct supervision, semantic segmentation model can be optimized again end to end, object positioning branch is also introduced, again being capable of basis Judgement to local candidate region improves the result of target recognition, finally realizes the accurate knowledge to local candidate region classification Not.

Figure 10 shows the functional block diagram of the embodiment three of the Model of Target Recognition training devicess of present invention offer.This On the basis of said apparatus embodiment two, training module 820 also includes embodiment：Select unit 826, for according to training figure The PRELIMINARY RESULTS of the semantic segmentation of picture and the object classification probabilistic forecasting value of each local candidate region, to multiple local candidate regions Screened in domain.

Integrated unit 822 is further used for：According to Weakly supervised information, the local candidate region that Jing screenings are obtained is melted Close, obtain the correction result of the semantic segmentation of training image.

There are thousands of as data preparation module prepares the local candidate region for obtaining, and these local candidate regions are made For for sample, there is the quantity and high probability background in the region (positive sample) of imbalanced training sets, i.e. high probability object Region (negative sample) quantity it is unbalanced so that follow-up training process is affected by obtaining inaccurate result.Therefore, this reality The select unit 826 of example is applied by screening to local candidate region so that sample is more balanced.

Specifically, select unit 826 calculates the dicing masks of local candidate region and the semanteme point of training image first The friendship of the PRELIMINARY RESULTS cut simultaneously is compared, and hands over and ratio is bigger, shows that the probability that local candidate region is object is higher, hands over and ratio is less, Show that the probability that local candidate region is background is higher.Then, according to the friendship of local candidate region and than with hand over and than threshold value Multiple local are waited by the object classification predictive value of comparative result and local candidate region and the comparative result of predictive value threshold value Favored area is screened.

Select unit 826 is further used for：Friendship ratio in response to local candidate region is handed over and is compared more than or equal to first Threshold value, the object classification predictive value of local candidate region are more than or equal to the first predictive value threshold value, using local candidate region as The local candidate region of the positive sample that Jing screenings are obtained；Friendship in response to local candidate region is simultaneously handed over simultaneously than being less than or equal to second Than threshold value, the object classification predictive value of local candidate region is less than or equal to the second predictive value threshold value, local candidate region is made The local candidate region of the negative sample obtained for Jing screenings.

By above-mentioned threshold value comparison procedure, select unit 826 filters out a number of positive sample and negative sample, and ensures The positive sample for filtering out and the equal number of negative sample.

Sort out subelement 822a and belong to same object classification for selecting from the local candidate region that Jing screenings are obtained Local candidate region.

Fusion treatment subelement 822b is further used for：According to the object classification probabilistic forecasting value of local candidate region from height To low order, from for belonging to the local candidate regions for picking out predetermined number in the other local candidate region of same object type Domain, carries out fusion treatment；Or, object class probability is picked out from for belonging in the other local candidate region of same object type Predictive value carries out fusion treatment higher than the local candidate region of predetermined threshold value.

The Model of Target Recognition training devicess that the present embodiment is provided, according to Weakly supervised information and multiple local candidate regions, The fusion of local candidate region is carried out, the correction result of the semantic segmentation of training image is obtained, so as to the mould to semantic segmentation model Shape parameter is modified.Then, the output result of the semantic segmentation model being corrected, the ginseng to Model of Target Recognition are shared Number is modified, and obtains then housebroken Model of Target Recognition.The guidance of the Weakly supervised information by being provided previously by, realizes Target recognition exactly.Further, training process of this device comprising Liang Ge branches, a branch is semantic segmentation model Training process, another branch (i.e. object positioning branch) are the training that the local candidate region to object is classified and to image class The process of other training, Liang Ge branches can share believing as a result, it is possible to avoid supervising because of initial period training sample for training Number inaccurate and semantic deviation that causes, further improves the accuracy of semantic segmentation result.The existing pixel scale of the program Direct supervision, semantic segmentation model can be optimized again end to end, object positioning branch is also introduced, again being capable of basis Judgement to local candidate region improves the result of target recognition.In addition, the present embodiment is also by local candidate region Screened so that sample is more balanced, further optimized training effect, finally realized to local candidate region classification Accurately identify.

Present invention also offers a kind of Target Identification Unit, the Target Identification Unit is using image to be identified as target knowledge The input of other model, determines dividing for the multiple local candidate regions selected for image according to the output result of Model of Target Recognition Class result.The Model of Target Recognition obtained based on training in the present invention carries out the device and device of the prior art of target recognition Identical, except for the difference that, the Model of Target Recognition for being utilized is obtained using the training devicess that the above embodiment of the present invention is provided.

Method and display be not inherently related to any certain computer, virtual system or miscellaneous equipment provided herein. Various general-purpose systems can also be used together based on teaching in this.As described above, construct required by this kind of system Structure be obvious.Additionally, the present invention is also not for any certain programmed language.It is understood that, it is possible to use it is various Programming language realizes the content of invention described herein, and the description done to language-specific above is to disclose this Bright preferred forms.

In description mentioned herein, a large amount of details are illustrated.It is to be appreciated, however, that the enforcement of the present invention Example can be put into practice in the case where not having these details.In some instances, known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this description.

Similarly, it will be appreciated that in order to simplify the disclosure and help understand one or more in each inventive aspect, exist Above to, in the description of the exemplary embodiment of the present invention, each feature of the present invention is grouped together into single enforcement sometimes In example, figure or descriptions thereof.However, should the method for the disclosure be construed to reflect following intention：I.e. required guarantor The more features of feature is expressly recited in each claim by the application claims ratio of shield.More precisely, such as following Claims it is reflected as, inventive aspect is less than all features of single embodiment disclosed above.Therefore, Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, wherein each claim itself All as the separate embodiments of the present invention.

Those skilled in the art are appreciated that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more different from embodiment equipment.Can be the module in embodiment or list Unit or component are combined into a module or unit or component, and can be divided in addition multiple submodule or subelement or Sub-component.In addition at least some in such feature and/or process or unit is excluded each other, can adopt any Combination is disclosed to all features disclosed in this specification (including adjoint claim, summary and accompanying drawing) and so to appoint Where all processes or unit of method or equipment are combined.Unless expressly stated otherwise, this specification (includes adjoint power Profit is required, summary and accompanying drawing) disclosed in each feature can it is identical by offers, be equal to or the alternative features of similar purpose carry out generation Replace.

Although additionally, it will be appreciated by those of skill in the art that some embodiments described herein include other embodiments In some included features rather than further feature, but the combination of the feature of different embodiments means in of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment required for protection appoint One of meaning can in any combination mode using.

The present invention all parts embodiment can be realized with hardware, or with one or more processor operation Software module realize, or with combinations thereof realize.It will be understood by those of skill in the art that can use in practice Microprocessor or digital signal processor (DSP) are realizing in the equipment of acquisition application message according to embodiments of the present invention The some or all functions of some or all parts.The present invention is also implemented as performing method as described herein Some or all equipment or program of device (for example, computer program and computer program).Such reality The program of the existing present invention can be stored on a computer-readable medium, or can have the form of one or more signal. Such signal can be downloaded from internet website and be obtained, or provide on carrier signal, or with any other form There is provided.

For example, Figure 11 shows the computing device that can realize Model of Target Recognition training method of the invention.Should Computing device can be terminal or server.The computing device conventionally comprises processor 1110 and with 1120 shape of storage device The computer program or computer-readable medium of formula, additionally includes communication interface and communication bus.Storage device 1120 can be the electricity of such as flash memory, EEPROM (Electrically Erasable Read Only Memory), EPROM, hard disk or ROM etc Quantum memory.One or more processors, communication interface and memorizer complete mutual communication by communication bus.Processor Can be that CPU (CPU) or GPU (Graphics Processing Unit) storage device 1120 have storage for performing above-mentioned side The memory space 1130 of the program code 1131 of any method and step in method, instructs for storage at least, and the instruction makes place Reason device performs the various steps in the Model of Target Recognition training method of the embodiment of the present invention.For example, store program codes are deposited Storage space 1130 can include being respectively used to realize each program code 1131 of the various steps in above method.These journeys Sequence code can read or be written to this one or more computer journey from one or more computer program In sequence product.These computer programs include the program generation of such as hard disk, compact-disc (CD), storage card or floppy disk etc Code carrier.Such computer program is usually the portable or static memory cell for example shown in Figure 12.The storage Unit can be with the memory paragraph of 1120 similar arrangement of storage device in the computing device with Figure 11, memory space etc..Program Code for example can be compressed in a suitable form.Generally, memory element is included for performing steps of a method in accordance with the invention 1131 ＇ of computer-readable code, you can with the code that the processor by such as 1110 etc reads, when these codes are by calculating When equipment runs, cause each step in the computing device method described above.

It should be noted that above-described embodiment the present invention will be described rather than limits the invention, and ability Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference markss between bracket should not be configured to limitations on claims.Word "comprising" is not excluded the presence of not Element listed in the claims or step.Word "a" or "an" before element does not exclude the presence of multiple such Element.The present invention can come real by means of the hardware for including some different elements and by means of properly programmed computer It is existing.If in the unit claim for listing equipment for drying, several in these devices can be by same hardware branch To embody.The use of word first, second, and third does not indicate that any order.These words can be explained and be run after fame Claim.

Claims

1. a kind of Model of Target Recognition training method, it is characterised in that methods described is labeled with Weakly supervised letter in advance using multiple The training image of breath is trained to Model of Target Recognition, is directed to each training image, and training step includes：

The multiple local candidate regions selected for the training image are input into the Model of Target Recognition, the mesh is obtained The PRELIMINARY RESULTS of multiple local candidate region classification of mark identification model output；

According to the PRELIMINARY RESULTS that the Weakly supervised information and the plurality of local candidate region are classified, carry out local candidate region and melt Close；

According to the PRELIMINARY RESULTS of the plurality of local candidate region classification and the result of local candidate region fusion to the target The parameter of identification model is modified；

Iteration performs the above training step until the training result of the Model of Target Recognition meets predetermined convergence condition.

2. Model of Target Recognition training method according to claim 1, it is characterised in that the Weakly supervised packet contains： Object classification information.

3. Model of Target Recognition training method according to claim 1, it is characterised in that it is described will be for the training Multiple local candidate regions that image is selected are input into before the Model of Target Recognition, and methods described also includes：To the instruction Practicing image carries out super-pixel segmentation process, will carry out super-pixel segmentation and processes some image blocks for obtaining being clustered, obtain many Individual local candidate region.

4. the Model of Target Recognition training method according to any one of claim 1-3, it is characterised in that methods described is also wrapped Include：

The training image is input into semantic segmentation model, the PRELIMINARY RESULTS of the semantic segmentation of the training image is obtained；

The PRELIMINARY RESULTS classified according to the Weakly supervised information and the plurality of local candidate region, carries out local candidate regions Domain fusion is specially：According to the PRELIMINARY RESULTS that the Weakly supervised information and the plurality of local candidate region are classified, local is carried out Candidate region is merged, and obtains the correction result of the semantic segmentation of the training image；

The result of the PRELIMINARY RESULTS classified according to the plurality of local candidate region and the fusion of local candidate region is to described The parameter of Model of Target Recognition is modified and further includes：

According to the correction result of the PRELIMINARY RESULTS and the semantic segmentation of the semantic segmentation, the mould to the semantic segmentation model Shape parameter is modified；

The output result of the shared semantic segmentation model being corrected, is modified to the parameter of the Model of Target Recognition.

5. the Model of Target Recognition training method according to any one of claim 1-4, it is characterised in that it is described will be for institute State the selected multiple local candidate regions of training image to be input into the Model of Target Recognition, obtain the Model of Target Recognition defeated The PRELIMINARY RESULTS of the multiple local candidate region classification for going out also includes：

Using cross entropy loss function, the plurality of local candidate region is classified according to object classification；

Belong to the other probability of object type to each local candidate region to be predicted, obtain the object type of each local candidate region Other probabilistic forecasting value；

The parameter to Model of Target Recognition is modified and further includes：The parameter of the cross entropy loss function is carried out Amendment.

6. the Model of Target Recognition training method according to any one of claim 1-5, it is characterised in that the training step Also include：

According to the Weakly supervised information, the function for predicting the image category of the training image is trained.

7. a kind of target identification method, it is characterised in that include：

Using image to be identified as Model of Target Recognition input, the Model of Target Recognition is in advance using such as claim 1- Method described in 6 any one is trained；

Dividing for the multiple local candidate regions selected for described image is determined according to the output result of the Model of Target Recognition Class result.

8. a kind of Model of Target Recognition training devicess, it is characterised in that described device is labeled with Weakly supervised letter in advance using multiple The training image of breath is trained to Model of Target Recognition, and the training devicess include：

Object-recognition unit, knows for the multiple local candidate regions selected for the training image are input into the target Other model, obtains the PRELIMINARY RESULTS of multiple local candidate region classification of the Model of Target Recognition output；

Integrated unit, for the PRELIMINARY RESULTS classified according to the Weakly supervised information and the plurality of local candidate region, is carried out Merge local candidate region；

Amending unit, for the PRELIMINARY RESULTS classified according to the plurality of local candidate region and the knot of local candidate region fusion Fruit is modified to the parameter of the Model of Target Recognition；

9. a kind of Target Identification Unit, it is characterised in that the Target Identification Unit for using image to be identified as target The input of identification model, determines according to the output result of the Model of Target Recognition and waits for multiple local that described image is selected The classification results of favored area；Wherein, the Model of Target Recognition is instructed using training devicess as claimed in claim 8 in advance Practice.

10. a kind of computing device, it is characterised in that include：Processor, communication interface, memorizer and communication bus；The place Reason device, the communication interface and the memorizer complete mutual communication by the communication bus；