CN108197668A

CN108197668A - The method for building up and cloud system of model data collection

Info

Publication number: CN108197668A
Application number: CN201810096270.5A
Authority: CN
Inventors: 梁昊; 南冰; 南一冰; 廉士国
Original assignee: As Science And Technology (beijing) Co Ltd
Current assignee: As Science And Technology (beijing) Co Ltd; Cloudminds Beijing Technologies Co Ltd
Priority date: 2018-01-31
Filing date: 2018-01-31
Publication date: 2018-06-22

Abstract

This application provides the method for building up and cloud system of model data collection, the method includes：It is clustered according to the data that the data characteristics of selection concentrates data, and classification marker is carried out to the data in the data set according to cluster result；Initialization disaggregated model is trained according to the data set after classification marker, obtains trained disaggregated model；Trained disaggregated model is tested, and model data collection is established according to test result.The application can utilize the finally determining model data collection for being used to implement Classification and Identification, remove artificial mark from and its verify spent manpower and time cost, so as to fulfill the automatic marking to model data collection, while effectively promote the efficiency and accuracy of Classification and Identification.

Description

The method for building up and cloud system of model data collection

Technical field

This application involves the method for building up and cloud system in depth learning technology field, more particularly to model data collection.

Background technology

In recent years, had on classifying quality aobvious compared to traditional sorting technique based on the sorting technique of deep learning The breakthrough of work, and classification accuracy is higher, with ResNet, DenseNet even depth learning networks are constantly suggested, and are based on The sorting technique of deep learning is increasingly becoming the main trend of classification application.

Sorting technique based on deep learning mainly by huge training set, in disaggregated model by forward conduction and The continuous training pattern parameter of reverse transfer, obtains trained disaggregated model, to reach ideal classifying quality, and preferably divides Class effect depends on the representativeness of generic and the accuracy of its corresponding label in training set.In order to ensure the standard of label True property, current training set label determine the classification belonging to sample data, but for more multiple by the way of manually marking Miscellaneous classification task, the data volume in training set are mostly 100,000 or even 1,011,000 order of magnitude, and the mode manually marked causes The manpower and time cost of consuming are higher, for example, Magenet image classification contests, the artificial mark of training set label is to rely on What MTurk crowdsourcings platform was realized.

Prior art deficiency is, since artificial notation methods are there are certain subjectivity, to ensure the visitor of annotation results The property seen and accuracy usually also need to supervise annotation process or carry out screening to annotation results, so as to cause artificial The cost higher of mark.Therefore, fixed training set is usually relied on to the training of disaggregated model, and included according to training set Classification realizes Classification and Identification, if desired builds training set to realize the identification to particular category according to specific demand, then causes Manpower and time cost spent by artificial mark and its verification is higher.As it can be seen that base is limited to the dependence of artificial notation methods In the sorting technique all-round popularization in practical applications of deep learning.

Invention content

In view of this, it is existing to solve an embodiment of the present invention is intended to provide the method for building up and cloud system of model data collection Sorting technique based on deep learning excessively relies on artificial notation methods, cause manually to mark and its manpower that verification is spent and The technical issues of time cost is higher.

In one aspect, the embodiment of the present application provides a kind of method for building up of model data collection, including：

It is clustered according to the data that the data characteristics of selection concentrates data, and according to cluster result to the data set In data carry out classification marker；

Initialization disaggregated model is trained according to the data set after classification marker, obtains trained disaggregated model；

Trained disaggregated model is tested, and model data collection is established according to test result.

On the other hand, what the embodiment of the present application provided a kind of model data collection establishes cloud system, including：

Server is clustered, the data concentrated for the data characteristics according to selection to data cluster, and according to cluster As a result classification marker is carried out to the data in the data set；

Training server for being trained according to the data set after classification marker to initialization disaggregated model, is instructed The disaggregated model perfected；

Test server for testing trained disaggregated model, and establishes model data according to test result Collection.

On the other hand, the embodiment of the present application provides a kind of electronic equipment, and the electronic equipment includes：

Transceiver, memory, one or more processors；And

One or more modules, one or more of modules are stored in the memory, and are configured to by institute One or more processors execution is stated, one or more of modules include the finger for performing each step in the above method It enables.

On the other hand, the embodiment of the present application provides a kind of computer program production being used in combination with electronic equipment Product, the computer program product include computer-readable storage medium and are embedded in computer program mechanism therein, institute It states computer program mechanism and includes the instruction for performing each step in the above method.

In order to achieve the above objectives, the technical solution of the embodiment of the present invention is realized in：

In the present embodiment, the data concentrated using the data characteristics of selection to data are clustered, and according to cluster result Classification marker is carried out to the data in the data set, initialization disaggregated model is instructed using the data set after classification marker Practice, obtain trained disaggregated model, and trained disaggregated model is tested, determine eventually for realization Classification and Identification Model data collection, so as to remove artificial mark and its spent manpower and time cost of verification from, realize to model data collection Automatic marking, while effectively promoted Classification and Identification efficiency and accuracy.

Description of the drawings

The specific embodiment of the application is described below with reference to accompanying drawings, wherein：

Fig. 1 is the method schematic that model data collection is established in the embodiment of the present application one；

Fig. 2 is the flow diagram that model data collection is established in the embodiment of the present application one；

Fig. 3 is the cloud system Organization Chart that model data collection is established in the embodiment of the present application two；

Fig. 4 is the structure diagram of electronic equipment in the embodiment of the present application three.

Specific embodiment

Below by way of specific example, the essence for embodiment technical solution that the present invention is furture elucidated.

In order to which the technical solution of the application and advantage is more clearly understood, below in conjunction with attached drawing to the exemplary of the application Embodiment is described in more detail, it is clear that described embodiment be only the application part of the embodiment rather than The exhaustion of all embodiments.And in the absence of conflict, the feature in the embodiment and embodiment in this explanation can be mutual It is combined.

Inventor notices during invention：

The foundation of training set based on artificial notation methods, it usually needs supervised to annotation process or tied to mark Fruit carries out screening, leads to the cost higher manually marked, and for needing to build training set according to specific demand to realize pair The identification of particular category, by causing, the manpower and time cost that manually mark and its verification is spent are higher.As it can be seen that based on depth The sorting technique of study is higher to the dependence manually marked.

Against the above deficiency/and based on this, the embodiment of the present application is proposed to be carried by the data progress feature concentrated to data It takes and clusters, establish data set automatically, the training set part in data set is trained, and root initialization disaggregated model The classification accuracy of trained disaggregated model is tested according to the test set part in data set, to ensure based on depth The objectivity of the model data intensive data classification of habit.

For the ease of the implementation of the application, Examples below illustrates.

Embodiment 1

Fig. 1 shows the method schematic that model data collection is established in the embodiment of the present application one, as shown in Figure 1, this method Including：

Step 101：It is clustered according to the data that the data characteristics of selection concentrates data, and according to cluster result to institute The data stated in data set carry out classification marker.

Step 102：Initialization disaggregated model is trained according to the data set after classification marker, obtains trained point Class model.

Step 103：Trained disaggregated model is tested, and model data collection is established according to test result.

In implementation, the executive agent of above-mentioned steps can be cloud server, and cloud server is according to preset feature database In feature data that data are concentrated carry out feature extraction, the data characteristics of extraction is clustered using clustering algorithm, root According to cluster result, to data characteristics, corresponding data carry out classification marker and according to the data after classification marker to being based on automatically The disaggregated model of deep learning is trained, and trained disaggregated model is tested, if test result satisfaction judges item Part then shows the classification success to data set, directly using the data set after classification marker as model data collection, for being based on depth It spends in the disaggregated model of study, to realize the precise classification of data；If test result is unsatisfactory for Rule of judgment, show to data The classification failure of collection, reacquires new feature, and repeat whole process from preset feature database, until test result meets Rule of judgment establishes model data collection, realizes the precise classification of data.

In implementation, model data collection can be applied to the automatic foundation of image data set, can also be according to actual conditions The automatic foundation for other types data set is needed, for example, the automatic foundation of text data set, this implementation is not to model data The type of intensive data is specifically limited.

In the present embodiment, the data that the data characteristics according to selection concentrates data cluster, including：

The data characteristics as cluster foundation is chosen from preset feature set；

According to selected data characteristics, the data characteristics of data intensive data is extracted；

The data characteristics of extraction is clustered.

In the present embodiment, the data characteristics in the preset feature set includes characterizing color of image, edge, line The artificial setting feature of one or more of reason and the output feature of each layer of disaggregated model.

In implementation, feature set establishes process specifically, color histogram, HOG, Haar etc. are used to characterize image face Artificial each layer of the disaggregated model of setting feature and VGG16, ResNet etc. based on deep learning of color, edge, texture etc. Feature is exported, is added in feature database together, feature database is expressed as { f₁, f₂..., f_k, k is characterized the data characteristics that library includes Quantity.

In implementation, chosen from preset feature set as the data characteristics of cluster foundation and according to selected number According to feature, the data characteristics of data intensive data is extracted, realization process is specially：

1) cluster foundation is randomly selected：Data characteristics f is randomly selected in feature database_iCluster as data classification marker Foundation, the data characteristics f that will be chosen_iIt is deleted from feature database, feature database is expressed as { f at this time₁, f₂..., f_i-1, f_i+1..., f_k}。

2) data characteristics in data set is extracted：Classification marker is carried out to the data in the data set according to cluster result Process specifically, according to the cluster foundation randomly selected, to the feature f for each data that data are concentrated_iIt extracts, if with The data characteristics f that machine is chosen_iFor histograms of oriented gradients (HOG：Histogram of Oriented Gradient) etc. it is artificial Feature is set, then is directly extracted according to the extracting method of data characteristics；If the data characteristics f randomly selected_iFor mould of classifying The output feature of a certain layer of type, then imported into the disaggregated model based on deep learning using the data in data set as input terminal In, and extract feature of the output feature of respective layer as the data.

In implementation, the data characteristics of extraction is clustered, and according to cluster result to the data in the data set into Row classification marker, realization process are specially：

1) data characteristics clusters：The data characteristics of extraction is clustered using K-Means clustering algorithms, wherein, cluster Centric quantity can be set according to actual needs, be set as m=10 herein, this implementation does not have cluster centre quantity Body limits.

2) classification marker：Automatic classification marker is carried out to the data x that data are concentrated according to cluster result, if data x is corresponded to Feature f be divided in the n-th class, then data x is marked as the n-th class.

In the present embodiment, the data set after the classification marker includes training set, the number according to after classification marker It is trained according to set pair initialization disaggregated model to be trained according to the training set to initialization disaggregated model.

In implementation, the data set after automatic label is divided into training set and test set, such as randomly select in data set 90% data are as training set, and the part of remainder 10% is as test set, according in the preceding classification results marked automatically, utilization Training set part is trained the initialization disaggregated model based on deep learning, obtains trained disaggregated model.Wherein, it instructs Practicing collection and the selection of test set accounting can be set according to actual conditions, this implementation not to the accounting of training set and test set into Row is specific to be limited.

In the present embodiment, the data set after the classification marker includes test set, described to trained disaggregated model It is tested, and model data collection is established according to test result, including：

Trained disaggregated model is tested according to the test set, the classification for obtaining trained disaggregated model is accurate True rate；

Model data collection is established according to the classification accuracy.

In the present embodiment, it is described that trained disaggregated model is tested according to the test set, it is trained Disaggregated model classification accuracy, including：

Classified using trained disaggregated model to the data in the test set, obtain the classification results of data；

The classification results with the classification marker of the test intensive data are compared, obtain trained classification mould The classification accuracy of type.

In implementation, the test process of disaggregated model is specifically, using obtained disaggregated model is trained to the number in test set It is compared according to classifying, and by testing classification result with testing the automatic labeled bracketing result of intensive data, if data x Testing classification result is identical with automatic labeled bracketing result, then it is assumed that data x classification is correct, otherwise it is assumed that data x classification is wrong Accidentally.

Further, according to the testing classification result of data all in test set and automatic labeled bracketing as a result, calculating The obtained disaggregated model of training to the classification accuracy b of entire test set, wherein, classification accuracy can according in test set just The ratio calculation of data count obtains in the data bulk and test set really classified, can also be accurate to classifying according to actual conditions The computational methods of rate are defined, this implementation does not limit the computational methods of classification accuracy specifically.

In the present embodiment, it is described that model data collection is established according to the classification accuracy, including：

If the classification accuracy is more than setting value, pattern number is generated according to the classification marker of the test intensive data According to collection；

If the classification accuracy is less than or equal to setting value, the data characteristics as cluster foundation is chosen again.

In implementation, the realization process of model data collection is established according to the classification accuracy specifically, will be calculated Classification accuracy b is compared with preset threshold value a, if b>A then generates model data according to automatic labeled bracketing result Collection；Otherwise, from deleting data characteristics f_iFeature database { f₁, f₂..., f_i-1, f_i+1..., f_kIn choose again data characteristics work For the cluster foundation of data classification marker, and whole process is repeated, until test result meets b>A generates model data collection.

The application is by taking the application scenarios established automatically of image data set as an example, and Fig. 2 shows in the embodiment of the present application one The flow diagram that model data collection is established, as shown in Fig. 2, the embodiment of the present application 1 is described in detail.

The embodiment of the present application application range includes but not limited to the automatic foundation based on image data set, with image data set It is automatic establish for, idiographic flow is as follows：

Step 201：Establish characteristics of image library.Will artificial setting feature and each layer of disaggregated model output feature, one And be added in characteristics of image library, characteristics of image library is expressed as { f₁, f₂..., f_k, k is the image data that characteristics of image library includes The quantity of feature.

Step 202：By randomly selecting cluster foundation, extraction image data concentrates the feature of image data.It specifically includes：

1) image data feature is randomly selected：Image data feature f is randomly selected in characteristics of image library_iAs data point The cluster foundation of class label, by the image data feature f of selection_iIt is deleted from characteristics of image library, characteristics of image library represents at this time For { f₁, f₂..., f_i-1, f_i+1..., f_k}。

2) the image data feature that extraction image data is concentrated：According to the cluster foundation randomly selected, to image data set In each image data feature f_iIt extracts.

Step 203：The image data feature of extraction is clustered, and is classified according to cluster result to image data Label.It specifically includes：

1) feature clustering：The image data feature of extraction is clustered using K-Means clustering algorithms.

2) classification marker：Automatic classification marker is carried out to image data according to cluster result, if the corresponding figures of image data x As data characteristics f is divided in the n-th class, then image data x is marked as the n-th class.

Step 204：Image classification model training.Image data set after automatic label is divided into training set and test set, According in the preceding classification results marked automatically, initialisation image disaggregated model is trained using training set part, is instructed The image classification model perfected.

Step 205：The image classification model obtained to training is tested, and the classification for obtaining image classification model is accurate Rate, and final model data collection is determined according to classification accuracy.It specifically includes：

1) image classification model measurement：The image data in test set is carried out using the image classification model that training obtains Classification, and testing classification result is compared with automatic labeled bracketing result, if the testing classification result of image data x and oneself Dynamic labeled bracketing result is identical, then it is assumed that and image data x classification is correct, otherwise it is assumed that image data x classification errors, thus into Classification accuracy b of the image classification model to entire test set is calculated in one step.

2) judged by the classification accuracy of image classification model, determine final model data collection：By what is be calculated Classification accuracy b is compared with preset threshold value a, if b>A then generates model data according to automatic labeled bracketing result Collection；Otherwise, return to step 202, from deleting image data feature f_iFeature database { f₁, f₂..., f_i-1, f_i+1..., f_kIn weight The new cluster foundation for choosing image data feature as data classification marker.

The preferred embodiment of the above, only the application is not intended to limit the protection domain of the application.

Embodiment 2

Based on same inventive concept, a kind of model data collection is additionally provided in the embodiment of the present application establishes cloud system, by It is similar to a kind of method for building up of model data collection in the principle that these equipment solve the problems, such as, therefore the implementation of these equipment can be with Referring to the implementation of method, overlaps will not be repeated.

What Fig. 3 showed model data collection in the embodiment of the present application two establishes cloud system Organization Chart, as shown in figure 3, model Data set is established cloud system 300 and can be included：

Server 301 is clustered, the data concentrated for the data characteristics according to selection to data cluster, and according to poly- Class result carries out classification marker to the data in the data set；

Training server 302 for being trained according to the data set after classification marker to initialization disaggregated model, obtains Trained disaggregated model；

Test server 303 for testing trained disaggregated model, and establishes pattern number according to test result According to collection.

In the present embodiment, the cluster server 301 includes：

The data characteristics of extraction is clustered.

In the present embodiment, the data set after the classification marker includes training set, and the training server 302 includes： Initialization disaggregated model is trained according to the training set.

In the present embodiment, the data set after the classification marker includes test set, and the test server 303 includes：

Model data collection is established according to the classification accuracy.

Embodiment 3

Based on same inventive concept, a kind of electronic equipment is additionally provided in the embodiment of the present application, due to its principle and one kind Establishing for model data collection is similar, therefore its implementation may refer to the implementation of method, and overlaps will not be repeated.

Fig. 4 shows the structure diagram of electronic equipment in the embodiment of the present application three, as shown in figure 4, the electronic equipment Including：Transceiver 401, memory 402, one or more processors 403；And one or more modules, it is one or Multiple modules are stored in the memory, and are configured to be performed by one or more of processors, it is one or Multiple modules include the instruction for performing each step in any above method.

Embodiment 4

Based on same inventive concept, the embodiment of the present application additionally provides a kind of computer journey being used in combination with electronic equipment Sequence product since its principle is similar to a kind of method for building up of model data collection, is implemented to may refer to the implementation of method, Overlaps will not be repeated.The computer program product includes computer-readable storage medium and is embedded in calculating therein Machine procedure mechanism, the computer program mechanism include the instruction for performing each step in any above method.

For convenience of description, each section of apparatus described above is divided into various modules with function and describes respectively.Certainly, exist Implement each module or the function of unit can be realized in same or multiple softwares or hardware during the application.

It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or computer program Product.Therefore, the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware can be used in the application Apply the form of example.Moreover, the computer for wherein including computer usable program code in one or more can be used in the application The computer program production that usable storage medium is implemented on (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.

The application is with reference to the flow according to the method for the embodiment of the present application, equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that it can be realized by computer program instructions every first-class in flowchart and/or the block diagram The combination of flow and/or box in journey and/or box and flowchart and/or the block diagram.These computer programs can be provided The processor of all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that the instruction performed by computer or the processor of other programmable data processing devices is generated for real The device of function specified in present one flow of flow chart or one box of multiple flows and/or block diagram or multiple boxes.

These computer program instructions, which may also be stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction generation being stored in the computer-readable memory includes referring to Enable the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one box of block diagram or The function of being specified in multiple boxes.

These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted Series of operation steps are performed on calculation machine or other programmable devices to generate computer implemented processing, so as in computer or The instruction offer performed on other programmable devices is used to implement in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in a box or multiple boxes.

Although the preferred embodiment of the application has been described, those skilled in the art once know basic creation Property concept, then additional changes and modifications may be made to these embodiments.So appended claims be intended to be construed to include it is excellent It selects embodiment and falls into all change and modification of the application range.

Claims

1. a kind of method for building up of model data collection, which is characterized in that including：

It is clustered according to the data that the data characteristics of selection concentrates data, and according to cluster result in the data set Data carry out classification marker；

2. the method as described in claim 1, which is characterized in that the data that the data characteristics according to selection concentrates data It is clustered, including：

The data characteristics of extraction is clustered.

3. method as claimed in claim 2, which is characterized in that the data characteristics in the preset feature set is included for table Levy the artificial setting feature of one or more of color of image, edge, texture and the output spy of each layer of disaggregated model Sign.

4. the method as described in claim 1, which is characterized in that the data set after the classification marker includes training set, described Initialization disaggregated model is trained for according to the training set to initialization classification mould according to the data set after classification marker Type is trained.

5. method as described in claim 1 or 4, which is characterized in that the data set after the classification marker includes test set, institute It states and trained disaggregated model is tested, and model data collection is established according to test result, including：

Trained disaggregated model is tested according to the test set, the classification for obtaining trained disaggregated model is accurate Rate；

Model data collection is established according to the classification accuracy.

6. method as claimed in claim 5, which is characterized in that it is described according to the test set to trained disaggregated model into Row test, obtains the classification accuracy of trained disaggregated model, including：

The classification results with the classification marker of the test intensive data are compared, obtain trained disaggregated model Classification accuracy.

7. method as claimed in claim 5, which is characterized in that it is described that model data collection is established according to the classification accuracy, Including：

If the classification accuracy is more than setting value, model data is generated according to the classification marker of the test intensive data Collection；

8. a kind of model data collection establishes cloud system, which is characterized in that including：

Server is clustered, the data concentrated for the data characteristics according to selection to data cluster, and according to cluster result Classification marker is carried out to the data in the data set；

Training server for being trained according to the data set after classification marker to initialization disaggregated model, is trained Disaggregated model；

Test server for testing trained disaggregated model, and establishes model data collection according to test result.

9. a kind of electronic equipment, which is characterized in that the electronic equipment includes：

Transceiver, memory, one or more processors；And

One or more modules, one or more of modules are stored in the memory, and are configured to by described one A or multiple processors perform, and one or more of modules are included in any the method in perform claim requirement 1-7 The instruction of each step.

10. a kind of computer program product being used in combination with electronic equipment, the computer program product can including computer The storage medium of reading includes wanting for perform claim with computer program mechanism therein, the computer program mechanism is embedded in Ask the instruction of each step in any the method in 1-7.