CN109657087A

CN109657087A - A kind of batch data mask method, device and computer readable storage medium

Info

Publication number: CN109657087A
Application number: CN201811456459.7A
Authority: CN
Inventors: 成冠举; 高鹏; 谢国彤
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-11-30
Filing date: 2018-11-30
Publication date: 2019-04-19

Abstract

The present invention relates to field of artificial intelligence, providing a kind of batch data mask method, device and storage medium, method includes: to carry out dimension-reduction treatment to the data set for including multiple images, obtains the data set being made of low-dimensional vector；The low-dimensional vector of data set is clustered, different classifications is divided the image into；The data after cluster are shown by visualization tool, choose same category of data, and unified Batch labeling is carried out to same category of data.The data in data set are divided into different classifications by cluster, so as to carry out Batch labeling to the same category of data in data set, reduce the workload of mark.It is versatile by the way of Unsupervised clustering.Also, after cluster by the way of neural network recognization, the feature of the image in same category is further identified, so as to determine the common characteristic of the data in the same category, and then unified Batch labeling can be carried out to same category according to recognition result.

Description

A kind of batch data mask method, device and computer readable storage medium

Technical field

The present invention relates to field of artificial intelligence, specifically, being related to a kind of batch data mask method, device and meter Calculation machine readable storage medium storing program for executing.

Background technique

With the rapid development of multimedia information technology and Internet information technique, new images hundreds of millions of daily are presented On the internet.Compared with text, image can more intuitive, more accurate ground description information, therefore in nowadays information explosion Epoch, image can make user it is more convenient, it is faster, more accurately obtain information needed.When image information is increasingly becoming instantly One of the most important approach propagated for information.Especially in intelligent identification technology, need largely marked picture as Training dataset carrys out training pattern, to improve the recognition capability of model.It however is usually at present logical to the mark of image data Cross artificial observed number evidence, distinguish data category, and by tool one by one classification annotation is carried out to every picture.This method The disadvantage is that can not batch data are labeled, when the amount of data is large annotating efficiency is lower；Many mark needs of work are special Industry personnel carry out classification annotation, cause to mark higher cost.

Summary of the invention

In order to solve the above technical problems, the present invention provides a kind of batch data mask method, is applied to electronic device, to packet Data set containing multiple images carries out dimension-reduction treatment, obtains the data set being made of low-dimensional vector；To the low-dimensional of data set to Amount is clustered, and different classifications is divided the image into；The data after cluster are shown by visualization tool, are chosen same The data of classification, and unified Batch labeling is carried out to same category of data.

Preferably, low-dimensional data is converted by high position data by the way of Nonlinear Dimension Reduction.

Preferably, Nonlinear Dimension Reduction uses following formula:

Higher dimensional space indicates are as follows:

Wherein, p_jiIndicate higher dimensional space conditional probability；

x_iAnd x_jIndicate the point of higher dimensional space；

σ_iIt indicates with x_iCentered on Gaussian Profile variance；

Lower dimensional space indicates are as follows:

q_ijIndicate lower dimensional space conditional probability,

y_iAnd y_jIndicate the point that higher dimensional space is mapped in lower dimensional space；

Cost function

Wherein, KL divergence indicates the error between the P and Q of a point；

P indicates higher dimensional space conditional probability distribution, and Q indicates lower dimensional space conditional probability distribution,

Gradient

Preferably for the uncertain classification of characteristics of image, classification annotation is carried out using number.

Preferably, after cluster, at least image in a certain classification is identified, using neural network also to accelerate to mark Infuse speed, comprising the following steps: collect training dataset, training dataset includes the picture largely marked, as training Data；With training data training neural network model, the recognition capability of neural network model is improved；After the completion of cluster, utilize Neural network model identifies an image in each classification, to obtain the feature in this image；According to the spy of this image Sign is uniformly labeled all images in each classification after cluster.

Preferably, after cluster, at least two images in each classification are identified, using neural network also to accelerate to mark Infuse speed, comprising the following steps: collect training dataset, training dataset includes the picture largely marked, as training Data；With training data training neural network model, the recognition capability of neural network model is improved；After the completion of cluster, utilize Neural network model identifies at least two images in each classification, the feature in image is extracted, if acquired feature does not have Common characteristic then further identifies next image, and continues to search the common characteristic of the feature in identified image, until institute The feature of the image of identification has common characteristic, then is the reference name of the classification with the common characteristic, carries out to entire classification Mark.

Preferably, data set is formed as feature vector using the color histogram of image.

The present invention also provides a kind of electronic device, which includes: memory and processor, is deposited in the memory Batch data marking program is contained, the batch data marking program realizes following steps when being executed by the processor: to packet Data set containing multiple images carries out dimension-reduction treatment, obtains the data set being made of low-dimensional vector；To the low-dimensional of data set to Amount is clustered, and different classifications is divided the image into；The data after cluster are shown by visualization tool, are chosen same The data of classification, and unified Batch labeling is carried out to same category of data.

The present invention also provides a kind of computer readable storage medium, the computer-readable recording medium storage has computer Program, the computer program include that program instruction realizes data as described above when described program instruction is executed by processor Batch labeling method.

Data in data set are divided into different classifications by cluster by the present invention, so as to same in data set The data of classification carry out Batch labeling, reduce the workload of mark.For the feature of uncertain data, can be directly used The mode for numbering mark, does not need professional and goes to identify.It is versatile by the way of Unsupervised clustering.Also, it is clustering Afterwards by the way of neural network recognization, the feature of the image in same category is further identified, so as to determine that this is same The common characteristic of data in classification, and then unified Batch labeling can be carried out to same category according to recognition result.

Detailed description of the invention

By the way that embodiment is described in conjunction with following accompanying drawings, features described above of the invention and technological merit will become More understands and be readily appreciated that.

Fig. 1 is the flow diagram of the batch data mask method of the embodiment of the present invention；

Fig. 2 be one embodiment of the invention cluster after using neural network recognition method carry out batch data mark stream Journey schematic diagram；

Fig. 3 be another embodiment of the present invention cluster after utilize neural network recognition method carry out batch data mark Flow diagram；

Fig. 4 is the hardware structure schematic diagram of the electronic device of the embodiment of the present invention；

Fig. 5 is the module structure drafting of the batch data marking program of the embodiment of the present invention.

Specific embodiment

Batch data mask method, device and computer-readable storage of the present invention described below with reference to the accompanying drawings The embodiment of medium.Those skilled in the art will recognize, without departing from the spirit and scope of the present invention the case where Under, described embodiment can be modified with a variety of different modes or combinations thereof.Therefore, attached drawing and description are in essence On be it is illustrative, be not intended to limit the scope of the claims.In addition, in the present specification, attached drawing is not in scale It draws, and identical appended drawing reference indicates identical part.

Fig. 1 is the flow diagram of batch data mask method provided in an embodiment of the present invention.This method includes following step It is rapid:

Step S10 carries out dimension-reduction treatment to the data set for including multiple images, obtains the data being made of low-dimensional vector Collection.Wherein, multiple images can be using a color histogram of every image as feature vector, and every is schemed The low-dimensional of picture, available High Dimensional Data Streams indicates vector.High dimensional data passes through dimension-reduction treatment dimensionality reduction to two dimension or three-dimensional data It can be used to cluster, can show the effect of cluster.

Step S30 clusters the low-dimensional vector of data set, divides the image into different classifications.For example, some images Automobile, image be high mountain, image be cat, image be elephant, then pass through clustering algorithm showing identical spy The image clustering of sign is together.Such as together the image clustering of automobile, together the image clustering of cat.

Data after cluster are shown by visualization tool (such as display), are chosen different classes of by step S50 Data, and unified Batch labeling is carried out to data of all categories.Such as: some region is all " cat " in visualization tool Data set, then this region is all chosen, and be labeled as " cat ", then the label of this batch data collection is all " cat ", reach The purpose of rapid batch mark is arrived.

In one alternate embodiment, low-dimensional data is converted by high position data by the way of Nonlinear Dimension Reduction.

Further, high dimensional data is regarded as the point in higher dimensional space, then maps that low-dimensional sky with manifold method Between in, keep its space length, i.e., at a distance of closer/remote point in higher dimensional space, be mapped in lower dimensional space still relatively it is close/ Far.Specifically, it at a distance of closer point in higher dimensional space, is mapped in lower dimensional space still relatively closely, in higher dimensional space apart It is farther away, it is mapped in lower dimensional space still farther out.Nonlinear Dimension Reduction uses following formula:

Higher dimensional space indicates are as follows:

Wherein, p_jiIndicate higher dimensional space conditional probability；

x_iAnd x_jIndicate the point of higher dimensional space；

σ_iIt indicates with x_iCentered on Gaussian Profile variance；

Lower dimensional space indicates are as follows:

Wherein, q_ijIndicate lower dimensional space conditional probability；

Cost function are as follows:

Wherein, KL divergence indicates the error between the P and Q of a point；

P indicates higher dimensional space conditional probability distribution, and Q indicates lower dimensional space conditional probability distribution；

Gradient are as follows:

In one alternate embodiment, the classification that do not know for characteristics of image carries out classification annotation using number.Such as Medical picture, these need professional person to go to identify classification.Can be used number carry out classification annotation, such as " 1,2,3... " or " A, b, c... " etc..

In one alternate embodiment, it goes mark also using the artificial observation category again after cluster or to occupy personnel Time.And hence it is also possible at least image in a certain classification is further identified using neural network after cluster, Since picture has been classified according to certain features using clustering algorithm, then, in subsequent neural network recognization, It can be expedited the speed of neural network recognization picture.Such as the picture of the cat in a certain classification including different appearance, then in mind After Network Recognition wherein at least an image (such as 3 images), then automatically confirm that the category is cat, and utilize annotation tool Automatically all pictures of the category are all labeled as cat.So cluster is further identified using neural network after cluster An at least image in classification afterwards, can more quickly mark image.

Specifically, after cluster, at least image in a certain classification is identified, using neural network also to accelerate to mark Speed is infused, as shown in Figure 2, comprising the following steps:

Step S100 collects training dataset, and training dataset includes the picture largely marked, as training number According to；

Step S200 improves the recognition capability of neural network model with training data training neural network model；

Step S300 identifies an image in each classification using neural network model after the completion of cluster, to obtain Feature in this image；

Step S400 uniformly marks all images in each classification after cluster according to the feature of this image Note.

In one alternate embodiment, after cluster, at least two in each classification are also identified using neural network Image, to accelerate to mark speed, as shown in Figure 3, comprising the following steps:

Step S500 identifies at least two images in each classification using neural network model, mentions after the completion of cluster The feature in image is taken, if acquired feature does not have common characteristic, further identifies next image, and continue to search institute The common characteristic of the feature in image is identified, until the feature of the image identified has common characteristic；

Step S600 take the common characteristic as the reference name of the classification, is labeled to entire classification.

As shown in fig.3, being the hardware structure schematic diagram of the embodiment of electronic device of the present invention.It is described in the present embodiment Electronic device 2 be it is a kind of can according to the instruction for being previously set or store, automatic progress numerical value calculating and/or information processing Equipment.For example, it may be smart phone, tablet computer, laptop, desktop computer, rack-mount server, blade type take It is engaged in device, tower server or Cabinet-type server (including server set composed by independent server or multiple servers Group) etc..As shown in figure 3, the electronic device 2 includes at least, but it is not limited to, depositing for connection can be in communication with each other by system bus Reservoir 21, processor 22, network interface 23.Wherein: the memory 21 includes at least a type of computer-readable storage Medium, the readable storage medium storing program for executing include flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory etc.), Random access storage device (RAM), static random-access memory (SRAM), read-only memory (ROM), electrically erasable are only Read memory (EEPROM), programmable read only memory (PROM), magnetic storage, disk, CD etc..In some embodiments In, the memory 21 can be the internal storage unit of the electronic device 2, such as the hard disk or memory of the electronic device 2. In further embodiments, the memory 21 is also possible to the External memory equipment of the electronic device 2, such as electronics dress Set the plug-in type hard disk being equipped on 2, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..Certainly, the memory 21 can also both include the electronic device 2 Internal storage unit also include its External memory equipment.In the present embodiment, the memory 21 is installed on commonly used in storage Operating system and types of applications software, such as the batch data marking program code of the electronic device 2 etc..In addition, institute Stating memory 21 can be also used for temporarily storing the Various types of data that has exported or will export.

The processor 22 can be in some embodiments central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chips.The processor 22 is commonly used in the control electricity The overall operation of sub-device 2, such as execute control relevant to the electronic device 2 progress data interaction or communication and processing Deng.In the present embodiment, the processor 22 is for running the program code stored in the memory 21 or processing data, example Batch data marking program as described in running.

The network interface 23 may include radio network interface or wired network interface, which is commonly used in Communication connection is established between the electronic device 2 and other electronic devices.For example, the network interface 23 is used to incite somebody to action by network The electronic device 2 is connected with push platform, and data transmission channel is established between the electronic device 2 and push platform and is led to Letter connection etc..The network can be intranet (Intranet), internet (Internet), global system for mobile communications (Global System of Mobile communication, GSM), wideband code division multiple access (Wideband CodeDivision Multiple Access, WCDMA), 4G network, 5G network, bluetooth (Bluetooth), Wi-Fi etc. is wireless Or cable network.

Optionally, which can also include user interface, and user interface may include input unit such as keyboard (Keyboard), speech input device such as microphone (microphone) etc. has the equipment of speech identifying function, voice defeated Device such as sound equipment, earphone etc. out.

Optionally, user interface can also include standard wireline interface and wireless interface.

Optionally, which can also include display, and display is referred to as display screen or display unit. It can be light-emitting diode display, liquid crystal display, touch-control liquid crystal display and Organic Light Emitting Diode in some embodiments (Organic Light-Emitting Diode, OLED) display etc..Display is used to be shown in handle in electronic device 2 Information and for showing visual user interface.

It should be pointed out that Fig. 3 illustrates only the electronic device 2 with component 21-23, it should be understood that not It is required that implement all components shown, the implementation that can be substituted is more or less component.

It may include operating system, batch data marking program 50 etc. in memory 11 comprising readable storage medium storing program for executing.Place Reason device 22 executes the function step corresponding with batch data mask method realized when batch data marking program 50 in memory 11 It is rapid to correspond, to avoid repeating, it is not described in detail one by one herein.Each module is briefly described below.

In the present embodiment, the batch data marking program being stored in memory 21 can be divided into one or The multiple program modules of person, one or more of program modules are stored in memory 21, and can be by one or more It is performed to manage device (the present embodiment is processor 22), to complete the present invention.For example, Fig. 3 shows the batch data mark journey Sequence module diagram, in the embodiment, the batch data marking program 50 can be divided into dimension-reduction treatment module 501, poly- Class processing module 502, classification choose module 503, Batch labeling module 504.Wherein, the so-called program module of the present invention refers to energy The series of computation machine program instruction section for enough completing specific function, than program more suitable for describing the cabinet configuration manager Implementation procedure in the electronic device 2.The concrete function of the program module will specifically be introduced by being described below.

Wherein, dimension-reduction treatment module 501 is used to carry out dimension-reduction treatment to the data set for including multiple images, obtains by low The data set of dimensional vector composition.Wherein, multiple images can be a color histogram using every image as feature Vector, for every image, the low-dimensional of available High Dimensional Data Streams indicates vector.High dimensional data is arrived by dimension-reduction treatment dimensionality reduction Two dimension or three-dimensional data can be used to cluster, and can show the effect of cluster.

Clustering processing module 502 divides the image into different classifications for clustering to the low-dimensional vector of data set.Example Such as, image is automobile, image be high mountain, image be cat, image be elephant, then pass through clustering algorithm handle Show the image clustering of same characteristic features together.Such as together the image clustering of automobile, together the image clustering of cat.

Classification chooses module 503 and is used to choose the different classes of data after cluster, and Batch labeling module 504 is to of all categories Data carry out unified Batch labeling.Such as: some region is all the data set of " cat " in visualization tool, then by this Region is all chosen, and is labeled as " cat ", then the label of this batch data collection is all " cat ", has reached rapid batch mark Purpose.

In one alternate embodiment, dimension-reduction treatment module 501 includes by the way of Nonlinear Dimension Reduction by high position data It is converted into low-dimensional data.

Further, high dimensional data is regarded as the point in higher dimensional space by dimension-reduction treatment module 501, then uses manifold method will It is mapped in lower dimensional space, keeps its space length, i.e., at a distance of closer/remote point in higher dimensional space, is mapped to low-dimensional sky Between in it is still relatively close/remote.Specifically, it at a distance of closer point in higher dimensional space, is mapped in lower dimensional space still relatively closely, in height Point apart from each other, is mapped in lower dimensional space still farther out in dimension space.Nonlinear Dimension Reduction uses following formula:

Higher dimensional space indicates are as follows:

Wherein, p_jiIndicate higher dimensional space conditional probability；

x_iAnd x_jIndicate the point of higher dimensional space；

σ_iIt indicates with x_iCentered on Gaussian Profile variance；

Lower dimensional space indicates are as follows:

Wherein, q_ijIndicate lower dimensional space conditional probability；

Cost function are as follows:

Wherein, KL divergence indicates the error between the P and Q of a point；

Gradient are as follows:

In one alternate embodiment, for characteristics of image do not know classification, 504 use of Batch labeling module number into Row classification annotation.Such as medical picture, these need professional person to go to identify classification.Number can be used and carry out classification annotation, Such as " 1,2,3... " or " a, b, c... ".

In one alternate embodiment, further include characteristic extracting module 505, use the artificial observation category again after cluster It goes mark also or to occupy the time of personnel.And hence it is also possible to further be identified using neural network a certain after cluster An at least image in classification, since picture has been classified according to certain features using clustering algorithm, then, In subsequent neural network recognization, the speed of neural network recognization picture can be expedited.It such as include difference in a certain classification The picture of the cat of appearance then automatically confirms that such then after neural network recognization wherein at least an image (such as 3 images) Not Wei cat, and all pictures of the category are all labeled as cat automatically using annotation tool.So the feature extraction mould after cluster Block 505 further identifies at least image in the classification after clustering using neural network, can more quickly mark Infuse image.

Specifically, after cluster, characteristic extracting module 505 also identifies at least one in a certain classification using neural network Image is opened, to accelerate to mark speed, comprising the following steps:

Step S300, after the completion of cluster, characteristic extracting module 505 is identified in each classification using neural network model One image, to obtain the feature in this image；

In one alternate embodiment, after cluster, characteristic extracting module 505 also identifies every one kind using neural network At least two images in not, to accelerate to mark speed, comprising the following steps:

Step S500, after the completion of cluster, characteristic extracting module 505 is identified in each classification using neural network model At least two images, the feature extracted in image further identify next figure if acquired feature does not have common characteristic Picture, and the common characteristic of the feature in identified image is continued to search, until the feature of the image identified has common characteristic；

In addition, the embodiment of the present invention also proposes a kind of computer readable storage medium, the computer readable storage medium It can be hard disk, multimedia card, SD card, flash card, SMC, read-only memory (ROM), Erasable Programmable Read Only Memory EPROM (EPROM), any one in portable compact disc read-only memory (CD-ROM), USB storage etc. or several timess Meaning combination.It include batch data marking program etc., the batch data marking program 50 in the computer readable storage medium Following operation is realized when being executed by processor 22:

The specific embodiment of the computer readable storage medium of the present invention and above-mentioned batch data mask method and electricity The specific embodiment of sub-device 2 is roughly the same, and details are not described herein.

The above description is only a preferred embodiment of the present invention, is not intended to restrict the invention, for those skilled in the art For member, the invention may be variously modified and varied.All within the spirits and principles of the present invention, it is made it is any modification, Equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims

1. a kind of batch data mask method is applied to electronic device, which is characterized in that

Dimension-reduction treatment is carried out to the data set for including multiple images, obtains the data set being made of low-dimensional vector；

The low-dimensional vector of data set is clustered, different classifications is divided the image into；

The data after cluster are shown by visualization tool, choose same category of data, and to same category of number According to carrying out unified Batch labeling.

2. batch data mask method according to claim 1, which is characterized in that will be high by the way of Nonlinear Dimension Reduction Position data are converted into low-dimensional data.

3. batch data mask method according to claim 1, which is characterized in that

Nonlinear Dimension Reduction uses following formula:

Higher dimensional space indicates are as follows:

Wherein, p_jiIndicate higher dimensional space conditional probability；

x_iAnd x_jIndicate the point of higher dimensional space；

σ_iIt indicates with x_iCentered on Gaussian Profile variance；

Lower dimensional space indicates are as follows:

q_ijIndicate lower dimensional space conditional probability,

Cost function

Wherein, KL divergence indicates the error between the P and Q of a point；

Gradient

4. batch data mask method according to claim 1, which is characterized in that class uncertain for characteristics of image Not, classification annotation is carried out using number.

5. batch data mask method according to claim 1, which is characterized in that

After cluster, at least image in a certain classification is also identified using neural network, to accelerate to mark speed, including Following steps:

Training dataset is collected, training dataset includes the picture largely marked, as training data；

With training data training neural network model, the recognition capability of neural network model is improved；

After the completion of cluster, an image in each classification is identified using neural network model, to obtain in this image Feature；

All images in each classification after cluster are uniformly labeled according to the feature of this image.

6. batch data mask method according to claim 1, which is characterized in that

After cluster, at least two images in each classification are also identified using neural network, to accelerate to mark speed, including Following steps:

After the completion of cluster, at least two images in each classification are identified using neural network model, extract the spy in image Sign, if acquired feature does not have common characteristic, further identifies next image, and continue to search in identified image The common characteristic of feature, until the feature of the image identified has common characteristic, then it is the mark of the classification with the common characteristic Title is infused, entire classification is labeled.

7. batch data mask method according to claim 1, which is characterized in that

Using the color histogram of image as feature vector, data set is formed.

8. a kind of electronic device, which is characterized in that the electronic device includes: memory and processor, is stored in the memory There is batch data marking program, the batch data marking program realizes following steps when being executed by the processor:

9. electronic device according to claim 8, which is characterized in that

10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence, the computer program includes program instruction, when described program instruction is executed by processor, is realized as in claim 1-7 Described in any item batch data mask methods.