CN111507380A

CN111507380A - Image classification method, system and device based on clustering and storage medium

Info

Publication number: CN111507380A
Application number: CN202010237384.4A
Authority: CN
Inventors: 王淦
Original assignee: Ping An Property and Casualty Insurance Company of China Ltd
Current assignee: Ping An Property and Casualty Insurance Company of China Ltd
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2020-08-07
Anticipated expiration: 2040-03-30
Also published as: CN111507380B

Abstract

The invention provides a picture classification method based on clustering, which is applied to an electronic device and comprises the following steps: acquiring all sample pictures within a preset time to establish a sample database; acquiring a positive sample, a negative sample and an unlabel sample from the sample database, and establishing a model training data set according to the acquired positive sample, negative sample and unlabel sample; preprocessing the model training data set, and acquiring the weight value of each sample in the model training data set through preprocessing; establishing an effective classification model by adopting a preset classifier according to the weight value of each sample in the model training data set; and classifying the pictures to be classified according to the effective classification model. The clustering-based picture classification method can effectively avoid the occurrence of wrong classification in the picture classification process.

Description

Image classification method, system and device based on clustering and storage medium

Technical Field

The invention relates to the technical field of picture identification, in particular to a picture classification method and device based on clustering and a computer readable storage medium.

Background

In the technical field of picture identification, an effective classification model is often used for judging whether a picture is real and effective, an effective classification model is used for judging whether a sample is effective, traditional classification is carried out manually, and for some special samples, the manual work cannot judge whether the sample is effective, whether the sample is effective is randomly judged according to the experience of workers, so that extremely poor experience is often brought to customers, and the effective classification model becomes a necessary link in the field of sample classification. The effective classification model is to give the effective probability of the sample through the model, and if the probability is higher than a certain threshold value, the sample is judged to be effective. Valid classification models must be trained using raw samples, which include both valid and invalid samples that are determined manually.

However, conventional efficient classification models typically use classification or outlier detection; regarding the traditional effective classification model in the classification form, a classification model for judging whether a sample is effective is generally trained by using a positive sample and a negative sample which are determined manually as sample data, so as to realize the classification and identification of a newly obtained sample at the later stage, however, in a historical database, not all samples can be judged to be effective manually, and due to the uncertainty of the samples, the traditional effective classification model cannot be trained by using the samples in the training process; therefore, the traditional effective classification model does not need to be applied to all samples in the historical database in the training process, and the classification effect of the traditional effective classification model is not high.

Based on the above problems, a high-precision image classification method capable of modeling by using all the existing sample images is needed.

Disclosure of Invention

The invention provides a picture classification method based on clustering, an electronic device and a computer storage medium, and mainly aims to solve the problems of low precision and high error probability of classifying pictures by using a traditional effective classification model.

In order to achieve the above object, the present invention provides an electronic device, a method for classifying pictures based on clustering, which is applied to an electronic device, and the method comprises:

acquiring all sample pictures in a preset time period to establish a sample database, wherein the preset time period is determined according to the real-time efficiency of sample picture generation;

acquiring a positive sample, a negative sample and an unlabel sample from the sample database, and establishing a model training data set according to the acquired positive sample, negative sample and unlabel sample; the unlabel sample is a sample picture which cannot be determined whether the unlabel sample is a positive sample or a negative sample;

dividing positive samples in the model training set into at least a plurality of families according to different scenes, acquiring potential positive samples and pure negative samples in unlabel samples in the model training data set based on a random forest algorithm according to the families of the divided positive samples, and then performing weight distribution on the positive samples, the negative samples, the potential positive samples and the pure negative samples in the model training data set according to a preset rule; wherein the potential positive samples have a greater probability of being positive samples and the clean negative samples have a greater probability of being negative samples;

establishing an effective classification model by adopting a preset classifier according to the weight value of each sample in the model training data set;

and classifying the pictures to be classified according to the effective classification model.

Preferably, the process of obtaining the positive sample, the negative sample and the unlabel sample from the sample database includes:

extracting characteristic information of all sample pictures in the sample database to obtain various types of characteristic information of the sample pictures, wherein the sample pictures are pictures obtained by electronic violation shooting; the characteristic information at least includes: the distance between the wheels of the target automobile and the solid line in the sample picture, whether the target automobile runs a red light, the distance between the target automobile and a nearby pedestrian and whether the automobile runs in the wrong direction or not;

judging whether the target automobile in the sample picture violates rules or not through a preset judgment rule according to the characteristic information, and recording the sample picture as a positive sample if the target automobile violates the rules; if the target automobile is judged not to be violated, recording the sample picture as a negative sample, and if the target automobile in the sample picture cannot be accurately judged to be violated according to the preset judgment rule, recording the sample picture as an unlabel sample;

and acquiring all positive samples, negative samples and unlabel samples from the sample database.

Preferably, the process of dividing the positive samples in the model training set into at least a plurality of family classes according to different scenarios comprises:

normalizing the characteristic values of the characteristic information of the positive sample by using a min-max method so as to normalize the characteristic values of all the characteristic information of the positive sample to be between 0 and 1;

and performing clustering processing on the positive samples by using a k-means algorithm to divide the positive samples in the model training set into at least a plurality of family classes.

Preferably, the process of clustering the positive samples by using the k-means algorithm comprises:

selecting K positive samples as initial clustering centers according to different scenes, wherein the minimum value of K is 20;

calculating the distance from each positive sample to each clustering center, and allocating each positive sample to the nearest clustering center, wherein the clustering center and all the positive samples allocated to the clustering center form a cluster together;

wherein, the formula for calculating the distance is as follows;

wherein xi is a positive sample, xu is a clustering center, and d is the dimension of the positive sample.

Preferably, after assigning each positive sample to the cluster center closest to it, the method further comprises: when each positive sample is assigned to a corresponding cluster center, the cluster center of the cluster is recalculated according to the existing positive sample in the cluster, wherein the calculation formula is as follows:

wherein ui represents the center point of the ith cluster, ci represents the ith cluster, and x represents the samples belonging to the cluster;

until the loss function of the set of positive samples reaches a minimum;

wherein, the expression formula of the loss function is as follows:

where ui represents the center point of the ith cluster, and ci represents the ith cluster.

Preferably, the process of obtaining potential positive samples and clean negative samples in the unlabel samples in the model training data set based on the random forest algorithm according to the classified family of the positive samples comprises:

carrying out independent score on the unlabel sample by using an independent random forest algorithm, carrying out approximate score on the unlabel sample according to the clustering of the positive sample, and calculating the sum of the independent score and the approximate score to be used as a total score;

calculating the average score of positive samples, judging whether the total score of the unlabel samples is greater than the average score of the positive samples, recording the unlabel samples as potential positive samples if the total score of the unlabel samples is greater than the average score of the positive samples, and recording the unlabel samples as pure negative samples if the total score of the unlabel samples is less than a preset hyper-parameter β, wherein the maximum value of the preset hyper-parameter β is 0.2.

In addition, to achieve the above object, the present invention further provides a system for classifying pictures based on clustering, wherein the system for classifying pictures comprises:

the system comprises a sample picture acquisition unit, a sample database generation unit and a sample picture generation unit, wherein the sample picture acquisition unit is used for acquiring all sample pictures in a preset time period to establish the sample database, and the preset time period is determined according to the real-time efficiency of sample picture generation;

the training data set establishing unit is used for acquiring a positive sample, a negative sample and an unlabel sample from the sample database, and establishing a model training data set according to the acquired positive sample, negative sample and unlabel sample; the unlabel sample is a sample picture which cannot be determined whether the unlabel sample is a positive sample or a negative sample;

the preprocessing unit is used for dividing the positive samples in the model training set into at least a plurality of families according to different scenes, acquiring potential positive samples and pure negative samples in the unlabel samples in the model training data set based on a random forest algorithm according to the families of the divided positive samples, and then performing weight distribution on the positive samples, the negative samples, the potential positive samples and the pure negative samples in the model training data set according to a preset rule; wherein the potential positive samples have a greater probability of being positive samples and the clean negative samples have a greater probability of being negative samples;

the effective classification model establishing and applying unit is used for establishing an effective classification model by adopting a preset classifier according to the weight value of each sample in the model training data set; and classifying the pictures to be classified according to the effective classification model.

In addition, to achieve the above object, the present invention further provides a method for classifying pictures based on clustering, which includes:

acquiring a positive sample, a negative sample and an unlabel sample from the sample database, and establishing a model training data set according to the acquired positive sample, negative sample and unlabel sample;

the unlabel sample is a sample picture which cannot be determined whether the unlabel sample is a positive sample or a negative sample;

In addition, to achieve the above object, the present invention further provides a computer-readable storage medium, in which a clustering-based picture classification program is stored, and when the clustering-based picture classification program is executed by a processor, the steps of the clustering-based picture classification method are implemented.

The invention provides a clustering-based picture classification method, a system, an electronic device and a computer-readable storage medium, which are characterized in that firstly, a sample picture is divided into a positive sample, a negative sample and an unlabel sample by a manual or traditional picture identification method; then, preprocessing is carried out on the model training data set, the weight value of each sample in the model training data set is obtained through preprocessing, then, a high-precision effective classification model is established by adopting a preset classifier according to the weight value of each sample in the model training data set, and finally, the pictures to be classified are classified in a high-precision mode through the effective classification model, so that the occurrence of wrong classification in the picture classification process can be effectively avoided.

Drawings

FIG. 1 is a schematic structural diagram of an electronic device according to an embodiment of the invention;

FIG. 2 is a flowchart illustrating an embodiment of a method for classifying pictures based on clustering according to the present invention;

FIG. 3 is a flow diagram of a preferred embodiment of a process for preprocessing a model training data set according to an embodiment of the present invention;

fig. 4 is a schematic diagram of the internal logic of the clustering-based picture classification procedure according to the embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a picture classification method based on clustering, which is applied to an electronic device 70. Referring to fig. 1, a schematic structural diagram of an electronic device 70 according to a preferred embodiment of the invention is shown.

In the embodiment, the electronic device 70 may be a terminal device having a computing function, such as a server, a smart phone, a tablet computer, a portable computer, or a desktop computer.

The electronic device 70 includes: a processor 71 and a memory 72.

The memory 72 includes at least one type of readable storage medium. At least one type of readable storage medium may be a non-volatile storage medium such as a flash memory, a hard disk, a multimedia card, a card-type memory, and the like. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 70, such as a hard disk of the electronic device 70. In other embodiments, the readable storage medium may be an external memory of the electronic device 1, such as a plug-in hard disk provided on the electronic device 70, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like.

In the present embodiment, the readable storage medium of the memory 72 is generally used for storing the cluster-based picture classification program 73 installed in the electronic device 70. The memory 72 may also be used to temporarily store data that has been output or is to be output.

Processor 72, which in some embodiments may be a Central Processing Unit (CPU), microprocessor or other data Processing chip, executes program code stored in memory 72 or processes data, such as cluster-based picture classification program 73.

In some embodiments, the electronic device 70 is a terminal device of a smartphone, tablet, portable computer, or the like. In other embodiments, the electronic device 70 may be a server.

Fig. 1 shows only an electronic device 70 having components 71-73, but it is to be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.

Optionally, the electronic device 70 may further include a user interface, which may include an input unit such as a Keyboard (Keyboard), a voice input device such as a microphone (microphone) or other devices with voice recognition function, a voice output device such as a sound box, a headset, etc., and optionally may also include a standard wired interface, a wireless interface.

In some embodiments, the electronic device 70 may be an L ED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic light-Emitting Diode (O L ED) touch screen, or the like.

Optionally, the electronic device 70 may further include a touch sensor. The area provided by the touch sensor for the user to perform touch operation is referred to as a touch area. Further, the touch sensor here may be a resistive touch sensor, a capacitive touch sensor, or the like. The touch sensor may include not only a contact type touch sensor but also a proximity type touch sensor. Further, the touch sensor may be a single sensor, or may be a plurality of sensors arranged in an array, for example.

The area of the display of the electronic device 70 may be the same as or different from the area of the touch sensor. Optionally, the display is stacked with the touch sensor to form a touch display screen. The device detects touch operation triggered by a user based on the touch display screen.

Optionally, the electronic device 70 may further include a Radio Frequency (RF) circuit, a sensor, an audio circuit, and the like, which are not described in detail herein.

In addition, fig. 2 is a flowchart of a multi-cluster-based picture classification method according to a preferred embodiment of the present invention, which is shown together with fig. 1 and fig. 2, in the embodiment of the apparatus shown in fig. 1, a memory 72 as a computer storage medium may include an operating system and a cluster-based picture classification program 73; the processor 71, when executing the cluster-based picture classification program 73 stored in the memory 72, performs the following steps:

s110: acquiring all sample pictures within a preset time period to establish a sample database, wherein the real-time efficiency of the sample pictures within the preset time period is determined, and in order to ensure the accuracy of a model established in a later period, the total number of the sample pictures within the preset time period at least needs to reach 10000.

In order to more clearly illustrate the content of the invention, the invention selects the classification scene of the electronic violation photo as a specific embodiment of the invention, along with the development of the transportation industry, the number of various automobiles is more and more, and the following automobile violation phenomena are more and more common, so that for discovering the violation vehicles on the road in time, the city can arrange an electronic violation photo device beside each street, judge whether the automobile in the image sample violates the regulations or not through a manual or traditional effective classification model, however, the mode of judging whether the automobile in the sample image violates the regulations or not by using the traditional effective classification model often has the condition of missing judgment or misjudgment, and hinders the travel of people to a certain extent; therefore, the invention adopts a picture classification method based on clustering to classify the electronic violation photo.

It should be noted that the sample pictures are pictures generated when the automobile is photographed in violation, which are shot by the camera on the road, and various information for determining whether the automobile in the pictures violates the regulations can be obtained from the sample pictures, for example, the behaviors of both hands of a pedestrian (a primary driver) in the sample pictures, the distance between the automobile wheels and a solid line, whether the automobile runs a red light, the position relationship between the automobile and nearby pedestrians, whether the automobile runs in the wrong direction, and the like.

In addition, since the violation behaviors which are easy to occur at different electronic violation photographing positions are slightly different, for example, at an intersection, the red light running phenomenon is easy to occur; the phenomenon that an operator does not fasten a safety belt or plays a mobile phone easily occurs in the middle of the road; at the door of a school, the phenomenon of not giving the passerby is easy to occur. Therefore, the sample pictures can be grouped in advance according to the specific position of the electronic violation shooting, the sample pictures are pre-classified in a grouping mode, a multi-level sample database is established, and when required positive samples, negative samples and unlabel samples are acquired in the sample database in the later period, the same number of positive samples, negative samples and unlabel samples can be acquired from all groups. By the method, the types of the obtained samples can be more balanced, so that the classification precision of the effective separation model established in the later stage is improved.

S120: acquiring a positive sample, a negative sample and an unlabel sample from a sample database, and establishing a model training data set according to the acquired positive sample, negative sample and unlabel sample;

the positive sample is a sample picture for confirming violation in the initial step, the negative sample is a sample picture for confirming violation in the initial step, and the unlabel sample is a sample picture which can not be judged whether violation is violated through the initial step.

It should be noted that, because the sample picture is a picture taken by a camera on a road against traffic regulations, the determination condition for determining whether the sample picture violates the traffic regulations is whether the car in the picture violates the traffic regulations, for example, whether the car runs a red light, whether the car is pressed, whether the car is parked illegally, whether the car is driven to smoke, whether the car is driven to make a call to play a mobile phone, whether a primary driver and a secondary driver are wearing safety belts, whether pedestrians are given a gift, whether the car meets the car without turning off the light, and the like. And recording the sample picture as a positive sample when the violation is judged, and recording the sample picture as a negative sample when the violation is not judged. However, in the actual photographing process, due to the vehicle speed, the weather conditions (such as haze, rainy and snowy days, and the like), the real-time light conditions, and the like of the vehicle, the problems of unclear taken sample pictures, blocking markers (such as safety belts, license plates, and the like), and the like can be caused, so that whether the sample pictures are illegal pictures can not be judged through the traditional picture identification technology, and at the moment, the pictures are recorded as unlabel samples.

Specifically, the process of preliminarily determining whether the sample picture violates the regulations can be judged manually, and the sample picture can also be judged by adopting a traditional picture identification method; and if the judgment is carried out manually, judging whether the violation behaviors exist in the sample picture according to manual naked eyes.

If the sample pictures are judged by adopting the traditional picture identification method, extracting the characteristic information of all the sample pictures to obtain various characteristic information of the sample pictures, wherein the characteristic information comprises: the distance between the wheels of the target automobile and the solid line in the sample picture, whether the target automobile runs a red light, the distance between the target automobile and a nearby pedestrian, whether the automobile runs in the wrong direction and the like.

Judging whether the target automobile in the sample picture violates rules or not through a preset judgment rule according to the characteristic information, and recording the sample picture as a positive sample if the target automobile violates the rules; and if the target automobile is judged not to be violated, recording the sample picture as a negative sample, and if the target automobile in the sample picture cannot be accurately judged to be violated according to a preset judgment rule, recording the sample picture as an unlabel sample.

The preset determination rule is related to violation behaviors, for example, whether the distance between the wheels of the target vehicle and the solid line is zero, whether the characteristic value of whether the target vehicle runs the red light is 1 (the characteristic value is 1 is the occurrence of an event), whether the characteristic value of whether the vehicle runs in the wrong direction is 1, and the like.

S130: and preprocessing the model training data set, and acquiring the weight value of each sample in the model training data set through preprocessing.

In one embodiment of the present invention, fig. 3 is a flowchart illustrating a preferred embodiment of a preprocessing process performed on a training data set of a model according to an embodiment of the present invention, wherein the preprocessing process includes the following steps as shown in fig. 3;

s131: for the positive samples in the model training data set, there are great differences between the positive samples in the model training data set, for example, some positive samples are determined by the fact that a car runs a red light, some positive samples are determined by the fact that an actor does not fasten a safety belt, some positive samples are determined by the fact that a car is pressed, and the like. Therefore, the positive sample cannot be simply regarded as a category; however, the characteristic information of the positive sample in different car violation scenes (such as running a red light) often has a large difference, and the characteristic information of the positive sample in the same fraud scene appears to have a relatively small difference, for example, the distance between a car wheel and a solid line is large, and the corresponding numerical difference in different violation scenes is large. Thus, positive samples may be divided into clusters (at least 20) according to their violation scenario, with similar positive samples included in each cluster.

S132: for the unlabel sample, because the potential positive sample is greatly different from the negative sample and is similar to the known positive sample, the potential positive sample and the clean negative sample in the unlabel sample in the model training data set are obtained according to the family of the positive samples and based on the random forest algorithm, wherein the potential positive sample has a higher probability of being the positive sample, and the clean negative sample has a higher probability of being the negative sample.

It should be noted that, since the purpose of the present solution is to obtain a valid classification model, and the negative sample is a non-fraudulent sample that is confirmed by a human being, no processing is performed on the negative sample.

S133: and carrying out weight distribution on the positive samples, the negative samples, the potential positive samples and the pure negative samples in the model training data set according to a preset rule.

Specifically, in step S131, firstly, the min-max method is used to normalize the feature values of the feature information of the positive sample, so that the feature values of all the feature information of the positive sample can be normalized to be between 0 and 1, and the problem that the feature values cannot be used in the same formula at the same time due to different dimensions of the feature values is also avoided. It should be noted that, since the normalization method is the prior art, it is not described herein again.

And secondly, randomly selecting K positive samples as initial clustering centers, wherein the minimum value of K is 20.

Then, calculating the distance from each positive sample to each cluster center, and allocating each positive sample to the cluster center closest to the positive sample, wherein the cluster center and all the positive samples allocated to the cluster center form a cluster together, and the formula for calculating the distance is as follows;

in the above formula, xi is a positive sample, xu is a clustering center, and d is the dimension of the positive sample;

the essence of the above formula is that the sum of the squares of the distances between a positive sample and all normalized feature values of the cluster center is taken as the distance between two points.

In order to improve the accuracy of the cluster centers, in the process of allocating each positive sample to the cluster center closest to the positive sample, each sample is allocated, and the cluster center of the cluster is recalculated according to the existing object in the cluster, wherein the calculation formula is as follows:

in the above formula, ui represents the center point of the ith cluster, ci represents the ith cluster, and x represents the samples belonging to the cluster, i.e., the center point of the cluster is determined first, and then the center point is used as a new cluster center, so that the cluster center is refreshed, and the accuracy of the cluster center is improved.

Finally, until the loss function of the whole positive sample set (including all positive samples) reaches the minimum value, wherein the expression formula of the loss function is:

in the above formula, ui represents the center point of the ith cluster, and ci represents the ith cluster, where the sum of distances from each sample in each cluster to the cluster center is calculated first, and then the sum of the sums of distances of all clusters is calculated as a loss function, and the accuracy of the cluster center of the positive sample can be further improved by setting the minimum value of the loss function.

It should be emphasized that the clusters mentioned in the above clustering process are the families in the family S131.

In step S132, the process of obtaining potential positive samples and clean negative samples of the positive samples in the model training data set according to the family and based on the random forest algorithm includes:

and (3) carrying out independent score on the unlabel sample by using an independent random forest algorithm, carrying out approximate score on the unlabel sample according to the clustering of the positive samples, and calculating the sum of the independent score and the approximate score to be used as a total score.

Calculating the average score of the positive samples, judging whether the total score of the unlabel samples is greater than the average score of the positive samples, if the total score of the unlabel samples is greater than the average score of the positive samples, defining the unlabel samples as potential positive samples, and defining the unlabel samples with the total score less than a certain preset hyperparameter β as pure negative samples, wherein the value β is set manually, the smaller the value is, the more reliable the selected samples are, and the maximum value of β set by the method is 0.2.

Specifically, because the unlabel samples have few positive samples and differ greatly from each other, the unlabel samples can be given independent scores using independent random forests, since positive samples are lower scores than near the root, while negative samples are higher scores than far from the root. In order to obtain an independent score of each sample point, the samples are transmitted from the root of the tree until the leaf nodes are reached, so that the path length of each tree can be obtained, and then the independent random forest average path length can be calculated. Based on the average path length, an independent score is (x) may be calculated, which may describe the probability that the sample is a positive sample.

IS(X)＝E(h(x))

Where h (x) represents the path length of a tree. The higher the IS (x) score, the higher the likelihood that x is a positive sample.

On the other hand, the closer a sample is to the cluster center of a known positive sample, the more likely it is a potential positive sample, so we calculate the approximate score ss (x) of the unlabel sample x and the cluster center closest to it, as follows:

to filter potential positive and clean negative samples in the unlabel sample, we must consider the sum of the potential independent score and the approximate score together, and the total score formula is as follows:

TS(x)＝θIS(X)+(1-θ)SS(x)

where θ ranges from [0,1] to a default value of 0.5, which can be used to balance the importance of the individual scores and the approximate scores.

Specifically, calculating

The average score of known positive samples is represented by the samples with a value of ts (x) greater than α as potential positive samples, and when the value of ts (x) is less than β as clean negative samples, β is a hyperparameter, the smaller the value the more reliable the selected sample.

Specifically, in S133, a corresponding preset rule may be set according to the scores of the various types of samples, and corresponding weights are given to the potential positive sample, the clean negative sample, the known positive sample, and the known negative sample obtained from the unlabel sample, where the weights of all known positive samples are 1 and the weights of all known negative samples are 0 because the known positive sample and the known negative sample have been confirmed earlier, and a weight calculation formula for the potential positive sample is as follows:

for pure negative samples, the smaller the score, the higher the weight, the calculation formula is as follows:

s140: and establishing an effective classification model by adopting a preset classifier according to the weight value of each sample in the model training data set.

Specifically, there are various types of classifiers, such as: the method comprises the steps of linear regression, an SVM (support vector machine), a Decision Tree (DT) and a cableost, wherein the cableost classifier which relatively accords with image classification is selected as an effective classification model, and potential positive samples, pure negative samples, known positive samples and known negative samples are used as sampling samples to train the cableost classifier, so that the effective classification model is established.

More specifically, the step S140 may specifically include the following steps:

various samples are collected from the training data set according to the weight values of the various samples to serve as training samples, wherein the sampling probability of potential positive samples and pure negative samples is set to be the weight values of the potential positive samples and the pure negative samples, the sampling probability of known positive samples is set to be 1, and the sampling probability of known negative samples is defaulted to be 0.

Training a catboost model by using collected training samples, wherein each training sample consists of features and labels.

And step S131 and step S132 are iterated circularly, the iteration is stopped after n _ iter times, and the default value of n _ iter is 30.

And (3) model parameter adjustment, namely parameter adjustment on parameters such as iterations, depth, scale _ pos _ weight and the like which influence the model result, so that the model effect is optimal.

It should be noted that the catboost model is an existing common classification model, and the innovation point of the invention mainly lies in the preprocessing of the data training set, so as to significantly improve the accuracy of the final classification model, and therefore, the specific training process and the model reference adjusting process of the catboost model are not repeated.

S150: and classifying the pictures to be classified according to the effective classification model, thereby accurately judging whether the automobiles in the pictures to be classified break rules or not.

It should be emphasized that the clustering-based picture classification method provided by the present invention can be used not only for classifying electronic violation photos, but also for other suitable picture classification scenes, such as classifying pictures or videos used for determining whether a player violates a rule in a sports game (gymnastics, diving, long jump, etc.) scene, and further classifying pictures or videos in scenes such as daily security monitoring, workplace monitoring, farm monitoring, etc.

In addition, the invention also provides a picture classification system based on clustering, which comprises:

the system comprises a sample picture acquisition unit, a sample database generation unit and a sample picture generation unit, wherein the sample picture acquisition unit is used for acquiring all sample pictures in a preset time period to establish the sample database, and the preset time period is determined according to the real-time efficiency generated by the sample pictures;

the training data set establishing unit is used for acquiring a positive sample, a negative sample and an unlabel sample from a sample database, and establishing a model training data set according to the acquired positive sample, negative sample and unlabel sample; the unlabel sample is a sample picture which cannot be determined whether the unlabel sample is a positive sample or a negative sample;

the preprocessing unit is used for dividing the positive samples in the model training set into at least a plurality of families according to different scenes, acquiring potential positive samples and pure negative samples in unlabel samples in the model training data set based on a random forest algorithm according to the families of the divided positive samples, and then performing weight distribution on the positive samples, the negative samples, the potential positive samples and the pure negative samples in the model training data set; wherein, the potential positive sample has a larger probability as the positive sample, and the pure negative sample has a larger probability as the negative sample;

In other embodiments, fig. 4 is a schematic diagram of the internal logic of the clustering-based picture classification program according to an embodiment of the present invention, and as shown in fig. 4, the clustering-based picture classification program 73 may be further divided into one or more modules, and the one or more modules are stored in the memory 72 and executed by the processor 71, so as to complete the present invention. The modules referred to herein are referred to as a series of computer program instruction segments capable of performing specified functions. Referring to fig. 4, a block diagram of a preferred embodiment of the clustering-based picture classification procedure 73 of fig. 1 is shown. The cluster-based picture classification program 73 may be segmented into: a sample picture acquisition module 74, a training data set building module 75, a preprocessing module 76, and an efficient classification model building and applying module 77. The functions or operational steps performed by the modules 74-77 are similar to those described above and will not be described in detail herein, for example, where:

the sample picture acquiring module 74 acquires all sample pictures within a preset time period to establish a sample database, wherein the preset time period is determined according to the real-time efficiency of the sample picture generation.

The training data set establishing module 75 acquires a positive sample, a negative sample and an unlabel sample from the sample database, and establishes a model training data set according to the acquired positive sample, negative sample and unlabel sample;

the unlabel sample is a sample picture in which whether the unlabel sample is a positive sample or a negative sample cannot be determined.

And the preprocessing module 76 is used for preprocessing the model training data set to obtain the weight value of each sample in the model training data set.

The effective classification model establishing and applying module 77 is configured to establish an effective classification model by using a preset classifier according to the weight value of each sample in the model training data set, and classify the pictures to be classified according to the effective classification model.

In addition, the invention also provides a picture classification method based on clustering. The method may be performed by an apparatus, which may be implemented by software and/or hardware, the method comprising:

s110: acquiring all sample pictures in a preset time period to establish a sample database, wherein the preset time period is determined according to the real-time efficiency of the sample pictures, and the total number of the sample pictures in the preset time period is at least 10000;

s130: preprocessing the model training data set, and acquiring the weight value of each sample in the model training data set through preprocessing;

s140: establishing an effective classification model by adopting a preset classifier according to the weight value of each sample in the model training data set;

s150: and classifying the pictures to be classified according to the effective classification model.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, where a clustering-based picture classification program is stored in the computer-readable storage medium, and when executed by a processor, the clustering-based picture classification program implements the following operations:

s110: acquiring all sample pictures in a preset time period to establish a sample database, wherein the preset time period is determined according to the real-time efficiency of the sample pictures;

The specific implementation of the computer-readable storage medium provided by the present invention is substantially the same as the specific implementation of the above-mentioned multi-room temperature alarm method and electronic device, and will not be described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments. Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A picture classification method based on clustering is applied to an electronic device, and is characterized by comprising the following steps:

2. The method according to claim 1, wherein the process of obtaining positive samples, negative samples and unlabel samples from the sample database comprises:

3. The method of claim 1, wherein the process of classifying the positive samples in the model training set into at least a plurality of families according to different scenes comprises:

4. The method as claimed in claim 3, wherein the process of clustering the positive samples using k-means algorithm comprises:

wherein, the formula for calculating the distance is as follows;

5. The cluster-based picture classification method according to claim 4,

after assigning each positive sample to the cluster center closest to it, the method further comprises: when each positive sample is assigned to a corresponding cluster center, the cluster center of the cluster is recalculated according to the existing positive sample in the cluster, wherein the calculation formula is as follows:

until the loss function of the set of positive samples reaches a minimum;

wherein, the expression formula of the loss function is as follows:

6. The method according to claim 5, wherein the step of obtaining potential positive samples and clean negative samples of unlabel samples in the model training data set based on a random forest algorithm according to the cluster of the divided positive samples comprises:

7. A system for classifying pictures based on clustering, the system comprising:

the preprocessing unit is used for dividing the positive samples in the model training set into at least a plurality of families according to different scenes, acquiring potential positive samples and pure negative samples in the unlabel samples in the model training data set based on a random forest algorithm according to the families of the divided positive samples, and then performing weight distribution on the positive samples, the negative samples, the potential positive samples and the pure negative samples in the model training data set; wherein the potential positive samples have a greater probability of being positive samples and the clean negative samples have a greater probability of being negative samples;

8. An electronic device, comprising: a memory, a processor, and a cluster-based picture classification program stored in the memory and executable on the processor, the multi-room temperature alert program when executed by the processor implementing the steps of:

dividing positive samples in the model training set into at least a plurality of families according to different scenes, acquiring potential positive samples and pure negative samples in unlabel samples in the model training data set based on a random forest algorithm according to the families of the divided positive samples, and then performing weight distribution on the positive samples, the negative samples, the potential positive samples and the pure negative samples in the model training data set; wherein the potential positive samples have a greater probability of being positive samples and the clean negative samples have a greater probability of being negative samples;

9. The electronic device of claim 8, wherein preprocessing the model training data set comprises:

dividing the positive samples in the model training set into at least 20 family classes according to different violation scenes;

acquiring potential positive samples and pure negative samples in unlabel samples in the model training data set based on a random forest algorithm according to the classified families of the positive samples, wherein the potential positive samples have higher probability of being positive samples, and the pure negative samples have higher probability of being negative samples;

weight assignment is performed on positive samples, negative samples, potential positive samples, and clean negative samples within the model training dataset.

10. A computer-readable storage medium, characterized in that a cluster-based picture classification program is provided in the computer-readable storage medium, which, when executed by a processor, performs the steps of the cluster-based picture classification method according to any one of claims 1 to 6.