CN112579587B - Data cleaning method and device, equipment and storage medium - Google Patents

Data cleaning method and device, equipment and storage medium

Info

Publication number
CN112579587B
CN112579587B CN202011592748.7A CN202011592748A CN112579587B CN 112579587 B CN112579587 B CN 112579587B CN 202011592748 A CN202011592748 A CN 202011592748A CN 112579587 B CN112579587 B CN 112579587B
Authority
CN
China
Prior art keywords
violation
label
city management
cleaned
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011592748.7A
Other languages
Chinese (zh)
Other versions
CN112579587A (en
Inventor
唐鑫
王冠皓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Button Internet Beijing Technology Co ltd
Original Assignee
Button Internet Beijing Technology Co ltd
Filing date
Publication date
Application filed by Button Internet Beijing Technology Co ltd filed Critical Button Internet Beijing Technology Co ltd
Priority to CN202011592748.7A priority Critical patent/CN112579587B/en
Publication of CN112579587A publication Critical patent/CN112579587A/en
Application granted granted Critical
Publication of CN112579587B publication Critical patent/CN112579587B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The disclosure provides a data cleaning method, a data cleaning device, data cleaning equipment and a storage medium, and relates to the field of data processing. The implementation scheme is as follows: performing the following cleaning operations on the data sample set to be cleaned by using the trained first classification model: inputting each data sample to be cleaned in the data sample set to be cleaned into a first classification model, determining real labels of one or more data samples to be cleaned, of which the predicted labels are inconsistent with the initial labels, based on the output of the first classification model, determining the one or more data samples to be cleaned after determining the real labels as first standard data samples, training the first classification model again by using the determined one or more first standard data samples, executing cleaning operation on the rest data samples to be cleaned in the data sample set to be cleaned by using the retrained first classification model, and constructing a first standard data sample set based on the plurality of first standard data samples with the real labels.

Description

Data cleaning method and device, equipment and storage medium
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to the field of image processing and artificial intelligence technologies, and in particular, to a data cleaning method and apparatus, a neural network training method and apparatus, a method and apparatus for identifying violations of urban management violation images, a computer device, a computer readable storage medium, and a computer program product.
Background
Artificial intelligence is the discipline of studying the process of making a computer simulate certain thinking and intelligent behavior (e.g., learning, reasoning, thinking, planning, etc.) of a person, both hardware-level and software-level techniques. Artificial intelligence hardware technology generally includes fields such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like. Artificial intelligence is increasingly being used in various fields, such as image recognition. In the field of image recognition, data can be marked by using a data cleaning method to obtain a standard sample data set, so that training learning can be performed by using the standard sample data set to realize image recognition.
The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated.
Disclosure of Invention
The present disclosure provides a data cleaning method and apparatus, a neural network training method and apparatus, a city management violation image violation identification method and apparatus, a computer device, a computer readable storage medium, and a computer program product.
According to a first aspect of the present disclosure, there is provided a data cleansing method for cleansing a data sample set to be cleansed, the data sample set to be cleansed including a plurality of data samples to be cleansed having an initial tag, the data cleansing method comprising: performing the following cleaning operations on the data sample set to be cleaned by using the trained first classification model: responding to the input of each data sample to be cleaned in the data sample set to be cleaned into a trained first classification model, and outputting a prediction label and label confidence coefficient of each data sample to be cleaned in the data sample set to be cleaned by the first classification model; acquiring one or more data samples to be cleaned, wherein the predicted label is inconsistent with the initial label; determining one or more real labels of the data samples to be cleaned, wherein the predicted labels are inconsistent with the initial labels, based on preset rules; and determining one or more data samples to be cleaned after determining the real label as a first standard data sample; training the first classification model again by using the determined one or more first standard data samples to execute cleaning operation on the rest data samples to be cleaned of the data sample set by using the retrained first classification model; and constructing a first standard data sample set based on the plurality of first standard data samples with the authentic labels.
According to another aspect of the present disclosure, there is provided a training method of a neural network, wherein the neural network includes a violation classification model, the training method including: acquiring a sample image set to be cleaned of an urban management violation, wherein the sample image set to be cleaned comprises a plurality of violation sample images with initial violation labels; the sample image set to be cleaned is cleaned by adopting the cleaning method, and real violation labels of a plurality of violation sample images are determined so as to obtain a standard sample image set of city management violations; and training the violation classification model by using the standard sample image set.
According to another aspect of the present disclosure, there is provided a method for identifying violations using a neural network, the neural network being trained using the training method described above, the neural network including a violation classification model, the identifying method including: acquiring a first city management acquisition image aiming at a target scene; and responding to the first city management acquisition image to input a violation classification model, and outputting a city management violation label corresponding to the first city management acquisition image by the violation classification model, wherein the city management violation label comprises violations and non-violations.
According to another aspect of the present disclosure, there is provided a data cleansing apparatus for cleansing a data sample set to be cleansed, the data sample set to be cleansed including a plurality of data samples to be cleansed having an initial tag, the cleansing apparatus comprising: a first cleaning unit configured to perform a cleaning operation on a set of data samples to be cleaned using a trained first classification model, wherein the first cleaning unit comprises: a prediction subunit configured to output a prediction label and label confidence thereof for each data sample to be cleaned in the data sample set to be cleaned in response to inputting each data sample to be cleaned in the data sample set to be cleaned into the first classification model; a first acquisition subunit configured to acquire one or more data samples to be cleaned for which the predicted tag is inconsistent with the initial tag; a first determining subunit configured to determine, based on a preset rule, one or more real tags of the data samples to be cleaned for which the predicted tag is inconsistent with the initial tag; and a second determining subunit configured to determine, as the first standard data sample, one or more data samples to be cleaned after determining the real tag; a first training unit configured to retrain the first classification model using the determined one or more first standard data samples, so that the first cleaning unit performs a cleaning operation on the remaining data samples to be cleaned of the data sample set using the retrained first classification model; and a first construction unit configured to construct a first standard data sample set based on the plurality of first standard data samples having the real tags.
According to another aspect of the present disclosure, there is provided a training apparatus of a neural network, wherein the neural network includes a violation classification model, the training apparatus including: the third acquisition unit is configured to acquire a sample image set to be cleaned of the urban management violation, wherein the sample image set to be cleaned comprises a plurality of violation sample images with initial violation tags; the second cleaning unit is configured to clean the sample image set to be cleaned by adopting the cleaning method, and determine real violation labels of a plurality of violation sample images so as to obtain a standard sample image set of urban management violations; and the third training unit is configured to train the violation classification model by using the standard sample image set.
According to another aspect of the present disclosure, there is provided a violation identification device based on an urban management violation image, the identification device including: training the obtained neural network according to the training method, wherein the neural network comprises a violation classification model; and a fifth acquisition unit configured to acquire a first city management acquisition image for the target scene, wherein the violation classification model is configured to input the violation classification model in response to the first city management acquisition image, and output a city management violation tag corresponding to the first city management acquisition image, the city management violation tag including a violation and a non-violation.
According to another aspect of the present disclosure, there is provided a computer apparatus comprising: a memory, a processor and a computer program stored on the memory, wherein the processor is configured to execute the computer program to implement the steps of the method described above.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the above-described method.
According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program, when executed by a processor, implements the steps of the above-described method.
According to one or more embodiments of the present disclosure, a data sample set to be cleaned may be cleaned by using a classification model, and the classification model may be iteratively trained by using standard sample data obtained by each cleaning, so that in each cleaning operation, accurate identification of a data sample with an error label may be implemented in a large number of data samples to be cleaned, and targeted cleaning of the data samples to be cleaned may be implemented, thereby improving cleaning efficiency of the data samples, and further improving an effect of training a neural network based on the cleaned data samples, and an accuracy of image identification by using the neural network.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.
FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, in accordance with an embodiment of the present disclosure;
FIG. 2 (a) shows a flow chart of a data cleansing method according to an embodiment of the present disclosure;
FIG. 2 (b) shows a flow chart of a data cleansing operation according to an embodiment of the present disclosure;
FIG. 3 illustrates a flow chart of another data cleansing method according to an embodiment of the present disclosure;
FIG. 4 shows a schematic diagram of an confusion matrix according to an embodiment of the disclosure;
FIG. 5 illustrates a flowchart of a neural network training method, according to an embodiment of the present disclosure;
FIG. 6 illustrates a flow chart of a violation identification method of a city management violation image, according to an embodiment of the present disclosure;
FIG. 7 shows a block diagram of a data cleansing apparatus according to an embodiment of the present disclosure;
FIG. 8 shows a block diagram of a training device of a neural network, according to an embodiment of the present disclosure;
FIG. 9 shows a block diagram of a violation identification device of a city management violation image, according to an embodiment of the disclosure;
fig. 10 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.
The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.
In recent years, artificial intelligence technology has made breakthrough progress, and particularly in the field of image recognition, progress has been remarkable. Today, image recognition technology based on artificial intelligence has been applied to more and more practical tasks, and accuracy and processing efficiency of task processing are significantly improved.
In the related art, a standard sample data set is obtained through a data cleaning method, and the neural network is trained by utilizing the standard sample data set, so that the trained neural network can recognize an input image. Thus, the accuracy of the standard sample dataset can affect the training effect of the neural network.
Based on the above, the disclosure provides a data cleaning method, which cleans a data sample set to be cleaned by using a classification model, and performs iterative training on the classification model by using standard sample data obtained by each cleaning, so that in each cleaning operation, accurate identification of data samples with error labels in a large number of data samples to be cleaned can be realized, targeted cleaning of the data samples to be cleaned is realized, the cleaning efficiency of the data samples is improved, and further, the training effect of the neural network based on the cleaned data samples and the accuracy of image identification by using the neural network are improved.
Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.
Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented, in accordance with an embodiment of the present disclosure. Referring to fig. 1, the system 100 includes one or more client devices 101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120. Client devices 101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.
In an embodiment of the present disclosure, the server 120 may run one or more services or software applications that enable the data cleansing method, the neural network training method, and the violation identification method of the city management violation image of the present disclosure to be performed.
In some embodiments, server 120 may also provide other services or software applications that may include non-virtual environments and virtual environments. In some embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of client devices 101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.
In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof that are executable by one or more processors. A user operating client devices 101, 102, 103, 104, 105, and/or 106 may in turn utilize one or more client applications to interact with server 120 to utilize the services provided by these components. It should be appreciated that a variety of different system configurations are possible, which may differ from system 100. Accordingly, FIG. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.
The user may use client devices 101, 102, 103, 104, 105, and/or 106 to obtain a first city management captured image and/or a second city management captured image for the target scene. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that the present disclosure may support any number of client devices.
Client devices 101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptop computers), workstation computers, wearable devices, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and the like. These computer devices may run various types and versions of software applications and operating systems, such as Microsoft Windows, apple iOS, UNIX-like operating systems, linux, or Linux-like operating systems (e.g., google Chrome OS); or include various mobile operating systems such as Microsoft Windows Mobile OS, iOS, windows Phone, android. Portable handheld devices may include cellular telephones, smart phones, tablet computers, personal Digital Assistants (PDAs), and the like. Wearable devices may include head mounted displays and other devices. The gaming system may include various handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), short Message Service (SMS) applications, and may use a variety of communication protocols.
Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a number of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. For example only, the one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.
The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture that involves virtualization (e.g., one or more flexible pools of logical storage devices that may be virtualized to maintain virtual storage devices of the server). In various embodiments, server 120 may run one or more services or software applications that provide the functionality described below.
The computing units in server 120 may run one or more operating systems including any of the operating systems described above as well as any commercially available server operating systems. Server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, etc.
In some implementations, server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of client devices 101, 102, 103, 104, 105, and 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of client devices 101, 102, 103, 104, 105, and 106.
In some implementations, the server 120 may be a server of a distributed system or a server that incorporates a blockchain. The server 120 may also be a cloud server, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology. The cloud server is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and Virtual special server (VPS PRIVATE SERVER) service.
The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of databases 130 may be used to store information such as audio files and video files. The data store 130 may reside in a variety of locations. For example, the data store used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. The data store 130 may be of different types. In some embodiments, the data store used by server 120 may be a database, such as a relational database. One or more of these databases may store, update, and retrieve the databases and data from the databases in response to the commands.
In some embodiments, one or more of databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key value stores, object stores, or conventional stores supported by the file system.
The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.
Fig. 2 (a) and 2 (b) are diagrams illustrating a data cleansing method for cleansing a data sample set to be cleansed, the data sample set to be cleansed including a plurality of data samples to be cleansed having an initial tag, according to an exemplary embodiment of the present disclosure, the data cleansing method comprising: step S201, performing the following cleaning operation on a data sample set to be cleaned by using a trained first classification model; step S201-1, responding to input of each data sample to be cleaned in a data sample set to be cleaned into a first classification model, and outputting a prediction label and label confidence coefficient of each data sample to be cleaned in the data sample set to be cleaned by the first classification model; step S201-2, obtaining one or more data samples to be cleaned, wherein the predicted label is inconsistent with the initial label; step S201-3, determining one or more real labels of the data sample to be cleaned, wherein the predicted labels are inconsistent with the initial labels, based on preset rules; step S201-4, determining one or more data samples to be cleaned after determining the real label as a first standard data sample; step S202, training the first classification model again by using the determined one or more first standard data samples to execute a cleaning operation on the remaining data samples to be cleaned of the data sample set by using the retrained first classification model; and step S203, constructing a first standard data sample set based on the plurality of first standard data samples with the real labels. Therefore, in the process of carrying out iterative cleaning on the data sample set to be cleaned, the first classification model is continuously and iteratively trained, the data sample to be cleaned with the error label in the data sample set to be cleaned can be identified in each cleaning operation based on the continuously optimized first classification model, and the cleaning efficiency of the data sample is improved by carrying out targeted data sample cleaning.
According to some embodiments, the first classification model may be initially trained using one or more data samples with authentic labels to obtain a trained first classification model. Thereby initiating a cleaning operation of the data sample set to be cleaned.
In one embodiment, the authenticity of the tag set by the data sample may be determined by means of a manual verification. Alternatively, the authenticity of the tag on which the data sample is placed may be determined by other means including a neural network, not limited herein.
Through step S201-1 in the cleaning operation of step S201, a prediction label and a label confidence coefficient of each data sample to be cleaned in the data sample set to be cleaned output by the first classification model may be obtained.
Thus, in step S201-2, one or more data samples to be cleaned, for which the predicted tag does not coincide with the initial tag, may be acquired to determine that the data sample to be cleaned has an error tag, and a subsequent targeted cleaning operation may be performed. Therefore, a large amount of calculation amount brought by indiscriminate cleaning operation on all data samples to be cleaned can be avoided, and the data cleaning efficiency is improved.
According to some embodiments, all the data samples to be cleaned, of which the predicted labels are inconsistent with the initial labels, in the data sample set to be cleaned can be determined to be the data samples to be cleaned with the error labels, so that subsequent processing can be performed to accelerate the processing process.
According to some embodiments, a dynamic threshold may be set, wherein a label confidence of each of the one or more data samples to be cleaned for which the predicted label is inconsistent with the initial label is not less than the dynamic threshold. That is, the predicted label in the data sample set to be cleaned is inconsistent with the initial label, and the data sample to be cleaned with the label confidence not smaller than the dynamic threshold is determined to be the data sample to be cleaned with the error label, so as to carry out subsequent processing. Therefore, the accuracy of identifying the data sample to be cleaned with the error label can be improved, the workload of subsequent processing is reduced, and the data cleaning efficiency is improved.
In the multiple iterative cleaning operation of the data sample set to be cleaned, the size of the dynamic threshold may be changed according to the number of iterations. According to some embodiments, for two consecutive cleaning operations of a data sample set to be cleaned, the dynamic threshold set in the first cleaning operation may be greater than the dynamic threshold set in the second cleaning operation. That is, in multiple iterative cleaning operations of a set of data samples to be cleaned, the magnitude of the dynamic threshold may be continually reduced as the number of iterations increases. Thus, the cleaning of each data sample to be cleaned in the set of data samples to be cleaned can be achieved by a limited number of iterative cleaning operations.
For step S201-3, wherein determining, based on the preset rule, the one or more real labels of the data samples to be cleaned for which the predicted label is inconsistent with the initial label may include setting a preset threshold value, and determining, based on the first preset rule, the one or more real labels of the data samples to be cleaned for which the predicted label is inconsistent with the initial label in response to determining that the dynamic threshold value is not less than the preset threshold value. Therefore, based on the magnitude relation between the preset threshold value and the dynamic threshold value, a proper mode can be selected to determine one or more real labels of the data samples to be cleaned, wherein the predicted labels are inconsistent with the initial labels, and the processing flexibility is improved.
Wherein, according to some embodiments, the preset threshold may be set as an authenticity threshold of the predictive label output by the first classification model. Specifically, when the label confidence level output by the first classification model is not less than the preset threshold, the current prediction label output by the first classification model can be considered to be a real label, otherwise, the current prediction label output by the first classification model is considered to be not a real label.
Based on this, according to some embodiments, determining, based on the first preset rule, the true label of the one or more data samples to be cleaned for which the predicted label is inconsistent with the initial label comprises: and determining the predicted label of one or more data samples to be cleaned as a real label. Thus, the cleaning efficiency of the data sample to be cleaned with the wrong label can be improved while the accuracy is ensured.
According to some embodiments, the cleaning operation of the data sample to be cleaned may further comprise: and determining, based on a second preset rule, one or more real labels of the data samples to be cleaned, the predicted labels of which are inconsistent with the initial labels, in response to determining that the dynamic threshold is smaller than the preset threshold, wherein the second prediction rule is different from the first prediction rule. Therefore, based on the magnitude relation between the preset threshold value and the dynamic threshold value, different modes can be set to determine the real labels of one or more data samples to be cleaned, wherein the predicted labels are inconsistent with the initial labels, so that the processing requirements under different dynamic threshold values are met.
According to some embodiments, determining, based on the second preset rule, one or more real tags of the data sample to be cleaned for which the predicted tag is inconsistent with the initial tag comprises: acquiring real data characteristics of one or more data samples to be cleaned, wherein the predicted label is inconsistent with the initial label; based on the corresponding authentic data signatures, one or more authentic tags of the data samples to be cleaned, for which the predicted tag is inconsistent with the original tag, are determined. Therefore, the accuracy of cleaning the data sample to be cleaned can be ensured in a more reliable manner under the condition that the dynamic threshold value is smaller than the preset threshold value.
In one embodiment, the true tags of one or more data samples to be cleaned for which the predicted tag does not coincide with the initial tag may be determined by means of a manual review.
It will be appreciated that the manual review is only one way to achieve an accurate determination of the true tag of the data sample to be cleaned, and that one skilled in the art may also determine the true tag of the data sample to be cleaned in other ways, not limited herein.
After the above-mentioned washing is performed on the one or more data samples to be washed with the error tag screened in step S201-2 to determine the true tag thereof, in step S201-4, the one or more data samples to be washed after the true tag is determined as the first standard data sample. Therefore, the first standard data samples can be ensured to have real labels, and meanwhile, the training effect of subsequent iterative training of the first classification model can be improved through the number of the first standard data samples which is continuously increased in the cleaning operation of multiple iterations.
According to some embodiments, the cleaning operation of the data sample set to be cleaned may further comprise: and determining that the initial label of the one or more data samples to be cleaned, of which the label confidence is not less than the dynamic threshold value and the predicted label is consistent with the initial label, is a real label. Therefore, the data sample to be cleaned with the correct label can be screened, and the initial label can be directly determined to be the real label and further can be directly determined to be the first standard data sample, so that the number of cleaning operation iteration times can be reduced.
After determining the one or more first standard data samples, in step S102, the first classification model may be retrained using the determined one or more first standard data samples to perform a cleaning operation on the remaining data samples to be cleaned of the data sample set using the retrained first classification model. Therefore, the first classification model can be iteratively trained based on the continuously updated first standard data sample set, and the first classification model obtained by training can be optimized along with the increase of the number of the first standard data samples. By continuously optimizing the first classification model during the cleaning of the data sample set to be cleaned, the cleaning of all the data samples to be cleaned can be gradually completed.
After the cleaning of all the data samples to be cleaned in the data sample set to be cleaned is completed, in step S103, a first standard data sample set may be constructed based on a plurality of first standard data samples having real tags. The obtained first standard data sample set is a structure of the data sample set to be cleaned after cleaning. Fig. 3 is a diagram illustrating another data cleansing method according to an exemplary embodiment of the present disclosure, the method may further include: step S301, training a second classification model by using an initial data sample set, wherein the initial data sample set comprises a plurality of initial data samples with initial labels; step S302, inputting a second standard data sample set into a trained second classification model, and obtaining a prediction label of each second standard data sample in the second standard data sample set output by the second classification model, wherein the second standard data sample set comprises a plurality of second standard data samples with real labels; step S303, constructing an confusion matrix based on the corresponding real labels and the corresponding predictive labels of all the second standard data samples in the second standard data sample set; step S304, determining one or more initial tags which are easy to be confused based on the confusion matrix; step S305, acquiring at least part of initial data samples corresponding to the one or more initial tags from the initial data sample set, so as to establish a data sample set to be cleaned. Therefore, through the steps S301 to S305, at least part of the confusable initial data samples can be obtained from the initial data sample set, so as to perform subsequent iterative cleaning operations, thereby realizing targeted cleaning of the initial data samples and effectively improving cleaning efficiency.
Since the initial data sample set includes initial data samples having erroneous initial tags, the second classification model trained using the initial data sample set is liable to confuse data corresponding to erroneous initial tags when performing classification. Based on the above, the second standard data sample set is input into the trained second classification model, the prediction label of each second standard data sample in the second standard data sample set output by the second classification model is obtained, based on the confusion matrix constructed by the real labels and the prediction labels respectively corresponding to all the second standard data samples in the second standard data sample set, one or more initial labels which are easy to be confused can be determined, and one or more groups of initial data samples with higher error rate in the initial data sample set can be correspondingly screened from the initial data sample set through the screened one or more initial labels. Therefore, the screening and targeted cleaning of the initial data samples with higher error rate of the initial data sample set are realized, and the data cleaning efficiency is effectively improved.
The confusion matrix is a tool that can be used to evaluate the prediction accuracy of the classification model, and in the present disclosure, the confusion matrix can be constructed based on the corresponding real labels and the prediction labels of all the second standard data samples in the second standard data sample set, and one or more initial labels that are easy to be confused in the initial data sample set can be screened out through the constructed confusion matrix. For example, in the exemplary embodiment of fig. 4, 10 out of 12 samples with true tags a are predicted as a, and 2 are predicted as 2; of the 12 samples with a true tag of B, 8 were predicted as B and4 were predicted as a. It can thus be determined that B is more confusing than a.
It is to be understood that fig. 4 is only an exemplary embodiment, and the kinds of real tags and predictive tags are not limited to two kinds.
According to some embodiments, after the cleaning of the initial data sample set to establish the data sample set to be cleaned is completed, the standard data sample obtained after the cleaning may be moved into the initial data sample set. The steps S301 to S305 are performed again by using the updated initial data sample set until the constructed confusion matrix meets the desired level, i.e. based on the confusion matrix, there are no more easily confusable one or more initial labels, whereby the cleaning of the whole initial data sample set can be achieved step by step.
Fig. 5 is a diagram illustrating a training method of a neural network, wherein the neural network includes a violation classification model, according to an exemplary embodiment of the present disclosure, the training method including: step S501, acquiring a sample image set to be cleaned of an urban management violation, wherein the sample image set to be cleaned comprises a plurality of violation sample images with initial violation labels; step S502, cleaning the sample image set to be cleaned by adopting the cleaning method, and determining real violation labels of a plurality of violation sample images so as to obtain a standard sample image set of city management violations; and step S503, training the violation classification model by using a standard sample image set. Therefore, the neural network can be trained based on the cleaned standard sample image set of the urban management violation, and the training effect of the neural network can be improved.
According to some embodiments, the set of sample images to be cleaned of city management violations may include a corresponding violation sample image with an initial violation tag of "expose garbage," a corresponding violation sample image with an initial violation tag of "cross store business," and other city management violation sample images corresponding thereto, and so forth.
According to some embodiments, the method further comprises: acquiring close a case sample images corresponding to each of a plurality of real violation tags, and determining a real close a case tag of each close a case sample image; and adding the plurality of close a case sample images with the authentic close a case tags to a standard sample image set of the urban management violation. Therefore, the neural network obtained through training can simultaneously realize the proposal and close a case discrimination of the violation phenomenon of urban management.
In one embodiment, to aid in subsequent close a case discrimination, a close a case sample image with a true close a case tag may be added to the standard sample image set for the city management violation. Alternatively, close a case sample images may be background images corresponding to violations, for example, close a case sample images corresponding to "exposed trash" may be trash cans, and close a case sample images corresponding to "cross-store business" may be store decking, without limitation.
According to some embodiments, the neural network further comprises a violation detection model, the training method further comprising: training the violation detection model by using a standard sample image set of the city management violation, so that the violation detection model can output the existence attribute of a real violation label related to the standard sample image set of the city management violation based on the input city management acquisition image, wherein the existence attribute comprises existence attribute and nonexistence attribute. The violation detection model obtained through training can be used for carrying out secondary judgment in close a case process, so that the reliability of close a case judgment is improved.
Fig. 6 is a diagram illustrating a method for identifying violations using a neural network, the neural network being trained using the training method described above, the neural network including a violation classification model, according to an exemplary embodiment of the present disclosure, the method comprising: step S601, acquiring a first city management acquisition image aiming at a target scene; step S602, a violation classification model is input in response to the first city management acquisition image, and the violation classification model outputs a city management violation label corresponding to the first city management acquisition image, wherein the city management violation label comprises violations and non-violations. Therefore, the recognition of the violations and the non-violations of the urban management violation images can be realized based on the neural network trained through the cleaned urban management violation standard sample image set, and a reliable basis is provided for the proposal and close a case aiming at the violations in urban management.
In one embodiment, the violation tags in the city management violation tag may further include specific types of violations, such as "expose garbage," "cross store business," and the like.
According to some embodiments, the identification method may further comprise: and responding to the fact that the city management violation label corresponding to the first city management acquisition image output by the violation classification model is determined to be a violation, and executing case finding for the first city management acquisition image, wherein case finding comprises recording the city management violation label. Therefore, the scheme can be executed aiming at the determined violation label, and follow-up tracking law enforcement for the violation phenomenon is facilitated.
In one embodiment, the recorded city management violation tags may include a specific type of violation. According to some embodiments, the identification method further comprises: after setting up a case for the first city management acquisition image, acquiring a second city management acquisition image for the target scene; inputting the second city management acquisition image into a violation classification model; and in response to determining that the city management violation tag of the second city management collection image output by the violation classification model is consistent with the city management violation tag of the first city management collection image, determining not to cancel the proposal for the first city management collection image. Since the city management violation label of the second city management collection image output by the violation classification model is consistent with the city management violation label of the first city management collection image, it can be determined that the violation phenomenon of executing the case finding for the first city management collection image is not eliminated yet, and thus, it can be rapidly determined that the case finding is not cancelled.
In one embodiment, the city management violation labels of the first city management captured image are violations, and the city management violation labels of the second city management captured image are non-violations, so that the city management violation labels of the first city management captured image and the second city management captured image are determined to be inconsistent, and the project for the first city management captured image is not cancelled.
In another embodiment, the city management violation tag of the first city management collection image is a violation of a first violation type, the city management violation tag of the second city management collection image is a violation of a second violation type, and the first violation type is different from the second violation type, so that the city management violation tags of the first city management collection image and the second city management collection image are determined to be inconsistent, and the proposal for the first city management collection image is not cancelled.
The second city management collected image may be an image collected in a case-setting continuing law enforcement process for the first city management collected image, and is used for judging whether to cancel the case-setting for the first city management collected image.
According to some embodiments, the neural network further comprises a violation detection model, the identifying method further comprising: in response to determining that the city management violation tag of the second city management acquisition image output by the violation classification model is inconsistent with the city management violation tag of the first city management acquisition image, inputting the second city management acquisition image into the violation detection model, and acquiring the existence attribute of the recorded city management violation tag output by the violation detection model; in response to determining that the recorded presence attribute of the city management violation label is present, it is determined not to cancel the proposal for the first city management captured image. Therefore, the reliability of close a case judgment can be improved by performing secondary judgment in close a case processes through the violation detection model.
Since there may be multiple kinds of violations in the second city management collected images at the same time, the decision is made only by the violation classification model, and erroneous decisions may be generated on the sales case. For example, the city management violation label of the first city management collection image is "cross store operation", the second city management collection image contains "cross store operation" and "exposure garbage" at the same time, and the phenomenon of "exposure garbage" is more remarkable, in this case, the city management violation label of the second city management collection image output by the violation classification model is "exposure garbage", and thus, the inconsistency between the city management violation label of the second city management collection image output by the violation classification model and the city management violation label of the first city management collection image is determined. However, in this case, the violation of "cross store business" is not eliminated. Based on this, the existence attribute of "cross store business" can be further determined by the violation detection model to increase the reliability of close a case determination. According to some embodiments, the method further comprises: in response to determining that the presence attribute of the recorded city management violation label is absent, determining to cancel the proposal for the first city management captured image. Thus, erroneous judgment with respect to close a case can be avoided.
According to another aspect of the present disclosure, as shown in fig. 7, there is also provided a data cleansing apparatus 700 for cleansing a data sample set to be cleansed, the data sample set to be cleansed including a plurality of data samples to be cleansed having an initial label, the cleansing apparatus 700 including: a first cleaning unit 701 configured to perform a cleaning operation on a set of data samples to be cleaned using a trained first classification model, wherein the first cleaning unit 701 comprises: a prediction subunit 701-1 configured to output a prediction label and a label confidence thereof for each data sample to be cleaned in the data sample set to be cleaned in response to inputting each data sample to be cleaned in the data sample set to be cleaned into the first classification model; a first obtaining subunit 701-2 configured to obtain one or more data samples to be cleaned, for which the predicted tag is inconsistent with the initial tag; a first determining subunit 701-3 configured to determine, based on a preset rule, one or more real tags of the data samples to be cleaned, for which the predicted tag is inconsistent with the initial tag; and a second determining subunit 701-4 configured to determine, as the first standard data sample, one or more data samples to be cleaned after determining the real tag; a first training unit 702 configured to retrain the first classification model using the determined one or more first standard data samples, so that the first cleaning unit performs a cleaning operation on remaining data samples to be cleaned of the data sample set using the retrained first classification model; a first construction unit 703 is configured to construct a first set of standard data samples based on a plurality of first standard data samples with real labels.
According to some embodiments, the first acquisition subunit comprises: a first setting subunit configured to set a dynamic threshold, wherein a label confidence of each of the one or more data samples to be cleaned for which the predicted label is inconsistent with the initial label is not less than the dynamic threshold.
According to some embodiments, the preset rules comprise first preset rules, the first cleaning unit further comprising: a second setting subunit configured to set a preset threshold; the first determination subunit is further configured to determine, based on the first preset rule, a true label of the one or more data samples to be cleaned for which the predicted label is inconsistent with the initial label in response to determining that the dynamic threshold is not less than the preset threshold.
According to some embodiments, the first determining subunit is further configured to determine the predicted tag of the one or more data samples to be cleaned as a real tag.
According to some embodiments, the first determining subunit is further configured to determine, in response to determining that the dynamic threshold is less than the preset threshold, a true label of the one or more data samples to be cleaned for which the predicted label is inconsistent with the initial label based on a second preset rule, wherein the second preset rule is different from the first preset rule.
According to some embodiments, the first determining subunit comprises: a second obtaining subunit configured to obtain real data characteristics of one or more data samples to be cleaned, for which the predicted tag is inconsistent with the initial tag; a third determination subunit configured to determine, based on the respective authentic data signatures, the authentic signature of the one or more data samples to be cleaned for which the predicted signature is inconsistent with the original signature.
According to some embodiments, for two successive cleaning operations of a data sample set to be cleaned, the dynamic threshold value set in the first cleaning operation is greater than the dynamic threshold value set in the second cleaning operation.
According to some embodiments, the first cleaning unit further comprises: and a fourth determining subunit configured to determine, for one or more data samples to be cleaned for which the label confidence is not less than the dynamic threshold and for which the predicted label is consistent with the initial label, that the initial label of the one or more data samples to be cleaned for which the predicted label is consistent with the initial label is a real label.
According to some embodiments, the apparatus further comprises: a second training unit configured to train a second classification model with an initial data sample set including a plurality of initial data samples having initial labels; the first acquisition unit is configured to input a second standard data sample set into the trained second classification model, and acquire a prediction label of each second standard data sample in the second standard data sample set output by the second classification model, wherein the second standard data sample set comprises a plurality of second standard data samples with real labels; the second construction unit is configured to construct an confusion matrix based on the corresponding real labels and the corresponding predictive labels of all the second standard data samples in the second standard data sample set; a first determining unit configured to determine one or more initial tags that are confusable based on the confusion matrix; and a second acquisition unit configured to acquire at least part of initial data samples corresponding to the one or more initial tags from the initial data sample set so as to establish a data sample set to be cleaned.
According to another aspect of the present disclosure, as shown in fig. 8, there is also provided a training apparatus 800 of a neural network, wherein the neural network includes a violation classification model, and the training apparatus 800 includes: a third obtaining unit 801 configured to obtain a sample image set to be cleaned of an urban management violation, where the sample image set to be cleaned includes a plurality of violation sample images with initial violation tags; a second cleaning unit 802, configured to clean the sample image set to be cleaned by adopting the cleaning method described above, and determine real violation tags of the multiple violation sample images included, so as to obtain a standard sample image set of the city management violation; a third training unit 803 is configured for training the violation classification model using the set of standard sample images.
According to some embodiments, the apparatus further comprises: a fourth obtaining unit configured to obtain close a case sample images corresponding to each of the plurality of real violation tags and determine a real close a case tag of each close a case sample image; and an adding unit configured to add a plurality of close a case sample images with real close a case tags to a standard sample image set of a city management violation.
According to some embodiments, the neural network further comprises a violation detection model, the apparatus further comprising: and the fourth training unit is configured to train the violation detection model by using the standard sample image set of the city management violation, so that the violation detection model can output the existence attribute of the real violation label related to the standard sample image set of the city management violation based on the input city management acquisition image, wherein the existence attribute comprises the existence attribute and the nonexistence attribute.
According to another aspect of the present disclosure, as shown in fig. 9, there is also provided a violation identification device 900 of a city management violation image, the identification device 900 including: the neural network 901 is trained according to the training method described above, wherein the neural network 901 comprises a violation classification model 901-1; a fifth acquiring unit 902 configured to acquire a first city management acquired image for a target scene; wherein the violation classification model 901-1 is configured to input the violation classification model 901-1 in response to the first city management collection image, and output a city management violation label corresponding to the first city management collection image, where the city management violation label includes a violation and a non-violation.
According to some embodiments, the apparatus further comprises: and the case-setting unit is configured to execute case setting for the first city management collected image in response to determining that the city management violation label corresponding to the first city management collected image output by the violation classification model is a violation, wherein case setting comprises recording the city management violation label.
According to some embodiments, the apparatus further comprises: a sixth acquisition unit configured to acquire a second city management acquisition image for the target scene after executing the case setting for the first city management acquisition image; an input unit configured to input the second city management acquisition image into the violation classification model; and a second determining unit configured to determine not to cancel the proposal for the first city management captured image in response to determining that the city management violation tag of the second city management captured image output by the violation classification model is identical to the city management violation tag of the first city management captured image.
According to some embodiments, the neural network further comprises a violation detection model, the apparatus further comprising: a seventh obtaining unit configured to input the second city management collected image into the violation detection model and obtain a recorded existence attribute of the city management violation label output by the violation detection model in response to determining that the city management violation label of the second city management collected image output by the violation classification model is inconsistent with the city management violation label of the first city management collected image; and a third determining unit configured to determine not to cancel the proposal for the first city management captured image in response to determining that the existence attribute of the recorded city management violation tag is present.
According to some embodiments, the apparatus further comprises: and a fourth determining unit configured to determine to cancel the proposal for the first city management captured image in response to determining that the existence attribute of the recorded city management violation tag is absent.
According to another aspect of the present disclosure, there is also provided a computer apparatus including: a memory, a processor and a computer program stored on the memory, wherein the processor is configured to execute the computer program to implement the steps of the method described above.
According to another aspect of the present disclosure, there is also provided a non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the above-described method.
According to another aspect of the present disclosure, there is also provided a computer program product comprising a computer program, wherein the computer program when executed by a processor implements the steps of the above-described method.
According to embodiments of the present disclosure, there is also provided an electronic device, a readable storage medium and a computer program product.
Referring to fig. 10, a block diagram of a structure of an electronic device 1000 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 10, the apparatus 1000 includes a computing unit 1001 that can be loaded into Random Access Memory (RAM) from a storage unit 1008 or according to a computer program stored in a Read Only Memory (ROM) 1002
1003 To perform various suitable actions and processes. In the RAM 1003, various programs and data required for the operation of the device 1000 can also be stored. The computing unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.
Various components in device 1000 are connected to I/O interface 1005, including: an input unit 1006, an output unit 1007, a storage unit 1008, and a communication unit 1009. The input unit 1006 may be any type of device capable of inputting information to the device 1000, the input unit 1006 may receive input numeric or character information, and generate key signal inputs related to user settings and/or function control of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote control. The output unit 1007 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 1008 may include, but is not limited to, magnetic disks, optical disks. Communication unit 1009 allows device 1000 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, 1302.11 devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.
The computing unit 1001 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the various methods and processes described above, such as one or more of a data cleansing method, a neural network training method, and a violation recognition method of a city management violation image. For example, in some embodiments, one or more of the data cleansing method, the neural network training method, and the violation identification method of the city management violation image may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communication unit 1009. When the computer program is loaded into RAM 1003 and executed by computing unit 1001, one or more of the steps of the cleaning method, neural network training method, or violation identification method of the city management violation image described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform one or more of a data cleansing method, a neural network training method, and a violation identification method of the city management violation image in any other suitable manner (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the foregoing methods, systems, and apparatus are merely exemplary embodiments or examples, and that the scope of the present invention is not limited by these embodiments or examples but only by the claims following the grant and their equivalents. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.

Claims (37)

1. A data cleansing method for cleansing a data sample set to be cleansed, the data sample set to be cleansed including a plurality of data samples to be cleansed having an initial tag, the data cleansing method comprising:
And executing the following cleaning operation on the data sample set to be cleaned by using the trained first classification model:
In response to inputting each data sample to be cleaned in the data sample set to be cleaned into the first classification model, the first classification model outputs a prediction label and label confidence thereof for each data sample to be cleaned in the data sample set to be cleaned;
acquiring one or more data samples to be cleaned, wherein the predicted label is inconsistent with the initial label;
Determining real labels of the one or more data samples to be cleaned, wherein the predicted labels are inconsistent with the initial labels, based on preset rules; and
Determining the one or more data samples to be cleaned after determining the real label as a first standard data sample;
Training the first classification model again by using the determined one or more first standard data samples so as to execute the cleaning operation on the rest data samples to be cleaned of the data sample set by using the retrained first classification model; and
A first set of standard data samples is constructed based on a plurality of said first standard data samples having authentic labels.
2. The data cleansing method of claim 1, wherein the obtaining one or more data samples to be cleansed for which the predictive tag is inconsistent with the initial tag comprises:
A dynamic threshold value is set and,
And the label confidence of each data sample to be cleaned in the one or more data samples to be cleaned, of which the predicted label is inconsistent with the initial label, is not smaller than the dynamic threshold.
3. The data cleansing method according to claim 1, wherein the preset rule includes a first preset rule, and the cleansing operation of the data sample to be cleansed includes:
Setting a preset threshold value;
and determining the real labels of the one or more data samples to be cleaned, wherein the predicted labels are inconsistent with the initial labels, based on a first preset rule, in response to determining that the dynamic threshold is not smaller than the preset threshold.
4. The data cleansing method of claim 3, wherein the determining, based on the first preset rule, the real labels of the one or more data samples to be cleansed for which the predicted labels are inconsistent with the initial labels comprises:
and determining the predicted label of the one or more data samples to be cleaned as the real label.
5. A data cleansing method according to claim 3, wherein the cleansing operation of the data sample to be cleansed further comprises:
in response to determining that the dynamic threshold is less than the preset threshold, determining, based on a second preset rule, a true label of the one or more data samples to be cleaned for which the predicted label is inconsistent with the initial label,
Wherein the second preset rule is different from the first preset rule.
6. The data cleansing method of claim 5, wherein the determining, based on the second preset rule, the real label of the one or more data samples to be cleansed for which the predicted label is inconsistent with the initial label comprises:
acquiring real data characteristics of the one or more data samples to be cleaned, wherein the predicted tag is inconsistent with the initial tag;
based on the corresponding real data characteristics, determining the real labels of the one or more data samples to be cleaned, wherein the predicted labels are inconsistent with the initial labels.
7. The data cleansing method of any of claims 2-6, wherein the dynamic threshold set in a first cleansing operation is greater than the dynamic threshold set in a second cleansing operation for two successive cleansing operations of the set of data samples to be cleansed.
8. The data cleansing method of any of claims 2-6, wherein cleansing operation of the data sample set to be cleansed further comprises:
And determining that the initial label of the one or more data samples to be cleaned, of which the label confidence is not less than the dynamic threshold and the predicted label is consistent with the initial label, is a real label.
9. The data cleansing method of claim 1, the method further comprising:
training a second classification model with an initial data sample set, the initial data sample set comprising a plurality of initial data samples having initial labels;
inputting a second standard data sample set into a trained second classification model, and obtaining a prediction label of each second standard data sample in the second standard data sample set output by the second classification model, wherein the second standard data sample set comprises a plurality of second standard data samples with real labels;
Constructing a confusion matrix based on the corresponding real labels and the corresponding predictive labels of all the second standard data samples in the second standard data sample set;
determining one or more initial tags that are confusable based on the confusion matrix;
And acquiring at least part of initial data samples corresponding to the one or more initial tags from the initial data sample set so as to establish the data sample set to be cleaned.
10. A training method of a neural network, wherein the neural network comprises a violation classification model,
The training method comprises the following steps:
Acquiring a sample image set to be cleaned of an urban management violation, wherein the sample image set to be cleaned comprises a plurality of violation sample images with initial violation labels;
The method for cleaning the sample image set to be cleaned according to any one of claims 1-9 is adopted, real violation labels of a plurality of violation sample images are determined, and a standard sample image set of city management violations is obtained;
and training the violation classification model by using the standard sample image set.
11. The training method of claim 10, the method further comprising:
Acquiring close a case sample images corresponding to each of a plurality of real violation tags, and determining a real close a case tag of each close a case sample image; and
A plurality of close a case sample images with real close a case tags are added to the standard sample image set of the city management violation.
12. The training method of claim 11, wherein the neural network further comprises a violation detection model,
The training method further comprises the following steps:
Training the violation detection model by using the standard sample image set of the city management violation so that the violation detection model can output the existence attribute of the real violation label related to the standard sample image set of the city management violation based on the input city management acquisition image,
Wherein the presence attribute includes both presence and absence attributes.
13. A method for identifying violations using a neural network trained using the training method of any of claims 10 to 12, the neural network comprising a violation classification model,
The identification method comprises the following steps:
acquiring a first city management acquisition image aiming at a target scene;
And responding to the first city management acquisition image to input the violation classification model, and outputting a city management violation label corresponding to the first city management acquisition image by the violation classification model, wherein the city management violation label comprises violations and non-violations.
14. The identification method of claim 13, the identification method further comprising:
And responding to the fact that the city management violation label corresponding to the first city management acquisition image output by the violation classification model is determined to be a violation, executing a case setting for the first city management acquisition image, wherein the case setting comprises recording the city management violation label.
15. The identification method of claim 14, the identification method further comprising:
After the case setting is executed for the first city management acquisition image, a second city management acquisition image for a target scene is acquired;
Inputting the second city management acquisition image into the violation classification model;
And in response to determining that the city management violation tag of the second city management acquired image output by the violation classification model is consistent with the city management violation tag of the first city management acquired image, determining not to cancel the proposal for the first city management acquired image.
16. The identification method of claim 15, wherein the neural network further comprises a violation detection model, the identification method further comprising:
In response to determining that the city management violation label of the second city management acquisition image output by the violation classification model is inconsistent with the city management violation label of the first city management acquisition image, inputting the second city management acquisition image into the violation detection model, and acquiring the existence attribute of the recorded city management violation label output by the violation detection model;
In response to determining that the recorded presence attribute of the city management violation label is present, it is determined not to cancel the proposal for the first city management captured image.
17. The identification method of claim 16, the method further comprising:
responsive to determining that the recorded presence attribute of the city management violation label is absent, determining to cancel a proposal for the first city management acquired image.
18. A data cleansing apparatus for cleansing a data sample set to be cleansed, the data sample set to be cleansed including a plurality of data samples to be cleansed having an initial tag, the cleansing apparatus comprising:
a first cleaning unit configured to perform a cleaning operation on the data sample set to be cleaned using a trained first classification model, wherein the first cleaning unit comprises:
a prediction subunit configured to output a prediction label and a label confidence thereof for each data sample to be cleaned in the data sample set to be cleaned in response to inputting each data sample to be cleaned in the data sample set to be cleaned into the first classification model;
A first acquisition subunit configured to acquire one or more data samples to be cleaned for which the predicted tag is inconsistent with the initial tag;
A first determining subunit configured to determine, based on a preset rule, a real label of the one or more data samples to be cleaned for which the predicted label is inconsistent with the initial label; and
A second determining subunit configured to determine the one or more data samples to be cleaned after determining the real tag as a first standard data sample;
a first training unit configured to retrain a first classification model using the determined one or more first standard data samples, such that the first cleaning unit performs the cleaning operation on the remaining data samples to be cleaned of the data sample set using the retrained first classification model; and
A first construction unit configured to construct a first standard data sample set based on a plurality of the first standard data samples having real tags.
19. The cleaning apparatus of claim 18, wherein the first acquisition subunit comprises:
a first setting subunit configured to set a dynamic threshold,
And the label confidence of each data sample to be cleaned in the one or more data samples to be cleaned, of which the predicted label is inconsistent with the initial label, is not smaller than the dynamic threshold.
20. The cleaning apparatus of claim 18, wherein the preset rules comprise first preset rules, the first cleaning unit further comprising:
a second setting subunit configured to set a preset threshold;
the first determination subunit is further configured to determine, based on the first preset rule, a true label of the one or more data samples to be cleaned for which the predicted label is inconsistent with the initial label in response to determining that the dynamic threshold is not less than the preset threshold.
21. The cleaning apparatus of claim 20, wherein the first determination subunit is further configured to determine a predicted tag of the one or more data samples to be cleaned as the real tag.
22. The cleaning apparatus of claim 20, wherein the first determination subunit is further configured to determine, based on a second preset rule, a true label of the one or more data samples to be cleaned for which the predicted label is inconsistent with the initial label in response to determining that the dynamic threshold is less than the preset threshold,
Wherein the second preset rule is different from the first preset rule.
23. The cleaning apparatus of claim 22, wherein the first determination subunit comprises:
a second obtaining subunit configured to obtain real data characteristics of the one or more data samples to be cleaned, for which the predictive label is inconsistent with the initial label;
A third determination subunit configured to determine, based on the respective authentic data signatures, the authentic signature of the one or more data samples to be cleaned for which the predicted signature is inconsistent with the original signature.
24. The washing apparatus according to any one of claims 19-23, wherein the dynamic threshold value set in a first washing operation is greater than the dynamic threshold value set in a second washing operation for two successive washing operations of the set of data samples to be washed.
25. The cleaning apparatus defined in any one of claims 19-23, wherein the first cleaning unit further comprises:
And a fourth determining subunit configured to determine, for one or more data samples to be cleaned, for which the label confidence is not less than the dynamic threshold and the predicted label is consistent with the initial label, that the initial label of the one or more data samples to be cleaned, for which the predicted label is consistent with the initial label, is a real label.
26. The cleaning apparatus of claim 18, the apparatus further comprising:
a second training unit configured to train a second classification model with an initial data sample set including a plurality of initial data samples having initial labels;
a first obtaining unit configured to input a second standard data sample set into a trained second classification model, and obtain a prediction label of each second standard data sample in the second standard data sample set output by the second classification model, wherein the second standard data sample set includes a plurality of second standard data samples with real labels;
The second construction unit is configured to construct an confusion matrix based on the corresponding real labels and the corresponding predictive labels of all the second standard data samples in the second standard data sample set;
A first determining unit configured to determine one or more initial tags that are confusable based on the confusion matrix;
And a second acquisition unit configured to acquire at least part of initial data samples corresponding to the one or more initial tags from the initial data sample set so as to establish the data sample set to be cleaned.
27. A training apparatus for a neural network, wherein the neural network includes a violation classification model,
The training device comprises:
A third obtaining unit configured to obtain a sample image set to be cleaned of an urban management violation, wherein the sample image set to be cleaned comprises a plurality of violation sample images with initial violation tags;
A second cleaning unit configured to clean the sample image set to be cleaned by adopting the cleaning method as claimed in any one of claims 1 to 9, and determine real violation tags of a plurality of violation sample images included, so as to obtain a standard sample image set of city management violations;
and the third training unit is configured to train the violation classification model by using the standard sample image set.
28. The training device of claim 27, the device further comprising:
A fourth obtaining unit configured to obtain close a case sample images corresponding to each of the real violation tags in the plurality of real violation tags, and determine a real close a case tag of each close a case sample image; and
An adding unit configured to add a plurality of close a case sample images with real close a case tags to the standard sample image set of the urban management violation.
29. The training apparatus of claim 28 wherein said neural network further comprises a violation detection model, said apparatus further comprising:
A fourth training unit configured to train the violation detection model using the set of city management violation standard sample images so that the violation detection model can output existence attributes of real violation tags related to the set of city management violation standard sample images based on the input city management acquisition image,
Wherein the presence attribute includes both presence and absence attributes.
30. A violation identification device based on an urban management violation image, the identification device comprising:
a neural network trained in accordance with the training method of any one of claims 10 to 12, wherein the neural network comprises a violation classification model;
a fifth acquisition unit configured to acquire a first city management captured image for the target scene,
The method comprises the steps of receiving a first city management acquisition image, wherein the first city management acquisition image is used for acquiring a first city management rule, and the first city management rule, the second city management rule and the first city management rule are acquired, and the first city management rule and the second city management rule are acquired.
31. The identification device of claim 30, the device further comprising:
and the case setting unit is configured to execute case setting on the first city management collected image in response to determining that the city management violation label corresponding to the first city management collected image output by the violation classification model is a violation, wherein the case setting comprises recording the city management violation label.
32. The identification device of claim 31, the device further comprising:
A sixth acquisition unit configured to acquire a second city management acquisition image for a target scene after executing a proposal for the first city management acquisition image;
An input unit configured to input the second city management acquisition image into the violation classification model;
And the second determining unit is configured to determine not to cancel the proposal for the first city management acquisition image in response to determining that the city management violation label of the second city management acquisition image output by the violation classification model is consistent with the city management violation label of the first city management acquisition image.
33. The identification apparatus of claim 32 wherein the neural network further comprises a violation detection model, the apparatus further comprising:
A seventh obtaining unit configured to input a second city management acquisition image into the violation detection model and obtain a presence attribute of the recorded city management violation label output by the violation detection model in response to determining that a city management violation label of the second city management acquisition image output by the violation classification model is inconsistent with a city management violation label of the first city management acquisition image;
And a third determining unit configured to determine not to cancel the proposal for the first city management captured image in response to determining that the existence attribute of the recorded city management violation tag is present.
34. The identification device of claim 33, the device further comprising:
and a fourth determining unit configured to determine to cancel the proposal for the first city management acquired image in response to determining that the existence attribute of the recorded city management violation tag is absent.
35. A computer device, comprising:
a memory, a processor and a computer program stored on the memory,
Wherein the processor is configured to execute the computer program to implement the steps of the method of any one of claims 1-17.
36. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the method of any of claims 1-17.
37. A computer program product comprising a computer program, wherein the computer program when executed by a processor implements the steps of the method of any of claims 1-17.
CN202011592748.7A 2020-12-29 Data cleaning method and device, equipment and storage medium Active CN112579587B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011592748.7A CN112579587B (en) 2020-12-29 Data cleaning method and device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011592748.7A CN112579587B (en) 2020-12-29 Data cleaning method and device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112579587A CN112579587A (en) 2021-03-30
CN112579587B true CN112579587B (en) 2024-07-02

Family

ID=

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241903A (en) * 2018-08-30 2019-01-18 平安科技(深圳)有限公司 Sample data cleaning method, device, computer equipment and storage medium
CN110490221A (en) * 2019-07-05 2019-11-22 平安科技(深圳)有限公司 Multi-tag classification method, electronic device and computer readable storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241903A (en) * 2018-08-30 2019-01-18 平安科技(深圳)有限公司 Sample data cleaning method, device, computer equipment and storage medium
CN110490221A (en) * 2019-07-05 2019-11-22 平安科技(深圳)有限公司 Multi-tag classification method, electronic device and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN112857268B (en) Object area measuring method, device, electronic equipment and storage medium
CN112749758B (en) Image processing method, neural network training method, device, equipment and medium
CN113656587B (en) Text classification method, device, electronic equipment and storage medium
CN114004985B (en) Character interaction detection method, neural network, training method, training equipment and training medium thereof
CN113256583A (en) Image quality detection method and apparatus, computer device, and medium
CN114445667A (en) Image detection method and method for training image detection model
CN116883181B (en) Financial service pushing method based on user portrait, storage medium and server
CN113723305A (en) Image and video detection method, device, electronic equipment and medium
CN116152607A (en) Target detection method, method and device for training target detection model
CN114219046B (en) Model training method, matching method, device, system, electronic equipment and medium
CN112579587B (en) Data cleaning method and device, equipment and storage medium
CN114140547B (en) Image generation method and device
CN114842476A (en) Watermark detection method and device and model training method and device
CN112860681B (en) Data cleaning method and device, computer equipment and medium
CN115359309A (en) Training method, device, equipment and medium of target detection model
CN114445147A (en) Electronic ticket issuing method, electronic ticket issuing device, electronic ticket issuing apparatus, and electronic ticket issuing medium
CN114429678A (en) Model training method and device, electronic device and medium
CN112905743A (en) Text object detection method and device, electronic equipment and storage medium
CN112579587A (en) Data cleaning method and device, equipment and storage medium
CN112784912A (en) Image recognition method and device, and training method and device of neural network model
CN116070711B (en) Data processing method, device, electronic equipment and storage medium
CN115512131B (en) Image detection method and training method of image detection model
CN114140851B (en) Image detection method and method for training image detection model
CN114117046B (en) Data processing method, device, electronic equipment and medium
CN115170536B (en) Image detection method, training method and device of model

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20240603

Address after: Room 220, 2nd Floor, Building 22, No. A1 Guanghua Road, Tongzhou District, Beijing, 101113

Applicant after: Button Internet (Beijing) Technology Co.,Ltd.

Country or region after: China

Address before: 2 / F, baidu building, 10 Shangdi 10th Street, Haidian District, Beijing 100085

Applicant before: BEIJING BAIDU NETCOM SCIENCE AND TECHNOLOGY Co.,Ltd.

Country or region before: China

GR01 Patent grant