CN114003591A - Commodity data multi-mode cleaning method and device, equipment, medium and product thereof - Google Patents
Commodity data multi-mode cleaning method and device, equipment, medium and product thereof Download PDFInfo
- Publication number
- CN114003591A CN114003591A CN202111274488.3A CN202111274488A CN114003591A CN 114003591 A CN114003591 A CN 114003591A CN 202111274488 A CN202111274488 A CN 202111274488A CN 114003591 A CN114003591 A CN 114003591A
- Authority
- CN
- China
- Prior art keywords
- commodity data
- label
- clustering
- commodity
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/45—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/48—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Quality & Reliability (AREA)
- Library & Information Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a commodity data multi-mode cleaning method and a device, equipment, a medium and a product thereof, wherein the method comprises the following steps: determining a commodity data set, wherein the commodity data set comprises a plurality of commodity data carrying real classification labels, and the real classification labels are leaf nodes in a preset category tree; clustering according to the text characteristic information of each commodity data to determine a text clustering label corresponding to the commodity data, wherein the total number of the text clustering labels is equal to the total number of leaf nodes; clustering according to the picture characteristic information of each commodity data to determine picture clustering labels corresponding to the commodity data, wherein the total number of the picture clustering labels is equal to the total number of leaf nodes; and performing data cleaning on the commodity data in the commodity data set according to the value information of the text clustering label and the picture clustering label of the same commodity data relative to the real classification label. The method and the device can be used for effectively cleaning data of massive commodity data in a multi-mode.
Description
Technical Field
The present application relates to the field of e-commerce information technologies, and in particular, to a method and a device for multi-modal cleaning of commodity data, a computer device, a computer-readable storage medium, and a computer program product.
Background
With the development of deep learning, the scale of various neural network models is larger and larger, the scale of training data is larger and larger, with the enlargement of the scale, the noise data in the training data is increased gradually, a large amount of noise data can seriously affect the effect of the models, the traditional manual labeling is difficult to process such large batch of data efficiently, and if the data cannot be cleaned effectively, the noise data can seriously affect the training effect of the neural network models.
In the prior art, the training data required by the neural network model is generally cleaned by applying a simple means to detect incomplete data, error data, repeated data and the like, and the method is very simple and violent, and does not fully consider the value factors of the data on the neural network model, so that the method has little effect on improving the value of the training data relative to the model.
Training data required by model training, particularly data carrying supervision labels, and whether corresponding labels are accurate and effective or not can affect the training effect of the model to a greater extent, and particularly can affect the learning capacity of the model, so that attention needs to be paid to the training data in a data cleaning stage. The related solution proposed by the industry for cleaning the training data carrying the label is mainly based on a clustering algorithm, and the training data with label information inconsistent with the clustering result is removed by simply checking according to the clustering result. Such approaches, while improving the quality of the data to a considerable extent, are not fine enough and fail to clean relevant data in conjunction with the model, especially for multimodal models where simple clustering is difficult to achieve efficiently.
The need for data cleansing is particularly evident in the e-commerce field. In the e-commerce field, a large amount of commodity data corresponding to commodities generally has tag information corresponding to one commodity data, but when the tag information of the commodity data comes from different sources or is generated according to different standards, how to effectively label the commodity data becomes a bigger problem.
In summary, it is worth searching in the e-commerce field how to perform data cleaning on training data composed of commodity data to adapt to the needs of neural network models to become effective training data.
Disclosure of Invention
A primary object of the present application is to solve at least one of the above problems and provide a multi-modal merchandise data cleaning method, and a corresponding apparatus, computer device, computer readable storage medium, and computer program product.
In order to meet various purposes of the application, the following technical scheme is adopted in the application:
the commodity data multi-mode cleaning method which is suitable for one of the purposes of the application comprises the following steps:
determining a commodity data set, wherein the commodity data set comprises a plurality of commodity data carrying real classification labels, and the real classification labels are leaf nodes in a preset category tree;
clustering according to the text characteristic information of each commodity data to determine a text clustering label corresponding to the commodity data, wherein the total number of the text clustering labels is equal to the total number of leaf nodes;
clustering according to the picture characteristic information of each commodity data to determine picture clustering labels corresponding to the commodity data, wherein the total number of the picture clustering labels is equal to the total number of leaf nodes;
and performing data cleaning on the commodity data in the commodity data set according to the value information of the text clustering label and the picture clustering label of the same commodity data relative to the real classification label.
In a deepened embodiment, clustering is performed according to text characteristic information of each commodity data to determine a text clustering label corresponding to the commodity data, wherein the total number of the text clustering labels is equal to the total number of leaf nodes, and the method comprises the following steps:
extracting text characteristic information of a commodity title of each commodity data in the commodity data set by adopting a pre-trained text characteristic extraction model;
setting classification number required by clustering according to the number of leaf nodes, and clustering by using the text characteristic vector of the commodity data to obtain a plurality of commodity data clusters based on texts, wherein each commodity data cluster comprises a plurality of correspondingly clustered commodity data;
and counting the maximum number of real classification labels owned by the commodity data for each commodity data cluster based on the text, and determining the maximum number of real classification labels as the text clustering labels of each commodity data in the commodity data cluster.
In a deepened embodiment, clustering is performed according to the picture characteristic information of each commodity data to determine a picture clustering label corresponding to the commodity data, wherein the total number of the picture clustering labels is equal to the total number of leaf nodes, and the method comprises the following steps:
extracting the picture characteristic information of the commodity picture of each commodity data in the commodity data set by adopting a pre-trained picture characteristic extraction model;
setting classification number required by clustering according to the number of leaf nodes, and clustering by using the picture characteristic vectors of the commodity data to obtain a plurality of commodity data clusters based on pictures corresponding to the classification number, wherein each commodity data cluster comprises a plurality of correspondingly clustered commodity data;
and counting the maximum number of real classification labels owned by the commodity data for each commodity data cluster based on the pictures, and determining the maximum number of real classification labels as the picture clustering labels of each commodity data in the commodity data cluster.
In a further embodiment, the data cleaning of the commodity data in the commodity data set is performed according to the value information of the text clustering label and the picture clustering label of the same commodity data relative to the real classification label, and the method comprises the following steps:
carrying out consistency judgment on the text clustering label and the picture clustering label of the same commodity data;
when the text clustering label and the picture clustering label of the same commodity data are judged to be consistent, determining that the two clustering labels have contribution values relative to the real classification label, and resetting the real classification label to be the text clustering label or the picture clustering label;
when the text clustering label and the picture clustering label of the same commodity data are judged to be inconsistent and one of the text clustering label and the picture clustering label is consistent with the real classification label, determining the text clustering label or the picture clustering label with contribution value according to a preset condition, and resetting the real classification label of the commodity data meeting the preset condition;
and deleting the commodity data of which the real classification labels are not reset from the commodity data set to realize cleaning.
In an embodiment, determining a text cluster label or a picture cluster label with contribution value according to a preset condition, and resetting a real classification label of commodity data meeting the preset condition by the text cluster label or the picture cluster label comprises the following steps:
counting the average length of the commodity titles of the commodity data in the commodity data set;
for the condition that the text clustering label of the commodity data is consistent with the real classification label and the picture clustering label of the commodity data is inconsistent with the real classification label, the condition that the length of the commodity title of the commodity data exceeds the average length is taken as a preset condition, when the preset condition is met, the text clustering label is confirmed to have contribution value, and the real classification label of the commodity data is reset by the text clustering label;
and for the condition that the picture clustering label of the commodity data is consistent with the real classification label and the text clustering label of the commodity data is inconsistent with the real classification label, taking the condition that the length of the commodity title of the commodity data is smaller than the average length as a preset condition, confirming that the picture clustering label has contribution value when the preset condition is satisfied, and resetting the real classification label of the commodity data by using the picture clustering label.
In an expanded embodiment, the multi-modal commodity data cleaning method further comprises the following subsequent steps:
performing characteristic splicing on the text characteristic information and the picture characteristic information of each commodity object in the commodity data set after data cleaning is completed to obtain corresponding image-text characteristic information;
and performing iterative training on the classifier by adopting the image-text characteristic information of each commodity data in the commodity data set, predicting the classification label in the iterative training process of monitoring the commodity data by using the real classification label of each commodity data, and performing gradient updating on the weight of the classifier according to the loss values of the predicted classification label and the real classification label until the classifier is trained to be in a convergence state.
The commodity data multi-mode cleaning device suitable for one of the purposes of the application comprises: the system comprises a data determining module, a text clustering module, a picture clustering module and a data cleaning module, wherein the data determining module is used for determining a commodity data set, the commodity data set comprises a plurality of commodity data carrying real classification labels, and the real classification labels are leaf nodes in a preset category tree; the text clustering module is used for clustering according to the text characteristic information of each commodity data to determine a text clustering label corresponding to the commodity data, and the total number of the text clustering labels is equal to the total number of leaf nodes; the image clustering module is used for clustering according to the image characteristic information of each commodity data to determine an image clustering label corresponding to the commodity data, and the total number of the image clustering labels is equal to the total number of the leaf nodes; and the data cleaning module is used for carrying out data cleaning on the commodity data in the commodity data set according to the value information of the text clustering label and the picture clustering label of the same commodity data relative to the real classification label.
In a further embodiment, the text clustering module includes: the text extraction submodule is used for extracting the text characteristic information of the commodity title of each commodity data in the commodity data set by adopting a pre-trained text characteristic extraction model; the text clustering submodule is used for setting the classification number required by clustering according to the number of leaf nodes, clustering is carried out according to the text characteristic vector of the commodity data, a plurality of commodity data clusters corresponding to the classification number and based on texts are obtained, and each commodity data cluster comprises a plurality of correspondingly clustered commodity data; and the text labeling submodule is used for counting the maximum number of real classification labels owned by the commodity data for each commodity data cluster based on the text, and determining the maximum number of real classification labels as the text clustering label of each commodity data in the commodity data cluster.
In a further embodiment, the image clustering module includes: the picture extraction submodule is used for extracting the picture characteristic information of the commodity picture of each commodity data in the commodity data set by adopting a pre-trained picture characteristic extraction model; the image clustering submodule is used for setting the classification number required by clustering according to the number of leaf nodes, clustering is carried out according to the image characteristic vector of the commodity data, a plurality of commodity data clusters based on images corresponding to the classification number are obtained, and each commodity data cluster comprises a plurality of correspondingly clustered commodity data; and the picture labeling submodule is used for counting the maximum number of real classification labels owned by the commodity data for each commodity data cluster based on the picture, and determining the maximum number of real classification labels as the picture clustering labels of each commodity data in the commodity data cluster.
In a further embodiment, the data cleansing module includes: the comparison judgment submodule is used for carrying out consistency judgment on the text clustering label and the picture clustering label of the same commodity data; the priority resetting sub-module is used for determining that the two clustering labels have contribution values relative to the real classifying label when the text clustering label and the picture clustering label of the same commodity data are consistent, and resetting the real classifying label to be the text clustering label or the picture clustering label; the comprehensive resetting sub-module is used for determining a text clustering label or a picture clustering label with contribution value according to a preset condition when the text clustering label and the picture clustering label of the same commodity data are not consistent and one of the text clustering label and the picture clustering label is consistent with the real classification label, and resetting the real classification label of the commodity data meeting the preset condition; and the purification and cleaning submodule is used for deleting the commodity data of which the real classification labels are not reset from the commodity data set so as to realize cleaning.
In an embodiment, the comprehensive reset submodule includes: a title length counting unit for counting the average length of the commodity titles of the commodity data in the commodity data set; the text label resetting unit is used for determining that the text clustering label has contribution value when the preset condition is satisfied and resetting the real classification label of the commodity data according to the condition that the text clustering label of the commodity data is consistent with the real classification label and the picture clustering label of the commodity data is inconsistent with the real classification label by taking the length of the commodity title of the commodity data exceeding the average length as the preset condition; and the picture label resetting unit is used for determining that the picture clustering label has contribution value when the preset condition is satisfied and resetting the real classification label of the commodity data according to the condition that the picture clustering label of the commodity data is consistent with the real classification label and the text clustering label of the commodity data is inconsistent with the real classification label by taking the length of the commodity title of the commodity data smaller than the average length as the preset condition.
In an extended embodiment, the multi-modal merchandise data cleaning device further comprises: the characteristic splicing module is used for carrying out characteristic splicing on the text characteristic information and the picture characteristic information of each commodity object in the commodity data set after data cleaning is finished so as to obtain corresponding image-text characteristic information; and the iterative training module is used for implementing iterative training on the classifier by adopting the image-text characteristic information of each commodity data in the commodity data set, predicting the classification label in the iterative training process of monitoring the commodity data by using the real classification label of each commodity data, and implementing gradient updating on the weight of the classifier according to the predicted classification label and the loss value of the real classification label until the classifier is trained to be in a convergence state.
The computer device comprises a central processing unit and a memory, wherein the central processing unit is used for calling and running a computer program stored in the memory to execute the steps of the commodity data multi-mode cleaning method.
A computer-readable storage medium, which stores a computer program implemented according to the multi-modal cleaning method for merchandise data in the form of computer-readable instructions, is provided, and when the computer program is called by a computer, the computer program executes the steps included in the method.
A computer program product, provided to adapt to another object of the present application, comprises computer programs/instructions which, when executed by a processor, implement the steps of the method described in any of the embodiments of the present application.
Compared with the prior art, the application has the following advantages:
firstly, when the application cleans the commodity data in the commodity data set, the application adapts to the condition that the commodity data contains multi-mode information, and considers that the multi-mode information can be needed by multi-mode training, thus extracting deep semantic information aiming at specific data of each mode in the commodity data, respectively determining the clustering label corresponding to each mode according to the deep semantic information of each mode, such as the text clustering label and the picture clustering label, then, researching the value information according to each mode clustering label relative to the real classification label originally labeled in the commodity data, resetting the clustering label contributing to the real classification label to the latest real classification label of the corresponding commodity data, realizing the correction of the label of the corresponding commodity data, and simultaneously cleaning other commodity data lacking value out of the commodity data set, the commodity data is integrated into a better quality training data set.
Secondly, the commodity data set obtained after the data of multiple modes are cleaned can be used as a training data set of a neural network model for executing a classification task, wherein the labeling information of the classification label is more accurate, the method is suitable for training a new classifier example more efficiently, the training process of the classifier example can be promoted to be easier to be rapidly converged, and the prediction accuracy of the classifier can be improved;
in addition, for the E-commerce field, due to the fact that information of commodity data is multi-source and complex, clustering is implemented from two dimensions of texts and pictures, data cleaning is performed on the basis, complexity of the commodity data is considered, and operation efficiency of a cleaning process is also considered, so that the technical scheme can provide significant contribution value in the aspect of preparation of a training data set in the E-commerce field.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a schematic flow diagram of an exemplary embodiment of a merchandise data multimodal cleaning method of the present application;
fig. 2 is a schematic flowchart of a process of acquiring text characteristic information of commodity data in an embodiment of the present application;
fig. 3 is a schematic flowchart of a process of acquiring picture characteristic information of commodity data in an embodiment of the present application;
FIG. 4 is a schematic flow chart illustrating a data cleansing process implemented in an embodiment of the present application;
FIG. 5 is a schematic block diagram of a process for re-deciding a real category label according to the average length of a product title in the embodiment of the present application;
FIG. 6 is a flowchart illustrating a process of training a classifier using a commodity data set after data cleaning according to an embodiment of the present application;
FIG. 7 is a functional block diagram of the merchandise data multimodal cleaning apparatus of the present application;
fig. 8 is a schematic structural diagram of a computer device used in the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
As will be appreciated by those skilled in the art, "client," "terminal," and "terminal device" as used herein include both devices that are wireless signal receivers, which are devices having only wireless signal receivers without transmit capability, and devices that are receive and transmit hardware, which have receive and transmit hardware capable of two-way communication over a two-way communication link. Such a device may include: cellular or other communication devices such as personal computers, tablets, etc. having single or multi-line displays or cellular or other communication devices without multi-line displays; PCS (Personal Communications Service), which may combine voice, data processing, facsimile and/or data communication capabilities; a PDA (Personal Digital Assistant), which may include a radio frequency receiver, a pager, internet/intranet access, a web browser, a notepad, a calendar and/or a GPS (Global Positioning System) receiver; a conventional laptop and/or palmtop computer or other device having and/or including a radio frequency receiver. As used herein, a "client," "terminal device" can be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or situated and/or configured to operate locally and/or in a distributed fashion at any other location(s) on earth and/or in space. The "client", "terminal Device" used herein may also be a communication terminal, a web terminal, a music/video playing terminal, such as a PDA, an MID (Mobile Internet Device) and/or a Mobile phone with music/video playing function, and may also be a smart tv, a set-top box, and the like.
The hardware referred to by the names "server", "client", "service node", etc. is essentially an electronic device with the performance of a personal computer, and is a hardware device having necessary components disclosed by the von neumann principle such as a central processing unit (including an arithmetic unit and a controller), a memory, an input device, an output device, etc., a computer program is stored in the memory, and the central processing unit calls a program stored in an external memory into the internal memory to run, executes instructions in the program, and interacts with the input and output devices, thereby completing a specific function.
It should be noted that the concept of "server" as referred to in this application can be extended to the case of a server cluster. According to the network deployment principle understood by those skilled in the art, the servers should be logically divided, and in physical space, the servers may be independent from each other but can be called through an interface, or may be integrated into one physical computer or a set of computer clusters. Those skilled in the art will appreciate this variation and should not be so limited as to restrict the implementation of the network deployment of the present application.
One or more technical features of the present application, unless expressly specified otherwise, may be deployed to a server for implementation by a client remotely invoking an online service interface provided by a capture server for access, or may be deployed directly and run on the client for access.
Unless specified in clear text, the neural network model referred to or possibly referred to in the application can be deployed in a remote server and used for remote call at a client, and can also be deployed in a client with qualified equipment capability for direct call.
Various data referred to in the present application may be stored in a server remotely or in a local terminal device unless specified in the clear text, as long as the data is suitable for being called by the technical solution of the present application.
The person skilled in the art will know this: although the various methods of the present application are described based on the same concept so as to be common to each other, they may be independently performed unless otherwise specified. In the same way, for each embodiment disclosed in the present application, it is proposed based on the same inventive concept, and therefore, concepts of the same expression and concepts of which expressions are different but are appropriately changed only for convenience should be equally understood.
The embodiments to be disclosed herein can be flexibly constructed by cross-linking related technical features of the embodiments unless the mutual exclusion relationship between the related technical features is stated in the clear text, as long as the combination does not depart from the inventive spirit of the present application and can meet the needs of the prior art or solve the deficiencies of the prior art. Those skilled in the art will appreciate variations therefrom.
The commodity data multi-mode cleaning method can be programmed into a computer program product and is realized by being deployed in a service cluster to operate, so that the method can be executed by accessing an open interface after the computer program product operates and performing man-machine interaction with the computer program product through a graphical user interface.
An application scenario exemplarily described in the present application is an application scenario related to classification task training in the e-commerce field, and since the e-commerce field needs to classify a certain standard according to commodity data corresponding to a commodity object, including but not limited to a commodity title, a commodity picture, commodity details, commodity attributes, and the like, a commodity data set composed of prepared commodity data is used for training a classifier, so that the data of the commodity data set can be cleaned by the technical solution of the present application, so as to determine an effective purification training set, and determine the classifier trained by the purification training set as a classifier required for production, so that the classifier can accurately classify the commodity data in the e-commerce platform according to the standard.
The classification performed by the classifier may be, for example, classification of security attributes of the product, classification of the product mapped to a category tree of the e-commerce platform, or the like, and different classification tasks may be determined according to different classification criteria and correspondingly trained. Accordingly, the classification labels in the commodity data set are also the classification labels corresponding to the classification standards, so that the classifier can implement supervised learning according to the classification labels and learn the corresponding classification capability after learning. In this regard, those skilled in the art will appreciate.
Referring to fig. 1, the multi-modal merchandise data cleaning method of the present application, in an exemplary embodiment thereof, includes the following steps:
step S1100, determining a commodity data set, wherein the commodity data set comprises a plurality of commodity data carrying real classification labels, and the real classification labels are leaf nodes in a preset category tree:
firstly, a commodity data set is prepared, wherein the commodity data set comprises commodity data required for training an exemplary neural network model of the application, and the commodity data carries classification labels with pre-established corresponding mapping relations, and are called as real classification labels. The real classification labels are suitable for providing supervision labels required by supervision learning for the trained neural network model so as to supervise the training process of the neural network model and enable the neural network model to be trained to a convergence state.
The commodity data is suitable for the exemplary application scenario of the present application, and includes specific data of multiple modalities such as commodity titles and commodity pictures, and various information contents included in the data are determined according to the specific data of the optic nerve network model, for example, the data may further include data such as commodity details and commodity attributes.
The commodity data can be acquired from a commodity database of an e-commerce platform or from a website which is open to freely acquire data in a network, and can be flexibly selected by a person skilled in the art.
The real classification label of the commodity data in the commodity data set can be determined as the classification label of the commodity data in a preset classification tree constructed by a source e-commerce platform website of the commodity data set. The category tree generally includes a plurality of hierarchies, each hierarchy is subordinate to a father node except a root node, and a plurality of child nodes can be subordinate under the father node, wherein the child nodes positioned at the bottom layer of the category tree form leaf nodes. Each commodity data is pre-assigned with a classification path, the classification path reaches a terminal leaf node from a root node of the category tree through intermediate nodes at all levels, and therefore the classification path comprises nodes at multiple levels, namely each commodity data has a corresponding leaf node, and the leaf nodes in the category tree generally have unique characteristics, so that the leaf nodes of the commodity data can be used as real classification labels of the leaf nodes.
The real classification label of the commodity data can be from a source e-commerce platform website of the commodity data, and can also be generated by a classification model which is pre-trained and is suitable for the application by technical personnel in the field, wherein the classification model is suitable for classifying according to the commodity title of the commodity data and/or deep semantic information of a commodity picture, and the commodity data is classified to a leaf node of a preset category tree. Naturally, the classification here is only a preliminary classification performed for the purpose of the present application to obtain the true classification label of the product data. The classification model referred to herein is a neural network model, and for example, deep semantic information may be extracted from a title of a commodity by using Bert, eletra, Albert, Roberta, etc., deep semantic information may be extracted from a picture of the commodity by using Resnet, EfficientNet, etc., and then a classifier is used to classify each leaf node of the category tree according to the deep semantic information, thereby determining a real classification label corresponding to commodity data.
For convenience of explanation, in the present application, the real classification tag carried by the commodity data in the commodity data set is marked as Tlabel。
Step S1200, clustering is carried out according to the text characteristic information of each commodity data to determine a text clustering label corresponding to the commodity data, and the total number of the text clustering labels is equal to the total number of leaf nodes:
in the application, the method and the device are suitable for the clustering requirement before data cleaning of the commodity data set, and the deep semantic information of the corresponding commodity data needs to be provided for clustering. In consideration of the diversity of specific data contained in the commodity data, the data are divided into two types according to the information modality of the data, wherein one type is text type data in the commodity data, and the text type data comprises but is not limited to a commodity title, commodity details, commodity attributes and the like; the other type is picture type data in the commodity data, including but not limited to commodity pictures, advertisement pictures, comment pictures and the like, wherein the commodity pictures also include default pictures for the commodities which represent the commodity data by default and other detailed pictures displayed on the detail page. The data adapting to the two modes can be processed by adopting different processing means when clustering is carried out.
The step is mainly responsible for clustering according to text type data in the commodity data. These text type data may be the entire text type data in the product data, or may be one of them, for example, for convenience of description in the present exemplary embodiment, only the product title in the product data is selected as the source data required for clustering.
And when the clustering is carried out on the commodity title, the clustering is carried out based on the deep semantic information of the commodity title. Therefore, it is necessary to extract deep semantic information of each item title of item data, i.e., text feature information thereof, which is usually expressed in a vector form. The text feature information can be extracted by adopting a pre-trained text feature extraction model, including but not limited to Bert, Electrora, Albert, Roberta and other optional neural network models, which can be used for extracting deep semantic information of a commodity title in commodity data to obtain corresponding text feature information. After the text characteristic information corresponding to the commodity title of each commodity data is obtained, the text characteristic information can be stored in a commodity data set in an associated mode to serve as expansion data of the corresponding commodity data so as to be referred in clustering.
After obtaining the text feature information of each commodity data, each commodity data may be clustered based on the text feature information by means of any known clustering algorithm, including but not limited to k-means clustering algorithm, mean shift clustering algorithm, DBSCAN clustering algorithm, expectation-maximization (EM) clustering using Gaussian Mixture Model (GMM), hierarchical clustering algorithm, spectral clustering algorithm, and the like, which are known to those skilled in the art. In this embodiment, a preferred spectral clustering algorithm is recommended. Spectral clustering (spectral clustering) is a widely used clustering algorithm, and compared with the traditional K-Means algorithm, the spectral clustering algorithm has stronger adaptability to data distribution, excellent clustering effect, small and more expensive computation amount, and is not complicated to realize. The basic principle of various clustering algorithms is that similar information, or distance information, among deep semantic information of various commodity data is calculated, and accordingly, the commodity data close to or similar in distance is regarded as the same type of data, and the commodity data far away from or dissimilar in distance is regarded as different types of data, so that a plurality of corresponding classifications are determined.
Since the commodity data already has the real classification labels corresponding to the leaf nodes, the total number of the leaf nodes already determines the classification number of all the commodity data in the commodity data set, and the clustering number when clustering is performed based on the text is also equal to the total number of the leaf nodes for the data cleaning requirement, so that the clustering number can be preset as the total number of the leaf nodes distributed by the commodity data in the commodity data set during clustering.
After clustering based on text characteristic information is carried out on the commodity data in the commodity data set by adopting a clustering algorithm, a plurality of clustering labels corresponding to the total number of the leaf nodes can be obtained, wherein the clustering labels are text clustering labels and are marked as V for convenience of subsequent descriptionlabel. It is understood that each commodity data in the commodity data set can obtain a corresponding text clustering label after being clustered by texts based on commodity titles.
Generally, according to the statistical principle, the text cluster labels of most of the commodity data may be consistent with the real classification label, and the text cluster labels of a small amount of the commodity data may be other real classification labels. Conversely, the real classification label of most of the commodity data under one text cluster label may be consistent with the text cluster label, but it is still not excluded that the real classification label of a few part of the commodity data is inconsistent and more dispersed with the text cluster label. This difference also illustrates the necessity of data cleansing of the commodity data in the commodity data set.
Step S1300, clustering is carried out according to the picture characteristic information of each commodity data to determine a picture clustering label corresponding to the commodity data, and the total number of the picture clustering labels is equal to the total number of leaf nodes:
in the same way as the clustering of the text characteristic information of the commodity title based on the commodity data in the previous step, the clustering of the information of the other modality, namely the commodity picture, is performed in the step.
And when the commodity pictures are clustered, the clustering is performed based on the deep semantic information of the commodity pictures. Therefore, it is necessary to extract deep semantic information of each commodity picture of the commodity data, i.e. the picture feature information thereof, which is usually expressed in a vector form. The extraction of the picture feature information can be implemented by adopting a pre-trained picture feature extraction model, including but not limited to Resnet, EfficientNet and other optional neural network models, and can be used for extracting deep semantic information of the commodity picture in the commodity data to obtain corresponding picture feature information. After the picture characteristic information corresponding to the commodity title of each commodity data is obtained, the picture characteristic information can be stored in a commodity data set in an associated mode to serve as expansion data of the corresponding commodity data so as to be referred in clustering.
After the picture feature information of each commodity data is obtained, each commodity data may be clustered based on the picture feature information by means of any known clustering algorithm, including but not limited to k-means clustering algorithm, mean shift clustering algorithm, DBSCAN clustering algorithm, expectation-maximization (EM) clustering using Gaussian Mixture Model (GMM), hierarchical clustering algorithm, spectral clustering algorithm, and the like, which are known to those skilled in the art. In this embodiment, the optimal spectral clustering algorithm is recommended in the same way. The basic principle and application of various clustering algorithms are described in the previous step, and are not described herein again.
The same as the previous step, since the commodity data already has the real classification label corresponding to the leaf node, the total number of the leaf nodes already determines the classification number of all the commodity data in the commodity data set, and the clustering number when clustering is performed based on the picture is also equal to the total number of the leaf nodes for the data cleaning requirement, so that the clustering number can be preset as the total number of the leaf nodes distributed by the commodity data in the commodity data set during clustering.
After clustering based on picture characteristic information is carried out on the commodity data in the commodity data set by adopting a clustering algorithm, a plurality of clustering labels corresponding to the total number of the leaf nodes can be obtained, wherein the picture clustering labels are used as the picture clustering labels and are marked as C for convenience of subsequent explanationlabel. It is understood that each commodity data in the commodity data set can obtain a corresponding picture clustering label after being clustered based on commodity pictures.
Similarly, according to the statistical principle, the image clustering labels of most of the commodity data may be consistent with the real classification label, and the image clustering labels of a small amount of the commodity data may be other real classification labels. Conversely, in the commodity data under one picture clustering label, the real classification label of most commodity data may be consistent with the picture clustering label, but it is still not excluded that the real classification label of a few part of commodity data is inconsistent and more dispersed with the picture clustering label. This difference also illustrates the necessity of data cleansing of the commodity data in the commodity data set.
Step S1400, according to the value information of the text clustering label and the picture clustering label of the same commodity data relative to the real classification label, data cleaning is carried out on the commodity data in the commodity data set:
as described above, after the text clustering label and the picture clustering label are obtained by clustering based on the texts and pictures in the commodity data, both of them may be consistent with or inconsistent with the real classification label of the commodity data itself, and further, the situation of consistency or inconsistency may also occur between the text clustering label and the picture clustering label in the same commodity data.
The basic principle of the value information is to examine the credibility of the text clustering label and/or the picture clustering label corresponding to the real classification label. If the credibility of the former is higher than that of the latter, the clustering result is more credible, and then the former can replace the latter, so that the correction of the real classification label of the commodity data is realized. If the credibility of the former is lower than that of the latter, the comprehensive multi-aspect information can not determine the effective classification label of the commodity data with high credibility, and in this case, the whole commodity data can be deleted from the commodity data set.
For the mining of the value information, preset conditions can be set according to preset strategies, and the number of commodities can be countedMining is realized by comparing the consistency of the text clustering label, the picture clustering label and the real classification label, for example, V can be setlabel=Clabel=TlabelWhen the preset condition is satisfied, the text clustering label and the picture clustering label are regarded as having contribution value relative to the real classification label, so that the original real classification label can be reserved, or the text clustering label and the picture clustering label are used for replacing the real classification label of the commodity data (the text clustering label and the picture clustering label are essentially consistent), and on the contrary, if V is satisfied, the real classification label of the commodity data is replaced by the text clustering label and the picture clustering label (the three labels are essentially consistent)label≠Clabel≠TlabelWhen the preset condition is met, the text clustering label and the picture clustering label have no contribution value relative to the original real classification label, and the classification label of the commodity data cannot be determined again, so that the commodity data in the commodity data can be deleted in a centralized manner.
In addition to the above exemplary manner, in other embodiments, when the preset condition is set, other factors may be introduced to comb the complex relationships among the tags to refine the resolution of the value information, for example, length information of a product title is introduced to measure whether a text clustering tag has a higher value than a picture clustering tag. In view of this, the following embodiments of the present application will further disclose that the present embodiment is not shown. In summary, according to the disclosure of the embodiment, a person skilled in the art can determine which value information exists in the commodity data by which tag according to the relationship between the text clustering tag and the picture clustering tag relative to the real classification tag, including the value information having contribution or no contribution, and then can make a cleaning decision according to the value information, and implement cleaning of the corresponding commodity data by using a corresponding means.
According to the commodity data set after data cleaning, part of commodity data which easily cause classification confusion are deleted in the cleaning process, and part of real classification labels of the commodity data which are obviously wrongly labeled are corrected, so that the real classification labels of the commodity data form more effective supervision labels.
Through this exemplary embodiment, it can be seen that there are many advantages to this application, for example:
firstly, when the application cleans the commodity data in the commodity data set, the application adapts to the condition that the commodity data contains multi-mode information, and considers that the multi-mode information can be needed by multi-mode training, thus extracting deep semantic information aiming at specific data of each mode in the commodity data, respectively determining the clustering label corresponding to each mode according to the deep semantic information of each mode, such as the text clustering label and the picture clustering label, then, researching the value information according to each mode clustering label relative to the real classification label originally labeled in the commodity data, resetting the clustering label contributing to the real classification label to the latest real classification label of the corresponding commodity data, realizing the correction of the label of the corresponding commodity data, and simultaneously cleaning other commodity data lacking value out of the commodity data set, the commodity data is integrated into a better quality training data set.
Secondly, the commodity data set obtained after the data of multiple modes are cleaned can be used as a training data set of a neural network model for executing a classification task, wherein the labeling information of the classification label is more accurate, the method is suitable for training a new classifier example more efficiently, the training process of the classifier example can be promoted to be easier to be rapidly converged, and the prediction accuracy of the classifier can be improved;
in addition, for the E-commerce field, due to the fact that information of commodity data is multi-source and complex, clustering is implemented from two dimensions of texts and pictures, data cleaning is performed on the basis, complexity of the commodity data is considered, and operation efficiency of a cleaning process is also considered, so that the technical scheme can provide significant contribution value in the aspect of preparation of a training data set in the E-commerce field.
Referring to fig. 2, in a further embodiment, the step S1200 performs clustering according to the text feature information of each commodity data to determine the text clustering label corresponding to the commodity data, where the total number of the text clustering labels is equal to the total number of leaf nodes, and includes the following steps:
step S1210, extracting the text feature information of the commodity title of each commodity data in the commodity data set by adopting a pre-trained text feature extraction model:
in this embodiment, in order to obtain the text feature information of the commodity data in the commodity data set, the commodity title in the commodity data may be preferably used as the extraction target, and the neural network model for extracting the text feature information of the commodity title may be preferably implemented by a pre-trained Bert model. After the commodity titles of the commodity data in the commodity data set are all input into the Bert model, the corresponding text characteristic information is obtained one by one, and the text characteristic information is stored in a correlation mode with the corresponding commodity data.
Step S1220, setting a classification number required for clustering according to the number of leaf nodes, and clustering with the text feature vectors of the commodity data to obtain a plurality of commodity data clusters based on texts corresponding to the classification number, where each commodity data cluster includes a plurality of correspondingly clustered commodity data:
in this embodiment, a spectral clustering algorithm is adopted, the number of classes required for clustering is set to be the same as the total number K of leaf nodes, then distance information is calculated for text feature vectors of each commodity data in the commodity data set, and then K commodity data clusters are correspondingly obtained, wherein each commodity data cluster comprises a plurality of correspondingly clustered commodity data. That is, all the commodity data in the commodity data set are clustered based on the text feature information and then are dispersedly divided into K commodity data clusters. It is understood that all the commodity data in the same commodity data cluster, in theory, represent that they are classified into the same class based on a certain characteristic.
Step S1230, counting the maximum number of real classification tags owned by the commodity data for each commodity data cluster based on the text, and determining the maximum number of real classification tags as the text clustering tags of each commodity data in the commodity data cluster:
in order to determine the corresponding clustering label of each commodity data cluster, namely the text clustering label, the method can be realized by considering the common characteristics of a plurality of commodity data in the same commodity data cluster. Specifically, the real classification tags carried by the commodity data in the same commodity data cluster may be counted, where there is one real classification tag with the largest statistical number, and under this real classification tag, the largest commodity data is owned in the current commodity data cluster, so as to naturally set the text clustering tag of this part of commodity data as the real classification tag with the largest statistical number. In order to unify the clustering labels of the current commodity data cluster, for other commodity data carrying different real classification labels in the current commodity data cluster, based on the default of the clustering common characteristic, the text clustering label is set as the real classification label with the maximum statistical number. That is, the real classification label having the largest statistical data in one commodity data cluster is also set as the text clustering label of all the commodity data in the commodity data cluster. Accordingly, marking of the text clustering labels of the commodity data of each commodity data cluster can be completed.
The embodiment elaborates the process of clustering the commodity data to generate the text clustering labels of each commodity data based on the text characteristic information by using the preset algorithm, so that the text clustering labels are marked for the commodity data by the clustering algorithm, the common characteristic of the commodity data based on the text type data is embodied, a decision basis required by data cleaning can be formed, and a decision value is provided for the data cleaning process.
Referring to fig. 3, in a further embodiment, the step S1300 performs clustering according to the picture feature information of each commodity data to determine the picture cluster labels corresponding to the commodity data, where the total number of the picture cluster labels is equal to the total number of the leaf nodes, and includes the following steps:
step S1310, extracting the picture feature information of the commodity picture of each commodity data in the commodity data set by using a pre-trained picture feature extraction model:
in this embodiment, in order to obtain the picture feature information of the commodity data in the commodity data set, the commodity picture in the commodity data, particularly the default picture, may be preferably used as the extraction target, and the neural network model for extracting the picture feature information of the commodity picture may be preferably implemented by a pre-trained Resnet model. After commodity pictures of each commodity data in the commodity data set are input into the Resnet model, corresponding picture characteristic information is obtained one by one, and the picture characteristic information is stored in a correlation mode with the corresponding commodity data.
Step S1320, setting the classification number required by clustering according to the leaf node number, clustering by using the picture characteristic vector of the commodity data, and obtaining a plurality of commodity data clusters based on pictures corresponding to the classification number, wherein each commodity data cluster comprises a plurality of correspondingly clustered commodity data:
in this embodiment, a spectral clustering algorithm is adopted, the number of classifications required for clustering is set to be the same as the total number K of leaf nodes, then distance information is calculated for picture feature vectors of each commodity data in a commodity data set, and then K commodity data clusters are correspondingly obtained, wherein each commodity data cluster comprises a plurality of correspondingly clustered commodity data. That is, all the commodity data in the commodity data set are clustered based on the picture characteristic information and then are dispersedly divided into K commodity data clusters. It is understood that all the commodity data in the same commodity data cluster, in theory, represent that they are classified into the same class based on a certain characteristic.
Step S1330, counting the maximum number of real classification tags owned by the commodity data for each commodity data cluster based on the picture, and determining the maximum number of real classification tags as the picture clustering tags of each commodity data in the commodity data cluster:
in order to determine the corresponding cluster label of each commodity data cluster, namely the picture cluster label, the common characteristics of a plurality of commodity data in the same commodity data cluster can be considered. Specifically, the real classification tags carried by the commodity data in the same commodity data cluster can be counted, wherein one real classification tag with the maximum statistical number exists, and under the real classification tag, the current commodity data cluster has the most commodity data, so that the picture clustering tags of the part of commodity data are naturally set as the real classification tags with the maximum statistical number. In order to unify the clustering labels of the current commodity data cluster, for other commodity data carrying different real classification labels in the current commodity data cluster, setting the image clustering labels as the real classification labels with the maximum statistical number based on the default of the clustering common characteristics. That is, the real classification label having the largest statistical data in one commodity data cluster is also set as the picture clustering label of all commodity data in the commodity data cluster. Accordingly, the marking of the picture clustering labels of the commodity data of each commodity data cluster can be completed.
The embodiment elaborates the process of clustering the commodity data based on the picture characteristic information by using the preset algorithm to generate the picture clustering labels of each commodity data, and it can be seen that the picture clustering labels are marked for the commodity data by the clustering algorithm, so that the common characteristic of the commodity data based on the picture type data is embodied, a decision basis required by data cleaning can be formed, and a decision value is provided for the data cleaning process.
Referring to fig. 4, in a further embodiment, in step S1400, the data cleaning is performed on the commodity data in the commodity data set according to the value information of the text clustering label and the picture clustering label of the same commodity data relative to the real classification label, and the method includes the following steps:
step 1410, performing consistency judgment on the text clustering label and the picture clustering label of the same commodity data:
in order to determine the value information between the text clustering label and the picture clustering label of the commodity data relative to the real classification label, consistency comparison can be performed on the text clustering label and the picture clustering label to see whether the text clustering label and the picture clustering label belong to the same label.
Step S1420, when it is determined that the text clustering label and the picture clustering label of the same commodity data are consistent, determining that the two clustering labels have contribution values with respect to the real classification label, and resetting the real classification label to be the text clustering label or the picture clustering label:
when the text of the same commodity data clusters label VlabelAnd picture clustering label ClabelWhen they are in agreement, i.e. Vlabel=ClabelIn this case, no matter whether the true tag of the commodity data is consistent with the text cluster tag or the picture cluster tag, the true class tag of the commodity data can be reset to the text cluster tag VlabelOr picture clustering label ClabelI.e. Tlabel=VlabelOr Tlabel=Clabel. It is understood that the reset here includes the reset at Vlabel=Clabel=TlabelThe resetting of the true classification label of the commodity data also includes the resetting of Vlabel=Clabel≠TlabelThe resetting of the genuine classification label for the commodity data in the case of (1). Alternative embodiment, for Vlabel=Clabel=TlabelIn this case, since the reset does not cause a substantial change in the real classification tag, the reset operation may not be performed, which is also completely equivalent to the present embodiment.
Step S1430, when the text clustering label and the picture clustering label of the same commodity data are judged to be inconsistent and one of the text clustering label and the picture clustering label is consistent with the real classification label, determining the text clustering label or the picture clustering label with contribution value according to the preset condition, and resetting the real classification label of the commodity data meeting the preset condition by using the text clustering label or the picture clustering label:
when the text of the same commodity data clusters label VlabelAnd picture clustering label ClabelIn case of inconsistency, i.e. Vlabel≠ClabelIn this embodiment, in order to avoid false cleaning, it is necessary to further examine a consistency relationship between the text clustering label or the picture clustering label and the real classification label, that is, to determine whether the text clustering label and the real classification label are consistent, and to determine whether the picture clustering label and the real classification label are consistent. If any one of the two is true, theoretically, a person skilled in the art can determine whether to reset the text clustering label or the picture clustering label in the commodity data to be the real classification label of the commodity data according to the corresponding commodity title and the credibility of the commodity picture in the commodity data.
The criterion of the credibility of the title and the picture of the commodity can be realized as a preset condition by a person skilled in the art, for example, when the text clustering label is consistent with the real classification label, namely Tlabel=VlabelAnd the picture cluster label is inconsistent with the true classification label, i.e. Tlabel≠ClabelThe preset condition may be implemented by determining whether the commodity title of the corresponding commodity data reaches a preset length, and when the commodity title reaches the preset length, it may be determined that the text clustering label includes value information of contribution value, so that the text clustering label may be used to reset the true classification label, otherwise, the commodity data is regarded as an object to be deleted. As another example, when the picture clustering label is consistent with the true classification label, T islabel=ClabelAnd the text cluster label is not consistent with the true classification label, i.e. Tlabel≠VlabelThe preset condition may be implemented to judge whether the commodity picture of the corresponding commodity data reaches a preset resolution, and when the commodity picture reaches the preset resolution, it may be determined that the picture clustering label includes value information of contribution value, so that the real classification label may be reset by using the picture clustering label, otherwise, the commodity data is regarded as an object to be deleted.
Step S1440, deleting the commodity data of which the real classification label is not reset from the commodity data set to realize cleaning:
after the above-mentioned processing, for the case that the value information includes the contribution value, the real classification label of the corresponding commodity data is reset (including the case equivalent to being reset), wherein the real classification label of the commodity data in which the real classification label resetting occurs is corrected, and all other commodity data in the commodity data set in which the real classification label is not reset theoretically belong to data which does not contribute to training, so that the data can be completely deleted from the commodity data set to complete data cleaning, and the commodity data set is processed into a purified training set.
In the embodiment, more detailed analysis is further performed according to the consistency of the text clustering labels, the picture clustering labels and the real classification labels of the commodity data, the granularity of data cleaning is refined, the real classification labels of part of the commodity data are corrected by correctly analyzing the relation among the labels, and some useless commodity data are deleted, so that the detailed cleaning of the commodity data set is realized, the quality of the commodity data in the commodity data set is further improved, and a great contribution is made to the improvement of the prediction accuracy of the classification model.
Referring to fig. 5, in an embodiment, in the step S1430, the text cluster label or the picture cluster label with contribution value is determined according to the preset condition, and the real classification label of the commodity data meeting the preset condition is reset according to the text cluster label or the picture cluster label, including the following steps:
step S1431, count the average length of the product title of the product data in the product data set:
in this embodiment, when determining whether the value information of the text clustering label or the picture clustering label relative to the real classification label contains a contribution value according to a preset condition, the preset condition can be realized by only adopting a single information basis, so that the resolution logic of the value information is simplified, and the decision efficiency of data cleaning is improved.
For this reason, the length information of the product titles is used as the judgment basis of the preset condition in the embodiment, specifically, the average value of the product titles of the product data in the product data set is obtained, that is, the character length of each product title is obtained, and the sum of the character lengths is divided by the total amount of the product data in the product data set to determine the average length of the product title.
Step S1432, regarding a case that the text cluster tag of the commodity data is consistent with the real classification tag and the picture cluster tag is inconsistent with the real classification tag, taking that the length of the commodity title of the commodity data exceeds the average length as a preset condition, and when the preset condition is satisfied, determining that the text cluster tag has a contribution value, and resetting the real classification tag of the commodity data by using the text cluster tag:
in this step, for the case where the text cluster tag of the commodity data is consistent with the real classification tag and the picture cluster tag is inconsistent with the real classification tag, it is determined whether the length of the commodity title of the commodity data exceeds the average length, and when the average length is exceeded, it indicates that the commodity title contains more information, and therefore, the preset condition is satisfied, so that, in the case where the picture cluster tag is inconsistent with the real classification tag and the commodity title corresponding to the text cluster tag contains more information, the credibility of the text cluster tag is approved, and therefore, it is determined that the text cluster tag has a contribution value, and therefore, the real classification tag of the commodity data is reset by the text cluster tag.
Step S1433, regarding a case that the image cluster tag of the commodity data is consistent with the real classification tag and the text cluster tag of the commodity data is inconsistent with the real classification tag, taking that the length of the commodity title of the commodity data is smaller than the average length as a preset condition, and when the preset condition is satisfied, determining that the image cluster tag has a contribution value, and resetting the real classification tag of the commodity data by using the preset condition:
in this step, for the case that the image cluster tag of the commodity data is consistent with the real classification tag and the text cluster tag is inconsistent with the real classification tag, it is determined whether the length of the commodity title of the commodity data is smaller than the average length, and when the length is smaller than the average length, it indicates that the information contained in the commodity title is less and the information provided by the commodity image is more reliable, so that the preset condition is satisfied.
In the embodiment, the average length of the commodity title of the commodity data is used as the control threshold value in the preset condition for deciding the effectiveness of the value information of the text clustering label and the picture clustering label, so that the implementation logics of the preset condition under different conditions are unified, the judgment strategy of the value information is simplified, and the decision efficiency can be improved. Because the length of the commodity title also contains the information contribution degree, the decision of the related label is carried out according to the average length of the commodity title, and the method is helpful for effectively determining whether the commodity data needs to update the real classification label in the data cleaning process, so that the method has a great contribution effect on improving the labeling quality of the commodity data and filtering invalid commodity data.
Referring to fig. 6, in an expanded embodiment, the multi-modal merchandise data cleaning method further includes the following steps:
s1500, performing characteristic splicing on the text characteristic information and the picture characteristic information of each commodity object in the commodity data set after data cleaning is completed to obtain corresponding image-text characteristic information:
the commodity data set after the data cleaning is finished can be directly used as a training data set of a classifier for executing a classification task, and after the commodity data is processed through the processes disclosed by the embodiments, the corresponding text characteristic information and the corresponding picture characteristic information are obtained from the commodity data, so that the deep semantic information can be directly utilized for training the classifier.
Therefore, the text characteristic information and the picture characteristic information of each commodity data can be normalized into high-dimensional vectors with the same dimensionality, and then the text characteristic information and the picture characteristic information are spliced to obtain corresponding image-text characteristic information, the image-text characteristic information integrates the commodity title of the commodity data and deep semantic information of the commodity picture, and is a result of representing and learning the commodity title and the commodity picture, so that the image-text characteristic information can be used as a basis for a classifier to perform classification after being fully connected.
Step S1600, iterative training is carried out on the classifier by adopting the image-text characteristic information of each commodity data in the commodity data set, a real classification label of each commodity data is used for monitoring a predicted classification label in the iterative training process of the commodity data, and gradient updating is carried out on the weight of the classifier according to the predicted classification label and the loss value of the real classification label until the classifier is trained to be in a convergence state:
when the classifier is trained, each commodity data in the commodity data set is adopted to carry out iterative training on the classifier in sequence, in each iterative process, the classifier calculates the classification probability of each specific classification label mapped into a classification space according to image-text characteristic information, determines the classification label with the maximum classification probability as a predicted classification label, then calculates the loss values of the predicted classification label and the real classification label by taking the real classification label of each commodity data subjected to data cleaning as a supervision label, and then carries out gradient updating on the weight of the classifier according to the loss values. The classification space used by the classifier is actually a classification space formed by the sum of the real classification labels in the commodity data set, and is also a classification space formed by the leaf nodes. And the classifier is used for calculating a loss function of the loss value and only needs to adopt a cross entropy loss function.
In the process of the cycle iterative training, after training is performed once aiming at each commodity data and a loss value is calculated, whether the loss value approaches zero or whether the loss value reaches a preset threshold value is calculated, when the judgment condition is met, the classifier is considered to be trained to a convergence state, and the training can be stopped; otherwise, the classifier is not converged, so that the next commodity data can be called to continuously carry out iterative training on the classifier, and the classifier is enabled to approach to convergence continuously.
In this embodiment, the commodity data set cleaned by the present application is used as a training data set for training the classifier, and since the commodity data set is the purified data, the classifier trained by the commodity data set has higher prediction accuracy, and since the real classification label of each commodity data is also corrected, the training efficiency of the classifier is also significantly improved. When the classifier trained to the convergence state is put into the production stage, the classification label corresponding to the commodity data can be determined with high accuracy according to the text characteristic information of the commodity title and the picture characteristic information of the commodity picture in the commodity data, so that the intelligent automatic classification of the commodity data can be realized.
Referring to fig. 7, a merchandise data multi-modal cleaning apparatus suitable for one of the purposes of the present application is provided, which includes: the system comprises a data determining module 1100, a text clustering module 1200, a picture clustering module 1300 and a data cleaning module 1400, wherein the data determining module 1100 is used for determining a commodity data set, the commodity data set comprises a plurality of commodity data carrying real classification labels, and the real classification labels are leaf nodes in a preset category tree; the text clustering module 1200 is configured to perform clustering according to text feature information of each commodity data to determine a text clustering label corresponding to the commodity data, where the total number of the text clustering labels is equal to the total number of leaf nodes; the picture clustering module 1300 is configured to perform clustering according to the picture characteristic information of each commodity data to determine a picture clustering label corresponding to the commodity data, where the total number of the picture clustering labels is equal to the total number of the leaf nodes; the data cleaning module 1400 is configured to perform data cleaning on the commodity data in the commodity data set according to the value information of the text clustering label and the picture clustering label of the same commodity data relative to the real classification label.
In a further embodiment, the text clustering module 1200 includes: the text extraction submodule is used for extracting the text characteristic information of the commodity title of each commodity data in the commodity data set by adopting a pre-trained text characteristic extraction model; the text clustering submodule is used for setting the classification number required by clustering according to the number of leaf nodes, clustering is carried out according to the text characteristic vector of the commodity data, a plurality of commodity data clusters corresponding to the classification number and based on texts are obtained, and each commodity data cluster comprises a plurality of correspondingly clustered commodity data; and the text labeling submodule is used for counting the maximum number of real classification labels owned by the commodity data for each commodity data cluster based on the text, and determining the maximum number of real classification labels as the text clustering label of each commodity data in the commodity data cluster.
In a further embodiment, the image clustering module 1300 includes: the picture extraction submodule is used for extracting the picture characteristic information of the commodity picture of each commodity data in the commodity data set by adopting a pre-trained picture characteristic extraction model; the image clustering submodule is used for setting the classification number required by clustering according to the number of leaf nodes, clustering is carried out according to the image characteristic vector of the commodity data, a plurality of commodity data clusters based on images corresponding to the classification number are obtained, and each commodity data cluster comprises a plurality of correspondingly clustered commodity data; and the picture labeling submodule is used for counting the maximum number of real classification labels owned by the commodity data for each commodity data cluster based on the picture, and determining the maximum number of real classification labels as the picture clustering labels of each commodity data in the commodity data cluster.
In a further embodiment, the data cleansing module 1400 includes: the comparison judgment submodule is used for carrying out consistency judgment on the text clustering label and the picture clustering label of the same commodity data; the priority resetting sub-module is used for determining that the two clustering labels have contribution values relative to the real classifying label when the text clustering label and the picture clustering label of the same commodity data are consistent, and resetting the real classifying label to be the text clustering label or the picture clustering label; the comprehensive resetting sub-module is used for determining a text clustering label or a picture clustering label with contribution value according to a preset condition when the text clustering label and the picture clustering label of the same commodity data are not consistent and one of the text clustering label and the picture clustering label is consistent with the real classification label, and resetting the real classification label of the commodity data meeting the preset condition; and the purification and cleaning submodule is used for deleting the commodity data of which the real classification labels are not reset from the commodity data set so as to realize cleaning.
In an embodiment, the comprehensive reset submodule includes: a title length counting unit for counting the average length of the commodity titles of the commodity data in the commodity data set; the text label resetting unit is used for determining that the text clustering label has contribution value when the preset condition is satisfied and resetting the real classification label of the commodity data according to the condition that the text clustering label of the commodity data is consistent with the real classification label and the picture clustering label of the commodity data is inconsistent with the real classification label by taking the length of the commodity title of the commodity data exceeding the average length as the preset condition; and the picture label resetting unit is used for determining that the picture clustering label has contribution value when the preset condition is satisfied and resetting the real classification label of the commodity data according to the condition that the picture clustering label of the commodity data is consistent with the real classification label and the text clustering label of the commodity data is inconsistent with the real classification label by taking the length of the commodity title of the commodity data smaller than the average length as the preset condition.
In an extended embodiment, the multi-modal merchandise data cleaning device further comprises: the characteristic splicing module is used for carrying out characteristic splicing on the text characteristic information and the picture characteristic information of each commodity object in the commodity data set after data cleaning is finished so as to obtain corresponding image-text characteristic information; and the iterative training module is used for implementing iterative training on the classifier by adopting the image-text characteristic information of each commodity data in the commodity data set, predicting the classification label in the iterative training process of monitoring the commodity data by using the real classification label of each commodity data, and implementing gradient updating on the weight of the classifier according to the predicted classification label and the loss value of the real classification label until the classifier is trained to be in a convergence state.
In order to solve the technical problem, an embodiment of the present application further provides a computer device. As shown in fig. 8, the internal structure of the computer device is schematically illustrated. The computer device includes a processor, a computer-readable storage medium, a memory, and a network interface connected by a system bus. The computer-readable storage medium of the computer device stores an operating system, a database and computer-readable instructions, the database can store control information sequences, and the computer-readable instructions, when executed by the processor, can enable the processor to implement a multi-mode cleaning method for commodity data. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, may cause the processor to perform the merchandise data multimodal washing method of the present application. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In this embodiment, the processor is configured to execute specific functions of each module and its sub-module in fig. 7, and the memory stores program codes and various data required for executing the modules or the sub-modules. The network interface is used for data transmission to and from a user terminal or a server. The memory in the present embodiment stores program codes and data necessary for executing all modules/sub-modules in the product data multi-modal cleaning apparatus of the present application, and the server can call the program codes and data of the server to execute the functions of all sub-modules.
The present application further provides a storage medium storing computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the merchandise data multimodal washing method according to any of the embodiments of the present application.
The present application also provides a computer program product comprising computer programs/instructions which, when executed by one or more processors, implement the steps of the method as described in any of the embodiments of the present application.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments of the present application can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when the computer program is executed, the processes of the embodiments of the methods can be included. The storage medium may be a computer-readable storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).
In summary, the method and the device can perform effective data cleaning on massive commodity data based on multiple modes to obtain a high-quality training set, so that a classifier of a classification task trained by the method and the device can obtain a classification prediction effect with high accuracy.
Those of skill in the art will appreciate that the various operations, methods, steps in the processes, acts, or solutions discussed in this application can be interchanged, modified, combined, or eliminated. Further, other steps, measures, or schemes in various operations, methods, or flows that have been discussed in this application can be alternated, altered, rearranged, broken down, combined, or deleted. Further, steps, measures, schemes in the prior art having various operations, methods, procedures disclosed in the present application may also be alternated, modified, rearranged, decomposed, combined, or deleted.
The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.
Claims (10)
1. A commodity data multi-mode cleaning method is characterized by comprising the following steps:
determining a commodity data set, wherein the commodity data set comprises a plurality of commodity data carrying real classification labels, and the real classification labels are leaf nodes in a preset category tree;
clustering according to the text characteristic information of each commodity data to determine a text clustering label corresponding to the commodity data, wherein the total number of the text clustering labels is equal to the total number of leaf nodes;
clustering according to the picture characteristic information of each commodity data to determine picture clustering labels corresponding to the commodity data, wherein the total number of the picture clustering labels is equal to the total number of leaf nodes;
and performing data cleaning on the commodity data in the commodity data set according to the value information of the text clustering label and the picture clustering label of the same commodity data relative to the real classification label.
2. The multi-modal cleaning method for the commodity data according to claim 1, wherein the text clustering labels corresponding to the commodity data are determined by clustering according to the text characteristic information of each commodity data, and the total number of the text clustering labels is equal to the total number of the leaf nodes, comprising the following steps:
extracting text characteristic information of a commodity title of each commodity data in the commodity data set by adopting a pre-trained text characteristic extraction model;
setting classification number required by clustering according to the number of leaf nodes, and clustering by using the text characteristic vector of the commodity data to obtain a plurality of commodity data clusters based on texts, wherein each commodity data cluster comprises a plurality of correspondingly clustered commodity data;
and counting the maximum number of real classification labels owned by the commodity data for each commodity data cluster based on the text, and determining the maximum number of real classification labels as the text clustering labels of each commodity data in the commodity data cluster.
3. The multi-modal cleaning method for the commodity data according to claim 1, wherein the picture clustering labels corresponding to the commodity data are determined by clustering according to the picture characteristic information of each commodity data, and the total number of the picture clustering labels is equal to the total number of the leaf nodes, comprising the following steps:
extracting the picture characteristic information of the commodity picture of each commodity data in the commodity data set by adopting a pre-trained picture characteristic extraction model;
setting classification number required by clustering according to the number of leaf nodes, and clustering by using the picture characteristic vectors of the commodity data to obtain a plurality of commodity data clusters based on pictures corresponding to the classification number, wherein each commodity data cluster comprises a plurality of correspondingly clustered commodity data;
and counting the maximum number of real classification labels owned by the commodity data for each commodity data cluster based on the pictures, and determining the maximum number of real classification labels as the picture clustering labels of each commodity data in the commodity data cluster.
4. The multi-modal washing method for the commodity data according to claim 1, wherein the data washing is performed on the commodity data in the commodity data set according to the value information of the text clustering label and the picture clustering label of the same commodity data relative to the real classification label, and the method comprises the following steps:
carrying out consistency judgment on the text clustering label and the picture clustering label of the same commodity data;
when the text clustering label and the picture clustering label of the same commodity data are judged to be consistent, determining that the two clustering labels have contribution values relative to the real classification label, and resetting the real classification label to be the text clustering label or the picture clustering label;
when the text clustering label and the picture clustering label of the same commodity data are judged to be inconsistent and one of the text clustering label and the picture clustering label is consistent with the real classification label, determining the text clustering label or the picture clustering label with contribution value according to a preset condition, and resetting the real classification label of the commodity data meeting the preset condition;
and deleting the commodity data of which the real classification labels are not reset from the commodity data set to realize cleaning.
5. The multi-modal washing method for the commodity data according to claim 4, wherein a text cluster label or a picture cluster label with contribution value is determined according to a preset condition, and a real classification label of the commodity data meeting the preset condition is reset according to the text cluster label or the picture cluster label, and the method comprises the following steps:
counting the average length of the commodity titles of the commodity data in the commodity data set;
for the condition that the text clustering label of the commodity data is consistent with the real classification label and the picture clustering label of the commodity data is inconsistent with the real classification label, the condition that the length of the commodity title of the commodity data exceeds the average length is taken as a preset condition, when the preset condition is met, the text clustering label is confirmed to have contribution value, and the real classification label of the commodity data is reset by the text clustering label;
and for the condition that the picture clustering label of the commodity data is consistent with the real classification label and the text clustering label of the commodity data is inconsistent with the real classification label, taking the condition that the length of the commodity title of the commodity data is smaller than the average length as a preset condition, confirming that the picture clustering label has contribution value when the preset condition is satisfied, and resetting the real classification label of the commodity data by using the picture clustering label.
6. The multi-modal merchandise data washing method according to any one of claims 1 to 5, further comprising the following subsequent steps:
performing characteristic splicing on the text characteristic information and the picture characteristic information of each commodity object in the commodity data set after data cleaning is completed to obtain corresponding image-text characteristic information;
and performing iterative training on the classifier by adopting the image-text characteristic information of each commodity data in the commodity data set, predicting the classification label in the iterative training process of monitoring the commodity data by using the real classification label of each commodity data, and performing gradient updating on the weight of the classifier according to the loss values of the predicted classification label and the real classification label until the classifier is trained to be in a convergence state.
7. A commodity data multi-mode cleaning method is characterized by comprising the following steps:
the data determining module is used for determining a commodity data set, wherein the commodity data set comprises a plurality of commodity data carrying real classification labels, and the real classification labels are leaf nodes in a preset category tree;
the text clustering module is used for clustering according to the text characteristic information of each commodity data to determine a text clustering label corresponding to the commodity data, and the total number of the text clustering labels is equal to the total number of the leaf nodes;
the image clustering module is used for clustering according to the image characteristic information of each commodity data to determine image clustering labels corresponding to the commodity data, and the total number of the image clustering labels is equal to the total number of the leaf nodes;
and the data cleaning module is used for carrying out data cleaning on the commodity data in the commodity data set according to the value information of the text clustering label and the picture clustering label of the same commodity data relative to the real classification label.
8. A computer device comprising a central processor and a memory, characterized in that the central processor is adapted to invoke execution of a computer program stored in the memory to perform the steps of the method according to any one of claims 1 to 6.
9. A computer-readable storage medium, characterized in that it stores, in the form of computer-readable instructions, a computer program implemented according to the method of any one of claims 1 to 6, which, when invoked by a computer, performs the steps comprised by the corresponding method.
10. A computer program product comprising computer program/instructions, characterized in that the computer program/instructions, when executed by a processor, implement the steps of the method as claimed in any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111274488.3A CN114003591A (en) | 2021-10-29 | 2021-10-29 | Commodity data multi-mode cleaning method and device, equipment, medium and product thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111274488.3A CN114003591A (en) | 2021-10-29 | 2021-10-29 | Commodity data multi-mode cleaning method and device, equipment, medium and product thereof |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114003591A true CN114003591A (en) | 2022-02-01 |
Family
ID=79925306
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111274488.3A Pending CN114003591A (en) | 2021-10-29 | 2021-10-29 | Commodity data multi-mode cleaning method and device, equipment, medium and product thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114003591A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117763360A (en) * | 2024-02-22 | 2024-03-26 | 杭州光云科技股份有限公司 | Training set rapid analysis method based on deep neural network and electronic equipment |
-
2021
- 2021-10-29 CN CN202111274488.3A patent/CN114003591A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117763360A (en) * | 2024-02-22 | 2024-03-26 | 杭州光云科技股份有限公司 | Training set rapid analysis method based on deep neural network and electronic equipment |
CN117763360B (en) * | 2024-02-22 | 2024-07-12 | 杭州光云科技股份有限公司 | Training set rapid analysis method based on deep neural network and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11922469B2 (en) | Automated news ranking and recommendation system | |
US10423647B2 (en) | Descriptive datacenter state comparison | |
US9552551B2 (en) | Pattern detection feedback loop for spatial and temporal memory systems | |
US8504570B2 (en) | Automated search for detecting patterns and sequences in data using a spatial and temporal memory system | |
US8365019B2 (en) | System and method for incident management enhanced with problem classification for technical support services | |
US8645291B2 (en) | Encoding of data for processing in a spatial and temporal memory system | |
US11847130B2 (en) | Extract, transform, load monitoring platform | |
US20180174062A1 (en) | Root cause analysis for sequences of datacenter states | |
US11860721B2 (en) | Utilizing automatic labelling, prioritizing, and root cause analysis machine learning models and dependency graphs to determine recommendations for software products | |
US20220100772A1 (en) | Context-sensitive linking of entities to private databases | |
US20220100963A1 (en) | Event extraction from documents with co-reference | |
CN113918554A (en) | Commodity data cleaning method and device, equipment, medium and product thereof | |
CN108108743A (en) | Abnormal user recognition methods and the device for identifying abnormal user | |
US20230214679A1 (en) | Extracting and classifying entities from digital content items | |
CN111800289B (en) | Communication network fault analysis method and device | |
US11176403B1 (en) | Filtering detected objects from an object recognition index according to extracted features | |
US20220100967A1 (en) | Lifecycle management for customized natural language processing | |
US20230161661A1 (en) | Utilizing topology-centric monitoring to model a system and correlate low level system anomalies and high level system impacts | |
WO2022148108A1 (en) | Systems, devices and methods for distributed hierarchical video analysis | |
Zhang et al. | An intrusion detection method based on stacked sparse autoencoder and improved gaussian mixture model | |
Ali et al. | Big data classification efficiency based on linear discriminant analysis | |
US20220131766A1 (en) | Cognitive model determining alerts generated in a system | |
CN113792786A (en) | Automatic commodity object classification method and device, equipment, medium and product thereof | |
CN114003591A (en) | Commodity data multi-mode cleaning method and device, equipment, medium and product thereof | |
CN114282622A (en) | Training sample checking method and device, equipment, medium and product thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |