CN113342969A - Data processing method and device - Google Patents
Data processing method and device Download PDFInfo
- Publication number
- CN113342969A CN113342969A CN202010139561.5A CN202010139561A CN113342969A CN 113342969 A CN113342969 A CN 113342969A CN 202010139561 A CN202010139561 A CN 202010139561A CN 113342969 A CN113342969 A CN 113342969A
- Authority
- CN
- China
- Prior art keywords
- model
- training
- data
- data set
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 18
- 238000012549 training Methods 0.000 claims abstract description 90
- 238000000034 method Methods 0.000 claims abstract description 36
- 238000012545 processing Methods 0.000 claims abstract description 32
- 238000000354 decomposition reaction Methods 0.000 claims abstract description 22
- 238000011156 evaluation Methods 0.000 claims abstract description 21
- 238000004140 cleaning Methods 0.000 claims abstract description 9
- 230000006870 function Effects 0.000 claims description 22
- 238000012216 screening Methods 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 8
- 238000010586 diagram Methods 0.000 description 15
- 239000013598 vector Substances 0.000 description 14
- 230000008569 process Effects 0.000 description 11
- 238000010276 construction Methods 0.000 description 9
- 230000000694 effects Effects 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 230000006399 behavior Effects 0.000 description 5
- 238000001514 detection method Methods 0.000 description 5
- 238000010801 machine learning Methods 0.000 description 5
- 230000003993 interaction Effects 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 238000005457 optimization Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000011218 segmentation Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000013475 authorization Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 239000006185 dispersion Substances 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/06—Buying, selling or leasing transactions
- G06Q30/0601—Electronic shopping [e-shopping]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Finance (AREA)
- Evolutionary Computation (AREA)
- Accounting & Taxation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Evolutionary Biology (AREA)
- Computing Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Medical Informatics (AREA)
- Development Economics (AREA)
- Economics (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a data processing method and device, and relates to the technical field of computers. One embodiment of the method comprises: acquiring a data source, and cleaning and processing the data source to screen out relevant characteristics for model training; vectorizing the related features, and combining all the vectorized features to serve as a feature vectorization result; dividing the feature vectorization result into a training data set and an evaluation data set, and training a deep factor machine decomposition model based on the feature vectorization result and the data labels in the training data set; adjusting and optimizing parameters of the depth factor machine decomposition model to obtain a parameter adjusting and optimizing model; and processing data by utilizing the parameter tuning model. The method and the system effectively identify the behavior that the unauthorized brand is illegally and randomly hung in the title of the commodity, assist the compliance of the detailed page of the commodity and meet the service application.
Description
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a data processing method and apparatus.
Background
With the increasing prosperity of e-commerce platforms, some merchants can input other brand words than authorized brand words in the title of goods in order to increase the exposure of the goods. In order to maintain the order of commodity operation of the e-commerce platform and avoid damaging the brand rights and interests of the same merchants, measures need to be taken to perform compliance quality check on the detailed page information of the commodities. Currently, for distinguishing the illegal hanging violation of the brand, a method for detecting the illegal hanging violation of the brand based on regular matching to a great extent is provided, namely: aiming at partial brand words, the illegal random hanging situation of limited commodities and partial brands is detected in a mode of manually maintaining a homogeneous brand word data table and related category rules based on fixed matching rules. In addition, there is a detection method using a conventional machine learning model, which uses a conventional LR/xgboost or other machine learning model to identify a brand random violation in a title of a product by integrating various product attributes (e.g., characteristics of a product category, a brand, a price, etc.).
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art: the method for detecting based on regular matching needs to manually maintain a plurality of brand lists and the incidence relation between brands and categories, thereby not only consuming time and labor, but also being not beneficial to high-speed increase of commodity data and brand extension; the detection method based on the traditional machine learning model cannot well reflect the interaction effect among the features and the deep abstract representation of the features, and simultaneously lacks the contextual understanding of text data, so that the problem of misjudgment exists in the detection of some special situations.
Disclosure of Invention
In view of this, the present invention provides a data processing method, which processes data to be processed by processing a data source and establishing a relevant model, and processes the data to be processed by using the model. Therefore, unauthorized brand illegal random hanging in the commodity title can be detected in the E-commerce field, so that brand illegal random hanging judgment of a large amount of merchant detailed information is effectively supported, and the authorized brands of merchants are protected.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a data processing method, including: acquiring a data source, and cleaning and processing the data source to screen out relevant characteristics for model training; vectorizing the related features, and combining all the vectorized features to serve as a feature vectorization result; dividing the feature vectorization result into a training data set and an evaluation data set, and training a deep factor machine decomposition model based on the feature vectorization result and the data labels in the training data set; adjusting and optimizing parameters of the depth factor machine decomposition model to obtain a parameter adjusting and optimizing model; and processing data by utilizing the parameter tuning model.
One embodiment of the above invention has the following advantages or benefits: because the technical means that the Deep FM model (Deep Factor Machine) is used for replacing the traditional Machine learning model and the rule matching method to detect the illegal random hanging behavior of the unauthorized brand in the commodity title is adopted, the low-order and high-order interaction effects of various feature vectors are combined, the illegal random hanging behavior of the unauthorized brand in the commodity title can be effectively identified, the compliance of the commodity detail page is assisted, and the technical effect of business application is met.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
fig. 1 is a schematic diagram of a main flow of a data processing method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a detailed system execution flow of a data processing method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of the main modules of a data processing apparatus according to an embodiment of the present invention;
FIG. 4 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
fig. 5 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram of a main flow of a data processing method according to an embodiment of the present invention, which is a preferred flow schematic diagram of the method. The method utilizes a deep FM model for data processing, wherein the deep FM model is a model algorithm based on the integration of an FM model and a deep learning network, and can simultaneously learn low-order and high-order feature interaction effects.
The data processing method of the embodiment of the invention comprises the following steps:
step S101: acquiring a data source, and cleaning and processing the data source to screen out relevant characteristics for model training;
step S102: vectorizing the relevant features screened in the step S101, and combining all the vectorized features to serve as a feature vectorization result;
step S103: dividing the feature vectorization result in the step S102 into a training data set and an evaluation data set, and training a deep FM model based on the feature vectorization result and the data labels in the training data set;
step S104: tuning parameters of the deep FM model trained in the step S103 to obtain a parameter tuning model; and
step S105: and (5) performing data processing by using the tuned parameter tuning model in the step (S104).
The method uses the deep FM model to replace the traditional machine learning model and the rule matching method to detect the illegal random hanging behavior of the unauthorized brand in the commodity title, thereby combining the low-order and high-order interaction effects of various feature vectors, effectively identifying the illegal random hanging behavior of the unauthorized brand in the commodity title, assisting the compliance of the commodity detail page and meeting the technical effects of business application.
Fig. 2 is a schematic diagram of a detailed system execution flow of a data processing method according to an embodiment of the present invention.
Next, referring to fig. 2, on the basis of the above-described general embodiment, the respective steps are described in detail.
Step S101: and acquiring a data source, and cleaning and processing the data source to screen out relevant characteristics for model training. Wherein, this data source includes: and E-commerce platform commodities, brand words and other data sources. The processing mode comprises word segmentation and/or matching and the like.
In step S101, preprocessing and screening are performed for all used data features in the recipe. Which comprises the following steps:
a) characteristic pretreatment: the method mainly comprises the steps of cleaning special characters in text data, segmenting text, matching and extracting related brand word data and the like;
b) and (3) screening data characteristics: the method mainly comprises the following steps of mainly including related attribute information such as a commodity title, brand words in the commodity title, commodity authorization brand words, commodity categories and the like; and
c) and (3) feature analysis results: basic statistical analysis is performed for the characteristics of brands, categories and the like.
Step S102: and vectorizing the relevant features screened in the step S101, and combining all the vectorized features to obtain a feature vectorization result. The feature vectorization results are used for subsequent model training. Specifically, vectorizing the features screened in step S101 includes: performing one-hot feature vectorization on the discrete features screened in the step S101; aiming at text features, carrying out feature embedding vectorization by training a word2vec model; and for the continuous features, performing one-hot feature vectorization after dispersion. The word2vec model is an embedded text vectorization model method proposed by Google.
In step S102, vectorization is performed on the relevant features filtered in step S101, which includes:
a) if the feature is a discrete feature, the feature is subjected to one-hot coding, wherein the discrete feature refers to a feature value without continuity, such as a gender of a person ({ male, female } discontinuous numerical value), and the one-hot coding (i.e., one-hot vectorization) refers to coding N states by using an N-bit state register, for example: the Chinese characteristics are as follows: [ "China", "American", "France" ], after the unique heat characteristic coding, the Chinese is converted into 100, the American is converted into 010, and the France is converted into 001;
b) if the text features are text features, based on a corpus of an e-commerce platform, obtaining embedded vectors of related texts by training a word2vec model, wherein the text features refer to features embodied by texts, such as names of goods and the like, and the embedded vectors (i.e., embedding vectors) refer to the fact that the texts are converted into vector forms capable of representing text semantics through a certain model, for example, text embedded vectorization is performed by using the word2vec model;
c) if the feature is a continuous feature, discretizing the feature and then carrying out one-hot coding, wherein the continuous feature refers to the feature having numerical continuity, such as the age of one person (continuous numerical value between 0 and 120 years old) and the score of the examination (continuous numerical value between 0 and 100 points);
d) and finally, combining all vectorized features as a feature vectorization result for subsequent model training.
Here, text features are embedded vectorized by using a word2vec model, so that feature vectors embodying brand context are added. The discretization is, for example, discretization of the commodity price feature according to price intervals.
Step S103: the feature vectorization result in step S102 is divided into a training data set and an evaluation data set, and the deep fm model is trained based on the feature vectorization result and the data labels in the training data set.
In step S103, the method is mainly used for constructing and training the deep fm model. Which comprises the following steps:
a) construction of a deep FM model:
mainly comprises FM (Factor Machine decomposition, which is mainly used for solving the problems of feature combination and Deep sparse matrix) layer construction, DNN (Deep Neural Networks) layer construction and logic layer construction;
b) training a DeepFM model:
first, samples were run as 9: 1, dividing the model into a training data set and an evaluation data set, wherein the training data set is used for model training, and the evaluation data set is used for evaluating the efficiency of the model;
secondly, inputting vectorization results of various feature data of the training data set into a deep FM model, and finally outputting prediction labels of the training samples through the multilayer structure of the deep FM model;
then, combining training data real label (y) and model prediction labelCalculating a cross entropy loss function (L, shown below), minimizing the loss function through a back propagation algorithm, namely optimizing the process of relevant structural parameters (structural parameters of the model, which are network parameters of the model itself and are automatically obtained through model training) in the DeepFM model,
wherein yi is the real training data label of the ith sample,predicting a label for the model of the ith sample, wherein i is 1,2, …, n is the number of training samples, and L is the overall loss function of the model; i.e. the sum of all sample loss functions.
Finally, when the loss converges, a trained DeepFM model is obtained, and the model can be used for sample prediction of unknown labels.
Step S104: and (5) adjusting and optimizing the parameters of the DeepFM model trained in the step (S103) to obtain a parameter adjusting and optimizing model. Model performance was evaluated using the evaluation dataset and the deep fm model for tuning. If the efficiency of the model meets the requirement, the model is taken as a parameter tuning model to be obtained; and if the efficiency of the model does not meet the requirement, performing parameter adjustment on the model to achieve the optimal predicted efficiency, and finally obtaining a parameter optimization model.
In step S104, the method is mainly used to evaluate the effectiveness of the training model in step S103, and optimize the hyper-parameters of the model, where the hyper-parameters are parameters set before the learning process is started, and the hyper-parameters are obtained without training, and can be adjusted manually through the prediction performance of the model, such as the learning rate, the iteration number, and the neuron number. Step S104 includes:
a) evaluating the Precision (Precision) of the training model in the step S103 based on the feature vectorization result and the label of the divided evaluation data set in the step S103 to determine whether the accuracy meets the service requirement;
b) if yes, outputting the judgment model for on-line judgment;
c) if not, the model hyper-parameters can be adjusted through a grid search or random search method, the generalization performance of the model is improved until the requirements are met, and finally the model is judged after optimization is output.
Precision (Precision) represents the proportion of the example that is classified as positive, which is actually positive.
The accuracy is TP/(TP + FP).
Wherein, TP: true Position, True positive samples predicted by the model; FP: false Position, False positive, is predicted by the model as a positive negative sample.
It should be noted that, in the parameter tuning module according to the embodiment of the present invention, in consideration of the requirement for the determination accuracy in the application system, it is preferable to use the accuracy rate as the evaluation index. Other criteria may be taken, however, such as F1-score may be considered when it is desired to balance the accuracy and recall of the model.
F1-score=2*Precision*Recall/(Precision+Recall)
Wherein Precision is Precision rate and Recall is Recall rate.
Step S105: and (5) performing data processing by using the tuned parameter tuning model in the step (S104). For example, the trained and tuned model is deployed to an online machine, and a prediction result is returned. The prediction result is an online determination result of whether or not there is a brand hang violation in the product.
In step S105, the method is mainly used for online deployment and online data invocation of the optimized model.
a) And (3) online deployment: deploying the optimized model in the step S104 to a plurality of online machines;
b) calling on line: and extracting relevant characteristic information aiming at commodity data on a mall line, converting the characteristic information into a characteristic vector, calling the deployed model in a http (hyper text transfer protocol) request mode, and returning a prediction result.
Fig. 3 is a schematic diagram of the main blocks of a data processing apparatus according to an embodiment of the present invention, which is a preferred schematic diagram of the apparatus.
The invention also provides a data processing device. The device detects the indiscriminate behavior of hanging of brand in the commodity title, the page compliance of auxiliary platform commodity detail based on deep FM model, and the device 300 mainly contains: the feature screening module 301 obtains a data source, and cleans and processes the data source to screen out relevant features for model training; a feature vectorization module 302, which is used for vectorizing the relevant features and merging all vectorized features as a feature vectorization result; the model training module 303 divides the feature vectorization result into a training data set and an evaluation data set, and trains the deep fm model based on the feature vectorization result and the data label in the training data set; a parameter tuning module 304 for tuning parameters of the deep FM model to obtain a parameter tuning model; and a model application module 305 for performing data processing using the parameter tuning model.
The feature screening module 301 obtains data sources, cleans and processes the data sources to screen out relevant features for model training. Wherein, this data source includes: and E-commerce platform commodities, brand words and other data sources. The text processing mode comprises word segmentation and/or matching and the like.
The feature screening module 301 preprocesses and screens for all data features used in the schema. Wherein,
a) characteristic pretreatment: the method mainly comprises the steps of cleaning special characters in text data, segmenting text, matching and extracting related brand word data and the like;
b) and (3) screening data characteristics: the method mainly comprises the following steps of mainly including related attribute information such as a commodity title, brand words in the commodity title, commodity authorization brand words, commodity categories and the like; and
c) and (3) feature analysis results: basic statistical analysis is performed for the characteristics of brands, categories and the like.
The feature vectorization module 302 vectorizes the relevant features, and combines all the vectorized features as a feature vectorization result. The feature vectorization results are used for subsequent model training. Specifically, vectorizing the features screened by the feature screening module 301 includes: one-hot feature vectorization is performed on the discrete features screened by the feature screening module 301; aiming at text features, carrying out feature embedding vectorization by training a word2vec model; and for the continuous features, performing one-hot feature vectorization after dispersion.
The feature vectorization module 302 vectorizes the features filtered by the feature filtering module 301 described above. Wherein,
a) if the feature is a discrete feature, the feature is subjected to one-hot coding, wherein the discrete feature refers to a feature value without continuity, such as a gender of a person ({ male, female } discontinuous numerical value), and the one-hot coding (i.e., one-hot vectorization) refers to coding N states by using an N-bit state register, for example: the Chinese characteristics are as follows: [ "China", "American", "France" ], after the unique heat characteristic coding, the Chinese is converted into 100, the American is converted into 010, and the France is converted into 001;
b) if the text features are text features, based on a corpus of an e-commerce platform, obtaining embedded vectors of related texts by training a word2vec model, wherein the text features refer to features embodied by texts, such as names of goods and the like, and the embedded vectors (i.e., embedding vectors) refer to the fact that the texts are converted into vector forms capable of representing text semantics through a certain model, for example, text embedded vectorization is performed by using the word2vec model;
c) if the feature is a continuous feature, discretizing the feature and then carrying out one-hot coding, wherein the continuous feature refers to the feature having numerical continuity, such as the age of one person (continuous numerical value between 0 and 120 years old) and the score of the examination (continuous numerical value between 0 and 100 points); and
d) and finally, combining all vectorized features as a feature vectorization result for subsequent model training.
Here, text features are embedded vectorized by using a word2vec model, so that feature vectors embodying brand context are added. The discretization is, for example, discretization of the commodity price feature according to price intervals.
The model training module 303 divides the feature vectorization result into a training data set and an evaluation data set, and trains the deep fm model based on the feature vectorization result and a data label in the training data set, where the data label refers to positive and negative category labels of the sample, for example: the positive sample is labeled 1, the negative sample is labeled 0, the violation sample is labeled 1 in this case, and the non-violation sample is labeled 0.
The model training module 303 is mainly used to construct and train the deep fm model. Wherein,
a) construction of a deep FM model:
mainly comprises FM (Factor Machine decomposition, which is mainly used for solving the problems of feature combination and Deep sparse matrix) layer construction, DNN (Deep Neural Networks) layer construction and Logistic layer construction;
b) training a DeepFM model:
first, samples were run as 9: 1, dividing the model into a training data set and an evaluation data set, wherein the training data set is used for model training, and the evaluation data set is used for evaluating the efficiency of the model;
secondly, inputting vectorization results of various feature data of the training data set into a deep FM model, and finally outputting prediction labels of the training samples through the multilayer structure of the deep FM model;
then, combining training data real label (y) and model prediction labelCalculating a cross entropy loss function (L, shown below), minimizing the loss function through a back propagation algorithm, namely optimizing the process of relevant structural parameters (structural parameters of the model, which are network parameters of the model itself and are automatically obtained through model training) in the DeepFM model,
wherein yi is the real training data label of the ith sample,predicting a label for the model of the ith sample, wherein i is 1,2, …, n is the number of training samples, and L is the overall loss function of the model; i.e. the sum of all sample loss functions.
Finally, when the loss converges, a trained DeepFM model is obtained, and the model can be used for sample prediction of unknown labels.
The parameter tuning module 304 tunes parameters of the deep fm model to obtain a parameter tuning model. Model performance was evaluated using the evaluation dataset and the deep fm model for tuning. If the efficiency of the model meets the requirement, the model is taken as a parameter tuning model to be obtained; and if the efficiency of the model does not meet the requirement, performing parameter adjustment on the model to achieve the optimal predicted efficiency, and finally obtaining a parameter optimization model.
The parameter tuning module 304 is mainly used for evaluating the efficiency of the model training module 303 for training the model, and tuning the model hyper-parameters, which are parameters set before the learning process is started, and which are not required to be obtained through training, and which can be manually adjusted through the prediction performance of the model, such as the learning rate, the iteration number, the neuron number, and the like. Wherein,
a) based on the feature vectorization result and the label of the divided evaluation data set by the model training module 303, evaluating the accuracy (Precision) of the model trained by the model training module 303 to determine whether the service requirement is met;
b) if yes, outputting the judgment model for on-line judgment;
c) if not, the model hyper-parameters can be adjusted through a grid search or random search method, the generalization performance of the model is improved until the requirements are met, and finally the model is judged after optimization is output.
Precision (Precision) represents the proportion of the example that is classified as positive, which is actually positive.
The accuracy is TP/(TP + FP).
Wherein, TP: true Position, True positive samples predicted by the model; FP: false Position, False positive, is predicted by the model as a positive negative sample.
It should be noted that, in the parameter tuning module according to the embodiment of the present invention, in consideration of the requirement for the determination accuracy in the application system, it is preferable to use the accuracy rate as the evaluation index. Other criteria may be taken, however, such as F1-score may be considered when it is desired to balance the accuracy and recall of the model.
F1-score=2*Precision*Recall/(Precision+Recall)
Wherein Precision is Precision rate and Recall is Recall rate.
The model application module 305 performs data processing using the parameter tuning model. For example, the trained and tuned model is deployed to an online machine, and a prediction result is returned. The prediction result is an online determination result of whether or not there is a brand hang violation in the product.
The model application module 305 is mainly used for online deployment and online data calling of the optimized model. Wherein,
a) and (3) online deployment: deploying the model optimized by the parameter tuning module 304 to a plurality of online machines;
b) calling on line: and extracting relevant characteristic information aiming at commodity data on a mall line, converting the characteristic information into a characteristic vector, calling the deployed model in a http (hyper text transfer protocol) request mode, and returning a prediction result.
Fig. 4 shows an exemplary system architecture 400 to which the data processing method or detection apparatus of an embodiment of the invention may be applied.
As shown in fig. 4, the system architecture 400 may include terminal devices 401, 402, 403, a network 404, and a server 405. The network 404 serves as a medium for providing communication links between the terminal devices 401, 402, 403 and the server 405. Network 404 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.
A user may use terminal devices 401, 402, 403 to interact with a server 405 over a network 404 to receive or send messages or the like. The terminal devices 401, 402, 403 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).
The terminal devices 401, 402, 403 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 405 may be a server providing various services, such as a background management server (for example only) providing support for shopping websites browsed by users using the terminal devices 401, 402, 403. The backend management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (for example, target push information, product information — just an example) to the terminal device.
It should be noted that the data processing method provided by the embodiment of the present invention is generally executed by the server 405, and accordingly, the detection device is generally disposed in the server 405.
It should be understood that the number of terminal devices, networks, and servers in fig. 4 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 5, shown is a block diagram of a computer system 500 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 501.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor comprises a feature screening module, a feature vectorization module, a model training module, a parameter tuning module and a model application module. For example, the feature screening module may be further described as a module that, in combination with various data sources such as e-commerce platform commodities and brand words, cleans text data, and screens out relevant features for model training through text processing methods such as word segmentation and matching.
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: acquiring a data source, and cleaning and processing the data source to screen out relevant characteristics for model training; vectorizing the related features, and combining all the vectorized features to serve as a feature vectorization result; dividing the feature vectorization result into a training data set and an evaluation data set, and training a deep factor machine decomposition model based on the feature vectorization result and the data labels in the training data set; adjusting and optimizing parameters of the depth factor machine decomposition model to obtain a parameter adjusting and optimizing model; and processing data by utilizing the parameter tuning model.
According to the technical scheme of the embodiment of the invention, the illegal and hanging judgment of brands of a large amount of merchant detailed information can be effectively supported, the authorized brands of merchants are protected, the technology meets the application effect, and the protection is required.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A data processing method, comprising:
acquiring a data source, and cleaning and processing the data source to screen out relevant characteristics for model training;
vectorizing the related features, and combining all the vectorized features to serve as a feature vectorization result;
dividing the feature vectorization result into a training data set and an evaluation data set, and training a deep factor machine decomposition model based on the feature vectorization result and the data labels in the training data set;
adjusting and optimizing parameters of the depth factor machine decomposition model to obtain a parameter adjusting and optimizing model; and
and processing data by using the parameter tuning model.
2. The data processing method of claim 1,
the vectorizing of the relevant features specifically includes: and aiming at the text features in the relevant features, carrying out feature embedding vectorization by training a word2vec model.
3. The data processing method of claim 1,
the method for optimizing the parameters of the depth factor machine decomposition model specifically comprises the following steps: and evaluating the model performance by utilizing the evaluation data set and the depth factor sub-machine decomposition model so as to perform tuning.
4. The data processing method of claim 1,
the training of the deep factor machine decomposition model specifically comprises the following steps: inputting the characteristic vectorization result of the training data set into a depth factor machine decomposition model, and combining a training data real label (y) and a model prediction labelCalculating a cross entropy loss function (L) through the following formula, minimizing the loss function through a back propagation algorithm, training to obtain related parameters in a depth factor submachine decomposition model,
5. A data processing apparatus, comprising:
the characteristic screening module is used for acquiring a data source, cleaning and processing the data source to screen out relevant characteristics for model training;
the characteristic vectorization module is used for vectorizing the related characteristics and combining all the vectorized characteristics to serve as a characteristic vectorization result;
the model training module is used for dividing the characteristic vectorization result into a training data set and an evaluation data set, and training a deep factor submachine decomposition model based on the characteristic vectorization result and the data labels in the training data set;
the parameter tuning module is used for tuning parameters of the depth factor machine decomposition model to obtain a parameter tuning model; and
and the model application module is used for processing data by utilizing the parameter tuning model.
6. The data processing apparatus of claim 5,
the vectorizing of the relevant features specifically includes: and the feature vectorization module carries out feature embedding vectorization by training a word2vec model aiming at the text features in the relevant features.
7. The data processing apparatus of claim 5,
the method for optimizing the parameters of the depth factor machine decomposition model specifically comprises the following steps: and the parameter tuning module evaluates the model efficiency by utilizing the evaluation data set and the depth factor sub-machine decomposition model so as to carry out tuning.
8. The data processing apparatus of claim 5,
the training of the deep factor machine decomposition model specifically comprises the following steps: inputting the characteristic vectorization result of the training data set into a depth factor machine decomposition model, and combining a training data real label (y) and a model prediction labelCalculating a cross entropy loss function (L) through the following formula, minimizing the loss function through a back propagation algorithm, training to obtain related parameters in a depth factor submachine decomposition model,
9. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-4.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010139561.5A CN113342969B (en) | 2020-03-03 | Data processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010139561.5A CN113342969B (en) | 2020-03-03 | Data processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113342969A true CN113342969A (en) | 2021-09-03 |
CN113342969B CN113342969B (en) | 2024-10-22 |
Family
ID=
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116541717A (en) * | 2023-07-06 | 2023-08-04 | 图林科技(深圳)有限公司 | Big data analysis method based on cloud computing and deep learning |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160239888A1 (en) * | 2015-02-13 | 2016-08-18 | David Silver | Systems and methods for verifying compliance in an electronic marketplace |
US20170206574A1 (en) * | 2016-01-20 | 2017-07-20 | Zortag, Inc. | Method of, and system for, preventing unauthorized products from being sold on online sites |
CN109583910A (en) * | 2018-10-26 | 2019-04-05 | 阿里巴巴集团控股有限公司 | A kind of merchandise authorization identification method, device and equipment |
CN110083678A (en) * | 2019-03-12 | 2019-08-02 | 平安科技(深圳)有限公司 | A kind of electric business platform goods matching method, device and readable storage medium storing program for executing |
CN110503565A (en) * | 2019-07-05 | 2019-11-26 | 中国平安人寿保险股份有限公司 | Behaviorist risk recognition methods, system, equipment and readable storage medium storing program for executing |
CN110569359A (en) * | 2019-08-26 | 2019-12-13 | 腾讯科技(深圳)有限公司 | Recognition model training and application method and device, computing equipment and storage medium |
WO2020020088A1 (en) * | 2018-07-23 | 2020-01-30 | 第四范式(北京)技术有限公司 | Neural network model training method and system, and prediction method and system |
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160239888A1 (en) * | 2015-02-13 | 2016-08-18 | David Silver | Systems and methods for verifying compliance in an electronic marketplace |
US20170206574A1 (en) * | 2016-01-20 | 2017-07-20 | Zortag, Inc. | Method of, and system for, preventing unauthorized products from being sold on online sites |
WO2020020088A1 (en) * | 2018-07-23 | 2020-01-30 | 第四范式(北京)技术有限公司 | Neural network model training method and system, and prediction method and system |
CN109583910A (en) * | 2018-10-26 | 2019-04-05 | 阿里巴巴集团控股有限公司 | A kind of merchandise authorization identification method, device and equipment |
CN110083678A (en) * | 2019-03-12 | 2019-08-02 | 平安科技(深圳)有限公司 | A kind of electric business platform goods matching method, device and readable storage medium storing program for executing |
CN110503565A (en) * | 2019-07-05 | 2019-11-26 | 中国平安人寿保险股份有限公司 | Behaviorist risk recognition methods, system, equipment and readable storage medium storing program for executing |
CN110569359A (en) * | 2019-08-26 | 2019-12-13 | 腾讯科技(深圳)有限公司 | Recognition model training and application method and device, computing equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
丁红朝;郭云云;: "目的论视角下跨境电商网店产品标题英译策略", 电子商务, no. 07, 13 July 2018 (2018-07-13) * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116541717A (en) * | 2023-07-06 | 2023-08-04 | 图林科技(深圳)有限公司 | Big data analysis method based on cloud computing and deep learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wassan et al. | Amazon product sentiment analysis using machine learning techniques | |
CN109145280B (en) | Information pushing method and device | |
US20190163742A1 (en) | Method and apparatus for generating information | |
CN107330752B (en) | Method and device for identifying brand words | |
CN107797982B (en) | Method, device and equipment for recognizing text type | |
CN108984554B (en) | Method and device for determining keywords | |
CN108021651B (en) | Network public opinion risk assessment method and device | |
CN106844407B (en) | Tag network generation method and system based on data set correlation | |
US11741094B2 (en) | Method and system for identifying core product terms | |
CN113051480A (en) | Resource pushing method and device, electronic equipment and storage medium | |
CN116739626A (en) | Commodity data mining processing method and device, electronic equipment and readable medium | |
CN112990311A (en) | Method and device for identifying admitted client | |
CN108512674B (en) | Method, device and equipment for outputting information | |
CN115187066A (en) | Risk identification method and device, electronic equipment and storage medium | |
CN112784861A (en) | Similarity determination method and device, electronic equipment and storage medium | |
CN113869904B (en) | Suspicious data identification method, device, electronic equipment, medium and computer program | |
CN115238676A (en) | Method and device for identifying hot spots of bidding demands, storage medium and electronic equipment | |
CN113743973B (en) | Method and device for analyzing market hotspot trend | |
CN113342969B (en) | Data processing method and device | |
CN113342969A (en) | Data processing method and device | |
CN110929175B (en) | Method, device, system and medium for evaluating user evaluation | |
CN113222632A (en) | Object mining method and device | |
CN113822691A (en) | User account identification method, device, system and medium | |
CN111783808A (en) | Method and apparatus for generating information | |
CN110827045B (en) | Method and device for distinguishing commodity relationship |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |