CN108304359A - Unsupervised learning uniform characteristics extractor construction method - Google Patents
Unsupervised learning uniform characteristics extractor construction method Download PDFInfo
- Publication number
- CN108304359A CN108304359A CN201810117102.XA CN201810117102A CN108304359A CN 108304359 A CN108304359 A CN 108304359A CN 201810117102 A CN201810117102 A CN 201810117102A CN 108304359 A CN108304359 A CN 108304359A
- Authority
- CN
- China
- Prior art keywords
- training
- news
- data
- encoder
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000010276 construction Methods 0.000 title claims abstract description 16
- 238000012549 training Methods 0.000 claims abstract description 115
- 230000006870 function Effects 0.000 claims abstract description 49
- 239000013598 vector Substances 0.000 claims abstract description 24
- 230000009467 reduction Effects 0.000 claims abstract description 15
- 238000000034 method Methods 0.000 claims description 30
- 230000008602 contraction Effects 0.000 claims description 5
- 210000002569 neuron Anatomy 0.000 claims description 5
- 238000007781 pre-processing Methods 0.000 claims description 5
- 239000011159 matrix material Substances 0.000 claims description 4
- 230000011218 segmentation Effects 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 230000002441 reversible effect Effects 0.000 claims description 2
- 238000001914 filtration Methods 0.000 abstract description 6
- 238000012545 processing Methods 0.000 abstract description 5
- 238000013480 data collection Methods 0.000 abstract 1
- 238000000605 extraction Methods 0.000 description 9
- 238000013461 design Methods 0.000 description 7
- 238000013528 artificial neural network Methods 0.000 description 4
- 230000004927 fusion Effects 0.000 description 4
- 238000013473 artificial intelligence Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000000513 principal component analysis Methods 0.000 description 3
- 230000002829 reductive effect Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000002156 mixing Methods 0.000 description 2
- 238000002360 preparation method Methods 0.000 description 2
- 230000010076 replication Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 239000008358 core component Substances 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 230000000452 restraining effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/06—Buying, selling or leasing transactions
- G06Q30/0601—Electronic shopping [e-shopping]
- G06Q30/0631—Item recommendations
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Finance (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Artificial Intelligence (AREA)
- Accounting & Taxation (AREA)
- Health & Medical Sciences (AREA)
- Development Economics (AREA)
- Economics (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The application provides a kind of unsupervised learning uniform characteristics extractor construction method, it is characterised in that:Practical newsletter archive data, which are obtained, from server end generates news features training dataset;The data that news features training data is concentrated are subjected to processing dyad and obtain news features training vector collection;News data collection is sorted out according to user accesses data, forms user characteristics training dataset;Structure one has the asymmetric noise reduction of the stack of multiple hidden layers to shrink self-encoding encoder, is trained to depth self-encoding encoder using specific object function;After depth self-encoding encoder completes training, decoder section is deleted, a binaryzation generation layer is added, structure completes unsupervised learning uniform characteristics extractor.Unsupervised learning uniform characteristics extractor provided by the present application, the unification that news features and user characteristics may be implemented, the unification based on commending contents and collaborative filtering recommending, and improve the efficiency of real-time recommendation.
Description
Technical Field
The invention belongs to the field of artificial intelligence, and particularly relates to a construction method of an unsupervised learning unified feature extractor.
Background
The current recommendation system or recommendation engine is generally classified into content-based recommendation, collaborative filtering recommendation, mixed recommendation and other types, is an information tool which is as important as a search engine in the current society, and is widely applied to the fields of e-commerce, media recommendation and the like. The current popular collaborative filtering method is mainly based on commonality, namely, similarity between users and similarity between items are calculated through the scores of some users on commodities or media contents (which can be collectively called as "items"), then the scores of other users similar to the interests of the users are used for deducing the scores of the users on new items, or the scores of the users on new items are predicted according to the similarity with the items which the users have interests, so that the method is also called score prediction, but the method has the defects of insufficient personalization and difficult prediction under the condition of insufficient score data.
The recommendation based on the content mainly models the preference of a certain user and the attribute of an article, and the recommendation is carried out according to the preference and the preference of the user, so that the personalization is strong, but the modeling and the matching of the preference of the user and the attribute of the article are difficult. Past user preference modeling requires the use of direct features such as demographics and is also prone to invading a person's privacy.
Deep learning is a new machine learning method which is emerging in recent years and can be divided into supervised learning and unsupervised learning. An Auto Encoder (AE) is a leading edge of current research of unsupervised learning, but most of the current depth auto encoding systems have advantages and disadvantages, such as easiness in overfitting and the like, most of the depth auto encoding systems do not realize unsupervised learning in a complete sense, and the exertion of the capability of the depth auto encoder is greatly restricted.
Under the condition that technologies such as artificial intelligence, deep learning and unsupervised learning are rapidly developed at present, a new technology and a new method need to be researched and used for updating the technical basis of the recommendation system, hybrid recommendation is effectively achieved, and online recommendation efficiency is greatly improved.
Disclosure of Invention
Aiming at the problems that the personalization is insufficient, the user characteristic extraction is difficult, different methods are Unified to form an effective mixed recommendation method, the privacy is violated in the user characteristic extraction, the real-time recommendation efficiency needs to be improved and the like in the application of current fusion media news recommendation and the like, according to the current novel artificial intelligence technology, the application discloses a construction method of an Unsupervised Learning Unified Feature Extractor (ULUFE) for extracting Content-Based Unified Feature Representation (URBC). A construction method of an unsupervised learning unified feature extractor comprises the following steps:
s1, acquiring actual news text data and user access data from a server, and generating a news characteristic training data set after sorting and randomizing;
s2, preprocessing the data in the news characteristic training data set by using the current Chinese word segmentation tool to obtain a preprocessed news characteristic training data set;
s3, obtaining a news characteristic training vector set from the preprocessed news characteristic training data set through a TF-IDF method;
s4, classifying the news characteristic training vector set according to the user access data to form a user characteristic training data set;
s5, constructing a stacked asymmetric noise reduction and contraction self-encoder with a plurality of hidden layers and using JSA-CDAEAs an objective function:
wherein,
wherein k isσIs gaussian kernel, with standard deviation σ of 1.0, gaussian kernel function:
where x denotes the input of the encoder, fθ() Representing the output of the encoder, gθ() Represents the decoder output; l isMC() A cost function representing a single input, λ is a regularization parameter of a systolic auto-encoder, | | | | luminanceFIs the F norm symbol, J (x) is the encoder Jacobian matrix, θ is the parameter set for the depth autoencoder, xiRepresenting the input to the encoder in one training session,representing the output restored by the decoder, t representing the training set, and z representing the algebraic expression in the Gaussian kernel;
s6, training the depth self-encoder, wherein the training steps are as follows:
s61, taking the news feature training vector set as training data of the depth self-encoder;
s62, adding Gaussian white noise into the training data to generate input data with noise;
s63, taking the input data with noise as the input of the depth self-encoder, and during training, adopting a batch gradient descent method, firstly performing unsupervised layer-by-layer pre-training to obtain initial parameters of each hidden layer and output data of an output layer;
s64, comparing input training data with output data in the objective function to realize the reverse propagation of the gradient and adjust the initial parameters of each hidden layer;
s65, obtaining a parameter set of the depth self-encoder after the training is finished;
and S7, deleting a decoder part of the depth self-encoder, adding a binary generation layer after the output of the last hidden layer, and completing the construction of the unsupervised learning unified feature extractor.
Preferably, the step S1, obtaining actual news text data and user access data from the server, and generating a news characteristic training data set after sorting and randomizing, specifically includes the following steps:
s11, collecting news data and user access data in a certain time period on the server;
s12, removing pictures and videos in news data, uniformly coding the pictures and videos into UTF-8, setting a sequence number for each news to form a news data set;
s13, randomizing and reordering the news in the news data set according to the sequence numbers, and then respectively using the news as the news characteristic training data sets in the layer-by-layer unsupervised pre-training stage and the global training stage according to a certain proportion.
Preferably, in step S5, a stacked asymmetric noise reduction and shrinkage self-encoder with multiple hidden layers is constructed, including 2 hidden layers.
Preferably, the coding function of the first hidden layer is h1(xi)=S(w1xi+b1) The pre-training decoding function is The coding function of the second hidden layer is h2(h1)=S(w2h1+b2) The pre-training decoding function is
The global training decoding function from the second hidden layer to the output layer is go(xi)=S(w1xi+b1);
The initial parameters of each layer are [0,1 ]]The nonlinear activation function S () uses a Sigmoid function in common, e is the Euler number, h represents the coding function of the hidden layer, g is the decoding function, b represents the offset, x represents the input to the coder, w1、w2The weight parameters of the first and second hidden layers are respectively.
Preferably, the dimension of the binary generation layer in step S7 is the same as that of the last hidden layer of the depth self-encoder, and a one-to-one connection is realized with each neuron of the last hidden layer; the binary generation layer is provided with a weight regulator according to the output of the last hidden layer to realize threshold value regulation, the selection of the threshold value T in the weight regulator enables the output result of one complete training to be divided into two types, and the variance between the two types is the largest.
Preferably, the method further includes step S8, inputting the user feature training vector set to the unsupervised learning unified feature extractor to obtain a user preference model, and generating a unified user neighbor table through similarity comparison according to the user preference model of each user.
The application has the advantages that:
1. aiming at the problems that manual marking data required by supervised learning in the quick recommendation of network media is difficult to obtain in real time, and the existing depth self-encoder still needs to be subjected to supervised fine tuning after adopting unsupervised layer-by-layer pre-training, the depth self-encoder can realize whole-process unsupervised learning;
2. the deep structure is adopted to replace a single hidden layer structure, so that the capability of high-order potential interpretation factors of the learning content is further improved;
3. the method has the advantages that the encoder and the decoder are asymmetrical, the hidden layer dimension is lower than the input layer dimension, the nonlinear flow pattern of data can be learned, the dimension reduction is realized while the characteristics are extracted, and the method is superior to linear flow patterns such as PCA. And the asymmetry can also be used as a means for solving the problem that the self-encoder is easy to over-fit;
4. the features output by the self-encoder are convenient for binarization processing, and the binarization features can be generated after a binarization generation layer is added, so that the rapid similarity comparison problem of users and news in a fusion medium can be solved by cosine similarity comparison, Hamming distance comparison, Hash and other methods in recommendation, and the rapid recommendation effect on short news in a mobile medium is obvious.
5. In application, the features (based on unified feature representation of content) extracted from the news data are used as the features of the news to be recommended and the user, so that the unification of the two features is realized, the unification of the recommendation method based on content recommendation and collaborative filtering recommendation is also realized, the innovation of the recommendation method is realized while the privacy of the user is effectively protected, and the recommendation efficiency is improved.
Drawings
FIG. 1 is a schematic design of an SA-CDAE according to the present invention;
FIG. 2 is a schematic diagram of the training of the present invention;
FIG. 3 is an unsupervised learning feature extractor of the present invention;
FIG. 4 is a schematic diagram of an online recommendation of the present invention;
FIG. 5 is a graph comparing accuracy rates of the present invention;
FIG. 6 is a chart comparing recall rates of the present invention.
Detailed Description
The specific implementation and detailed steps of the unsupervised learning unified feature extractor construction method of the present invention are further described below:
the method comprises the following steps: data acquisition and preparation
The invention mainly aims at website text news and mobile phone news client text news in the current fusion media. The news text data and the user access data are both located at a server side, a 'news characteristic training data set' needs to be generated in the step, and the specific process is as follows:
1) collecting news data and user access data in a certain period of time on a server, wherein the news data comprises historical news on the server, and the user access data comprises a news ID list read by a user in a certain period of time;
2) removing irrelevant contents such as pictures, videos and the like in news data, uniformly coding the irrelevant contents into UTF-8, and setting a sequence number for each piece of news to form a news data set;
3) the news in the news data set is randomized and reordered according to the sequence number, and then the news is respectively used as a 'news characteristic training data set' in a layer-by-layer unsupervised pre-training stage and a global training stage according to a certain proportion.
Step two: text data preprocessing
And performing Chinese word segmentation, stop word elimination and other processing on the data in the news characteristic training data set by using the current Chinese word segmentation tool to obtain a preprocessed news characteristic training data set.
Step three: news text data vectorization
Performing vectorization processing on the preprocessed news characteristic training data set by using a TF-IDF method to obtain a news characteristic training vector set corresponding to the news characteristic training data set; TF-IDF is an abbreviation for "term frequency-inverse document frequency".
TF means word frequency, and the calculation formula is as follows:
IDF means inverse document frequency and is calculated by the formula:
on the premise of keeping the relative position of words in the news characteristic training data set, obtaining initial characteristic vectors of data in the news characteristic training data set through a TF-IDF method to form a news characteristic training vector set, wherein the TF-ID calculation formula is as follows:
TF-ID=TF*IDF
step four: obtaining a user feature training dataset
Classifying the news characteristic training vector set according to user access data to obtain a user characteristic training vector set;
step five: constructing a stacked asymmetric noise reduction contraction self-encoder
The core component of the unsupervised learning unified feature extractor in the application is a specially designed depth self-encoder, and the role of the unsupervised learning unified feature extractor in the invention is mainly embodied in two aspects of feature extraction and dimension reduction. The invention designs a Stacked (depth) Asymmetric noise reduction and shrinkage self-encoder (SA-CDAE) with 2 or 3 hidden layers by combining the advantages of the shrinkage self-encoder and the noise reduction self-encoder around the application target of the fusion media intelligent recommendation, as shown in figure 1. Structurally, the multi-hidden layer is adopted to improve the feature extraction capability of the single hidden layer, the input layer and the output layer have the same dimension, the hidden layer dimension is smaller than that of the input layer, the hidden layer dimension is gradually reduced layer by layer according to a proportion, and the coding and decoding structure is asymmetric, so that the anti-overfitting capability is improved. The initial news training vector set obtained after the early preparation and the preprocessing accords with independent same steps on the whole, but a certain amount of disturbance exists, the specific distribution is unknown, and the vector set is recorded as D ═ x1,x2,…,xn},xi∈RdAnd N belongs to N, then:
the coding function of the first hidden layer is h1(xi)=S(w1xi+b1),
The pre-training decoding function of the first hidden layer is
The coding function of the second hidden layer is h2(h1)=S(w2h1+b2),
The pre-training decoding function of the second hidden layer is
The global training decoding function from the second hidden layer to the output layer is go(xi)=S(w1xi+b1);
The initial parameters of each layer are [0,1 ]]Random numbers in open intervals, nonlinear activation function S () uses a Sigmoid function in unison,
wherein D represents news initial training vector set, R is real number set, N is natural number set, h represents coding function of hidden layer, g represents decoding function, b represents bias, e is Euler number, h represents coding function of hidden layer, g represents decoding function, b represents bias, x represents input of encoder, w represents input of encoder, and the like1、w2Weight parameters, x, of the first and second hidden layers, respectivelyiRepresenting the input to the encoder in one training session.
The principle of the self-encoder is to try to make the input of the encoder reappear at the output of the decoder by training an encoding and decoding mechanism, wherein the encoder part is also called hidden layer, and the decoder part is also called output layer. It is not easy and not practical to reconstruct the input at the output end completely, but it only can implement approximate replication by designing special structure, adding constraint in replication properly, using special cost function and training method, and can force the model to replicate the data in the input according to the weight, so as to construct useful distributed features in the data in the encoder of the self-encoder, which has become the leading edge of the research of generating model in recent years. The prototype automatic encoder represents better feature extraction capability, but the problems of overfitting and the like easily occur in use, the generalization capability of actual data is lost, and then the derivative automatic encoder which is improved and optimized for the prototype is developed in succession.
The depth self-encoder of the present invention is designed to take into account both the addition of noise and the reduction of noise (perturbations). The noise adding means that white noise with Gaussian distribution is added into input X by means of the thought of denoisingAutoEncoders, so that the decoder can forcibly remove the interference of the noise during output, the over-fitting resistance of the system is improved, the noise reducing self-coding characteristic can be achieved by adding the white Gaussian noise into the input during training, and the over-fitting risk is further reduced. The parameter set θ of the neural network is trained by back propagation and Stochastic Gradient Descent (SGD).
Reducing noise (disturbance) refers to improving the system's resistance to non-gaussian distributed noise and disturbance during training. In order to further reduce the influence of outliers in a news characteristic data set and a user characteristic data set and provide a basis for further adopting binarization generation in a scheme, the characteristic of a contraction automatic encoder is partially adopted in design. The shrinkage automatic encoder adds analytic shrinkage penalty factors in the cost function expression of the prototype automatic encoder to reduce the freedom degree of characteristic expression, so that hidden layer neurons reach a saturation state, and output data is limited within a certain range of a parameter space. The penalty factor is actually an F norm (Frobenius norm) of a Jacobian matrix (Jacobian) of the encoder, and has the functions of reducing the influence of an outlier (outlier) on the encoder, restraining the disturbance of a training sample (on a low-dimensional manifold surface) in all directions and assisting the encoder to learn useful data characteristics. Furthermore, the distributed representation learned by the systolic auto-encoder has the feature of "saturation", i.e. most hidden layer elements have values close to both ends (0 or 1) and the partial derivative to the input is close to 0.
In general self-encoder training, a Mean Square Error (MSE) is often used as a cost function, and there is a certain tolerance to gaussian-distributed noise, but in this example, considering the existence of disturbances such as minimization variables, for example, in an accidental reading situation outside user preference, in order to improve robustness, the present embodiment uses maximum correlation entropy (MC) as the cost function:
wherein k isσIs gaussian kernel, with standard deviation σ of 1.0, gaussian kernel function:
the overall objective function of the depth self-encoder in the invention is as follows:
in the above formulae, fθ() Representing the output of the encoder, gθ() Represents the decoder output; l isMC() A cost function representing a single input, λ is a regularization parameter of a systolic auto-encoder, | | | | luminanceFIs the F norm symbol, J (x) is the encoder Jacobian matrix, θ is the parameter set for the depth autoencoder, xiRepresenting the input to the encoder in one training session,representing the output restored by the decoder, t representing the training set, and z representing the algebraic expression in the gaussian kernel.
Step six: training depth autoencoder
The training of the neural network refers to that cleaned and sorted data are used as input, and parameters of a neural network target function tend to converge gradually through two links of forward propagation and backward propagation, so that high-order statistical characteristics are learned. As shown in fig. 2, the deep self-encoder takes off-line training, and the main training steps are as follows:
1) the method comprises the steps that a news characteristic training vector set is used as training data of a deep self-encoder and is set to be X, so that the training data in the application are news data, and neither manual marking nor third-party corpus is needed;
2) adding white Gaussian noise to the training data XNoisy input data X1;
3) Mixing X1As the input of the depth self-encoder, a batch gradient descent method (Mini-batch) is adopted during training, unsupervised layer-by-layer pre-training is firstly carried out to obtain initial parameters of each hidden layer and the output of an output layer
4) For X and X in the objective functionAnd comparing to realize global backward propagation of the gradient and adjusting the initial parameters of all hidden layers.
5) And after the training is finished, obtaining a parameter set of the depth self-encoder, and using the parameter set to construct an unsupervised learning unified feature extractor in the next step.
The deep self-encoder in the application takes brand-new consideration on the aspects of structure, cost function, training mode and the like, can realize dimension reduction while extracting features, can learn nonlinear flow patterns, and is greatly superior to linear flow pattern methods such as PCA (principal component analysis) in the aspect of dimension reduction. In addition, according to the parallelization characteristic of the neural network, GPU parallel computing is adopted to accelerate the main training steps of the depth self-encoder, so that the training efficiency of the depth self-encoder is greatly improved, and the practical application efficiency in a recommendation system is improved.
Step seven: constructing an unsupervised learning unified feature extractor
The trained output of the deep self-encoder has the characteristic of easy binarization, for this reason, after the deep self-encoder finishes training, a decoder part is deleted, and a binarization generation layer is added after the output of the last hidden layer, wherein the binarization generation layer is used for finishing binarization processing, and as shown in fig. 3, the construction of the unsupervised learning unified feature extractor is finished.
In this embodiment, about 70% of the output of the depth self-encoder is close to 0 and 1, and binarization is easy, but how the remaining 30% is processed will directly affect the overall binarization extraction effect and the accuracy of subsequent similarity comparison. Therefore, a binary generation layer with the same dimension as the last hidden layer of the depth self-encoder is structurally designed, and one-to-one connection with each neuron of the last hidden layer is realized; in internal design, a common fixed threshold is not adopted in a binary generation layer, but a weight regulator is designed according to the actual distribution output by the last hidden layer to realize threshold adjustment, and the selection principle of the threshold T in the weight regulator is that the output result of one-time complete training can be divided into two types, and the variance between the two types is maximum.
After one complete training, the total output set of each unit of the hidden layer is K, wherein N different data exist. Sorting K from small to large to obtain a data set K (K1, K2, …, ki), and dividing the data set K into two groups K1 and K2 with the size of t and N-t, wherein the occurrence frequency of each ki is ni, wherein i belongs to [1, N ∈](ii) a Probability of occurrence of two groups in the whole respectively1、ε2the mean values of the two groups are respectively beta1、β2. Then the probability of ki occurring is pi=ni/N,ε2=1-ε1The mean values of the two groups are respectivelythe mean value of the data set K is β ═ epsilon1β1+ε2β2. The between-class variance of the two groups is defined as δ (t) ═ e1(β1-β)2+ε2(β2-β)2. Get T ═ argmaxt(delta (T)), even if T is the maximum value of delta (T), finding a value corresponding to the position T from K as a threshold value, setting the value less than or equal to T as 0, and setting the rest values as 1, thereby realizing the binarization of hidden layer output.
Step eight: obtaining a user preference model and a user neighbor table
And after the construction of the unsupervised learning unified feature extractor is completed, inputting the user feature training vector set into the unsupervised learning unified feature extractor to obtain a user preference model, and generating a unified user neighbor table through similarity comparison according to the user preference model of each user.
Fig. 4 shows an example of personalized recommendation using an unsupervised learning unified feature extractor. Preprocessing and vectorizing all news texts to be recommended, and inputting the processed news texts into an unsupervised learning unified feature extractor to obtain a news feature vector to be recommended, wherein the news feature vector is represented by unified features based on content; comparing the similarity of the news characteristic vector to be recommended with a preference model of a user to generate a recommendation list based on content; and generating a collaborative filtering recommendation list from news read by a user similar to the user A1 by using the user neighbor table, and obtaining a Top-N recommendation list of mixed recommendations after weighted mixing.
The unsupervised learning unified feature extractor disclosed by the invention is innovative in the aspects of overall design and application mode:
1. innovations are made in design: the design of the depth self-encoder integrates the characteristics of a contraction self-encoder and a noise reduction self-encoder, a new objective function is designed, the extraction capability of high-order statistical information is structurally improved through a depth structure (2-3 hidden layers), and the number of neurons in each hidden layer is reduced progressively, so that the coding and decoding asymmetry of the depth self-encoder is realized, the problem of overfitting easily occurring in the self-encoder is favorably improved, the robustness of feature extraction is improved, and the dimension reduction is realized while the features are extracted. After training is finished, a binarization generation layer is used for replacing an output layer to obtain an unsupervised learning feature extractor, binarization features can be generated, hamming distance comparison is facilitated, and operations such as Hash comparison are facilitated.
2. The innovation is realized on the training mode: in the past, a self-encoder of a single hidden layer totally uses input and output as comparison data, and updates network parameters by back propagation after errors are obtained. After unsupervised layer-by-layer pre-training is generally adopted in the multi-layer self-encoder, classifiers such as softmax are added behind the last hidden layer to perform supervised learning according to class labels, so that the whole body is semi-supervised learning. The self-encoder of the depth comprehensively considers the problems of network depth and calculation efficiency, input data are also used for comparison at an output end, and the obtained error is propagated reversely, so that complete unsupervised learning is realized.
3. The innovation is realized in application: the efficiency of applications such as a recommendation system is improved, and the privacy of individuals is effectively protected. On the aspect of improving the recommendation efficiency, the characteristics generated by extracting the news text are used as the favorite and preference characteristics of the user to the news, so that the uniform extraction of the characteristics (the characteristics of the user and the characteristics of the articles) is realized, and the uniformity of a mixed recommendation method on the technical basis is also realized; and data such as user demographic information and the like are avoided through the high-order statistical characteristics, the extracted vector is abstract information and does not contain explicit data of the user, and the user information cannot be leaked even if the vector is illegally acquired, so that privacy protection is realized, and the increasingly strict protection requirements of the country on the personal privacy information are met.
4. Innovations in training data are realized. In the existing methods such as collaborative filtering, similarity among users and similarity among articles are calculated through scoring of commodities or media contents by the users, but the current users rarely score read news, so that scoring data is rare and training data is insufficient. The application directly uses news data and user access data as training data of the depth self-encoder, and has the following characteristics: firstly, the defect of lack of training data is avoided; secondly, a third-party corpus is not used, so that the method is more practical.
In practical applications, precision and recall are the two most important indicators used in the recommendation system evaluation. Practical tests show that the features extracted by the unsupervised learning unified feature extractor constructed by the method are well matched with the recommendation method. Compared with the current popular method, the novel personalized recommendation method has good effects in the aspects of accuracy rate and recall rate, and is shown in the attached figures 5 and 6.
Finally, it should be noted that: the above-mentioned embodiments are only used for illustrating the technical solution of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.
Claims (6)
1. A construction method of an unsupervised learning unified feature extractor is characterized by comprising the following steps:
s1, acquiring actual news text data and user access data from a server, and generating a news characteristic training data set after sorting and randomizing;
s2, preprocessing the data in the news characteristic training data set by using the current Chinese word segmentation tool to obtain a preprocessed news characteristic training data set;
s3, obtaining a news characteristic training vector set from the preprocessed news characteristic training data set through a TF-IDF method;
s4, classifying the news characteristic training vector set according to the user access data to form a user characteristic training data set;
s5, constructing a stacked asymmetric noise reduction and contraction self-encoder with a plurality of hidden layers and using JSA-CDAEAs an objective function:
wherein,
wherein k isσIs gaussian kernel, with standard deviation σ of 1.0, gaussian kernel function:
where x denotes the input of the encoder, fθ() Representing the output of the encoder, gθ() Represents the decoder output; l isMC() A cost function representing a single input, λ is a regularization parameter of a systolic auto-encoder, | | | | luminanceFIs the F norm symbol, J (x) is the encoder Jacobian matrix, θ is the parameter set for the depth autoencoder, xiRepresenting the input to the encoder in one training session,representing the output restored by the decoder, t representing the training set, and z representing the algebraic expression in the Gaussian kernel;
s6, training the depth self-encoder, wherein the training steps are as follows:
s61, taking the news feature training vector set as training data of the depth self-encoder;
s62, adding Gaussian white noise into the training data to generate input data with noise;
s63, taking the input data with noise as the input of the depth self-encoder, and during training, adopting a batch gradient descent method, firstly performing unsupervised layer-by-layer pre-training to obtain initial parameters of each hidden layer and output data of an output layer;
s64, comparing input training data with output data in the objective function to realize the reverse propagation of the gradient and adjust the initial parameters of each hidden layer;
s65, obtaining a parameter set of the depth self-encoder after the training is finished;
and S7, removing a decoder part of the depth self-encoder, and adding a binary generation layer after the output of the last hidden layer to complete the construction of the unsupervised learning unified feature extractor.
2. The unsupervised learning unified feature extractor construction method of claim 1, wherein:
the step S1, acquiring actual news text data and user access data from the server, and generating a news characteristic training data set after sorting and randomizing, specifically includes the following steps:
s11, collecting news data and user access data in a certain time period on the server;
s12, removing pictures and videos in news data, uniformly coding the pictures and videos into UTF-8, setting a sequence number for each news to form a news data set;
s13, randomizing and reordering the news in the news data set according to the sequence numbers, and then respectively using the news as the news characteristic training data sets in the layer-by-layer unsupervised pre-training stage and the global training stage according to a certain proportion.
3. The unsupervised learning unified feature extractor construction method of claim 1, wherein:
in the step S5, a stacked asymmetric noise reduction and shrinkage self-encoder with multiple hidden layers is constructed, including 2 hidden layers.
4. The unsupervised learning unified feature extractor construction method of claim 3, wherein:
the coding function of the first hidden layer is h1(xi)=S(w1xi+b1) The pre-training decoding function is
The coding function of the second hidden layer is h2(h1)=S(w2h1+b2) The pre-training decoding function is
The global training decoding function from the second hidden layer to the output layer is go(xi)=S(w1xi+b1);
The initial parameters of each layer are [0,1 ]]The nonlinear activation function S () uses a Sigmoid function in common, e is the Euler number, h represents the coding function of the hidden layer, g is the decoding function, b represents the offset, x represents the input to the coder, w1、w2The weight parameters of the first and second hidden layers are respectively.
5. The unsupervised learning unified feature extractor construction method of claim 1, wherein:
the dimension of the binary generation layer in the step S7 is the same as that of the last hidden layer of the depth self-encoder, and one-to-one connection is realized with each neuron of the last hidden layer; the binary generation layer is provided with a weight regulator according to the output of the last hidden layer to realize threshold value regulation, the selection of the threshold value T in the weight regulator enables the output result of one complete training to be divided into two types, and the variance between the two types is the largest.
6. The unsupervised learning unified feature extractor construction method of claim 1, further comprising:
and S8, inputting the user feature training vector set into the unsupervised learning unified feature extractor to obtain a user preference model, and generating a unified user neighbor table through similarity comparison according to the user preference model of each user.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810117102.XA CN108304359B (en) | 2018-02-06 | 2018-02-06 | Unsupervised learning uniform characteristics extractor construction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810117102.XA CN108304359B (en) | 2018-02-06 | 2018-02-06 | Unsupervised learning uniform characteristics extractor construction method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108304359A true CN108304359A (en) | 2018-07-20 |
CN108304359B CN108304359B (en) | 2019-06-14 |
Family
ID=62864632
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810117102.XA Active CN108304359B (en) | 2018-02-06 | 2018-02-06 | Unsupervised learning uniform characteristics extractor construction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108304359B (en) |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109344391A (en) * | 2018-08-23 | 2019-02-15 | 昆明理工大学 | Multiple features fusion Chinese newsletter archive abstraction generating method neural network based |
CN109598336A (en) * | 2018-12-05 | 2019-04-09 | 国网江西省电力有限公司信息通信分公司 | A kind of Data Reduction method encoding neural network certainly based on stack noise reduction |
CN109614984A (en) * | 2018-10-29 | 2019-04-12 | 深圳北斗应用技术研究院有限公司 | A kind of homologous image detecting method and system |
CN109635303A (en) * | 2018-12-19 | 2019-04-16 | 中国科学技术大学 | The recognition methods of specific area metasemy word |
CN110022313A (en) * | 2019-03-25 | 2019-07-16 | 河北师范大学 | Polymorphic worm feature extraction and polymorphic worm discrimination method based on machine learning |
CN110136226A (en) * | 2019-04-08 | 2019-08-16 | 华南理工大学 | It is a kind of to cooperate with the news of description generation to match drawing method automatically based on image group |
CN110442804A (en) * | 2019-08-13 | 2019-11-12 | 北京市商汤科技开发有限公司 | A kind of training method, device, equipment and the storage medium of object recommendation network |
CN110648282A (en) * | 2019-09-29 | 2020-01-03 | 燕山大学 | Image super-resolution reconstruction method and system based on width neural network |
CN111368205A (en) * | 2020-03-09 | 2020-07-03 | 腾讯科技(深圳)有限公司 | Data recommendation method and device, computer equipment and storage medium |
CN112116029A (en) * | 2020-09-25 | 2020-12-22 | 天津工业大学 | Intelligent fault diagnosis method for gearbox with multi-scale structure and characteristic fusion |
US20210027168A1 (en) * | 2019-07-23 | 2021-01-28 | Samsung Electronics Co., Ltd. | Electronic apparatus and controlling method thereof |
CN112651221A (en) * | 2019-10-10 | 2021-04-13 | 北京搜狗科技发展有限公司 | Data processing method and device and data processing device |
CN113441421A (en) * | 2021-07-22 | 2021-09-28 | 北京信息科技大学 | Automatic garbage classification system and method |
CN113497938A (en) * | 2020-03-19 | 2021-10-12 | 华为技术有限公司 | Method and device for compressing and decompressing image based on variational self-encoder |
CN114417427A (en) * | 2022-03-30 | 2022-04-29 | 浙江大学 | Deep learning-oriented data sensitivity attribute desensitization system and method |
CN114817722A (en) * | 2022-04-26 | 2022-07-29 | 齐鲁工业大学 | QoS prediction method and system based on multiple double-layer stacked noise reduction self-encoder |
CN115146689A (en) * | 2021-03-16 | 2022-10-04 | 天津大学 | Deep learning-based power system high-dimensional measurement data dimension reduction method |
CN112651221B (en) * | 2019-10-10 | 2024-11-05 | 北京搜狗科技发展有限公司 | Data processing method and device for data processing |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150238148A1 (en) * | 2013-10-17 | 2015-08-27 | Siemens Aktiengesellschaft | Method and system for anatomical object detection using marginal space deep neural networks |
US9263036B1 (en) * | 2012-11-29 | 2016-02-16 | Google Inc. | System and method for speech recognition using deep recurrent neural networks |
CN105550677A (en) * | 2016-02-02 | 2016-05-04 | 河北大学 | 3D palm print identification method |
CN106295245A (en) * | 2016-07-27 | 2017-01-04 | 广州麦仑信息科技有限公司 | The method of storehouse noise reduction own coding gene information feature extraction based on Caffe |
CN106803062A (en) * | 2016-12-20 | 2017-06-06 | 陕西师范大学 | The recognition methods of stack noise reduction own coding neutral net images of gestures |
CN107545903A (en) * | 2017-07-19 | 2018-01-05 | 南京邮电大学 | A kind of phonetics transfer method based on deep learning |
-
2018
- 2018-02-06 CN CN201810117102.XA patent/CN108304359B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9263036B1 (en) * | 2012-11-29 | 2016-02-16 | Google Inc. | System and method for speech recognition using deep recurrent neural networks |
US20150238148A1 (en) * | 2013-10-17 | 2015-08-27 | Siemens Aktiengesellschaft | Method and system for anatomical object detection using marginal space deep neural networks |
CN105550677A (en) * | 2016-02-02 | 2016-05-04 | 河北大学 | 3D palm print identification method |
CN106295245A (en) * | 2016-07-27 | 2017-01-04 | 广州麦仑信息科技有限公司 | The method of storehouse noise reduction own coding gene information feature extraction based on Caffe |
CN106803062A (en) * | 2016-12-20 | 2017-06-06 | 陕西师范大学 | The recognition methods of stack noise reduction own coding neutral net images of gestures |
CN107545903A (en) * | 2017-07-19 | 2018-01-05 | 南京邮电大学 | A kind of phonetics transfer method based on deep learning |
Non-Patent Citations (2)
Title |
---|
赵飞翔: "基于栈式降噪稀疏自动编码器的雷达目标识别方法", 《雷达学报》 * |
邱爽等: "基于栈式降噪自动编码器的中文短文本分类", 《内蒙古名族大学学报》 * |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109344391A (en) * | 2018-08-23 | 2019-02-15 | 昆明理工大学 | Multiple features fusion Chinese newsletter archive abstraction generating method neural network based |
CN109614984A (en) * | 2018-10-29 | 2019-04-12 | 深圳北斗应用技术研究院有限公司 | A kind of homologous image detecting method and system |
CN109598336A (en) * | 2018-12-05 | 2019-04-09 | 国网江西省电力有限公司信息通信分公司 | A kind of Data Reduction method encoding neural network certainly based on stack noise reduction |
CN109635303A (en) * | 2018-12-19 | 2019-04-16 | 中国科学技术大学 | The recognition methods of specific area metasemy word |
CN110022313A (en) * | 2019-03-25 | 2019-07-16 | 河北师范大学 | Polymorphic worm feature extraction and polymorphic worm discrimination method based on machine learning |
CN110022313B (en) * | 2019-03-25 | 2021-09-17 | 河北师范大学 | Polymorphic worm feature extraction and polymorphic worm identification method based on machine learning |
CN110136226A (en) * | 2019-04-08 | 2019-08-16 | 华南理工大学 | It is a kind of to cooperate with the news of description generation to match drawing method automatically based on image group |
CN110136226B (en) * | 2019-04-08 | 2023-12-22 | 华南理工大学 | News automatic image distribution method based on image group collaborative description generation |
US20210027168A1 (en) * | 2019-07-23 | 2021-01-28 | Samsung Electronics Co., Ltd. | Electronic apparatus and controlling method thereof |
CN110442804A (en) * | 2019-08-13 | 2019-11-12 | 北京市商汤科技开发有限公司 | A kind of training method, device, equipment and the storage medium of object recommendation network |
CN110648282B (en) * | 2019-09-29 | 2021-03-23 | 燕山大学 | Image super-resolution reconstruction method and system based on width neural network |
CN110648282A (en) * | 2019-09-29 | 2020-01-03 | 燕山大学 | Image super-resolution reconstruction method and system based on width neural network |
CN112651221A (en) * | 2019-10-10 | 2021-04-13 | 北京搜狗科技发展有限公司 | Data processing method and device and data processing device |
CN112651221B (en) * | 2019-10-10 | 2024-11-05 | 北京搜狗科技发展有限公司 | Data processing method and device for data processing |
CN111368205A (en) * | 2020-03-09 | 2020-07-03 | 腾讯科技(深圳)有限公司 | Data recommendation method and device, computer equipment and storage medium |
CN113497938A (en) * | 2020-03-19 | 2021-10-12 | 华为技术有限公司 | Method and device for compressing and decompressing image based on variational self-encoder |
CN112116029A (en) * | 2020-09-25 | 2020-12-22 | 天津工业大学 | Intelligent fault diagnosis method for gearbox with multi-scale structure and characteristic fusion |
CN115146689A (en) * | 2021-03-16 | 2022-10-04 | 天津大学 | Deep learning-based power system high-dimensional measurement data dimension reduction method |
CN113441421B (en) * | 2021-07-22 | 2022-12-13 | 北京信息科技大学 | Automatic garbage classification system and method |
CN113441421A (en) * | 2021-07-22 | 2021-09-28 | 北京信息科技大学 | Automatic garbage classification system and method |
CN114417427A (en) * | 2022-03-30 | 2022-04-29 | 浙江大学 | Deep learning-oriented data sensitivity attribute desensitization system and method |
CN114817722A (en) * | 2022-04-26 | 2022-07-29 | 齐鲁工业大学 | QoS prediction method and system based on multiple double-layer stacked noise reduction self-encoder |
Also Published As
Publication number | Publication date |
---|---|
CN108304359B (en) | 2019-06-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108304359B (en) | Unsupervised learning uniform characteristics extractor construction method | |
CN108363804B (en) | Local model weighted fusion Top-N movie recommendation method based on user clustering | |
CN109145112B (en) | Commodity comment classification method based on global information attention mechanism | |
CN107832663B (en) | Multi-modal emotion analysis method based on quantum theory | |
CN107608956B (en) | Reader emotion distribution prediction algorithm based on CNN-GRNN | |
Shi et al. | Deep adaptively-enhanced hashing with discriminative similarity guidance for unsupervised cross-modal retrieval | |
CN109886020A (en) | Software vulnerability automatic classification method based on deep neural network | |
CN113343125B (en) | Academic accurate recommendation-oriented heterogeneous scientific research information integration method and system | |
CN107895303B (en) | Personalized recommendation method based on OCEAN model | |
Kumar et al. | Sentic computing for aspect-based opinion summarization using multi-head attention with feature pooled pointer generator network | |
CN113204522A (en) | Large-scale data retrieval method based on Hash algorithm combined with generation countermeasure network | |
CN116680363A (en) | Emotion analysis method based on multi-mode comment data | |
CN116595975A (en) | Aspect-level emotion analysis method for word information enhancement based on sentence information | |
CN115408605A (en) | Neural network recommendation method and system based on side information and attention mechanism | |
Chen et al. | Deformable convolutional matrix factorization for document context-aware recommendation in social networks | |
Li et al. | Coltr: Semi-supervised learning to rank with co-training and over-parameterization for web search | |
Lu et al. | Recommender system based on scarce information mining | |
Mu et al. | Auxiliary stacked denoising autoencoder based collaborative filtering recommendation | |
CN109902169B (en) | Method for improving performance of film recommendation system based on film subtitle information | |
Li et al. | Otcmr: Bridging heterogeneity gap with optimal transport for cross-modal retrieval | |
CN115481236A (en) | News recommendation method based on user interest modeling | |
Wen et al. | Extended factorization machines for sequential recommendation | |
Schmitt et al. | Outlier detection on semantic space for sentiment analysis with convolutional neural networks | |
Zhang et al. | A generic framework for learning explicit and implicit user-item couplings in recommendation | |
CN114817566A (en) | Emotion reason pair extraction method based on emotion embedding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |