CN108304359B - Unsupervised learning uniform characteristics extractor construction method - Google Patents
Unsupervised learning uniform characteristics extractor construction method Download PDFInfo
- Publication number
- CN108304359B CN108304359B CN201810117102.XA CN201810117102A CN108304359B CN 108304359 B CN108304359 B CN 108304359B CN 201810117102 A CN201810117102 A CN 201810117102A CN 108304359 B CN108304359 B CN 108304359B
- Authority
- CN
- China
- Prior art keywords
- training
- data
- news
- layer
- encoding encoder
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/06—Buying, selling or leasing transactions
- G06Q30/0601—Electronic shopping [e-shopping]
- G06Q30/0631—Item recommendations
Abstract
The application provides a kind of unsupervised learning uniform characteristics extractor construction method, it is characterised in that: obtains practical newsletter archive data from server end and generates news features training dataset;The data that news features training data is concentrated are subjected to processing dyad and obtain news features training vector collection;News data collection is sorted out according to user accesses data, forms user characteristics training dataset;Building one has the asymmetric noise reduction of the stack of multiple hidden layers to shrink self-encoding encoder, is trained using specific objective function to depth self-encoding encoder;After depth self-encoding encoder completes training, decoder section is deleted, a binaryzation generation layer is added, unsupervised learning uniform characteristics extractor is completed in building.Unsupervised learning uniform characteristics extractor provided by the present application, the unification that news features and user characteristics may be implemented, the unification based on commending contents and collaborative filtering recommending, and improve the efficiency of real-time recommendation.
Description
Technical field
The invention belongs to artificial intelligence fields, more particularly to a kind of unsupervised learning uniform characteristics extractor building side
Method.
Background technique
Current recommender system or recommended engine is generally divided into content-based recommendation, collaborative filtering recommending, mixed recommendation
Etc. types, be information tool of equal importance with search engine in today's society, obtained in fields such as e-commerce, media recommenders
To extensive use.Current popular collaborative filtering method is based primarily upon general character, i.e., (can to commodity or media content by certain user
To be referred to as " article ") scoring to calculate the similarity between the similarity between user, article, then according to the user it is emerging
The scoring of the similar other users of interest infers it to the scoring of new article or according to the similarity with its once interested article
Predict the scoring to new article, therefore the score in predicting that is otherwise known as, but the disadvantage is that it is personalized it is insufficient, in score data deficiency
In the case where predict it is difficult.
Content-based recommendation is mainly to model to the preference of certain user, model to the attribute of article, according to user
Preference, hobby to be recommended, it is personalized stronger, but the modeling and matching of user preference, goods attribute are difficult points.It is past
User preference modeling needs the direct feature such as population in use statistics, is also easy to invade the privacy of people.
Deep learning is the new machine learning method of rising in recent years, can be divided into supervised learning and unsupervised
It practises.Self-encoding encoder (AutoEncoder, AF) is a current study frontier of unsupervised learning, but current most depth
The disadvantages of degree respectively has superiority and inferiority from coded system, and there are easy over-fittings, most of unsupervised that do not realize also in complete meaning
It practises, constrains the performance of its ability significantly.
In the case where the fast development of the technologies such as current manual's intelligence, deep learning, unsupervised learning, need to study use
New technology, new method are updated the technical foundation of recommender system, effectively realization mixed recommendation, promote online recommend energetically
Efficiency.
Summary of the invention
For personalization in the applications such as present fusion Media News recommendation is insufficient, user characteristics extract difficulty, will not Tongfang
Method, which is united, is formed with privacy violation, real-time recommendation efficiency during the mixed recommendation method of effect is more difficult, user characteristics extract
The problems such as to be improved, according to current novel artificial intelligence technology, it is unified special that this application discloses a kind of unsupervised learnings
Extractor (Unsupervised Learning Unified Feature Extractor, ULUFE) construction method is levied, to
Extract " uniform characteristics based on content indicate " (Unified Representation Based on Content, URBC).One
Kind unsupervised learning uniform characteristics extractor construction method, comprising the following steps:
S1, practical newsletter archive data and user accesses data are obtained from server end, by arrangement and randomization
News features training dataset is generated afterwards;
S2, using current Chinese word segmentation tool, the data that news features training data is concentrated are pre-processed, are obtained
Pretreated news features training dataset;
S3, by pretreated news features training dataset, news features training vector is obtained by TF-IDF method
Collection;
S4, news features training vector collection is sorted out according to user accesses data, forms user characteristics training data
Collection;
S5, building one have the asymmetric noise reduction of the stack of multiple hidden layers to shrink self-encoding encoder, use JSA-CDAEAs target
Function:
Wherein,
Wherein kσFor Gaussian kernel, standard deviation sigma takes 1.0, gaussian kernel function are as follows:
Wherein, the input of x presentation code device, fθThe output of () presentation code device, gθ() indicates decoder output;LMC()
Indicate that the cost function individually inputted, λ are the regularization parameters for shrinking self-encoding encoder, | | | |FIt is F norm sign, J (x) is to compile
Code device Jacobian matrix, θ is the parameter set of depth self-encoding encoder, xiIndicate the input of encoder in primary training,It indicates
The output of decoder reduction, t represent training set, and z represents the algebraic expression in Gaussian kernel;
S6, training depth self-encoding encoder, training step are as follows:
S61, using the news features training vector collection as the training data of the depth self-encoding encoder;
S62, white Gaussian noise is added in the training data, generates the input data with noise;
S63, using the input data with noise as the input of the depth self-encoding encoder, using batch ladder when training
Descent method is spent, unsupervised layer-by-layer pre-training is first carried out, obtains the initial parameter of each hidden layer and the output data of output layer;
S64, the training data of input and output data are compared in objective function, realize the reversed biography of gradient
It broadcasts, the initial parameter of each hidden layer is adjusted;
After the completion of S65, training, the parameter set of depth self-encoding encoder is obtained;
S7, the decoder section for deleting depth self-encoding encoder, and binaryzation is added after the output of most end hidden layer and generates
Layer completes the building of unsupervised learning uniform characteristics extractor.
Preferably, the step S1 obtains practical newsletter archive data from server end and user accesses data, process is whole
News features training dataset is generated after reason and randomization, specifically includes the following steps:
News data and user accesses data on S11, acquisition server in certain period;
Picture and video in S12, removal news data, Unified coding UTF-8 are that every news sets serial number, constitute
News data set;
S13, the news in news data set is subjected to randomization rearrangement by serial number, then distinguished by a certain percentage
News features training dataset as layer-by-layer unsupervised pre-training stage and global training stage.
Preferably, one is constructed in the step S5 has the asymmetric noise reduction of the stack of multiple hidden layers to shrink self-encoding encoder, packet
Include 2 hidden layers.
Preferably, the coding function of the first hidden layer is h1(xi)=S (w1xi+b1), pre-training decoding functions are The coding function of second hidden layer is h2(h1)=S (w2h1+b2), pre-training decoding functions are
The overall situation training decoding functions of second hidden layer to output layer are go(xi)=S (w1xi+b1);
For the initial parameter of each layer using the random number of [0,1], nonlinear activation function S () is unified to use Sigmoid function, E is Euler's numbers, and h indicates that the coding function of hidden layer, g are decoding functions, and b represents biasing, x presentation code device
Input, w1、w2It is the weighting parameter of the first and second hidden layers respectively.
Preferably, the binaryzation generation layer dimension in the step S7 is as depth self-encoding encoder most end hidden layer, and most
Each neuron of last hidden layer realizes one-to-one connection;A weight tune is arranged according to the output of most end hidden layer in binaryzation generation layer
Section device realizes adjusting thresholds, in weight adjuster the selection of threshold value T so that primary completely trained output result is divided into two classes,
And two class inter-class variance it is maximum.
Preferably, further include S8, user characteristics training vector collection is input to unsupervised learning uniform characteristics extractor, is obtained
To user preferences modeling, according to the user preferences modeling of each user, it is close that a unified user is generated by similarity-rough set
Adjacent table.
The advantages of the application, is:
1. the artificial labeled data needed for supervised learning in the quick recommendation of the network media is difficult to obtain in real time, and
Existing depth self-encoding encoder remains a need for the problem of supervision accurate adjustment, the depth in the present invention after using unsupervised layer-by-layer pre-training
Whole unsupervised learning may be implemented in degree self-encoding encoder;
2. substituting single hidden layer configuration using depth structure, the potential explanation factor of high-order of learning Content is further improved
Ability;
3. asymmetric, hidden layer dimension is lower than input layer dimension using encoder and decoder, the non-of data may learn
Linear flow pattern realizes dimensionality reduction while extracting feature, better than the method for the linear flow pattern such as PCA.And asymmetric it can also make
To solve a kind of means that self-encoding encoder is easy overfitting problem;
4. the feature that self-encoding encoder exports in the present invention is convenient for binary conversion treatment, can after addition binaryzation generation layer
Generate binaryzation feature so that can compare respectively by cosine similarity in recommendation, Hamming distances compare, Hash etc. just
Method solves the quick similarity comparison problem of user in amalgamation media, news, makees to the quick recommendation of newsflash in mobile media
With obvious.
5. upper in application, the feature (uniform characteristics based on content indicate) extracted from news data is as to be recommended
The feature of news and user realizes the unification of two kinds of features, also achieves based on commending contents and collaborative filtering recommending method
Unification, while effective protection privacy of user, realize recommended method innovation, improve recommendation efficiency.
Detailed description of the invention
Fig. 1 is SA-CDAE design diagram of the invention;
Fig. 2 is training schematic diagram of the invention;
Fig. 3 is unsupervised learning feature extractor of the invention;
Fig. 4 is online recommendation schematic diagram of the invention;
Fig. 5 is that accurate rate of the invention compares figure;
Fig. 6 is that recall rate of the invention compares figure.
Specific embodiment
Below to unsupervised learning uniform characteristics extractor construction method specific embodiment of the invention and detailed step
It is described further:
Step 1: data acquisition and preparation
Present invention is generally directed to the website Text news and mobile phone news client Text news in present fusion media.Newly
It hears text data and user accesses data is all located at server end, this step needs to generate " news features training dataset ", tool
Body process is as follows:
1) news data and user accesses data on acquisition server in certain period, news data includes on server
History news, user accesses data includes the news ID list that user reads within certain period;
2) irrelevant contents such as picture and video, Unified coding UTF-8 in news data are removed, is every news setting sequence
Number, constitute news data set;
3) news in news data set is subjected to randomization rearrangement by serial number, then made respectively by a certain percentage
For " the news features training dataset " in layer-by-layer unsupervised pre-training stage and global training stage.
Step 2: text data pretreatment
Using current Chinese word segmentation tool, the data that news features training data is concentrated carry out Chinese word segmentation, deactivate
The processing such as word rejecting, obtain pretreated news features training dataset.
Step 3: newsletter archive data vector
Vectorization processing is carried out to pretreated news features training dataset using TF-IDF method, is obtained and news
The corresponding news features training vector collection of feature training dataset;TF-IDF is the abbreviation of " the reverse document frequency of word frequency-".
TF means word frequency, calculation formula are as follows:
IDF means inverse document frequency, calculation formula are as follows:
Under the premise of keeping news features training data to concentrate the relative position of word, news is obtained by TF-IDF method
Feature training data concentrates the initial characteristics vector of data, constitutes news features training vector collection, the calculation formula of TF-ID are as follows:
TF-ID=TF*IDF
Step 4: user characteristics training dataset is obtained
News features training vector collection is sorted out according to user accesses data, it is available " user characteristics training to
Quantity set ";
Step 5: the asymmetric noise reduction of one stack of building shrinks self-encoding encoder
The core component of unsupervised learning uniform characteristics extractor in the application is a specially designed depth
Self-encoding encoder, effect in the present invention are mainly reflected in two aspects of feature extraction and Dimensionality reduction.The present invention is around fusion
The application target that media intelligent is recommended has 2 or 3 in conjunction with devising one the advantages of shrinking self-encoding encoder and noise reduction self-encoding encoder
The asymmetric noise reduction of the stack (depth) of hidden layer shrinks self-encoding encoder (Stacked Asymmetric Denoising
Contractive Auto-encoder, SA-CDAE), as shown in Figure 1.Structurally, it is single hidden to improve to take more hidden layers
The ability in feature extraction of layer, take that input layer output layer dimension is identical, hidden layer dimension is less than input layer and successively successively decrease in proportion,
Coding and decoding structure is asymmetric to promote anti-over-fitting ability.The news obtained after early-stage preparations and pretreatment is initially instructed
Practice vector set and meet on the whole independent with substep, but there are a certain amount of disturbances, are specifically distributed as unknown, are denoted as D={ x1,
x2..., xn, xi∈Rd, n ∈ N, then:
The coding function of first hidden layer is h1(xi)=S (w1xi+b1),
The pre-training decoding functions of first hidden layer are
The coding function of second hidden layer is
The pre-training decoding functions of second hidden layer are
The overall situation training decoding functions of second hidden layer to output layer are go(xi)=S (w1xi+b1);
Using the random number in [0,1] open interval, nonlinear activation function S () is unified to be used the initial parameter of each layer
Sigmoid function,
Wherein D represents news initial training vector set, and R is set of real numbers, and N is nature manifold, and h represents the coding letter of hidden layer
Number, g are decoding functions, and b represents biasing, and e is Euler's numbers, and h indicates that the coding function of hidden layer, g are decoding functions, and b represents biasing,
The input of x presentation code device, w1、w2It is the weighting parameter of the first and second hidden layers, x respectivelyiIndicate encoder in primary training
Input.
Self-encoding encoder has been used for reference the characteristics of human brain, and principle is an attempt to make to compile by one coding and decoding mechanism of training
The input of code device can be reappeared in the output end of decoder, and wherein encoder section is also known as hidden layer, and decoder section is also known as
Output layer.It is not easy to, in the input of output end Perfect Reconstruction also without practical significance, but by designing special structure, replicating
In be suitably added constraint, using special cost function and training method, make it that can only realize near-duplicate, model can be forced
The data in input are replicated by weight, to construct distribution useful in data in the encoder of self-encoding encoder
Feature becomes the forward position of Study on Generation Model Program in recent years.Prototype autocoder has embodied preferable ability in feature extraction, but
In use the problems such as being easy to appear over-fitting, generalization ability is lost to real data, then occur successively for prototype into
Row improves and the derivative type self-encoding encoder of optimization.
Depth self-encoding encoder of the invention considers simultaneously in design to be added noise and reduces noise (disturbance).Addition is made an uproar
Sound refers to the thinking by means of DenoisingAutoEncoders, and the white noise of Gaussian Profile is added in input X, makes to decode
The interference of device compulsory commutation noise in output, so that the anti-over-fitting performance of system is improved, in training by input
White Gaussian noise is added, can have noise reduction from the characteristics of coding, further decrease the risk of over-fitting.By backpropagation and
Stochastic gradient descent (SGD) trains the parameter set θ of neural network.
It reduces noise (disturbance) and refers to that raising system is to the resistivity of non-gaussian distribution noise and disturbance in training.For
The influence of news features data set, user characteristic data concentration outlier is further decreased, and further to use in scheme
Binaryzation, which generates, provides basis, and also part uses the characteristics of shrinking autocoder in design.Shrinking autocoder is
Analyticity is added in the cost function expression formula of prototype autocoder and shrinks penalty factor, to reduce the freedom of character representation
Degree, makes hidden neuron reach saturation state, and then output data is limited in a certain range of parameter space.The punishment because
Son is actually the F norm (Frobenius norm) of encoder Jacobian matrix (Jacobian), and effect is to reduce outlier
(outlier) to the influence of encoder, inhibit the disturbance of training sample (being on low dimensional manifold curved surface) in all directions, it is auxiliary
Encoder is helped to learn useful data feature.In addition, shrinking the distributed spy for indicating that there is " saturation " that autocoder learns
Point, i.e., the value of most of Hidden unit is all close to both ends (0 or 1), and to the partial derivative of input close to 0.
Mean square error function (Mean Square Error, MSE) is often used in the training of general self-encoding encoder as cost
Function has certain tolerance to the noise of Gaussian Profile, but in this example in view of the presence of the disturbances such as minimizer, such as
Accidental reading conditions except user preference use maximal correlation entropy (Maximum to improve in robustness the present embodiment
Correntropy, MC) it is used as cost function:
Wherein kσFor Gaussian kernel, standard deviation sigma takes 1.0, gaussian kernel function are as follows:
The objective function of depth self-encoding encoder entirety in the present invention are as follows:
In various above, fθThe output of () presentation code device, gθ() indicates decoder output;LMC() indicates single input
Cost function, λ be shrink self-encoding encoder regularization parameter, | | | |FIt is F norm sign, J (x) is encoder Jacobian
Matrix, θ are the parameter set of depth self-encoding encoder, xiIndicate the input of encoder in primary training,Indicate decoder reduction
Output, t represent training set, and z represents the algebraic expression in Gaussian kernel.
Step 6: training depth self-encoding encoder
The training of neural network refers to using the data through over cleaning, arrangement as input, passes through forward-propagating and reversed biography
Two links are broadcast, the parameter of neural network objective function is gradually made to tend to restrain, so that higher order statistical characteristic is arrived in study.Such as Fig. 2
Shown, depth self-encoding encoder takes off-line training, and main training step is as follows:
1) using news features training vector collection as the training data of depth self-encoding encoder, it is set as X, it can thus be seen that
Training data in the application uses news data, and has not both needed manually to mark or do not need using third party's corpus
Library;
2) white Gaussian noise is added in training data X, generates the input data X with noise1;
3) by X1As the input of depth self-encoding encoder, batch gradient descent method (Mini-batch) is used when training, first
Unsupervised layer-by-layer pre-training is carried out, the initial parameter of each hidden layer and the output of output layer are obtained
4) in objective function to X andIt is compared, realizes the global backpropagation of gradient, to the initial of each hidden layer
Parameter is adjusted.
5) after the completion of training, the parameter set of depth self-encoding encoder is obtained, it is unified special for building unsupervised learning in next step
Levy extractor.
Depth self-encoding encoder in the application structure, cost function, in terms of carried out completely new consideration, can
It to realize dimensionality reduction while feature extraction, and can learn to non-linear flow pattern, PCA etc. is significantly better than in terms of dimensionality reduction
The method of linear flow pattern.In addition, according to the parallelization feature of neural network, in the main training of depth self-encoding encoder of the invention
Accelerated in step using GPU parallel computation, so that the training effectiveness of depth self-encoding encoder has been obtained large increase, improve
Practical application efficiency in recommender system.
Step 7: building unsupervised learning uniform characteristics extractor
The depth self-encoding encoder output that training is completed has the characteristics that easy binaryzation, for this purpose, complete in depth self-encoding encoder
After training, decoder section is deleted, and binaryzation generation layer is added after the output of most end hidden layer, the binaryzation generates
Layer is used to complete binary conversion treatment, as shown in figure 3, completing the building of unsupervised learning uniform characteristics extractor.
In the present embodiment, the output of depth self-encoding encoder is easy binaryzation, but residue 30% there are about 70% close to 0,1
How to handle, the precision of whole binaryzation extraction effect and subsequent similarity-rough set will be directly influenced.For this purpose, in structure
On devise binaryzation generation layer of the dimension as depth self-encoding encoder most end hidden layer, each neuron with most end hidden layer
Realize one-to-one connection;In interior design, binaryzation generation layer does not use common fixed threshold, but according to most end
The actual distribution of hidden layer output designs a weight adjuster to realize adjusting thresholds, and the selection of threshold value T is former in weight adjuster
It is then that the output result once completely trained can be made to be divided into two classes, and the inter-class variance of two classes is maximum.
If total output set of hidden layer each unit is K, wherein there is different data N number of after primary complete training.To K into
Sequence obtains data set K (k1, k2 ..., ki) to row from small to large, and sets its two group K1, K2 that can be divided into size as t and N-t,
The frequency of occurrence of each ki is ni, wherein [1, N] i ∈;The probability that two groups occur in entirety distinguishes ε1、ε2, two groups it is equal
Value is respectively β1、β2.The probability that then ki occurs is pi=ni/N,ε2=1- ε1, the mean value of two groups is respectivelyThe mean value of data set K is β=ε1β1+ε2β2.The then inter-class variance definition of two groups
For δ (t)=ε1(β1-β)2+ε2(β2-β)2.Seek T=argmaxt(6 (t)), even if t when δ (t) is maximum, finds T from K
Corresponding value is set as threshold value ,≤T's is set to 0, remaining is set to 1, to realize the binaryzation of hidden layer output.
Step 8: user preferences modeling and user's neighbor table are obtained
After the completion of the building of unsupervised learning uniform characteristics extractor, user characteristics training vector collection is input to no prison
Educational inspector practises uniform characteristics extractor, obtains user preferences modeling, according to the user preferences modeling of each user, passes through similarity ratio
Compared with one unified user's neighbor table of generation.
As shown in figure 4, being example when carrying out personalized recommendation using unsupervised learning uniform characteristics extractor.To own
Newsletter archive to be recommended is pre-processed, is input in unsupervised learning uniform characteristics extractor after vectorization, available
The news features vector to be recommended that uniform characteristics based on content indicate;By the preference mould of news features vector to be recommended and user
Type carries out similarity-rough set and generates content-based recommendation list;Using user's neighbor table, user similar with user A1 is read
The news read generates collaborative filtering recommending list, obtains the Top-N recommendation list of mixed recommendation after weighted blend.
Unsupervised learning uniform characteristics extractor disclosed by the invention is all created on whole design, operational mode
It is new:
1, innovated in design: the design of depth self-encoding encoder has merged contraction self-encoding encoder and noise reduction from coding
The characteristics of device, devises new objective function, improves higher order statistical letter by depth structure (2~3 hidden layers) in structure
The extractability of breath, the quantity of each hidden neuron, which realizes, successively decreases, to realize depth self-encoding encoder coding and decoding
Asymmetry is beneficial to improve the overfitting problem that self-encoding encoder is easy to appear, improves the robustness of feature extraction, extracting
Dimensionality reduction is realized while feature.Output layer is substituted using a binaryzation generation layer after the completion of training, obtains unsupervised learning
Binaryzation feature can be generated in feature extractor, is convenient for Hamming distances and compares, and is also convenient for further progress Hash relatively etc.
Operation.
2, innovation is realized in training method: the past is usually the self-encoding encoder of single hidden layer completely using input, output
Data as a comparison carry out backpropagation after obtaining error to update network parameter.Multilayer self-encoding encoder generally uses unsupervised
After layer-by-layer pre-training, increases the classifiers such as softmax behind most end hidden layer and supervised learning is carried out according to class label, so whole
Body is a semi-supervised learning.And the self-encoding encoder of depth of the present invention has comprehensively considered the problem of network depth and computational efficiency,
In output end also using entering data to be compared, obtained error is subjected to backpropagation, to realize complete nothing
Supervised learning.
3, innovation is above realized in utilization: not only improving the efficiency of the applications such as recommender system, also effectively protect individual
Privacy.Recommend in efficiency being promoted, it is special to news hobby, preference as user by extracting the feature generated from newsletter archive
Sign not only realizes feature unified (user characteristics and article characteristics) and extracts, also achieves mixed recommendation method in technical foundation
On unification;But also the data such as the demographic information of user have been avoided by this higher order statistical theory, it extracts
Vector is abstracted information, the explicit data not comprising user, even if being illegally accessed the leakage that will not cause user information, from
And secret protection is realized, meet country's protection requirement increasingly tighter to individual privacy information.
4, innovation is realized on the training data.The methods of existing collaborative filtering is by user to commodity or media content
Scoring calculate the similarity between the similarity between user, article, but active user seldom carries out the news of reading
Scoring causes score data rareness, training data insufficient.The application directly uses news data and user accesses data, as
The training data of depth self-encoding encoder has the following characteristics that a defect for being that of avoiding the data that lack in training;Second is that without using the
The corpus of tripartite, more closing to reality.
In practical applications, accurate rate and recall rate are two most important indexs used in recommender system evaluation.Through
Actual test is shown the feature extracted using the unsupervised learning uniform characteristics extractor constructed in the application present invention and pushed away
The method of recommending realizes good matching.So that Novel individualized recommended method is compared with currently more popular method, accurate
Good effect is all achieved in terms of rate and recall rate, as shown in attached drawing 5 and attached drawing 6.
Finally, it should be noted that above-described embodiments are merely to illustrate the technical scheme, rather than to it
Limitation;Although the present invention is described in detail referring to the foregoing embodiments, those skilled in the art should understand that:
It can still modify to technical solution documented by previous embodiment, or to part of or all technical features into
Row equivalent replacement;And these modifications or substitutions, it does not separate the essence of the corresponding technical solution various embodiments of the present invention technical side
The range of case.
Claims (6)
1. a kind of unsupervised learning uniform characteristics extractor construction method, which is characterized in that the construction method includes following step
It is rapid:
S1, practical newsletter archive data and user accesses data are obtained from server end, by arranging and life after randomization
At news features training dataset;
S2, using current Chinese word segmentation tool, the data that news features training data is concentrated are pre-processed, pre- place is obtained
News features training dataset after reason;
S3, by pretreated news features training dataset, news features training vector collection is obtained by TF-IDF method;
S4, news features training vector collection is sorted out according to user accesses data, forms user characteristics training dataset;
S5, building one have the asymmetric noise reduction of the stack of multiple hidden layers to shrink self-encoding encoder, use JSA-CDAEAs objective function:
Wherein,
Wherein kσFor Gaussian kernel, standard deviation sigma takes 1.0, gaussian kernel function are as follows:
Wherein, the input of x presentation code device, fθThe output of () presentation code device, gθ() indicates decoder output;LMC() indicates
The cost function individually inputted, λ are the regularization parameters for shrinking self-encoding encoder, | | | |FIt is F norm sign, J (x) is encoder
Jacobian matrix, θ are the parameter set of depth self-encoding encoder, xiIndicate the input of encoder in primary training,Indicate decoding
The output of device reduction, t represent training set, and z represents the algebraic expression in Gaussian kernel;
S6, training depth self-encoding encoder, training step are as follows:
S61, using the news features training vector collection as the training data of the depth self-encoding encoder;
S62, white Gaussian noise is added in the training data, generates the input data with noise;
S63, using the input data with noise as the input of the depth self-encoding encoder, using under batch gradient when training
Drop method first carries out unsupervised layer-by-layer pre-training, obtains the initial parameter of each hidden layer and the output data of output layer;
S64, the training data of input and output data are compared in objective function, realize the backpropagation of gradient,
The initial parameter of each hidden layer is adjusted;
After the completion of S65, training, the parameter set of depth self-encoding encoder is obtained;
S7, the decoder section for removing depth self-encoding encoder, and binaryzation generation layer is added after the output of most end hidden layer, it is complete
At the building of unsupervised learning uniform characteristics extractor.
2. unsupervised learning uniform characteristics extractor construction method according to claim 1, it is characterised in that:
The step S1 obtains practical newsletter archive data and user accesses data from server end, by arranging and being randomized
News features training dataset is generated after processing, specifically includes the following steps:
News data and user accesses data on S11, acquisition server in certain period;
Picture and video in S12, removal news data, Unified coding UTF-8 are that every news sets serial number, constitute news
Data acquisition system;
S13, the news in news data set is subjected to randomization rearrangement by serial number, then by a certain percentage respectively as
The news features training dataset in layer-by-layer unsupervised pre-training stage and global training stage.
3. unsupervised learning uniform characteristics extractor construction method according to claim 1, it is characterised in that:
One is constructed in the step S5 has the asymmetric noise reduction of the stack of multiple hidden layers to shrink self-encoding encoder, including 2 hidden layers.
4. unsupervised learning uniform characteristics extractor construction method according to claim 3, it is characterised in that:
The coding function of first hidden layer is h1(xi)=S (w1xi+b1), pre-training decoding functions are
The coding function of second hidden layer is h2(h1)=S (w2h1+b2), pre-training decoding functions are
The overall situation training decoding functions of second hidden layer to output layer are go(xi)=S (w1xi+b1);
For the initial parameter of each layer using the random number of [0,1], nonlinear activation function S () is unified to use Sigmoid function, E is Euler's numbers, and h indicates that the coding function of hidden layer, g are decoding functions, and b represents biasing, x presentation code device
Input, w1、w2It is the weighting parameter of the first and second hidden layers respectively.
5. unsupervised learning uniform characteristics extractor construction method according to claim 1, it is characterised in that:
Each mind of the binaryzation generation layer dimension as depth self-encoding encoder most end hidden layer, with most end hidden layer in the step S7
One-to-one connection is realized through member;A weight adjuster is arranged according to the output of most end hidden layer to realize threshold in binaryzation generation layer
Value adjustment, the selection of threshold value T is so that primary completely trained output result is divided into two classes in weight adjuster, and between the class of two classes
Variance is maximum.
6. unsupervised learning uniform characteristics extractor construction method according to claim 1, which is characterized in that further include:
User characteristics training vector collection is input to unsupervised learning uniform characteristics extractor, obtains user preferences modeling, root by S8
According to the user preferences modeling of each user, unified user's neighbor table is generated by similarity-rough set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810117102.XA CN108304359B (en) | 2018-02-06 | 2018-02-06 | Unsupervised learning uniform characteristics extractor construction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810117102.XA CN108304359B (en) | 2018-02-06 | 2018-02-06 | Unsupervised learning uniform characteristics extractor construction method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108304359A CN108304359A (en) | 2018-07-20 |
CN108304359B true CN108304359B (en) | 2019-06-14 |
Family
ID=62864632
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810117102.XA Active CN108304359B (en) | 2018-02-06 | 2018-02-06 | Unsupervised learning uniform characteristics extractor construction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108304359B (en) |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109344391B (en) * | 2018-08-23 | 2022-10-21 | 昆明理工大学 | Multi-feature fusion Chinese news text abstract generation method based on neural network |
CN109614984A (en) * | 2018-10-29 | 2019-04-12 | 深圳北斗应用技术研究院有限公司 | A kind of homologous image detecting method and system |
CN109598336A (en) * | 2018-12-05 | 2019-04-09 | 国网江西省电力有限公司信息通信分公司 | A kind of Data Reduction method encoding neural network certainly based on stack noise reduction |
CN109635303B (en) * | 2018-12-19 | 2020-08-25 | 中国科学技术大学 | Method for recognizing meaning-changing words in specific field |
CN110022313B (en) * | 2019-03-25 | 2021-09-17 | 河北师范大学 | Polymorphic worm feature extraction and polymorphic worm identification method based on machine learning |
CN110136226B (en) * | 2019-04-08 | 2023-12-22 | 华南理工大学 | News automatic image distribution method based on image group collaborative description generation |
KR20210011844A (en) * | 2019-07-23 | 2021-02-02 | 삼성전자주식회사 | Electronic apparatus and method for controlling thereof |
CN110442804A (en) * | 2019-08-13 | 2019-11-12 | 北京市商汤科技开发有限公司 | A kind of training method, device, equipment and the storage medium of object recommendation network |
CN110648282B (en) * | 2019-09-29 | 2021-03-23 | 燕山大学 | Image super-resolution reconstruction method and system based on width neural network |
CN111368205B (en) * | 2020-03-09 | 2021-04-06 | 腾讯科技(深圳)有限公司 | Data recommendation method and device, computer equipment and storage medium |
CN113497938A (en) * | 2020-03-19 | 2021-10-12 | 华为技术有限公司 | Method and device for compressing and decompressing image based on variational self-encoder |
CN112116029A (en) * | 2020-09-25 | 2020-12-22 | 天津工业大学 | Intelligent fault diagnosis method for gearbox with multi-scale structure and characteristic fusion |
CN115146689A (en) * | 2021-03-16 | 2022-10-04 | 天津大学 | Deep learning-based power system high-dimensional measurement data dimension reduction method |
CN113441421B (en) * | 2021-07-22 | 2022-12-13 | 北京信息科技大学 | Automatic garbage classification system and method |
CN114417427B (en) * | 2022-03-30 | 2022-08-02 | 浙江大学 | Deep learning-oriented data sensitivity attribute desensitization system and method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9263036B1 (en) * | 2012-11-29 | 2016-02-16 | Google Inc. | System and method for speech recognition using deep recurrent neural networks |
CN105550677A (en) * | 2016-02-02 | 2016-05-04 | 河北大学 | 3D palm print identification method |
CN107545903A (en) * | 2017-07-19 | 2018-01-05 | 南京邮电大学 | A kind of phonetics transfer method based on deep learning |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9668699B2 (en) * | 2013-10-17 | 2017-06-06 | Siemens Healthcare Gmbh | Method and system for anatomical object detection using marginal space deep neural networks |
CN106295245B (en) * | 2016-07-27 | 2019-08-30 | 广州麦仑信息科技有限公司 | Method of the storehouse noise reduction based on Caffe from coding gene information feature extraction |
CN106803062A (en) * | 2016-12-20 | 2017-06-06 | 陕西师范大学 | The recognition methods of stack noise reduction own coding neutral net images of gestures |
-
2018
- 2018-02-06 CN CN201810117102.XA patent/CN108304359B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9263036B1 (en) * | 2012-11-29 | 2016-02-16 | Google Inc. | System and method for speech recognition using deep recurrent neural networks |
CN105550677A (en) * | 2016-02-02 | 2016-05-04 | 河北大学 | 3D palm print identification method |
CN107545903A (en) * | 2017-07-19 | 2018-01-05 | 南京邮电大学 | A kind of phonetics transfer method based on deep learning |
Non-Patent Citations (2)
Title |
---|
基于栈式降噪稀疏自动编码器的雷达目标识别方法;赵飞翔;《雷达学报》;20170430;第6卷(第2期);全文 |
基于栈式降噪自动编码器的中文短文本分类;邱爽等;《内蒙古名族大学学报》;20170930;第32卷(第5期);全文 |
Also Published As
Publication number | Publication date |
---|---|
CN108304359A (en) | 2018-07-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108304359B (en) | Unsupervised learning uniform characteristics extractor construction method | |
Zhou et al. | A comprehensive survey on pretrained foundation models: A history from bert to chatgpt | |
Brown et al. | Smooth-ap: Smoothing the path towards large-scale image retrieval | |
Yang et al. | Fashion captioning: Towards generating accurate descriptions with semantic rewards | |
Agrawal | Clickbait detection using deep learning | |
CN109145112A (en) | A kind of comment on commodity classification method based on global information attention mechanism | |
Sheu et al. | Knowledge-guided article embedding refinement for session-based news recommendation | |
Xiao et al. | Using convolution control block for Chinese sentiment analysis | |
Sun et al. | Chinese herbal medicine image recognition and retrieval by convolutional neural network | |
Wang et al. | Cross-domain recommendation with user personality | |
CN113569001A (en) | Text processing method and device, computer equipment and computer readable storage medium | |
Yang et al. | Read, attend and comment: A deep architecture for automatic news comment generation | |
Hu et al. | Attentive interactive convolutional matching for community question answering in social multimedia | |
CN111814453A (en) | Fine-grained emotion analysis method based on BiLSTM-TextCNN | |
Liu et al. | Identifying experts in community question answering website based on graph convolutional neural network | |
Du et al. | Sentiment analysis method based on piecewise convolutional neural network and generative adversarial network | |
CN111680190A (en) | Video thumbnail recommendation method fusing visual semantic information | |
Yilmaz et al. | Inferring political alignments of Twitter users | |
CN117033751A (en) | Recommended information processing method, recommended information processing device, storage medium and equipment | |
Banerjee et al. | Recommendation of compatible outfits conditioned on style | |
Cao et al. | Video-based recipe retrieval | |
Deng et al. | A depression tendency detection model fusing weibo content and user behavior | |
Jaya et al. | Analysis of convolution neural network for transfer learning of sentiment analysis in Indonesian tweets | |
Qian et al. | Multi-hop interactive attention based classification network for expert recommendation | |
CN110909167A (en) | Microblog text classification system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |